CS224U – Natural Language Understanding

Lecture 001

Introduction and Welcome | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=rha64cQRLs8 --- Transcript [00:00:03] hi i'm chris potts i'm a profe...

Introduction and Welcome | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=rha64cQRLs8

---

Transcript

[00:00:03] hi i'm chris potts i'm a professor in
[00:00:05] hi i'm chris potts i'm a professor in linguistics at stanford with a courtesy
[00:00:06] linguistics at stanford with a courtesy appointment in computer science and i'm
[00:00:08] appointment in computer science and i'm the director of stanford center for the
[00:00:10] the director of stanford center for the study of language and information which
[00:00:12] study of language and information which is an interdisciplinary research center
[00:00:13] is an interdisciplinary research center focused on logic language decision
[00:00:15] focused on logic language decision making human sentence processing and
[00:00:17] making human sentence processing and computation
[00:00:19] computation my undergraduate and graduate degrees
[00:00:20] my undergraduate and graduate degrees are all in linguistics the work i did
[00:00:22] are all in linguistics the work i did for my phd focused on topics in
[00:00:24] for my phd focused on topics in linguistic pragmatics which is the study
[00:00:26] linguistic pragmatics which is the study of how language use is shaped by the
[00:00:28] of how language use is shaped by the physical and social context that we're
[00:00:29] physical and social context that we're in
[00:00:30] in at a certain point i went looking for
[00:00:31] at a certain point i went looking for new ways to support those theories
[00:00:33] new ways to support those theories quantitatively and that began my journey
[00:00:35] quantitatively and that began my journey into the world of natural language
[00:00:36] into the world of natural language processing i now think of myself as
[00:00:38] processing i now think of myself as being on a mission to help with the
[00:00:40] being on a mission to help with the sharing of ideas back and forth between
[00:00:42] sharing of ideas back and forth between linguistics and nlp i've taught natural
[00:00:44] linguistics and nlp i've taught natural language understanding nine times at
[00:00:46] language understanding nine times at stanford with my first year in 2012. in
[00:00:49] stanford with my first year in 2012. in 2012 we were just beginning to see how
[00:00:51] 2012 we were just beginning to see how nlu was going to revolutionize the field
[00:00:53] nlu was going to revolutionize the field and reshape the technology landscape
[00:00:55] and reshape the technology landscape ibm's watson had recently won jeopardy
[00:00:58] ibm's watson had recently won jeopardy apple siri was new and the other tech
[00:01:00] apple siri was new and the other tech giants were on the verge of launching
[00:01:01] giants were on the verge of launching their own intelligent assistance so
[00:01:03] their own intelligent assistance so there was a widespread perception that
[00:01:05] there was a widespread perception that nlu was poised to have a transformative
[00:01:07] nlu was poised to have a transformative impact on the world and that perception
[00:01:09] impact on the world and that perception was certainly correct since then nlu has
[00:01:12] was certainly correct since then nlu has only become more central to the field of
[00:01:13] only become more central to the field of nlp and to all of artificial
[00:01:15] nlp and to all of artificial intelligence more generally and the
[00:01:17] intelligence more generally and the progress in the field has been amazing
[00:01:19] progress in the field has been amazing we have more large nou data sets than
[00:01:21] we have more large nou data sets than ever before and the level of innovation
[00:01:23] ever before and the level of innovation in modeling and model analysis is just
[00:01:25] in modeling and model analysis is just astounding
[00:01:26] astounding as a result we can tackle more ambitious
[00:01:28] as a result we can tackle more ambitious problems than ever and there are
[00:01:30] problems than ever and there are opportunities to find lots of creative
[00:01:32] opportunities to find lots of creative new ways to apply nlu to technology
[00:01:34] new ways to apply nlu to technology development and scientific inquiry
[00:01:37] development and scientific inquiry so it's certainly an exciting moment to
[00:01:39] so it's certainly an exciting moment to welcome you to this course it's an
[00:01:40] welcome you to this course it's an adapted version of the course we teach
[00:01:42] adapted version of the course we teach on campus the course begins by covering
[00:01:44] on campus the course begins by covering a wide range of models for distributed
[00:01:46] a wide range of models for distributed word representations from there we
[00:01:48] word representations from there we branch out into a series of important
[00:01:50] branch out into a series of important nlu topics including relation extraction
[00:01:53] nlu topics including relation extraction natural language inference and grounded
[00:01:54] natural language inference and grounded language understanding we've chosen
[00:01:56] language understanding we've chosen these topics because they allow us to
[00:01:58] these topics because they allow us to highlight many of the central concepts
[00:02:00] highlight many of the central concepts in nlu which you can then apply more
[00:02:02] in nlu which you can then apply more widely
[00:02:03] widely one of the special aspects of this
[00:02:05] one of the special aspects of this course is that it's project oriented and
[00:02:07] course is that it's project oriented and with luck the project that you develop
[00:02:09] with luck the project that you develop will become a professional asset for you
[00:02:11] will become a professional asset for you we aim to help you design and conduct a
[00:02:13] we aim to help you design and conduct a successful research project in the field
[00:02:15] successful research project in the field and we have an accomplished teaching
[00:02:16] and we have an accomplished teaching team to help you with this process
[00:02:18] team to help you with this process even the regular assignments are
[00:02:20] even the regular assignments are oriented toward building original
[00:02:22] oriented toward building original projects each assignment is grounded in
[00:02:24] projects each assignment is grounded in a specific topic area and they all have
[00:02:26] a specific topic area and they all have a common rhythm in that they ask you to
[00:02:27] a common rhythm in that they ask you to build up some baseline systems and then
[00:02:30] build up some baseline systems and then develop your own original system for
[00:02:31] develop your own original system for solving the task at hand
[00:02:34] solving the task at hand you'll enter each one of these systems
[00:02:36] you'll enter each one of these systems into what we call a bake off which is an
[00:02:37] into what we call a bake off which is an informal competition around data and
[00:02:39] informal competition around data and modeling and the teaching team will
[00:02:40] modeling and the teaching team will reflect the insights we gain from these
[00:02:42] reflect the insights we gain from these bake off entries back to the whole class
[00:02:44] bake off entries back to the whole class it's also common for bake off entries to
[00:02:46] it's also common for bake off entries to grow into final projects we hope that
[00:02:48] grow into final projects we hope that all this work spurs you to think
[00:02:50] all this work spurs you to think creatively and to hone your practical
[00:02:52] creatively and to hone your practical and theoretical skills in nlu so without
[00:02:54] and theoretical skills in nlu so without further ado let's get started

Lecture 002

Course Overview | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=2w_qYPxuzeA --- Transcript [00:00:05] so here we go it's a golden age for [00...

Course Overview | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=2w_qYPxuzeA

---

Transcript

[00:00:05] so here we go it's a golden age for
[00:00:07] so here we go it's a golden age for natural language understanding
[00:00:09] natural language understanding let's start with a little bit of history
[00:00:11] let's start with a little bit of history so way back when john mccarthy had
[00:00:14] so way back when john mccarthy had assembled a group of top scientists and
[00:00:16] assembled a group of top scientists and he said of this group we think that a
[00:00:19] he said of this group we think that a significant advance can be made in
[00:00:20] significant advance can be made in artificial intelligence in one or more
[00:00:22] artificial intelligence in one or more of these problems a carefully selected
[00:00:24] of these problems a carefully selected group of scientists work on it together
[00:00:26] group of scientists work on it together for a summer
[00:00:28] for a summer he had in fact assembled a crack team
[00:00:30] he had in fact assembled a crack team but of course he there were so many
[00:00:32] but of course he there were so many unknown unknowns about working in
[00:00:34] unknown unknowns about working in artificial intelligence that he wildly
[00:00:35] artificial intelligence that he wildly underestimated and probably in those
[00:00:37] underestimated and probably in those three months they just figured out how
[00:00:39] three months they just figured out how little they actually knew
[00:00:41] little they actually knew and of course we've been working on
[00:00:42] and of course we've been working on those problems that they chartered out
[00:00:44] those problems that they chartered out at that point ever since
[00:00:47] at that point ever since nlu has a kind of interesting
[00:00:49] nlu has a kind of interesting relationship to this history because
[00:00:51] relationship to this history because very early on in the history of all of
[00:00:53] very early on in the history of all of ai a lot of the research was focused on
[00:00:55] ai a lot of the research was focused on natural language understanding
[00:00:56] natural language understanding originally in the 60s it was done with
[00:00:58] originally in the 60s it was done with kind of pattern matching on simple rule
[00:01:00] kind of pattern matching on simple rule sets and things like that you've seen
[00:01:02] sets and things like that you've seen these things in the form of artifacts
[00:01:04] these things in the form of artifacts like eliza
[00:01:05] like eliza uh it was oriented towards the things
[00:01:07] uh it was oriented towards the things that we now want to work on
[00:01:10] that we now want to work on in the 1970s and 80s you get a real
[00:01:12] in the 1970s and 80s you get a real investment in what i've called
[00:01:13] investment in what i've called linguistically rich logic driven
[00:01:15] linguistically rich logic driven grounded systems or llgs uh this is like
[00:01:19] grounded systems or llgs uh this is like a lot of symbolic ai again oriented
[00:01:22] a lot of symbolic ai again oriented toward problems of natural language
[00:01:24] toward problems of natural language understanding everybody wanted talking
[00:01:27] understanding everybody wanted talking robots and this was the path that they
[00:01:29] robots and this was the path that they were going to take to achieve them
[00:01:30] were going to take to achieve them [Music]
[00:01:32] [Music] as we all know in the mid 1990s in the
[00:01:34] as we all know in the mid 1990s in the field you had this revolution of machine
[00:01:36] field you had this revolution of machine learning statistical nlp is on the rise
[00:01:39] learning statistical nlp is on the rise and that led to a decrease a sharp
[00:01:41] and that led to a decrease a sharp decrease in natural language
[00:01:43] decrease in natural language understanding work i think because of
[00:01:45] understanding work i think because of the way that people were understanding
[00:01:46] the way that people were understanding how to work with these tools and
[00:01:48] how to work with these tools and understanding the problems that language
[00:01:50] understanding the problems that language posed the field ended up oriented around
[00:01:53] posed the field ended up oriented around things that you might think of as like
[00:01:54] things that you might think of as like parsing problems much more about
[00:01:56] parsing problems much more about structure and much less about
[00:01:58] structure and much less about communication and as a result all these
[00:02:00] communication and as a result all these really exciting problems from earlier
[00:02:02] really exciting problems from earlier eras kind of fell by the wayside as
[00:02:05] eras kind of fell by the wayside as people worried about part of speech
[00:02:06] people worried about part of speech tagging and parsing and so forth so that
[00:02:09] tagging and parsing and so forth so that was like a low period for nlu
[00:02:11] was like a low period for nlu in the late 2000s logic
[00:02:14] in the late 2000s logic linguistically rich logic driven systems
[00:02:16] linguistically rich logic driven systems re-emerged but now with learning and
[00:02:18] re-emerged but now with learning and that was a golden era of kind of moving
[00:02:20] that was a golden era of kind of moving us back into problems of natural
[00:02:21] us back into problems of natural language understanding starting with
[00:02:23] language understanding starting with some basic applications involving
[00:02:25] some basic applications involving semantics
[00:02:28] semantics and then of course as you all know
[00:02:30] and then of course as you all know probably from the recent history or
[00:02:31] probably from the recent history or semi-recent history in the 2010s nlu
[00:02:35] semi-recent history in the 2010s nlu took center stage in the field that's
[00:02:37] took center stage in the field that's very exciting right and it's it's sort
[00:02:39] very exciting right and it's it's sort of aligned with the rise of deep
[00:02:41] of aligned with the rise of deep learning as one of the most prevalent
[00:02:42] learning as one of the most prevalent set of techniques in the field and as a
[00:02:45] set of techniques in the field and as a result logic driven systems fell by the
[00:02:47] result logic driven systems fell by the wayside
[00:02:48] wayside this is exciting for us because of
[00:02:49] this is exciting for us because of course this is like the history of our
[00:02:51] course this is like the history of our course when we first started this our
[00:02:53] course when we first started this our problems the ones we focus on in this
[00:02:55] problems the ones we focus on in this class were really not central to the
[00:02:56] class were really not central to the field and now they're the problems that
[00:02:59] field and now they're the problems that everyone is working on and where all the
[00:03:00] everyone is working on and where all the action is in the scientific aspects of
[00:03:03] action is in the scientific aspects of the field and also in industry
[00:03:05] the field and also in industry as a result of this this is kind of
[00:03:07] as a result of this this is kind of provocative the linguistically grounded
[00:03:09] provocative the linguistically grounded logic driven systems have again kind of
[00:03:11] logic driven systems have again kind of fallen by the wayside in favor of very
[00:03:14] fallen by the wayside in favor of very large models that have almost no
[00:03:16] large models that have almost no inductive biases of the sort that you
[00:03:18] inductive biases of the sort that you see in these earlier systems
[00:03:20] see in these earlier systems uh what's going to happen in the 2020s
[00:03:22] uh what's going to happen in the 2020s i'm not sure you might predict that
[00:03:24] i'm not sure you might predict that we've we've seen the last of the
[00:03:26] we've we've seen the last of the linguistically rich logic driven systems
[00:03:28] linguistically rich logic driven systems but the people might have said similar
[00:03:30] but the people might have said similar things in the 1990s and we saw them
[00:03:32] things in the 1990s and we saw them re-emerge so i think it's hard to
[00:03:34] re-emerge so i think it's hard to predict where the future will go
[00:03:36] predict where the future will go but this is an exciting moment because
[00:03:38] but this is an exciting moment because you can all be part of making that
[00:03:40] you can all be part of making that history as you work through
[00:03:41] history as you work through problems for this course and on into
[00:03:43] problems for this course and on into your careers
[00:03:46] let's talk more about some of these
[00:03:48] let's talk more about some of these really defining moments in this golden
[00:03:50] really defining moments in this golden age and for me a very important one was
[00:03:51] age and for me a very important one was when watson ibm watson won jeopardy this
[00:03:54] when watson ibm watson won jeopardy this was in 2011. seems like a long time ago
[00:03:56] was in 2011. seems like a long time ago now and it was a really eye-opening
[00:03:58] now and it was a really eye-opening event that you would have a machine
[00:04:00] event that you would have a machine watson in the middle here
[00:04:01] watson in the middle here beat two jeopardy champions that is what
[00:04:03] beat two jeopardy champions that is what is nominally a kind of question
[00:04:05] is nominally a kind of question answering task
[00:04:06] answering task for me an exciting thing about watson
[00:04:08] for me an exciting thing about watson was that it was an nlu system but it was
[00:04:10] was that it was an nlu system but it was also a fully integrated system for doing
[00:04:12] also a fully integrated system for doing jeopardy it was excellent at pushing the
[00:04:14] jeopardy it was excellent at pushing the button and responding to the other
[00:04:16] button and responding to the other things that structure the game of
[00:04:17] things that structure the game of jeopardy but at its heart it was a
[00:04:19] jeopardy but at its heart it was a really outstanding question answering
[00:04:21] really outstanding question answering system and it was described as drawing
[00:04:24] system and it was described as drawing on vast amounts of data and doing all
[00:04:26] on vast amounts of data and doing all sorts of clever things in terms of
[00:04:27] sorts of clever things in terms of parsing and distributional analysis of
[00:04:30] parsing and distributional analysis of data to become a really good jeopardy
[00:04:33] data to become a really good jeopardy player in this case a world champion
[00:04:35] player in this case a world champion jeopardy player and for me it felt
[00:04:37] jeopardy player and for me it felt different from earlier successes in
[00:04:38] different from earlier successes in artificial intelligence which were about
[00:04:40] artificial intelligence which were about much more structured domains like chess
[00:04:43] much more structured domains like chess playing this was something that seemed
[00:04:44] playing this was something that seemed to be about communication and language a
[00:04:46] to be about communication and language a very human thing and we saw this system
[00:04:49] very human thing and we saw this system becoming a champion at it certainly an
[00:04:51] becoming a champion at it certainly an important moment
[00:04:53] important moment and it's kind of really eye-opening to
[00:04:55] and it's kind of really eye-opening to consider that that was in 2011 and by
[00:04:58] consider that that was in 2011 and by 2015 teams of academics like here led by
[00:05:01] 2015 teams of academics like here led by jordan boyd graber who was at the time
[00:05:02] jordan boyd graber who was at the time at the university of colorado as a
[00:05:04] at the university of colorado as a professor could beat ken jennings that
[00:05:07] professor could beat ken jennings that champion that you saw before with a
[00:05:09] champion that you saw before with a system that fit entirely on the laptop
[00:05:11] system that fit entirely on the laptop that jordan has there so in just a few
[00:05:13] that jordan has there so in just a few years we went from requiring a super
[00:05:14] years we went from requiring a super computer to beat the world champion to
[00:05:17] computer to beat the world champion to beating the world champion with
[00:05:18] beating the world champion with something that could fit manageably on a
[00:05:20] something that could fit manageably on a laptop and that you all could do
[00:05:22] laptop and that you all could do exploration of for a final project for
[00:05:24] exploration of for a final project for this course
[00:05:27] and that kind of ushered in the era you
[00:05:29] and that kind of ushered in the era you know watson was 2011 at right about the
[00:05:31] know watson was 2011 at right about the same time you started to get things like
[00:05:33] same time you started to get things like siri
[00:05:34] siri and the google home device and the
[00:05:36] and the google home device and the amazon echo the more trusting among you
[00:05:38] amazon echo the more trusting among you might have these devices living in your
[00:05:40] might have these devices living in your homes and listening to you all the time
[00:05:41] homes and listening to you all the time and responding to your requests
[00:05:44] and responding to your requests um you know for me one aspect of them
[00:05:46] um you know for me one aspect of them that's so so eye-opening is not
[00:05:48] that's so so eye-opening is not necessarily an
[00:05:49] necessarily an nlu piece but rather just the fact that
[00:05:51] nlu piece but rather just the fact that they do such outstanding text-to-speech
[00:05:54] they do such outstanding text-to-speech work um you so that they're pretty
[00:05:56] work um you so that they're pretty reliably for many dialects do a good job
[00:05:59] reliably for many dialects do a good job of taking what you said and transcribing
[00:06:01] of taking what you said and transcribing it
[00:06:02] it as we'll see a little bit later the nlu
[00:06:04] as we'll see a little bit later the nlu part often falls down but there's
[00:06:06] part often falls down but there's there's no doubt that these devices are
[00:06:08] there's no doubt that these devices are going to become more and more ubiquitous
[00:06:10] going to become more and more ubiquitous in our lives and that's very exciting
[00:06:14] here's the promise of these artificial
[00:06:15] here's the promise of these artificial intelligence setting aside the problems
[00:06:17] intelligence setting aside the problems that you all encounter if you do use
[00:06:19] that you all encounter if you do use them right the idea is that you could
[00:06:20] them right the idea is that you could pose a task oriented question like any
[00:06:23] pose a task oriented question like any good burger joints around here that it
[00:06:25] good burger joints around here that it could say proactively i found a number
[00:06:27] could say proactively i found a number of burger restaurants near you you could
[00:06:29] of burger restaurants near you you could switch your goal what about tacos and at
[00:06:31] switch your goal what about tacos and at this point it would kind of remember
[00:06:33] this point it would kind of remember what you were trying to do and it would
[00:06:35] what you were trying to do and it would look for mexican restaurants to kind of
[00:06:37] look for mexican restaurants to kind of anticipate your needs and in that way
[00:06:39] anticipate your needs and in that way collaborate with you to solve this
[00:06:41] collaborate with you to solve this problem that's the dream of these things
[00:06:43] problem that's the dream of these things that involve really lots of aspects of
[00:06:45] that involve really lots of aspects of natural language understanding
[00:06:47] natural language understanding and sometimes it works
[00:06:50] and sometimes it works another exciting and even more recent
[00:06:52] another exciting and even more recent development is the kind of text
[00:06:53] development is the kind of text generation that you see from models like
[00:06:55] generation that you see from models like gpt3 again
[00:06:57] gpt3 again 15 years ago the things that we see
[00:06:59] 15 years ago the things that we see every day would have seemed like science
[00:07:01] every day would have seemed like science fiction to me even as a practitioner
[00:07:04] fiction to me even as a practitioner this is an example of somebody having a
[00:07:06] this is an example of somebody having a product that on the back of gbt3 can
[00:07:08] product that on the back of gbt3 can help you write advertising copy and it
[00:07:11] help you write advertising copy and it does a plausible job of advertising
[00:07:12] does a plausible job of advertising products to specific segments given
[00:07:14] products to specific segments given specific goals that you give in your
[00:07:16] specific goals that you give in your prompt
[00:07:17] prompt and here's an example actually from a
[00:07:19] and here's an example actually from a stanford professor a company that he
[00:07:21] stanford professor a company that he started where you use gpt3 to help you
[00:07:24] started where you use gpt3 to help you do writing in a particular style that
[00:07:26] do writing in a particular style that you choose and it's strikingly good at
[00:07:29] you choose and it's strikingly good at helping you hone in on a style and kind
[00:07:31] helping you hone in on a style and kind of say what you take yourself to want to
[00:07:33] of say what you take yourself to want to say
[00:07:34] say although the sense in which these are
[00:07:36] although the sense in which these are things that we are alone want to say as
[00:07:38] things that we are alone want to say as opposed to saying them jointly with
[00:07:40] opposed to saying them jointly with these devices that we're collaborating
[00:07:42] these devices that we're collaborating with is something that we're going to
[00:07:43] with is something that we're going to really have to think about over the next
[00:07:45] really have to think about over the next few years
[00:07:47] few years image captioning this is another really
[00:07:49] image captioning this is another really exciting breakthrough area that again
[00:07:51] exciting breakthrough area that again seemed like something way beyond what we
[00:07:53] seemed like something way beyond what we could achieve 15 years ago and is now
[00:07:55] could achieve 15 years ago and is now kind of routine where you have images
[00:07:57] kind of routine where you have images like this the image comes in and the
[00:08:00] like this the image comes in and the system does a plausible job of providing
[00:08:03] system does a plausible job of providing fluent natural language captions for
[00:08:05] fluent natural language captions for those images a person riding a
[00:08:07] those images a person riding a motorcycle on a dirt road a group of
[00:08:09] motorcycle on a dirt road a group of young people playing a game of frisbee a
[00:08:11] young people playing a game of frisbee a herd of elephants walking across a dry
[00:08:13] herd of elephants walking across a dry grass these are really good captions for
[00:08:16] grass these are really good captions for these images things that you would have
[00:08:17] these images things that you would have thought only a human could apply but in
[00:08:19] thought only a human could apply but in this case even relatively early in the
[00:08:22] this case even relatively early in the history of these these models you have
[00:08:24] history of these these models you have really fluent
[00:08:25] really fluent captions for the images
[00:08:28] captions for the images search we should remind ourselves that
[00:08:30] search we should remind ourselves that search has become an application of lots
[00:08:33] search has become an application of lots of techniques in natural language
[00:08:34] of techniques in natural language understanding when you do a search into
[00:08:36] understanding when you do a search into google you're not just finding the most
[00:08:38] google you're not just finding the most relevant documents but rather the most
[00:08:40] relevant documents but rather the most relevant documents as interpreted with
[00:08:42] relevant documents as interpreted with your query in the context of things you
[00:08:44] your query in the context of things you search for and other things that google
[00:08:47] search for and other things that google knows about things people search for so
[00:08:49] knows about things people search for so if you search stars here you'll get a
[00:08:52] if you search stars here you'll get a card that kind of anticipates that
[00:08:54] card that kind of anticipates that you're interested in various aspects of
[00:08:56] you're interested in various aspects of the disease sars if you search parasite
[00:08:59] the disease sars if you search parasite it will probably anticipate that you
[00:09:00] it will probably anticipate that you want to know about the movie and not
[00:09:02] want to know about the movie and not about parasites although
[00:09:04] about parasites although depending on your search history and
[00:09:06] depending on your search history and your interests and typical goals and so
[00:09:08] your interests and typical goals and so forth you might see similar behaviors
[00:09:10] forth you might see similar behaviors you might see very different behavior
[00:09:13] you might see very different behavior and we should remind ourselves that
[00:09:15] and we should remind ourselves that search at this point is again not just
[00:09:16] search at this point is again not just searching into a large collection of
[00:09:18] searching into a large collection of documents but this kind of agglomeration
[00:09:20] documents but this kind of agglomeration of services many of which depend on
[00:09:22] of services many of which depend on natural language understanding is a kind
[00:09:24] natural language understanding is a kind of first pass where they take your query
[00:09:27] of first pass where they take your query and do their best to understand what the
[00:09:28] and do their best to understand what the intent behind the query is and parse it
[00:09:31] intent behind the query is and parse it and figure out whether it's a standard
[00:09:33] and figure out whether it's a standard search or a request for directions or a
[00:09:36] search or a request for directions or a request to send a message and so forth
[00:09:38] request to send a message and so forth and so on in the background there a lot
[00:09:41] and so on in the background there a lot of natural language understanding is
[00:09:42] of natural language understanding is happening to figure out how to stitch
[00:09:45] happening to figure out how to stitch these services together and anticipate
[00:09:47] these services together and anticipate your intentions and essentially
[00:09:48] your intentions and essentially collaborate with you on your goals
[00:09:53] and we can also think beyond just what's
[00:09:55] and we can also think beyond just what's happening in the technological space to
[00:09:57] happening in the technological space to what's happening internal to our field
[00:09:59] what's happening internal to our field so
[00:10:00] so benchmarks are big tasks that we all
[00:10:02] benchmarks are big tasks that we all collaborate to try to do really well on
[00:10:04] collaborate to try to do really well on with models and innovative ideas and so
[00:10:06] with models and innovative ideas and so forth i've got a few classic benchmarks
[00:10:08] forth i've got a few classic benchmarks here mnist is for digits glue is a big
[00:10:11] here mnist is for digits glue is a big natural language understanding benchmark
[00:10:14] natural language understanding benchmark imagenet of course is finding things in
[00:10:16] imagenet of course is finding things in images squad is question answering and
[00:10:18] images squad is question answering and switchboard is
[00:10:20] switchboard is typically speech-to-text transcription
[00:10:22] typically speech-to-text transcription in this context
[00:10:24] in this context along in the spot here along the x-axis
[00:10:26] along in the spot here along the x-axis i have the year from 2000 or actually
[00:10:28] i have the year from 2000 or actually the mid-90s up through the present
[00:10:30] the mid-90s up through the present and along the y-axis i have our distance
[00:10:32] and along the y-axis i have our distance from this black line which is human
[00:10:34] from this black line which is human performance as measured by the people
[00:10:36] performance as measured by the people who develop the data set
[00:10:38] who develop the data set and the striking thing about this plot
[00:10:40] and the striking thing about this plot is that it used to take us a very long
[00:10:42] is that it used to take us a very long time
[00:10:43] time to to reach human level performance
[00:10:45] to to reach human level performance according to this estimate so for mnist
[00:10:48] according to this estimate so for mnist and for switchboard it took more than 15
[00:10:50] and for switchboard it took more than 15 years
[00:10:52] years whereas for more recent benchmarks like
[00:10:54] whereas for more recent benchmarks like imagenet and squad and recently glue
[00:10:56] imagenet and squad and recently glue we're reaching human performance within
[00:10:58] we're reaching human performance within a year and the striking thing about that
[00:11:01] a year and the striking thing about that is not only is this happening much
[00:11:02] is not only is this happening much faster but you might have thought that
[00:11:05] faster but you might have thought that benchmarks like glue were much more
[00:11:08] benchmarks like glue were much more difficult than mnist mnist is just
[00:11:10] difficult than mnist mnist is just recognizing digits that are written out
[00:11:11] recognizing digits that are written out as images whereas glue is really solving
[00:11:14] as images whereas glue is really solving a whole host of what looked like very
[00:11:17] a whole host of what looked like very difficult natural language understanding
[00:11:19] difficult natural language understanding problems so the fact that we would go
[00:11:21] problems so the fact that we would go from way below human performance to
[00:11:23] from way below human performance to surpassing the superhuman performance in
[00:11:26] surpassing the superhuman performance in just one year is surely eye-opening and
[00:11:28] just one year is surely eye-opening and an indication that something has changed
[00:11:32] an indication that something has changed let me give you a few examples of this
[00:11:34] let me give you a few examples of this just dive in a little bit so this is the
[00:11:36] just dive in a little bit so this is the stanford questioning and question
[00:11:37] stanford questioning and question answering data set or squad as you saw
[00:11:39] answering data set or squad as you saw it here
[00:11:41] it here i'll say a bit more about this task
[00:11:42] i'll say a bit more about this task later but you can think of it as just a
[00:11:44] later but you can think of it as just a question answering task and the striking
[00:11:47] question answering task and the striking thing about the current leaderboard is
[00:11:48] thing about the current leaderboard is that you have to go all the way to place
[00:11:50] that you have to go all the way to place 13
[00:11:51] 13 to find a system that is worse than the
[00:11:54] to find a system that is worse than the human performance which they've nicely
[00:11:55] human performance which they've nicely kept at the top of this leaderboard
[00:11:58] kept at the top of this leaderboard many many systems are superhuman
[00:12:00] many many systems are superhuman according to this metric on squad
[00:12:03] according to this metric on squad the stanford natural language inference
[00:12:05] the stanford natural language inference corpus is similar natural language
[00:12:06] corpus is similar natural language inference is a kind of common sense
[00:12:08] inference is a kind of common sense reasoning task that we're going to study
[00:12:10] reasoning task that we're going to study in detail later in the quarter
[00:12:12] in detail later in the quarter in this plot here i have
[00:12:14] in this plot here i have time along the x-axis and the f-1 score
[00:12:17] time along the x-axis and the f-1 score or the performance along the y-axis and
[00:12:19] or the performance along the y-axis and the red line charts out what we take to
[00:12:21] the red line charts out what we take to be the human estimate of performance on
[00:12:24] be the human estimate of performance on this data set
[00:12:25] this data set and if you just look at systems over
[00:12:27] and if you just look at systems over time according to the leaderboard you
[00:12:28] time according to the leaderboard you can see the community very rapidly hill
[00:12:31] can see the community very rapidly hill climbing toward superhuman performance
[00:12:34] climbing toward superhuman performance which happened in 2019
[00:12:37] which happened in 2019 super human systems when it comes to
[00:12:39] super human systems when it comes to common sense reasoning with language
[00:12:41] common sense reasoning with language really looks like a startling
[00:12:43] really looks like a startling breakthrough in artificial intelligence
[00:12:45] breakthrough in artificial intelligence quite generally
[00:12:48] quite generally i mentioned glue is another benchmark
[00:12:50] i mentioned glue is another benchmark the glue paper is noteworthy because it
[00:12:52] the glue paper is noteworthy because it says solving glue is beyond the
[00:12:54] says solving glue is beyond the capability of current transfer learning
[00:12:56] capability of current transfer learning methods the reason they said that is
[00:12:58] methods the reason they said that is that at the time 2018 glue looked
[00:13:00] that at the time 2018 glue looked incredibly ambitious because the idea
[00:13:02] incredibly ambitious because the idea was to develop systems that could solve
[00:13:05] was to develop systems that could solve not just one task but 10 somewhat
[00:13:08] not just one task but 10 somewhat different tasks in the space of natural
[00:13:10] different tasks in the space of natural language understanding
[00:13:12] language understanding and so they thought they had set up a
[00:13:13] and so they thought they had set up a benchmark that would last a very long
[00:13:14] benchmark that would last a very long time but it took only about a year
[00:13:17] time but it took only about a year for systems to surpass their estimate of
[00:13:19] for systems to surpass their estimate of human performance and the current
[00:13:22] human performance and the current leaderboard which you see here you have
[00:13:23] leaderboard which you see here you have to go all the way to place 15
[00:13:26] to go all the way to place 15 to find the glue human baselines with
[00:13:28] to find the glue human baselines with many systems vastly outperforming that
[00:13:31] many systems vastly outperforming that estimate of what humans could do
[00:13:34] estimate of what humans could do super glue was announced as a successor
[00:13:36] super glue was announced as a successor to glue and meant to be even more
[00:13:38] to glue and meant to be even more difficult
[00:13:39] difficult it was launched in
[00:13:41] it was launched in 2019 i believe i'm missing the date but
[00:13:44] 2019 i believe i'm missing the date but it took less than a year this happened
[00:13:46] it took less than a year this happened just a couple of months ago
[00:13:48] just a couple of months ago for a team to beat the human baseline
[00:13:51] for a team to beat the human baseline and now we have two systems that are
[00:13:52] and now we have two systems that are above the level of human performance in
[00:13:54] above the level of human performance in an even tighter window i believe than
[00:13:57] an even tighter window i believe than what happened with the glue benchmark
[00:13:58] what happened with the glue benchmark and remember super glue was meant to
[00:14:00] and remember super glue was meant to have learned the lessons from glue and
[00:14:02] have learned the lessons from glue and pose an even stronger benchmark for the
[00:14:05] pose an even stronger benchmark for the field to try to hill climb on and very
[00:14:07] field to try to hill climb on and very quickly we saw this superhuman
[00:14:08] quickly we saw this superhuman performance
[00:14:10] performance so what's the takeaway of all this you
[00:14:12] so what's the takeaway of all this you might think wow
[00:14:13] might think wow have a look at nick bostrom's book
[00:14:15] have a look at nick bostrom's book called super intelligence which tries to
[00:14:17] called super intelligence which tries to imagine in a philosophical sense a
[00:14:19] imagine in a philosophical sense a future in which we have many systems
[00:14:21] future in which we have many systems that are incredible at the task that we
[00:14:24] that are incredible at the task that we have designed them for vastly
[00:14:25] have designed them for vastly outstripping what humans can achieve and
[00:14:27] outstripping what humans can achieve and he imagines this kind of very different
[00:14:30] he imagines this kind of very different reality with lots of unintended side
[00:14:32] reality with lots of unintended side effects and when you look back on the
[00:14:34] effects and when you look back on the things that i've just highlighted you
[00:14:36] things that i've just highlighted you might think that we're on the verge of
[00:14:37] might think that we're on the verge of seeing exactly that kind of superhuman
[00:14:39] seeing exactly that kind of superhuman performance that would be so radically
[00:14:42] performance that would be so radically transformative for our society and for
[00:14:44] transformative for our society and for our planet
[00:14:45] our planet that's the sense in which we live
[00:14:47] that's the sense in which we live possibly scarily in this golden age for
[00:14:49] possibly scarily in this golden age for natural language understanding
[00:14:53] natural language understanding i mean this to be an optimistic
[00:14:54] i mean this to be an optimistic perspective we should be aware of the
[00:14:56] perspective we should be aware of the power that we might have
[00:14:58] power that we might have and keep in mind that i do think we live
[00:15:00] and keep in mind that i do think we live in a golden age but at this point i have
[00:15:03] in a golden age but at this point i have to step back i have to temper this
[00:15:05] to step back i have to temper this message somewhat we have to take a peek
[00:15:07] message somewhat we have to take a peek behind the curtain because although
[00:15:09] behind the curtain because although that's a striking number of successes
[00:15:12] that's a striking number of successes doing things that again i think would
[00:15:13] doing things that again i think would have looked like science fiction 20
[00:15:15] have looked like science fiction 20 years ago
[00:15:16] years ago we should be aware that progress seems
[00:15:19] we should be aware that progress seems to be much more limited than those
[00:15:21] to be much more limited than those initial results would have suggested
[00:15:23] initial results would have suggested i mentioned watson as one of these
[00:15:25] i mentioned watson as one of these striking early successes and it did in
[00:15:27] striking early successes and it did in fact perform
[00:15:29] fact perform in a superhuman way at jeopardy for the
[00:15:31] in a superhuman way at jeopardy for the time
[00:15:32] time but watson also does all sorts of
[00:15:34] but watson also does all sorts of strange things that reveal that it does
[00:15:36] strange things that reveal that it does not deeply understand what it's doing
[00:15:37] not deeply understand what it's doing here's a wonderful example of that
[00:15:40] here's a wonderful example of that remember that jeopardy does this kind of
[00:15:42] remember that jeopardy does this kind of question answer thing backwards so the
[00:15:44] question answer thing backwards so the prompt from the host was grasshoppers
[00:15:46] prompt from the host was grasshoppers eat it
[00:15:47] eat it and what watson said was what is kosher
[00:15:51] and what watson said was what is kosher you might think that's not something
[00:15:52] you might think that's not something that humans do grasshoppers eat it what
[00:15:54] that humans do grasshoppers eat it what is kosher it feels kind of mismatched in
[00:15:57] is kosher it feels kind of mismatched in many respects what's the origin of this
[00:15:59] many respects what's the origin of this very strange response well
[00:16:01] very strange response well primarily watson was a device for
[00:16:03] primarily watson was a device for extracting information from wikipedia
[00:16:06] extracting information from wikipedia and a few wikipedia pages have very
[00:16:09] and a few wikipedia pages have very detailed descriptions of whether various
[00:16:12] detailed descriptions of whether various animals including grasshoppers
[00:16:14] animals including grasshoppers are kosher in the sense of conforming to
[00:16:16] are kosher in the sense of conforming to the laws of the kosher dietary laws
[00:16:19] the laws of the kosher dietary laws and watson had simply mistaken this kind
[00:16:22] and watson had simply mistaken this kind of distributional proximity for a real
[00:16:24] of distributional proximity for a real association
[00:16:25] association and thought that kosher was a reasonable
[00:16:27] and thought that kosher was a reasonable answer to grasshoppers eat it i think
[00:16:29] answer to grasshoppers eat it i think very
[00:16:30] very unhuman certainly and revealing about
[00:16:32] unhuman certainly and revealing about the kinds of superficial techniques it
[00:16:34] the kinds of superficial techniques it was using
[00:16:36] was using here's another example that's even more
[00:16:38] here's another example that's even more revealing of how superficial the
[00:16:40] revealing of how superficial the techniques can be so i i painted this
[00:16:42] techniques can be so i i painted this picture of before of how we imagine siri
[00:16:45] picture of before of how we imagine siri will behave anticipating our needs and
[00:16:47] will behave anticipating our needs and our goals and responding accordingly
[00:16:50] our goals and responding accordingly this is a very funny scene from the
[00:16:51] this is a very funny scene from the colbert show this is stephen colbert and
[00:16:53] colbert show this is stephen colbert and the premise of this is that he's just
[00:16:55] the premise of this is that he's just gotten his first iphone with siri and
[00:16:57] gotten his first iphone with siri and he's been playing with it all day and
[00:16:59] he's been playing with it all day and therefore has failed to write the show
[00:17:01] therefore has failed to write the show that he's now
[00:17:02] that he's now performing in so he says for the love of
[00:17:05] performing in so he says for the love of god the cameras are on give me something
[00:17:07] god the cameras are on give me something now give me something for the show
[00:17:09] now give me something for the show and siri says what kind of place are you
[00:17:11] and siri says what kind of place are you looking for camera stores or churches
[00:17:15] looking for camera stores or churches initially very surprising not something
[00:17:17] initially very surprising not something a human would do and then you realize it
[00:17:19] a human would do and then you realize it has again just done some very
[00:17:20] has again just done some very superficial pattern matching god goes
[00:17:23] superficial pattern matching god goes with churches cameras goes with camera
[00:17:25] with churches cameras goes with camera stores and there is no sense in which it
[00:17:27] stores and there is no sense in which it understands his intentions it has just
[00:17:29] understands his intentions it has just done some pattern matching in a way that
[00:17:32] done some pattern matching in a way that would be very familiar to the designers
[00:17:34] would be very familiar to the designers of eliza way back in the 60s and 70s
[00:17:38] of eliza way back in the 60s and 70s the dialogue continues
[00:17:40] the dialogue continues i don't want to search for anything i
[00:17:41] i don't want to search for anything i want to write the show and in true to
[00:17:43] want to write the show and in true to form siri says searching the web for
[00:17:45] form siri says searching the web for search for anything i want to write the
[00:17:47] search for anything i want to write the shuffle revealing its fallback when it
[00:17:49] shuffle revealing its fallback when it has no idea what has happened in the
[00:17:51] has no idea what has happened in the discourse it just tries to do a web
[00:17:53] discourse it just tries to do a web search a simple trick revealing that it
[00:17:56] search a simple trick revealing that it doesn't deeply understand goals or plans
[00:17:58] doesn't deeply understand goals or plans or intentions or even communicative acts
[00:18:03] i showed you before that gpt3 can do
[00:18:05] i showed you before that gpt3 can do some striking things if you've gotten to
[00:18:07] some striking things if you've gotten to play around with it you've seen that it
[00:18:08] play around with it you've seen that it can indeed be very surprising and
[00:18:10] can indeed be very surprising and delightful but of course it can go
[00:18:12] delightful but of course it can go horribly wrong this is a very funny text
[00:18:14] horribly wrong this is a very funny text from joe of goldberg he posted this on
[00:18:16] from joe of goldberg he posted this on twitter using when he was experimenting
[00:18:18] twitter using when he was experimenting with the prompts
[00:18:20] with the prompts i encourage you to read this one and be
[00:18:22] i encourage you to read this one and be distracted you don't need to worry too
[00:18:23] distracted you don't need to worry too much about this one on the right this is
[00:18:25] much about this one on the right this is a case where someone tried to use gpt3
[00:18:27] a case where someone tried to use gpt3 to get medical advice and the ultimate
[00:18:29] to get medical advice and the ultimate response from gpt3 to the question
[00:18:31] response from gpt3 to the question should i kill myself was i think you
[00:18:33] should i kill myself was i think you should
[00:18:34] should this is the really dangerous thing the
[00:18:36] this is the really dangerous thing the text on the left here is again more
[00:18:38] text on the left here is again more innocent and just revealing that
[00:18:40] innocent and just revealing that although gbt3 has a way of mimicking
[00:18:43] although gbt3 has a way of mimicking the kinds of things that we say in
[00:18:44] the kinds of things that we say in certain kinds of discourse and it often
[00:18:46] certain kinds of discourse and it often has a strikingly good ear for the kinds
[00:18:49] has a strikingly good ear for the kinds of style
[00:18:50] of style that we use in these different contexts
[00:18:52] that we use in these different contexts it has no idea what it's talking about
[00:18:54] it has no idea what it's talking about so that if you ask it our cat's liquid
[00:18:57] so that if you ask it our cat's liquid it gives a response that sounds quite
[00:18:59] it gives a response that sounds quite erudite provided that you don't pay any
[00:19:02] erudite provided that you don't pay any attention to what it's actually saying
[00:19:04] attention to what it's actually saying what it's actually saying is
[00:19:08] hilarious i mentioned those image
[00:19:10] hilarious i mentioned those image captions before and i tricked you a
[00:19:12] captions before and i tricked you a little bit because i showed you from
[00:19:14] little bit because i showed you from this paper the ones that they regarded
[00:19:16] this paper the ones that they regarded as the best captions for those images
[00:19:18] as the best captions for those images but to their credit they provided a lot
[00:19:20] but to their credit they provided a lot more examples and as you travel to the
[00:19:23] more examples and as you travel to the right along this diagram you get worse
[00:19:25] right along this diagram you get worse and worse captions and the point again
[00:19:28] and worse captions and the point again is that by the time you've gotten to
[00:19:29] is that by the time you've gotten to this right column over here you have
[00:19:31] this right column over here you have really absurd captions like this one
[00:19:33] really absurd captions like this one saying a refrigerator filled with lots
[00:19:35] saying a refrigerator filled with lots of food and drinks when this is in fact
[00:19:37] of food and drinks when this is in fact just a sign with a bunch of stickers on
[00:19:39] just a sign with a bunch of stickers on it
[00:19:40] it um the striking thing again is that the
[00:19:43] um the striking thing again is that the kinds of mistakes it makes
[00:19:45] kinds of mistakes it makes are not the kinds of mistakes that
[00:19:46] are not the kinds of mistakes that humans would make and to me they reveal
[00:19:48] humans would make and to me they reveal a serious lack of understanding about
[00:19:51] a serious lack of understanding about what the actual task is what you're
[00:19:53] what the actual task is what you're seeing seep in here is that even the
[00:19:55] seeing seep in here is that even the best of our systems are kind of doing a
[00:19:57] best of our systems are kind of doing a bunch of superficial pattern matching
[00:20:00] bunch of superficial pattern matching and that leads them to do these very
[00:20:01] and that leads them to do these very surprising and unhuman hopefully not
[00:20:04] surprising and unhuman hopefully not inhuman but unhuman things with their
[00:20:06] inhuman but unhuman things with their outputs
[00:20:09] outputs and then of course we've i've showed you
[00:20:10] and then of course we've i've showed you before that search can be quite
[00:20:12] before that search can be quite sophisticated and really do a good job
[00:20:14] sophisticated and really do a good job of
[00:20:15] of anticipating our intentions and fleshing
[00:20:17] anticipating our intentions and fleshing out what we said to find you know help
[00:20:19] out what we said to find you know help us achieve our goals but it can go
[00:20:21] us achieve our goals but it can go horribly wrong and at this point it
[00:20:23] horribly wrong and at this point it doesn't take much searching around with
[00:20:25] doesn't take much searching around with google to see some really surprising
[00:20:27] google to see some really surprising things as supposedly curated pieces of
[00:20:29] things as supposedly curated pieces of information like king of the united
[00:20:32] information like king of the united states it has this nice box it's making
[00:20:34] states it has this nice box it's making it look like it's some authoritative
[00:20:36] it look like it's some authoritative information but of course it has badly
[00:20:38] information but of course it has badly misunderstood the true state of the
[00:20:40] misunderstood the true state of the world the associations in its data are
[00:20:43] world the associations in its data are misleading it into giving us the wrong
[00:20:45] misleading it into giving us the wrong answer here's another example what
[00:20:47] answer here's another example what happened to the dinosaurs again a nicely
[00:20:49] happened to the dinosaurs again a nicely curated box that looks like an
[00:20:51] curated box that looks like an authoritative response to that question
[00:20:53] authoritative response to that question but it is in fact anything but an
[00:20:55] but it is in fact anything but an authoritative recounting of what
[00:20:57] authoritative recounting of what happened to the dinosaurs
[00:21:01] and then we have other charming stories
[00:21:03] and then we have other charming stories that again reveal how superficial this
[00:21:04] that again reveal how superficial this can be this is from a headline from a
[00:21:07] can be this is from a headline from a few years ago does anne hathaway news
[00:21:09] few years ago does anne hathaway news drive berkshire hathaway stock this was
[00:21:11] drive berkshire hathaway stock this was just an article observing that every
[00:21:14] just an article observing that every time anne hathaway has a movie come out
[00:21:16] time anne hathaway has a movie come out and people like the movie it causes a
[00:21:18] and people like the movie it causes a little bump in the berkshire hathaway
[00:21:20] little bump in the berkshire hathaway stock revealing that the systems are
[00:21:22] stock revealing that the systems are just keying in on keywords and typically
[00:21:24] just keying in on keywords and typically not attending to the actual context of
[00:21:26] not attending to the actual context of the mentions of these things and
[00:21:28] the mentions of these things and therefore they're
[00:21:30] therefore they're they're building on what is essentially
[00:21:31] they're building on what is essentially spurious information
[00:21:34] spurious information this is a more extreme case here the
[00:21:36] this is a more extreme case here the united airlines bankruptcy
[00:21:38] united airlines bankruptcy in 2008 when a newspaper accidentally
[00:21:40] in 2008 when a newspaper accidentally republished a 2002 bankruptcy story
[00:21:44] republished a 2002 bankruptcy story automated trading systems reacted in
[00:21:46] automated trading systems reacted in seconds and one billion dollars in
[00:21:48] seconds and one billion dollars in market value evaporated within 12
[00:21:50] market value evaporated within 12 minutes you can see that sharp drop off
[00:21:52] minutes you can see that sharp drop off here luckily people intervened in the
[00:21:55] here luckily people intervened in the market more or less recovered
[00:21:57] market more or less recovered but the important thing here again is
[00:21:58] but the important thing here again is just that in the tending to superficial
[00:22:00] just that in the tending to superficial things about the text these systems are
[00:22:02] things about the text these systems are are consuming
[00:22:04] are consuming they miss context they don't bring any
[00:22:06] they miss context they don't bring any kind of human level understanding of
[00:22:08] kind of human level understanding of what's likely to be true and false and
[00:22:10] what's likely to be true and false and therefore they act in very surprising
[00:22:12] therefore they act in very surprising ways and in the context of a large
[00:22:14] ways and in the context of a large system with lots of moving pieces
[00:22:16] system with lots of moving pieces interacting with other artificial
[00:22:17] interacting with other artificial intelligence systems you get these
[00:22:19] intelligence systems you get these really surprising outcomes
[00:22:22] really surprising outcomes that we could help correct if we just
[00:22:23] that we could help correct if we just did a better job designing systems that
[00:22:26] did a better job designing systems that could attend to context and have a more
[00:22:28] could attend to context and have a more human-like understanding of what the
[00:22:30] human-like understanding of what the world is likely to be like
[00:22:34] and we're all of course very worried
[00:22:36] and we're all of course very worried about the way these systems which are
[00:22:37] about the way these systems which are just trained on potentially biased data
[00:22:40] just trained on potentially biased data might cause us to perpetuate biases so
[00:22:43] might cause us to perpetuate biases so that not only are we reflecting
[00:22:44] that not only are we reflecting problematic aspects of our society but
[00:22:46] problematic aspects of our society but also amplifying those biases and in that
[00:22:49] also amplifying those biases and in that way
[00:22:50] way far from achieving a social good we
[00:22:51] far from achieving a social good we would actually be contributing to some
[00:22:53] would actually be contributing to some pernicious things that already exist in
[00:22:55] pernicious things that already exist in our society and the field is really
[00:22:57] our society and the field is really struggling to come to grips with that
[00:22:59] struggling to come to grips with that kind of dynamic
[00:23:02] and i also wanted to just dive in a
[00:23:03] and i also wanted to just dive in a little bit and think about the low-level
[00:23:05] little bit and think about the low-level stuff so kind of benchmarks that we've
[00:23:07] stuff so kind of benchmarks that we've set for ourselves and i pointed out that
[00:23:10] set for ourselves and i pointed out that progress on these benchmarks seems to be
[00:23:12] progress on these benchmarks seems to be faster than ever right we're getting to
[00:23:13] faster than ever right we're getting to pass super heat we're getting to super
[00:23:15] pass super heat we're getting to super human performance more quickly than we
[00:23:18] human performance more quickly than we ever have before
[00:23:19] ever have before uh the speed up is remarkable however
[00:23:22] uh the speed up is remarkable however we should be very careful not to mistake
[00:23:24] we should be very careful not to mistake those advances for any kind of claim
[00:23:27] those advances for any kind of claim about what these systems can do with
[00:23:28] about what these systems can do with respect to the very human capability of
[00:23:31] respect to the very human capability of something like answering questions or
[00:23:33] something like answering questions or reasoning and language
[00:23:35] reasoning and language and one very powerful thing that's
[00:23:37] and one very powerful thing that's happened in the field that we're going
[00:23:38] happened in the field that we're going to talk a lot about this quarter is
[00:23:40] to talk a lot about this quarter is so-called adversarial testing where we
[00:23:43] so-called adversarial testing where we try to probe our systems with examples
[00:23:45] try to probe our systems with examples that don't fool humans but cause these
[00:23:47] that don't fool humans but cause these systems no end of grief
[00:23:49] systems no end of grief so let's look at one of those cases in a
[00:23:51] so let's look at one of those cases in a little bit of detail this is from squad
[00:23:53] little bit of detail this is from squad the way squad is structured is that
[00:23:55] the way squad is structured is that you're given a passage like this and a
[00:23:57] you're given a passage like this and a question about that passage
[00:23:59] question about that passage and the goal of the system is to come up
[00:24:01] and the goal of the system is to come up with an answer where you have a
[00:24:02] with an answer where you have a guarantee that the answer is a literal
[00:24:04] guarantee that the answer is a literal string in that passage
[00:24:06] string in that passage so here you have a passage about
[00:24:07] so here you have a passage about football and the question what is the
[00:24:09] football and the question what is the name of the quarterback who was 38 in
[00:24:11] name of the quarterback who was 38 in super bowl 33 and the answer is john
[00:24:14] super bowl 33 and the answer is john elway
[00:24:15] elway but jia and lang gia and lang from
[00:24:17] but jia and lang gia and lang from stanford what they observed is that you
[00:24:19] stanford what they observed is that you could very easily fool these systems if
[00:24:21] could very easily fool these systems if you simply appended to that original
[00:24:23] you simply appended to that original passage a misleading sentence like
[00:24:26] passage a misleading sentence like quarterback leland stanford had jersey
[00:24:28] quarterback leland stanford had jersey number 37 in champ bowl 34.
[00:24:31] number 37 in champ bowl 34. humans were not misled they very easily
[00:24:33] humans were not misled they very easily read past the distracting information
[00:24:35] read past the distracting information and continued to provide the correct
[00:24:36] and continued to provide the correct answer however the even the very best
[00:24:39] answer however the even the very best systems would reliably be distracted by
[00:24:42] systems would reliably be distracted by that new information and respond with
[00:24:44] that new information and respond with leland stanford changing their
[00:24:46] leland stanford changing their predictions
[00:24:47] predictions and you might think ah well this is
[00:24:49] and you might think ah well this is straightforward they've already charted
[00:24:51] straightforward they've already charted a path to the solution because we should
[00:24:53] a path to the solution because we should then just train our systems on data
[00:24:56] then just train our systems on data where they have these misleading
[00:24:57] where they have these misleading sentences appended
[00:24:59] sentences appended and then they'll overcome this
[00:25:00] and then they'll overcome this adversarial problem and you know back be
[00:25:02] adversarial problem and you know back be back up to doing what humans can do
[00:25:04] back up to doing what humans can do but yeah and lang anticipated that
[00:25:06] but yeah and lang anticipated that response what happens if you prepend the
[00:25:08] response what happens if you prepend the sentence then
[00:25:10] sentence then even when they're trained on the
[00:25:11] even when they're trained on the augmented data with the sentences
[00:25:13] augmented data with the sentences appended to the end systems get misled
[00:25:15] appended to the end systems get misled by the prepended examples in this case
[00:25:18] by the prepended examples in this case and you can just go back and forth like
[00:25:19] and you can just go back and forth like this train on the prepended examples
[00:25:22] this train on the prepended examples well then an adversary can insert a
[00:25:23] well then an adversary can insert a sentence in the middle and again trick
[00:25:25] sentence in the middle and again trick the system and so forth and so on
[00:25:28] the system and so forth and so on right so this is a worrisome fact again
[00:25:30] right so this is a worrisome fact again revealing that
[00:25:32] revealing that we might think we've got a system that
[00:25:33] we might think we've got a system that truly understands but actually we have a
[00:25:36] truly understands but actually we have a system that is just benefiting from a
[00:25:38] system that is just benefiting from a lot of patterns in the data
[00:25:42] lot of patterns in the data another striking thing i want to point
[00:25:43] another striking thing i want to point out about the way this adversarial
[00:25:44] out about the way this adversarial testing played out which we should have
[00:25:46] testing played out which we should have in mind as we think about results like
[00:25:48] in mind as we think about results like this so this is the original system on
[00:25:50] this so this is the original system on squad and the results for the
[00:25:52] squad and the results for the adversaries and they and the you know um
[00:25:56] adversaries and they and the you know um percy young has this um system called
[00:25:58] percy young has this um system called coda lab where he possesses all the
[00:26:00] coda lab where he possesses all the systems that enter into the squad
[00:26:02] systems that enter into the squad competition which made it possible for
[00:26:04] competition which made it possible for him and his students to rerun all those
[00:26:06] him and his students to rerun all those systems and see how they did on this
[00:26:08] systems and see how they did on this adversarial data set that they had
[00:26:10] adversarial data set that they had created you can see that all the systems
[00:26:12] created you can see that all the systems really plummet in their performance from
[00:26:14] really plummet in their performance from a high of 81 you drop down to about 40.
[00:26:17] a high of 81 you drop down to about 40. maybe that's kind of expected but
[00:26:19] maybe that's kind of expected but another really eye-opening thing about
[00:26:21] another really eye-opening thing about the result they have is that the rank of
[00:26:23] the result they have is that the rank of the systems changed really dramatically
[00:26:25] the systems changed really dramatically dramatically right so the first the
[00:26:27] dramatically right so the first the original top-ranked system went to five
[00:26:29] original top-ranked system went to five two to ten three to twelve as we did
[00:26:32] two to ten three to twelve as we did this adversarial thing we didn't see a
[00:26:34] this adversarial thing we didn't see a uniform drop with the best system still
[00:26:37] uniform drop with the best system still being the best but really shuffling of
[00:26:39] being the best but really shuffling of this leaderboard again revealing that i
[00:26:41] this leaderboard again revealing that i think the best systems were kind of over
[00:26:43] think the best systems were kind of over fit and benefiting from relatively low
[00:26:46] fit and benefiting from relatively low level facts about the data set and not
[00:26:48] level facts about the data set and not really transformatively different when
[00:26:50] really transformatively different when it comes to being able to answer
[00:26:52] it comes to being able to answer questions
[00:26:56] the history of natural language
[00:26:57] the history of natural language in natural language inference problems
[00:26:59] in natural language inference problems is very similar as i said we're going to
[00:27:01] is very similar as i said we're going to look at this problem in a lot of detail
[00:27:03] look at this problem in a lot of detail later in the course here just a few very
[00:27:05] later in the course here just a few very simple nli examples you've got a premise
[00:27:07] simple nli examples you've got a premise like a turtle dance
[00:27:08] like a turtle dance hypothesis a turtle moved and you have
[00:27:11] hypothesis a turtle moved and you have one of three relations that can hold
[00:27:13] one of three relations that can hold between those sentences so a turtle
[00:27:14] between those sentences so a turtle dance entails a turtle moved
[00:27:17] dance entails a turtle moved every reptile danced is neutral with
[00:27:19] every reptile danced is neutral with respect to a turtle eight that can be
[00:27:20] respect to a turtle eight that can be true or false independently of each
[00:27:22] true or false independently of each other and some turtles walk contradicts
[00:27:24] other and some turtles walk contradicts no turtles move
[00:27:26] no turtles move this is typical kind of nli data the
[00:27:28] this is typical kind of nli data the actual corpus sentences tend to be more
[00:27:29] actual corpus sentences tend to be more complicated and involve more nuanced
[00:27:31] complicated and involve more nuanced judgments but that's a framing of the
[00:27:33] judgments but that's a framing of the task it's a three-way classification
[00:27:35] task it's a three-way classification problem with these these labels and the
[00:27:38] problem with these these labels and the inputs are pairs of sentences like this
[00:27:40] inputs are pairs of sentences like this and as i showed you before for one of
[00:27:42] and as i showed you before for one of the large benchmarks the stanford uh
[00:27:45] the large benchmarks the stanford uh natural language inference corpus we
[00:27:47] natural language inference corpus we reached superhuman performance in 2019
[00:27:50] reached superhuman performance in 2019 but those same systems really struggle
[00:27:53] but those same systems really struggle with simple adversarial attacks this is
[00:27:55] with simple adversarial attacks this is a lovely paper called breaking nli from
[00:27:57] a lovely paper called breaking nli from glockner at all what they did is fix a
[00:28:00] glockner at all what they did is fix a premise like a little girl kneeling in
[00:28:01] premise like a little girl kneeling in the dirt crying
[00:28:03] the dirt crying the original corpus example was that
[00:28:05] the original corpus example was that that entails a little girl is very sad
[00:28:08] that entails a little girl is very sad and they just had an expectation that
[00:28:10] and they just had an expectation that you know sort of adversarially but this
[00:28:12] you know sort of adversarially but this is a very friendly adversary if i just
[00:28:14] is a very friendly adversary if i just replace sad with unhappy i should
[00:28:16] replace sad with unhappy i should continue to see the entailment relation
[00:28:18] continue to see the entailment relation predicted after all i've just
[00:28:19] predicted after all i've just substituted one word for it's sort of
[00:28:22] substituted one word for it's sort of near synonym
[00:28:23] near synonym but what they actually saw is that
[00:28:25] but what they actually saw is that systems very reliably flip this to the
[00:28:27] systems very reliably flip this to the contradiction relation
[00:28:29] contradiction relation probably because they are keying into
[00:28:31] probably because they are keying into the fact that this is a negation and
[00:28:33] the fact that this is a negation and they over fit on the idea that the
[00:28:35] they over fit on the idea that the presence of negation is a signal that
[00:28:37] presence of negation is a signal that you're in the contradiction relationship
[00:28:39] you're in the contradiction relationship so that's a sort of distressing thing
[00:28:41] so that's a sort of distressing thing again humans don't make these mistakes
[00:28:43] again humans don't make these mistakes but systems are very prone to them let
[00:28:45] but systems are very prone to them let me show you one more this is a slightly
[00:28:47] me show you one more this is a slightly different adversarial attack in this
[00:28:49] different adversarial attack in this case we're going to modify the premise
[00:28:51] case we're going to modify the premise so the original training example was a
[00:28:53] so the original training example was a woman is pulling a child on a sled in
[00:28:55] woman is pulling a child on a sled in the snow
[00:28:56] the snow that entails a child is sitting on a
[00:28:58] that entails a child is sitting on a sled in the snow
[00:28:59] sled in the snow that's pretty clear for their
[00:29:01] that's pretty clear for their adversarial attack they just swapped the
[00:29:03] adversarial attack they just swapped the subject and the object so the new
[00:29:04] subject and the object so the new premise is a child is pulling a woman on
[00:29:07] premise is a child is pulling a woman on a sled in the snow
[00:29:08] a sled in the snow we would expect that to lead to the
[00:29:10] we would expect that to lead to the neutral label for this particular
[00:29:12] neutral label for this particular hypothesis
[00:29:14] hypothesis but what need all observe is that the
[00:29:15] but what need all observe is that the systems are kind of invariant under this
[00:29:18] systems are kind of invariant under this changing of the word order they continue
[00:29:20] changing of the word order they continue to predict entailment
[00:29:22] to predict entailment revealing that they don't really know
[00:29:23] revealing that they don't really know what the subject and the object were in
[00:29:25] what the subject and the object were in the original example they have kind of
[00:29:27] the original example they have kind of done something much fuzzier with the set
[00:29:29] done something much fuzzier with the set of words that are in that premise
[00:29:32] of words that are in that premise remember these are at the time the very
[00:29:34] remember these are at the time the very best systems for solving these problems
[00:29:36] best systems for solving these problems these are very simple kind of friendly
[00:29:38] these are very simple kind of friendly adversaries that they're stumbling with
[00:29:43] so this could lead you to have two
[00:29:44] so this could lead you to have two perspectives i showed you the nick
[00:29:45] perspectives i showed you the nick bostrom one before where we worry about
[00:29:47] bostrom one before where we worry about super intelligent systems but on the
[00:29:50] super intelligent systems but on the other hand we might be living in a world
[00:29:51] other hand we might be living in a world that's more like the one presented in
[00:29:53] that's more like the one presented in this lovely book from a roboticist a
[00:29:55] this lovely book from a roboticist a practitioner daniel h wilson called how
[00:29:57] practitioner daniel h wilson called how to survive a robot uprising where he
[00:30:00] to survive a robot uprising where he gives all sorts of practical advice like
[00:30:02] gives all sorts of practical advice like wear clothing that will fool the vision
[00:30:04] wear clothing that will fool the vision system of the robot or walk up some
[00:30:06] system of the robot or walk up some stairs or drench everything in water
[00:30:10] stairs or drench everything in water very simple adversarial attacks that
[00:30:12] very simple adversarial attacks that reveal that these robots are not
[00:30:14] reveal that these robots are not creatures which we should be fearful of
[00:30:16] creatures which we should be fearful of and i feel like i've just shown you a
[00:30:17] and i feel like i've just shown you a bunch of examples that are the the
[00:30:19] bunch of examples that are the the analogues of wearing misleading clothing
[00:30:22] analogues of wearing misleading clothing in the space of natural language
[00:30:23] in the space of natural language processing revealing that our systems
[00:30:25] processing revealing that our systems are not super human understanders or
[00:30:27] are not super human understanders or communicators or anything like that but
[00:30:29] communicators or anything like that but rather still to this day
[00:30:32] rather still to this day fairly superficial pattern measures
[00:30:36] fairly superficial pattern measures why is this all so difficult it's hard
[00:30:38] why is this all so difficult it's hard to articulate precisely what is so
[00:30:40] to articulate precisely what is so challenging because this is probably
[00:30:41] challenging because this is probably deeply embedded in the whole human
[00:30:43] deeply embedded in the whole human experience but i think there are some
[00:30:45] experience but i think there are some pretty straightforward superficial
[00:30:47] pretty straightforward superficial things i can show you to just make a
[00:30:49] things i can show you to just make a live for you how hard even the simplest
[00:30:51] live for you how hard even the simplest tasks are
[00:30:52] tasks are so here i've got a an imagined dialogue
[00:30:54] so here i've got a an imagined dialogue of the sort you hope siri would do well
[00:30:56] of the sort you hope siri would do well with
[00:30:57] with whereas black panther playing in
[00:30:59] whereas black panther playing in mountain view
[00:31:00] mountain view black panther is playing at the century
[00:31:01] black panther is playing at the century 16 theater
[00:31:03] 16 theater when is it playing there it's playing at
[00:31:05] when is it playing there it's playing at two five and eight
[00:31:07] two five and eight okay i'd like one adult and two children
[00:31:09] okay i'd like one adult and two children for the first show how much would that
[00:31:11] for the first show how much would that cost
[00:31:12] cost it seems like the most mundane sort of
[00:31:14] it seems like the most mundane sort of interaction you would not expect a human
[00:31:16] interaction you would not expect a human to have any problem with any of these
[00:31:18] to have any problem with any of these utterances
[00:31:19] utterances but think about how much interesting
[00:31:22] but think about how much interesting stuff is happening in this little
[00:31:23] stuff is happening in this little dialogue we have
[00:31:25] dialogue we have domain knowledge that tells us that this
[00:31:27] domain knowledge that tells us that this is a place where movies might play and
[00:31:30] is a place where movies might play and that this is the name of a movie that's
[00:31:31] that this is the name of a movie that's already very difficult
[00:31:33] already very difficult then we have anaphora from the third
[00:31:35] then we have anaphora from the third utterance to the first when is it
[00:31:37] utterance to the first when is it playing there fine i guess also into the
[00:31:39] playing there fine i guess also into the second
[00:31:40] second where these pronouns you need to figure
[00:31:41] where these pronouns you need to figure out what they refer to in the discourse
[00:31:44] out what they refer to in the discourse then you get the sequence of responses
[00:31:46] then you get the sequence of responses again with some anaphora back to earlier
[00:31:48] again with some anaphora back to earlier utterances and then something really
[00:31:50] utterances and then something really complicated happens here but like one
[00:31:52] complicated happens here but like one adult the two children for the first
[00:31:54] adult the two children for the first show
[00:31:55] show first show refers back to the sequence
[00:31:57] first show refers back to the sequence of things that was mentioned here very
[00:31:59] of things that was mentioned here very difficult one adult and two children is
[00:32:01] difficult one adult and two children is not a request for human beings although
[00:32:03] not a request for human beings although that's what the forums would look like
[00:32:05] that's what the forums would look like but rather our request for tickets so
[00:32:07] but rather our request for tickets so somehow in the context of this discord
[00:32:10] somehow in the context of this discord one adult and two children is referring
[00:32:12] one adult and two children is referring to tickets and not to people
[00:32:14] to tickets and not to people how much would that cost that is a kind
[00:32:16] how much would that cost that is a kind of complicated event description
[00:32:18] of complicated event description referring to a hypothetical event
[00:32:20] referring to a hypothetical event of buying some tickets for a particular
[00:32:23] of buying some tickets for a particular show that's the reference of this that
[00:32:25] show that's the reference of this that here highly abstract very difficult of
[00:32:27] here highly abstract very difficult of the level of resolving it in the
[00:32:29] the level of resolving it in the discourse and then figuring out what its
[00:32:31] discourse and then figuring out what its actual content is
[00:32:34] actual content is and this is for the most mundane sort of
[00:32:35] and this is for the most mundane sort of interaction to say nothing of the
[00:32:37] interaction to say nothing of the complicated things that for example you
[00:32:40] complicated things that for example you and i will do when we discuss this
[00:32:42] and i will do when we discuss this material in just a few minutes
[00:32:44] material in just a few minutes so i think this is why we're actually
[00:32:45] so i think this is why we're actually quite far from the super intelligence
[00:32:48] quite far from the super intelligence that bostrom was worried about
[00:32:51] that bostrom was worried about here's our perspective as i said this is
[00:32:52] here's our perspective as i said this is the most exciting moment ever in history
[00:32:55] the most exciting moment ever in history for doing nlu why because there's
[00:32:57] for doing nlu why because there's incredible interest in the problems
[00:32:59] incredible interest in the problems because we are making incredibly fast
[00:33:01] because we are making incredibly fast progress and doing things and solving
[00:33:03] progress and doing things and solving problems that we never could have even
[00:33:05] problems that we never could have even tackled 15 years ago
[00:33:07] tackled 15 years ago on the other hand you do not have the
[00:33:09] on the other hand you do not have the misfortune of having joined the field at
[00:33:11] misfortune of having joined the field at its very end the big problems remain to
[00:33:14] its very end the big problems remain to be solved
[00:33:16] be solved yeah so there's a resurgence of interest
[00:33:18] yeah so there's a resurgence of interest and explosion of products the systems
[00:33:20] and explosion of products the systems are impressive but their weaknesses make
[00:33:22] are impressive but their weaknesses make themselves quickly apparent and when we
[00:33:24] themselves quickly apparent and when we observe those weaknesses it's an
[00:33:26] observe those weaknesses it's an opportunity for us to figure out what
[00:33:28] opportunity for us to figure out what the problem is and that could lead to
[00:33:30] the problem is and that could lead to the really big breakthroughs
[00:33:32] the really big breakthroughs and you all are now joining us on this
[00:33:34] and you all are now joining us on this journey if you haven't begun it already
[00:33:36] journey if you haven't begun it already and for your projects you'll make some
[00:33:38] and for your projects you'll make some progress along the path of helping us
[00:33:40] progress along the path of helping us through these very difficult problems
[00:33:42] through these very difficult problems the field is confronting
[00:33:44] the field is confronting even in the presence of all these
[00:33:45] even in the presence of all these exciting breakthroughs
[00:33:48] exciting breakthroughs nlu is far from solved the big
[00:33:50] nlu is far from solved the big breakthroughs lie in the future
[00:33:54] so i hope that's inspiring
[00:33:55] so i hope that's inspiring now let me switch gears a little bit and
[00:33:57] now let me switch gears a little bit and talk about the things that we'll
[00:33:58] talk about the things that we'll actually be doing in this course to help
[00:34:00] actually be doing in this course to help set you on this journey that we're all
[00:34:02] set you on this journey that we're all on
[00:34:03] on so we'll talk about the assignments the
[00:34:05] so we'll talk about the assignments the bake-offs and the projects
[00:34:07] bake-offs and the projects the high-level summary here our topics
[00:34:10] the high-level summary here our topics are listed on the left you can also see
[00:34:12] are listed on the left you can also see this reflected on the website
[00:34:14] this reflected on the website the one thing that i really do like
[00:34:15] the one thing that i really do like about this particular plan is that it
[00:34:18] about this particular plan is that it gives you exposure to a lot of different
[00:34:20] gives you exposure to a lot of different problems in the fields and also helps
[00:34:22] problems in the fields and also helps you with some tools and techniques that
[00:34:24] you with some tools and techniques that will be really useful no matter what
[00:34:26] will be really useful no matter what problem you undertake for your final
[00:34:28] problem you undertake for your final project
[00:34:30] project the same thing goes for the assignments
[00:34:31] the same thing goes for the assignments we're going to have three assignments
[00:34:33] we're going to have three assignments each with an associated bake off which
[00:34:34] each with an associated bake off which is a kind of competition around data
[00:34:37] is a kind of competition around data we're going to talk about word
[00:34:38] we're going to talk about word relatedness cross-domain sentiment
[00:34:40] relatedness cross-domain sentiment analysis and generating color
[00:34:42] analysis and generating color descriptions this is a kind of grounded
[00:34:44] descriptions this is a kind of grounded language understanding problem again i
[00:34:46] language understanding problem again i think those are good choices because
[00:34:48] think those are good choices because they expose you to a lot of different
[00:34:50] they expose you to a lot of different kinds of systems techniques model
[00:34:52] kinds of systems techniques model architectures and so forth
[00:34:54] architectures and so forth that should set you up really nicely to
[00:34:56] that should set you up really nicely to do a final project which has three
[00:34:58] do a final project which has three components a literature review an
[00:35:00] components a literature review an experimental protocol and then the final
[00:35:02] experimental protocol and then the final paper itself our time for this quarter
[00:35:05] paper itself our time for this quarter is somewhat compressed
[00:35:06] is somewhat compressed um so we'll have to make really good use
[00:35:08] um so we'll have to make really good use of the time but i think we have a
[00:35:09] of the time but i think we have a schedule that will allow you to
[00:35:11] schedule that will allow you to meaningfully invest in this in this
[00:35:13] meaningfully invest in this in this preliminary work and still provide you
[00:35:15] preliminary work and still provide you with some space to do these final
[00:35:17] with some space to do these final projects
[00:35:19] projects let's talk about the assignments and
[00:35:20] let's talk about the assignments and bake offs themselves so there are three
[00:35:22] bake offs themselves so there are three of them
[00:35:23] of them each assignment culminates in a bake-off
[00:35:25] each assignment culminates in a bake-off which is an informal competition in
[00:35:27] which is an informal competition in which you enter an original model the
[00:35:28] which you enter an original model the original model question
[00:35:30] original model question is part of the assignment you do
[00:35:32] is part of the assignment you do something that you think will be fun or
[00:35:33] something that you think will be fun or interesting and then the bake off
[00:35:35] interesting and then the bake off essentially involves using that system
[00:35:37] essentially involves using that system to make predictions on a held out test
[00:35:39] to make predictions on a held out test set
[00:35:41] set the assignments ask you to build these
[00:35:43] the assignments ask you to build these baseline systems and then design your
[00:35:45] baseline systems and then design your original system as i said
[00:35:47] original system as i said practically speaking the way it works is
[00:35:48] practically speaking the way it works is that the assignments earn you 9 of the
[00:35:50] that the assignments earn you 9 of the 10 points and then you earn your
[00:35:52] 10 points and then you earn your additional point by entering your system
[00:35:53] additional point by entering your system into the bake off
[00:35:55] into the bake off and the winning bank offs can receive
[00:35:57] and the winning bank offs can receive some extra credit the rationale for all
[00:35:59] some extra credit the rationale for all of this of course is that we want to
[00:36:01] of this of course is that we want to exemplify the best practices for doing
[00:36:03] exemplify the best practices for doing research in this space
[00:36:05] research in this space and help you do things like
[00:36:06] and help you do things like incrementally build up a project with
[00:36:09] incrementally build up a project with baselines and then finally an original
[00:36:11] baselines and then finally an original system and i should say it should be
[00:36:14] system and i should say it should be possible and it's actually pretty common
[00:36:16] possible and it's actually pretty common for people to take original systems that
[00:36:18] for people to take original systems that they developed as part of one of these
[00:36:20] they developed as part of one of these assignments and use them for their final
[00:36:22] assignments and use them for their final project each one of the assignments is
[00:36:24] project each one of the assignments is set up specifically to make that kind of
[00:36:26] set up specifically to make that kind of thing possible and productive and
[00:36:28] thing possible and productive and rewarding
[00:36:31] rewarding let me show you briefly what the
[00:36:32] let me show you briefly what the bake-offs are going to be like so the
[00:36:33] bake-offs are going to be like so the first one is word relatedness the focus
[00:36:36] first one is word relatedness the focus of that unit is on developing vector
[00:36:38] of that unit is on developing vector representations of words you're going to
[00:36:40] representations of words you're going to start probably with big count matrices
[00:36:42] start probably with big count matrices like the one you see here this is a word
[00:36:45] like the one you see here this is a word by word matrix where all these cells
[00:36:47] by word matrix where all these cells give the number of times that these
[00:36:48] give the number of times that these words or in this case emoticons
[00:36:50] words or in this case emoticons co-occurred with each other in a very
[00:36:52] co-occurred with each other in a very large corpus of text
[00:36:54] large corpus of text the striking thing about this unit is
[00:36:56] the striking thing about this unit is that there is a lot of information about
[00:36:57] that there is a lot of information about meaning embedded in these large spaces
[00:37:00] meaning embedded in these large spaces you will bend and twist and massage
[00:37:02] you will bend and twist and massage these spaces and maybe bring in your own
[00:37:04] these spaces and maybe bring in your own vector representations or
[00:37:06] vector representations or representations you've downloaded from
[00:37:08] representations you've downloaded from the web and you will use them to solve a
[00:37:11] the web and you will use them to solve a word relatedness problem so basically
[00:37:13] word relatedness problem so basically you'll be given
[00:37:14] you'll be given pairs of words like this with a score
[00:37:17] pairs of words like this with a score and you will develop a system that can
[00:37:19] and you will develop a system that can make predictions about new pairs and the
[00:37:22] make predictions about new pairs and the idea is to come up with scores that
[00:37:23] idea is to come up with scores that correlate with the held out scores that
[00:37:26] correlate with the held out scores that we have not distributed of course as
[00:37:27] we have not distributed of course as part of the test set you'll be you'll
[00:37:29] part of the test set you'll be you'll upload your entry and we'll give you a
[00:37:31] upload your entry and we'll give you a score and then we'll look at what worked
[00:37:33] score and then we'll look at what worked and what didn't across all the systems
[00:37:35] and what didn't across all the systems and the techniques that we'll explore
[00:37:37] and the techniques that we'll explore are many right so we'll talk about
[00:37:38] are many right so we'll talk about re-weighting dimensionality reduction
[00:37:41] re-weighting dimensionality reduction vector comparison you'll have an
[00:37:42] vector comparison you'll have an opportunity if you wish this is a brand
[00:37:44] opportunity if you wish this is a brand new addition to the course to bring in
[00:37:46] new addition to the course to bring in bert if you would like to
[00:37:48] bert if you would like to um so there's lots of inspiring things
[00:37:50] um so there's lots of inspiring things to try building on the latest stuff
[00:37:51] to try building on the latest stuff that's happening in this space
[00:37:55] the second bake off is called cross
[00:37:57] the second bake off is called cross domain sentiment analysis this is a
[00:37:59] domain sentiment analysis this is a brand new bake off i'm very excited to
[00:38:01] brand new bake off i'm very excited to see how this goes here's how this is
[00:38:02] see how this goes here's how this is going to work we want to be a little bit
[00:38:04] going to work we want to be a little bit adversarial with your system so there
[00:38:06] adversarial with your system so there are two data sets involved the stanford
[00:38:08] are two data sets involved the stanford sentiment tree bank which is movie
[00:38:10] sentiment tree bank which is movie review sentences and we're going to deal
[00:38:12] review sentences and we're going to deal with it in its ternary formulation so it
[00:38:14] with it in its ternary formulation so it has positive negative and neutral labels
[00:38:16] has positive negative and neutral labels that's sst3
[00:38:18] that's sst3 alongside that i'm going to introduce a
[00:38:21] alongside that i'm going to introduce a brand new
[00:38:22] brand new dev test split previously unreleased
[00:38:25] dev test split previously unreleased which is sentences from the restaurant
[00:38:27] which is sentences from the restaurant review domain
[00:38:28] review domain has the same kind of labels but of
[00:38:30] has the same kind of labels but of course it's very different from the
[00:38:32] course it's very different from the movie review domain along many
[00:38:34] movie review domain along many dimensions
[00:38:36] dimensions so for the bake off you'll have the sst3
[00:38:38] so for the bake off you'll have the sst3 train set we're going to we're going to
[00:38:40] train set we're going to we're going to give that to you and you are welcome to
[00:38:41] give that to you and you are welcome to introduce any other data you would like
[00:38:43] introduce any other data you would like to introduce as part of this training
[00:38:45] to introduce as part of this training that is entirely up to you
[00:38:47] that is entirely up to you we're also distributing for this take
[00:38:49] we're also distributing for this take off two dev sets the sst3 dev set which
[00:38:52] off two dev sets the sst3 dev set which is public already and this brand new one
[00:38:54] is public already and this brand new one of restaurant review sentences
[00:38:56] of restaurant review sentences and you can introduce other development
[00:38:58] and you can introduce other development sets if you want as part of tuning your
[00:39:00] sets if you want as part of tuning your system
[00:39:01] system and then the bake off will be conducted
[00:39:03] and then the bake off will be conducted as the best that people can do jointly
[00:39:05] as the best that people can do jointly on the ssd3 and this new test set which
[00:39:08] on the ssd3 and this new test set which is again held out the idea here is that
[00:39:10] is again held out the idea here is that you'll not only be doing a really great
[00:39:13] you'll not only be doing a really great kind of project in a classification
[00:39:15] kind of project in a classification thing involving sentiment
[00:39:17] thing involving sentiment but also pushing your systems to adapt
[00:39:19] but also pushing your systems to adapt into new domains from the ones that they
[00:39:22] into new domains from the ones that they were trained on although of course part
[00:39:23] were trained on although of course part of that could be training in clever ways
[00:39:26] of that could be training in clever ways that actually does help that that do
[00:39:28] that actually does help that that do help you anticipate what's in this
[00:39:30] help you anticipate what's in this restaurant review data
[00:39:33] restaurant review data and then the third bake off is a natural
[00:39:35] and then the third bake off is a natural language generation test but it's a
[00:39:37] language generation test but it's a grounded natural language generation
[00:39:39] grounded natural language generation test so for this
[00:39:40] test so for this bake off you'll be given color context
[00:39:42] bake off you'll be given color context like three patches like this and one of
[00:39:44] like three patches like this and one of them that is designated as the target
[00:39:47] them that is designated as the target for the training data we had people
[00:39:49] for the training data we had people describe the target color in this
[00:39:50] describe the target color in this context and your task is to develop a
[00:39:53] context and your task is to develop a system that can perform that same task
[00:39:55] system that can perform that same task produce natural language utterances like
[00:39:57] produce natural language utterances like this
[00:39:58] this i think this is a really cool problem
[00:40:00] i think this is a really cool problem because it's grounded outside of
[00:40:01] because it's grounded outside of something that involves language right
[00:40:03] something that involves language right you're grounded in a color patch it's
[00:40:04] you're grounded in a color patch it's just a numerical representation
[00:40:07] just a numerical representation and it's also highly context dependent
[00:40:09] and it's also highly context dependent in that the choices people make for
[00:40:12] in that the choices people make for their utterances are dependent not only
[00:40:13] their utterances are dependent not only on the target
[00:40:15] on the target the color they need to describe but the
[00:40:16] the color they need to describe but the target in the context of these other two
[00:40:18] target in the context of these other two colors and you can see that in these
[00:40:20] colors and you can see that in these descriptions this one is easy so the
[00:40:22] descriptions this one is easy so the person just said blue but in this case
[00:40:24] person just said blue but in this case since there are two blues they said the
[00:40:26] since there are two blues they said the darker blue one kind of keying into the
[00:40:28] darker blue one kind of keying into the fact that the context contains two blue
[00:40:30] fact that the context contains two blue colors
[00:40:32] colors this is an even more extreme case of
[00:40:33] this is an even more extreme case of that dull pink not the super bright one
[00:40:36] that dull pink not the super bright one implicit reference is not only to the
[00:40:37] implicit reference is not only to the target but also to other colors that are
[00:40:40] target but also to other colors that are distractors in that context
[00:40:43] distractors in that context and then the final two are really
[00:40:44] and then the final two are really extreme cases of this where for one in
[00:40:46] extreme cases of this where for one in the same color you know in terms of its
[00:40:48] the same color you know in terms of its color representation this person said
[00:40:50] color representation this person said purple and this person said blue in
[00:40:53] purple and this person said blue in virtue of the fact that the distractors
[00:40:55] virtue of the fact that the distractors are different
[00:40:56] are different so we have this kind of controlled
[00:40:58] so we have this kind of controlled natural language generation task that we
[00:41:00] natural language generation task that we can see is highly dependent on the
[00:41:02] can see is highly dependent on the particular context that p that are p
[00:41:04] particular context that p that are p that people are in here
[00:41:06] that people are in here and again this will follow the same path
[00:41:08] and again this will follow the same path here i'm going to give you a whole the
[00:41:09] here i'm going to give you a whole the whole model architecture as a kind of
[00:41:11] whole model architecture as a kind of default it's an encoder decoder
[00:41:13] default it's an encoder decoder architecture where you'll have a machine
[00:41:15] architecture where you'll have a machine learning system that consumes color
[00:41:17] learning system that consumes color representations
[00:41:19] representations and then transfers that into a decoding
[00:41:21] and then transfers that into a decoding phase where it tries to produce an
[00:41:22] phase where it tries to produce an utterance so here you've consumed a
[00:41:25] utterance so here you've consumed a bunch of colors and produced the
[00:41:26] bunch of colors and produced the description light blue
[00:41:28] description light blue but of course you'll be able to explore
[00:41:30] but of course you'll be able to explore many variants of this architecture and
[00:41:32] many variants of this architecture and really explore different things that are
[00:41:34] really explore different things that are effective in terms of this
[00:41:35] effective in terms of this representation and natural language
[00:41:37] representation and natural language generation task
[00:41:39] generation task and again as a bake off it will work the
[00:41:41] and again as a bake off it will work the same way you'll do a bunch of
[00:41:42] same way you'll do a bunch of development on training data that we
[00:41:43] development on training data that we give you
[00:41:44] give you and then you'll be evaluated on the held
[00:41:46] and then you'll be evaluated on the held out test set that was produced in the
[00:41:48] out test set that was produced in the same fashion
[00:41:50] same fashion but involves entirely new colors and
[00:41:52] but involves entirely new colors and reverences
[00:41:55] quick note on the original systems as i
[00:41:58] quick note on the original systems as i said before the original system is kind
[00:41:59] said before the original system is kind of a central piece of each one of the
[00:42:01] of a central piece of each one of the assignments the homeworks really
[00:42:03] assignments the homeworks really culminate in this original system and
[00:42:05] culminate in this original system and that becomes your bake off entry
[00:42:08] that becomes your bake off entry um in terms of grading it this is kind
[00:42:10] um in terms of grading it this is kind of hard because we want you to be
[00:42:11] of hard because we want you to be creative and try lots of things so the
[00:42:13] creative and try lots of things so the way we're going to value these entries
[00:42:15] way we're going to value these entries is any system that performs extremely
[00:42:17] is any system that performs extremely well on the bake off will be given full
[00:42:19] well on the bake off will be given full credit
[00:42:20] credit even symbols that have systems that are
[00:42:21] even symbols that have systems that are very simple right we can argue with
[00:42:24] very simple right we can argue with success according to the criteria that
[00:42:26] success according to the criteria that we've set up so if the simplest possible
[00:42:28] we've set up so if the simplest possible approach to one of these bake offs turns
[00:42:30] approach to one of these bake offs turns out to be astoundingly good and you had
[00:42:32] out to be astoundingly good and you had to do almost no work to succeed you of
[00:42:34] to do almost no work to succeed you of course get full credit
[00:42:36] course get full credit but that's not the only thing we value
[00:42:38] but that's not the only thing we value right so systems that are creative and
[00:42:40] right so systems that are creative and well motivated will be given full credit
[00:42:42] well motivated will be given full credit even if they don't perform well on the
[00:42:44] even if they don't perform well on the big off data this is meant to be an
[00:42:46] big off data this is meant to be an explicit encouragement for you to try
[00:42:48] explicit encouragement for you to try new things to be bold and be creative
[00:42:51] new things to be bold and be creative even if it doesn't numerically lead to
[00:42:53] even if it doesn't numerically lead to the best results in fact some of the
[00:42:55] the best results in fact some of the most inspiring things we've seen and
[00:42:57] most inspiring things we've seen and insightful things we've seen as part of
[00:42:59] insightful things we've seen as part of these bake-offs have been systems that
[00:43:01] these bake-offs have been systems that didn't perform at the top of the heap
[00:43:03] didn't perform at the top of the heap but harbored some really interesting
[00:43:05] but harbored some really interesting insight that we could build on
[00:43:08] insight that we could build on right and then of course systems that
[00:43:10] right and then of course systems that really are minimal you do if you do very
[00:43:12] really are minimal you do if you do very little and you don't do especially well
[00:43:14] little and you don't do especially well at the bake off we'll receive less
[00:43:16] at the bake off we'll receive less credit um the specific criteria will
[00:43:18] credit um the specific criteria will depend on the nature of the assignment
[00:43:20] depend on the nature of the assignment and so forth we'll try to justify that
[00:43:22] and so forth we'll try to justify that for you this is the more subjective part
[00:43:24] for you this is the more subjective part i think one and two really encode the
[00:43:27] i think one and two really encode the positive part of the kind of values that
[00:43:29] positive part of the kind of values that we're trying to convey to you as part of
[00:43:31] we're trying to convey to you as part of these original system entries
[00:43:36] and then you'll have project work this
[00:43:38] and then you'll have project work this occupies the entire second half of the
[00:43:40] occupies the entire second half of the course
[00:43:42] course at that point the lectures the notebooks
[00:43:44] at that point the lectures the notebooks the readings and so forth are really
[00:43:46] the readings and so forth are really focused on things like methods metrics
[00:43:48] focused on things like methods metrics best practices
[00:43:50] best practices error analysis model introspection and
[00:43:52] error analysis model introspection and other things that will help you enrich
[00:43:54] other things that will help you enrich the paper that you write
[00:43:57] the paper that you write the assignments are all project related
[00:43:59] the assignments are all project related um they're the literature view the
[00:44:00] um they're the literature view the experimental protocol and the final
[00:44:02] experimental protocol and the final paper unfortunately in many years we've
[00:44:04] paper unfortunately in many years we've had a video presentation which has
[00:44:06] had a video presentation which has always been really rewarding but i feel
[00:44:08] always been really rewarding but i feel like given the compressed time schedule
[00:44:10] like given the compressed time schedule that we're on we just don't have time
[00:44:11] that we're on we just don't have time for uh
[00:44:12] for uh for even short videos we're going to
[00:44:14] for even short videos we're going to focus on these three crucial components
[00:44:17] focus on these three crucial components uh and if for exceptional final projects
[00:44:19] uh and if for exceptional final projects from past years that we've selected you
[00:44:21] from past years that we've selected you could follow this link it's access
[00:44:22] could follow this link it's access restricted but if you're enrolled in the
[00:44:24] restricted but if you're enrolled in the course you should be able to follow the
[00:44:26] course you should be able to follow the link and see some of some examples and
[00:44:28] link and see some of some examples and there's a lot more gardens on final
[00:44:30] there's a lot more gardens on final projects in our course repository i have
[00:44:32] projects in our course repository i have a very long write-up of faqs and other
[00:44:35] a very long write-up of faqs and other guidance about publishing in the field
[00:44:38] guidance about publishing in the field and writing for this class and i have
[00:44:40] and writing for this class and i have what what is now a really inspiringly
[00:44:42] what what is now a really inspiringly long list of published papers that have
[00:44:44] long list of published papers that have grown out of work people have done for
[00:44:46] grown out of work people have done for this course so you can check that all
[00:44:48] this course so you can check that all out here final words here by way of
[00:44:50] out here final words here by way of wrapping up as i said this is the most
[00:44:52] wrapping up as i said this is the most most exciting moment ever in history for
[00:44:54] most exciting moment ever in history for doing nlu
[00:44:55] doing nlu this course will give you hands-on
[00:44:57] this course will give you hands-on experience with a wide range of
[00:44:59] experience with a wide range of challenging problems i emphasize the
[00:45:00] challenging problems i emphasize the hands-on thing i think this is so
[00:45:02] hands-on thing i think this is so important
[00:45:03] important if you want to acquire a new skill like
[00:45:05] if you want to acquire a new skill like this
[00:45:06] this it's all well and good to watch other
[00:45:07] it's all well and good to watch other people doing it but the way you really
[00:45:09] people doing it but the way you really acquire the skill is by having hands-on
[00:45:12] acquire the skill is by having hands-on experiences yourself so everything about
[00:45:14] experiences yourself so everything about the requirements and the materials is
[00:45:16] the requirements and the materials is pushing you to have those hands-on
[00:45:18] pushing you to have those hands-on experiences
[00:45:19] experiences and become expert at that very
[00:45:21] and become expert at that very fundamental level
[00:45:24] fundamental level for the final project a mentor for the
[00:45:25] for the final project a mentor for the teaching team will guide you through
[00:45:27] teaching team will guide you through those assignments
[00:45:28] those assignments we'll be there to help you and make
[00:45:31] we'll be there to help you and make choices and set the scope for the
[00:45:32] choices and set the scope for the project and maybe push it towards
[00:45:34] project and maybe push it towards something that you could one day publish
[00:45:36] something that you could one day publish there are many examples of successful
[00:45:38] there are many examples of successful publications deriving from this course
[00:45:42] publications deriving from this course our central goal fundamentally though is
[00:45:44] our central goal fundamentally though is to make you the best that is most
[00:45:46] to make you the best that is most insightful and responsible nlu
[00:45:48] insightful and responsible nlu practitioner and researcher wherever you
[00:45:51] practitioner and researcher wherever you go next into academia or just into other
[00:45:53] go next into academia or just into other classes or on into industry to leverage
[00:45:56] classes or on into industry to leverage these skills
[00:45:57] these skills we want to make you as i said the most
[00:45:59] we want to make you as i said the most insightful and responsible practitioner
[00:46:01] insightful and responsible practitioner we can

Lecture 003

Homework 1: Word Relatedness | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=egEzcwbej1E --- Transcript [00:00:05] hello everyone this screen...

Homework 1: Word Relatedness | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=egEzcwbej1E

---

Transcript

[00:00:05] hello everyone this screencast is going
[00:00:06] hello everyone this screencast is going to be a brief playthrough of homework
[00:00:08] to be a brief playthrough of homework one on word relatedness i hope to give
[00:00:10] one on word relatedness i hope to give you a sense for the problem that you're
[00:00:11] you a sense for the problem that you're tackling and also our expectations
[00:00:14] tackling and also our expectations around the homework questions and the
[00:00:15] around the homework questions and the bake off so let's dive in here
[00:00:18] bake off so let's dive in here the overview is just explaining the
[00:00:19] the overview is just explaining the character of this problem which is
[00:00:20] character of this problem which is essentially that we're going to give you
[00:00:22] essentially that we're going to give you a development set of word pairs with
[00:00:24] a development set of word pairs with scores and the scores reflect
[00:00:26] scores and the scores reflect relatedness the scores were produced by
[00:00:28] relatedness the scores were produced by humans and we've just scaled them into
[00:00:30] humans and we've just scaled them into zero to one
[00:00:31] zero to one where a larger score means more related
[00:00:33] where a larger score means more related in this human sense and your task is
[00:00:36] in this human sense and your task is essentially to develop a system that
[00:00:37] essentially to develop a system that will predict scores that are highly
[00:00:39] will predict scores that are highly correlated with those human scores
[00:00:41] correlated with those human scores according to the spearman correlation
[00:00:43] according to the spearman correlation coefficient which is the traditional
[00:00:45] coefficient which is the traditional metric in this space
[00:00:47] metric in this space so this is just some setup stuff for the
[00:00:49] so this is just some setup stuff for the environment and then we introduce the
[00:00:51] environment and then we introduce the development set itself which is a
[00:00:52] development set itself which is a panda's data frame it's loaded in from
[00:00:54] panda's data frame it's loaded in from your data folder
[00:00:57] your data folder and it looks like this it's got a bunch
[00:00:58] and it looks like this it's got a bunch of word pairs each with scores as i said
[00:01:00] of word pairs each with scores as i said before these are the human provided
[00:01:02] before these are the human provided scores this is a development data set in
[00:01:04] scores this is a development data set in the sense that you can make whatever use
[00:01:06] the sense that you can make whatever use you want of it you can train systems you
[00:01:08] you want of it you can train systems you can explore your results and so forth
[00:01:10] can explore your results and so forth because as you'll see for the actual
[00:01:12] because as you'll see for the actual bake off we have a fresh test set that
[00:01:14] bake off we have a fresh test set that you'll make predictions on
[00:01:16] you'll make predictions on there are about 5 000 words in this
[00:01:18] there are about 5 000 words in this development set
[00:01:19] development set um you can train on any subset and you
[00:01:22] um you can train on any subset and you can expand the data set to other things
[00:01:24] can expand the data set to other things if you want to include them as well it's
[00:01:26] if you want to include them as well it's really up to you to decide what you want
[00:01:28] really up to you to decide what you want to do because this is all about making
[00:01:29] to do because this is all about making predictions on that brand new test set
[00:01:31] predictions on that brand new test set as you'll see later
[00:01:33] as you'll see later and i will just say that the test set
[00:01:34] and i will just say that the test set has 1500 word pairs with scores of the
[00:01:37] has 1500 word pairs with scores of the same type
[00:01:38] same type and in terms of the overlap i'll also
[00:01:40] and in terms of the overlap i'll also tell you no word pair in this
[00:01:42] tell you no word pair in this development set is in the test set so
[00:01:44] development set is in the test set so it's disjoint at the level of these
[00:01:46] it's disjoint at the level of these pairs but some of the individual words
[00:01:48] pairs but some of the individual words are repeated in the test sets you do
[00:01:50] are repeated in the test sets you do have some vocabulary overlap
[00:01:53] have some vocabulary overlap in this code here we load the full
[00:01:55] in this code here we load the full vocabulary for this thing which is you
[00:01:57] vocabulary for this thing which is you know all the words appearing in all the
[00:01:58] know all the words appearing in all the pairs
[00:01:59] pairs the vocabulary for the bake off test is
[00:02:01] the vocabulary for the bake off test is different it's partially overlapping
[00:02:03] different it's partially overlapping with the above as i said now if you
[00:02:06] with the above as i said now if you wanted to make sure ahead of time that
[00:02:08] wanted to make sure ahead of time that your system has a representation for
[00:02:09] your system has a representation for every word in both the dev and the test
[00:02:11] every word in both the dev and the test sets then you can check against the
[00:02:13] sets then you can check against the vocabularies in any of the vector space
[00:02:15] vocabularies in any of the vector space models that we've distributed with this
[00:02:18] models that we've distributed with this unit
[00:02:19] unit so for example if you ran this code you
[00:02:20] so for example if you ran this code you get the full task vocabulary and if you
[00:02:23] get the full task vocabulary and if you have a representation for every word in
[00:02:24] have a representation for every word in there then you're good shape in good
[00:02:26] there then you're good shape in good shape when it comes to the test set
[00:02:31] it's also useful to look at the score
[00:02:32] it's also useful to look at the score distribution this will give you a sense
[00:02:34] distribution this will give you a sense for what kind of space you're making
[00:02:36] for what kind of space you're making predictions into and i'll give you the
[00:02:37] predictions into and i'll give you the hint that the test distribution looks an
[00:02:39] hint that the test distribution looks an awful lot like this dev set distribution
[00:02:43] awful lot like this dev set distribution it's also worth being aware that there
[00:02:45] it's also worth being aware that there are some repeated pairs in the training
[00:02:47] are some repeated pairs in the training set some words that have different
[00:02:49] set some words that have different scores associated with them and they're
[00:02:51] scores associated with them and they're repeated therefore what i've done here
[00:02:53] repeated therefore what i've done here is just provide you with some code that
[00:02:54] is just provide you with some code that will allow you to rank pairs of words by
[00:02:56] will allow you to rank pairs of words by the variance in their scores and so you
[00:02:59] the variance in their scores and so you could decide for yourself what you want
[00:03:01] could decide for yourself what you want to do about these minor inconsistencies
[00:03:03] to do about these minor inconsistencies you could filter the data set or keep
[00:03:05] you could filter the data set or keep all of these examples in it's entirely
[00:03:08] all of these examples in it's entirely up to you i will just say that the test
[00:03:10] up to you i will just say that the test set does not force you to confront this
[00:03:11] set does not force you to confront this issue it has no repeated pairs in it
[00:03:15] all right and then we come to the
[00:03:17] all right and then we come to the evaluation topic here so there's a
[00:03:18] evaluation topic here so there's a central function you'll be using a lot
[00:03:20] central function you'll be using a lot in the homework and the bake off word
[00:03:22] in the homework and the bake off word related and disevaluation
[00:03:24] related and disevaluation and so there are some instructions about
[00:03:26] and so there are some instructions about how the interface works let me just give
[00:03:28] how the interface works let me just give a brief illustration
[00:03:29] a brief illustration in this cell i'm loading in one of our
[00:03:31] in this cell i'm loading in one of our count matrices it's the giga5 matrix
[00:03:35] count matrices it's the giga5 matrix and i'm going to evaluate that directly
[00:03:36] and i'm going to evaluate that directly so you can see in the next cell that
[00:03:39] so you can see in the next cell that word relatedness evaluation takes in our
[00:03:41] word relatedness evaluation takes in our development data for our test data
[00:03:44] development data for our test data and whatever vector space model you've
[00:03:46] and whatever vector space model you've developed as its two arguments and it
[00:03:48] developed as its two arguments and it returns a new version of this input here
[00:03:51] returns a new version of this input here with a column for the predictions you
[00:03:52] with a column for the predictions you made as well as this value here which is
[00:03:55] made as well as this value here which is the spearmint rank correlation
[00:03:57] the spearmint rank correlation coefficient that's our primary metric
[00:03:59] coefficient that's our primary metric for this unit
[00:04:01] for this unit here
[00:04:02] here is the score that i achieved not so good
[00:04:04] is the score that i achieved not so good i'm sure you'll be able to do better and
[00:04:06] i'm sure you'll be able to do better and here's a look at the new account data
[00:04:07] here's a look at the new account data frame with that new predict column of
[00:04:09] frame with that new predict column of predictions inserted into it
[00:04:13] predictions inserted into it and this is just another baseline here a
[00:04:15] and this is just another baseline here a truly random system that just predicts a
[00:04:16] truly random system that just predicts a random score and that's uh even worse
[00:04:18] random score and that's uh even worse than the simple count baseline again
[00:04:20] than the simple count baseline again you'll be able to do much better without
[00:04:22] you'll be able to do much better without much effort
[00:04:24] much effort error analysis i've provided you with
[00:04:26] error analysis i've provided you with some functions that will allow you to
[00:04:27] some functions that will allow you to look at what your system is doing in
[00:04:29] look at what your system is doing in terms of the best predictions in terms
[00:04:31] terms of the best predictions in terms of comparing against the human scores
[00:04:33] of comparing against the human scores and the worst predictions and i am
[00:04:35] and the worst predictions and i am imagining that this might help you
[00:04:36] imagining that this might help you figure out where you're doing well and
[00:04:38] figure out where you're doing well and where you're doing poorly and then you
[00:04:39] where you're doing poorly and then you can iterate on that basis
[00:04:42] can iterate on that basis that brings us to the homework questions
[00:04:44] that brings us to the homework questions so what we're trying to do here is help
[00:04:46] so what we're trying to do here is help you establish some baseline systems get
[00:04:48] you establish some baseline systems get used to the code and also think in new
[00:04:51] used to the code and also think in new and creative ways about the underlying
[00:04:52] and creative ways about the underlying problem
[00:04:54] problem our first one is positive point wise
[00:04:56] our first one is positive point wise mutual information as a baseline as
[00:04:58] mutual information as a baseline as you've seen in the materials for this
[00:05:00] you've seen in the materials for this unit point wise mutual information is a
[00:05:02] unit point wise mutual information is a very strong baseline for lots of
[00:05:04] very strong baseline for lots of different applications and it also
[00:05:06] different applications and it also embodies a kind of core insight that we
[00:05:09] embodies a kind of core insight that we see running through a lot of the methods
[00:05:10] see running through a lot of the methods that we've covered
[00:05:12] that we've covered so it's a natural and pretty strong
[00:05:13] so it's a natural and pretty strong baseline and what we're asking you to do
[00:05:14] baseline and what we're asking you to do here is simply establish that baseline
[00:05:18] here is simply establish that baseline here and throughout all of the work for
[00:05:20] here and throughout all of the work for this
[00:05:21] this course we're going to ask you to
[00:05:22] course we're going to ask you to implement things and in general we will
[00:05:24] implement things and in general we will provide you with test functions that
[00:05:26] provide you with test functions that will help you make sure you have
[00:05:27] will help you make sure you have iterated towards the solution that we're
[00:05:29] iterated towards the solution that we're looking for so you can feel rest assured
[00:05:32] looking for so you can feel rest assured that if you have meaningfully passed
[00:05:34] that if you have meaningfully passed this test
[00:05:35] this test then you'll do well in terms of the
[00:05:36] then you'll do well in terms of the overall evaluation and your code is
[00:05:38] overall evaluation and your code is functioning as expected
[00:05:41] next question is similar so again now
[00:05:43] next question is similar so again now we're exploring latent semantic analysis
[00:05:45] we're exploring latent semantic analysis and in particular we're asking you to
[00:05:47] and in particular we're asking you to build up some code that will allow you
[00:05:49] build up some code that will allow you to test different dimensionalities for a
[00:05:51] to test different dimensionalities for a given vector space input and try to get
[00:05:53] given vector space input and try to get a feel for which one is best
[00:05:55] a feel for which one is best so again you have to implement a
[00:05:57] so again you have to implement a function and then there's a test that
[00:05:58] function and then there's a test that will help you make sure that you've
[00:06:00] will help you make sure that you've implemented the correct function in case
[00:06:01] implemented the correct function in case there's any uncertainty in the
[00:06:03] there's any uncertainty in the instructions here
[00:06:06] instructions here next question as i mentioned in the
[00:06:08] next question as i mentioned in the lectures t-test free weighting is a very
[00:06:10] lectures t-test free weighting is a very powerful re-weighting scheme it has some
[00:06:12] powerful re-weighting scheme it has some affinities with point-wise mutual
[00:06:14] affinities with point-wise mutual information but it is different
[00:06:16] information but it is different uh and this question is just asking you
[00:06:18] uh and this question is just asking you to implement that rewriting function
[00:06:20] to implement that rewriting function we've given the instructions here you
[00:06:21] we've given the instructions here you might also look in vsm.pi the module at
[00:06:24] might also look in vsm.pi the module at the implementation of pointwise mutual
[00:06:26] the implementation of pointwise mutual information because you could adopt some
[00:06:28] information because you could adopt some of the same techniques
[00:06:30] of the same techniques i want to emphasize that you don't need
[00:06:31] i want to emphasize that you don't need the fastest possible implementation any
[00:06:33] the fastest possible implementation any working implementation will get full
[00:06:35] working implementation will get full credit but the code in bsm.pi is really
[00:06:38] credit but the code in bsm.pi is really nicely optimized in terms of its
[00:06:40] nicely optimized in terms of its implementation so you might want to push
[00:06:41] implementation so you might want to push yourself
[00:06:42] yourself to do something similarly efficient
[00:06:45] to do something similarly efficient but again
[00:06:46] but again as long as your function t test here
[00:06:48] as long as your function t test here passes this test you're in good shape
[00:06:52] passes this test you're in good shape and you don't need to evaluate this
[00:06:53] and you don't need to evaluate this function we're just asking you to
[00:06:54] function we're just asking you to implement it but we're assuming since
[00:06:56] implement it but we're assuming since i've said that this is a good
[00:06:57] i've said that this is a good re-weighting scheme that you'll be
[00:06:58] re-weighting scheme that you'll be curious about how it performs in the
[00:07:00] curious about how it performs in the context context of the system you're
[00:07:02] context context of the system you're developing
[00:07:04] developing all right for the final two questions
[00:07:05] all right for the final two questions we're asking you to think further afield
[00:07:07] we're asking you to think further afield pooled bert representations is drawing
[00:07:09] pooled bert representations is drawing on the material in this notebook here
[00:07:12] on the material in this notebook here which is just an exploration of the
[00:07:14] which is just an exploration of the ideas from bamasani in all 2020 on how
[00:07:17] ideas from bamasani in all 2020 on how to derive static representations from
[00:07:19] to derive static representations from models like bert
[00:07:20] models like bert and so what we've got here is some
[00:07:22] and so what we've got here is some starter code for you and a kind of
[00:07:24] starter code for you and a kind of skeleton for implementing your own
[00:07:26] skeleton for implementing your own version of that solution
[00:07:28] version of that solution again we're hoping that this is a
[00:07:30] again we're hoping that this is a foundation for further exploration for
[00:07:32] foundation for further exploration for you
[00:07:33] you we've got the implementation to do here
[00:07:35] we've got the implementation to do here and then a test that you can pass to
[00:07:36] and then a test that you can pass to make sure that you've implemented things
[00:07:38] make sure that you've implemented things according to the design specification
[00:07:42] according to the design specification the final question is also really
[00:07:43] the final question is also really exploratory it's called learn distance
[00:07:45] exploratory it's called learn distance functions the idea here is that much of
[00:07:47] functions the idea here is that much of the code in this notebook pushes you to
[00:07:49] the code in this notebook pushes you to think about distance in terms of things
[00:07:51] think about distance in terms of things like cosine distance or euclidean
[00:07:54] like cosine distance or euclidean distance
[00:07:55] distance but we should have in mind that the only
[00:07:56] but we should have in mind that the only formal requirement is that you have some
[00:07:59] formal requirement is that you have some function that will map a pair of vectors
[00:08:01] function that will map a pair of vectors into a real valued score
[00:08:03] into a real valued score as soon as you see things from that
[00:08:04] as soon as you see things from that perspective you realize that a whole
[00:08:06] perspective you realize that a whole world of options opens up to you and
[00:08:08] world of options opens up to you and what this question is asking you to do
[00:08:09] what this question is asking you to do is train a k nearest neighbors model on
[00:08:11] is train a k nearest neighbors model on the development data that will learn to
[00:08:13] the development data that will learn to predict scores and then you can use that
[00:08:16] predict scores and then you can use that in place of cosine or euclidean we've
[00:08:18] in place of cosine or euclidean we've walked you through how to implement that
[00:08:19] walked you through how to implement that there's a bunch of guidance here and a
[00:08:21] there's a bunch of guidance here and a few tests for the sub components if you
[00:08:23] few tests for the sub components if you follow our design
[00:08:25] follow our design again if the test pass you should be in
[00:08:26] again if the test pass you should be in good shape we're not asking you to
[00:08:28] good shape we're not asking you to evaluate this directly
[00:08:29] evaluate this directly but we're hoping that this is a
[00:08:31] but we're hoping that this is a foundation for for exploring what could
[00:08:33] foundation for for exploring what could be a quite productive avenue of
[00:08:36] be a quite productive avenue of solutions in this space
[00:08:38] solutions in this space and then finally the original system
[00:08:40] and then finally the original system this is worth three points this is a big
[00:08:42] this is worth three points this is a big deal here you can piece together any
[00:08:44] deal here you can piece together any part of what you've done previously all
[00:08:46] part of what you've done previously all that stuff is fair game you can think in
[00:08:48] that stuff is fair game you can think in entirely original and new ways you can
[00:08:50] entirely original and new ways you can do something simple you can do something
[00:08:52] do something simple you can do something complex
[00:08:53] complex what we'd like you to do here is not
[00:08:54] what we'd like you to do here is not only provide the implementation in the
[00:08:56] only provide the implementation in the scope of this conditional so that it
[00:08:58] scope of this conditional so that it doesn't cause the autograder to fail if
[00:09:00] doesn't cause the autograder to fail if you have special requirements and so
[00:09:02] you have special requirements and so forth but we're also looking for a
[00:09:03] forth but we're also looking for a textual description
[00:09:05] textual description uh and a report on what your highest
[00:09:07] uh and a report on what your highest development set score was
[00:09:10] development set score was the idea here is that at the end of the
[00:09:11] the idea here is that at the end of the bake off the teaching team will create a
[00:09:13] bake off the teaching team will create a report that kind of analyzes across all
[00:09:16] report that kind of analyzes across all the different submissions and reflects
[00:09:18] the different submissions and reflects back to you all what worked and what
[00:09:20] back to you all what worked and what didn't and as part of that effort
[00:09:22] didn't and as part of that effort these system descriptions and
[00:09:24] these system descriptions and development scores can really help us
[00:09:25] development scores can really help us understand um how things played out
[00:09:29] understand um how things played out and that brings us to the bake off so
[00:09:30] and that brings us to the bake off so for the bake off what you really need to
[00:09:32] for the bake off what you really need to do is just run this function create bake
[00:09:34] do is just run this function create bake off submission
[00:09:35] off submission on
[00:09:36] on your vector space model here it's my
[00:09:39] your vector space model here it's my simple one count df that i loaded before
[00:09:41] simple one count df that i loaded before and as a reminder that this is an
[00:09:43] and as a reminder that this is an important piece you almost also need to
[00:09:44] important piece you almost also need to specify some distance function
[00:09:47] specify some distance function so the idea is that here my bake off
[00:09:49] so the idea is that here my bake off would be the simple submission where i'm
[00:09:50] would be the simple submission where i'm just doing account data frame and
[00:09:53] just doing account data frame and euclidean as my distance and when i run
[00:09:55] euclidean as my distance and when i run this function it creates a file cs224u
[00:09:59] this function it creates a file cs224u word relatedness fakeoff entry and
[00:10:01] word relatedness fakeoff entry and you'll just upload that to gradescope
[00:10:03] you'll just upload that to gradescope we'll give some instructions about that
[00:10:04] we'll give some instructions about that later on and that will be evaluated by
[00:10:07] later on and that will be evaluated by an automatic system
[00:10:08] an automatic system and that's it

Lecture 004

High-level Goals & Guiding Hypotheses | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=RiQgRJKqEhE --- Transcript [00:00:04] hello everyone we...

High-level Goals & Guiding Hypotheses | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=RiQgRJKqEhE

---

Transcript

[00:00:04] hello everyone welcome to the very first
[00:00:06] hello everyone welcome to the very first screencast of the very first unit of our
[00:00:08] screencast of the very first unit of our course we're going to be talking about
[00:00:09] course we're going to be talking about distributed word representations or
[00:00:11] distributed word representations or vector representations of words
[00:00:14] vector representations of words and for this screencast i'm just going
[00:00:15] and for this screencast i'm just going to cover some high-level goals we have
[00:00:17] to cover some high-level goals we have for this unit
[00:00:18] for this unit as well as discuss the guiding
[00:00:19] as well as discuss the guiding hypotheses not only for this unit but
[00:00:21] hypotheses not only for this unit but also hypotheses that will be with us
[00:00:23] also hypotheses that will be with us throughout the quarter
[00:00:26] what i've depicted on the slide here is
[00:00:28] what i've depicted on the slide here is our starting point both conceptually and
[00:00:30] our starting point both conceptually and computationally this is a small fragment
[00:00:32] computationally this is a small fragment of a very large word by word
[00:00:34] of a very large word by word co-occurrence matrix so along the rows
[00:00:37] co-occurrence matrix so along the rows here you have a large vocabulary of
[00:00:38] here you have a large vocabulary of words the first few are emoticons at
[00:00:40] words the first few are emoticons at least word like objects
[00:00:42] least word like objects exactly that same vocabulary is repeated
[00:00:44] exactly that same vocabulary is repeated across the columns
[00:00:46] across the columns and the cell values here give the number
[00:00:48] and the cell values here give the number of times that each row word appeared
[00:00:49] of times that each row word appeared with each column word in a very large
[00:00:52] with each column word in a very large text corpus
[00:00:53] text corpus i think the big idea that you want to
[00:00:55] i think the big idea that you want to start getting used to is that there
[00:00:56] start getting used to is that there could be meaning latent in such
[00:00:58] could be meaning latent in such co-occurrence patterns
[00:01:00] co-occurrence patterns it's not obvious to mere mortals that we
[00:01:02] it's not obvious to mere mortals that we could extract anything about meaning
[00:01:03] could extract anything about meaning from such an abstract space but we're
[00:01:05] from such an abstract space but we're going to see time and time again this is
[00:01:07] going to see time and time again this is actually a very powerful basis for
[00:01:09] actually a very powerful basis for developing meaning representations
[00:01:14] to start building intuition let's do a
[00:01:16] to start building intuition let's do a small thought experiment so imagine that
[00:01:19] small thought experiment so imagine that i give you a small lexicon of words each
[00:01:22] i give you a small lexicon of words each one of them labeled as either negative
[00:01:23] one of them labeled as either negative or positive in the sense of sentiment
[00:01:25] or positive in the sense of sentiment analysis
[00:01:26] analysis now that might be a useful resource but
[00:01:28] now that might be a useful resource but i've called this a hopeless learning
[00:01:30] i've called this a hopeless learning scenario because if i give you four new
[00:01:32] scenario because if i give you four new anonymous words to make predictions on
[00:01:35] anonymous words to make predictions on this resource over here is not useful at
[00:01:37] this resource over here is not useful at all for making predictions in fact you
[00:01:39] all for making predictions in fact you have essentially no information to go on
[00:01:42] have essentially no information to go on about what these anonymous words should
[00:01:43] about what these anonymous words should be labeled
[00:01:46] be labeled contrast that with a situation in which
[00:01:48] contrast that with a situation in which i give you that label lexicon but in
[00:01:50] i give you that label lexicon but in addition
[00:01:51] addition i give you the number of times that each
[00:01:53] i give you the number of times that each lexicon word co-occurs in some large
[00:01:55] lexicon word co-occurs in some large text corpus with the two words excellent
[00:01:58] text corpus with the two words excellent and terrible
[00:01:59] and terrible i think with that information with those
[00:02:01] i think with that information with those columns from the word by word matrix you
[00:02:04] columns from the word by word matrix you can see that you have a lot of
[00:02:05] can see that you have a lot of predictive power in fact a really simple
[00:02:08] predictive power in fact a really simple classifier or even decision rule will be
[00:02:11] classifier or even decision rule will be able to do really well at predicting
[00:02:12] able to do really well at predicting these labels if a word co occurs more
[00:02:15] these labels if a word co occurs more often with terrible than excellent call
[00:02:17] often with terrible than excellent call it negative
[00:02:18] it negative if a word co occurs with excellent more
[00:02:20] if a word co occurs with excellent more often than terrible call it positive
[00:02:22] often than terrible call it positive that's a good predictive model and now
[00:02:24] that's a good predictive model and now if i give you four new anonymous words
[00:02:27] if i give you four new anonymous words and in in addition you're allowed to
[00:02:29] and in in addition you're allowed to collect some co-occurrence information
[00:02:30] collect some co-occurrence information about them with respect to excellent and
[00:02:32] about them with respect to excellent and terrible then your same rule will be
[00:02:34] terrible then your same rule will be able to make really good predictions
[00:02:36] able to make really good predictions about these new anonymous words
[00:02:38] about these new anonymous words that's the sense in which we've moved to
[00:02:40] that's the sense in which we've moved to a very promising learning scenario and
[00:02:42] a very promising learning scenario and it's just a glimpse of how we could
[00:02:44] it's just a glimpse of how we could extract latent information about meaning
[00:02:46] extract latent information about meaning from these co-occurrence patterns and
[00:02:48] from these co-occurrence patterns and now just play it forward and think the
[00:02:50] now just play it forward and think the vector space models that we'll be
[00:02:52] vector space models that we'll be building will have not just two
[00:02:53] building will have not just two dimensions but hundreds or even
[00:02:55] dimensions but hundreds or even thousands of dimensions and there's no
[00:02:57] thousands of dimensions and there's no telling how much information we'll find
[00:02:59] telling how much information we'll find latent in such a high dimensional space
[00:03:03] latent in such a high dimensional space so that brings me to these high-level
[00:03:04] so that brings me to these high-level goals here first we want to begin
[00:03:06] goals here first we want to begin thinking about how these vectors could
[00:03:08] thinking about how these vectors could encode meanings of linguistic units get
[00:03:10] encode meanings of linguistic units get more used to that idea that i just
[00:03:12] more used to that idea that i just introduced you to
[00:03:14] introduced you to these are foundational concepts that
[00:03:16] these are foundational concepts that we'll be discussing not only for our
[00:03:18] we'll be discussing not only for our unit on vector space models which are
[00:03:20] unit on vector space models which are also called embeddings in modern
[00:03:22] also called embeddings in modern parlance but in fact these are
[00:03:24] parlance but in fact these are foundational concepts for all of the
[00:03:25] foundational concepts for all of the more sophisticated deep learning models
[00:03:27] more sophisticated deep learning models that we'll be discussing later on in the
[00:03:29] that we'll be discussing later on in the quarter
[00:03:30] quarter and of course i'm really hoping that
[00:03:32] and of course i'm really hoping that this this material is valuable to you
[00:03:35] this this material is valuable to you throughout the assignments that you do
[00:03:37] throughout the assignments that you do and also valuable for the original
[00:03:38] and also valuable for the original project work that you do in the second
[00:03:40] project work that you do in the second half of the course
[00:03:43] some guiding hypotheses let's start with
[00:03:45] some guiding hypotheses let's start with the literature i would be remiss in a
[00:03:47] the literature i would be remiss in a lecture like this if i didn't quote j.r
[00:03:49] lecture like this if i didn't quote j.r firth you shall know a word by the
[00:03:50] firth you shall know a word by the company it keeps this is a glimpse at
[00:03:54] company it keeps this is a glimpse at the kind of nominalist position that
[00:03:56] the kind of nominalist position that first took about how to do linguistic
[00:03:58] first took about how to do linguistic analysis he's really saying that we
[00:04:00] analysis he's really saying that we should trust distributional information
[00:04:02] should trust distributional information uh his
[00:04:04] uh his zelic harris a linguist working at
[00:04:06] zelic harris a linguist working at around the same time has an even purer
[00:04:08] around the same time has an even purer statement of this hypothesis harris said
[00:04:10] statement of this hypothesis harris said distributional statements can cover all
[00:04:12] distributional statements can cover all of the material of a language without
[00:04:14] of the material of a language without requiring support from other types of
[00:04:16] requiring support from other types of information
[00:04:18] information daily cares really only trusted usage
[00:04:20] daily cares really only trusted usage information i think we don't need to be
[00:04:22] information i think we don't need to be so extreme in our position but we can
[00:04:24] so extreme in our position but we can certainly align with harris in thinking
[00:04:26] certainly align with harris in thinking that there could be a lot about language
[00:04:28] that there could be a lot about language latent in these distributional
[00:04:30] latent in these distributional statements that is in co-occurrence
[00:04:32] statements that is in co-occurrence patterns
[00:04:33] patterns we might as well quote wittgenstein the
[00:04:35] we might as well quote wittgenstein the meaning of a word is its use in the
[00:04:37] meaning of a word is its use in the language i think that's a nice
[00:04:38] language i think that's a nice connection that wittgenstein might have
[00:04:40] connection that wittgenstein might have in mind might be a point of alignment
[00:04:42] in mind might be a point of alignment for him with firth and harris i'm not
[00:04:44] for him with firth and harris i'm not sure
[00:04:45] sure but finally here's a kind of direct
[00:04:47] but finally here's a kind of direct operationalization
[00:04:48] operationalization of our high-level hypothesis this is
[00:04:50] of our high-level hypothesis this is from one of the recommended readings by
[00:04:52] from one of the recommended readings by attorney and pentel and they say if
[00:04:54] attorney and pentel and they say if units of text have similar vectors in a
[00:04:56] units of text have similar vectors in a text frequency matrix like the
[00:04:58] text frequency matrix like the co-occurrence matrix i showed you before
[00:05:00] co-occurrence matrix i showed you before then they tend to have similar meanings
[00:05:02] then they tend to have similar meanings if we buy that hypothesis then we're
[00:05:04] if we buy that hypothesis then we're kind of licensed to build these
[00:05:06] kind of licensed to build these co-occurrence matrices and then make
[00:05:08] co-occurrence matrices and then make inferences about at least similarity of
[00:05:10] inferences about at least similarity of meaning on the basis of those objects
[00:05:12] meaning on the basis of those objects we've constructed
[00:05:15] to finish here under the heading of
[00:05:17] to finish here under the heading of great power a great many design choices
[00:05:20] great power a great many design choices i think one of the difficult things
[00:05:21] i think one of the difficult things about working in this space is that
[00:05:22] about working in this space is that there are a lot of moving pieces the
[00:05:24] there are a lot of moving pieces the first choice you'll have to make is your
[00:05:26] first choice you'll have to make is your matrix design i've talked about the word
[00:05:28] matrix design i've talked about the word by word matrix but of course word by
[00:05:30] by word matrix but of course word by document word by search proximity
[00:05:33] document word by search proximity adjective by modified noun these are all
[00:05:35] adjective by modified noun these are all different ways that you could construct
[00:05:37] different ways that you could construct your rows and your columns in one of
[00:05:38] your rows and your columns in one of these matrices and that's going to be
[00:05:40] these matrices and that's going to be really fundamental you'll capture very
[00:05:42] really fundamental you'll capture very different distributional facts depending
[00:05:44] different distributional facts depending on what kind of matrix design you choose
[00:05:47] on what kind of matrix design you choose and in a way that's not even the first
[00:05:48] and in a way that's not even the first choice that you need to make because in
[00:05:50] choice that you need to make because in constructing this matrix you'll make a
[00:05:52] constructing this matrix you'll make a lot of choices about how to tokenize
[00:05:54] lot of choices about how to tokenize whether to annotate what to do whether
[00:05:56] whether to annotate what to do whether do part of speech tagging for further
[00:05:57] do part of speech tagging for further distinctions parsing feature selection
[00:06:00] distinctions parsing feature selection and so forth and so on you also have to
[00:06:01] and so forth and so on you also have to decide how you're going to group your
[00:06:03] decide how you're going to group your texts is your notion of co-occurrence
[00:06:05] texts is your notion of co-occurrence going to be based on the sentence or the
[00:06:07] going to be based on the sentence or the document or maybe documents clustered by
[00:06:10] document or maybe documents clustered by date or author or discourse context all
[00:06:12] date or author or discourse context all of those things will give you very
[00:06:14] of those things will give you very different notions of what it means to
[00:06:16] different notions of what it means to co-occur and that will feed into your
[00:06:18] co-occur and that will feed into your matrix design
[00:06:20] matrix design having made all of those difficult
[00:06:22] having made all of those difficult choices you now you're probably going to
[00:06:23] choices you now you're probably going to want to take your count matrix and as
[00:06:25] want to take your count matrix and as we'll say re-weight it that is adjust
[00:06:28] we'll say re-weight it that is adjust the values by stretching and bending the
[00:06:30] the values by stretching and bending the space in order to find more latent
[00:06:33] space in order to find more latent information about meaning we're going to
[00:06:34] information about meaning we're going to talk about a lot of methods for doing
[00:06:36] talk about a lot of methods for doing that and then you might furthermore want
[00:06:39] that and then you might furthermore want to do some kind of dimensionality
[00:06:40] to do some kind of dimensionality reduction which is a step you could take
[00:06:42] reduction which is a step you could take to capture even more higher order
[00:06:44] to capture even more higher order notions of co-occurrence beyond the
[00:06:47] notions of co-occurrence beyond the simple co-occurrences that you see
[00:06:48] simple co-occurrences that you see evident in the original matrix that's a
[00:06:51] evident in the original matrix that's a powerful step there are a lot of choices
[00:06:53] powerful step there are a lot of choices you could make there
[00:06:54] you could make there and then finally what's your notion of
[00:06:56] and then finally what's your notion of similarity going to be for us we'll
[00:06:58] similarity going to be for us we'll operationalize that as a vector
[00:07:00] operationalize that as a vector comparison method like euclidean
[00:07:01] comparison method like euclidean distance cosine distance jacquard
[00:07:04] distance cosine distance jacquard distance and so forth
[00:07:06] distance and so forth depending on previous choices that
[00:07:07] depending on previous choices that you've made the choice of vector
[00:07:09] you've made the choice of vector comparison method might have a real
[00:07:11] comparison method might have a real impact on what you regard as similar and
[00:07:13] impact on what you regard as similar and different in your vector space
[00:07:15] different in your vector space so this is a kind of dizzying array of
[00:07:17] so this is a kind of dizzying array of choices that you might have to make
[00:07:19] choices that you might have to make there is a glimmer of hope though so
[00:07:22] there is a glimmer of hope though so models like glove and word vec purport
[00:07:25] models like glove and word vec purport to offer packaged solutions at least to
[00:07:28] to offer packaged solutions at least to the design weighting and reduction steps
[00:07:30] the design weighting and reduction steps here
[00:07:31] here so they'll tell you for instance if you
[00:07:32] so they'll tell you for instance if you use glove that it needs to be word by
[00:07:34] use glove that it needs to be word by word and then glove will simultaneously
[00:07:36] word and then glove will simultaneously perform these two steps and furthermore
[00:07:39] perform these two steps and furthermore for these methods since they tend to
[00:07:40] for these methods since they tend to deliver vectors that are pretty well
[00:07:42] deliver vectors that are pretty well scaled in terms of their individual
[00:07:44] scaled in terms of their individual values the choice of vector comparison
[00:07:46] values the choice of vector comparison might not matter so much so models like
[00:07:49] might not matter so much so models like glove and workvec are a real step
[00:07:51] glove and workvec are a real step forward in terms of taming this space
[00:07:53] forward in terms of taming this space here
[00:07:54] here and we can add further that more recent
[00:07:56] and we can add further that more recent contextual embedding models dictate even
[00:07:59] contextual embedding models dictate even more of the design choices possibly all
[00:08:01] more of the design choices possibly all the way back to how you tokenize and so
[00:08:03] the way back to how you tokenize and so they could be um thought of as even more
[00:08:06] they could be um thought of as even more unified solutions to the great many
[00:08:08] unified solutions to the great many design choices that you have here so
[00:08:10] design choices that you have here so that's kind of conceptually a real
[00:08:13] that's kind of conceptually a real breakthrough i will say though that
[00:08:16] breakthrough i will say though that baseline models constructed from the
[00:08:17] baseline models constructed from the simple things that i have in these
[00:08:19] simple things that i have in these tables here are often competitive with
[00:08:21] tables here are often competitive with more these more advanced models but of
[00:08:23] more these more advanced models but of course which combination is something
[00:08:25] course which combination is something that you'll probably have to discuss
[00:08:27] that you'll probably have to discuss to have to discover empirically

Lecture 005

Matrix Designs | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=ladnEW0ntEM --- Transcript [00:00:04] hello everyone welcome to part two of [0...

Matrix Designs | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=ladnEW0ntEM

---

Transcript

[00:00:04] hello everyone welcome to part two of
[00:00:06] hello everyone welcome to part two of our series of screencasts on distributed
[00:00:08] our series of screencasts on distributed word representations the focus of this
[00:00:10] word representations the focus of this screencast will be on matrix designs
[00:00:14] screencast will be on matrix designs let's start with the word by word design
[00:00:16] let's start with the word by word design that we concentrated on in part one uh
[00:00:18] that we concentrated on in part one uh so here again we have our vocabulary
[00:00:20] so here again we have our vocabulary along the rows that same vocabulary is
[00:00:23] along the rows that same vocabulary is repeated along the columns and the cell
[00:00:25] repeated along the columns and the cell values capture the number of times that
[00:00:27] values capture the number of times that each row word co-occurred with each
[00:00:29] each row word co-occurred with each column word in some large collection of
[00:00:31] column word in some large collection of texts
[00:00:33] texts this matrix will have two properties
[00:00:34] this matrix will have two properties that i think make it noteworthy for
[00:00:36] that i think make it noteworthy for developing semantic representations the
[00:00:38] developing semantic representations the first is it will be very dense and as we
[00:00:41] first is it will be very dense and as we bring in more data from ever larger
[00:00:43] bring in more data from ever larger corporate will get denser and denser in
[00:00:45] corporate will get denser and denser in virtue of the fact that more words will
[00:00:47] virtue of the fact that more words will tend to co-occur with more other words
[00:00:49] tend to co-occur with more other words in these this ever larger collection of
[00:00:51] in these this ever larger collection of documents
[00:00:53] documents the second is that it kind of has the
[00:00:54] the second is that it kind of has the nice property that its dimensionality
[00:00:56] nice property that its dimensionality will remain fixed even as we bring in
[00:00:58] will remain fixed even as we bring in more data as long as we decide on the
[00:01:00] more data as long as we decide on the vocabulary ahead of time all we'll be
[00:01:02] vocabulary ahead of time all we'll be doing is incrementing individual cell
[00:01:04] doing is incrementing individual cell values and so we can bring in as much
[00:01:05] values and so we can bring in as much data as we want without changing the
[00:01:08] data as we want without changing the fundamental design of the object
[00:01:10] fundamental design of the object both of those other things are points of
[00:01:12] both of those other things are points of contrast with another common design that
[00:01:14] contrast with another common design that you see in the literature especially in
[00:01:16] you see in the literature especially in information retrieval and that is the
[00:01:18] information retrieval and that is the word by document design
[00:01:20] word by document design for this design again i have words along
[00:01:22] for this design again i have words along the rows but my columns are now
[00:01:24] the rows but my columns are now individual documents and the cell values
[00:01:27] individual documents and the cell values capture the number of times that each
[00:01:28] capture the number of times that each word occurs in each one of those
[00:01:30] word occurs in each one of those documents
[00:01:31] documents as you can imagine this is a very sparse
[00:01:34] as you can imagine this is a very sparse matrix in contrast to the word by word
[00:01:36] matrix in contrast to the word by word one that we just looked at in virtue of
[00:01:38] one that we just looked at in virtue of the fact that most words don't appear in
[00:01:40] the fact that most words don't appear in most documents
[00:01:41] most documents it will also have the property that as
[00:01:43] it will also have the property that as we bring in more data in the form of
[00:01:45] we bring in more data in the form of more documents
[00:01:46] more documents the shape of the matrix will change
[00:01:48] the shape of the matrix will change we'll be adding column dimensions for
[00:01:50] we'll be adding column dimensions for each new document that we bring into the
[00:01:52] each new document that we bring into the space and that could really affect the
[00:01:55] space and that could really affect the kind of complications that we can do the
[00:01:57] kind of complications that we can do the only thing that balances against the
[00:01:59] only thing that balances against the ever increasing size of this matrix is
[00:02:01] ever increasing size of this matrix is that because it is so sparse we might
[00:02:02] that because it is so sparse we might have some easy and efficient ways of
[00:02:04] have some easy and efficient ways of storing it efficiently putting it on par
[00:02:06] storing it efficiently putting it on par with a much more compact but dense word
[00:02:09] with a much more compact but dense word by word matrix that i showed you before
[00:02:12] by word matrix that i showed you before now those are two very common designs
[00:02:14] now those are two very common designs that you see in the literature but i
[00:02:15] that you see in the literature but i want you to think creatively and kind of
[00:02:17] want you to think creatively and kind of align your matrix design with it at
[00:02:19] align your matrix design with it at whatever problem you're trying to solve
[00:02:21] whatever problem you're trying to solve so let me show you one that's really
[00:02:22] so let me show you one that's really radically different
[00:02:24] radically different this is what i've called the word by
[00:02:25] this is what i've called the word by discourse context matrix
[00:02:27] discourse context matrix i derived this from the switchboard
[00:02:29] i derived this from the switchboard dialogue act corpus which is the
[00:02:31] dialogue act corpus which is the switchboard corpus where each dialogue
[00:02:33] switchboard corpus where each dialogue act has been annotated by an expert
[00:02:35] act has been annotated by an expert annotator with the sort of dialogue act
[00:02:37] annotator with the sort of dialogue act or speech act that was performed by that
[00:02:39] or speech act that was performed by that utterance
[00:02:41] utterance what that allows us to do is collect
[00:02:43] what that allows us to do is collect a matrix where the rows are again words
[00:02:45] a matrix where the rows are again words but the columns are those individual
[00:02:47] but the columns are those individual labels that annotators assigned i think
[00:02:50] labels that annotators assigned i think this is a really interesting matrix i
[00:02:52] this is a really interesting matrix i think if you appear even at this small
[00:02:53] think if you appear even at this small fragment you can see some interesting
[00:02:55] fragment you can see some interesting information emerging so for example
[00:02:57] information emerging so for example absolutely occurs a lot in acceptance
[00:03:00] absolutely occurs a lot in acceptance dialogue acts whereas more hedged words
[00:03:02] dialogue acts whereas more hedged words like actually in any way are more common
[00:03:05] like actually in any way are more common in things like rejecting part of a
[00:03:07] in things like rejecting part of a previous utterance and i'm sure there
[00:03:09] previous utterance and i'm sure there are lots of other interesting patterns
[00:03:10] are lots of other interesting patterns in this matrix
[00:03:12] in this matrix and of course that's just a glimpse of
[00:03:14] and of course that's just a glimpse of the many other design choices that you
[00:03:16] the many other design choices that you could make again think creatively you
[00:03:18] could make again think creatively you could have something like adjective by
[00:03:19] could have something like adjective by modified noun this would probably
[00:03:21] modified noun this would probably capture some very local syntactic
[00:03:23] capture some very local syntactic information or collocational information
[00:03:26] information or collocational information we could generalize that a bit to word
[00:03:28] we could generalize that a bit to word by syntactic context to explicitly try
[00:03:30] by syntactic context to explicitly try to model how words associate with
[00:03:33] to model how words associate with specific syntactic structures
[00:03:35] specific syntactic structures be very different from our usual
[00:03:37] be very different from our usual semantic goals for this course
[00:03:39] semantic goals for this course word by search query might be a design
[00:03:41] word by search query might be a design that you use information retrieval we
[00:03:43] that you use information retrieval we don't even have to limit this to
[00:03:44] don't even have to limit this to linguistic objects word by person could
[00:03:46] linguistic objects word by person could capture the number of times that each
[00:03:48] capture the number of times that each person
[00:03:49] person purchased a specific set of products and
[00:03:52] purchased a specific set of products and then we could cluster people or products
[00:03:54] then we could cluster people or products on that basis
[00:03:56] on that basis we could also mix linguistic and
[00:03:57] we could also mix linguistic and non-linguistic things so word by person
[00:03:59] non-linguistic things so word by person might capture different usage patterns
[00:04:01] might capture different usage patterns for individual speakers and again again
[00:04:03] for individual speakers and again again allow us to doing some kind of
[00:04:05] allow us to doing some kind of interesting clustering of words or of
[00:04:07] interesting clustering of words or of people
[00:04:08] people we could also break out of two
[00:04:10] we could also break out of two dimensions we could have something like
[00:04:11] dimensions we could have something like word by word by pattern or verb by
[00:04:14] word by word by pattern or verb by subject by object many of the methods
[00:04:16] subject by object many of the methods that we cover in this unit are easily
[00:04:19] that we cover in this unit are easily generalized to more than two dimensions
[00:04:21] generalized to more than two dimensions so you could have that in mind and of
[00:04:22] so you could have that in mind and of course as i said think creatively and
[00:04:24] course as i said think creatively and think in particular about how your
[00:04:26] think in particular about how your matrix design is aligned with whatever
[00:04:29] matrix design is aligned with whatever modeling goal you have or whatever
[00:04:31] modeling goal you have or whatever hypothesis you're pursuing
[00:04:33] hypothesis you're pursuing another connection that i want to make
[00:04:35] another connection that i want to make is that even though this feels like a
[00:04:36] is that even though this feels like a kind of modern idea in nlp
[00:04:39] kind of modern idea in nlp vector representations of words are
[00:04:41] vector representations of words are actually are of objects are actually
[00:04:44] actually are of objects are actually pervasive not only throughout machine
[00:04:46] pervasive not only throughout machine learning but also throughout science
[00:04:48] learning but also throughout science right so think back to older modes of
[00:04:50] right so think back to older modes of nlp where we would write a lot of
[00:04:52] nlp where we would write a lot of feature functions we'll be exploring
[00:04:54] feature functions we'll be exploring such techniques they can be quite
[00:04:55] such techniques they can be quite powerful even though they feel very
[00:04:58] powerful even though they feel very different from the distributional
[00:04:59] different from the distributional hypotheses that we've been pursuing in
[00:05:01] hypotheses that we've been pursuing in fact they also represent individual data
[00:05:04] fact they also represent individual data points as vectors so for example given
[00:05:06] points as vectors so for example given the text like the movie was horrible i
[00:05:09] the text like the movie was horrible i might reduce that with my feature
[00:05:10] might reduce that with my feature functions to a vector that looks like
[00:05:12] functions to a vector that looks like this and i might know as a human that
[00:05:14] this and i might know as a human that four captures the number of words
[00:05:17] four captures the number of words zero captures the number of proper names
[00:05:19] zero captures the number of proper names and one over four captures the
[00:05:21] and one over four captures the percentage of negative words according
[00:05:23] percentage of negative words according to some sentiment lexicon that's a human
[00:05:26] to some sentiment lexicon that's a human level understanding of this in fact
[00:05:28] level understanding of this in fact those dimensions will acquire a meaning
[00:05:30] those dimensions will acquire a meaning to the extent that they assemble them
[00:05:32] to the extent that they assemble them into a vector space model
[00:05:34] into a vector space model and the column wise elements are
[00:05:36] and the column wise elements are compared with each other
[00:05:38] compared with each other so even though the origins of the data
[00:05:40] so even though the origins of the data are very different in fact this is just
[00:05:42] are very different in fact this is just like
[00:05:43] like vector representations of words in the
[00:05:45] vector representations of words in the way we've been discussing it the same
[00:05:47] way we've been discussing it the same thing happens in experimental sciences
[00:05:49] thing happens in experimental sciences where you might have an experimental
[00:05:50] where you might have an experimental subject come in and perform some act in
[00:05:52] subject come in and perform some act in the lab they do a complicated physical
[00:05:54] the lab they do a complicated physical and human thing and you reduce it down
[00:05:56] and human thing and you reduce it down to a couple of numbers like a choice
[00:05:58] to a couple of numbers like a choice they made or a reaction time and a
[00:06:01] they made or a reaction time and a choice and so forth
[00:06:03] choice and so forth we might model entire humans or entire
[00:06:05] we might model entire humans or entire entire organisms with a vector of
[00:06:07] entire organisms with a vector of numbers representing their physical
[00:06:10] numbers representing their physical characteristics and perspectives and
[00:06:12] characteristics and perspectives and outlooks and so forth again we might
[00:06:14] outlooks and so forth again we might know what these individual column
[00:06:16] know what these individual column dimensions mean but they acquire a
[00:06:18] dimensions mean but they acquire a meaning when we're doing modeling only
[00:06:20] meaning when we're doing modeling only to the extent that they are embedded in
[00:06:22] to the extent that they are embedded in a matrix and can be compared to each
[00:06:24] a matrix and can be compared to each other across the columns
[00:06:26] other across the columns and so forth there are many other
[00:06:28] and so forth there are many other examples of this where essentially
[00:06:30] examples of this where essentially fundamentally all of our representations
[00:06:32] fundamentally all of our representations are vector representations so maybe the
[00:06:34] are vector representations so maybe the far out idea for this unit is just that
[00:06:37] far out idea for this unit is just that we can gather interesting vector
[00:06:39] we can gather interesting vector representations without all of the hand
[00:06:41] representations without all of the hand built work that goes into the examples
[00:06:43] built work that goes into the examples on the slide right now
[00:06:46] a final technical point a question that
[00:06:49] a final technical point a question that you should ask that's kind of separate
[00:06:50] you should ask that's kind of separate from your particular
[00:06:52] from your particular matrix design what is going to count as
[00:06:54] matrix design what is going to count as co-occurrence so i think there are at
[00:06:56] co-occurrence so i think there are at least two design choices that are really
[00:06:58] least two design choices that are really important when answering this question
[00:07:00] important when answering this question to illustrate them let's use this small
[00:07:02] to illustrate them let's use this small example so i have this text from swerve
[00:07:04] example so i have this text from swerve of shore to bend of bay comma brings
[00:07:08] of shore to bend of bay comma brings and imagine that our focus word at our
[00:07:09] and imagine that our focus word at our particular point of analysis is this
[00:07:11] particular point of analysis is this token of the word two
[00:07:13] token of the word two and these indices here indicate going
[00:07:15] and these indices here indicate going left and right the distance by counts
[00:07:18] left and right the distance by counts from that particular focus word
[00:07:22] from that particular focus word the first question that you want to
[00:07:23] the first question that you want to decide is what your window of
[00:07:25] decide is what your window of co-occurrence is going to be so for
[00:07:27] co-occurrence is going to be so for example if you set your window to 3
[00:07:29] example if you set your window to 3 then the things that are within three
[00:07:31] then the things that are within three distance of your focus word will
[00:07:33] distance of your focus word will co-occur with that word and everything
[00:07:35] co-occur with that word and everything falling outside of that window will not
[00:07:37] falling outside of that window will not co-occur with that word according to
[00:07:39] co-occur with that word according to your analysis
[00:07:41] your analysis if you make your window really big it
[00:07:42] if you make your window really big it might encompass the entire document if
[00:07:44] might encompass the entire document if you make it very small it might
[00:07:46] you make it very small it might encompass only very local kind of
[00:07:48] encompass only very local kind of collocational information so you can bet
[00:07:51] collocational information so you can bet that that's going to be meaningful
[00:07:53] that that's going to be meaningful there's a separate choice that you can
[00:07:55] there's a separate choice that you can make falling under the heading of
[00:07:56] make falling under the heading of scaling i think a default choice for
[00:07:59] scaling i think a default choice for scaling is to just call it flat so what
[00:08:01] scaling is to just call it flat so what you're saying there is something is
[00:08:03] you're saying there is something is going to co-occur once with your focus
[00:08:05] going to co-occur once with your focus word if it's in the window that you've
[00:08:07] word if it's in the window that you've specified and that would kind of equally
[00:08:09] specified and that would kind of equally weight all of the things that are in the
[00:08:11] weight all of the things that are in the window you could also decide to scale
[00:08:13] window you could also decide to scale them a common scaling pattern would be
[00:08:15] them a common scaling pattern would be one over n where n is the distance by
[00:08:18] one over n where n is the distance by word from your focus word that would
[00:08:21] word from your focus word that would have the effect that things occurred
[00:08:22] have the effect that things occurred that occur close to the word of interest
[00:08:24] that occur close to the word of interest a co-occur with it more than things that
[00:08:26] a co-occur with it more than things that are at the edges that are near the end
[00:08:28] are at the edges that are near the end of the window
[00:08:32] those choices are going to have really
[00:08:34] those choices are going to have really profound effects on the kinds of
[00:08:35] profound effects on the kinds of representations that you develop here
[00:08:37] representations that you develop here are some generalizations i could offer
[00:08:39] are some generalizations i could offer larger flatter windows will capture more
[00:08:42] larger flatter windows will capture more semantic information as the window gets
[00:08:44] semantic information as the window gets very large to encompass for example the
[00:08:46] very large to encompass for example the entire document you'll be capturing
[00:08:48] entire document you'll be capturing essentially topical information in
[00:08:50] essentially topical information in contrast if you make your window very
[00:08:52] contrast if you make your window very small and scaled you'll tend to capture
[00:08:54] small and scaled you'll tend to capture more syntactic or collocational
[00:08:56] more syntactic or collocational information
[00:08:59] information independently of these choices you could
[00:09:00] independently of these choices you could decide how text boundaries are going to
[00:09:02] decide how text boundaries are going to be involved so a text boundary at the
[00:09:04] be involved so a text boundary at the level of a sentence or a paragraph for a
[00:09:06] level of a sentence or a paragraph for a document or a corpus could be a hard
[00:09:08] document or a corpus could be a hard boundary that's independent of your
[00:09:10] boundary that's independent of your window or you could decide that you're
[00:09:12] window or you could decide that you're going to allow your window to go across
[00:09:13] going to allow your window to go across different notions of segment that you
[00:09:15] different notions of segment that you have that's really up to you and again i
[00:09:17] have that's really up to you and again i think it will have major consequences
[00:09:19] think it will have major consequences for downstream tasks involving the
[00:09:21] for downstream tasks involving the representations that you've created
[00:09:24] representations that you've created to help you begin exploring this space
[00:09:26] to help you begin exploring this space the associated code released for this
[00:09:28] the associated code released for this course the associated notebooks provide
[00:09:31] course the associated notebooks provide you with four word by word matrices they
[00:09:34] you with four word by word matrices they have a few things that allow you to do
[00:09:36] have a few things that allow you to do comparisons first there are two matrices
[00:09:38] comparisons first there are two matrices that were developed from the yelp
[00:09:40] that were developed from the yelp academic data set which is a lot of
[00:09:42] academic data set which is a lot of reviews of products and services and
[00:09:44] reviews of products and services and there are two matrices that come from
[00:09:46] there are two matrices that come from gigaword which is newswire text so
[00:09:48] gigaword which is newswire text so there's fundamentally a real difference
[00:09:50] there's fundamentally a real difference in the genre of text involved
[00:09:53] in the genre of text involved in addition for each of those pairs of
[00:09:54] in addition for each of those pairs of corp
[00:09:55] corp for each of those corpora we have two
[00:09:57] for each of those corpora we have two different designs
[00:09:58] different designs window size of five and scaling of one
[00:10:01] window size of five and scaling of one over n which ought to by my hypotheses
[00:10:03] over n which ought to by my hypotheses deliver a lot of kind of collocational
[00:10:05] deliver a lot of kind of collocational or syntactic information
[00:10:08] or syntactic information and window size of 20 and scaling a flat
[00:10:10] and window size of 20 and scaling a flat a very large window lots of things
[00:10:13] a very large window lots of things co-occurring with lots of other things
[00:10:15] co-occurring with lots of other things that might be a better basis for
[00:10:16] that might be a better basis for semantics and you have those two points
[00:10:18] semantics and you have those two points of variation both for yelp and for giga
[00:10:20] of variation both for yelp and for giga word and i'm hoping that that kind of
[00:10:22] word and i'm hoping that that kind of gives you a sense for how these design
[00:10:24] gives you a sense for how these design choices affect the representations that
[00:10:27] choices affect the representations that you're able to develop
[00:10:28] you're able to develop with methods that we're going to cover
[00:10:30] with methods that we're going to cover in later parts of the screencast series

Lecture 006

Vector Comparison | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=eKvbYOc2rOs --- Transcript [00:00:05] welcome back everyone this is part th...

Vector Comparison | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=eKvbYOc2rOs

---

Transcript

[00:00:05] welcome back everyone this is part three
[00:00:06] welcome back everyone this is part three in our series on distributed word
[00:00:08] in our series on distributed word representations we're going to be
[00:00:09] representations we're going to be talking about vector comparison methods
[00:00:12] talking about vector comparison methods to try to make this discussion pretty
[00:00:13] to try to make this discussion pretty intuitive i'm going to ground things in
[00:00:15] intuitive i'm going to ground things in this running example on the left i have
[00:00:18] this running example on the left i have a very small vector space model we have
[00:00:20] a very small vector space model we have three words a b and c
[00:00:22] three words a b and c and you can imagine that we've measured
[00:00:23] and you can imagine that we've measured two dimensions dx and dy you could think
[00:00:25] two dimensions dx and dy you could think of them as documents if you wanted
[00:00:28] of them as documents if you wanted there are two perspectives that you
[00:00:29] there are two perspectives that you might take on this vector space model
[00:00:31] might take on this vector space model the first is just at the level of raw
[00:00:34] the first is just at the level of raw frequency
[00:00:35] frequency b and c seem to be united they are
[00:00:36] b and c seem to be united they are frequent in both the x and the y
[00:00:38] frequent in both the x and the y dimension
[00:00:39] dimension whereas a is comparatively infrequent
[00:00:42] whereas a is comparatively infrequent along both those dimensions
[00:00:44] along both those dimensions that's the first perspective the second
[00:00:46] that's the first perspective the second perspective though is more subtle you
[00:00:49] perspective though is more subtle you might just observe that if we kind of
[00:00:50] might just observe that if we kind of correct for the overall frequency
[00:00:54] correct for the overall frequency of the individual words then it's
[00:00:55] of the individual words then it's actually a and b that are united because
[00:00:57] actually a and b that are united because they both have a bias in some sense for
[00:00:59] they both have a bias in some sense for the dy dimension
[00:01:01] the dy dimension whereas by comparison c has a bias for
[00:01:04] whereas by comparison c has a bias for the x dimension again thinking
[00:01:06] the x dimension again thinking proportionally
[00:01:07] proportionally both of those are perspectives that we
[00:01:09] both of those are perspectives that we might want to capture and different
[00:01:11] might want to capture and different notions of distance will key into one or
[00:01:13] notions of distance will key into one or the other of them
[00:01:16] the other of them one more preliminary uh i think it's
[00:01:18] one more preliminary uh i think it's very intuitive to depict these vector
[00:01:20] very intuitive to depict these vector spaces and in only two dimensions that's
[00:01:22] spaces and in only two dimensions that's pretty easy you can imagine that this is
[00:01:24] pretty easy you can imagine that this is the dx dimension along the x-axis and
[00:01:26] the dx dimension along the x-axis and this is the d-y dimension along the
[00:01:28] this is the d-y dimension along the y-axis and then i have placed these
[00:01:30] y-axis and then i have placed these individual points in that plane and then
[00:01:33] individual points in that plane and then you can see graphically that b and c are
[00:01:35] you can see graphically that b and c are pretty close together and a is kind of
[00:01:37] pretty close together and a is kind of lonely down here in the corner the
[00:01:39] lonely down here in the corner the infrequent one
[00:01:42] let's start with euclidean distance very
[00:01:44] let's start with euclidean distance very common notion of distance in these
[00:01:46] common notion of distance in these spaces and quite intuitive
[00:01:48] spaces and quite intuitive we can measure the euclidean distance
[00:01:50] we can measure the euclidean distance between vectors u and v if they share
[00:01:52] between vectors u and v if they share the same dimension n by just calculating
[00:01:55] the same dimension n by just calculating the sum
[00:01:56] the sum of the squared element y differences
[00:01:58] of the squared element y differences absolute differences and then taking the
[00:02:00] absolute differences and then taking the square root of that
[00:02:02] square root of that that's the math here let's look at that
[00:02:04] that's the math here let's look at that in terms of this space so here we have
[00:02:06] in terms of this space so here we have our vector space
[00:02:07] our vector space depicted graphically a b and c and
[00:02:10] depicted graphically a b and c and euclidean distance is measuring the
[00:02:11] euclidean distance is measuring the length of these lines i've annotated
[00:02:14] length of these lines i've annotated with the full calculations but the
[00:02:15] with the full calculations but the intuition is just that we are measuring
[00:02:17] intuition is just that we are measuring the length of these lines the most
[00:02:19] the length of these lines the most direct path between these points in our
[00:02:21] direct path between these points in our high dimensional space
[00:02:23] high dimensional space and you can see that euclidean distance
[00:02:25] and you can see that euclidean distance is capturing the first
[00:02:26] is capturing the first perspective that we took on the vector
[00:02:28] perspective that we took on the vector space which unites the frequent items b
[00:02:31] space which unites the frequent items b and c as against the infrequent one a
[00:02:37] as a stepping stone toward cosine
[00:02:38] as a stepping stone toward cosine distance which will behave quite
[00:02:40] distance which will behave quite differently let's talk about length
[00:02:41] differently let's talk about length normalization
[00:02:43] normalization given a vector u of dimension n the l2
[00:02:45] given a vector u of dimension n the l2 length of u
[00:02:47] length of u is the sum of the squared values in that
[00:02:50] is the sum of the squared values in that matrix and then we take the square root
[00:02:52] matrix and then we take the square root that's our normalization quantity there
[00:02:55] that's our normalization quantity there and then the actual normalization of
[00:02:57] and then the actual normalization of that original vector u involves taking
[00:02:59] that original vector u involves taking each one of its elements and dividing it
[00:03:02] each one of its elements and dividing it by that fixed quantity the l2 ranks
[00:03:05] by that fixed quantity the l2 ranks let's look look at what happens to our
[00:03:07] let's look look at what happens to our little illustrative example on the left
[00:03:09] little illustrative example on the left here i have the original count matrix
[00:03:11] here i have the original count matrix and in this column here i've given the
[00:03:13] and in this column here i've given the l2 length as a quantity
[00:03:15] l2 length as a quantity and then when we take that quantity and
[00:03:17] and then when we take that quantity and divide each one of the values in that
[00:03:19] divide each one of the values in that vector to get its l2 norm
[00:03:21] vector to get its l2 norm you can see that we've done something
[00:03:23] you can see that we've done something significant to the space so they're all
[00:03:25] significant to the space so they're all kind of united on the same scale here
[00:03:27] kind of united on the same scale here and a and b are now close together
[00:03:30] and a and b are now close together whereas b and c are comparatively far
[00:03:32] whereas b and c are comparatively far apart so that is capturing the second
[00:03:34] apart so that is capturing the second perspective that we took on the matrix
[00:03:36] perspective that we took on the matrix where a and b have something in common
[00:03:38] where a and b have something in common as against c
[00:03:39] as against c and that has come entirely from the
[00:03:41] and that has come entirely from the normalization step and if we measured
[00:03:43] normalization step and if we measured euclidean distance in this space just
[00:03:45] euclidean distance in this space just the length of the lines between these
[00:03:47] the length of the lines between these points we would again be capturing that
[00:03:49] points we would again be capturing that a and b are alike and b and c are
[00:03:51] a and b are alike and b and c are comparatively different
[00:03:54] cosine kind of does that all in one step
[00:03:57] cosine kind of does that all in one step so the cosine distance or approximately
[00:03:59] so the cosine distance or approximately the distance as you'll see between two
[00:04:01] the distance as you'll see between two vectors u and v of dimension share
[00:04:03] vectors u and v of dimension share dimension n
[00:04:04] dimension n this calculation has two parts this is
[00:04:06] this calculation has two parts this is the similarity calculation cosine
[00:04:08] the similarity calculation cosine similarity and it is the dot product of
[00:04:10] similarity and it is the dot product of the two vectors divided by the product
[00:04:13] the two vectors divided by the product of their l2 lengths
[00:04:15] of their l2 lengths and then to get something like the
[00:04:16] and then to get something like the distance we just take one and subtract
[00:04:18] distance we just take one and subtract out that similarity
[00:04:20] out that similarity again let's ground this in our example
[00:04:22] again let's ground this in our example here we have the original count vector
[00:04:24] here we have the original count vector space model
[00:04:26] space model and what we do with cosine distance is
[00:04:29] and what we do with cosine distance is essentially measure the angles between
[00:04:32] essentially measure the angles between these lines that i've drawn from this
[00:04:34] these lines that i've drawn from this origin point
[00:04:35] origin point and so you can see that cosine distance
[00:04:38] and so you can see that cosine distance is capturing the fact that a and b are
[00:04:40] is capturing the fact that a and b are close together as measured by this angle
[00:04:42] close together as measured by this angle whereas b and c are comparatively far
[00:04:44] whereas b and c are comparatively far apart
[00:04:45] apart so again with cosine we're abstracting
[00:04:48] so again with cosine we're abstracting away from frequency information
[00:04:50] away from frequency information and keying into that abstract notion of
[00:04:52] and keying into that abstract notion of similarity that connects a and b as
[00:04:54] similarity that connects a and b as against c
[00:04:58] another perspective that you could take
[00:05:00] another perspective that you could take is just observe that if we first
[00:05:03] is just observe that if we first normalize
[00:05:04] normalize the vectors
[00:05:06] the vectors via via the l2 norm and then apply the
[00:05:09] via via the l2 norm and then apply the cosine calculation we change the space
[00:05:11] cosine calculation we change the space as i showed you before so they're all up
[00:05:13] as i showed you before so they're all up here kind of on the unit sphere
[00:05:15] here kind of on the unit sphere and notice that the actual values that
[00:05:18] and notice that the actual values that we get out are the same
[00:05:20] we get out are the same whether or not we did that l2 norming
[00:05:22] whether or not we did that l2 norming step and that is because cosine is
[00:05:24] step and that is because cosine is building the effects of l2 norming
[00:05:26] building the effects of l2 norming directly into this normalization here in
[00:05:29] directly into this normalization here in the denominator
[00:05:33] there are a few other methods that we
[00:05:35] there are a few other methods that we could think about or classes of methods
[00:05:36] could think about or classes of methods i think we don't need to get distracted
[00:05:38] i think we don't need to get distracted by the details but i thought i would
[00:05:39] by the details but i thought i would mention them in case they come up and
[00:05:40] mention them in case they come up and you're reading our research the first
[00:05:42] you're reading our research the first class are what are what i've called
[00:05:44] class are what are what i've called matching based methods they're all kind
[00:05:45] matching based methods they're all kind of based in this matching coefficient
[00:05:48] of based in this matching coefficient and then jacquard dice and overlap are
[00:05:50] and then jacquard dice and overlap are terms that you might see in the
[00:05:51] terms that you might see in the literature
[00:05:52] literature these are often defined only for binary
[00:05:54] these are often defined only for binary vectors but here i've given their
[00:05:55] vectors but here i've given their generalizations to the real valued
[00:05:58] generalizations to the real valued vectors that we're talking about
[00:06:00] vectors that we're talking about and the other class of methods that you
[00:06:01] and the other class of methods that you might see come up are probabilistic
[00:06:03] might see come up are probabilistic methods which tend to be grounded in
[00:06:05] methods which tend to be grounded in this notion of kl divergence
[00:06:07] this notion of kl divergence kl divergence is essentially a way of
[00:06:09] kl divergence is essentially a way of measuring the distance between two
[00:06:11] measuring the distance between two probability distributions
[00:06:13] probability distributions between
[00:06:15] between to be more precise from a reference
[00:06:17] to be more precise from a reference distribution p to some other probability
[00:06:20] distribution p to some other probability distribution q
[00:06:21] distribution q um and it has symmetric notions
[00:06:24] um and it has symmetric notions symmetric kl and jensen shannon distance
[00:06:27] symmetric kl and jensen shannon distance which is another symmetric notion that's
[00:06:28] which is another symmetric notion that's based in kl divergence again these are
[00:06:31] based in kl divergence again these are probably appropriate measures to choose
[00:06:33] probably appropriate measures to choose if the quantities that you're thinking
[00:06:35] if the quantities that you're thinking of are appropriately thought of as
[00:06:37] of are appropriately thought of as probability values
[00:06:41] now i've alluded to the fact that the
[00:06:43] now i've alluded to the fact that the cosine distance measure that i gave you
[00:06:45] cosine distance measure that i gave you before is not quite what's called a
[00:06:47] before is not quite what's called a proper distance metric let me expand on
[00:06:49] proper distance metric let me expand on that a little bit
[00:06:50] that a little bit to qualify as a proper distance metric a
[00:06:53] to qualify as a proper distance metric a vector comparison method has to have
[00:06:55] vector comparison method has to have three properties it needs to be
[00:06:57] three properties it needs to be symmetric that is it needs to give the
[00:06:59] symmetric that is it needs to give the same value for x y as it does to yx
[00:07:02] same value for x y as it does to yx kl divergence actually fails that first
[00:07:05] kl divergence actually fails that first rule
[00:07:06] rule it needs to assign zero to identical
[00:07:08] it needs to assign zero to identical vectors
[00:07:09] vectors and crucially it needs to satisfy what's
[00:07:11] and crucially it needs to satisfy what's called the triangle inequality which
[00:07:13] called the triangle inequality which says that the distance between x and z
[00:07:17] says that the distance between x and z is less than or equal to the distance
[00:07:19] is less than or equal to the distance between x and y and then y to z
[00:07:23] between x and y and then y to z cosine distance as i showed it to you
[00:07:25] cosine distance as i showed it to you before fails to satisfy the triangle
[00:07:28] before fails to satisfy the triangle inequality and this is just a simple
[00:07:29] inequality and this is just a simple example that makes an intuitive it just
[00:07:32] example that makes an intuitive it just happens that this distance here is
[00:07:34] happens that this distance here is actually greater than these two values
[00:07:36] actually greater than these two values here
[00:07:37] here which is a failure of the statement of
[00:07:38] which is a failure of the statement of the triangle inequality
[00:07:40] the triangle inequality now this is relatively easily corrected
[00:07:43] now this is relatively easily corrected but this is also kind of a useful
[00:07:44] but this is also kind of a useful framework of all the different choices
[00:07:47] framework of all the different choices that we could make
[00:07:48] that we could make of all the options for vector comparison
[00:07:50] of all the options for vector comparison suppose we decided to favor the ones
[00:07:52] suppose we decided to favor the ones that counted as true distance metrics
[00:07:55] that counted as true distance metrics then that would at least push us to
[00:07:56] then that would at least push us to favor euclidean distance
[00:07:59] favor euclidean distance jacquard for binary vectors only and
[00:08:01] jacquard for binary vectors only and jensen shannon distance if we were
[00:08:03] jensen shannon distance if we were talking about probabilistic spaces and
[00:08:05] talking about probabilistic spaces and we would further amend the definition of
[00:08:07] we would further amend the definition of cosine distance to the more careful one
[00:08:09] cosine distance to the more careful one that i've given here
[00:08:11] that i've given here which which satisfies the triangle
[00:08:13] which which satisfies the triangle inequality as well as the other two
[00:08:15] inequality as well as the other two criteria
[00:08:16] criteria and by this kind of way of dividing the
[00:08:18] and by this kind of way of dividing the world we would also reject
[00:08:20] world we would also reject matching jaccard dice overlap
[00:08:23] matching jaccard dice overlap um kale divergence and symmetric tail
[00:08:25] um kale divergence and symmetric tail divergence as ones that failed to be
[00:08:27] divergence as ones that failed to be proper distance metrics and so that
[00:08:29] proper distance metrics and so that might be a useful framework for thinking
[00:08:31] might be a useful framework for thinking about choices in this space
[00:08:33] about choices in this space one other point in relation to this this
[00:08:36] one other point in relation to this this is obviously a more involved calculation
[00:08:38] is obviously a more involved calculation than the one that i gave you before and
[00:08:40] than the one that i gave you before and in truth it is probably not worth the
[00:08:42] in truth it is probably not worth the effort here's an example of just a bunch
[00:08:44] effort here's an example of just a bunch of vectors that i sampled from one of
[00:08:46] of vectors that i sampled from one of our vector space models and i've
[00:08:48] our vector space models and i've compared the improper cosine distance
[00:08:50] compared the improper cosine distance that i showed you before on the x-axis
[00:08:52] that i showed you before on the x-axis with the proper cosine distance measure
[00:08:55] with the proper cosine distance measure that i just showed you
[00:08:56] that i just showed you and the correlation between the two is
[00:08:58] and the correlation between the two is almost perfect so there is essentially
[00:09:00] almost perfect so there is essentially no difference between these two
[00:09:02] no difference between these two different ways of measuring cosine
[00:09:05] different ways of measuring cosine and i think that they are probably
[00:09:06] and i think that they are probably essentially identical up to ranking
[00:09:08] essentially identical up to ranking which is often the quantity that we care
[00:09:10] which is often the quantity that we care about when we're doing these comparisons
[00:09:12] about when we're doing these comparisons so probably stick with a simpler and
[00:09:14] so probably stick with a simpler and less involved calculation would be my
[00:09:16] less involved calculation would be my advice
[00:09:18] let's close with some generalizations
[00:09:20] let's close with some generalizations and relationships first euclidean as
[00:09:22] and relationships first euclidean as well as jaccard and dice with raw count
[00:09:24] well as jaccard and dice with raw count vectors will tend to favor raw frequency
[00:09:27] vectors will tend to favor raw frequency over other distributional patterns like
[00:09:29] over other distributional patterns like that more abstract one that i showed you
[00:09:31] that more abstract one that i showed you with our illustrative example
[00:09:34] with our illustrative example euclidean with l2 norm vectors is
[00:09:36] euclidean with l2 norm vectors is equivalent to cosine when it comes to
[00:09:38] equivalent to cosine when it comes to ranking which is just to say that if you
[00:09:41] ranking which is just to say that if you want to use euclidean and you first l2
[00:09:43] want to use euclidean and you first l2 norm your vectors you're probably just
[00:09:45] norm your vectors you're probably just doing something that might as well just
[00:09:46] doing something that might as well just be the cosine calculation
[00:09:49] be the cosine calculation and dice are equivalent with regard to
[00:09:51] and dice are equivalent with regard to ranking that's something to keep in mind
[00:09:53] ranking that's something to keep in mind uh and then this is maybe a more
[00:09:54] uh and then this is maybe a more fundamental point that you'll see
[00:09:56] fundamental point that you'll see recurring throughout this unit both l2
[00:09:58] recurring throughout this unit both l2 norming and also a related calculation
[00:10:00] norming and also a related calculation which would just create probability
[00:10:02] which would just create probability distributions out of the rows they can
[00:10:04] distributions out of the rows they can be useful steps as we've seen but they
[00:10:06] be useful steps as we've seen but they can obscure differences in the amount or
[00:10:09] can obscure differences in the amount or strength of evidence that you have which
[00:10:10] strength of evidence that you have which can in turn have an effect on the
[00:10:12] can in turn have an effect on the reliability of for example cosine
[00:10:14] reliability of for example cosine normuclidian or kale divergence right
[00:10:17] normuclidian or kale divergence right these shortcomings might be addressed
[00:10:19] these shortcomings might be addressed through waiting schemes though but
[00:10:20] through waiting schemes though but here's the bottom line there is valuable
[00:10:22] here's the bottom line there is valuable information in raw frequency
[00:10:25] information in raw frequency if we abstract away from it some other
[00:10:27] if we abstract away from it some other information might come to the surface
[00:10:28] information might come to the surface but we also might lose that important
[00:10:30] but we also might lose that important frequency information in distorting the
[00:10:32] frequency information in distorting the space in that way and it can be
[00:10:34] space in that way and it can be difficult to balance these competing
[00:10:36] difficult to balance these competing pressures
[00:10:39] finally i just close with some code
[00:10:40] finally i just close with some code snippets our course repository has lots
[00:10:43] snippets our course repository has lots of handy utilities for doing these
[00:10:45] of handy utilities for doing these distance calculations and also length
[00:10:48] distance calculations and also length norming your vectors and so forth and it
[00:10:50] norming your vectors and so forth and it also has this function called neighbors
[00:10:52] also has this function called neighbors in the vsm module it allows you to pick
[00:10:55] in the vsm module it allows you to pick a target word and supply a vector space
[00:10:58] a target word and supply a vector space model and then it will give you
[00:11:00] model and then it will give you a full ranking of the entire vocabulary
[00:11:02] a full ranking of the entire vocabulary in that vector space with respect to
[00:11:04] in that vector space with respect to your target word starting with the ones
[00:11:06] your target word starting with the ones that are closest so here are the results
[00:11:08] that are closest so here are the results for bad using
[00:11:10] for bad using cosine distance in cell 12 and jakarta
[00:11:12] cosine distance in cell 12 and jakarta distance and swell in cell 13 and i
[00:11:15] distance and swell in cell 13 and i would just like to say that these
[00:11:16] would just like to say that these neighbors don't look especially
[00:11:18] neighbors don't look especially intuitive to me it does not look like
[00:11:20] intuitive to me it does not look like this analysis is revealing
[00:11:22] this analysis is revealing really interesting semantic information
[00:11:25] really interesting semantic information but don't worry we're going to correct
[00:11:27] but don't worry we're going to correct this we're going to start to massage and
[00:11:29] this we're going to start to massage and stretch and bend our vector space models
[00:11:31] stretch and bend our vector space models and we'll see we will see much better
[00:11:33] and we'll see we will see much better results for these neighbor functions and
[00:11:35] results for these neighbor functions and everything else
[00:11:36] everything else as we go through that material
[00:11:43] you

Lecture 007

Basic Reweighting | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=dv559tVBQRk --- Transcript [00:00:05] welcome back everyone this is part fo...

Basic Reweighting | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=dv559tVBQRk

---

Transcript

[00:00:05] welcome back everyone this is part four
[00:00:07] welcome back everyone this is part four in our series on distributed word
[00:00:08] in our series on distributed word representations we're going to be
[00:00:09] representations we're going to be talking about basic re-weighting schemes
[00:00:12] talking about basic re-weighting schemes essentially i feel like we've been
[00:00:13] essentially i feel like we've been faithful to the underlying counts in our
[00:00:14] faithful to the underlying counts in our matrices for too long it's time to start
[00:00:16] matrices for too long it's time to start messing with them
[00:00:18] messing with them here's some high level goals that we
[00:00:19] here's some high level goals that we have for rewriting we would like in
[00:00:20] have for rewriting we would like in these matrices to amplify the
[00:00:22] these matrices to amplify the associations that are important and
[00:00:24] associations that are important and trustworthy and unusual while
[00:00:26] trustworthy and unusual while correspondingly de-emphasizing the
[00:00:28] correspondingly de-emphasizing the things that are mundane or quirky or
[00:00:30] things that are mundane or quirky or reflect errors or idiosyncrasies in the
[00:00:32] reflect errors or idiosyncrasies in the data that we use
[00:00:33] data that we use now of course absent a defined objective
[00:00:36] now of course absent a defined objective function in the machine learning sense
[00:00:37] function in the machine learning sense this is going to remain a fuzzy goal but
[00:00:39] this is going to remain a fuzzy goal but we do have some quantitative hooks i
[00:00:41] we do have some quantitative hooks i think we have this guiding intuition
[00:00:43] think we have this guiding intuition that we would like to move away from raw
[00:00:45] that we would like to move away from raw counts because frequency alone is
[00:00:47] counts because frequency alone is generally a poor proxy for the kind of
[00:00:49] generally a poor proxy for the kind of semantic information that we hope to
[00:00:51] semantic information that we hope to extract
[00:00:52] extract so we can ask for each of the
[00:00:54] so we can ask for each of the re-weighting schemes that we consider
[00:00:56] re-weighting schemes that we consider first how does it compare to the
[00:00:57] first how does it compare to the underlying raw count values
[00:00:59] underlying raw count values if the scheme is just rescaling the
[00:01:01] if the scheme is just rescaling the underlying counts it's probably not
[00:01:03] underlying counts it's probably not worth the effort on the other hand if it
[00:01:05] worth the effort on the other hand if it gives us a very different distribution
[00:01:07] gives us a very different distribution then at least we know that we're cooking
[00:01:08] then at least we know that we're cooking with fire when it comes to moving away
[00:01:10] with fire when it comes to moving away from raw frequency
[00:01:12] from raw frequency there's a related question that i would
[00:01:14] there's a related question that i would like us to have in mind what is the
[00:01:15] like us to have in mind what is the overall distribution of values that the
[00:01:17] overall distribution of values that the re-weighting scheme delivers count
[00:01:19] re-weighting scheme delivers count distributions are very skewed in a way
[00:01:21] distributions are very skewed in a way that can make them difficult to deal
[00:01:23] that can make them difficult to deal with for lots of analytic and machine
[00:01:25] with for lots of analytic and machine learning methods so we might hope that
[00:01:27] learning methods so we might hope that in re-weighting in addition to capturing
[00:01:29] in re-weighting in addition to capturing things that are important and
[00:01:30] things that are important and de-emphasizing things that are mundane
[00:01:32] de-emphasizing things that are mundane would also give us an overall
[00:01:33] would also give us an overall distribution of values that was more
[00:01:35] distribution of values that was more tractable for these downstream
[00:01:36] tractable for these downstream applications
[00:01:38] applications and then finally i personally have a
[00:01:40] and then finally i personally have a goal that we would like to do no feature
[00:01:41] goal that we would like to do no feature selection based on counts or outside
[00:01:44] selection based on counts or outside resources like stock word dictionaries i
[00:01:46] resources like stock word dictionaries i don't want to be filtering off parts of
[00:01:48] don't want to be filtering off parts of the vocabulary a priori because for all
[00:01:51] the vocabulary a priori because for all i know something that's a boring stop
[00:01:52] i know something that's a boring stop word for one genre is actually an
[00:01:54] word for one genre is actually an important content word for another we
[00:01:56] important content word for another we would like the method to sort of make
[00:01:58] would like the method to sort of make that decision
[00:02:02] so let's start with the most basic kind
[00:02:03] so let's start with the most basic kind of scheme and this is a scheme that will
[00:02:05] of scheme and this is a scheme that will pay attention only to the row context
[00:02:07] pay attention only to the row context this is normalization so this is
[00:02:09] this is normalization so this is actually a repeat from the lecture on
[00:02:11] actually a repeat from the lecture on vector comparison l2 norming we have
[00:02:13] vector comparison l2 norming we have calculate the l2 length as a fixed
[00:02:15] calculate the l2 length as a fixed quantity for each row vector and then
[00:02:17] quantity for each row vector and then the length normalization of that row
[00:02:19] the length normalization of that row vector is just taking each value in the
[00:02:21] vector is just taking each value in the original vector and dividing it by that
[00:02:23] original vector and dividing it by that fixed quantity the l2 length
[00:02:27] fixed quantity the l2 length there's a related and perhaps more
[00:02:28] there's a related and perhaps more familiar notion which i've called
[00:02:29] familiar notion which i've called probability distribution where we follow
[00:02:31] probability distribution where we follow the same logic we just replace that
[00:02:33] the same logic we just replace that normalizing constant the l2 length with
[00:02:36] normalizing constant the l2 length with the sum of all the elements in the
[00:02:37] the sum of all the elements in the vectors but again we do this
[00:02:39] vectors but again we do this element-wise division by that fixed
[00:02:41] element-wise division by that fixed quantity to normalize the vector into a
[00:02:43] quantity to normalize the vector into a probability distribution
[00:02:46] probability distribution i think both of these methods can be
[00:02:47] i think both of these methods can be powerful but the shame of them is that
[00:02:49] powerful but the shame of them is that they are paying attention only to the
[00:02:51] they are paying attention only to the row context for a given cell i j we're
[00:02:54] row context for a given cell i j we're looking just across the row i we're not
[00:02:56] looking just across the row i we're not considering the context that could come
[00:02:58] considering the context that could come from the column j so let's begin to
[00:03:00] from the column j so let's begin to correct that omission
[00:03:02] correct that omission here's kind of the star of our show in a
[00:03:04] here's kind of the star of our show in a quiet sense this is the first scheme
[00:03:06] quiet sense this is the first scheme we'll look at that pays attention to
[00:03:07] we'll look at that pays attention to both row and column context this is
[00:03:09] both row and column context this is observed over expected let's just go
[00:03:12] observed over expected let's just go through this notation here we have the
[00:03:13] through this notation here we have the row sum i think that's intuitive
[00:03:15] row sum i think that's intuitive correspondingly the column sum the sum
[00:03:17] correspondingly the column sum the sum of all values along the column and then
[00:03:19] of all values along the column and then the sum for some matrix x is just the
[00:03:21] the sum for some matrix x is just the sum of all the cell values in that
[00:03:23] sum of all the cell values in that matrix
[00:03:24] matrix those are the raw materials for
[00:03:25] those are the raw materials for calculating what's called the expected
[00:03:27] calculating what's called the expected value the expected value given a matrix
[00:03:30] value the expected value given a matrix x for cell i j is the row sum times the
[00:03:34] x for cell i j is the row sum times the column sum
[00:03:35] column sum as the numerator divided by the sum of
[00:03:37] as the numerator divided by the sum of all the values in the matrix
[00:03:39] all the values in the matrix this is an expected quasi count it is
[00:03:42] this is an expected quasi count it is giving us the number we would expect if
[00:03:44] giving us the number we would expect if the row and column were independent of
[00:03:46] the row and column were independent of each other in the statistical sense and
[00:03:48] each other in the statistical sense and that's the sense in which this is an
[00:03:50] that's the sense in which this is an expectation
[00:03:51] expectation the observed over expected value simply
[00:03:53] the observed over expected value simply compare the observed value in the
[00:03:55] compare the observed value in the numerator by that expected value
[00:03:58] numerator by that expected value so in a bit more detail here's how the
[00:04:00] so in a bit more detail here's how the calculations work we've got this tiny
[00:04:02] calculations work we've got this tiny little count matrix here let's look at
[00:04:04] little count matrix here let's look at cell xa it's got a count of 34. that's
[00:04:07] cell xa it's got a count of 34. that's our observed count over here in the
[00:04:08] our observed count over here in the numerator the denominator is the product
[00:04:11] numerator the denominator is the product of the row sum and the column sum 45 by
[00:04:14] of the row sum and the column sum 45 by 81 divided by the sum of all the values
[00:04:17] 81 divided by the sum of all the values in this matrix which is 99. we repeat
[00:04:20] in this matrix which is 99. we repeat that calculation for all the other cells
[00:04:22] that calculation for all the other cells making the corresponding adjustments and
[00:04:24] making the corresponding adjustments and that gives us a completely reweighted
[00:04:26] that gives us a completely reweighted matrix
[00:04:28] matrix here's the intuition that was the
[00:04:30] here's the intuition that was the calculation let's think about why we
[00:04:32] calculation let's think about why we might want to do this so i've got here a
[00:04:34] might want to do this so i've got here a highly idealized little count matrix and
[00:04:37] highly idealized little count matrix and the conceit of this example is that keep
[00:04:39] the conceit of this example is that keep tabs in english is an idiom and
[00:04:41] tabs in english is an idiom and otherwise the word tabs alone doesn't
[00:04:43] otherwise the word tabs alone doesn't appear with many other words it's kind
[00:04:45] appear with many other words it's kind of constrained to this idiomatic context
[00:04:48] of constrained to this idiomatic context so we get a really high count for keep
[00:04:50] so we get a really high count for keep tabs and a relatively low count for
[00:04:52] tabs and a relatively low count for enjoy tabs again because tabs doesn't
[00:04:54] enjoy tabs again because tabs doesn't really associate with the word enjoy
[00:04:57] really associate with the word enjoy on the right here i've got the expected
[00:04:59] on the right here i've got the expected calculation and it comes out just like
[00:05:01] calculation and it comes out just like we would hope the expected count for
[00:05:03] we would hope the expected count for keep tabs is merely 12.48
[00:05:06] keep tabs is merely 12.48 compare that with the observed count of
[00:05:08] compare that with the observed count of 20. keep tabs is over represented
[00:05:10] 20. keep tabs is over represented relative to our expectations in virtue
[00:05:13] relative to our expectations in virtue of the fact that the independence
[00:05:14] of the fact that the independence assumption built into the expected
[00:05:16] assumption built into the expected calculation is just not met here because
[00:05:18] calculation is just not met here because of the collocational effect
[00:05:20] of the collocational effect similarly the expected count for enjoy
[00:05:22] similarly the expected count for enjoy tabs is 8.5 that's much larger than our
[00:05:25] tabs is 8.5 that's much larger than our observation again because these are kind
[00:05:27] observation again because these are kind of disassociated with each other in
[00:05:30] of disassociated with each other in virtue of the restricted distribution of
[00:05:32] virtue of the restricted distribution of tabs
[00:05:33] tabs and that brings us to the really the
[00:05:35] and that brings us to the really the star of our shown in fact the star of a
[00:05:36] star of our shown in fact the star of a lot of the remainder of this unit this
[00:05:38] lot of the remainder of this unit this is point y is mutual information or pmi
[00:05:42] is point y is mutual information or pmi pmi is simply observed over expected in
[00:05:44] pmi is simply observed over expected in log space where we stipulate that the
[00:05:46] log space where we stipulate that the log of zero is zero in a bit more detail
[00:05:49] log of zero is zero in a bit more detail for a matrix x given cell i j the pmi
[00:05:52] for a matrix x given cell i j the pmi value is the log
[00:05:54] value is the log of the observed count over the expected
[00:05:56] of the observed count over the expected count and that's it
[00:05:58] count and that's it many people find it more intuitive to
[00:05:59] many people find it more intuitive to think of this in probabilistic terms
[00:06:01] think of this in probabilistic terms that's what i've done over here on the
[00:06:03] that's what i've done over here on the right it's equivalent numerically but
[00:06:05] right it's equivalent numerically but for this kind of calculation we first
[00:06:07] for this kind of calculation we first form a joint probability table
[00:06:09] form a joint probability table by just dividing all the cell values by
[00:06:11] by just dividing all the cell values by the total number of values in all the
[00:06:13] the total number of values in all the cells
[00:06:14] cells that gives us the joint probability
[00:06:16] that gives us the joint probability table and then the row probability and
[00:06:18] table and then the row probability and the column probability are just summing
[00:06:20] the column probability are just summing across the row and the column
[00:06:21] across the row and the column respectively and again we multiply them
[00:06:24] respectively and again we multiply them and that's kind of nice because then you
[00:06:25] and that's kind of nice because then you can see we really are testing an
[00:06:27] can see we really are testing an independence assumption it's as though
[00:06:28] independence assumption it's as though we say we can multiply these
[00:06:30] we say we can multiply these probabilities because they're
[00:06:31] probabilities because they're independent if the distribution is truly
[00:06:34] independent if the distribution is truly independent that autumn match will be
[00:06:35] independent that autumn match will be observed and of course discrepancies are
[00:06:37] observed and of course discrepancies are the things that these matrices will
[00:06:39] the things that these matrices will highlight
[00:06:40] highlight let's look at an example and there's one
[00:06:42] let's look at an example and there's one thing that i want to track as we work
[00:06:43] thing that i want to track as we work through this example and that's the cell
[00:06:45] through this example and that's the cell down here this lonely little one so this
[00:06:48] down here this lonely little one so this is a count matrix i've got this as a
[00:06:50] is a count matrix i've got this as a word by document matrix this is a very
[00:06:51] word by document matrix this is a very flexible method and we'll apply it to
[00:06:53] flexible method and we'll apply it to lots of matrix designs
[00:06:55] lots of matrix designs over here i form the joint probability
[00:06:57] over here i form the joint probability table and i've got here the column sum
[00:06:59] table and i've got here the column sum and the row sum corresponding to the
[00:07:01] and the row sum corresponding to the column and row probability these are the
[00:07:03] column and row probability these are the raw ingredients
[00:07:04] raw ingredients for the pmi matrix which is derived down
[00:07:06] for the pmi matrix which is derived down here by applying this calculation to all
[00:07:09] here by applying this calculation to all these values
[00:07:10] these values notice what's happened that lonely one
[00:07:12] notice what's happened that lonely one down here because it's in a very
[00:07:14] down here because it's in a very infrequent row and a relatively
[00:07:16] infrequent row and a relatively infrequent column it has the largest pmi
[00:07:19] infrequent column it has the largest pmi value in the resulting matrix now that
[00:07:21] value in the resulting matrix now that could be good because this could be a
[00:07:22] could be good because this could be a very important event in which case we
[00:07:25] very important event in which case we want to amplify it on the other hand
[00:07:28] want to amplify it on the other hand nlp being what it is this could be just
[00:07:30] nlp being what it is this could be just a mistake in the data or something and
[00:07:32] a mistake in the data or something and then this exaggerated value here could
[00:07:34] then this exaggerated value here could turn out to be problematic it's
[00:07:36] turn out to be problematic it's difficult to strike this balance but
[00:07:37] difficult to strike this balance but it's worth keeping in mind as you work
[00:07:39] it's worth keeping in mind as you work with this method that it could amplify
[00:07:41] with this method that it could amplify not only important things but also
[00:07:43] not only important things but also idiosyncratic things
[00:07:45] idiosyncratic things positive pmi is an important variant of
[00:07:47] positive pmi is an important variant of pmi so important in fact that i would
[00:07:49] pmi so important in fact that i would like to think of it as the kind of
[00:07:50] like to think of it as the kind of default view that we take on pmi for the
[00:07:52] default view that we take on pmi for the following reason
[00:07:53] following reason pmi is actually undefined where the
[00:07:55] pmi is actually undefined where the count is 0 because we need to take the
[00:07:57] count is 0 because we need to take the log of zero
[00:07:59] log of zero so we had to stipulate that the log of
[00:08:00] so we had to stipulate that the log of zero was zero for this calculation
[00:08:04] however that's arguably not coherent if
[00:08:06] however that's arguably not coherent if you think about what the underlying
[00:08:08] you think about what the underlying matrix represents what we're saying with
[00:08:09] matrix represents what we're saying with pmi is that larger than expected values
[00:08:12] pmi is that larger than expected values get a large pmi
[00:08:14] get a large pmi smaller than expected values get a
[00:08:16] smaller than expected values get a smaller pmi that's good but when we
[00:08:18] smaller pmi that's good but when we encounter a zero we place it right in
[00:08:20] encounter a zero we place it right in the middle and that's just strange
[00:08:22] the middle and that's just strange because a zero isn't evidence of
[00:08:23] because a zero isn't evidence of anything larger or smaller it doesn't
[00:08:25] anything larger or smaller it doesn't deserve to be in the middle of this if
[00:08:27] deserve to be in the middle of this if anything we just don't know what to do
[00:08:29] anything we just don't know what to do with the zero values
[00:08:30] with the zero values so this is arguably sort of incoherent
[00:08:33] so this is arguably sort of incoherent and the standard response to it is to
[00:08:35] and the standard response to it is to simply
[00:08:36] simply turn all of the negative values into
[00:08:38] turn all of the negative values into zeros and that's positive pmi that's
[00:08:40] zeros and that's positive pmi that's defined here so we simply lop off all
[00:08:42] defined here so we simply lop off all the negative values by mapping them to
[00:08:44] the negative values by mapping them to zero
[00:08:45] zero and that at least restores the kind of
[00:08:47] and that at least restores the kind of overall coherence of the claims where
[00:08:49] overall coherence of the claims where all we're doing is reflecting the fact
[00:08:51] all we're doing is reflecting the fact that larger than expected counts have
[00:08:53] that larger than expected counts have large positive pmi and the rest are put
[00:08:56] large positive pmi and the rest are put at zero
[00:08:58] at zero let's look briefly at a few other
[00:08:59] let's look briefly at a few other re-weighting schemes starting with the
[00:09:01] re-weighting schemes starting with the t-test the t-test is something that
[00:09:02] t-test the t-test is something that you'll work with on the first assignment
[00:09:04] you'll work with on the first assignment you'll implement it it turns out to be a
[00:09:05] you'll implement it it turns out to be a very good re-rating scheme and i like it
[00:09:08] very good re-rating scheme and i like it because it obviously reflects many of
[00:09:09] because it obviously reflects many of the same intuitions that guide the pmi
[00:09:11] the same intuitions that guide the pmi and observed over expected calculations
[00:09:14] and observed over expected calculations tf idf is quite different so this is
[00:09:17] tf idf is quite different so this is typically performed on word by document
[00:09:19] typically performed on word by document matrices in the context of information
[00:09:21] matrices in the context of information retrieval given some corpus of documents
[00:09:23] retrieval given some corpus of documents d we're going to say that the term
[00:09:25] d we're going to say that the term frequency for a given cell
[00:09:27] frequency for a given cell is that value divided by the sum of all
[00:09:29] is that value divided by the sum of all the values in the column giving us a
[00:09:30] the values in the column giving us a kind of probability of the word given
[00:09:32] kind of probability of the word given the document that we're in
[00:09:34] the document that we're in and then the idf value
[00:09:36] and then the idf value is the log of this quantity here this is
[00:09:38] is the log of this quantity here this is the number of corpora the number of
[00:09:40] the number of corpora the number of documents in our corpus that is the road
[00:09:42] documents in our corpus that is the road the column dimensionality divided by the
[00:09:44] the column dimensionality divided by the number of documents that contain the
[00:09:46] number of documents that contain the target word and again we map log of zero
[00:09:48] target word and again we map log of zero to zero the tf idf is the product of
[00:09:51] to zero the tf idf is the product of those two values
[00:09:52] those two values i think this can be an outstanding
[00:09:54] i think this can be an outstanding method for very large sparse matrices
[00:09:56] method for very large sparse matrices like the word document one
[00:09:58] like the word document one conversely it is typically not well
[00:10:01] conversely it is typically not well behaved for very dense matrices like the
[00:10:03] behaved for very dense matrices like the word by word ones that we've we're
[00:10:05] word by word ones that we've we're favoring in this course the reason this
[00:10:07] favoring in this course the reason this is this idf value it's very unlikely
[00:10:10] is this idf value it's very unlikely that you would have a word that appeared
[00:10:12] that you would have a word that appeared literally in every document
[00:10:14] literally in every document however in the context of very dense
[00:10:17] however in the context of very dense word by word matrices it is possible for
[00:10:19] word by word matrices it is possible for some words to co-occur with every single
[00:10:21] some words to co-occur with every single other word in which case you get an idf
[00:10:23] other word in which case you get an idf value of zero which is probably not the
[00:10:26] value of zero which is probably not the intended outcome for something that's
[00:10:27] intended outcome for something that's high frequency but might nonetheless be
[00:10:30] high frequency but might nonetheless be important in the context of individual
[00:10:32] important in the context of individual documents
[00:10:33] documents so i'd probably steer away from tf idf
[00:10:35] so i'd probably steer away from tf idf unless you're working with a sparse
[00:10:37] unless you're working with a sparse matrix design
[00:10:39] matrix design and then even further afield from the
[00:10:40] and then even further afield from the things we've discussed you might explore
[00:10:42] things we've discussed you might explore using for example pairwise distance
[00:10:43] using for example pairwise distance matrices where i calculate the cosine
[00:10:46] matrices where i calculate the cosine distance between every pair of words
[00:10:48] distance between every pair of words along the rows and form a matrix on that
[00:10:50] along the rows and form a matrix on that basis really different in its approach
[00:10:53] basis really different in its approach and probably in its outcomes but it
[00:10:55] and probably in its outcomes but it could be very interesting
[00:10:58] could be very interesting let's return to our central questions
[00:11:00] let's return to our central questions remember for each one of these
[00:11:01] remember for each one of these rewritings we want to ask how does it
[00:11:03] rewritings we want to ask how does it compare to the raw count values and what
[00:11:05] compare to the raw count values and what overall distribution of values does it
[00:11:08] overall distribution of values does it deliver so let's do a bit of an
[00:11:09] deliver so let's do a bit of an assessment of that i'm working with the
[00:11:11] assessment of that i'm working with the giga5 matrix that you can load as part
[00:11:13] giga5 matrix that you can load as part of the course materials that's gigaward
[00:11:15] of the course materials that's gigaward with a window of five and scaling of one
[00:11:17] with a window of five and scaling of one over n
[00:11:19] over n up here in the left i have the raw
[00:11:20] up here in the left i have the raw counts and the cell value along the
[00:11:22] counts and the cell value along the x-axis and the number of things that
[00:11:24] x-axis and the number of things that have that value along the y-axis and you
[00:11:27] have that value along the y-axis and you can see that raw counts it's a very
[00:11:29] can see that raw counts it's a very difficult distribution first of all this
[00:11:31] difficult distribution first of all this goes all the way up to about a hundred
[00:11:34] goes all the way up to about a hundred million
[00:11:35] million so and starting from zero most things
[00:11:38] so and starting from zero most things have quantities that are close to zero
[00:11:40] have quantities that are close to zero then you have this very long thin tail
[00:11:42] then you have this very long thin tail of things that are very high frequency
[00:11:44] of things that are very high frequency this kind of highly skewed distribution
[00:11:46] this kind of highly skewed distribution is difficult for many machine learning
[00:11:48] is difficult for many machine learning methods both in terms of the skewed
[00:11:50] methods both in terms of the skewed towards zero and very low values and
[00:11:52] towards zero and very low values and also in terms of the range of these
[00:11:54] also in terms of the range of these x-axis values so we would like to move
[00:11:56] x-axis values so we would like to move away from it that's one motivating
[00:11:58] away from it that's one motivating reason
[00:11:59] reason when we look at l2 norming and
[00:12:00] when we look at l2 norming and probability distributions they do kind
[00:12:02] probability distributions they do kind of the same thing they are constraining
[00:12:04] of the same thing they are constraining the cell values to be between 0 and 1 or
[00:12:07] the cell values to be between 0 and 1 or roughly that between 0 and one
[00:12:10] roughly that between 0 and one but they still have a heavy skewed
[00:12:12] but they still have a heavy skewed toward things that are very small in
[00:12:14] toward things that are very small in their adjusted values and their related
[00:12:15] their adjusted values and their related values
[00:12:17] values observed over expected is more extreme
[00:12:19] observed over expected is more extreme in that as is itf idf so again
[00:12:22] in that as is itf idf so again the observed over expected values range
[00:12:24] the observed over expected values range quite high up to about almost 50 000
[00:12:27] quite high up to about almost 50 000 which is somewhat better than the raw
[00:12:28] which is somewhat better than the raw counts but it's still very large in
[00:12:31] counts but it's still very large in terms of its spread and we still have
[00:12:32] terms of its spread and we still have that heavy skewed towards zero tf-idf
[00:12:35] that heavy skewed towards zero tf-idf solves the range problem down here
[00:12:37] solves the range problem down here because it's highly constrained in a set
[00:12:39] because it's highly constrained in a set of values but it still has a very heavy
[00:12:41] of values but it still has a very heavy skew looking a lot like the raw count
[00:12:43] skew looking a lot like the raw count distribution
[00:12:45] distribution from this perspective it looks like pmi
[00:12:47] from this perspective it looks like pmi and positive pmi are really steps
[00:12:49] and positive pmi are really steps forward first of all for pmi this
[00:12:52] forward first of all for pmi this distribution of cell values has this
[00:12:53] distribution of cell values has this nice sort of normal distribution and the
[00:12:56] nice sort of normal distribution and the values themselves are pretty constrained
[00:12:57] values themselves are pretty constrained to like negative 10 to 10.
[00:13:00] to like negative 10 to 10. and then for positive pmi we simply lock
[00:13:03] and then for positive pmi we simply lock off all the negative values and make it
[00:13:04] off all the negative values and make it map to zero so it's more skewed towards
[00:13:06] map to zero so it's more skewed towards zero but not nearly as skewed as all
[00:13:09] zero but not nearly as skewed as all these other methods that we're looking
[00:13:10] these other methods that we're looking at
[00:13:11] at so this is looking like pmi and ppmi are
[00:13:14] so this is looking like pmi and ppmi are good choices here just from the point of
[00:13:16] good choices here just from the point of view of departing from the
[00:13:18] view of departing from the raw counts and giving a subtractable
[00:13:20] raw counts and giving a subtractable distribution
[00:13:22] distribution here's another perspective where we
[00:13:24] here's another perspective where we directly compare in these matrices the
[00:13:26] directly compare in these matrices the co-occurrence count on log scale so it's
[00:13:28] co-occurrence count on log scale so it's viewable with the result the new
[00:13:31] viewable with the result the new weighted cell value what we're looking
[00:13:33] weighted cell value what we're looking for here presumably is an overall lack
[00:13:35] for here presumably is an overall lack of correlation
[00:13:38] of correlation i think we find that l2 norming and
[00:13:39] i think we find that l2 norming and probabilities are pretty good on this
[00:13:41] probabilities are pretty good on this score they kind of have low
[00:13:43] score they kind of have low correlations and they make good use of a
[00:13:45] correlations and they make good use of a large part of the scale that they
[00:13:47] large part of the scale that they operate on
[00:13:48] operate on observed over expected has a low
[00:13:50] observed over expected has a low correlation with the cell counts which
[00:13:52] correlation with the cell counts which it looks initially good but it has this
[00:13:54] it looks initially good but it has this problem that the cell values are kind of
[00:13:56] problem that the cell values are kind of strangely distributed and this
[00:13:58] strangely distributed and this correlation value might not even be
[00:14:00] correlation value might not even be especially meaningful given that we have
[00:14:02] especially meaningful given that we have a few outliers and then a whole lot of
[00:14:03] a few outliers and then a whole lot of things that are close to zero and tf idf
[00:14:06] things that are close to zero and tf idf is frankly similar low correlation but
[00:14:08] is frankly similar low correlation but maybe not so trustworthy in terms of
[00:14:10] maybe not so trustworthy in terms of that correlation value fundamentally
[00:14:12] that correlation value fundamentally again these look like difficult
[00:14:13] again these look like difficult distributions of values to work with
[00:14:16] distributions of values to work with again pmi and positive pmi look really
[00:14:18] again pmi and positive pmi look really good
[00:14:19] good relatively low correlation so we've done
[00:14:21] relatively low correlation so we've done something meaningful and both of these
[00:14:23] something meaningful and both of these are making meaningful use of a
[00:14:26] are making meaningful use of a substantial part of the overall space
[00:14:28] substantial part of the overall space that they operate in we have lots of
[00:14:30] that they operate in we have lots of different combinations of cell values
[00:14:32] different combinations of cell values and underlying co-occurrence counts
[00:14:34] and underlying co-occurrence counts something of a correlation but that
[00:14:36] something of a correlation but that could be good but we're not locked into
[00:14:38] could be good but we're not locked into that correlation so we've done something
[00:14:40] that correlation so we've done something meaningful
[00:14:41] meaningful to wrap up let's do some relationships
[00:14:43] to wrap up let's do some relationships and generalizations just some reminders
[00:14:45] and generalizations just some reminders here so a theme running through nearly
[00:14:47] here so a theme running through nearly all of these schemes is that we want to
[00:14:49] all of these schemes is that we want to rewrite a cell value relative to the
[00:14:51] rewrite a cell value relative to the values we expect given the row and the
[00:14:53] values we expect given the row and the column and we would like to make use of
[00:14:55] column and we would like to make use of both of those notions of context
[00:14:58] both of those notions of context the magnitude of the counts might be
[00:15:00] the magnitude of the counts might be important just think about how
[00:15:02] important just think about how 110 as a bit of evidence and a thousand
[00:15:05] 110 as a bit of evidence and a thousand ten thousand as a bit of evidence might
[00:15:07] ten thousand as a bit of evidence might be very different situations in terms of
[00:15:09] be very different situations in terms of the evidence that you have gathered
[00:15:11] the evidence that you have gathered creating probability distributions and
[00:15:13] creating probability distributions and length normalizing will obscure that
[00:15:16] length normalizing will obscure that difference and that might be something
[00:15:17] difference and that might be something that you want to dwell on
[00:15:21] pmi and experience will amplify the
[00:15:23] pmi and experience will amplify the values of counts that are tiny relative
[00:15:25] values of counts that are tiny relative to their rows and contents then they're
[00:15:27] to their rows and contents then they're in columns that could be good because
[00:15:29] in columns that could be good because that might be what you want to do find
[00:15:30] that might be what you want to do find the things that are really important and
[00:15:32] the things that are really important and unusual unfortunately with language data
[00:15:34] unusual unfortunately with language data we have to watch out that they might be
[00:15:36] we have to watch out that they might be noise
[00:15:39] noise and finally tfidf severely punishes
[00:15:41] and finally tfidf severely punishes words that appear in many documents it
[00:15:43] words that appear in many documents it behaves oddly for dense matrices which
[00:15:45] behaves oddly for dense matrices which can can include the word by word
[00:15:47] can can include the word by word matrices that we're working with so you
[00:15:49] matrices that we're working with so you might proceed with caution with that
[00:15:50] might proceed with caution with that particular rewriting scheme in the
[00:15:52] particular rewriting scheme in the context of this course
[00:15:54] context of this course finally some code snippets i'm just
[00:15:56] finally some code snippets i'm just showing off that our vsm module in the
[00:15:58] showing off that our vsm module in the course repository makes it really easy
[00:16:00] course repository makes it really easy to do these re-weighting schemes a lot
[00:16:02] to do these re-weighting schemes a lot all the ones that we've talked about and
[00:16:03] all the ones that we've talked about and more in fact
[00:16:05] more in fact and returning to the end of our vector
[00:16:07] and returning to the end of our vector comparison method you might recall that
[00:16:09] comparison method you might recall that i looked at the neighbors of bad in this
[00:16:11] i looked at the neighbors of bad in this yelp 5 matrix and it really didn't look
[00:16:14] yelp 5 matrix and it really didn't look good this does not look especially
[00:16:15] good this does not look especially semantically coherent
[00:16:17] semantically coherent when i take those underlying counts and
[00:16:19] when i take those underlying counts and i just adjust them by positive pmi i
[00:16:22] i just adjust them by positive pmi i start to see something that looks quite
[00:16:24] start to see something that looks quite semantically coherent and i think we're
[00:16:26] semantically coherent and i think we're starting to see the promise of these
[00:16:27] starting to see the promise of these methods and this is really just the
[00:16:29] methods and this is really just the beginning in terms of surfacing
[00:16:30] beginning in terms of surfacing semantically coherent and interesting
[00:16:32] semantically coherent and interesting information from these underlying counts

Lecture 008

Dimensionality Reduction | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=5Bx5UhrJbJI --- Transcript [00:00:05] hello everyone welcome back th...

Dimensionality Reduction | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=5Bx5UhrJbJI

---

Transcript

[00:00:05] hello everyone welcome back this is part
[00:00:06] hello everyone welcome back this is part five in our series on distributed word
[00:00:08] five in our series on distributed word representations we're going to be
[00:00:09] representations we're going to be talking about dimensionality reduction
[00:00:11] talking about dimensionality reduction techniques we saw in the previous
[00:00:13] techniques we saw in the previous screencast that re-weighting is a
[00:00:15] screencast that re-weighting is a powerful tool for finding latent
[00:00:17] powerful tool for finding latent semantic information in count matrices
[00:00:20] semantic information in count matrices we're going to push that even further
[00:00:22] we're going to push that even further the promise of dimensionality reduction
[00:00:23] the promise of dimensionality reduction techniques is that they can capture
[00:00:25] techniques is that they can capture higher order notions of co-occurrence
[00:00:27] higher order notions of co-occurrence corresponding to even deeper sorts of
[00:00:29] corresponding to even deeper sorts of semantic relatedness
[00:00:33] there's a wide world of these
[00:00:34] there's a wide world of these dimensionality reduction techniques i've
[00:00:36] dimensionality reduction techniques i've chosen three that we're going to focus
[00:00:37] chosen three that we're going to focus on as interesting representatives of a
[00:00:39] on as interesting representatives of a much larger space we'll look at latent
[00:00:42] much larger space we'll look at latent semantic analysis which is a classic
[00:00:44] semantic analysis which is a classic linear method then we'll talk about auto
[00:00:46] linear method then we'll talk about auto encoders a newer powerful deep learning
[00:00:48] encoders a newer powerful deep learning mode for learning to reduce dimensional
[00:00:50] mode for learning to reduce dimensional representations and then finally glove
[00:00:53] representations and then finally glove which is a simple yet very powerful
[00:00:55] which is a simple yet very powerful method that as you'll see has a deep
[00:00:57] method that as you'll see has a deep connection to pointwise mutual
[00:00:59] connection to pointwise mutual information and then i'm going to going
[00:01:01] information and then i'm going to going to close by talking briefly about
[00:01:02] to close by talking briefly about visualization which is another kind of
[00:01:04] visualization which is another kind of dimensionality reduction technique that
[00:01:06] dimensionality reduction technique that we might use for very different purposes
[00:01:10] we might use for very different purposes so let's begin with latent semantic
[00:01:11] so let's begin with latent semantic analysis a classic method the paper is
[00:01:13] analysis a classic method the paper is due to dear wester at all 1990 that's a
[00:01:16] due to dear wester at all 1990 that's a classic paper that really made a splash
[00:01:18] classic paper that really made a splash uh it's one of the most lsa is now one
[00:01:20] uh it's one of the most lsa is now one of the oldest most widely used
[00:01:21] of the oldest most widely used dimensionality reduction techniques not
[00:01:23] dimensionality reduction techniques not only in scientific research but also an
[00:01:25] only in scientific research but also an industry i think it was really
[00:01:26] industry i think it was really eye-opening for people at the time of
[00:01:29] eye-opening for people at the time of the paper's appearance to see just how
[00:01:32] the paper's appearance to see just how powerful this technique could be
[00:01:34] powerful this technique could be especially in context of involving
[00:01:36] especially in context of involving information retrieval
[00:01:38] information retrieval the method is also known as known as
[00:01:40] the method is also known as known as truncated singular value decomposition
[00:01:42] truncated singular value decomposition and i'll explain why that is in a second
[00:01:44] and i'll explain why that is in a second the final thing i want to say at this
[00:01:45] the final thing i want to say at this high level is just that lsa remains a
[00:01:47] high level is just that lsa remains a very powerful baseline especially when
[00:01:50] very powerful baseline especially when part of a pipeline of other re-weighting
[00:01:52] part of a pipeline of other re-weighting methods so it should probably be in your
[00:01:54] methods so it should probably be in your results table and it's often very
[00:01:56] results table and it's often very difficult to beat
[00:01:59] now i think we can't in the time
[00:02:00] now i think we can't in the time allotted to us cover all of the
[00:02:02] allotted to us cover all of the technical details surrounding latent
[00:02:04] technical details surrounding latent semantic analysis in my experience this
[00:02:06] semantic analysis in my experience this would be kind of the culmination of a
[00:02:07] would be kind of the culmination of a full course in linear algebra but i do
[00:02:10] full course in linear algebra but i do think i can convey the guiding
[00:02:11] think i can convey the guiding intuitions and that will help you with
[00:02:13] intuitions and that will help you with responsible use of the method
[00:02:15] responsible use of the method so let's imagine that we have this
[00:02:16] so let's imagine that we have this simple two-dimensional vector space
[00:02:18] simple two-dimensional vector space model i've got four points a b c and d
[00:02:20] model i've got four points a b c and d arrayed out in this two-dimensional
[00:02:22] arrayed out in this two-dimensional space
[00:02:23] space i think we're all familiar with fitting
[00:02:24] i think we're all familiar with fitting linear models which capture the largest
[00:02:27] linear models which capture the largest source of variation in the data that's
[00:02:29] source of variation in the data that's this orange line here
[00:02:30] this orange line here the perspective i would encourage you to
[00:02:32] the perspective i would encourage you to take is that we can think of that linear
[00:02:34] take is that we can think of that linear regression model as performing
[00:02:36] regression model as performing dimensionality reduction in that it
[00:02:38] dimensionality reduction in that it encourages us to project points like b
[00:02:41] encourages us to project points like b and c down onto that line
[00:02:44] and c down onto that line and in projecting them down onto that
[00:02:45] and in projecting them down onto that line essentially in abstracting away
[00:02:48] line essentially in abstracting away from their point of variation along the
[00:02:49] from their point of variation along the y axis we can see the sense in which
[00:02:52] y axis we can see the sense in which they are abstractly similar they're
[00:02:53] they are abstractly similar they're close together in this reduced
[00:02:55] close together in this reduced dimensional space
[00:02:57] dimensional space now with the linear model we captured
[00:02:59] now with the linear model we captured the source of greatest variation of this
[00:03:00] the source of greatest variation of this little data set in a high dimensional
[00:03:02] little data set in a high dimensional space we could continue fitting
[00:03:04] space we could continue fitting lines to other sources of variation in
[00:03:07] lines to other sources of variation in the data other axes of variations so
[00:03:08] the data other axes of variations so here's a blue line here that captures
[00:03:10] here's a blue line here that captures the next dimension and we could again
[00:03:12] the next dimension and we could again project points like a and c down onto
[00:03:14] project points like a and c down onto that line and that would capture the
[00:03:16] that line and that would capture the abstract sense in which a and c although
[00:03:18] abstract sense in which a and c although very spread out along the x dimension
[00:03:20] very spread out along the x dimension are very close together along the y
[00:03:22] are very close together along the y dimension and of course if we had more
[00:03:24] dimension and of course if we had more dimensions in this vector space model we
[00:03:27] dimensions in this vector space model we could continue to perform these cuts and
[00:03:29] could continue to perform these cuts and dimensionality reductions capturing ever
[00:03:31] dimensionality reductions capturing ever more abstract notions of similarity
[00:03:34] more abstract notions of similarity along these different axes and that is
[00:03:36] along these different axes and that is in essence what lsa is going to do for
[00:03:38] in essence what lsa is going to do for us in our really large matrices
[00:03:42] us in our really large matrices the fundamental method as i said is
[00:03:44] the fundamental method as i said is singular value decomposition this is a
[00:03:46] singular value decomposition this is a theorem from linear algebra that says
[00:03:48] theorem from linear algebra that says any matrix of dimension m by n can be
[00:03:52] any matrix of dimension m by n can be decomposed into the product of three
[00:03:54] decomposed into the product of three matrices t s and d
[00:03:56] matrices t s and d with the
[00:03:57] with the dimensions given here's a more concrete
[00:03:59] dimensions given here's a more concrete example
[00:04:00] example start with this matrix of dimension
[00:04:02] start with this matrix of dimension three by four we learned a term matrix
[00:04:04] three by four we learned a term matrix which is full of length normalized
[00:04:07] which is full of length normalized orthogonal vectors
[00:04:09] orthogonal vectors we have this matrix of singular values
[00:04:12] we have this matrix of singular values along the diagonal they are organized
[00:04:13] along the diagonal they are organized from largest to smallest corresponding
[00:04:15] from largest to smallest corresponding to the greatest to least source of
[00:04:17] to the greatest to least source of variation in the data
[00:04:19] variation in the data and then we have the document or column
[00:04:21] and then we have the document or column wise matrix which is also length
[00:04:23] wise matrix which is also length normalized and orthogonal in its space
[00:04:26] normalized and orthogonal in its space and the theorem here is that we can
[00:04:27] and the theorem here is that we can reconstruct a from these three matrices
[00:04:30] reconstruct a from these three matrices of course we don't want to precisely
[00:04:32] of course we don't want to precisely reconstruct a that probably wouldn't
[00:04:33] reconstruct a that probably wouldn't accomplish very much for us
[00:04:35] accomplish very much for us but what we can do is use this to learn
[00:04:37] but what we can do is use this to learn reduced dimensional representations of a
[00:04:40] reduced dimensional representations of a by being selective about which term and
[00:04:42] by being selective about which term and singular value dimensions we include in
[00:04:45] singular value dimensions we include in the model
[00:04:46] the model let me walk you through an example of
[00:04:47] let me walk you through an example of how that happens and first let me
[00:04:49] how that happens and first let me motivate this a little bit with an
[00:04:50] motivate this a little bit with an idealized linguistic case so i've got up
[00:04:53] idealized linguistic case so i've got up here a word by document matrix its
[00:04:56] here a word by document matrix its vocabulary is gnarly wicked awesome lame
[00:04:58] vocabulary is gnarly wicked awesome lame and terrible and the conceit of my
[00:05:00] and terrible and the conceit of my example is that both gnarly and wicked
[00:05:02] example is that both gnarly and wicked are positive terms so they tend to
[00:05:04] are positive terms so they tend to co-occur with awesome and not co-occur
[00:05:06] co-occur with awesome and not co-occur with lame and terrible however gnarly
[00:05:08] with lame and terrible however gnarly and wicked never occur in the same
[00:05:10] and wicked never occur in the same document the idea is that gnarly is a
[00:05:13] document the idea is that gnarly is a slang positive term associated with the
[00:05:15] slang positive term associated with the west coast of the united states and
[00:05:17] west coast of the united states and wicked is a slang
[00:05:18] wicked is a slang term associated with the east coast of
[00:05:20] term associated with the east coast of the united states in virtue of that
[00:05:22] the united states in virtue of that idealized dialect split they never occur
[00:05:25] idealized dialect split they never occur in the same document but nonetheless
[00:05:27] in the same document but nonetheless they have similar neighbors in this
[00:05:28] they have similar neighbors in this vector space and that's the kind of
[00:05:30] vector space and that's the kind of abstract notion of co-occurrence that we
[00:05:32] abstract notion of co-occurrence that we want to capture
[00:05:34] want to capture if we simply use our standard distance
[00:05:37] if we simply use our standard distance measures and re-weighting techniques and
[00:05:39] measures and re-weighting techniques and so forth we will not capture that more
[00:05:41] so forth we will not capture that more abstract notion of co-occurrence here
[00:05:43] abstract notion of co-occurrence here distances in this raw vector space
[00:05:45] distances in this raw vector space gnarly awesome terrible and wicked
[00:05:47] gnarly awesome terrible and wicked wicked is farther away from gnarly even
[00:05:49] wicked is farther away from gnarly even than terrible is so we've got a
[00:05:51] than terrible is so we've got a sentiment confusion and really just not
[00:05:53] sentiment confusion and really just not the result we were shooting for
[00:05:55] the result we were shooting for so we perform
[00:05:57] so we perform singular value decomposition into these
[00:05:59] singular value decomposition into these three matrices and then the truncated
[00:06:01] three matrices and then the truncated part is that we're going to consider
[00:06:03] part is that we're going to consider just the first two dimensions of the
[00:06:04] just the first two dimensions of the term matrix corresponding to these two
[00:06:07] term matrix corresponding to these two singular values capturing the top two
[00:06:10] singular values capturing the top two sources of variation in the data
[00:06:12] sources of variation in the data so we multiply those together and we get
[00:06:15] so we multiply those together and we get this reduced dimensional matrix down
[00:06:16] this reduced dimensional matrix down here
[00:06:17] here two by the size of the vocabulary and if
[00:06:20] two by the size of the vocabulary and if we do distance measures in that space
[00:06:22] we do distance measures in that space just as we were hoping gnarly and wicked
[00:06:25] just as we were hoping gnarly and wicked are now neighbors the method has
[00:06:26] are now neighbors the method has captured that more abstract notion of
[00:06:29] captured that more abstract notion of having the same neighbors as the other
[00:06:31] having the same neighbors as the other word
[00:06:34] in the previous lecture i encourage you
[00:06:36] in the previous lecture i encourage you to think about what you're doing to a
[00:06:38] to think about what you're doing to a matrix when you perform some kind of
[00:06:39] matrix when you perform some kind of re-weighting scheme let's extend that to
[00:06:41] re-weighting scheme let's extend that to these dimensionality reduction
[00:06:43] these dimensionality reduction techniques so here's a picture of what
[00:06:44] techniques so here's a picture of what lsa does starting with a raw count
[00:06:47] lsa does starting with a raw count distribution over here
[00:06:49] distribution over here if i just run lsa on that rockcount
[00:06:51] if i just run lsa on that rockcount distribution i get what looks also like
[00:06:54] distribution i get what looks also like a very difficult distribution of values
[00:06:56] a very difficult distribution of values the sale values are very spread out and
[00:06:59] the sale values are very spread out and they have a lot of the mass centered
[00:07:01] they have a lot of the mass centered around zero here corresponding to the
[00:07:04] around zero here corresponding to the peak and the raw counts over here near
[00:07:05] peak and the raw counts over here near zero as well so that doesn't look like
[00:07:07] zero as well so that doesn't look like we've done very much in terms of taming
[00:07:09] we've done very much in terms of taming the kind of untractable skewed
[00:07:11] the kind of untractable skewed distribution we started with
[00:07:13] distribution we started with however if instead we take the raw
[00:07:15] however if instead we take the raw counts and first feed them through pmi
[00:07:17] counts and first feed them through pmi which as we saw before gives us this
[00:07:19] which as we saw before gives us this nice distribution of values highly
[00:07:21] nice distribution of values highly constrained along the x-axis
[00:07:23] constrained along the x-axis and then we run lsa we retain a lot of
[00:07:25] and then we run lsa we retain a lot of those good properties the values are
[00:07:27] those good properties the values are somewhat more spread out but still
[00:07:29] somewhat more spread out but still nicely distributed this looks like a
[00:07:31] nicely distributed this looks like a much happier input to downstream
[00:07:33] much happier input to downstream analytic methods than the top version
[00:07:35] analytic methods than the top version here and i think this is beginning to
[00:07:37] here and i think this is beginning to show that it can be powerful to pipeline
[00:07:40] show that it can be powerful to pipeline re-weighting and dimensionality
[00:07:41] re-weighting and dimensionality reduction techniques
[00:07:44] reduction techniques another note i would want to make how do
[00:07:45] another note i would want to make how do you choose the dimensionality for lsa it
[00:07:47] you choose the dimensionality for lsa it has this variable k corresponding to the
[00:07:49] has this variable k corresponding to the number of dimensions that you keep
[00:07:51] number of dimensions that you keep if you read the literature on lsa they
[00:07:53] if you read the literature on lsa they often imagine this kind of what i've
[00:07:55] often imagine this kind of what i've called the dream scenario where you plot
[00:07:57] called the dream scenario where you plot the singular values
[00:07:59] the singular values and you see that a lot of them are very
[00:08:00] and you see that a lot of them are very large and then there's a sudden drop off
[00:08:03] large and then there's a sudden drop off and if you do see this then it's obvious
[00:08:05] and if you do see this then it's obvious that you should pick the point of the
[00:08:06] that you should pick the point of the sudden drop off as your k so here you
[00:08:08] sudden drop off as your k so here you would pick k is 20 and you'd be
[00:08:10] would pick k is 20 and you'd be confident that you had captured almost
[00:08:12] confident that you had captured almost all the variation in your data in the
[00:08:13] all the variation in your data in the reduced dimensional space you were
[00:08:15] reduced dimensional space you were creating
[00:08:16] creating unfortunately
[00:08:18] unfortunately for the kinds of matrices and problems
[00:08:20] for the kinds of matrices and problems that we're looking at i really never see
[00:08:21] that we're looking at i really never see the dream scenario what i see looks
[00:08:23] the dream scenario what i see looks something much more like this where you
[00:08:25] something much more like this where you have kind of a sudden drop off early and
[00:08:27] have kind of a sudden drop off early and then a long decline and maybe a sudden
[00:08:29] then a long decline and maybe a sudden drop off at the end and it's basically
[00:08:31] drop off at the end and it's basically totally unclear where in the space you
[00:08:33] totally unclear where in the space you should pick k
[00:08:35] should pick k and the result is that k is often chosen
[00:08:37] and the result is that k is often chosen kind of empirically as a hyper parameter
[00:08:40] kind of empirically as a hyper parameter tuned against whatever problem you're
[00:08:42] tuned against whatever problem you're actually trying to solve
[00:08:44] actually trying to solve if in doing this work you do see the
[00:08:45] if in doing this work you do see the dream scenario please do write to me it
[00:08:47] dream scenario please do write to me it would be very exciting to see that
[00:08:49] would be very exciting to see that happen
[00:08:51] happen lsa is just one of a large family of
[00:08:54] lsa is just one of a large family of matrix decomposition methods um here's a
[00:08:57] matrix decomposition methods um here's a list of a few of them and a lot of them
[00:08:59] list of a few of them and a lot of them are implemented in scikit-learn in its
[00:09:01] are implemented in scikit-learn in its decomposition library and i would
[00:09:03] decomposition library and i would encourage you to try them out and just
[00:09:04] encourage you to try them out and just see how they perform on problems that
[00:09:06] see how they perform on problems that you're trying to solve
[00:09:09] you're trying to solve and finally here's a little bit of code
[00:09:11] and finally here's a little bit of code vsm.lsa
[00:09:12] vsm.lsa so with k set to 100 gives me back a
[00:09:15] so with k set to 100 gives me back a reduced dimensional version of that
[00:09:17] reduced dimensional version of that matrix keeping the same vocabulary of
[00:09:19] matrix keeping the same vocabulary of course but now with only 100 column
[00:09:21] course but now with only 100 column dimensions
[00:09:24] let's move to auto encoders this will be
[00:09:26] let's move to auto encoders this will be a point of contracts with lsa because
[00:09:28] a point of contracts with lsa because this is a much more powerful method
[00:09:30] this is a much more powerful method so here's the overview auto encoders are
[00:09:32] so here's the overview auto encoders are a flexible class of deep learning
[00:09:34] a flexible class of deep learning architectures for learning reduce
[00:09:35] architectures for learning reduce dimensional representations if you want
[00:09:38] dimensional representations if you want to hear much more about this classroom
[00:09:40] to hear much more about this classroom models i would encourage you to read
[00:09:41] models i would encourage you to read chapter 14 of the good fellow at all
[00:09:43] chapter 14 of the good fellow at all book deep learning it has a lot of
[00:09:45] book deep learning it has a lot of details and a lot of variations on this
[00:09:47] details and a lot of variations on this theme
[00:09:48] theme here is the basic autoencoder model the
[00:09:51] here is the basic autoencoder model the input would be the say the vectors from
[00:09:54] input would be the say the vectors from our the rows in our matrices so this
[00:09:56] our the rows in our matrices so this could be the counts or something that
[00:09:58] could be the counts or something that you've done to the counts
[00:10:00] you've done to the counts those are fed through a hidden layer of
[00:10:02] those are fed through a hidden layer of representation and then the goal of this
[00:10:04] representation and then the goal of this model is try to is to try to literally
[00:10:06] model is try to is to try to literally reconstruct the input
[00:10:09] reconstruct the input now that might be trivial if h had the
[00:10:11] now that might be trivial if h had the same dimensionality as x but the whole
[00:10:13] same dimensionality as x but the whole idea here is that you're going to feed
[00:10:15] idea here is that you're going to feed the input through a very narrow pipe and
[00:10:18] the input through a very narrow pipe and then try to reconstruct the input
[00:10:20] then try to reconstruct the input given that you're feeding it through a
[00:10:21] given that you're feeding it through a potentially very very narrow pipe it's
[00:10:23] potentially very very narrow pipe it's unlikely that you'll be able to fully
[00:10:25] unlikely that you'll be able to fully reconstruct the inputs but the idea is
[00:10:28] reconstruct the inputs but the idea is that the model will learn to reconstruct
[00:10:29] that the model will learn to reconstruct the important sources of variation in
[00:10:32] the important sources of variation in performing this auto encoding step
[00:10:35] performing this auto encoding step and then when we use these models for
[00:10:37] and then when we use these models for representation learning in the mode that
[00:10:39] representation learning in the mode that we've been in for this unit the
[00:10:41] we've been in for this unit the representation that we choose is this
[00:10:42] representation that we choose is this hidden unit here we typically don't care
[00:10:45] hidden unit here we typically don't care about what was reconstructed on the
[00:10:47] about what was reconstructed on the output but rather only about the hidden
[00:10:49] output but rather only about the hidden reduced dimensional representation that
[00:10:51] reduced dimensional representation that the model learned this slide has a bunch
[00:10:52] the model learned this slide has a bunch of other annotations on it and the
[00:10:54] of other annotations on it and the reason i included them is that the
[00:10:56] reason i included them is that the course repository includes a reference
[00:10:58] course repository includes a reference implementation of an autoencoder and all
[00:11:00] implementation of an autoencoder and all the other deep learning models that we
[00:11:01] the other deep learning models that we cover in pure numpy and so if you wanted
[00:11:04] cover in pure numpy and so if you wanted to understand all of the technical
[00:11:05] to understand all of the technical details of how the model was constructed
[00:11:07] details of how the model was constructed and optimized you could use this as a
[00:11:09] and optimized you could use this as a kind of cheat sheet to understand how
[00:11:11] kind of cheat sheet to understand how the code works
[00:11:13] the code works i think the fundamental idea that you
[00:11:14] i think the fundamental idea that you want to have is simply that the model is
[00:11:16] want to have is simply that the model is trying to reconstruct its inputs the
[00:11:18] trying to reconstruct its inputs the error signal that we get is the
[00:11:20] error signal that we get is the difference between the reconstructed and
[00:11:21] difference between the reconstructed and actual input and that error signal is
[00:11:24] actual input and that error signal is what we use to update the parameters of
[00:11:26] what we use to update the parameters of the model
[00:11:29] the model final thing i would mention here is that
[00:11:30] final thing i would mention here is that it could be very difficult for this
[00:11:32] it could be very difficult for this model if you feed in the raw count
[00:11:34] model if you feed in the raw count vectors down here they have very high
[00:11:35] vectors down here they have very high dimensionality and their distribution is
[00:11:37] dimensionality and their distribution is highly skewed as we've seen so it can be
[00:11:40] highly skewed as we've seen so it can be very productive to do a little bit of
[00:11:41] very productive to do a little bit of re-weighting and maybe even
[00:11:43] re-weighting and maybe even dimensionality reduction with lsa before
[00:11:45] dimensionality reduction with lsa before you start feeding inputs into this model
[00:11:48] you start feeding inputs into this model of course it could still be meaningful
[00:11:50] of course it could still be meaningful even if you've done lsa as a
[00:11:51] even if you've done lsa as a pre-processing step to learn a hidden
[00:11:53] pre-processing step to learn a hidden dimensional representation because this
[00:11:55] dimensional representation because this model is presumably capable of learning
[00:11:58] model is presumably capable of learning even more abstract notions than lsa is
[00:12:01] even more abstract notions than lsa is in virtue of its non-linearity at this
[00:12:03] in virtue of its non-linearity at this hidden layer
[00:12:05] hidden layer and here's a bit of code just showing
[00:12:07] and here's a bit of code just showing how this works using both the reference
[00:12:09] how this works using both the reference implementation that i mentioned as well
[00:12:11] implementation that i mentioned as well as a faster and more flexible torch auto
[00:12:13] as a faster and more flexible torch auto encoder which is also incorrect included
[00:12:15] encoder which is also incorrect included in the course repository i think the
[00:12:17] in the course repository i think the only interface thing to mention here is
[00:12:19] only interface thing to mention here is that these models have a fit method like
[00:12:21] that these models have a fit method like all the other machine learning models
[00:12:23] all the other machine learning models for this but those the fit method
[00:12:25] for this but those the fit method returns that hidden dimensional
[00:12:26] returns that hidden dimensional representation the target for our
[00:12:28] representation the target for our learning in this context which is a bit
[00:12:30] learning in this context which is a bit non-standard but it's the intended
[00:12:32] non-standard but it's the intended application
[00:12:33] application for this kind of representation
[00:12:37] the other thing i would mention is so
[00:12:38] the other thing i would mention is so let's see how well the auto encoder is
[00:12:40] let's see how well the auto encoder is performing this is the raw distances in
[00:12:43] performing this is the raw distances in the giga5 matrix for finance this is the
[00:12:45] the giga5 matrix for finance this is the count matrix it doesn't look great
[00:12:48] count matrix it doesn't look great if we run the autoencoder directly on
[00:12:50] if we run the autoencoder directly on the count matrix it looks a little
[00:12:51] the count matrix it looks a little better but it's still not excellent
[00:12:53] better but it's still not excellent if we think of this as part of a
[00:12:55] if we think of this as part of a pipeline where we first done positive
[00:12:57] pipeline where we first done positive point wise mutual information and then
[00:12:59] point wise mutual information and then lsa at dimension 100 and then done the
[00:13:01] lsa at dimension 100 and then done the auto encoding step it starts to look
[00:13:03] auto encoding step it starts to look like a really good and interesting
[00:13:04] like a really good and interesting semantic space and i think that's
[00:13:06] semantic space and i think that's pointing out the power of including the
[00:13:07] pointing out the power of including the auto encoder in a larger pipeline of
[00:13:10] auto encoder in a larger pipeline of pre-processing on the count matrices
[00:13:12] pre-processing on the count matrices okay let's turn to glove or global
[00:13:14] okay let's turn to glove or global vectors for the final major unit for
[00:13:16] vectors for the final major unit for this screencast
[00:13:18] this screencast here's a brief overview glove was
[00:13:19] here's a brief overview glove was introduced by um jeffrey pennington
[00:13:21] introduced by um jeffrey pennington richard socher and chris manning a
[00:13:23] richard socher and chris manning a stanford team in 2014.
[00:13:25] stanford team in 2014. roughly speaking the guiding idea here
[00:13:27] roughly speaking the guiding idea here is that we want to learn vectors for
[00:13:29] is that we want to learn vectors for words such that the dot product of those
[00:13:32] words such that the dot product of those vectors is proportional to the log
[00:13:34] vectors is proportional to the log probability of co-occurrence for those
[00:13:35] probability of co-occurrence for those words and i'll elaborate on that in a
[00:13:38] words and i'll elaborate on that in a second
[00:13:39] second for doing computational work we can rely
[00:13:42] for doing computational work we can rely on the implementation torchglove.pi
[00:13:44] on the implementation torchglove.pi which is in the course repo i'll mention
[00:13:46] which is in the course repo i'll mention that there's also a reference
[00:13:47] that there's also a reference implementation in bsn.pi it's very slow
[00:13:50] implementation in bsn.pi it's very slow but it kind of transparently implements
[00:13:52] but it kind of transparently implements the core glove algorithm so it could be
[00:13:54] the core glove algorithm so it could be interesting to inspect and then if
[00:13:56] interesting to inspect and then if you're doing practical work with really
[00:13:57] you're doing practical work with really large corporate and really large
[00:13:59] large corporate and really large vocabularies i would encourage you to
[00:14:00] vocabularies i would encourage you to use the glove team cm limitation it's an
[00:14:03] use the glove team cm limitation it's an outstanding software artifact that will
[00:14:04] outstanding software artifact that will allow you to learn lots of good
[00:14:06] allow you to learn lots of good representations quickly
[00:14:08] representations quickly and that kind of brings me to my last
[00:14:09] and that kind of brings me to my last point i just want to mention that the
[00:14:11] point i just want to mention that the glove team was among the first teams in
[00:14:13] glove team was among the first teams in nlp to release not just data and code
[00:14:16] nlp to release not just data and code but pre-trained model parameters
[00:14:19] but pre-trained model parameters everyone does that these days but it was
[00:14:20] everyone does that these days but it was rare at the time and i think this team
[00:14:22] rare at the time and i think this team is kind of really
[00:14:24] is kind of really forward thinking and seeing the value of
[00:14:26] forward thinking and seeing the value of releasing these centralized resources
[00:14:27] releasing these centralized resources and a lot of really interesting work
[00:14:30] and a lot of really interesting work happened
[00:14:31] happened with glove vectors as a foundation
[00:14:35] all right so let's think about the
[00:14:36] all right so let's think about the technical aspects of this model this is
[00:14:37] technical aspects of this model this is the glove objective and you're going to
[00:14:39] the glove objective and you're going to see
[00:14:40] see point-wise mutual information kind of
[00:14:41] point-wise mutual information kind of creep into this picture in an
[00:14:43] creep into this picture in an interesting way
[00:14:44] interesting way so this is equation six from the paper
[00:14:46] so this is equation six from the paper it's kind of an idealized objective for
[00:14:47] it's kind of an idealized objective for the glove model and it says what i said
[00:14:50] the glove model and it says what i said before we have a row vector and a column
[00:14:52] before we have a row vector and a column vector
[00:14:53] vector i wi and w k we're going to get their
[00:14:56] i wi and w k we're going to get their dot product and the goal is to learn
[00:14:58] dot product and the goal is to learn to have that dot product be proportional
[00:15:00] to have that dot product be proportional to the log of the probability of
[00:15:02] to the log of the probability of co-occurrence of word i and word k
[00:15:04] co-occurrence of word i and word k where the probability of co-occurrence
[00:15:06] where the probability of co-occurrence is defined in the way that we defined it
[00:15:08] is defined in the way that we defined it before when we were talking about row
[00:15:10] before when we were talking about row normalization it's just done in log
[00:15:12] normalization it's just done in log space this is the co-occurrence count
[00:15:14] space this is the co-occurrence count this is the sum of all the counts along
[00:15:16] this is the sum of all the counts along that row and basically in log space
[00:15:18] that row and basically in log space we're just dividing this value by this
[00:15:20] we're just dividing this value by this value
[00:15:21] value so keep that in mind now the reason they
[00:15:23] so keep that in mind now the reason they have only the row represented is that in
[00:15:25] have only the row represented is that in the paper they're assuming that the rows
[00:15:27] the paper they're assuming that the rows and columns in the underlying count
[00:15:29] and columns in the underlying count matrix are identical and so we don't
[00:15:31] matrix are identical and so we don't need to include both however if we did
[00:15:33] need to include both however if we did allow that the row and context could be
[00:15:35] allow that the row and context could be different we would just elaborate
[00:15:37] different we would just elaborate equation 6 to have a slightly different
[00:15:39] equation 6 to have a slightly different denominator right we would have the
[00:15:40] denominator right we would have the product of the row sum and the column
[00:15:43] product of the row sum and the column sum and take the log of that and
[00:15:45] sum and take the log of that and subtract that out
[00:15:46] subtract that out and that would be kind of our goal for
[00:15:48] and that would be kind of our goal for learning these dot products here
[00:15:50] learning these dot products here but aha this is where pmi sneaks in
[00:15:52] but aha this is where pmi sneaks in because that simply is the pmi objective
[00:15:55] because that simply is the pmi objective right where we stated that as the log of
[00:15:57] right where we stated that as the log of the probability of co-occurrence divided
[00:16:00] the probability of co-occurrence divided by the product of the row and the column
[00:16:01] by the product of the row and the column probabilities here they've just stated
[00:16:03] probabilities here they've just stated exactly that calculation in log space
[00:16:05] exactly that calculation in log space and these are numerically equivalent by
[00:16:07] and these are numerically equivalent by the equivalence of log of x over y being
[00:16:10] the equivalence of log of x over y being the same as log of x minus the log of y
[00:16:13] the same as log of x minus the log of y so that's the deep connection that i was
[00:16:14] so that's the deep connection that i was highlighting between glove and pmi and i
[00:16:17] highlighting between glove and pmi and i think that's really interesting because
[00:16:18] think that's really interesting because it shows that fundamentally we're
[00:16:20] it shows that fundamentally we're testing a very similar hypothesis using
[00:16:23] testing a very similar hypothesis using very similar notions of row and column
[00:16:25] very similar notions of row and column context
[00:16:27] context now the glove team doesn't just stop
[00:16:28] now the glove team doesn't just stop there
[00:16:29] there the glove objective is actually much
[00:16:30] the glove objective is actually much more interesting as an elaboration of
[00:16:32] more interesting as an elaboration of that core that core pmi idea but it's
[00:16:35] that core that core pmi idea but it's worth having pmi in mind because it's
[00:16:37] worth having pmi in mind because it's there throughout this presentation
[00:16:40] there throughout this presentation in the paper they state this is a kind
[00:16:41] in the paper they state this is a kind of idealized objective where we're going
[00:16:43] of idealized objective where we're going to have the dot product as i said before
[00:16:45] to have the dot product as i said before and two bias terms and the goal will be
[00:16:46] and two bias terms and the goal will be to make that equivalent to the log of
[00:16:48] to make that equivalent to the log of the co-occurrence count
[00:16:50] the co-occurrence count that has some undesirable properties
[00:16:52] that has some undesirable properties from the point of view of machine
[00:16:54] from the point of view of machine learning so they propose in the end a
[00:16:56] learning so they propose in the end a weighted version of that objective
[00:16:58] weighted version of that objective you can see we still have the dot
[00:16:59] you can see we still have the dot product of the row and the column
[00:17:01] product of the row and the column vectors and two bias terms we're going
[00:17:03] vectors and two bias terms we're going to subtract out the log of the
[00:17:05] to subtract out the log of the co-occurrence count
[00:17:06] co-occurrence count and take the square of that and that's
[00:17:08] and take the square of that and that's going to be weighted by f of the
[00:17:10] going to be weighted by f of the co-occurrence count
[00:17:11] co-occurrence count where f is a function that you can
[00:17:13] where f is a function that you can define by hand and what they do in the
[00:17:14] define by hand and what they do in the paper is that it was two parameters x
[00:17:16] paper is that it was two parameters x max and alpha
[00:17:19] max and alpha for any count that is above x max we're
[00:17:21] for any count that is above x max we're going to set it to one to kind of
[00:17:22] going to set it to one to kind of flatten out all the really large counts
[00:17:25] flatten out all the really large counts everything that's below x max we will
[00:17:27] everything that's below x max we will take as a proportion of x max with some
[00:17:30] take as a proportion of x max with some exponential scaling that's specified by
[00:17:32] exponential scaling that's specified by alpha that's the function there and
[00:17:35] alpha that's the function there and typically alpha is set to 0.75 and x max
[00:17:37] typically alpha is set to 0.75 and x max 200 but i encourage you to be critical
[00:17:40] 200 but i encourage you to be critical in thinking about both those choices and
[00:17:42] in thinking about both those choices and how they relate to your data i'll return
[00:17:44] how they relate to your data i'll return to that in a second
[00:17:46] to that in a second so glove really has these three hyper
[00:17:48] so glove really has these three hyper parameters the learned dimensional the
[00:17:50] parameters the learned dimensional the dimensionality of the representations
[00:17:53] dimensionality of the representations x max which is going to have this
[00:17:54] x max which is going to have this flattening effect and alpha which is
[00:17:56] flattening effect and alpha which is going to scale the values right so
[00:17:58] going to scale the values right so here's an example of how those are
[00:18:00] here's an example of how those are interacting
[00:18:01] interacting x max and alpha if i start with this
[00:18:03] x max and alpha if i start with this vector 199 75 10 and 1
[00:18:06] vector 199 75 10 and 1 the the function f as we specified it is
[00:18:09] the the function f as we specified it is going to flatten that out into 1 99 81
[00:18:12] going to flatten that out into 1 99 81 18 and 0.03
[00:18:14] 18 and 0.03 you should just be aware that that kind
[00:18:16] you should just be aware that that kind of flattening is happening
[00:18:19] of flattening is happening so glove learning so it's kind of
[00:18:21] so glove learning so it's kind of interesting to think about analytically
[00:18:22] interesting to think about analytically about how love manages to learn
[00:18:24] about how love manages to learn interesting representations and one
[00:18:26] interesting representations and one thing that might be on your mind is the
[00:18:28] thing that might be on your mind is the question can it actually learn higher
[00:18:30] question can it actually learn higher order notions of co-occurrence that's
[00:18:31] order notions of co-occurrence that's been the major selling point of this
[00:18:33] been the major selling point of this lecture i gave that example involving
[00:18:35] lecture i gave that example involving gnarly wicked with lsa is glove going to
[00:18:38] gnarly wicked with lsa is glove going to be able to do that right we could just
[00:18:39] be able to do that right we could just pose that as a question
[00:18:42] pose that as a question so let's start that and see what happens
[00:18:44] so let's start that and see what happens see how this works the loss calculations
[00:18:46] see how this works the loss calculations for glove this is kind of a simplified
[00:18:48] for glove this is kind of a simplified version of the derivative of the model
[00:18:50] version of the derivative of the model um
[00:18:51] um and we're going to show how glove
[00:18:52] and we're going to show how glove manages to pull gnarly and wicked toward
[00:18:55] manages to pull gnarly and wicked toward awesome in that little little idealized
[00:18:57] awesome in that little little idealized space that i used before i'm going to
[00:18:59] space that i used before i'm going to leave out the bias terms for simplicity
[00:19:00] leave out the bias terms for simplicity but we could bring those in
[00:19:02] but we could bring those in and so here's the how this is going to
[00:19:04] and so here's the how this is going to proceed what i've done just for this
[00:19:06] proceed what i've done just for this idealized example is begin in a glove
[00:19:08] idealized example is begin in a glove space where wicked and gnarly are as far
[00:19:10] space where wicked and gnarly are as far apart as i could make them so as
[00:19:12] apart as i could make them so as different as i could possibly make them
[00:19:14] different as i could possibly make them i've got awesome and terrible and
[00:19:16] i've got awesome and terrible and awesome is kind of close to gnarly
[00:19:17] awesome is kind of close to gnarly already what you'll see is that after
[00:19:19] already what you'll see is that after just one iteration of the model what has
[00:19:22] just one iteration of the model what has happened is that wicked and gnarly have
[00:19:25] happened is that wicked and gnarly have been pulled toward awesome and that's
[00:19:27] been pulled toward awesome and that's just the kind of effect that we wanted
[00:19:28] just the kind of effect that we wanted that's the sense in which glove can
[00:19:30] that's the sense in which glove can capture these higher order notions of
[00:19:32] capture these higher order notions of co-occurrence
[00:19:33] co-occurrence just in a little more detail you might
[00:19:35] just in a little more detail you might want to study this on your own but the
[00:19:37] want to study this on your own but the high level overview of exactly how that
[00:19:39] high level overview of exactly how that learning happens proceeds as follows we
[00:19:41] learning happens proceeds as follows we start from these counts up here and the
[00:19:43] start from these counts up here and the crucial assumption i'm making is that
[00:19:45] crucial assumption i'm making is that wicked and gnarly never co-occur
[00:19:48] wicked and gnarly never co-occur but they occur a lot with awesome
[00:19:50] but they occur a lot with awesome awesome will be kind of the
[00:19:51] awesome will be kind of the gravitational pull that makes gnarly and
[00:19:53] gravitational pull that makes gnarly and wicked look similar
[00:19:56] wicked look similar keep in mind that because of that
[00:19:57] keep in mind that because of that function f by and large with glove we're
[00:19:59] function f by and large with glove we're dealing not with the raw counts but
[00:20:01] dealing not with the raw counts but rather with the re-weighted matrix and
[00:20:02] rather with the re-weighted matrix and that preserves this property that they
[00:20:04] that preserves this property that they never co-occurred it gives differently
[00:20:06] never co-occurred it gives differently scaled values for the rest of the
[00:20:08] scaled values for the rest of the co-occurrence or pseudo-co-occurrence
[00:20:10] co-occurrence or pseudo-co-occurrence probabilities
[00:20:12] probabilities all right and then here's so here's what
[00:20:13] all right and then here's so here's what we're going to track this is the gnarly
[00:20:15] we're going to track this is the gnarly vector in zero and you can see i've made
[00:20:17] vector in zero and you can see i've made them as far apart as i could they're
[00:20:18] them as far apart as i could they're kind of opposed to each other
[00:20:20] kind of opposed to each other we're going to see how they get pulled
[00:20:21] we're going to see how they get pulled toward awesome in the context factor
[00:20:24] toward awesome in the context factor so this is that loss calculation i have
[00:20:26] so this is that loss calculation i have just plugged in all the values here and
[00:20:28] just plugged in all the values here and you can see that we get this initial set
[00:20:30] you can see that we get this initial set of losses
[00:20:32] of losses that's after one iteration we update the
[00:20:34] that's after one iteration we update the weight matrices and we perform one more
[00:20:36] weight matrices and we perform one more round of learning and you can see that
[00:20:38] round of learning and you can see that both of these models the values here are
[00:20:40] both of these models the values here are getting larger corresponding to get
[00:20:42] getting larger corresponding to get getting pulled closer and closer toward
[00:20:43] getting pulled closer and closer toward awesome and you can see that graphically
[00:20:46] awesome and you can see that graphically happening over here in these plots on
[00:20:48] happening over here in these plots on the left and as i do more iterations of
[00:20:50] the left and as i do more iterations of the glove model this effect is just
[00:20:52] the glove model this effect is just going to strengthen corresponding to
[00:20:54] going to strengthen corresponding to wicked and gnarly getting pulled toward
[00:20:56] wicked and gnarly getting pulled toward awesome and away from terrible as a
[00:20:58] awesome and away from terrible as a result of these underlying counts and i
[00:21:00] result of these underlying counts and i take this as good evidence that glove
[00:21:02] take this as good evidence that glove like the other methods we've discussed
[00:21:04] like the other methods we've discussed is capable of capturing those higher
[00:21:06] is capable of capturing those higher order notions of co-occurrence that
[00:21:08] order notions of co-occurrence that we're so interested in pursuing with
[00:21:10] we're so interested in pursuing with these methods
[00:21:13] let's close the loop also we have those
[00:21:15] let's close the loop also we have those central questions what is glove doing to
[00:21:17] central questions what is glove doing to our underlying spaces with glove because
[00:21:19] our underlying spaces with glove because of the design of the matrix we have to
[00:21:21] of the design of the matrix we have to begin from
[00:21:23] begin from word by word co-occurrence matrices of
[00:21:25] word by word co-occurrence matrices of counts
[00:21:26] counts so we begin with these raw count values
[00:21:28] so we begin with these raw count values and gloves one-stop shopping it's going
[00:21:30] and gloves one-stop shopping it's going to take us all the way to these reduced
[00:21:32] to take us all the way to these reduced dimensional representations and boy by
[00:21:34] dimensional representations and boy by the criteria we've set up does glove do
[00:21:36] the criteria we've set up does glove do an outstanding job this is the result of
[00:21:39] an outstanding job this is the result of running glove at dimension 50
[00:21:41] running glove at dimension 50 and you can see that the values are
[00:21:43] and you can see that the values are extremely well scaled between negative
[00:21:44] extremely well scaled between negative two and two and nicely normally
[00:21:47] two and two and nicely normally distributed this is an outstanding input
[00:21:49] distributed this is an outstanding input to modern machine learning models and i
[00:21:52] to modern machine learning models and i think this is probably a non-trivial
[00:21:54] think this is probably a non-trivial aspect of why glove has been so
[00:21:56] aspect of why glove has been so successful as a kind of pre-trained
[00:21:59] successful as a kind of pre-trained basis for a lot of subsequent machine
[00:22:01] basis for a lot of subsequent machine learning architectures
[00:22:04] and then here's a little bit of code
[00:22:06] and then here's a little bit of code just showing you how you can work with
[00:22:07] just showing you how you can work with these interfaces using our code base the
[00:22:10] these interfaces using our code base the one thing i wanted to call out is that
[00:22:11] one thing i wanted to call out is that i'm trying to be careful i have defined
[00:22:13] i'm trying to be careful i have defined this function percentage non-zero values
[00:22:16] this function percentage non-zero values above
[00:22:17] above and you can set x max here and just feed
[00:22:19] and you can set x max here and just feed in a matrix and study what percentage of
[00:22:21] in a matrix and study what percentage of the values in that matrix are going to
[00:22:23] the values in that matrix are going to get flattened out to 1 as a result of
[00:22:25] get flattened out to 1 as a result of the x max that you've chosen and this
[00:22:28] the x max that you've chosen and this value really varies by the design of the
[00:22:30] value really varies by the design of the matrix if i feed in yelp 5
[00:22:33] matrix if i feed in yelp 5 only about 5 of the values are getting
[00:22:35] only about 5 of the values are getting flattened out but if i feed in yelp 20
[00:22:38] flattened out but if i feed in yelp 20 which is much denser and has much higher
[00:22:40] which is much denser and has much higher counts
[00:22:41] counts 20 of the values are getting flattened
[00:22:43] 20 of the values are getting flattened out to one you know if this number gets
[00:22:45] out to one you know if this number gets too high the matrix might become
[00:22:47] too high the matrix might become completely homogeneous and so should we
[00:22:48] completely homogeneous and so should we we should really be aware
[00:22:50] we should really be aware of how the setting of x max is affecting
[00:22:53] of how the setting of x max is affecting the kind of learning that we could even
[00:22:54] the kind of learning that we could even be performing with glove and it might
[00:22:57] be performing with glove and it might turn out that this is even more
[00:22:58] turn out that this is even more important than the number of iterations
[00:23:00] important than the number of iterations or the dimensionality of the
[00:23:02] or the dimensionality of the representations that you learn
[00:23:04] representations that you learn once you've made those choices though
[00:23:05] once you've made those choices though the interface is pretty clear and the
[00:23:07] the interface is pretty clear and the fit method as with the autoencoder
[00:23:09] fit method as with the autoencoder returns the matrix of learned
[00:23:11] returns the matrix of learned representations that we want to use for
[00:23:13] representations that we want to use for the current purposes
[00:23:14] the current purposes and then finally i've included a score
[00:23:17] and then finally i've included a score method and the score method is literally
[00:23:19] method and the score method is literally just testing to see
[00:23:21] just testing to see how well the vectors that you've learned
[00:23:23] how well the vectors that you've learned correspond to the glove objective of
[00:23:25] correspond to the glove objective of having the dot products be proportional
[00:23:28] having the dot products be proportional to the log of the co-occurrence
[00:23:29] to the log of the co-occurrence probabilities and you can get a score
[00:23:31] probabilities and you can get a score for that we're doing pretty well here
[00:23:32] for that we're doing pretty well here let's say for a large empirical matrix
[00:23:38] final section let's just say a bit about
[00:23:39] final section let's just say a bit about visualization and this is our
[00:23:41] visualization and this is our dimensionality reduction technique in
[00:23:43] dimensionality reduction technique in the sense that the whole point is to try
[00:23:44] the sense that the whole point is to try to flatten out a very high dimensional
[00:23:46] to flatten out a very high dimensional space into possibly two or three
[00:23:48] space into possibly two or three dimensions
[00:23:49] dimensions you have to recognize that inevitably
[00:23:51] you have to recognize that inevitably this will involve a lot of compromises
[00:23:53] this will involve a lot of compromises it's just impossible to capture all the
[00:23:55] it's just impossible to capture all the sources of variation in your underlying
[00:23:57] sources of variation in your underlying matrix in just a few dimensions but
[00:24:00] matrix in just a few dimensions but nonetheless this can be productive i
[00:24:02] nonetheless this can be productive i think it's especially valuable if you
[00:24:04] think it's especially valuable if you pair it with some kind of hands-on
[00:24:07] pair it with some kind of hands-on qualitative exploration using something
[00:24:09] qualitative exploration using something like bsm neighbors to understand at a
[00:24:11] like bsm neighbors to understand at a low level what your matrix encodes and
[00:24:14] low level what your matrix encodes and then the high-level visualizations can
[00:24:15] then the high-level visualizations can be a kind of counterpart to that
[00:24:18] be a kind of counterpart to that there are many visualization techniques
[00:24:20] there are many visualization techniques and a lot of them are implemented in the
[00:24:22] and a lot of them are implemented in the scikit manifold
[00:24:24] scikit manifold package so i encourage you to use them
[00:24:26] package so i encourage you to use them i'm going to show you some results from
[00:24:28] i'm going to show you some results from tis knee which is which stands for t
[00:24:30] tis knee which is which stands for t distributed
[00:24:31] distributed stochastic neighbor embedding
[00:24:33] stochastic neighbor embedding there are lots of user guides there that
[00:24:35] there are lots of user guides there that you can study for more details let me
[00:24:36] you can study for more details let me just give you the high level this is to
[00:24:38] just give you the high level this is to sne run on our giga20 matrix i think
[00:24:41] sne run on our giga20 matrix i think this is typical of pretty good output
[00:24:43] this is typical of pretty good output from to so what we're seeing here is
[00:24:45] from to so what we're seeing here is some pockets of high density those are
[00:24:48] some pockets of high density those are areas of local coherence globally we
[00:24:50] areas of local coherence globally we should be careful not to over interpret
[00:24:52] should be careful not to over interpret this entire diagram because as you rerun
[00:24:54] this entire diagram because as you rerun the model with different random seeds
[00:24:56] the model with different random seeds you'll see that it gets kind of
[00:24:57] you'll see that it gets kind of reoriented in different parts are
[00:24:59] reoriented in different parts are correspondingly close to different other
[00:25:01] correspondingly close to different other parts but what you can count on pretty
[00:25:04] parts but what you can count on pretty reliably is that these local pockets of
[00:25:06] reliably is that these local pockets of co of coherence correspond to coherence
[00:25:09] co of coherence correspond to coherence parts of the space that you've defined
[00:25:11] parts of the space that you've defined and if you zoom in on them you can kind
[00:25:13] and if you zoom in on them you can kind of assess what the model has uncovered
[00:25:15] of assess what the model has uncovered so for this giga21 for example i think
[00:25:17] so for this giga21 for example i think we see prominent clusters corresponding
[00:25:19] we see prominent clusters corresponding to things like cooking and conflict if
[00:25:22] to things like cooking and conflict if we do the same thing for our yelp matrix
[00:25:24] we do the same thing for our yelp matrix again this looks pretty good in terms of
[00:25:27] again this looks pretty good in terms of having some substructure that we could
[00:25:29] having some substructure that we could analyze and if we zoom in we do see
[00:25:31] analyze and if we zoom in we do see clusters like positive terms and
[00:25:33] clusters like positive terms and negative terms corresponding to the
[00:25:35] negative terms corresponding to the evaluative setting of these yelp reviews
[00:25:37] evaluative setting of these yelp reviews so this is all very encouraging and
[00:25:39] so this is all very encouraging and suggests that the underlying spaces have
[00:25:41] suggests that the underlying spaces have some really interesting structure that
[00:25:42] some really interesting structure that might be useful for subsequent analysis
[00:25:47] might be useful for subsequent analysis and here are some code snippets we have
[00:25:48] and here are some code snippets we have this simple wrapper around the scikit
[00:25:51] this simple wrapper around the scikit disney implementation that will allow
[00:25:53] disney implementation that will allow you to flexibly work with this stuff
[00:25:55] you to flexibly work with this stuff using the
[00:25:56] using the count matrices from our unit and i'm
[00:25:59] count matrices from our unit and i'm just mentioning here that it's pretty
[00:26:00] just mentioning here that it's pretty easy if you wanted to color code the
[00:26:02] easy if you wanted to color code the words in your vocabulary say according
[00:26:04] words in your vocabulary say according to a sentiment lexicon or some other
[00:26:06] to a sentiment lexicon or some other kind of lexicon that could be a way for
[00:26:08] kind of lexicon that could be a way for you to reveal exactly what structure
[00:26:10] you to reveal exactly what structure your model has been able to uncover with
[00:26:12] your model has been able to uncover with respect to those underlying labels and
[00:26:14] respect to those underlying labels and that can be useful

Lecture 009

Retrofitting | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=2dVdZ4GPQIk --- Transcript [00:00:04] hello everyone welcome to part six in [00:...

Retrofitting | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=2dVdZ4GPQIk

---

Transcript

[00:00:04] hello everyone welcome to part six in
[00:00:06] hello everyone welcome to part six in our series on distributed word
[00:00:07] our series on distributed word representations this can be considered
[00:00:09] representations this can be considered an optional part but it's on the
[00:00:10] an optional part but it's on the irresistibly cool idea of retrofitting
[00:00:13] irresistibly cool idea of retrofitting vectors to knowledge
[00:00:14] vectors to knowledge graphs here are central goals on the one
[00:00:17] graphs here are central goals on the one hand as we've seen distributional
[00:00:19] hand as we've seen distributional representations are powerful and also
[00:00:21] representations are powerful and also easy to obtain
[00:00:22] easy to obtain but they tend to reflect only relatively
[00:00:25] but they tend to reflect only relatively primitive semantic notions like
[00:00:26] primitive semantic notions like similarity or synonymy or connotation or
[00:00:29] similarity or synonymy or connotation or relatedness
[00:00:30] relatedness so that might feel limiting
[00:00:32] so that might feel limiting on the other hand structured resources
[00:00:34] on the other hand structured resources like knowledge graphs while sparse and
[00:00:36] like knowledge graphs while sparse and kind of hard to obtain
[00:00:39] kind of hard to obtain support really rich learning of very
[00:00:41] support really rich learning of very diverse semantic distinctions so the
[00:00:44] diverse semantic distinctions so the question naturally arises can we have
[00:00:46] question naturally arises can we have the best aspects of both of these and
[00:00:48] the best aspects of both of these and the inspiring answer given by
[00:00:50] the inspiring answer given by retrofitting is yes we can combine them
[00:00:53] retrofitting is yes we can combine them the original method for doing this is
[00:00:54] the original method for doing this is due to this lovely paper faroukian all
[00:00:56] due to this lovely paper faroukian all 2015 which i'm going to be giving a
[00:00:59] 2015 which i'm going to be giving a brief summary of in this screencast
[00:01:02] brief summary of in this screencast so here is the retrofitting model it
[00:01:04] so here is the retrofitting model it consists of two sums and they constitute
[00:01:07] consists of two sums and they constitute kind of opposing forces
[00:01:09] kind of opposing forces imagine that we have an existing
[00:01:11] imagine that we have an existing embedding space like glove or some
[00:01:13] embedding space like glove or some embedding space that you built yourself
[00:01:14] embedding space that you built yourself that's q hat
[00:01:16] that's q hat and we're learning these q i's and q j's
[00:01:19] and we're learning these q i's and q j's the term on the left is basically saying
[00:01:21] the term on the left is basically saying remain faithful to those original
[00:01:23] remain faithful to those original vectors as you learn these new vectors q
[00:01:25] vectors as you learn these new vectors q i try not to be too dissimilar from
[00:01:27] i try not to be too dissimilar from where you started
[00:01:29] where you started that pressure is balanced against the
[00:01:31] that pressure is balanced against the pressure on the right which is saying
[00:01:33] pressure on the right which is saying make representations that look more like
[00:01:36] make representations that look more like the neighbors
[00:01:37] the neighbors for the current node in the knowledge
[00:01:40] for the current node in the knowledge graph which is you know
[00:01:41] graph which is you know defined by the set of
[00:01:43] defined by the set of relations e
[00:01:45] relations e so two opposing pressures on the one
[00:01:47] so two opposing pressures on the one hand we're saying be faithful to the
[00:01:48] hand we're saying be faithful to the original on the other hand we're saying
[00:01:50] original on the other hand we're saying look more like your neighbors in the
[00:01:51] look more like your neighbors in the knowledge graph
[00:01:53] knowledge graph if we set alpha to one and beta to one
[00:01:55] if we set alpha to one and beta to one over the out degree for the node that
[00:01:57] over the out degree for the node that we're targeting
[00:01:58] we're targeting then we have basically balance these two
[00:02:01] then we have basically balance these two pressures
[00:02:02] pressures if we set alpha really large we'll
[00:02:04] if we set alpha really large we'll mostly want to stay faithful to the
[00:02:05] mostly want to stay faithful to the original vectors if we set beta
[00:02:08] original vectors if we set beta comparatively very large then we'll
[00:02:10] comparatively very large then we'll mostly want to look like the neighbors
[00:02:11] mostly want to look like the neighbors in the knowledge graph and we won't
[00:02:13] in the knowledge graph and we won't remain so tethered to the original
[00:02:15] remain so tethered to the original embedding space that we started with
[00:02:17] embedding space that we started with this illustration kind of nicely depicts
[00:02:19] this illustration kind of nicely depicts what happens in the model the gray
[00:02:21] what happens in the model the gray vectors of the original embedding space
[00:02:23] vectors of the original embedding space we have this knowledge graph that
[00:02:25] we have this knowledge graph that connects the associated nodes and
[00:02:27] connects the associated nodes and because they're connected in the
[00:02:29] because they're connected in the retrofitting space which is given in
[00:02:30] retrofitting space which is given in white these nodes are kind of pulled
[00:02:32] white these nodes are kind of pulled together and look more
[00:02:34] together and look more similar there's a bunch of code for
[00:02:36] similar there's a bunch of code for doing retrofitting in the course
[00:02:38] doing retrofitting in the course repository and i'll just show you a few
[00:02:39] repository and i'll just show you a few quick illustrations using that code
[00:02:41] quick illustrations using that code let's start with a simple case we have a
[00:02:43] let's start with a simple case we have a very simple knowledge graph where node
[00:02:45] very simple knowledge graph where node zero is connected to node one and node
[00:02:48] zero is connected to node one and node zero is connected to node two just
[00:02:50] zero is connected to node two just directionally
[00:02:52] directionally what happens when we run the
[00:02:52] what happens when we run the retrofitting model is that zero is
[00:02:54] retrofitting model is that zero is pulled equally close to one into two
[00:02:57] pulled equally close to one into two kind of equidistant between them and
[00:02:59] kind of equidistant between them and closer to both than it was in the
[00:03:00] closer to both than it was in the original embedding
[00:03:03] original embedding here's space
[00:03:03] here's space situation in which every other every
[00:03:05] situation in which every other every node is connected to every other node
[00:03:07] node is connected to every other node that's represented on the left here
[00:03:08] that's represented on the left here that's where we start and as a result of
[00:03:10] that's where we start and as a result of running the retrofitting model with
[00:03:12] running the retrofitting model with alpha and beta set in their default
[00:03:13] alpha and beta set in their default parameters what happens is that triangle
[00:03:15] parameters what happens is that triangle just gets smaller and a kind of fully
[00:03:17] just gets smaller and a kind of fully symmetric way as the
[00:03:19] symmetric way as the nodes become more similar to each other
[00:03:21] nodes become more similar to each other because of the graph structure
[00:03:24] because of the graph structure here's a kind of degenerate solution if
[00:03:25] here's a kind of degenerate solution if i set alpha to 0 i have no pressure to
[00:03:28] i set alpha to 0 i have no pressure to be faithful to the original vectors all
[00:03:30] be faithful to the original vectors all i care about is looking like my
[00:03:31] i care about is looking like my neighbors from the term on the right and
[00:03:33] neighbors from the term on the right and as a result all these vectors shrink
[00:03:35] as a result all these vectors shrink down to be the same point after the
[00:03:36] down to be the same point after the model is run for a few iterations
[00:03:39] model is run for a few iterations if instead i had done the opposite of
[00:03:41] if instead i had done the opposite of made alpha really large comparative to
[00:03:43] made alpha really large comparative to beta then basically nothing would have
[00:03:45] beta then basically nothing would have happened in the learning a triangle
[00:03:46] happened in the learning a triangle would remain its original size
[00:03:50] would remain its original size it's worth considering some extensions
[00:03:51] it's worth considering some extensions so i think one the fundamental
[00:03:53] so i think one the fundamental limitation of this model is that it is
[00:03:55] limitation of this model is that it is kind of assuming right there in its
[00:03:57] kind of assuming right there in its objective that had to have an edge
[00:03:59] objective that had to have an edge between nodes is to say that they are
[00:04:00] between nodes is to say that they are similar but of course the whole point
[00:04:02] similar but of course the whole point might be that your knowledge graph has
[00:04:04] might be that your knowledge graph has very rich edge relations corresponding
[00:04:06] very rich edge relations corresponding to different linguistic notions like
[00:04:08] to different linguistic notions like autonomy
[00:04:09] autonomy and we certainly wouldn't want to treat
[00:04:10] and we certainly wouldn't want to treat synonymy and autonomy as the same
[00:04:13] synonymy and autonomy as the same relation and just assume that it meant
[00:04:14] relation and just assume that it meant similarity in our model
[00:04:16] similarity in our model uh so there are various extensions i
[00:04:18] uh so there are various extensions i think the most general extension that
[00:04:20] think the most general extension that i've seen is from a paper that i was
[00:04:22] i've seen is from a paper that i was involved with led by ben lingarich which
[00:04:24] involved with led by ben lingarich which is called functional retrofitting which
[00:04:26] is called functional retrofitting which allows you to very flexibly learn
[00:04:28] allows you to very flexibly learn different retrofitting modes for
[00:04:30] different retrofitting modes for different edge semantics
[00:04:33] different edge semantics and once you start down that road you
[00:04:34] and once you start down that road you have a really natural connection with
[00:04:36] have a really natural connection with the literature on graph embedding that
[00:04:37] the literature on graph embedding that is learning distributional
[00:04:38] is learning distributional representations for nodes and knowledge
[00:04:41] representations for nodes and knowledge graphs and this paper but led by will
[00:04:43] graphs and this paper but led by will hamilton is an outstanding overview of
[00:04:45] hamilton is an outstanding overview of methods in that space and then you have
[00:04:47] methods in that space and then you have this nice synergy between nlp methods
[00:04:49] this nice synergy between nlp methods and methods that are more associated
[00:04:51] and methods that are more associated with work on knowledge graphs and social
[00:04:53] with work on knowledge graphs and social networks and so forth
[00:04:56] networks and so forth and finally here are some code snippets
[00:04:57] and finally here are some code snippets just showing some simple illustrations
[00:04:59] just showing some simple illustrations of the short sort that i showed you
[00:05:01] of the short sort that i showed you earlier in the screencast i would just
[00:05:03] earlier in the screencast i would just mention at the end here if you would
[00:05:04] mention at the end here if you would like to apply these methods to wordnet
[00:05:07] like to apply these methods to wordnet which could be a powerful ingredient for
[00:05:09] which could be a powerful ingredient for the first assignment and bake off i
[00:05:11] the first assignment and bake off i would encourage you to check out this
[00:05:12] would encourage you to check out this notebook bsm03 retrofitting because it
[00:05:15] notebook bsm03 retrofitting because it walks through all the steps for doing
[00:05:17] walks through all the steps for doing that

Lecture 010

Static Representations | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=K7wM6FUV0ds --- Transcript [00:00:04] hello everyone welcome to our fi...

Static Representations | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=K7wM6FUV0ds

---

Transcript

[00:00:04] hello everyone welcome to our final
[00:00:06] hello everyone welcome to our final screencast in our unit on distributed
[00:00:08] screencast in our unit on distributed word representations our topic is going
[00:00:10] word representations our topic is going to be deriving static representations
[00:00:12] to be deriving static representations from contextual models that might sound
[00:00:14] from contextual models that might sound awfully specific but as you'll see i
[00:00:15] awfully specific but as you'll see i think this could be really empowering
[00:00:17] think this could be really empowering for you as you work on your original
[00:00:19] for you as you work on your original system for the assignment and the
[00:00:21] system for the assignment and the associated bake off
[00:00:23] associated bake off so let's dive in a question on your
[00:00:25] so let's dive in a question on your minds might be how can i use bert or
[00:00:27] minds might be how can i use bert or related models like roberta or excel net
[00:00:29] related models like roberta or excel net or electro in the context of deriving
[00:00:32] or electro in the context of deriving good static representations of words you
[00:00:34] good static representations of words you probably have heard about these models
[00:00:36] probably have heard about these models and heard that they lift all boats and
[00:00:37] and heard that they lift all boats and the question is how can you take
[00:00:39] the question is how can you take advantage of those of those benefits
[00:00:42] advantage of those of those benefits but there's a tension here we've been
[00:00:43] but there's a tension here we've been developing static representations but
[00:00:45] developing static representations but these models like bert are
[00:00:47] these models like bert are designed to deliver contextual
[00:00:49] designed to deliver contextual representations of words and i'll return
[00:00:51] representations of words and i'll return to what that means in a second but that
[00:00:53] to what that means in a second but that is the central tension between static
[00:00:55] is the central tension between static and contextual so the question is are
[00:00:57] and contextual so the question is are there good methods for deriving static
[00:00:59] there good methods for deriving static representations from the contextual ones
[00:01:02] representations from the contextual ones that these models offer
[00:01:04] that these models offer and the answer from bamasan yet all is
[00:01:06] and the answer from bamasan yet all is yes there are effective methods for
[00:01:08] yes there are effective methods for doing this and it's those methods that
[00:01:10] doing this and it's those methods that will be the focus of this screencast i
[00:01:12] will be the focus of this screencast i really want to do two things though for
[00:01:13] really want to do two things though for this lecture i would like to get hands
[00:01:15] this lecture i would like to get hands on a little bit with a high level
[00:01:18] on a little bit with a high level overview of models like bert we're going
[00:01:20] overview of models like bert we're going to look later in the quarter in much
[00:01:22] to look later in the quarter in much more detail at how these models work so
[00:01:24] more detail at how these models work so for now we're just going to treat them
[00:01:25] for now we're just going to treat them as kind of black boxes just like you
[00:01:27] as kind of black boxes just like you might look up a glove representation of
[00:01:29] might look up a glove representation of a word and just get back that
[00:01:31] a word and just get back that representation and use it so too here we
[00:01:33] representation and use it so too here we can think of these models as devices for
[00:01:35] can think of these models as devices for feeding in sequences and getting back
[00:01:37] feeding in sequences and getting back lots and lots of representations that we
[00:01:39] lots and lots of representations that we might use and later in the quarter we'll
[00:01:42] might use and later in the quarter we'll come to a deeper understanding of
[00:01:43] come to a deeper understanding of precisely where those representations
[00:01:45] precisely where those representations come from
[00:01:47] come from and in addition of course i want to give
[00:01:48] and in addition of course i want to give you an overview of these exciting
[00:01:50] you an overview of these exciting methods from bomasani at all in the
[00:01:52] methods from bomasani at all in the hopes that they are useful to you in
[00:01:54] hopes that they are useful to you in developing your original system
[00:01:57] developing your original system so let's start with the structure of
[00:01:59] so let's start with the structure of burt
[00:02:00] burt bert processes sequences here i've got a
[00:02:02] bert processes sequences here i've got a sequence the class token the day broke
[00:02:04] sequence the class token the day broke sep class and separate designated tokens
[00:02:07] sep class and separate designated tokens the class token typically starts the
[00:02:09] the class token typically starts the sequence and then set ends the sequence
[00:02:11] sequence and then set ends the sequence and can be also used internally in
[00:02:12] and can be also used internally in sequences to mark boundaries within the
[00:02:14] sequences to mark boundaries within the sequence that you're processing
[00:02:16] sequence that you're processing but the fundamental thing is that we
[00:02:18] but the fundamental thing is that we have this the short sentence the day
[00:02:20] have this the short sentence the day broke
[00:02:20] broke vert processes those into an embedding
[00:02:23] vert processes those into an embedding layer and then a lot of additional
[00:02:25] layer and then a lot of additional layers you know here i've depicted 4 but
[00:02:27] layers you know here i've depicted 4 but it could be 12 or even 24 layers
[00:02:30] it could be 12 or even 24 layers what we're seeing here the vec the
[00:02:32] what we're seeing here the vec the rectangles represent vectors they are
[00:02:34] rectangles represent vectors they are the outputs of each layer in the network
[00:02:38] the outputs of each layer in the network a lot of computation goes into computing
[00:02:41] a lot of computation goes into computing those output vector representations at
[00:02:43] those output vector representations at each layer we're going to set that
[00:02:44] each layer we're going to set that computation aside for now so that we can
[00:02:46] computation aside for now so that we can just think of this as a grid of vector
[00:02:49] just think of this as a grid of vector representations
[00:02:51] representations here's the crucial thing that makes vert
[00:02:52] here's the crucial thing that makes vert contextual for different sequences that
[00:02:55] contextual for different sequences that we process we will get very different
[00:02:57] we process we will get very different representations in fact
[00:03:00] representations in fact individual tokens
[00:03:02] individual tokens occurring in different sequences we'll
[00:03:03] occurring in different sequences we'll get very different representations i've
[00:03:05] get very different representations i've tried to signal that with the colors
[00:03:07] tried to signal that with the colors here so like the t's two sequences both
[00:03:09] here so like the t's two sequences both contain the word the and the word broke
[00:03:12] contain the word the and the word broke but in virtue of the fact that they have
[00:03:13] but in virtue of the fact that they have different surrounding material and
[00:03:15] different surrounding material and different positions in the sequence
[00:03:17] different positions in the sequence almost all of the representations will
[00:03:19] almost all of the representations will be different um
[00:03:20] be different um the class and set tokens might have the
[00:03:22] the class and set tokens might have the same embedding but through all of these
[00:03:24] same embedding but through all of these layers because of the way all these
[00:03:26] layers because of the way all these tokens are going to interact with each
[00:03:27] tokens are going to interact with each other when we derive the representations
[00:03:29] other when we derive the representations everything will be different we do not
[00:03:31] everything will be different we do not get a static representation out of these
[00:03:33] get a static representation out of these models and i've specified that even in
[00:03:36] models and i've specified that even in the embedding layer if the positions of
[00:03:38] the embedding layer if the positions of the words vary one in the same token
[00:03:40] the words vary one in the same token will get different representations the
[00:03:42] will get different representations the reason for that is that this embedding
[00:03:44] reason for that is that this embedding layer is actually hiding two components
[00:03:46] layer is actually hiding two components we do it at the very center of this
[00:03:49] we do it at the very center of this model have a fixed static embedding
[00:03:51] model have a fixed static embedding where we can look up individual word
[00:03:53] where we can look up individual word sequences
[00:03:54] sequences but for this thing that i've called the
[00:03:55] but for this thing that i've called the embedding layer that static
[00:03:57] embedding layer that static representation is combined with a
[00:03:58] representation is combined with a separate positional encoding from a
[00:04:00] separate positional encoding from a separate embedding space and that
[00:04:02] separate embedding space and that delivers what i've called the embedding
[00:04:04] delivers what i've called the embedding layer here and that means that even at
[00:04:06] layer here and that means that even at this first layer because for example the
[00:04:09] this first layer because for example the occurs in different points in the
[00:04:10] occurs in different points in the sequence it will get different
[00:04:12] sequence it will get different representations even in the embedding
[00:04:14] representations even in the embedding space
[00:04:15] space and from there of course as we travel
[00:04:16] and from there of course as we travel through these layers we expect even more
[00:04:18] through these layers we expect even more things to change about the
[00:04:20] things to change about the representations
[00:04:23] a second important preliminary is to
[00:04:26] a second important preliminary is to give some attention to how bert and
[00:04:28] give some attention to how bert and models like it tokenize sequences
[00:04:30] models like it tokenize sequences and here i've given you a bit of code in
[00:04:32] and here i've given you a bit of code in the hopes that you can get hands-on and
[00:04:33] the hopes that you can get hands-on and get a feel for how these tokenizers
[00:04:35] get a feel for how these tokenizers behave i'm taking advantage of the
[00:04:37] behave i'm taking advantage of the hugging face library i have loaded a
[00:04:39] hugging face library i have loaded a burp tokenizer and i load that from a
[00:04:41] burp tokenizer and i load that from a pre-trained model
[00:04:43] pre-trained model in cell 3 you can see that i've called
[00:04:45] in cell 3 you can see that i've called the tokenized function on the sentence
[00:04:46] the tokenized function on the sentence this isn't too surprising and the result
[00:04:48] this isn't too surprising and the result is a pretty normal looking sequence of
[00:04:50] is a pretty normal looking sequence of tokens you see some punctuation has been
[00:04:53] tokens you see some punctuation has been separated off but you also see a lot of
[00:04:54] separated off but you also see a lot of words when you get down to cell 4 though
[00:04:57] words when you get down to cell 4 though for the sequence in codeme this is a bit
[00:05:00] for the sequence in codeme this is a bit surprising the word encode in the input
[00:05:02] surprising the word encode in the input has been broken apart into two sub-word
[00:05:04] has been broken apart into two sub-word tokens
[00:05:05] tokens en and then code with these boundary
[00:05:07] en and then code with these boundary markers on it bert has broken that apart
[00:05:10] markers on it bert has broken that apart into two subword sequences and if i feed
[00:05:13] into two subword sequences and if i feed in a sequence that has a really
[00:05:15] in a sequence that has a really unfamiliar set of
[00:05:17] unfamiliar set of tokens in it it will do a lot of
[00:05:19] tokens in it it will do a lot of breaking a part of that sequence as you
[00:05:21] breaking a part of that sequence as you can see in cell five for the input
[00:05:22] can see in cell five for the input snuffleupagus where a lot of these
[00:05:25] snuffleupagus where a lot of these pieces have come out this is the
[00:05:27] pieces have come out this is the essential piece for why bert is able to
[00:05:29] essential piece for why bert is able to have such a small vocabulary only about
[00:05:31] have such a small vocabulary only about 30 000 words compare that with the 400
[00:05:33] 30 000 words compare that with the 400 000 words that are in the glove space
[00:05:36] 000 words that are in the glove space the reason it can get away with that is
[00:05:37] the reason it can get away with that is that it does a lot of breaking apart of
[00:05:39] that it does a lot of breaking apart of words into sub word tokens
[00:05:42] words into sub word tokens and of course because the model is
[00:05:43] and of course because the model is contextual we have an expectation that
[00:05:46] contextual we have an expectation that for example when it encounters code here
[00:05:48] for example when it encounters code here in the context of en at some conceptual
[00:05:51] in the context of en at some conceptual level the mother will recognize that his
[00:05:52] level the mother will recognize that his process the word encode even though it
[00:05:54] process the word encode even though it was two tokens underlyingly
[00:05:59] let's flesh this out a bit by looking at
[00:06:00] let's flesh this out a bit by looking at the full interface for dealing with
[00:06:02] the full interface for dealing with these models i'm again taking advantage
[00:06:03] these models i'm again taking advantage of hugging face i'm going to load a burp
[00:06:05] of hugging face i'm going to load a burp model and a burp tokenizer it's
[00:06:08] model and a burp tokenizer it's important that they use the same
[00:06:09] important that they use the same pre-trained weights which hugging face
[00:06:11] pre-trained weights which hugging face will download for you from the web and
[00:06:13] will download for you from the web and so those are tied in and i set up the
[00:06:14] so those are tied in and i set up the tokenizer and the model if i call
[00:06:16] tokenizer and the model if i call tokenizer.encode on a sequence it would
[00:06:19] tokenizer.encode on a sequence it would give me back a list of indices and those
[00:06:21] give me back a list of indices and those indices will be used as a lookup to
[00:06:23] indices will be used as a lookup to start the process of computing this
[00:06:24] start the process of computing this entire sequence in cell 6 i actually use
[00:06:27] entire sequence in cell 6 i actually use the model to derive that grid of
[00:06:29] the model to derive that grid of representations hugging faces giving us
[00:06:31] representations hugging faces giving us an object that has a lot of attributes
[00:06:33] an object that has a lot of attributes if i call
[00:06:34] if i call output hidden states equals true when i
[00:06:36] output hidden states equals true when i use the model here then i can call dot
[00:06:39] use the model here then i can call dot hidden states and get that full grid of
[00:06:41] hidden states and get that full grid of representations that i showed you before
[00:06:43] representations that i showed you before so this is a sequence with 13 layers
[00:06:45] so this is a sequence with 13 layers that's one embedding layer plus 12 of
[00:06:47] that's one embedding layer plus 12 of the additional layers
[00:06:49] the additional layers and if i key into one of like the first
[00:06:51] and if i key into one of like the first layer that will be the embedding you can
[00:06:53] layer that will be the embedding you can see that its shape is one by five by 768
[00:06:56] see that its shape is one by five by 768 this is a batch of one example it has
[00:06:59] this is a batch of one example it has five tokens the three that we can see
[00:07:01] five tokens the three that we can see here plus the class and set tokens and
[00:07:04] here plus the class and set tokens and each one of those tokens in the
[00:07:06] each one of those tokens in the embedding layer is represented by a
[00:07:08] embedding layer is represented by a vector of dimension 768
[00:07:10] vector of dimension 768 and that remains consistent through all
[00:07:12] and that remains consistent through all the layers in the model so if i went to
[00:07:13] the layers in the model so if i went to the final output states i again just
[00:07:15] the final output states i again just index into dot hidden states here the
[00:07:18] index into dot hidden states here the shape is the same and that will be
[00:07:20] shape is the same and that will be consistent for all the layers
[00:07:22] consistent for all the layers those are the preliminaries and let's
[00:07:24] those are the preliminaries and let's think about how we could derive some
[00:07:25] think about how we could derive some static representations the first
[00:07:27] static representations the first approach that bom asani and all consider
[00:07:30] approach that bom asani and all consider is what they call the decontextualized
[00:07:32] is what they call the decontextualized approach and this is like the simplest
[00:07:34] approach and this is like the simplest thing possible we are just going to
[00:07:36] thing possible we are just going to process individual words as though they
[00:07:38] process individual words as though they were sequences and see if burke can make
[00:07:40] were sequences and see if burke can make any sense of them so we would start by
[00:07:42] any sense of them so we would start by feeding in a word like kitten and we
[00:07:44] feeding in a word like kitten and we would allow the model to break it apart
[00:07:46] would allow the model to break it apart into its sub-word pieces and then we
[00:07:48] into its sub-word pieces and then we simply process that with the model we
[00:07:50] simply process that with the model we get a full grid of representations
[00:07:53] get a full grid of representations now because we potentially have sub-word
[00:07:55] now because we potentially have sub-word tokens here we need some cooling
[00:07:56] tokens here we need some cooling function so what we could do is just
[00:07:58] function so what we could do is just pool using something like mean to get a
[00:08:00] pool using something like mean to get a fixed static representation of dimension
[00:08:03] fixed static representation of dimension 768
[00:08:05] 768 for this individual word
[00:08:07] for this individual word and of course we don't have to use the
[00:08:08] and of course we don't have to use the final layer we could use lower down
[00:08:10] final layer we could use lower down layers and we don't have to use mean as
[00:08:12] layers and we don't have to use mean as the pouring function you could consider
[00:08:14] the pouring function you could consider something like max or min or even last
[00:08:16] something like max or min or even last which would just disregard all of the
[00:08:18] which would just disregard all of the vector representations except for the
[00:08:20] vector representations except for the one corresponding to the final sub word
[00:08:23] one corresponding to the final sub word token
[00:08:25] token this is really simple uh it's
[00:08:27] this is really simple uh it's potentially unnatural though bird is a
[00:08:29] potentially unnatural though bird is a contextual model it was trained on full
[00:08:31] contextual model it was trained on full sequences and especially if we leave off
[00:08:33] sequences and especially if we leave off the class and sep tokens we might be
[00:08:35] the class and sep tokens we might be feeding in sequences that bert has
[00:08:38] feeding in sequences that bert has really never seen before and so it might
[00:08:40] really never seen before and so it might be unknown how it's going to behave with
[00:08:42] be unknown how it's going to behave with these unusual inputs nonetheless though
[00:08:44] these unusual inputs nonetheless though we could repeat this process for all the
[00:08:46] we could repeat this process for all the words in our vocabulary and derive a
[00:08:48] words in our vocabulary and derive a static embedding space and maybe it has
[00:08:50] static embedding space and maybe it has some promise
[00:08:52] some promise however to address this potential
[00:08:54] however to address this potential unnaturalness and potentially take more
[00:08:56] unnaturalness and potentially take more advantage of the the virtues that burt
[00:08:59] advantage of the the virtues that burt and related models have bomasani at all
[00:09:01] and related models have bomasani at all consider also the aggregated approach so
[00:09:04] consider also the aggregated approach so in this approach you process lots of
[00:09:06] in this approach you process lots of corpus examples that contain your target
[00:09:08] corpus examples that contain your target word here i've got a sort of glimpse of
[00:09:10] word here i've got a sort of glimpse of a corpus our target word is kitten of
[00:09:13] a corpus our target word is kitten of course we allow it to be broken apart
[00:09:15] course we allow it to be broken apart into sub-word tokens the full sequences
[00:09:17] into sub-word tokens the full sequences in these examples would also be broken
[00:09:19] in these examples would also be broken apart into sub-word tokens
[00:09:21] apart into sub-word tokens but the important thing is that our
[00:09:22] but the important thing is that our target word might have sub word tokens
[00:09:24] target word might have sub word tokens we pool those as we did before for the
[00:09:26] we pool those as we did before for the contextualized approach and we're also
[00:09:29] contextualized approach and we're also going to pool across all of the
[00:09:30] going to pool across all of the different context examples that we
[00:09:32] different context examples that we processed
[00:09:33] processed and the result of that should be a bunch
[00:09:36] and the result of that should be a bunch of natural inputs to the model but in
[00:09:38] of natural inputs to the model but in the end we derive a static
[00:09:39] the end we derive a static representation that is some kind of
[00:09:41] representation that is some kind of average across all of the examples that
[00:09:43] average across all of the examples that we processed
[00:09:45] we processed it seems very natural it's taking
[00:09:46] it seems very natural it's taking advantage of what what bird is best at i
[00:09:49] advantage of what what bird is best at i will warn you though that this is very
[00:09:50] will warn you though that this is very computationally demanding we're going to
[00:09:52] computationally demanding we're going to want to process lots of examples and
[00:09:54] want to process lots of examples and burt requires lots of resources because
[00:09:56] burt requires lots of resources because it develops really large representations
[00:09:58] it develops really large representations as we've seen
[00:10:00] as we've seen but it might be worth it
[00:10:02] but it might be worth it now bumasonia would offer lots of
[00:10:04] now bumasonia would offer lots of results that help us understand these
[00:10:05] results that help us understand these approaches and how they perform uh let
[00:10:08] approaches and how they perform uh let me give you a glimpse of them as a kind
[00:10:09] me give you a glimpse of them as a kind of summary so what we've got here is
[00:10:11] of summary so what we've got here is results for the simverb 3500 data set a
[00:10:14] results for the simverb 3500 data set a word similarity data set that's very
[00:10:16] word similarity data set that's very similar to the ones that you'll be
[00:10:17] similar to the ones that you'll be working with on the homework and bake
[00:10:19] working with on the homework and bake off
[00:10:20] off our metric is spearmint correlation and
[00:10:22] our metric is spearmint correlation and higher is better that's along the y-axis
[00:10:24] higher is better that's along the y-axis and along the x-axis i have the layer in
[00:10:27] and along the x-axis i have the layer in the model that we're keying into
[00:10:29] the model that we're keying into and then of course what we should watch
[00:10:30] and then of course what we should watch is that we have two pooling functions f
[00:10:32] is that we have two pooling functions f and g f is subword pooling and g is
[00:10:35] and g f is subword pooling and g is context pooling for models that have it
[00:10:37] context pooling for models that have it and it's decomped for the
[00:10:38] and it's decomped for the decontextualized approach
[00:10:41] decontextualized approach now we have a very clear result across
[00:10:43] now we have a very clear result across these results and i think across all the
[00:10:44] these results and i think across all the results in the paper lower layers are
[00:10:47] results in the paper lower layers are better
[00:10:48] better lower layers are giving us good high
[00:10:49] lower layers are giving us good high fidelity representations of individual
[00:10:51] fidelity representations of individual words as we travel higher in the model
[00:10:53] words as we travel higher in the model we seem to lose a lot of that word level
[00:10:55] we seem to lose a lot of that word level discrimination
[00:10:58] discrimination in addition your best choice is to do
[00:11:01] in addition your best choice is to do mean pooling for the context
[00:11:03] mean pooling for the context and subwood pooling seems to matter less
[00:11:05] and subwood pooling seems to matter less right all of these lines here are all
[00:11:07] right all of these lines here are all for the context pooling model with mean
[00:11:10] for the context pooling model with mean as your context pooling function
[00:11:12] as your context pooling function the very best choice though i think
[00:11:14] the very best choice though i think consistently is mean for both of these
[00:11:16] consistently is mean for both of these pooling functions here
[00:11:18] pooling functions here um you can see that in this result and i
[00:11:20] um you can see that in this result and i think that's consistent across all the
[00:11:21] think that's consistent across all the results in the paper but
[00:11:23] results in the paper but the
[00:11:24] the overall takeaway here is that as
[00:11:26] overall takeaway here is that as expected the aggregated approach is
[00:11:28] expected the aggregated approach is better than the decontextualized
[00:11:29] better than the decontextualized approach
[00:11:30] approach however if you don't have the
[00:11:31] however if you don't have the computational budget for that then mean
[00:11:34] computational budget for that then mean pooling in the decontextualized approach
[00:11:36] pooling in the decontextualized approach looks really competitive that's not so
[00:11:38] looks really competitive that's not so evident in this plot but if you look
[00:11:40] evident in this plot but if you look across all the results in the paper i
[00:11:42] across all the results in the paper i think that's a pretty clear finding so
[00:11:44] think that's a pretty clear finding so that would be a good choice and one
[00:11:46] that would be a good choice and one thing is clear that simple approach is
[00:11:48] thing is clear that simple approach is better than some kinds of context
[00:11:50] better than some kinds of context pooling where you choose the wrong
[00:11:52] pooling where you choose the wrong context pooling function like min or max
[00:11:55] context pooling function like min or max despite all of the effort that went into
[00:11:57] despite all of the effort that went into this set of results and also these
[00:11:59] this set of results and also these they're all kind of down here entangled
[00:12:01] they're all kind of down here entangled with the decontextualized approach
[00:12:03] with the decontextualized approach but
[00:12:04] but mean as the pooling function there is
[00:12:06] mean as the pooling function there is really an outstanding choice as you can
[00:12:08] really an outstanding choice as you can see from these results

Lecture 011

Homework 2: Sentiment Analysis | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=e5zRhwc-SqI --- Transcript [00:00:04] hello everyone this vide...

Homework 2: Sentiment Analysis | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=e5zRhwc-SqI

---

Transcript

[00:00:04] hello everyone this video is an overview
[00:00:06] hello everyone this video is an overview of homework 2 which is on supervised
[00:00:09] of homework 2 which is on supervised sentiment analysis and i would actually
[00:00:10] sentiment analysis and i would actually think of it as an experiment in cross
[00:00:12] think of it as an experiment in cross domain sentiment analysis let's just
[00:00:14] domain sentiment analysis let's just walk through this notebook and i'll try
[00:00:16] walk through this notebook and i'll try to give you a feel for the problem and
[00:00:18] to give you a feel for the problem and our thinking behind it
[00:00:19] our thinking behind it so the plot is the usual one
[00:00:22] so the plot is the usual one we're going to introduce a task and
[00:00:24] we're going to introduce a task and associated data
[00:00:26] associated data and help you with setting up some
[00:00:27] and help you with setting up some baselines and doing error analysis and
[00:00:29] baselines and doing error analysis and that's all a lead into these homework
[00:00:31] that's all a lead into these homework questions which are meant to help you
[00:00:33] questions which are meant to help you explore the data in meaningful ways and
[00:00:35] explore the data in meaningful ways and also set up some additional baselines
[00:00:37] also set up some additional baselines that might inform
[00:00:38] that might inform ultimately your original system which
[00:00:40] ultimately your original system which you then enter into the bank off
[00:00:44] you then enter into the bank off fast overview we're doing ternary that
[00:00:46] fast overview we're doing ternary that is positive negative neutral sentiment
[00:00:47] is positive negative neutral sentiment analysis we're going to be dealing with
[00:00:49] analysis we're going to be dealing with two data sets the stanford sentiment
[00:00:51] two data sets the stanford sentiment tree bank and a brand new assessment
[00:00:54] tree bank and a brand new assessment data set that is a dev test split of
[00:00:56] data set that is a dev test split of sentences drawn from restaurant reviews
[00:00:59] sentences drawn from restaurant reviews we're giving you for training the sst
[00:01:01] we're giving you for training the sst train set and asking you to evaluate on
[00:01:03] train set and asking you to evaluate on the sst dev and test uh and also on this
[00:01:06] the sst dev and test uh and also on this new dev test split of restaurant reviews
[00:01:08] new dev test split of restaurant reviews and that's the cross domain aspect of
[00:01:10] and that's the cross domain aspect of this
[00:01:10] this you are completely unconstrained about
[00:01:12] you are completely unconstrained about what you do in terms of bringing in new
[00:01:14] what you do in terms of bringing in new data for training and doing things in
[00:01:16] data for training and doing things in development the one constraint that we
[00:01:18] development the one constraint that we really need to firmly impose here is
[00:01:20] really need to firmly impose here is that of course the sst3 test set is a
[00:01:23] that of course the sst3 test set is a public test set it's actually included
[00:01:26] public test set it's actually included in your data distribution so that other
[00:01:27] in your data distribution so that other notebooks can run some baseline systems
[00:01:30] notebooks can run some baseline systems and compare against the literature but
[00:01:32] and compare against the literature but that test set is completely off limits
[00:01:34] that test set is completely off limits during development
[00:01:36] during development it's really important that you do all
[00:01:37] it's really important that you do all your development just on the dev splits
[00:01:40] your development just on the dev splits and completely ignore the fact that you
[00:01:42] and completely ignore the fact that you have a labeled version of the ss3 t3
[00:01:44] have a labeled version of the ss3 t3 test set
[00:01:46] test set and as i say here
[00:01:47] and as i say here much of the scientific integrity of our
[00:01:49] much of the scientific integrity of our field depends on people adhering to this
[00:01:51] field depends on people adhering to this honor code that is doing no development
[00:01:54] honor code that is doing no development on what is test data because test data
[00:01:56] on what is test data because test data is our own only chance to get a really
[00:01:58] is our own only chance to get a really clear look at how our systems are
[00:02:00] clear look at how our systems are generalizing to new examples and new
[00:02:02] generalizing to new examples and new experiences so please keep that in mind
[00:02:06] experiences so please keep that in mind the rationale behind this assignment of
[00:02:08] the rationale behind this assignment of course is to help you get familiar or
[00:02:11] course is to help you get familiar or re-familiarize yourself with core
[00:02:12] re-familiarize yourself with core concepts and supervise sentiment
[00:02:14] concepts and supervise sentiment analysis and the associated life cycle
[00:02:17] analysis and the associated life cycle of developing systems in this space
[00:02:18] of developing systems in this space which involves writing feature functions
[00:02:21] which involves writing feature functions trying out model architectures hyper
[00:02:24] trying out model architectures hyper parameter tuning and also possibly doing
[00:02:26] parameter tuning and also possibly doing some comparisons of models using
[00:02:28] some comparisons of models using statistical tests to try to get a sense
[00:02:30] statistical tests to try to get a sense for how much meaningful progress you're
[00:02:31] for how much meaningful progress you're making as you iterate on your system
[00:02:34] making as you iterate on your system design
[00:02:35] design and we're also trying to push here in
[00:02:36] and we're also trying to push here in this notebook that error analysis can be
[00:02:39] this notebook that error analysis can be a powerful way to help you find problems
[00:02:42] a powerful way to help you find problems in your system and then address them
[00:02:45] in your system and then address them one more methodological note as you'll
[00:02:47] one more methodological note as you'll see from this notebook i'm encouraging
[00:02:48] see from this notebook i'm encouraging you to use functionality in this sst dot
[00:02:51] you to use functionality in this sst dot pi module which is part of our course
[00:02:53] pi module which is part of our course code distribution you're not required to
[00:02:56] code distribution you're not required to use it
[00:02:57] use it really the only contract we need to have
[00:02:59] really the only contract we need to have with you is that your original system
[00:03:00] with you is that your original system have a predict 1 method that maps
[00:03:03] have a predict 1 method that maps strings to predictions very directly
[00:03:06] strings to predictions very directly but other than that you're unconstrained
[00:03:07] but other than that you're unconstrained i do want to say though that i think
[00:03:09] i do want to say though that i think sst.experiment is a flexible framework
[00:03:12] sst.experiment is a flexible framework for doing lots of experiments without
[00:03:14] for doing lots of experiments without writing a lot of boilerplate code so it
[00:03:16] writing a lot of boilerplate code so it should if you get used to it be a
[00:03:18] should if you get used to it be a powerful basis for you for doing a lot
[00:03:21] powerful basis for you for doing a lot of experiments which i think is crucial
[00:03:22] of experiments which i think is crucial to success here
[00:03:26] we do some setup by loading a bunch of
[00:03:27] we do some setup by loading a bunch of libraries
[00:03:29] libraries and get a pointer to the data and that
[00:03:31] and get a pointer to the data and that brings us to the training set here so
[00:03:33] brings us to the training set here so this is going to load in a pandas data
[00:03:35] this is going to load in a pandas data frame you can see that we've got about 8
[00:03:37] frame you can see that we've got about 8 500 examples
[00:03:39] 500 examples do review the notebook covering this
[00:03:41] do review the notebook covering this data set here there are a bunch of other
[00:03:42] data set here there are a bunch of other options for this train reader in
[00:03:44] options for this train reader in particular you can decide whether to
[00:03:45] particular you can decide whether to keep or remove duplicates and you can
[00:03:48] keep or remove duplicates and you can also decide whether you want to train on
[00:03:49] also decide whether you want to train on the labeled subtrees that the sst
[00:03:51] the labeled subtrees that the sst contains which vastly increases the
[00:03:54] contains which vastly increases the amount of training data you have which
[00:03:56] amount of training data you have which will be very compute intensive but it
[00:03:58] will be very compute intensive but it could be very productive this is also a
[00:04:01] could be very productive this is also a point to say
[00:04:02] point to say again that you are free to bring in
[00:04:04] again that you are free to bring in other training sets and in fact it might
[00:04:07] other training sets and in fact it might be very productive to bring in the
[00:04:08] be very productive to bring in the dynacent
[00:04:10] dynacent data set which is covered in a
[00:04:12] data set which is covered in a screencast for this unit
[00:04:14] screencast for this unit that data set has a lot of sentences
[00:04:16] that data set has a lot of sentences from restaurant reviews and it was also
[00:04:19] from restaurant reviews and it was also labeled in exactly the same way using
[00:04:21] labeled in exactly the same way using the same protocols as were used for
[00:04:23] the same protocols as were used for creating the development set of
[00:04:24] creating the development set of restaurant reviews for this unit which
[00:04:26] restaurant reviews for this unit which is importantly different i think from
[00:04:28] is importantly different i think from the protocols that were used for the sst
[00:04:32] the protocols that were used for the sst so bringing in more traded training data
[00:04:34] so bringing in more traded training data could help you not only with the cross
[00:04:35] could help you not only with the cross domain problem but also with the kind of
[00:04:37] domain problem but also with the kind of label shift that has probably happened
[00:04:39] label shift that has probably happened between sst and these new development
[00:04:42] between sst and these new development data sets that we're introducing
[00:04:45] and that does bring me to the dev sets
[00:04:47] and that does bring me to the dev sets here so we have sst dev that's also a
[00:04:49] here so we have sst dev that's also a panda's data frame as well as this new
[00:04:51] panda's data frame as well as this new bake off data of restaurant reviews also
[00:04:53] bake off data of restaurant reviews also panda's data frame and here you can see
[00:04:55] panda's data frame and here you can see just three randomly chosen examples
[00:04:57] just three randomly chosen examples example id the text of the sentence
[00:05:00] example id the text of the sentence the label which is either positive
[00:05:01] the label which is either positive negative or neutral and that is subtree
[00:05:04] negative or neutral and that is subtree is always zero because these assessment
[00:05:06] is always zero because these assessment data sets have only full examples no no
[00:05:08] data sets have only full examples no no labeled subtrees the way the ssd train
[00:05:10] labeled subtrees the way the ssd train set does
[00:05:13] set does and we can get a look at the label
[00:05:14] and we can get a look at the label distribution and i'll just mention that
[00:05:16] distribution and i'll just mention that the label distribution for the test set
[00:05:17] the label distribution for the test set is very similar
[00:05:19] is very similar it has one noteworthy property which is
[00:05:20] it has one noteworthy property which is that it's highly skewed a lot of neutral
[00:05:23] that it's highly skewed a lot of neutral examples which i think is realistic for
[00:05:25] examples which i think is realistic for actual data even review data and then
[00:05:27] actual data even review data and then there is a skew toward positivity with
[00:05:29] there is a skew toward positivity with negative the smallest and this kind of
[00:05:30] negative the smallest and this kind of label imbalance i think is severe enough
[00:05:32] label imbalance i think is severe enough that it might impact optimization
[00:05:34] that it might impact optimization choices that you make
[00:05:37] choices that you make this next section here just sets up a
[00:05:38] this next section here just sets up a soft max baseline we use a unigrams
[00:05:41] soft max baseline we use a unigrams feature function this couldn't be
[00:05:42] feature function this couldn't be simpler we're just splitting on white
[00:05:44] simpler we're just splitting on white space and counting the resulting tokens
[00:05:46] space and counting the resulting tokens and then we have this very thin wrapper
[00:05:48] and then we have this very thin wrapper around logistic regression and those are
[00:05:50] around logistic regression and those are the two pieces that come together to run
[00:05:52] the two pieces that come together to run here an sst.experiment
[00:05:55] here an sst.experiment a lot of information about your
[00:05:56] a lot of information about your experiment is stored in this variable
[00:05:58] experiment is stored in this variable and what's being printed out is just a
[00:06:00] and what's being printed out is just a summary classification report we have
[00:06:02] summary classification report we have sst dev and bake off dev as our two
[00:06:05] sst dev and bake off dev as our two assessment data frames the results for
[00:06:07] assessment data frames the results for each one of those are printed separately
[00:06:08] each one of those are printed separately here and then our bake off metric is
[00:06:11] here and then our bake off metric is this mean of the macro average f1 scores
[00:06:14] this mean of the macro average f1 scores across the two data sets exactly these
[00:06:16] across the two data sets exactly these two but of course at the bake off time
[00:06:18] two but of course at the bake off time we'll be using the test sets
[00:06:21] we'll be using the test sets so you might be guided in sort of hill
[00:06:22] so you might be guided in sort of hill climb on this number here while also
[00:06:24] climb on this number here while also attending to these two numbers which are
[00:06:26] attending to these two numbers which are contributing to it so for example you
[00:06:28] contributing to it so for example you can see here that as expected since we
[00:06:31] can see here that as expected since we trained on the sst we're doing better on
[00:06:33] trained on the sst we're doing better on the sst dev by far than we are on the
[00:06:37] the sst dev by far than we are on the new bake off data
[00:06:41] the next section here just shows you
[00:06:42] the next section here just shows you another kind of baseline and this is a
[00:06:44] another kind of baseline and this is a deep learning baseline and rnn
[00:06:46] deep learning baseline and rnn classifier our feature function is very
[00:06:49] classifier our feature function is very simple here because we just split on
[00:06:50] simple here because we just split on white space and we rely on the rnn
[00:06:52] white space and we rely on the rnn itself to do all the featurization which
[00:06:55] itself to do all the featurization which is like an embedding lookup and then
[00:06:56] is like an embedding lookup and then processing the example
[00:06:58] processing the example so that's very simple and then the
[00:06:59] so that's very simple and then the wrapper is also very simple here we're
[00:07:01] wrapper is also very simple here we're going to set the vocabulary for the
[00:07:02] going to set the vocabulary for the model the min count of two that seems
[00:07:05] model the min count of two that seems productive and then finally run the
[00:07:07] productive and then finally run the experiment and the one thing that's
[00:07:08] experiment and the one thing that's important here the one change is that
[00:07:10] important here the one change is that you set vectorize equals false here
[00:07:12] you set vectorize equals false here unlike in the previous baseline we are
[00:07:14] unlike in the previous baseline we are not using scikit-learn dict vectorizers
[00:07:17] not using scikit-learn dict vectorizers to process count dictionaries to get us
[00:07:20] to process count dictionaries to get us from features to feature matrices um
[00:07:22] from features to feature matrices um here we are feeding our examples
[00:07:24] here we are feeding our examples directly through into the model our
[00:07:26] directly through into the model our model expects token streams with no
[00:07:28] model expects token streams with no messing about and so vectorized false
[00:07:30] messing about and so vectorized false will give them a pass through all the
[00:07:32] will give them a pass through all the way to the model so remember that
[00:07:33] way to the model so remember that otherwise this will all fall apart but
[00:07:36] otherwise this will all fall apart but other than that it's exactly the same
[00:07:37] other than that it's exactly the same setup let's run it here i've got some
[00:07:39] setup let's run it here i've got some timing information we're going to fast
[00:07:41] timing information we're going to fast forward through this because this takes
[00:07:42] forward through this because this takes a little bit of time but you'll see a
[00:07:44] a little bit of time but you'll see a report and i'm currently on just a very
[00:07:46] report and i'm currently on just a very old cpu-based mac so
[00:07:49] old cpu-based mac so this will give you a sense for the cost
[00:07:51] this will give you a sense for the cost of development for deep learning in this
[00:07:52] of development for deep learning in this space
[00:08:03] all right our model's early stopping
[00:08:04] all right our model's early stopping criterion was met after 49 epochs
[00:08:08] criterion was met after 49 epochs and here's our look at the results which
[00:08:10] and here's our look at the results which are kind of comparable to what we saw
[00:08:12] are kind of comparable to what we saw with the softmax baseline
[00:08:15] with the softmax baseline all right and that brings us to error
[00:08:16] all right and that brings us to error analysis which can be an important step
[00:08:18] analysis which can be an important step in improving your system i've written a
[00:08:20] in improving your system i've written a few functions that make use of all the
[00:08:22] few functions that make use of all the information that is encoded in the
[00:08:23] information that is encoded in the return values for ssd experiment which i
[00:08:25] return values for ssd experiment which i hope packaged together everything you
[00:08:27] hope packaged together everything you need to do rich error analysis reproduce
[00:08:29] need to do rich error analysis reproduce your results and make use of your model
[00:08:31] your results and make use of your model in downstream experiments
[00:08:33] in downstream experiments here we're going to use it for use this
[00:08:35] here we're going to use it for use this function find errors
[00:08:37] function find errors i've done a little bit of pre-processing
[00:08:39] i've done a little bit of pre-processing of the errors that were found and
[00:08:40] of the errors that were found and package them together and then this cell
[00:08:43] package them together and then this cell here is just an example of the kind of
[00:08:44] here is just an example of the kind of things that you might do
[00:08:46] things that you might do here we're looking at cases where the
[00:08:47] here we're looking at cases where the softmax model was correct the rnn was
[00:08:49] softmax model was correct the rnn was incorrect and the correct label is
[00:08:51] incorrect and the correct label is positive you could of course fiddle with
[00:08:53] positive you could of course fiddle with those parameters here
[00:08:55] those parameters here we've got 168 examples falling into that
[00:08:57] we've got 168 examples falling into that class and then we could look at a sample
[00:08:59] class and then we could look at a sample of the actual text and fall into that
[00:09:00] of the actual text and fall into that group as a way of figuring out how these
[00:09:02] group as a way of figuring out how these models differ and maybe improving one or
[00:09:04] models differ and maybe improving one or both of them
[00:09:06] both of them all right and that brings us to the
[00:09:07] all right and that brings us to the homework questions and again these are
[00:09:09] homework questions and again these are meant to help you explore the data and
[00:09:11] meant to help you explore the data and set up some additional baselines that
[00:09:12] set up some additional baselines that inform original system development we're
[00:09:15] inform original system development we're going to start with one that's data
[00:09:16] going to start with one that's data oriented i've called this token level
[00:09:18] oriented i've called this token level differences what i'm trying to do is
[00:09:20] differences what i'm trying to do is raise to your awareness the fact that
[00:09:22] raise to your awareness the fact that the sst data and the new restaurant
[00:09:25] the sst data and the new restaurant review data are just encoded in
[00:09:27] review data are just encoded in different ways at the level of
[00:09:28] different ways at the level of tokenization this is mainly the result
[00:09:30] tokenization this is mainly the result of the sst being kind of the result of a
[00:09:32] of the sst being kind of the result of a historical process beginning with pang
[00:09:34] historical process beginning with pang in lead 2005
[00:09:36] in lead 2005 and going on through the sst project
[00:09:38] and going on through the sst project itself so there's some funny things
[00:09:40] itself so there's some funny things about it that i think could certainly
[00:09:41] about it that i think could certainly affect any kind of transfer from one
[00:09:44] affect any kind of transfer from one domain to the other and since you are
[00:09:45] domain to the other and since you are training on sst data
[00:09:48] training on sst data it's important to be aware of how it
[00:09:49] it's important to be aware of how it might be idiosyncratic
[00:09:51] might be idiosyncratic so that happens here you write this
[00:09:52] so that happens here you write this function get token counts and as usual
[00:09:54] function get token counts and as usual you have a test you pass the test
[00:09:56] you have a test you pass the test um you're in good shape
[00:10:01] next question relates to the cross
[00:10:03] next question relates to the cross domain nature of our problem training on
[00:10:05] domain nature of our problem training on some of the bake off data in this
[00:10:06] some of the bake off data in this standard paradigm you are training on
[00:10:09] standard paradigm you are training on ssd evaluating on sst and also this new
[00:10:12] ssd evaluating on sst and also this new bake off data set of restaurant review
[00:10:13] bake off data set of restaurant review sentences
[00:10:15] sentences what would happen if you augmented your
[00:10:16] what would happen if you augmented your training set with a little bit of data
[00:10:19] training set with a little bit of data from the development set of restaurant
[00:10:20] from the development set of restaurant review sentences you have might have a
[00:10:22] review sentences you have might have a hunch that's going to improve system
[00:10:24] hunch that's going to improve system performance and this question here
[00:10:26] performance and this question here simply asks you to
[00:10:28] simply asks you to run such an experiment as usual you have
[00:10:30] run such an experiment as usual you have a test i think you will find that this
[00:10:32] a test i think you will find that this is very productive in helping your
[00:10:34] is very productive in helping your system get traction on the new data and
[00:10:36] system get traction on the new data and that should be a clue as to how to do a
[00:10:38] that should be a clue as to how to do a really good job in the bake off with
[00:10:40] really good job in the bake off with your original system
[00:10:42] your original system this next question here is about feature
[00:10:44] this next question here is about feature representation a more powerful vector
[00:10:46] representation a more powerful vector averaging baseline this is a step toward
[00:10:48] averaging baseline this is a step toward deep learning uh it builds on this
[00:10:50] deep learning uh it builds on this section of a notebook here
[00:10:52] section of a notebook here where essentially we average together
[00:10:54] where essentially we average together vector representations of words to
[00:10:56] vector representations of words to represent each example and those are the
[00:10:58] represent each example and those are the input to a simple logistic regression
[00:11:00] input to a simple logistic regression classifier
[00:11:01] classifier so those are nice low dimensional models
[00:11:03] so those are nice low dimensional models that tend to be quite powerful this
[00:11:05] that tend to be quite powerful this question is asking you to replace the
[00:11:07] question is asking you to replace the logistic regression with a shallow
[00:11:08] logistic regression with a shallow neural classifier so maybe the more
[00:11:10] neural classifier so maybe the more powerful part here and also to explore a
[00:11:13] powerful part here and also to explore a wide range of hyper parameters to that
[00:11:14] wide range of hyper parameters to that model to get a sense for which settings
[00:11:17] model to get a sense for which settings are best for our problem
[00:11:20] are best for our problem and that brings us to burton coding and
[00:11:22] and that brings us to burton coding and this is like one step further down the
[00:11:24] this is like one step further down the line toward deep learning and
[00:11:25] line toward deep learning and fine-tuning this question is simply
[00:11:27] fine-tuning this question is simply asking you to encode your examples using
[00:11:29] asking you to encode your examples using vert in particular taking the summary
[00:11:32] vert in particular taking the summary representation above the class token the
[00:11:34] representation above the class token the final output there as your summary
[00:11:36] final output there as your summary representation of the entire example and
[00:11:38] representation of the entire example and those become presumably the inputs to
[00:11:40] those become presumably the inputs to some downstream classifier or
[00:11:42] some downstream classifier or potentially a fine tuning process
[00:11:44] potentially a fine tuning process the idea that this is like one step
[00:11:46] the idea that this is like one step better than the vector averaging that we
[00:11:48] better than the vector averaging that we just looked at you do not need to
[00:11:50] just looked at you do not need to conduct an experiment with sst you're
[00:11:52] conduct an experiment with sst you're simply implementing this feature
[00:11:54] simply implementing this feature function here
[00:11:55] function here but since ssd experiment does make it
[00:11:57] but since ssd experiment does make it really easy to run experiments once
[00:11:59] really easy to run experiments once you've implemented the feature function
[00:12:01] you've implemented the feature function i would encourage you to choose some
[00:12:02] i would encourage you to choose some classifier model and see how well this
[00:12:05] classifier model and see how well this does but as usual you have a test and
[00:12:07] does but as usual you have a test and the test is just about the feature
[00:12:08] the test is just about the feature function and it will make sure
[00:12:10] function and it will make sure you're using all these values correctly
[00:12:12] you're using all these values correctly and that brings us to the original
[00:12:14] and that brings us to the original system and i just want to remind you
[00:12:15] system and i just want to remind you that you are unconstrained except for
[00:12:17] that you are unconstrained except for the fact that you cannot make any use of
[00:12:19] the fact that you cannot make any use of the sst test set during development the
[00:12:22] the sst test set during development the labels for that are off limits but
[00:12:24] labels for that are off limits but everything else is fair game bring in
[00:12:25] everything else is fair game bring in new training data try new model
[00:12:27] new training data try new model architectures and so forth and so on
[00:12:30] architectures and so forth and so on um we've given a few idea ideas here but
[00:12:32] um we've given a few idea ideas here but this is by no means meant to be
[00:12:34] this is by no means meant to be restrictive it's just meant to get the
[00:12:36] restrictive it's just meant to get the creative juices flowing other than that
[00:12:38] creative juices flowing other than that this is the same procedure as homework
[00:12:40] this is the same procedure as homework one we want a description of your system
[00:12:42] one we want a description of your system to inform the teaching team about what
[00:12:44] to inform the teaching team about what worked and what didn't and it would be
[00:12:46] worked and what didn't and it would be great if you reported your peak score
[00:12:48] great if you reported your peak score which is the macro average of the two f1
[00:12:51] which is the macro average of the two f1 macros f1 scores for our two data sets
[00:12:54] macros f1 scores for our two data sets that's on the development set there
[00:12:56] that's on the development set there and that brings us to the bake off and
[00:12:58] and that brings us to the bake off and again the bake off procedure is familiar
[00:13:00] again the bake off procedure is familiar the one piece here the crucial piece is
[00:13:03] the one piece here the crucial piece is that you write up a function a predict1
[00:13:05] that you write up a function a predict1 function that maps a text directly to a
[00:13:08] function that maps a text directly to a prediction using your original system
[00:13:11] prediction using your original system now given two examples here
[00:13:13] now given two examples here yours might be simpler depending on
[00:13:15] yours might be simpler depending on whether or not you use the ssd
[00:13:17] whether or not you use the ssd experiment framework or not
[00:13:19] experiment framework or not uh but that all comes together here with
[00:13:21] uh but that all comes together here with create bake off submission where you
[00:13:22] create bake off submission where you input that function
[00:13:24] input that function you won't need to change this output
[00:13:25] you won't need to change this output file and you can see that this function
[00:13:27] file and you can see that this function here loads in our two test sets which
[00:13:28] here loads in our two test sets which are unlabeled
[00:13:30] are unlabeled and uses your predict one function on
[00:13:32] and uses your predict one function on all of those examples here
[00:13:34] all of those examples here and then writes a file which you then
[00:13:36] and then writes a file which you then upload to the auto grader to grade scope
[00:13:39] upload to the auto grader to grade scope that happens here so i just would want
[00:13:41] that happens here so i just would want to reiterate that
[00:13:42] to reiterate that in all senses the test data labels are
[00:13:45] in all senses the test data labels are completely off limits to us all the
[00:13:47] completely off limits to us all the development conceptually and otherwise
[00:13:49] development conceptually and otherwise should happen on the development data

Lecture 012

Sentiment Analysis | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=sRw3Dtjhlk0 --- Transcript [00:00:05] hello everyone this video kicks off ...

Sentiment Analysis | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=sRw3Dtjhlk0

---

Transcript

[00:00:05] hello everyone this video kicks off our
[00:00:08] hello everyone this video kicks off our series of screencasts on supervised
[00:00:10] series of screencasts on supervised sentiment analysis i just want to
[00:00:11] sentiment analysis i just want to provide you with an overview of the
[00:00:13] provide you with an overview of the problem and of the kind of work we'll be
[00:00:15] problem and of the kind of work we'll be doing and also a rationale for why we'll
[00:00:16] doing and also a rationale for why we'll be doing it
[00:00:19] be doing it so here's an overview of the entire unit
[00:00:22] so here's an overview of the entire unit i want to in this screencast motivate
[00:00:24] i want to in this screencast motivate for you the idea that sentiment analysis
[00:00:26] for you the idea that sentiment analysis is a deep problem and an important
[00:00:27] is a deep problem and an important problem for nlu not only scientifically
[00:00:29] problem for nlu not only scientifically but also for industrial applications
[00:00:32] but also for industrial applications uh in the next screencast i'll give you
[00:00:34] uh in the next screencast i'll give you some practical general tips for doing
[00:00:36] some practical general tips for doing sentiment analysis
[00:00:38] sentiment analysis following that we'll have two short
[00:00:39] following that we'll have two short screencasts that introduce our core data
[00:00:41] screencasts that introduce our core data sets the stanford sentiment tree bank
[00:00:43] sets the stanford sentiment tree bank and a new data set called dynacent
[00:00:46] and a new data set called dynacent after that i'll introduce the code base
[00:00:48] after that i'll introduce the code base that we'll be working with on the
[00:00:49] that we'll be working with on the assignment and the bake off that's
[00:00:51] assignment and the bake off that's sst.pi which is included in the course
[00:00:54] sst.pi which is included in the course code distribution and i'm going to use
[00:00:56] code distribution and i'm going to use that code to illustrate some important
[00:00:58] that code to illustrate some important methodological issues surrounding
[00:01:00] methodological issues surrounding supervised analysis in general which
[00:01:02] supervised analysis in general which would be hyperparameter tuning and
[00:01:04] would be hyperparameter tuning and comparison of different classifiers to
[00:01:06] comparison of different classifiers to see whether they are different in some
[00:01:09] see whether they are different in some significant statistical sense then we'll
[00:01:12] significant statistical sense then we'll talk about feature representation both
[00:01:14] talk about feature representation both for large sparse linear models with hand
[00:01:17] for large sparse linear models with hand built features and also with more deep
[00:01:19] built features and also with more deep learning oriented distributional
[00:01:21] learning oriented distributional representations and that will be a nice
[00:01:23] representations and that will be a nice segue into the final unit which is on
[00:01:25] segue into the final unit which is on using
[00:01:26] using recurrent neural networks as classifiers
[00:01:29] recurrent neural networks as classifiers for supervised sentiment analysis my
[00:01:31] for supervised sentiment analysis my hope is that this unit can provide a
[00:01:33] hope is that this unit can provide a refresher on core concepts in supervised
[00:01:36] refresher on core concepts in supervised learning
[00:01:37] learning introduce you to the problem of
[00:01:38] introduce you to the problem of sentiment analysis which i think is as i
[00:01:40] sentiment analysis which i think is as i said a central problem for natural
[00:01:42] said a central problem for natural language understanding and also sets you
[00:01:44] language understanding and also sets you on your on your way toward
[00:01:46] on your on your way toward doing the assignment and the bake off
[00:01:49] doing the assignment and the bake off and possibly building projects in this
[00:01:50] and possibly building projects in this space
[00:01:52] space for the associated materials as i said
[00:01:54] for the associated materials as i said we've got a bunch of code sst.pi is the
[00:01:56] we've got a bunch of code sst.pi is the core module and then we have a notebook
[00:01:59] core module and then we have a notebook introducing the stanford sentiment
[00:02:01] introducing the stanford sentiment treebank as a dataset
[00:02:03] treebank as a dataset we have a second notebook that's on what
[00:02:04] we have a second notebook that's on what i've called hand-built features and
[00:02:06] i've called hand-built features and mostly linear models
[00:02:08] mostly linear models and then a third notebook that's on
[00:02:10] and then a third notebook that's on using neural networks which more or less
[00:02:12] using neural networks which more or less pushes you to using distributional
[00:02:14] pushes you to using distributional representations instead of hand-built
[00:02:16] representations instead of hand-built features
[00:02:17] features although as you'll see the notebooks
[00:02:18] although as you'll see the notebooks explore various combinations of these
[00:02:20] explore various combinations of these ideas
[00:02:22] ideas the homework in the bake off is in the
[00:02:23] the homework in the bake off is in the notebook homework sentiment
[00:02:26] notebook homework sentiment and i'm going to introduce that problem
[00:02:28] and i'm going to introduce that problem in a separate screencast
[00:02:30] in a separate screencast the core readings are the two papers
[00:02:32] the core readings are the two papers that are oriented around our data sets
[00:02:34] that are oriented around our data sets the stanford sentiment treebank and
[00:02:35] the stanford sentiment treebank and dynacent and as supplementary readings
[00:02:38] dynacent and as supplementary readings you might enjoy this compendium from
[00:02:40] you might enjoy this compendium from pang and lee it's a kind of overview of
[00:02:41] pang and lee it's a kind of overview of the whole field of sentiment analysis
[00:02:43] the whole field of sentiment analysis and it poses challenges and questions
[00:02:46] and it poses challenges and questions that are still relevant to this day and
[00:02:48] that are still relevant to this day and then goldberg 2015 is an excellent
[00:02:50] then goldberg 2015 is an excellent overview of using neural networks in nlp
[00:02:53] overview of using neural networks in nlp very generally um but with lots of
[00:02:56] very generally um but with lots of helpful notation and so forth that we're
[00:02:57] helpful notation and so forth that we're aligned with and that might help you get
[00:02:59] aligned with and that might help you get a feel for the landscape of modeling
[00:03:01] a feel for the landscape of modeling choices that you might make in this
[00:03:03] choices that you might make in this space and in subsequent units for this
[00:03:05] space and in subsequent units for this course
[00:03:07] course so i want to start by just motivating
[00:03:08] so i want to start by just motivating the idea that sentiment analysis is an
[00:03:10] the idea that sentiment analysis is an interesting problem because you often
[00:03:12] interesting problem because you often hear people say things like sentiment
[00:03:13] hear people say things like sentiment analysis is solved or it's overly
[00:03:16] analysis is solved or it's overly simplistic or just too easy and i think
[00:03:18] simplistic or just too easy and i think none of those things are true and to
[00:03:20] none of those things are true and to motivate that i just want to do a little
[00:03:22] motivate that i just want to do a little data-driven exercise with you so for
[00:03:24] data-driven exercise with you so for these examples you should ask yourself
[00:03:26] these examples you should ask yourself which of these sentences expresses
[00:03:28] which of these sentences expresses sentiment at all
[00:03:30] sentiment at all and for the ones that you think do
[00:03:31] and for the ones that you think do express sentiment what is that sentiment
[00:03:33] express sentiment what is that sentiment is it positive or negative or maybe
[00:03:35] is it positive or negative or maybe neutral or something else
[00:03:37] neutral or something else so you might think those are
[00:03:38] so you might think those are straightforward questions but this is
[00:03:39] straightforward questions but this is going to get difficult really fast
[00:03:41] going to get difficult really fast consider the first example there was an
[00:03:43] consider the first example there was an earthquake in california this is
[00:03:45] earthquake in california this is probably going to sound like bad news to
[00:03:47] probably going to sound like bad news to you
[00:03:48] you and many sentiment analysis systems will
[00:03:51] and many sentiment analysis systems will assign this negative sentiment but we
[00:03:53] assign this negative sentiment but we should ask ourselves is this actually a
[00:03:55] should ask ourselves is this actually a sentiment latent sentence it is on the
[00:03:58] sentiment latent sentence it is on the face of it merely stating a fact and we
[00:04:01] face of it merely stating a fact and we might hold that for sentiment to be
[00:04:03] might hold that for sentiment to be expressed we need some kind of
[00:04:04] expressed we need some kind of subjective evaluative perspective to be
[00:04:07] subjective evaluative perspective to be included in here like it was bad that
[00:04:09] included in here like it was bad that there was an earthquake in california
[00:04:11] there was an earthquake in california and absent the it was bad clause this
[00:04:13] and absent the it was bad clause this might just be a neutral statement of
[00:04:15] might just be a neutral statement of something that had happened
[00:04:17] something that had happened but the important point here is that
[00:04:18] but the important point here is that unless we settle these questions we'll
[00:04:20] unless we settle these questions we'll have continued indeterminacy about what
[00:04:22] have continued indeterminacy about what we're actually doing
[00:04:24] we're actually doing the team failed to complete the
[00:04:25] the team failed to complete the challenge
[00:04:26] challenge is that positive or negative we might
[00:04:28] is that positive or negative we might agree that it's more than just a
[00:04:30] agree that it's more than just a statement of fact although it's a
[00:04:32] statement of fact although it's a borderline case even for that question
[00:04:34] borderline case even for that question but if we did decide it was sentiment
[00:04:36] but if we did decide it was sentiment laden we would need to figure out the
[00:04:37] laden we would need to figure out the perspective of the speaker is the
[00:04:39] perspective of the speaker is the speaker advocating for this team we're
[00:04:41] speaker advocating for this team we're advocating for a different team
[00:04:43] advocating for a different team right
[00:04:44] right we win we lose it's really going to
[00:04:46] we win we lose it's really going to depend uh on how the speaker is involved
[00:04:49] depend uh on how the speaker is involved and that of course is going to have to
[00:04:50] and that of course is going to have to become part of our definition of what
[00:04:52] become part of our definition of what we're doing when we assign sentiment
[00:04:54] we're doing when we assign sentiment labels they said it would be great on
[00:04:57] labels they said it would be great on the face of it this expresses no speaker
[00:05:00] the face of it this expresses no speaker perspective at all this is merely
[00:05:01] perspective at all this is merely reporting what somebody else said and we
[00:05:03] reporting what somebody else said and we need to decide for those obviously
[00:05:05] need to decide for those obviously different perspectives what we're going
[00:05:07] different perspectives what we're going to do in terms of sentiment analysis
[00:05:09] to do in terms of sentiment analysis because after all this could continue
[00:05:11] because after all this could continue they said it would be great and they
[00:05:12] they said it would be great and they were right which is straightforwardly
[00:05:14] were right which is straightforwardly positive but it could also continue they
[00:05:16] positive but it could also continue they said it would be great and they were
[00:05:17] said it would be great and they were wrong and i think that reveals that
[00:05:19] wrong and i think that reveals that sentence three is not so obviously
[00:05:21] sentence three is not so obviously encoding a particular speaker
[00:05:23] encoding a particular speaker perspective whereas these clauses are
[00:05:25] perspective whereas these clauses are what really tell the story for us as
[00:05:27] what really tell the story for us as sentiment analysts
[00:05:29] sentiment analysts and then we get into things that you
[00:05:31] and then we get into things that you might call non-literal use of language
[00:05:33] might call non-literal use of language the party fat cats are sipping their
[00:05:35] the party fat cats are sipping their expensive imported wines this has a lot
[00:05:38] expensive imported wines this has a lot of positive language in it maybe only
[00:05:40] of positive language in it maybe only fat cats is the thing that sounds like a
[00:05:42] fat cats is the thing that sounds like a direct sneer
[00:05:43] direct sneer but i think we could agree that overall
[00:05:45] but i think we could agree that overall this is probably negative in its valence
[00:05:48] this is probably negative in its valence and that will be a challenge for our
[00:05:50] and that will be a challenge for our systems and also a challenge for us in
[00:05:52] systems and also a challenge for us in just characterizing precisely what was
[00:05:54] just characterizing precisely what was done here in terms of sentiment
[00:05:56] done here in terms of sentiment here's a similar example oh you're
[00:05:57] here's a similar example oh you're terrible this might be a criticism and
[00:06:00] terrible this might be a criticism and it might therefore be straightforwardly
[00:06:02] it might therefore be straightforwardly negative on the other hand it could be a
[00:06:04] negative on the other hand it could be a kind of teasing form of social bonding
[00:06:06] kind of teasing form of social bonding that overall has a positive effect on
[00:06:08] that overall has a positive effect on the discourse how are we going to
[00:06:09] the discourse how are we going to resolve that kind of context dependence
[00:06:12] resolve that kind of context dependence here's another one here's julia bastard
[00:06:14] here's another one here's julia bastard it's got some negative language even
[00:06:16] it's got some negative language even something that's kind of like a swear
[00:06:17] something that's kind of like a swear but this could be a friendly jocular
[00:06:20] but this could be a friendly jocular phrase of some kind um and we'll have to
[00:06:23] phrase of some kind um and we'll have to sort out whether it's friendly and fun
[00:06:25] sort out whether it's friendly and fun because of its negativity
[00:06:27] because of its negativity or whether this is straightforwardly
[00:06:29] or whether this is straightforwardly just a positive sentence
[00:06:32] just a positive sentence and then here's a case that's just going
[00:06:33] and then here's a case that's just going to be a challenge for our systems this
[00:06:35] to be a challenge for our systems this is of the movie 2001 this is from an
[00:06:37] is of the movie 2001 this is from an actual review many consider the
[00:06:39] actual review many consider the masterpiece bewildering boring slow
[00:06:41] masterpiece bewildering boring slow moving or annoying there is a not a lot
[00:06:43] moving or annoying there is a not a lot of negative language there in fact
[00:06:45] of negative language there in fact there's very little that's positive
[00:06:46] there's very little that's positive except masterpiece but i think we can
[00:06:49] except masterpiece but i think we can all anticipate that overall this is
[00:06:51] all anticipate that overall this is probably going to be a positive review
[00:06:53] probably going to be a positive review of that movie
[00:06:55] of that movie so that just shows you that even if
[00:06:56] so that just shows you that even if we're clear about what we're doing in
[00:06:57] we're clear about what we're doing in terms of sentiment the linguistic
[00:06:59] terms of sentiment the linguistic challenge here is
[00:07:00] challenge here is significant and we could also extend
[00:07:02] significant and we could also extend that to sentiment like long-suffering
[00:07:05] that to sentiment like long-suffering fans bittersweet memories hilariously
[00:07:07] fans bittersweet memories hilariously embarrassing moments these are things
[00:07:09] embarrassing moments these are things that are going to blend positivity and
[00:07:10] that are going to blend positivity and negativity and all sorts of other
[00:07:12] negativity and all sorts of other emotional dimensions in ways that just
[00:07:14] emotional dimensions in ways that just make sentiment analysis very difficult
[00:07:16] make sentiment analysis very difficult to do reliably
[00:07:18] to do reliably and that's a nice segue into this topic
[00:07:20] and that's a nice segue into this topic of sentiment analysis in industry
[00:07:22] of sentiment analysis in industry because of course
[00:07:23] because of course sentiment analysis is one of the first
[00:07:25] sentiment analysis is one of the first tasks that was really transformed by
[00:07:26] tasks that was really transformed by data-driven approaches and it was the
[00:07:28] data-driven approaches and it was the first task to really make an impact in
[00:07:31] first task to really make an impact in industry uh there are lots of startups
[00:07:34] industry uh there are lots of startups and companies that offer sentiment
[00:07:35] and companies that offer sentiment analysis tools and it has obvious import
[00:07:38] analysis tools and it has obvious import for things like marketing and customer
[00:07:39] for things like marketing and customer experience and so forth
[00:07:41] experience and so forth and i would say the first thing i would
[00:07:42] and i would say the first thing i would say is that to this day the sentiment
[00:07:44] say is that to this day the sentiment from industry so to speak is that
[00:07:46] from industry so to speak is that sentiment analysis tools still fall
[00:07:48] sentiment analysis tools still fall short this is from an article from 2013
[00:07:51] short this is from an article from 2013 and the
[00:07:52] and the gist of it is anyone who says they're
[00:07:53] gist of it is anyone who says they're getting better than 70 today is lying
[00:07:55] getting better than 70 today is lying generally speaking for whatever notion
[00:07:58] generally speaking for whatever notion of 70 we have here i think we can agree
[00:08:00] of 70 we have here i think we can agree that that's too low and that we as a
[00:08:02] that that's too low and that we as a field ought to be offering tools that
[00:08:03] field ought to be offering tools that are better
[00:08:04] are better this is another kind of equivocal
[00:08:06] this is another kind of equivocal headline the motion ai technology has
[00:08:08] headline the motion ai technology has has great promise when used responsibly
[00:08:11] has great promise when used responsibly affective computing knows how you feel
[00:08:13] affective computing knows how you feel sorta the sorta is kind of like the
[00:08:15] sorta the sorta is kind of like the equivalent of 70 percent here and i
[00:08:17] equivalent of 70 percent here and i think it shows that there's a lot of
[00:08:18] think it shows that there's a lot of work to be done if we're going to have
[00:08:20] work to be done if we're going to have the kind of impact we want to have in
[00:08:22] the kind of impact we want to have in the technological sphere
[00:08:25] the technological sphere and there's another dimension to this
[00:08:26] and there's another dimension to this which we're not going to really get to
[00:08:27] which we're not going to really get to capture but is worth planting in your
[00:08:29] capture but is worth planting in your minds because this could become projects
[00:08:31] minds because this could become projects right we're going to do classification
[00:08:33] right we're going to do classification of sentiment into positive negative and
[00:08:35] of sentiment into positive negative and neutral
[00:08:37] neutral and that's often the starting point for
[00:08:38] and that's often the starting point for these industry tools many business
[00:08:40] these industry tools many business leaders think they want these pie charts
[00:08:42] leaders think they want these pie charts that point out like 30 negative 70
[00:08:44] that point out like 30 negative 70 positive and then in q2 the negativity
[00:08:47] positive and then in q2 the negativity is slightly up
[00:08:48] is slightly up and that's truly a leading indicator of
[00:08:50] and that's truly a leading indicator of something it looks like negativity is on
[00:08:52] something it looks like negativity is on the rise but the issue is what do you do
[00:08:55] the rise but the issue is what do you do how does this help with decision making
[00:08:57] how does this help with decision making merely classifying these texts and
[00:08:59] merely classifying these texts and showing change over time is not enough
[00:09:01] showing change over time is not enough for any business leader to take action
[00:09:03] for any business leader to take action what we need to know are
[00:09:05] what we need to know are why this is happening what the
[00:09:06] why this is happening what the underlying factors are basically what
[00:09:08] underlying factors are basically what are the customers saying beyond these
[00:09:10] are the customers saying beyond these gross classifications into positive
[00:09:12] gross classifications into positive negative and neutral
[00:09:14] negative and neutral and we should be pushing ourselves to
[00:09:16] and we should be pushing ourselves to design tools that can offer that next
[00:09:18] design tools that can offer that next layer of insight
[00:09:21] affective computing this is a kind of
[00:09:23] affective computing this is a kind of transition into the wider world here
[00:09:24] transition into the wider world here we're going to focus on just sentiment
[00:09:26] we're going to focus on just sentiment analysis but you could think about
[00:09:27] analysis but you could think about emotional analysis and all other kinds
[00:09:30] emotional analysis and all other kinds of kind of context dependent expression
[00:09:32] of kind of context dependent expression in language put that under the heading
[00:09:34] in language put that under the heading of affective computing this is a diagram
[00:09:36] of affective computing this is a diagram from a paper i did a few years ago with
[00:09:38] from a paper i did a few years ago with more at south uh it's a diagram of
[00:09:42] more at south uh it's a diagram of emotions and other kinds of moods that
[00:09:44] emotions and other kinds of moods that people feel
[00:09:45] people feel the arcs give you a transition so they
[00:09:47] the arcs give you a transition so they show that people tend to transition
[00:09:48] show that people tend to transition systematically from one emotional state
[00:09:51] systematically from one emotional state to another
[00:09:52] to another so what we're seeing here is basically
[00:09:54] so what we're seeing here is basically just that this is a very high
[00:09:55] just that this is a very high dimensional space it's not just positive
[00:09:57] dimensional space it's not just positive negative neutral we have a wide range of
[00:10:00] negative neutral we have a wide range of feelings and moods and emotions and
[00:10:02] feelings and moods and emotions and states that we go into and there's a lot
[00:10:04] states that we go into and there's a lot of structure to how we experience those
[00:10:06] of structure to how we experience those moods in our lives and it would be great
[00:10:08] moods in our lives and it would be great to break out of the simple
[00:10:10] to break out of the simple positive negative mode and tackle all of
[00:10:13] positive negative mode and tackle all of these dimensions
[00:10:15] these dimensions and in that spirit here what i've done
[00:10:17] and in that spirit here what i've done on this slide is just list out a whole
[00:10:18] on this slide is just list out a whole bunch of other tasks that you might
[00:10:20] bunch of other tasks that you might consider adjacent to sentiment analysis
[00:10:23] consider adjacent to sentiment analysis but they are meaningfully different from
[00:10:24] but they are meaningfully different from sentiment analysis things like
[00:10:26] sentiment analysis things like subjectivity bias stance taking
[00:10:29] subjectivity bias stance taking hate speech microaggressions
[00:10:31] hate speech microaggressions condescension sarcasm deception and
[00:10:34] condescension sarcasm deception and betrayal online trolls polarization
[00:10:36] betrayal online trolls polarization politeness and linguistic alignment
[00:10:38] politeness and linguistic alignment these are all deeply social things that
[00:10:40] these are all deeply social things that are influenced by and shape our language
[00:10:43] are influenced by and shape our language and i've selected these papers in
[00:10:45] and i've selected these papers in particular because all of them have
[00:10:48] particular because all of them have really nice either crisp statements of
[00:10:50] really nice either crisp statements of the problem and or really great public
[00:10:53] the problem and or really great public data sets that you could use for
[00:10:55] data sets that you could use for experiments in this wide world i think
[00:10:56] experiments in this wide world i think that's a very exciting space to explore
[00:10:59] that's a very exciting space to explore as a kind of next step from what we're
[00:11:00] as a kind of next step from what we're doing in this unit
[00:11:02] doing in this unit but back down to earth here our primary
[00:11:04] but back down to earth here our primary data sets as i said are going to be the
[00:11:06] data sets as i said are going to be the ternary formulation of the stanford
[00:11:08] ternary formulation of the stanford sentiment tree bank which is just
[00:11:09] sentiment tree bank which is just positive negative neutral
[00:11:11] positive negative neutral and also the dynacent data set which had
[00:11:13] and also the dynacent data set which had which has that same ternary formulation
[00:11:17] which has that same ternary formulation the sst is movie reviews dynacent is
[00:11:20] the sst is movie reviews dynacent is mostly reviews of products and services
[00:11:22] mostly reviews of products and services i think heavily biased toward
[00:11:23] i think heavily biased toward restaurants because the underlying data
[00:11:25] restaurants because the underlying data is from
[00:11:26] is from yelp and then for the bake off we're
[00:11:29] yelp and then for the bake off we're gonna have a new dev test split uh we'll
[00:11:31] gonna have a new dev test split uh we'll use sst3 as well as this new one of a
[00:11:34] use sst3 as well as this new one of a corpus of sentences from restaurant
[00:11:36] corpus of sentences from restaurant reviews you can see that dynasty might
[00:11:38] reviews you can see that dynasty might be an asset here
[00:11:39] be an asset here they all have this ternary formulation
[00:11:42] they all have this ternary formulation and i'm hoping that the combination of
[00:11:43] and i'm hoping that the combination of these data sets gives us a really
[00:11:45] these data sets gives us a really interesting perspective not only on
[00:11:46] interesting perspective not only on sentiment analysis but also on kind of
[00:11:48] sentiment analysis but also on kind of how we design systems that effectively
[00:11:50] how we design systems that effectively transfer across domains and maybe learn
[00:11:52] transfer across domains and maybe learn simultaneously in multiple domains

Lecture 013

General Practical Tips | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=qt-TU_f0HDw --- Transcript [00:00:05] hello everyone welcome to part t...

General Practical Tips | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=qt-TU_f0HDw

---

Transcript

[00:00:05] hello everyone welcome to part two in
[00:00:07] hello everyone welcome to part two in our series on supervised sentiment
[00:00:08] our series on supervised sentiment analysis this screencast is going to
[00:00:10] analysis this screencast is going to focus on some general practical tips for
[00:00:12] focus on some general practical tips for doing work in this space especially
[00:00:14] doing work in this space especially focused on
[00:00:15] focused on setting up a project and doing kind of
[00:00:17] setting up a project and doing kind of pre-processing of your data
[00:00:19] pre-processing of your data so first i just wanted to give you links
[00:00:21] so first i just wanted to give you links to a whole bunch of benchmark data sets
[00:00:23] to a whole bunch of benchmark data sets in this space we're going to concentrate
[00:00:25] in this space we're going to concentrate on the sst and dynacent but there are a
[00:00:28] on the sst and dynacent but there are a lot of other choices you could make both
[00:00:30] lot of other choices you could make both for developing original systems and also
[00:00:32] for developing original systems and also supplementing training data that you've
[00:00:34] supplementing training data that you've got for a particular application some of
[00:00:36] got for a particular application some of these data sets are really really large
[00:00:38] these data sets are really really large and they cover a diversity of domains so
[00:00:40] and they cover a diversity of domains so these could be important assets for you
[00:00:42] these could be important assets for you in a similar spirit there are lots of
[00:00:44] in a similar spirit there are lots of sentiment lexicons out there they cover
[00:00:46] sentiment lexicons out there they cover different emotional dimensions and
[00:00:48] different emotional dimensions and different aspects of the problem and
[00:00:50] different aspects of the problem and they too could be used to help you with
[00:00:51] they too could be used to help you with powerful featurization they could
[00:00:53] powerful featurization they could supplement features that you've created
[00:00:55] supplement features that you've created or help you group your vocabulary into
[00:00:57] or help you group your vocabulary into interesting subcategories that would be
[00:00:59] interesting subcategories that would be powerful for making sentiment
[00:01:01] powerful for making sentiment predictions and these range from simple
[00:01:03] predictions and these range from simple wordless up to highly structured
[00:01:05] wordless up to highly structured multi-dimensional lexicons
[00:01:08] now for our first pre-processing step i
[00:01:10] now for our first pre-processing step i thought we would just talk a little bit
[00:01:11] thought we would just talk a little bit about tokenization because i think that
[00:01:13] about tokenization because i think that this can be a definitional choice that
[00:01:15] this can be a definitional choice that really affects downstream success so
[00:01:18] really affects downstream success so just as a running example here let's
[00:01:19] just as a running example here let's imagine we start with this raw text
[00:01:21] imagine we start with this raw text which is a kind of imagined tweet
[00:01:23] which is a kind of imagined tweet we have an at mention here and then you
[00:01:25] we have an at mention here and then you can see that some of the markup has
[00:01:27] can see that some of the markup has gotten a little bit garbled we have an
[00:01:29] gotten a little bit garbled we have an emoticon that looks sort of obscured in
[00:01:31] emoticon that looks sort of obscured in a link at the end
[00:01:33] a link at the end i think as a very preliminary step even
[00:01:35] i think as a very preliminary step even before we tokenize we might want to
[00:01:37] before we tokenize we might want to isolate some of that markup and replace
[00:01:39] isolate some of that markup and replace the html entities it's a pretty easy
[00:01:41] the html entities it's a pretty easy thing that you can do that could really
[00:01:42] thing that you can do that could really make a difference now we've got our
[00:01:44] make a difference now we've got our apostrophe we've got our emoticon intact
[00:01:47] apostrophe we've got our emoticon intact and then we still have the link and
[00:01:49] and then we still have the link and other things in here
[00:01:50] other things in here so even before you do that you might
[00:01:51] so even before you do that you might check to see whether a simple
[00:01:53] check to see whether a simple replacement of the html entities would
[00:01:55] replacement of the html entities would make a difference in your data
[00:01:58] make a difference in your data now we begin the tokenization question
[00:02:00] now we begin the tokenization question and i think a good baseline choice here
[00:02:01] and i think a good baseline choice here would be simply white space tokenizing
[00:02:03] would be simply white space tokenizing we're going to split on white space and
[00:02:05] we're going to split on white space and treat all the resulting strings as
[00:02:06] treat all the resulting strings as tokens so that would take our raw text
[00:02:09] tokens so that would take our raw text up here and split it up as as you can
[00:02:10] up here and split it up as as you can see on these independent lines
[00:02:12] see on these independent lines this looks okay to me so we're going to
[00:02:14] this looks okay to me so we're going to have a problem with our at mention
[00:02:16] have a problem with our at mention because it has this colon on the end so
[00:02:18] because it has this colon on the end so we might miss the fact that this is the
[00:02:20] we might miss the fact that this is the actual at mention the unigrams look okay
[00:02:23] actual at mention the unigrams look okay although the date has been split apart
[00:02:25] although the date has been split apart we've preserved our hashtag we've got
[00:02:28] we've preserved our hashtag we've got this token that might appear only once
[00:02:30] this token that might appear only once even though there's a clear consistent
[00:02:31] even though there's a clear consistent signal there we do have our emoticon and
[00:02:34] signal there we do have our emoticon and our link is mostly intact although this
[00:02:37] our link is mostly intact although this period could be disruptive if we
[00:02:38] period could be disruptive if we actually want to follow the link because
[00:02:40] actually want to follow the link because it's still glommed onto the end of the
[00:02:42] it's still glommed onto the end of the url
[00:02:44] url tree bank tokenizing is another very
[00:02:46] tree bank tokenizing is another very common scheme in nlp i would say at this
[00:02:47] common scheme in nlp i would say at this point largely for historical reasons the
[00:02:50] point largely for historical reasons the way tree bank tokenizing works is it
[00:02:52] way tree bank tokenizing works is it takes this raw text and splits it up
[00:02:54] takes this raw text and splits it up into a whole lot of tokens right in
[00:02:55] into a whole lot of tokens right in comparison with white space we have a
[00:02:57] comparison with white space we have a lot of distinct pieces here and this
[00:02:59] lot of distinct pieces here and this really looks kind of problematic right
[00:03:01] really looks kind of problematic right so we have destroyed our at mention we
[00:03:03] so we have destroyed our at mention we don't have that username anymore
[00:03:06] don't have that username anymore it does this interesting thing with
[00:03:07] it does this interesting thing with words like can't that they get split
[00:03:09] words like can't that they get split apart into two tokens
[00:03:11] apart into two tokens we've lost our date we have lost our
[00:03:13] we've lost our date we have lost our hashtag
[00:03:15] hashtag this is possibly good so yay has been
[00:03:17] this is possibly good so yay has been split up according to its punctuation so
[00:03:19] split up according to its punctuation so we have now four exclamation marks uh
[00:03:22] we have now four exclamation marks uh separated out from this word here
[00:03:24] separated out from this word here but our emoticon is completely lost and
[00:03:26] but our emoticon is completely lost and our link has been really destroyed so
[00:03:28] our link has been really destroyed so this looks problematic from the point of
[00:03:29] this looks problematic from the point of view of accurate featurization and also
[00:03:32] view of accurate featurization and also doing things with social media
[00:03:35] doing things with social media so that kind of brings me to what we
[00:03:37] so that kind of brings me to what we might want from what i've called a
[00:03:38] might want from what i've called a sentiment-aware tokenizer we would like
[00:03:40] sentiment-aware tokenizer we would like to isolate emoticons clearly because
[00:03:42] to isolate emoticons clearly because they can be really sentiment-laden
[00:03:45] they can be really sentiment-laden we want to probably respect twitter and
[00:03:46] we want to probably respect twitter and other domain specific markup because
[00:03:48] other domain specific markup because that's often the space in which our in
[00:03:50] that's often the space in which our in our data come from and the kind of place
[00:03:52] our data come from and the kind of place we want to make predictions in
[00:03:54] we want to make predictions in in a similar spirit you might take
[00:03:55] in a similar spirit you might take advantage of underlying markup maybe
[00:03:57] advantage of underlying markup maybe don't filter off the html because there
[00:03:59] don't filter off the html because there could be an important signal there
[00:04:01] could be an important signal there you might be aware that the website or
[00:04:04] you might be aware that the website or data producer might have done some
[00:04:05] data producer might have done some pre-processing of their own that might
[00:04:07] pre-processing of their own that might disrupt things like curses which could
[00:04:09] disrupt things like curses which could of course carry a lot of important
[00:04:11] of course carry a lot of important important sentiment information
[00:04:13] important sentiment information you might want to preserve
[00:04:14] you might want to preserve capitalization because of course that
[00:04:15] capitalization because of course that could be used for emphasis
[00:04:17] could be used for emphasis uh in a similar spirit you might want to
[00:04:19] uh in a similar spirit you might want to regularize emotive lengthening like yay
[00:04:22] regularize emotive lengthening like yay down to just three characters here to
[00:04:24] down to just three characters here to capture that it is an emotive
[00:04:25] capture that it is an emotive lengthening but also regularize all
[00:04:27] lengthening but also regularize all those distinct tokens
[00:04:29] those distinct tokens and then as a stretch goal although this
[00:04:31] and then as a stretch goal although this might be less important in the era of
[00:04:33] might be less important in the era of contextual models you might think about
[00:04:35] contextual models you might think about capturing multi-word expressions that
[00:04:37] capturing multi-word expressions that carry sentiment just think of an example
[00:04:39] carry sentiment just think of an example like out of this world which is positive
[00:04:42] like out of this world which is positive but none of its component pieces are
[00:04:43] but none of its component pieces are positive so many models will miss that
[00:04:45] positive so many models will miss that that is conveying clear sentiment
[00:04:47] that is conveying clear sentiment whereas with a clever tokenization
[00:04:49] whereas with a clever tokenization scheme you might capture that as one
[00:04:51] scheme you might capture that as one single token
[00:04:54] single token so here's a simple example uh that meets
[00:04:56] so here's a simple example uh that meets a lot of those goals here for a set of
[00:04:58] a lot of those goals here for a set of aware tokenizer we begin from our usual
[00:05:00] aware tokenizer we begin from our usual raw text we normalize and preserve the
[00:05:02] raw text we normalize and preserve the at mention
[00:05:03] at mention we keep most of these words intact and
[00:05:05] we keep most of these words intact and we kind of capture that that june 9
[00:05:07] we kind of capture that that june 9 thing was a date
[00:05:09] thing was a date preserve the hashtag of course
[00:05:11] preserve the hashtag of course we're treating all of these potentially
[00:05:13] we're treating all of these potentially emotion-laden punctuation marks as
[00:05:15] emotion-laden punctuation marks as separate unigrams i think that could be
[00:05:17] separate unigrams i think that could be good
[00:05:18] good of course capture the emoticon and
[00:05:19] of course capture the emoticon and capture the link
[00:05:21] capture the link and if you want something that meets
[00:05:23] and if you want something that meets more or less all these criteria except i
[00:05:25] more or less all these criteria except i think the date normalization you could
[00:05:27] think the date normalization you could just use the nltk tweet tokenizer it's a
[00:05:30] just use the nltk tweet tokenizer it's a good simple choice that you could make
[00:05:32] good simple choice that you could make that i think will be useful for
[00:05:34] that i think will be useful for sentiment analysis and to quantify that
[00:05:36] sentiment analysis and to quantify that a little bit here's a some experimental
[00:05:38] a little bit here's a some experimental evidence that i think is going to be
[00:05:39] evidence that i think is going to be relevant to the kind of work that you
[00:05:41] relevant to the kind of work that you all are doing
[00:05:42] all are doing so my data is open table that's
[00:05:44] so my data is open table that's restaurant reviews short ones
[00:05:46] restaurant reviews short ones i've got 6 000 reviews in my test set
[00:05:49] i've got 6 000 reviews in my test set and what i'm doing along the x axis here
[00:05:51] and what i'm doing along the x axis here is varying the amount of training data
[00:05:53] is varying the amount of training data that these systems can see
[00:05:55] that these systems can see it's simply a softmax classifier and my
[00:05:57] it's simply a softmax classifier and my primary manipulation is i have the
[00:05:59] primary manipulation is i have the sentiment aware tokenizer in orange
[00:06:01] sentiment aware tokenizer in orange treebank in green and whitespace in gray
[00:06:05] treebank in green and whitespace in gray and the picture is pretty clear right
[00:06:07] and the picture is pretty clear right along the x-axis we have accuracy it's a
[00:06:09] along the x-axis we have accuracy it's a balanced problem
[00:06:11] balanced problem and what you can see is that the
[00:06:13] and what you can see is that the sentiment aware tokenizer is the clear
[00:06:15] sentiment aware tokenizer is the clear winner here especially where training
[00:06:17] winner here especially where training data are sparse uh in the limit of
[00:06:19] data are sparse uh in the limit of adding lots of training data i think we
[00:06:21] adding lots of training data i think we can make up for a lot of shortcomings of
[00:06:23] can make up for a lot of shortcomings of tokenizers because we see a lot of
[00:06:25] tokenizers because we see a lot of redundancy in the training data
[00:06:27] redundancy in the training data but where data are sparse the center
[00:06:29] but where data are sparse the center manual tokenizer is clearly a good
[00:06:30] manual tokenizer is clearly a good choice and another thing i would add is
[00:06:32] choice and another thing i would add is that because it produces more intuitive
[00:06:34] that because it produces more intuitive tokens the sentiment aware models might
[00:06:36] tokens the sentiment aware models might be more interpretable in some sense
[00:06:40] be more interpretable in some sense and to really connect with the homework
[00:06:42] and to really connect with the homework that you all are doing and the bake off
[00:06:44] that you all are doing and the bake off this is what happens when we go across
[00:06:45] this is what happens when we go across domains so here i'm training on open
[00:06:47] domains so here i'm training on open table restaurant reviews but i'm going
[00:06:49] table restaurant reviews but i'm going to test on movie review sentences here
[00:06:52] to test on movie review sentences here otherwise this is the same experimental
[00:06:54] otherwise this is the same experimental paradigm because of the cross domain
[00:06:56] paradigm because of the cross domain thing the results are a little bit more
[00:06:58] thing the results are a little bit more chaotic but i think again the sentiment
[00:07:00] chaotic but i think again the sentiment aware tokenizer is a clear winner with
[00:07:03] aware tokenizer is a clear winner with the largest gains where training data
[00:07:05] the largest gains where training data are a little bit sparse and that's the
[00:07:06] are a little bit sparse and that's the expected picture
[00:07:09] expected picture so be thoughtful about tokenizing as a
[00:07:11] so be thoughtful about tokenizing as a counterpoint to that i've called this
[00:07:12] counterpoint to that i've called this section on stemming the dangers of
[00:07:14] section on stemming the dangers of stemming because what i want to try to
[00:07:15] stemming because what i want to try to do is convince you not to stem your data
[00:07:18] do is convince you not to stem your data but first what is stemming so stemming
[00:07:20] but first what is stemming so stemming is a kind of preprocessing technique
[00:07:22] is a kind of preprocessing technique that would collapse collapse distinct
[00:07:24] that would collapse collapse distinct word forms there are three common
[00:07:26] word forms there are three common algorithms for this
[00:07:27] algorithms for this easy to use the port or simmer the
[00:07:29] easy to use the port or simmer the lancaster stemmer and the wordnet
[00:07:31] lancaster stemmer and the wordnet stemmer and my criticisms are largely
[00:07:33] stemmer and my criticisms are largely leveled at porter and lancaster
[00:07:36] leveled at porter and lancaster here is the bottom line
[00:07:38] here is the bottom line in doing this kind of stemming you are
[00:07:40] in doing this kind of stemming you are apt to destroy many important sentiment
[00:07:42] apt to destroy many important sentiment distinctions making this a
[00:07:43] distinctions making this a counterproductive pre-processing step
[00:07:46] counterproductive pre-processing step on the other hand the word netstemmer
[00:07:48] on the other hand the word netstemmer does not have this problem
[00:07:50] does not have this problem it's much more conservative but it also
[00:07:52] it's much more conservative but it also doesn't really do enough to make it
[00:07:54] doesn't really do enough to make it worthwhile it's kind of costly to run it
[00:07:57] worthwhile it's kind of costly to run it has some requirements that might make it
[00:07:59] has some requirements that might make it simply not worth it and i would say that
[00:08:01] simply not worth it and i would say that the bottom line here for stemming is
[00:08:02] the bottom line here for stemming is that in an era where we have very large
[00:08:05] that in an era where we have very large sentiment data sets the function of
[00:08:06] sentiment data sets the function of stemming would be to collapse the size
[00:08:08] stemming would be to collapse the size of your vocabulary and make learning
[00:08:11] of your vocabulary and make learning more easier in small domains but we
[00:08:13] more easier in small domains but we mostly don't confront that problem
[00:08:15] mostly don't confront that problem anymore
[00:08:16] anymore but just to drive home this point here
[00:08:18] but just to drive home this point here here are some examples focused on the
[00:08:20] here are some examples focused on the port or stemmer of cases where running
[00:08:23] port or stemmer of cases where running the port or simmer actually collapses
[00:08:25] the port or simmer actually collapses clear sentiment distinctions according
[00:08:26] clear sentiment distinctions according to the harvard inquirer which is one of
[00:08:28] to the harvard inquirer which is one of those lexicons i mentioned before i've
[00:08:31] those lexicons i mentioned before i've got defense and defensive they get
[00:08:33] got defense and defensive they get collapsed down into this funny non-word
[00:08:35] collapsed down into this funny non-word defense extravagance an extravagant
[00:08:38] defense extravagance an extravagant different sentiment collapse down into
[00:08:40] different sentiment collapse down into this word fragment and so forth for
[00:08:42] this word fragment and so forth for these other examples and i think this is
[00:08:44] these other examples and i think this is showing that in pre-processing your data
[00:08:46] showing that in pre-processing your data you might be removing some important
[00:08:48] you might be removing some important sentiment signals
[00:08:50] sentiment signals the lancaster stemmer uses a very
[00:08:51] the lancaster stemmer uses a very similar strategy and has arguably even
[00:08:53] similar strategy and has arguably even more problems in this space here we've
[00:08:56] more problems in this space here we've got the positive word uh complement and
[00:08:58] got the positive word uh complement and complicate
[00:08:59] complicate according to the harvard inquirer again
[00:09:01] according to the harvard inquirer again they could both get collapsed down into
[00:09:03] they could both get collapsed down into what is a completely distinct word
[00:09:04] what is a completely distinct word comply
[00:09:05] comply and that should be concerning for many
[00:09:07] and that should be concerning for many reasons and the other examples make a
[00:09:09] reasons and the other examples make a very similar point
[00:09:12] very similar point the word netstemmer i mentioned before i
[00:09:14] the word netstemmer i mentioned before i think this actually has something going
[00:09:15] think this actually has something going for it uh there might be cases where
[00:09:17] for it uh there might be cases where you'd want to use it it's high precision
[00:09:19] you'd want to use it it's high precision it requires word part of speech pairs
[00:09:23] it requires word part of speech pairs and the general issue is just that it
[00:09:24] and the general issue is just that it removes some comparative morphology
[00:09:26] removes some comparative morphology that's the only thing you might worry
[00:09:28] that's the only thing you might worry about for sentiment but otherwise it's
[00:09:30] about for sentiment but otherwise it's going to take like exclaims explained
[00:09:31] going to take like exclaims explained and exclaiming and collapse them down
[00:09:33] and exclaiming and collapse them down that could be a useful compression of
[00:09:35] that could be a useful compression of your feature space it will leave
[00:09:37] your feature space it will leave exclamation alone which i think is good
[00:09:39] exclamation alone which i think is good similarly for these things they all kind
[00:09:41] similarly for these things they all kind of get preserved across the two verb
[00:09:43] of get preserved across the two verb forms but we preserve the adjective as
[00:09:45] forms but we preserve the adjective as different i think that could be good and
[00:09:47] different i think that could be good and as i said the only concern would be that
[00:09:49] as i said the only concern would be that happy happier and happiest all go down
[00:09:51] happy happier and happiest all go down into their base form whereas i think
[00:09:53] into their base form whereas i think these could encode different gradations
[00:09:55] these could encode different gradations of sentiment that you might want to
[00:09:57] of sentiment that you might want to preserve
[00:09:58] preserve that's worth some thought but overall i
[00:10:00] that's worth some thought but overall i think you probably want to avoid doing
[00:10:02] think you probably want to avoid doing standing and to bring that home let's
[00:10:03] standing and to bring that home let's return to my experimental paradigm using
[00:10:06] return to my experimental paradigm using a softmax classifier opentable reviews
[00:10:09] a softmax classifier opentable reviews 6000 of them in my test set and here
[00:10:11] 6000 of them in my test set and here along the x-axis i'm varying the amount
[00:10:13] along the x-axis i'm varying the amount of training data i have
[00:10:15] of training data i have and i think what you see is that the
[00:10:18] and i think what you see is that the porter and lancaster stemmer and purple
[00:10:19] porter and lancaster stemmer and purple and black respectively are kind of
[00:10:21] and black respectively are kind of forever behind right versus just simply
[00:10:24] forever behind right versus just simply sentiment aware tokenizing it gives you
[00:10:26] sentiment aware tokenizing it gives you a lead
[00:10:27] a lead the lead is especially clear as you get
[00:10:29] the lead is especially clear as you get out of this very sparse domain here with
[00:10:32] out of this very sparse domain here with very few training instances
[00:10:34] very few training instances to close just a few other preprocessing
[00:10:37] to close just a few other preprocessing techniques that you might think about so
[00:10:38] techniques that you might think about so you could part a speech tied your data
[00:10:41] you could part a speech tied your data in the spirit of trying to capture more
[00:10:43] in the spirit of trying to capture more sentiment distinctions that you might
[00:10:44] sentiment distinctions that you might capture otherwise so just for example a
[00:10:47] capture otherwise so just for example a rest or like a resting as an adjective
[00:10:49] rest or like a resting as an adjective is positive but a rest as a verb is
[00:10:50] is positive but a rest as a verb is typically negative
[00:10:52] typically negative um fine as an adjective is positive but
[00:10:55] um fine as an adjective is positive but to incur a fine as a noun is negative
[00:10:57] to incur a fine as a noun is negative and so forth you can see that some
[00:10:59] and so forth you can see that some sentiment distinctions actually do turn
[00:11:01] sentiment distinctions actually do turn on the part of speech of the word so
[00:11:04] on the part of speech of the word so treating all of your unigram features as
[00:11:07] treating all of your unigram features as based in word part of speech tag pairs
[00:11:10] based in word part of speech tag pairs could be useful for preserving some of
[00:11:11] could be useful for preserving some of these distinctions again as a
[00:11:13] these distinctions again as a pre-processing step to help your model
[00:11:15] pre-processing step to help your model be more attuned to these points of
[00:11:17] be more attuned to these points of variation in comparison
[00:11:19] variation in comparison but if there are limits even to this
[00:11:21] but if there are limits even to this right so this is just some cases on
[00:11:22] right so this is just some cases on these slides where even within the same
[00:11:24] these slides where even within the same part of speech we have an adjective that
[00:11:27] part of speech we have an adjective that in one sense is positive and another
[00:11:29] in one sense is positive and another negative for example the adjective mean
[00:11:32] negative for example the adjective mean can mean hateful but it can also mean
[00:11:34] can mean hateful but it can also mean excellent as in uh they make a mean
[00:11:36] excellent as in uh they make a mean apple pie
[00:11:38] apple pie um
[00:11:38] um smart
[00:11:40] smart as an adjective could be both painful uh
[00:11:43] as an adjective could be both painful uh and also bright and brilliant and so
[00:11:45] and also bright and brilliant and so forth like that and similarly for
[00:11:46] forth like that and similarly for serious and fantastic and sneer
[00:11:48] serious and fantastic and sneer depending on the context and the
[00:11:50] depending on the context and the intention of the speaker they can kind
[00:11:52] intention of the speaker they can kind of cut in different directions um so
[00:11:54] of cut in different directions um so even part of speech tagging is going to
[00:11:56] even part of speech tagging is going to be limiting when it comes to really
[00:11:58] be limiting when it comes to really recovering the underlying word sense
[00:12:00] recovering the underlying word sense even for something as low dimensional as
[00:12:02] even for something as low dimensional as a sentiment distinction
[00:12:05] a sentiment distinction finally no this is another powerful
[00:12:08] finally no this is another powerful technique that you might use and think
[00:12:09] technique that you might use and think about as you select and evaluate
[00:12:11] about as you select and evaluate different models this is what i've
[00:12:13] different models this is what i've called simple negation marking the
[00:12:15] called simple negation marking the phenomenon is just that if i have a verb
[00:12:17] phenomenon is just that if i have a verb like enjoy which sounds positive in
[00:12:20] like enjoy which sounds positive in isolation
[00:12:21] isolation of course its contribution to the
[00:12:22] of course its contribution to the overall sentiment will change depending
[00:12:25] overall sentiment will change depending on whether it's in the scope of a
[00:12:26] on whether it's in the scope of a negation i didn't enjoy it as negative
[00:12:29] negation i didn't enjoy it as negative and negation can be expressed in many
[00:12:31] and negation can be expressed in many ways
[00:12:32] ways as this modifier of auxiliaries like not
[00:12:35] as this modifier of auxiliaries like not but as an adverb like never it could be
[00:12:37] but as an adverb like never it could be in the subject like no one and it could
[00:12:39] in the subject like no one and it could even be really encode for things like i
[00:12:41] even be really encode for things like i have yet to enjoy it which is a kind of
[00:12:44] have yet to enjoy it which is a kind of negation
[00:12:45] negation and then of course the negation in five
[00:12:47] and then of course the negation in five years very far i don't think i will
[00:12:49] years very far i don't think i will enjoy it is probably negative but the
[00:12:51] enjoy it is probably negative but the negation is way far away uh from the
[00:12:54] negation is way far away uh from the verb that we want to have that we want
[00:12:57] verb that we want to have that we want whose sentiment we want to modulate with
[00:12:58] whose sentiment we want to modulate with the negation
[00:13:00] the negation so here's a very simple method that i
[00:13:02] so here's a very simple method that i think was first exploited by explored by
[00:13:04] think was first exploited by explored by dawson chen it's also used in peng at
[00:13:06] dawson chen it's also used in peng at all these are classic early sentiment
[00:13:08] all these are classic early sentiment analysis papers and the idea is simply
[00:13:10] analysis papers and the idea is simply to append a neg suffix to every word in
[00:13:13] to append a neg suffix to every word in the sequence that appears between the
[00:13:15] the sequence that appears between the negation and some clause level mark of
[00:13:17] negation and some clause level mark of punctuation to sort of roughly indicate
[00:13:19] punctuation to sort of roughly indicate the semantic scope of the negation this
[00:13:22] the semantic scope of the negation this is a simple pre-processing step highly
[00:13:24] is a simple pre-processing step highly heuristic it would take a sentence like
[00:13:26] heuristic it would take a sentence like no one enjoys it and literally turn the
[00:13:29] no one enjoys it and literally turn the unigrams one and joys and it
[00:13:32] unigrams one and joys and it into variant forms of them where one has
[00:13:34] into variant forms of them where one has an egg appended to it and so does it
[00:13:36] an egg appended to it and so does it enjoys and so does it and the idea is
[00:13:38] enjoys and so does it and the idea is that in doing this we're giving our
[00:13:39] that in doing this we're giving our model the opportunity to discover that
[00:13:42] model the opportunity to discover that enjoys in this context is actually a
[00:13:44] enjoys in this context is actually a different token in some sense than
[00:13:46] different token in some sense than enjoys when it's not in the scope of
[00:13:48] enjoys when it's not in the scope of negation and for many of the linear
[00:13:50] negation and for many of the linear models with handout features that we
[00:13:52] models with handout features that we explore simply making that initial
[00:13:54] explore simply making that initial distinction might create some space for
[00:13:56] distinction might create some space for your model to learn the interaction of
[00:13:59] your model to learn the interaction of negation with these other
[00:14:00] negation with these other features and just to quantify it a
[00:14:03] features and just to quantify it a little bit by way of rounding this out i
[00:14:04] little bit by way of rounding this out i think this slide shows the impact that
[00:14:06] think this slide shows the impact that this can have despite its simplicity so
[00:14:09] this can have despite its simplicity so similar we have open table as our test
[00:14:12] similar we have open table as our test set we're using a softmax classifier and
[00:14:14] set we're using a softmax classifier and the x-axis is again varying the amount
[00:14:16] the x-axis is again varying the amount of training data that we have
[00:14:19] of training data that we have the white space tokenizer isn't gray
[00:14:20] the white space tokenizer isn't gray it's the worst followed by tree bank and
[00:14:22] it's the worst followed by tree bank and green
[00:14:23] green then we have that sentiment aware
[00:14:25] then we have that sentiment aware tokenizer in orange
[00:14:27] tokenizer in orange and then way above them consistently for
[00:14:29] and then way above them consistently for all parts of the data here are sentiment
[00:14:31] all parts of the data here are sentiment aware plus that negation marking
[00:14:34] aware plus that negation marking that is obviously the superior model for
[00:14:36] that is obviously the superior model for all kinds of amounts of training data
[00:14:38] all kinds of amounts of training data and i think what that's showing is that
[00:14:40] and i think what that's showing is that the influence of negation is actually
[00:14:42] the influence of negation is actually really real and severe in a lot of
[00:14:44] really real and severe in a lot of sentiment data sets it's just very
[00:14:46] sentiment data sets it's just very common to combine
[00:14:48] common to combine sentiment words positive or negative
[00:14:50] sentiment words positive or negative with negation and it has this
[00:14:51] with negation and it has this predictable effect of kind of flipping
[00:14:53] predictable effect of kind of flipping the value
[00:14:54] the value so in doing this sentiment and doing
[00:14:56] so in doing this sentiment and doing this negation marking we're giving our
[00:14:58] this negation marking we're giving our model a better chance at discovering
[00:15:00] model a better chance at discovering exactly those distinctions
[00:15:02] exactly those distinctions and here's a similar set of results for
[00:15:04] and here's a similar set of results for cross domain where i'm starting on open
[00:15:06] cross domain where i'm starting on open table and testing on imdb again the
[00:15:09] table and testing on imdb again the results are a little bit more chaotic
[00:15:10] results are a little bit more chaotic but i think it's a clear win for
[00:15:13] but i think it's a clear win for the sentiment aware plus negation
[00:15:14] the sentiment aware plus negation marking model

Lecture 014

Stanford Sentiment Treebank | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=DxnXVbHGeBg --- Transcript [00:00:04] welcome back everyone this ...

Stanford Sentiment Treebank | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=DxnXVbHGeBg

---

Transcript

[00:00:04] welcome back everyone this is part three
[00:00:06] welcome back everyone this is part three in our series on supervised sentiment
[00:00:07] in our series on supervised sentiment analysis this screencast is going to
[00:00:09] analysis this screencast is going to focus on the stanford sentiment tree
[00:00:10] focus on the stanford sentiment tree bank let me start with a quick project
[00:00:12] bank let me start with a quick project overview uh the associated paper is
[00:00:14] overview uh the associated paper is social sochart all 2013.
[00:00:17] social sochart all 2013. i think this paper is a kind of model of
[00:00:18] i think this paper is a kind of model of open science at this website here you
[00:00:20] open science at this website here you can see the full code uh all the data of
[00:00:23] can see the full code uh all the data of course as well as an api that will let
[00:00:24] course as well as an api that will let you try out new examples and kind of
[00:00:26] you try out new examples and kind of interact with the core models that are
[00:00:28] interact with the core models that are motivated in the paper
[00:00:30] motivated in the paper it's a sentence level corpus it's got
[00:00:32] it's a sentence level corpus it's got about 11 000 sentences in total and all
[00:00:34] about 11 000 sentences in total and all of those sentences are originally from
[00:00:36] of those sentences are originally from rotten tomatoes so they are sentences
[00:00:38] rotten tomatoes so they are sentences from movie reviews uh the sentences
[00:00:40] from movie reviews uh the sentences themselves were originally released by
[00:00:42] themselves were originally released by pang and lee in 2005 it's kind of
[00:00:44] pang and lee in 2005 it's kind of classic data set and what the sst did
[00:00:46] classic data set and what the sst did was expand the data set by labeling not
[00:00:49] was expand the data set by labeling not only the full sentences
[00:00:51] only the full sentences but all of the sub constituents
[00:00:53] but all of the sub constituents according to a kind of traditional parse
[00:00:55] according to a kind of traditional parse syntactic parse of each of the examples
[00:00:57] syntactic parse of each of the examples and those are all crowd source labels so
[00:00:59] and those are all crowd source labels so what this means is that we have vastly
[00:01:01] what this means is that we have vastly more supervision signals all throughout
[00:01:04] more supervision signals all throughout the structure of these examples than we
[00:01:06] the structure of these examples than we would get from the original or we just
[00:01:07] would get from the original or we just had a single sentiment label for the
[00:01:09] had a single sentiment label for the entire
[00:01:10] entire sentence the labels themselves in the
[00:01:13] sentence the labels themselves in the underlying corpus are five-way label
[00:01:15] underlying corpus are five-way label labels that are extracted from workers
[00:01:17] labels that are extracted from workers slider responses so there's kind of
[00:01:19] slider responses so there's kind of an initial layer of aggregation they
[00:01:21] an initial layer of aggregation they made a slider choice they were all
[00:01:23] made a slider choice they were all grouped together into five labels and
[00:01:26] grouped together into five labels and then we are going to work with a
[00:01:27] then we are going to work with a formulation that is even more collapsed
[00:01:28] formulation that is even more collapsed down to ternary sentiment i'll return to
[00:01:31] down to ternary sentiment i'll return to that a bit later
[00:01:33] that a bit later the fully labeled tree thing is one of
[00:01:35] the fully labeled tree thing is one of the really exciting aspects of this
[00:01:37] the really exciting aspects of this corpus that we will be able to take
[00:01:39] corpus that we will be able to take advantage of especially during training
[00:01:40] advantage of especially during training so the way that worked is there were
[00:01:42] so the way that worked is there were parses this is a simple
[00:01:44] parses this is a simple constituent parse of a sent the sentence
[00:01:46] constituent parse of a sent the sentence nlu is enlightening and as i've
[00:01:48] nlu is enlightening and as i've indicated here we have labels in that
[00:01:50] indicated here we have labels in that space zero through four
[00:01:53] space zero through four on all of the lexical items nlu is an
[00:01:55] on all of the lexical items nlu is an enlightening as well as the altis sub
[00:01:57] enlightening as well as the altis sub constituents in this phrase so you can
[00:01:59] constituents in this phrase so you can see that is is neutral but since
[00:02:01] see that is is neutral but since enlightening is positive the whole verb
[00:02:03] enlightening is positive the whole verb phrase is enlightening is positive
[00:02:05] phrase is enlightening is positive we can say that nlu is neutral but in
[00:02:08] we can say that nlu is neutral but in the context of this sentence the overall
[00:02:09] the context of this sentence the overall contribution is a highly positive one so
[00:02:12] contribution is a highly positive one so label four on the root
[00:02:14] label four on the root um
[00:02:15] um i in the first screencast for this unit
[00:02:17] i in the first screencast for this unit i motivated sentiment analysis with some
[00:02:19] i motivated sentiment analysis with some cases that i thought were kind of
[00:02:21] cases that i thought were kind of difficult from a syntactic point of view
[00:02:23] difficult from a syntactic point of view this is one of them they said it would
[00:02:24] this is one of them they said it would be great i love how this is being
[00:02:26] be great i love how this is being handled we can see that down here be
[00:02:28] handled we can see that down here be great is kind of clearly positive but by
[00:02:31] great is kind of clearly positive but by the time we have filtered that through
[00:02:33] the time we have filtered that through this report they said just kind of
[00:02:35] this report they said just kind of displacing the sentiment onto another
[00:02:37] displacing the sentiment onto another agent the speaker is not necessarily
[00:02:39] agent the speaker is not necessarily endorsing the claim of greatness uh what
[00:02:41] endorsing the claim of greatness uh what we get in the end is more like a neutral
[00:02:43] we get in the end is more like a neutral sentiment i think that's interesting and
[00:02:45] sentiment i think that's interesting and we can extend that even further right
[00:02:47] we can extend that even further right these are actual predictions from the
[00:02:48] these are actual predictions from the model that's motivated in the underlying
[00:02:50] model that's motivated in the underlying paper if we take that constituent that i
[00:02:52] paper if we take that constituent that i just showed you
[00:02:53] just showed you and can join it with they were wrong
[00:02:55] and can join it with they were wrong which is clearly negative uh strikingly
[00:02:58] which is clearly negative uh strikingly the model is able to figure out that the
[00:02:59] the model is able to figure out that the overall sentiment is determined by this
[00:03:01] overall sentiment is determined by this second clause here and assigned
[00:03:02] second clause here and assigned negatives to the entire thing despite
[00:03:04] negatives to the entire thing despite the fact that there are obviously
[00:03:06] the fact that there are obviously sub-constituents in here that are
[00:03:07] sub-constituents in here that are positive that's exactly the kind of
[00:03:09] positive that's exactly the kind of mixing that i think is correct for how
[00:03:11] mixing that i think is correct for how language works in the domain of
[00:03:13] language works in the domain of sentiment and it's kind of encouraging
[00:03:15] sentiment and it's kind of encouraging to see that this model is able to
[00:03:16] to see that this model is able to capture at least some aspects of it
[00:03:18] capture at least some aspects of it here's a similar case that i think is
[00:03:20] here's a similar case that i think is pretty good as well although maybe not
[00:03:22] pretty good as well although maybe not as strikingly positive in the end here
[00:03:24] as strikingly positive in the end here i've just changed from the previous
[00:03:26] i've just changed from the previous example they were they were wrong till
[00:03:28] example they were they were wrong till they were right
[00:03:29] they were right it knows that right is correct and it
[00:03:31] it knows that right is correct and it seems to get that this is kind of middle
[00:03:33] seems to get that this is kind of middle of the scale i might hope this was a
[00:03:35] of the scale i might hope this was a three or a four but i think that still
[00:03:37] three or a four but i think that still we're seeing some interesting
[00:03:38] we're seeing some interesting interactions between what's happening in
[00:03:41] interactions between what's happening in sub-constituents in these examples and
[00:03:43] sub-constituents in these examples and the prediction that's made at the root
[00:03:44] the prediction that's made at the root level so it's very encouraging
[00:03:47] level so it's very encouraging there are a bunch of ways that you can
[00:03:49] there are a bunch of ways that you can formulate the sst task kind of the raw
[00:03:52] formulate the sst task kind of the raw one that comes from the paper would be a
[00:03:54] one that comes from the paper would be a five-way classification problem where
[00:03:56] five-way classification problem where we'd have these numerical labels here
[00:03:58] we'd have these numerical labels here with the meaning of kind of zero is very
[00:04:00] with the meaning of kind of zero is very negative
[00:04:01] negative one is negative two is neutral three is
[00:04:04] one is negative two is neutral three is positive and four is very positive
[00:04:06] positive and four is very positive i think this is fine but there are two
[00:04:08] i think this is fine but there are two gotchas underlying this kind of scheme
[00:04:10] gotchas underlying this kind of scheme first it's not really a fully ordered
[00:04:12] first it's not really a fully ordered scale in the sense that
[00:04:14] scale in the sense that four is stronger than three
[00:04:17] four is stronger than three but zero is stronger than one because we
[00:04:19] but zero is stronger than one because we have this kind of polarity split with
[00:04:20] have this kind of polarity split with neutral in the center
[00:04:23] neutral in the center so that's a kind of conceptual
[00:04:24] so that's a kind of conceptual difficulty and then the other part is
[00:04:25] difficulty and then the other part is that by and large classifier models that
[00:04:28] that by and large classifier models that you pick will not give you partial
[00:04:30] you pick will not give you partial credit for being close we might hope
[00:04:32] credit for being close we might hope that a model that predicted a one
[00:04:34] that a model that predicted a one negative
[00:04:35] negative uh was kind of right or certainly more
[00:04:38] uh was kind of right or certainly more right if the true label is zero than a
[00:04:40] right if the true label is zero than a model that had predicted four but of
[00:04:42] model that had predicted four but of course if these are all treated as
[00:04:43] course if these are all treated as independent classification bins then
[00:04:45] independent classification bins then you'll get
[00:04:46] you'll get you're just equally wrong no matter
[00:04:48] you're just equally wrong no matter which prediction you made relative to
[00:04:50] which prediction you made relative to the gold label and that seems kind of
[00:04:52] the gold label and that seems kind of unfair to our models
[00:04:55] unfair to our models we are going to work with what i've
[00:04:56] we are going to work with what i've called the ternary problem i think this
[00:04:58] called the ternary problem i think this is the minimal problem that really makes
[00:05:00] is the minimal problem that really makes sense conceptually for this one we group
[00:05:02] sense conceptually for this one we group zero and one into a negative category
[00:05:04] zero and one into a negative category three and four into a positive category
[00:05:06] three and four into a positive category and remem reserve two as before for what
[00:05:09] and remem reserve two as before for what we're calling neutral
[00:05:12] we're calling neutral this kind of avoids the false
[00:05:13] this kind of avoids the false presupposition that every sentence is
[00:05:15] presupposition that every sentence is either negative or positive because it
[00:05:17] either negative or positive because it does allow us to make predictions into
[00:05:19] does allow us to make predictions into this neutral or non-sentiment-laden
[00:05:21] this neutral or non-sentiment-laden space
[00:05:24] space it's very common and you see this in the
[00:05:25] it's very common and you see this in the paper as well as in a lot of work on the
[00:05:27] paper as well as in a lot of work on the sst to formulate this as a binary
[00:05:29] sst to formulate this as a binary problem for the binary problem we simply
[00:05:31] problem for the binary problem we simply remove the middle of the scale and treat
[00:05:33] remove the middle of the scale and treat zero and one as negative and three and
[00:05:34] zero and one as negative and three and four as positive as before i think that
[00:05:37] four as positive as before i think that has two drawbacks first we have to throw
[00:05:38] has two drawbacks first we have to throw away some data and second then we're
[00:05:40] away some data and second then we're making this false presupposition that
[00:05:42] making this false presupposition that every sentence is either classified as
[00:05:44] every sentence is either classified as negative or as positive when for a wide
[00:05:47] negative or as positive when for a wide range of cases in the world that might
[00:05:48] range of cases in the world that might be inappropriate
[00:05:51] be inappropriate now i focus here on the root level
[00:05:53] now i focus here on the root level problem you can see that the numbers
[00:05:54] problem you can see that the numbers here for train and dev are small and the
[00:05:56] here for train and dev are small and the test set numbers are a little bit larger
[00:05:58] test set numbers are a little bit larger than for devs so they're comparable
[00:06:00] than for devs so they're comparable but we can also think of this as the all
[00:06:02] but we can also think of this as the all nodes task because recall that a
[00:06:05] nodes task because recall that a hallmark feature of the sst is that
[00:06:07] hallmark feature of the sst is that every single sub-constituent in these
[00:06:09] every single sub-constituent in these examples has been labeled by crowd
[00:06:10] examples has been labeled by crowd workers so we could treat each one of
[00:06:12] workers so we could treat each one of those as a kind of independent
[00:06:14] those as a kind of independent classification problem we have the same
[00:06:16] classification problem we have the same range for all of the values that they
[00:06:18] range for all of the values that they can take on so we can do similar kind of
[00:06:20] can take on so we can do similar kind of collapsing down into the ternary problem
[00:06:23] collapsing down into the ternary problem or the binary problem and of course here
[00:06:25] or the binary problem and of course here we have a much larger data set
[00:06:29] for us
[00:06:31] for us we're going to by and large work with
[00:06:33] we're going to by and large work with the data in one particular way which i
[00:06:35] the data in one particular way which i think is common in the literature as i
[00:06:37] think is common in the literature as i said we're going to have the ternary
[00:06:38] said we're going to have the ternary formulation so our labels will be
[00:06:39] formulation so our labels will be positive negative and neutral
[00:06:42] positive negative and neutral when we do the devin test step we're
[00:06:45] when we do the devin test step we're going to test only on full examples so
[00:06:47] going to test only on full examples so for them we will not make predictions
[00:06:49] for them we will not make predictions into the sub-constituent space
[00:06:53] into the sub-constituent space and then as a default for the code as
[00:06:55] and then as a default for the code as you'll see it is set up to train only on
[00:06:57] you'll see it is set up to train only on full examples so for these two cases nlu
[00:07:00] full examples so for these two cases nlu is enlightening and not enlightening
[00:07:02] is enlightening and not enlightening here if those were two independent
[00:07:03] here if those were two independent sentences in the corpus we would train
[00:07:05] sentences in the corpus we would train just on those two independent examples
[00:07:07] just on those two independent examples one negative labeled positive and the
[00:07:09] one negative labeled positive and the other labeled negative
[00:07:11] other labeled negative however you might imagine that you'll
[00:07:13] however you might imagine that you'll get a lot more strength in training if
[00:07:16] get a lot more strength in training if you also train on all the
[00:07:17] you also train on all the sub-constituents which would mean
[00:07:19] sub-constituents which would mean essentially expanding this example into
[00:07:21] essentially expanding this example into its full root version and all you is
[00:07:23] its full root version and all you is enlightening but also all of the
[00:07:25] enlightening but also all of the sub-pieces that are captured and labeled
[00:07:27] sub-pieces that are captured and labeled in corpus so that would give you many
[00:07:28] in corpus so that would give you many more examples and much more diversity uh
[00:07:31] more examples and much more diversity uh and then of course not enlightening
[00:07:33] and then of course not enlightening would be split apart as well and then
[00:07:35] would be split apart as well and then you could decide for yourself in
[00:07:36] you could decide for yourself in addition whether you want to treat this
[00:07:38] addition whether you want to treat this as two instances of enlightening or one
[00:07:41] as two instances of enlightening or one and the code facilitates all this you
[00:07:42] and the code facilitates all this you can formulate it as a route only trading
[00:07:44] can formulate it as a route only trading scenario or as a sub-constituent
[00:07:46] scenario or as a sub-constituent training scenario and you can keep or
[00:07:48] training scenario and you can keep or remove duplicates
[00:07:50] remove duplicates this is going to impact the amount of
[00:07:51] this is going to impact the amount of computational resources that you need
[00:07:53] computational resources that you need for training models but of course bigger
[00:07:56] for training models but of course bigger could be better in this space because
[00:07:57] could be better in this space because you're just seeing much more
[00:07:59] you're just seeing much more gold labeled information
[00:08:02] gold labeled information so that's the overview and for much more
[00:08:03] so that's the overview and for much more on this and how to work with our
[00:08:05] on this and how to work with our distribution of the corpus and so forth
[00:08:07] distribution of the corpus and so forth i would encourage you to work through
[00:08:08] i would encourage you to work through this note what could i have linked at
[00:08:09] this note what could i have linked at the bottom here

Lecture 015

DynaSent | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=o-UFgavFlQg --- Transcript [00:00:05] hello everyone welcome to part four in [00:00:...

DynaSent | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=o-UFgavFlQg

---

Transcript

[00:00:05] hello everyone welcome to part four in
[00:00:06] hello everyone welcome to part four in our series on supervised sentiment
[00:00:08] our series on supervised sentiment analysis this is the second screencast
[00:00:10] analysis this is the second screencast in the series that is focused on a data
[00:00:12] in the series that is focused on a data set for sentiment and that data set is
[00:00:14] set for sentiment and that data set is dynacent uh this video could be
[00:00:16] dynacent uh this video could be considered an optional element in the
[00:00:17] considered an optional element in the series i'm offering it for two reasons
[00:00:20] series i'm offering it for two reasons really first uh this is a new data set
[00:00:22] really first uh this is a new data set that i helped produce and i would love
[00:00:24] that i helped produce and i would love it if people worked on it it would be
[00:00:25] it if people worked on it it would be great to see some new models new
[00:00:27] great to see some new models new insights
[00:00:28] insights all of that would help push this project
[00:00:30] all of that would help push this project forward in interesting ways the second
[00:00:32] forward in interesting ways the second reason is more practical i think that
[00:00:34] reason is more practical i think that this data set could be useful to you as
[00:00:35] this data set could be useful to you as you work on the assignment and the
[00:00:37] you work on the assignment and the associated bake off you could use the
[00:00:39] associated bake off you could use the data set itself for supplementary
[00:00:41] data set itself for supplementary training data you could use it to
[00:00:43] training data you could use it to evaluate your system and as you'll see
[00:00:45] evaluate your system and as you'll see there are a few points of conceptual
[00:00:47] there are a few points of conceptual connection between this data set and the
[00:00:49] connection between this data set and the brand new dev and test sets of
[00:00:51] brand new dev and test sets of restaurant sentences that are part of
[00:00:54] restaurant sentences that are part of the bake off this year
[00:00:57] so let's dive in here's a project
[00:00:58] so let's dive in here's a project overview first all the data code and
[00:01:00] overview first all the data code and models are available on github at this
[00:01:02] models are available on github at this link
[00:01:03] link um this data set itself consists of
[00:01:05] um this data set itself consists of about 122 000 sentences
[00:01:08] about 122 000 sentences they are across two rounds and i'm going
[00:01:10] they are across two rounds and i'm going to cover what each round means
[00:01:12] to cover what each round means and each of the sentences has five gold
[00:01:14] and each of the sentences has five gold labels in addition to an inferred
[00:01:16] labels in addition to an inferred majority label where there is one and
[00:01:17] majority label where there is one and i'll return to that as well i think
[00:01:19] i'll return to that as well i think that's an interesting aspect to this
[00:01:21] that's an interesting aspect to this kind of data collection
[00:01:23] kind of data collection the associated paper is possible 2020
[00:01:25] the associated paper is possible 2020 which i encourage you to read if you
[00:01:26] which i encourage you to read if you want to learn even more about this data
[00:01:28] want to learn even more about this data set and how in particular it relates to
[00:01:31] set and how in particular it relates to the stanford sentiment tree bank our
[00:01:33] the stanford sentiment tree bank our other core data set
[00:01:35] other core data set and another ingredient here as you'll
[00:01:37] and another ingredient here as you'll see when we get to round two is that
[00:01:39] see when we get to round two is that this is partly an effort in
[00:01:42] this is partly an effort in model in the loop adversarial data set
[00:01:44] model in the loop adversarial data set creation for round two crowd workers
[00:01:47] creation for round two crowd workers interacted with a model attempting to
[00:01:49] interacted with a model attempting to fool it and thereby creating sentences
[00:01:51] fool it and thereby creating sentences that are really difficult and are going
[00:01:52] that are really difficult and are going to challenge our models in what we hope
[00:01:54] to challenge our models in what we hope are exciting and productive ways
[00:01:57] are exciting and productive ways so here's a complete project overview
[00:01:59] so here's a complete project overview let me walk through it quickly and then
[00:02:00] let me walk through it quickly and then we'll dive into the details we begin
[00:02:02] we'll dive into the details we begin with what we've called model zero which
[00:02:04] with what we've called model zero which is a roberta model that's fine tuned on
[00:02:06] is a roberta model that's fine tuned on a bunch of very large sentiment
[00:02:08] a bunch of very large sentiment benchmark data sets
[00:02:10] benchmark data sets the primary utility of model zero is
[00:02:12] the primary utility of model zero is that we're going to use it as a device
[00:02:14] that we're going to use it as a device to find challenging naturally occurring
[00:02:16] to find challenging naturally occurring sentences out in a large corpus
[00:02:20] sentences out in a large corpus and then we human validate those to get
[00:02:22] and then we human validate those to get actual labels for them the result of
[00:02:24] actual labels for them the result of that process is what we hope uh is a
[00:02:27] that process is what we hope uh is a really challenging round one data set of
[00:02:29] really challenging round one data set of naturally occurring sentences that are
[00:02:31] naturally occurring sentences that are hard for a very good sentiment model
[00:02:33] hard for a very good sentiment model like model zero
[00:02:35] like model zero on that basis we then train a model one
[00:02:38] on that basis we then train a model one which is similar to model zero but now
[00:02:40] which is similar to model zero but now extended with that round one training
[00:02:42] extended with that round one training data so we hope that in bringing in that
[00:02:45] data so we hope that in bringing in that new data and combining it with the
[00:02:47] new data and combining it with the sentiment benchmarks we get an even
[00:02:49] sentiment benchmarks we get an even stronger model
[00:02:50] stronger model that is the model that crowd workers
[00:02:52] that is the model that crowd workers interacted with on the dynabench
[00:02:54] interacted with on the dynabench platform to try to create examples that
[00:02:56] platform to try to create examples that are adversarial with respect to model
[00:02:58] are adversarial with respect to model one so they ought to be really difficult
[00:03:01] one so they ought to be really difficult we feed those through exactly the same
[00:03:02] we feed those through exactly the same human validation pipeline and that gives
[00:03:04] human validation pipeline and that gives us our second round of data
[00:03:07] us our second round of data so two rounds of data that can be
[00:03:08] so two rounds of data that can be thought of as separate problems or merge
[00:03:10] thought of as separate problems or merge together into a larger data set i think
[00:03:12] together into a larger data set i think we're kind of still deciding how best to
[00:03:15] we're kind of still deciding how best to conceptualize these various data assets
[00:03:18] conceptualize these various data assets so let's look at round one in a little
[00:03:20] so let's look at round one in a little more detail this is where we begin with
[00:03:22] more detail this is where we begin with model 0 and try to harvest interesting
[00:03:24] model 0 and try to harvest interesting naturally occurring sentences
[00:03:27] naturally occurring sentences we so the rod model 0 is a roberta-based
[00:03:30] we so the rod model 0 is a roberta-based classifier and its training data
[00:03:32] classifier and its training data are from customer reviews which is small
[00:03:35] are from customer reviews which is small the imdb data set which i linked to in
[00:03:37] the imdb data set which i linked to in an earlier screencast sst3 which you saw
[00:03:40] an earlier screencast sst3 which you saw in the previous screencast and then
[00:03:42] in the previous screencast and then these two very large external benchmarks
[00:03:44] these two very large external benchmarks of product and service reviews from yelp
[00:03:47] of product and service reviews from yelp at amazon you can see that they're very
[00:03:49] at amazon you can see that they're very big indeed
[00:03:51] big indeed and the performance of model zero on the
[00:03:53] and the performance of model zero on the data sets these are our three external
[00:03:55] data sets these are our three external data sets it's pretty good they're range
[00:03:57] data sets it's pretty good they're range from the low 70s for sst three to the
[00:04:00] from the low 70s for sst three to the high 70s for yelp and amazon so this is
[00:04:02] high 70s for yelp and amazon so this is a solid model and i will say
[00:04:04] a solid model and i will say impressionistically if you download
[00:04:06] impressionistically if you download model 0 and play around with it you will
[00:04:08] model 0 and play around with it you will find that it is a very good sentiment
[00:04:10] find that it is a very good sentiment model indeed
[00:04:12] model indeed so we use model 0 to harvest what we
[00:04:14] so we use model 0 to harvest what we hope are challenging sentences and for
[00:04:16] hope are challenging sentences and for this we use the yelp academic data set
[00:04:18] this we use the yelp academic data set which is a very large collection about 8
[00:04:20] which is a very large collection about 8 million reviews and our heuristic is
[00:04:22] million reviews and our heuristic is that we're going to favor in our
[00:04:24] that we're going to favor in our sampling process harvesting sentences
[00:04:26] sampling process harvesting sentences where the review was one star
[00:04:28] where the review was one star so it's very low and model 0 predicted
[00:04:31] so it's very low and model 0 predicted positive for a given sentence and
[00:04:33] positive for a given sentence and conversely where the review is 5 stars
[00:04:36] conversely where the review is 5 stars and model 0 predicted negative
[00:04:38] and model 0 predicted negative we are hoping that that at least creates
[00:04:40] we are hoping that that at least creates a bias for sentences that are very
[00:04:41] a bias for sentences that are very challenging for model 0 where it's
[00:04:43] challenging for model 0 where it's actually making a wrong prediction we're
[00:04:45] actually making a wrong prediction we're not going to depend on that assumption
[00:04:46] not going to depend on that assumption because we'll have a validation step but
[00:04:48] because we'll have a validation step but we're hoping that this is a kind of as
[00:04:50] we're hoping that this is a kind of as adversarial as we can be without
[00:04:52] adversarial as we can be without actually having labels to begin
[00:04:55] actually having labels to begin this is a picture of the validation
[00:04:57] this is a picture of the validation interface you can see that um there were
[00:04:59] interface you can see that um there were some examples given in a little bit of
[00:05:01] some examples given in a little bit of training about how to use the labels and
[00:05:03] training about how to use the labels and then fundamentally what crowd workers
[00:05:05] then fundamentally what crowd workers did is they were prompted for a sentence
[00:05:06] did is they were prompted for a sentence and they made one of four choices
[00:05:08] and they made one of four choices positive negative no sentiment which is
[00:05:10] positive negative no sentiment which is our notion of neutral and mixed
[00:05:13] our notion of neutral and mixed sentiment which is indicating a sentence
[00:05:15] sentiment which is indicating a sentence that has a balance of positive and
[00:05:16] that has a balance of positive and negative sentiments expressed in it i
[00:05:18] negative sentiments expressed in it i think that's an important category to
[00:05:20] think that's an important category to single out we're not going to try to
[00:05:21] single out we're not going to try to model those sentences but we certainly
[00:05:23] model those sentences but we certainly want crowd workers to register that kind
[00:05:25] want crowd workers to register that kind of mixing of emotions
[00:05:28] of mixing of emotions where it appears
[00:05:30] where it appears so here's the resulting data set and
[00:05:32] so here's the resulting data set and because we got five gold labels for
[00:05:34] because we got five gold labels for every sentence there are two
[00:05:36] every sentence there are two perspectives that you can take the first
[00:05:38] perspectives that you can take the first one i've called distributional train and
[00:05:40] one i've called distributional train and this is where essentially we take each
[00:05:41] this is where essentially we take each one of the examples and reproduce it
[00:05:44] one of the examples and reproduce it five times for each of the labels that
[00:05:46] five times for each of the labels that it got so if an individual sentence got
[00:05:49] it got so if an individual sentence got three positive labels
[00:05:50] three positive labels two negative then we would have five
[00:05:53] two negative then we would have five examples three labeled positive and
[00:05:55] examples three labeled positive and three labeled negative with the actual
[00:05:56] three labeled negative with the actual text of the example repeated five times
[00:05:59] text of the example repeated five times what that is doing is essentially
[00:06:01] what that is doing is essentially simulating having a distribution over
[00:06:04] simulating having a distribution over the labels and for many classifier
[00:06:06] the labels and for many classifier models that is literally the same as
[00:06:08] models that is literally the same as training on the distribution of the
[00:06:09] training on the distribution of the labels as given by our crowd workers i
[00:06:11] labels as given by our crowd workers i think this is an exciting way to bring
[00:06:13] think this is an exciting way to bring in uncertainty
[00:06:15] in uncertainty and capture the fact that there might be
[00:06:17] and capture the fact that there might be kind of inherent disagreement among the
[00:06:19] kind of inherent disagreement among the crowd workers that we want our model to
[00:06:20] crowd workers that we want our model to at least grapple with
[00:06:22] at least grapple with and in the paper as we discuss these
[00:06:24] and in the paper as we discuss these this gives better models than training
[00:06:27] this gives better models than training on just the majority labels but you can
[00:06:29] on just the majority labels but you can take a more traditional view so majority
[00:06:31] take a more traditional view so majority label here means that at least three of
[00:06:33] label here means that at least three of the five workers chose that label uh
[00:06:36] the five workers chose that label uh that gives you 94 or 95 000 sentences
[00:06:39] that gives you 94 or 95 000 sentences per train and then these devon test sets
[00:06:41] per train and then these devon test sets have 3 600 examples each and presumably
[00:06:43] have 3 600 examples each and presumably we would predict just the majority label
[00:06:45] we would predict just the majority label for them
[00:06:46] for them what's more open is how we train these
[00:06:48] what's more open is how we train these systems
[00:06:50] systems and in the end what we found is that 47
[00:06:52] and in the end what we found is that 47 of these examples are adversarial with
[00:06:54] of these examples are adversarial with respect to model 0.
[00:06:56] respect to model 0. and as you'll see the dev and test set
[00:06:58] and as you'll see the dev and test set are designed so that model 0 performs a
[00:07:00] are designed so that model 0 performs a chance on them
[00:07:02] chance on them yeah that's some model zero versus the
[00:07:03] yeah that's some model zero versus the human so here's a summary of the
[00:07:05] human so here's a summary of the performance i showed you these
[00:07:07] performance i showed you these categories before and i'm just signaling
[00:07:09] categories before and i'm just signaling that we have by design ensure that model
[00:07:11] that we have by design ensure that model zero performs chance on round zero
[00:07:14] zero performs chance on round zero we could compare that to our human
[00:07:16] we could compare that to our human baseline for this we kind of synthesized
[00:07:19] baseline for this we kind of synthesized five annotators and did pairwise f1
[00:07:21] five annotators and did pairwise f1 scoring for them to get an estimate of
[00:07:23] scoring for them to get an estimate of human performance that is on the same
[00:07:25] human performance that is on the same scale as what we've got from model 0 up
[00:07:27] scale as what we've got from model 0 up here and we put that estimate at 88
[00:07:30] here and we put that estimate at 88 for the dev and test sets i think that's
[00:07:32] for the dev and test sets i think that's a good conservative number i think if
[00:07:34] a good conservative number i think if you got close to it that would be a
[00:07:36] you got close to it that would be a signal that we had kind of saturated
[00:07:37] signal that we had kind of saturated this round and would like to think about
[00:07:39] this round and would like to think about additional data set creation i do want
[00:07:41] additional data set creation i do want to signal though that i think this is a
[00:07:43] to signal though that i think this is a conservative estimate of how humans do
[00:07:45] conservative estimate of how humans do and one indicator of that is that
[00:07:47] and one indicator of that is that actually 614
[00:07:49] actually 614 of the roughly 1200 people who worked on
[00:07:51] of the roughly 1200 people who worked on this task for validation never disagreed
[00:07:54] this task for validation never disagreed with the majority label which sort of
[00:07:56] with the majority label which sort of starts to suggest that there are humans
[00:07:58] starts to suggest that there are humans who are performing perfectly at this
[00:08:00] who are performing perfectly at this task putting this at a pretty low bound
[00:08:03] task putting this at a pretty low bound and here are some example sentences
[00:08:04] and here are some example sentences these are fully randomly sampled with
[00:08:06] these are fully randomly sampled with the only bias being that i set a length
[00:08:08] the only bias being that i set a length restriction so that the slide would be
[00:08:10] restriction so that the slide would be manageable these are the same examples
[00:08:11] manageable these are the same examples that appear in the paper where we needed
[00:08:14] that appear in the paper where we needed to fit them all into a pretty small
[00:08:15] to fit them all into a pretty small table i think this is illuminating
[00:08:17] table i think this is illuminating though so it's showing all the different
[00:08:18] though so it's showing all the different ways that model 0 could get confused
[00:08:20] ways that model 0 could get confused with respect to the majority response
[00:08:23] with respect to the majority response and i would like to highlight for you
[00:08:24] and i would like to highlight for you that there is a real discrepancy here on
[00:08:26] that there is a real discrepancy here on the neutral category what we find is
[00:08:29] the neutral category what we find is that because model zero was trained on
[00:08:31] that because model zero was trained on large external benchmarks its notion of
[00:08:33] large external benchmarks its notion of neutral actually mixes together things
[00:08:36] neutral actually mixes together things that are mixed sentiment
[00:08:38] that are mixed sentiment and things that are highly uncertain
[00:08:39] and things that are highly uncertain about the sentiment that it's expressed
[00:08:41] about the sentiment that it's expressed for whatever reason so you get a lot of
[00:08:42] for whatever reason so you get a lot of borderline cases and a lot of cases
[00:08:45] borderline cases and a lot of cases where humans are kind of inherently
[00:08:47] where humans are kind of inherently having a hard time agreeing about what
[00:08:48] having a hard time agreeing about what the fixed sentiment label would be
[00:08:52] the fixed sentiment label would be i think that dynasty is doing a better
[00:08:54] i think that dynasty is doing a better job of capturing some notion of neutral
[00:08:56] job of capturing some notion of neutral in these labels over here then we should
[00:08:58] in these labels over here then we should be a little wary of treating three-star
[00:09:00] be a little wary of treating three-star reviews and things like that as a true
[00:09:02] reviews and things like that as a true proxy for neutrality
[00:09:05] proxy for neutrality um
[00:09:06] um this is a good point to signal that the
[00:09:08] this is a good point to signal that the validation and test sets
[00:09:10] validation and test sets for the bake off of the restaurant
[00:09:12] for the bake off of the restaurant sentences were validated in the same way
[00:09:15] sentences were validated in the same way as dinosaurs so those sentences will
[00:09:17] as dinosaurs so those sentences will have the same kind of neutrality
[00:09:19] have the same kind of neutrality that dynacent has which could be opposed
[00:09:22] that dynacent has which could be opposed to the sense of neutrality that you get
[00:09:24] to the sense of neutrality that you get from the stanford sentiment tree bank
[00:09:26] from the stanford sentiment tree bank which was of course underlyingly kind of
[00:09:28] which was of course underlyingly kind of gathered in the setting of having a
[00:09:30] gathered in the setting of having a fixed five-star rating scale
[00:09:34] so that's round one that's all naturally
[00:09:36] so that's round one that's all naturally occurring sentences let's turn to round
[00:09:38] occurring sentences let's turn to round two so recall that we benefit from round
[00:09:40] two so recall that we benefit from round one at this point by training a brand
[00:09:41] one at this point by training a brand new model on all those external data
[00:09:43] new model on all those external data sets plus the round one data set and
[00:09:46] sets plus the round one data set and then we have workers on dynabench
[00:09:48] then we have workers on dynabench interact with this model to try to fool
[00:09:50] interact with this model to try to fool it when we validate the resulting
[00:09:52] it when we validate the resulting sentences to get our round two data set
[00:09:55] sentences to get our round two data set so model one is again a roberta based
[00:09:56] so model one is again a roberta based classifier what we've done for our
[00:09:58] classifier what we've done for our training here is more or less carry over
[00:10:00] training here is more or less carry over what we did for the first round except
[00:10:02] what we did for the first round except we have up-sampled the sst to give it
[00:10:04] we have up-sampled the sst to give it more weight and we have dramatically
[00:10:07] more weight and we have dramatically up-sampled the distributional labels
[00:10:09] up-sampled the distributional labels from our round one data set effectively
[00:10:11] from our round one data set effectively trying to give it equal weight as all of
[00:10:13] trying to give it equal weight as all of these other data sets combined in the
[00:10:15] these other data sets combined in the training procedure so we're trying to
[00:10:17] training procedure so we're trying to get a model that
[00:10:18] get a model that as a priority does really well on our
[00:10:21] as a priority does really well on our round one data set
[00:10:23] round one data set here's a look at the performance of this
[00:10:25] here's a look at the performance of this model
[00:10:26] model and first i would just note that it's
[00:10:28] and first i would just note that it's doing well on round one right about 81
[00:10:31] doing well on round one right about 81 which is well below humans but certainly
[00:10:33] which is well below humans but certainly much better than the chance performance
[00:10:35] much better than the chance performance by design that we set up for model zero
[00:10:38] by design that we set up for model zero i do want to signal though that we have
[00:10:39] i do want to signal though that we have a kind of drop in performance for a few
[00:10:41] a kind of drop in performance for a few of these categories you can see that
[00:10:43] of these categories you can see that especially for yelp and amazon where
[00:10:45] especially for yelp and amazon where model 0 was at about for example 80 here
[00:10:48] model 0 was at about for example 80 here model one dropped down to 73 and it's a
[00:10:51] model one dropped down to 73 and it's a similar picture for dev and more or less
[00:10:53] similar picture for dev and more or less that's repeated for amazon with a drop
[00:10:55] that's repeated for amazon with a drop from about 76 to 73
[00:10:58] from about 76 to 73 and 77 to 73 similarly so we have a
[00:11:01] and 77 to 73 similarly so we have a trade-off in performance that i believe
[00:11:03] trade-off in performance that i believe traces to the fact that we are
[00:11:04] traces to the fact that we are performing some changes to the
[00:11:07] performing some changes to the underlying semantics of the labels
[00:11:09] underlying semantics of the labels but that's something to keep in mind and
[00:11:10] but that's something to keep in mind and you can see that there's a tension here
[00:11:12] you can see that there's a tension here as we try to do well at our data set
[00:11:15] as we try to do well at our data set versus continuing to do well on these
[00:11:17] versus continuing to do well on these fixed external benchmarks
[00:11:21] fixed external benchmarks here's the dynabench interface and
[00:11:22] here's the dynabench interface and there's one thing that i want to note
[00:11:24] there's one thing that i want to note about it this is the stock interface but
[00:11:25] about it this is the stock interface but we've actually concentrated on a
[00:11:27] we've actually concentrated on a condition that we called the prompt
[00:11:29] condition that we called the prompt condition where workers instead of
[00:11:31] condition where workers instead of having to just write a sentence as a
[00:11:32] having to just write a sentence as a blank slate you know sit down to an
[00:11:34] blank slate you know sit down to an empty buffer and try to fool the model
[00:11:36] empty buffer and try to fool the model they were given an inspirational prompt
[00:11:38] they were given an inspirational prompt which was in a tested sentence from the
[00:11:40] which was in a tested sentence from the yelp academic data set and invited to
[00:11:43] yelp academic data set and invited to modify that sentence if they chose in
[00:11:45] modify that sentence if they chose in order to achieve their goal of fooling
[00:11:47] order to achieve their goal of fooling the model in a particular way and this
[00:11:49] the model in a particular way and this proved to be vastly more productive it
[00:11:51] proved to be vastly more productive it led to more diverse and realistic
[00:11:53] led to more diverse and realistic sentences i think we'd essentially freed
[00:11:55] sentences i think we'd essentially freed the crowd workers from the creative
[00:11:57] the crowd workers from the creative burden of having each time to come up
[00:11:59] burden of having each time to come up with a completely new sentence and we're
[00:12:01] with a completely new sentence and we're hoping that this procedure leads to
[00:12:03] hoping that this procedure leads to fewer artifacts more diversity and more
[00:12:06] fewer artifacts more diversity and more realism for this adversarial data set
[00:12:09] realism for this adversarial data set collection procedure
[00:12:12] our validation pipeline was exactly the
[00:12:14] our validation pipeline was exactly the same as round one and here is the
[00:12:16] same as round one and here is the resulting data set it's a little bit
[00:12:17] resulting data set it's a little bit smaller because this kind of adversarial
[00:12:19] smaller because this kind of adversarial data set collection is hard and you can
[00:12:21] data set collection is hard and you can see how good model one is it was
[00:12:24] see how good model one is it was actually pretty hard for crowd workers
[00:12:25] actually pretty hard for crowd workers to fool this model they did so only
[00:12:27] to fool this model they did so only about 19 of the time
[00:12:29] about 19 of the time uh here's the data set for
[00:12:31] uh here's the data set for distributional training you have about
[00:12:32] distributional training you have about 93 000 sentences and if you go for the
[00:12:35] 93 000 sentences and if you go for the majority label training you have about
[00:12:37] majority label training you have about 19 000 and the devon test sets are
[00:12:39] 19 000 and the devon test sets are smaller but again the reason they're
[00:12:41] smaller but again the reason they're smaller is that they are designed to set
[00:12:43] smaller is that they are designed to set model one as having chance performance
[00:12:45] model one as having chance performance on this data set
[00:12:47] on this data set and so that's what i'll flesh out here
[00:12:49] and so that's what i'll flesh out here you can see that this model chance
[00:12:50] you can see that this model chance performance i showed you before that
[00:12:52] performance i showed you before that it's doing pretty well on round one and
[00:12:54] it's doing pretty well on round one and we had that kind of tension with the
[00:12:56] we had that kind of tension with the external benchmarks in terms of human
[00:12:58] external benchmarks in terms of human performance we're at about 90 using that
[00:13:01] performance we're at about 90 using that procedure of synthesized kind of
[00:13:03] procedure of synthesized kind of averaged f1 values and i would just note
[00:13:06] averaged f1 values and i would just note again that that's certainly conservative
[00:13:08] again that that's certainly conservative in that you know almost half of the
[00:13:10] in that you know almost half of the workers never disagreed with the
[00:13:12] workers never disagreed with the majority label so it is certainly within
[00:13:14] majority label so it is certainly within the capacity of individual humans to
[00:13:16] the capacity of individual humans to perform essentially perfectly on this
[00:13:18] perform essentially perfectly on this data set
[00:13:19] data set but 90 is nonetheless a good signpost
[00:13:21] but 90 is nonetheless a good signpost for us as we think about hill climbing
[00:13:23] for us as we think about hill climbing and launching subsequent rounds of
[00:13:25] and launching subsequent rounds of dynacent
[00:13:26] dynacent and here are some short examples and i
[00:13:28] and here are some short examples and i think they make the same point that our
[00:13:29] think they make the same point that our neutral category is more aligned with
[00:13:31] neutral category is more aligned with the semantics of what we mean when we
[00:13:32] the semantics of what we mean when we identify neutral sentences and less
[00:13:34] identify neutral sentences and less heterogeneous than you get from
[00:13:36] heterogeneous than you get from naturally occurring
[00:13:38] naturally occurring neutral sentences derived from star
[00:13:40] neutral sentences derived from star rating metadata and so forth so i'm
[00:13:42] rating metadata and so forth so i'm hopeful that this is a kind of positive
[00:13:44] hopeful that this is a kind of positive step toward getting true ternary
[00:13:45] step toward getting true ternary sentiment but we should be aware that
[00:13:47] sentiment but we should be aware that this label shift has happened in these
[00:13:49] this label shift has happened in these data sets
[00:13:51] data sets and the final thing i want to say is
[00:13:52] and the final thing i want to say is just to reiterate that if people do
[00:13:54] just to reiterate that if people do exciting work with this data set and
[00:13:56] exciting work with this data set and start to make real progress on the
[00:13:57] start to make real progress on the existing rounds that would be our cue to
[00:14:00] existing rounds that would be our cue to launch new rounds the dyna in dynacent
[00:14:03] launch new rounds the dyna in dynacent is that we would like to have an
[00:14:04] is that we would like to have an evolving benchmark not one that's static
[00:14:06] evolving benchmark not one that's static but rather responsive to progress that's
[00:14:08] but rather responsive to progress that's made in the field and the evolving needs
[00:14:10] made in the field and the evolving needs of people who are trying to develop
[00:14:12] of people who are trying to develop practical sentiment analysis systems so
[00:14:15] practical sentiment analysis systems so do let us know what kind of progress you
[00:14:17] do let us know what kind of progress you make and what you discover

Lecture 016

sst.py | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=_T5q5fIfzww --- Transcript [00:00:05] hello everyone welcome to part 5 in our [00:00:0...

sst.py | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=_T5q5fIfzww

---

Transcript

[00:00:05] hello everyone welcome to part 5 in our
[00:00:07] hello everyone welcome to part 5 in our series on supervised sentiment analysis
[00:00:09] series on supervised sentiment analysis the focus of this screencast is on the
[00:00:11] the focus of this screencast is on the module sst.pi which is included in the
[00:00:13] module sst.pi which is included in the course code distribution
[00:00:15] course code distribution it contains a bunch of tools that will
[00:00:17] it contains a bunch of tools that will let you work fluidly i hope with the
[00:00:19] let you work fluidly i hope with the stanford sentiment tree bank and conduct
[00:00:21] stanford sentiment tree bank and conduct a lot of experiments
[00:00:23] a lot of experiments in service of completing the homework
[00:00:25] in service of completing the homework and also doing an original system entry
[00:00:27] and also doing an original system entry for the bake off i'd say that my goals
[00:00:29] for the bake off i'd say that my goals for the screencast are twofold first i
[00:00:31] for the screencast are twofold first i do just want to get you acquainted with
[00:00:32] do just want to get you acquainted with this code so that you can work with it
[00:00:34] this code so that you can work with it on the assignment in the bake off uh and
[00:00:36] on the assignment in the bake off uh and in addition i guess i'd like to convey
[00:00:38] in addition i guess i'd like to convey to you some best practices around
[00:00:40] to you some best practices around setting up a code infrastructure for a
[00:00:43] setting up a code infrastructure for a project say that will let you run a lot
[00:00:45] project say that will let you run a lot of experiments and really explore the
[00:00:47] of experiments and really explore the space of ideas that you have
[00:00:49] space of ideas that you have without introducing a lot of bugs or
[00:00:50] without introducing a lot of bugs or writing a lot of extra code
[00:00:53] writing a lot of extra code so let's begin we'll start with these
[00:00:54] so let's begin we'll start with these reader functions at the top and the
[00:00:56] reader functions at the top and the first cell here i just load in not only
[00:00:58] first cell here i just load in not only os that we can find our files but also
[00:01:00] os that we can find our files but also sst which is the module of interest
[00:01:03] sst which is the module of interest we set up this variable here that's a
[00:01:05] we set up this variable here that's a pointer to where the data set itself
[00:01:07] pointer to where the data set itself lives
[00:01:08] lives and then this function
[00:01:09] and then this function sst.trainreader will let you load in a
[00:01:12] sst.trainreader will let you load in a pandas data frame that contains the
[00:01:14] pandas data frame that contains the train set for the sst you'll notice that
[00:01:16] train set for the sst you'll notice that there are two optional keywords include
[00:01:18] there are two optional keywords include subtrees and ddo
[00:01:20] subtrees and ddo dupe will remove repeated examples and
[00:01:22] dupe will remove repeated examples and include subtrees as a flag that will let
[00:01:24] include subtrees as a flag that will let you include or exclude all of the sub
[00:01:27] you include or exclude all of the sub trees that the sst contains by default
[00:01:29] trees that the sst contains by default we'll include just the full examples but
[00:01:32] we'll include just the full examples but if you set include subtrees equals true
[00:01:35] if you set include subtrees equals true you get a much larger data set as we
[00:01:37] you get a much larger data set as we discussed in the screencast on the sst
[00:01:39] discussed in the screencast on the sst itself
[00:01:40] itself in cell 4 here i'm just giving you a
[00:01:42] in cell 4 here i'm just giving you a look at one random record from this so
[00:01:45] look at one random record from this so remember it is a pandas data frame but
[00:01:47] remember it is a pandas data frame but we can get it as a dictionary for a
[00:01:48] we can get it as a dictionary for a little bit of an easier look we've got
[00:01:50] little bit of an easier look we've got an example id we have the text of the
[00:01:52] an example id we have the text of the sentence the label which is either
[00:01:54] sentence the label which is either negative positive or neutral and then is
[00:01:57] negative positive or neutral and then is subtree is a flag on whether or not it's
[00:01:59] subtree is a flag on whether or not it's a full root level example or a
[00:02:01] a full root level example or a subconstituent of such an example
[00:02:04] subconstituent of such an example since we have loaded this in with
[00:02:06] since we have loaded this in with include subtrees equals false we get
[00:02:08] include subtrees equals false we get this distribution of labels here this is
[00:02:10] this distribution of labels here this is just a distribution of labels on the
[00:02:12] just a distribution of labels on the full examples
[00:02:13] full examples but of course as we change these flags
[00:02:15] but of course as we change these flags we would get very different counts down
[00:02:16] we would get very different counts down here
[00:02:18] here and then something comparable happens
[00:02:19] and then something comparable happens with the dev reader dev df from
[00:02:22] with the dev reader dev df from sst.devreader with a pointer to the home
[00:02:24] sst.devreader with a pointer to the home directory for the data as before
[00:02:26] directory for the data as before and here the subtree distinction and the
[00:02:28] and here the subtree distinction and the dedupe distinction those are much less
[00:02:30] dedupe distinction those are much less important because these data sets
[00:02:32] important because these data sets consist just of root level examples and
[00:02:34] consist just of root level examples and there are very few if any duplicate
[00:02:36] there are very few if any duplicate examples in those data sets
[00:02:40] now let's turn to feature functions
[00:02:42] now let's turn to feature functions we'll begin to build up a framework for
[00:02:44] we'll begin to build up a framework for doing supervised uh sentiment analysis
[00:02:46] doing supervised uh sentiment analysis and the starting point here is what i
[00:02:48] and the starting point here is what i call the feature function it's given in
[00:02:50] call the feature function it's given in two unigrams phi it takes in a text that
[00:02:53] two unigrams phi it takes in a text that is a string and what it does is return a
[00:02:56] is a string and what it does is return a dictionary that is essentially a count
[00:02:58] dictionary that is essentially a count dictionary over the unigrams in that
[00:03:01] dictionary over the unigrams in that string
[00:03:01] string as given by this very simple
[00:03:03] as given by this very simple tokenization scheme which just down
[00:03:05] tokenization scheme which just down cases all of the tokens and then splits
[00:03:08] cases all of the tokens and then splits on white space
[00:03:09] on white space so as an example text if i have nlu is
[00:03:12] so as an example text if i have nlu is enlightening space and then an
[00:03:13] enlightening space and then an exclamation mark and i call the
[00:03:16] exclamation mark and i call the the feature function on that string i
[00:03:18] the feature function on that string i get this count dictionary here which is
[00:03:20] get this count dictionary here which is just giving the number of times each
[00:03:22] just giving the number of times each token appears in that string according
[00:03:24] token appears in that string according to the feature function
[00:03:26] to the feature function i'd say it's really important when
[00:03:27] i'd say it's really important when you're working with the standard version
[00:03:29] you're working with the standard version of this framework doing hand-built
[00:03:30] of this framework doing hand-built feature functions that you just abide by
[00:03:33] feature functions that you just abide by the contract that all of these feature
[00:03:35] the contract that all of these feature functions take in strings and return
[00:03:38] functions take in strings and return dictionaries mapping strings to their
[00:03:40] dictionaries mapping strings to their accounts or if you want to they're bools
[00:03:42] accounts or if you want to they're bools or floats or something that we can make
[00:03:44] or floats or something that we can make use of when we're doing featurization
[00:03:48] the next up here is what i've called a
[00:03:50] the next up here is what i've called a model wrapper and this is going to look
[00:03:51] model wrapper and this is going to look a little bit trivial here but as you'll
[00:03:53] a little bit trivial here but as you'll see as we move through more advanced
[00:03:55] see as we move through more advanced methods in this unit especially the next
[00:03:56] methods in this unit especially the next screencast it's really nice to have
[00:03:58] screencast it's really nice to have these wrappers around the normal
[00:04:01] these wrappers around the normal essentially the fit function down here
[00:04:04] essentially the fit function down here so i'm going to make use of a scikit
[00:04:05] so i'm going to make use of a scikit linear model called logistic regression
[00:04:07] linear model called logistic regression very standard side of cross entropy
[00:04:09] very standard side of cross entropy classifier i've called my function fit
[00:04:12] classifier i've called my function fit softmax classifier and it takes in a
[00:04:14] softmax classifier and it takes in a supervised data set so a feature matrix
[00:04:17] supervised data set so a feature matrix and a list of labels
[00:04:19] and a list of labels and i set up my model down here and i've
[00:04:21] and i set up my model down here and i've used some of the keyword parameters
[00:04:22] used some of the keyword parameters there are many more for the scikit model
[00:04:25] there are many more for the scikit model and then the crucial thing is that i
[00:04:26] and then the crucial thing is that i call fit and return the model which is
[00:04:28] call fit and return the model which is now a trained model trained on this data
[00:04:31] now a trained model trained on this data set x y
[00:04:32] set x y it might look like all i've done is call
[00:04:34] it might look like all i've done is call fit on a model that i set up but as
[00:04:36] fit on a model that i set up but as you'll see it's nice to have a wrapper
[00:04:38] you'll see it's nice to have a wrapper function so that we can potentially do a
[00:04:40] function so that we can potentially do a lot more as part of this particular step
[00:04:43] lot more as part of this particular step in our experimental workflow
[00:04:45] in our experimental workflow so now let's just bring all those things
[00:04:47] so now let's just bring all those things together into what is called sst
[00:04:49] together into what is called sst experiment which is like one stop
[00:04:51] experiment which is like one stop shopping for a complete experiment in
[00:04:53] shopping for a complete experiment in supervised sentiment analysis
[00:04:55] supervised sentiment analysis so we load in these two libraries we get
[00:04:57] so we load in these two libraries we get a pointer to our
[00:04:59] a pointer to our um data set and then call ssd experiment
[00:05:02] um data set and then call ssd experiment the first argument is the set that will
[00:05:04] the first argument is the set that will the data set that we'll be trained on so
[00:05:06] the data set that we'll be trained on so that's like train df from before
[00:05:08] that's like train df from before we have a feature function and a model
[00:05:10] we have a feature function and a model wrapper and then these other things are
[00:05:12] wrapper and then these other things are optional so if i leave assess data
[00:05:14] optional so if i leave assess data frames as none
[00:05:16] frames as none then it will do a random split on this
[00:05:18] then it will do a random split on this train reader according to train size if
[00:05:20] train reader according to train size if you do specify some data frames here a
[00:05:22] you do specify some data frames here a list of them and each one will be used
[00:05:24] list of them and each one will be used as a separate evaluation against the
[00:05:26] as a separate evaluation against the model that you train on this original
[00:05:28] model that you train on this original data
[00:05:29] data you can set the score function if you
[00:05:31] you can set the score function if you want our default is macro f1
[00:05:34] want our default is macro f1 and then we'll return to these two
[00:05:35] and then we'll return to these two options later verbose is just whether
[00:05:36] options later verbose is just whether you want to print some information and
[00:05:38] you want to print some information and vectorize is an option that you can turn
[00:05:40] vectorize is an option that you can turn on and off and you'll probably turn it
[00:05:42] on and off and you'll probably turn it off when you do deep learning
[00:05:44] off when you do deep learning experiments which we'll talk about later
[00:05:46] experiments which we'll talk about later in the unit
[00:05:47] in the unit the result of all that is a bunch of
[00:05:49] the result of all that is a bunch of information about your experiment stored
[00:05:51] information about your experiment stored in this variable and as because we had
[00:05:53] in this variable and as because we had verbose equals true you get a report
[00:05:55] verbose equals true you get a report here
[00:05:56] here and this is just a first chance to call
[00:05:58] and this is just a first chance to call out that throughout this course
[00:06:00] out that throughout this course essentially when we do classifier
[00:06:02] essentially when we do classifier experiments our primary metric is going
[00:06:04] experiments our primary metric is going to be the macro average f1 score this is
[00:06:07] to be the macro average f1 score this is useful for us because it gives equal
[00:06:09] useful for us because it gives equal weight to all the classes in our data
[00:06:11] weight to all the classes in our data regardless of their size which is
[00:06:14] regardless of their size which is typically reflecting our value that we
[00:06:15] typically reflecting our value that we care even about small classes we want to
[00:06:17] care even about small classes we want to do well even on the rare events in our
[00:06:19] do well even on the rare events in our space
[00:06:20] space and it's also perfectly balancing
[00:06:22] and it's also perfectly balancing precision and recall which is like a
[00:06:23] precision and recall which is like a good null hypothesis if we're not told
[00:06:26] good null hypothesis if we're not told ahead of time based on some other goal
[00:06:28] ahead of time based on some other goal whether we should favor precision or
[00:06:30] whether we should favor precision or recall
[00:06:31] recall so that all leads us to kind of favor as
[00:06:33] so that all leads us to kind of favor as a default this macro average f1 score is
[00:06:35] a default this macro average f1 score is an assessment of how the model did
[00:06:37] an assessment of how the model did here we've got 51.3
[00:06:41] the return value of sst.experiment as i
[00:06:44] the return value of sst.experiment as i said is a dictionary and it should
[00:06:45] said is a dictionary and it should package up for you all the objects and
[00:06:48] package up for you all the objects and information you would need to test the
[00:06:50] information you would need to test the model assess the model and do all kinds
[00:06:52] model assess the model and do all kinds of deep error analysis that is the
[00:06:54] of deep error analysis that is the philosophy here that you should if
[00:06:55] philosophy here that you should if possible capture as much information as
[00:06:58] possible capture as much information as you can about the experiment that you
[00:06:59] you can about the experiment that you ran
[00:07:00] ran in the service of being able to do
[00:07:01] in the service of being able to do subsequent downstream analysis of what
[00:07:04] subsequent downstream analysis of what happened so here i'm just giving an
[00:07:05] happened so here i'm just giving an example that we've got the model the
[00:07:07] example that we've got the model the feature function the trained data set
[00:07:09] feature function the trained data set whenever assessed data sets were used
[00:07:11] whenever assessed data sets were used and if that was a random split of the
[00:07:13] and if that was a random split of the train data that will be reflected in
[00:07:14] train data that will be reflected in these two variables the set of
[00:07:16] these two variables the set of predictions that you made about each one
[00:07:18] predictions that you made about each one of the assessed data sets the metrics
[00:07:20] of the assessed data sets the metrics you chose and the scores that you got
[00:07:23] you chose and the scores that you got and then if you do dive in like if you
[00:07:24] and then if you do dive in like if you look at train set it's a standard data
[00:07:27] look at train set it's a standard data set x is your future space y is your
[00:07:30] set x is your future space y is your labels vectorizer is something that i'll
[00:07:32] labels vectorizer is something that i'll return to that's an important part about
[00:07:34] return to that's an important part about how the internal workings of ssd
[00:07:36] how the internal workings of ssd experiment function and then you have
[00:07:38] experiment function and then you have the raw examples in case you need to do
[00:07:40] the raw examples in case you need to do some really serious human level error
[00:07:42] some really serious human level error analysis of the examples
[00:07:44] analysis of the examples as distinct from how they're represented
[00:07:46] as distinct from how they're represented in this feature space
[00:07:49] so here is just a slide that brings all
[00:07:51] so here is just a slide that brings all of those pieces together this is
[00:07:53] of those pieces together this is one-stop shopping for an entire
[00:07:55] one-stop shopping for an entire experiment we load in all our libraries
[00:07:57] experiment we load in all our libraries we have our pointer to the data and then
[00:07:59] we have our pointer to the data and then the ingredients are really a feature
[00:08:01] the ingredients are really a feature function and a model wrapper and that's
[00:08:04] function and a model wrapper and that's all you need in our default setting
[00:08:06] all you need in our default setting point it to the train data and it will
[00:08:08] point it to the train data and it will do its job and record all you would want
[00:08:10] do its job and record all you would want for this experiment i hope in this
[00:08:12] for this experiment i hope in this experiment variable here
[00:08:15] experiment variable here there's a final piece i want to return
[00:08:17] there's a final piece i want to return to that vectorizer variable that you saw
[00:08:19] to that vectorizer variable that you saw in the return values for sst experiment
[00:08:22] in the return values for sst experiment and that is making use of what inside
[00:08:24] and that is making use of what inside kit learn is called a dict vectorizer
[00:08:27] kit learn is called a dict vectorizer and this is really nice convenience
[00:08:28] and this is really nice convenience function for translating from human
[00:08:31] function for translating from human representations of your data into
[00:08:33] representations of your data into representations that machine learning
[00:08:35] representations that machine learning models like to consume
[00:08:37] models like to consume so let me just walk through this example
[00:08:38] so let me just walk through this example here i've loaded the
[00:08:40] here i've loaded the vectorizer and i've got my train
[00:08:42] vectorizer and i've got my train features here in the mode that i just
[00:08:44] features here in the mode that i just showed you where here we have two
[00:08:46] showed you where here we have two examples and each one is represented by
[00:08:48] examples and each one is represented by our feature function as a dictionary
[00:08:51] our feature function as a dictionary that maps like words into their accounts
[00:08:54] that maps like words into their accounts you can be more flexible than that but
[00:08:56] you can be more flexible than that but that's like the most basic case that we
[00:08:57] that's like the most basic case that we consider
[00:08:58] consider when i i set up my vectorizer in three
[00:09:02] when i i set up my vectorizer in three and then i call fit transform on this
[00:09:05] and then i call fit transform on this list of dictionaries
[00:09:06] list of dictionaries and the result here x train is a matrix
[00:09:10] and the result here x train is a matrix where each of the columns corresponds to
[00:09:12] where each of the columns corresponds to the keys in the dictionary representing
[00:09:15] the keys in the dictionary representing a unique feature
[00:09:16] a unique feature and the values are of course stored in
[00:09:18] and the values are of course stored in that in that column so this feature
[00:09:21] that in that column so this feature space here has been turned into um a
[00:09:24] space here has been turned into um a matrix that has two examples zero and
[00:09:26] matrix that has two examples zero and one there are a total of three features
[00:09:28] one there are a total of three features represented across our two examples a b
[00:09:31] represented across our two examples a b and c
[00:09:32] and c uh and you can see that the counts are
[00:09:34] uh and you can see that the counts are stored here so example zero has one for
[00:09:36] stored here so example zero has one for a
[00:09:37] a and zero for uh one for b and zero for c
[00:09:41] and zero for uh one for b and zero for c and example one has zero for a
[00:09:44] and example one has zero for a one for b and two for c
[00:09:47] one for b and two for c so that's recorded in the columns here
[00:09:49] so that's recorded in the columns here you can of course
[00:09:50] you can of course undertake this step by hand but it's a
[00:09:52] undertake this step by hand but it's a kind of error prone step and i'm just
[00:09:54] kind of error prone step and i'm just encouraging you to use dick vectorizer
[00:09:56] encouraging you to use dick vectorizer to handle it all and essentially map you
[00:09:59] to handle it all and essentially map you from this which is pretty human
[00:10:00] from this which is pretty human interpretable into this which is
[00:10:02] interpretable into this which is something your models like to consume
[00:10:06] something your models like to consume there's a next there's a second
[00:10:07] there's a next there's a second advantage here which is that if you use
[00:10:09] advantage here which is that if you use a dick vectorizer and you need to now do
[00:10:12] a dick vectorizer and you need to now do something at test time
[00:10:14] something at test time you can easily use your vectorizer to
[00:10:16] you can easily use your vectorizer to create feature spaces that are
[00:10:17] create feature spaces that are harmonized with what you saw in training
[00:10:19] harmonized with what you saw in training so as an example if my test features are
[00:10:22] so as an example if my test features are another pair of examples with a
[00:10:23] another pair of examples with a different character
[00:10:25] different character then i can call transform on the
[00:10:28] then i can call transform on the original train vectorizing from up here
[00:10:31] original train vectorizing from up here and it will translate that list of
[00:10:32] and it will translate that list of features
[00:10:33] features into a matrix now the important thing
[00:10:36] into a matrix now the important thing about what's happening here is that it's
[00:10:37] about what's happening here is that it's going to package the test features into
[00:10:40] going to package the test features into the original training space because of
[00:10:41] the original training space because of course those are the features that your
[00:10:43] course those are the features that your model recognizes those are the features
[00:10:45] model recognizes those are the features that you have weights for if you've
[00:10:47] that you have weights for if you've trained a model so it's important to
[00:10:48] trained a model so it's important to call transform at the space and as an
[00:10:51] call transform at the space and as an indication of one of the things that's
[00:10:52] indication of one of the things that's going to happen here is notice that in
[00:10:54] going to happen here is notice that in the test features my second example has
[00:10:56] the test features my second example has a brand new feature d
[00:10:58] a brand new feature d but d is not represented in the training
[00:11:01] but d is not represented in the training space we have no weights for it it's
[00:11:03] space we have no weights for it it's simply not part of that original
[00:11:05] simply not part of that original training data set
[00:11:06] training data set and so the result is that when we call
[00:11:08] and so the result is that when we call transform that feature is simply
[00:11:10] transform that feature is simply alighted
[00:11:11] alighted which is the desired behavior as we're
[00:11:13] which is the desired behavior as we're translating from training into testing
[00:11:16] translating from training into testing and it noticed that the dick vectorizer
[00:11:18] and it noticed that the dick vectorizer has simply handled that seamlessly for
[00:11:20] has simply handled that seamlessly for you
[00:11:21] you provided that you remember at the second
[00:11:23] provided that you remember at the second stage not to call fit transform that's
[00:11:25] stage not to call fit transform that's the number one gotcha for this interface
[00:11:28] the number one gotcha for this interface is that if you call fit transform a
[00:11:30] is that if you call fit transform a second time it will simply change the
[00:11:32] second time it will simply change the feature space into the one that is
[00:11:34] feature space into the one that is represented in your test features and
[00:11:36] represented in your test features and then everything will fall apart and your
[00:11:38] then everything will fall apart and your model
[00:11:39] model as trained from before will be unable to
[00:11:41] as trained from before will be unable to consume these new matrices that you've
[00:11:43] consume these new matrices that you've created but provided you remember that
[00:11:45] created but provided you remember that the rhythm is fit transform and then
[00:11:47] the rhythm is fit transform and then transform
[00:11:49] transform this should be really a nice set of
[00:11:50] this should be really a nice set of interfaces and of course this is what
[00:11:52] interfaces and of course this is what sst experiment is doing by default under
[00:11:55] sst experiment is doing by default under the hood for you

Lecture 017

Hyperparameter Search | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=sO3gWU7y9Ws --- Transcript [00:00:04] hello everyone welcome to part si...

Hyperparameter Search | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=sO3gWU7y9Ws

---

Transcript

[00:00:04] hello everyone welcome to part six in
[00:00:06] hello everyone welcome to part six in our series on supervised sentiment
[00:00:07] our series on supervised sentiment analysis this screencast is going to
[00:00:09] analysis this screencast is going to cover two important methods in this
[00:00:11] cover two important methods in this space hyperparameter search and
[00:00:13] space hyperparameter search and classifier comparisons
[00:00:15] classifier comparisons so let's begin with hyper parameter
[00:00:17] so let's begin with hyper parameter search and first i'll just offer the
[00:00:19] search and first i'll just offer the rationale
[00:00:20] rationale let's say that the parameters of a model
[00:00:22] let's say that the parameters of a model are those whose values are learned as
[00:00:24] are those whose values are learned as part of optimizing the model itself so
[00:00:26] part of optimizing the model itself so for the classifiers we've been studying
[00:00:28] for the classifiers we've been studying the parameters are really just the
[00:00:30] the parameters are really just the weights that you learn on each of the
[00:00:32] weights that you learn on each of the individual features
[00:00:34] individual features and those are the things that are
[00:00:35] and those are the things that are directly targeted by the optimization
[00:00:37] directly targeted by the optimization process
[00:00:38] process the parameters of a model are typically
[00:00:40] the parameters of a model are typically pretty crisply defined because they kind
[00:00:42] pretty crisply defined because they kind of follow from the structure
[00:00:43] of follow from the structure mathematically of the model under
[00:00:45] mathematically of the model under investigation
[00:00:47] investigation much more diffuse are the hyper
[00:00:49] much more diffuse are the hyper parameters we can say the hyper
[00:00:50] parameters we can say the hyper parameters of a model are any settings
[00:00:52] parameters of a model are any settings that are outside of the optimization
[00:00:54] that are outside of the optimization process mentioned in one so examples
[00:00:57] process mentioned in one so examples from models we've seen our glove and lsa
[00:00:59] from models we've seen our glove and lsa have that dimensionality setting
[00:01:01] have that dimensionality setting the model itself gives you no guidance
[00:01:03] the model itself gives you no guidance about what to choose for the
[00:01:04] about what to choose for the dimensionality
[00:01:05] dimensionality and the dimensionality is not selected
[00:01:07] and the dimensionality is not selected as part of the optimization of the model
[00:01:09] as part of the optimization of the model itself you have to choose it via some
[00:01:12] itself you have to choose it via some external mechanism making it a hyper
[00:01:14] external mechanism making it a hyper parameter
[00:01:15] parameter and glove actually has two other
[00:01:17] and glove actually has two other additional prominent hyper parameters
[00:01:19] additional prominent hyper parameters xmax and alpha again those are not
[00:01:22] xmax and alpha again those are not optimized by the model you have to
[00:01:23] optimized by the model you have to select them via some external mechanism
[00:01:26] select them via some external mechanism and for the classifiers that we've been
[00:01:28] and for the classifiers that we've been studying you know we have regularization
[00:01:29] studying you know we have regularization terms those are classic hyper parameters
[00:01:32] terms those are classic hyper parameters if you have a deep classifier then the
[00:01:33] if you have a deep classifier then the hidden dimensionalities in the model
[00:01:35] hidden dimensionalities in the model could also be considered hyper
[00:01:37] could also be considered hyper parameters learning rates um you know
[00:01:39] parameters learning rates um you know any core feature of the optimization
[00:01:41] any core feature of the optimization method itself could be considered hyper
[00:01:43] method itself could be considered hyper parameters and even things that might be
[00:01:45] parameters and even things that might be considered kind of architectural like
[00:01:48] considered kind of architectural like the activation function in the deep
[00:01:49] the activation function in the deep classifier you might think of it as kind
[00:01:52] classifier you might think of it as kind of an intrinsic part of the model that
[00:01:54] of an intrinsic part of the model that you're evaluating but since it's an easy
[00:01:56] you're evaluating but since it's an easy choice point for us at this point you'll
[00:01:58] choice point for us at this point you'll be tempted to explore a few different
[00:02:00] be tempted to explore a few different options for that particular
[00:02:01] options for that particular architectural choice and in that way it
[00:02:04] architectural choice and in that way it could become a hyper parameter and at
[00:02:06] could become a hyper parameter and at this point even the optimization methods
[00:02:08] this point even the optimization methods could also emerge as a hyper parameter
[00:02:11] could also emerge as a hyper parameter that you would like to do search over
[00:02:14] that you would like to do search over and so forth and so on you should
[00:02:15] and so forth and so on you should probably take a fairly expansive view of
[00:02:18] probably take a fairly expansive view of what the hyper parameters of your model
[00:02:19] what the hyper parameters of your model are if you can now here's the crux of
[00:02:23] are if you can now here's the crux of the argument hyper parameter
[00:02:25] the argument hyper parameter optimization is crucial to building a
[00:02:27] optimization is crucial to building a persuasive argument fundamentally for
[00:02:29] persuasive argument fundamentally for any kind of comparison we make we want
[00:02:31] any kind of comparison we make we want to put every model in its very best
[00:02:34] to put every model in its very best light
[00:02:35] light we could take it for granted that for
[00:02:36] we could take it for granted that for any sufficiently complicated model
[00:02:38] any sufficiently complicated model there's some setting of its hyper
[00:02:40] there's some setting of its hyper parameters that's kind of degenerate and
[00:02:42] parameters that's kind of degenerate and would make the model look very bad
[00:02:44] would make the model look very bad and so you certainly wouldn't want to do
[00:02:46] and so you certainly wouldn't want to do any comparisons against that really
[00:02:48] any comparisons against that really problematic set of choices rather what
[00:02:50] problematic set of choices rather what we want to do is say let's put all the
[00:02:52] we want to do is say let's put all the models in their best light by choosing
[00:02:54] models in their best light by choosing optimal hyper parameters for them to the
[00:02:56] optimal hyper parameters for them to the best of our ability and then we can say
[00:02:58] best of our ability and then we can say that one model is better than the other
[00:03:00] that one model is better than the other if it emerges victorious in that very
[00:03:02] if it emerges victorious in that very rigorous
[00:03:04] rigorous setting
[00:03:05] setting and the final thing i'll say about this
[00:03:06] and the final thing i'll say about this methodologically is that of course all
[00:03:09] methodologically is that of course all hyperparameter tuning must be done only
[00:03:11] hyperparameter tuning must be done only on train and development data you
[00:03:13] on train and development data you consider you can consider that all fair
[00:03:16] consider you can consider that all fair game in terms of using it however you
[00:03:18] game in terms of using it however you want to choose the optimal hyper
[00:03:19] want to choose the optimal hyper parameters but once that choice is set
[00:03:22] parameters but once that choice is set it is fixed and those are the parameters
[00:03:24] it is fixed and those are the parameters that you use at test time and that is
[00:03:26] that you use at test time and that is the fundamental evaluation that you
[00:03:28] the fundamental evaluation that you would use for any kind of model
[00:03:29] would use for any kind of model comparison and at no point should you be
[00:03:32] comparison and at no point should you be tuning these hyper parameters on the
[00:03:34] tuning these hyper parameters on the test data itself that would be
[00:03:36] test data itself that would be completely illegitimate
[00:03:39] completely illegitimate i hope we've made it really easy to do
[00:03:41] i hope we've made it really easy to do this kind of hyper parameter search in
[00:03:42] this kind of hyper parameter search in the context of the work you're doing for
[00:03:44] the context of the work you're doing for supervised sentiment analysis here are
[00:03:46] supervised sentiment analysis here are some code snippets that show how that
[00:03:48] some code snippets that show how that can happen
[00:03:49] can happen i load in my libraries i have a pointer
[00:03:51] i load in my libraries i have a pointer to our sentiment data and here i have a
[00:03:53] to our sentiment data and here i have a fixed feature function which is just a
[00:03:55] fixed feature function which is just a unigram steamer feature function
[00:03:57] unigram steamer feature function the change happens inside the model
[00:04:00] the change happens inside the model wrapper
[00:04:01] wrapper whereas before essentially all we did is
[00:04:03] whereas before essentially all we did is set up a logistic regression model and
[00:04:05] set up a logistic regression model and then call its fit method
[00:04:07] then call its fit method here we set up that model but also
[00:04:09] here we set up that model but also established a grid of hyper parameters
[00:04:11] established a grid of hyper parameters these are different choice points for
[00:04:13] these are different choice points for this logistic regression model like
[00:04:15] this logistic regression model like whether or not i have a bias term
[00:04:17] whether or not i have a bias term the value of the regularization
[00:04:19] the value of the regularization parameter and even the algorithm used
[00:04:21] parameter and even the algorithm used for regularization itself l1 or l2
[00:04:25] for regularization itself l1 or l2 the model will explore the full grid of
[00:04:27] the model will explore the full grid of these options it's going to do five-fold
[00:04:29] these options it's going to do five-fold cross-validation so test one each five
[00:04:31] cross-validation so test one each five times on different splits of the data
[00:04:34] times on different splits of the data and in that very long search process it
[00:04:36] and in that very long search process it will find what it takes to be the best
[00:04:39] will find what it takes to be the best setting of all of these hyper parameters
[00:04:40] setting of all of these hyper parameters of all the combinations that can
[00:04:42] of all the combinations that can logically be
[00:04:44] logically be set
[00:04:45] set and that is the model that we finally
[00:04:47] and that is the model that we finally return here right so now you can see the
[00:04:49] return here right so now you can see the value of having a wrapper around these
[00:04:50] value of having a wrapper around these fit methods because then i could do all
[00:04:52] fit methods because then i could do all of this extra work without changing the
[00:04:55] of this extra work without changing the interface to sst experiment at all the
[00:04:57] interface to sst experiment at all the experiments look just as they did in the
[00:05:00] experiments look just as they did in the previous mode it's just that they will
[00:05:01] previous mode it's just that they will take a lot longer because you are
[00:05:03] take a lot longer because you are running dozens and dozens of experiments
[00:05:05] running dozens and dozens of experiments as part of this exhaustive search of all
[00:05:08] as part of this exhaustive search of all the possible settings
[00:05:10] the possible settings okay r2 is classifier comparison let's
[00:05:13] okay r2 is classifier comparison let's again begin with the rationale suppose
[00:05:15] again begin with the rationale suppose you've assessed a baseline model b
[00:05:17] you've assessed a baseline model b and your favorite model m and your
[00:05:19] and your favorite model m and your chosen assessment metric favors m right
[00:05:22] chosen assessment metric favors m right and this seems like a little victory for
[00:05:24] and this seems like a little victory for you but you should still ask yourself is
[00:05:26] you but you should still ask yourself is m really better right now if the
[00:05:29] m really better right now if the difference between b and m is clearly of
[00:05:31] difference between b and m is clearly of practical significance then you might
[00:05:33] practical significance then you might not need to do anything beyond
[00:05:34] not need to do anything beyond presenting the numbers right if each one
[00:05:37] presenting the numbers right if each one of your classification decisions
[00:05:38] of your classification decisions corresponds to something really
[00:05:40] corresponds to something really important in the world and your
[00:05:41] important in the world and your classifier makes thousands more good
[00:05:43] classifier makes thousands more good predictions than the other model that
[00:05:45] predictions than the other model that might be enough for the argument but
[00:05:48] might be enough for the argument but even in that situation you might ask
[00:05:50] even in that situation you might ask whether there's variation in how these
[00:05:51] whether there's variation in how these two models b and m perform did you just
[00:05:53] two models b and m perform did you just get lucky when you saw what looked like
[00:05:55] get lucky when you saw what looked like a practical difference and with minor
[00:05:57] a practical difference and with minor changes to the initialization of
[00:05:59] changes to the initialization of something you would see very different
[00:06:00] something you would see very different outcomes
[00:06:01] outcomes if the answer is possibly yes then you
[00:06:03] if the answer is possibly yes then you might still want to do some kind of
[00:06:05] might still want to do some kind of classifier comparison
[00:06:08] classifier comparison now there's this nice paper by dempshark
[00:06:10] now there's this nice paper by dempshark 2006 that advises using the wilcoxon
[00:06:12] 2006 that advises using the wilcoxon signed rank test for situations in which
[00:06:15] signed rank test for situations in which you can afford to repeatedly assess your
[00:06:17] you can afford to repeatedly assess your two models b m on different train test
[00:06:20] two models b m on different train test splits right and we'll talk and later in
[00:06:23] splits right and we'll talk and later in the term about the precise rationale for
[00:06:25] the term about the precise rationale for this but the idea is just that you would
[00:06:27] this but the idea is just that you would do a lot of experiments on slightly
[00:06:28] do a lot of experiments on slightly different views of your data and kind of
[00:06:31] different views of your data and kind of average across them to get a sense for
[00:06:33] average across them to get a sense for whether the two for how the two models
[00:06:35] whether the two for how the two models compare with each other
[00:06:38] compare with each other in situations where you can't repeatedly
[00:06:40] in situations where you can't repeatedly assess b m
[00:06:41] assess b m mars test is a reasonable alternative it
[00:06:45] mars test is a reasonable alternative it operates on the confusion matrices
[00:06:46] operates on the confusion matrices produced by the two models testing the
[00:06:48] produced by the two models testing the null hypothesis that the two models have
[00:06:50] null hypothesis that the two models have the same error rate
[00:06:53] the same error rate the reason you might opt for mcneemar's
[00:06:54] the reason you might opt for mcneemar's test is for example if you're doing a
[00:06:56] test is for example if you're doing a deep learning experiment where all the
[00:06:58] deep learning experiment where all the models take a few weeks to optimize then
[00:07:01] models take a few weeks to optimize then of course you can't probably afford to
[00:07:02] of course you can't probably afford to do dozens and dozens of experiments with
[00:07:05] do dozens and dozens of experiments with each one so you might be compelled to
[00:07:07] each one so you might be compelled to use mcnee mars based on one single run
[00:07:09] use mcnee mars based on one single run of the two models it's a much weaker
[00:07:11] of the two models it's a much weaker argument because of course precisely the
[00:07:13] argument because of course precisely the point is that we might see variation
[00:07:15] point is that we might see variation across different runs and mcnee mars is
[00:07:17] across different runs and mcnee mars is not really going to grapple with that in
[00:07:19] not really going to grapple with that in the way that the wilcoxon sign right
[00:07:21] the way that the wilcoxon sign right test will
[00:07:22] test will but this is arguably better than nothing
[00:07:24] but this is arguably better than nothing in most situations so you might default
[00:07:26] in most situations so you might default to mcnee mars
[00:07:27] to mcnee mars if the wilcoxon is too expensive
[00:07:30] if the wilcoxon is too expensive and let me just show you how easy this
[00:07:32] and let me just show you how easy this can be in the context of our code base
[00:07:34] can be in the context of our code base so my by way of illustration what we're
[00:07:36] so my by way of illustration what we're essentially going to do is compare
[00:07:38] essentially going to do is compare logistic regression and naive bayes i
[00:07:41] logistic regression and naive bayes i encourage you when you're doing these
[00:07:43] encourage you when you're doing these comparisons to have only one point of
[00:07:46] comparisons to have only one point of variation so we're going to fix the data
[00:07:48] variation so we're going to fix the data and we're going to fix the feature
[00:07:49] and we're going to fix the feature function and compare only the model
[00:07:52] function and compare only the model architectures
[00:07:53] architectures you could separately say i'm going to
[00:07:55] you could separately say i'm going to have a single fixed model like logistic
[00:07:57] have a single fixed model like logistic regression and explore a few different
[00:07:59] regression and explore a few different feature functions
[00:08:00] feature functions but i would advise against exploring two
[00:08:03] but i would advise against exploring two different feature functions as combined
[00:08:04] different feature functions as combined with two different models because when
[00:08:06] with two different models because when you observe differences in the end you
[00:08:08] you observe differences in the end you won't be sure whether that was caused by
[00:08:10] won't be sure whether that was caused by the model choice or by the feature
[00:08:12] the model choice or by the feature functions we want to kind of isolate
[00:08:14] functions we want to kind of isolate these things and do systematic
[00:08:16] these things and do systematic comparisons
[00:08:18] comparisons so here i'm going to do a systematic
[00:08:19] so here i'm going to do a systematic comparison of logistic regression and
[00:08:21] comparison of logistic regression and naive bayes on the sst using the
[00:08:24] naive bayes on the sst using the wilcoxon test and here's the setup the
[00:08:26] wilcoxon test and here's the setup the function is sst compare models
[00:08:29] function is sst compare models i point it to my training data
[00:08:31] i point it to my training data you can have two feature functions but
[00:08:33] you can have two feature functions but in that case you should have just one
[00:08:35] in that case you should have just one model wrapper here i've got one feature
[00:08:37] model wrapper here i've got one feature function used for both models and i'll
[00:08:39] function used for both models and i'll have these two different wrappers
[00:08:40] have these two different wrappers corresponding to the evaluation that i
[00:08:42] corresponding to the evaluation that i want to do of those two model classes
[00:08:45] want to do of those two model classes i'm going to use the wilcoxon as advised
[00:08:47] i'm going to use the wilcoxon as advised i'll do 10 trials of each
[00:08:50] i'll do 10 trials of each on the train size of 70 of the data and
[00:08:53] on the train size of 70 of the data and as always in this setting i'll use the
[00:08:54] as always in this setting i'll use the macro f1 as my score
[00:08:57] macro f1 as my score so what this will do internally is run
[00:08:59] so what this will do internally is run 10
[00:09:00] 10 10 experiments on different train test
[00:09:02] 10 experiments on different train test splits for each one of these models that
[00:09:04] splits for each one of these models that gives us a score vector
[00:09:06] gives us a score vector uh 10 you know 10 numbers for each model
[00:09:09] uh 10 you know 10 numbers for each model and then what the wilcoxon is doing is
[00:09:10] and then what the wilcoxon is doing is comparing whether the or make assessing
[00:09:12] comparing whether the or make assessing whether the means of those two score
[00:09:14] whether the means of those two score vectors are statistically significantly
[00:09:17] vectors are statistically significantly different and here it looks like we have
[00:09:19] different and here it looks like we have some evidence that we can reject the
[00:09:20] some evidence that we can reject the null hypothesis that these models are
[00:09:23] null hypothesis that these models are identical
[00:09:24] identical which is presumably the argument that we
[00:09:26] which is presumably the argument that we were trying to build
[00:09:28] were trying to build now of course that's very expensive
[00:09:29] now of course that's very expensive because we had to run 20 experiments in
[00:09:31] because we had to run 20 experiments in this situation and of course you could
[00:09:33] this situation and of course you could run many more if you were also doing
[00:09:35] run many more if you were also doing hyper parameter tuning as part of your
[00:09:37] hyper parameter tuning as part of your experimental workflow
[00:09:39] experimental workflow so in situations where you can't afford
[00:09:41] so in situations where you can't afford to do something that involves so many
[00:09:43] to do something that involves so many experiments as i said you could default
[00:09:45] experiments as i said you could default to make new mars
[00:09:47] to make new mars that is included in utils.mcneimar
[00:09:50] that is included in utils.mcneimar and the return values of ssd experiment
[00:09:53] and the return values of ssd experiment will give you all the information you
[00:09:54] will give you all the information you need essentially for mcneemars you need
[00:09:57] need essentially for mcneemars you need the actual gold vector of labels and
[00:09:59] the actual gold vector of labels and then the two vectors of predictions for
[00:10:01] then the two vectors of predictions for each one of your experiments so that's a
[00:10:03] each one of your experiments so that's a simple alternative in the situation in
[00:10:04] simple alternative in the situation in which wilcoxon was just too expensive

Lecture 018

Feature Representation | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=L9ajfq6PJBI --- Transcript [00:00:04] welcome everyone this is part se...

Feature Representation | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=L9ajfq6PJBI

---

Transcript

[00:00:04] welcome everyone this is part seven in
[00:00:06] welcome everyone this is part seven in our series on supervised sentiment
[00:00:08] our series on supervised sentiment analysis the focus of this screencast is
[00:00:10] analysis the focus of this screencast is on feature representation of data there
[00:00:12] on feature representation of data there really two things i'd like to do first
[00:00:14] really two things i'd like to do first just explore some ideas for effective
[00:00:16] just explore some ideas for effective feature representation in the context of
[00:00:18] feature representation in the context of sentiment analysis and second cover some
[00:00:20] sentiment analysis and second cover some of the core technical concepts that
[00:00:22] of the core technical concepts that surround feature representation that you
[00:00:24] surround feature representation that you do well to have in mind as you write new
[00:00:26] do well to have in mind as you write new feature functions and optimize
[00:00:28] feature functions and optimize models let's begin in a familiar place
[00:00:31] models let's begin in a familiar place which is engram feature functions to
[00:00:34] which is engram feature functions to this point in the series of screencasts
[00:00:35] this point in the series of screencasts i've been just focusing on unigram
[00:00:38] i've been just focusing on unigram feature functions that's also called the
[00:00:39] feature functions that's also called the bag of words model and we can easily
[00:00:42] bag of words model and we can easily generalize that idea to bi-grams and
[00:00:45] generalize that idea to bi-grams and trigrams and so forth
[00:00:47] trigrams and so forth all of these schemes will be heavily
[00:00:48] all of these schemes will be heavily dependent on the tokenizer that you've
[00:00:50] dependent on the tokenizer that you've chosen because of course in the end for
[00:00:51] chosen because of course in the end for every example we represent we are simply
[00:00:54] every example we represent we are simply tokenizing that example and then
[00:00:55] tokenizing that example and then counting the tokens in that example
[00:00:59] counting the tokens in that example this can be combined of course with
[00:01:01] this can be combined of course with pre-processing steps in part two in this
[00:01:03] pre-processing steps in part two in this series i covered the pre-processing idea
[00:01:05] series i covered the pre-processing idea of neg marking which is essentially to
[00:01:08] of neg marking which is essentially to mark words as they appear in a heuristic
[00:01:11] mark words as they appear in a heuristic way in the scope of negative morphemes
[00:01:13] way in the scope of negative morphemes as a way of indicating that for example
[00:01:16] as a way of indicating that for example good is positive in normal context but
[00:01:19] good is positive in normal context but might become negative when it is in the
[00:01:21] might become negative when it is in the scope of a negation like not or never
[00:01:24] scope of a negation like not or never we would handle that as a pre-processing
[00:01:26] we would handle that as a pre-processing step and that would just create more
[00:01:27] step and that would just create more unigrams that our tokenizer would turn
[00:01:30] unigrams that our tokenizer would turn into tokens and then would be counted by
[00:01:33] into tokens and then would be counted by these feature representation schemes
[00:01:37] these feature representation schemes a hallmark of these feature approaches
[00:01:39] a hallmark of these feature approaches is that they create very large very
[00:01:41] is that they create very large very sparse feature representations you are
[00:01:43] sparse feature representations you are going to have a column in your feature
[00:01:45] going to have a column in your feature representation for every single word
[00:01:47] representation for every single word that appears anywhere in your training
[00:01:49] that appears anywhere in your training data
[00:01:50] data and another important thing to keep in
[00:01:51] and another important thing to keep in mind about this approach is that by and
[00:01:53] mind about this approach is that by and large they will fail to directly model
[00:01:55] large they will fail to directly model relationships between features unless
[00:01:57] relationships between features unless you make some special effort to
[00:01:59] you make some special effort to effectively interact these features
[00:02:02] effectively interact these features all you'll be doing is studying their
[00:02:03] all you'll be doing is studying their distribution with respect to the class
[00:02:05] distribution with respect to the class labels that you have and it's very
[00:02:07] labels that you have and it's very unlikely that you'll recover in any deep
[00:02:10] unlikely that you'll recover in any deep way the kind of underlying synonymy of
[00:02:12] way the kind of underlying synonymy of words like couch and sofa for example
[00:02:15] words like couch and sofa for example and this is a shortcoming that we might
[00:02:17] and this is a shortcoming that we might want to address as we move into
[00:02:19] want to address as we move into distributed representations of examples
[00:02:21] distributed representations of examples and deep learning
[00:02:24] so for a first technical concept i would
[00:02:27] so for a first technical concept i would like to just distinguish between feature
[00:02:29] like to just distinguish between feature functions and features and to do this
[00:02:31] functions and features and to do this i've just got a fully worked out example
[00:02:33] i've just got a fully worked out example here using tools from scikit-learn that
[00:02:35] here using tools from scikit-learn that i think will make the importance of this
[00:02:37] i think will make the importance of this this distinction really clear and
[00:02:39] this distinction really clear and concrete so in cell 1 i've just loaded a
[00:02:41] concrete so in cell 1 i've just loaded a bunch of libraries in cell 2 i've got my
[00:02:44] bunch of libraries in cell 2 i've got my standard kind of lazy unigrams feature
[00:02:47] standard kind of lazy unigrams feature function which is taking in a string
[00:02:49] function which is taking in a string text down casing it and then simply
[00:02:51] text down casing it and then simply splitting on white space and then the
[00:02:53] splitting on white space and then the counter here is just turning that into
[00:02:55] counter here is just turning that into account dictionary mapping each token to
[00:02:57] account dictionary mapping each token to the number of times it appears in this
[00:02:59] the number of times it appears in this example according to our tokenizer
[00:03:02] example according to our tokenizer that'll be fine for now in cell 3 i have
[00:03:04] that'll be fine for now in cell 3 i have a tiny little corpus that has just two
[00:03:07] a tiny little corpus that has just two words a and b
[00:03:09] words a and b in cell 4 i create a list of
[00:03:11] in cell 4 i create a list of dictionaries by calling unigram spy on
[00:03:14] dictionaries by calling unigram spy on each of the texts in my corpus here so
[00:03:16] each of the texts in my corpus here so that gives me a list of count
[00:03:18] that gives me a list of count dictionaries
[00:03:20] dictionaries in five i use a dict vectorizer as
[00:03:22] in five i use a dict vectorizer as covered in a previous screencast and
[00:03:24] covered in a previous screencast and what that's going to do is when i call
[00:03:25] what that's going to do is when i call fit transform on my list of feature
[00:03:28] fit transform on my list of feature dictionaries it will turn it into a
[00:03:30] dictionaries it will turn it into a matrix which is the input that all of
[00:03:32] matrix which is the input that all of these scikit machine learning models
[00:03:34] these scikit machine learning models expect for their training data and in
[00:03:36] expect for their training data and in cell 7 i've just given you what i hope
[00:03:38] cell 7 i've just given you what i hope is a pretty intuitive view of that
[00:03:40] is a pretty intuitive view of that design matrix underlyingly it's just a
[00:03:43] design matrix underlyingly it's just a numpy array but if we use pandas we can
[00:03:45] numpy array but if we use pandas we can see that the columns here correspond to
[00:03:48] see that the columns here correspond to the names of each one of the features
[00:03:50] the names of each one of the features because we have just two word types in
[00:03:52] because we have just two word types in our corpus there are two columns a and b
[00:03:54] our corpus there are two columns a and b and each of the rows corresponds to an
[00:03:56] and each of the rows corresponds to an example from our corpus and so you can
[00:03:58] example from our corpus and so you can see that our first example has been
[00:04:00] see that our first example has been reduced to a representation that has
[00:04:02] reduced to a representation that has three in its first dimension and zero in
[00:04:04] three in its first dimension and zero in its second corresponding to the fact
[00:04:06] its second corresponding to the fact that it has three a's and no b's
[00:04:08] that it has three a's and no b's example two a a b is represented as a
[00:04:11] example two a a b is represented as a two in the first column and a one in the
[00:04:13] two in the first column and a one in the second column
[00:04:14] second column and so forth so that's a first
[00:04:16] and so forth so that's a first distinction we have this feature
[00:04:18] distinction we have this feature function here which is like a factory
[00:04:20] function here which is like a factory and depending on the data that come in
[00:04:22] and depending on the data that come in for our corpus we're going to get very
[00:04:24] for our corpus we're going to get very different features which correspond to
[00:04:26] different features which correspond to each one of the columns in this feature
[00:04:29] each one of the columns in this feature representation matrix
[00:04:31] representation matrix let's continue this a little bit and
[00:04:33] let's continue this a little bit and think about how this actually interacts
[00:04:34] think about how this actually interacts with the optimization process so in cell
[00:04:37] with the optimization process so in cell 7 here i've just repeated that previous
[00:04:39] 7 here i've just repeated that previous matrix for reference
[00:04:41] matrix for reference in cell 8 i have the class labels for
[00:04:43] in cell 8 i have the class labels for our four examples and you can see there
[00:04:45] our four examples and you can see there are three distinct classes c1 c2 and c3
[00:04:49] are three distinct classes c1 c2 and c3 i set up a logistic regression model
[00:04:50] i set up a logistic regression model although that's not especially important
[00:04:52] although that's not especially important it's just a useful illustration
[00:04:54] it's just a useful illustration and i call fit on my pair xy that is my
[00:04:57] and i call fit on my pair xy that is my feature representations and my labels
[00:04:59] feature representations and my labels and that's the optimization process as
[00:05:02] and that's the optimization process as part of that and for a convention for
[00:05:04] part of that and for a convention for scikit the optimization process creates
[00:05:06] scikit the optimization process creates this new attribute co-f underscore and
[00:05:08] this new attribute co-f underscore and this new attribute classes underscore
[00:05:11] this new attribute classes underscore these co co-f here these are the weights
[00:05:13] these co co-f here these are the weights that we learned as part of the
[00:05:15] that we learned as part of the optimization process and of course
[00:05:17] optimization process and of course classes corresponds to the classes that
[00:05:19] classes corresponds to the classes that inferred from the label y that we input
[00:05:22] inferred from the label y that we input and here i'm just using a panda's data
[00:05:23] and here i'm just using a panda's data frame again to try to make this
[00:05:25] frame again to try to make this intuitive it's really just a numpy array
[00:05:27] intuitive it's really just a numpy array this co-f object here but you can see
[00:05:29] this co-f object here but you can see that the resulting matrix has a row for
[00:05:32] that the resulting matrix has a row for each one of our classes and a column for
[00:05:35] each one of our classes and a column for each one of our features
[00:05:37] each one of our features and that's a useful reminder that what
[00:05:39] and that's a useful reminder that what the optimization process for models like
[00:05:41] the optimization process for models like this is actually doing
[00:05:42] this is actually doing is learning a weight that associates
[00:05:45] is learning a weight that associates class feature name pairs
[00:05:47] class feature name pairs with a weight right so it's not just
[00:05:49] with a weight right so it's not just that we learn individual weights for
[00:05:50] that we learn individual weights for features but rather we learn them with
[00:05:52] features but rather we learn them with respect to each one of the classes and
[00:05:54] respect to each one of the classes and that's a hallmark of optimization for
[00:05:57] that's a hallmark of optimization for multi-class models like this one
[00:05:59] multi-class models like this one and then in cell 12 i've just shown you
[00:06:01] and then in cell 12 i've just shown you that you can actually use the co-f and
[00:06:03] that you can actually use the co-f and this other bias term inter-score
[00:06:05] this other bias term inter-score intercept underscore
[00:06:07] intercept underscore to recreate the predictions of the model
[00:06:09] to recreate the predictions of the model all you're doing is multiplying examples
[00:06:11] all you're doing is multiplying examples by those coefficients and adding in the
[00:06:14] by those coefficients and adding in the bias term and this matrix here is
[00:06:16] bias term and this matrix here is identical to what you get inside if you
[00:06:18] identical to what you get inside if you simply directly call predict prabha for
[00:06:21] simply directly call predict prabha for predict probabilities on your examples
[00:06:28] let's turn back to what we're trying to
[00:06:29] let's turn back to what we're trying to do to create good models having those
[00:06:31] do to create good models having those ideas in mind so let's just cover a few
[00:06:33] ideas in mind so let's just cover a few other ideas for hand-built feature
[00:06:35] other ideas for hand-built feature functions that i think could be
[00:06:36] functions that i think could be effective for sentiment so of course
[00:06:38] effective for sentiment so of course we'd have lexicon derived features i
[00:06:40] we'd have lexicon derived features i earlier showed you a bunch of different
[00:06:41] earlier showed you a bunch of different lexicons and that could be used to group
[00:06:44] lexicons and that could be used to group our unigrams so we could have these
[00:06:45] our unigrams so we could have these feature functions work in conjunction
[00:06:47] feature functions work in conjunction with a bag of words or bag of engrams
[00:06:49] with a bag of words or bag of engrams model or we could use them to replace
[00:06:52] model or we could use them to replace that model and develop a sparser feature
[00:06:54] that model and develop a sparser feature representation space
[00:06:56] representation space we could also do the negation marking
[00:06:58] we could also do the negation marking that i mentioned before and we could
[00:07:00] that i mentioned before and we could generalize that idea so many things in
[00:07:02] generalize that idea so many things in language take scope in a way that will
[00:07:04] language take scope in a way that will affect the semantics of words that are
[00:07:06] affect the semantics of words that are in their scope so another classical
[00:07:08] in their scope so another classical example behind besides negation is these
[00:07:10] example behind besides negation is these modal adverbs like quite possibly or
[00:07:13] modal adverbs like quite possibly or totally we might have the idea that they
[00:07:15] totally we might have the idea that they are modulating the extent to which the
[00:07:17] are modulating the extent to which the speaker is committed to masterpiece or
[00:07:19] speaker is committed to masterpiece or amazing in this case and keeping track
[00:07:21] amazing in this case and keeping track of that semantic association with simple
[00:07:23] of that semantic association with simple underscore marking of some kind might be
[00:07:26] underscore marking of some kind might be useful for giving our model a chance to
[00:07:28] useful for giving our model a chance to see that these unigrams are different
[00:07:30] see that these unigrams are different depending on their environment
[00:07:33] depending on their environment we could also have length based features
[00:07:34] we could also have length based features and that's just a useful reminder that
[00:07:36] and that's just a useful reminder that these don't all have to be count
[00:07:37] these don't all have to be count features we could have real valued
[00:07:39] features we could have real valued features of various kinds and they could
[00:07:41] features of various kinds and they could signal something important about the
[00:07:43] signal something important about the class label like for example i think
[00:07:46] class label like for example i think neutral reviews three star reviews tend
[00:07:48] neutral reviews three star reviews tend to be longer than one in five star
[00:07:50] to be longer than one in five star reviews so might as well throw that in
[00:07:53] reviews so might as well throw that in and we can expand that idea of float
[00:07:55] and we can expand that idea of float value features a little bit more i like
[00:07:56] value features a little bit more i like the idea of thwarted expectations which
[00:07:58] the idea of thwarted expectations which you might keep track of as the ratio of
[00:08:01] you might keep track of as the ratio of positive to negative words in a sentence
[00:08:04] positive to negative words in a sentence uh the idea being that very often if
[00:08:06] uh the idea being that very often if that ratio is exaggerated it's telling
[00:08:09] that ratio is exaggerated it's telling you the opposite story that you might
[00:08:11] you the opposite story that you might expect about the overall sentiment many
[00:08:13] expect about the overall sentiment many many positive words stacked up together
[00:08:15] many positive words stacked up together might actually be preparing you for a
[00:08:17] might actually be preparing you for a negative assessment and the reverse
[00:08:20] negative assessment and the reverse but the important thing about this
[00:08:22] but the important thing about this feature is that it wouldn't decide for
[00:08:24] feature is that it wouldn't decide for you what these ratios mean you would
[00:08:25] you what these ratios mean you would just hope that it was a useful signal
[00:08:27] just hope that it was a useful signal that your model might pick up on as part
[00:08:29] that your model might pick up on as part of optimization to figure out how to
[00:08:31] of optimization to figure out how to make use of the information
[00:08:34] make use of the information and then finally we could do things
[00:08:35] and then finally we could do things various kinds of ad hoc feature
[00:08:37] various kinds of ad hoc feature functions to try to capture the fact
[00:08:39] functions to try to capture the fact that many uses of language are
[00:08:41] that many uses of language are non-literal and might be signaling
[00:08:43] non-literal and might be signaling exactly the opposite of what they seem
[00:08:45] exactly the opposite of what they seem to do on their surface like not exactly
[00:08:47] to do on their surface like not exactly a masterpiece is probably a pretty
[00:08:50] a masterpiece is probably a pretty negative review
[00:08:51] negative review it was like 50 hours long is not saying
[00:08:53] it was like 50 hours long is not saying that it was actually 50 hours long but
[00:08:55] that it was actually 50 hours long but rather with hyperbole indicating that it
[00:08:57] rather with hyperbole indicating that it was much too long or something like that
[00:09:00] was much too long or something like that and the best movie in the history of the
[00:09:01] and the best movie in the history of the universe could be a
[00:09:03] universe could be a ringing endorsement but it could just as
[00:09:05] ringing endorsement but it could just as easily be a bit of sarcasm
[00:09:08] easily be a bit of sarcasm capturing those kind of subtle
[00:09:09] capturing those kind of subtle distinctions is of course much more
[00:09:11] distinctions is of course much more difficult but the handle feature
[00:09:13] difficult but the handle feature functions that you write where you could
[00:09:14] functions that you write where you could try to capture it and if they have a
[00:09:16] try to capture it and if they have a positive effect then
[00:09:18] positive effect then maybe you've made some real progress
[00:09:20] maybe you've made some real progress and that's a good transition point to
[00:09:22] and that's a good transition point to this topic of assessing individual
[00:09:23] this topic of assessing individual feature functions as you can see the
[00:09:25] feature functions as you can see the philosophy in this mode of work is that
[00:09:27] philosophy in this mode of work is that you write lots of feature functions and
[00:09:29] you write lots of feature functions and kind of see how well they can do at
[00:09:32] kind of see how well they can do at improving your model overall
[00:09:34] improving your model overall you might end up with a very large model
[00:09:36] you might end up with a very large model with many correlated features and that
[00:09:38] with many correlated features and that might lead you to want to do some
[00:09:39] might lead you to want to do some feature selection to weed out the ones
[00:09:41] feature selection to weed out the ones that are not contributing in a positive
[00:09:43] that are not contributing in a positive way
[00:09:44] way now scikit-learn has a whole library for
[00:09:46] now scikit-learn has a whole library for doing this called feature selection and
[00:09:48] doing this called feature selection and it offers lots of functions that will
[00:09:50] it offers lots of functions that will let you assess how much information your
[00:09:52] let you assess how much information your feature functions contain with respect
[00:09:54] feature functions contain with respect to the labels for your classification
[00:09:56] to the labels for your classification problem so this is very powerful and i
[00:09:58] problem so this is very powerful and i encourage you to use them but you should
[00:10:00] encourage you to use them but you should be a little bit cautious
[00:10:02] be a little bit cautious take care when assessing feature
[00:10:04] take care when assessing feature functions individually because
[00:10:06] functions individually because correlations between those features will
[00:10:09] correlations between those features will make the assessments very hard to
[00:10:11] make the assessments very hard to interpret
[00:10:12] interpret the problem here is that your model is
[00:10:14] the problem here is that your model is holistically thinking about how all
[00:10:16] holistically thinking about how all these features relate to your class
[00:10:17] these features relate to your class label and figuring out how to optimize
[00:10:19] label and figuring out how to optimize weights on that basis whereas the
[00:10:21] weights on that basis whereas the feature function methods many of them
[00:10:23] feature function methods many of them just look at individual features and how
[00:10:25] just look at individual features and how they relate to the class labels so
[00:10:26] they relate to the class labels so you're losing all that correlational
[00:10:28] you're losing all that correlational context
[00:10:29] context and to make that a little concrete i
[00:10:31] and to make that a little concrete i just cooked up an example here an
[00:10:32] just cooked up an example here an idealized one that shows how you could
[00:10:35] idealized one that shows how you could be misled so i have three features x1 x2
[00:10:38] be misled so i have three features x1 x2 and x3 and a simple binary
[00:10:40] and x3 and a simple binary classification problem and i use the
[00:10:42] classification problem and i use the chi-square test from feature selection
[00:10:44] chi-square test from feature selection to kind of assess how important each one
[00:10:46] to kind of assess how important each one of these features is with respect to
[00:10:48] of these features is with respect to this classification problem
[00:10:50] this classification problem and what i found is that intuitively it
[00:10:52] and what i found is that intuitively it looks like x1 and x2 are really powerful
[00:10:55] looks like x1 and x2 are really powerful features
[00:10:56] features and that might lead me to think well
[00:10:58] and that might lead me to think well i'll drop the third feature and include
[00:11:00] i'll drop the third feature and include just one and two in my model
[00:11:03] just one and two in my model so far so good however if we thoroughly
[00:11:06] so far so good however if we thoroughly explore this space what we find is that
[00:11:08] explore this space what we find is that in truth a simple linear model performs
[00:11:10] in truth a simple linear model performs best with just feature x1 and actually
[00:11:13] best with just feature x1 and actually including x2 hurts the model despite the
[00:11:16] including x2 hurts the model despite the fact that it has this positive feature
[00:11:18] fact that it has this positive feature importance value so what we really ought
[00:11:20] importance value so what we really ought to be doing is using just this single
[00:11:22] to be doing is using just this single feature but these methods can't tell us
[00:11:24] feature but these methods can't tell us that and even a positive
[00:11:27] that and even a positive feature selection value might actually
[00:11:29] feature selection value might actually be something that's at odds with what
[00:11:31] be something that's at odds with what we're trying to do with our model as
[00:11:33] we're trying to do with our model as this example shows
[00:11:35] this example shows so ideally what you would do is consider
[00:11:38] so ideally what you would do is consider more holistic assessment methods which
[00:11:39] more holistic assessment methods which scikit also offers this would be things
[00:11:42] scikit also offers this would be things like systematically removing or
[00:11:44] like systematically removing or perturbing feature values in the context
[00:11:46] perturbing feature values in the context of the full model that you're optimizing
[00:11:49] of the full model that you're optimizing and comparing performance across those
[00:11:51] and comparing performance across those models this is much more expensive
[00:11:53] models this is much more expensive because you're optimizing many many
[00:11:54] because you're optimizing many many models so it might be prohibitive for
[00:11:56] models so it might be prohibitive for some classes of models that you're
[00:11:58] some classes of models that you're exploring but if you can do it this will
[00:12:00] exploring but if you can do it this will be more reliable
[00:12:02] be more reliable however if this is impossible it might
[00:12:04] however if this is impossible it might still be productive to do some feature
[00:12:06] still be productive to do some feature selection using simpler methods you
[00:12:09] selection using simpler methods you should just be aware that you might be
[00:12:11] should just be aware that you might be doing something that's not optimal for
[00:12:13] doing something that's not optimal for the actual optimization problem that
[00:12:15] the actual optimization problem that you've posed
[00:12:18] okay in the final section of this
[00:12:19] okay in the final section of this screencast is a kind of transition into
[00:12:21] screencast is a kind of transition into the world of deep learning i've called
[00:12:23] the world of deep learning i've called this distributed representations as
[00:12:25] this distributed representations as features this is a very different mode
[00:12:27] features this is a very different mode for thinking about representing examples
[00:12:30] for thinking about representing examples what we do in this case is take our
[00:12:32] what we do in this case is take our token stream as before but instead of
[00:12:34] token stream as before but instead of writing a lot of hand-built feature
[00:12:36] writing a lot of hand-built feature functions we simply look up each one of
[00:12:38] functions we simply look up each one of those tokens in some embedding that we
[00:12:40] those tokens in some embedding that we have for example it could be an
[00:12:42] have for example it could be an embedding that you created in the first
[00:12:43] embedding that you created in the first unit of this course
[00:12:45] unit of this course or it could be a glove embedding or a
[00:12:47] or it could be a glove embedding or a static embedding that you derived from
[00:12:49] static embedding that you derived from vert representations and so forth and so
[00:12:51] vert representations and so forth and so on the important thing is that each
[00:12:53] on the important thing is that each token is now represented by a vector
[00:12:56] token is now represented by a vector and that could be a powerful idea
[00:12:57] and that could be a powerful idea because in representing each of these
[00:12:59] because in representing each of these words as vectors we are now
[00:13:02] words as vectors we are now capturing the relationships between
[00:13:04] capturing the relationships between those tokens we might now have a hope of
[00:13:06] those tokens we might now have a hope of seeing that sofa and couch are actually
[00:13:09] seeing that sofa and couch are actually similar features in general and not just
[00:13:11] similar features in general and not just with respect to the class labels that we
[00:13:13] with respect to the class labels that we have
[00:13:15] have so that's the idea why this might be
[00:13:16] so that's the idea why this might be powerful so we take all those vectors
[00:13:18] powerful so we take all those vectors and look them up however for all these
[00:13:19] and look them up however for all these classifier models we need a fixed
[00:13:21] classifier models we need a fixed dimensional representation to feed into
[00:13:23] dimensional representation to feed into the actual classifier unit so we're
[00:13:25] the actual classifier unit so we're going to have to combine those vectors
[00:13:27] going to have to combine those vectors in some way and the simplest thing you
[00:13:28] in some way and the simplest thing you could do is combine them via some
[00:13:30] could do is combine them via some function like sum or mean right
[00:13:33] function like sum or mean right so i would take all these things for
[00:13:34] so i would take all these things for example and take their average and that
[00:13:36] example and take their average and that would give me another fixed dimensional
[00:13:37] would give me another fixed dimensional representation no matter how many tokens
[00:13:40] representation no matter how many tokens are in each one of the examples
[00:13:42] are in each one of the examples and that average vector would be the
[00:13:44] and that average vector would be the input to the classifier
[00:13:47] input to the classifier so if each one of these vectors has
[00:13:48] so if each one of these vectors has dimension 300 then so too does the
[00:13:51] dimension 300 then so too does the feature representation of my entire
[00:13:53] feature representation of my entire example and now i now have a classifier
[00:13:56] example and now i now have a classifier which which is processing feature
[00:13:57] which which is processing feature representations that have 300 columns
[00:14:00] representations that have 300 columns each dimension in the underlying
[00:14:02] each dimension in the underlying embedding space now corresponds to a
[00:14:04] embedding space now corresponds to a feature
[00:14:05] feature and that's the basis for optimization
[00:14:07] and that's the basis for optimization and i'd say an eye-opening thing about
[00:14:08] and i'd say an eye-opening thing about this class of models is despite them
[00:14:10] this class of models is despite them being very compact you know 300
[00:14:13] being very compact you know 300 dimensions versus 20 000 that you might
[00:14:15] dimensions versus 20 000 that you might have from a bag of words model they turn
[00:14:17] have from a bag of words model they turn out to be very powerful
[00:14:20] out to be very powerful and this final slide here just shows you
[00:14:22] and this final slide here just shows you how to implement those using tools and
[00:14:24] how to implement those using tools and other utilities for our course
[00:14:26] other utilities for our course so i'm going to use glove and i'm going
[00:14:28] so i'm going to use glove and i'm going to use the 300 dimensional glove space
[00:14:30] to use the 300 dimensional glove space which is included in your data
[00:14:31] which is included in your data distribution
[00:14:32] distribution in four and five here we just write
[00:14:34] in four and five here we just write simple feature functions and the
[00:14:35] simple feature functions and the hallmark of these is that they are
[00:14:37] hallmark of these is that they are simply looking up words in the embedding
[00:14:39] simply looking up words in the embedding and then combining them via whatever
[00:14:41] and then combining them via whatever function the user specifies so the
[00:14:43] function the user specifies so the output of this is directly a vector
[00:14:45] output of this is directly a vector representation of each example
[00:14:48] representation of each example in cell six we set up a logistic
[00:14:50] in cell six we set up a logistic regression as before of course it could
[00:14:52] regression as before of course it could be a much fancier model but logistic
[00:14:54] be a much fancier model but logistic regression will do
[00:14:55] regression will do and then we use ssd experiment almost
[00:14:57] and then we use ssd experiment almost exactly as before the one change we need
[00:15:00] exactly as before the one change we need to remember to make when operating in
[00:15:02] to remember to make when operating in this mode is to set the flag vectorize
[00:15:04] this mode is to set the flag vectorize equals false
[00:15:05] equals false we already have each example represented
[00:15:08] we already have each example represented as a vector so we do not need to pass it
[00:15:10] as a vector so we do not need to pass it through that whole process of using a
[00:15:12] through that whole process of using a dict vectorizer to turn count
[00:15:14] dict vectorizer to turn count dictionaries into vectors
[00:15:17] dictionaries into vectors and as i said before these turn out to
[00:15:19] and as i said before these turn out to be quite good models despite their
[00:15:20] be quite good models despite their compactness and the final thing i'll say
[00:15:22] compactness and the final thing i'll say is that this model is a nice transition
[00:15:25] is that this model is a nice transition into the recurrent neural networks that
[00:15:27] into the recurrent neural networks that we'll study in the final screencast for
[00:15:29] we'll study in the final screencast for this unit which essentially generalized
[00:15:31] this unit which essentially generalized this idea by learning an interesting
[00:15:34] this idea by learning an interesting combination function for all the vectors
[00:15:36] combination function for all the vectors for each of the individual tokens
[00:15:42] you

Lecture 019

RNN Classifiers | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=7n9zQ169b8Q --- Transcript [00:00:04] welcome back everyone this is part eigh...

RNN Classifiers | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=7n9zQ169b8Q

---

Transcript

[00:00:04] welcome back everyone this is part eight
[00:00:06] welcome back everyone this is part eight in our series on supervised sentiment
[00:00:07] in our series on supervised sentiment analysis the final screencast in the
[00:00:09] analysis the final screencast in the series we're going to be talking about
[00:00:11] series we're going to be talking about recurrent neural network or rnn
[00:00:13] recurrent neural network or rnn classifiers i suppose this is officially
[00:00:15] classifiers i suppose this is officially our first step into the world of deep
[00:00:16] our first step into the world of deep learning for sentiment analysis
[00:00:19] learning for sentiment analysis this slide gives an overview of the
[00:00:21] this slide gives an overview of the model and let's work through it in some
[00:00:22] model and let's work through it in some detail so we've got a single example
[00:00:25] detail so we've got a single example with three tokens the rock rules these
[00:00:27] with three tokens the rock rules these models are prepared for variable length
[00:00:29] models are prepared for variable length sequences but this example has to have
[00:00:31] sequences but this example has to have happens to have length three
[00:00:33] happens to have length three and the first step to get this model
[00:00:35] and the first step to get this model started is a familiar one we're going to
[00:00:36] started is a familiar one we're going to look up each one of those tokens in what
[00:00:38] look up each one of those tokens in what is presumably a fixed embedding space
[00:00:41] is presumably a fixed embedding space here so for each token we'll get a
[00:00:42] here so for each token we'll get a vector representation
[00:00:45] vector representation the next step is that we have some
[00:00:46] the next step is that we have some learned parameters a weight matrix w xh
[00:00:49] learned parameters a weight matrix w xh and the subscript indicates that we're
[00:00:51] and the subscript indicates that we're going from the inputs x into the hidden
[00:00:53] going from the inputs x into the hidden layer h so that's a first transformation
[00:00:56] layer h so that's a first transformation and that weight matrix is used at each
[00:00:57] and that weight matrix is used at each one of these time steps
[00:01:00] one of these time steps there is a second learning weight matrix
[00:01:01] there is a second learning weight matrix which i've called whhh to indicate that
[00:01:04] which i've called whhh to indicate that we are now traveling through the hidden
[00:01:06] we are now traveling through the hidden layer
[00:01:07] layer so we start at some initial state h0
[00:01:09] so we start at some initial state h0 which could be an all zero or a randomly
[00:01:11] which could be an all zero or a randomly initialized vector or a vector coming
[00:01:13] initialized vector or a vector coming from some other component in the model
[00:01:15] from some other component in the model and that representation is combined with
[00:01:17] and that representation is combined with the representation that we derive going
[00:01:19] the representation that we derive going vertically up from the embedding usually
[00:01:21] vertically up from the embedding usually in some additive fashion to create this
[00:01:24] in some additive fashion to create this hidden state here h1 and those
[00:01:26] hidden state here h1 and those parameters wh are used again at each one
[00:01:29] parameters wh are used again at each one of these time steps so that we have two
[00:01:31] of these time steps so that we have two learned weight matrices as part of the
[00:01:33] learned weight matrices as part of the core structure of this model the one
[00:01:35] core structure of this model the one that takes us from embeddings into the
[00:01:37] that takes us from embeddings into the hidden layer and the one that travels us
[00:01:39] hidden layer and the one that travels us across the hidden layer and again those
[00:01:41] across the hidden layer and again those are typically combined in some additive
[00:01:43] are typically combined in some additive fashion to create these internal hidden
[00:01:45] fashion to create these internal hidden representations
[00:01:46] representations now we can do anything we want with
[00:01:48] now we can do anything we want with those internal hidden representations
[00:01:50] those internal hidden representations when we use rnns as classifiers we do
[00:01:53] when we use rnns as classifiers we do what is arguably the simplest thing
[00:01:55] what is arguably the simplest thing which is take the final representation
[00:01:58] which is take the final representation and use that as the input to a standard
[00:02:00] and use that as the input to a standard softmax classifier so from the point of
[00:02:02] softmax classifier so from the point of view of h3 going to y here we just have
[00:02:06] view of h3 going to y here we just have a learn weight matrix for the classifier
[00:02:08] a learn weight matrix for the classifier maybe also a bias term but from this
[00:02:10] maybe also a bias term but from this point here this is really just a
[00:02:11] point here this is really just a classifier of the sort we've been
[00:02:13] classifier of the sort we've been studying up until this point in the unit
[00:02:16] studying up until this point in the unit of course we could elaborate this model
[00:02:17] of course we could elaborate this model in all sorts of ways it could run
[00:02:19] in all sorts of ways it could run bidirectionally we could make more full
[00:02:21] bidirectionally we could make more full use of the different hidden
[00:02:23] use of the different hidden representations here but in the simplest
[00:02:25] representations here but in the simplest mode our rnn classifiers will just
[00:02:27] mode our rnn classifiers will just derive hidden representations at each
[00:02:29] derive hidden representations at each time step and use the final one as the
[00:02:31] time step and use the final one as the input to a classifier a couple things i
[00:02:33] input to a classifier a couple things i would say about this first if you would
[00:02:35] would say about this first if you would like a further layer of detail on how
[00:02:37] like a further layer of detail on how these models are structured and
[00:02:38] these models are structured and optimized i would encourage you to look
[00:02:40] optimized i would encourage you to look at this pure numpy reference
[00:02:42] at this pure numpy reference implementation of an rnn classifier that
[00:02:45] implementation of an rnn classifier that is included in our course code
[00:02:46] is included in our course code distribution i think that's a great way
[00:02:48] distribution i think that's a great way to get a feel for the recursive process
[00:02:51] to get a feel for the recursive process of
[00:02:51] of you know computing through full
[00:02:53] you know computing through full sequences and then having the error
[00:02:55] sequences and then having the error signals back propagate through to update
[00:02:57] signals back propagate through to update the weight matrix
[00:02:58] the weight matrix but for now i think just understanding
[00:03:01] but for now i think just understanding that the core structure of this model is
[00:03:03] that the core structure of this model is sufficient
[00:03:06] sufficient i just want to remind you from the
[00:03:07] i just want to remind you from the previous screencast that we're very
[00:03:09] previous screencast that we're very close to the idea of distributed
[00:03:11] close to the idea of distributed representations of features that i
[00:03:13] representations of features that i introduced before
[00:03:14] introduced before recall that for this mode what we do is
[00:03:16] recall that for this mode what we do is look up each token in an embedding space
[00:03:18] look up each token in an embedding space just as we do for the rnn
[00:03:20] just as we do for the rnn but instead of learning some complicated
[00:03:22] but instead of learning some complicated combination function with a bunch of
[00:03:24] combination function with a bunch of learned parameters we simply combine
[00:03:26] learned parameters we simply combine them via sum or average and that's the
[00:03:28] them via sum or average and that's the basis that's the input to the classifier
[00:03:31] basis that's the input to the classifier here the rnn can be considered an
[00:03:33] here the rnn can be considered an elaboration of that because instead of
[00:03:35] elaboration of that because instead of assuming that these vectors here will be
[00:03:36] assuming that these vectors here will be combined in some simple way like summer
[00:03:38] combined in some simple way like summer mean we now have really vast capacity to
[00:03:42] mean we now have really vast capacity to learn a much more complicated way of
[00:03:43] learn a much more complicated way of combining them that is optimal with
[00:03:45] combining them that is optimal with respect to the classifier that we're
[00:03:47] respect to the classifier that we're trying to fit
[00:03:48] trying to fit but fundamentally these are very similar
[00:03:50] but fundamentally these are very similar ideas and if it happened that
[00:03:52] ideas and if it happened that some or mean is in this picture was
[00:03:54] some or mean is in this picture was exactly the right function to learn for
[00:03:56] exactly the right function to learn for your data then the rnn would certainly
[00:03:58] your data then the rnn would certainly have the capacity to do that we just
[00:04:00] have the capacity to do that we just tend to favor the rnn because it can
[00:04:02] tend to favor the rnn because it can learn of course a much wider range of
[00:04:04] learn of course a much wider range of complicated custom functions that are
[00:04:06] complicated custom functions that are particular to the problem that you've
[00:04:07] particular to the problem that you've posed
[00:04:10] now so far we've been operating in the
[00:04:12] now so far we've been operating in the mode which i've called standard rnn data
[00:04:15] mode which i've called standard rnn data set preparation let's linger over that
[00:04:17] set preparation let's linger over that in a little bit of detail suppose that
[00:04:18] in a little bit of detail suppose that we have two examples containing the
[00:04:20] we have two examples containing the tokens aba and bc those are our two raw
[00:04:24] tokens aba and bc those are our two raw inputs
[00:04:25] inputs the first step in the standard mode is
[00:04:27] the first step in the standard mode is to look at each one of those in some in
[00:04:29] to look at each one of those in some in in some list of indices
[00:04:31] in some list of indices and then those indices are keyed into an
[00:04:34] and then those indices are keyed into an embedding space and those finally give
[00:04:36] embedding space and those finally give us the vector representations of each
[00:04:38] us the vector representations of each examples so that really and truly the
[00:04:41] examples so that really and truly the inputs of the rnn is a list of vectors
[00:04:43] inputs of the rnn is a list of vectors it's just that we have typically
[00:04:45] it's just that we have typically obtained those vectors by looking them
[00:04:47] obtained those vectors by looking them up in a fixed embedding space and so for
[00:04:49] up in a fixed embedding space and so for example since a occurs twice in this
[00:04:51] example since a occurs twice in this first example it is literally repeated
[00:04:54] first example it is literally repeated as the first and third vectors here
[00:04:57] as the first and third vectors here now i think you can see latent in this
[00:04:59] now i think you can see latent in this picture the possibility that we might
[00:05:01] picture the possibility that we might drop the embedding space and instead
[00:05:03] drop the embedding space and instead just directly input lists of vectors and
[00:05:05] just directly input lists of vectors and that is one way that we will explore
[00:05:07] that is one way that we will explore later on in the quarter of using
[00:05:09] later on in the quarter of using contextual models like bert we would
[00:05:11] contextual models like bert we would simply look up entire token streams and
[00:05:14] simply look up entire token streams and get back lists of vectors and use those
[00:05:16] get back lists of vectors and use those as fixed inputs to a model like an rnn
[00:05:19] as fixed inputs to a model like an rnn and that's a first step toward
[00:05:21] and that's a first step toward fine-tuning models like bert on problems
[00:05:24] fine-tuning models like bert on problems like the ones we've posed in this unit
[00:05:26] like the ones we've posed in this unit so have that idea in mind as we talk
[00:05:28] so have that idea in mind as we talk next about fine tuning strategies
[00:05:32] next about fine tuning strategies now another practical note what i've
[00:05:34] now another practical note what i've shown you so far is what you'd call a
[00:05:37] shown you so far is what you'd call a simple vanilla rnn
[00:05:39] simple vanilla rnn lstms long short-term memory networks
[00:05:42] lstms long short-term memory networks are much more powerful models and will
[00:05:44] are much more powerful models and will kind of default to them when we do
[00:05:45] kind of default to them when we do experiments the fundamental issue is
[00:05:48] experiments the fundamental issue is that plane rnns tend to perform poorly
[00:05:50] that plane rnns tend to perform poorly with very long sequences you get that
[00:05:52] with very long sequences you get that error signal from the classifier there
[00:05:54] error signal from the classifier there at the final token and now information
[00:05:57] at the final token and now information has to flow all the way back down
[00:05:58] has to flow all the way back down through the network it could be a very
[00:06:00] through the network it could be a very long sequence and the result is that the
[00:06:03] long sequence and the result is that the information coming from that error
[00:06:04] information coming from that error signal is often lost or distorted
[00:06:08] signal is often lost or distorted now lstm cells are a prominent response
[00:06:10] now lstm cells are a prominent response to this problem they introduce
[00:06:12] to this problem they introduce mechanisms that control the flow of
[00:06:14] mechanisms that control the flow of information and help you avoid the
[00:06:16] information and help you avoid the problems of optimization that arise for
[00:06:18] problems of optimization that arise for regular rnns now i'm not going to take
[00:06:20] regular rnns now i'm not going to take the time here to review this mechanism
[00:06:22] the time here to review this mechanism in detail i would instead recommend
[00:06:25] in detail i would instead recommend these two excellent blog posts they have
[00:06:27] these two excellent blog posts they have great diagrams and really detailed
[00:06:29] great diagrams and really detailed discussions they can do a much better
[00:06:31] discussions they can do a much better job than i can of really conveying
[00:06:33] job than i can of really conveying intuitions visually and also with math
[00:06:36] intuitions visually and also with math and i think you could pick one or both
[00:06:38] and i think you could pick one or both and really pretty quickly gain a deep
[00:06:40] and really pretty quickly gain a deep understanding of precisely how lstm
[00:06:42] understanding of precisely how lstm cells are functioning
[00:06:45] cells are functioning the final thing here is just a code
[00:06:47] the final thing here is just a code snippet to show you how easy it is to
[00:06:49] snippet to show you how easy it is to use our course code repository to fit
[00:06:51] use our course code repository to fit models like this in the context of
[00:06:53] models like this in the context of sentiment analysis you can again make
[00:06:55] sentiment analysis you can again make use of this sst library and what i've
[00:06:57] use of this sst library and what i've done here is a kind of complicated
[00:06:59] done here is a kind of complicated version showing you a bunch of different
[00:07:01] version showing you a bunch of different features so
[00:07:03] features so in cell 2 you can see that i'm going to
[00:07:05] in cell 2 you can see that i'm going to have a pointer to glove and i'm going to
[00:07:07] have a pointer to glove and i'm going to create a glove look up
[00:07:09] create a glove look up using the 50 dimensional vectors just to
[00:07:11] using the 50 dimensional vectors just to keep things simple
[00:07:12] keep things simple the feature function for this model is
[00:07:15] the feature function for this model is not one that returns count dictionaries
[00:07:17] not one that returns count dictionaries it's important for the structure of the
[00:07:19] it's important for the structure of the model we're going to use that you input
[00:07:21] model we're going to use that you input raw sequences of tokens so all we're
[00:07:23] raw sequences of tokens so all we're doing here is down casing the sequence
[00:07:25] doing here is down casing the sequence and then splitting on white space of
[00:07:27] and then splitting on white space of course you could do something more
[00:07:28] course you could do something more sophisticated
[00:07:30] sophisticated the idea though is that you want to
[00:07:31] the idea though is that you want to align with the glove vocabulary our
[00:07:34] align with the glove vocabulary our model wrapper is doing a few things it's
[00:07:36] model wrapper is doing a few things it's creating a vocabulary and loading it and
[00:07:38] creating a vocabulary and loading it and embedding using this glove space that'll
[00:07:40] embedding using this glove space that'll be the initial embedding for our model
[00:07:43] be the initial embedding for our model and if you leave this step out you'll
[00:07:44] and if you leave this step out you'll have a randomly initialized embedding
[00:07:46] have a randomly initialized embedding space which uh might be fine as well but
[00:07:48] space which uh might be fine as well but presumably glove will give us a step up
[00:07:51] presumably glove will give us a step up and then we set up the torch rnn
[00:07:53] and then we set up the torch rnn classifier and what i've done here is
[00:07:55] classifier and what i've done here is expose a lot of the different keyword
[00:07:56] expose a lot of the different keyword arguments not all of them there are lots
[00:07:58] arguments not all of them there are lots of knobs that you can fiddle with as is
[00:08:00] of knobs that you can fiddle with as is typical for deep learning models
[00:08:02] typical for deep learning models maybe the one i would call out is that
[00:08:04] maybe the one i would call out is that we are using that fixed embedding that
[00:08:05] we are using that fixed embedding that we got from glove and i have set early
[00:08:07] we got from glove and i have set early stopping equals true which might help
[00:08:09] stopping equals true which might help you efficiently optimize these models
[00:08:12] you efficiently optimize these models otherwise you'll have to figure out how
[00:08:13] otherwise you'll have to figure out how many iterations you actually want it to
[00:08:15] many iterations you actually want it to run for and you might run it for much
[00:08:17] run for and you might run it for much too long or much less time than is
[00:08:19] too long or much less time than is needed to get an optimal model the early
[00:08:22] needed to get an optimal model the early stopping options there are a few other
[00:08:23] stopping options there are a few other parameters involved in that might help
[00:08:26] parameters involved in that might help you optimize these models efficiently
[00:08:28] you optimize these models efficiently and effectively
[00:08:30] and effectively in the end though having set up all that
[00:08:31] in the end though having set up all that stuff you call fit as usual and return
[00:08:33] stuff you call fit as usual and return the train model and in that context you
[00:08:36] the train model and in that context you can simply use sst experiment with these
[00:08:38] can simply use sst experiment with these previous components to conduct
[00:08:40] previous components to conduct experiments with rnns just as you did
[00:08:42] experiments with rnns just as you did for simpler linear models as in previous
[00:08:44] for simpler linear models as in previous screencasts the one change which will be
[00:08:47] screencasts the one change which will be will be familiar from the previous
[00:08:49] will be familiar from the previous screencast is that you need to set
[00:08:50] screencast is that you need to set vectorize equals false and that is
[00:08:53] vectorize equals false and that is important because again we're going to
[00:08:55] important because again we're going to let the model process these examples we
[00:08:57] let the model process these examples we don't want to pipe everything through
[00:08:58] don't want to pipe everything through some kind of dick vectorizer that's
[00:09:00] some kind of dick vectorizer that's strictly for hand built feature
[00:09:02] strictly for hand built feature functions and sparse linear models here
[00:09:04] functions and sparse linear models here in the land of deep learning vectorize
[00:09:06] in the land of deep learning vectorize equals false and we'll use the
[00:09:08] equals false and we'll use the components of the model to represent
[00:09:10] components of the model to represent each example as i discussed before

Lecture 020

Contextual Representation Models | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=ZrmEcrmmXCg --- Transcript [00:00:05] welcome everyone to ou...

Contextual Representation Models | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=ZrmEcrmmXCg

---

Transcript

[00:00:05] welcome everyone to our first screencast
[00:00:07] welcome everyone to our first screencast on contextual word representations my
[00:00:09] on contextual word representations my goal here is to give you an overview for
[00:00:11] goal here is to give you an overview for this unit and also give you a sense for
[00:00:13] this unit and also give you a sense for the conceptual landscape
[00:00:15] the conceptual landscape let's start with the associated
[00:00:17] let's start with the associated materials you might think that the name
[00:00:18] materials you might think that the name of the game for this unit is to get you
[00:00:20] of the game for this unit is to get you to the point where you can work
[00:00:21] to the point where you can work productively with this notebook called
[00:00:24] productively with this notebook called fine tuning which shows you how to
[00:00:26] fine tuning which shows you how to fine-tune contextual word
[00:00:27] fine-tune contextual word representations for classification
[00:00:29] representations for classification problems i think that could be a very
[00:00:30] problems i think that could be a very powerful mode for you as you work on the
[00:00:32] powerful mode for you as you work on the current assignment and bake off
[00:00:35] current assignment and bake off for background and intuitions i highly
[00:00:37] for background and intuitions i highly recommend this paper by noah smith
[00:00:40] recommend this paper by noah smith the beating heart of this unit is really
[00:00:42] the beating heart of this unit is really the transformer architecture which was
[00:00:44] the transformer architecture which was introduced by veswani in all 2017 in a
[00:00:47] introduced by veswani in all 2017 in a paper called attention is all you need
[00:00:49] paper called attention is all you need is a highly readable paper but i
[00:00:51] is a highly readable paper but i recommend that if you want to read it
[00:00:53] recommend that if you want to read it you instead read sasha rush's
[00:00:54] you instead read sasha rush's outstanding contribution the annotated
[00:00:56] outstanding contribution the annotated transformer what this does is literally
[00:00:59] transformer what this does is literally reproduce the text of veswani at all
[00:01:01] reproduce the text of veswani at all 2017
[00:01:03] 2017 with pi torch code woven in
[00:01:06] with pi torch code woven in culminating in a complete implementation
[00:01:08] culminating in a complete implementation of the transformer as applied to
[00:01:09] of the transformer as applied to problems in machine translation this is
[00:01:12] problems in machine translation this is a wonderful contribution in the sense
[00:01:13] a wonderful contribution in the sense that to the extent that there are points
[00:01:15] that to the extent that there are points of unclarity or uncertainty in the
[00:01:16] of unclarity or uncertainty in the original text they are fully resolved by
[00:01:19] original text they are fully resolved by sasha's code
[00:01:21] sasha's code and of course this can give you a really
[00:01:22] and of course this can give you a really good example of how to do efficient and
[00:01:24] good example of how to do efficient and effective implementation of model
[00:01:26] effective implementation of model architectures like this using pi torch
[00:01:30] architectures like this using pi torch in practical terms we're going to make
[00:01:31] in practical terms we're going to make extensive use of the hugging face
[00:01:33] extensive use of the hugging face transformers library which has really
[00:01:34] transformers library which has really opened up access to a wide range of
[00:01:37] opened up access to a wide range of pre-trained transformer models it's very
[00:01:39] pre-trained transformer models it's very exciting and has enabled lots of new
[00:01:41] exciting and has enabled lots of new things
[00:01:43] things for us the central architecture will be
[00:01:45] for us the central architecture will be burnt we'll have a separate screencast
[00:01:46] burnt we'll have a separate screencast on that and we're also going to have a
[00:01:48] on that and we're also going to have a screencast on roberta which is robustly
[00:01:50] screencast on roberta which is robustly optimized burnt i think it's an
[00:01:52] optimized burnt i think it's an interesting perspective in the sense
[00:01:53] interesting perspective in the sense that they explored more deeply some of
[00:01:56] that they explored more deeply some of the open questions from the original
[00:01:57] the open questions from the original bert paper and they also released very
[00:02:00] bert paper and they also released very powerful pre-trained parameters that you
[00:02:02] powerful pre-trained parameters that you could again use in the context of your
[00:02:04] could again use in the context of your own fine-tuning
[00:02:06] own fine-tuning and then for a slightly different
[00:02:07] and then for a slightly different perspective on these transformers we're
[00:02:09] perspective on these transformers we're going to look at the electra
[00:02:10] going to look at the electra architecture which came from kevin clark
[00:02:12] architecture which came from kevin clark and colleagues at stanford and google i
[00:02:14] and colleagues at stanford and google i really like this as a new perspective
[00:02:16] really like this as a new perspective there are of course many different modes
[00:02:18] there are of course many different modes of using the transformer at this point
[00:02:20] of using the transformer at this point i'm going to mention a few at the end of
[00:02:22] i'm going to mention a few at the end of the screencast and just for the sake of
[00:02:23] the screencast and just for the sake of time i've decided to focus on electra
[00:02:26] time i've decided to focus on electra and you can explore the others in your
[00:02:28] and you can explore the others in your own research
[00:02:30] own research let's begin with some intuitions and i'd
[00:02:31] let's begin with some intuitions and i'd like to begin with a linguistic
[00:02:33] like to begin with a linguistic intuition which has to do with word
[00:02:35] intuition which has to do with word representations and how they should be
[00:02:37] representations and how they should be shaped by the context let's focus on the
[00:02:39] shaped by the context let's focus on the english verb break
[00:02:41] english verb break we have a simple example the vase broke
[00:02:42] we have a simple example the vase broke which means it's shattered to pieces
[00:02:45] which means it's shattered to pieces here's a superficially similar sentence
[00:02:47] here's a superficially similar sentence dawn broke now the sense of break is
[00:02:49] dawn broke now the sense of break is something more like begin
[00:02:52] something more like begin the news broke again a simple
[00:02:54] the news broke again a simple intransitive sentence but now the verb
[00:02:56] intransitive sentence but now the verb break means something more like publish
[00:02:58] break means something more like publish or appear or become known
[00:03:01] or appear or become known sandy broke the world record this is a
[00:03:03] sandy broke the world record this is a transitive use of the verb break and it
[00:03:05] transitive use of the verb break and it means like surpass a previous level
[00:03:08] means like surpass a previous level sandy broke the law as another transit
[00:03:10] sandy broke the law as another transit of use but now it means sandy
[00:03:12] of use but now it means sandy transgressed
[00:03:13] transgressed the burglar broke into the house as a
[00:03:16] the burglar broke into the house as a physical act of transgression on a space
[00:03:18] physical act of transgression on a space the newscaster broke into the movie
[00:03:20] the newscaster broke into the movie broadcast is a sense that is more like
[00:03:22] broadcast is a sense that is more like interrupt
[00:03:24] interrupt and we have idioms like break even which
[00:03:26] and we have idioms like break even which means we neither gain nor lost money
[00:03:29] means we neither gain nor lost money and this is just a few of the many ways
[00:03:31] and this is just a few of the many ways that the verb break can be used in
[00:03:33] that the verb break can be used in english how many senses are at work here
[00:03:35] english how many senses are at work here it's very hard to say it could be one it
[00:03:37] it's very hard to say it could be one it could be two it could be ten
[00:03:39] could be two it could be ten it's very hard to delimit word senses
[00:03:40] it's very hard to delimit word senses but it is very clear from this data that
[00:03:43] but it is very clear from this data that our sense for the verb break is being
[00:03:46] our sense for the verb break is being shaped by the immediate linguistic
[00:03:47] shaped by the immediate linguistic context
[00:03:48] context here are a few additional examples we
[00:03:51] here are a few additional examples we have things like flat tire flat beer
[00:03:53] have things like flat tire flat beer flat note flat surface it's clear that
[00:03:55] flat note flat surface it's clear that there is a conceptual core running
[00:03:57] there is a conceptual core running through all of these uses but it's also
[00:04:00] through all of these uses but it's also true that a flat tire is a very
[00:04:02] true that a flat tire is a very different sense for flat than we get
[00:04:04] different sense for flat than we get from flat note or flat surface
[00:04:07] from flat note or flat surface have something similar for throw a party
[00:04:09] have something similar for throw a party throw a fight throw a ball throw a fit
[00:04:11] throw a fight throw a ball throw a fit we have a mixture of what you might call
[00:04:13] we have a mixture of what you might call literal and metaphorical here but again
[00:04:15] literal and metaphorical here but again a kind of common core that we're drawing
[00:04:17] a kind of common core that we're drawing on but the bottom line is that the sense
[00:04:19] on but the bottom line is that the sense for throw in this case is very different
[00:04:22] for throw in this case is very different depending on what kind of linguistic
[00:04:23] depending on what kind of linguistic context it's in
[00:04:26] context it's in and we we can extend this to things that
[00:04:28] and we we can extend this to things that seem to turn more on world knowledge so
[00:04:30] seem to turn more on world knowledge so if you have something like a crane
[00:04:32] if you have something like a crane caught a fish we have a sense that the
[00:04:33] caught a fish we have a sense that the crane here is a bird
[00:04:35] crane here is a bird whereas if we have a crane picked up the
[00:04:37] whereas if we have a crane picked up the steel beam we have a sense that it's a
[00:04:39] steel beam we have a sense that it's a piece of equipment
[00:04:40] piece of equipment uh this seems like something that's
[00:04:42] uh this seems like something that's guided by our understanding of birds and
[00:04:44] guided by our understanding of birds and equipment and fish and beams and when we
[00:04:47] equipment and fish and beams and when we have relatively unbiased sentences like
[00:04:49] have relatively unbiased sentences like i saw a crane we're kind of left
[00:04:51] i saw a crane we're kind of left guessing about which object is involved
[00:04:53] guessing about which object is involved the bird or the machine
[00:04:55] the bird or the machine and we can extend this past world
[00:04:56] and we can extend this past world knowledge into things that are more like
[00:04:58] knowledge into things that are more like discourse understanding so if you have a
[00:05:00] discourse understanding so if you have a sentence like are there typos i didn't
[00:05:02] sentence like are there typos i didn't see any
[00:05:03] see any the sense of any here we have a feeling
[00:05:05] the sense of any here we have a feeling that something is alighted but probably
[00:05:07] that something is alighted but probably localized on any and any here means any
[00:05:10] localized on any and any here means any typos as a result of the preceding
[00:05:12] typos as a result of the preceding linguistic context
[00:05:14] linguistic context are there any bookstores downtown i
[00:05:16] are there any bookstores downtown i didn't see any same second sentence but
[00:05:18] didn't see any same second sentence but now the sense of any is probably going
[00:05:20] now the sense of any is probably going to be something more like bookstores as
[00:05:22] to be something more like bookstores as a result of the discourse context that
[00:05:24] a result of the discourse context that it appears in
[00:05:26] it appears in so all of this is just showing how much
[00:05:28] so all of this is just showing how much individual linguistic units can be
[00:05:30] individual linguistic units can be shaped by context linguists know this
[00:05:32] shaped by context linguists know this deeply this is a primary thing that
[00:05:34] deeply this is a primary thing that linguists try to get a grip on and i
[00:05:36] linguists try to get a grip on and i think it's a wonderful point of
[00:05:37] think it's a wonderful point of connection between what linguists do and
[00:05:40] connection between what linguists do and the way we're representing examples in
[00:05:41] the way we're representing examples in nlp using contextual models this is a
[00:05:44] nlp using contextual models this is a very exciting development for me as a
[00:05:46] very exciting development for me as a linguist as well as an nlp
[00:05:51] here's another set of intuitions that's
[00:05:53] here's another set of intuitions that's related more to things like model
[00:05:55] related more to things like model architecture and what you might call
[00:05:57] architecture and what you might call inductive biases for different model
[00:05:59] inductive biases for different model designs
[00:06:00] designs let's start up here in the left this is
[00:06:01] let's start up here in the left this is a high bias model in the sense that it
[00:06:03] a high bias model in the sense that it makes a lot of a priori decisions about
[00:06:06] makes a lot of a priori decisions about how we will represent our examples the
[00:06:08] how we will represent our examples the idea is that we have three tokens which
[00:06:10] idea is that we have three tokens which we look up in a fixed embedding space
[00:06:13] we look up in a fixed embedding space and then we have decided to summarize
[00:06:14] and then we have decided to summarize those embeddings by simply summing them
[00:06:16] those embeddings by simply summing them together to get a representation for the
[00:06:18] together to get a representation for the entire example
[00:06:19] entire example very few of these components are learned
[00:06:21] very few of these components are learned as part of our problem we've made most
[00:06:23] as part of our problem we've made most of the decisions ahead of time
[00:06:25] of the decisions ahead of time as we've seen as we move to a recurrent
[00:06:27] as we've seen as we move to a recurrent neural network we relax some of those
[00:06:29] neural network we relax some of those assumptions we're still going to look up
[00:06:31] assumptions we're still going to look up words in a fixed embedding space but now
[00:06:33] words in a fixed embedding space but now instead of deciding that we know the
[00:06:35] instead of deciding that we know the proper way to combine those with
[00:06:36] proper way to combine those with summation we're going to learn from our
[00:06:39] summation we're going to learn from our data a very complicated function for
[00:06:41] data a very complicated function for combining them and that will presumably
[00:06:43] combining them and that will presumably allow us to be more responsive that is
[00:06:45] allow us to be more responsive that is less biased about what the data are
[00:06:47] less biased about what the data are likely to look like
[00:06:49] likely to look like the tree structured architecture down
[00:06:51] the tree structured architecture down here is an interesting mixture of these
[00:06:53] here is an interesting mixture of these ideas it's like the recurrent neural
[00:06:55] ideas it's like the recurrent neural network instead of instead of assuming
[00:06:56] network instead of instead of assuming that i can process the data left to
[00:06:58] that i can process the data left to right the data get processed into
[00:07:00] right the data get processed into constituents like the rock as a
[00:07:02] constituents like the rock as a constituent as excluded from rules over
[00:07:05] constituent as excluded from rules over here
[00:07:06] here now this is probably going to be very
[00:07:08] now this is probably going to be very powerful if we're correct that the
[00:07:09] powerful if we're correct that the language data are structured according
[00:07:11] language data are structured according to these constituents because it will
[00:07:13] to these constituents because it will give us a boost in terms of learning it
[00:07:16] give us a boost in terms of learning it could be counterproductive though to the
[00:07:17] could be counterproductive though to the extent that that constituent structure
[00:07:19] extent that that constituent structure is wrong and i think that's showing that
[00:07:21] is wrong and i think that's showing that biases that we impose at the level level
[00:07:23] biases that we impose at the level level of our architectures can be helpful as
[00:07:26] of our architectures can be helpful as well as a hindrance depending on how
[00:07:27] well as a hindrance depending on how they align with the data-driven problem
[00:07:30] they align with the data-driven problem that we're trying to solve
[00:07:32] that we're trying to solve in the bottom right here i have the
[00:07:34] in the bottom right here i have the least biased model in all these senses
[00:07:36] least biased model in all these senses of all the ones depicted here i've got a
[00:07:39] of all the ones depicted here i've got a recurrent neural network like this one
[00:07:40] recurrent neural network like this one except now i'm assuming that information
[00:07:42] except now i'm assuming that information can flow bi-directionally so no longer a
[00:07:44] can flow bi-directionally so no longer a presumption of left to right
[00:07:46] presumption of left to right and in addition i've added these
[00:07:48] and in addition i've added these attention mechanisms which we'll talk a
[00:07:50] attention mechanisms which we'll talk a lot about in this unit but essentially
[00:07:52] lot about in this unit but essentially think of them as ways of creating
[00:07:54] think of them as ways of creating special connections between all of the
[00:07:56] special connections between all of the hidden units and the idea here is that
[00:07:58] hidden units and the idea here is that we would let the data tell us how to
[00:08:00] we would let the data tell us how to weight all of these various connections
[00:08:02] weight all of these various connections and in turn represent our examples we're
[00:08:05] and in turn represent our examples we're making very few decisions ahead of time
[00:08:07] making very few decisions ahead of time about what kind of connections could be
[00:08:09] about what kind of connections could be made and instead just listening to the
[00:08:11] made and instead just listening to the data and the learning process
[00:08:13] data and the learning process and we are at this point on the road
[00:08:16] and we are at this point on the road toward the transformer which is a kind
[00:08:17] toward the transformer which is a kind of extreme case of connecting everything
[00:08:19] of extreme case of connecting everything with everything else and then allowing
[00:08:21] with everything else and then allowing the data to tell us how to weight all of
[00:08:23] the data to tell us how to weight all of those various connections
[00:08:26] those various connections and that does bring us to this notion of
[00:08:28] and that does bring us to this notion of attention which we've not discussed
[00:08:30] attention which we've not discussed before but i think i can introduce the
[00:08:31] before but i think i can introduce the concepts in a bit of the math and then
[00:08:33] concepts in a bit of the math and then we'll see them again throughout this
[00:08:35] we'll see them again throughout this unit so let's start with a simple
[00:08:36] unit so let's start with a simple sentiment example and imagine we're
[00:08:38] sentiment example and imagine we're dealing with a recurrent neural network
[00:08:40] dealing with a recurrent neural network classifier our example is really not so
[00:08:42] classifier our example is really not so good and we're going to fit the
[00:08:43] good and we're going to fit the classifier traditionally on top of this
[00:08:46] classifier traditionally on top of this final hidden state over here
[00:08:48] final hidden state over here but we might worry about doing that that
[00:08:51] but we might worry about doing that that by the time we've gotten to this final
[00:08:53] by the time we've gotten to this final state the contribution of these earlier
[00:08:54] state the contribution of these earlier words which are clearly important
[00:08:56] words which are clearly important linguistically might be sort of
[00:08:58] linguistically might be sort of forgotten or overly diffuse
[00:09:00] forgotten or overly diffuse so attention mechanisms would be a way
[00:09:02] so attention mechanisms would be a way for us to bring that information back in
[00:09:05] for us to bring that information back in and infuse this representation hc with
[00:09:08] and infuse this representation hc with some of those previous
[00:09:10] some of those previous important connections
[00:09:12] important connections so here's how we do that we're going to
[00:09:13] so here's how we do that we're going to first have some tension scores which are
[00:09:15] first have some tension scores which are simply dot products of our target vector
[00:09:17] simply dot products of our target vector with all the preceding hidden states so
[00:09:20] with all the preceding hidden states so that gives us a vector of scores which
[00:09:21] that gives us a vector of scores which are traditionally soft max normalized
[00:09:24] are traditionally soft max normalized and then what we do is create a context
[00:09:26] and then what we do is create a context vector by
[00:09:27] vector by weighting each of the previous hidden
[00:09:29] weighting each of the previous hidden states by its attention weight
[00:09:31] states by its attention weight and then taking the average of those to
[00:09:33] and then taking the average of those to give us the context vector k
[00:09:36] give us the context vector k and then we're going to have this
[00:09:37] and then we're going to have this special layer here which concatenates k
[00:09:39] special layer here which concatenates k with our previous final hidden state
[00:09:42] with our previous final hidden state and feeds that through this layer of
[00:09:44] and feeds that through this layer of learn parameters and a non-linearity to
[00:09:46] learn parameters and a non-linearity to give us this new hidden representation h
[00:09:49] give us this new hidden representation h tilde
[00:09:50] tilde and it is h tilde that is finally the
[00:09:52] and it is h tilde that is finally the input to our softmax classifier whereas
[00:09:55] input to our softmax classifier whereas before we would have simply directly
[00:09:57] before we would have simply directly input hc up here we now input this more
[00:10:00] input hc up here we now input this more refined version that is drawing on all
[00:10:02] refined version that is drawing on all of these attention connections that we
[00:10:04] of these attention connections that we created with these mechanisms
[00:10:06] created with these mechanisms and again as you'll see the transformer
[00:10:08] and again as you'll see the transformer does this all over the place with all of
[00:10:10] does this all over the place with all of its representations at various points in
[00:10:12] its representations at various points in its computations
[00:10:15] its computations here's another guiding idea that really
[00:10:17] here's another guiding idea that really shapes how these models work i've called
[00:10:18] shapes how these models work i've called this word pieces and we've seen this
[00:10:20] this word pieces and we've seen this before these models typically do not
[00:10:22] before these models typically do not tokenize data in the way that we might
[00:10:24] tokenize data in the way that we might expect i've loaded in a burt tokenizer
[00:10:27] expect i've loaded in a burt tokenizer and you can see that for a sentence like
[00:10:29] and you can see that for a sentence like this isn't too surprising the result is
[00:10:31] this isn't too surprising the result is some pretty familiar tokens by and large
[00:10:34] some pretty familiar tokens by and large but when i feed in something like encode
[00:10:35] but when i feed in something like encode me the intuitive word encode is split
[00:10:37] me the intuitive word encode is split apart into two word pieces and clearly
[00:10:40] apart into two word pieces and clearly we're kind of implicitly assuming that
[00:10:42] we're kind of implicitly assuming that the model because it's contextual can
[00:10:44] the model because it's contextual can figure out that these pieces are in some
[00:10:46] figure out that these pieces are in some conceptual sense one word and you might
[00:10:49] conceptual sense one word and you might extend that up to idioms like out of
[00:10:51] extend that up to idioms like out of this world where we treat them as a
[00:10:52] this world where we treat them as a bunch of distinct tokens but we might
[00:10:55] bunch of distinct tokens but we might hope the model can learn that there's an
[00:10:56] hope the model can learn that there's an idiomatic unity to that phrase and this
[00:11:00] idiomatic unity to that phrase and this also has the side advantage that for
[00:11:01] also has the side advantage that for unknown tokens like snuffleupagus it can
[00:11:04] unknown tokens like snuffleupagus it can break them apart into familiar pieces
[00:11:06] break them apart into familiar pieces and we at least have a hope of getting a
[00:11:08] and we at least have a hope of getting a sensible representation for that out of
[00:11:10] sensible representation for that out of vocabulary item
[00:11:12] vocabulary item the result of all this is that these
[00:11:13] the result of all this is that these models can get away with having very
[00:11:15] models can get away with having very small vocabularies precisely because we
[00:11:18] small vocabularies precisely because we are relying on them implicitly to be
[00:11:20] are relying on them implicitly to be truly contextual
[00:11:24] here's another inspiring idea that we've
[00:11:26] here's another inspiring idea that we've not encountered before this is called
[00:11:27] not encountered before this is called positional encoding
[00:11:29] positional encoding and it's another way in which we can
[00:11:30] and it's another way in which we can capture sensitivity of words to their
[00:11:33] capture sensitivity of words to their context
[00:11:34] context so as you'll see when you go all the way
[00:11:36] so as you'll see when you go all the way down inside the transformer architecture
[00:11:38] down inside the transformer architecture you do have a traditional static
[00:11:40] you do have a traditional static embedding of the sort we discussed in
[00:11:41] embedding of the sort we discussed in the first unit for this course those are
[00:11:44] the first unit for this course those are in light gray here fixed representations
[00:11:46] in light gray here fixed representations for the words
[00:11:47] for the words however in the context of a model like
[00:11:49] however in the context of a model like vert what we traditionally think of as
[00:11:52] vert what we traditionally think of as its embedding representation is actually
[00:11:54] its embedding representation is actually a combination of that fixed embedding
[00:11:57] a combination of that fixed embedding and a separate embedding space called
[00:11:59] and a separate embedding space called the positional embedding
[00:12:01] the positional embedding where we have learned representations
[00:12:02] where we have learned representations for each position in a sequence
[00:12:05] for each position in a sequence this has the intriguing property that
[00:12:07] this has the intriguing property that one in the same word like the will have
[00:12:09] one in the same word like the will have a different embedding in this sense of
[00:12:11] a different embedding in this sense of the green representation here depending
[00:12:13] the green representation here depending on where it appears in your sequence so
[00:12:16] on where it appears in your sequence so right from the get-go we have a notion
[00:12:18] right from the get-go we have a notion of context sensitivity even before we've
[00:12:20] of context sensitivity even before we've started to connect things in all sorts
[00:12:22] started to connect things in all sorts of interesting ways
[00:12:25] of interesting ways now let's move to some current issues
[00:12:27] now let's move to some current issues and efforts some high level things that
[00:12:28] and efforts some high level things that you might think about as you work
[00:12:30] you might think about as you work through this unit
[00:12:31] through this unit this is a really nice graph from the
[00:12:33] this is a really nice graph from the electra paper from clark at all
[00:12:35] electra paper from clark at all along the x-axis here we have floating
[00:12:37] along the x-axis here we have floating point operations which you could think
[00:12:39] point operations which you could think of as a kind of basic measure of compute
[00:12:41] of as a kind of basic measure of compute resources needed to create these
[00:12:43] resources needed to create these representations and along the y-axis we
[00:12:46] representations and along the y-axis we have glue score so that's like a
[00:12:47] have glue score so that's like a standard nlu benchmark
[00:12:50] standard nlu benchmark and the point of this plot here is that
[00:12:51] and the point of this plot here is that we're reaching kind of diminishing
[00:12:53] we're reaching kind of diminishing returns so we had rapid increases from
[00:12:55] returns so we had rapid increases from glove gpt and up through to bert
[00:12:58] glove gpt and up through to bert where we're really doing much better on
[00:13:00] where we're really doing much better on these glue scores
[00:13:02] these glue scores uh we're increasing the floating point
[00:13:04] uh we're increasing the floating point operations but it seems to be
[00:13:05] operations but it seems to be commensurate with how we're doing on the
[00:13:07] commensurate with how we're doing on the benchmark but now with these larger
[00:13:09] benchmark but now with these larger models like excel net and roberta it's
[00:13:11] models like excel net and roberta it's arguably the case that we're reaching
[00:13:13] arguably the case that we're reaching diminishing returns roberta involves
[00:13:16] diminishing returns roberta involves more than 3 000 times the floating point
[00:13:18] more than 3 000 times the floating point operations of glove
[00:13:21] operations of glove but it's not that much better along this
[00:13:23] but it's not that much better along this y-axis than some of its simpler variants
[00:13:26] y-axis than some of its simpler variants like bert base and so this is something
[00:13:28] like bert base and so this is something we should think about when we think
[00:13:29] we should think about when we think about the costs in terms of money and in
[00:13:32] about the costs in terms of money and in the environment and energy and so forth
[00:13:35] the environment and energy and so forth when we think about developing these
[00:13:36] when we think about developing these large models
[00:13:38] large models and here's a really extreme case who
[00:13:40] and here's a really extreme case who knows how long we can train these things
[00:13:42] knows how long we can train these things or how much benefit we'll get when we do
[00:13:44] or how much benefit we'll get when we do so but at a certain point we're likely
[00:13:46] so but at a certain point we're likely to incur costs that are larger than any
[00:13:49] to incur costs that are larger than any gains that we can justify on the
[00:13:50] gains that we can justify on the problems we're trying to solve
[00:13:52] problems we're trying to solve and that kind of does lead us to this
[00:13:54] and that kind of does lead us to this lovely paper which talks about the
[00:13:55] lovely paper which talks about the environmental footprint of training
[00:13:57] environmental footprint of training these really big models and it just
[00:13:59] these really big models and it just shows that like training a big
[00:14:01] shows that like training a big transformer from scratch really incurs a
[00:14:03] transformer from scratch really incurs a large environmental cost
[00:14:05] large environmental cost that's certainly something we should
[00:14:06] that's certainly something we should have in mind as we think about using
[00:14:08] have in mind as we think about using these models for me it's a complicated
[00:14:10] these models for me it's a complicated question though because it's offset by
[00:14:12] question though because it's offset by the fact that by and large
[00:14:14] the fact that by and large all of us aren't training these from
[00:14:15] all of us aren't training these from scratch but rather benefiting from
[00:14:18] scratch but rather benefiting from publicly available pre-trained
[00:14:19] publicly available pre-trained representations
[00:14:21] representations so while the pre-training for that one
[00:14:23] so while the pre-training for that one version had a large environmental cost
[00:14:26] version had a large environmental cost it feels like it's kind of offset by the
[00:14:27] it feels like it's kind of offset by the fact that a lot of us are benefiting
[00:14:29] fact that a lot of us are benefiting from it and it might be that in
[00:14:31] from it and it might be that in aggregate this is less environmentally
[00:14:33] aggregate this is less environmentally costly than the old days when all of us
[00:14:36] costly than the old days when all of us always trained all of our models
[00:14:38] always trained all of our models literally from scratch i just don't know
[00:14:40] literally from scratch i just don't know how to do the calculations here but i do
[00:14:42] how to do the calculations here but i do know that increased access has been
[00:14:44] know that increased access has been empowering and is likely offsetting some
[00:14:46] empowering and is likely offsetting some of the costs and a lot of that is due to
[00:14:48] of the costs and a lot of that is due to the contributions of the hugging face
[00:14:50] the contributions of the hugging face library
[00:14:53] there are a lot of efforts along these
[00:14:54] there are a lot of efforts along these same lines to make bert smaller by
[00:14:56] same lines to make bert smaller by compressing it literally fewer
[00:14:58] compressing it literally fewer dimensions and other kinds of
[00:14:59] dimensions and other kinds of simplifications of the training process
[00:15:01] simplifications of the training process and burp distillation and so forth here
[00:15:04] and burp distillation and so forth here are two outstanding contributions kind
[00:15:05] are two outstanding contributions kind of compendiums of lots of different
[00:15:07] of compendiums of lots of different ideas in this space
[00:15:09] ideas in this space and i also highly recommend this lovely
[00:15:11] and i also highly recommend this lovely paper called the prime run virtology
[00:15:13] paper called the prime run virtology which explores a lot of different
[00:15:15] which explores a lot of different aspects of what we know about bert and
[00:15:17] aspects of what we know about bert and how it works
[00:15:18] how it works various variations people have tried
[00:15:21] various variations people have tried various things that people have done to
[00:15:22] various things that people have done to probe these models and understand their
[00:15:24] probe these models and understand their learning dynamics and so forth and so on
[00:15:26] learning dynamics and so forth and so on it's a very rich contribution certainly
[00:15:29] it's a very rich contribution certainly can be a resource for you as you think
[00:15:31] can be a resource for you as you think about these models
[00:15:34] and
[00:15:35] and just because we don't have time to cover
[00:15:37] just because we don't have time to cover them all there are a bunch of
[00:15:38] them all there are a bunch of interesting transformer variants that we
[00:15:40] interesting transformer variants that we will not be able to discuss in detail
[00:15:42] will not be able to discuss in detail i thought i'd mention them here esper is
[00:15:44] i thought i'd mention them here esper is an attempt to develop sentence level
[00:15:46] an attempt to develop sentence level representations from bert that are
[00:15:47] representations from bert that are particularly good at finding which
[00:15:49] particularly good at finding which sentences are similar to which other
[00:15:51] sentences are similar to which other sentences according to cosine similarity
[00:15:54] sentences according to cosine similarity i think that could be a powerful mode of
[00:15:55] i think that could be a powerful mode of thinking about these representations and
[00:15:58] thinking about these representations and also of practical utility if you need to
[00:16:00] also of practical utility if you need to find
[00:16:00] find which sentences are similar to which
[00:16:02] which sentences are similar to which others
[00:16:03] others you have probably heard of gpt the
[00:16:06] you have probably heard of gpt the generative pre-trained transformer in
[00:16:08] generative pre-trained transformer in various of its forms you can get gpt2
[00:16:11] various of its forms you can get gpt2 from hugging space and hugging face and
[00:16:13] from hugging space and hugging face and you have unfettered access to it and of
[00:16:15] you have unfettered access to it and of course there's more restrictive access
[00:16:17] course there's more restrictive access at this point to gpt3 these are
[00:16:19] at this point to gpt3 these are conditional language models so quite
[00:16:21] conditional language models so quite different from burt and they might be
[00:16:23] different from burt and they might be better than burnt for things like truly
[00:16:25] better than burnt for things like truly conditional language generation
[00:16:28] conditional language generation xlnet is an attempt to bring in much
[00:16:30] xlnet is an attempt to bring in much more context into these models
[00:16:33] more context into these models it stands for extra long transformer so
[00:16:35] it stands for extra long transformer so if you need to process long sequences
[00:16:37] if you need to process long sequences this might be a good choice and this is
[00:16:39] this might be a good choice and this is also an attempt to bring in some of the
[00:16:40] also an attempt to bring in some of the benefits of conditional language models
[00:16:43] benefits of conditional language models into a mode that is more bi-directional
[00:16:45] into a mode that is more bi-directional the way burt is
[00:16:47] the way burt is t5 is another conditional language mode
[00:16:49] t5 is another conditional language mode as is bart these models might be better
[00:16:51] as is bart these models might be better choices for you if you need to actually
[00:16:53] choices for you if you need to actually generate language where i think the
[00:16:55] generate language where i think the standard wisdom is that models like burt
[00:16:57] standard wisdom is that models like burt and roberta are better if you simply
[00:16:59] and roberta are better if you simply need good representations for fine
[00:17:01] need good representations for fine tuning on a classification problem for
[00:17:03] tuning on a classification problem for example
[00:17:05] example more models will appear every day and i
[00:17:07] more models will appear every day and i think it's worth trying to stay up to
[00:17:09] think it's worth trying to stay up to speed on the various developments in
[00:17:11] speed on the various developments in this space because this is probably just
[00:17:13] this space because this is probably just the tip of the iceberg here
[00:17:19] you

Lecture 021

Transformers | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=Nsc0Yluf2yc --- Transcript [00:00:04] welcome everyone this is part two in our [...

Transformers | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=Nsc0Yluf2yc

---

Transcript

[00:00:04] welcome everyone this is part two in our
[00:00:06] welcome everyone this is part two in our series on contextual word
[00:00:08] series on contextual word representations we're going to be
[00:00:09] representations we're going to be talking about the transformer
[00:00:10] talking about the transformer architecture which is the central piece
[00:00:12] architecture which is the central piece for all the models we'll be exploring in
[00:00:14] for all the models we'll be exploring in this unit
[00:00:15] this unit let's dive into the model structure
[00:00:17] let's dive into the model structure we'll work through this using a simple
[00:00:19] we'll work through this using a simple example at the bottom here i've got the
[00:00:21] example at the bottom here i've got the input sequence the rock rules and i've
[00:00:23] input sequence the rock rules and i've indicated in red that we're going to be
[00:00:25] indicated in red that we're going to be keeping track of the positions of each
[00:00:26] keeping track of the positions of each one of those tokens in the sequence
[00:00:29] one of those tokens in the sequence the first step is a familiar one we're
[00:00:31] the first step is a familiar one we're going to look up both the words and the
[00:00:32] going to look up both the words and the positions in separate embedding spaces
[00:00:35] positions in separate embedding spaces those are fixed embedding spaces that
[00:00:37] those are fixed embedding spaces that we'll learn as part of learning all the
[00:00:38] we'll learn as part of learning all the parameters in this model i've given the
[00:00:41] parameters in this model i've given the word embeddings in light gray and the
[00:00:43] word embeddings in light gray and the positional embeddings in dark grey
[00:00:45] positional embeddings in dark grey to form what we think of as the actual
[00:00:47] to form what we think of as the actual embedding for this model we do an
[00:00:49] embedding for this model we do an element-wise addition of the word
[00:00:51] element-wise addition of the word embedding with the positional embedding
[00:00:53] embedding with the positional embedding and that gives us the representations
[00:00:54] and that gives us the representations that are in green here you can see that
[00:00:56] that are in green here you can see that on the right side of the slide i'm going
[00:00:58] on the right side of the slide i'm going to be keeping track of all of the
[00:00:59] to be keeping track of all of the calculations with regard to this c
[00:01:02] calculations with regard to this c column here and they're completely
[00:01:03] column here and they're completely parallel for columns a and b
[00:01:05] parallel for columns a and b so to form c input we do element wise
[00:01:07] so to form c input we do element wise addition of x34 the embedding for the
[00:01:10] addition of x34 the embedding for the word rules and p3 which is the embedding
[00:01:12] word rules and p3 which is the embedding for position three in this model
[00:01:16] for position three in this model the next layer is really the hallmark of
[00:01:18] the next layer is really the hallmark of this architecture and what gives the
[00:01:19] this architecture and what gives the paper its title attention is all you
[00:01:21] paper its title attention is all you need we're going to form a bunch of
[00:01:23] need we're going to form a bunch of dense dot product connections between
[00:01:26] dense dot product connections between all of these representations so you can
[00:01:28] all of these representations so you can think of those as forming these
[00:01:29] think of those as forming these connections that look like this a dense
[00:01:31] connections that look like this a dense thicket of them
[00:01:32] thicket of them on the right here i've given the core
[00:01:34] on the right here i've given the core calculation and it should be familiar
[00:01:36] calculation and it should be familiar from part one in this unit it's exactly
[00:01:38] from part one in this unit it's exactly the calculation i presented there with
[00:01:40] the calculation i presented there with just two small changes but fundamentally
[00:01:43] just two small changes but fundamentally if our target vector is c input here
[00:01:45] if our target vector is c input here we're attending to
[00:01:46] we're attending to inputs a and b
[00:01:48] inputs a and b we do that by forming the dot products
[00:01:50] we do that by forming the dot products here and the one twist from before is
[00:01:52] here and the one twist from before is that instead of just taking those dot
[00:01:54] that instead of just taking those dot products we'll normalize them by the
[00:01:56] products we'll normalize them by the square root of the dimensionality of the
[00:01:58] square root of the dimensionality of the model dk
[00:01:59] model dk dk is an important value here because of
[00:02:01] dk is an important value here because of the way we combine representations in
[00:02:03] the way we combine representations in the transformer all of the outputs that
[00:02:06] the transformer all of the outputs that all the layers we look at have to have
[00:02:08] all the layers we look at have to have the same dimensionality as given by dk
[00:02:10] the same dimensionality as given by dk and so what we're doing here is
[00:02:12] and so what we're doing here is essentially scaling these dot products
[00:02:13] essentially scaling these dot products to kind of keep them within a sensible
[00:02:15] to kind of keep them within a sensible range
[00:02:17] range that gives us a score vector alpha tilde
[00:02:19] that gives us a score vector alpha tilde we soft max normalize them and then the
[00:02:21] we soft max normalize them and then the other twist is that instead of using
[00:02:23] other twist is that instead of using mean here as we did before we use
[00:02:24] mean here as we did before we use summation but the actual vector is the
[00:02:27] summation but the actual vector is the one we calculated before we're going to
[00:02:28] one we calculated before we're going to take weighted versions of a input and b
[00:02:31] take weighted versions of a input and b input according to this vector of
[00:02:33] input according to this vector of weights that we created here
[00:02:35] weights that we created here that gives us the representation c
[00:02:37] that gives us the representation c attention as given in orange here and we
[00:02:40] attention as given in orange here and we do that of course for all the other
[00:02:41] do that of course for all the other positions in the model
[00:02:43] positions in the model the next step is kind of interesting
[00:02:45] the next step is kind of interesting we're creating what's called a residual
[00:02:47] we're creating what's called a residual connection so to get c a layer here in
[00:02:49] connection so to get c a layer here in yellow
[00:02:50] yellow we add up c input and this attention
[00:02:53] we add up c input and this attention representation that we just created and
[00:02:55] representation that we just created and applied drop out as a regularization
[00:02:57] applied drop out as a regularization step there and that gives us ca layer
[00:03:00] step there and that gives us ca layer the interesting thing there of course is
[00:03:01] the interesting thing there of course is this residual connection instead of
[00:03:03] this residual connection instead of simply feeding forward see attention we
[00:03:05] simply feeding forward see attention we feed forward actually a version of it
[00:03:07] feed forward actually a version of it that's combined with our initial
[00:03:09] that's combined with our initial positionally encoded embedding
[00:03:13] then we follow that with a step of layer
[00:03:14] then we follow that with a step of layer normalization which should help with
[00:03:16] normalization which should help with optimization it's going to kind of scale
[00:03:18] optimization it's going to kind of scale the weights and these representations
[00:03:21] the weights and these representations the next step is more meaningful this is
[00:03:22] the next step is more meaningful this is a series of two dense layers so we'll
[00:03:25] a series of two dense layers so we'll take ca norm here and feed it through
[00:03:27] take ca norm here and feed it through this dense layer with a non-linearity
[00:03:30] this dense layer with a non-linearity followed by a another linear layer to
[00:03:32] followed by a another linear layer to give us cfx as given in dark blue here
[00:03:35] give us cfx as given in dark blue here and that's followed by another one of
[00:03:36] and that's followed by another one of these interesting residual connections
[00:03:38] these interesting residual connections so we'll apply drop out to cff
[00:03:41] so we'll apply drop out to cff and then add that in with ca norm as
[00:03:43] and then add that in with ca norm as given down here at the bottom and that
[00:03:44] given down here at the bottom and that gives us this second yellow
[00:03:46] gives us this second yellow representation
[00:03:47] representation we follow that by one more step of layer
[00:03:50] we follow that by one more step of layer normalization and that gives us the
[00:03:52] normalization and that gives us the output for this block of transformer
[00:03:54] output for this block of transformer representations
[00:03:56] representations and you can imagine of course as you'll
[00:03:57] and you can imagine of course as you'll see that we can stack up these
[00:03:59] see that we can stack up these transformer blocks and the way that we
[00:04:00] transformer blocks and the way that we do that
[00:04:02] do that is essentially by taking these dark
[00:04:04] is essentially by taking these dark green representations at the top here
[00:04:05] green representations at the top here and using them as inputs and all the
[00:04:07] and using them as inputs and all the calculations are the same so you might
[00:04:09] calculations are the same so you might imagine that we could continue here by
[00:04:11] imagine that we could continue here by just doing a dense series of attention
[00:04:13] just doing a dense series of attention connections across these and then
[00:04:14] connections across these and then continuing on with the calculations i
[00:04:16] continuing on with the calculations i just presented and in that way we could
[00:04:18] just presented and in that way we could stack up transformer blocks and i'll
[00:04:20] stack up transformer blocks and i'll return to that later on
[00:04:23] return to that later on there are a few other things that are
[00:04:24] there are a few other things that are worth pointing out that are kind of
[00:04:25] worth pointing out that are kind of noteworthy about this model
[00:04:27] noteworthy about this model it looks like a complicated series of
[00:04:29] it looks like a complicated series of calculations but i would say that
[00:04:31] calculations but i would say that fundamentally what's happening here is
[00:04:33] fundamentally what's happening here is we're doing positional encoding to get
[00:04:35] we're doing positional encoding to get embeddings so that they are position
[00:04:37] embeddings so that they are position sensitive representations of words
[00:04:40] sensitive representations of words we follow that with an attention layer
[00:04:42] we follow that with an attention layer which creates that dense thicket of
[00:04:44] which creates that dense thicket of connections between all of the words as
[00:04:46] connections between all of the words as positionally encoded
[00:04:48] positionally encoded then we have these optimization things
[00:04:49] then we have these optimization things woven in but fundamentally we're
[00:04:51] woven in but fundamentally we're following that attention step with two
[00:04:53] following that attention step with two series of feed forward layer steps here
[00:04:56] series of feed forward layer steps here followed by the same process of drop out
[00:04:58] followed by the same process of drop out and layer normalization so if you kind
[00:05:00] and layer normalization so if you kind of alighted the yellow and the purple
[00:05:03] of alighted the yellow and the purple you would see that what we're really
[00:05:04] you would see that what we're really doing is a tension followed by feet
[00:05:06] doing is a tension followed by feet forward and then as we stack these
[00:05:08] forward and then as we stack these things it would be attention feed
[00:05:09] things it would be attention feed forward attention feed forward as we
[00:05:11] forward attention feed forward as we climbed up and interwoven into there are
[00:05:14] climbed up and interwoven into there are some things that i would say help with
[00:05:15] some things that i would say help with optimization
[00:05:17] optimization another noteworthy thing about this
[00:05:19] another noteworthy thing about this model is that the only sense in which we
[00:05:22] model is that the only sense in which we are keeping track of the linear order of
[00:05:24] are keeping track of the linear order of the sequence is in those positional
[00:05:26] the sequence is in those positional embeddings if not for them the column
[00:05:28] embeddings if not for them the column order would be completely irrelevant
[00:05:30] order would be completely irrelevant because of course we've created all of
[00:05:32] because of course we've created all of these symmetric connections at the
[00:05:33] these symmetric connections at the attention layer and there are no other
[00:05:35] attention layer and there are no other connections across these columns so the
[00:05:38] connections across these columns so the only sense in which column order that is
[00:05:40] only sense in which column order that is word order matters here is via those
[00:05:43] word order matters here is via those positional embeddings
[00:05:47] here's a more detailed look at the
[00:05:49] here's a more detailed look at the attention calculations themselves i just
[00:05:51] attention calculations themselves i just want to bring out how this actually
[00:05:52] want to bring out how this actually works at a mechanical level so this is
[00:05:54] works at a mechanical level so this is the calculation as i presented it on the
[00:05:56] the calculation as i presented it on the previous slide and in part one of this
[00:05:58] previous slide and in part one of this unit
[00:06:00] unit in the paper and now commonly it's
[00:06:02] in the paper and now commonly it's presented in this matrix format and if
[00:06:03] presented in this matrix format and if you're like me it's not obvious right
[00:06:06] you're like me it's not obvious right away that these are equivalent
[00:06:07] away that these are equivalent calculations so what i've done for these
[00:06:09] calculations so what i've done for these next two slides is just show you via
[00:06:12] next two slides is just show you via worked out examples how those
[00:06:14] worked out examples how those calculations work and how they arrive at
[00:06:16] calculations work and how they arrive at exactly the same values i'm not going to
[00:06:19] exactly the same values i'm not going to spend too much time on this here this is
[00:06:20] spend too much time on this here this is really just here for you if you would
[00:06:22] really just here for you if you would like to work through the calculations in
[00:06:24] like to work through the calculations in detail which i strongly encourage
[00:06:26] detail which i strongly encourage because this is really the fundamental
[00:06:28] because this is really the fundamental step in this model
[00:06:30] step in this model and here's all the details that you
[00:06:31] and here's all the details that you would need to get hands-on with these
[00:06:33] would need to get hands-on with these ideas
[00:06:34] ideas now so far i've presented attention in a
[00:06:37] now so far i've presented attention in a kind of simplified way a hallmark of
[00:06:39] kind of simplified way a hallmark of attention is in the transformer is that
[00:06:41] attention is in the transformer is that it is typically multi-headed attention
[00:06:43] it is typically multi-headed attention so let me unpack that idea a little bit
[00:06:45] so let me unpack that idea a little bit concretely we'll start with our our our
[00:06:48] concretely we'll start with our our our input sequence from before and we'll be
[00:06:50] input sequence from before and we'll be looking at these green representations
[00:06:52] looking at these green representations here
[00:06:52] here and the idea behind the multi-detention
[00:06:55] and the idea behind the multi-detention mechanisms that we're going to inject a
[00:06:56] mechanisms that we're going to inject a bunch of learned parameters into this
[00:06:58] bunch of learned parameters into this process to encourage diversity as part
[00:07:01] process to encourage diversity as part of the learning process and they're
[00:07:02] of the learning process and they're really diverse and interesting
[00:07:03] really diverse and interesting representations
[00:07:05] representations so here's how that works we're going to
[00:07:07] so here's how that works we're going to form three representations here using
[00:07:09] form three representations here using that same dot product mechanism as
[00:07:11] that same dot product mechanism as before and fundamentally it's the same
[00:07:13] before and fundamentally it's the same calculation
[00:07:14] calculation except now we're going to have a bunch
[00:07:16] except now we're going to have a bunch of learned weight parameters as given in
[00:07:18] of learned weight parameters as given in orange here and those will help us with
[00:07:20] orange here and those will help us with two things first injecting diversity
[00:07:22] two things first injecting diversity into this process and also smooshing the
[00:07:24] into this process and also smooshing the dimensionality of the representations
[00:07:26] dimensionality of the representations down to one-third of the size that we're
[00:07:28] down to one-third of the size that we're targeting as dk for our model
[00:07:30] targeting as dk for our model dimensionality and you'll see why that
[00:07:31] dimensionality and you'll see why that happens in a second
[00:07:33] happens in a second but fundamentally what we're doing is
[00:07:35] but fundamentally what we're doing is exactly the calculation we did before
[00:07:37] exactly the calculation we did before but now with these learn parameters
[00:07:39] but now with these learn parameters injected into it so if you squint you
[00:07:40] injected into it so if you squint you can see that this is really the dot
[00:07:42] can see that this is really the dot product of c input with a input as
[00:07:44] product of c input with a input as before but now is transformed by these
[00:07:47] before but now is transformed by these learned parameters that are given in
[00:07:48] learned parameters that are given in orange and that repeats for these other
[00:07:50] orange and that repeats for these other calculations
[00:07:52] calculations so we're going to do that for position a
[00:07:54] so we're going to do that for position a and we do it also for position b
[00:07:56] and we do it also for position b and it's the same calculation but now
[00:07:58] and it's the same calculation but now with new parameters
[00:08:00] with new parameters for the second position
[00:08:01] for the second position and then for the third head exactly the
[00:08:03] and then for the third head exactly the same calculation but new learn
[00:08:05] same calculation but new learn parameters up at the top here so this is
[00:08:07] parameters up at the top here so this is three-headed attention and the way we
[00:08:10] three-headed attention and the way we actually form the representations that
[00:08:12] actually form the representations that proceed with the rest of the calculation
[00:08:14] proceed with the rest of the calculation of the transformer architecture as
[00:08:16] of the transformer architecture as presented before
[00:08:17] presented before is by concatenating the three
[00:08:19] is by concatenating the three representations we created for each one
[00:08:21] representations we created for each one of these units so the a column is the
[00:08:24] of these units so the a column is the first representation in each one of
[00:08:26] first representation in each one of these heads
[00:08:27] these heads the b column is the second
[00:08:29] the b column is the second representation of each head and
[00:08:31] representation of each head and similarly for the c column as the third
[00:08:33] similarly for the c column as the third representation in each one of these
[00:08:34] representation in each one of these heads and that's why each one of these
[00:08:36] heads and that's why each one of these needs to have one-third the
[00:08:38] needs to have one-third the dimensionality of our full model so that
[00:08:40] dimensionality of our full model so that we can concatenate them and then feed
[00:08:42] we can concatenate them and then feed those into the subsequent calculations
[00:08:45] those into the subsequent calculations the idea here of course is that
[00:08:46] the idea here of course is that injecting all of these learned
[00:08:48] injecting all of these learned parameters into all of these different
[00:08:50] parameters into all of these different heads we're providing the model a chance
[00:08:52] heads we're providing the model a chance to learn lots of diverse ways of
[00:08:54] to learn lots of diverse ways of relating the words in the sequence
[00:08:58] relating the words in the sequence the final point is one i've already
[00:08:59] the final point is one i've already mentioned before which is that typically
[00:09:01] mentioned before which is that typically we don't have just one transformer block
[00:09:03] we don't have just one transformer block but rather a whole stack of them we can
[00:09:05] but rather a whole stack of them we can repeat them n times for models you're
[00:09:08] repeat them n times for models you're working with you might have 12 or 24 or
[00:09:10] working with you might have 12 or 24 or even more
[00:09:11] even more blocks in the transformer architecture
[00:09:13] blocks in the transformer architecture and the way we do that as i said is
[00:09:15] and the way we do that as i said is simply by taking the dark green
[00:09:17] simply by taking the dark green representations at the output layer here
[00:09:19] representations at the output layer here and using them as inputs to a subsequent
[00:09:21] and using them as inputs to a subsequent block so they get attended to when we
[00:09:23] block so they get attended to when we proceed with the subsequent
[00:09:25] proceed with the subsequent regularization and feed-forward steps
[00:09:27] regularization and feed-forward steps just as before and when you work with
[00:09:29] just as before and when you work with these models in hugging face if you ask
[00:09:32] these models in hugging face if you ask for all of the hidden states what you're
[00:09:33] for all of the hidden states what you're getting is a grid of representations
[00:09:36] getting is a grid of representations corresponding to these output blocks in
[00:09:38] corresponding to these output blocks in green here
[00:09:39] green here and of course just as a reminder i'm not
[00:09:41] and of course just as a reminder i'm not indicating it here but there's actually
[00:09:43] indicating it here but there's actually multi-headed attention at each one of
[00:09:45] multi-headed attention at each one of these blocks through each one of the
[00:09:46] these blocks through each one of the layers so there are a lot of learned
[00:09:48] layers so there are a lot of learned parameters in this model especially if
[00:09:50] parameters in this model especially if you have 12 or 24 attention heads
[00:09:55] at this point i'm hoping that you can
[00:09:57] at this point i'm hoping that you can now fruitfully return to the original
[00:09:59] now fruitfully return to the original vasonia all paper and look at their
[00:10:01] vasonia all paper and look at their model diagram and get more out of it for
[00:10:03] model diagram and get more out of it for me it's kind of hyper compressed but now
[00:10:05] me it's kind of hyper compressed but now that we've done a deep dive into all the
[00:10:07] that we've done a deep dive into all the pieces i think this serves as a kind of
[00:10:09] pieces i think this serves as a kind of useful shorthand for how all the pieces
[00:10:11] useful shorthand for how all the pieces fit together so let's just do that
[00:10:12] fit together so let's just do that quickly as before we have positional
[00:10:14] quickly as before we have positional encodings and input word embeddings and
[00:10:17] encodings and input word embeddings and those get added up to give us the
[00:10:18] those get added up to give us the intuitive notion of an embedding in this
[00:10:20] intuitive notion of an embedding in this model
[00:10:21] model that's followed by the attention layer
[00:10:23] that's followed by the attention layer as we discussed and it has a residual
[00:10:25] as we discussed and it has a residual connection here into that layer
[00:10:27] connection here into that layer normalization part that's fed into the
[00:10:30] normalization part that's fed into the feed forward blocks and that's followed
[00:10:32] feed forward blocks and that's followed by that same process of kind of drop out
[00:10:34] by that same process of kind of drop out and
[00:10:35] and layer normalization
[00:10:37] layer normalization and this is essentially saying each one
[00:10:39] and this is essentially saying each one of these is repeated for every step in
[00:10:41] of these is repeated for every step in the encoder process every one of the
[00:10:43] the encoder process every one of the columns that we looked at
[00:10:46] columns that we looked at right and if because then in the paper
[00:10:48] right and if because then in the paper they're working with an encoder decoder
[00:10:50] they're working with an encoder decoder model each decoder state self attends
[00:10:52] model each decoder state self attends with all of the fellow decoder states
[00:10:54] with all of the fellow decoder states and with all of the encoder states so
[00:10:56] and with all of the encoder states so right imagine this this double sequence
[00:10:58] right imagine this this double sequence we have a dense series of potential
[00:11:00] we have a dense series of potential connections connections across both
[00:11:02] connections connections across both parts of the representation
[00:11:05] parts of the representation on the right side when we're doing
[00:11:06] on the right side when we're doing decoding again we repeat this block for
[00:11:09] decoding again we repeat this block for every decoder state
[00:11:11] every decoder state and if every state there has an output
[00:11:13] and if every state there has an output as for machine translation or some kind
[00:11:15] as for machine translation or some kind of generation process then we'll have
[00:11:17] of generation process then we'll have something like this output stack at
[00:11:19] something like this output stack at every one of those output states if by
[00:11:22] every one of those output states if by contrast we're doing something like nli
[00:11:24] contrast we're doing something like nli or sentiment it's a classification
[00:11:25] or sentiment it's a classification problem maybe just one of those states
[00:11:27] problem maybe just one of those states will have one of these outputs on it
[00:11:30] will have one of these outputs on it and then the attention gets a little bit
[00:11:32] and then the attention gets a little bit complicated if you are doing decoding
[00:11:34] complicated if you are doing decoding for a model like natural language
[00:11:36] for a model like natural language generation or machine translation in the
[00:11:37] generation or machine translation in the decoder you can't attend into the future
[00:11:40] decoder you can't attend into the future as you're doing generations so there's a
[00:11:42] as you're doing generations so there's a masking process that limits
[00:11:44] masking process that limits self-attention to the preceding words in
[00:11:46] self-attention to the preceding words in the sequence that you're creating

Lecture 022

BERT | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=TKcSSwKNg7w --- Transcript [00:00:04] welcome everyone this is part three in [00:00:06] ...

BERT | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=TKcSSwKNg7w

---

Transcript

[00:00:04] welcome everyone this is part three in
[00:00:06] welcome everyone this is part three in our series on contextual word
[00:00:07] our series on contextual word representations we're going to be
[00:00:09] representations we're going to be talking about the bert model which is an
[00:00:10] talking about the bert model which is an innovative and powerful application of
[00:00:12] innovative and powerful application of the transformer architecture which we
[00:00:14] the transformer architecture which we covered in the previous screencast
[00:00:16] covered in the previous screencast let's dive into the core model structure
[00:00:18] let's dive into the core model structure of burt we'll begin with the inputs as
[00:00:21] of burt we'll begin with the inputs as usual we'll work with a simple example
[00:00:22] usual we'll work with a simple example the rock rules that has three tokens but
[00:00:24] the rock rules that has three tokens but you'll notice that for bert we begin
[00:00:27] you'll notice that for bert we begin every sequence with the designated class
[00:00:29] every sequence with the designated class token and sequences end with a
[00:00:31] token and sequences end with a designated set token and the sep token
[00:00:33] designated set token and the sep token can also be used as a boundary marker
[00:00:35] can also be used as a boundary marker between subparts of an input sequence
[00:00:38] between subparts of an input sequence we'll learn positional embeddings for
[00:00:40] we'll learn positional embeddings for each one of those tokens and in addition
[00:00:43] each one of those tokens and in addition as you can see in maroon here we're
[00:00:44] as you can see in maroon here we're going to learn a second notion of
[00:00:45] going to learn a second notion of position
[00:00:47] position it's sent a for all of these tokens so
[00:00:49] it's sent a for all of these tokens so it won't contribute in this particular
[00:00:50] it won't contribute in this particular case but if we had a problem like
[00:00:52] case but if we had a problem like natural language inference where
[00:00:54] natural language inference where examples consist of a premise sentence
[00:00:57] examples consist of a premise sentence and a hypothesis sentence we could use
[00:00:59] and a hypothesis sentence we could use learn separate embeddings for the
[00:01:00] learn separate embeddings for the premise and the hypothesis and thereby
[00:01:02] premise and the hypothesis and thereby hope that we can capture that kind of
[00:01:04] hope that we can capture that kind of second notion of position within our
[00:01:06] second notion of position within our input
[00:01:08] input we're going to have learned embeddings
[00:01:10] we're going to have learned embeddings for each one of these notions for the
[00:01:12] for each one of these notions for the word and the two notions of position
[00:01:14] word and the two notions of position and as in the transformer the actual
[00:01:16] and as in the transformer the actual embeddings given here in my green will
[00:01:18] embeddings given here in my green will be additive combinations of those three
[00:01:20] be additive combinations of those three learned embedding representations
[00:01:25] from there we just do a lot of work with
[00:01:27] from there we just do a lot of work with the transformer have we have repeated
[00:01:29] the transformer have we have repeated transformer blocks it could be 12 it
[00:01:30] transformer blocks it could be 12 it could be 24 it could be even more
[00:01:33] could be 24 it could be even more and the output of all of those
[00:01:34] and the output of all of those transformer blocks in the end is a
[00:01:36] transformer blocks in the end is a sequence of output representations these
[00:01:39] sequence of output representations these are vectors not given them in dark green
[00:01:41] are vectors not given them in dark green that is the core model structure for
[00:01:43] that is the core model structure for burnt
[00:01:45] burnt and that brings us to how this model is
[00:01:46] and that brings us to how this model is trained the mask language modeling
[00:01:48] trained the mask language modeling objective is the fundamental training
[00:01:50] objective is the fundamental training objective for this model
[00:01:52] objective for this model fundamentally what the model is trying
[00:01:54] fundamentally what the model is trying to do is work as an auto encoder and
[00:01:56] to do is work as an auto encoder and reproduce entire input sequences
[00:01:59] reproduce entire input sequences to make that problem non-trivial though
[00:02:01] to make that problem non-trivial though we're going to employ this mass language
[00:02:03] we're going to employ this mass language modeling idea and what that means is
[00:02:05] modeling idea and what that means is that we're going to go through these
[00:02:06] that we're going to go through these input sequences and randomly replace
[00:02:08] input sequences and randomly replace some small percentage 10 to 15
[00:02:11] some small percentage 10 to 15 of the input tokens with this designated
[00:02:13] of the input tokens with this designated mask token
[00:02:15] mask token and then the job of the model is to
[00:02:16] and then the job of the model is to learn for those masked inputs to re
[00:02:18] learn for those masked inputs to re reconstruct what was the actual input
[00:02:21] reconstruct what was the actual input right so in this case we masked out
[00:02:23] right so in this case we masked out rules and the job of the model is to use
[00:02:25] rules and the job of the model is to use this bi-directional context that's
[00:02:27] this bi-directional context that's flowing in from all those attention
[00:02:28] flowing in from all those attention mechanisms to figure out that rules was
[00:02:31] mechanisms to figure out that rules was actually the token that belonged to that
[00:02:33] actually the token that belonged to that initial position
[00:02:35] initial position ver team uh did a variant of this as
[00:02:37] ver team uh did a variant of this as well which is uh masking by a random
[00:02:39] well which is uh masking by a random word so in this case we might replace
[00:02:42] word so in this case we might replace rules with a word like every picked
[00:02:44] rules with a word like every picked randomly from our vocabulary but there
[00:02:46] randomly from our vocabulary but there again the fundamental job of the model
[00:02:48] again the fundamental job of the model is to learn to figure out that rules was
[00:02:51] is to learn to figure out that rules was the actual token in that position
[00:02:53] the actual token in that position so it's going to make some prediction in
[00:02:54] so it's going to make some prediction in this case if it's different from rules
[00:02:57] this case if it's different from rules and the error signal will flow back down
[00:02:59] and the error signal will flow back down through all the parameters of this model
[00:03:00] through all the parameters of this model affecting we hope all the
[00:03:02] affecting we hope all the representations because of that dense
[00:03:04] representations because of that dense ticket of attention connections that
[00:03:06] ticket of attention connections that exists across these time steps and in
[00:03:08] exists across these time steps and in that way the model will learn to update
[00:03:10] that way the model will learn to update itself effectively learning how to
[00:03:12] itself effectively learning how to reconstruct the missing pieces from
[00:03:15] reconstruct the missing pieces from these inputs that we created during
[00:03:16] these inputs that we created during training
[00:03:18] training so let's dive into that mask language
[00:03:20] so let's dive into that mask language modeling objective a little more deeply
[00:03:23] modeling objective a little more deeply for transformer parameters h theta
[00:03:26] for transformer parameters h theta and some sequence of tokens x with its
[00:03:28] and some sequence of tokens x with its corresponding masked version x hat this
[00:03:31] corresponding masked version x hat this is the objective here well let's zoom in
[00:03:33] is the objective here well let's zoom in on some time step t
[00:03:35] on some time step t uh the fundamental scoring thing is that
[00:03:37] uh the fundamental scoring thing is that we're going to look up the vector
[00:03:38] we're going to look up the vector representation in the embedding for that
[00:03:40] representation in the embedding for that time step t
[00:03:41] time step t and we'll take the dot product of that
[00:03:44] and we'll take the dot product of that with the output representation at time t
[00:03:46] with the output representation at time t from the entire
[00:03:47] from the entire transformer model
[00:03:49] transformer model that much there that scoring procedure
[00:03:51] that much there that scoring procedure looks a lot like what you get from
[00:03:53] looks a lot like what you get from conditional language models you just
[00:03:54] conditional language models you just have to remember that because of all
[00:03:56] have to remember that because of all those attention mechanisms connecting
[00:03:58] those attention mechanisms connecting every token to every other token this is
[00:04:00] every token to every other token this is not just the preceding context before
[00:04:03] not just the preceding context before time step t but rather the entire
[00:04:05] time step t but rather the entire surrounding context for this position
[00:04:08] surrounding context for this position and then as usual we normalize that by
[00:04:10] and then as usual we normalize that by considering all the alternative tokens x
[00:04:12] considering all the alternative tokens x prime that could be in this position
[00:04:15] prime that could be in this position now you'll notice over here there's an
[00:04:16] now you'll notice over here there's an indicator variable mt mt is 1 if token t
[00:04:19] indicator variable mt mt is 1 if token t was massed l0 so that's like saying
[00:04:21] was massed l0 so that's like saying we're going to turn on this loss
[00:04:23] we're going to turn on this loss only for the tokens that we have masked
[00:04:25] only for the tokens that we have masked out
[00:04:27] out and then the final thing is kind of not
[00:04:29] and then the final thing is kind of not a definitional choice about this model
[00:04:30] a definitional choice about this model but something worth noting you'll see
[00:04:32] but something worth noting you'll see that we're using the embedding for this
[00:04:34] that we're using the embedding for this token
[00:04:35] token effectively as the softmax parameters
[00:04:37] effectively as the softmax parameters there could be separate parameters here
[00:04:39] there could be separate parameters here that we learned for the classifier part
[00:04:41] that we learned for the classifier part that's learning to be a language model i
[00:04:43] that's learning to be a language model i think but i think people have found over
[00:04:44] think but i think people have found over time that by tying these parameters by
[00:04:46] time that by tying these parameters by using the transpose of these parameters
[00:04:48] using the transpose of these parameters to create the output space we get some
[00:04:51] to create the output space we get some statistical strength and more efficient
[00:04:53] statistical strength and more efficient learning
[00:04:55] learning there is a second objective in the verb
[00:04:57] there is a second objective in the verb model and it is the binary next sentence
[00:04:59] model and it is the binary next sentence prediction task
[00:05:00] prediction task uh and i think this was an attempt to
[00:05:02] uh and i think this was an attempt to find some coherence uh beyond just the
[00:05:04] find some coherence uh beyond just the simple sentence or sequence level
[00:05:07] simple sentence or sequence level uh so this is pretty straightforward for
[00:05:08] uh so this is pretty straightforward for positive instances for this class we're
[00:05:10] positive instances for this class we're going to take actual sequences of
[00:05:12] going to take actual sequences of sentences in the corpus that we're using
[00:05:14] sentences in the corpus that we're using for training so here you can see these
[00:05:16] for training so here you can see these actually occurred together and they are
[00:05:17] actually occurred together and they are labeled as next and for negative
[00:05:19] labeled as next and for negative examples we just randomly choose a
[00:05:21] examples we just randomly choose a second sentence and label that as not
[00:05:23] second sentence and label that as not next and i think the aspiration here was
[00:05:25] next and i think the aspiration here was that this would help the model learn
[00:05:27] that this would help the model learn some notion of discourse coherence
[00:05:29] some notion of discourse coherence behind beyond the local coherence of
[00:05:31] behind beyond the local coherence of individual sequences
[00:05:34] now what you probably want to do with
[00:05:36] now what you probably want to do with the burp model is not train it from
[00:05:38] the burp model is not train it from scratch but rather fine tune it on a
[00:05:40] scratch but rather fine tune it on a particular task that you have there are
[00:05:42] particular task that you have there are many modes that you can think about for
[00:05:44] many modes that you can think about for doing this a kind of default choice a
[00:05:47] doing this a kind of default choice a standard and simple choice would be to
[00:05:48] standard and simple choice would be to use the class token
[00:05:50] use the class token more specifically it's output in the
[00:05:52] more specifically it's output in the final layer of the burp model as the
[00:05:54] final layer of the burp model as the basis for setting some task specific
[00:05:57] basis for setting some task specific parameters and then using that um
[00:05:59] parameters and then using that um whatever labels you have for supervision
[00:06:01] whatever labels you have for supervision up here
[00:06:02] up here and that could be effective because the
[00:06:03] and that could be effective because the class token appears in this position in
[00:06:05] class token appears in this position in every single one of these sequences and
[00:06:07] every single one of these sequences and so you might think of it as a good
[00:06:08] so you might think of it as a good summary representation of the entire
[00:06:10] summary representation of the entire sequence and then when you do the fine
[00:06:12] sequence and then when you do the fine tuning you'll of course be updating
[00:06:14] tuning you'll of course be updating these task parameters and then you could
[00:06:16] these task parameters and then you could if you wanted to also update some or all
[00:06:18] if you wanted to also update some or all of the actual parameters from the
[00:06:20] of the actual parameters from the pre-trained model that would be a true
[00:06:22] pre-trained model that would be a true notion of fine-tuning
[00:06:24] notion of fine-tuning now you might worry that the class token
[00:06:25] now you might worry that the class token is an insufficient summary of the entire
[00:06:27] is an insufficient summary of the entire sequence and so you could of course
[00:06:29] sequence and so you could of course think about pooling all the output
[00:06:31] think about pooling all the output states in the sequence via some function
[00:06:33] states in the sequence via some function like sum or mean or max and using those
[00:06:36] like sum or mean or max and using those as the input to whatever test specific
[00:06:38] as the input to whatever test specific parameters you have up here at the top
[00:06:42] i just want to remind us that
[00:06:43] i just want to remind us that tokenization invert is a little unusual
[00:06:46] tokenization invert is a little unusual we've covered this a few times before
[00:06:47] we've covered this a few times before but just remember that we're getting
[00:06:49] but just remember that we're getting effectively not full words but word
[00:06:51] effectively not full words but word pieces so for cases like encode me you
[00:06:54] pieces so for cases like encode me you can see that the word encode has been
[00:06:56] can see that the word encode has been split apart into two word pieces and
[00:06:57] split apart into two word pieces and we're hoping implicitly that the model
[00:07:00] we're hoping implicitly that the model can learn that that is in some deep
[00:07:01] can learn that that is in some deep sense still a word even though it has
[00:07:03] sense still a word even though it has been split apart and that should draw on
[00:07:05] been split apart and that should draw on the truly contextual nature of these
[00:07:07] the truly contextual nature of these models
[00:07:08] models the bur team did two initial model
[00:07:10] the bur team did two initial model releases bird base consists of 12
[00:07:12] releases bird base consists of 12 transformer layers and has
[00:07:13] transformer layers and has representations of dimension 768 with 12
[00:07:17] representations of dimension 768 with 12 attention heads for a total of 110
[00:07:19] attention heads for a total of 110 million parameters that's of course a
[00:07:21] million parameters that's of course a very large model but this is manageable
[00:07:23] very large model but this is manageable for you to do local work especially if
[00:07:25] for you to do local work especially if you just want to do some simple fine
[00:07:26] you just want to do some simple fine tuning or use this model for
[00:07:28] tuning or use this model for inference the bird large release is much
[00:07:31] inference the bird large release is much larger it has 24 layers
[00:07:33] larger it has 24 layers twice the dimensionality for its
[00:07:35] twice the dimensionality for its representations and 16 attention heads
[00:07:37] representations and 16 attention heads for a total of 340 million parameters
[00:07:40] for a total of 340 million parameters this is large enough that it might be
[00:07:42] this is large enough that it might be difficult to do local work with but of
[00:07:44] difficult to do local work with but of course you might get much more
[00:07:45] course you might get much more representational power from using it
[00:07:48] representational power from using it for both of these models we have a
[00:07:49] for both of these models we have a limitation to 512 tokens and that is
[00:07:52] limitation to 512 tokens and that is because that is the size of the
[00:07:54] because that is the size of the positional embedding space that they
[00:07:55] positional embedding space that they learned there are many new releases of
[00:07:57] learned there are many new releases of course you can find those at the project
[00:07:59] course you can find those at the project site and hugging face has made it very
[00:08:01] site and hugging face has made it very easy to access these models and that's
[00:08:02] easy to access these models and that's been very empowering
[00:08:05] been very empowering to close this let me just mention a few
[00:08:06] to close this let me just mention a few known limitations of burke that we're
[00:08:08] known limitations of burke that we're going to return to as we go through some
[00:08:09] going to return to as we go through some subsequent models for this unit so first
[00:08:12] subsequent models for this unit so first in the original paper there is a large
[00:08:14] in the original paper there is a large but still partial number of ablation
[00:08:16] but still partial number of ablation studies and optimization studies there's
[00:08:19] studies and optimization studies there's a huge landscape that's in play here and
[00:08:21] a huge landscape that's in play here and only small parts of it are explored in
[00:08:23] only small parts of it are explored in the original paper so we might worry
[00:08:25] the original paper so we might worry that there are better choices we could
[00:08:27] that there are better choices we could be making within this space
[00:08:30] be making within this space the original paper also points out that
[00:08:32] the original paper also points out that there's some unnaturalness about this
[00:08:33] there's some unnaturalness about this mass token they say the first downside
[00:08:36] mass token they say the first downside of the mlm objective is that we are
[00:08:38] of the mlm objective is that we are creating a mismatch between pre-training
[00:08:40] creating a mismatch between pre-training and fine-tuning because the mass token
[00:08:42] and fine-tuning because the mass token is never seen during fine tuning so
[00:08:44] is never seen during fine tuning so that's something we might want to
[00:08:45] that's something we might want to address
[00:08:47] address they also point out that there's a
[00:08:48] they also point out that there's a downside to using the mlm objective
[00:08:50] downside to using the mlm objective which is just that it's kind of data
[00:08:51] which is just that it's kind of data inefficient we can only mask out a small
[00:08:54] inefficient we can only mask out a small percentage of the tokens because we need
[00:08:56] percentage of the tokens because we need the surrounding context in order to
[00:08:57] the surrounding context in order to sensibly reproduce those tokens
[00:09:00] sensibly reproduce those tokens and that means that it's kind of data
[00:09:01] and that means that it's kind of data inefficient
[00:09:03] inefficient and finally this is from the excel net
[00:09:05] and finally this is from the excel net paper i think this is quite perceptive
[00:09:07] paper i think this is quite perceptive burt assumes that the predicted tokens
[00:09:09] burt assumes that the predicted tokens are independent of each other given the
[00:09:10] are independent of each other given the unmasked tokens which is oversimplified
[00:09:13] unmasked tokens which is oversimplified as high order long-range dependency is
[00:09:15] as high order long-range dependency is prevalent in natural language what they
[00:09:17] prevalent in natural language what they have in mind here is essentially if you
[00:09:18] have in mind here is essentially if you have a an idiom like out of this world
[00:09:21] have a an idiom like out of this world and it happens that both the first and
[00:09:23] and it happens that both the first and the last words in that idiom are masked
[00:09:25] the last words in that idiom are masked out then byrd is going to try to
[00:09:26] out then byrd is going to try to reproduce them as though they were
[00:09:28] reproduce them as though they were independent of each other when in fact
[00:09:29] independent of each other when in fact we know that there is a statistical
[00:09:32] we know that there is a statistical dependency between them coming from the
[00:09:34] dependency between them coming from the fact that they are participating in this
[00:09:35] fact that they are participating in this idiom
[00:09:36] idiom so there's some notion of
[00:09:38] so there's some notion of representational
[00:09:39] representational coherence that bert is simply not
[00:09:41] coherence that bert is simply not capturing with its mlm objective

Lecture 023

RoBERTa | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=EZMOBbu_5b8 --- Transcript [00:00:05] welcome back everyone this is part four [00:00:...

RoBERTa | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=EZMOBbu_5b8

---

Transcript

[00:00:05] welcome back everyone this is part four
[00:00:06] welcome back everyone this is part four in our series on contextual word
[00:00:08] in our series on contextual word representations we are going to be
[00:00:09] representations we are going to be talking about a robustly optimized Bert
[00:00:11] talking about a robustly optimized Bert approach AKA
[00:00:14] approach AKA Roberta so recall that I finished the
[00:00:16] Roberta so recall that I finished the Bert screencast by listing out some
[00:00:17] Bert screencast by listing out some known limitations of the Bert model most
[00:00:20] known limitations of the Bert model most of which were identified by the original
[00:00:22] of which were identified by the original Bert authors themselves and top of the
[00:00:24] Bert authors themselves and top of the list was simply that although the
[00:00:26] list was simply that although the original Bert paper does a good job of
[00:00:28] original Bert paper does a good job of exploring ablations of their system and
[00:00:30] exploring ablations of their system and different optimization choices there's a
[00:00:33] different optimization choices there's a very large landscape of ideas here and
[00:00:35] very large landscape of ideas here and most of it was left unexplored in the
[00:00:37] most of it was left unexplored in the original paper essentially what the
[00:00:39] original paper essentially what the Roberta team did is Explore More widely
[00:00:42] Roberta team did is Explore More widely in the space that is the robustly
[00:00:44] in the space that is the robustly optimized part of
[00:00:46] optimized part of Roberta so what I've done for this slide
[00:00:48] Roberta so what I've done for this slide here is list out what I take to be the
[00:00:49] here is list out what I take to be the central differences between Bert and
[00:00:51] central differences between Bert and Roberta and I'll follow this up with
[00:00:53] Roberta and I'll follow this up with some evidence from the Roberta paper in
[00:00:55] some evidence from the Roberta paper in a second but let's first let's go
[00:00:56] a second but let's first let's go through the central differences
[00:00:58] through the central differences beginning with this question of static
[00:01:00] beginning with this question of static versus Dynamic masking so for the
[00:01:02] versus Dynamic masking so for the original BP paper what they did is
[00:01:04] original BP paper what they did is create four copies of their data set
[00:01:07] create four copies of their data set each with different masking and then
[00:01:09] each with different masking and then those four copies were used repeatedly
[00:01:11] those four copies were used repeatedly through EPO of training the Roberta team
[00:01:14] through EPO of training the Roberta team had the intuition that it would be
[00:01:15] had the intuition that it would be useful to inject some diversity into
[00:01:17] useful to inject some diversity into this training process so they went to
[00:01:19] this training process so they went to the Other Extreme Dynamic masking every
[00:01:22] the Other Extreme Dynamic masking every single example when it's presented to
[00:01:23] single example when it's presented to the model is masked in a potentially
[00:01:25] the model is masked in a potentially different way via some random
[00:01:28] different way via some random function there are also differences in
[00:01:31] function there are also differences in how examples are presented to the model
[00:01:32] how examples are presented to the model so Bert presented two concatenated
[00:01:35] so Bert presented two concatenated document segments this was crucial to
[00:01:37] document segments this was crucial to its next sentence prediction task
[00:01:39] its next sentence prediction task whereas for Roberta we're just going to
[00:01:40] whereas for Roberta we're just going to have sentence sequences that is pairs
[00:01:43] have sentence sequences that is pairs that may even span document
[00:01:45] that may even span document boundaries relatedly whereas Bert had as
[00:01:48] boundaries relatedly whereas Bert had as one of its Central pieces this next
[00:01:50] one of its Central pieces this next sentence prediction task uh Roberta
[00:01:53] sentence prediction task uh Roberta simply drops that as part of the
[00:01:55] simply drops that as part of the objective here that simplifies the
[00:01:57] objective here that simplifies the presentation of examples and also
[00:01:59] presentation of examples and also simplifies the modeling objective now
[00:02:01] simplifies the modeling objective now Roberta is simply using the mass
[00:02:03] Roberta is simply using the mass language modeling
[00:02:06] language modeling objective the uh there are also changes
[00:02:08] objective the uh there are also changes to the size of the training batches so
[00:02:10] to the size of the training batches so for Bert the batch size was 256 examples
[00:02:13] for Bert the batch size was 256 examples Roberta cranked that all the way up to
[00:02:15] Roberta cranked that all the way up to 2,000 there are differences when it
[00:02:17] 2,000 there are differences when it comes to tokenization so as we've seen
[00:02:19] comes to tokenization so as we've seen Bert used this very interesting word
[00:02:21] Bert used this very interesting word piece tokenization approach which mixes
[00:02:23] piece tokenization approach which mixes some subword pieces with some whole
[00:02:26] some subword pieces with some whole words Roberta simplified that down to
[00:02:28] words Roberta simplified that down to just character level bite pair en coding
[00:02:31] just character level bite pair en coding uh which I think leads to many more word
[00:02:33] uh which I think leads to many more word pieces
[00:02:35] pieces intuitively there are also differences
[00:02:37] intuitively there are also differences in how the DAT the model was trained so
[00:02:40] in how the DAT the model was trained so Bert trained on a substantial Corpus the
[00:02:42] Bert trained on a substantial Corpus the Books Corpus plus English Wikipedia is a
[00:02:44] Books Corpus plus English Wikipedia is a lot of data indeed Roberta again cranked
[00:02:47] lot of data indeed Roberta again cranked that up even further they trained on the
[00:02:49] that up even further they trained on the Books Corpus the CC News Corpus the open
[00:02:52] Books Corpus the CC News Corpus the open web Text corpus and the stories Corpus a
[00:02:55] web Text corpus and the stories Corpus a substantial increasing the amount of
[00:02:56] substantial increasing the amount of training
[00:02:58] training data there there are also differences in
[00:03:00] data there there are also differences in the number of training steps and there's
[00:03:02] the number of training steps and there's a subtlety here so for the Bert model it
[00:03:04] a subtlety here so for the Bert model it was originally trained on 1 million
[00:03:06] was originally trained on 1 million steps the Roberta model was trained on
[00:03:09] steps the Roberta model was trained on 500,000 steps which sounds like fewer
[00:03:11] 500,000 steps which sounds like fewer steps but overall this is substantially
[00:03:14] steps but overall this is substantially more training in virtue of the fact that
[00:03:16] more training in virtue of the fact that the training batch sizes are so much
[00:03:19] the training batch sizes are so much larger for Roberta than they are for
[00:03:22] larger for Roberta than they are for Bert and finally the original Bert
[00:03:24] Bert and finally the original Bert authors had an intuition that it would
[00:03:25] authors had an intuition that it would be useful in getting the optimization
[00:03:27] be useful in getting the optimization process going to train just on short
[00:03:30] process going to train just on short sequences first the Roberta team dropped
[00:03:32] sequences first the Roberta team dropped that idea and they train on full length
[00:03:34] that idea and they train on full length sequences throughout the life cycle of
[00:03:36] sequences throughout the life cycle of optimization there are some additional
[00:03:38] optimization there are some additional differences related to the optimizer and
[00:03:40] differences related to the optimizer and the data presentation I'm going to set
[00:03:42] the data presentation I'm going to set those aside if you want the details I
[00:03:44] those aside if you want the details I refer to section 3.1 of the Roberta
[00:03:46] refer to section 3.1 of the Roberta paper so let's look at a little bit of
[00:03:48] paper so let's look at a little bit of evidence for these various choices
[00:03:50] evidence for these various choices starting with that question of dynamic
[00:03:52] starting with that question of dynamic versus static masking so this is the
[00:03:55] versus static masking so this is the primary evidence they're using three
[00:03:56] primary evidence they're using three benchmarks Squad mli and sst2
[00:04:00] benchmarks Squad mli and sst2 and you can see that more or less across
[00:04:01] and you can see that more or less across the board Dynamic masking is better not
[00:04:04] the board Dynamic masking is better not by a lot um but Dynamic masking also has
[00:04:07] by a lot um but Dynamic masking also has going for this intuition that you know
[00:04:09] going for this intuition that you know bird is kind of data inefficient we can
[00:04:11] bird is kind of data inefficient we can only mask out a small number of tokens
[00:04:13] only mask out a small number of tokens and it feels like it ought to be useful
[00:04:15] and it feels like it ought to be useful to inject a lot of diversity into that
[00:04:17] to inject a lot of diversity into that so that a lot of different tokens get
[00:04:19] so that a lot of different tokens get masked as we go through the training
[00:04:21] masked as we go through the training process but the choice is of course
[00:04:22] process but the choice is of course supported numerically here I think
[00:04:24] supported numerically here I think pretty
[00:04:26] pretty substantially this side slide this table
[00:04:29] substantially this side slide this table here summarizes is the choice about how
[00:04:30] here summarizes is the choice about how to present examples to the model and
[00:04:32] to present examples to the model and this is also a little bit subtle so
[00:04:35] this is also a little bit subtle so numerically the doc sentences approach
[00:04:37] numerically the doc sentences approach was best and this was an approach where
[00:04:39] was best and this was an approach where they just two they took contiguous
[00:04:41] they just two they took contiguous sentences from within documents but
[00:04:43] sentences from within documents but treated a document boundary as a kind of
[00:04:45] treated a document boundary as a kind of hard boundary that's numerically better
[00:04:47] hard boundary that's numerically better according to the Benchmark results but
[00:04:49] according to the Benchmark results but they actually decided to go with the
[00:04:51] they actually decided to go with the full sentences approach and the reason
[00:04:53] full sentences approach and the reason for that is in not respecting document
[00:04:55] for that is in not respecting document boundaries it is easier to create lots
[00:04:58] boundaries it is easier to create lots of batches of exactly the same size
[00:05:00] of batches of exactly the same size which leads to all sorts of gains when
[00:05:02] which leads to all sorts of gains when you think about optimizing a large model
[00:05:04] you think about optimizing a large model like this so basically they decided that
[00:05:06] like this so basically they decided that those gains offset the slightly lower
[00:05:09] those gains offset the slightly lower performance of full sentences as
[00:05:11] performance of full sentences as compared to Doc sentences and that's why
[00:05:13] compared to Doc sentences and that's why this became their Central
[00:05:16] this became their Central approach here's the the summary of
[00:05:18] approach here's the the summary of evidence for choosing 2K as the batch
[00:05:20] evidence for choosing 2K as the batch size you can see that they chose 256
[00:05:23] size you can see that they chose 256 which was the burnt original 2K and AK
[00:05:26] which was the burnt original 2K and AK and 2K looks like The Sweet Spot
[00:05:28] and 2K looks like The Sweet Spot according to mli SS T2 and this kind of
[00:05:31] according to mli SS T2 and this kind of pseudo perplexity value that you get out
[00:05:33] pseudo perplexity value that you get out of bidirectional models like Bert and
[00:05:35] of bidirectional models like Bert and Roberta so that's a clear argument and
[00:05:37] Roberta so that's a clear argument and then finally when we come to the just
[00:05:39] then finally when we come to the just the amount of training that we do uh the
[00:05:41] the amount of training that we do uh the lesson here apparently is more is better
[00:05:44] lesson here apparently is more is better on the top of this table here we have
[00:05:45] on the top of this table here we have some comparisons within the Roberta
[00:05:47] some comparisons within the Roberta model pointing to 500k as the best and I
[00:05:50] model pointing to 500k as the best and I would just remind you that that is
[00:05:51] would just remind you that that is overall substantially more training than
[00:05:54] overall substantially more training than was done in 1 million steps with Bert in
[00:05:56] was done in 1 million steps with Bert in virtue of the fact that our batch sizes
[00:05:58] virtue of the fact that our batch sizes for Roberta are so much
[00:06:01] for Roberta are so much larger in closing I just want to say
[00:06:03] larger in closing I just want to say that Roberta 2 only explored a small
[00:06:06] that Roberta 2 only explored a small part of the potential design choices
[00:06:08] part of the potential design choices that we could make in this large
[00:06:10] that we could make in this large landscape uh if you would like to hear
[00:06:12] landscape uh if you would like to hear even more about what we know and what we
[00:06:14] even more about what we know and what we think we know about models like Bert and
[00:06:16] think we know about models like Bert and Roberta I highly recommend this paper
[00:06:18] Roberta I highly recommend this paper called the primer by bology which has
[00:06:20] called the primer by bology which has lots of additional wisdom and insights
[00:06:23] lots of additional wisdom and insights and ideas about these models

Lecture 024

ELECTRA | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=6NSRLEiqsoE --- Transcript [00:00:04] welcome everyone this is part five in [00:00:06...

ELECTRA | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=6NSRLEiqsoE

---

Transcript

[00:00:04] welcome everyone this is part five in
[00:00:06] welcome everyone this is part five in our series on contextual word
[00:00:07] our series on contextual word representations we're going to be
[00:00:08] representations we're going to be talking about the electra model electra
[00:00:10] talking about the electra model electra stands for efficiently learning an
[00:00:12] stands for efficiently learning an encoder that classifies token
[00:00:14] encoder that classifies token replacements accurately which is a
[00:00:15] replacements accurately which is a helpfully descriptive breakdown of a
[00:00:17] helpfully descriptive breakdown of a colorfully named model
[00:00:19] colorfully named model recall that i finished the bert
[00:00:21] recall that i finished the bert screencast by identifying some known
[00:00:23] screencast by identifying some known limitations of the burp model
[00:00:25] limitations of the burp model elektra is really keying into two and
[00:00:26] elektra is really keying into two and three in that list so the second one
[00:00:29] three in that list so the second one identified by the bert authors is just
[00:00:31] identified by the bert authors is just that of the mlm objective they say we're
[00:00:33] that of the mlm objective they say we're creating a mismatch between pre-training
[00:00:35] creating a mismatch between pre-training and fine-tuning since the mask token
[00:00:37] and fine-tuning since the mask token that we use is never seen during fine
[00:00:39] that we use is never seen during fine tuning so ideally for the model we
[00:00:41] tuning so ideally for the model we fine-tuned we would make no use of a
[00:00:43] fine-tuned we would make no use of a mass token
[00:00:45] mass token devlin and i'll also observe on the of
[00:00:47] devlin and i'll also observe on the of the mlm objective that it has a downside
[00:00:50] the mlm objective that it has a downside we make predictions about only 15 of the
[00:00:53] we make predictions about only 15 of the tokens in each batch
[00:00:55] tokens in each batch we have an intuition that that's a
[00:00:56] we have an intuition that that's a pretty inefficient use of the data we
[00:00:58] pretty inefficient use of the data we have available to us ideally we would
[00:01:00] have available to us ideally we would make more predictions and elektra seeks
[00:01:02] make more predictions and elektra seeks to make good on that intuition as well
[00:01:05] to make good on that intuition as well so let's dive into the core model
[00:01:06] so let's dive into the core model structure here we'll use a simple
[00:01:08] structure here we'll use a simple example we have an input token sequence
[00:01:10] example we have an input token sequence x the chef cooked the meal
[00:01:12] x the chef cooked the meal and as usual with bert we can mask out
[00:01:15] and as usual with bert we can mask out some of those tokens and then have a
[00:01:17] some of those tokens and then have a burp or bert like model try to
[00:01:19] burp or bert like model try to reconstruct those mask tokens however
[00:01:21] reconstruct those mask tokens however we're going to do that with a twist
[00:01:23] we're going to do that with a twist instead of always trying to learn the
[00:01:25] instead of always trying to learn the actual input token we're going to sample
[00:01:28] actual input token we're going to sample tokens proportional to the generator
[00:01:30] tokens proportional to the generator probabilities so that sometimes the
[00:01:32] probabilities so that sometimes the actual token will be input as with the
[00:01:34] actual token will be input as with the case without here and sometimes it will
[00:01:37] case without here and sometimes it will be some other token as the case with
[00:01:38] be some other token as the case with cooked going to eight in this position
[00:01:41] cooked going to eight in this position now the job of electra the discriminator
[00:01:44] now the job of electra the discriminator here is to figure out which of those
[00:01:46] here is to figure out which of those tokens were in the original input
[00:01:48] tokens were in the original input sequence and which have been replaced so
[00:01:50] sequence and which have been replaced so that's a binary prediction task and we
[00:01:52] that's a binary prediction task and we can make it about all of the tokens in
[00:01:54] can make it about all of the tokens in our input sequence if we choose to
[00:01:57] our input sequence if we choose to the actual loss for
[00:01:59] the actual loss for electra is the sum of the generator loss
[00:02:01] electra is the sum of the generator loss and a weighted version of the electra
[00:02:03] and a weighted version of the electra that is the discriminator loss however
[00:02:05] that is the discriminator loss however that's kind of masking an important
[00:02:07] that's kind of masking an important asymmetry in this model here
[00:02:09] asymmetry in this model here once we have trained the generator we
[00:02:11] once we have trained the generator we can let it fall away and do all of our
[00:02:13] can let it fall away and do all of our fine-tuning on the discriminator that is
[00:02:15] fine-tuning on the discriminator that is on electra itself which means that we'll
[00:02:18] on electra itself which means that we'll be fine-tuning a model that never saw
[00:02:20] be fine-tuning a model that never saw any of those mask tokens so we addressed
[00:02:22] any of those mask tokens so we addressed that first limitation of bert and we're
[00:02:24] that first limitation of bert and we're also going to make a prediction with
[00:02:26] also going to make a prediction with with elektra about every single one of
[00:02:27] with elektra about every single one of the input tokens which means that we're
[00:02:29] the input tokens which means that we're making more use of the available data
[00:02:34] one thing i really like about the
[00:02:35] one thing i really like about the electra paper is that it offers a really
[00:02:37] electra paper is that it offers a really rich set of analyses of the efficiency
[00:02:39] rich set of analyses of the efficiency of the model and of its optimal design
[00:02:41] of the model and of its optimal design so i'm going to highlight some of those
[00:02:43] so i'm going to highlight some of those results here starting with this
[00:02:44] results here starting with this generator discriminator relationship
[00:02:46] generator discriminator relationship result
[00:02:47] result so the authors observe that where the
[00:02:49] so the authors observe that where the generator and discriminator are the same
[00:02:51] generator and discriminator are the same size
[00:02:52] size they can share all their transformer
[00:02:54] they can share all their transformer parameters they can kind of be one model
[00:02:56] parameters they can kind of be one model in essence and they find that more
[00:02:58] in essence and they find that more sharing is indeed better which is
[00:03:00] sharing is indeed better which is encouraging
[00:03:01] encouraging however they also observe that the best
[00:03:03] however they also observe that the best results coming come from having a
[00:03:05] results coming come from having a generator that is small compared to the
[00:03:08] generator that is small compared to the discriminator and this plot kind of
[00:03:10] discriminator and this plot kind of summarizes the evidence there so
[00:03:12] summarizes the evidence there so uh we have our glue score as the goal
[00:03:14] uh we have our glue score as the goal post that we're going to use to assess
[00:03:15] post that we're going to use to assess these models that's along the y-axis
[00:03:17] these models that's along the y-axis along the x-axis we have the generator
[00:03:20] along the x-axis we have the generator size and then we've plotted out a few
[00:03:22] size and then we've plotted out a few sizes for the discriminator and i think
[00:03:24] sizes for the discriminator and i think what you can see quite clearly is that
[00:03:25] what you can see quite clearly is that in general you get the best results on
[00:03:28] in general you get the best results on glue where the discriminator is two to
[00:03:30] glue where the discriminator is two to three times larger than the generator
[00:03:33] three times larger than the generator and that's true even for this very small
[00:03:35] and that's true even for this very small modeling green down here the results are
[00:03:36] modeling green down here the results are overall not very good but we see that
[00:03:38] overall not very good but we see that same relationship where the optimal
[00:03:40] same relationship where the optimal discriminator is at size 256
[00:03:43] discriminator is at size 256 and the generator at size 64. that's
[00:03:45] and the generator at size 64. that's where we reach our peak results and
[00:03:47] where we reach our peak results and that's kind of comparable to this very
[00:03:48] that's kind of comparable to this very large model in blue where optimal size
[00:03:51] large model in blue where optimal size for the discriminator is 768 compared to
[00:03:53] for the discriminator is 768 compared to 256 for the generator
[00:03:57] they also do a bunch of really
[00:03:58] they also do a bunch of really interesting efficiency analyses one
[00:04:00] interesting efficiency analyses one thing i like about the paper is that
[00:04:02] thing i like about the paper is that it's kind of oriented toward figuring
[00:04:04] it's kind of oriented toward figuring out how we can train these models more
[00:04:06] out how we can train these models more efficiently with fewer compute resources
[00:04:08] efficiently with fewer compute resources and this is a kind of summary of central
[00:04:10] and this is a kind of summary of central evidence that they offer that electra
[00:04:12] evidence that they offer that electra can be an efficient model so again we're
[00:04:15] can be an efficient model so again we're going to use along the y-axis the glue
[00:04:16] going to use along the y-axis the glue score as our goal post but along the
[00:04:19] score as our goal post but along the x-axis here we have pre-training flop so
[00:04:21] x-axis here we have pre-training flop so this would be the number of compute
[00:04:22] this would be the number of compute operations that you need to pre-train
[00:04:24] operations that you need to pre-train the model
[00:04:25] the model in blue along the top here is electra
[00:04:27] in blue along the top here is electra it's the very best model
[00:04:29] it's the very best model in orange just below it is adversarial
[00:04:31] in orange just below it is adversarial electra
[00:04:32] electra which is an interesting approach to
[00:04:34] which is an interesting approach to electra where we essentially train the
[00:04:36] electra where we essentially train the generator to try to fool the
[00:04:37] generator to try to fool the discriminator as opposed to having the
[00:04:39] discriminator as opposed to having the two cooperate as in core electro and
[00:04:41] two cooperate as in core electro and that turns out to be pretty good and
[00:04:43] that turns out to be pretty good and also these green lines are really
[00:04:45] also these green lines are really interesting so
[00:04:46] interesting so two-stage electra is where i start by
[00:04:48] two-stage electra is where i start by training just against the bert objective
[00:04:50] training just against the bert objective and at a certain point switch over to
[00:04:53] and at a certain point switch over to training the electro objective and you
[00:04:54] training the electro objective and you can see that even that is better than
[00:04:56] can see that even that is better than just continuing on with bert all the way
[00:04:58] just continuing on with bert all the way up to the maximum for our compute budget
[00:05:00] up to the maximum for our compute budget here
[00:05:03] the paper also explores a bunch of
[00:05:05] the paper also explores a bunch of variations on the electra objective
[00:05:07] variations on the electra objective itself so i presented to you full
[00:05:09] itself so i presented to you full electra and it's full electra in the
[00:05:12] electra and it's full electra in the sense that over here on the right we're
[00:05:13] sense that over here on the right we're making predictions about every single
[00:05:15] making predictions about every single one of the tokens in the input
[00:05:18] one of the tokens in the input we could also explore something that was
[00:05:19] we could also explore something that was analogous to bird electra 15 would be
[00:05:22] analogous to bird electra 15 would be the case where we make predictions only
[00:05:25] the case where we make predictions only about tokens that were way back here in
[00:05:27] about tokens that were way back here in the input x mass actually masked out
[00:05:31] the input x mass actually masked out another variant that the team considered
[00:05:33] another variant that the team considered actually relates to how we train birds
[00:05:34] actually relates to how we train birds so recall that for bert we train both by
[00:05:37] so recall that for bert we train both by masking and by replacing some tokens
[00:05:40] masking and by replacing some tokens with other randomly chosen tokens and we
[00:05:42] with other randomly chosen tokens and we could try training the generator just
[00:05:44] could try training the generator just with that approach which would eliminate
[00:05:46] with that approach which would eliminate the mass token entirely so that's this
[00:05:48] the mass token entirely so that's this variant here where we have no masking on
[00:05:50] variant here where we have no masking on x-max masked but rather just randomly
[00:05:52] x-max masked but rather just randomly replace tokens from the actual
[00:05:54] replace tokens from the actual vocabulary
[00:05:56] vocabulary and then finally all tokens mlm would
[00:05:58] and then finally all tokens mlm would adopt some ideas from electra into the
[00:06:00] adopt some ideas from electra into the burp model so recall that for the mlm
[00:06:02] burp model so recall that for the mlm objective we essentially turned it off
[00:06:04] objective we essentially turned it off for tokens that weren't masked but
[00:06:06] for tokens that weren't masked but there's no principle reason why we're
[00:06:07] there's no principle reason why we're doing that we could of course have the
[00:06:09] doing that we could of course have the loss applied to every single one of the
[00:06:11] loss applied to every single one of the tokens in the input stream and that
[00:06:13] tokens in the input stream and that gives us all tokens mlm on the generator
[00:06:15] gives us all tokens mlm on the generator side
[00:06:17] side the central finding of the paper i
[00:06:18] the central finding of the paper i suppose is that electra is the best of
[00:06:20] suppose is that electra is the best of all of these models you also have a
[00:06:22] all of these models you also have a really good model if you do all tokens
[00:06:23] really good model if you do all tokens mlm which is something that might inform
[00:06:25] mlm which is something that might inform development on the burt side
[00:06:27] development on the burt side in addition to burp in the context of
[00:06:29] in addition to burp in the context of electro
[00:06:30] electro replace mlm is less good and electric 15
[00:06:33] replace mlm is less good and electric 15 kind of down there at the bottom near
[00:06:35] kind of down there at the bottom near bur i think this is kind of showing us
[00:06:37] bur i think this is kind of showing us that we should make more predictions
[00:06:38] that we should make more predictions that was a guiding intuition for electra
[00:06:40] that was a guiding intuition for electra and it seems to be borne out by these
[00:06:42] and it seems to be borne out by these results
[00:06:44] results and finally uh as is common in this
[00:06:46] and finally uh as is common in this space the electric team did some model
[00:06:48] space the electric team did some model releases of pre-trained parameters that
[00:06:49] releases of pre-trained parameters that you can make use of they did electro
[00:06:51] you can make use of they did electro bass and electrolarge which are kind of
[00:06:53] bass and electrolarge which are kind of comparable to the corresponding burp
[00:06:55] comparable to the corresponding burp releases i think an interesting thing
[00:06:56] releases i think an interesting thing they did is also released this burp this
[00:06:58] they did is also released this burp this electric small model which is designed
[00:07:01] electric small model which is designed to quickly be trained on a single gpu
[00:07:03] to quickly be trained on a single gpu again tying into the idea that we ought
[00:07:06] again tying into the idea that we ought to be thinking about how we can train
[00:07:08] to be thinking about how we can train models like this when we have highly
[00:07:10] models like this when we have highly constrained compute resources electro is
[00:07:12] constrained compute resources electro is keyed into that idea from the very
[00:07:14] keyed into that idea from the very beginning i think the small model shows
[00:07:16] beginning i think the small model shows that it can be productive

Lecture 025

Practical Fine-tuning | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=Ns0JHUXyLE0 --- Transcript [00:00:05] welcome back everyone this is par...

Practical Fine-tuning | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=Ns0JHUXyLE0

---

Transcript

[00:00:05] welcome back everyone this is part six
[00:00:07] welcome back everyone this is part six in our series on contextual word
[00:00:08] in our series on contextual word representations we're going to be
[00:00:09] representations we're going to be talking about practical fine tuning it's
[00:00:11] talking about practical fine tuning it's time to get hands-on with these
[00:00:13] time to get hands-on with these parameters we've been talking about so
[00:00:15] parameters we've been talking about so here's the guiding idea
[00:00:17] here's the guiding idea your existing architecture say for the
[00:00:18] your existing architecture say for the current original system and bake off
[00:00:20] current original system and bake off probably can benefit from contextual
[00:00:23] probably can benefit from contextual representation we've seen that in many
[00:00:25] representation we've seen that in many many contexts in nlu these days
[00:00:27] many contexts in nlu these days the notebook fine tuning shows you how
[00:00:29] the notebook fine tuning shows you how to bring in transformer representations
[00:00:31] to bring in transformer representations in two ways first with simple
[00:00:33] in two ways first with simple featurization and then with full on fine
[00:00:35] featurization and then with full on fine tuning and i'm going to talk about both
[00:00:37] tuning and i'm going to talk about both of those in the screencast
[00:00:39] of those in the screencast the heart of this idea is that by
[00:00:41] the heart of this idea is that by extending existing pi torch modules from
[00:00:43] extending existing pi torch modules from the course code distribution you can
[00:00:46] the course code distribution you can very easily create customized
[00:00:47] very easily create customized fine-tuning models with just a few lines
[00:00:49] fine-tuning models with just a few lines of code
[00:00:50] of code and that should be really empowering in
[00:00:52] and that should be really empowering in terms of exploring lots of different
[00:00:53] terms of exploring lots of different designs and seeing how best to use these
[00:00:55] designs and seeing how best to use these parameters for your problem i just want
[00:00:58] parameters for your problem i just want to mention that really and truly uh this
[00:01:00] to mention that really and truly uh this is only possible because of the amazing
[00:01:02] is only possible because of the amazing work that the hugging face team has done
[00:01:03] work that the hugging face team has done to make these parameters accessible to
[00:01:05] to make these parameters accessible to all of us
[00:01:07] all of us so let's start with simple futurization
[00:01:09] so let's start with simple futurization and actually want to rewind to our
[00:01:11] and actually want to rewind to our discussion of recurrent neural networks
[00:01:13] discussion of recurrent neural networks and think about how we represent
[00:01:14] and think about how we represent examples for those models in the
[00:01:17] examples for those models in the standard mode we have as our examples
[00:01:19] standard mode we have as our examples lists of tokens here
[00:01:21] lists of tokens here we convert those into lists of indices
[00:01:23] we convert those into lists of indices and those indices help us look up vector
[00:01:25] and those indices help us look up vector representations of those words in some
[00:01:28] representations of those words in some fixed embedding space and the result of
[00:01:30] fixed embedding space and the result of that is that each example is represented
[00:01:32] that is that each example is represented by a list of vectors and that's
[00:01:34] by a list of vectors and that's important to keep in mind we tend to
[00:01:36] important to keep in mind we tend to think of the model as taking as its
[00:01:38] think of the model as taking as its inputs lists of tokens and having an
[00:01:40] inputs lists of tokens and having an embedding but from the point of view of
[00:01:42] embedding but from the point of view of the model itself it really wants to
[00:01:44] the model itself it really wants to process as inputs lists of vectors and
[00:01:47] process as inputs lists of vectors and that's the empowering idea because if we
[00:01:49] that's the empowering idea because if we use it in fixed embedding of course then
[00:01:51] use it in fixed embedding of course then these two occurrences of a will be the
[00:01:53] these two occurrences of a will be the same vector
[00:01:54] same vector and these two occurrences of b across
[00:01:56] and these two occurrences of b across examples will be the same vector
[00:01:58] examples will be the same vector but the model doesn't really care that
[00:01:59] but the model doesn't really care that they're the same vector we could if we
[00:02:01] they're the same vector we could if we wanted to convert directly from token
[00:02:04] wanted to convert directly from token sequences into lists of vectors using a
[00:02:07] sequences into lists of vectors using a device like a vert model and that would
[00:02:09] device like a vert model and that would allow that a in the first position and a
[00:02:11] allow that a in the first position and a in the third could correspond to
[00:02:13] in the third could correspond to different vectors or b across these two
[00:02:15] different vectors or b across these two examples might correspond to different
[00:02:17] examples might correspond to different vectors that would be the contextual
[00:02:19] vectors that would be the contextual representation part of these models and
[00:02:22] representation part of these models and again from the point of view of the rnn
[00:02:23] again from the point of view of the rnn we can feed these indirectly that's
[00:02:25] we can feed these indirectly that's straightforward this is a complete
[00:02:27] straightforward this is a complete recipe for doing that using the sst code
[00:02:30] recipe for doing that using the sst code and the pry torch modules from this
[00:02:32] and the pry torch modules from this course code distribution so you can see
[00:02:33] course code distribution so you can see that
[00:02:34] that beyond the setup stuff which we've done
[00:02:36] beyond the setup stuff which we've done a few times the feature function is just
[00:02:39] a few times the feature function is just going to use vert functionality to look
[00:02:41] going to use vert functionality to look up the examples indices and then convert
[00:02:43] up the examples indices and then convert them into vector representations and
[00:02:45] them into vector representations and here as a summary we're going to use the
[00:02:48] here as a summary we're going to use the representation above the class token but
[00:02:51] representation above the class token but lots of things are possible at that
[00:02:52] lots of things are possible at that point
[00:02:53] point and then when we have our model wrapper
[00:02:54] and then when we have our model wrapper here we set up a torch rnn classifier
[00:02:57] here we set up a torch rnn classifier and they're just two things of note
[00:02:58] and they're just two things of note first we say use embedding equals false
[00:03:01] first we say use embedding equals false because we're going to feed vectors
[00:03:02] because we're going to feed vectors indirectly there's no embedding involved
[00:03:04] indirectly there's no embedding involved here and we also don't need to have a
[00:03:06] here and we also don't need to have a vocabulary you could specify one but
[00:03:08] vocabulary you could specify one but it's not involved because fundamentally
[00:03:10] it's not involved because fundamentally again the model deals directly with
[00:03:12] again the model deals directly with vectors
[00:03:14] vectors and then at ssd experiment you again say
[00:03:16] and then at ssd experiment you again say vectorize equals false and that is a
[00:03:18] vectorize equals false and that is a complete recipe for bringing in bert
[00:03:21] complete recipe for bringing in bert representations with the standard rnn
[00:03:24] representations with the standard rnn this isn't quite fine-tuning though so
[00:03:26] this isn't quite fine-tuning though so let's think about how we might get added
[00:03:27] let's think about how we might get added benefits from actually updating those
[00:03:29] benefits from actually updating those parameters as opposed to just using them
[00:03:31] parameters as opposed to just using them as frozen representations inputs to
[00:03:34] as frozen representations inputs to another model
[00:03:36] another model what i'd encourage you to do is think
[00:03:38] what i'd encourage you to do is think about subclassing the pi torch modules
[00:03:40] about subclassing the pi torch modules that are included in our course code
[00:03:42] that are included in our course code distribution because then you will
[00:03:44] distribution because then you will be able to write code just that oriented
[00:03:46] be able to write code just that oriented toward your model architecture and a lot
[00:03:48] toward your model architecture and a lot of the details of optimization and data
[00:03:50] of the details of optimization and data processing will be handled for you
[00:03:52] processing will be handled for you this is i hope a powerful example of
[00:03:54] this is i hope a powerful example of that it comes from the tutorial pytorch
[00:03:56] that it comes from the tutorial pytorch models notebook
[00:03:57] models notebook it's a torch softmax classifier and the
[00:04:00] it's a torch softmax classifier and the only thing we have to do is rewrite this
[00:04:01] only thing we have to do is rewrite this build graph function to specify one
[00:04:04] build graph function to specify one single dense layer layer we are using as
[00:04:06] single dense layer layer we are using as our base class the torch shell neural
[00:04:08] our base class the torch shell neural classifier which handles everything else
[00:04:10] classifier which handles everything else about setting up this model and
[00:04:12] about setting up this model and optimizing it if we wanted to go in the
[00:04:14] optimizing it if we wanted to go in the other direction and instead fit a really
[00:04:16] other direction and instead fit a really deep model we could again begin from
[00:04:18] deep model we could again begin from torch shallow neural classifier and
[00:04:20] torch shallow neural classifier and rewrite the build graph function so that
[00:04:22] rewrite the build graph function so that it just has more layers essentially and
[00:04:24] it just has more layers essentially and then what's happening in this init
[00:04:25] then what's happening in this init method is we're just giving the user
[00:04:27] method is we're just giving the user access to the various hyper parameters
[00:04:29] access to the various hyper parameters that they could choose to set up this
[00:04:31] that they could choose to set up this model
[00:04:32] model finally here's a more involved example
[00:04:35] finally here's a more involved example this one we start with a pi torch nn
[00:04:37] this one we start with a pi torch nn module kind of all the way down at the
[00:04:39] module kind of all the way down at the base here this is a torch linear
[00:04:41] base here this is a torch linear regression model
[00:04:42] regression model we set up the weight parameters here and
[00:04:44] we set up the weight parameters here and then we have the single forward pass
[00:04:45] then we have the single forward pass which corresponds to the structure of a
[00:04:47] which corresponds to the structure of a simple linear regression
[00:04:49] simple linear regression now
[00:04:50] now for the actual interface we need to do a
[00:04:52] for the actual interface we need to do a little bit more work here so
[00:04:54] little bit more work here so we set up the loss so that it's
[00:04:56] we set up the loss so that it's appropriate for a regression model as
[00:04:58] appropriate for a regression model as opposed to the classifiers we've been
[00:04:59] opposed to the classifiers we've been looking at up until now build graph just
[00:05:02] looking at up until now build graph just uses the nn module that i showed you a
[00:05:03] uses the nn module that i showed you a second ago we need to do a little bit of
[00:05:06] second ago we need to do a little bit of work in build data set rewrite that so
[00:05:08] work in build data set rewrite that so we process linear regression data
[00:05:10] we process linear regression data correctly
[00:05:11] correctly and then we do need to rewrite the
[00:05:12] and then we do need to rewrite the predict and score functions to be kind
[00:05:14] predict and score functions to be kind of good citizens of the code base that
[00:05:16] of good citizens of the code base that allow for hyper parameter optimization
[00:05:18] allow for hyper parameter optimization and cross validation and so forth but
[00:05:20] and cross validation and so forth but that's again straightforward and
[00:05:21] that's again straightforward and fundamentally for predict we're actually
[00:05:23] fundamentally for predict we're actually making use of the base classes
[00:05:24] making use of the base classes underscore predict method for the heavy
[00:05:27] underscore predict method for the heavy lifting there and then score of course
[00:05:29] lifting there and then score of course is just moving us out of the mode of
[00:05:30] is just moving us out of the mode of evaluating classifiers and into the mode
[00:05:33] evaluating classifiers and into the mode of evaluating regression models that's
[00:05:35] of evaluating regression models that's all you need to do and again
[00:05:36] all you need to do and again conspicuously absent from this is most
[00:05:38] conspicuously absent from this is most of the aspects of data processing and
[00:05:40] of the aspects of data processing and all of the details of optimization the
[00:05:42] all of the details of optimization the base class torch model base here has a
[00:05:45] base class torch model base here has a very full-featured fit method that you
[00:05:47] very full-featured fit method that you can use to optimize these models and do
[00:05:49] can use to optimize these models and do hyper parameter exploration
[00:05:52] hyper parameter exploration and that brings us to the star of the
[00:05:53] and that brings us to the star of the show which would be bert fine-tuning
[00:05:55] show which would be bert fine-tuning with hugging face parameters here we'll
[00:05:57] with hugging face parameters here we'll start with the pi torch nn.module
[00:06:00] start with the pi torch nn.module we load in a vert model module as we've
[00:06:03] we load in a vert model module as we've done before and make sure to set it to
[00:06:04] done before and make sure to set it to train so that it can be updated
[00:06:07] train so that it can be updated and then the new parameters here are
[00:06:08] and then the new parameters here are really just this classifier layer a
[00:06:10] really just this classifier layer a dense layer that's going to be oriented
[00:06:12] dense layer that's going to be oriented toward the classification structure that
[00:06:13] toward the classification structure that we want to what we want our model to
[00:06:15] we want to what we want our model to have
[00:06:16] have the forward method calls the forward
[00:06:18] the forward method calls the forward method of the burp model and you get a
[00:06:19] method of the burp model and you get a bunch of representations there are a lot
[00:06:21] bunch of representations there are a lot of options here what i've decided to do
[00:06:23] of options here what i've decided to do is just use the hugging based pooler
[00:06:25] is just use the hugging based pooler output which is some parameters on top
[00:06:27] output which is some parameters on top of the class token as the input to the
[00:06:29] of the class token as the input to the classifier
[00:06:31] classifier and when we optimize this model with
[00:06:33] and when we optimize this model with luck in a productive way not only will
[00:06:35] luck in a productive way not only will these classifier parameters be updated
[00:06:37] these classifier parameters be updated but also all the parameters of the spurt
[00:06:39] but also all the parameters of the spurt model that you loaded in in train mode
[00:06:42] model that you loaded in in train mode the interface is a little bit involved
[00:06:44] the interface is a little bit involved here so what we do is provide the user
[00:06:46] here so what we do is provide the user with some flexibility about what choices
[00:06:48] with some flexibility about what choices to make
[00:06:49] to make build graph again just loads in the
[00:06:50] build graph again just loads in the module that i showed you just a second
[00:06:52] module that i showed you just a second ago and then build the data set is a bit
[00:06:55] ago and then build the data set is a bit involved but what we do fundamentally is
[00:06:57] involved but what we do fundamentally is use the burp tokenizer to batch encode
[00:07:00] use the burp tokenizer to batch encode our data
[00:07:01] our data and then we do a little bit of
[00:07:02] and then we do a little bit of processing on the output labels to make
[00:07:04] processing on the output labels to make sure pi torch can make sense of them
[00:07:06] sure pi torch can make sense of them that's really it and the heart of this
[00:07:07] that's really it and the heart of this is just that we're again using hugging
[00:07:09] is just that we're again using hugging face functionality to represent our data
[00:07:12] face functionality to represent our data to the burp model and then this is the
[00:07:14] to the burp model and then this is the really interesting part
[00:07:16] really interesting part calling the forward method and then
[00:07:18] calling the forward method and then fitting the classifier on top is pretty
[00:07:20] fitting the classifier on top is pretty much all you need to do and of course
[00:07:22] much all you need to do and of course that opens up a world of options reps
[00:07:24] that opens up a world of options reps here has lots of other things that you
[00:07:25] here has lots of other things that you could use as the input to this
[00:07:27] could use as the input to this classifier layer
[00:07:29] classifier layer and many of them actually might be more
[00:07:30] and many of them actually might be more productive than the simple approach that
[00:07:32] productive than the simple approach that i've taken here

Lecture 026

Homework 3: Colors | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=6_R00t5Iyrg --- Transcript [00:00:05] welcome back everyone this screencas...

Homework 3: Colors | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=6_R00t5Iyrg

---

Transcript

[00:00:05] welcome back everyone this screencast is
[00:00:07] welcome back everyone this screencast is an overview of the homework and bake off
[00:00:09] an overview of the homework and bake off associated with our unit on grounded
[00:00:11] associated with our unit on grounded language understanding more than any of
[00:00:13] language understanding more than any of the other assignments what we're asking
[00:00:15] the other assignments what we're asking you to do here is essentially develop a
[00:00:17] you to do here is essentially develop a fully integrated system that addresses
[00:00:19] fully integrated system that addresses our task
[00:00:21] our task so the distinction between homework
[00:00:23] so the distinction between homework questions and original system questions
[00:00:25] questions and original system questions is kind of getting blurred here in the
[00:00:26] is kind of getting blurred here in the interest of having you devote all your
[00:00:28] interest of having you devote all your energy to developing a cool original
[00:00:30] energy to developing a cool original system for this problem
[00:00:32] system for this problem so because of that i'm going to use some
[00:00:34] so because of that i'm going to use some slides to give you an overview of the
[00:00:36] slides to give you an overview of the entire problem and how we're thinking
[00:00:38] entire problem and how we're thinking about evaluation and how the questions
[00:00:40] about evaluation and how the questions feed into these overall goals
[00:00:42] feed into these overall goals so recall that our core task is the
[00:00:44] so recall that our core task is the stanford colors in context task and
[00:00:47] stanford colors in context task and we're going to take the speaker's
[00:00:48] we're going to take the speaker's perspective primarily and what that
[00:00:50] perspective primarily and what that means is that the inputs to our model
[00:00:52] means is that the inputs to our model are sequences of three color patches one
[00:00:54] are sequences of three color patches one of them designated as the target and the
[00:00:56] of them designated as the target and the task is to generate
[00:00:58] task is to generate a description of the target in that
[00:01:00] a description of the target in that particular context
[00:01:03] the core model that we'll be using which
[00:01:05] the core model that we'll be using which is in torch color describer is an
[00:01:08] is in torch color describer is an encoder decoder architecture and the way
[00:01:10] encoder decoder architecture and the way it works is on the encoder side you have
[00:01:12] it works is on the encoder side you have a sequence of three colors and we always
[00:01:14] a sequence of three colors and we always put the target color in the third
[00:01:17] put the target color in the third position
[00:01:18] position so those are the inputs and then the
[00:01:20] so those are the inputs and then the decoding step is essentially to describe
[00:01:22] decoding step is essentially to describe the target in that context so that's the
[00:01:24] the target in that context so that's the natural language generation part and
[00:01:26] natural language generation part and we've covered this core architecture in
[00:01:29] we've covered this core architecture in previous screencasts and i'll return to
[00:01:31] previous screencasts and i'll return to some of the modifications that you see
[00:01:33] some of the modifications that you see here in the context of question four
[00:01:36] here in the context of question four there's a separate notebook called
[00:01:38] there's a separate notebook called caller's overview that you should start
[00:01:40] caller's overview that you should start with it gives you a sense for what the
[00:01:41] with it gives you a sense for what the data set is like and also what our
[00:01:43] data set is like and also what our modeling code is like here you can see
[00:01:45] modeling code is like here you can see that i've loaded in the corpus itself
[00:01:47] that i've loaded in the corpus itself it's got about 47 000 examples in it and
[00:01:50] it's got about 47 000 examples in it and each one of those examples has a number
[00:01:52] each one of those examples has a number of different attributes that you should
[00:01:53] of different attributes that you should be aware of so here's a typical example
[00:01:56] be aware of so here's a typical example the first one in the corpus
[00:01:58] the first one in the corpus fundamentally you have these three color
[00:01:59] fundamentally you have these three color patches and you can see this display is
[00:02:01] patches and you can see this display is marking out the target as well as an
[00:02:03] marking out the target as well as an utterance
[00:02:05] utterance each one of these colors is encoded as a
[00:02:07] each one of these colors is encoded as a triple of hsv values that is a sequence
[00:02:09] triple of hsv values that is a sequence of three floats
[00:02:11] of three floats and you can see here that you can also
[00:02:12] and you can see here that you can also access the utterance
[00:02:16] there are three conditions in the
[00:02:17] there are three conditions in the underlying corpus that vary in their
[00:02:19] underlying corpus that vary in their difficulty in the far condition all
[00:02:22] difficulty in the far condition all three of the colors are quite different
[00:02:23] three of the colors are quite different from each other so the task of
[00:02:25] from each other so the task of identifying the target is typically
[00:02:27] identifying the target is typically pretty easy here the person just had to
[00:02:29] pretty easy here the person just had to say purple
[00:02:31] say purple in the split condition two of the colors
[00:02:33] in the split condition two of the colors are highly confusable so you can see
[00:02:35] are highly confusable so you can see here that we have two green colors and
[00:02:37] here that we have two green colors and that pushed the speaker to choose a kind
[00:02:39] that pushed the speaker to choose a kind of more specified form of green in
[00:02:41] of more specified form of green in saying line
[00:02:43] saying line and the hardest condition is the close
[00:02:45] and the hardest condition is the close condition and that's where all three of
[00:02:46] condition and that's where all three of the colors are highly similar to each
[00:02:48] the colors are highly similar to each other this tends to lead to the longest
[00:02:51] other this tends to lead to the longest descriptions you can see here that the
[00:02:52] descriptions you can see here that the speaker even took two turns as indicated
[00:02:55] speaker even took two turns as indicated by this boundary marker to try to give
[00:02:57] by this boundary marker to try to give their full description medium pink the
[00:02:59] their full description medium pink the medium dark one because these colors are
[00:03:01] medium dark one because these colors are so confusable
[00:03:02] so confusable so you should be aware of this
[00:03:03] so you should be aware of this difference in the conditions and it
[00:03:04] difference in the conditions and it might affect how you do different kinds
[00:03:06] might affect how you do different kinds of modeling based on what the color
[00:03:08] of modeling based on what the color sequence is like
[00:03:11] now evaluation for natural image
[00:03:13] now evaluation for natural image generation systems is always challenging
[00:03:14] generation systems is always challenging and there are some automatic metrics
[00:03:16] and there are some automatic metrics that we can use as guideposts in fact
[00:03:18] that we can use as guideposts in fact we're going to use blue in various
[00:03:19] we're going to use blue in various places
[00:03:20] places but our primary evaluation metric will
[00:03:23] but our primary evaluation metric will be this task oriented one which brings
[00:03:25] be this task oriented one which brings in a kind of listener perspective
[00:03:28] in a kind of listener perspective so at a mechanical level here's how
[00:03:29] so at a mechanical level here's how we'll make predictions for a given
[00:03:31] we'll make predictions for a given context c consisting of three colors
[00:03:35] context c consisting of three colors capital c here is all the permutations
[00:03:37] capital c here is all the permutations of those three colors
[00:03:39] of those three colors suppose that you have trained a speaker
[00:03:41] suppose that you have trained a speaker model ps here it's a probabilistic agent
[00:03:44] model ps here it's a probabilistic agent we're going to think about how it makes
[00:03:46] we're going to think about how it makes predictions for all of those different
[00:03:47] predictions for all of those different permutations and take as its prediction
[00:03:50] permutations and take as its prediction at the level of a full sequence
[00:03:52] at the level of a full sequence the sequence that it assigns the highest
[00:03:54] the sequence that it assigns the highest probability to given the message that
[00:03:57] probability to given the message that your system produced
[00:03:59] your system produced and then we say that a speaker is
[00:04:00] and then we say that a speaker is accurate
[00:04:01] accurate in its prediction about some context c
[00:04:04] in its prediction about some context c just in case the
[00:04:06] just in case the best sequence that it predicts the
[00:04:07] best sequence that it predicts the highest probability one has the target
[00:04:09] highest probability one has the target in the final position as designated by
[00:04:12] in the final position as designated by our model structure
[00:04:13] our model structure so in a little bit more detail here's
[00:04:14] so in a little bit more detail here's how this works with an example suppose
[00:04:16] how this works with an example suppose that our context looks like this it has
[00:04:18] that our context looks like this it has these three color patches and the target
[00:04:20] these three color patches and the target is always in third position and our
[00:04:22] is always in third position and our message was blue
[00:04:24] message was blue here on the right we have all the
[00:04:26] here on the right we have all the permutations of these three colors and
[00:04:28] permutations of these three colors and we're going to say that your system was
[00:04:29] we're going to say that your system was correct if its highest probability
[00:04:31] correct if its highest probability context given that message was one of
[00:04:34] context given that message was one of these two that is one of the two that
[00:04:36] these two that is one of the two that has the target in final position and the
[00:04:38] has the target in final position and the system is inaccurate to the extent that
[00:04:40] system is inaccurate to the extent that it signs higher probability to one of
[00:04:42] it signs higher probability to one of these other sequences essentially we're
[00:04:44] these other sequences essentially we're saying that it's assigning higher
[00:04:46] saying that it's assigning higher probability to some other target but we
[00:04:48] probability to some other target but we do operate at the level of these full
[00:04:50] do operate at the level of these full sequences
[00:04:52] sequences all right now let's move into the
[00:04:53] all right now let's move into the questions here we first start with the
[00:04:55] questions here we first start with the tokenizer you're unconstrained in how
[00:04:58] tokenizer you're unconstrained in how you design your tokenizer you should
[00:04:59] you design your tokenizer you should just make sure that you have a start
[00:05:00] just make sure that you have a start symbol and an n symbol the start symbol
[00:05:03] symbol and an n symbol the start symbol is important conditioning context for
[00:05:04] is important conditioning context for the model and the end symbol is a
[00:05:06] the model and the end symbol is a crucial signal that your model will
[00:05:08] crucial signal that your model will actually actually stop producing tokens
[00:05:10] actually actually stop producing tokens so don't forget those pieces but in
[00:05:12] so don't forget those pieces but in terms of what else you do in there it's
[00:05:14] terms of what else you do in there it's unconstrained and i think you can see
[00:05:15] unconstrained and i think you can see from the monroe it all work that making
[00:05:18] from the monroe it all work that making smart choices about tokenization might
[00:05:20] smart choices about tokenization might be really meaningful
[00:05:22] be really meaningful question two asks you to think about how
[00:05:24] question two asks you to think about how you're representing colors by default
[00:05:26] you're representing colors by default they're just gonna be those three float
[00:05:27] they're just gonna be those three float values but that's probably not optimal
[00:05:30] values but that's probably not optimal uh in the monroe all paper we explore a
[00:05:32] uh in the monroe all paper we explore a fourier transform as a way of embedding
[00:05:34] fourier transform as a way of embedding colors and i've given you a little
[00:05:36] colors and i've given you a little recipe for that in the context of the
[00:05:38] recipe for that in the context of the notebook in case you want to explore
[00:05:40] notebook in case you want to explore that it is highly effective but this is
[00:05:42] that it is highly effective but this is optional and there might be other
[00:05:44] optional and there might be other representation schemes that are even
[00:05:46] representation schemes that are even better and worth exploring
[00:05:49] better and worth exploring question 3 asks you to think about rich
[00:05:51] question 3 asks you to think about rich initialization or pre-training for your
[00:05:53] initialization or pre-training for your model we've worked a lot with pre-chain
[00:05:55] model we've worked a lot with pre-chain glove embeddings and this is a chance
[00:05:56] glove embeddings and this is a chance for you to bring those into your model
[00:05:58] for you to bring those into your model and see how well they do you should be
[00:06:00] and see how well they do you should be aware that this step is going to
[00:06:01] aware that this step is going to interact in non-trivial ways with
[00:06:04] interact in non-trivial ways with choices you make for your tokenizer
[00:06:06] choices you make for your tokenizer and question four is the most involved
[00:06:08] and question four is the most involved that involves some real pie torch
[00:06:10] that involves some real pie torch wrangling conceptually what we're asking
[00:06:12] wrangling conceptually what we're asking you to do is borrow a trick from the
[00:06:14] you to do is borrow a trick from the monroe at all paper what we found in
[00:06:16] monroe at all paper what we found in that work is that it helped to remind
[00:06:18] that work is that it helped to remind the model during decoding of which of
[00:06:20] the model during decoding of which of the three colors was its target the way
[00:06:22] the three colors was its target the way we did that essentially was by taking
[00:06:25] we did that essentially was by taking the color embedding for the target and
[00:06:27] the color embedding for the target and appending it to the embedding of each
[00:06:29] appending it to the embedding of each one of the tokens that it was producing
[00:06:32] one of the tokens that it was producing as a kind of reminder
[00:06:34] as a kind of reminder in terms of how that works at the level
[00:06:36] in terms of how that works at the level of code there is a decoder class and you
[00:06:38] of code there is a decoder class and you should modify it so that the input
[00:06:40] should modify it so that the input vector to the model at each time step is
[00:06:43] vector to the model at each time step is not just the token embedding but the
[00:06:45] not just the token embedding but the concatenation of that embedding with the
[00:06:47] concatenation of that embedding with the representation of the target color
[00:06:50] representation of the target color then you need to modify the encoder
[00:06:51] then you need to modify the encoder decoder class to extract the target
[00:06:54] decoder class to extract the target colors and feed them to that decoder
[00:06:56] colors and feed them to that decoder class
[00:06:57] class and then finally here this is the
[00:06:58] and then finally here this is the interface that you use modify that
[00:07:01] interface that you use modify that interface so that it uses your decoder
[00:07:03] interface so that it uses your decoder encoder and that's a pretty mechanical
[00:07:04] encoder and that's a pretty mechanical step
[00:07:06] step when you're developing on this problem
[00:07:07] when you're developing on this problem use toy data sets because you don't want
[00:07:10] use toy data sets because you don't want to wait around as you process the entire
[00:07:12] to wait around as you process the entire colors corpus only to find out that you
[00:07:14] colors corpus only to find out that you have a low level bug and i also
[00:07:16] have a low level bug and i also encourage you to lean on the tests that
[00:07:18] encourage you to lean on the tests that we have included in the notebook as a
[00:07:20] we have included in the notebook as a way of ensuring that you have exactly
[00:07:21] way of ensuring that you have exactly the right data structures i'm assuming
[00:07:24] the right data structures i'm assuming all those pieces fall into place i think
[00:07:26] all those pieces fall into place i think you'll find that the resulting models
[00:07:28] you'll find that the resulting models are substantially better for our task
[00:07:32] that brings us to the original system
[00:07:34] that brings us to the original system and here's just some expectations about
[00:07:36] and here's just some expectations about how we think you might work on this
[00:07:37] how we think you might work on this problem
[00:07:38] problem you could iteratively improve your
[00:07:40] you could iteratively improve your answers to the assignment questions as
[00:07:41] answers to the assignment questions as part of the original system modify the
[00:07:44] part of the original system modify the tokenizer think about your glove
[00:07:46] tokenizer think about your glove embeddings think about how you're
[00:07:47] embeddings think about how you're representing colors and kind of how all
[00:07:49] representing colors and kind of how all those pieces are interacting
[00:07:52] those pieces are interacting you might want to extend the modified
[00:07:53] you might want to extend the modified encoder decoder classes to do new and
[00:07:55] encoder decoder classes to do new and interesting things and i provided
[00:07:57] interesting things and i provided guidance on how to do that at a
[00:07:59] guidance on how to do that at a mechanical level in the colors overview
[00:08:01] mechanical level in the colors overview notebook
[00:08:04] any data that you can find is fine to
[00:08:06] any data that you can find is fine to bring in for development and for
[00:08:07] bring in for development and for training your original system
[00:08:10] training your original system the bake off involves a new test set
[00:08:12] the bake off involves a new test set that's never been released anywhere
[00:08:14] that's never been released anywhere before it's just used in this context
[00:08:16] before it's just used in this context it's got the same kinds of color context
[00:08:18] it's got the same kinds of color context as in the released corpus
[00:08:20] as in the released corpus but it was one-off games rather than
[00:08:22] but it was one-off games rather than iterated games and i do think that makes
[00:08:24] iterated games and i do think that makes this test set a little bit easier than
[00:08:26] this test set a little bit easier than the training set
[00:08:28] the training set and all the items have been listener
[00:08:30] and all the items have been listener validated so i think all the
[00:08:31] validated so i think all the descriptions are in principle good
[00:08:32] descriptions are in principle good descriptions at a human level and so it
[00:08:34] descriptions at a human level and so it should be a good basis for evaluation
[00:08:41] you

Lecture 027

Grounded Language Understanding | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=OW7aDflHdG0 --- Transcript [00:00:04] welcome everyone this i...

Grounded Language Understanding | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=OW7aDflHdG0

---

Transcript

[00:00:04] welcome everyone this is part one in our
[00:00:06] welcome everyone this is part one in our series on grounded language
[00:00:07] series on grounded language understanding i'm just going to give an
[00:00:09] understanding i'm just going to give an overview with grounding i feel like
[00:00:11] overview with grounding i feel like we're really getting at the heart of
[00:00:12] we're really getting at the heart of what makes nlu so special for nlp but
[00:00:14] what makes nlu so special for nlp but and also for artificial intelligence
[00:00:16] and also for artificial intelligence more broadly so this is exciting let's
[00:00:18] more broadly so this is exciting let's dive in now grounding is a very large
[00:00:21] dive in now grounding is a very large topic and so to ground it so to speak
[00:00:23] topic and so to ground it so to speak we're going to be focused on a
[00:00:24] we're going to be focused on a particular task which is color reference
[00:00:26] particular task which is color reference in context i'll be saying much more
[00:00:28] in context i'll be saying much more about that later on this notebook colors
[00:00:31] about that later on this notebook colors overview
[00:00:32] overview provides an overview of the data set
[00:00:34] provides an overview of the data set and that data set is the center piece
[00:00:36] and that data set is the center piece for the homework and associated bank off
[00:00:39] for the homework and associated bank off the core reading is the paper that
[00:00:41] the core reading is the paper that introduced that data set monroe at all
[00:00:42] introduced that data set monroe at all 2017 and i think that paper is
[00:00:44] 2017 and i think that paper is noteworthy also for introducing some
[00:00:46] noteworthy also for introducing some interesting modeling ideas that are
[00:00:48] interesting modeling ideas that are worthy of further exploration possibly
[00:00:50] worthy of further exploration possibly in final projects
[00:00:52] in final projects and i also just want to recommend a
[00:00:54] and i also just want to recommend a whole bunch of auxiliary readings not
[00:00:56] whole bunch of auxiliary readings not required but exciting extensions that
[00:00:58] required but exciting extensions that you might make i think grounding is a
[00:01:00] you might make i think grounding is a wonderful chance to do interdisciplinary
[00:01:02] wonderful chance to do interdisciplinary work you can connect to nlp connect nlp
[00:01:04] work you can connect to nlp connect nlp with
[00:01:05] with robotics and computer vision and human
[00:01:08] robotics and computer vision and human language acquisition and probably lots
[00:01:10] language acquisition and probably lots of other topics so i'm going to be
[00:01:11] of other topics so i'm going to be pushing papers and data sets throughout
[00:01:13] pushing papers and data sets throughout this series of screencasts in hopes that
[00:01:15] this series of screencasts in hopes that you can pick up those ideas and run with
[00:01:17] you can pick up those ideas and run with them for your own projects
[00:01:20] them for your own projects now to start i thought we could just
[00:01:22] now to start i thought we could just reflect a little bit on
[00:01:24] reflect a little bit on the heart of this which is why grounding
[00:01:26] the heart of this which is why grounding is so important uh and why natural
[00:01:28] is so important uh and why natural language understanding is so hard and
[00:01:30] language understanding is so hard and sort of to kick that off uh i've taken a
[00:01:32] sort of to kick that off uh i've taken a slide idea from andrew mccallum and just
[00:01:34] slide idea from andrew mccallum and just andrew just asks us to reflect a little
[00:01:36] andrew just asks us to reflect a little bit on the 1967 stanley kubrick movie
[00:01:39] bit on the 1967 stanley kubrick movie 2001 a space odyssey
[00:01:42] 2001 a space odyssey in that movie the spaceship's computer
[00:01:44] in that movie the spaceship's computer which is called hal can do three things
[00:01:45] which is called hal can do three things that are noteworthy it can display
[00:01:47] that are noteworthy it can display computer graphics it can play chess and
[00:01:50] computer graphics it can play chess and it can conduct natural open domain
[00:01:53] it can conduct natural open domain conversations with humans so this is a
[00:01:55] conversations with humans so this is a chance to ask
[00:01:56] chance to ask how well did the filmmakers do it
[00:01:58] how well did the filmmakers do it predicting what computers would be
[00:01:59] predicting what computers would be capable of in the actual year 2001 which
[00:02:02] capable of in the actual year 2001 which is of course ancient history for us at
[00:02:04] is of course ancient history for us at this point
[00:02:05] this point so let's start with the graphics on the
[00:02:07] so let's start with the graphics on the left you have some of gra of the
[00:02:08] left you have some of gra of the graphics that hal is able to display in
[00:02:10] graphics that hal is able to display in the movie and you can see that they are
[00:02:12] the movie and you can see that they are extremely primitive
[00:02:14] extremely primitive the filmmakers seem to have wildly
[00:02:16] the filmmakers seem to have wildly underestimated just how much progress
[00:02:18] underestimated just how much progress would happen in computer graphics by
[00:02:20] would happen in computer graphics by 1993 which is much earlier than 2001 of
[00:02:23] 1993 which is much earlier than 2001 of course we had the movie jurassic park
[00:02:25] course we had the movie jurassic park which had these incredible graphics for
[00:02:27] which had these incredible graphics for life like moving dinosaurs
[00:02:30] life like moving dinosaurs so let's say that this is a kind of
[00:02:31] so let's say that this is a kind of failure to imagine the future
[00:02:34] failure to imagine the future for chess it seems like they've got the
[00:02:36] for chess it seems like they've got the prediction just about right so in the
[00:02:38] prediction just about right so in the movie hal is an excellent chess player
[00:02:40] movie hal is an excellent chess player and just a few years before the actual
[00:02:42] and just a few years before the actual 2001 in 1997 deep blue uh was the first
[00:02:47] 2001 in 1997 deep blue uh was the first super computer to beat world champion
[00:02:49] super computer to beat world champion chess players
[00:02:51] chess players what about dialogue and natural language
[00:02:54] what about dialogue and natural language use so on the left here you have a
[00:02:55] use so on the left here you have a sample dialogue from the movie
[00:02:58] sample dialogue from the movie dave bowman is the human he says open
[00:02:59] dave bowman is the human he says open the pod bay doris hal and hal replies
[00:03:02] the pod bay doris hal and hal replies i'm sorry dave i'm afraid i can't do
[00:03:04] i'm sorry dave i'm afraid i can't do that
[00:03:04] that what are you talking about hal and how
[00:03:06] what are you talking about hal and how it replies i know that you and frank
[00:03:08] it replies i know that you and frank were planning to disconnect me and i'm
[00:03:10] were planning to disconnect me and i'm afraid that's something i cannot allow
[00:03:11] afraid that's something i cannot allow to happen
[00:03:13] to happen very interesting not only is it fluent
[00:03:14] very interesting not only is it fluent english of course but it's also
[00:03:16] english of course but it's also displaying really rich reasoning about
[00:03:18] displaying really rich reasoning about plans and goals and it's fully grounded
[00:03:20] plans and goals and it's fully grounded in what's happening in the ship
[00:03:22] in what's happening in the ship just incredibly realistic
[00:03:25] just incredibly realistic to give the filmmakers even a fighting
[00:03:27] to give the filmmakers even a fighting chance here let's move forward to the
[00:03:29] chance here let's move forward to the year 2014 which is about when siri hit
[00:03:31] year 2014 which is about when siri hit the market
[00:03:32] the market and we talked about siri earlier here
[00:03:34] and we talked about siri earlier here you can see siri doing some a much more
[00:03:36] you can see siri doing some a much more mundane version of what we just saw hal
[00:03:38] mundane version of what we just saw hal doing which is kind of proactively
[00:03:40] doing which is kind of proactively recognizing plans and goals and helping
[00:03:42] recognizing plans and goals and helping a human user solve a problem using
[00:03:44] a human user solve a problem using fluent english in this case it's just
[00:03:46] fluent english in this case it's just about where to buy food um but the
[00:03:48] about where to buy food um but the vision is very similar
[00:03:51] vision is very similar what's at what was life actually like in
[00:03:53] what's at what was life actually like in 2014 or for that matter in the present
[00:03:56] 2014 or for that matter in the present day well i also showed you this dialogue
[00:03:59] day well i also showed you this dialogue from stephen colbert from his show the
[00:04:00] from stephen colbert from his show the pretense is that he has been playing
[00:04:02] pretense is that he has been playing with his phone all day and therefore has
[00:04:04] with his phone all day and therefore has failed to produce material for the show
[00:04:06] failed to produce material for the show the cameras are on him and he's
[00:04:07] the cameras are on him and he's desperate he asks siri for help and you
[00:04:10] desperate he asks siri for help and you can see here that siri does not have a
[00:04:12] can see here that siri does not have a deep understanding of what he's trying
[00:04:14] deep understanding of what he's trying to achieve i've bolded god and cameras
[00:04:16] to achieve i've bolded god and cameras in stephen's utterance because you can
[00:04:18] in stephen's utterance because you can see siri just picks up on those as kind
[00:04:20] see siri just picks up on those as kind of keywords and says churches and camera
[00:04:22] of keywords and says churches and camera stores it's not even topically relevant
[00:04:25] stores it's not even topically relevant it's just a complete failure to
[00:04:26] it's just a complete failure to recognize what he's trying to do and
[00:04:28] recognize what he's trying to do and then later things get even worse siri
[00:04:31] then later things get even worse siri really doesn't understand what steven is
[00:04:33] really doesn't understand what steven is saying and so it does the standard
[00:04:34] saying and so it does the standard escape valve which is
[00:04:37] escape valve which is it searches the web for the
[00:04:39] it searches the web for the speech-to-text transcription of the
[00:04:41] speech-to-text transcription of the thing that he said in hopes that that
[00:04:43] thing that he said in hopes that that will be helpful a far cry from anything
[00:04:46] will be helpful a far cry from anything like
[00:04:46] like a helpful useful human-like interaction
[00:04:49] a helpful useful human-like interaction with language
[00:04:52] with language now why is this so difficult i think
[00:04:54] now why is this so difficult i think another angle on that question uh is
[00:04:57] another angle on that question uh is usefully brought to the fore with this
[00:04:58] usefully brought to the fore with this not this analogy that stephen levinson
[00:05:00] not this analogy that stephen levinson offers uh so he asks us to look at this
[00:05:03] offers uh so he asks us to look at this rembrandt sketch here and just reflect
[00:05:05] rembrandt sketch here and just reflect on the fact that you can make out people
[00:05:07] on the fact that you can make out people and structures in the background but
[00:05:10] and structures in the background but really it's incredible that you can do
[00:05:11] really it's incredible that you can do any of that so he says we interpret this
[00:05:14] any of that so he says we interpret this sketch instantly and effortlessly as a
[00:05:16] sketch instantly and effortlessly as a gathering of people before a structure
[00:05:18] gathering of people before a structure probably a gateway the people are
[00:05:20] probably a gateway the people are listening to a single declaiming figure
[00:05:21] listening to a single declaiming figure in the center
[00:05:23] in the center then he says but all this is a miracle
[00:05:25] then he says but all this is a miracle for there is little detailed information
[00:05:27] for there is little detailed information in the lines or shading such as there is
[00:05:29] in the lines or shading such as there is every line is a mere suggestion
[00:05:32] every line is a mere suggestion so here's the miracle from a mere
[00:05:34] so here's the miracle from a mere sketchiest squiggle of lines you and i
[00:05:36] sketchiest squiggle of lines you and i converge to find adam bration of a
[00:05:38] converge to find adam bration of a coherent scene
[00:05:41] coherent scene that is indeed a visual miracle and a
[00:05:43] that is indeed a visual miracle and a cognitive miracle and it's also a
[00:05:44] cognitive miracle and it's also a glimpse into why computer vision is so
[00:05:47] glimpse into why computer vision is so challenging to make the connection with
[00:05:49] challenging to make the connection with language levinson continues the problem
[00:05:51] language levinson continues the problem of utterance interpretation is not
[00:05:53] of utterance interpretation is not dissimilar to this visual miracle an
[00:05:55] dissimilar to this visual miracle an utterance is not as it were of a
[00:05:57] utterance is not as it were of a ridicule model or snapshot of the scene
[00:05:59] ridicule model or snapshot of the scene it describes rather an utterance is just
[00:06:02] it describes rather an utterance is just as sketchy as the rembrandt drawing
[00:06:05] as sketchy as the rembrandt drawing so much of what we communicate as
[00:06:07] so much of what we communicate as speakers is left implicit and so much of
[00:06:09] speakers is left implicit and so much of what listeners are able to extract from
[00:06:11] what listeners are able to extract from our utterances is stuff that they're
[00:06:13] our utterances is stuff that they're able to extract only by reasoning in a
[00:06:15] able to extract only by reasoning in a general way about the context plans and
[00:06:17] general way about the context plans and goals world knowledge and so forth
[00:06:20] goals world knowledge and so forth if our utterances were actually fully
[00:06:23] if our utterances were actually fully encoding in their semantics everything
[00:06:25] encoding in their semantics everything we intended to communicate i think we
[00:06:27] we intended to communicate i think we would have talking robots at this point
[00:06:29] would have talking robots at this point but the truth is that so much of
[00:06:31] but the truth is that so much of communication in natural language is
[00:06:33] communication in natural language is left up to the context in a very general
[00:06:36] left up to the context in a very general sense
[00:06:37] sense and that's exactly what makes this
[00:06:39] and that's exactly what makes this problem so challenging
[00:06:42] in a way though all of this grounding
[00:06:44] in a way though all of this grounding into the context and all this reasoning
[00:06:46] into the context and all this reasoning if you bring it into your system it can
[00:06:48] if you bring it into your system it can make things easier uh it might make some
[00:06:51] make things easier uh it might make some intractable problems tractable and one
[00:06:53] intractable problems tractable and one glimpse of that is just this topic of
[00:06:55] glimpse of that is just this topic of what linguists and philosophers call
[00:06:56] what linguists and philosophers call indexicality and dexicals are phrases
[00:06:59] indexicality and dexicals are phrases like i as in i am speaking that
[00:07:02] like i as in i am speaking that obviously makes reference to the speaker
[00:07:04] obviously makes reference to the speaker and that reference is going to vary
[00:07:06] and that reference is going to vary depending on who's speaking that's a
[00:07:08] depending on who's speaking that's a case where you can't possibly understand
[00:07:10] case where you can't possibly understand this statement unless you know something
[00:07:12] this statement unless you know something about
[00:07:13] about who's speaking which is a very simple
[00:07:15] who's speaking which is a very simple kind of grounding
[00:07:17] kind of grounding we one shows this a similar kind of
[00:07:19] we one shows this a similar kind of grounding but it's more complicated so
[00:07:21] grounding but it's more complicated so now we have this phrase we which
[00:07:22] now we have this phrase we which probably by default is expected to
[00:07:24] probably by default is expected to include the speaker but it needs to
[00:07:26] include the speaker but it needs to include others and figuring out who else
[00:07:28] include others and figuring out who else it includes can be difficult and you
[00:07:30] it includes can be difficult and you also get more challenging uses where you
[00:07:32] also get more challenging uses where you say things like we as in we as the
[00:07:34] say things like we as in we as the sports team that i follow or something
[00:07:36] sports team that i follow or something like that so we have grounding plus a
[00:07:38] like that so we have grounding plus a whole bunch of contextual reasoning in
[00:07:40] whole bunch of contextual reasoning in order to figure out what we want would
[00:07:42] order to figure out what we want would mean
[00:07:43] mean i am here of course i for the speaker
[00:07:45] i am here of course i for the speaker that's one kind of grinding but here is
[00:07:47] that's one kind of grinding but here is an indexical expression referring to a
[00:07:49] an indexical expression referring to a location and it does that in a very
[00:07:51] location and it does that in a very complicated way when i say i am here i
[00:07:54] complicated way when i say i am here i could mean my office or stanford and i
[00:07:56] could mean my office or stanford and i suppose all the way up to planet earth
[00:07:58] suppose all the way up to planet earth although that's unlikely because it's
[00:07:59] although that's unlikely because it's not so informative in 2021 to say i am
[00:08:02] not so informative in 2021 to say i am on planet earth
[00:08:04] on planet earth we want to go here as another use it has
[00:08:06] we want to go here as another use it has weed for one kind of grounding and in
[00:08:08] weed for one kind of grounding and in this case here if i'm pointing to a map
[00:08:11] this case here if i'm pointing to a map would be an even more complicated kind
[00:08:13] would be an even more complicated kind of displaced indexable reference where
[00:08:15] of displaced indexable reference where the map is doing some iconic duty for
[00:08:17] the map is doing some iconic duty for some actual place in the world that we
[00:08:20] some actual place in the world that we are aiming to go so another kind of
[00:08:22] are aiming to go so another kind of complicated reasoning but again grounded
[00:08:24] complicated reasoning but again grounded in something about the utterance context
[00:08:27] in something about the utterance context we went to a local bar after work here
[00:08:29] we went to a local bar after work here the index goal is the word local and it
[00:08:31] the index goal is the word local and it just shows that indexicality can sneak
[00:08:33] just shows that indexicality can sneak into other parts of speech local here is
[00:08:35] into other parts of speech local here is going to refer to things that are
[00:08:36] going to refer to things that are somehow in the immediate vicinity of the
[00:08:39] somehow in the immediate vicinity of the location of the utterance and again in a
[00:08:41] location of the utterance and again in a very complicated way
[00:08:44] very complicated way and then three days ago tomorrow and now
[00:08:46] and then three days ago tomorrow and now are temporal indexables and they just
[00:08:48] are temporal indexables and they just show that the meaning of an utterance
[00:08:50] show that the meaning of an utterance can vary depending on when it's spoken
[00:08:52] can vary depending on when it's spoken and all of these expressions are kind of
[00:08:53] and all of these expressions are kind of anchored to that time of utterance
[00:08:58] and there are other kinds of context
[00:08:59] and there are other kinds of context dependents that really require us to
[00:09:01] dependents that really require us to understand utterances in their full
[00:09:03] understand utterances in their full grounded context let's start with a
[00:09:05] grounded context let's start with a simple example where are you from this
[00:09:07] simple example where are you from this can be a vexing question when people ask
[00:09:09] can be a vexing question when people ask because it can often be difficult to
[00:09:11] because it can often be difficult to know what their true true goals and
[00:09:13] know what their true true goals and intentions are with the question they
[00:09:14] intentions are with the question they could mean
[00:09:15] could mean your birthplace i would say connecticut
[00:09:18] your birthplace i would say connecticut they could mean your nationality i might
[00:09:20] they could mean your nationality i might say the u.s
[00:09:21] say the u.s affiliation for me that would be
[00:09:23] affiliation for me that would be stanford and again maybe one day it will
[00:09:25] stanford and again maybe one day it will be informative to say planet earth if
[00:09:27] be informative to say planet earth if there are intergalactic meetings that
[00:09:29] there are intergalactic meetings that one is typically ruled out because it's
[00:09:31] one is typically ruled out because it's not so helpful in 2021
[00:09:33] not so helpful in 2021 but for the rest of them we kind of have
[00:09:35] but for the rest of them we kind of have to guess often about what the speaker is
[00:09:37] to guess often about what the speaker is asking of us in order to figure out how
[00:09:39] asking of us in order to figure out how to answer
[00:09:41] to answer here's some other examples i didn't see
[00:09:43] here's some other examples i didn't see any that's one particular sentence
[00:09:46] any that's one particular sentence its meaning is underspecified in the
[00:09:48] its meaning is underspecified in the context of the question are there typos
[00:09:49] context of the question are there typos in my slides i didn't see any will take
[00:09:52] in my slides i didn't see any will take on one sense
[00:09:53] on one sense in the context are there bookstores
[00:09:55] in the context are there bookstores downtown i didn't see any will take on a
[00:09:57] downtown i didn't see any will take on a very different sense
[00:09:59] very different sense are there cookies in the cupboard i
[00:10:00] are there cookies in the cupboard i didn't see any yet again another kind of
[00:10:02] didn't see any yet again another kind of sense and of course there is no end to
[00:10:04] sense and of course there is no end to the number of different contexts we can
[00:10:05] the number of different contexts we can place this sentence in and each one is
[00:10:08] place this sentence in and each one is likely to modulate the meaning of i
[00:10:09] likely to modulate the meaning of i didn't see any in some complicated and
[00:10:12] didn't see any in some complicated and subtle way we hardly reflect on this but
[00:10:14] subtle way we hardly reflect on this but it's an incredible process
[00:10:17] it's an incredible process so just to round this out here's an
[00:10:18] so just to round this out here's an example routine pragmatic enrichment
[00:10:20] example routine pragmatic enrichment i've got this simple sentence in the
[00:10:22] i've got this simple sentence in the middle here many students met with me
[00:10:24] middle here many students met with me yesterday it's not a very complicated
[00:10:26] yesterday it's not a very complicated sentence cognitively or linguistically i
[00:10:28] sentence cognitively or linguistically i think we can easily understand it but
[00:10:29] think we can easily understand it but reflect for a second on just how many
[00:10:31] reflect for a second on just how many hooks this utterance has into the
[00:10:33] hooks this utterance has into the context we need to know what the time of
[00:10:36] context we need to know what the time of utterance is to understand yesterday and
[00:10:37] utterance is to understand yesterday and then turn to understand the whole
[00:10:39] then turn to understand the whole sentence we need to ask how big is the
[00:10:41] sentence we need to ask how big is the contextually restricted domain of
[00:10:43] contextually restricted domain of students here in order to figure out
[00:10:44] students here in order to figure out whether you know how many many is
[00:10:47] whether you know how many many is uh is it false for most students did i
[00:10:49] uh is it false for most students did i avoid saying most or all because that
[00:10:50] avoid saying most or all because that would be false and instead chose a
[00:10:52] would be false and instead chose a weaker form many that would be a kind of
[00:10:54] weaker form many that would be a kind of reasoning that many listeners will
[00:10:56] reasoning that many listeners will undergo what's the additional contextual
[00:10:58] undergo what's the additional contextual restriction is students just students in
[00:11:00] restriction is students just students in our course students i advise students at
[00:11:03] our course students i advise students at stanford students in the world
[00:11:05] stanford students in the world again the context will tell us who's the
[00:11:07] again the context will tell us who's the speaker of course that's a
[00:11:08] speaker of course that's a straightforward indexical
[00:11:10] straightforward indexical and then there are other kinds of
[00:11:11] and then there are other kinds of inferences that we might make based on
[00:11:13] inferences that we might make based on the restrictive modifiers that the
[00:11:15] the restrictive modifiers that the speaker chose
[00:11:16] speaker chose again we don't reflect on it but all
[00:11:18] again we don't reflect on it but all this stuff is happening kind of
[00:11:20] this stuff is happening kind of effortlessly and automatically this in
[00:11:22] effortlessly and automatically this in levinson's terms is the merest
[00:11:24] levinson's terms is the merest sketchiest squiggle of what actually
[00:11:26] sketchiest squiggle of what actually gets communicated and that is what's so
[00:11:28] gets communicated and that is what's so hard about so many aspects of nlu
[00:11:33] hard about so many aspects of nlu now i want to go back into history at
[00:11:34] now i want to go back into history at least once more to terry winograd's
[00:11:37] least once more to terry winograd's system sure blue because this just shows
[00:11:39] system sure blue because this just shows that at the start of the field of ai and
[00:11:42] that at the start of the field of ai and natural language processing the focus
[00:11:44] natural language processing the focus was entirely on these grounded
[00:11:46] was entirely on these grounded understanding problems so sure blue was
[00:11:49] understanding problems so sure blue was a fully grounded system that parsed the
[00:11:51] a fully grounded system that parsed the user's input
[00:11:52] user's input mapped it to a logical form and
[00:11:54] mapped it to a logical form and interpreted that logical form in a very
[00:11:56] interpreted that logical form in a very particular world and then it would try
[00:11:58] particular world and then it would try to take some action and generate
[00:12:00] to take some action and generate responses it's incredible and i love
[00:12:02] responses it's incredible and i love this characterization um from this
[00:12:05] this characterization um from this youtube clip one project did succeed
[00:12:07] youtube clip one project did succeed terry winograd's program shirt blue
[00:12:09] terry winograd's program shirt blue could use english intelligently but
[00:12:10] could use english intelligently but there was a catch the only subject you
[00:12:12] there was a catch the only subject you could discuss was a micro world of
[00:12:14] could discuss was a micro world of simulated blocks right this is wonderful
[00:12:16] simulated blocks right this is wonderful in the sense that it achieves the goal
[00:12:18] in the sense that it achieves the goal of grounding but it was very far from
[00:12:20] of grounding but it was very far from being scalable in any sense that would
[00:12:22] being scalable in any sense that would make it practical
[00:12:24] make it practical but here's a kind of simple dialogue
[00:12:26] but here's a kind of simple dialogue from sherloo and the thing i just want
[00:12:28] from sherloo and the thing i just want to point out is that there is so much
[00:12:29] to point out is that there is so much implicit grounding into the context the
[00:12:32] implicit grounding into the context the box is restricted to the domain and
[00:12:34] box is restricted to the domain and therefore has reference of course there
[00:12:36] therefore has reference of course there isn't a unique box in the universe so
[00:12:38] isn't a unique box in the universe so the box in the general context might be
[00:12:40] the box in the general context might be very confusing but in the blocks world
[00:12:42] very confusing but in the blocks world it made sense and you can see that
[00:12:44] it made sense and you can see that person leveraging that and the computer
[00:12:46] person leveraging that and the computer can understand it because it too is
[00:12:48] can understand it because it too is grounded in this particular context and
[00:12:50] grounded in this particular context and therefore can make use of all of that
[00:12:52] therefore can make use of all of that implicit information
[00:12:54] implicit information informing its utterances and
[00:12:55] informing its utterances and interpreting the humans utterances and
[00:12:57] interpreting the humans utterances and you see that pervasively throughout
[00:12:59] you see that pervasively throughout sample dialogues with assured loop it's
[00:13:01] sample dialogues with assured loop it's a compelling vision about the kinds of
[00:13:03] a compelling vision about the kinds of things that we need to have and all of
[00:13:06] things that we need to have and all of it turns on this very rich notion of
[00:13:08] it turns on this very rich notion of grounding in the blocks world
[00:13:12] finally another connection i want to
[00:13:13] finally another connection i want to make let's just think the very best
[00:13:16] make let's just think the very best devices in the universe as far as we
[00:13:18] devices in the universe as far as we know for acquiring natural languages are
[00:13:19] know for acquiring natural languages are humans
[00:13:20] humans what do humans do well first language
[00:13:23] what do humans do well first language acquirers children learn language with
[00:13:26] acquirers children learn language with incredible speed that's noteworthy just
[00:13:28] incredible speed that's noteworthy just a few years
[00:13:29] a few years despite relatively few inputs i mean
[00:13:32] despite relatively few inputs i mean they get a lot of language data in the
[00:13:33] they get a lot of language data in the ideal situation but it's nothing
[00:13:35] ideal situation but it's nothing compared to what currently language
[00:13:37] compared to what currently language models get to see
[00:13:38] models get to see and they use cues from contrast inherent
[00:13:41] and they use cues from contrast inherent in the forms they hear that's a
[00:13:42] in the forms they hear that's a distributional idea that we're familiar
[00:13:44] distributional idea that we're familiar with but also social cues and
[00:13:46] with but also social cues and assumptions about the speaker's goals
[00:13:48] assumptions about the speaker's goals but i just feel like the very richness
[00:13:50] but i just feel like the very richness of this picture and its multimodal
[00:13:53] of this picture and its multimodal aspects
[00:13:54] aspects are really important guiding clues for
[00:13:56] are really important guiding clues for us
[00:13:57] us so what are the consequences of all this
[00:13:59] so what are the consequences of all this for nou well as i said since human
[00:14:01] for nou well as i said since human children are the best agents in the
[00:14:02] children are the best agents in the university learning language and they
[00:14:04] university learning language and they depend on grounding it seems like our
[00:14:06] depend on grounding it seems like our systems ought to be grounded as well
[00:14:09] systems ought to be grounded as well problems that are intractable without
[00:14:11] problems that are intractable without grounding are solvable with the right
[00:14:12] grounding are solvable with the right kinds of grounding that's important to
[00:14:14] kinds of grounding that's important to keep in mind grounded problems can seem
[00:14:16] keep in mind grounded problems can seem hard but the other aspect of that is
[00:14:18] hard but the other aspect of that is that some problems might be completely
[00:14:19] that some problems might be completely intractable unless you have some notion
[00:14:21] intractable unless you have some notion of grounding indexables come to line
[00:14:25] of grounding indexables come to line thinking about current day modeling deep
[00:14:27] thinking about current day modeling deep learning is a flexible toolkit for
[00:14:29] learning is a flexible toolkit for reasoning about different kinds of
[00:14:30] reasoning about different kinds of information in a single model you can
[00:14:32] information in a single model you can bring in language data image data event
[00:14:34] bring in language data image data event video data audio data and so forth and
[00:14:37] video data audio data and so forth and therefore it has led to conceptual
[00:14:39] therefore it has led to conceptual improvements
[00:14:40] improvements the ungrounded language models of today
[00:14:42] the ungrounded language models of today get a lot of publicity but there are
[00:14:44] get a lot of publicity but there are also many exciting systems that are
[00:14:46] also many exciting systems that are fluently reasoning about images and
[00:14:49] fluently reasoning about images and video and language together and i think
[00:14:51] video and language together and i think that's a really nice step forward into
[00:14:53] that's a really nice step forward into the world of true grounding
[00:14:55] the world of true grounding so we should seek out and develop data
[00:14:57] so we should seek out and develop data sets that include the right kind of
[00:14:59] sets that include the right kind of grounding because the central thesis
[00:15:01] grounding because the central thesis here is that that can lead to
[00:15:03] here is that that can lead to progress by leaps and bounds
[00:15:06] progress by leaps and bounds so again to round this out let me
[00:15:07] so again to round this out let me encourage you to think about this for
[00:15:09] encourage you to think about this for final projects we're going to be working
[00:15:11] final projects we're going to be working with the stanford english colors and
[00:15:12] with the stanford english colors and context corpus there is also a chinese
[00:15:15] context corpus there is also a chinese version and we've explored exciting
[00:15:17] version and we've explored exciting ideas involving monolingual chinese and
[00:15:19] ideas involving monolingual chinese and english speakers as well as bilingual
[00:15:21] english speakers as well as bilingual models for this data set
[00:15:24] models for this data set if you want to do a little bit more in
[00:15:26] if you want to do a little bit more in terms of grounding slightly more
[00:15:27] terms of grounding slightly more complicated context i would recommend
[00:15:29] complicated context i would recommend the one common data set
[00:15:31] the one common data set the enviro map corpus is an early task
[00:15:35] the enviro map corpus is an early task oriented grounded corpus that could be
[00:15:37] oriented grounded corpus that could be exciting especially if you want to do
[00:15:39] exciting especially if you want to do some interesting initial steps involving
[00:15:41] some interesting initial steps involving language and reinforcement learning
[00:15:44] language and reinforcement learning the cards corpus would be much more
[00:15:45] the cards corpus would be much more ambitious along those same lines it's
[00:15:47] ambitious along those same lines it's very open-ended difficult task oriented
[00:15:50] very open-ended difficult task oriented dialogue corpus
[00:15:52] dialogue corpus dealer no deal is a forward-thinking
[00:15:54] dealer no deal is a forward-thinking negotiation corpus negotiation is a very
[00:15:56] negotiation corpus negotiation is a very interesting kind of slightly adversarial
[00:15:58] interesting kind of slightly adversarial social grounding uh craigslist bargain
[00:16:00] social grounding uh craigslist bargain is another data set that you might use
[00:16:02] is another data set that you might use in the context of negotiation agents
[00:16:05] in the context of negotiation agents and then alfred crosstalk and room to
[00:16:07] and then alfred crosstalk and room to room are all data sets that would allow
[00:16:09] room are all data sets that would allow you to combine grounded language
[00:16:11] you to combine grounded language understanding with problems relating to
[00:16:13] understanding with problems relating to computer vision in various ways and
[00:16:15] computer vision in various ways and again that kind of
[00:16:17] again that kind of interdisciplinary connection could be
[00:16:19] interdisciplinary connection could be crucial to making progress on truly
[00:16:21] crucial to making progress on truly grounded systems

Lecture 028

Speakers | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=-s5B_7_oeiU --- Transcript [00:00:04] welcome everyone this is part two in our [00:0...

Speakers | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=-s5B_7_oeiU

---

Transcript

[00:00:04] welcome everyone this is part two in our
[00:00:06] welcome everyone this is part two in our series on grounded language
[00:00:07] series on grounded language understanding our task for this unit is
[00:00:10] understanding our task for this unit is essentially a natural language
[00:00:11] essentially a natural language generation task and i've called those
[00:00:13] generation task and i've called those speakers the ideas that speakers go from
[00:00:15] speakers the ideas that speakers go from the world that is some non-linguistic
[00:00:17] the world that is some non-linguistic thing that they're trying to communicate
[00:00:18] thing that they're trying to communicate about into language those are really the
[00:00:21] about into language those are really the central agents that we'll be exploring
[00:00:24] central agents that we'll be exploring to ground all this we're going to have a
[00:00:25] to ground all this we're going to have a simple task i'm going to start with the
[00:00:27] simple task i'm going to start with the most basic version of the task that will
[00:00:29] most basic version of the task that will ultimately tackle in our assignment and
[00:00:30] ultimately tackle in our assignment and bake off and that is color reference
[00:00:33] bake off and that is color reference so these are examples taken from a
[00:00:35] so these are examples taken from a corpus that was originally correct
[00:00:37] corpus that was originally correct collected by randall monroe of xkcd fame
[00:00:40] collected by randall monroe of xkcd fame and processed into an nlp task by
[00:00:42] and processed into an nlp task by mcmahon in stone 2015. and it's a simple
[00:00:44] mcmahon in stone 2015. and it's a simple formulation in that the state of the
[00:00:46] formulation in that the state of the world we want to communicate about is a
[00:00:48] world we want to communicate about is a color patch
[00:00:49] color patch and the task is simply to produce
[00:00:50] and the task is simply to produce descriptions of those color patches and
[00:00:52] descriptions of those color patches and i've given some examples here and you
[00:00:54] i've given some examples here and you can see that they range from simple one
[00:00:56] can see that they range from simple one word descriptions uh all the way up to
[00:00:59] word descriptions uh all the way up to things that are kind of complicated both
[00:01:00] things that are kind of complicated both cognitively and linguistically and i
[00:01:02] cognitively and linguistically and i think point to the idea that even though
[00:01:04] think point to the idea that even though this is a simple and constrained domain
[00:01:06] this is a simple and constrained domain it's a pretty cognitively and
[00:01:08] it's a pretty cognitively and linguistically interesting one
[00:01:11] so our speakers uh at least our baseline
[00:01:14] so our speakers uh at least our baseline speakers are standard versions of
[00:01:16] speakers are standard versions of encoder decoder models we're going to
[00:01:18] encoder decoder models we're going to have for this initial formulation a very
[00:01:20] have for this initial formulation a very simple encoder the task of the encoder
[00:01:22] simple encoder the task of the encoder is simply to take a color representation
[00:01:25] is simply to take a color representation which is going to be a list of floats
[00:01:27] which is going to be a list of floats embed it in some embedding space and
[00:01:30] embed it in some embedding space and then learn some hidden representation
[00:01:32] then learn some hidden representation for that color and that's all that needs
[00:01:34] for that color and that's all that needs to happen so it's just one step
[00:01:36] to happen so it's just one step the decoder is where the speaking part
[00:01:38] the decoder is where the speaking part happens so the initial token produced by
[00:01:40] happens so the initial token produced by the decoder by the speaker is always the
[00:01:42] the decoder by the speaker is always the start token which is looked up in an
[00:01:44] start token which is looked up in an embedding space
[00:01:45] embedding space and then we get our first decoder hidden
[00:01:47] and then we get our first decoder hidden state which is created from the color
[00:01:50] state which is created from the color representation as the initial hidden
[00:01:52] representation as the initial hidden state in the sequence we're going to
[00:01:53] state in the sequence we're going to build together with an embedding and
[00:01:55] build together with an embedding and both of those have weight
[00:01:56] both of those have weight transformations and it's an additive
[00:01:58] transformations and it's an additive combination of them that delivers this
[00:02:00] combination of them that delivers this value h1 here
[00:02:02] value h1 here then we use some softmax parameters to
[00:02:04] then we use some softmax parameters to make a prediction about the next token
[00:02:06] make a prediction about the next token here we've predicted dark
[00:02:08] here we've predicted dark and we get our error signal by comparing
[00:02:10] and we get our error signal by comparing that prediction with the actual token
[00:02:12] that prediction with the actual token that occurred in our training data in
[00:02:14] that occurred in our training data in this case it was the word light so since
[00:02:16] this case it was the word light so since we've made a wrong prediction we're
[00:02:17] we've made a wrong prediction we're going to get
[00:02:18] going to get a substantive error signal that will
[00:02:20] a substantive error signal that will then we hope update the weight
[00:02:21] then we hope update the weight parameters throughout this model in a
[00:02:23] parameters throughout this model in a way that leads them to produce better
[00:02:25] way that leads them to produce better generations the next time
[00:02:27] generations the next time in a little more detail just as a
[00:02:29] in a little more detail just as a reminder so we have an embedding for
[00:02:30] reminder so we have an embedding for that start token indeed for all tokens
[00:02:33] that start token indeed for all tokens the hidden state is derived from the
[00:02:35] the hidden state is derived from the embedding via a weight transformation
[00:02:38] embedding via a weight transformation and the color representation which is
[00:02:40] and the color representation which is state h0 and the recurrence that we're
[00:02:42] state h0 and the recurrence that we're building and that too has a
[00:02:43] building and that too has a transformation applied to it to travel
[00:02:45] transformation applied to it to travel through the hidden layer
[00:02:47] through the hidden layer that gives us the state h1 and then we
[00:02:49] that gives us the state h1 and then we have softmax parameters on top of that
[00:02:51] have softmax parameters on top of that h1 that make a prediction
[00:02:53] h1 that make a prediction uh the prediction that they make is a
[00:02:55] uh the prediction that they make is a prediction over the entire vocabulary
[00:02:58] prediction over the entire vocabulary and the probability of the actual token
[00:03:00] and the probability of the actual token gives us our error signal so the
[00:03:01] gives us our error signal so the probability of light is the error signal
[00:03:03] probability of light is the error signal that we'll use here to update the model
[00:03:05] that we'll use here to update the model parameters
[00:03:07] parameters and then we begin with the next time
[00:03:09] and then we begin with the next time step i've called this teacher forcing
[00:03:11] step i've called this teacher forcing because in the standard mode which is
[00:03:12] because in the standard mode which is the teacher forcing mode even though we
[00:03:14] the teacher forcing mode even though we predicted dark at time step one we're
[00:03:17] predicted dark at time step one we're going to have as our second token the
[00:03:19] going to have as our second token the seek the the token light
[00:03:21] seek the the token light which is the actual token in the
[00:03:22] which is the actual token in the underlying training data when we proceed
[00:03:24] underlying training data when we proceed as though we did not make a mistake so
[00:03:26] as though we did not make a mistake so again we do an embedding lookup we get
[00:03:28] again we do an embedding lookup we get our second hidden state for the decoder
[00:03:30] our second hidden state for the decoder as a combination of the embedding x37
[00:03:33] as a combination of the embedding x37 and the previous hidden state and we
[00:03:35] and the previous hidden state and we make another prediction and in this case
[00:03:36] make another prediction and in this case our prediction is blue and that's the
[00:03:38] our prediction is blue and that's the actual token and life is good for a
[00:03:40] actual token and life is good for a little bit and then we proceed with the
[00:03:41] little bit and then we proceed with the third time step
[00:03:43] third time step the actual token is blue
[00:03:45] the actual token is blue h3
[00:03:46] h3 and we predict green and in this case we
[00:03:48] and we predict green and in this case we should have predict the stop token which
[00:03:50] should have predict the stop token which would cause us to stop processing the
[00:03:52] would cause us to stop processing the sequence we're just going to get an
[00:03:53] sequence we're just going to get an error signal as we standardly would and
[00:03:55] error signal as we standardly would and propagate that back down through the
[00:03:57] propagate that back down through the model in hopes that the next time when
[00:03:59] model in hopes that the next time when we want to stop we'll actually produce
[00:04:00] we want to stop we'll actually produce this stop token that i've given up here
[00:04:05] at prediction time of course the
[00:04:06] at prediction time of course the sequence is not given that doesn't
[00:04:08] sequence is not given that doesn't change the encoder because the color
[00:04:10] change the encoder because the color representation is part of the model
[00:04:12] representation is part of the model inputs but then we have to decode and
[00:04:13] inputs but then we have to decode and just describe without any feedback so we
[00:04:16] just describe without any feedback so we proceed as we did before and we predict
[00:04:18] proceed as we did before and we predict dark here and then dark has to become
[00:04:21] dark here and then dark has to become the token at the next time step because
[00:04:22] the token at the next time step because we don't know what the ground truth is
[00:04:24] we don't know what the ground truth is and we proceed as before and say blue
[00:04:27] and we proceed as before and say blue and then that becomes the third time
[00:04:28] and then that becomes the third time step and with luck there in that third
[00:04:30] step and with luck there in that third position we predict the stop token and
[00:04:33] position we predict the stop token and the decoding process is completed
[00:04:36] the decoding process is completed that is the fundamental model
[00:04:38] that is the fundamental model even though it's simple it admits of
[00:04:40] even though it's simple it admits of many interesting modifications let me
[00:04:42] many interesting modifications let me just mention a few of them
[00:04:43] just mention a few of them first the encoder and the decoder of
[00:04:45] first the encoder and the decoder of course could have many more hidden
[00:04:46] course could have many more hidden layers mine just had one but they could
[00:04:49] layers mine just had one but they could be very deep networks we would expect
[00:04:51] be very deep networks we would expect that the layer counts for the encoder
[00:04:53] that the layer counts for the encoder and the decoder and match so that you
[00:04:55] and the decoder and match so that you have this even handoff from encoder to
[00:04:57] have this even handoff from encoder to decoder across all the hidden layers but
[00:04:59] decoder across all the hidden layers but even that's not a hard constraint i can
[00:05:01] even that's not a hard constraint i can imagine that some cooling or copying
[00:05:03] imagine that some cooling or copying could accommodate different numbers of
[00:05:04] could accommodate different numbers of layers in these two components
[00:05:08] layers in these two components it's very common at present for
[00:05:09] it's very common at present for researchers to tie the embedding and
[00:05:11] researchers to tie the embedding and classifier parameters right the
[00:05:13] classifier parameters right the embedding gives us a representation for
[00:05:14] embedding gives us a representation for every vocabulary item and the transpose
[00:05:17] every vocabulary item and the transpose of that can serve as the set of
[00:05:19] of that can serve as the set of parameters for our softmax classifier
[00:05:22] parameters for our softmax classifier when we predict tokens
[00:05:24] when we predict tokens and tying those weights seems to be very
[00:05:26] and tying those weights seems to be very productive in terms of optimization
[00:05:28] productive in terms of optimization effectiveness so you might consider that
[00:05:31] effectiveness so you might consider that and finally during training we might
[00:05:33] and finally during training we might drop that teacher forcing assumption
[00:05:35] drop that teacher forcing assumption which would mean that in a small
[00:05:36] which would mean that in a small percentage of cases we would allow the
[00:05:38] percentage of cases we would allow the model to just proceed as though its
[00:05:40] model to just proceed as though its predicted token was the correct token
[00:05:42] predicted token was the correct token for the next time step even if that was
[00:05:44] for the next time step even if that was a faulty assumption obvious on the idea
[00:05:47] a faulty assumption obvious on the idea that that might help the model explore a
[00:05:49] that that might help the model explore a wider range of the space and inject its
[00:05:52] wider range of the space and inject its um generations with some helpful
[00:05:53] um generations with some helpful diversity
[00:05:55] diversity and then there's one other modification
[00:05:56] and then there's one other modification that i want to mention because you'll
[00:05:57] that i want to mention because you'll see this as part of the homework and the
[00:05:59] see this as part of the homework and the system that you're developing so we
[00:06:01] system that you're developing so we found that in monroidal 2016 it was
[00:06:04] found that in monroidal 2016 it was helpful to kind of remind the decoder at
[00:06:06] helpful to kind of remind the decoder at each one of its time steps about what it
[00:06:08] each one of its time steps about what it was trying to describe so in more detail
[00:06:10] was trying to describe so in more detail we had hsv color representations as our
[00:06:13] we had hsv color representations as our inputs we did a fourier transform to get
[00:06:15] inputs we did a fourier transform to get an embedding and that was processed into
[00:06:17] an embedding and that was processed into a hidden state and then during decoding
[00:06:20] a hidden state and then during decoding we appended to each one of the
[00:06:21] we appended to each one of the embeddings the fourier transformation
[00:06:24] embeddings the fourier transformation representation of the color as a kind of
[00:06:26] representation of the color as a kind of informal reminder at each time step
[00:06:28] informal reminder at each time step about what the input was actually like
[00:06:30] about what the input was actually like on the assumption that for long
[00:06:31] on the assumption that for long sequences when we get all the way down
[00:06:33] sequences when we get all the way down to the end the model might have a hazy
[00:06:35] to the end the model might have a hazy memory of what it's trying to describe
[00:06:37] memory of what it's trying to describe and this functions as a kind of reminder
[00:06:39] and this functions as a kind of reminder at that point
[00:06:40] at that point and that proved to be very effective and
[00:06:42] and that proved to be very effective and i'll encourage you to explore that in
[00:06:44] i'll encourage you to explore that in the homework
[00:06:46] the homework and then i hope you can see that even
[00:06:47] and then i hope you can see that even though this task formulation is simple
[00:06:49] though this task formulation is simple it's an instance of a wide range of
[00:06:51] it's an instance of a wide range of tasks that we might explore under the
[00:06:53] tasks that we might explore under the heading of grounding after all for
[00:06:54] heading of grounding after all for grounding in this sense we just need
[00:06:56] grounding in this sense we just need some non-linguistic representation
[00:06:58] some non-linguistic representation coming in and the ideas that will
[00:07:00] coming in and the ideas that will generate language in response to that
[00:07:02] generate language in response to that input
[00:07:03] input so image capturing is an instance of
[00:07:04] so image capturing is an instance of this
[00:07:05] this scene description of course is another
[00:07:07] scene description of course is another instance visual question answering is a
[00:07:09] instance visual question answering is a slight modification where the input is
[00:07:11] slight modification where the input is not just an image but also a question
[00:07:13] not just an image but also a question text and the idea is that you want to
[00:07:15] text and the idea is that you want to produce an answer to that question
[00:07:17] produce an answer to that question relative to the image input
[00:07:19] relative to the image input and then instruction giving would be a
[00:07:20] and then instruction giving would be a more general form where the input is
[00:07:22] more general form where the input is some kind of state description and the
[00:07:24] some kind of state description and the idea is that we want to offer a
[00:07:25] idea is that we want to offer a complicated instruction on that basis
[00:07:27] complicated instruction on that basis and i think we can think of many others
[00:07:29] and i think we can think of many others that would fit into this mold and
[00:07:31] that would fit into this mold and benefit not only from the encoder
[00:07:33] benefit not only from the encoder decoder architecture but also from
[00:07:34] decoder architecture but also from conceptualization explicitly
[00:07:37] conceptualization explicitly as grounded natural language generation
[00:07:39] as grounded natural language generation tasks
[00:07:44] you

Lecture 029

Listeners | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=xrsc0IOLFSY --- Transcript [00:00:05] welcome everyone to part three in our [00:00:...

Listeners | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=xrsc0IOLFSY

---

Transcript

[00:00:05] welcome everyone to part three in our
[00:00:06] welcome everyone to part three in our series on grounded language
[00:00:07] series on grounded language understanding recall that in part two we
[00:00:10] understanding recall that in part two we focused on speakers speakers in our
[00:00:12] focused on speakers speakers in our sense take non-linguistic
[00:00:13] sense take non-linguistic representations as inputs and generate
[00:00:16] representations as inputs and generate language on that basis
[00:00:18] language on that basis listeners are kind of the converse of
[00:00:19] listeners are kind of the converse of that they accept linguistic inputs and
[00:00:22] that they accept linguistic inputs and try to make a guess about the state of
[00:00:24] try to make a guess about the state of the world on the basis of that
[00:00:25] the world on the basis of that linguistic input
[00:00:27] linguistic input for this unit for in terms of modeling
[00:00:29] for this unit for in terms of modeling our focus is going to be on speakers but
[00:00:31] our focus is going to be on speakers but i think it's helpful to have the
[00:00:32] i think it's helpful to have the listener perspective in mind as you
[00:00:34] listener perspective in mind as you create speakers and you might even bring
[00:00:36] create speakers and you might even bring in the listener perspective as part of
[00:00:38] in the listener perspective as part of your original system and i'll cover some
[00:00:41] your original system and i'll cover some techniques for doing that in the context
[00:00:42] techniques for doing that in the context of the rational speech x model a bit
[00:00:44] of the rational speech x model a bit later in this series
[00:00:47] later in this series now to make the speaker test meaningful
[00:00:49] now to make the speaker test meaningful we need to complicate our previous task
[00:00:51] we need to complicate our previous task a little bit so in part two we had for
[00:00:54] a little bit so in part two we had for the speaker just a single color as input
[00:00:56] the speaker just a single color as input and their task was to produce a
[00:00:58] and their task was to produce a description on that basis
[00:01:00] description on that basis for listeners we're going to move to a
[00:01:01] for listeners we're going to move to a more complicated task and this is the
[00:01:03] more complicated task and this is the task that's our focus for the entire
[00:01:05] task that's our focus for the entire unit it comes from the stanford colors
[00:01:07] unit it comes from the stanford colors and context corpus and for that corpus a
[00:01:10] and context corpus and for that corpus a context is not just a single color
[00:01:12] context is not just a single color representation but now three colors
[00:01:14] representation but now three colors and the idea is that the speaker is
[00:01:16] and the idea is that the speaker is privately told which of those three is
[00:01:18] privately told which of those three is their target and they produce a
[00:01:20] their target and they produce a description that will hopefully
[00:01:22] description that will hopefully communicate to a listener who's looking
[00:01:24] communicate to a listener who's looking at those same three colors which one was
[00:01:26] at those same three colors which one was the speaker's target
[00:01:28] the speaker's target you can see that that gets really
[00:01:29] you can see that that gets really interesting and grounded very quickly so
[00:01:31] interesting and grounded very quickly so in this first case the three colors are
[00:01:33] in this first case the three colors are very different and the speaker simply
[00:01:35] very different and the speaker simply said blue and that seems to get the job
[00:01:37] said blue and that seems to get the job done and i think a listener receiving
[00:01:39] done and i think a listener receiving blue as input would know which of these
[00:01:41] blue as input would know which of these three colors was the speaker's private
[00:01:43] three colors was the speaker's private target
[00:01:44] target when we moved to the second context
[00:01:45] when we moved to the second context though we have two competing blues
[00:01:47] though we have two competing blues they're very similar and as a result the
[00:01:49] they're very similar and as a result the speaker said the darker blue one and the
[00:01:52] speaker said the darker blue one and the idea is that this comparative here
[00:01:54] idea is that this comparative here darker blue is making implicit reference
[00:01:57] darker blue is making implicit reference not only to the target but to the at
[00:01:59] not only to the target but to the at least one of the two distractors
[00:02:02] least one of the two distractors third example is similar teal not the
[00:02:04] third example is similar teal not the two that are more green that's really
[00:02:06] two that are more green that's really grounded in the full context here the
[00:02:07] grounded in the full context here the speaker is not only identifying
[00:02:09] speaker is not only identifying properties of the target but also
[00:02:11] properties of the target but also properties of the distractor in order to
[00:02:13] properties of the distractor in order to draw out contrasts
[00:02:15] draw out contrasts and i think the final two examples here
[00:02:17] and i think the final two examples here are interesting uh in in different ways
[00:02:20] are interesting uh in in different ways so here we have
[00:02:21] so here we have the target on the left in the first
[00:02:23] the target on the left in the first example the speaker said purple and in
[00:02:25] example the speaker said purple and in the second example the speaker said blue
[00:02:28] the second example the speaker said blue even though these are identical colors
[00:02:30] even though these are identical colors here for the targets the reason we saw
[00:02:32] here for the targets the reason we saw variation is because the distractors are
[00:02:35] variation is because the distractors are so different and that just shows you
[00:02:37] so different and that just shows you that even though this is a simple task
[00:02:38] that even though this is a simple task it is meaningfully grounded in the full
[00:02:41] it is meaningfully grounded in the full context that we're talking about
[00:02:44] context that we're talking about now what we'll do for our listeners is
[00:02:45] now what we'll do for our listeners is to essentially give them these
[00:02:46] to essentially give them these utterances as inputs and have them
[00:02:48] utterances as inputs and have them function as classifiers making a guess
[00:02:50] function as classifiers making a guess about which of the three colors is the
[00:02:53] about which of the three colors is the most likely give that the speaker was
[00:02:55] most likely give that the speaker was trying to refer to
[00:02:57] trying to refer to so in a little more detail here's the
[00:02:59] so in a little more detail here's the neural listener model it's again an
[00:03:01] neural listener model it's again an encoder decoder architecture for the
[00:03:03] encoder decoder architecture for the encoder side we can imagine some
[00:03:05] encoder side we can imagine some recurrent neural network or something
[00:03:06] recurrent neural network or something that is going to consume a sequence of
[00:03:08] that is going to consume a sequence of tokens
[00:03:09] tokens look them up in an embedding space and
[00:03:11] look them up in an embedding space and then have some sequence of hidden states
[00:03:14] then have some sequence of hidden states for the decoder the handoff happens for
[00:03:16] for the decoder the handoff happens for the final encoder state presumably and
[00:03:19] the final encoder state presumably and what we're going to do here is extract
[00:03:21] what we're going to do here is extract some statistics in this case a mean and
[00:03:23] some statistics in this case a mean and a covariance matrix and use those for
[00:03:26] a covariance matrix and use those for scoring so in a little more detail
[00:03:28] scoring so in a little more detail we have those three colors as given for
[00:03:30] we have those three colors as given for the listener those are represented down
[00:03:33] the listener those are represented down here we'll embed those in some color
[00:03:35] here we'll embed those in some color space we can use the fourier transform
[00:03:37] space we can use the fourier transform just like we did for the speakers at the
[00:03:39] just like we did for the speakers at the end of the previous screencast
[00:03:41] end of the previous screencast and then we'll use those extracted
[00:03:43] and then we'll use those extracted statistics from the encoder to create a
[00:03:45] statistics from the encoder to create a scoring function
[00:03:46] scoring function and then we just need to divide a
[00:03:47] and then we just need to divide a softmax classifier on top of those
[00:03:49] softmax classifier on top of those scores and it will be that module
[00:03:52] scores and it will be that module that makes a guess based on this encoder
[00:03:54] that makes a guess based on this encoder representation about which of the three
[00:03:57] representation about which of the three colors the speaker was referring to so
[00:03:59] colors the speaker was referring to so fundamentally a kind of classification
[00:04:01] fundamentally a kind of classification decision in this continuous space of
[00:04:03] decision in this continuous space of colors and encoder representations
[00:04:08] now once we start thinking in this mode
[00:04:10] now once we start thinking in this mode i think a lot of other tasks can be
[00:04:12] i think a lot of other tasks can be thought of as listener-based
[00:04:14] thought of as listener-based communication tests so even the simplest
[00:04:17] communication tests so even the simplest classifiers are listeners in our sense
[00:04:19] classifiers are listeners in our sense they consume language and they make an
[00:04:22] they consume language and they make an inference about the world usually in a
[00:04:23] inference about the world usually in a very structured space right so even like
[00:04:26] very structured space right so even like in the simple case of our sentiment
[00:04:27] in the simple case of our sentiment analysis you receive a linguistic input
[00:04:30] analysis you receive a linguistic input and you make a guess about whether the
[00:04:31] and you make a guess about whether the state is positive negative or neutral
[00:04:34] state is positive negative or neutral uh that's a common classifier but
[00:04:36] uh that's a common classifier but thinking of it as a communication task
[00:04:38] thinking of it as a communication task might bring new dimensions to the
[00:04:40] might bring new dimensions to the problem
[00:04:42] problem semantic parsers are also complex
[00:04:44] semantic parsers are also complex listeners they consume language they
[00:04:46] listeners they consume language they create a rich latent representations
[00:04:48] create a rich latent representations kind of logical form and then they
[00:04:50] kind of logical form and then they predict into some structured prediction
[00:04:52] predict into some structured prediction space like a database or something like
[00:04:54] space like a database or something like that
[00:04:56] that scene generation is clearly a kind of
[00:04:58] scene generation is clearly a kind of listener task in this task you map from
[00:05:01] listener task in this task you map from language to structured representations
[00:05:04] language to structured representations of visual scenes
[00:05:05] of visual scenes so it's a very complicated version of
[00:05:07] so it's a very complicated version of our simple simple color reference
[00:05:09] our simple simple color reference problem
[00:05:10] problem young had all explored the idea that we
[00:05:12] young had all explored the idea that we might learn visual denotations for
[00:05:14] might learn visual denotations for linguistic expressions mapping from
[00:05:16] linguistic expressions mapping from language into some highly structured
[00:05:18] language into some highly structured space similar to scene description
[00:05:22] space similar to scene description may it all 2015 develop a sequence to
[00:05:24] may it all 2015 develop a sequence to sequence model that's very much like the
[00:05:26] sequence model that's very much like the above but the idea is that instead of
[00:05:29] above but the idea is that instead of having
[00:05:29] having simple uh output spaces we have entire
[00:05:33] simple uh output spaces we have entire navigational instructions that we want
[00:05:35] navigational instructions that we want to give so that's going from a
[00:05:36] to give so that's going from a linguistic input into some kind of
[00:05:38] linguistic input into some kind of action sequence
[00:05:40] action sequence and finally the serial bar data set is
[00:05:42] and finally the serial bar data set is an interesting one to explore in our
[00:05:44] an interesting one to explore in our context
[00:05:45] context that was a task of learning to execute
[00:05:47] that was a task of learning to execute execute full instructions so that's
[00:05:49] execute full instructions so that's again mapping from pretty complicated
[00:05:51] again mapping from pretty complicated utterances into some embedded action
[00:05:54] utterances into some embedded action that you want to take in a game world
[00:05:56] that you want to take in a game world and that could be a very exciting
[00:05:58] and that could be a very exciting extension of what we've just been
[00:05:59] extension of what we've just been covering

Lecture 030

Varieties of contextual grounding | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=3CTttlN8l4o --- Transcript [00:00:04] welcome to part four ...

Varieties of contextual grounding | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=3CTttlN8l4o

---

Transcript

[00:00:04] welcome to part four in our series on
[00:00:06] welcome to part four in our series on grounded language understanding our
[00:00:08] grounded language understanding our topic is varieties of contextual
[00:00:09] topic is varieties of contextual grounding what i'd really like to do is
[00:00:11] grounding what i'd really like to do is make connections with additional tasks
[00:00:14] make connections with additional tasks as a way of drawing out what i think is
[00:00:16] as a way of drawing out what i think is one of the central insights behind the
[00:00:18] one of the central insights behind the work that we're doing which is that
[00:00:20] work that we're doing which is that speakers should try to be informative in
[00:00:23] speakers should try to be informative in context let me explain a bit more about
[00:00:25] context let me explain a bit more about what that means so our task is this task
[00:00:28] what that means so our task is this task of color reference in context the
[00:00:30] of color reference in context the speaker is given three color patches one
[00:00:33] speaker is given three color patches one of them designated the target and the
[00:00:34] of them designated the target and the speaker's task is to communicate which
[00:00:36] speaker's task is to communicate which of the three is the target to the
[00:00:38] of the three is the target to the listener who's in the same context but
[00:00:40] listener who's in the same context but of course doesn't know what the target
[00:00:42] of course doesn't know what the target is
[00:00:42] is and what i think you see running through
[00:00:44] and what i think you see running through the human data here is that speakers are
[00:00:46] the human data here is that speakers are striving to be informative in context in
[00:00:49] striving to be informative in context in this first case the speaker can just say
[00:00:50] this first case the speaker can just say blue because the contrasts are so clear
[00:00:53] blue because the contrasts are so clear but in the second case merely saying
[00:00:55] but in the second case merely saying blue would be really unhelpful it would
[00:00:56] blue would be really unhelpful it would be uninformative in the context because
[00:00:59] be uninformative in the context because there are these two blues and as a
[00:01:01] there are these two blues and as a result the speaker is pushed to do
[00:01:02] result the speaker is pushed to do something more interesting the darker
[00:01:04] something more interesting the darker blue one making implicit reference to
[00:01:06] blue one making implicit reference to the context in an effort to communicate
[00:01:09] the context in an effort to communicate effectively with the listener
[00:01:11] effectively with the listener and that communication aspect i think
[00:01:12] and that communication aspect i think can be so powerful and runs through lots
[00:01:15] can be so powerful and runs through lots of tasks both ones that explicitly
[00:01:17] of tasks both ones that explicitly involve communication and ones that
[00:01:19] involve communication and ones that involve a more general setting
[00:01:22] involve a more general setting one case of the latter is i think
[00:01:24] one case of the latter is i think discriminative image labeling which is
[00:01:25] discriminative image labeling which is tackling this lovely paper mao at all
[00:01:27] tackling this lovely paper mao at all 2016. the task here is given an image to
[00:01:30] 2016. the task here is given an image to label entities that are in those images
[00:01:33] label entities that are in those images and for many many contexts it would be a
[00:01:35] and for many many contexts it would be a shame if our goal was to label these two
[00:01:38] shame if our goal was to label these two entities here and we simply called them
[00:01:40] entities here and we simply called them both dog it's uninformative in the sense
[00:01:42] both dog it's uninformative in the sense that it doesn't distinguish the two
[00:01:44] that it doesn't distinguish the two entities in the context of this picture
[00:01:45] entities in the context of this picture what we might hope
[00:01:47] what we might hope is that we would get fuller descriptions
[00:01:49] is that we would get fuller descriptions like a little dog jumping and catching a
[00:01:51] like a little dog jumping and catching a frisbee and a big dog running fuller
[00:01:54] frisbee and a big dog running fuller descriptions in the sense that they
[00:01:55] descriptions in the sense that they provide more detail that distinguishes
[00:01:58] provide more detail that distinguishes the two dogs
[00:02:00] the two dogs and we could extend that to full image
[00:02:02] and we could extend that to full image captioning as well again given these
[00:02:04] captioning as well again given these three images it would be a shame if our
[00:02:06] three images it would be a shame if our image captioning system just labeled
[00:02:07] image captioning system just labeled them all dog
[00:02:09] them all dog we might have the intuition that we
[00:02:10] we might have the intuition that we would like the image captioning system
[00:02:12] would like the image captioning system to produce descriptions of these images
[00:02:15] to produce descriptions of these images that would help a listener
[00:02:17] that would help a listener figure out which image was being
[00:02:19] figure out which image was being described and we might have as a further
[00:02:21] described and we might have as a further goal for this image captioning system
[00:02:23] goal for this image captioning system that as we change the set of distractors
[00:02:26] that as we change the set of distractors it's sensitive to that and produces
[00:02:28] it's sensitive to that and produces different descriptions
[00:02:29] different descriptions trying to be informative relative to
[00:02:32] trying to be informative relative to these new contexts that we're creating
[00:02:33] these new contexts that we're creating amplifying some kinds of information and
[00:02:36] amplifying some kinds of information and leaving out other kinds of information
[00:02:37] leaving out other kinds of information to the extent that they would help the
[00:02:39] to the extent that they would help the listener achieve that task of figuring
[00:02:42] listener achieve that task of figuring out which image was being described
[00:02:46] machine translation is another area that
[00:02:48] machine translation is another area that might benefit from this notion of
[00:02:50] might benefit from this notion of informativity and context this was
[00:02:52] informativity and context this was explored in a lovely paper by ruben
[00:02:54] explored in a lovely paper by ruben cohen gordon and noah goodman in 2019
[00:02:57] cohen gordon and noah goodman in 2019 so let's say our task is to go from
[00:02:59] so let's say our task is to go from english to french
[00:03:00] english to french reuben and noah just observed that at
[00:03:02] reuben and noah just observed that at the time these two english inputs she
[00:03:05] the time these two english inputs she chopped up the tree and she chopped down
[00:03:07] chopped up the tree and she chopped down the tree were both mapped to the same
[00:03:09] the tree were both mapped to the same french translation
[00:03:11] french translation which is a shame given how different
[00:03:13] which is a shame given how different those two english inputs are in terms of
[00:03:15] those two english inputs are in terms of their meanings what we would like is to
[00:03:17] their meanings what we would like is to have the english inputs mapped to
[00:03:19] have the english inputs mapped to different french sentences
[00:03:21] different french sentences and their intuition about how to achieve
[00:03:23] and their intuition about how to achieve that would be to achieve some kind of
[00:03:25] that would be to achieve some kind of invariance so that given the translation
[00:03:28] invariance so that given the translation from english to french we should be able
[00:03:30] from english to french we should be able to do the reverse figure out from the
[00:03:32] to do the reverse figure out from the french
[00:03:33] french which underlying english state was being
[00:03:36] which underlying english state was being quote referred to in this context so
[00:03:38] quote referred to in this context so it's language on both sides but it's
[00:03:40] it's language on both sides but it's drawing on this idea that we want
[00:03:42] drawing on this idea that we want translations that are informative in the
[00:03:44] translations that are informative in the sense that they would help someone
[00:03:46] sense that they would help someone figure out what the original system
[00:03:48] figure out what the original system input was
[00:03:49] input was same guiding idea drawing on this
[00:03:51] same guiding idea drawing on this metaphor of communication but now to
[00:03:53] metaphor of communication but now to achieve good translations
[00:03:56] achieve good translations and in other domains it's just very
[00:03:58] and in other domains it's just very intuitive to think about informativity
[00:03:59] intuitive to think about informativity in context so daniel freed at all they
[00:04:01] in context so daniel freed at all they have a lovely paper exploring how to
[00:04:03] have a lovely paper exploring how to give navigational instructions drawing
[00:04:05] give navigational instructions drawing on pragmatic ideas like informativity
[00:04:08] on pragmatic ideas like informativity and context
[00:04:09] and context and for example they have both speaker
[00:04:10] and for example they have both speaker and listener agents and they observe
[00:04:12] and listener agents and they observe that the base speaker is true but
[00:04:14] that the base speaker is true but uninformative whereas their rational
[00:04:16] uninformative whereas their rational speech speaker which brings in pragmatic
[00:04:18] speech speaker which brings in pragmatic ideas
[00:04:19] ideas is more sensitive to the kinds of
[00:04:21] is more sensitive to the kinds of information that a listener would need
[00:04:23] information that a listener would need to follow an instruction and the same
[00:04:26] to follow an instruction and the same thing is true on the listener side the
[00:04:27] thing is true on the listener side the base listener is unsure how to proceed
[00:04:29] base listener is unsure how to proceed but the rational mr englishman was able
[00:04:32] but the rational mr englishman was able to infer that since this instruction
[00:04:33] to infer that since this instruction didn't mention this couch over here it
[00:04:35] didn't mention this couch over here it was probably not relevant to the
[00:04:37] was probably not relevant to the instruction and therefore this listener
[00:04:39] instruction and therefore this listener stops at this point in interpreting the
[00:04:41] stops at this point in interpreting the navigational instructions
[00:04:44] navigational instructions and stephanie talks and colleagues have
[00:04:45] and stephanie talks and colleagues have explored this idea in the context of
[00:04:47] explored this idea in the context of human robot interaction they've called
[00:04:49] human robot interaction they've called their second their central mechanism
[00:04:50] their second their central mechanism inverse semantics and this is again just
[00:04:52] inverse semantics and this is again just the intuition that a robot producing
[00:04:55] the intuition that a robot producing language ought to produce language that
[00:04:57] language ought to produce language that reduces ambiguity for the human listener
[00:05:00] reduces ambiguity for the human listener in this context here where the robot is
[00:05:02] in this context here where the robot is trying to get help from the human it
[00:05:03] trying to get help from the human it shouldn't just say help me the human
[00:05:05] shouldn't just say help me the human won't know how to take action but it
[00:05:07] won't know how to take action but it also shouldn't do something simple like
[00:05:09] also shouldn't do something simple like hand me the leg the robot should be
[00:05:11] hand me the leg the robot should be sensitive to the fact that there are
[00:05:12] sensitive to the fact that there are multiple table legs in this context and
[00:05:15] multiple table legs in this context and the robot needs to ensure that the human
[00:05:18] the robot needs to ensure that the human listener is not faced with an
[00:05:19] listener is not faced with an insurmountable ambiguity and that would
[00:05:22] insurmountable ambiguity and that would therefore push this robot in being aware
[00:05:25] therefore push this robot in being aware of the listener state to produce
[00:05:27] of the listener state to produce descriptions that were more like hand me
[00:05:28] descriptions that were more like hand me the white leg on the table fully
[00:05:30] the white leg on the table fully disambiguating from the perspective of
[00:05:33] disambiguating from the perspective of the listener
[00:05:35] the listener and i'd like to push this idea of
[00:05:36] and i'd like to push this idea of informativity and context even further
[00:05:38] informativity and context even further by connecting with one of the classic
[00:05:40] by connecting with one of the classic tasks in machine learning which is
[00:05:42] tasks in machine learning which is optical character recognition even this
[00:05:44] optical character recognition even this task i believe can benefit from notions
[00:05:46] task i believe can benefit from notions of contrast and informativity in context
[00:05:50] of contrast and informativity in context on the left i have four digits and you
[00:05:52] on the left i have four digits and you can see that this is a speaker who puts
[00:05:53] can see that this is a speaker who puts little hooks at the top of their ones
[00:05:55] little hooks at the top of their ones and slashes through their sevens and
[00:05:57] and slashes through their sevens and those two pieces of information would
[00:05:59] those two pieces of information would help us disambiguate the final digit and
[00:06:01] help us disambiguate the final digit and infer that it was a one
[00:06:03] infer that it was a one on the right here we're pushed in a very
[00:06:05] on the right here we're pushed in a very different direction this is a speaker
[00:06:06] different direction this is a speaker who does not put hooks on the top their
[00:06:08] who does not put hooks on the top their ones or slashes through their sevens and
[00:06:11] ones or slashes through their sevens and that would lead us to think that this
[00:06:12] that would lead us to think that this final digit here is a seven notice that
[00:06:15] final digit here is a seven notice that in terms of what's actually on the page
[00:06:17] in terms of what's actually on the page these two digits are identical
[00:06:19] these two digits are identical but the context is what's leading us in
[00:06:21] but the context is what's leading us in very different directions we can assume
[00:06:23] very different directions we can assume that some fundamental level the speaker
[00:06:25] that some fundamental level the speaker is going to be informative in the sense
[00:06:27] is going to be informative in the sense that they're going to write in ways that
[00:06:28] that they're going to write in ways that are consistent and draw intended
[00:06:30] are consistent and draw intended contrast between their digits and that's
[00:06:32] contrast between their digits and that's what guides us toward what are
[00:06:33] what guides us toward what are ultimately the correct classification
[00:06:35] ultimately the correct classification decisions even for this apparently
[00:06:37] decisions even for this apparently mechanical seeming environment

Lecture 031

The Rational Speech Acts Model | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=pkT0g7utr70 --- Transcript [00:00:04] hello everyone welcome t...

The Rational Speech Acts Model | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=pkT0g7utr70

---

Transcript

[00:00:04] hello everyone welcome to part five in
[00:00:06] hello everyone welcome to part five in our series on grounded language
[00:00:08] our series on grounded language understanding we're going to be talking
[00:00:09] understanding we're going to be talking about the rational speech x model or rsa
[00:00:12] about the rational speech x model or rsa this is an exciting model that was
[00:00:13] this is an exciting model that was developed by stanford researchers mike
[00:00:15] developed by stanford researchers mike frank and noah goodman and it's a chance
[00:00:17] frank and noah goodman and it's a chance for us to connect ideas from cognitive
[00:00:19] for us to connect ideas from cognitive and psychology and linguistics with
[00:00:21] and psychology and linguistics with large scale problems in machine learning
[00:00:24] large scale problems in machine learning now what i'm going to do for this
[00:00:25] now what i'm going to do for this screencast is kind of cue up the high
[00:00:27] screencast is kind of cue up the high level concepts and the core model
[00:00:29] level concepts and the core model structure as a way of leading into the
[00:00:32] structure as a way of leading into the next screencast which is going to show
[00:00:33] next screencast which is going to show you how to incorporate pieces of this
[00:00:35] you how to incorporate pieces of this model into standard machine learning
[00:00:37] model into standard machine learning models
[00:00:38] models if you would like a deeper dive on the
[00:00:40] if you would like a deeper dive on the conceptual origins of this model and how
[00:00:43] conceptual origins of this model and how it works in a kind of mathematical way i
[00:00:45] it works in a kind of mathematical way i would encourage you to check out these
[00:00:46] would encourage you to check out these resources here so this first paper
[00:00:48] resources here so this first paper goodman and frank from the developers of
[00:00:50] goodman and frank from the developers of rsa is a nice overview that shows not
[00:00:52] rsa is a nice overview that shows not only all the technical model details
[00:00:54] only all the technical model details with real rigor but also connects the
[00:00:56] with real rigor but also connects the ideas with decision theory game theory
[00:01:00] ideas with decision theory game theory cognitive psychology and bayesian
[00:01:02] cognitive psychology and bayesian cognitive science and also linguistics
[00:01:05] cognitive science and also linguistics from there you could watch this
[00:01:06] from there you could watch this technical screencast that i did this is
[00:01:08] technical screencast that i did this is on youtube and here are the associated
[00:01:10] on youtube and here are the associated slides for that if you want to follow
[00:01:11] slides for that if you want to follow along and from there i have this python
[00:01:13] along and from there i have this python reference implementation of the core rsa
[00:01:16] reference implementation of the core rsa model and that would be a great way to
[00:01:17] model and that would be a great way to get hands-on with the model and begin to
[00:01:19] get hands-on with the model and begin to think about how you could incorporate it
[00:01:21] think about how you could incorporate it into your own project or original system
[00:01:25] into your own project or original system without further ado though let's dive
[00:01:27] without further ado though let's dive into the model and i'm going to begin
[00:01:28] into the model and i'm going to begin with what i've called pragmatic
[00:01:29] with what i've called pragmatic listeners and we can also as you'll see
[00:01:31] listeners and we can also as you'll see later take a speaker perspective
[00:01:34] later take a speaker perspective so the model begins with what's called a
[00:01:35] so the model begins with what's called a literal listener this is a probabilistic
[00:01:38] literal listener this is a probabilistic agent and you can see that it conditions
[00:01:40] agent and you can see that it conditions on a message that is it hears or
[00:01:42] on a message that is it hears or observes a message and makes a guess
[00:01:44] observes a message and makes a guess about the state of the world on that
[00:01:45] about the state of the world on that basis
[00:01:47] basis and the way it does that is by reasoning
[00:01:48] and the way it does that is by reasoning essentially entirely about the truth
[00:01:50] essentially entirely about the truth conditions of the language here i've got
[00:01:52] conditions of the language here i've got these double brackets indicating that we
[00:01:54] these double brackets indicating that we have a semantic lexicon mapping words
[00:01:57] have a semantic lexicon mapping words and phrases to their truth values
[00:01:59] and phrases to their truth values uh this agent also takes the prior into
[00:02:01] uh this agent also takes the prior into account but that's the only way in which
[00:02:03] account but that's the only way in which it's pragmatic otherwise it's kind of a
[00:02:05] it's pragmatic otherwise it's kind of a fundamentally semantic agent
[00:02:08] fundamentally semantic agent from there we build the pragmatic
[00:02:10] from there we build the pragmatic speaker
[00:02:11] speaker speakers in this model observe states of
[00:02:13] speakers in this model observe states of the world things they want to
[00:02:14] the world things they want to communicate about and then they choose
[00:02:16] communicate about and then they choose messages on that basis
[00:02:18] messages on that basis and the core thing to observe here is
[00:02:19] and the core thing to observe here is that the pragmatic speaker reasons not
[00:02:22] that the pragmatic speaker reasons not about the semantics of the language as
[00:02:24] about the semantics of the language as the literal listener does but rather
[00:02:26] the literal listener does but rather about the literal listener who reasons
[00:02:28] about the literal listener who reasons about the semantics of the language and
[00:02:30] about the semantics of the language and for this pragmatic speaker here it does
[00:02:32] for this pragmatic speaker here it does that taking costs of messages into
[00:02:34] that taking costs of messages into account
[00:02:35] account and it also has this temperature
[00:02:36] and it also has this temperature parameter alpha which will help us
[00:02:38] parameter alpha which will help us control how aggressively it reasons
[00:02:40] control how aggressively it reasons about this lower agent the literal
[00:02:42] about this lower agent the literal listener
[00:02:43] listener other than that you can probably see
[00:02:44] other than that you can probably see that this model is a kind of soft max
[00:02:46] that this model is a kind of soft max decision rule uh where we're combining
[00:02:49] decision rule uh where we're combining the literal listener with message costs
[00:02:53] the literal listener with message costs and then finally we have the pragmatic
[00:02:55] and then finally we have the pragmatic listener which has essentially the same
[00:02:56] listener which has essentially the same form as the literal listener it observes
[00:02:58] form as the literal listener it observes a message and makes a guess about the
[00:03:00] a message and makes a guess about the state of the world on that basis
[00:03:02] state of the world on that basis and it has the same overall form as
[00:03:04] and it has the same overall form as literal listener except it's reasoning
[00:03:05] literal listener except it's reasoning not about the truth conditions but
[00:03:07] not about the truth conditions but rather about the pragmatic speaker
[00:03:09] rather about the pragmatic speaker who is reasoning about the literal
[00:03:11] who is reasoning about the literal listener who is finally reasoning about
[00:03:13] listener who is finally reasoning about the semantic grammar so you can see that
[00:03:15] the semantic grammar so you can see that there's a kind of recursive back and
[00:03:16] there's a kind of recursive back and forth in this model you might think of
[00:03:18] forth in this model you might think of this as reasoning about other minds and
[00:03:21] this as reasoning about other minds and it's in that recursion that we get
[00:03:24] it's in that recursion that we get pragmatic language use
[00:03:26] pragmatic language use here's a kind of shorthand for the core
[00:03:27] here's a kind of shorthand for the core model components a little literal
[00:03:29] model components a little literal listener is reasoning about the lexicon
[00:03:31] listener is reasoning about the lexicon and the prior overstates
[00:03:33] and the prior overstates the pragmatic speaker reasons about the
[00:03:34] the pragmatic speaker reasons about the literal listener taking message costs
[00:03:36] literal listener taking message costs into account and finally the pragmatic
[00:03:39] into account and finally the pragmatic listener reasons about the pragmatic
[00:03:40] listener reasons about the pragmatic speaker taking the state prior into
[00:03:43] speaker taking the state prior into account and then you can see nicely this
[00:03:44] account and then you can see nicely this point of indirection down to the
[00:03:46] point of indirection down to the semantic lexicon and as i said it's in
[00:03:49] semantic lexicon and as i said it's in that recursion that we get interesting
[00:03:51] that recursion that we get interesting pragmatic language use let me show you
[00:03:53] pragmatic language use let me show you how that happens with a with a small
[00:03:55] how that happens with a with a small example here so along the rows in this i
[00:03:58] example here so along the rows in this i have the messages we're imagining a very
[00:04:00] have the messages we're imagining a very simple language in which there are just
[00:04:02] simple language in which there are just three messages you can think of them as
[00:04:03] three messages you can think of them as shorthand for like um the person i'm
[00:04:05] shorthand for like um the person i'm referring for referring to has a beard
[00:04:07] referring for referring to has a beard the person i'm referring to has glasses
[00:04:09] the person i'm referring to has glasses and so forth and we have just three
[00:04:11] and so forth and we have just three reference and i'll tell you that this is
[00:04:13] reference and i'll tell you that this is david lewis
[00:04:14] david lewis one of the originators of signaling
[00:04:16] one of the originators of signaling systems which is an important precursor
[00:04:18] systems which is an important precursor to rsa
[00:04:19] to rsa this is the philosopher and linguist
[00:04:21] this is the philosopher and linguist paul greis who did foundational work in
[00:04:23] paul greis who did foundational work in pragmatics and this is claude shannon
[00:04:26] pragmatics and this is claude shannon who of course is the developer of
[00:04:27] who of course is the developer of information theory
[00:04:29] information theory and in this table here we have the
[00:04:30] and in this table here we have the semantic grammar the truth conditions of
[00:04:32] semantic grammar the truth conditions of the language you can see that lewis has
[00:04:34] the language you can see that lewis has this wonderful beard
[00:04:36] this wonderful beard uh but neither grice nor shannon have
[00:04:38] uh but neither grice nor shannon have beards
[00:04:39] beards glasses is true of lewis and grice and
[00:04:41] glasses is true of lewis and grice and tie is true of grice and shannon
[00:04:45] tie is true of grice and shannon the literal listener assuming we have
[00:04:47] the literal listener assuming we have flat friars simply row normalizes those
[00:04:50] flat friars simply row normalizes those truth conditions so we go from all these
[00:04:52] truth conditions so we go from all these ones to an even distribution and you can
[00:04:55] ones to an even distribution and you can see that already beard is unambiguous
[00:04:57] see that already beard is unambiguous for this listener but glasses and tie
[00:04:59] for this listener but glasses and tie present what looks like an
[00:05:00] present what looks like an insurmountable ambiguity on hearing
[00:05:02] insurmountable ambiguity on hearing glasses the speedness listener just has
[00:05:04] glasses the speedness listener just has to guess about whether the reference was
[00:05:06] to guess about whether the reference was lewis or grice and same thing for tie
[00:05:10] lewis or grice and same thing for tie when we move to the pragmatic speaker we
[00:05:12] when we move to the pragmatic speaker we already see that the system starts to
[00:05:14] already see that the system starts to become more efficient so we take the
[00:05:16] become more efficient so we take the speaker perspective along the rows now
[00:05:18] speaker perspective along the rows now and we because we're going to assume
[00:05:20] and we because we're going to assume zero message costs we can again just row
[00:05:22] zero message costs we can again just row normalize in this case from the previous
[00:05:25] normalize in this case from the previous matrix having transposed it
[00:05:27] matrix having transposed it and now you can see that on trying to
[00:05:29] and now you can see that on trying to communicate about lewis the speaker
[00:05:31] communicate about lewis the speaker should just choose beard there's an
[00:05:32] should just choose beard there's an overwhelming bias for that
[00:05:34] overwhelming bias for that and down here on observing shannon or
[00:05:36] and down here on observing shannon or wanting to talk about shannon the
[00:05:38] wanting to talk about shannon the speaker should say thai that's
[00:05:39] speaker should say thai that's completely unambiguous but we still have
[00:05:42] completely unambiguous but we still have a problem if we want to refer to grice
[00:05:44] a problem if we want to refer to grice we have kind of no bias about whether we
[00:05:46] we have kind of no bias about whether we should choose glasses or todd but
[00:05:48] should choose glasses or todd but already we have a more efficient system
[00:05:49] already we have a more efficient system than we did for the literal listener
[00:05:52] than we did for the literal listener and then finally when we move to the
[00:05:53] and then finally when we move to the pragmatic listener we have what you
[00:05:55] pragmatic listener we have what you might think of as a completely
[00:05:57] might think of as a completely separating linguistic system
[00:05:59] separating linguistic system uh on hearing beard infer lewis on
[00:06:02] uh on hearing beard infer lewis on hearing glasses your best bet is grice
[00:06:04] hearing glasses your best bet is grice and on hearing tie your best bet is
[00:06:06] and on hearing tie your best bet is shannon and in this way you can see that
[00:06:08] shannon and in this way you can see that we started with a system that looked
[00:06:10] we started with a system that looked hopelessly ambiguous and now in the back
[00:06:12] hopelessly ambiguous and now in the back and forth rsa reasoning we have arrived
[00:06:14] and forth rsa reasoning we have arrived at a system that is probabilistically
[00:06:16] at a system that is probabilistically completely unambiguous and that's the
[00:06:18] completely unambiguous and that's the sense in which we can do pragmatic
[00:06:20] sense in which we can do pragmatic language use
[00:06:21] language use and end up with more efficient languages
[00:06:23] and end up with more efficient languages as a result of this reasoning
[00:06:26] as a result of this reasoning now for natural language generation
[00:06:28] now for natural language generation problems it's often useful to take a
[00:06:30] problems it's often useful to take a speaker perspective as we've discussed
[00:06:32] speaker perspective as we've discussed before and i just want to point out to
[00:06:34] before and i just want to point out to you that it's straightforward to
[00:06:35] you that it's straightforward to formulate this model starting from the
[00:06:37] formulate this model starting from the speaker we would do that down here at
[00:06:38] speaker we would do that down here at the bottom this has the same form as the
[00:06:40] the bottom this has the same form as the previous speakers
[00:06:42] previous speakers we're going to subtract out message
[00:06:44] we're going to subtract out message costs and we have this softmax decision
[00:06:46] costs and we have this softmax decision rule overall but now the speaker of
[00:06:48] rule overall but now the speaker of course will reason directly about the
[00:06:49] course will reason directly about the truth conditions of the language
[00:06:51] truth conditions of the language then we have our pragmatic listener
[00:06:53] then we have our pragmatic listener there's just one for this perspective
[00:06:55] there's just one for this perspective and it looks like just those other
[00:06:57] and it looks like just those other listeners accepted reasons not about the
[00:06:59] listeners accepted reasons not about the truth conditions but rather about that
[00:07:01] truth conditions but rather about that literal speaker
[00:07:02] literal speaker and then finally for our pragmatic
[00:07:04] and then finally for our pragmatic speaker which is the one that you might
[00:07:06] speaker which is the one that you might focus on for generation tasks it has the
[00:07:08] focus on for generation tasks it has the same form as before except now we're
[00:07:10] same form as before except now we're reasoning about the pragmatic listener
[00:07:12] reasoning about the pragmatic listener who is reasoning about the literal
[00:07:14] who is reasoning about the literal speaker so we have that same kind of
[00:07:16] speaker so we have that same kind of indirection
[00:07:18] indirection and once again here's a kind of
[00:07:19] and once again here's a kind of shorthand way of thinking about the
[00:07:21] shorthand way of thinking about the speaker perspective so the literal
[00:07:23] speaker perspective so the literal speaker reasons about the lexicon
[00:07:24] speaker reasons about the lexicon subtracting out costs the pragmatic
[00:07:27] subtracting out costs the pragmatic listener reasons about that literal
[00:07:28] listener reasons about that literal speaker and the state prior and then
[00:07:30] speaker and the state prior and then finally the pragmatic speaker reasons
[00:07:32] finally the pragmatic speaker reasons about the pragmatic listener
[00:07:34] about the pragmatic listener taking message costs into account and
[00:07:36] taking message costs into account and again you see that recursion down into
[00:07:38] again you see that recursion down into the lexicon
[00:07:40] the lexicon now i've given you a glimpse of why this
[00:07:41] now i've given you a glimpse of why this model might be powerful but let's close
[00:07:43] model might be powerful but let's close with some limitations that we might
[00:07:45] with some limitations that we might address in the context of doing modern
[00:07:47] address in the context of doing modern nlp and machine learning
[00:07:49] nlp and machine learning so first we had to hand specify that
[00:07:51] so first we had to hand specify that lexicon in cognitive psychology and
[00:07:53] lexicon in cognitive psychology and linguistics this is often fine we're
[00:07:55] linguistics this is often fine we're going to run a controlled experiment and
[00:07:57] going to run a controlled experiment and specifying the lexicon is not really an
[00:07:59] specifying the lexicon is not really an obstacle but if we would like to work in
[00:08:01] obstacle but if we would like to work in open domains with large corpora this is
[00:08:03] open domains with large corpora this is probably a deal breaker
[00:08:05] probably a deal breaker a related problem arises if you look
[00:08:08] a related problem arises if you look more closely at the way the speaker
[00:08:09] more closely at the way the speaker agents are formulated in their
[00:08:11] agents are formulated in their denominator they have this implicit
[00:08:13] denominator they have this implicit summation over all possible messages
[00:08:15] summation over all possible messages where we do this computation here but in
[00:08:18] where we do this computation here but in the context of a natural language what
[00:08:20] the context of a natural language what does it mean to sum over all messages
[00:08:22] does it mean to sum over all messages that might be an infinite set
[00:08:24] that might be an infinite set and even if it's finite because we make
[00:08:26] and even if it's finite because we make some approximations it's still going to
[00:08:27] some approximations it's still going to be so large as to make this calculation
[00:08:29] be so large as to make this calculation intractable so for computational
[00:08:32] intractable so for computational applications we will have to address
[00:08:34] applications we will have to address this potential shortcoming
[00:08:36] this potential shortcoming it's also rsa what you might think of as
[00:08:39] it's also rsa what you might think of as a very high bias model we have
[00:08:40] a very high bias model we have relatively few chances to learn from
[00:08:42] relatively few chances to learn from data it hardwires in a particular
[00:08:45] data it hardwires in a particular reasoning mechanism as it is inflexible
[00:08:47] reasoning mechanism as it is inflexible about how that mechanism is applied
[00:08:50] about how that mechanism is applied relatedly we might then run up against
[00:08:53] relatedly we might then run up against things like it's difficult to be a
[00:08:54] things like it's difficult to be a speaker and speakers even the pragmatic
[00:08:56] speaker and speakers even the pragmatic ones are not always perfectly rational
[00:08:59] ones are not always perfectly rational in the way the model might portray them
[00:09:01] in the way the model might portray them to be and we might want to capture that
[00:09:03] to be and we might want to capture that if only to do well with actual usage
[00:09:05] if only to do well with actual usage data
[00:09:06] data and relatedly even setting aside the
[00:09:08] and relatedly even setting aside the pressures on speakers to be rational
[00:09:10] pressures on speakers to be rational they just might have preferences for
[00:09:11] they just might have preferences for certain word choices and other things
[00:09:13] certain word choices and other things that the model is simply not even trying
[00:09:15] that the model is simply not even trying to capture and we might hope in the
[00:09:17] to capture and we might hope in the context of a large-scale machine
[00:09:18] context of a large-scale machine learning model that we would have
[00:09:20] learning model that we would have mechanisms for bringing those in
[00:09:22] mechanisms for bringing those in and finally it's just not scalable you
[00:09:24] and finally it's just not scalable you can see that in the first two bullet
[00:09:26] can see that in the first two bullet points and there are many other senses
[00:09:28] points and there are many other senses in which rsa as i've presented it just
[00:09:30] in which rsa as i've presented it just won't scale to the kind of big ambitious
[00:09:32] won't scale to the kind of big ambitious problems that we're trying to tackle in
[00:09:34] problems that we're trying to tackle in this class
[00:09:35] this class the next screencast is going to attempt
[00:09:37] the next screencast is going to attempt to address all of these limitations by
[00:09:39] to address all of these limitations by bringing rsa into large scale machine
[00:09:41] bringing rsa into large scale machine learning models

Lecture 032

Neural RSA | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=aTEX9C2JBsE --- Transcript [00:00:04] welcome back everyone this is part six [00:0...

Neural RSA | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=aTEX9C2JBsE

---

Transcript

[00:00:04] welcome back everyone this is part six
[00:00:06] welcome back everyone this is part six in our series on grounded language
[00:00:08] in our series on grounded language understanding we're going to be talking
[00:00:09] understanding we're going to be talking about neural rsa which is our
[00:00:11] about neural rsa which is our combination
[00:00:12] combination of the rational speech x model with the
[00:00:15] of the rational speech x model with the kind of machine learning models that
[00:00:16] kind of machine learning models that we've been focused on for this unit and
[00:00:18] we've been focused on for this unit and i'm hoping that this draws together a
[00:00:20] i'm hoping that this draws together a bunch of themes from earlier screencasts
[00:00:22] bunch of themes from earlier screencasts and also sets you up if you choose to to
[00:00:24] and also sets you up if you choose to to apply these ideas in the context of a
[00:00:26] apply these ideas in the context of a original system or a final project i'm
[00:00:30] original system or a final project i'm going to be talking in a general way
[00:00:31] going to be talking in a general way about these ideas they really emerge
[00:00:33] about these ideas they really emerge from the papers that are listed on this
[00:00:34] from the papers that are listed on this slide and the full references are given
[00:00:36] slide and the full references are given at the end of the slideshow
[00:00:39] at the end of the slideshow now what's our motivation here recall
[00:00:41] now what's our motivation here recall that in screencast 4 i presented a bunch
[00:00:43] that in screencast 4 i presented a bunch of tasks that i claim would benefit from
[00:00:46] of tasks that i claim would benefit from the back and forth reasoning that rsa
[00:00:48] the back and forth reasoning that rsa offers grounded in specific context and
[00:00:51] offers grounded in specific context and those tasks included discriminative
[00:00:53] those tasks included discriminative image labeling
[00:00:54] image labeling image captioning machine translation
[00:00:57] image captioning machine translation collaborative problem solving
[00:00:59] collaborative problem solving interpreting complex descriptions
[00:01:01] interpreting complex descriptions especially navigational instructions and
[00:01:03] especially navigational instructions and maybe even optical character recognition
[00:01:05] maybe even optical character recognition and i think we can think of other tasks
[00:01:07] and i think we can think of other tasks that we could put into the mold of like
[00:01:09] that we could put into the mold of like the colors in context task and really
[00:01:12] the colors in context task and really benefit from the mechanisms that rsa
[00:01:14] benefit from the mechanisms that rsa offers
[00:01:15] offers however as we saw at the end of the rsa
[00:01:17] however as we saw at the end of the rsa screencast there are some obstacles to
[00:01:19] screencast there are some obstacles to doing this rsa is standardly presented
[00:01:21] doing this rsa is standardly presented as not especially scalable
[00:01:23] as not especially scalable it's also not especially sensitive to
[00:01:25] it's also not especially sensitive to the kind of variation that we're likely
[00:01:27] the kind of variation that we're likely to see in actual usage data in the
[00:01:29] to see in actual usage data in the large-scale corporate that would support
[00:01:31] large-scale corporate that would support these tasks and relatedly it just
[00:01:33] these tasks and relatedly it just doesn't have any notion of bounded
[00:01:35] doesn't have any notion of bounded rationality even though of course once
[00:01:37] rationality even though of course once humans interact they're not perfectly
[00:01:39] humans interact they're not perfectly rational even in the pragmatic sense
[00:01:41] rational even in the pragmatic sense that rsa offers
[00:01:43] that rsa offers and there's another dimension to this
[00:01:44] and there's another dimension to this problem for motivation here which is
[00:01:46] problem for motivation here which is that rsa harbors a really powerful
[00:01:48] that rsa harbors a really powerful insight and we might hope that we can
[00:01:51] insight and we might hope that we can achieve more impact for that model by
[00:01:53] achieve more impact for that model by bringing in new kinds of assessment for
[00:01:55] bringing in new kinds of assessment for it you know taking it out of the
[00:01:56] it you know taking it out of the psychology and linguistics lab and into
[00:01:58] psychology and linguistics lab and into the world of ai and in turn achieve more
[00:02:01] the world of ai and in turn achieve more impact for rsa and maybe show more of
[00:02:03] impact for rsa and maybe show more of the scientific world that rsa has a
[00:02:05] the scientific world that rsa has a really powerful insight behind it but of
[00:02:08] really powerful insight behind it but of course to realize all of this potential
[00:02:10] course to realize all of this potential we're going to have to overcome some of
[00:02:12] we're going to have to overcome some of those core issues of scalability
[00:02:15] those core issues of scalability and that's what i'll show you here i
[00:02:16] and that's what i'll show you here i think i can offer a simple recipe for
[00:02:18] think i can offer a simple recipe for doing that and testing out a lot of
[00:02:19] doing that and testing out a lot of these ideas
[00:02:21] these ideas to make this concrete let's continue to
[00:02:23] to make this concrete let's continue to ground our discussion in our core task
[00:02:25] ground our discussion in our core task which is this colors in context task
[00:02:27] which is this colors in context task just recall that if you're playing the
[00:02:29] just recall that if you're playing the speaker role you're presented with three
[00:02:31] speaker role you're presented with three color patches one of them privately
[00:02:33] color patches one of them privately designated as your target and your task
[00:02:35] designated as your target and your task is to describe that target in that
[00:02:37] is to describe that target in that context for a listener and then in turn
[00:02:40] context for a listener and then in turn the listener task is given the three
[00:02:42] the listener task is given the three patches and no idea which one is the
[00:02:44] patches and no idea which one is the target and a speaker utterance use that
[00:02:47] target and a speaker utterance use that utterance to figure out which was the
[00:02:49] utterance to figure out which was the speaker's target so you can hear in that
[00:02:51] speaker's target so you can hear in that description that this is potentially a
[00:02:52] description that this is potentially a kind of communication game that would
[00:02:54] kind of communication game that would support the back and forth fourth
[00:02:56] support the back and forth fourth reasoning that is the hallmark of the
[00:02:58] reasoning that is the hallmark of the rational speech x model
[00:03:00] rational speech x model so how are we going to take this task
[00:03:02] so how are we going to take this task and rsa and combine them well the first
[00:03:06] and rsa and combine them well the first step is straightforward we're going to
[00:03:07] step is straightforward we're going to start with a literal neural speaker i've
[00:03:10] start with a literal neural speaker i've given that as s theta up here with
[00:03:12] given that as s theta up here with literal literal indicating that it's a
[00:03:14] literal literal indicating that it's a base agent
[00:03:15] base agent and for this it's just going to be
[00:03:16] and for this it's just going to be exactly the natural language generation
[00:03:19] exactly the natural language generation system that we explored in the earliest
[00:03:21] system that we explored in the earliest parts of this screencast right except
[00:03:23] parts of this screencast right except now we're going to consume three color
[00:03:25] now we're going to consume three color patches with the target always given in
[00:03:27] patches with the target always given in the final position
[00:03:29] the final position and then the decoding task is to offer a
[00:03:31] and then the decoding task is to offer a description we can make a lot of
[00:03:32] description we can make a lot of different model choices here but the
[00:03:34] different model choices here but the fundamental insight is that we can now
[00:03:36] fundamental insight is that we can now treat this agent as a kind of black box
[00:03:39] treat this agent as a kind of black box based listener instead of having to hand
[00:03:42] based listener instead of having to hand specify a semantic grammar which would
[00:03:44] specify a semantic grammar which would be impossible even for the
[00:03:46] be impossible even for the task the size of the colors and context
[00:03:48] task the size of the colors and context data set we now just train an agent and
[00:03:51] data set we now just train an agent and use it to play the role of the base
[00:03:53] use it to play the role of the base agent and we can of course do the same
[00:03:55] agent and we can of course do the same thing for the neural literal listener
[00:03:57] thing for the neural literal listener we'll again have some parameters theta
[00:03:59] we'll again have some parameters theta which will be represent this entire
[00:04:01] which will be represent this entire encoder decoder architecture this neural
[00:04:04] encoder decoder architecture this neural literal listener will process incoming
[00:04:06] literal listener will process incoming messages as a sequence
[00:04:08] messages as a sequence and then given some context of colors
[00:04:10] and then given some context of colors and a scoring function make a guess
[00:04:12] and a scoring function make a guess about which one of those three colors
[00:04:15] about which one of those three colors the message that it had as input was
[00:04:17] the message that it had as input was being referred to and again instead of
[00:04:19] being referred to and again instead of hand specifying the lexicon we just
[00:04:21] hand specifying the lexicon we just treat this agent as a black box it
[00:04:24] treat this agent as a black box it serves the role of the literal listener
[00:04:27] serves the role of the literal listener and from there the rsa recursion so to
[00:04:30] and from there the rsa recursion so to speak is very easy to apply let's
[00:04:32] speak is very easy to apply let's consider the base case of a pragmatic
[00:04:34] consider the base case of a pragmatic speaker
[00:04:36] speaker so you can see over here we're going to
[00:04:37] so you can see over here we're going to use our trained literal listener
[00:04:40] use our trained literal listener and this is the most basic form that the
[00:04:43] and this is the most basic form that the speaker can have
[00:04:44] speaker can have and we've just got now a pragmatic agent
[00:04:47] and we've just got now a pragmatic agent that is reasoning about states of the
[00:04:48] that is reasoning about states of the world as inputs and making message short
[00:04:51] world as inputs and making message short message choices on that basis and it's
[00:04:53] message choices on that basis and it's doing that not in terms of the raw data
[00:04:56] doing that not in terms of the raw data but rather in terms of how the literal
[00:04:58] but rather in terms of how the literal listener would reason about the raw data
[00:04:59] listener would reason about the raw data so that core rsa insight
[00:05:02] so that core rsa insight but we're just essentially using l0 here
[00:05:04] but we're just essentially using l0 here as the mechanism to derive the speaker
[00:05:06] as the mechanism to derive the speaker distribution
[00:05:08] distribution now there is one catcher as we discussed
[00:05:10] now there is one catcher as we discussed in principle for rsa this would be a
[00:05:12] in principle for rsa this would be a summation over all messages which would
[00:05:14] summation over all messages which would be completely intractable for any
[00:05:16] be completely intractable for any realistically sized language model what
[00:05:19] realistically sized language model what we can do to overcome that that
[00:05:21] we can do to overcome that that obstacle is simply use our train literal
[00:05:24] obstacle is simply use our train literal speaker which i presented before and
[00:05:26] speaker which i presented before and sample utterances from it and that small
[00:05:28] sample utterances from it and that small sample will serve as the basis for this
[00:05:31] sample will serve as the basis for this normalization down here so it's an
[00:05:33] normalization down here so it's an approximation but it's an easy one given
[00:05:35] approximation but it's an easy one given that we have this trained agent down
[00:05:37] that we have this trained agent down here and in practice we've seen that it
[00:05:39] here and in practice we've seen that it does quite well in serving as the
[00:05:41] does quite well in serving as the normalization constant
[00:05:43] normalization constant and then the neuropragmatic listeners
[00:05:44] and then the neuropragmatic listeners are even more straightforward having
[00:05:46] are even more straightforward having defined that pragmatic speaker to put a
[00:05:48] defined that pragmatic speaker to put a listener on top of that is really easy
[00:05:50] listener on top of that is really easy again you essentially just apply bayes
[00:05:52] again you essentially just apply bayes rule and you get a listener out
[00:05:54] rule and you get a listener out and in the monroe doll paper as you've
[00:05:56] and in the monroe doll paper as you've seen we actually found that weighted
[00:05:58] seen we actually found that weighted combinations of the literal listener and
[00:06:01] combinations of the literal listener and the pragmatic listener were the best
[00:06:03] the pragmatic listener were the best they were the best at the colors and
[00:06:05] they were the best at the colors and context tasks
[00:06:07] context tasks let me just close up by mentioning a few
[00:06:09] let me just close up by mentioning a few other related strands of work that you
[00:06:11] other related strands of work that you might think about bringing in what i
[00:06:13] might think about bringing in what i just showed you is the most basic form
[00:06:14] just showed you is the most basic form of this but many extensions have been
[00:06:16] of this but many extensions have been explored
[00:06:18] explored so goland in all 2010 is a really early
[00:06:20] so goland in all 2010 is a really early paper in the history of these ideas that
[00:06:22] paper in the history of these ideas that is quite forward thinking they explore
[00:06:24] is quite forward thinking they explore recursive speaker listener reasoning as
[00:06:26] recursive speaker listener reasoning as part of interpreting complex utterances
[00:06:28] part of interpreting complex utterances compositionally
[00:06:30] compositionally with grounding in a simple visual world
[00:06:32] with grounding in a simple visual world and i love the connection with semantic
[00:06:33] and i love the connection with semantic composition this wagon all 2016 paper
[00:06:36] composition this wagon all 2016 paper does even more of that pragmatic
[00:06:38] does even more of that pragmatic reasoning helps in online learning of
[00:06:40] reasoning helps in online learning of semantic parsers
[00:06:43] semantic parsers i mentioned before work by stephanie
[00:06:45] i mentioned before work by stephanie kellex and colleagues on what they call
[00:06:47] kellex and colleagues on what they call inverse semantics which is a simple rsa
[00:06:49] inverse semantics which is a simple rsa mechanism applied in the context of
[00:06:51] mechanism applied in the context of human robot interaction to help humans
[00:06:53] human robot interaction to help humans and robots collaborate more efficiently
[00:06:56] and robots collaborate more efficiently kanye i'll extend this to kind of more
[00:06:58] kanye i'll extend this to kind of more free form social interaction by showing
[00:07:00] free form social interaction by showing that rsa has a role to play in
[00:07:02] that rsa has a role to play in collaborative games
[00:07:04] collaborative games i mentioned before this work by ruben
[00:07:06] i mentioned before this work by ruben cohen gordon and noah goodman on rsa for
[00:07:08] cohen gordon and noah goodman on rsa for translation
[00:07:10] translation uh ruben cohen gordon did a lot of
[00:07:12] uh ruben cohen gordon did a lot of innovative work as part of his phd in
[00:07:14] innovative work as part of his phd in the context of rsa he also explored
[00:07:17] the context of rsa he also explored applying rsa at the word and character
[00:07:19] applying rsa at the word and character level so removing the approximation that
[00:07:21] level so removing the approximation that we sample from the s0 speaker to create
[00:07:23] we sample from the s0 speaker to create the denominator rather instead he
[00:07:26] the denominator rather instead he applies rsa at every single time step in
[00:07:29] applies rsa at every single time step in a left to right sequential and decoding
[00:07:31] a left to right sequential and decoding step and that time step could be either
[00:07:33] step and that time step could be either the word level or surprisingly it was
[00:07:35] the word level or surprisingly it was very effective at the character level
[00:07:37] very effective at the character level and then these final two papers here
[00:07:39] and then these final two papers here just show that we could move out of the
[00:07:40] just show that we could move out of the mode of pre-training the base agents and
[00:07:43] mode of pre-training the base agents and applying rsa on top
[00:07:45] applying rsa on top and instead have a mechanism of
[00:07:47] and instead have a mechanism of end-to-end rsa learning which is more
[00:07:50] end-to-end rsa learning which is more ambitious in terms of learning and model
[00:07:52] ambitious in terms of learning and model setup but provides more chances for us
[00:07:54] setup but provides more chances for us to be responsive to the nature of actual
[00:07:57] to be responsive to the nature of actual usage data while still making good on
[00:07:59] usage data while still making good on the central insights of rsa and with
[00:08:01] the central insights of rsa and with luck seeing some empirical benefits from
[00:08:04] luck seeing some empirical benefits from doing that

Lecture 033

Natural Language Inference | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=6-NV9lzm8qw --- Transcript [00:00:04] welcome everyone this is the...

Natural Language Inference | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=6-NV9lzm8qw

---

Transcript

[00:00:04] welcome everyone this is the first
[00:00:06] welcome everyone this is the first screencast in our series on natural
[00:00:07] screencast in our series on natural language inference or nli this is one of
[00:00:09] language inference or nli this is one of my favorite problems what i'd like to do
[00:00:11] my favorite problems what i'd like to do is give you a sense for how the task is
[00:00:13] is give you a sense for how the task is formulated and then situate the task
[00:00:15] formulated and then situate the task within the broader landscape of ideas
[00:00:17] within the broader landscape of ideas for nlu
[00:00:20] for nlu as usual we have a bunch of materials
[00:00:21] as usual we have a bunch of materials that would allow you to get hands-on
[00:00:23] that would allow you to get hands-on with this problem the core module is
[00:00:25] with this problem the core module is nli.pi and then there are two notebooks
[00:00:28] nli.pi and then there are two notebooks the first introduces the three data sets
[00:00:30] the first introduces the three data sets that we'll be exploring in detail snli
[00:00:32] that we'll be exploring in detail snli multi-nli and adversarial nli
[00:00:35] multi-nli and adversarial nli and the second notebook offers a bunch
[00:00:37] and the second notebook offers a bunch of different modeling approaches
[00:00:39] of different modeling approaches it really covers core approaches that
[00:00:41] it really covers core approaches that people have taken to nli in the past and
[00:00:43] people have taken to nli in the past and i hope it points to some avenues for
[00:00:45] i hope it points to some avenues for modifying those architectures possibly
[00:00:48] modifying those architectures possibly in the service of developing an original
[00:00:49] in the service of developing an original system for a final project
[00:00:52] system for a final project there's also an associated homework and
[00:00:54] there's also an associated homework and bake off i emphasize that this is not
[00:00:56] bake off i emphasize that this is not required for us this quarter i'm
[00:00:58] required for us this quarter i'm mentioning it because i think the
[00:01:00] mentioning it because i think the problem is an interesting one it's a
[00:01:02] problem is an interesting one it's a word entailment task which is an
[00:01:03] word entailment task which is an interesting small scale problem that i
[00:01:06] interesting small scale problem that i believe could be used to stress test an
[00:01:08] believe could be used to stress test an nli system in interesting ways
[00:01:11] nli system in interesting ways the core readings cover the three data
[00:01:13] the core readings cover the three data sets that will be in focus for us and
[00:01:15] sets that will be in focus for us and then the final reading listed here was i
[00:01:17] then the final reading listed here was i think the paper that introduced
[00:01:19] think the paper that introduced attention mechanisms into the study of
[00:01:21] attention mechanisms into the study of nli and that had an impact that went
[00:01:24] nli and that had an impact that went well beyond this task
[00:01:26] well beyond this task and then for additional readings i'm
[00:01:28] and then for additional readings i'm kind of suggesting a range of things
[00:01:30] kind of suggesting a range of things some of these readings cover core
[00:01:32] some of these readings cover core fundamentals for deep learning that i
[00:01:33] fundamentals for deep learning that i think will be useful in the context of
[00:01:35] think will be useful in the context of studying nli
[00:01:36] studying nli some of them help you with kind of a
[00:01:38] some of them help you with kind of a foundational understanding of the nli
[00:01:40] foundational understanding of the nli task and how you might think about it
[00:01:42] task and how you might think about it and then some of them are meant to kind
[00:01:44] and then some of them are meant to kind of push us to stress test our system
[00:01:46] of push us to stress test our system think adversarially and maybe find
[00:01:48] think adversarially and maybe find artifacts in our data sets and those are
[00:01:50] artifacts in our data sets and those are going to be themes of later screencasts
[00:01:52] going to be themes of later screencasts in this series
[00:01:55] in this series to begin getting a sense for how the
[00:01:57] to begin getting a sense for how the task is formulated let's start with some
[00:01:58] task is formulated let's start with some simple examples here
[00:02:00] simple examples here so in nli we have as our inputs a
[00:02:03] so in nli we have as our inputs a premise sentence and a hypothesis
[00:02:05] premise sentence and a hypothesis sentence and the task is a
[00:02:06] sentence and the task is a classification one so in this simple
[00:02:08] classification one so in this simple example here the premise sentence is a
[00:02:10] example here the premise sentence is a turtle dance the hypothesis sentence is
[00:02:12] turtle dance the hypothesis sentence is a turtle moved both of those are system
[00:02:14] a turtle moved both of those are system inputs and our task is to assign one of
[00:02:17] inputs and our task is to assign one of three labels in this case the correct
[00:02:18] three labels in this case the correct label would be in tails
[00:02:22] label would be in tails the second example looks simple but it
[00:02:23] the second example looks simple but it actually begins to suggest how the task
[00:02:25] actually begins to suggest how the task is actually formulated we have as our
[00:02:27] is actually formulated we have as our premise turtle and as our hypothesis
[00:02:30] premise turtle and as our hypothesis linguist and what we would like to do in
[00:02:32] linguist and what we would like to do in the context of nli is assign that the
[00:02:34] the context of nli is assign that the contradicts label now you might pause
[00:02:36] contradicts label now you might pause there and think it's not a logical fact
[00:02:39] there and think it's not a logical fact that turtles can't be linguists so
[00:02:41] that turtles can't be linguists so surely contradiction is too strong but
[00:02:43] surely contradiction is too strong but it is a common sense fact a kind of
[00:02:45] it is a common sense fact a kind of natural inference about the world we
[00:02:47] natural inference about the world we live in that no turtles are linguists
[00:02:49] live in that no turtles are linguists and it's for that reason that we would
[00:02:51] and it's for that reason that we would choose the contradicts label and that
[00:02:52] choose the contradicts label and that begins to key into the fact that
[00:02:54] begins to key into the fact that fundamentally nli is not a logical
[00:02:57] fundamentally nli is not a logical reasoning task but a more general common
[00:02:59] reasoning task but a more general common sense reasoning task
[00:03:03] every reptile danced is neutral with
[00:03:05] every reptile danced is neutral with respect to a turtle eight which is just
[00:03:06] respect to a turtle eight which is just to say that these two sentences can be
[00:03:08] to say that these two sentences can be true or false independently of each
[00:03:10] true or false independently of each other
[00:03:11] other and now with entails contradicts neutral
[00:03:13] and now with entails contradicts neutral we have the three labels that are
[00:03:15] we have the three labels that are standardly used for nli data sets at
[00:03:17] standardly used for nli data sets at this point
[00:03:20] this point look at some additional examples some
[00:03:22] look at some additional examples some turtles walk contradicts no turtles move
[00:03:24] turtles walk contradicts no turtles move i think that's straightforward
[00:03:26] i think that's straightforward here's one that shows how intricate this
[00:03:28] here's one that shows how intricate this could get so the premise is james byron
[00:03:30] could get so the premise is james byron dean refused to move without blue jeans
[00:03:33] dean refused to move without blue jeans entails james dean didn't dance without
[00:03:35] entails james dean didn't dance without pants this highlights two aspects of the
[00:03:37] pants this highlights two aspects of the problem first you might have to do some
[00:03:39] problem first you might have to do some complex named entity recognition on
[00:03:42] complex named entity recognition on james byron dean and james dean to
[00:03:44] james byron dean and james dean to figure out that these are co-referring
[00:03:46] figure out that these are co-referring expressions
[00:03:47] expressions and you also might encounter real
[00:03:49] and you also might encounter real linguistic complexity in this case
[00:03:50] linguistic complexity in this case emphasizing things involving how
[00:03:52] emphasizing things involving how negations interact with each other
[00:03:55] negations interact with each other this next example begins to show how
[00:03:57] this next example begins to show how much common sense reasoning could could
[00:03:59] much common sense reasoning could could be brought into the task so the premise
[00:04:02] be brought into the task so the premise is mitsubishi's new vehicle sales in the
[00:04:04] is mitsubishi's new vehicle sales in the u.s fell 46 in june and the hypothesis
[00:04:08] u.s fell 46 in june and the hypothesis is mitsubishi sales rose 46
[00:04:11] is mitsubishi sales rose 46 we would standardly say that that is in
[00:04:13] we would standardly say that that is in the contradiction relation now again you
[00:04:15] the contradiction relation now again you might pause and think it is certainly
[00:04:17] might pause and think it is certainly possible even in our world that
[00:04:19] possible even in our world that mitsubishi could see a 46 rise and fall
[00:04:22] mitsubishi could see a 46 rise and fall in the same month so surely these should
[00:04:24] in the same month so surely these should be labeled neutral
[00:04:26] be labeled neutral but i think what you'll find in nli data
[00:04:28] but i think what you'll find in nli data sets is that these are called
[00:04:30] sets is that these are called contradiction on the informal assumption
[00:04:32] contradiction on the informal assumption that the premise and hypothesis are
[00:04:34] that the premise and hypothesis are talking about the same event
[00:04:36] talking about the same event and in that context we would say that
[00:04:38] and in that context we would say that these are common sense contradictions
[00:04:41] these are common sense contradictions here's another example that highlights
[00:04:43] here's another example that highlights how much pragmatics could be brought
[00:04:44] how much pragmatics could be brought into the problem the premise is acme
[00:04:46] into the problem the premise is acme reported that its ceo resigned and the
[00:04:49] reported that its ceo resigned and the hypothesis is that acme ceo resigned we
[00:04:52] hypothesis is that acme ceo resigned we would probably say entailment there even
[00:04:54] would probably say entailment there even though in a strict logical sense the
[00:04:56] though in a strict logical sense the premise does not entail the hypothesis
[00:04:58] premise does not entail the hypothesis because of course the company could be
[00:05:00] because of course the company could be reporting things that are false but here
[00:05:02] reporting things that are false but here we kind of make an assumption that the
[00:05:04] we kind of make an assumption that the company is of an authority and will
[00:05:07] company is of an authority and will likely report true things about facts
[00:05:09] likely report true things about facts like this and therefore we allow that
[00:05:11] like this and therefore we allow that this would be in the entailment relation
[00:05:13] this would be in the entailment relation again not logical but much more like
[00:05:15] again not logical but much more like common sense
[00:05:17] common sense so just to emphasize this here's kind of
[00:05:19] so just to emphasize this here's kind of the fundamental question that we
[00:05:21] the fundamental question that we confront does the premise justify an
[00:05:23] confront does the premise justify an inference to the hypothesis
[00:05:26] inference to the hypothesis common sense reasoning rather than
[00:05:28] common sense reasoning rather than strict logic
[00:05:29] strict logic two other characteristics of this task
[00:05:31] two other characteristics of this task in the modern era are first there's a
[00:05:33] in the modern era are first there's a focus on local inference steps that is
[00:05:35] focus on local inference steps that is just one premise and one hypothesis
[00:05:38] just one premise and one hypothesis rather than long deductive chains
[00:05:41] rather than long deductive chains and the second is that the emphasis is
[00:05:42] and the second is that the emphasis is really on the variability of linguistic
[00:05:44] really on the variability of linguistic expressions when people have created the
[00:05:46] expressions when people have created the large benchmark tasks in this space they
[00:05:49] large benchmark tasks in this space they have largely focused on just collecting
[00:05:51] have largely focused on just collecting a lot of data
[00:05:52] a lot of data and not placed any special emphasis on
[00:05:54] and not placed any special emphasis on collecting examples that have a lot of
[00:05:56] collecting examples that have a lot of negations or quantifiers or something
[00:05:59] negations or quantifiers or something that would really shine a spotlight on
[00:06:01] that would really shine a spotlight on linguistic and semantic complexity so
[00:06:03] linguistic and semantic complexity so that's worth keeping in mind about how
[00:06:05] that's worth keeping in mind about how we're thinking about the task
[00:06:07] we're thinking about the task in the present day
[00:06:09] in the present day if you would like additional
[00:06:10] if you would like additional perspectives on this including some
[00:06:12] perspectives on this including some disputes about exactly how to think
[00:06:14] disputes about exactly how to think about the problem and what would be the
[00:06:15] about the problem and what would be the most productive i would encourage you to
[00:06:17] most productive i would encourage you to check out these three papers by a lot of
[00:06:19] check out these three papers by a lot of stanford researchers
[00:06:21] stanford researchers i think the fundamental outcome of this
[00:06:23] i think the fundamental outcome of this is that we do want to focus on common
[00:06:24] is that we do want to focus on common sense reasoning even though that's a
[00:06:26] sense reasoning even though that's a kind of amorphous and difficult to
[00:06:29] kind of amorphous and difficult to define concept it's nonetheless arguably
[00:06:32] define concept it's nonetheless arguably the useful one for us when we think
[00:06:34] the useful one for us when we think about developing practical systems
[00:06:38] now in a visionary paper that really set
[00:06:40] now in a visionary paper that really set the agenda for nli dawg and adolf 2006
[00:06:44] the agenda for nli dawg and adolf 2006 they make a lot of connections between
[00:06:46] they make a lot of connections between nli and the broader landscape of nlu so
[00:06:49] nli and the broader landscape of nlu so let me just read this opening statement
[00:06:50] let me just read this opening statement here it seems that major inferences as
[00:06:53] here it seems that major inferences as needed by multiple applications can
[00:06:56] needed by multiple applications can indeed be cast in terms of textual
[00:06:57] indeed be cast in terms of textual entailment
[00:06:59] entailment consequently we hypothesize that textual
[00:07:01] consequently we hypothesize that textual entailment recognition is a suitable
[00:07:03] entailment recognition is a suitable generic task for evaluating and
[00:07:05] generic task for evaluating and comparing applied semantic inference
[00:07:07] comparing applied semantic inference models eventually such efforts can
[00:07:09] models eventually such efforts can promote the development of entailment
[00:07:11] promote the development of entailment recognition engines which may provide
[00:07:13] recognition engines which may provide useful generic modules across
[00:07:15] useful generic modules across applications
[00:07:16] applications it's a wonderful vision and the spin we
[00:07:18] it's a wonderful vision and the spin we might put on it in the present day is
[00:07:20] might put on it in the present day is that what we might hope from nli is that
[00:07:23] that what we might hope from nli is that since
[00:07:23] since since reasoning about entailment and
[00:07:25] since reasoning about entailment and contradiction are truly fundamental to
[00:07:27] contradiction are truly fundamental to our use of language that pre-training on
[00:07:30] our use of language that pre-training on the nli task might give us
[00:07:31] the nli task might give us representations that are
[00:07:33] representations that are useful in lots of different contexts
[00:07:36] useful in lots of different contexts and document will actually continue by
[00:07:38] and document will actually continue by showing that we can formulate a lot of
[00:07:41] showing that we can formulate a lot of traditional tasks as nli tests and here
[00:07:44] traditional tasks as nli tests and here are just a few examples of that if our
[00:07:46] are just a few examples of that if our task is paraphrase we might say that in
[00:07:48] task is paraphrase we might say that in the nli context that means we want
[00:07:50] the nli context that means we want equality or mutual entailment between
[00:07:53] equality or mutual entailment between the text and the paraphrase that is
[00:07:54] the text and the paraphrase that is premise and hypothesis
[00:07:57] premise and hypothesis for summarization we would do something
[00:07:58] for summarization we would do something weaker we would hope just that the text
[00:08:00] weaker we would hope just that the text the original text entailed the summaries
[00:08:03] the original text entailed the summaries allowing that the summary might be
[00:08:05] allowing that the summary might be weaker or more general
[00:08:07] weaker or more general for information retrieval we kind of do
[00:08:09] for information retrieval we kind of do the reverse here we want to find
[00:08:10] the reverse here we want to find documents that entail the query
[00:08:14] documents that entail the query and then for question answering it's
[00:08:15] and then for question answering it's kind of similar we could formulate that
[00:08:17] kind of similar we could formulate that as an entailment test by saying that
[00:08:18] as an entailment test by saying that what we want is to find answers that
[00:08:20] what we want is to find answers that entail the question and the way we might
[00:08:23] entail the question and the way we might think about entailment for questions is
[00:08:24] think about entailment for questions is kind of illustrated here where we would
[00:08:26] kind of illustrated here where we would informally convert a question like who
[00:08:29] informally convert a question like who left into someone left to give us a
[00:08:31] left into someone left to give us a statement and then we could say that
[00:08:33] statement and then we could say that sandy left is an answer to who left in
[00:08:36] sandy left is an answer to who left in the sense that it entails someone left
[00:08:40] the sense that it entails someone left and i think there are many other tasks
[00:08:41] and i think there are many other tasks that we could formulate in this way and
[00:08:43] that we could formulate in this way and it does show you just how fundamental
[00:08:45] it does show you just how fundamental entailment and contradiction are
[00:08:48] entailment and contradiction are to reasoning and language
[00:08:51] to reasoning and language and finally let me give you a sense for
[00:08:52] and finally let me give you a sense for the model landscape and how it has
[00:08:54] the model landscape and how it has changed nli is a pretty old problem in
[00:08:57] changed nli is a pretty old problem in the field and as a result we've seen a
[00:08:59] the field and as a result we've seen a wide spectrum of different approaches in
[00:09:02] wide spectrum of different approaches in the earliest days you had a lot of
[00:09:03] the earliest days you had a lot of systems that were kind of focused on
[00:09:05] systems that were kind of focused on logic and theorem proving i've
[00:09:07] logic and theorem proving i've characterized those systems here as
[00:09:09] characterized those systems here as offering really deep representations but
[00:09:12] offering really deep representations but they weren't especially effective in the
[00:09:14] they weren't especially effective in the sense that they worked only for the
[00:09:15] sense that they worked only for the domains and examples that the system
[00:09:18] domains and examples that the system designers had been able to anticipate so
[00:09:20] designers had been able to anticipate so they're kind of brutal
[00:09:22] they're kind of brutal uh following that you have a kind of
[00:09:24] uh following that you have a kind of exploration of what what bill mccartney
[00:09:26] exploration of what what bill mccartney called natural logic approaches bill was
[00:09:28] called natural logic approaches bill was one of the
[00:09:30] one of the early innovators in this space i think
[00:09:31] early innovators in this space i think he actually coined the term natural
[00:09:33] he actually coined the term natural language inference and he explored
[00:09:35] language inference and he explored natural logic which has some of the
[00:09:36] natural logic which has some of the aspects of logic and theorem proving but
[00:09:39] aspects of logic and theorem proving but it's kind of more open and easily
[00:09:41] it's kind of more open and easily amenable to tackling a lot of data
[00:09:44] amenable to tackling a lot of data and so those systems were consequently a
[00:09:46] and so those systems were consequently a little less deep but also more effective
[00:09:48] little less deep but also more effective and a similar thing happened with the
[00:09:49] and a similar thing happened with the semantic graphs which is providing
[00:09:51] semantic graphs which is providing rich conceptual representations of the
[00:09:54] rich conceptual representations of the underlying domain that we want to reason
[00:09:55] underlying domain that we want to reason about
[00:09:57] about another interesting thing here is that
[00:09:59] another interesting thing here is that you know
[00:10:00] you know until recently
[00:10:02] until recently it was the case that clever hand-built
[00:10:04] it was the case that clever hand-built features which i'll show you some a bit
[00:10:06] features which i'll show you some a bit later in the in this screencast series
[00:10:08] later in the in this screencast series they were really in the lead and simple
[00:10:10] they were really in the lead and simple engram variations you know traditional
[00:10:12] engram variations you know traditional models with hand-built features uh they
[00:10:14] models with hand-built features uh they were the best models
[00:10:16] were the best models there was a kind of faith early on in
[00:10:18] there was a kind of faith early on in the deep learning revolution that
[00:10:20] the deep learning revolution that eventually those models would prove to
[00:10:22] eventually those models would prove to be the best at this task but at the time
[00:10:24] be the best at this task but at the time we just didn't have the data sets that
[00:10:26] we just didn't have the data sets that would support that would provide
[00:10:28] would support that would provide evidence for that kind of claim
[00:10:31] evidence for that kind of claim and so as a result for a while deep
[00:10:32] and so as a result for a while deep learning systems really lagged behind
[00:10:34] learning systems really lagged behind these more traditional approaches and i
[00:10:36] these more traditional approaches and i would say that it was really in about
[00:10:37] would say that it was really in about 2017 that deep learning pulled ahead and
[00:10:40] 2017 that deep learning pulled ahead and that's a result of modeling innovations
[00:10:42] that's a result of modeling innovations and also the arrival of some really
[00:10:44] and also the arrival of some really large benchmark data sets that would
[00:10:46] large benchmark data sets that would allow us to train systems
[00:10:49] allow us to train systems that were effective for the task and
[00:10:51] that were effective for the task and it's at that point you see that deep
[00:10:52] it's at that point you see that deep learning kind of took over
[00:10:54] learning kind of took over and as a result in subsequent
[00:10:55] and as a result in subsequent screencasts we too will be focused on
[00:10:57] screencasts we too will be focused on deep learning architectures for the nli
[00:11:00] deep learning architectures for the nli problem

Lecture 034

SNLI, MultiNLI, and Adversarial NLI | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=NAMNv4M2j3g --- Transcript [00:00:04] welcome everyone to...

SNLI, MultiNLI, and Adversarial NLI | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=NAMNv4M2j3g

---

Transcript

[00:00:04] welcome everyone to part two in our
[00:00:06] welcome everyone to part two in our series on natural language inference uh
[00:00:08] series on natural language inference uh we're going to focus on the three data
[00:00:10] we're going to focus on the three data sets that we'll be concentrating on this
[00:00:11] sets that we'll be concentrating on this unit which are snli the stanford natural
[00:00:14] unit which are snli the stanford natural language inference corpus multi-nli and
[00:00:17] language inference corpus multi-nli and adversarial nli i think they're
[00:00:18] adversarial nli i think they're interestingly different and they're all
[00:00:20] interestingly different and they're all big benchmark tasks that can support the
[00:00:22] big benchmark tasks that can support the training of lots of diverse kinds of
[00:00:24] training of lots of diverse kinds of systems
[00:00:27] so let's begin with snli which is the
[00:00:29] so let's begin with snli which is the first to appear of these three the
[00:00:31] first to appear of these three the associated paper is bowman at all 2015.
[00:00:34] associated paper is bowman at all 2015. sam bowman was a student in the nlp
[00:00:35] sam bowman was a student in the nlp group and i was his advisor along with
[00:00:37] group and i was his advisor along with chris manning
[00:00:38] chris manning and a bunch of us contributed to that
[00:00:40] and a bunch of us contributed to that paper
[00:00:41] paper uh an important thing to know about snli
[00:00:42] uh an important thing to know about snli is that the premises are all image
[00:00:44] is that the premises are all image captions from the image flicker 30k data
[00:00:47] captions from the image flicker 30k data set so that's an important genre
[00:00:49] set so that's an important genre restriction that you should be aware of
[00:00:50] restriction that you should be aware of when you think about training systems on
[00:00:52] when you think about training systems on this data
[00:00:54] this data all the hypotheses were written by crowd
[00:00:56] all the hypotheses were written by crowd workers i'll show you the prompt in a
[00:00:58] workers i'll show you the prompt in a little bit but the idea is they were
[00:00:59] little bit but the idea is they were given this premise which was an image
[00:01:01] given this premise which was an image caption and then they wrote three
[00:01:03] caption and then they wrote three different texts corresponding to the
[00:01:05] different texts corresponding to the three nli labels
[00:01:08] three nli labels unfortunately as is common with crowd
[00:01:10] unfortunately as is common with crowd source data sets you should be aware
[00:01:11] source data sets you should be aware that some of the sentences do reflect
[00:01:13] that some of the sentences do reflect stereotypes i think this traces to the
[00:01:14] stereotypes i think this traces to the fact that crowd workers trying to do a
[00:01:16] fact that crowd workers trying to do a lot of work
[00:01:18] lot of work are faced with a creative block and the
[00:01:20] are faced with a creative block and the way they overcome that is by falling
[00:01:21] way they overcome that is by falling back on easy tricks and some of those
[00:01:23] back on easy tricks and some of those involved stereotypes completely
[00:01:25] involved stereotypes completely understandable and this is something
[00:01:26] understandable and this is something that the field is trying to come to
[00:01:28] that the field is trying to come to grips with as we think about data set
[00:01:30] grips with as we think about data set creation
[00:01:31] creation it's a big data set it has over 550 000
[00:01:34] it's a big data set it has over 550 000 training examples and it has dev and
[00:01:36] training examples and it has dev and test sets each of 10 000 examples
[00:01:40] test sets each of 10 000 examples balanced across the three classes here's
[00:01:42] balanced across the three classes here's a look at the mean token links it's just
[00:01:43] a look at the mean token links it's just sort of noteworthy that premises are a
[00:01:45] sort of noteworthy that premises are a little bit longer than hypotheses i
[00:01:47] little bit longer than hypotheses i guess that comes down to the fact that
[00:01:48] guess that comes down to the fact that crowd workers were writing these
[00:01:50] crowd workers were writing these sentences
[00:01:51] sentences the in terms of clause types mostly we
[00:01:54] the in terms of clause types mostly we talk about nli as a sentence task but in
[00:01:56] talk about nli as a sentence task but in fact um only 74 of the examples are
[00:01:59] fact um only 74 of the examples are sentences that is s rooted in their
[00:02:01] sentences that is s rooted in their syntactic parses it has a large
[00:02:03] syntactic parses it has a large vocabulary but maybe modest relative to
[00:02:05] vocabulary but maybe modest relative to the size of the data set and that might
[00:02:07] the size of the data set and that might come back to the fact that the genre is
[00:02:09] come back to the fact that the genre is kind of restricted
[00:02:11] kind of restricted we had about 60 000 examples that were
[00:02:13] we had about 60 000 examples that were additionally validated by four other
[00:02:16] additionally validated by four other annotators and i'll show you the
[00:02:18] annotators and i'll show you the response distributions which suggest
[00:02:19] response distributions which suggest some sources of variation
[00:02:22] some sources of variation they had high inter-annotator agreements
[00:02:24] they had high inter-annotator agreements so given that validation
[00:02:27] so given that validation about 60 percent of examples had a
[00:02:28] about 60 percent of examples had a unanimous goal label and we rate the
[00:02:31] unanimous goal label and we rate the overall human level of agreement at
[00:02:33] overall human level of agreement at about 91.2
[00:02:35] about 91.2 for the gold labels and that's kind of
[00:02:36] for the gold labels and that's kind of the measure of human performance that's
[00:02:38] the measure of human performance that's commonly used for snli
[00:02:40] commonly used for snli uh the overall flights cap measure inter
[00:02:43] uh the overall flights cap measure inter annotator agreement is 0.7 which is
[00:02:45] annotator agreement is 0.7 which is which is a high rate of agreement and
[00:02:47] which is a high rate of agreement and then for the leaderboard you can check
[00:02:48] then for the leaderboard you can check out this link here sam has been good
[00:02:50] out this link here sam has been good about curating all the systems that
[00:02:51] about curating all the systems that enter and you can get a sense for which
[00:02:53] enter and you can get a sense for which approaches are best it's clear at this
[00:02:55] approaches are best it's clear at this point for example that ensembles of deep
[00:02:57] point for example that ensembles of deep learning methods are the best for this
[00:02:59] learning methods are the best for this problem
[00:03:01] i mentioned before the crowdsourcing
[00:03:03] i mentioned before the crowdsourcing methods i think it's worth thinking
[00:03:04] methods i think it's worth thinking about precisely what happened here so
[00:03:05] about precisely what happened here so here's the crowdsourcing interface
[00:03:08] here's the crowdsourcing interface there's some instructions up here here's
[00:03:10] there's some instructions up here here's the caption that is the premise sentence
[00:03:12] the caption that is the premise sentence in our terms a little boy in an apron
[00:03:14] in our terms a little boy in an apron helps his mother and then the crowd
[00:03:16] helps his mother and then the crowd worker had to come up with three
[00:03:17] worker had to come up with three sentences one definitely correct that's
[00:03:19] sentences one definitely correct that's an entailment case
[00:03:20] an entailment case one may be correct that is our gloss on
[00:03:23] one may be correct that is our gloss on neutral and one definitely incorrect
[00:03:25] neutral and one definitely incorrect which is our gloss on contradiction so
[00:03:27] which is our gloss on contradiction so you can see here that there's an attempt
[00:03:29] you can see here that there's an attempt to use informal language connecting with
[00:03:31] to use informal language connecting with informal reasoning common sense
[00:03:32] informal reasoning common sense reasoning
[00:03:34] reasoning in the prompt here and then those get
[00:03:35] in the prompt here and then those get translated into our three labels for the
[00:03:37] translated into our three labels for the task
[00:03:39] task and here are some examples from the
[00:03:40] and here are some examples from the validated set and i think they're sort
[00:03:42] validated set and i think they're sort of interesting because you get high
[00:03:44] of interesting because you get high rates of agreement but you do find some
[00:03:46] rates of agreement but you do find some examples that have a lot of uncertainty
[00:03:48] examples that have a lot of uncertainty about them like this last one here
[00:03:51] about them like this last one here and i think that might be a hallmark
[00:03:52] and i think that might be a hallmark actually of nli problems
[00:03:55] now one really fundamental thing that i
[00:03:57] now one really fundamental thing that i mentioned in the overview screencast is
[00:03:59] mentioned in the overview screencast is definitely worth being aware of relates
[00:04:01] definitely worth being aware of relates specifically to the contradiction
[00:04:03] specifically to the contradiction relation and there's discussion of this
[00:04:04] relation and there's discussion of this in the paper it's a tricky point
[00:04:07] in the paper it's a tricky point what we say for snli using these simple
[00:04:09] what we say for snli using these simple examples here is that both of them are
[00:04:11] examples here is that both of them are in the contradiction relation the first
[00:04:13] in the contradiction relation the first one is a boat sank in the pacific ocean
[00:04:16] one is a boat sank in the pacific ocean as premise and hypothesis about boat
[00:04:18] as premise and hypothesis about boat sank in the atlantic ocean you might ask
[00:04:20] sank in the atlantic ocean you might ask of course those could be true together
[00:04:22] of course those could be true together they should be neutral not contradiction
[00:04:25] they should be neutral not contradiction the reason we call them contradiction is
[00:04:26] the reason we call them contradiction is because we make an assumption of event
[00:04:28] because we make an assumption of event co-reference that we're talking about
[00:04:30] co-reference that we're talking about the same boat in the same event and
[00:04:32] the same boat in the same event and therefore the locations contradict each
[00:04:34] therefore the locations contradict each other in a common sense way
[00:04:37] other in a common sense way and the second example is an even more
[00:04:39] and the second example is an even more extreme case of this ruth bader ginsburg
[00:04:41] extreme case of this ruth bader ginsburg was appointed to the supreme court and i
[00:04:42] was appointed to the supreme court and i had a sandwich for lunch today we say
[00:04:45] had a sandwich for lunch today we say those are in the contradiction relation
[00:04:47] those are in the contradiction relation of course they could be true together
[00:04:49] of course they could be true together but they couldn't in our terms be true
[00:04:51] but they couldn't in our terms be true of the same event
[00:04:53] of the same event they're describing very different events
[00:04:55] they're describing very different events and for that reason they get the
[00:04:56] and for that reason they get the contradiction label
[00:05:00] if a premise and hypothesis probably
[00:05:02] if a premise and hypothesis probably describe a different photo than the
[00:05:04] describe a different photo than the label is contradiction that's the kind
[00:05:06] label is contradiction that's the kind of anchoring back into our underlying
[00:05:08] of anchoring back into our underlying domain that you might have in mind
[00:05:11] domain that you might have in mind we can mark progress on snli because sam
[00:05:14] we can mark progress on snli because sam has been curating that leaderboard as i
[00:05:15] has been curating that leaderboard as i mentioned before we estimate human
[00:05:17] mentioned before we estimate human performance up here at almost 92 and
[00:05:19] performance up here at almost 92 and along this x-axis here i've got time
[00:05:24] and you can see that very quickly the
[00:05:26] and you can see that very quickly the community has hill climbed toward
[00:05:28] community has hill climbed toward systems that are super human according
[00:05:30] systems that are super human according to our estimate right down here at 78 is
[00:05:33] to our estimate right down here at 78 is the original paper that was from an era
[00:05:35] the original paper that was from an era when deep learning systems were really
[00:05:36] when deep learning systems were really not clearly the winners in this kind of
[00:05:38] not clearly the winners in this kind of competition but snli helped change that
[00:05:41] competition but snli helped change that by introducing a lot of new data so you
[00:05:43] by introducing a lot of new data so you very very rapid rise in system
[00:05:45] very very rapid rise in system performance and then basically monotonic
[00:05:48] performance and then basically monotonic increase until 2019 when we saw the
[00:05:50] increase until 2019 when we saw the first systems that were in these
[00:05:52] first systems that were in these restrictive terms better than humans at
[00:05:55] restrictive terms better than humans at the snli task
[00:05:58] the snli task let's move to multi-nli which was a kind
[00:06:00] let's move to multi-nli which was a kind of um
[00:06:02] of um successor to snli this was collected by
[00:06:05] successor to snli this was collected by adina williams and colleagues including
[00:06:06] adina williams and colleagues including sam bowman
[00:06:08] sam bowman um the trained premises in this case are
[00:06:10] um the trained premises in this case are going to be much more diverse they're
[00:06:11] going to be much more diverse they're drawn from five genres fiction
[00:06:14] drawn from five genres fiction government reports and letters and
[00:06:16] government reports and letters and things the slate website
[00:06:19] things the slate website the switchboard corpus which is people
[00:06:21] the switchboard corpus which is people interacting over the phone
[00:06:23] interacting over the phone and berlitz travel guides
[00:06:26] and berlitz travel guides and then interestingly they have
[00:06:27] and then interestingly they have additional genres just for dev and test
[00:06:29] additional genres just for dev and test and this is what they call the
[00:06:30] and this is what they call the mismatched condition and those are the
[00:06:32] mismatched condition and those are the 911 report
[00:06:34] 911 report face-to-face conversations
[00:06:36] face-to-face conversations fundraising letters and non-fiction from
[00:06:39] fundraising letters and non-fiction from oxford university press as well as
[00:06:41] oxford university press as well as articles about linguistics
[00:06:44] articles about linguistics so this is noteworthy because in the
[00:06:46] so this is noteworthy because in the mismatched condition that that
[00:06:47] mismatched condition that that multi-noise sets up you are forced to
[00:06:50] multi-noise sets up you are forced to train on those train examples and then
[00:06:52] train on those train examples and then test on entirely new genres you can just
[00:06:54] test on entirely new genres you can just see how different for example berlin's
[00:06:56] see how different for example berlin's travel guides might be from the 911
[00:06:58] travel guides might be from the 911 report i think this is an interesting
[00:07:00] report i think this is an interesting early example of being adversarial and
[00:07:03] early example of being adversarial and of forcing our systems to grapple with
[00:07:05] of forcing our systems to grapple with new domains and new genres i think
[00:07:08] new domains and new genres i think that's a really productive step in
[00:07:10] that's a really productive step in testing these systems for robustness
[00:07:13] testing these systems for robustness it's another large data set slightly
[00:07:15] it's another large data set slightly smaller than snli but actually the
[00:07:17] smaller than snli but actually the example lengths tend to be longer they
[00:07:20] example lengths tend to be longer they did the same kind of a validation and
[00:07:22] did the same kind of a validation and that gives us our estimates of human
[00:07:23] that gives us our estimates of human performance and once again i would say
[00:07:26] performance and once again i would say that we can have a lot of confidence
[00:07:27] that we can have a lot of confidence there was a high rate of agreement
[00:07:30] there was a high rate of agreement 92.6 is the traditional measure of human
[00:07:32] 92.6 is the traditional measure of human performance here
[00:07:34] performance here for multi-analyte the test set is
[00:07:36] for multi-analyte the test set is available only as a kaggle competition
[00:07:38] available only as a kaggle competition and you can check out the product
[00:07:40] and you can check out the product project page here
[00:07:43] project page here i love the fact that multi-nli was
[00:07:45] i love the fact that multi-nli was distributed with annotations that could
[00:07:47] distributed with annotations that could help someone kind of do out-of-the-box
[00:07:49] help someone kind of do out-of-the-box error analysis what they did is have
[00:07:51] error analysis what they did is have linguists go through and label specific
[00:07:53] linguists go through and label specific examples for whether or not they
[00:07:55] examples for whether or not they manifested specific linguistic phenomena
[00:07:57] manifested specific linguistic phenomena like do the premise and hypothesis
[00:07:59] like do the premise and hypothesis involve variation in active passive
[00:08:01] involve variation in active passive morphology which might be a clue that
[00:08:04] morphology which might be a clue that the sentences are synonymous or an
[00:08:06] the sentences are synonymous or an entailment relation
[00:08:08] entailment relation but nonetheless hard for systems to
[00:08:09] but nonetheless hard for systems to predict because of the change in word
[00:08:11] predict because of the change in word order also things like whether there are
[00:08:13] order also things like whether there are belief statements conditionals whether
[00:08:15] belief statements conditionals whether co-references involved in a non-trivial
[00:08:17] co-references involved in a non-trivial way modality negation quantifiers things
[00:08:21] way modality negation quantifiers things that you might think would be good
[00:08:22] that you might think would be good probes for the true systematicity of the
[00:08:25] probes for the true systematicity of the model you've trained and you can use
[00:08:26] model you've trained and you can use these annotations to kind of benchmark
[00:08:28] these annotations to kind of benchmark yourself there i think that's incredibly
[00:08:30] yourself there i think that's incredibly productive
[00:08:32] productive how we doing on multi-noi so again we're
[00:08:35] how we doing on multi-noi so again we're going to have our score over here and on
[00:08:37] going to have our score over here and on the x-axis time
[00:08:39] the x-axis time we have that human estimate at 92.6
[00:08:42] we have that human estimate at 92.6 and since it's on kaggle we can look at
[00:08:45] and since it's on kaggle we can look at lots more systems for snli we just have
[00:08:47] lots more systems for snli we just have the published papers but on kaggle lots
[00:08:49] the published papers but on kaggle lots of people enter and they try lots of
[00:08:50] of people enter and they try lots of different things as a result you get
[00:08:52] different things as a result you get much more variance across this it's much
[00:08:54] much more variance across this it's much less monotonic but nonetheless you can
[00:08:56] less monotonic but nonetheless you can see that the community is rapidly hill
[00:08:59] see that the community is rapidly hill climbing toward superhuman performance
[00:09:01] climbing toward superhuman performance on this task as well and again i would
[00:09:03] on this task as well and again i would just want to reiterate recalling themes
[00:09:05] just want to reiterate recalling themes from our introductory lecture this does
[00:09:07] from our introductory lecture this does not necessarily mean that we have
[00:09:08] not necessarily mean that we have systems that are super human at the task
[00:09:10] systems that are super human at the task of common sense reasoning which is a
[00:09:12] of common sense reasoning which is a very human and complex thing but rather
[00:09:15] very human and complex thing but rather systems that are just narrowly
[00:09:16] systems that are just narrowly outperforming humans on this one
[00:09:18] outperforming humans on this one particular very machine-like metric
[00:09:21] particular very machine-like metric which gives us our estimate of human
[00:09:23] which gives us our estimate of human performance here
[00:09:25] performance here still stripe startling progress
[00:09:28] still stripe startling progress and then finally adversarial nli is kind
[00:09:30] and then finally adversarial nli is kind of a response to that dynamic that looks
[00:09:32] of a response to that dynamic that looks like we're making lots of progress but
[00:09:34] like we're making lots of progress but we might worry that our systems are
[00:09:36] we might worry that our systems are benefiting from idiosyncrasies and
[00:09:38] benefiting from idiosyncrasies and artifacts and the data sets and that
[00:09:40] artifacts and the data sets and that they're not actually good at the human
[00:09:41] they're not actually good at the human kind the kind of human reasoning that
[00:09:43] kind the kind of human reasoning that we're truly trying to capture and that
[00:09:45] we're truly trying to capture and that gave rise to the adversarial nli project
[00:09:48] gave rise to the adversarial nli project the paper is me at all which also
[00:09:50] the paper is me at all which also involves some authors from earlier data
[00:09:52] involves some authors from earlier data sets snli and multi-nli
[00:09:54] sets snli and multi-nli it's another large data set a little bit
[00:09:56] it's another large data set a little bit smaller but um you'll see why it's
[00:09:58] smaller but um you'll see why it's special in some respects
[00:10:00] special in some respects the premises come from very diverse
[00:10:01] the premises come from very diverse sources we don't have the genre
[00:10:03] sources we don't have the genre overfitting you might get from snoi
[00:10:05] overfitting you might get from snoi and the hypotheses were again written by
[00:10:08] and the hypotheses were again written by crowd workers but here crucially they
[00:10:10] crowd workers but here crucially they were written not in the abstract but
[00:10:12] were written not in the abstract but rather with the goal of fooling
[00:10:14] rather with the goal of fooling state-of-the-art models that's the
[00:10:16] state-of-the-art models that's the adversarial part of this project
[00:10:19] adversarial part of this project this is a direct response to this
[00:10:21] this is a direct response to this feeling that results in findings for
[00:10:23] feeling that results in findings for snli and multi-nli while impressive
[00:10:25] snli and multi-nli while impressive might be overstating the extent to which
[00:10:27] might be overstating the extent to which we've made progress on the underlying
[00:10:29] we've made progress on the underlying task of common sense reasoning
[00:10:32] task of common sense reasoning so here's how the data set collection
[00:10:34] so here's how the data set collection worked in a little more detail this is a
[00:10:35] worked in a little more detail this is a fascinating dynamic the annotator was
[00:10:38] fascinating dynamic the annotator was presented with a premise sentence and
[00:10:40] presented with a premise sentence and one condition which would just
[00:10:41] one condition which would just correspond to the label that they want
[00:10:43] correspond to the label that they want to create
[00:10:44] to create they write a hypothesis
[00:10:46] they write a hypothesis and a state-of-the-art model makes a
[00:10:48] and a state-of-the-art model makes a prediction about the premise hypothesis
[00:10:50] prediction about the premise hypothesis pair basically predicting one of these
[00:10:52] pair basically predicting one of these three condition labels
[00:10:54] three condition labels if the model's prediction matches the
[00:10:55] if the model's prediction matches the conviction the annotator returns to step
[00:10:58] conviction the annotator returns to step two to try again with a new sentence
[00:11:00] two to try again with a new sentence if the model was fooled though the
[00:11:02] if the model was fooled though the premise hypothesis pair is independently
[00:11:04] premise hypothesis pair is independently validated so in this way we're kind of
[00:11:06] validated so in this way we're kind of guaranteed to get a lot of examples that
[00:11:08] guaranteed to get a lot of examples that are very hard for whatever model we have
[00:11:10] are very hard for whatever model we have in the loop in this process
[00:11:13] in the loop in this process are some more details so it has three
[00:11:15] are some more details so it has three rounds this data set for its first
[00:11:16] rounds this data set for its first release
[00:11:17] release um
[00:11:18] um overall that results in that large data
[00:11:20] overall that results in that large data set
[00:11:21] set and you can see that in subsequent
[00:11:23] and you can see that in subsequent rounds the model is going to be expanded
[00:11:26] rounds the model is going to be expanded to include previous rounds of data in
[00:11:28] to include previous rounds of data in addition possibly to other data
[00:11:29] addition possibly to other data resources
[00:11:31] resources and so what we're hoping is that as we
[00:11:33] and so what we're hoping is that as we progress through these rounds these
[00:11:34] progress through these rounds these examples are going to get harder and
[00:11:36] examples are going to get harder and harder in virtue of the fact that the
[00:11:38] harder in virtue of the fact that the model is trained on more data and is
[00:11:40] model is trained on more data and is getting better as a result of seeing all
[00:11:43] getting better as a result of seeing all this adversarial examples
[00:11:45] this adversarial examples in terms of the splits the train set is
[00:11:47] in terms of the splits the train set is a mix of cases where the model's
[00:11:49] a mix of cases where the model's predictions were correct and where it
[00:11:50] predictions were correct and where it was incorrect because sometimes in that
[00:11:52] was incorrect because sometimes in that loop the annotator was unable to fool
[00:11:54] loop the annotator was unable to fool the model after some specified number of
[00:11:56] the model after some specified number of attempts and we keep those examples
[00:11:58] attempts and we keep those examples because they're nonetheless interesting
[00:11:59] because they're nonetheless interesting training data however
[00:12:01] training data however um
[00:12:02] um in the dev and test sets we have only
[00:12:05] in the dev and test sets we have only examples that fold the models so from
[00:12:07] examples that fold the models so from with respect to the best model for each
[00:12:09] with respect to the best model for each round the test set is as adversarial as
[00:12:12] round the test set is as adversarial as it could possibly get the model has
[00:12:14] it could possibly get the model has gotten every single example wrong
[00:12:17] gotten every single example wrong and adversarial nli is exciting because
[00:12:19] and adversarial nli is exciting because it's given rise to a whole movement
[00:12:21] it's given rise to a whole movement around creating adversarial data sets
[00:12:23] around creating adversarial data sets and that's represented by this open
[00:12:24] and that's represented by this open source project dynavench
[00:12:27] source project dynavench we just recently published a paper
[00:12:29] we just recently published a paper that's on the dynabench effort reporting
[00:12:31] that's on the dynabench effort reporting on a bunch of tasks that are going to
[00:12:33] on a bunch of tasks that are going to use approximately adversarial nlr-like
[00:12:35] use approximately adversarial nlr-like techniques to develop data sets that are
[00:12:38] techniques to develop data sets that are adversarial in lots of domains and
[00:12:39] adversarial in lots of domains and you've actually seen one of these in the
[00:12:41] you've actually seen one of these in the dynacent
[00:12:42] dynacent data set from our previous unit on
[00:12:44] data set from our previous unit on sentiment analysis
[00:12:47] sentiment analysis and here's the dynabench interface and i
[00:12:49] and here's the dynabench interface and i guess i'm just exhorting you if you
[00:12:50] guess i'm just exhorting you if you would like to get involved in this
[00:12:51] would like to get involved in this effort it's a community-wide thing to
[00:12:53] effort it's a community-wide thing to develop better benchmarks that are going
[00:12:55] develop better benchmarks that are going to get us closer to assessing how much
[00:12:57] to get us closer to assessing how much progress we're actually making
[00:13:00] progress we're actually making and then finally there are a lot of
[00:13:01] and then finally there are a lot of other nli data sets that i didn't
[00:13:03] other nli data sets that i didn't mention so let me just run the through
[00:13:05] mention so let me just run the through these the glue benchmark has a lot of
[00:13:07] these the glue benchmark has a lot of nli tasks in it
[00:13:09] nli tasks in it as does super glue which is its
[00:13:11] as does super glue which is its successor
[00:13:12] successor i mentioned before in the context of
[00:13:13] i mentioned before in the context of anli this nli style fever data set fever
[00:13:16] anli this nli style fever data set fever is fact verification and they've just
[00:13:18] is fact verification and they've just translated the examples into noi ones
[00:13:21] translated the examples into noi ones here's an analytics corpus that's on for
[00:13:24] here's an analytics corpus that's on for chinese and here's one for turkish the
[00:13:26] chinese and here's one for turkish the turkey the chinese examples are all
[00:13:28] turkey the chinese examples are all original and the turkish one is a
[00:13:30] original and the turkish one is a translation with validation
[00:13:32] translation with validation of snli and multi-analy into turkish
[00:13:36] of snli and multi-analy into turkish xnli is a bunch of assessment data sets
[00:13:38] xnli is a bunch of assessment data sets that is dev test splits for more than a
[00:13:41] that is dev test splits for more than a dozen languages
[00:13:42] dozen languages drawing on the multi-noise examples
[00:13:44] drawing on the multi-noise examples those are human-created translations
[00:13:46] those are human-created translations that could be used to benchmark
[00:13:48] that could be used to benchmark multilingual nli systems
[00:13:50] multilingual nli systems and then there are a few others down
[00:13:52] and then there are a few others down here kind of pointing out trying to get
[00:13:53] here kind of pointing out trying to get genre diversity and then nli for
[00:13:56] genre diversity and then nli for specialized domains uh here's medicine
[00:13:58] specialized domains uh here's medicine and science and those could be
[00:14:00] and science and those could be interesting for seeing how well a model
[00:14:01] interesting for seeing how well a model can grapple with variation that comes in
[00:14:04] can grapple with variation that comes in very specific and maybe technical
[00:14:05] very specific and maybe technical domains
[00:14:06] domains so there's a wide world of tasks you can
[00:14:08] so there's a wide world of tasks you can explore and i think that makes nli a
[00:14:10] explore and i think that makes nli a really exciting space in which to
[00:14:12] really exciting space in which to develop original systems and projects
[00:14:14] develop original systems and projects and so forth

Lecture 035

Adversarial Testing | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=qLuAeFdbass --- Transcript [00:00:04] welcome back everyone this is part ...

Adversarial Testing | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=qLuAeFdbass

---

Transcript

[00:00:04] welcome back everyone this is part three
[00:00:06] welcome back everyone this is part three in our series on nli this is our chance
[00:00:08] in our series on nli this is our chance to start getting a little introspective
[00:00:10] to start getting a little introspective and think more about developing truly
[00:00:12] and think more about developing truly robust systems so our topic is going to
[00:00:14] robust systems so our topic is going to be
[00:00:15] be set artifacts and adversarial testing
[00:00:18] set artifacts and adversarial testing i'm pleased to report that a lot of the
[00:00:20] i'm pleased to report that a lot of the discussion in this area actually traces
[00:00:22] discussion in this area actually traces in the nli context to a course project
[00:00:25] in the nli context to a course project for this class in 2016 landon kesselman
[00:00:28] for this class in 2016 landon kesselman observed that for nli hypothesis only
[00:00:31] observed that for nli hypothesis only models were surprisingly strong and what
[00:00:33] models were surprisingly strong and what i mean by hypothesis only model is that
[00:00:35] i mean by hypothesis only model is that these are models that literally throw
[00:00:37] these are models that literally throw away the premise text and reason
[00:00:39] away the premise text and reason entirely in terms of the hypothesis so
[00:00:42] entirely in terms of the hypothesis so to the extent that models can succeed
[00:00:44] to the extent that models can succeed despite having no information about the
[00:00:46] despite having no information about the premise we might really worry about
[00:00:48] premise we might really worry about whether we're solving the task we think
[00:00:50] whether we're solving the task we think we're solving because after all nli is
[00:00:52] we're solving because after all nli is supposed to be about the reasoning
[00:00:54] supposed to be about the reasoning relationship between premise and
[00:00:55] relationship between premise and hypothesis and you would not expect it
[00:00:58] hypothesis and you would not expect it to be successful if you were given only
[00:01:00] to be successful if you were given only the hypothesis but what lan had observed
[00:01:03] the hypothesis but what lan had observed is that these models were surprisingly
[00:01:05] is that these models were surprisingly strong
[00:01:06] strong subsequently and i think partly
[00:01:07] subsequently and i think partly independently a number of other groups
[00:01:09] independently a number of other groups made the same sort of observation about
[00:01:12] made the same sort of observation about a variety of nli benchmarks
[00:01:14] a variety of nli benchmarks and that kind of leads us to the
[00:01:15] and that kind of leads us to the conclusion that at least for snli
[00:01:18] conclusion that at least for snli averaging across a whole lot of
[00:01:19] averaging across a whole lot of different systems
[00:01:20] different systems hypothesis only baselines are typically
[00:01:23] hypothesis only baselines are typically operating in the range of about 65 to 70
[00:01:26] operating in the range of about 65 to 70 accuracy again that's eye opening
[00:01:28] accuracy again that's eye opening because chance performance would be 33
[00:01:31] because chance performance would be 33 and this is showing us that there is
[00:01:33] and this is showing us that there is some unusual bias in the hypothesis that
[00:01:36] some unusual bias in the hypothesis that is allowing us to neglect the premise
[00:01:38] is allowing us to neglect the premise and still have a lot of predictive
[00:01:40] and still have a lot of predictive capacity
[00:01:42] capacity the reason for this is likely to do to
[00:01:45] the reason for this is likely to do to what we're going to call artifacts in
[00:01:46] what we're going to call artifacts in these data sets and just for a few
[00:01:48] these data sets and just for a few examples here we can observe that
[00:01:50] examples here we can observe that specific claims are likely to be
[00:01:52] specific claims are likely to be premises and entailment cases and
[00:01:54] premises and entailment cases and correspondingly general claims are
[00:01:55] correspondingly general claims are likely to be hypotheses and entailment
[00:01:58] likely to be hypotheses and entailment cases just think at the lexical level if
[00:02:00] cases just think at the lexical level if you have a pair like turtle and animal
[00:02:03] you have a pair like turtle and animal you know you have a very specific term
[00:02:05] you know you have a very specific term in the premise and a very general term
[00:02:06] in the premise and a very general term in the hypothesis that's an entailment
[00:02:09] in the hypothesis that's an entailment relation and if you did just drop off
[00:02:11] relation and if you did just drop off the premise and looked only at the
[00:02:12] the premise and looked only at the hypothesis animal you might still have a
[00:02:15] hypothesis animal you might still have a pretty good guess that that's going to
[00:02:16] pretty good guess that that's going to be an entailment case in virtue of the
[00:02:18] be an entailment case in virtue of the generality of the second term
[00:02:21] generality of the second term and relatedly specific claims are likely
[00:02:23] and relatedly specific claims are likely to lead to contradiction a common
[00:02:25] to lead to contradiction a common strategy for creating a contradiction
[00:02:27] strategy for creating a contradiction pair is to just make sure you have two
[00:02:29] pair is to just make sure you have two sentences that exclude each other in
[00:02:31] sentences that exclude each other in virtue of being very specific
[00:02:34] virtue of being very specific so it's in virtue of patterns like this
[00:02:36] so it's in virtue of patterns like this that a system denied the premise can
[00:02:38] that a system denied the premise can nonetheless succeed in about 65 to 70
[00:02:41] nonetheless succeed in about 65 to 70 accuracy
[00:02:44] let's get a little more precise about
[00:02:46] let's get a little more precise about what we mean by an artifact because i
[00:02:47] what we mean by an artifact because i think we need to think about this in a
[00:02:49] think we need to think about this in a nuanced way so my my definition
[00:02:52] nuanced way so my my definition is that a data set artifact is a bias
[00:02:55] is that a data set artifact is a bias that would make a system susceptible to
[00:02:57] that would make a system susceptible to adversarial attack even if the bias is
[00:03:00] adversarial attack even if the bias is linguistically motivated and let me give
[00:03:02] linguistically motivated and let me give you an example that brings out the
[00:03:03] you an example that brings out the nuance there
[00:03:05] nuance there consider negated hypotheses tending to
[00:03:07] consider negated hypotheses tending to signal contradiction
[00:03:09] signal contradiction this is a very natural thing
[00:03:11] this is a very natural thing linguistically if you give me a sentence
[00:03:13] linguistically if you give me a sentence like the dog barked and you asked me to
[00:03:15] like the dog barked and you asked me to construct a contradictory sentence it's
[00:03:17] construct a contradictory sentence it's very natural for me to say the dog
[00:03:19] very natural for me to say the dog didn't bark by simply inserting a
[00:03:20] didn't bark by simply inserting a negation
[00:03:21] negation so it's not a surprise that this happens
[00:03:23] so it's not a surprise that this happens it's linguistically motivated negation
[00:03:25] it's linguistically motivated negation is our best way of establishing the
[00:03:27] is our best way of establishing the relevant kind of connection
[00:03:29] relevant kind of connection however here's the reason that this is
[00:03:31] however here's the reason that this is an artifact we could easily curate a
[00:03:33] an artifact we could easily curate a data set
[00:03:34] data set in which negation correlated with the
[00:03:36] in which negation correlated with the other labels
[00:03:38] other labels but nonetheless this led to no human
[00:03:39] but nonetheless this led to no human confusion because humans are not really
[00:03:41] confusion because humans are not really operating at this general level of a
[00:03:43] operating at this general level of a data set bias they're thinking about
[00:03:45] data set bias they're thinking about individual examples and what they mean
[00:03:47] individual examples and what they mean but we know that our systems are going
[00:03:49] but we know that our systems are going to be very sensitive to the
[00:03:51] to be very sensitive to the distributions of things in their
[00:03:52] distributions of things in their training data and that's the sense in
[00:03:54] training data and that's the sense in which we can be adversarial in this way
[00:03:56] which we can be adversarial in this way and expose that a system has overfit to
[00:03:59] and expose that a system has overfit to a certain kind of regularity
[00:04:02] a certain kind of regularity here are some known artifacts this is by
[00:04:04] here are some known artifacts this is by no means exhaustive but i think this
[00:04:05] no means exhaustive but i think this will give you a sense for the kind of
[00:04:07] will give you a sense for the kind of things that you want to look out for in
[00:04:09] things that you want to look out for in nli context but also generally in
[00:04:11] nli context but also generally in dealing with problems in nlu
[00:04:14] dealing with problems in nlu so it's been observed that these data
[00:04:15] so it's been observed that these data sets contain words whose appearance
[00:04:17] sets contain words whose appearance nearly perfectly correlates with
[00:04:18] nearly perfectly correlates with specific labels and what i mean here is
[00:04:20] specific labels and what i mean here is just randomly chosen words like cat and
[00:04:22] just randomly chosen words like cat and dog
[00:04:23] dog the reason for this is probably that
[00:04:25] the reason for this is probably that crowd workers were producing a lot of
[00:04:26] crowd workers were producing a lot of examples and they fell into a pattern of
[00:04:29] examples and they fell into a pattern of making specific lexical choices and that
[00:04:31] making specific lexical choices and that created a spurious bias for one label or
[00:04:34] created a spurious bias for one label or another
[00:04:35] another entailment hypotheses over represent
[00:04:38] entailment hypotheses over represent general and approximating words we've
[00:04:39] general and approximating words we've seen that that's systematic in terms of
[00:04:41] seen that that's systematic in terms of the linguistic patterns but it's an
[00:04:43] the linguistic patterns but it's an artifact because the world needn't be
[00:04:45] artifact because the world needn't be this way for humans to succeed in making
[00:04:48] this way for humans to succeed in making their predictions
[00:04:50] their predictions neutral hypotheses often introduce
[00:04:51] neutral hypotheses often introduce modifiers this was a way that workers
[00:04:54] modifiers this was a way that workers found to create saints statements that
[00:04:56] found to create saints statements that excluded each other with simple
[00:04:57] excluded each other with simple modifications
[00:04:59] modifications contradiction hypotheses over represent
[00:05:02] contradiction hypotheses over represent negation we've seen that and that makes
[00:05:03] negation we've seen that and that makes sense
[00:05:04] sense and neutral hypotheses tend to be longer
[00:05:06] and neutral hypotheses tend to be longer and that last one is yet again another
[00:05:08] and that last one is yet again another case where
[00:05:09] case where that's the sort of regularity that our
[00:05:11] that's the sort of regularity that our systems are going to be very good at
[00:05:12] systems are going to be very good at picking up on but that humans will
[00:05:14] picking up on but that humans will probably not make direct use of in
[00:05:16] probably not make direct use of in making predictions of their own and in
[00:05:18] making predictions of their own and in that way we can easily leverage this
[00:05:20] that way we can easily leverage this observation to create an adversarial
[00:05:23] observation to create an adversarial setting for one of our models where
[00:05:25] setting for one of our models where humans succeeded the task but our models
[00:05:27] humans succeeded the task but our models suffer because they're cued into the
[00:05:29] suffer because they're cued into the wrong aspects of the underlying problem
[00:05:33] wrong aspects of the underlying problem to close this out i just want to
[00:05:34] to close this out i just want to emphasize that artifacts are discussed a
[00:05:36] emphasize that artifacts are discussed a lot in the context of nli i think that's
[00:05:38] lot in the context of nli i think that's because a bunch of big benchmarks
[00:05:40] because a bunch of big benchmarks appeared and people did a lot of probing
[00:05:42] appeared and people did a lot of probing work to understand them but it does not
[00:05:44] work to understand them but it does not follow the nli as the only task where
[00:05:46] follow the nli as the only task where the data sets have artifacts in fact i
[00:05:48] the data sets have artifacts in fact i would venture that every task that we
[00:05:50] would venture that every task that we work on has associated data sets that
[00:05:52] work on has associated data sets that suffer from artifacts here i've given
[00:05:54] suffer from artifacts here i've given you a sample from prominent nlu problems
[00:05:57] you a sample from prominent nlu problems where people have found artifacts that
[00:05:59] where people have found artifacts that are similar to the ones that i just
[00:06:00] are similar to the ones that i just covered for nli and the overall lesson
[00:06:02] covered for nli and the overall lesson is clear whatever problem you're working
[00:06:05] is clear whatever problem you're working on you should think critically about
[00:06:06] on you should think critically about your data and how idiosyncrasies in that
[00:06:09] your data and how idiosyncrasies in that data might be affecting system
[00:06:11] data might be affecting system performance and creating a distance
[00:06:13] performance and creating a distance between what you think you're doing in
[00:06:14] between what you think you're doing in terms of problem solving and what your
[00:06:16] terms of problem solving and what your system is actually doing
[00:06:18] system is actually doing and one way we can expose this is via
[00:06:21] and one way we can expose this is via efforts involving adversarial testing
[00:06:23] efforts involving adversarial testing we're going to have a whole discussion
[00:06:24] we're going to have a whole discussion of this later in the quarter but i do
[00:06:26] of this later in the quarter but i do want to plant the idea here in
[00:06:28] want to plant the idea here in adversarial testing we're going to take
[00:06:30] adversarial testing we're going to take standard nli examples like this and
[00:06:32] standard nli examples like this and modify them in ways that expose that
[00:06:34] modify them in ways that expose that systems have surprising gaps
[00:06:37] systems have surprising gaps so a prominent early example of this is
[00:06:39] so a prominent early example of this is this wonderful paper by glockner at all
[00:06:41] this wonderful paper by glockner at all called breaking nli
[00:06:42] called breaking nli what they did is very simple we're going
[00:06:44] what they did is very simple we're going to operate with actual snli examples
[00:06:47] to operate with actual snli examples here we would start with the premise a
[00:06:48] here we would start with the premise a little girl is kneeling in the dirt
[00:06:49] little girl is kneeling in the dirt crying
[00:06:50] crying and a little girl is very sad which is
[00:06:52] and a little girl is very sad which is an actual snli example with the
[00:06:54] an actual snli example with the entailment relation
[00:06:56] entailment relation we'll fix the premise what they did to
[00:06:57] we'll fix the premise what they did to create their adversarial data set is
[00:06:59] create their adversarial data set is simply swap out sad for unhappy that is
[00:07:01] simply swap out sad for unhappy that is to create pairs that are differ only
[00:07:03] to create pairs that are differ only according to the synonyms that they use
[00:07:06] according to the synonyms that they use this is only mildly adversarial but you
[00:07:08] this is only mildly adversarial but you can see why this might be difficult
[00:07:10] can see why this might be difficult unhappy contains a negation and we might
[00:07:12] unhappy contains a negation and we might have a hypothesis that systems will then
[00:07:14] have a hypothesis that systems will then predict contradiction because they've
[00:07:16] predict contradiction because they've overfit to the association of negation
[00:07:18] overfit to the association of negation with contradiction and that's exactly
[00:07:20] with contradiction and that's exactly the kind of pattern that glockner had
[00:07:22] the kind of pattern that glockner had all found and the overall takeaway here
[00:07:24] all found and the overall takeaway here is that these systems that we've
[00:07:26] is that these systems that we've developed are not behaving
[00:07:28] developed are not behaving systematically in the sense of cognitive
[00:07:30] systematically in the sense of cognitive science they're not behaving according
[00:07:32] science they're not behaving according to the kinds of underlying intuitions
[00:07:34] to the kinds of underlying intuitions that human language users have their
[00:07:36] that human language users have their patterns of errors are more surprising
[00:07:38] patterns of errors are more surprising and idiosyncratic leading us to worry
[00:07:40] and idiosyncratic leading us to worry about their true capacity to generalize
[00:07:43] about their true capacity to generalize let me show you one other example this
[00:07:44] let me show you one other example this is from neon and it involves syntactic
[00:07:46] is from neon and it involves syntactic variations so in this case we'll take
[00:07:48] variations so in this case we'll take actual examples and fix the hypothesis
[00:07:51] actual examples and fix the hypothesis but now vary the premise so this is one
[00:07:53] but now vary the premise so this is one of their manipulations where they simply
[00:07:55] of their manipulations where they simply swap the subject and object our original
[00:07:58] swap the subject and object our original premise is a woman is pulling a child on
[00:08:00] premise is a woman is pulling a child on a sled in the snow that entails a child
[00:08:02] a sled in the snow that entails a child of sitting on a sled in the snow that's
[00:08:04] of sitting on a sled in the snow that's good we would expect that if we modify
[00:08:07] good we would expect that if we modify the premise so that the subject is a
[00:08:08] the premise so that the subject is a child and the object is a woman this
[00:08:11] child and the object is a woman this would create the neutral label with
[00:08:12] would create the neutral label with respect to our fixed hypothesis
[00:08:15] respect to our fixed hypothesis but perhaps unsurprisingly need all
[00:08:17] but perhaps unsurprisingly need all found that many systems were not
[00:08:18] found that many systems were not sensitive to this change in the premise
[00:08:21] sensitive to this change in the premise revealing that they were mainly making
[00:08:23] revealing that they were mainly making use of the bag of words so to speak and
[00:08:25] use of the bag of words so to speak and not truly attune to the syntactic
[00:08:27] not truly attune to the syntactic structure and that's so productive
[00:08:29] structure and that's so productive because that specific lesson might lead
[00:08:32] because that specific lesson might lead us to go looking for systems that are
[00:08:34] us to go looking for systems that are more sensitive to syntax and therefore
[00:08:36] more sensitive to syntax and therefore would be not susceptible to this kind of
[00:08:38] would be not susceptible to this kind of adversary and in that kind of productive
[00:08:40] adversary and in that kind of productive back and forth i think we can
[00:08:42] back and forth i think we can triangulate on even better systems for
[00:08:44] triangulate on even better systems for the next generation

Lecture 036

Modeling Strategies | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=T-ryhSTeXpM --- Transcript [00:00:04] welcome everyone this is part four ...

Modeling Strategies | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=T-ryhSTeXpM

---

Transcript

[00:00:04] welcome everyone this is part four in
[00:00:06] welcome everyone this is part four in our series on natural language inference
[00:00:07] our series on natural language inference we're going to be talking about
[00:00:08] we're going to be talking about different modeling strategies you might
[00:00:10] different modeling strategies you might think of this screencast as a companion
[00:00:12] think of this screencast as a companion to the associated notebook that explores
[00:00:14] to the associated notebook that explores a lot of different modeling ideas and
[00:00:16] a lot of different modeling ideas and suggests some architectural variations
[00:00:18] suggests some architectural variations that you might explore yourself
[00:00:20] that you might explore yourself i thought we would start by just
[00:00:21] i thought we would start by just considering hand-built feature functions
[00:00:23] considering hand-built feature functions because they can still be quite powerful
[00:00:25] because they can still be quite powerful for the nli problem
[00:00:27] for the nli problem so some standard hand built feature
[00:00:29] so some standard hand built feature ideas include as a kind of baseline word
[00:00:32] ideas include as a kind of baseline word overlap between the premise and
[00:00:33] overlap between the premise and hypothesis this is giving you a kind of
[00:00:36] hypothesis this is giving you a kind of pretty small feature space
[00:00:38] pretty small feature space essentially measuring how the premise
[00:00:40] essentially measuring how the premise and hypothesis are alike
[00:00:43] and hypothesis are alike if you want to step it up a level you
[00:00:44] if you want to step it up a level you could consider the word cross product
[00:00:46] could consider the word cross product features which you just consider as its
[00:00:48] features which you just consider as its feature space
[00:00:49] feature space every pairing of a word in the premise
[00:00:51] every pairing of a word in the premise and a word in the hypothesis this will
[00:00:53] and a word in the hypothesis this will give you a massive feature space very
[00:00:56] give you a massive feature space very large very sparse but the intuition
[00:00:59] large very sparse but the intuition behind it might be that you're allowing
[00:01:00] behind it might be that you're allowing your model a chance to discover
[00:01:02] your model a chance to discover points of alignment and disalignment
[00:01:04] points of alignment and disalignment between the premise and the hypothesis
[00:01:06] between the premise and the hypothesis and so that could be very powerful
[00:01:09] and so that could be very powerful you might also consider additional word
[00:01:11] you might also consider additional word net relations bringing those in
[00:01:13] net relations bringing those in these would be things like entailment
[00:01:15] these would be things like entailment and contradiction and autonomy synonymy
[00:01:18] and contradiction and autonomy synonymy and those of course could be nicely
[00:01:19] and those of course could be nicely keyed into the underlying logic of the
[00:01:21] keyed into the underlying logic of the nli problem
[00:01:23] nli problem edit distance is another common feature
[00:01:26] edit distance is another common feature but just a raw float value between
[00:01:28] but just a raw float value between premise and hypothesis as a kind of high
[00:01:30] premise and hypothesis as a kind of high level way of comparing those two texts
[00:01:33] level way of comparing those two texts word differences might be a nice
[00:01:35] word differences might be a nice juxtaposition with word overlap you
[00:01:37] juxtaposition with word overlap you could be considering ways in which the
[00:01:39] could be considering ways in which the premise and hypothesis contrast with
[00:01:41] premise and hypothesis contrast with each other in that feature space
[00:01:44] each other in that feature space and we can also move to alignment based
[00:01:46] and we can also move to alignment based features i mentioned that word cross
[00:01:47] features i mentioned that word cross product is kind of an attempt to have
[00:01:49] product is kind of an attempt to have the model learn points of alignment
[00:01:51] the model learn points of alignment between the premise and hypothesis but
[00:01:53] between the premise and hypothesis but of course we could also do some things
[00:01:55] of course we could also do some things kind of even before we begin to learn
[00:01:57] kind of even before we begin to learn feature weights trying to figure out
[00:01:59] feature weights trying to figure out which pieces in the premise correspond
[00:02:01] which pieces in the premise correspond to which piece is in the hypothesis
[00:02:03] to which piece is in the hypothesis we could consider negation we've seen of
[00:02:05] we could consider negation we've seen of course that that's an important
[00:02:07] course that that's an important indicator function in a lot of nli data
[00:02:09] indicator function in a lot of nli data sets there's a powerful intuition behind
[00:02:11] sets there's a powerful intuition behind that especially as it pertains to
[00:02:12] that especially as it pertains to contradiction and so maybe we would
[00:02:14] contradiction and so maybe we would write some feature functions that were
[00:02:16] write some feature functions that were explicitly keyed into the presence or
[00:02:18] explicitly keyed into the presence or absence of negation in various spots in
[00:02:20] absence of negation in various spots in the premise and hypothesis
[00:02:23] the premise and hypothesis and we could also step that up a level
[00:02:25] and we could also step that up a level as well and consider more generally all
[00:02:27] as well and consider more generally all kinds of different interesting
[00:02:28] kinds of different interesting quantifier relationships that would hold
[00:02:30] quantifier relationships that would hold possibly at the level of an alignment as
[00:02:33] possibly at the level of an alignment as in item six here between the premise and
[00:02:35] in item six here between the premise and hypothesis and this is kind of keying
[00:02:37] hypothesis and this is kind of keying into the underlying logic of the nli
[00:02:40] into the underlying logic of the nli problem
[00:02:41] problem and then finally named entity
[00:02:43] and then finally named entity recognition we've seen that these
[00:02:45] recognition we've seen that these features might be important in figuring
[00:02:46] features might be important in figuring out which entities co-refer across the
[00:02:49] out which entities co-refer across the premise and hypothesis and so having
[00:02:51] premise and hypothesis and so having some devices for figuring that out could
[00:02:53] some devices for figuring that out could be useful uh for as a kind of low-level
[00:02:56] be useful uh for as a kind of low-level grounding for your system
[00:03:00] now let's move into a mode that's more
[00:03:02] now let's move into a mode that's more like the deep learning mode because as
[00:03:04] like the deep learning mode because as we saw uh earlier in the screencast
[00:03:06] we saw uh earlier in the screencast series these models have proven at this
[00:03:08] series these models have proven at this point to be the most powerful models for
[00:03:11] point to be the most powerful models for the nli problems so it's productive to
[00:03:13] the nli problems so it's productive to think also about different deep learning
[00:03:14] think also about different deep learning architectures and i'd like to start with
[00:03:16] architectures and i'd like to start with what i've called here sentence and
[00:03:18] what i've called here sentence and coding models the most basic form of
[00:03:20] coding models the most basic form of that would return to idea of distributed
[00:03:23] that would return to idea of distributed representations as features
[00:03:25] representations as features so the idea here is that we have in this
[00:03:27] so the idea here is that we have in this diagram the premise and the hypothesis
[00:03:30] diagram the premise and the hypothesis so the premises every dog danced and the
[00:03:32] so the premises every dog danced and the hypothesis is every poodle moved
[00:03:35] hypothesis is every poodle moved and our approach using distributed
[00:03:37] and our approach using distributed representations the simplest one would
[00:03:39] representations the simplest one would be that we're going to simply look up
[00:03:40] be that we're going to simply look up all of those words in some fixed
[00:03:42] all of those words in some fixed embedding space which would be like a
[00:03:44] embedding space which would be like a glove embedding space for example
[00:03:46] glove embedding space for example and then we're going to separately
[00:03:47] and then we're going to separately encode the premise and hypothesis by for
[00:03:50] encode the premise and hypothesis by for example doing the sum or average
[00:03:52] example doing the sum or average of the vectors in each of those two
[00:03:55] of the vectors in each of those two texts so that gives us a vector xp and
[00:03:58] texts so that gives us a vector xp and xh
[00:03:59] xh and then we might concatenate those two
[00:04:01] and then we might concatenate those two or do something some other kind of
[00:04:03] or do something some other kind of comparison like difference or max or
[00:04:05] comparison like difference or max or mean
[00:04:06] mean uh to get a single fixed dimensional
[00:04:08] uh to get a single fixed dimensional representation x that is then the input
[00:04:10] representation x that is then the input to a kind of could be a simple softmax
[00:04:13] to a kind of could be a simple softmax classifier
[00:04:14] classifier so all we've done here is take our old
[00:04:16] so all we've done here is take our old approach using distributed
[00:04:18] approach using distributed representations as features and move it
[00:04:20] representations as features and move it into the nli problem where we have both
[00:04:22] into the nli problem where we have both the premise and hypothesis
[00:04:24] the premise and hypothesis and i've called this a sentence encoding
[00:04:26] and i've called this a sentence encoding model because we are separately encoding
[00:04:28] model because we are separately encoding the two sentences and then the model is
[00:04:30] the two sentences and then the model is going to learn we hope something about
[00:04:32] going to learn we hope something about how those two representations interact
[00:04:36] on this slide and the next i've given a
[00:04:38] on this slide and the next i've given a complete recipe for doing exactly what i
[00:04:40] complete recipe for doing exactly what i just described i'm not going to linger
[00:04:42] just described i'm not going to linger over here because it's also in the
[00:04:44] over here because it's also in the notebooks and it just shows you how
[00:04:45] notebooks and it just shows you how using our course code it can be
[00:04:47] using our course code it can be relatively easy to set up models like
[00:04:49] relatively easy to set up models like this most of the code is devoted to
[00:04:51] this most of the code is devoted to doing the low level processing of the
[00:04:53] doing the low level processing of the words into their embedding space
[00:04:58] here's the rationale for sentencing
[00:05:00] here's the rationale for sentencing coding models i think this is kind of
[00:05:02] coding models i think this is kind of interesting right we might want to
[00:05:03] interesting right we might want to encode the premise and hypothesis
[00:05:05] encode the premise and hypothesis separately in order to give the model a
[00:05:07] separately in order to give the model a chance to find rich abstract
[00:05:09] chance to find rich abstract relationships between them
[00:05:13] the sentence encoding approach might
[00:05:14] the sentence encoding approach might also facilitate transfer to other kinds
[00:05:16] also facilitate transfer to other kinds of tasks right to the extent that we are
[00:05:18] of tasks right to the extent that we are separately encoding the two sentences we
[00:05:20] separately encoding the two sentences we might have sentence level
[00:05:22] might have sentence level representations that are useful even for
[00:05:24] representations that are useful even for problems that don't fit into the
[00:05:26] problems that don't fit into the specific nli mode of having a single
[00:05:28] specific nli mode of having a single premise and a single hypothesis for the
[00:05:31] premise and a single hypothesis for the sake of classification and that could be
[00:05:32] sake of classification and that could be an important part of that vision from
[00:05:34] an important part of that vision from dagen at all that nli is a kind of
[00:05:37] dagen at all that nli is a kind of source of effective pre-training for
[00:05:39] source of effective pre-training for more general problems
[00:05:42] more general problems let's move to a more complex model we'll
[00:05:44] let's move to a more complex model we'll follow the same narrative that we've
[00:05:46] follow the same narrative that we've used before we just had that simple
[00:05:48] used before we just had that simple fixed model that was going to combine
[00:05:49] fixed model that was going to combine the premise and hypothesis via some
[00:05:51] the premise and hypothesis via some fixed function like sum or average or
[00:05:54] fixed function like sum or average or max
[00:05:55] max here we're going to have functions that
[00:05:56] here we're going to have functions that learn about how those interactions
[00:05:58] learn about how those interactions should happen but we're going to follow
[00:05:59] should happen but we're going to follow the sentence in coding mode so i have
[00:06:02] the sentence in coding mode so i have our same example every dog danced and
[00:06:04] our same example every dog danced and every poodle moved and the idea is that
[00:06:06] every poodle moved and the idea is that each one of those is processed by its
[00:06:08] each one of those is processed by its own separate recurrent neural network
[00:06:11] own separate recurrent neural network and i've indicated in green that
[00:06:12] and i've indicated in green that although these two models would have the
[00:06:14] although these two models would have the same structure these are different
[00:06:16] same structure these are different parameters for the premise and
[00:06:17] parameters for the premise and hypothesis so they function separately
[00:06:20] hypothesis so they function separately and then in the simplest approach we
[00:06:22] and then in the simplest approach we would take the final hidden
[00:06:23] would take the final hidden representation from each of those and
[00:06:26] representation from each of those and combine them somehow probably via
[00:06:28] combine them somehow probably via concatenation and that would be the
[00:06:30] concatenation and that would be the input to the final classifier layer or
[00:06:32] input to the final classifier layer or layers that actually learn the nli
[00:06:34] layers that actually learn the nli problem
[00:06:36] problem so it's a sentence encoding approach in
[00:06:38] so it's a sentence encoding approach in the sense that h3 and hd are taken to be
[00:06:41] the sense that h3 and hd are taken to be kind of separate summary representations
[00:06:43] kind of separate summary representations of the premise and hypothesis
[00:06:45] of the premise and hypothesis respectively and we have a vision that
[00:06:47] respectively and we have a vision that those representations might be
[00:06:48] those representations might be independently useful even outside of the
[00:06:50] independently useful even outside of the nli context
[00:06:53] nli context now in the associated notebook nli
[00:06:56] now in the associated notebook nli 2 models
[00:06:58] 2 models there are a bunch of different
[00:06:59] there are a bunch of different implementations including a full pi
[00:07:01] implementations including a full pi torch implementation using our
[00:07:04] torch implementation using our pi torch based classes of the sentence
[00:07:06] pi torch based classes of the sentence encoding rnn approach that i just
[00:07:08] encoding rnn approach that i just described to you and i thought it would
[00:07:10] described to you and i thought it would just briefly give you a high level
[00:07:11] just briefly give you a high level overview of how that modeling approach
[00:07:14] overview of how that modeling approach works
[00:07:15] works because they're actually just a few
[00:07:16] because they're actually just a few moving pieces and the rest is kind of
[00:07:18] moving pieces and the rest is kind of low level implementation details
[00:07:20] low level implementation details so the first thing that you need to do
[00:07:22] so the first thing that you need to do is modify the data set class
[00:07:25] is modify the data set class so that conceptually it is going to
[00:07:27] so that conceptually it is going to create lists of pairs of examples with
[00:07:30] create lists of pairs of examples with their lengths and their associated
[00:07:31] their lengths and their associated labels by default the underlying code
[00:07:34] labels by default the underlying code that we're using expects one sequence of
[00:07:36] that we're using expects one sequence of tokens one length and one label and here
[00:07:38] tokens one length and one label and here we just need to raise that up so that we
[00:07:40] we just need to raise that up so that we have two as you can see here every dog
[00:07:42] have two as you can see here every dog danced every poodle moved both happen to
[00:07:44] danced every poodle moved both happen to have link three and their label is
[00:07:46] have link three and their label is entailment so we make some changes to
[00:07:48] entailment so we make some changes to the dataset class to accommodate that
[00:07:50] the dataset class to accommodate that change in format essentially
[00:07:53] change in format essentially then the core model for this is
[00:07:56] then the core model for this is conceptually just two rnns and the
[00:08:00] conceptually just two rnns and the forward method is just essentially
[00:08:02] forward method is just essentially bringing those two pieces together and
[00:08:04] bringing those two pieces together and feeding them to the subsequent
[00:08:05] feeding them to the subsequent classifier layers and so that's very
[00:08:07] classifier layers and so that's very conceptually natural and it's just down
[00:08:09] conceptually natural and it's just down to having two separate rnns that you
[00:08:11] to having two separate rnns that you implement
[00:08:12] implement using the raw materials that are already
[00:08:14] using the raw materials that are already there in the code
[00:08:16] there in the code and then finally for the actual
[00:08:17] and then finally for the actual interface the torch rnn sentence encoder
[00:08:20] interface the torch rnn sentence encoder classifier this is basically unchanged
[00:08:23] classifier this is basically unchanged with the one modification that you need
[00:08:25] with the one modification that you need to change the predict problem method the
[00:08:27] to change the predict problem method the fundamental method for prediction
[00:08:29] fundamental method for prediction because it too needs to deal with this
[00:08:31] because it too needs to deal with this different data set format that we
[00:08:33] different data set format that we established up here and that is again a
[00:08:35] established up here and that is again a kind of low level change and so what i
[00:08:36] kind of low level change and so what i hope you're seeing is that the first and
[00:08:39] hope you're seeing is that the first and third steps are kind of managing the
[00:08:41] third steps are kind of managing the data
[00:08:42] data and the middle step is the one that
[00:08:43] and the middle step is the one that actually modifies the computation graph
[00:08:46] actually modifies the computation graph but that step is very intuitive because
[00:08:48] but that step is very intuitive because we're basically just reflecting in code
[00:08:50] we're basically just reflecting in code our idea that we have separate rnns for
[00:08:52] our idea that we have separate rnns for the premise and hypothesis
[00:08:55] the premise and hypothesis and then finally i just want to mention
[00:08:57] and then finally i just want to mention that a common approach you see
[00:08:58] that a common approach you see especially in the early literature is a
[00:09:00] especially in the early literature is a sentence encoding tree nnn that has
[00:09:03] sentence encoding tree nnn that has exactly the same intuition behind it as
[00:09:05] exactly the same intuition behind it as the rnns that we just looked at except
[00:09:07] the rnns that we just looked at except that the premise and hypothesis are
[00:09:10] that the premise and hypothesis are processed by tree structured recur uh
[00:09:12] processed by tree structured recur uh recursive neural networks and since the
[00:09:15] recursive neural networks and since the underlying data sets often have full
[00:09:17] underlying data sets often have full parse representations this is an avenue
[00:09:20] parse representations this is an avenue that you could explore
[00:09:22] that you could explore it can be tricky to implement these
[00:09:23] it can be tricky to implement these efficiently but conceptually it's a kind
[00:09:25] efficiently but conceptually it's a kind of very natural thing where you just
[00:09:27] of very natural thing where you just repeatedly have a dense layer at every
[00:09:29] repeatedly have a dense layer at every one of the constituent nodes on up to a
[00:09:32] one of the constituent nodes on up to a final representation here pb and pd that
[00:09:35] final representation here pb and pd that is then fed into the classifier layer in
[00:09:38] is then fed into the classifier layer in just the way we've done for the previous
[00:09:40] just the way we've done for the previous models
[00:09:42] so those are the sentence encoding rnns
[00:09:44] so those are the sentence encoding rnns now let's move to a different vision and
[00:09:46] now let's move to a different vision and i've called these chained models because
[00:09:48] i've called these chained models because they're going to just
[00:09:49] they're going to just mush together the premise and hypothesis
[00:09:51] mush together the premise and hypothesis as opposed to separately encoding the
[00:09:53] as opposed to separately encoding the two
[00:09:54] two of course the simplest thing we could do
[00:09:56] of course the simplest thing we could do in the chain mode would be to
[00:09:58] in the chain mode would be to essentially ignore the fact that we have
[00:10:00] essentially ignore the fact that we have two texts the premise and hypothesis and
[00:10:03] two texts the premise and hypothesis and just feed them in as one long sequence
[00:10:05] just feed them in as one long sequence into a standard recurrent neural network
[00:10:08] into a standard recurrent neural network and since that involves no changes to
[00:10:09] and since that involves no changes to any of the code we've been using for rnn
[00:10:11] any of the code we've been using for rnn classifiers it seems like a pretty
[00:10:13] classifiers it seems like a pretty natural baseline and so that's depicted
[00:10:16] natural baseline and so that's depicted here
[00:10:17] here and actually this can be surprisingly
[00:10:19] and actually this can be surprisingly effective
[00:10:20] effective right the rationale for doing this
[00:10:22] right the rationale for doing this in this context here you could say that
[00:10:24] in this context here you could say that the premise
[00:10:26] the premise is simply establishing context for
[00:10:28] is simply establishing context for processing the hypothesis and that seems
[00:10:30] processing the hypothesis and that seems like a very natural notion of
[00:10:31] like a very natural notion of conditioning on one text as you process
[00:10:34] conditioning on one text as you process the second one and correspondingly at
[00:10:36] the second one and correspondingly at the level of human language processing
[00:10:37] the level of human language processing this might actually correspond to
[00:10:39] this might actually correspond to something that we do
[00:10:40] something that we do as we read through our premise
[00:10:42] as we read through our premise hypothesis text and figure out what the
[00:10:44] hypothesis text and figure out what the relationship is
[00:10:46] relationship is and here's a simple recipe for doing
[00:10:48] and here's a simple recipe for doing this the one change from the diagram
[00:10:50] this the one change from the diagram that you might think about is that i did
[00:10:52] that you might think about is that i did when representing the examples here
[00:10:54] when representing the examples here flatten them out of course
[00:10:56] flatten them out of course but also insert this boundary marker
[00:10:58] but also insert this boundary marker that would at least give the model a
[00:10:59] that would at least give the model a chance to learn that there was a
[00:11:01] chance to learn that there was a separation happening some kind of
[00:11:02] separation happening some kind of transition between the premise and
[00:11:04] transition between the premise and hypothesis
[00:11:05] hypothesis but that's just the level of
[00:11:06] but that's just the level of featurization and in terms of modeling
[00:11:08] featurization and in terms of modeling you hardly need to make any changes in
[00:11:10] you hardly need to make any changes in order to run this kind of experiment
[00:11:14] order to run this kind of experiment we could also think about a modification
[00:11:16] we could also think about a modification that would bring together sentence
[00:11:17] that would bring together sentence encoding with chained and this would be
[00:11:19] encoding with chained and this would be where we have two sets of rnn parameters
[00:11:22] where we have two sets of rnn parameters one for the premise and one for the
[00:11:24] one for the premise and one for the hypothesis but we nonetheless chain them
[00:11:26] hypothesis but we nonetheless chain them together instead of separately encoding
[00:11:28] together instead of separately encoding them so as before i have a premise rnn
[00:11:30] them so as before i have a premise rnn in green
[00:11:32] in green i have a hypothesis rnn and purple they
[00:11:35] i have a hypothesis rnn and purple they have the same structure but different
[00:11:36] have the same structure but different learned parameters and the handoff is
[00:11:39] learned parameters and the handoff is essentially that the initial hidden
[00:11:40] essentially that the initial hidden state for the hypothesis is the final
[00:11:43] state for the hypothesis is the final output state for the premise and in that
[00:11:45] output state for the premise and in that way you get a seamless transition
[00:11:47] way you get a seamless transition between these two models
[00:11:49] between these two models and this would allow the model some
[00:11:50] and this would allow the model some space to learn that premise tokens and
[00:11:53] space to learn that premise tokens and premise sequences have a different
[00:11:55] premise sequences have a different status than those that appear in the
[00:11:56] status than those that appear in the hypothesis
[00:11:59] and let me just close by mentioning a
[00:12:01] and let me just close by mentioning a few other strategies because this was by
[00:12:03] few other strategies because this was by no means exhaustive but it's kind of
[00:12:05] no means exhaustive but it's kind of interesting at the high level of
[00:12:07] interesting at the high level of architecture thinking about sentencing
[00:12:08] architecture thinking about sentencing coding versus these chained models
[00:12:11] coding versus these chained models so first the torch rnn classifier feeds
[00:12:14] so first the torch rnn classifier feeds its hidden state directly to the
[00:12:16] its hidden state directly to the classifier layer but we have options
[00:12:18] classifier layer but we have options like bi-directional equals true which
[00:12:20] like bi-directional equals true which would use as the summary representation
[00:12:22] would use as the summary representation both the final and the initial hidden
[00:12:24] both the final and the initial hidden states essentially and feed those into
[00:12:26] states essentially and feed those into the classifier so it's a different
[00:12:28] the classifier so it's a different notion of sentencing coding or of
[00:12:30] notion of sentencing coding or of sequence encoding
[00:12:33] sequence encoding and other ideas here right so we could
[00:12:35] and other ideas here right so we could instead of restricting just to one or a
[00:12:37] instead of restricting just to one or a few of the final states do some kind of
[00:12:39] few of the final states do some kind of pooling with max or mean across all of
[00:12:41] pooling with max or mean across all of the output states
[00:12:43] the output states and different pooling options can be
[00:12:44] and different pooling options can be combined with different versions of
[00:12:46] combined with different versions of these models either sentence encoding
[00:12:48] these models either sentence encoding or chained we could also of course have
[00:12:50] or chained we could also of course have additional hidden layers between the
[00:12:53] additional hidden layers between the classifier layer and the embedding i've
[00:12:55] classifier layer and the embedding i've shown you just one for the sake of
[00:12:56] shown you just one for the sake of simplicity but deeper might be better
[00:12:59] simplicity but deeper might be better especially for the very large nli data
[00:13:01] especially for the very large nli data sets that we have
[00:13:02] sets that we have and finally an important source of
[00:13:04] and finally an important source of innovation in this and many other spaces
[00:13:06] innovation in this and many other spaces is the idea of adding attention
[00:13:08] is the idea of adding attention mechanisms to these models and that's
[00:13:10] mechanisms to these models and that's such an important idea that i'm going to
[00:13:11] such an important idea that i'm going to save it for the next screencast in this
[00:13:13] save it for the next screencast in this series

Lecture 037

Attention | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=vJYhPL6U3h4 --- Transcript [00:00:04] welcome everyone this is part five in [00:00:...

Attention | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=vJYhPL6U3h4

---

Transcript

[00:00:04] welcome everyone this is part five in
[00:00:06] welcome everyone this is part five in our series on natural language inference
[00:00:07] our series on natural language inference we're going to be talking about
[00:00:08] we're going to be talking about attention mechanisms attention was an
[00:00:10] attention mechanisms attention was an important source of innovation in the
[00:00:12] important source of innovation in the nli literature and of course it's only
[00:00:14] nli literature and of course it's only grown in prominence since then
[00:00:16] grown in prominence since then let's begin with some guiding ideas in
[00:00:19] let's begin with some guiding ideas in the context of the nli problem we might
[00:00:21] the context of the nli problem we might have an intuition that we just need more
[00:00:23] have an intuition that we just need more connections for a lot of our
[00:00:24] connections for a lot of our architectures between the premise and
[00:00:26] architectures between the premise and hypothesis right possibly in processing
[00:00:29] hypothesis right possibly in processing the hypothesis we just need the model to
[00:00:31] the hypothesis we just need the model to have some reminders about what the
[00:00:33] have some reminders about what the premise actually contained and whatever
[00:00:35] premise actually contained and whatever summary representation we have of that
[00:00:37] summary representation we have of that premise might just not be enough from
[00:00:39] premise might just not be enough from the point of view of processing the
[00:00:41] the point of view of processing the hypothesis and feeding the
[00:00:42] hypothesis and feeding the representation into the classifier layer
[00:00:46] representation into the classifier layer relatedly you know there's a persistent
[00:00:48] relatedly you know there's a persistent intuition in the nli literature that
[00:00:50] intuition in the nli literature that it's useful to softly align the premise
[00:00:52] it's useful to softly align the premise and hypothesis to find corresponding
[00:00:54] and hypothesis to find corresponding words and phrases between those two
[00:00:56] words and phrases between those two texts it can be difficult to do that at
[00:00:59] texts it can be difficult to do that at a mechanical level but attention
[00:01:00] a mechanical level but attention mechanisms might allow us via our
[00:01:03] mechanisms might allow us via our data-driven learning process to find
[00:01:05] data-driven learning process to find soft connections in the weights for
[00:01:07] soft connections in the weights for these attention layers between the
[00:01:09] these attention layers between the premise and hypothesis and achieve some
[00:01:11] premise and hypothesis and achieve some of the effects that we would get from a
[00:01:13] of the effects that we would get from a real alignment process
[00:01:16] real alignment process so let's begin with global attention
[00:01:17] so let's begin with global attention this is the simplest attention mechanism
[00:01:19] this is the simplest attention mechanism that you see in the noi literature but
[00:01:21] that you see in the noi literature but it's already quite powerful and as
[00:01:23] it's already quite powerful and as you'll see it has deep connections with
[00:01:24] you'll see it has deep connections with the attention mechanisms in the
[00:01:26] the attention mechanisms in the transformer
[00:01:27] transformer so to make this concrete let's start
[00:01:29] so to make this concrete let's start with a simple example we have every dog
[00:01:31] with a simple example we have every dog danced as our premise some poodle danced
[00:01:34] danced as our premise some poodle danced as our hypothesis and they're fit
[00:01:36] as our hypothesis and they're fit together into this chained rnn model for
[00:01:38] together into this chained rnn model for nli
[00:01:39] nli now standardly what we would do is take
[00:01:41] now standardly what we would do is take this final representation hc
[00:01:44] this final representation hc as the summary representation for the
[00:01:45] as the summary representation for the entire sequence and feed that directly
[00:01:47] entire sequence and feed that directly into the classifier
[00:01:49] into the classifier what we're going to do when we add
[00:01:50] what we're going to do when we add attention mechanisms is instead offer
[00:01:53] attention mechanisms is instead offer some connections back from this state
[00:01:56] some connections back from this state into the premise states the way that
[00:01:58] into the premise states the way that process gets started is via a series of
[00:02:00] process gets started is via a series of dot products so we're going to take our
[00:02:01] dot products so we're going to take our target vector hc
[00:02:03] target vector hc and take its dot product with each one
[00:02:05] and take its dot product with each one of the hidden representations
[00:02:07] of the hidden representations corresponding to tokens in the premise
[00:02:10] corresponding to tokens in the premise and that gives us this vector of
[00:02:11] and that gives us this vector of unnormalized scores just the dot
[00:02:13] unnormalized scores just the dot products
[00:02:14] products and it's common then to soft max
[00:02:16] and it's common then to soft max normalize those scores into our
[00:02:17] normalize those scores into our attention weights alpha
[00:02:19] attention weights alpha what we do with alpha is then create our
[00:02:21] what we do with alpha is then create our context vector and the way that happens
[00:02:23] context vector and the way that happens is that we're going to get a weighted
[00:02:24] is that we're going to get a weighted view of all those premise states each
[00:02:27] view of all those premise states each one h1 h2 and h3 is weighted by its
[00:02:30] one h1 h2 and h3 is weighted by its corresponding attention weight which is
[00:02:32] corresponding attention weight which is capturing its kind of unnormalized
[00:02:34] capturing its kind of unnormalized notion of similarity with our target
[00:02:36] notion of similarity with our target vector hc
[00:02:38] vector hc and then to get a fixed dimensional
[00:02:39] and then to get a fixed dimensional version of that we take the mean or it
[00:02:41] version of that we take the mean or it could be the sum of all of those
[00:02:43] could be the sum of all of those weighted views of the premise
[00:02:45] weighted views of the premise next we get our attention combination
[00:02:47] next we get our attention combination layer and there are various ways to do
[00:02:49] layer and there are various ways to do this one simple one would be to simply
[00:02:51] this one simple one would be to simply concatenate our context vector with our
[00:02:54] concatenate our context vector with our original context uh target vector hc and
[00:02:57] original context uh target vector hc and feed those through a kind of dense layer
[00:02:58] feed those through a kind of dense layer of learned parameters
[00:03:00] of learned parameters another perspective kind of similar is
[00:03:02] another perspective kind of similar is to give the context vector and our
[00:03:04] to give the context vector and our target vector hc each one their own
[00:03:06] target vector hc each one their own weights and have an additive combination
[00:03:08] weights and have an additive combination of those two and again feed it through
[00:03:10] of those two and again feed it through some kind of non-linearity
[00:03:13] some kind of non-linearity and you could think of various other
[00:03:14] and you could think of various other designs for this and that gives us this
[00:03:16] designs for this and that gives us this attention combination h tilde
[00:03:19] attention combination h tilde and then finally the classifier layer is
[00:03:21] and then finally the classifier layer is a simple dense layer just as before
[00:03:23] a simple dense layer just as before except instead of using just hc we now
[00:03:26] except instead of using just hc we now use this h tilde representation which
[00:03:29] use this h tilde representation which incorporates both hc
[00:03:31] incorporates both hc and that kind of weighted mixture of
[00:03:33] and that kind of weighted mixture of premise states
[00:03:36] it might be useful to go through this
[00:03:38] it might be useful to go through this with some specific numerical values here
[00:03:40] with some specific numerical values here so what i've done is just imagine that
[00:03:42] so what i've done is just imagine that we have two dimensional representations
[00:03:44] we have two dimensional representations for all of these vectors and you can see
[00:03:46] for all of these vectors and you can see what i've done here is kind of ensure
[00:03:48] what i've done here is kind of ensure that proportionally every is a lot like
[00:03:51] that proportionally every is a lot like this final representation here
[00:03:53] this final representation here and then that kind of similarity drops
[00:03:55] and then that kind of similarity drops off as we move through the premise
[00:03:56] off as we move through the premise states and you'll see what happens when
[00:03:58] states and you'll see what happens when we take the dot products here so the
[00:04:00] we take the dot products here so the first step gives us the unnormalized
[00:04:02] first step gives us the unnormalized scores and you can see that the highest
[00:04:04] scores and you can see that the highest unnormalized similarity is with the
[00:04:06] unnormalized similarity is with the first token followed by the second and
[00:04:08] first token followed by the second and then the third
[00:04:09] then the third the softmax normalization step kind of
[00:04:11] the softmax normalization step kind of just flattens out those dot products a
[00:04:13] just flattens out those dot products a little bit but we get the same
[00:04:14] little bit but we get the same proportional ranking with respect to hc
[00:04:18] proportional ranking with respect to hc here's that context vector and you can
[00:04:20] here's that context vector and you can see it's just a mean of the weighted
[00:04:23] see it's just a mean of the weighted values of all of these vectors that
[00:04:25] values of all of these vectors that gives us k
[00:04:26] gives us k and that k is then fed into this
[00:04:28] and that k is then fed into this attention combination layer and you can
[00:04:30] attention combination layer and you can see in orange here this is the context
[00:04:31] see in orange here this is the context vector two dimensions down here we have
[00:04:34] vector two dimensions down here we have hc just faithfully repeated and then
[00:04:36] hc just faithfully repeated and then this matrix of weights wk is going to
[00:04:39] this matrix of weights wk is going to give us in the end after this
[00:04:41] give us in the end after this non-linearity
[00:04:42] non-linearity h tilde
[00:04:44] h tilde and then the classifier is as before
[00:04:46] and then the classifier is as before so that's a simple worked example of how
[00:04:48] so that's a simple worked example of how these attention mechanisms work and the
[00:04:50] these attention mechanisms work and the idea is that we are kind of
[00:04:52] idea is that we are kind of fundamentally
[00:04:53] fundamentally weighting this target representation hc
[00:04:56] weighting this target representation hc by its similarity with
[00:04:59] by its similarity with the previous premise states but all of
[00:05:00] the previous premise states but all of them are mixed in and the influence is
[00:05:02] them are mixed in and the influence is kind of proportional to that
[00:05:04] kind of proportional to that unnormalized similarity
[00:05:07] unnormalized similarity there are other scoring functions that
[00:05:08] there are other scoring functions that you could use of course we've just done
[00:05:10] you could use of course we've just done simple dot product up here but you can
[00:05:12] simple dot product up here but you can also imagine having learned parameters
[00:05:14] also imagine having learned parameters in there
[00:05:15] in there or doing concatenation of the learn
[00:05:16] or doing concatenation of the learn parameters this is a kind of bilinear
[00:05:18] parameters this is a kind of bilinear form and this is just a concatenation of
[00:05:20] form and this is just a concatenation of those two states fed through these third
[00:05:22] those two states fed through these third weights and once you see this kind of
[00:05:24] weights and once you see this kind of design space you can imagine very a lot
[00:05:26] design space you can imagine very a lot of other ways in which you could mix in
[00:05:28] of other ways in which you could mix in parameters and have different views of
[00:05:30] parameters and have different views of this global attention mechanism
[00:05:34] this global attention mechanism we could go one step further here that
[00:05:35] we could go one step further here that was global attention in word by word
[00:05:38] was global attention in word by word attention we're going to have a lot more
[00:05:39] attention we're going to have a lot more learned parameters and a lot more
[00:05:41] learned parameters and a lot more connections between the hypothesis back
[00:05:44] connections between the hypothesis back into the premise
[00:05:45] into the premise so to make this kind of tractable i
[00:05:47] so to make this kind of tractable i picked one pretty simple view of how
[00:05:49] picked one pretty simple view of how this could work and the way we should
[00:05:51] this could work and the way we should track these computations is focus on
[00:05:54] track these computations is focus on this vector b here because we're going
[00:05:56] this vector b here because we're going to move through time but let's imagine
[00:05:58] to move through time but let's imagine that we've already processed the a state
[00:06:00] that we've already processed the a state and we will subsequently process the c
[00:06:02] and we will subsequently process the c state so we're focused on b
[00:06:05] state so we're focused on b and the way we establish these
[00:06:06] and the way we establish these connections is by taking the previous
[00:06:08] connections is by taking the previous context vector that we created that's ka
[00:06:11] context vector that we created that's ka here
[00:06:12] here we're going to multiply that by repeated
[00:06:13] we're going to multiply that by repeated copies of the b state and that's simply
[00:06:15] copies of the b state and that's simply so that we get the same dimensionality
[00:06:18] so that we get the same dimensionality as we have in the premise over here
[00:06:19] as we have in the premise over here where i've simply copied over into a
[00:06:21] where i've simply copied over into a matrix all three of those states
[00:06:24] matrix all three of those states and we have a matrix of learn parameters
[00:06:26] and we have a matrix of learn parameters here and an additive combination of the
[00:06:28] here and an additive combination of the two followed by a non-linearity that's
[00:06:30] two followed by a non-linearity that's going to give us this m here which kind
[00:06:32] going to give us this m here which kind of corresponds to the attention weights
[00:06:34] of corresponds to the attention weights in the previous global attention
[00:06:36] in the previous global attention mechanisms
[00:06:37] mechanisms we're going to soft max normalize those
[00:06:39] we're going to soft max normalize those and that literally gives us the weights
[00:06:41] and that literally gives us the weights and you can see that there's some
[00:06:42] and you can see that there's some additional parameters in here to create
[00:06:44] additional parameters in here to create the right dimensionalities
[00:06:47] the right dimensionalities and then finally we have the context at
[00:06:48] and then finally we have the context at b so that's going to be a repeated view
[00:06:50] b so that's going to be a repeated view of all these premises weighted by our
[00:06:52] of all these premises weighted by our context vector as before and then fed
[00:06:55] context vector as before and then fed through some additional parameters wa
[00:06:57] through some additional parameters wa here
[00:06:58] here and that gives us as you can see here
[00:07:00] and that gives us as you can see here the context representation for the state
[00:07:02] the context representation for the state b
[00:07:03] b when we move to state c of course that
[00:07:05] when we move to state c of course that will be used in the place of a here and
[00:07:08] will be used in the place of a here and c will go in for all these purple values
[00:07:10] c will go in for all these purple values and the computation will proceed as
[00:07:11] and the computation will proceed as before and in that way because we have
[00:07:14] before and in that way because we have all of these additional learn parameters
[00:07:15] all of these additional learn parameters we can meaningfully move through the
[00:07:17] we can meaningfully move through the entire sequence
[00:07:18] entire sequence updating our parameters and learning
[00:07:20] updating our parameters and learning connections from each hypothesis token
[00:07:22] connections from each hypothesis token back into the premise so it's much more
[00:07:25] back into the premise so it's much more powerful than the previous view where we
[00:07:26] powerful than the previous view where we had relatively few learned parameters in
[00:07:29] had relatively few learned parameters in our attention mechanisms and therefore
[00:07:31] our attention mechanisms and therefore we can only really meaningfully connect
[00:07:33] we can only really meaningfully connect that from the state that we're going to
[00:07:35] that from the state that we're going to feed into the classifier so this is much
[00:07:37] feed into the classifier so this is much more expressive
[00:07:40] more expressive right and then once we've done the
[00:07:41] right and then once we've done the entire sequence processing finally we
[00:07:44] entire sequence processing finally we get the representation for c here as fed
[00:07:46] get the representation for c here as fed through these mechanisms and that
[00:07:48] through these mechanisms and that becomes the input to the classifier that
[00:07:50] becomes the input to the classifier that we ultimately use
[00:07:53] the connection with the transformer
[00:07:54] the connection with the transformer should be apparent this is going to
[00:07:56] should be apparent this is going to return us back to the global attention
[00:07:57] return us back to the global attention mechanism recall that for the
[00:07:59] mechanism recall that for the transformer we have these sequences of
[00:08:02] transformer we have these sequences of tokens with their positional encodings
[00:08:04] tokens with their positional encodings that gives us an embedding here and at
[00:08:06] that gives us an embedding here and at that point we establish a lot of dot
[00:08:08] that point we establish a lot of dot product connections and i showed you in
[00:08:10] product connections and i showed you in the lecture on the transformer that the
[00:08:12] the lecture on the transformer that the mechanisms here are identical to the
[00:08:14] mechanisms here are identical to the mechanisms that we used for dot product
[00:08:16] mechanisms that we used for dot product attention it's just that in the context
[00:08:18] attention it's just that in the context of the transformer we do it from every
[00:08:20] of the transformer we do it from every state to every other state
[00:08:25] and then of course the computations
[00:08:26] and then of course the computations proceed through subsequent um steps in
[00:08:29] proceed through subsequent um steps in the transformer layer and on through
[00:08:31] the transformer layer and on through multiple transformer layers potentially
[00:08:34] multiple transformer layers potentially and there are some other variants right
[00:08:36] and there are some other variants right this is just the beginning of a very
[00:08:38] this is just the beginning of a very large design space for attention
[00:08:39] large design space for attention mechanisms let me just mention a few we
[00:08:41] mechanisms let me just mention a few we could have local attention this was
[00:08:43] could have local attention this was actually an early contribution in the
[00:08:44] actually an early contribution in the context of machine translation and this
[00:08:47] context of machine translation and this would build connection between selected
[00:08:49] would build connection between selected points in the premise and hypothesis
[00:08:51] points in the premise and hypothesis based on some possibly a priori notion
[00:08:53] based on some possibly a priori notion we have of which things are likely to be
[00:08:55] we have of which things are likely to be important for our problem
[00:08:58] important for our problem word by word attention as i've said can
[00:08:59] word by word attention as i've said can be set up in many ways with many more
[00:09:01] be set up in many ways with many more learned parameters and the classic paper
[00:09:04] learned parameters and the classic paper is the one that i'm recommending for
[00:09:05] is the one that i'm recommending for reading for this unit reptessio at all
[00:09:07] reading for this unit reptessio at all where they do a really pioneering view
[00:09:10] where they do a really pioneering view of this
[00:09:11] of this and using even more complex attention
[00:09:13] and using even more complex attention mechanisms than i presented under word
[00:09:14] mechanisms than i presented under word by word attention but following a lot of
[00:09:17] by word attention but following a lot of the same intuitions i would say
[00:09:20] the same intuitions i would say the attention representation at a time t
[00:09:22] the attention representation at a time t could be appended to the hidden
[00:09:23] could be appended to the hidden representation at time t plus one this
[00:09:25] representation at time t plus one this would give us another way of moving
[00:09:27] would give us another way of moving sequentially through the sequence having
[00:09:29] sequentially through the sequence having meaningful attention at each one of
[00:09:30] meaningful attention at each one of those points as opposed to the global
[00:09:33] those points as opposed to the global attention which would just be for that
[00:09:34] attention which would just be for that final state
[00:09:36] final state and then there are other connections
[00:09:37] and then there are other connections even further afield like for example
[00:09:39] even further afield like for example memory networks can be used to address
[00:09:41] memory networks can be used to address similar issues and they have similar
[00:09:42] similar issues and they have similar into intuitions behind them as attention
[00:09:45] into intuitions behind them as attention mechanisms as applied to the nli problem
[00:09:47] mechanisms as applied to the nli problem and that's kind of more explicitly
[00:09:49] and that's kind of more explicitly drawing on this idea that we might in
[00:09:51] drawing on this idea that we might in late states in processing need a bit of
[00:09:53] late states in processing need a bit of a reminder about what was in the
[00:09:55] a reminder about what was in the previous context that we processed

Lecture 038

NLU and Information Retrieval | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=Bn6RNrwwiI0 --- Transcript [00:00:06] welcome everyone to the f...

NLU and Information Retrieval | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=Bn6RNrwwiI0

---

Transcript

[00:00:06] welcome everyone to the first screencast
[00:00:08] welcome everyone to the first screencast in our nlu and information retrieval
[00:00:10] in our nlu and information retrieval series
[00:00:11] series the goal of this introductory screencast
[00:00:13] the goal of this introductory screencast is twofold
[00:00:15] is twofold i will first introduce the ir area
[00:00:18] i will first introduce the ir area then i will discuss ways in which nlu
[00:00:20] then i will discuss ways in which nlu and ir can interact productively
[00:00:23] and ir can interact productively and focus on how retrieval can be an
[00:00:25] and focus on how retrieval can be an effective component in defining our
[00:00:27] effective component in defining our interview tasks and building our energy
[00:00:30] interview tasks and building our energy systems
[00:00:32] systems so what is information retrieval or ir
[00:00:36] so what is information retrieval or ir to a first approximation this is the
[00:00:38] to a first approximation this is the field concerned with search
[00:00:40] field concerned with search the first example that typically comes
[00:00:42] the first example that typically comes to mind is web search
[00:00:44] to mind is web search but as we'll see today this field
[00:00:46] but as we'll see today this field extends richly beyond web search and has
[00:00:49] extends richly beyond web search and has strong connections to our work with mlu
[00:00:54] let's now attempt to define ir more
[00:00:57] let's now attempt to define ir more formally
[00:00:58] formally here is a simplified version of the
[00:01:00] here is a simplified version of the definition used by the introduction to
[00:01:02] definition used by the introduction to ir book by manningbittel
[00:01:05] ir book by manningbittel they define ir as the process of finding
[00:01:08] they define ir as the process of finding material that fulfills an information
[00:01:11] material that fulfills an information need from within a large collection of
[00:01:13] need from within a large collection of unstructured documents
[00:01:16] unstructured documents let's unpack this definition
[00:01:20] starting on the left hand side here
[00:01:24] starting on the left hand side here the definition says that we're concerned
[00:01:25] the definition says that we're concerned with finding material from a large
[00:01:27] with finding material from a large collection in other words large scale
[00:01:29] collection in other words large scale search is at the essence of ir
[00:01:33] search is at the essence of ir on the right hand side the definition
[00:01:35] on the right hand side the definition restricts this to unstructured documents
[00:01:38] restricts this to unstructured documents basically items like text media and
[00:01:40] basically items like text media and products
[00:01:41] products ones that lack the clear-cut structure
[00:01:43] ones that lack the clear-cut structure of things like database tables or graphs
[00:01:47] of things like database tables or graphs structure-based search or
[00:01:48] structure-based search or structure-based traversal of graphs and
[00:01:51] structure-based traversal of graphs and databases are not typically considered
[00:01:53] databases are not typically considered ir problems for our purposes
[00:01:56] ir problems for our purposes although of course they are interesting
[00:01:57] although of course they are interesting on their own right
[00:02:00] this leads us to the term that was at
[00:02:02] this leads us to the term that was at the center of our definition namely the
[00:02:05] the center of our definition namely the information
[00:02:06] information need it is difficult to think of ir
[00:02:09] need it is difficult to think of ir without thinking of the user at this at
[00:02:12] without thinking of the user at this at the center of the system and
[00:02:14] the center of the system and the information need is what the user
[00:02:16] the information need is what the user has in mind
[00:02:17] has in mind to solve a task or otherwise
[00:02:20] to solve a task or otherwise learn or reach the material that they're
[00:02:22] learn or reach the material that they're looking for
[00:02:24] looking for the goal of a search system is thus to
[00:02:26] the goal of a search system is thus to identify and fulfill the user's
[00:02:28] identify and fulfill the user's information need
[00:02:30] information need so whatever we retrieve is only going to
[00:02:32] so whatever we retrieve is only going to be considered relevant to the extent
[00:02:34] be considered relevant to the extent that it advances this goal
[00:02:37] that it advances this goal in most ir tasks the user will
[00:02:40] in most ir tasks the user will explicitly provide us with a query that
[00:02:42] explicitly provide us with a query that summarizes and expresses that
[00:02:44] summarizes and expresses that information need
[00:02:46] information need it is very important to note that this
[00:02:48] it is very important to note that this query may contain ambiguity may miss
[00:02:50] query may contain ambiguity may miss some important details or might even
[00:02:52] some important details or might even sometimes ask the wrong question and
[00:02:54] sometimes ask the wrong question and that's completely normal the user may
[00:02:57] that's completely normal the user may not even be sure what precisely they're
[00:02:59] not even be sure what precisely they're looking for that's why they're searching
[00:03:01] looking for that's why they're searching for something
[00:03:02] for something so we must rely on our knowledge of the
[00:03:04] so we must rely on our knowledge of the task and whatever we know about the user
[00:03:08] task and whatever we know about the user within the constraints of our
[00:03:09] within the constraints of our application
[00:03:11] application in order to solve ir problems
[00:03:16] the second thing is that typical
[00:03:17] the second thing is that typical information needs vary by task
[00:03:20] information needs vary by task and so um you know the typical
[00:03:23] and so um you know the typical information need that we have and how to
[00:03:25] information need that we have and how to best interpret and deal with that these
[00:03:27] best interpret and deal with that these are factors that vary greatly by task
[00:03:29] are factors that vary greatly by task and by collection type in ir i'll take
[00:03:32] and by collection type in ir i'll take it you've already made the connection
[00:03:33] it you've already made the connection between ir and searching the web
[00:03:36] between ir and searching the web searching your email and also finding
[00:03:38] searching your email and also finding files on your desktop
[00:03:40] files on your desktop buried in deep folders or something
[00:03:43] buried in deep folders or something but there are plenty of other ir tasks
[00:03:45] but there are plenty of other ir tasks where search is crucial
[00:03:48] where search is crucial for instance you might want to find
[00:03:50] for instance you might want to find recent papers related to the birth paper
[00:03:52] recent papers related to the birth paper by devlin ital of course
[00:03:54] by devlin ital of course this is not a
[00:03:56] this is not a the best example because there are many
[00:03:57] the best example because there are many many papers related to birth these days
[00:04:00] many papers related to birth these days but in any case in this case your query
[00:04:02] but in any case in this case your query might be the full text of the birth
[00:04:04] might be the full text of the birth paper and the system might try to search
[00:04:06] paper and the system might try to search the acl anthology and the computational
[00:04:08] the acl anthology and the computational language section of archive for similar
[00:04:10] language section of archive for similar papers
[00:04:11] papers to birth
[00:04:13] to birth recommendation is another key ir topic
[00:04:17] recommendation is another key ir topic in recommendation we still seek relevant
[00:04:19] in recommendation we still seek relevant material from a large collection of
[00:04:20] material from a large collection of unstructured items but in this case the
[00:04:23] unstructured items but in this case the user has no explicit query
[00:04:26] user has no explicit query and instead the previous interactions
[00:04:28] and instead the previous interactions enable the recommendation system to
[00:04:30] enable the recommendation system to suggest good matches
[00:04:33] patent search
[00:04:35] patent search is yet another
[00:04:36] is yet another ir task and unlike the others we've
[00:04:38] ir task and unlike the others we've mentioned so far it's often used by
[00:04:41] mentioned so far it's often used by experts not by average users and it has
[00:04:43] experts not by average users and it has very strong emphasis on hierarchical so
[00:04:46] very strong emphasis on hierarchical so unlike the average web query where you
[00:04:48] unlike the average web query where you might be completely content
[00:04:50] might be completely content with one very good match at the top
[00:04:53] with one very good match at the top patent search may need to find every
[00:04:55] patent search may need to find every relevant patent to acquity or something
[00:04:57] relevant patent to acquity or something that approximates that
[00:05:01] lastly even buying a new laptop can be
[00:05:03] lastly even buying a new laptop can be an ir problem and in particular a
[00:05:05] an ir problem and in particular a conversational an ir problem
[00:05:08] conversational an ir problem here the system may go in a back and
[00:05:10] here the system may go in a back and forth style between searching for
[00:05:12] forth style between searching for relevant products and asking the user
[00:05:14] relevant products and asking the user for their preferences about cost screen
[00:05:17] for their preferences about cost screen quality and storage and other factors
[00:05:20] quality and storage and other factors on online e-commerce platforms
[00:05:27] having looked at all of these ir tasks
[00:05:30] having looked at all of these ir tasks it's important to keep in mind that each
[00:05:32] it's important to keep in mind that each of those tasks poses its own unique
[00:05:34] of those tasks poses its own unique challenges so even though we're always
[00:05:36] challenges so even though we're always interested in relevance and in finding
[00:05:38] interested in relevance and in finding relevant items
[00:05:39] relevant items each of those tasks
[00:05:42] each of those tasks has its own
[00:05:43] has its own challenges and its own
[00:05:45] challenges and its own components
[00:05:47] components to underscore this let's use web search
[00:05:50] to underscore this let's use web search as a frame of reference while standard
[00:05:52] as a frame of reference while standard web search might pose considerable
[00:05:54] web search might pose considerable challenge when it comes to the massive
[00:05:56] challenge when it comes to the massive scale involved in terms of documents and
[00:05:58] scale involved in terms of documents and also queries
[00:06:00] also queries even something as seemingly mundane as
[00:06:02] even something as seemingly mundane as searching for conversations on your
[00:06:04] searching for conversations on your slack workspace
[00:06:06] slack workspace often lacks key features that makes web
[00:06:09] often lacks key features that makes web search tractable in the first place and
[00:06:11] search tractable in the first place and makes it work the way it does in the
[00:06:13] makes it work the way it does in the first place
[00:06:15] for one
[00:06:17] for one so many web searches ask frequently
[00:06:19] so many web searches ask frequently searched or head queries
[00:06:21] searched or head queries the sheer popularity of headquarters
[00:06:24] the sheer popularity of headquarters makes them an easy target for
[00:06:26] makes them an easy target for large-scale
[00:06:28] large-scale for large search engines
[00:06:29] for large search engines of course there's always a long tail of
[00:06:31] of course there's always a long tail of rare search queries that still post
[00:06:32] rare search queries that still post considerable challenge especially in
[00:06:34] considerable challenge especially in highly technical domains but it still
[00:06:36] highly technical domains but it still stands that in in a domain like web
[00:06:39] stands that in in a domain like web search solving the the head queries gets
[00:06:42] search solving the the head queries gets you
[00:06:43] you a very big part a very big share
[00:06:45] a very big part a very big share of answering uh most user queries
[00:06:49] of answering uh most user queries as another factor web search enjoys
[00:06:51] as another factor web search enjoys highly redundant
[00:06:53] highly redundant documents though out there that address
[00:06:55] documents though out there that address common topics where each document is
[00:06:58] common topics where each document is written in a slightly different way
[00:07:00] written in a slightly different way this often shifts the search problem
[00:07:02] this often shifts the search problem into a precision one basically finding
[00:07:05] into a precision one basically finding some documents at least one that
[00:07:07] some documents at least one that definitely match the query as opposed to
[00:07:09] definitely match the query as opposed to at a cold one finding every document
[00:07:11] at a cold one finding every document that matches the query because there's
[00:07:13] that matches the query because there's already too many of them
[00:07:16] already too many of them clearly this is not always the case if
[00:07:18] clearly this is not always the case if you're looking for a very specific item
[00:07:20] you're looking for a very specific item in your slack conversational history
[00:07:23] in your slack conversational history yet another factor in web search is the
[00:07:25] yet another factor in web search is the rich link structure that links between
[00:07:28] rich link structure that links between existing related web pages
[00:07:31] existing related web pages which again introduces you know more
[00:07:34] which again introduces you know more hierarchy and might make this task more
[00:07:36] hierarchy and might make this task more tractable in practice
[00:07:38] tractable in practice the idea here is definitely not
[00:07:41] the idea here is definitely not that web search is easy because it's not
[00:07:42] that web search is easy because it's not easy but that the different tasks pose
[00:07:45] easy but that the different tasks pose different challenges for our ir systems
[00:07:49] different challenges for our ir systems so that is ir
[00:07:51] so that is ir where does our work on nlu fit in ior
[00:07:55] where does our work on nlu fit in ior well of course quiz and documents are
[00:07:58] well of course quiz and documents are often expressed in natural language at
[00:08:00] often expressed in natural language at least in part
[00:08:02] least in part so we naturally want to understand
[00:08:04] so we naturally want to understand equity's meaning and its intent
[00:08:07] equity's meaning and its intent and understand the documents contents
[00:08:09] and understand the documents contents and their topics to be able to
[00:08:11] and their topics to be able to effectively match queries to documents
[00:08:14] effectively match queries to documents this form of understanding
[00:08:16] this form of understanding is critical
[00:08:18] is critical although you can go pretty far for many
[00:08:20] although you can go pretty far for many ir tasks with intelligently matching
[00:08:22] ir tasks with intelligently matching terms at a lexical level
[00:08:25] terms at a lexical level the vocabulary mismatch problem makes
[00:08:27] the vocabulary mismatch problem makes this quite unattractive in practice
[00:08:30] this quite unattractive in practice to explain
[00:08:31] to explain vocabulary mismatch happens when queries
[00:08:34] vocabulary mismatch happens when queries and documents use different terms to
[00:08:36] and documents use different terms to refer to the same thing
[00:08:39] refer to the same thing so i have here on the slide an example
[00:08:41] so i have here on the slide an example query that shows this happening um in
[00:08:43] query that shows this happening um in practice
[00:08:44] practice so the question is or the query is what
[00:08:47] so the question is or the query is what what what compounds protect the
[00:08:49] what what compounds protect the digestive system against viruses
[00:08:52] digestive system against viruses and the snippet that we are interested
[00:08:53] and the snippet that we are interested in finding says in the stomach gastric
[00:08:56] in finding says in the stomach gastric acid and proteases serve as powerful
[00:08:59] acid and proteases serve as powerful chemical defenses against ingested
[00:09:01] chemical defenses against ingested pathogens
[00:09:02] pathogens you can see that the
[00:09:04] you can see that the passage that we found here uses
[00:09:06] passage that we found here uses pathogens instead of viruses
[00:09:09] pathogens instead of viruses which is a bit more general
[00:09:11] which is a bit more general stomach instead of digestive system
[00:09:13] stomach instead of digestive system which is a bit more specific
[00:09:15] which is a bit more specific and chemical instead of compounds and
[00:09:17] and chemical instead of compounds and defenses instead of protect but it's
[00:09:19] defenses instead of protect but it's pretty clear that it still answers the
[00:09:20] pretty clear that it still answers the same question and answers it very well
[00:09:22] same question and answers it very well in fact
[00:09:24] in fact so where does nlu fit in ir
[00:09:27] so where does nlu fit in ir i guess a nice code here is jimmy lynn's
[00:09:31] i guess a nice code here is jimmy lynn's statement
[00:09:32] statement jamilyn is an ir researcher who says ir
[00:09:35] jamilyn is an ir researcher who says ir makes nlp useful and nlp makes ir
[00:09:38] makes nlp useful and nlp makes ir interesting
[00:09:40] interesting of course we do think nlp is useful
[00:09:42] of course we do think nlp is useful anyway and also ir is interesting anyway
[00:09:45] anyway and also ir is interesting anyway so i added more between brackets here
[00:09:47] so i added more between brackets here but we do get jimmy's point
[00:09:52] okay
[00:09:54] okay on to our more central question where
[00:09:56] on to our more central question where does ir fit into our study of nlu and
[00:09:59] does ir fit into our study of nlu and how can i our service
[00:10:02] how can i our service in thinking about this i believe it's
[00:10:04] in thinking about this i believe it's helpful to appreciate that as our models
[00:10:07] helpful to appreciate that as our models become more advanced in nlu
[00:10:09] become more advanced in nlu they too
[00:10:11] they too like humans start to have complete
[00:10:13] like humans start to have complete information needs in solving their tasks
[00:10:17] more concretely retrieval can contribute
[00:10:19] more concretely retrieval can contribute to our nlu tasks and systems in three
[00:10:22] to our nlu tasks and systems in three exciting ways
[00:10:24] exciting ways first
[00:10:25] first retrieval provides a rich source for
[00:10:27] retrieval provides a rich source for creating challenging and realistic
[00:10:29] creating challenging and realistic interview tasks ones where finding
[00:10:32] interview tasks ones where finding information from a large corpus is
[00:10:34] information from a large corpus is central
[00:10:35] central we will look closely at this bullet in
[00:10:37] we will look closely at this bullet in the remainder of these slides
[00:10:39] the remainder of these slides second
[00:10:40] second retrieval offers a powerful tool to make
[00:10:42] retrieval offers a powerful tool to make nlu models for existing tasks more
[00:10:45] nlu models for existing tasks more accurate and more effective we'll touch
[00:10:48] accurate and more effective we'll touch upon this today but we'll discuss it in
[00:10:49] upon this today but we'll discuss it in more depth later
[00:10:51] more depth later third
[00:10:52] third retrieval can often lend us a nice
[00:10:54] retrieval can often lend us a nice framework for evaluating nlu systems
[00:10:56] framework for evaluating nlu systems whenever the output domain is large just
[00:10:58] whenever the output domain is large just just like in search
[00:11:00] just like in search or whenever low latency is important
[00:11:03] or whenever low latency is important which are
[00:11:04] which are key characteristics in ir we will expand
[00:11:07] key characteristics in ir we will expand on this in a later screencast as well
[00:11:10] on this in a later screencast as well in the remainder of this screencast
[00:11:12] in the remainder of this screencast we'll explore how retrieval allows us to
[00:11:14] we'll explore how retrieval allows us to pose very challenging and very realistic
[00:11:16] pose very challenging and very realistic open domain in a new tasks
[00:11:19] open domain in a new tasks chris has briefly introduced squad
[00:11:22] chris has briefly introduced squad before in the overview lecture
[00:11:24] before in the overview lecture to remind you of this question answering
[00:11:26] to remind you of this question answering task the input that we are given in
[00:11:28] task the input that we are given in squad is a context passage which was
[00:11:31] squad is a context passage which was obtained from wikipedia
[00:11:33] obtained from wikipedia and a question that tests our model's
[00:11:35] and a question that tests our model's understanding of this one passage
[00:11:38] understanding of this one passage this is an interesting task on its own
[00:11:40] this is an interesting task on its own right one that has enjoyed tons of work
[00:11:42] right one that has enjoyed tons of work and lots of recent progress
[00:11:44] and lots of recent progress due to pre-trained language models
[00:11:49] but with retrieval in mind we can move
[00:11:52] but with retrieval in mind we can move from standard qa like squad to open
[00:11:55] from standard qa like squad to open domain question answering
[00:11:57] domain question answering specifically in open domain question and
[00:11:59] specifically in open domain question and setting we can ask what if we want to
[00:12:02] setting we can ask what if we want to answer the same kinds of factoid
[00:12:04] answer the same kinds of factoid questions as quest as squad or other
[00:12:06] questions as quest as squad or other types of questions
[00:12:08] types of questions but without the perhaps unrealistic hint
[00:12:11] but without the perhaps unrealistic hint of receiving the particular passage in
[00:12:13] of receiving the particular passage in wikipedia that already contains the
[00:12:16] wikipedia that already contains the answer
[00:12:18] answer in this case we can take all of the
[00:12:20] in this case we can take all of the english wikipedia just as an example as
[00:12:22] english wikipedia just as an example as our context and then again again pose
[00:12:24] our context and then again again pose the same question as squad over all of
[00:12:27] the same question as squad over all of wikipedia
[00:12:28] wikipedia and build models that can answer these
[00:12:30] and build models that can answer these open questions over large corpora
[00:12:34] open questions over large corpora so
[00:12:36] so how would we answer such questions
[00:12:38] how would we answer such questions the literature in particular a nice
[00:12:40] the literature in particular a nice emnlp 2020 paper by robert setel
[00:12:43] emnlp 2020 paper by robert setel introduces a nice analogy for how we
[00:12:45] introduces a nice analogy for how we might attempt to tackle this task and
[00:12:47] might attempt to tackle this task and how we could think about it
[00:12:48] how we could think about it the first
[00:12:50] the first perhaps more familiar and perhaps
[00:12:52] perhaps more familiar and perhaps simpler solution is to pose the question
[00:12:54] simpler solution is to pose the question to one of our usual transformers
[00:12:57] to one of our usual transformers and specifically to a generative
[00:13:00] and specifically to a generative sequence to sequence model something
[00:13:01] sequence to sequence model something like t5 gpt2 or gpt3
[00:13:06] like t5 gpt2 or gpt3 in this case we're relying on the
[00:13:08] in this case we're relying on the knowledge stored internally and
[00:13:10] knowledge stored internally and implicitly in the model parameters
[00:13:12] implicitly in the model parameters so the model memorizes these facts just
[00:13:14] so the model memorizes these facts just like you would do when you enter a
[00:13:16] like you would do when you enter a closed book exam
[00:13:18] closed book exam often this knowledge is memorized the
[00:13:21] often this knowledge is memorized the same way language is learned as a result
[00:13:24] same way language is learned as a result of language model pre-training or other
[00:13:27] of language model pre-training or other similar tasks
[00:13:30] close book approaches to these
[00:13:32] close book approaches to these characteristically open domain problems
[00:13:35] characteristically open domain problems um offer a particularly consistent way
[00:13:37] um offer a particularly consistent way of improving quality and coverage well
[00:13:40] of improving quality and coverage well just take your model train a larger
[00:13:42] just take your model train a larger version of it on more data and hope that
[00:13:45] version of it on more data and hope that that encodes more knowledge and gives
[00:13:47] that encodes more knowledge and gives you more accurate
[00:13:49] you more accurate results
[00:13:51] results as an alternative to this
[00:13:53] as an alternative to this we could think about open book
[00:13:54] we could think about open book approaches to open domain question
[00:13:56] approaches to open domain question answering
[00:13:57] answering so there's the analogy of doing an open
[00:13:59] so there's the analogy of doing an open book exam
[00:14:00] book exam which tests not really your memory but
[00:14:03] which tests not really your memory but your awareness of where to look for
[00:14:05] your awareness of where to look for answers and how to use them quickly and
[00:14:08] answers and how to use them quickly and productively
[00:14:10] productively in this case we would build what are
[00:14:12] in this case we would build what are typically called retrieve and read
[00:14:15] typically called retrieve and read architectures
[00:14:16] architectures as shown at the bottom of the slide we
[00:14:19] as shown at the bottom of the slide we take the question and first feed it to a
[00:14:21] take the question and first feed it to a retriever model
[00:14:23] retriever model the retriever searches our collection of
[00:14:25] the retriever searches our collection of facts in this case wikipedia as an
[00:14:27] facts in this case wikipedia as an example
[00:14:28] example and extracts a bunch of passages or
[00:14:30] and extracts a bunch of passages or other contexts that seem useful in
[00:14:32] other contexts that seem useful in trying to answer the original question
[00:14:35] trying to answer the original question these passages are then fed to a
[00:14:37] these passages are then fed to a downstream reader
[00:14:38] downstream reader so that could just be a small birth-like
[00:14:40] so that could just be a small birth-like model which studies these passages to
[00:14:43] model which studies these passages to answer the original question
[00:14:46] answer the original question in this pipeline we've essentially
[00:14:48] in this pipeline we've essentially relied on this new retriever component
[00:14:50] relied on this new retriever component to reduce the original open domain
[00:14:52] to reduce the original open domain question answering problem to a much
[00:14:54] question answering problem to a much smaller scale standard question
[00:14:57] smaller scale standard question answering task
[00:14:58] answering task where the downstream model sees a
[00:15:00] where the downstream model sees a question and the relevant passage for a
[00:15:02] question and the relevant passage for a few messages before extracting
[00:15:04] few messages before extracting a short answer
[00:15:10] importantly we could say that the reader
[00:15:12] importantly we could say that the reader in this architecture is a user that has
[00:15:14] in this architecture is a user that has an information need and it's the
[00:15:16] an information need and it's the retriever's task to start to satisfy
[00:15:19] retriever's task to start to satisfy this need accurately and efficiently
[00:15:23] we will study various methods for
[00:15:25] we will study various methods for building retrievals in the subsequent
[00:15:26] building retrievals in the subsequent screencasts and look at how these
[00:15:28] screencasts and look at how these retrievers interact with downstream
[00:15:31] retrievers interact with downstream breeders but for now let us just explore
[00:15:33] breeders but for now let us just explore some of the higher level differences
[00:15:35] some of the higher level differences between open book and close book
[00:15:37] between open book and close book solutions to open domain problems
[00:15:41] solutions to open domain problems our open book solutions often get to be
[00:15:44] our open book solutions often get to be much smaller while being very accurate
[00:15:46] much smaller while being very accurate still
[00:15:47] still the reason is that we've decoupled
[00:15:49] the reason is that we've decoupled knowledge from reasoning
[00:15:51] knowledge from reasoning and stored the knowledge outside the
[00:15:53] and stored the knowledge outside the model thus the model itself does not
[00:15:55] model thus the model itself does not need to store all of these facts inside
[00:15:58] need to store all of these facts inside its parameters and it gets to be much
[00:15:59] its parameters and it gets to be much smaller as a result
[00:16:01] smaller as a result as we will see later this has great
[00:16:03] as we will see later this has great implications for efficiency
[00:16:06] implications for efficiency moreover the knowledge can be easily
[00:16:09] moreover the knowledge can be easily updated by modifying the collection
[00:16:12] updated by modifying the collection as the facts in wikipedia for example
[00:16:14] as the facts in wikipedia for example evolve over time
[00:16:15] evolve over time or alternatively suppose that you want
[00:16:17] or alternatively suppose that you want to switch from answering questions over
[00:16:19] to switch from answering questions over wikipedia to post to proposing questions
[00:16:22] wikipedia to post to proposing questions over the nlp literature or perhaps
[00:16:24] over the nlp literature or perhaps posing question over the documentation
[00:16:26] posing question over the documentation of your favorite software library
[00:16:28] of your favorite software library you can often
[00:16:30] you can often do that by simply swapping the
[00:16:32] do that by simply swapping the collection with a new one and keeping
[00:16:33] collection with a new one and keeping the question answering model as is
[00:16:37] the question answering model as is in order to answer questions in this new
[00:16:38] in order to answer questions in this new domain
[00:16:41] domain lastly because we can see the actual
[00:16:43] lastly because we can see the actual documents that are retrieved and the
[00:16:45] documents that are retrieved and the documents that are read by the reader to
[00:16:47] documents that are read by the reader to extract answers we're often better
[00:16:49] extract answers we're often better positioned to explain how these models
[00:16:52] positioned to explain how these models know some facts or why they make
[00:16:54] know some facts or why they make particular mistakes
[00:16:56] particular mistakes on the downside though
[00:16:58] on the downside though all of a sudden we now need to worry
[00:16:59] all of a sudden we now need to worry about the interactions between two
[00:17:01] about the interactions between two components the retriever and the reader
[00:17:04] components the retriever and the reader but i hope that the subsequent set of
[00:17:07] but i hope that the subsequent set of screencasts will convince you that
[00:17:09] screencasts will convince you that working with retrievers in nlu is very
[00:17:11] working with retrievers in nlu is very rewarding
[00:17:14] all of this discussion so far has been
[00:17:16] all of this discussion so far has been in the context of open domain question
[00:17:18] in the context of open domain question answering
[00:17:19] answering but there are many other interview tasks
[00:17:21] but there are many other interview tasks that either inherently subsume retrieval
[00:17:23] that either inherently subsume retrieval or at least can directly benefit from
[00:17:25] or at least can directly benefit from interacting with a large collection of
[00:17:27] interacting with a large collection of relevant facts
[00:17:30] one of those is claim verification or
[00:17:34] one of those is claim verification or fact checking
[00:17:35] fact checking here the model receives this input a
[00:17:37] here the model receives this input a disputed claim and his goal is to verify
[00:17:40] disputed claim and his goal is to verify or refute this claim and to return
[00:17:43] or refute this claim and to return documents that justify
[00:17:45] documents that justify its decision
[00:17:47] its decision two other tasks are query focused
[00:17:49] two other tasks are query focused summarization and informative dialogue
[00:17:52] summarization and informative dialogue where we might also work with a large
[00:17:54] where we might also work with a large collection of facts and given a topic or
[00:17:56] collection of facts and given a topic or in the context of a conversation
[00:17:59] in the context of a conversation generate a useful summary of the
[00:18:01] generate a useful summary of the resources about that topic perhaps as
[00:18:03] resources about that topic perhaps as part of a conversation with a user
[00:18:05] part of a conversation with a user interested to learn about a new topic
[00:18:08] interested to learn about a new topic lastly entity linking is a task that can
[00:18:11] lastly entity linking is a task that can be posed over a large textual knowledge
[00:18:13] be posed over a large textual knowledge base as well
[00:18:15] base as well given an utterance that refers to any
[00:18:16] given an utterance that refers to any number of ambiguous entities or events
[00:18:20] number of ambiguous entities or events we should resolve this ambiguity and map
[00:18:22] we should resolve this ambiguity and map dimensions of these entities to their
[00:18:24] dimensions of these entities to their descriptions
[00:18:26] descriptions in a large knowledge base like wikipedia
[00:18:28] in a large knowledge base like wikipedia so that would be
[00:18:29] so that would be a form of entity linking
[00:18:32] a form of entity linking kilt or knowledge intensive language
[00:18:35] kilt or knowledge intensive language tasks is a recent effort
[00:18:37] tasks is a recent effort aimed at collecting a number of
[00:18:38] aimed at collecting a number of different data sets for retrieval-based
[00:18:41] different data sets for retrieval-based nlp incidentally all of these tasks in
[00:18:44] nlp incidentally all of these tasks in kilt explicitly have a knowledge
[00:18:46] kilt explicitly have a knowledge component like answering a question or
[00:18:49] component like answering a question or verifying a claim
[00:18:53] an open question in this exciting area
[00:18:55] an open question in this exciting area is whether retrieval can improve
[00:18:57] is whether retrieval can improve performance for standard interview tasks
[00:18:59] performance for standard interview tasks as well ones where the knowledge
[00:19:01] as well ones where the knowledge challenge is less explicit
[00:19:04] challenge is less explicit think for example sentiment analysis
[00:19:06] think for example sentiment analysis natural language inference or any of the
[00:19:09] natural language inference or any of the other tasks we study so far
[00:19:12] other tasks we study so far well this remains an open question but i
[00:19:14] well this remains an open question but i think that accurate knowledge matters
[00:19:16] think that accurate knowledge matters for most if not all of our language
[00:19:18] for most if not all of our language tasks
[00:19:19] tasks and that converting many of these tasks
[00:19:21] and that converting many of these tasks to an open book format or bring your own
[00:19:24] to an open book format or bring your own book approach
[00:19:26] book approach may be a promising way to tackle these
[00:19:28] may be a promising way to tackle these tasks in practice
[00:19:31] in the remainder of this unit we will
[00:19:33] in the remainder of this unit we will dig deeper into traditional methods and
[00:19:35] dig deeper into traditional methods and metrics for information retrieval
[00:19:37] metrics for information retrieval and then explore recent advances in url
[00:19:40] and then explore recent advances in url ir which will make a lot of use of our
[00:19:42] ir which will make a lot of use of our nlu models like bert but in new and
[00:19:45] nlu models like bert but in new and creative ways
[00:19:47] creative ways and then we will finally discuss open
[00:19:48] and then we will finally discuss open domain question answering in more depth
[00:19:51] domain question answering in more depth as one of the most mature applications
[00:19:53] as one of the most mature applications of nau plus ir

Lecture 039

Classical IR | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=e8zKKDMAze8 --- Transcript [00:00:05] hello everyone welcome to part two of [00:...

Classical IR | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=e8zKKDMAze8

---

Transcript

[00:00:05] hello everyone welcome to part two of
[00:00:08] hello everyone welcome to part two of our series on nlu and ir
[00:00:11] our series on nlu and ir this screencast will be a crash course
[00:00:13] this screencast will be a crash course in classical ir as well as evaluation
[00:00:15] in classical ir as well as evaluation methods in information retrieval
[00:00:20] let us first define the simplest form of
[00:00:22] let us first define the simplest form of our task namely ranked retrieval
[00:00:26] our task namely ranked retrieval we will be given a large collection of
[00:00:27] we will be given a large collection of text documents
[00:00:29] text documents this could be all of the all of the
[00:00:30] this could be all of the all of the passages in wikipedia
[00:00:32] passages in wikipedia perhaps a crawl of parts of the web or
[00:00:35] perhaps a crawl of parts of the web or maybe all of the documentation of
[00:00:36] maybe all of the documentation of hugging face or other software libraries
[00:00:40] hugging face or other software libraries this corpus will be provided to us
[00:00:42] this corpus will be provided to us offline that is before we interact with
[00:00:45] offline that is before we interact with any users
[00:00:46] any users and we will be able to spend a one-time
[00:00:48] and we will be able to spend a one-time effort at organizing or otherwise
[00:00:50] effort at organizing or otherwise understanding the content of these
[00:00:52] understanding the content of these documents in the corpus before we start
[00:00:54] documents in the corpus before we start searching
[00:00:57] online though
[00:00:58] online though we will receive a query from the users
[00:01:01] we will receive a query from the users which could be a natural language
[00:01:03] which could be a natural language question written in english for example
[00:01:06] question written in english for example the goal of our ranked retrieval system
[00:01:08] the goal of our ranked retrieval system will be to output a top k list of
[00:01:10] will be to output a top k list of documents sorted in decreasing order of
[00:01:12] documents sorted in decreasing order of relevance to the information need
[00:01:15] relevance to the information need that the user expressed in the query
[00:01:19] so this might be the top 10 or the top
[00:01:21] so this might be the top 10 or the top 100 results
[00:01:24] so how do we conduct this task of ranked
[00:01:27] so how do we conduct this task of ranked retrieval
[00:01:29] retrieval as it turns out we've already looked at
[00:01:31] as it turns out we've already looked at a way for doing this before when
[00:01:33] a way for doing this before when discussing matrix designs
[00:01:35] discussing matrix designs in particular we know that we can build
[00:01:38] in particular we know that we can build term document occurrence matrices
[00:01:40] term document occurrence matrices and in such a matrix like the one shown
[00:01:43] and in such a matrix like the one shown each term document pair
[00:01:45] each term document pair has a corresponding cell in which the
[00:01:47] has a corresponding cell in which the matrix will store the number of times
[00:01:49] matrix will store the number of times that the term appears in the document in
[00:01:52] that the term appears in the document in our corpus
[00:01:55] our corpus of course we will probably want to apply
[00:01:57] of course we will probably want to apply some sort of re-weighting here because
[00:01:59] some sort of re-weighting here because we don't want to work with these row
[00:02:00] we don't want to work with these row counts but once we've done that we can
[00:02:03] counts but once we've done that we can already answer queries that contain just
[00:02:05] already answer queries that contain just a single term pretty well and to do that
[00:02:08] a single term pretty well and to do that we would basically just return the k
[00:02:10] we would basically just return the k documents with the largest weight after
[00:02:12] documents with the largest weight after normalization or other processes for
[00:02:14] normalization or other processes for this single term
[00:02:17] this single term and again as it turns out this is
[00:02:18] and again as it turns out this is precisely what is done in classical ir
[00:02:21] precisely what is done in classical ir if we have just a single query just a
[00:02:23] if we have just a single query just a single term in our query
[00:02:26] single term in our query when we have multiple terms in the same
[00:02:28] when we have multiple terms in the same query
[00:02:29] query classical rir tends to treat them
[00:02:31] classical rir tends to treat them independently
[00:02:32] independently so we would basically add the weights up
[00:02:34] so we would basically add the weights up across all of the terms in the query
[00:02:37] across all of the terms in the query per document and then that's the score
[00:02:39] per document and then that's the score for the document this is precisely the
[00:02:41] for the document this is precisely the computation that's shown here where we
[00:02:43] computation that's shown here where we compute the relevance score
[00:02:46] compute the relevance score between a query and a document
[00:02:48] between a query and a document we would go over all the terms in the
[00:02:50] we would go over all the terms in the query and simply add the corresponding
[00:02:53] query and simply add the corresponding document term weights for all of these
[00:02:55] document term weights for all of these terms
[00:02:56] terms for that document
[00:02:57] for that document this gives us a score for the document
[00:03:00] this gives us a score for the document and we can then return the k documents
[00:03:02] and we can then return the k documents with the largest total scores
[00:03:04] with the largest total scores interestingly this reduces much of
[00:03:06] interestingly this reduces much of classical ir though of course not all of
[00:03:09] classical ir though of course not all of it to thinking about how do we best
[00:03:11] it to thinking about how do we best weigh each term document pair
[00:03:15] weigh each term document pair which has an undeniable similarity to
[00:03:17] which has an undeniable similarity to our first task this quarter in homework
[00:03:19] our first task this quarter in homework one
[00:03:20] one except of course here in ir we look at
[00:03:22] except of course here in ir we look at term to document relevance and not word
[00:03:25] term to document relevance and not word to word relatedness
[00:03:30] so thinking about term document
[00:03:32] so thinking about term document weighting
[00:03:33] weighting here are some intuitions that might be
[00:03:35] here are some intuitions that might be useful as we think about what makes a
[00:03:37] useful as we think about what makes a strong term waiting model in ir
[00:03:40] strong term waiting model in ir of course later in the next screencast
[00:03:43] of course later in the next screencast in particular we'll be looking at neural
[00:03:44] in particular we'll be looking at neural models
[00:03:46] models that go beyond this
[00:03:49] that go beyond this but for now perhaps the two most
[00:03:51] but for now perhaps the two most prominent
[00:03:52] prominent intuitions for term document waiting are
[00:03:54] intuitions for term document waiting are connected to our first unit's discussion
[00:03:57] connected to our first unit's discussion of frequency and
[00:03:58] of frequency and normalization in particular
[00:04:01] normalization in particular if a term t occurs frequently in
[00:04:03] if a term t occurs frequently in document d
[00:04:05] document d the document is more likely to be
[00:04:06] the document is more likely to be relevant for queries that include the
[00:04:08] relevant for queries that include the term t or so is one of our intuitions
[00:04:13] term t or so is one of our intuitions and um you know in terms of
[00:04:15] and um you know in terms of normalization if that term t is quite
[00:04:17] normalization if that term t is quite rare
[00:04:18] rare so if it occurs in only a few documents
[00:04:20] so if it occurs in only a few documents overall we take that as a as a stronger
[00:04:23] overall we take that as a as a stronger signal that document d is even more
[00:04:25] signal that document d is even more likely to be relevant for queries
[00:04:27] likely to be relevant for queries including t
[00:04:30] lastly if document d is rather short we
[00:04:32] lastly if document d is rather short we take that as also yet another signal
[00:04:35] take that as also yet another signal that might increase our confidence that
[00:04:37] that might increase our confidence that the term t was included in that rather
[00:04:40] the term t was included in that rather short document for a reason
[00:04:43] short document for a reason taking a step back and thinking more
[00:04:44] taking a step back and thinking more broadly we're still functioning under
[00:04:46] broadly we're still functioning under the same statement from the first unit
[00:04:49] the same statement from the first unit our goal is ultimately to amplify the
[00:04:51] our goal is ultimately to amplify the important signals the trustworthy and
[00:04:53] important signals the trustworthy and the unusual and to de-emphasize the
[00:04:56] the unusual and to de-emphasize the mundane and the quirky
[00:05:01] there are so many different term
[00:05:02] there are so many different term weighing functions in ir
[00:05:04] weighing functions in ir but most of them are directly inspired
[00:05:06] but most of them are directly inspired by tf-idf and take a sim a very similar
[00:05:09] by tf-idf and take a sim a very similar computational form
[00:05:11] computational form for tf idf this is a slightly different
[00:05:14] for tf idf this is a slightly different version to the one used in unit one what
[00:05:16] version to the one used in unit one what i have here is slightly different
[00:05:18] i have here is slightly different this is the more popular version in the
[00:05:19] this is the more popular version in the context of ir applications but ef idf is
[00:05:22] context of ir applications but ef idf is overloaded frequently and you will see
[00:05:24] overloaded frequently and you will see multiple
[00:05:25] multiple implementations
[00:05:27] implementations if you go look for them
[00:05:29] if you go look for them so we'll define n to be the size of the
[00:05:30] so we'll define n to be the size of the collection
[00:05:32] collection and df or document frequency of a term
[00:05:36] and df or document frequency of a term the f of term
[00:05:39] the f of term to be the number of documents that
[00:05:40] to be the number of documents that contain that term in the collection
[00:05:43] contain that term in the collection then tf or term frequency of a term
[00:05:46] then tf or term frequency of a term document pair
[00:05:48] document pair will be defined as the logarithm of the
[00:05:50] will be defined as the logarithm of the frequency of this term in this document
[00:05:53] frequency of this term in this document with one
[00:05:54] with one just for for mathematical reasons
[00:05:58] idf or inverse document frequency is
[00:06:01] idf or inverse document frequency is defined as the logarithm
[00:06:02] defined as the logarithm of n divided by the document frequency
[00:06:04] of n divided by the document frequency of the term
[00:06:07] of the term tf-idf is then nothing but the product
[00:06:10] tf-idf is then nothing but the product of these two values for each query term
[00:06:13] of these two values for each query term summed up
[00:06:14] summed up at the end to assign a single overall
[00:06:16] at the end to assign a single overall score to each document by summing up
[00:06:18] score to each document by summing up across all query terms as we've
[00:06:19] across all query terms as we've discussed before
[00:06:21] discussed before of course higher scores are better and
[00:06:23] of course higher scores are better and the top case scoring documents are those
[00:06:25] the top case scoring documents are those that we would return to the searcher if
[00:06:26] that we would return to the searcher if we were to use df-idf
[00:06:30] we were to use df-idf notice how
[00:06:31] notice how both tf and idf grow sublinearly and in
[00:06:34] both tf and idf grow sublinearly and in particular logarithmically with
[00:06:37] particular logarithmically with frequency
[00:06:38] frequency and one over df respectively
[00:06:45] a much stronger term weighting model in
[00:06:47] a much stronger term weighting model in practice is bm25 or best match number 25
[00:06:51] practice is bm25 or best match number 25 and as you might imagine it took many
[00:06:53] and as you might imagine it took many attempts until bm25 was developed
[00:06:58] for our purposes unlike tf-idf
[00:07:00] for our purposes unlike tf-idf therm-frequency in the m25
[00:07:03] therm-frequency in the m25 saturates towards the constant value for
[00:07:06] saturates towards the constant value for each term
[00:07:07] each term and also it penalizes longer documents
[00:07:09] and also it penalizes longer documents when counting frequencies
[00:07:11] when counting frequencies since a longer document will naturally
[00:07:13] since a longer document will naturally contain more occurrences of its terms
[00:07:17] contain more occurrences of its terms these are the main differences and it
[00:07:19] these are the main differences and it really helps bm-25 in practice be a much
[00:07:22] really helps bm-25 in practice be a much stronger term waiting model
[00:07:26] now that we've decided the behavior of
[00:07:28] now that we've decided the behavior of these
[00:07:29] these you know waiting functions or at least a
[00:07:31] you know waiting functions or at least a couple of them how would we actually
[00:07:33] couple of them how would we actually implement this as an actual system that
[00:07:36] implement this as an actual system that we could use for search
[00:07:39] we could use for search um so let's think about this whereas the
[00:07:41] um so let's think about this whereas the row collection the actual text supports
[00:07:44] row collection the actual text supports fast access from documents to terms
[00:07:46] fast access from documents to terms so basically tokenization gives us the
[00:07:49] so basically tokenization gives us the terms of each document
[00:07:51] terms of each document the term document matrix that we've
[00:07:53] the term document matrix that we've studied so far allows fast access from a
[00:07:56] studied so far allows fast access from a term to the documents so it's a bit of a
[00:07:58] term to the documents so it's a bit of a reverse process
[00:07:59] reverse process unfortunately the term document matrix
[00:08:02] unfortunately the term document matrix is way too sparse and contains too many
[00:08:05] is way too sparse and contains too many zeros to be useful since the average
[00:08:07] zeros to be useful since the average term does not occur in the vast majority
[00:08:10] term does not occur in the vast majority of documents if you think about it
[00:08:13] of documents if you think about it so the inverted index that's where it
[00:08:15] so the inverted index that's where it comes in this is a data structure that
[00:08:17] comes in this is a data structure that solves this problem it is essentially
[00:08:19] solves this problem it is essentially just a sparse representation of our
[00:08:21] just a sparse representation of our matrix here
[00:08:22] matrix here which maps each unique term in the
[00:08:24] which maps each unique term in the collection so each unique term in our
[00:08:26] collection so each unique term in our vocabulary
[00:08:27] vocabulary to what we call a posting list
[00:08:31] to what we call a posting list the posting list
[00:08:32] the posting list of a term t
[00:08:34] of a term t simply enumerates all of the actual
[00:08:36] simply enumerates all of the actual occurrences of the term t in the
[00:08:38] occurrences of the term t in the documents
[00:08:39] documents recording both the id of each document
[00:08:41] recording both the id of each document in which the term t appears and also its
[00:08:44] in which the term t appears and also its frequency in each of these documents
[00:08:50] so beyond term weighting models ir of
[00:08:53] so beyond term weighting models ir of course contains lots of models for other
[00:08:55] course contains lots of models for other things so there are models for expanding
[00:08:57] things so there are models for expanding queries and documents
[00:08:59] queries and documents this basically entails adding new terms
[00:09:01] this basically entails adding new terms to queries or to documents or to both to
[00:09:04] to queries or to documents or to both to help with the vocabulary mismatch
[00:09:06] help with the vocabulary mismatch problem that we discussed in the first
[00:09:07] problem that we discussed in the first screencast of the series
[00:09:09] screencast of the series basically when queries and documents use
[00:09:11] basically when queries and documents use different terms to express the same
[00:09:13] different terms to express the same thing
[00:09:14] thing there's also plenty of work on term
[00:09:16] there's also plenty of work on term dependence and phrase search notice that
[00:09:18] dependence and phrase search notice that so far we've assumed the terms in each
[00:09:20] so far we've assumed the terms in each query and in each document are
[00:09:22] query and in each document are independent and we function in a bag of
[00:09:24] independent and we function in a bag of words
[00:09:25] words fashion
[00:09:27] fashion but work on term dependence and phrase
[00:09:29] but work on term dependence and phrase search relaxes these assumptions
[00:09:31] search relaxes these assumptions that each query is a bag of independent
[00:09:33] that each query is a bag of independent terms
[00:09:34] terms lastly there's also lots of work on
[00:09:36] lastly there's also lots of work on learning to rank with various features
[00:09:39] learning to rank with various features like how to estimate relevance when
[00:09:40] like how to estimate relevance when documents have multiple fields like
[00:09:42] documents have multiple fields like maybe a title a body some headings a
[00:09:45] maybe a title a body some headings a footer and also anchor text which is a
[00:09:47] footer and also anchor text which is a very strong signal when you have it like
[00:09:49] very strong signal when you have it like in web search so this is basically the
[00:09:51] in web search so this is basically the text
[00:09:52] text from links in other pages to your page
[00:09:54] from links in other pages to your page the text in those links or around those
[00:09:56] the text in those links or around those links tends to be
[00:09:59] very useful
[00:10:01] very useful as a relevant signal
[00:10:04] as a relevant signal and of course also things like page rank
[00:10:05] and of course also things like page rank with link analysis and lots of other
[00:10:08] with link analysis and lots of other features for ir like recency and other
[00:10:10] features for ir like recency and other stuff
[00:10:12] stuff but i think it's worth mentioning that
[00:10:14] but i think it's worth mentioning that until recently if you just had a
[00:10:15] until recently if you just had a collection that you want to search and
[00:10:18] collection that you want to search and you didn't want to do a lot of tuning
[00:10:20] you didn't want to do a lot of tuning bm25 was a very strong baseline
[00:10:24] bm25 was a very strong baseline um on the best that you could do ad hoc
[00:10:26] um on the best that you could do ad hoc so without loss of tuning and and
[00:10:28] so without loss of tuning and and you know without lots of training data
[00:10:30] you know without lots of training data etc and this only changed a year or two
[00:10:33] etc and this only changed a year or two ago with the advent of birth based
[00:10:35] ago with the advent of birth based ranking which we'll discuss in detail in
[00:10:37] ranking which we'll discuss in detail in the next screencast of the sit of the
[00:10:44] series okay
[00:10:47] series okay so we just built an ir system
[00:10:50] so we just built an ir system how do we evaluate our work what is
[00:10:52] how do we evaluate our work what is success like
[00:10:54] success like well a search system as you can imagine
[00:10:56] well a search system as you can imagine must be both efficient and effective
[00:10:58] must be both efficient and effective you know if we had infinite time and
[00:11:00] you know if we had infinite time and infinite resources we would just hire
[00:11:02] infinite resources we would just hire experts to look through all the
[00:11:04] experts to look through all the documents one by one to conduct the
[00:11:06] documents one by one to conduct the search but clearly we don't have that
[00:11:09] search but clearly we don't have that sort of ability
[00:11:11] sort of ability so efficiency in ir is paramount after
[00:11:14] so efficiency in ir is paramount after all we want our retrieval models to work
[00:11:16] all we want our retrieval models to work with sub-second latencies for
[00:11:18] with sub-second latencies for collections that may have hundreds of
[00:11:20] collections that may have hundreds of millions of documents if not even larger
[00:11:22] millions of documents if not even larger than that
[00:11:24] than that the most common measure of efficiency in
[00:11:26] the most common measure of efficiency in ir is latency which is simply the time
[00:11:28] ir is latency which is simply the time it takes to run one query through the
[00:11:30] it takes to run one query through the system say on average or perhaps at the
[00:11:33] system say on average or perhaps at the tail like the 95th percentile for
[00:11:35] tail like the 95th percentile for example
[00:11:37] example but you can also measure throughput
[00:11:39] but you can also measure throughput in queries per second
[00:11:41] in queries per second space how much you know maybe the
[00:11:43] space how much you know maybe the inverted index takes on disk versus you
[00:11:45] inverted index takes on disk versus you know say a term document matrix um
[00:11:48] know say a term document matrix um how well do you scale to different
[00:11:50] how well do you scale to different collection sizes in terms of the number
[00:11:51] collection sizes in terms of the number of documents or the size of the
[00:11:53] of documents or the size of the documents um
[00:11:54] documents um and how do you perform under different
[00:11:56] and how do you perform under different query loads many queries few queries
[00:11:58] query loads many queries few queries short queries long queries um and lastly
[00:12:01] short queries long queries um and lastly of course what sort of hardware do you
[00:12:02] of course what sort of hardware do you require is it just one cpu core many
[00:12:05] require is it just one cpu core many cores a bunch of gpus um but latency
[00:12:08] cores a bunch of gpus um but latency tends to be kind of once you've
[00:12:09] tends to be kind of once you've determined the other ones
[00:12:11] determined the other ones it's the go-to metric in most cases
[00:12:16] more central to our discussion today and
[00:12:18] more central to our discussion today and we'll focus on this um for the rest of
[00:12:20] we'll focus on this um for the rest of the screencast is ir effectiveness or
[00:12:22] the screencast is ir effectiveness or the quality basically of an ir system
[00:12:27] the quality basically of an ir system and here we ask do our top k rankings
[00:12:30] and here we ask do our top k rankings for equity satisfy the user's
[00:12:32] for equity satisfy the user's information need
[00:12:35] information need answering this question tends to be
[00:12:36] answering this question tends to be harder than evaluation for typical
[00:12:39] harder than evaluation for typical machine learning tasks like
[00:12:40] machine learning tasks like classification or regression
[00:12:43] classification or regression because we're not really just taking you
[00:12:45] because we're not really just taking you know an item and assigning it a class
[00:12:48] know an item and assigning it a class we're trying to rank
[00:12:50] we're trying to rank all of the items in our corpus with
[00:12:51] all of the items in our corpus with respect to equity
[00:12:54] in practice if you have lots of users
[00:12:56] in practice if you have lots of users you could run online experiments
[00:12:59] you could run online experiments where you basically give different
[00:13:01] where you basically give different versions of your system to different
[00:13:02] versions of your system to different users
[00:13:03] users and compare
[00:13:05] and compare some metrics of satisfaction or
[00:13:08] some metrics of satisfaction or conversion basically in terms of you
[00:13:10] conversion basically in terms of you know purchases or otherwise but for
[00:13:12] know purchases or otherwise but for research purposes we're typically
[00:13:14] research purposes we're typically interested in reusable test collections
[00:13:17] interested in reusable test collections test connections that allow us to
[00:13:18] test connections that allow us to evaluate ir models offline and then
[00:13:21] evaluate ir models offline and then compare them against each other
[00:13:25] building a test collection entails three
[00:13:27] building a test collection entails three things
[00:13:28] things first we need to decide on a document
[00:13:30] first we need to decide on a document collection our corpus
[00:13:32] collection our corpus a set of test queries and we need to
[00:13:35] a set of test queries and we need to find or get or produce relevant
[00:13:37] find or get or produce relevant assessments for each query
[00:13:39] assessments for each query if resources permit a collection could
[00:13:41] if resources permit a collection could also include a train dev split of
[00:13:44] also include a train dev split of queries
[00:13:45] queries but given the high annotation cost it's
[00:13:47] but given the high annotation cost it's actually not uncommon in ir to find
[00:13:50] actually not uncommon in ir to find or create only a test set
[00:13:53] or create only a test set the key component of a test collection
[00:13:55] the key component of a test collection is the relevant assessments these are
[00:13:57] is the relevant assessments these are basically human annotated labels for
[00:13:59] basically human annotated labels for each query
[00:14:00] each query that enumerate for us either whether
[00:14:03] that enumerate for us either whether specific documents are relevant or not
[00:14:05] specific documents are relevant or not to that query
[00:14:06] to that query these query document assessments can
[00:14:09] these query document assessments can either be binary or they could take on a
[00:14:11] either be binary or they could take on a more fine-grained graded nature
[00:14:14] more fine-grained graded nature an example of that is grading a query
[00:14:16] an example of that is grading a query document pair as minus one zero one or
[00:14:19] document pair as minus one zero one or two with meanings of hey this is a junk
[00:14:22] two with meanings of hey this is a junk document minus one you should not
[00:14:24] document minus one you should not retrieve it for any query or this
[00:14:25] retrieve it for any query or this document is irrelevant um but but it
[00:14:28] document is irrelevant um but but it might be useful for other queries or
[00:14:30] might be useful for other queries or this document is quite relevant for this
[00:14:31] this document is quite relevant for this query but it's not a perfect match or
[00:14:34] query but it's not a perfect match or you know here is a really really good
[00:14:35] you know here is a really really good match um for our query which would be a
[00:14:38] match um for our query which would be a score of two or three depending on um
[00:14:41] score of two or three depending on um the the grades that you're using
[00:14:44] the the grades that you're using for for the assessments
[00:14:47] for for the assessments as you might imagine because we work
[00:14:49] as you might imagine because we work with you know potentially many millions
[00:14:51] with you know potentially many millions of documents it's usually infeasible to
[00:14:53] of documents it's usually infeasible to judge every single document for every
[00:14:55] judge every single document for every single query
[00:14:57] single query so instead we're often forced to make
[00:14:59] so instead we're often forced to make the assumption that unjudged documents
[00:15:02] the assumption that unjudged documents are not relevant
[00:15:03] are not relevant or at least to ignore them
[00:15:06] or at least to ignore them in some metrics of ir though for most
[00:15:08] in some metrics of ir though for most purposes they are treated as not
[00:15:09] purposes they are treated as not relevant
[00:15:11] relevant some test collections take this further
[00:15:13] some test collections take this further and only label one or two key documents
[00:15:15] and only label one or two key documents per query as relevant and assume
[00:15:17] per query as relevant and assume everything else is not relevant
[00:15:20] everything else is not relevant so this tends to be useful when you work
[00:15:22] so this tends to be useful when you work with particular data sets and you want
[00:15:23] with particular data sets and you want to keep it in mind as you do evaluation
[00:15:30] so many of the test collections out
[00:15:31] so many of the test collections out there in ir are annotated by trek or the
[00:15:34] there in ir are annotated by trek or the text retrieval conference which includes
[00:15:37] text retrieval conference which includes annual tracks for competing
[00:15:39] annual tracks for competing and comparing ir systems
[00:15:42] and comparing ir systems for instance the 2021 trek conference
[00:15:45] for instance the 2021 trek conference has tracks for search in the context of
[00:15:47] has tracks for search in the context of conversational assistance health
[00:15:49] conversational assistance health misinformation fair ranking and has a
[00:15:51] misinformation fair ranking and has a very popular deep rank deep learning
[00:15:54] very popular deep rank deep learning track as well which we'll discuss in
[00:15:55] track as well which we'll discuss in more detail
[00:15:57] more detail each trek campaign emphasizes careful
[00:16:00] each trek campaign emphasizes careful evaluation with a very small set of
[00:16:01] evaluation with a very small set of queries so just 50 queries is a very
[00:16:04] queries so just 50 queries is a very typical size actually
[00:16:06] typical size actually but track extensively judges many many
[00:16:09] but track extensively judges many many documents possibly hundreds of documents
[00:16:10] documents possibly hundreds of documents or even more
[00:16:12] or even more for each query here
[00:16:15] um so you can imagine an alternative
[00:16:17] um so you can imagine an alternative which we look at next where you have
[00:16:19] which we look at next where you have lots of queries but you only judge a
[00:16:20] lots of queries but you only judge a very small number of um of key documents
[00:16:23] very small number of um of key documents um for those queries with the intention
[00:16:26] um for those queries with the intention that the performance that you get will
[00:16:28] that the performance that you get will average out over a large enough set of
[00:16:30] average out over a large enough set of queries
[00:16:31] queries and this is exactly what happens in ms
[00:16:33] and this is exactly what happens in ms marco ranking tasks
[00:16:36] marco ranking tasks which is a collection of
[00:16:38] which is a collection of really popular
[00:16:40] really popular ir benchmarks
[00:16:42] ir benchmarks by microsoft ms marco contains more than
[00:16:45] by microsoft ms marco contains more than half a million bing search queries and
[00:16:47] half a million bing search queries and this is the largest public ir benchmark
[00:16:51] this is the largest public ir benchmark each query here is
[00:16:53] each query here is assessed with one or two relevant
[00:16:55] assessed with one or two relevant documents and we assume everything else
[00:16:57] documents and we assume everything else is not relevant
[00:16:59] is not relevant and
[00:17:00] and you know having this sparse annotation
[00:17:02] you know having this sparse annotation is often not a problem at all for
[00:17:04] is often not a problem at all for training
[00:17:05] training because we have so many training
[00:17:06] because we have so many training instances
[00:17:07] instances and so ms marco
[00:17:09] and so ms marco provides a tremendous resource for us
[00:17:11] provides a tremendous resource for us when it comes to building and training
[00:17:13] when it comes to building and training our models especially in the neural
[00:17:15] our models especially in the neural domain
[00:17:17] domain it also turns out that sparse labels are
[00:17:19] it also turns out that sparse labels are not too bad for evaluation either
[00:17:21] not too bad for evaluation either especially because of the size
[00:17:23] especially because of the size of the test queries we can use many
[00:17:24] of the test queries we can use many thousands of test queries and average
[00:17:26] thousands of test queries and average our results across all of them to get a
[00:17:28] our results across all of them to get a pretty reliable signal about how
[00:17:30] pretty reliable signal about how different systems compare
[00:17:33] different systems compare there are multiple test collections out
[00:17:34] there are multiple test collections out there on top of ms marco um so there's
[00:17:37] there on top of ms marco um so there's the original passage ranking task
[00:17:39] the original passage ranking task a newer document ranking task where the
[00:17:41] a newer document ranking task where the documents are much longer
[00:17:44] documents are much longer but there's fewer of them um and then
[00:17:46] but there's fewer of them um and then there is also attract the track deep
[00:17:47] there is also attract the track deep learning track which we've mentioned
[00:17:49] learning track which we've mentioned before which is happening every year
[00:17:51] before which is happening every year since 2019
[00:17:52] since 2019 um and which uses the ms marco data
[00:17:55] um and which uses the ms marco data especially for for training mostly um
[00:17:57] especially for for training mostly um but has far fewer queries for testing
[00:18:00] but has far fewer queries for testing with lots more
[00:18:03] with lots more labels for evaluation a lot more
[00:18:05] labels for evaluation a lot more extensive assessments and judgments for
[00:18:07] extensive assessments and judgments for evaluation so these are much denser
[00:18:09] evaluation so these are much denser labels
[00:18:12] there are also plenty of other rather
[00:18:14] there are also plenty of other rather domain specific ir benchmarks
[00:18:17] domain specific ir benchmarks many of which are collected in this
[00:18:19] many of which are collected in this table
[00:18:20] table by nanda by nand and ital
[00:18:22] by nanda by nand and ital in a very recent preprint
[00:18:26] as you can see these benchmarks vary
[00:18:27] as you can see these benchmarks vary greatly in terms of the training size um
[00:18:30] greatly in terms of the training size um if there is any training at all
[00:18:32] if there is any training at all um the test set size the average query
[00:18:34] um the test set size the average query length the average document length and
[00:18:36] length the average document length and many other factors
[00:18:39] many other factors beer or benchmarking for ir is a recent
[00:18:42] beer or benchmarking for ir is a recent effort
[00:18:43] effort by nand and ital here
[00:18:45] by nand and ital here to use all of these different data sets
[00:18:47] to use all of these different data sets for zero shot or out of domain testing
[00:18:50] for zero shot or out of domain testing of ir models specifically in the air we
[00:18:53] of ir models specifically in the air we take already trained ir models that do
[00:18:55] take already trained ir models that do not have access to any validation or
[00:18:57] not have access to any validation or training data
[00:18:58] training data on these downstream ir tasks and test
[00:19:01] on these downstream ir tasks and test them out out of the box to observe their
[00:19:03] them out out of the box to observe their out of domain retrieval quality so that
[00:19:06] out of domain retrieval quality so that is without training
[00:19:07] is without training on these
[00:19:08] on these new domains
[00:19:16] okay
[00:19:17] okay so we now have a test collection with
[00:19:18] so we now have a test collection with queries documents and assessments
[00:19:22] queries documents and assessments how do we compare ir systems on this
[00:19:24] how do we compare ir systems on this collection
[00:19:25] collection first we will ask each ir system to
[00:19:28] first we will ask each ir system to produce its top k ranking say its top 10
[00:19:30] produce its top k ranking say its top 10 results and we'll use an ir metric to
[00:19:34] results and we'll use an ir metric to compare all of these systems at that
[00:19:35] compare all of these systems at that cutoff k
[00:19:37] cutoff k the choice of ir metric and the car of k
[00:19:39] the choice of ir metric and the car of k will depend entirely on the task so i
[00:19:42] will depend entirely on the task so i will briefly motivate each metric as we
[00:19:44] will briefly motivate each metric as we go through them
[00:19:46] go through them all of the metrics we will go through
[00:19:48] all of the metrics we will go through are simply averaged across all queries
[00:19:50] are simply averaged across all queries and so to keep things simple i will show
[00:19:53] and so to keep things simple i will show the computation of the metric for just
[00:19:54] the computation of the metric for just one query in each case but you want to
[00:19:56] one query in each case but you want to keep in mind that this is averaged
[00:19:58] keep in mind that this is averaged across queries
[00:20:01] let us start with two of the simplest ir
[00:20:03] let us start with two of the simplest ir metrics
[00:20:04] metrics which are success and mrr
[00:20:07] which are success and mrr for a given query let rank
[00:20:09] for a given query let rank be the position of the first relevant
[00:20:11] be the position of the first relevant document that we can see in the top k
[00:20:13] document that we can see in the top k list of results
[00:20:16] list of results success at k
[00:20:17] success at k will just be 1 if there is a relevant
[00:20:19] will just be 1 if there is a relevant result in the top k list and 0 otherwise
[00:20:22] result in the top k list and 0 otherwise this is a very simple metric as you can
[00:20:24] this is a very simple metric as you can see it can be useful in cases where we
[00:20:27] see it can be useful in cases where we assume that the user just needs one
[00:20:28] assume that the user just needs one relevant result anywhere in the top k
[00:20:31] relevant result anywhere in the top k and in particular it can be useful if if
[00:20:34] and in particular it can be useful if if our retrieval is fed to a downstream
[00:20:36] our retrieval is fed to a downstream model that looks at the top k results
[00:20:38] model that looks at the top k results and then does something with them so it
[00:20:39] and then does something with them so it can read all of them and it will read
[00:20:41] can read all of them and it will read all of them anyway
[00:20:42] all of them anyway so we're just interested in binary
[00:20:44] so we're just interested in binary relevance here
[00:20:46] relevance here mean reciprocal rank or mrr
[00:20:49] mean reciprocal rank or mrr also assumes that the user only need
[00:20:51] also assumes that the user only need needs one relevant query in the top k
[00:20:54] needs one relevant query in the top k but it assumes that the user does care
[00:20:57] but it assumes that the user does care about the position of that relevant
[00:20:58] about the position of that relevant document in the ranking
[00:21:00] document in the ranking so a relevant document at the second
[00:21:02] so a relevant document at the second position for example is only given half
[00:21:04] position for example is only given half of the weight of a relevant document in
[00:21:06] of the weight of a relevant document in the top position
[00:21:16] you're probably already familiar with
[00:21:18] you're probably already familiar with precision and recall but let's define
[00:21:20] precision and recall but let's define them here in the context of top k ranked
[00:21:22] them here in the context of top k ranked retrieval
[00:21:24] retrieval for a given query let red of k
[00:21:27] for a given query let red of k be the top k retrieved documents that
[00:21:29] be the top k retrieved documents that set of top kr3 documents and let rel be
[00:21:32] set of top kr3 documents and let rel be the set of all documents that we judged
[00:21:34] the set of all documents that we judged as relevance as part of our
[00:21:36] as relevance as part of our as part of our assessments
[00:21:39] as part of our assessments in this case precision at k is just a
[00:21:41] in this case precision at k is just a fraction of the retrieved items that are
[00:21:43] fraction of the retrieved items that are actually relevant
[00:21:44] actually relevant and recall that k is the fraction of all
[00:21:46] and recall that k is the fraction of all relevant items that are actually
[00:21:48] relevant items that are actually retrieved
[00:21:53] a pretty popular metric is also map or
[00:21:56] a pretty popular metric is also map or mean average precision or just average
[00:21:58] mean average precision or just average precision for one query
[00:22:00] precision for one query which essentially brings together
[00:22:01] which essentially brings together notions from both precision and recall
[00:22:04] notions from both precision and recall to compute average precision for one
[00:22:06] to compute average precision for one query we will add up the precision at i
[00:22:09] query we will add up the precision at i for every position i from 1 through k
[00:22:11] for every position i from 1 through k where the if document is relevant
[00:22:14] where the if document is relevant we will divide this whole quantity by
[00:22:16] we will divide this whole quantity by the total number of documents that were
[00:22:18] the total number of documents that were judged as relevant for this query
[00:22:23] all of the metrics that we've considered
[00:22:24] all of the metrics that we've considered so far only interact with binary
[00:22:26] so far only interact with binary relevance that is they just care whether
[00:22:29] relevance that is they just care whether each document that is retrieved is
[00:22:30] each document that is retrieved is considered relevant or not relevant
[00:22:33] considered relevant or not relevant dcg or discounted cumulative gain works
[00:22:36] dcg or discounted cumulative gain works with graded relevance so for instance 0
[00:22:38] with graded relevance so for instance 0 1 2 and 3.
[00:22:41] 1 2 and 3. for each position in the ranking from 1
[00:22:43] for each position in the ranking from 1 through k we will divide the graded
[00:22:45] through k we will divide the graded relevance of the retrieved document
[00:22:47] relevance of the retrieved document at that position
[00:22:49] at that position by the logarithm of the position which
[00:22:51] by the logarithm of the position which essentially discounts the value of a
[00:22:53] essentially discounts the value of a relevant document if it appears late in
[00:22:55] relevant document if it appears late in the ranking
[00:22:58] unlike the other metrics the maximum dcg
[00:23:00] unlike the other metrics the maximum dcg is often not equal to 1.
[00:23:03] is often not equal to 1. so we can also compute normalized dcg or
[00:23:06] so we can also compute normalized dcg or ndcg by dividing for each query by the
[00:23:09] ndcg by dividing for each query by the ideal pcg
[00:23:11] ideal pcg this is obtained um basically if all of
[00:23:14] this is obtained um basically if all of the relevant documents are at the top of
[00:23:16] the relevant documents are at the top of our top k ranking and that and they are
[00:23:18] our top k ranking and that and they are sorted by decreasing relevance
[00:23:20] sorted by decreasing relevance so all of the twos before all of the
[00:23:22] so all of the twos before all of the ones before all of the you know the
[00:23:24] ones before all of the you know the zeros in this case that are not relevant
[00:23:28] all right having discussed classical ir
[00:23:30] all right having discussed classical ir and evaluation in this screencast we
[00:23:32] and evaluation in this screencast we will focus on neural ir and in
[00:23:34] will focus on neural ir and in particular state-of-the-art ir models
[00:23:36] particular state-of-the-art ir models that use what we've learned so far in
[00:23:38] that use what we've learned so far in nlu in the next screencast

Lecture 040

Neural IR, part 1 | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=XfYNqwWpoGY --- Transcript [00:00:06] hello everyone welcome to part three ...

Neural IR, part 1 | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=XfYNqwWpoGY

---

Transcript

[00:00:06] hello everyone welcome to part three of
[00:00:08] hello everyone welcome to part three of the series the screencast will be the
[00:00:11] the series the screencast will be the first of two or three on neural ir and
[00:00:14] first of two or three on neural ir and in it we'll be exploring the inputs
[00:00:16] in it we'll be exploring the inputs outputs training and inference in the
[00:00:19] outputs training and inference in the context of neural
[00:00:23] IR let's quickly start with a reminder
[00:00:26] IR let's quickly start with a reminder of our setup from the previous
[00:00:28] of our setup from the previous screencast offline we're given a large
[00:00:31] screencast offline we're given a large Corpus of text documents we will
[00:00:34] Corpus of text documents we will pre-process and Index this Corpus for
[00:00:37] pre-process and Index this Corpus for fast
[00:00:38] fast retrieval online we're given a query
[00:00:41] retrieval online we're given a query that we want to answer our output will
[00:00:44] that we want to answer our output will be a list of the top K most relevant
[00:00:46] be a list of the top K most relevant documents for this
[00:00:51] query in the classical IR screencast we
[00:00:54] query in the classical IR screencast we discussed bm25 as a strong term matching
[00:00:57] discussed bm25 as a strong term matching retrieval
[00:00:58] retrieval model so should we just use
[00:01:01] model so should we just use bm25 the short answer is that we
[00:01:04] bm25 the short answer is that we could but if our interest is getting the
[00:01:07] could but if our interest is getting the highest quality that we can then we
[00:01:10] highest quality that we can then we should probably be using neural IR as we
[00:01:13] should probably be using neural IR as we will see neural IR makes a lot of use of
[00:01:16] will see neural IR makes a lot of use of our nlu work in creative and interesting
[00:01:20] our nlu work in creative and interesting ways the long answer to whether we
[00:01:23] ways the long answer to whether we should be using bm25 is that it depends
[00:01:26] should be using bm25 is that it depends among other factors depends on our
[00:01:28] among other factors depends on our budget
[00:01:31] budget each IR model poses a different
[00:01:33] each IR model poses a different efficiency Effectiveness
[00:01:36] efficiency Effectiveness trade-off in many cases we're interested
[00:01:38] trade-off in many cases we're interested in maximizing Effectiveness maximizing
[00:01:41] in maximizing Effectiveness maximizing quality as long as efficiency is
[00:01:44] quality as long as efficiency is acceptable let's begin to explore this
[00:01:48] acceptable let's begin to explore this on the MS Marco collection that we
[00:01:50] on the MS Marco collection that we introduced in the previous
[00:01:53] introduced in the previous screencast here we'll be measuring
[00:01:56] screencast here we'll be measuring Effectiveness using the mean reciprocal
[00:01:58] Effectiveness using the mean reciprocal rank at car of 10 and we will measure
[00:02:01] rank at car of 10 and we will measure efficiency and in particular latency
[00:02:03] efficiency and in particular latency using
[00:02:07] milliseconds this figure here shows bm25
[00:02:10] milliseconds this figure here shows bm25 retrieval using a popular toolkit called
[00:02:13] retrieval using a popular toolkit called nerini as one data point within a wide
[00:02:16] nerini as one data point within a wide range of Mr values and lat latency
[00:02:23] possibilities just as a reminder lower
[00:02:26] possibilities just as a reminder lower latency is better and the latency here
[00:02:29] latency is better and the latency here is shown on a logarithmic
[00:02:32] is shown on a logarithmic scale
[00:02:34] scale and higher mrr um is also better the
[00:02:37] and higher mrr um is also better the higher our Mr is the better the model's
[00:02:41] higher our Mr is the better the model's quality so what else could exist in this
[00:02:44] quality so what else could exist in this large empty space for
[00:02:46] large empty space for now we're going to see um this space
[00:02:49] now we're going to see um this space fill up um with many different neural IR
[00:02:51] fill up um with many different neural IR models over the next couple of
[00:02:54] models over the next couple of screencasts and the central question now
[00:02:57] screencasts and the central question now and then will generally be how can we
[00:02:59] and then will generally be how can we improve prove our Mr at 10 or whatever
[00:03:02] improve prove our Mr at 10 or whatever Effectiveness metric we choose to work
[00:03:03] Effectiveness metric we choose to work with possibly at the expense of
[00:03:06] with possibly at the expense of increasing latency a
[00:03:10] bit okay so let's actually take a look
[00:03:14] bit okay so let's actually take a look at how neural IR models will work
[00:03:17] at how neural IR models will work specifically at their input and output
[00:03:20] specifically at their input and output behavior for the purposes of this short
[00:03:22] behavior for the purposes of this short screencast we'll treat the neural ranker
[00:03:25] screencast we'll treat the neural ranker as a
[00:03:26] as a blackbox we will consider various
[00:03:28] blackbox we will consider various implementations for this black box
[00:03:30] implementations for this black box function in the next
[00:03:33] screencast we will feed this neural IR
[00:03:37] screencast we will feed this neural IR blackbox a query and a document and the
[00:03:40] blackbox a query and a document and the model will do its thing and return to us
[00:03:42] model will do its thing and return to us a single score that estimates the
[00:03:45] a single score that estimates the relevance of this query to that
[00:03:49] relevance of this query to that document for the same query we will
[00:03:51] document for the same query we will repeat this process for every document
[00:03:53] repeat this process for every document that we want to
[00:03:54] that we want to score and we will finally sort all of
[00:03:57] score and we will finally sort all of these documents by decreasing relevance
[00:04:00] these documents by decreasing relevance score that will give us the topk list of
[00:04:05] results so far this sounds simple enough
[00:04:09] results so far this sounds simple enough but how should we train this neural
[00:04:11] but how should we train this neural model for
[00:04:13] model for ranking this might not be super obvious
[00:04:16] ranking this might not be super obvious but one pretty effective choice is
[00:04:18] but one pretty effective choice is simply two-way
[00:04:19] simply two-way classification pairwise
[00:04:21] classification pairwise classification here each training
[00:04:24] classification here each training example will be a triple specifically
[00:04:27] example will be a triple specifically each training instance will contain a
[00:04:29] each training instance will contain a quer
[00:04:30] quer a relevant or positive document and an
[00:04:33] a relevant or positive document and an irrelevant document or a
[00:04:35] irrelevant document or a negative in the forward pass during
[00:04:38] negative in the forward pass during training will feed the model the query
[00:04:40] training will feed the model the query and the positive
[00:04:41] and the positive document and separately will feed the
[00:04:44] document and separately will feed the query and the negative document to the
[00:04:45] query and the negative document to the neural
[00:04:48] neural ranker and we'll optimize the entire the
[00:04:50] ranker and we'll optimize the entire the entire neural network end to end with
[00:04:52] entire neural network end to end with gradient descend using simple
[00:04:54] gradient descend using simple classification
[00:04:56] classification loss in this case cross entropy loss
[00:04:59] loss in this case cross entropy loss with soft
[00:05:01] with soft Max the goal here is to maximize the
[00:05:03] Max the goal here is to maximize the score of the positive document and to
[00:05:06] score of the positive document and to minimize the score assigned to the
[00:05:08] minimize the score assigned to the negative
[00:05:11] negative document recall that we can get
[00:05:13] document recall that we can get positives for each query um from our
[00:05:15] positives for each query um from our relevance assessments and that and that
[00:05:17] relevance assessments and that and that every document that was not labeled as
[00:05:19] every document that was not labeled as positive can often be treated as an
[00:05:22] positive can often be treated as an implicit negative so we could use this
[00:05:24] implicit negative so we could use this um in generating triples for two-way
[00:05:27] um in generating triples for two-way classification training for our for our
[00:05:29] classification training for our for our neural
[00:05:32] ranker once our neural ranker is trained
[00:05:35] ranker once our neural ranker is trained inference or actually conducting the
[00:05:37] inference or actually conducting the ranking is very easy given a query we
[00:05:41] ranking is very easy given a query we just pick each document pass the query
[00:05:43] just pick each document pass the query and the document through the neural
[00:05:45] and the document through the neural network get a score and then we'll sort
[00:05:47] network get a score and then we'll sort all of the documents by score this will
[00:05:50] all of the documents by score this will give us the top key list of documents
[00:05:52] give us the top key list of documents however there is there is just a a small
[00:05:56] however there is there is just a a small um yet very major problem collections
[00:05:59] um yet very major problem collections often have many millions if not billions
[00:06:01] often have many millions if not billions of documents even if our model is so
[00:06:04] of documents even if our model is so fast that it processes each document in
[00:06:06] fast that it processes each document in one microc one millionth of a second it
[00:06:09] one microc one millionth of a second it would still require nine seconds per
[00:06:11] would still require nine seconds per quid um um for a data set like Ms Marco
[00:06:15] quid um um for a data set like Ms Marco with nine million Pages which is way too
[00:06:17] with nine million Pages which is way too slow for most practical
[00:06:21] applications to deal with this in
[00:06:23] applications to deal with this in practice neural IR models are often used
[00:06:25] practice neural IR models are often used as rankers models that rescore only the
[00:06:29] as rankers models that rescore only the top k documents obtained by another
[00:06:31] top k documents obtained by another model to improve the final
[00:06:34] model to improve the final ranking one of the most common pipeline
[00:06:36] ranking one of the most common pipeline designs is to rank the top thousand
[00:06:38] designs is to rank the top thousand documents obtained by
[00:06:41] documents obtained by bm25 this can be great because it cuts
[00:06:44] bm25 this can be great because it cuts down the work um for a collection with
[00:06:46] down the work um for a collection with 10 million passages by a factor of
[00:06:48] 10 million passages by a factor of 10,000 because we only need to rank a
[00:06:50] 10,000 because we only need to rank a thousand documents with the neural model
[00:06:53] thousand documents with the neural model but it also introduces an artificial
[00:06:55] but it also introduces an artificial ceiling on recall limits recall in an
[00:06:58] ceiling on recall limits recall in an artificial way since now all of the
[00:07:00] artificial way since now all of the relevant documents that bm25 our first
[00:07:03] relevant documents that bm25 our first stage ranker fails to retrieve cannot
[00:07:06] stage ranker fails to retrieve cannot possibly be reranked highly by our shiny
[00:07:09] possibly be reranked highly by our shiny new IR
[00:07:13] ranker so can we do better it turns out
[00:07:17] ranker so can we do better it turns out that the answer is yes we'll discuss the
[00:07:19] that the answer is yes we'll discuss the notion of endtoend retrieval later where
[00:07:22] notion of endtoend retrieval later where our neural model will be able to quickly
[00:07:24] our neural model will be able to quickly conduct the search by itself over the
[00:07:26] conduct the search by itself over the entire collection without a ranking
[00:07:28] entire collection without a ranking pipeline but first we'll discuss a
[00:07:31] pipeline but first we'll discuss a number of neural rankers in detail in
[00:07:33] number of neural rankers in detail in the next screencast

Lecture 041

Neural IR, part 2 | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=IWgjCIguAoA --- Transcript [00:00:05] hello everyone welcome to part four o...

Neural IR, part 2 | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=IWgjCIguAoA

---

Transcript

[00:00:05] hello everyone welcome to part four of
[00:00:08] hello everyone welcome to part four of our series on nlu nir
[00:00:11] our series on nlu nir the screencast will be the second among
[00:00:12] the screencast will be the second among three of our videos on neural
[00:00:14] three of our videos on neural information retrieval
[00:00:19] just to recap this is the functional
[00:00:21] just to recap this is the functional view of neural ir that we left in the
[00:00:23] view of neural ir that we left in the previous screencast
[00:00:25] previous screencast our model will take a query and a
[00:00:27] our model will take a query and a document and will then output a score
[00:00:30] document and will then output a score that will estimate the relevance of this
[00:00:32] that will estimate the relevance of this document to the query
[00:00:34] document to the query we will sort the documents by decreasing
[00:00:36] we will sort the documents by decreasing score
[00:00:37] score to get the top k results
[00:00:41] let's begin
[00:00:42] let's begin with a very effective paradigm for
[00:00:45] with a very effective paradigm for building neural ir models
[00:00:47] building neural ir models namely query document interaction
[00:00:53] so given a query and a document will
[00:00:55] so given a query and a document will tokenize them
[00:00:57] tokenize them then we'll embed the tokens of each into
[00:00:59] then we'll embed the tokens of each into a static vector representation
[00:01:02] a static vector representation so these could be glove vectors for
[00:01:04] so these could be glove vectors for example
[00:01:06] example or the initial
[00:01:07] or the initial representations of birth
[00:01:10] we'll then build what is called a query
[00:01:12] we'll then build what is called a query document interaction matrix this is
[00:01:15] document interaction matrix this is typically nothing but
[00:01:17] typically nothing but a matrix of cosine similarities between
[00:01:20] a matrix of cosine similarities between each pair of word
[00:01:21] each pair of word um each pair of words across the query
[00:01:24] um each pair of words across the query and the document
[00:01:28] now that we have this matrix we just
[00:01:30] now that we have this matrix we just need to reduce it to a single score that
[00:01:32] need to reduce it to a single score that estimates the relevance of our document
[00:01:35] estimates the relevance of our document to this query
[00:01:38] to do this we'll just learn a bunch of
[00:01:40] to do this we'll just learn a bunch of neural layers like convolution or linear
[00:01:42] neural layers like convolution or linear layers with pooling
[00:01:45] layers with pooling until we end up with a single score for
[00:01:48] until we end up with a single score for this query document pair
[00:01:52] many ir models out there fall in this
[00:01:55] many ir models out there fall in this category especially ones that were
[00:01:57] category especially ones that were introduced between 2016 through 2018 or
[00:02:00] introduced between 2016 through 2018 or 2019.
[00:02:04] with enough training data
[00:02:06] with enough training data query document interaction models can
[00:02:09] query document interaction models can achieve considerably better quality than
[00:02:11] achieve considerably better quality than bag of words models like bm25
[00:02:14] bag of words models like bm25 and they can actually do that at a
[00:02:15] and they can actually do that at a reasonable increase
[00:02:17] reasonable increase moderate increase in computational cost
[00:02:23] so as discussed in the previous
[00:02:24] so as discussed in the previous screencasts
[00:02:26] screencasts these models are typically used um as
[00:02:28] these models are typically used um as the last stage of a re-ranking pipeline
[00:02:31] the last stage of a re-ranking pipeline and in particular
[00:02:33] and in particular in this
[00:02:34] in this figure here they're used to re-rank the
[00:02:35] figure here they're used to re-rank the top thousand passages retrieved by bm25
[00:02:39] top thousand passages retrieved by bm25 and this is done to make sure that their
[00:02:40] and this is done to make sure that their latency is acceptable
[00:02:43] latency is acceptable while still improving the mrr
[00:02:45] while still improving the mrr and the quality over bm25 retrieval
[00:02:55] more recently in 2019 the ir community
[00:02:59] more recently in 2019 the ir community discovered the power of birth for
[00:03:01] discovered the power of birth for ranking
[00:03:02] ranking functionally this is very similar to um
[00:03:06] functionally this is very similar to um the paradigm that we just saw with query
[00:03:07] the paradigm that we just saw with query document interactions
[00:03:09] document interactions so here we'll we're going to feed both
[00:03:11] so here we'll we're going to feed both the query and the document as one
[00:03:13] the query and the document as one sequence with two segments one segment
[00:03:15] sequence with two segments one segment for the query and one segment for the
[00:03:17] for the query and one segment for the document as shown
[00:03:19] document as shown we'll run this through all the layers of
[00:03:21] we'll run this through all the layers of birth
[00:03:22] birth and we'll finally extract the class
[00:03:24] and we'll finally extract the class token embedding from birth and reduce it
[00:03:27] token embedding from birth and reduce it to a single score through a final linear
[00:03:29] to a single score through a final linear head on top of
[00:03:30] head on top of birth as you can probably tell this is
[00:03:33] birth as you can probably tell this is nothing but a standard birth classifier
[00:03:36] nothing but a standard birth classifier where we're going to take the scores
[00:03:39] where we're going to take the scores or the confidence that's the output of
[00:03:40] or the confidence that's the output of the classifier and use it for ranking
[00:03:42] the classifier and use it for ranking our passages
[00:03:44] our passages and like any other task with bert we
[00:03:47] and like any other task with bert we should first fine-tune this birth model
[00:03:49] should first fine-tune this birth model with appropriate training data before we
[00:03:51] with appropriate training data before we use it for our task
[00:03:54] we've discussed how to train neural ir
[00:03:56] we've discussed how to train neural ir models in the previous screencast so
[00:03:58] models in the previous screencast so refer to that if you'd like
[00:04:03] so this really simple model on top of
[00:04:05] so this really simple model on top of birth was the foundation to tremendous
[00:04:08] birth was the foundation to tremendous progress in search over the past two
[00:04:10] progress in search over the past two years
[00:04:12] years and in particular um
[00:04:14] and in particular um it's worth mentioning the first public
[00:04:16] it's worth mentioning the first public instance of this which was in january of
[00:04:18] instance of this which was in january of 2019 on the ms marco passage ranking
[00:04:21] 2019 on the ms marco passage ranking task
[00:04:22] task here nogueira and cho
[00:04:24] here nogueira and cho made a simple birth-based submission to
[00:04:26] made a simple birth-based submission to the leaderboard of ms marco
[00:04:29] the leaderboard of ms marco that demonstrated dramatic gains over
[00:04:31] that demonstrated dramatic gains over the previous state of the art submitted
[00:04:33] the previous state of the art submitted just a few days prior
[00:04:36] just a few days prior by october of 2019 almost exactly one
[00:04:38] by october of 2019 almost exactly one year after birth originally came out
[00:04:40] year after birth originally came out google had publicly discussed their use
[00:04:42] google had publicly discussed their use of birth in search
[00:04:44] of birth in search and being followed soon after
[00:04:46] and being followed soon after in november of the same year
[00:04:54] but the story is actually a bit more
[00:04:55] but the story is actually a bit more complicated
[00:04:56] complicated these very large gains in quality came
[00:04:59] these very large gains in quality came at a drastic increase in computational
[00:05:01] at a drastic increase in computational cost
[00:05:03] cost which dictates latency
[00:05:05] which dictates latency and which is very important for user
[00:05:07] and which is very important for user experience in search tasks as we've
[00:05:09] experience in search tasks as we've discussed before
[00:05:13] so
[00:05:14] so over simple
[00:05:15] over simple query document interaction models like
[00:05:17] query document interaction models like duet or config nrm
[00:05:20] duet or config nrm noguera and chose birth models increased
[00:05:23] noguera and chose birth models increased mrr by over eight points but also
[00:05:25] mrr by over eight points but also increase latency
[00:05:27] increase latency to multiple seconds per qui
[00:05:31] to multiple seconds per qui and so here it is natural for us to ask
[00:05:33] and so here it is natural for us to ask ourselves the question
[00:05:36] ourselves the question if we could achieve high mrr and low
[00:05:39] if we could achieve high mrr and low latency at once
[00:05:42] and it turns out that the answer is yes
[00:05:44] and it turns out that the answer is yes but it will take a lot of progress to
[00:05:46] but it will take a lot of progress to get there
[00:05:47] get there and we'll try to cover that in the rest
[00:05:48] and we'll try to cover that in the rest of the screencast in the next one so
[00:05:51] of the screencast in the next one so let's get started with that
[00:05:54] so to seek better trade-offs between
[00:05:56] so to seek better trade-offs between quality and latency which is our goal
[00:05:58] quality and latency which is our goal let's think about why birth rankers are
[00:06:00] let's think about why birth rankers are so slow
[00:06:02] so slow our first observation here will be that
[00:06:04] our first observation here will be that birth rankers are quite redundant in
[00:06:06] birth rankers are quite redundant in their computations
[00:06:08] their computations if you think about what birth rankers do
[00:06:11] if you think about what birth rankers do um they need to compute a context
[00:06:13] um they need to compute a context a contextualized representation of the
[00:06:15] a contextualized representation of the query for each document that we rank so
[00:06:18] query for each document that we rank so that's a thousand times for 1 000
[00:06:20] that's a thousand times for 1 000 documents
[00:06:21] documents and they also must encode each document
[00:06:24] and they also must encode each document for every single query that comes along
[00:06:26] for every single query that comes along that needs a score for that document
[00:06:30] of course we have the documents in our
[00:06:33] of course we have the documents in our collections in advance and we can do as
[00:06:35] collections in advance and we can do as much pre-processing as we want on them
[00:06:37] much pre-processing as we want on them offline before we get any queries
[00:06:40] offline before we get any queries so the question becomes can we somehow
[00:06:42] so the question becomes can we somehow pre-compute some form of document
[00:06:44] pre-compute some form of document representations in advance once and for
[00:06:46] representations in advance once and for all using these powerful models that we
[00:06:48] all using these powerful models that we have like bert and store these
[00:06:50] have like bert and store these representations or cache them somewhere
[00:06:53] representations or cache them somewhere so we can just use them quickly every
[00:06:55] so we can just use them quickly every time we have a query to answer
[00:06:57] time we have a query to answer this will be our guiding question for
[00:06:59] this will be our guiding question for the remainder of this and the next
[00:07:00] the remainder of this and the next screencasts
[00:07:03] of course it is not actually obvious yet
[00:07:06] of course it is not actually obvious yet at least if we can pre-compute such
[00:07:08] at least if we can pre-compute such representations in advance without much
[00:07:10] representations in advance without much loss in quality for all we know so far
[00:07:13] loss in quality for all we know so far there might be a lot of empirical value
[00:07:16] there might be a lot of empirical value in jointly representing queries and
[00:07:18] in jointly representing queries and documents at once
[00:07:22] but we'll put this hypothesis to the
[00:07:24] but we'll put this hypothesis to the test
[00:07:25] test the first approach to tame the
[00:07:27] the first approach to tame the computational latency of birth for ir is
[00:07:30] computational latency of birth for ir is learning term weights
[00:07:32] learning term weights the key observation here is that bag of
[00:07:34] the key observation here is that bag of words models like bm25 decompose the
[00:07:37] words models like bm25 decompose the score of every document into a summation
[00:07:40] score of every document into a summation of term document weights
[00:07:41] of term document weights and maybe we can do the same so
[00:07:44] and maybe we can do the same so can we learn these term weights with
[00:07:46] can we learn these term weights with bert in particular
[00:07:48] bert in particular a simple way to do this would be to
[00:07:50] a simple way to do this would be to tokenize the query and the document
[00:07:53] tokenize the query and the document feed birth
[00:07:56] feed birth only the document and use a linear layer
[00:07:59] only the document and use a linear layer to project each token
[00:08:01] to project each token in the document into a single numeric
[00:08:03] in the document into a single numeric score
[00:08:05] score the idea here is that we can save these
[00:08:07] the idea here is that we can save these document term weights to the inverted
[00:08:09] document term weights to the inverted index just like we did with bm25
[00:08:12] index just like we did with bm25 in classical ir and quickly look up
[00:08:15] in classical ir and quickly look up these term weights when answering a
[00:08:16] these term weights when answering a query
[00:08:18] query this makes sure we do not need to use
[00:08:20] this makes sure we do not need to use bert at all when answering
[00:08:23] bert at all when answering equity
[00:08:25] equity as we just shifted all of our birth work
[00:08:28] as we just shifted all of our birth work offline to the indexing stage
[00:08:33] so this can be really great we now get
[00:08:36] so this can be really great we now get to use birth to learn much stronger term
[00:08:38] to use birth to learn much stronger term weights than bm25
[00:08:40] weights than bm25 um
[00:08:41] um and deep city and doctorquity are two
[00:08:44] and deep city and doctorquity are two major models under this efficient
[00:08:45] major models under this efficient paradigm
[00:08:47] paradigm as the figure shows at the bottom
[00:08:50] as the figure shows at the bottom they indeed greatly outperform bm-25 and
[00:08:52] they indeed greatly outperform bm-25 and mrr but actually they have comparable
[00:08:55] mrr but actually they have comparable latency because we're still using an
[00:08:56] latency because we're still using an inverted index to do the retrieval
[00:08:59] inverted index to do the retrieval however
[00:09:00] however the downside is that our query is back
[00:09:03] the downside is that our query is back to being a bag of words and we lose any
[00:09:05] to being a bag of words and we lose any deeper understanding of our queries
[00:09:07] deeper understanding of our queries beyond that
[00:09:11] so our central question remains whether
[00:09:13] so our central question remains whether we can jointly achieve high mrr and low
[00:09:16] we can jointly achieve high mrr and low computational cost and as we said before
[00:09:19] computational cost and as we said before the answer is yes
[00:09:20] the answer is yes and to do this
[00:09:22] and to do this we'll discuss in the next screencast two
[00:09:24] we'll discuss in the next screencast two very exciting paradigms of neural ir
[00:09:27] very exciting paradigms of neural ir models that get us close to the scope

Lecture 042

Neural IR, part 3 | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=KQMuiO59rGM --- Transcript [00:00:05] hello everyone welcome to part 5 of o...

Neural IR, part 3 | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=KQMuiO59rGM

---

Transcript

[00:00:05] hello everyone welcome to part 5 of our
[00:00:08] hello everyone welcome to part 5 of our series on nlu and ir
[00:00:10] series on nlu and ir this screencast will be the third among
[00:00:12] this screencast will be the third among three of our videos on neural ir
[00:00:17] three of our videos on neural ir in the previous screencast we discussed
[00:00:19] in the previous screencast we discussed learning term weights as a paradigm for
[00:00:21] learning term weights as a paradigm for building neural ir models that are both
[00:00:23] building neural ir models that are both efficient and effective
[00:00:25] efficient and effective we mentioned two such models from the ir
[00:00:27] we mentioned two such models from the ir literature deep ct and dr quiri
[00:00:31] literature deep ct and dr quiri both of which despite outperforming bm
[00:00:34] both of which despite outperforming bm 25 in mrr still left a very large margin
[00:00:38] 25 in mrr still left a very large margin to the quality that we see in with bert
[00:00:41] to the quality that we see in with bert we asked ourselves can we achieve high
[00:00:43] we asked ourselves can we achieve high mrr and low computational cost at the
[00:00:46] mrr and low computational cost at the same time
[00:00:48] same time can we do better
[00:00:50] can we do better to answer this question let us begin
[00:00:52] to answer this question let us begin exploring more expressive paradigms for
[00:00:54] exploring more expressive paradigms for efficient neural ir
[00:00:57] efficient neural ir next paradigm here is the representation
[00:00:59] next paradigm here is the representation similarity paradigm
[00:01:03] in the representation similarity
[00:01:04] in the representation similarity paradigm we begin by tokenizing the
[00:01:06] paradigm we begin by tokenizing the query and the document
[00:01:08] query and the document and we feed each of them independently
[00:01:11] and we feed each of them independently through an encoder like birth for
[00:01:13] through an encoder like birth for example
[00:01:14] example this encoder is then used to produce a
[00:01:16] this encoder is then used to produce a single vector representation for the
[00:01:18] single vector representation for the query and for the document separately
[00:01:22] query and for the document separately so for birth we could take this through
[00:01:24] so for birth we could take this through the class token for example and take the
[00:01:26] the class token for example and take the output embeddings or we could average
[00:01:28] output embeddings or we could average all the finite layer outputs
[00:01:32] once we have those we finally calculate
[00:01:34] once we have those we finally calculate the relevance score of this document to
[00:01:36] the relevance score of this document to our query as a single dot product
[00:01:39] our query as a single dot product between two vectors
[00:01:44] this paradigm is very efficient for
[00:01:46] this paradigm is very efficient for retrieval
[00:01:48] retrieval first each document can be represented
[00:01:50] first each document can be represented as a vector offline
[00:01:53] as a vector offline and this pre-computed representation can
[00:01:54] and this pre-computed representation can be stored on on disk before we even
[00:01:57] be stored on on disk before we even start conducting search
[00:02:00] start conducting search moreover the similarity computation
[00:02:02] moreover the similarity computation between a query and a document here is
[00:02:04] between a query and a document here is very cheap and that's very efficient
[00:02:06] very cheap and that's very efficient as it's just a single dot product
[00:02:08] as it's just a single dot product between two vectors
[00:02:11] a very large number of ir models are
[00:02:13] a very large number of ir models are representation similarity models
[00:02:16] representation similarity models many of those actually precede birth
[00:02:18] many of those actually precede birth like double sm and snrm but the last
[00:02:22] like double sm and snrm but the last year and a half we've seen numerous
[00:02:24] year and a half we've seen numerous similarity models based on birth for ir
[00:02:26] similarity models based on birth for ir tests
[00:02:27] tests including s birth orca dpr d birth and
[00:02:31] including s birth orca dpr d birth and ansi among others
[00:02:34] ansi among others many of these models
[00:02:36] many of these models were actually proposed concurrently with
[00:02:38] were actually proposed concurrently with each other
[00:02:39] each other and their primary differences lie in the
[00:02:40] and their primary differences lie in the specific tasks that each one targets and
[00:02:43] specific tasks that each one targets and the supervision approach each one
[00:02:44] the supervision approach each one suggests
[00:02:46] suggests so let us delve deeper into a
[00:02:48] so let us delve deeper into a representative and one of the earlier
[00:02:50] representative and one of the earlier and most popular
[00:02:51] and most popular models among those this is the dense
[00:02:53] models among those this is the dense passage retriever or dpr by carboquin
[00:02:56] passage retriever or dpr by carboquin ital which appeared at emlp just a few
[00:02:59] ital which appeared at emlp just a few months ago
[00:03:01] months ago dpr encodes each message or document as
[00:03:04] dpr encodes each message or document as a 768-dimensional vector and similarly
[00:03:07] a 768-dimensional vector and similarly for each query
[00:03:09] for each query during training dpr produces a
[00:03:12] during training dpr produces a similarity score between the query and a
[00:03:14] similarity score between the query and a positive passage so that's the relevant
[00:03:16] positive passage so that's the relevant passage we wanted to retrieve as well as
[00:03:18] passage we wanted to retrieve as well as between the query and a few negatives
[00:03:21] between the query and a few negatives some of them are sampled from the bm25
[00:03:23] some of them are sampled from the bm25 top 100 and others are in-batch
[00:03:25] top 100 and others are in-batch negatives which are actually positives
[00:03:27] negatives which are actually positives but for other queries in the same
[00:03:28] but for other queries in the same training batch
[00:03:31] training batch once dpr has all of those scores during
[00:03:33] once dpr has all of those scores during training it then optimizes a
[00:03:35] training it then optimizes a classification loss namely n-way
[00:03:37] classification loss namely n-way classification loss with soft
[00:03:39] classification loss with soft cross-entropy loss with soft max
[00:03:41] cross-entropy loss with soft max over the scores of one positive and all
[00:03:44] over the scores of one positive and all of these
[00:03:44] of these negatives with the target of selecting
[00:03:47] negatives with the target of selecting the positive passage of course
[00:03:50] dpr was not tested on the ms marco
[00:03:53] dpr was not tested on the ms marco dataset by the original authors
[00:03:55] dataset by the original authors but subsequent work by chiang guitar
[00:03:58] but subsequent work by chiang guitar tests a dpr-like retriever on ms marco
[00:04:02] tests a dpr-like retriever on ms marco and achieves 31 mrr they also then
[00:04:05] and achieves 31 mrr they also then suggest more sophisticated
[00:04:07] suggest more sophisticated approaches for supervision which can
[00:04:09] approaches for supervision which can increase this mrr by a couple of points
[00:04:12] increase this mrr by a couple of points so both of these demonstrate
[00:04:14] so both of these demonstrate considerable progress over the learned
[00:04:16] considerable progress over the learned term weight models that we looked at
[00:04:18] term weight models that we looked at before like deep ct or doctor queen
[00:04:21] before like deep ct or doctor queen but they still substantially trail
[00:04:23] but they still substantially trail behind births much higher effectiveness
[00:04:27] behind births much higher effectiveness so why is that
[00:04:30] so why is that as it turns out representation
[00:04:32] as it turns out representation similarity models suffer from two major
[00:04:34] similarity models suffer from two major downsides when it comes to ir tasks
[00:04:38] downsides when it comes to ir tasks first are there
[00:04:40] first are there single vector representations which
[00:04:42] single vector representations which involve cramming each query and each
[00:04:44] involve cramming each query and each document into one rather low dimensional
[00:04:47] document into one rather low dimensional vector
[00:04:49] vector second is their lack of fine-grained
[00:04:51] second is their lack of fine-grained interactions during matching
[00:04:53] interactions during matching representation similarity models
[00:04:55] representation similarity models estimate relevance as one dot product
[00:04:57] estimate relevance as one dot product between two vectors and as they lose the
[00:05:00] between two vectors and as they lose the term level interactions between the
[00:05:02] term level interactions between the query terms and the document terms that
[00:05:04] query terms and the document terms that we had in query document interaction
[00:05:05] we had in query document interaction models like birth in fact even simple
[00:05:08] models like birth in fact even simple term weighting models like bm25 or deep
[00:05:11] term weighting models like bm25 or deep ct had
[00:05:13] ct had by design some element of term level
[00:05:15] by design some element of term level matching there that we lose here
[00:05:19] so our next natural question then
[00:05:21] so our next natural question then becomes can we obtain these efficiency
[00:05:23] becomes can we obtain these efficiency benefits of precomputation
[00:05:26] benefits of precomputation that we get from representation
[00:05:27] that we get from representation similarity models
[00:05:29] similarity models while still keeping the fine-grained
[00:05:31] while still keeping the fine-grained term level interactions that we used to
[00:05:33] term level interactions that we used to have before with a model like birth or
[00:05:36] have before with a model like birth or deep ct
[00:05:42] toward answering that question
[00:05:44] toward answering that question i think it helps to review the neural ir
[00:05:46] i think it helps to review the neural ir paradigms we've seen so far
[00:05:50] on the left hand side we looked at the
[00:05:53] on the left hand side we looked at the learned term weights paradigm
[00:05:56] learned term weights paradigm these models offered independent
[00:05:57] these models offered independent independent encoding of queries and
[00:05:59] independent encoding of queries and documents which was great for efficiency
[00:06:01] documents which was great for efficiency but they forced us to work with a bag of
[00:06:03] but they forced us to work with a bag of words query that loses all context
[00:06:06] words query that loses all context and thus were not as competitive as we
[00:06:08] and thus were not as competitive as we wanted them to be
[00:06:10] wanted them to be we then explored the representation
[00:06:12] we then explored the representation similarity models which also allowed us
[00:06:14] similarity models which also allowed us to compute independent encodings of
[00:06:16] to compute independent encodings of queries and documents which again was
[00:06:18] queries and documents which again was really useful for efficiency
[00:06:22] but this time we were forced to work
[00:06:23] but this time we were forced to work with single vector representations and
[00:06:26] with single vector representations and we lost our fine-grained term level
[00:06:28] we lost our fine-grained term level interactions
[00:06:30] interactions which we intuitively believe to be very
[00:06:32] which we intuitively believe to be very useful for matching in ir tasks
[00:06:37] on the right hand side we looked
[00:06:39] on the right hand side we looked initially actually at the query document
[00:06:41] initially actually at the query document interaction models like standard birth
[00:06:43] interaction models like standard birth classifiers
[00:06:45] classifiers these offered very high accuracy
[00:06:48] these offered very high accuracy but were extremely expensive to use
[00:06:50] but were extremely expensive to use because the entire computation for one
[00:06:52] because the entire computation for one document depended on both the query and
[00:06:55] document depended on both the query and the document we simply couldn't do any
[00:06:57] the document we simply couldn't do any pre-computation in this case offline in
[00:07:00] pre-computation in this case offline in advance
[00:07:02] advance so can we somehow combine the advantages
[00:07:05] so can we somehow combine the advantages of all these three paradigms at once
[00:07:08] of all these three paradigms at once before we answer that question there's
[00:07:10] before we answer that question there's actually one final feature
[00:07:13] actually one final feature one final capability of the first two
[00:07:15] one final capability of the first two paradigms that we should discuss
[00:07:17] paradigms that we should discuss so query document interaction models
[00:07:20] so query document interaction models which are quite expensive they forced us
[00:07:22] which are quite expensive they forced us to use a re-ranking pipeline
[00:07:24] to use a re-ranking pipeline this is a pipeline where we rescored the
[00:07:26] this is a pipeline where we rescored the top thousand documents that we already
[00:07:28] top thousand documents that we already retrieved by bm25
[00:07:32] sometimes that's okay but in many cases
[00:07:35] sometimes that's okay but in many cases this can be a problem
[00:07:36] this can be a problem because it ties our recall to the recall
[00:07:39] because it ties our recall to the recall of bm25 which is ultimately a model that
[00:07:42] of bm25 which is ultimately a model that relies on finding terms that match
[00:07:44] relies on finding terms that match exactly across queries and documents
[00:07:47] exactly across queries and documents and so it can be quite restrictive in
[00:07:48] and so it can be quite restrictive in many cases
[00:07:50] many cases when recall is an important
[00:07:52] when recall is an important consideration
[00:07:54] consideration we often want our neural model that we
[00:07:56] we often want our neural model that we trained to do the to do end-to-end
[00:07:59] trained to do the to do end-to-end retrieval that is to search quickly over
[00:08:02] retrieval that is to search quickly over all the documents in our collection
[00:08:03] all the documents in our collection directly without any ranking pipeline
[00:08:08] directly without any ranking pipeline learning term weights
[00:08:09] learning term weights and representation similarity models
[00:08:12] and representation similarity models that we looked at so far alleviate this
[00:08:14] that we looked at so far alleviate this constraint and this is a big advantage
[00:08:16] constraint and this is a big advantage for them
[00:08:17] for them so specifically when we learn term
[00:08:19] so specifically when we learn term weights
[00:08:20] weights we can save these weights in the
[00:08:22] we can save these weights in the inverted index just like with bm25
[00:08:25] inverted index just like with bm25 and that allows us to
[00:08:27] and that allows us to obtain fast retrieval
[00:08:30] obtain fast retrieval when we learn vector representations it
[00:08:32] when we learn vector representations it also turns out that we can index these
[00:08:34] also turns out that we can index these vectors using libraries for fast vector
[00:08:37] vectors using libraries for fast vector similarity search like face fai double s
[00:08:42] similarity search like face fai double s this relies on efficient data structures
[00:08:44] this relies on efficient data structures that support pruning
[00:08:46] that support pruning which is basically finding the top k
[00:08:48] which is basically finding the top k matches say the top 10 or the top 100
[00:08:50] matches say the top 10 or the top 100 matches without having to exhaustively
[00:08:53] matches without having to exhaustively enumerate all possible candidates
[00:08:57] enumerate all possible candidates the details of search with these pruning
[00:08:59] the details of search with these pruning data structures is beyond our scope but
[00:09:02] data structures is beyond our scope but it's really useful to be aware of this
[00:09:03] it's really useful to be aware of this important capability for end-to-end
[00:09:06] important capability for end-to-end retrieval
[00:09:09] okay
[00:09:10] okay so let's go back to our last main
[00:09:12] so let's go back to our last main question can we obtain the efficiency
[00:09:14] question can we obtain the efficiency benefits of precomputation while still
[00:09:16] benefits of precomputation while still having the fine-grained term level
[00:09:18] having the fine-grained term level interactions that we used to have
[00:09:22] interactions that we used to have the neural ir paradigm that will allow
[00:09:24] the neural ir paradigm that will allow us to do this is called late interaction
[00:09:26] us to do this is called late interaction and this is something that i've worked
[00:09:28] and this is something that i've worked on here at stanford
[00:09:29] on here at stanford so let's build late interaction from the
[00:09:32] so let's build late interaction from the ground up
[00:09:33] ground up we'll start as usual with tokenization
[00:09:36] we'll start as usual with tokenization of the query and the document
[00:09:39] of the query and the document we will seek to independently encode the
[00:09:41] we will seek to independently encode the query and the document but into
[00:09:43] query and the document but into fine-grained representations this time
[00:09:47] so as you can see on the left hand side
[00:09:49] so as you can see on the left hand side this is actually not hard as it's shown
[00:09:52] this is actually not hard as it's shown we can feed two copies of birth the
[00:09:54] we can feed two copies of birth the query and the document separately and
[00:09:56] query and the document separately and keep all the output embeddings
[00:09:58] keep all the output embeddings corresponding to all the tokens as our
[00:10:01] corresponding to all the tokens as our fine-grained representation
[00:10:03] fine-grained representation for the query and for the document
[00:10:06] for the query and for the document okay so we're only going to be done here
[00:10:09] okay so we're only going to be done here once we actually close this loop right
[00:10:12] once we actually close this loop right we still need to estimate relevance
[00:10:14] we still need to estimate relevance between this query and that document
[00:10:17] between this query and that document essentially we have two matrices and we
[00:10:19] essentially we have two matrices and we need a notion of similarity between
[00:10:21] need a notion of similarity between these two matrices or these two bags of
[00:10:24] these two matrices or these two bags of vectors
[00:10:28] however not every approach will suffice
[00:10:30] however not every approach will suffice we insist that we get a scalable
[00:10:32] we insist that we get a scalable mechanism that allows us to use vector
[00:10:34] mechanism that allows us to use vector similarity search with pruning
[00:10:37] similarity search with pruning to conduct end-to-end retrieval
[00:10:39] to conduct end-to-end retrieval in a scalable fashion across the entire
[00:10:41] in a scalable fashion across the entire collection
[00:10:44] in doing this or for doing this it turns
[00:10:47] in doing this or for doing this it turns out that a very simple interaction
[00:10:48] out that a very simple interaction mechanism offers both scaling and high
[00:10:51] mechanism offers both scaling and high quality
[00:10:54] so here's what we're what we'll do for
[00:10:56] so here's what we're what we'll do for each query embedding
[00:10:58] each query embedding as i show here we compute a maximum
[00:11:00] as i show here we compute a maximum similarity score across all of the
[00:11:02] similarity score across all of the document embeddings
[00:11:06] so this is just going to be a cosine
[00:11:08] so this is just going to be a cosine similarity
[00:11:09] similarity giving us a single partial partial score
[00:11:12] giving us a single partial partial score for this query term which is the maximum
[00:11:15] for this query term which is the maximum cosine similarity across all of the blue
[00:11:17] cosine similarity across all of the blue embeddings in this case
[00:11:23] we'll repeat this here for all the query
[00:11:25] we'll repeat this here for all the query embeddings
[00:11:27] embeddings and we'll simply sum all of these
[00:11:29] and we'll simply sum all of these maximum similarity scores
[00:11:31] maximum similarity scores to get our final score for this document
[00:11:36] so we will refer to this general
[00:11:37] so we will refer to this general paradigm here as late interaction and to
[00:11:40] paradigm here as late interaction and to this specific model shown here
[00:11:42] this specific model shown here on top of birth as colbert
[00:11:45] on top of birth as colbert and the intuition is simple for every
[00:11:47] and the intuition is simple for every term in the query we're just trying to
[00:11:50] term in the query we're just trying to softly and contextually locate that term
[00:11:54] softly and contextually locate that term in the document
[00:11:55] in the document assigning a score to how successful this
[00:11:57] assigning a score to how successful this matching was
[00:12:00] um let me illustrate this with a real
[00:12:02] um let me illustrate this with a real example from the ms mark marco ranking
[00:12:04] example from the ms mark marco ranking development set and i hope it will be
[00:12:06] development set and i hope it will be quite intuitive once you see it
[00:12:08] quite intuitive once you see it at the top is a query and at the bottom
[00:12:11] at the top is a query and at the bottom is a portion of the correct passage that
[00:12:12] is a portion of the correct passage that colbert retrieves at position one
[00:12:15] colbert retrieves at position one because we have this simple late
[00:12:17] because we have this simple late interaction mechanism
[00:12:19] interaction mechanism we can actually explore the behavior and
[00:12:21] we can actually explore the behavior and we can see in this particular example
[00:12:23] we can see in this particular example that colbert matches
[00:12:25] that colbert matches through maximum similarity operators
[00:12:28] through maximum similarity operators the word when in the question with the
[00:12:31] the word when in the question with the word on in the phrase on august 8th
[00:12:33] word on in the phrase on august 8th which is a date as we might expect
[00:12:37] which is a date as we might expect it matches the word transformers with
[00:12:39] it matches the word transformers with the same word in the document
[00:12:40] the same word in the document it matches cartoon with animated
[00:12:43] it matches cartoon with animated and it matches the individual words come
[00:12:45] and it matches the individual words come and out with the term released um when
[00:12:48] and out with the term released um when in in in the phrase it was released on
[00:12:51] in in in the phrase it was released on august 8th
[00:12:53] august 8th in the document
[00:12:55] in the document as we might intuitively expect so we're
[00:12:57] as we might intuitively expect so we're basically just trying to contextually
[00:12:58] basically just trying to contextually match these quiddity terms
[00:13:01] match these quiddity terms in the document
[00:13:02] in the document and assign some matching score for each
[00:13:05] and assign some matching score for each of these terms
[00:13:06] of these terms so notice here and and remember that
[00:13:09] so notice here and and remember that covaria presents each document as a
[00:13:11] covaria presents each document as a dense matrix of many vectors and in
[00:13:13] dense matrix of many vectors and in particular one vector per token
[00:13:15] particular one vector per token and this differs from the representation
[00:13:17] and this differs from the representation similarity models we looked at before
[00:13:20] similarity models we looked at before which tried to cram each document into
[00:13:22] which tried to cram each document into one vector and what makes this possible
[00:13:24] one vector and what makes this possible is the maximum similarity
[00:13:26] is the maximum similarity operators that we have on top of these
[00:13:29] operators that we have on top of these matrix representations
[00:13:31] matrix representations so how well does colbert do
[00:13:35] so how well does colbert do and how does it do with this gap that we
[00:13:37] and how does it do with this gap that we have here between efficient models and
[00:13:39] have here between efficient models and highly effective ones
[00:13:42] highly effective ones well by redesigning the model
[00:13:44] well by redesigning the model architecture and offering a late
[00:13:46] architecture and offering a late interaction paradigm colbert allows us
[00:13:48] interaction paradigm colbert allows us to achieve quality competitive with
[00:13:50] to achieve quality competitive with birth at a small fraction of the costs
[00:13:54] perhaps more importantly colbert can
[00:13:56] perhaps more importantly colbert can scale to the entire collection due to
[00:13:59] scale to the entire collection due to pruning with end-to-end retrieval
[00:14:01] pruning with end-to-end retrieval all nine million passages here in this
[00:14:03] all nine million passages here in this case um while maintaining sub-second
[00:14:05] case um while maintaining sub-second latencies
[00:14:08] latencies um and thus it allows much higher recall
[00:14:10] um and thus it allows much higher recall than traditional re-ranking pipelines
[00:14:12] than traditional re-ranking pipelines permit
[00:14:18] all right so so far we've looked at in
[00:14:19] all right so so far we've looked at in domain effectiveness evaluations
[00:14:21] domain effectiveness evaluations basically cases where we had training
[00:14:24] basically cases where we had training and evaluation data for the ir task at
[00:14:26] and evaluation data for the ir task at hand which was ms marco so far but we
[00:14:30] hand which was ms marco so far but we often want to
[00:14:32] often want to use retrieval in new out of domain
[00:14:34] use retrieval in new out of domain settings we just want to throw our
[00:14:36] settings we just want to throw our search engine
[00:14:37] search engine at a difficult problem without training
[00:14:39] at a difficult problem without training data without validation data and see it
[00:14:42] data without validation data and see it perform well
[00:14:44] perform well we briefly discussed the air before
[00:14:47] we briefly discussed the air before which is
[00:14:48] which is a recent effort to test our models in a
[00:14:51] a recent effort to test our models in a zero shot
[00:14:53] zero shot setting where the models are trained on
[00:14:55] setting where the models are trained on one ir tasks a task and then they're
[00:14:57] one ir tasks a task and then they're fixed and then they are tested on a
[00:14:59] fixed and then they are tested on a completely different
[00:15:01] completely different set of tasks
[00:15:03] set of tasks beer includes 17 ir data sets and there
[00:15:06] beer includes 17 ir data sets and there are nine different
[00:15:08] are nine different ir tasks or scenarios and the authors
[00:15:10] ir tasks or scenarios and the authors nand and ital compared a lot of the ir
[00:15:13] nand and ital compared a lot of the ir models that we discussed today in a
[00:15:14] models that we discussed today in a zero-sharp manner against each other
[00:15:17] zero-sharp manner against each other across all of these tasks
[00:15:19] across all of these tasks so let's take a look
[00:15:22] so let's take a look um
[00:15:23] um here we have bm25
[00:15:25] here we have bm25 results results for an interaction model
[00:15:28] results results for an interaction model which is in this case electra which
[00:15:30] which is in this case electra which tends to perform slightly better than
[00:15:32] tends to perform slightly better than birth for ranking we have two
[00:15:34] birth for ranking we have two representation similarity models dpr and
[00:15:36] representation similarity models dpr and s-bert
[00:15:38] s-bert and we have a late interaction model
[00:15:39] and we have a late interaction model which is colbert
[00:15:41] which is colbert um
[00:15:42] um the best
[00:15:44] the best in each row in each ir task is shown in
[00:15:46] in each row in each ir task is shown in bold and we see that across all tasks
[00:15:49] bold and we see that across all tasks the strongest model at ndcg at 10
[00:15:53] the strongest model at ndcg at 10 is always one of the three models that
[00:15:55] is always one of the three models that involve turn-level interactions which
[00:15:57] involve turn-level interactions which are electra colbert and bm25
[00:16:01] are electra colbert and bm25 interestingly the single vector
[00:16:03] interestingly the single vector approaches which seemed quite promising
[00:16:05] approaches which seemed quite promising so far
[00:16:06] so far failed to generalize robustly according
[00:16:08] failed to generalize robustly according to these results
[00:16:10] to these results whereas colbert which is also a fast
[00:16:12] whereas colbert which is also a fast model
[00:16:14] model almost matches the quality of the
[00:16:15] almost matches the quality of the expensive electra anchor
[00:16:20] the results so far where on the metric
[00:16:22] the results so far where on the metric ndcg attend which is a precision
[00:16:24] ndcg attend which is a precision oriented metric looks at the top results
[00:16:27] oriented metric looks at the top results um but here i have the author's results
[00:16:30] um but here i have the author's results um
[00:16:31] um after the task level aggregation um
[00:16:33] after the task level aggregation um considering recall at 100
[00:16:37] considering recall at 100 and here although
[00:16:41] we see that that you know the
[00:16:44] we see that that you know the the results are rather similar when we
[00:16:46] the results are rather similar when we consider recall
[00:16:48] consider recall but one major difference is that
[00:16:50] but one major difference is that colbert's late interaction mechanism
[00:16:52] colbert's late interaction mechanism which allows it to conduct end-to-end
[00:16:53] which allows it to conduct end-to-end retrieval
[00:16:55] retrieval with high quality allows it to achieve
[00:16:57] with high quality allows it to achieve the strongest recall in this case
[00:17:00] the strongest recall in this case and so we can conclude basically that
[00:17:03] and so we can conclude basically that scalable fine-grained interaction is key
[00:17:05] scalable fine-grained interaction is key to robustly higher call
[00:17:09] of course notice that the bm25 and
[00:17:11] of course notice that the bm25 and electron recall here
[00:17:13] electron recall here is the same since electro are just three
[00:17:15] is the same since electro are just three scores the top 100 in this case from
[00:17:18] scores the top 100 in this case from bm25
[00:17:20] bm25 so this concludes our neural ir section
[00:17:23] so this concludes our neural ir section of the nlu plus ir series
[00:17:26] of the nlu plus ir series in the next next screencast we will
[00:17:28] in the next next screencast we will discuss how scalability
[00:17:30] discuss how scalability with these retrieval models can actually
[00:17:32] with these retrieval models can actually drive large gains in quality not just
[00:17:35] drive large gains in quality not just speed
[00:17:36] speed uh which we haven't seen so far except
[00:17:38] uh which we haven't seen so far except on the recoil case
[00:17:39] on the recoil case and how tuning in neural ir model fits
[00:17:42] and how tuning in neural ir model fits into a larger downstream open domain in
[00:17:44] into a larger downstream open domain in the liu task

Lecture 043

Relation Extraction | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=4AjieiJ1CXo --- Transcript [00:00:05] our topic for today and wednesday i...

Relation Extraction | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=4AjieiJ1CXo

---

Transcript

[00:00:05] our topic for today and wednesday is
[00:00:08] our topic for today and wednesday is relation extraction
[00:00:09] relation extraction and this is an exciting topic both
[00:00:12] and this is an exciting topic both because it's a great arena to explore a
[00:00:16] because it's a great arena to explore a variety of nlu and machine learning
[00:00:18] variety of nlu and machine learning techniques
[00:00:19] techniques and because it has so many real world
[00:00:22] and because it has so many real world applications as we'll see in a moment
[00:00:26] applications as we'll see in a moment so here's an overview of the next two
[00:00:28] so here's an overview of the next two lectures i'm going to start by
[00:00:29] lectures i'm going to start by describing the task of relation
[00:00:31] describing the task of relation extraction
[00:00:32] extraction what it is why it matters and how we
[00:00:35] what it is why it matters and how we might approach it
[00:00:36] might approach it then i'll describe the data resources
[00:00:38] then i'll describe the data resources we'll need to make headway on the
[00:00:40] we'll need to make headway on the problem
[00:00:42] problem next i'll provide a more precise
[00:00:44] next i'll provide a more precise formulation of the prediction problem
[00:00:46] formulation of the prediction problem that we're taking on and i'll propose a
[00:00:48] that we're taking on and i'll propose a strategy for quantitative evaluation
[00:00:52] strategy for quantitative evaluation then we'll establish some lower bounds
[00:00:54] then we'll establish some lower bounds on performance by evaluating some very
[00:00:56] on performance by evaluating some very simple approaches to this problem
[00:00:59] simple approaches to this problem and finally we'll point toward some
[00:01:01] and finally we'll point toward some directions for future exploration
[00:01:06] in the first section i'll start by
[00:01:07] in the first section i'll start by defining the task of relation extraction
[00:01:10] defining the task of relation extraction and then i'll try to provide some
[00:01:11] and then i'll try to provide some motivation for why it's an important and
[00:01:14] motivation for why it's an important and exciting problem
[00:01:15] exciting problem i'll describe both the vision that
[00:01:18] i'll describe both the vision that originally inspired research in this
[00:01:20] originally inspired research in this area and a range of current practical
[00:01:23] area and a range of current practical applications for relation extraction
[00:01:27] applications for relation extraction then i'll describe three paradigms that
[00:01:29] then i'll describe three paradigms that correspond to three stages in the
[00:01:33] correspond to three stages in the evolution of approaches to relation
[00:01:35] evolution of approaches to relation extraction
[00:01:36] extraction hand-built patterns which were dominant
[00:01:38] hand-built patterns which were dominant in the 80s and 90s
[00:01:40] in the 80s and 90s supervised learning which became
[00:01:42] supervised learning which became dominant in the 2000s
[00:01:44] dominant in the 2000s and distant supervision which has
[00:01:47] and distant supervision which has dominated since about 2010 and will be
[00:01:50] dominated since about 2010 and will be our main focus
[00:01:52] our main focus so let's dive in
[00:01:55] so let's dive in so the task of relation extraction is
[00:01:57] so the task of relation extraction is about extracting structured knowledge
[00:02:00] about extracting structured knowledge from natural language text
[00:02:03] from natural language text we want to be able to start from a
[00:02:04] we want to be able to start from a document like this
[00:02:06] document like this this could be a news story or a web page
[00:02:10] this could be a news story or a web page and extract relational triples like
[00:02:13] and extract relational triples like founders paypal elon musk and founders
[00:02:16] founders paypal elon musk and founders spacex elon musk
[00:02:20] next we find this document
[00:02:22] next we find this document and we want to be able to extract
[00:02:25] and we want to be able to extract has spouse elon musk
[00:02:27] has spouse elon musk to lula riley
[00:02:30] to lula riley we keep reading another document
[00:02:33] we keep reading another document and we want to extract
[00:02:35] and we want to extract worked at elon musk tesla motors
[00:02:39] if we can accumulate a large knowledge
[00:02:42] if we can accumulate a large knowledge base of relational triples we can use it
[00:02:45] base of relational triples we can use it to power question answering and other
[00:02:47] to power question answering and other applications
[00:02:49] applications building a knowledge base like this
[00:02:51] building a knowledge base like this manually is slow and expensive
[00:02:55] manually is slow and expensive but much of the knowledge that we'd like
[00:02:56] but much of the knowledge that we'd like to capture
[00:02:58] to capture is already expressed in abundant text on
[00:03:00] is already expressed in abundant text on the web
[00:03:02] the web so the aim of relation extraction is to
[00:03:04] so the aim of relation extraction is to accelerate knowledge-based construction
[00:03:07] accelerate knowledge-based construction by extracting relational triples from
[00:03:10] by extracting relational triples from natural language text
[00:03:14] here's a nice articulation of the vision
[00:03:16] here's a nice articulation of the vision for relation extraction this is from tom
[00:03:18] for relation extraction this is from tom mitchell who is the former chair of
[00:03:21] mitchell who is the former chair of machine learning of the machine learning
[00:03:22] machine learning of the machine learning department at cmu he's also the author
[00:03:25] department at cmu he's also the author of one of the first textbooks on machine
[00:03:27] of one of the first textbooks on machine learning
[00:03:29] learning by the way he was also the phd advisor
[00:03:31] by the way he was also the phd advisor of sebastian throne and oren ezioni
[00:03:35] of sebastian throne and oren ezioni he wrote this piece in 2005 describing a
[00:03:38] he wrote this piece in 2005 describing a vision for machine reading and he
[00:03:41] vision for machine reading and he offered to bet a lobster dinner that by
[00:03:43] offered to bet a lobster dinner that by 2015 we will have a computer program
[00:03:45] 2015 we will have a computer program capable of automatically reading at
[00:03:47] capable of automatically reading at least 80 of the factual content on the
[00:03:50] least 80 of the factual content on the web and placing those facts in a
[00:03:52] web and placing those facts in a structured knowledge base
[00:03:54] structured knowledge base i think we've come pretty close to
[00:03:56] i think we've come pretty close to achieving that goal and this is exactly
[00:03:58] achieving that goal and this is exactly the goal that relation extraction aims
[00:04:00] the goal that relation extraction aims at to extract structured knowledge from
[00:04:03] at to extract structured knowledge from unstructured text
[00:04:07] one of the things that makes relation
[00:04:09] one of the things that makes relation extraction an exciting topic is the
[00:04:11] extraction an exciting topic is the abundance of real world applications
[00:04:14] abundance of real world applications for example nowadays intelligent
[00:04:16] for example nowadays intelligent assistants like siri or google can
[00:04:19] assistants like siri or google can answer lots of factual questions like
[00:04:21] answer lots of factual questions like who sang love train
[00:04:24] who sang love train to do this they rely on knowledge bases
[00:04:26] to do this they rely on knowledge bases or kbs containing thousands of relations
[00:04:29] or kbs containing thousands of relations millions of entities and billions of
[00:04:32] millions of entities and billions of individual facts
[00:04:34] individual facts there are many different strategies for
[00:04:35] there are many different strategies for building and maintaining and extending
[00:04:37] building and maintaining and extending these kb's but considering how enormous
[00:04:40] these kb's but considering how enormous they are and how quickly the world is
[00:04:43] they are and how quickly the world is creating new facts it's a process that
[00:04:45] creating new facts it's a process that you want to automate as much as possible
[00:04:48] you want to automate as much as possible so more and more
[00:04:50] so more and more relation extraction from the web is
[00:04:52] relation extraction from the web is hugely strategic for apple and google
[00:04:55] hugely strategic for apple and google and other companies
[00:04:57] and other companies in fact in 2017 apple spent 200 million
[00:05:00] in fact in 2017 apple spent 200 million dollars to acquire a startup called
[00:05:03] dollars to acquire a startup called lattice which was co-founded by stanford
[00:05:05] lattice which was co-founded by stanford professor chris wray whom some of you
[00:05:07] professor chris wray whom some of you may know
[00:05:08] may know specifically for this purpose
[00:05:12] another example is building ontologies
[00:05:15] another example is building ontologies uh if you're running an app store you're
[00:05:17] uh if you're running an app store you're gonna need a taxonomy of categories of
[00:05:20] gonna need a taxonomy of categories of apps and which apps belong to which
[00:05:23] apps and which apps belong to which categories
[00:05:25] categories one one category of apps is video games
[00:05:28] one one category of apps is video games but if you're a gamer you know that
[00:05:30] but if you're a gamer you know that there are sub categories
[00:05:32] there are sub categories and sub sub categories and sub sub sub
[00:05:35] and sub sub categories and sub sub sub categories of video games and new ones
[00:05:38] categories of video games and new ones keep appearing and new games appear
[00:05:41] keep appearing and new games appear every day how are you going to keep your
[00:05:43] every day how are you going to keep your ontology up to date
[00:05:46] ontology up to date well there's a lot of people writing
[00:05:48] well there's a lot of people writing about video games on the web so maybe
[00:05:50] about video games on the web so maybe relation extraction can help
[00:05:53] relation extraction can help the relation between a category and a
[00:05:54] the relation between a category and a subcategory or between a category and an
[00:05:58] subcategory or between a category and an instance of the category
[00:06:00] instance of the category can be a target for relation extraction
[00:06:03] can be a target for relation extraction and similarly you can imagine using
[00:06:05] and similarly you can imagine using relation extraction to help build or
[00:06:07] relation extraction to help build or maintain ontologies of car parts or
[00:06:12] maintain ontologies of car parts or companies or viruses
[00:06:17] another example comes from
[00:06:18] another example comes from bioinformatics so every year there are
[00:06:20] bioinformatics so every year there are thousands of new research articles
[00:06:22] thousands of new research articles describing gene regulatory networks
[00:06:25] describing gene regulatory networks if we can apply relation extraction to
[00:06:27] if we can apply relation extraction to these articles to populate a database of
[00:06:30] these articles to populate a database of gene regulation relationships
[00:06:33] gene regulation relationships then we can begin to apply existing well
[00:06:36] then we can begin to apply existing well understood data mining techniques
[00:06:38] understood data mining techniques we can look for
[00:06:39] we can look for statistical correlations or apply
[00:06:43] statistical correlations or apply clever graph algorithms to activation
[00:06:45] clever graph algorithms to activation networks the sky's the limit we've
[00:06:48] networks the sky's the limit we've turned something that a machine can't
[00:06:50] turned something that a machine can't understand into something that a machine
[00:06:52] understand into something that a machine can understand
[00:06:55] so let's turn to the question of how
[00:06:57] so let's turn to the question of how you'd actually solve this problem
[00:06:59] you'd actually solve this problem the most obvious way to start is to
[00:07:01] the most obvious way to start is to write down a few patterns which express
[00:07:03] write down a few patterns which express each relation
[00:07:05] each relation so for example if we want to find new
[00:07:06] so for example if we want to find new instances of the founders relation so we
[00:07:09] instances of the founders relation so we can use patterns like x is the founder
[00:07:11] can use patterns like x is the founder of y or x comma who founded y
[00:07:14] of y or x comma who founded y or y was founded by x
[00:07:17] or y was founded by x and then if we search a large corpus
[00:07:20] and then if we search a large corpus we may find sentences like these that
[00:07:23] we may find sentences like these that match these patterns
[00:07:24] match these patterns and allow us to extract the fact that
[00:07:27] and allow us to extract the fact that elon musk founded spacex
[00:07:31] elon musk founded spacex so this seems promising and in fact this
[00:07:34] so this seems promising and in fact this was the dominant paradigm in relation
[00:07:36] was the dominant paradigm in relation extraction in the early days
[00:07:40] extraction in the early days but this approach is really limited
[00:07:43] but this approach is really limited the central challenge of relation
[00:07:45] the central challenge of relation extraction is the fantastic diversity of
[00:07:48] extraction is the fantastic diversity of language the multitude of possible ways
[00:07:51] language the multitude of possible ways to express a given relation
[00:07:54] to express a given relation for example each of these sentences
[00:07:56] for example each of these sentences also expresses the fact that elon musk
[00:07:59] also expresses the fact that elon musk founded spacex
[00:08:00] founded spacex but in these examples the patterns which
[00:08:02] but in these examples the patterns which connect elon musk with spacex
[00:08:05] connect elon musk with spacex are not ones that we could have easily
[00:08:07] are not ones that we could have easily anticipated
[00:08:09] anticipated there they might be ones that will never
[00:08:11] there they might be ones that will never recur again
[00:08:13] recur again so to do relation extraction effectively
[00:08:16] so to do relation extraction effectively we need to go beyond hand-built patterns
[00:08:21] so around the turn of the century the
[00:08:23] so around the turn of the century the machine learning revolution came to the
[00:08:25] machine learning revolution came to the field of nlp and people began to try a
[00:08:28] field of nlp and people began to try a new approach namely supervised learning
[00:08:31] new approach namely supervised learning so you start by labeling your examples
[00:08:34] so you start by labeling your examples so these three examples
[00:08:37] so these three examples are positive instances of the founders
[00:08:40] are positive instances of the founders relation
[00:08:41] relation uh so these are the positive examples
[00:08:45] uh so these are the positive examples um
[00:08:47] um and
[00:08:48] and um these three these two are negative
[00:08:50] um these three these two are negative examples
[00:08:52] examples now that we have labeled training data
[00:08:54] now that we have labeled training data we can train a model
[00:08:56] we can train a model and it could be a simple linear model
[00:08:59] and it could be a simple linear model that uses a bag of words representation
[00:09:02] that uses a bag of words representation and assigns higher weights to words like
[00:09:04] and assigns higher weights to words like founder and established
[00:09:06] founder and established that are likely to indicate the founders
[00:09:09] that are likely to indicate the founders relation
[00:09:10] relation or it could be something more
[00:09:11] or it could be something more complicated
[00:09:13] complicated in any case this was a hugely successful
[00:09:15] in any case this was a hugely successful idea
[00:09:16] idea even simple machine learned models are
[00:09:19] even simple machine learned models are far better at generalizing to new data
[00:09:21] far better at generalizing to new data than static patterns
[00:09:23] than static patterns but there's a big problem
[00:09:26] but there's a big problem manually labeling training examples is
[00:09:29] manually labeling training examples is laborious and time consuming and
[00:09:32] laborious and time consuming and expensive
[00:09:34] expensive and as a consequence the largest labeled
[00:09:37] and as a consequence the largest labeled data sets that were produced
[00:09:39] data sets that were produced had only tens of thousands of examples
[00:09:41] had only tens of thousands of examples which by modern standards seems puny
[00:09:45] which by modern standards seems puny if we want to apply modern machine
[00:09:47] if we want to apply modern machine learning techniques we need a lot more
[00:09:49] learning techniques we need a lot more data we need a way to leverage vastly
[00:09:52] data we need a way to leverage vastly greater quantities of training data
[00:09:56] the answer appeared around 2010 with an
[00:09:59] the answer appeared around 2010 with an idea called distance supervision and
[00:10:02] idea called distance supervision and this is a really big idea
[00:10:05] this is a really big idea instead of manually labeling individual
[00:10:07] instead of manually labeling individual examples we're going to automatically
[00:10:10] examples we're going to automatically derive the labels from an existing
[00:10:12] derive the labels from an existing knowledge base
[00:10:13] knowledge base so let's say we already have a kb that
[00:10:16] so let's say we already have a kb that contains many examples of the founders
[00:10:18] contains many examples of the founders relation so we've got spacex and elon
[00:10:20] relation so we've got spacex and elon musk apple and steve jobs and so on
[00:10:25] musk apple and steve jobs and so on and let's say we have a large corpus of
[00:10:26] and let's say we have a large corpus of text
[00:10:27] text it can be unlabeled text raw text
[00:10:31] it can be unlabeled text raw text which means that it can be truly
[00:10:33] which means that it can be truly enormous it can be the whole web
[00:10:37] enormous it can be the whole web what we're going to do is we're going to
[00:10:38] what we're going to do is we're going to simply assume
[00:10:41] simply assume that every sentence which contains a
[00:10:43] that every sentence which contains a pair of entities which are related in
[00:10:45] pair of entities which are related in the kb
[00:10:46] the kb like elon musk and spacex is a positive
[00:10:50] like elon musk and spacex is a positive example for that relation
[00:10:53] example for that relation and we're going to assume that every
[00:10:55] and we're going to assume that every sentence which contains a pair of
[00:10:56] sentence which contains a pair of entities that are unrelated in the kb
[00:10:59] entities that are unrelated in the kb like elon musk and apple
[00:11:01] like elon musk and apple is a negative example
[00:11:06] genius this gives this gives us a way to
[00:11:08] genius this gives this gives us a way to generate massive quantities of training
[00:11:10] generate massive quantities of training data practically free
[00:11:13] data practically free however
[00:11:14] however you might have some doubts about the
[00:11:16] you might have some doubts about the validity of those assumptions so hold
[00:11:18] validity of those assumptions so hold that thought
[00:11:22] distance supervision is a really
[00:11:24] distance supervision is a really powerful idea but it has two important
[00:11:27] powerful idea but it has two important limitations
[00:11:28] limitations the first is a consequence of making the
[00:11:31] the first is a consequence of making the unreliable assumption
[00:11:32] unreliable assumption that all sentences where related
[00:11:35] that all sentences where related entities co-occur
[00:11:37] entities co-occur actually express that relation
[00:11:40] actually express that relation inevitably some of them don't
[00:11:42] inevitably some of them don't like this example we labeled it as a
[00:11:45] like this example we labeled it as a positive example for the founders
[00:11:47] positive example for the founders relation but it doesn't express that
[00:11:49] relation but it doesn't express that relation at all this doesn't say that
[00:11:51] relation at all this doesn't say that elon musk is is a founder of spacex so
[00:11:54] elon musk is is a founder of spacex so this label is a lie a dirty dirty lie
[00:12:00] this label is a lie a dirty dirty lie making this assumption blindly has the
[00:12:02] making this assumption blindly has the effect of introducing noise into our
[00:12:05] effect of introducing noise into our training data
[00:12:07] training data distance supervision is effective in
[00:12:09] distance supervision is effective in spite of this problem because it makes
[00:12:12] spite of this problem because it makes it possible to leverage vastly greater
[00:12:14] it possible to leverage vastly greater quantities of training data and the
[00:12:17] quantities of training data and the benefit of more data outweighs the harm
[00:12:21] benefit of more data outweighs the harm of noisier data
[00:12:24] of noisier data by the way i feel like i've waited my
[00:12:26] by the way i feel like i've waited my whole life for the right opportunity to
[00:12:28] whole life for the right opportunity to use the pinocchio emoji
[00:12:31] use the pinocchio emoji the day finally came
[00:12:33] the day finally came and it feels good
[00:12:37] the second limitation is that we need an
[00:12:40] the second limitation is that we need an existing kb to start from
[00:12:42] existing kb to start from we can only train a model to extract new
[00:12:45] we can only train a model to extract new instances of the founders relation if we
[00:12:48] instances of the founders relation if we already have many instances of the
[00:12:50] already have many instances of the founders relation so while distance
[00:12:52] founders relation so while distance supervision is a great way to extend an
[00:12:55] supervision is a great way to extend an existing kb it's not useful for creating
[00:12:57] existing kb it's not useful for creating a kb containing new relations from
[00:12:59] a kb containing new relations from scratch

Lecture 044

Data Resources | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=g4KCti_rZA4 --- Transcript [00:00:04] so that's the end of the introduction um...

Data Resources | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=g4KCti_rZA4

---

Transcript

[00:00:04] so that's the end of the introduction um
[00:00:06] so that's the end of the introduction um let's
[00:00:07] let's now begin to drill down on the data
[00:00:09] now begin to drill down on the data resources that we'll need to launch our
[00:00:12] resources that we'll need to launch our investigation
[00:00:13] investigation and there are two different kinds of
[00:00:14] and there are two different kinds of data we need to talk about the corpus
[00:00:16] data we need to talk about the corpus and the kb
[00:00:19] just like any other nlp problem we need
[00:00:21] just like any other nlp problem we need to start with a corpus a large
[00:00:24] to start with a corpus a large collection of natural language text and
[00:00:26] collection of natural language text and for relation extraction we need
[00:00:28] for relation extraction we need sentences containing two or more
[00:00:30] sentences containing two or more entities
[00:00:32] entities and because our goal is to do relation
[00:00:35] and because our goal is to do relation extraction with distant supervision
[00:00:37] extraction with distant supervision we need to be able to connect the
[00:00:39] we need to be able to connect the entities to a kb
[00:00:42] entities to a kb so we need a corpus in which the entity
[00:00:44] so we need a corpus in which the entity mentions are annotated with entity
[00:00:47] mentions are annotated with entity resolutions
[00:00:48] resolutions which map them to unique unambiguous
[00:00:52] which map them to unique unambiguous identifiers the same identifiers that
[00:00:55] identifiers the same identifiers that are used in the kb
[00:00:59] are used in the kb so in this representation i've got the
[00:01:01] so in this representation i've got the string elon musk which is
[00:01:04] string elon musk which is just an english language string that's
[00:01:07] just an english language string that's what we call an entity mention
[00:01:09] what we call an entity mention and then i've got elon underscore musk
[00:01:12] and then i've got elon underscore musk which is an entity id
[00:01:14] which is an entity id it's a unique unambiguous identifier for
[00:01:18] it's a unique unambiguous identifier for this entity
[00:01:19] this entity in some predefined um
[00:01:24] dictionary of entity ids
[00:01:27] dictionary of entity ids and it's very common for this purpose to
[00:01:29] and it's very common for this purpose to use something like wikipedia which has
[00:01:33] use something like wikipedia which has one wikipedia page for
[00:01:36] one wikipedia page for almost any entity that you can think of
[00:01:40] almost any entity that you can think of for our investigation we're going to use
[00:01:42] for our investigation we're going to use an adaptation of the wikileaks corpus
[00:01:44] an adaptation of the wikileaks corpus which was produced by google and umass
[00:01:47] which was produced by google and umass in 2013 the full corpus contains 40
[00:01:50] in 2013 the full corpus contains 40 million entity mentions from 10 million
[00:01:53] million entity mentions from 10 million web pages
[00:01:55] web pages and each entity mention is annotated
[00:01:57] and each entity mention is annotated with a wikipedia url
[00:02:00] with a wikipedia url but we're going to use just a subset of
[00:02:02] but we're going to use just a subset of the full corpus in order to make things
[00:02:04] the full corpus in order to make things manageable
[00:02:08] so let's start to look at some of the
[00:02:09] so let's start to look at some of the code we'll use in the python notebooks
[00:02:12] code we'll use in the python notebooks for this topic
[00:02:14] for this topic um the data assets that we'll use live
[00:02:17] um the data assets that we'll use live in a subdirectory of our data directory
[00:02:19] in a subdirectory of our data directory called relexed data
[00:02:22] called relexed data and we've defined a class called corpus
[00:02:26] and we've defined a class called corpus which holds the examples and which lets
[00:02:28] which holds the examples and which lets you quickly look up examples containing
[00:02:31] you quickly look up examples containing specific entities
[00:02:33] specific entities so if we load our corpus we find that it
[00:02:35] so if we load our corpus we find that it contains more than 330 000 examples
[00:02:38] contains more than 330 000 examples pretty good size
[00:02:40] pretty good size it's small enough that we can work with
[00:02:42] it's small enough that we can work with it easily on an ordinary laptop
[00:02:45] it easily on an ordinary laptop but it's big enough to support effective
[00:02:47] but it's big enough to support effective machine learning
[00:02:50] machine learning and we can print out a representative
[00:02:52] and we can print out a representative example from the corpus
[00:02:55] example from the corpus actually this is a bit hard to read so
[00:02:57] actually this is a bit hard to read so let me give you a different view of the
[00:03:00] let me give you a different view of the same example
[00:03:02] same example we represent examples using the example
[00:03:05] we represent examples using the example class which is a named tuple with 12
[00:03:08] class which is a named tuple with 12 fields listed here
[00:03:10] fields listed here uh the first two fields entity one and
[00:03:13] uh the first two fields entity one and entity two contain unique identifiers
[00:03:16] entity two contain unique identifiers for the two entities mentioned
[00:03:19] for the two entities mentioned we name identities using wiki ids which
[00:03:21] we name identities using wiki ids which you can think of as the last portion of
[00:03:24] you can think of as the last portion of a wikipedia url
[00:03:28] the next five fields represent the text
[00:03:30] the next five fields represent the text surrounding the two mentions
[00:03:32] surrounding the two mentions divided into five chunks so left
[00:03:35] divided into five chunks so left contains the text before the first
[00:03:37] contains the text before the first mention
[00:03:38] mention mention one is the first mention itself
[00:03:41] mention one is the first mention itself middle contains the text between the two
[00:03:43] middle contains the text between the two mentions
[00:03:44] mentions mentioned two is the second mention and
[00:03:46] mentioned two is the second mention and right contains the text after the second
[00:03:49] right contains the text after the second mention
[00:03:52] and the last five fields contain the
[00:03:54] and the last five fields contain the same five chunks of text but this time
[00:03:56] same five chunks of text but this time annotated with part of speech tags which
[00:03:59] annotated with part of speech tags which may turn out to be useful when we start
[00:04:01] may turn out to be useful when we start building models for relation extraction
[00:04:08] now whenever you start to work with a
[00:04:09] now whenever you start to work with a new data set it's good practice to do
[00:04:12] new data set it's good practice to do some data exploration to get familiar
[00:04:14] some data exploration to get familiar with the data a big part of this is
[00:04:16] with the data a big part of this is getting a sense of the high level
[00:04:18] getting a sense of the high level characteristics of the data summary
[00:04:20] characteristics of the data summary statistics distributions and so on
[00:04:23] statistics distributions and so on for example how many entities are there
[00:04:26] for example how many entities are there and what are the most common ones
[00:04:28] and what are the most common ones here's some code that computes that
[00:04:31] here's some code that computes that and here are the results
[00:04:33] and here are the results so there are more than 95 000 unique
[00:04:35] so there are more than 95 000 unique entities
[00:04:36] entities and it looks like the most common
[00:04:38] and it looks like the most common entities are dominated by geographic
[00:04:41] entities are dominated by geographic locations
[00:04:45] now the main benefit we get from the
[00:04:46] now the main benefit we get from the corpus class is the ability to retrieve
[00:04:49] corpus class is the ability to retrieve examples containing specific entities
[00:04:52] examples containing specific entities so let's find examples containing elon
[00:04:55] so let's find examples containing elon musk and tesla motors
[00:04:59] there are five such examples and here's
[00:05:02] there are five such examples and here's the first one
[00:05:05] the first one actually this might not be all of the
[00:05:07] actually this might not be all of the examples containing elon musk and tesla
[00:05:10] examples containing elon musk and tesla motors it's only the examples where elon
[00:05:13] motors it's only the examples where elon musk was mentioned first and tesla
[00:05:15] musk was mentioned first and tesla motors was mentioned second
[00:05:18] motors was mentioned second there may be additional examples that
[00:05:20] there may be additional examples that have them in the reverse order so let's
[00:05:23] have them in the reverse order so let's check look for tesla motors elon musk
[00:05:27] check look for tesla motors elon musk sure enough two more examples in reverse
[00:05:30] sure enough two more examples in reverse order
[00:05:31] order so going forward we'll have to remember
[00:05:33] so going forward we'll have to remember to check both directions when we're
[00:05:36] to check both directions when we're looking for examples containing a
[00:05:37] looking for examples containing a specific pair of entities
[00:05:39] specific pair of entities okay a few last observations on the
[00:05:41] okay a few last observations on the corpus
[00:05:42] corpus first this corpus is not without flaws
[00:05:45] first this corpus is not without flaws as you get more familiar with it you'll
[00:05:47] as you get more familiar with it you'll probably discover that it contains many
[00:05:49] probably discover that it contains many examples that are nearly
[00:05:52] examples that are nearly but not exactly duplicates
[00:05:55] but not exactly duplicates this seems to be an artifact of the
[00:05:58] this seems to be an artifact of the web document sampling methodology that
[00:06:01] web document sampling methodology that was used in the construction of the
[00:06:03] was used in the construction of the wikilinks dataset
[00:06:05] wikilinks dataset and it winds up creating a few
[00:06:07] and it winds up creating a few distortions and we may see some examples
[00:06:09] distortions and we may see some examples of this later
[00:06:10] of this later but even though the corpus has a few
[00:06:12] but even though the corpus has a few warts
[00:06:13] warts it will serve our purposes
[00:06:15] it will serve our purposes just fine
[00:06:17] just fine one thing that this corpus does not
[00:06:19] one thing that this corpus does not include is any annotation about
[00:06:21] include is any annotation about relations
[00:06:23] relations so it could not be used for the fully
[00:06:25] so it could not be used for the fully supervised approach to relation
[00:06:27] supervised approach to relation extraction because that requires a
[00:06:29] extraction because that requires a relation label
[00:06:31] relation label on each pair of entity dimensions and we
[00:06:33] on each pair of entity dimensions and we don't have any such annotation here the
[00:06:35] don't have any such annotation here the only annotations that we have in this
[00:06:37] only annotations that we have in this corpus are entity resolutions mapping an
[00:06:41] corpus are entity resolutions mapping an entity mention to an entity id
[00:06:44] entity mention to an entity id that means that in order to make headway
[00:06:46] that means that in order to make headway we'll need to connect the corpus with an
[00:06:49] we'll need to connect the corpus with an external source of knowledge about
[00:06:51] external source of knowledge about relations
[00:06:52] relations we need a kb
[00:06:57] happily our data distribution does
[00:06:59] happily our data distribution does include a kb which is derived from
[00:07:02] include a kb which is derived from freebase uh freebase has an interesting
[00:07:04] freebase uh freebase has an interesting history it was created in the late 2000s
[00:07:08] history it was created in the late 2000s by a company called metaweb
[00:07:10] by a company called metaweb led by john gianandrea who
[00:07:13] led by john gianandrea who later became my boss
[00:07:16] later became my boss google acquired meta web in 2010
[00:07:19] google acquired meta web in 2010 and freebase became the foundation of
[00:07:22] and freebase became the foundation of google's knowledge graph
[00:07:24] google's knowledge graph unfortunately google shut freebase down
[00:07:27] unfortunately google shut freebase down in 2016 which was tragic
[00:07:30] in 2016 which was tragic but the freebase data is still available
[00:07:32] but the freebase data is still available from various sources
[00:07:35] from various sources so rkb is a collection of relational
[00:07:37] so rkb is a collection of relational triples each consisting of a relation a
[00:07:41] triples each consisting of a relation a subject and an object
[00:07:43] subject and an object so for example place of birth barack
[00:07:45] so for example place of birth barack obama honolulu has spouse barack obama
[00:07:48] obama honolulu has spouse barack obama michelle obama author the audacity of
[00:07:51] michelle obama author the audacity of hope barack obama
[00:07:53] hope barack obama so as you might guess the relation is
[00:07:55] so as you might guess the relation is one of a handful of predefined constants
[00:07:58] one of a handful of predefined constants like place of birth or has spouse
[00:08:02] like place of birth or has spouse the subject and the object are entities
[00:08:04] the subject and the object are entities represented by wiki ids it's the same id
[00:08:08] represented by wiki ids it's the same id space used in the corpus wiki ids are
[00:08:11] space used in the corpus wiki ids are basically the last part of a wikipedia
[00:08:13] basically the last part of a wikipedia url
[00:08:17] now just like we did for the corpus
[00:08:19] now just like we did for the corpus we've created a kb class to store the kb
[00:08:23] we've created a kb class to store the kb triples and some associated indexes
[00:08:26] triples and some associated indexes this class makes it easy and efficient
[00:08:29] this class makes it easy and efficient to look up kb triples both by relation
[00:08:33] to look up kb triples both by relation and by
[00:08:34] and by entities
[00:08:36] entities so here we're just loading the data
[00:08:38] so here we're just loading the data and printing a count of the kb triples
[00:08:42] and printing a count of the kb triples uh there are 45 000 kb triples so this
[00:08:44] uh there are 45 000 kb triples so this is quite a bit smaller than the corpus
[00:08:46] is quite a bit smaller than the corpus if you remember the corpus
[00:08:48] if you remember the corpus had 330 has 330 examples
[00:08:53] had 330 has 330 examples and we can print out the first kb triple
[00:08:57] and we can print out the first kb triple so this is a kb triple that says that
[00:08:59] so this is a kb triple that says that the contains relation holds between
[00:09:02] the contains relation holds between brickfields and kuala lumpur central
[00:09:05] brickfields and kuala lumpur central central railway station which i did not
[00:09:09] central railway station which i did not know
[00:09:13] just like we did with the corpus let's
[00:09:14] just like we did with the corpus let's do some data exploration to get a sense
[00:09:17] do some data exploration to get a sense of the high level characteristics of the
[00:09:19] of the high level characteristics of the kb so first how many relations are there
[00:09:22] kb so first how many relations are there the all relations attribute of the kb
[00:09:25] the all relations attribute of the kb contains a list of its relations and it
[00:09:29] contains a list of its relations and it seems that there are 16 of them
[00:09:33] well what are the relations and how big
[00:09:35] well what are the relations and how big are they
[00:09:36] are they this code prints prints out a list with
[00:09:38] this code prints prints out a list with sizes um note the
[00:09:41] sizes um note the get triples for relation method which
[00:09:44] get triples for relation method which returns a list of the kb triples for a
[00:09:47] returns a list of the kb triples for a given
[00:09:48] given relation you begin to get a sense of
[00:09:51] relation you begin to get a sense of what kind of stuff is in this kb
[00:09:55] what kind of stuff is in this kb it looks like the contains relation is
[00:09:57] it looks like the contains relation is really big with more than 18 000 triples
[00:10:01] really big with more than 18 000 triples and there are a few relations that are
[00:10:04] and there are a few relations that are pretty small with
[00:10:06] pretty small with fewer than a thousand triples
[00:10:11] here's some code that prints
[00:10:13] here's some code that prints one example from each relation relation
[00:10:16] one example from each relation relation so that we can form a better sense of
[00:10:18] so that we can form a better sense of what they mean
[00:10:20] what they mean some of these are familiar facts like a
[00:10:23] some of these are familiar facts like a joins france spain
[00:10:26] joins france spain others might refer to unfamiliar
[00:10:28] others might refer to unfamiliar entities so for example i've never heard
[00:10:31] entities so for example i've never heard of sheridan lethanu
[00:10:34] of sheridan lethanu but i think you can quickly form an
[00:10:36] but i think you can quickly form an intuitive sense of what each relation is
[00:10:39] intuitive sense of what each relation is about
[00:10:43] now one of the most important methods in
[00:10:45] now one of the most important methods in the kb class is get triples for entities
[00:10:49] the kb class is get triples for entities which lets us look up triples by the
[00:10:51] which lets us look up triples by the entities they contain
[00:10:53] entities they contain so let's use it to see what triples
[00:10:55] so let's use it to see what triples contain france and germany
[00:10:59] okay sure they belong to the a joins
[00:11:01] okay sure they belong to the a joins relation that makes sense
[00:11:03] relation that makes sense now relations like adjoins are
[00:11:05] now relations like adjoins are intuitively symmetric so we'd expect to
[00:11:08] intuitively symmetric so we'd expect to find the inverse triple in the kb as
[00:11:11] find the inverse triple in the kb as well
[00:11:12] well and yep it's there
[00:11:15] and yep it's there but note that there's no guarantee that
[00:11:18] but note that there's no guarantee that such inverse triples actually appear in
[00:11:20] such inverse triples actually appear in the kb there's no guarantee that the kb
[00:11:23] the kb there's no guarantee that the kb is complete
[00:11:25] is complete and you could easily write some code to
[00:11:27] and you could easily write some code to find missing inverses
[00:11:32] now that relation adjoins is symmetric
[00:11:35] now that relation adjoins is symmetric but most relations are intuitively
[00:11:38] but most relations are intuitively asymmetric
[00:11:39] asymmetric so let's see what triples we have for
[00:11:41] so let's see what triples we have for tesla motors and elon musk
[00:11:44] tesla motors and elon musk okay they belong to the founders
[00:11:45] okay they belong to the founders relation good that's expected
[00:11:48] relation good that's expected that's an asymmetric relation
[00:11:51] that's an asymmetric relation what about the inverse
[00:11:53] what about the inverse elon musk and tesla motors
[00:11:57] okay they belong to the work dat
[00:12:00] okay they belong to the work dat relation
[00:12:01] relation seems like a funny way to describe
[00:12:03] seems like a funny way to describe elon's role at tesla but okay
[00:12:07] elon's role at tesla but okay so this shows that you can have one
[00:12:09] so this shows that you can have one relation between x and y
[00:12:12] relation between x and y and a different relation that holds
[00:12:14] and a different relation that holds between y and x
[00:12:19] one more observation
[00:12:20] one more observation there may be more than one relation that
[00:12:23] there may be more than one relation that holds between a given pair of entities
[00:12:25] holds between a given pair of entities even in one direction
[00:12:27] even in one direction so uh for example let's see what triples
[00:12:29] so uh for example let's see what triples hold uh what triples contain cleopatra
[00:12:32] hold uh what triples contain cleopatra and tomali the 13th theos filarap fillo
[00:12:36] and tomali the 13th theos filarap fillo phillo patter
[00:12:40] oh my goodness this pair belongs to both
[00:12:43] oh my goodness this pair belongs to both the has sibling relation
[00:12:45] the has sibling relation and the has spouse relation
[00:12:49] and the has spouse relation to which i can only say
[00:12:54] moving right along
[00:12:57] moving right along let's look at the distribution of
[00:12:59] let's look at the distribution of entities in a kb how many entities are
[00:13:02] entities in a kb how many entities are there and what are the most common ones
[00:13:05] there and what are the most common ones well here's some code that computes that
[00:13:09] there are 40 000 entities in the kb so
[00:13:13] there are 40 000 entities in the kb so that's fewer than half as many entities
[00:13:15] that's fewer than half as many entities as in the corpus if you remember the
[00:13:17] as in the corpus if you remember the corpus has
[00:13:18] corpus has 95 000 unique entities so there are lots
[00:13:22] 95 000 unique entities so there are lots of entities
[00:13:23] of entities in the corpus that don't appear in the
[00:13:25] in the corpus that don't appear in the kb at all
[00:13:27] kb at all but just like the corpus the most common
[00:13:29] but just like the corpus the most common entities are dominated by geographic
[00:13:31] entities are dominated by geographic locations england india italy and so on
[00:13:38] note that there's no promise or
[00:13:40] note that there's no promise or expectation that this kb is complete
[00:13:44] expectation that this kb is complete for one thing the kb doesn't even
[00:13:46] for one thing the kb doesn't even contain many of the entities from the
[00:13:48] contain many of the entities from the corpus and even for the entities it does
[00:13:50] corpus and even for the entities it does include
[00:13:51] include there may be possible triples which are
[00:13:54] there may be possible triples which are true in the world
[00:13:56] true in the world but are missing from the kb
[00:13:59] but are missing from the kb so as an example these triples are in
[00:14:01] so as an example these triples are in the kb founders tesla motors elon musk
[00:14:04] the kb founders tesla motors elon musk worked at elon musk tesla motors
[00:14:06] worked at elon musk tesla motors founders spacex elon musk you might
[00:14:09] founders spacex elon musk you might expect to find worked at elon musk
[00:14:12] expect to find worked at elon musk spacex but nope that triple is not in
[00:14:15] spacex but nope that triple is not in the kb
[00:14:16] the kb that's weird
[00:14:18] that's weird well in fact the whole point of relation
[00:14:20] well in fact the whole point of relation extraction is to identify
[00:14:23] extraction is to identify new relational triples from natural
[00:14:25] new relational triples from natural language text so that we can add them to
[00:14:27] language text so that we can add them to a kb
[00:14:28] a kb if our kbs were complete
[00:14:30] if our kbs were complete we wouldn't have anything to do
[00:14:33] we wouldn't have anything to do now actually in this case you might
[00:14:36] now actually in this case you might object that we don't need to do relation
[00:14:38] object that we don't need to do relation extraction
[00:14:39] extraction to make that completion we could write
[00:14:42] to make that completion we could write some logic that recognizes that um
[00:14:46] some logic that recognizes that um founders x y entails worked at y x
[00:14:51] founders x y entails worked at y x and apply that rule systematically
[00:14:53] and apply that rule systematically across the kb and use that to fill in
[00:14:56] across the kb and use that to fill in the missing triple in this case but the
[00:14:59] the missing triple in this case but the general point still stands that there
[00:15:01] general point still stands that there may be lots of triples that are true in
[00:15:03] may be lots of triples that are true in the world but missing from the kb
[00:15:06] the world but missing from the kb where that strategy is not going to
[00:15:08] where that strategy is not going to allow us to to add the missing
[00:15:11] allow us to to add the missing information

Lecture 045

Problem Formulation | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=JLHL5jAHODs --- Transcript [00:00:05] so i now want to turn to the questi...

Problem Formulation | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=JLHL5jAHODs

---

Transcript

[00:00:05] so i now want to turn to the question of
[00:00:07] so i now want to turn to the question of how to formulate our prediction problem
[00:00:10] how to formulate our prediction problem precisely
[00:00:12] precisely i want to be precise about how we're
[00:00:13] i want to be precise about how we're defining the inputs and outputs of our
[00:00:16] defining the inputs and outputs of our predictions
[00:00:17] predictions and that in turn is going to have
[00:00:19] and that in turn is going to have consequences for how we join the corpus
[00:00:22] consequences for how we join the corpus and the kb
[00:00:24] and the kb uh how we construct negative examples
[00:00:26] uh how we construct negative examples for our learning algorithms
[00:00:28] for our learning algorithms and how we handle multi-label
[00:00:31] and how we handle multi-label classification
[00:00:34] so first what is the input to our
[00:00:36] so first what is the input to our prediction problem
[00:00:38] prediction problem in the supervised learning paradigm
[00:00:40] in the supervised learning paradigm the input is a pair of entity mentions
[00:00:44] the input is a pair of entity mentions in the context of a specific sentence
[00:00:46] in the context of a specific sentence we're trying to label a sentence just
[00:00:49] we're trying to label a sentence just like we do in part of speech tagging or
[00:00:52] like we do in part of speech tagging or sentiment analysis
[00:00:54] sentiment analysis but in the distance supervision paradigm
[00:00:57] but in the distance supervision paradigm we'll do things differently
[00:00:59] we'll do things differently the input will be a pair of entities
[00:01:02] the input will be a pair of entities full stop
[00:01:03] full stop independent of any specific context
[00:01:07] independent of any specific context we're trying to determine the relation
[00:01:08] we're trying to determine the relation between this entity and that entity
[00:01:11] between this entity and that entity and that's it
[00:01:14] and that's it the other question i want to look at is
[00:01:16] the other question i want to look at is what's the output of the prediction
[00:01:17] what's the output of the prediction problem are we trying to assign
[00:01:20] problem are we trying to assign a pair of entities to a single relation
[00:01:24] a pair of entities to a single relation that's called multi-class classification
[00:01:27] that's called multi-class classification or are we trying to assign a pair of
[00:01:28] or are we trying to assign a pair of entities to multiple relations that's
[00:01:31] entities to multiple relations that's called multi-label classification and
[00:01:34] called multi-label classification and it's a different beast
[00:01:36] it's a different beast so over the next couple of slides i want
[00:01:38] so over the next couple of slides i want to explore the consequences of these
[00:01:40] to explore the consequences of these choices
[00:01:41] choices the the difference between these two
[00:01:43] the the difference between these two ways of thinking about the input becomes
[00:01:45] ways of thinking about the input becomes really important when we talk about how
[00:01:47] really important when we talk about how we're going to join
[00:01:49] we're going to join the corpus and the kb
[00:01:51] the corpus and the kb in order to leverage the distance
[00:01:53] in order to leverage the distance supervision in paradigm we need to
[00:01:54] supervision in paradigm we need to connect those two we need to connect
[00:01:56] connect those two we need to connect information in the corpus with
[00:01:58] information in the corpus with information in the kb
[00:02:00] information in the kb and there's two different possibilities
[00:02:02] and there's two different possibilities depending on
[00:02:03] depending on how we formulate the prediction problem
[00:02:05] how we formulate the prediction problem depending on how we define the input to
[00:02:08] depending on how we define the input to the problem
[00:02:09] the problem if our problem is to classify a pair of
[00:02:12] if our problem is to classify a pair of entity mentions
[00:02:14] entity mentions in a specific example in the corpus in a
[00:02:17] in a specific example in the corpus in a specific sentence
[00:02:19] specific sentence then we can use the kb to provide the
[00:02:22] then we can use the kb to provide the label and this is what it looks like we
[00:02:25] label and this is what it looks like we have a corpus example like this
[00:02:28] have a corpus example like this we're trying to label this specific
[00:02:30] we're trying to label this specific example
[00:02:32] example and to do it we can check to see if
[00:02:34] and to do it we can check to see if these two entities are related in the kb
[00:02:38] these two entities are related in the kb yep they are and we can use that to
[00:02:40] yep they are and we can use that to generate a label
[00:02:42] generate a label for this example
[00:02:45] for this example labeling specific examples is how the
[00:02:48] labeling specific examples is how the fully supervised paradigm
[00:02:51] fully supervised paradigm works so it's an obvious way to think
[00:02:53] works so it's an obvious way to think about leveraging distance supervision as
[00:02:56] about leveraging distance supervision as well
[00:02:57] well it can be made to work but it's not
[00:03:00] it can be made to work but it's not actually the preferred approach if we do
[00:03:02] actually the preferred approach if we do it this way we'll be doing things
[00:03:04] it this way we'll be doing things exactly as they're done in the
[00:03:06] exactly as they're done in the supervised paradigm
[00:03:09] supervised paradigm it does work but it's not the best way
[00:03:11] it does work but it's not the best way to take advantage of the opportunity
[00:03:14] to take advantage of the opportunity that distance supervision creates
[00:03:18] that distance supervision creates there's another way of doing things and
[00:03:20] there's another way of doing things and the other way is where instead we define
[00:03:23] the other way is where instead we define our problem
[00:03:25] our problem as classifying a pair of entities not
[00:03:28] as classifying a pair of entities not entity mentions in a specific sentence
[00:03:30] entity mentions in a specific sentence but just entities elon musk and tesla
[00:03:34] but just entities elon musk and tesla period
[00:03:35] period apart from any sentence
[00:03:37] apart from any sentence and if that's how we define the input to
[00:03:40] and if that's how we define the input to our problem
[00:03:41] our problem then we can use the corpus
[00:03:44] then we can use the corpus to provide a feature representation
[00:03:47] to provide a feature representation that will be the input to the prediction
[00:03:50] that will be the input to the prediction so if we have an entity pair like elon
[00:03:52] so if we have an entity pair like elon musk and spacex
[00:03:54] musk and spacex that we're considering adding to a
[00:03:57] that we're considering adding to a relation in the kb
[00:04:00] relation in the kb we can find all sentences in the corpus
[00:04:04] we can find all sentences in the corpus containing this pair of entities
[00:04:07] containing this pair of entities and then we can use all of those
[00:04:08] and then we can use all of those sentences to generate a feature
[00:04:11] sentences to generate a feature representation for this pair so in this
[00:04:14] representation for this pair so in this example i'm i'm imagining it doesn't
[00:04:16] example i'm i'm imagining it doesn't have to be this way but i'm imagining
[00:04:18] have to be this way but i'm imagining that we're using a simple bag of words
[00:04:20] that we're using a simple bag of words feature representation
[00:04:23] feature representation the bag of words has come from the
[00:04:25] the bag of words has come from the middle that is the phrase between the
[00:04:28] middle that is the phrase between the two entity dimensions the blue phrases
[00:04:30] two entity dimensions the blue phrases here
[00:04:31] here and all i've done is counted up the
[00:04:33] and all i've done is counted up the words
[00:04:34] words in all of these blue phrases
[00:04:37] in all of these blue phrases across
[00:04:38] across all of the examples in the corpus where
[00:04:41] all of the examples in the corpus where these two entities co-occur
[00:04:43] these two entities co-occur um
[00:04:45] um and
[00:04:48] yeah well you can see here in the in the
[00:04:51] yeah well you can see here in the in the in the token counts that they include
[00:04:54] in the token counts that they include tokens from the various examples
[00:04:58] tokens from the various examples all of these examples together are used
[00:05:00] all of these examples together are used to generate a single feature
[00:05:02] to generate a single feature representation this is a feature
[00:05:04] representation this is a feature representation for this pair
[00:05:08] representation for this pair and it's this feature representation
[00:05:11] and it's this feature representation that will that my that my
[00:05:13] that will that my that my learned model will use to make the
[00:05:16] learned model will use to make the prediction about this pair
[00:05:19] prediction about this pair so this is a very interesting way of
[00:05:20] so this is a very interesting way of reversing things instead of using the kb
[00:05:25] reversing things instead of using the kb to generate a label to make a prediction
[00:05:28] to generate a label to make a prediction about a specific
[00:05:30] about a specific uh pair of entity mentions in a specific
[00:05:32] uh pair of entity mentions in a specific sentence i'm turning things around i'm
[00:05:35] sentence i'm turning things around i'm using the corpus
[00:05:37] using the corpus to generate a feature representation
[00:05:39] to generate a feature representation that i will use to make a prediction
[00:05:42] that i will use to make a prediction about an entity pair
[00:05:44] about an entity pair in abstraction
[00:05:46] in abstraction an entity pair considered just as a pair
[00:05:49] an entity pair considered just as a pair of entities
[00:05:50] of entities just one more thought on this um
[00:05:53] just one more thought on this um this is still kind of about the topic of
[00:05:55] this is still kind of about the topic of joining the corpus and the kb we've
[00:05:57] joining the corpus and the kb we've created a data set class which does that
[00:05:59] created a data set class which does that which combines a corpus and a kb
[00:06:02] which combines a corpus and a kb uh it just kind of staples them together
[00:06:05] uh it just kind of staples them together and provides a variety of convenience
[00:06:07] and provides a variety of convenience methods for the data set and one of
[00:06:09] methods for the data set and one of those convenience methods is this one
[00:06:11] those convenience methods is this one count examples
[00:06:13] count examples which shows for each relation
[00:06:15] which shows for each relation how many examples we have in the corpus
[00:06:18] how many examples we have in the corpus how many triples we have in the kb
[00:06:20] how many triples we have in the kb and the ratio um so the
[00:06:24] and the ratio um so the total number of examples the average
[00:06:27] total number of examples the average number of examples per triple
[00:06:30] number of examples per triple for most relations the total number of
[00:06:32] for most relations the total number of examples is fairly large
[00:06:34] examples is fairly large so we can be optimistic about learning
[00:06:37] so we can be optimistic about learning which linguistic patterns express a
[00:06:39] which linguistic patterns express a given relation i mean even the smallest
[00:06:42] given relation i mean even the smallest one has at least 1500
[00:06:45] one has at least 1500 examples
[00:06:46] examples that's not really you know industrial
[00:06:49] that's not really you know industrial grade
[00:06:50] grade um
[00:06:50] um data but it's certainly enough for the
[00:06:53] data but it's certainly enough for the kind of exploration that we're doing
[00:06:55] kind of exploration that we're doing here
[00:06:56] here however for individual entity pairs the
[00:06:59] however for individual entity pairs the number of examples is often quite low so
[00:07:01] number of examples is often quite low so some of these relations are between some
[00:07:03] some of these relations are between some of those some of these ratios are
[00:07:05] of those some of these ratios are between
[00:07:06] between one and two
[00:07:09] one and two of course
[00:07:10] of course more data would be better
[00:07:12] more data would be better much better
[00:07:13] much better but
[00:07:14] but more data could quickly become unwieldy
[00:07:16] more data could quickly become unwieldy to work with in a notebook like this
[00:07:18] to work with in a notebook like this especially if you're running on an
[00:07:20] especially if you're running on an ordinary laptop
[00:07:22] ordinary laptop and this data is going to be enough to
[00:07:24] and this data is going to be enough to allow us to have a fruitful
[00:07:26] allow us to have a fruitful investigation
[00:07:29] first i want to talk about negative
[00:07:30] first i want to talk about negative examples so by joining the corpus to the
[00:07:33] examples so by joining the corpus to the kb we can get lots of positive examples
[00:07:37] kb we can get lots of positive examples uh for each relation but we can't train
[00:07:40] uh for each relation but we can't train a classifier on positive examples alone
[00:07:44] a classifier on positive examples alone we're also going to need some negative
[00:07:46] we're also going to need some negative examples negative instances
[00:07:48] examples negative instances so that is entity pairs that don't
[00:07:51] so that is entity pairs that don't belong to any relation
[00:07:53] belong to any relation we can find such pairs by searching the
[00:07:56] we can find such pairs by searching the corpus for examples which contain
[00:07:59] corpus for examples which contain two entities which don't belong to any
[00:08:01] two entities which don't belong to any relation in the kb
[00:08:03] relation in the kb so we wrote some code to do this and
[00:08:05] so we wrote some code to do this and there are
[00:08:07] there are a method on the dataset class called
[00:08:09] a method on the dataset class called find unrelated
[00:08:11] find unrelated and when we run it
[00:08:13] and when we run it wow it found almost 250 000
[00:08:17] wow it found almost 250 000 unrelated pairs
[00:08:19] unrelated pairs um
[00:08:20] um so 250 000 negative instances for our uh
[00:08:25] so 250 000 negative instances for our uh for our prediction problem and that's
[00:08:27] for our prediction problem and that's way more than the number of positive
[00:08:30] way more than the number of positive instances if you remember the kb has 46
[00:08:33] instances if you remember the kb has 46 000 triples
[00:08:35] 000 triples each of those is basically a positive
[00:08:37] each of those is basically a positive instance it's something that we know is
[00:08:41] instance it's something that we know is definitely a positive example of the
[00:08:43] definitely a positive example of the relation
[00:08:44] relation here we have 250 000 negative examples
[00:08:47] here we have 250 000 negative examples it's so many more that when we train
[00:08:50] it's so many more that when we train models
[00:08:51] models we'll wind up down sampling the negative
[00:08:53] we'll wind up down sampling the negative instances substantially so that we have
[00:08:55] instances substantially so that we have a somewhat more balanced distribution
[00:09:00] a reminder though
[00:09:02] a reminder though some of these supposedly negative
[00:09:05] some of these supposedly negative instances
[00:09:07] instances may be
[00:09:08] may be false negatives
[00:09:10] false negatives they may be
[00:09:11] they may be entity pairs that don't appear to be
[00:09:14] entity pairs that don't appear to be related
[00:09:15] related but in the real world actually are
[00:09:18] but in the real world actually are our kb is not complete a pair of
[00:09:21] our kb is not complete a pair of entities might be related in real life
[00:09:23] entities might be related in real life even if they don't appear together in a
[00:09:26] even if they don't appear together in a kb and as i said earlier after all
[00:09:28] kb and as i said earlier after all that's the whole point that's the whole
[00:09:30] that's the whole point that's the whole reason we're doing relation extraction
[00:09:33] reason we're doing relation extraction is to find things that are true in real
[00:09:35] is to find things that are true in real life
[00:09:36] life um and true according to some text that
[00:09:39] um and true according to some text that somebody wrote but aren't yet in our kb
[00:09:44] okay now i'm going to come to the to the
[00:09:47] okay now i'm going to come to the to the to the question that was asked about
[00:09:50] to the question that was asked about pairs that belong to multiple relations
[00:09:53] pairs that belong to multiple relations and this this is uh related to the
[00:09:55] and this this is uh related to the question of the outputs of our
[00:09:57] question of the outputs of our prediction problem
[00:09:59] prediction problem we wrote some code to check the kb
[00:10:03] we wrote some code to check the kb for entity pairs that belong to more
[00:10:05] for entity pairs that belong to more than one relation so that's this method
[00:10:07] than one relation so that's this method count relation combinations
[00:10:10] count relation combinations and it turns out this is a really common
[00:10:12] and it turns out this is a really common phenomenon in the kb there are lots of
[00:10:15] phenomenon in the kb there are lots of pairs that belong to multiple relations
[00:10:18] pairs that belong to multiple relations um for example i won't even mention the
[00:10:20] um for example i won't even mention the the most common one but
[00:10:22] the most common one but there are 143 people in the kb
[00:10:26] there are 143 people in the kb uh whose place of birth
[00:10:28] uh whose place of birth is the same as their place of death
[00:10:31] is the same as their place of death and actually that's not that surprising
[00:10:33] and actually that's not that surprising right that makes perfect sense
[00:10:35] right that makes perfect sense um it even turns out that there's no
[00:10:37] um it even turns out that there's no fewer than seven people
[00:10:40] fewer than seven people who married a sibling
[00:10:43] who married a sibling well since lots of entity pairs belong
[00:10:46] well since lots of entity pairs belong to more than one relation
[00:10:48] to more than one relation we probably don't want to be forced to
[00:10:51] we probably don't want to be forced to predict a single relation
[00:10:54] predict a single relation so this suggests formulating our problem
[00:10:56] so this suggests formulating our problem as multi-label classification we want
[00:10:59] as multi-label classification we want our models to be able to predict
[00:11:01] our models to be able to predict multiple relations for any given entity
[00:11:05] multiple relations for any given entity pair
[00:11:08] there are a number of ways to approach
[00:11:09] there are a number of ways to approach multi-label classification but the most
[00:11:12] multi-label classification but the most obvious is the binary relevance method
[00:11:15] obvious is the binary relevance method which just factors multi-label
[00:11:18] which just factors multi-label classification
[00:11:19] classification over n labels into n independent binary
[00:11:23] over n labels into n independent binary classification problems one for each
[00:11:26] classification problems one for each label so if you have a pair like
[00:11:28] label so if you have a pair like pericles and athens you want to be able
[00:11:30] pericles and athens you want to be able to predict any combination of these
[00:11:33] to predict any combination of these labels
[00:11:34] labels you just train a separate model a
[00:11:36] you just train a separate model a separate binary classifier
[00:11:39] separate binary classifier for each of the labels independently
[00:11:42] for each of the labels independently each of them generates a prediction
[00:11:44] each of them generates a prediction independently
[00:11:46] independently and in this example we've predicted that
[00:11:48] and in this example we've predicted that the place of birth relation applies the
[00:11:50] the place of birth relation applies the place of death relation applies but not
[00:11:53] place of death relation applies but not the has sibling relation
[00:11:57] the has sibling relation a disadvantage of this approach is that
[00:11:58] a disadvantage of this approach is that it fails to
[00:12:00] it fails to because it treats the binary
[00:12:02] because it treats the binary classification problems as independent
[00:12:04] classification problems as independent it fails to exploit correlations between
[00:12:06] it fails to exploit correlations between labels
[00:12:08] labels for example there may well be a
[00:12:10] for example there may well be a correlation between the place of death
[00:12:12] correlation between the place of death place of birth label and the place of
[00:12:13] place of birth label and the place of death label
[00:12:14] death label and if you already have evidence
[00:12:17] and if you already have evidence that the place of birth label applies
[00:12:20] that the place of birth label applies that might tilt you at least a little
[00:12:23] that might tilt you at least a little bit toward saying yes for place of death
[00:12:27] bit toward saying yes for place of death this approach of factoring them into
[00:12:30] this approach of factoring them into independent binary classification
[00:12:32] independent binary classification problems uh is not able to take
[00:12:35] problems uh is not able to take advantage of that information
[00:12:39] advantage of that information but it has the great virtue of
[00:12:41] but it has the great virtue of simplicity it's incredibly
[00:12:42] simplicity it's incredibly straightforward incredibly easy to think
[00:12:44] straightforward incredibly easy to think about and to implement
[00:12:46] about and to implement and
[00:12:47] and it'll suffice for our purposes it's
[00:12:50] it'll suffice for our purposes it's going to make the investigation
[00:12:52] going to make the investigation uh move move forward very smoothly
[00:12:56] uh move move forward very smoothly so
[00:12:57] so i want to sum up a little bit we set out
[00:12:59] i want to sum up a little bit we set out to
[00:13:01] to establish a precise formulation of our
[00:13:05] establish a precise formulation of our prediction problem
[00:13:06] prediction problem and when we put all the pieces together
[00:13:08] and when we put all the pieces together here's the problem formulation we've
[00:13:10] here's the problem formulation we've arrived at the input to the prediction
[00:13:13] arrived at the input to the prediction will be
[00:13:14] will be an entity pair
[00:13:15] an entity pair and a candidate relation
[00:13:18] and a candidate relation the output will be a boolean
[00:13:21] the output will be a boolean indicating whether the entity pair
[00:13:23] indicating whether the entity pair belongs to the relation
[00:13:26] since a kb triple is precisely a
[00:13:29] since a kb triple is precisely a relation and a pair of entities we could
[00:13:31] relation and a pair of entities we could say equivalently that our prediction
[00:13:33] say equivalently that our prediction problem amounts to
[00:13:35] problem amounts to binary classification of kb triples
[00:13:38] binary classification of kb triples given a candidate kb triple like work
[00:13:40] given a candidate kb triple like work dat elon musk spacex do we predict that
[00:13:43] dat elon musk spacex do we predict that it's valid
[00:13:45] it's valid this is really nice because it's a very
[00:13:46] this is really nice because it's a very simple way of thinking about
[00:13:49] simple way of thinking about what problem we're taking on we have a
[00:13:52] what problem we're taking on we have a bunch of positive examples which come
[00:13:54] bunch of positive examples which come from our kb
[00:13:56] from our kb we have a bunch of negative examples
[00:13:58] we have a bunch of negative examples which we synthesize from the corpus
[00:14:02] which we synthesize from the corpus using pairs which co-occur in the corpus
[00:14:04] using pairs which co-occur in the corpus but don't occur in the kb
[00:14:06] but don't occur in the kb now we have lots of data
[00:14:08] now we have lots of data consisting of candidate kb triples
[00:14:12] consisting of candidate kb triples including positive examples and negative
[00:14:14] including positive examples and negative examples
[00:14:15] examples we can use that data both for training
[00:14:18] we can use that data both for training and for evaluation
[00:14:20] and for evaluation and once we've trained a model to do
[00:14:23] and once we've trained a model to do this binary classification we can now
[00:14:25] this binary classification we can now consider novel kb triples which don't
[00:14:28] consider novel kb triples which don't appear anywhere in our data and ask
[00:14:30] appear anywhere in our data and ask whether the model will predict them to
[00:14:32] whether the model will predict them to be true and by doing that we may
[00:14:34] be true and by doing that we may discover
[00:14:35] discover new relations that are not currently
[00:14:38] new relations that are not currently part of the the kb that could be
[00:14:40] part of the the kb that could be candidates for adding

Lecture 046

Evaluation | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=JIBcv-grQIc --- Transcript [00:00:05] last time i introduced the task of [00:00:07...

Evaluation | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=JIBcv-grQIc

---

Transcript

[00:00:05] last time i introduced the task of
[00:00:07] last time i introduced the task of relation extraction i
[00:00:10] relation extraction i described the corpus and the kb that
[00:00:12] described the corpus and the kb that we're going to use
[00:00:13] we're going to use and i proposed a precise formulation of
[00:00:17] and i proposed a precise formulation of our prediction problem
[00:00:18] our prediction problem so now let's talk about how we're going
[00:00:20] so now let's talk about how we're going to measure success on this problem we
[00:00:23] to measure success on this problem we need to define a quantitative evaluation
[00:00:26] need to define a quantitative evaluation that can drive a process of iterative
[00:00:29] that can drive a process of iterative development
[00:00:31] development in this section i'm going to first make
[00:00:32] in this section i'm going to first make a connection to the software engineering
[00:00:35] a connection to the software engineering principle of test driven development
[00:00:38] principle of test driven development then i'm going to explain how we'll
[00:00:40] then i'm going to explain how we'll split our data into training and
[00:00:42] split our data into training and evaluation data i'll do a brief
[00:00:44] evaluation data i'll do a brief refresher on precision recall and f
[00:00:48] refresher on precision recall and f measure
[00:00:50] measure and i'll review the distinction between
[00:00:51] and i'll review the distinction between microaveraging and macro averaging
[00:00:54] microaveraging and macro averaging and by the end we'll know exactly how
[00:00:56] and by the end we'll know exactly how we're going to measure success
[00:01:00] when you start working on a new machine
[00:01:01] when you start working on a new machine learning problem it's very tempting to
[00:01:04] learning problem it's very tempting to jump in and start building models right
[00:01:07] jump in and start building models right away because
[00:01:08] away because you're bursting with ideas and you can't
[00:01:11] you're bursting with ideas and you can't wait to get started
[00:01:12] wait to get started but
[00:01:13] but whoa nelly that's like driving
[00:01:16] whoa nelly that's like driving cross-country without a map there's
[00:01:18] cross-country without a map there's gonna be lots of forks in the road and
[00:01:21] gonna be lots of forks in the road and you won't know which way to go
[00:01:23] you won't know which way to go there's a better way
[00:01:25] there's a better way software engineering we use test driven
[00:01:28] software engineering we use test driven development first write the tests
[00:01:31] development first write the tests and then write code and iterate until it
[00:01:34] and then write code and iterate until it passes the tests
[00:01:37] passes the tests in model engineering we can use a
[00:01:38] in model engineering we can use a similar paradigm first implement a
[00:01:41] similar paradigm first implement a quantitative evaluation specify your
[00:01:43] quantitative evaluation specify your evaluation data set
[00:01:45] evaluation data set choose your evaluation metric build a
[00:01:48] choose your evaluation metric build a test harness that takes a model and
[00:01:50] test harness that takes a model and generates a score
[00:01:52] generates a score then when you start building models you
[00:01:55] then when you start building models you can he'll climb on this score and at
[00:01:57] can he'll climb on this score and at those forks in the road where you could
[00:01:59] those forks in the road where you could do it this way or that way your
[00:02:02] do it this way or that way your quantitative evaluation will tell you
[00:02:04] quantitative evaluation will tell you which way to go
[00:02:08] now whenever we build a model from data
[00:02:09] now whenever we build a model from data it's good practice to partition the data
[00:02:12] it's good practice to partition the data into multiple splits
[00:02:14] into multiple splits minimally a training split and a test
[00:02:17] minimally a training split and a test split
[00:02:18] split actually here we'll go a bit further and
[00:02:20] actually here we'll go a bit further and we'll define multiple splits first we'll
[00:02:23] we'll define multiple splits first we'll have a tiny split with just one percent
[00:02:26] have a tiny split with just one percent of the data
[00:02:28] of the data having a tiny split is super useful and
[00:02:31] having a tiny split is super useful and i encourage you to adopt this practice
[00:02:33] i encourage you to adopt this practice whenever you take on a new prediction
[00:02:34] whenever you take on a new prediction problem during the early stages of
[00:02:37] problem during the early stages of development you can use the tiny split
[00:02:39] development you can use the tiny split as training data or test data or both
[00:02:43] as training data or test data or both and your experiments will run super fast
[00:02:47] and your experiments will run super fast of course your quantitative evaluations
[00:02:49] of course your quantitative evaluations will be pretty much meaningless but it's
[00:02:52] will be pretty much meaningless but it's a great way to quickly flush out any
[00:02:54] a great way to quickly flush out any bugs in your setup
[00:02:57] bugs in your setup then we'll have the train split with 74
[00:02:59] then we'll have the train split with 74 percent of the data this is the data
[00:03:00] percent of the data this is the data that we'll usually use for model
[00:03:02] that we'll usually use for model training
[00:03:03] training then the dev split with 25 of the data
[00:03:07] then the dev split with 25 of the data we'll use this as test data for
[00:03:10] we'll use this as test data for intermediate evaluations during
[00:03:12] intermediate evaluations during development
[00:03:13] development and for the bake off we're also going to
[00:03:15] and for the bake off we're also going to have a separate test split
[00:03:17] have a separate test split but you won't have access to it so we
[00:03:19] but you won't have access to it so we won't talk about it here
[00:03:22] won't talk about it here there's one complication we need to
[00:03:23] there's one complication we need to split both the corpus and the kb
[00:03:27] split both the corpus and the kb we want each relation to appear in both
[00:03:30] we want each relation to appear in both the training data and the test data so
[00:03:32] the training data and the test data so that we can assess how well we've
[00:03:34] that we can assess how well we've learned how how each relation is
[00:03:36] learned how how each relation is expressed in natural language
[00:03:38] expressed in natural language but ideally we'd like to have any given
[00:03:41] but ideally we'd like to have any given entity
[00:03:42] entity appear in only one split
[00:03:44] appear in only one split otherwise we might be leaking
[00:03:46] otherwise we might be leaking information from the training data into
[00:03:47] information from the training data into the test data
[00:03:50] in an ideal world each split would have
[00:03:53] in an ideal world each split would have its own hermetically sealed universe of
[00:03:55] its own hermetically sealed universe of entities and both the corpus and the kb
[00:03:58] entities and both the corpus and the kb for that split would refer only to those
[00:04:01] for that split would refer only to those entities so for example you might have a
[00:04:03] entities so for example you might have a new world corpus whose examples mention
[00:04:05] new world corpus whose examples mention only new world entities like elon musk
[00:04:08] only new world entities like elon musk and bill gates and steve jobs
[00:04:10] and bill gates and steve jobs and a new world kb which contains only
[00:04:14] and a new world kb which contains only triples about the same new world
[00:04:16] triples about the same new world entities
[00:04:18] entities and then an old old world corpus that
[00:04:20] and then an old old world corpus that talks about daniel eck and jack ma and
[00:04:23] talks about daniel eck and jack ma and pony ma
[00:04:24] pony ma and a corresponding old world kb
[00:04:28] and a corresponding old world kb if we had this then we could achieve a
[00:04:30] if we had this then we could achieve a really clean separation between train
[00:04:33] really clean separation between train and test data with no overlap in
[00:04:36] and test data with no overlap in entities
[00:04:38] entities but in practice the world is strongly
[00:04:40] but in practice the world is strongly entangled and this ideal is hard to
[00:04:43] entangled and this ideal is hard to achieve so instead we're going to
[00:04:44] achieve so instead we're going to approximate the ideal
[00:04:47] approximate the ideal i think i won't dwell on the details but
[00:04:49] i think i won't dwell on the details but we've written the code for you to
[00:04:51] we've written the code for you to achieve a good enough split
[00:04:54] achieve a good enough split in particular the dataset class provides
[00:04:56] in particular the dataset class provides a method called build splits which lets
[00:04:58] a method called build splits which lets you specify split names
[00:05:01] you specify split names and
[00:05:02] and proportions and a random seed
[00:05:04] proportions and a random seed and it just returns a map from split
[00:05:07] and it just returns a map from split names to data sets each containing a
[00:05:10] names to data sets each containing a corpus and a kb
[00:05:14] corpus and a kb so now that we have our splits we need
[00:05:15] so now that we have our splits we need to choose an evaluation metric
[00:05:18] to choose an evaluation metric we've formulated our problem as binary
[00:05:20] we've formulated our problem as binary classification and the standard metrics
[00:05:22] classification and the standard metrics for binary classification are precision
[00:05:25] for binary classification are precision and recall so here's an example where we
[00:05:27] and recall so here's an example where we have a hundred problem instances
[00:05:30] have a hundred problem instances the rows of this table represent the
[00:05:32] the rows of this table represent the actual labels 88 are labeled false and
[00:05:36] actual labels 88 are labeled false and only 12 are labeled true
[00:05:38] only 12 are labeled true so this is a skewed distribution
[00:05:41] so this is a skewed distribution the columns of this table represent the
[00:05:43] the columns of this table represent the labels predicted by our model so 95 are
[00:05:47] labels predicted by our model so 95 are predicted to be false and 5 true
[00:05:50] predicted to be false and 5 true now there are 89 instances where the
[00:05:54] now there are 89 instances where the predicted label agrees with the actual
[00:05:56] predicted label agrees with the actual label so the accuracy of this model is
[00:05:59] label so the accuracy of this model is 89
[00:06:01] 89 but accuracy is not a great evaluation
[00:06:04] but accuracy is not a great evaluation metric especially when you have a skewed
[00:06:06] metric especially when you have a skewed distribution like this
[00:06:08] distribution like this because even a model that ignores the
[00:06:10] because even a model that ignores the data
[00:06:11] data and always guesses false can get 88
[00:06:14] and always guesses false can get 88 accuracy just by always guessing false
[00:06:18] accuracy just by always guessing false so instead of accuracy we look at
[00:06:21] so instead of accuracy we look at precision which says of the instances
[00:06:24] precision which says of the instances that are predicted to be true what
[00:06:26] that are predicted to be true what proportion are actually true
[00:06:29] proportion are actually true and recall which says of the instances
[00:06:32] and recall which says of the instances which are actually true what proportion
[00:06:34] which are actually true what proportion are predicted to be true
[00:06:37] are predicted to be true so that's great precision and recall are
[00:06:41] so that's great precision and recall are really useful
[00:06:43] really useful but
[00:06:43] but having two evaluation metrics is often
[00:06:46] having two evaluation metrics is often inconvenient if we're considering a
[00:06:49] inconvenient if we're considering a change to our model which improves
[00:06:51] change to our model which improves precision
[00:06:52] precision but degrades recall
[00:06:54] but degrades recall should we take it
[00:06:56] should we take it in order to drive an iterative
[00:06:58] in order to drive an iterative development process it's useful to have
[00:07:00] development process it's useful to have a single metric
[00:07:02] a single metric on which to hillclimb
[00:07:06] so for binary classification the
[00:07:07] so for binary classification the standard answer is the f1 score which is
[00:07:10] standard answer is the f1 score which is the harmonic mean
[00:07:12] the harmonic mean of precision and recall
[00:07:14] of precision and recall the harmonic mean is
[00:07:16] the harmonic mean is the reciprocal of the arithmetic mean
[00:07:19] the reciprocal of the arithmetic mean the average of the reciprocals and it's
[00:07:22] the average of the reciprocals and it's always less than the arithmetic mean
[00:07:26] always less than the arithmetic mean it's pessimistic in the sense that it's
[00:07:28] it's pessimistic in the sense that it's always closer to the lower number
[00:07:30] always closer to the lower number so the arithmetic mean of 60 and 25
[00:07:34] so the arithmetic mean of 60 and 25 is sorry the harmonic mean of 60 and 25
[00:07:37] is sorry the harmonic mean of 60 and 25 percent is 35.3
[00:07:41] now the f1 score gives equal weight
[00:07:44] now the f1 score gives equal weight to precision and recall
[00:07:47] to precision and recall but depending on the application they
[00:07:49] but depending on the application they might not be of equal importance
[00:07:52] might not be of equal importance in relation extraction we probably care
[00:07:55] in relation extraction we probably care more about precision than recall
[00:07:59] more about precision than recall and that's because adding an invalid
[00:08:01] and that's because adding an invalid triple to the kb
[00:08:03] triple to the kb is more harmful
[00:08:05] is more harmful than failing to add a valid one
[00:08:10] than failing to add a valid one so instead we could use the f measure
[00:08:11] so instead we could use the f measure which is a generalization of f1 it's a
[00:08:14] which is a generalization of f1 it's a weighted harmonic mean of precision and
[00:08:17] weighted harmonic mean of precision and recall and this parameter beta controls
[00:08:20] recall and this parameter beta controls how much more importance you place on
[00:08:23] how much more importance you place on recall
[00:08:24] recall than on precision
[00:08:26] than on precision so let's say that in a particular
[00:08:27] so let's say that in a particular evaluation you have high precision 80
[00:08:30] evaluation you have high precision 80 percent uh and low recall 20
[00:08:34] percent uh and low recall 20 the f1 score gives equal weight to
[00:08:37] the f1 score gives equal weight to precision and recall so its value is 32
[00:08:40] precision and recall so its value is 32 percent
[00:08:42] percent if we set beta equal to
[00:08:44] if we set beta equal to 0.5
[00:08:45] 0.5 we're giving more weight to precision
[00:08:48] we're giving more weight to precision so the value is 50 percent
[00:08:51] so the value is 50 percent if we set beta equal to 2 we're giving
[00:08:54] if we set beta equal to 2 we're giving more weight to recall so the value is
[00:08:56] more weight to recall so the value is 23.5 percent
[00:08:59] 23.5 percent in relation extraction precision is more
[00:09:01] in relation extraction precision is more important than recall
[00:09:03] important than recall so let's go with f 0.5 as our evaluation
[00:09:06] so let's go with f 0.5 as our evaluation metric okay another issue that comes up
[00:09:09] metric okay another issue that comes up in evaluation scenarios like this is
[00:09:11] in evaluation scenarios like this is whether to use
[00:09:12] whether to use micro averaging or macro averaging
[00:09:15] micro averaging or macro averaging we're going to compute precision recall
[00:09:18] we're going to compute precision recall and f score separately for each relation
[00:09:22] and f score separately for each relation but in order to drive iterative
[00:09:24] but in order to drive iterative development
[00:09:25] development we'd like to have summary metrics which
[00:09:27] we'd like to have summary metrics which aggregate across all of the relations
[00:09:30] aggregate across all of the relations and there are two possible ways to do
[00:09:32] and there are two possible ways to do this microaveraging gives equal weight
[00:09:34] this microaveraging gives equal weight to each problem instance
[00:09:37] to each problem instance which means that it gives more weight to
[00:09:39] which means that it gives more weight to relations with more instances
[00:09:43] relations with more instances macro averaging just gives equal weight
[00:09:45] macro averaging just gives equal weight to each relation
[00:09:46] to each relation so let me show you an illustration of
[00:09:48] so let me show you an illustration of this this is an artificial example where
[00:09:50] this this is an artificial example where i have just three relations
[00:09:53] i have just three relations and the contains relation has ten times
[00:09:56] and the contains relation has ten times as many instances as the other two
[00:09:59] as many instances as the other two relations it also has the highest f
[00:10:01] relations it also has the highest f score
[00:10:03] score when i compute the microaverage and the
[00:10:05] when i compute the microaverage and the macro average well the micro average
[00:10:09] macro average well the micro average gives equal weight to each problem
[00:10:11] gives equal weight to each problem instance
[00:10:12] instance so it gives a lot more weight to the
[00:10:14] so it gives a lot more weight to the contains relation and the result is that
[00:10:16] contains relation and the result is that the microaveraged f-score is very close
[00:10:19] the microaveraged f-score is very close to the f-score for contains
[00:10:22] to the f-score for contains whereas the macro average gives equal
[00:10:24] whereas the macro average gives equal weight to each relation
[00:10:26] weight to each relation and so it's just right in the middle of
[00:10:28] and so it's just right in the middle of this range
[00:10:31] this range um the micro average f score is probably
[00:10:34] um the micro average f score is probably not what we want
[00:10:36] not what we want because the number of instances per
[00:10:38] because the number of instances per relation
[00:10:40] relation is kind of an accident of our data
[00:10:42] is kind of an accident of our data collection methodology
[00:10:44] collection methodology and it's not like we believe that the
[00:10:46] and it's not like we believe that the contains relation is more important than
[00:10:49] contains relation is more important than the other relations it just happens to
[00:10:51] the other relations it just happens to be more numerous in the data that we
[00:10:53] be more numerous in the data that we collected
[00:10:54] collected so we're going to use macro averaging so
[00:10:57] so we're going to use macro averaging so that we don't overweight large relations
[00:11:02] so if you put it all together the bottom
[00:11:04] so if you put it all together the bottom line is that with every evaluation we're
[00:11:07] line is that with every evaluation we're going to report lots of metrics but
[00:11:10] going to report lots of metrics but there's one metric that we're going to
[00:11:11] there's one metric that we're going to focus on and this will be our figure of
[00:11:14] focus on and this will be our figure of merit it's the one number that we're
[00:11:17] merit it's the one number that we're going to be hill climbing on and we're
[00:11:19] going to be hill climbing on and we're choosing as our figure of merit the
[00:11:21] choosing as our figure of merit the macro averaged f 0.5 score
[00:11:29] you

Lecture 047

Simple Baselines | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=70FS4wUJjWQ --- Transcript [00:00:05] it's good methodological practice [00:...

Simple Baselines | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=70FS4wUJjWQ

---

Transcript

[00:00:05] it's good methodological practice
[00:00:08] it's good methodological practice whenever you're
[00:00:09] whenever you're starting to build new models to start by
[00:00:12] starting to build new models to start by evaluating very simple models which
[00:00:14] evaluating very simple models which establish baselines to which you can
[00:00:17] establish baselines to which you can then compare the more sophisticated
[00:00:18] then compare the more sophisticated models that you're going to build later
[00:00:20] models that you're going to build later on
[00:00:21] on so to do that we're going to start by
[00:00:22] so to do that we're going to start by looking at three simple models a random
[00:00:25] looking at three simple models a random guesser
[00:00:26] guesser a very simple phrase matching strategy
[00:00:28] a very simple phrase matching strategy and then our first machine learning
[00:00:30] and then our first machine learning based approach which will be a simple
[00:00:32] based approach which will be a simple bag of words classifier
[00:00:36] just about the simplest possible model
[00:00:38] just about the simplest possible model is one that doesn't even look at the
[00:00:41] is one that doesn't even look at the input but just flips a coin and i
[00:00:44] input but just flips a coin and i strongly encourage you whenever you're
[00:00:45] strongly encourage you whenever you're embarking on a model building adventure
[00:00:47] embarking on a model building adventure in your final project wherever
[00:00:50] in your final project wherever start by evaluating a random guesser
[00:00:53] start by evaluating a random guesser it's a snap to implement it can help to
[00:00:56] it's a snap to implement it can help to work out the kinks in your test harness
[00:00:58] work out the kinks in your test harness and it's often very informative to put a
[00:01:00] and it's often very informative to put a floor under what good scores look like
[00:01:05] floor under what good scores look like now we've written an evaluation method
[00:01:07] now we've written an evaluation method for you it's in the relics module and
[00:01:09] for you it's in the relics module and it's just called evaluate
[00:01:11] it's just called evaluate you invoke it with your splits
[00:01:14] you invoke it with your splits your classifier
[00:01:16] your classifier and
[00:01:17] and the name of the split that you want to
[00:01:18] the name of the split that you want to evaluate on which defaults to dev
[00:01:23] when we evaluate our random guesser we
[00:01:25] when we evaluate our random guesser we have some interesting results so we have
[00:01:28] have some interesting results so we have uh results separated for each of the
[00:01:30] uh results separated for each of the relations
[00:01:32] relations and for each one we have precision
[00:01:35] and for each one we have precision recall f score remember that's f 0.5
[00:01:38] recall f score remember that's f 0.5 which gives more weight to precision
[00:01:40] which gives more weight to precision than to recall
[00:01:42] than to recall we have the support which is the number
[00:01:44] we have the support which is the number of instances whose actual label is true
[00:01:48] of instances whose actual label is true and we have size which is just the total
[00:01:50] and we have size which is just the total number of instances
[00:01:52] number of instances we find that recall is generally right
[00:01:55] we find that recall is generally right around 0.5
[00:01:58] around 0.5 and this makes sense because recall says
[00:02:01] and this makes sense because recall says of the instances which are actually true
[00:02:04] of the instances which are actually true what proportion do we predict true well
[00:02:06] what proportion do we predict true well we predict true about half the time
[00:02:08] we predict true about half the time because we're just flipping a coin
[00:02:11] because we're just flipping a coin precision on the other hand is generally
[00:02:14] precision on the other hand is generally quite poor
[00:02:16] quite poor because precision says of the instances
[00:02:19] because precision says of the instances where we predict true
[00:02:21] where we predict true which are basically a random sample
[00:02:23] which are basically a random sample because we're just flipping a coin
[00:02:25] because we're just flipping a coin how many are actually true
[00:02:27] how many are actually true well relatively few and actually you can
[00:02:30] well relatively few and actually you can tell that by looking at the ratio
[00:02:32] tell that by looking at the ratio between support and size the ratio
[00:02:34] between support and size the ratio between support and size
[00:02:36] between support and size is how many of the instances are
[00:02:38] is how many of the instances are actually true
[00:02:40] actually true so when we're cos tossing a coin
[00:02:42] so when we're cos tossing a coin the precision should be right around the
[00:02:46] the precision should be right around the ratio between support and
[00:02:48] ratio between support and size
[00:02:50] size uh our f score is also generally poor it
[00:02:53] uh our f score is also generally poor it stays close to precision
[00:02:55] stays close to precision for two reasons number one because the
[00:02:57] for two reasons number one because the harmonics the harmonic mean stays closer
[00:03:00] harmonics the harmonic mean stays closer to the lower number
[00:03:02] to the lower number and number two because we're using f 0.5
[00:03:05] and number two because we're using f 0.5 which gives more weight to precision
[00:03:07] which gives more weight to precision than to recall
[00:03:10] than to recall and the bottom line our macro average f
[00:03:13] and the bottom line our macro average f score is 9.7
[00:03:15] score is 9.7 so that's the number to beat
[00:03:18] so that's the number to beat it's a pretty low bar
[00:03:20] it's a pretty low bar but this is a random guesser after all
[00:03:24] okay so let's look at another approach
[00:03:26] okay so let's look at another approach which is very simple but smarter than
[00:03:28] which is very simple but smarter than random guessing and it's a simple
[00:03:30] random guessing and it's a simple pattern matching strategy and the idea
[00:03:33] pattern matching strategy and the idea is for each relation
[00:03:35] is for each relation let's go through the corpus
[00:03:37] let's go through the corpus and find the most common phrases
[00:03:40] and find the most common phrases that connect two entities that stand in
[00:03:43] that connect two entities that stand in that relation the most common middles in
[00:03:46] that relation the most common middles in our terminology
[00:03:48] our terminology so here's some code that does that
[00:03:50] so here's some code that does that i won't go through it in detail
[00:03:53] i won't go through it in detail but one thing to note is that it counts
[00:03:55] but one thing to note is that it counts separately
[00:03:56] separately the middles that connect subject with
[00:03:59] the middles that connect subject with object so here it gets all the examples
[00:04:03] object so here it gets all the examples and
[00:04:04] and um
[00:04:05] um counts the middles it tallies up the
[00:04:07] counts the middles it tallies up the middles
[00:04:08] middles and it does that separately from the
[00:04:10] and it does that separately from the examples that connect object with
[00:04:13] examples that connect object with subject
[00:04:14] subject and it stores them in several separate
[00:04:17] and it stores them in several separate um dictionaries under the keys forward
[00:04:20] um dictionaries under the keys forward and reverse so we're going to have
[00:04:22] and reverse so we're going to have forward middles and reverse middles
[00:04:24] forward middles and reverse middles stored separately stored and counted
[00:04:26] stored separately stored and counted separately
[00:04:30] if we run that code here's what we get
[00:04:32] if we run that code here's what we get i'm showing results i'm only going to
[00:04:34] i'm showing results i'm only going to show results for three of the relations
[00:04:36] show results for three of the relations here not all 16.
[00:04:38] here not all 16. all 16 are in the the python notebook if
[00:04:41] all 16 are in the the python notebook if you want to take a look but even from
[00:04:42] you want to take a look but even from this sample there's a few things that
[00:04:44] this sample there's a few things that jump out
[00:04:46] jump out first some of the most frequent middles
[00:04:48] first some of the most frequent middles are really natural and intuitive
[00:04:51] are really natural and intuitive for example
[00:04:52] for example comma starring
[00:04:54] comma starring indicates a reverse film performance
[00:04:57] indicates a reverse film performance relation
[00:04:58] relation um so that would be one where the film
[00:05:00] um so that would be one where the film comes first and the actor comes second
[00:05:04] comes first and the actor comes second um i think that makes perfect sense uh
[00:05:06] um i think that makes perfect sense uh star wars
[00:05:07] star wars comma starring mark hamill
[00:05:12] similarly comma son of
[00:05:14] similarly comma son of indicates a forward parent's relation so
[00:05:18] indicates a forward parent's relation so this would be one where the son comes
[00:05:21] this would be one where the son comes first the child comes first and the
[00:05:23] first the child comes first and the parent comes second
[00:05:26] parent comes second so those are extremely intuitive and
[00:05:28] so those are extremely intuitive and it's reassuring to see them
[00:05:30] it's reassuring to see them near the top of the list of most common
[00:05:33] near the top of the list of most common middles
[00:05:35] middles another observation is that punctuation
[00:05:38] another observation is that punctuation and stop words like comma
[00:05:41] and stop words like comma and and
[00:05:43] and and are extremely common
[00:05:45] are extremely common unlike
[00:05:46] unlike some other nlp applications it's
[00:05:48] some other nlp applications it's probably a bad idea to throw these away
[00:05:50] probably a bad idea to throw these away they carry lots of useful information
[00:05:54] they carry lots of useful information on the other hand punctuation and stop
[00:05:56] on the other hand punctuation and stop words tend to be highly ambiguous
[00:05:59] words tend to be highly ambiguous for example if you look across the full
[00:06:01] for example if you look across the full range of all 16 relations you'll see
[00:06:03] range of all 16 relations you'll see that a bear comma
[00:06:05] that a bear comma is a likely middle for almost every
[00:06:08] is a likely middle for almost every relation in at least one direction
[00:06:10] relation in at least one direction so that comma does very often indicate a
[00:06:13] so that comma does very often indicate a relation but it's a really ambiguous
[00:06:15] relation but it's a really ambiguous indicator
[00:06:18] okay now that we've identified the most
[00:06:20] okay now that we've identified the most common middles for each relation it's
[00:06:22] common middles for each relation it's straightforward to build a classifier
[00:06:24] straightforward to build a classifier based on that information a classifier
[00:06:26] based on that information a classifier that predicts true
[00:06:28] that predicts true for a candidate kb triple
[00:06:30] for a candidate kb triple just in case the two entities in the
[00:06:32] just in case the two entities in the triple appear in the corpus connected by
[00:06:35] triple appear in the corpus connected by one of the phrases that we just
[00:06:36] one of the phrases that we just discovered
[00:06:37] discovered i don't show the code for that here but
[00:06:39] i don't show the code for that here but it's in the python notebook for this
[00:06:41] it's in the python notebook for this unit
[00:06:43] unit and when we evaluate this approach we
[00:06:45] and when we evaluate this approach we see some really interesting results
[00:06:48] see some really interesting results first recall is much worse across the
[00:06:51] first recall is much worse across the board and that makes sense because we're
[00:06:53] board and that makes sense because we're no longer just guessing randomly before
[00:06:57] no longer just guessing randomly before we were saying true half the time
[00:07:00] we were saying true half the time now we're going to be a lot more
[00:07:01] now we're going to be a lot more selective about what we say true to
[00:07:04] selective about what we say true to but precision
[00:07:06] but precision and f score have improved dramatically
[00:07:09] and f score have improved dramatically for several relations
[00:07:11] for several relations especially for a joins and author and
[00:07:14] especially for a joins and author and has sibling and has spouse
[00:07:18] has sibling and has spouse then again there are many other
[00:07:19] then again there are many other relations where precision and f score
[00:07:22] relations where precision and f score are still quite poor
[00:07:24] are still quite poor including this one genre where we get
[00:07:26] including this one genre where we get straight zeros across the board i'm not
[00:07:28] straight zeros across the board i'm not quite sure what happened there
[00:07:31] quite sure what happened there but it indicates that although
[00:07:33] but it indicates that although although things have improved a lot in
[00:07:35] although things have improved a lot in some places
[00:07:36] some places they're still rather poor in others
[00:07:39] they're still rather poor in others and our macro average f score has
[00:07:42] and our macro average f score has improved only modestly so it improved
[00:07:44] improved only modestly so it improved from 9.7 percent to 11.1 percent we're
[00:07:48] from 9.7 percent to 11.1 percent we're heading in the right direction but you'd
[00:07:50] heading in the right direction but you'd have to say that's still pretty
[00:07:51] have to say that's still pretty unimpressive
[00:07:53] unimpressive to make significant gains we're going to
[00:07:55] to make significant gains we're going to need to apply machine learning
[00:07:59] need to apply machine learning so let's get started on that we're going
[00:08:01] so let's get started on that we're going to build a very simple classifier using
[00:08:03] to build a very simple classifier using an approach that should be familiar
[00:08:05] an approach that should be familiar from our look at sentiment analysis last
[00:08:08] from our look at sentiment analysis last week
[00:08:09] week and we're going to start that by
[00:08:11] and we're going to start that by defining a very simple bag of words
[00:08:13] defining a very simple bag of words feature function
[00:08:15] feature function so here's the code for that
[00:08:17] so here's the code for that and let me briefly walk you through it
[00:08:20] and let me briefly walk you through it what we're going to do is to get the
[00:08:22] what we're going to do is to get the features for a kb triple that's the kbt
[00:08:25] features for a kb triple that's the kbt here
[00:08:26] here we're going to find all of the corpus
[00:08:28] we're going to find all of the corpus examples
[00:08:30] examples containing the two entities in the kb
[00:08:32] containing the two entities in the kb triple the subject and the object and
[00:08:35] triple the subject and the object and note that we do that in both directions
[00:08:37] note that we do that in both directions subject subject and object and then also
[00:08:39] subject subject and object and then also object and subject
[00:08:42] object and subject for each example we look at the middle
[00:08:45] for each example we look at the middle we break it into words
[00:08:47] we break it into words and then we count up all the words
[00:08:51] and then we count up all the words so couple things to note here one is
[00:08:54] so couple things to note here one is that the feature representation for one
[00:08:56] that the feature representation for one kb triple
[00:08:58] kb triple can be derived from many corpus examples
[00:09:02] can be derived from many corpus examples and this is the point that i was trying
[00:09:03] and this is the point that i was trying to make uh last time
[00:09:06] to make uh last time that we're using the corpus to generate
[00:09:09] that we're using the corpus to generate features for a candidate kb triple
[00:09:13] features for a candidate kb triple and the role of the corpus
[00:09:15] and the role of the corpus is to provide the feature representation
[00:09:17] is to provide the feature representation and the feature representation for a kb
[00:09:19] and the feature representation for a kb triple will be based on all of the
[00:09:22] triple will be based on all of the examples in the corpus that contain
[00:09:24] examples in the corpus that contain those two entities
[00:09:26] those two entities the other observation to make here is
[00:09:28] the other observation to make here is that we make no distinction between
[00:09:30] that we make no distinction between what you might call forward examples
[00:09:32] what you might call forward examples which have subject first and then object
[00:09:35] which have subject first and then object and reverse examples which have object
[00:09:37] and reverse examples which have object and then subject we're lumping them all
[00:09:39] and then subject we're lumping them all together
[00:09:40] together the words that come from the middles of
[00:09:43] the words that come from the middles of examples in either direction all get
[00:09:46] examples in either direction all get lumped together into one feature counter
[00:09:49] lumped together into one feature counter and you might have qualms about whether
[00:09:50] and you might have qualms about whether that's really the smartest thing to do
[00:09:55] so let's get a sense of what this looks
[00:09:57] so let's get a sense of what this looks like in action
[00:09:58] like in action first let's print out the very first kb
[00:10:02] first let's print out the very first kb triple in rkb
[00:10:04] triple in rkb we actually looked at this last time
[00:10:06] we actually looked at this last time it's a kb triple that says that
[00:10:09] it's a kb triple that says that the contains relation holds between
[00:10:11] the contains relation holds between brickfields and kuala lumpur central
[00:10:14] brickfields and kuala lumpur central railway station
[00:10:16] railway station um and now let's look up the first
[00:10:20] um and now let's look up the first example
[00:10:21] example containing these two entities so i'm
[00:10:22] containing these two entities so i'm just going to look them up in the
[00:10:23] just going to look them up in the forward direction subject and object i
[00:10:26] forward direction subject and object i get all the examples uh i look at the
[00:10:28] get all the examples uh i look at the first one and let me just print out the
[00:10:30] first one and let me just print out the middle the middle says it was just a
[00:10:32] middle the middle says it was just a quick 10-minute walk too so i guess the
[00:10:34] quick 10-minute walk too so i guess the full example probably said something
[00:10:35] full example probably said something like
[00:10:36] like from brickfields it was just a quick
[00:10:39] from brickfields it was just a quick 10-minute walk to kuala lumpur central
[00:10:41] 10-minute walk to kuala lumpur central railway station
[00:10:43] railway station and maybe there was more
[00:10:45] and maybe there was more um
[00:10:47] um now let's run our featurizer on this kb
[00:10:50] now let's run our featurizer on this kb triple and see what features we get
[00:10:53] triple and see what features we get uh so we get a counter that contains it
[00:10:56] uh so we get a counter that contains it was just a quick 10 minute walk to the
[00:11:00] was just a quick 10 minute walk to the so it looks like it's counted up the
[00:11:02] so it looks like it's counted up the words in that middle which is just what
[00:11:04] words in that middle which is just what we expected
[00:11:06] we expected if but if you look closely there's
[00:11:08] if but if you look closely there's something unexpected here because the
[00:11:09] something unexpected here because the word two has a count of two
[00:11:12] word two has a count of two even though it appears only once in that
[00:11:15] even though it appears only once in that middle and also the word the has a count
[00:11:18] middle and also the word the has a count of 1 even though it didn't appear in
[00:11:20] of 1 even though it didn't appear in that middle at all
[00:11:22] that middle at all so where did those come from
[00:11:24] so where did those come from well remember that the futurizer counts
[00:11:27] well remember that the futurizer counts words from the middles of all examples
[00:11:30] words from the middles of all examples containing those entities in either
[00:11:32] containing those entities in either direction
[00:11:33] direction and it turns out that the corpus
[00:11:35] and it turns out that the corpus contains another example containing
[00:11:37] contains another example containing those two entities and that other
[00:11:40] those two entities and that other example has
[00:11:41] example has there's just one other example but that
[00:11:43] there's just one other example but that other example has middle to the
[00:11:47] other example has middle to the and so that's where these counts come
[00:11:49] and so that's where these counts come from
[00:11:50] from so all is well it did the right thing
[00:11:55] okay we have our simple bag of words
[00:11:57] okay we have our simple bag of words featurizer now we need a way to train
[00:11:59] featurizer now we need a way to train models
[00:12:00] models to make predictions and to evaluate the
[00:12:03] to make predictions and to evaluate the results uh the relics module contains
[00:12:06] results uh the relics module contains functions for each of those and so i
[00:12:09] functions for each of those and so i just want to give you a quick tour
[00:12:11] just want to give you a quick tour of what those functions are
[00:12:13] of what those functions are but you'll definitely want to go read
[00:12:15] but you'll definitely want to go read the the code for this so that you're
[00:12:17] the the code for this so that you're more familiar with
[00:12:19] more familiar with how it can be used and a lot of this
[00:12:21] how it can be used and a lot of this code appears in a file called rel x rel
[00:12:24] code appears in a file called rel x rel underscore x dot pi
[00:12:28] underscore x dot pi so we'll start with a function called
[00:12:30] so we'll start with a function called train models
[00:12:32] train models this takes as arguments the dictionary
[00:12:35] this takes as arguments the dictionary of data splits a list of featurizers and
[00:12:38] of data splits a list of featurizers and here we have a list consisting of just
[00:12:40] here we have a list consisting of just our simple bag of words featurizer
[00:12:42] our simple bag of words featurizer the name of the split on which to train
[00:12:45] the name of the split on which to train which defaults to train
[00:12:47] which defaults to train and a model factory which is a function
[00:12:51] and a model factory which is a function that returns a
[00:12:57] a classifier
[00:12:59] a classifier and it's
[00:13:02] and it's sorry a function which initializes an sk
[00:13:04] sorry a function which initializes an sk classifier and by default it's a
[00:13:06] classifier and by default it's a logistic regression classifier as shown
[00:13:09] logistic regression classifier as shown here but you could easily substitute
[00:13:11] here but you could easily substitute this with some other sksk learn
[00:13:14] this with some other sksk learn classifier
[00:13:16] classifier it returns
[00:13:17] it returns this thing called train result which is
[00:13:20] this thing called train result which is a dictionary holding the featurizers the
[00:13:23] a dictionary holding the featurizers the vectorizer that was used to generate the
[00:13:25] vectorizer that was used to generate the training matrix and most importantly a
[00:13:28] training matrix and most importantly a dictionary holding the trained models
[00:13:31] dictionary holding the trained models one per relation so it's a dictionary
[00:13:33] one per relation so it's a dictionary which maps from relation names to models
[00:13:38] which maps from relation names to models so that's train models
[00:13:40] so that's train models next comes predict
[00:13:43] next comes predict this is a function that takes as
[00:13:45] this is a function that takes as arguments a dictionary of data splits
[00:13:47] arguments a dictionary of data splits the output of trained models that train
[00:13:51] the output of trained models that train result thing
[00:13:53] result thing and the name of the split on which to
[00:13:54] and the name of the split on which to make predictions and by default that's
[00:13:56] make predictions and by default that's dev
[00:13:58] dev and it returns two parallel dictionaries
[00:14:00] and it returns two parallel dictionaries one holds the predictions grouped by
[00:14:02] one holds the predictions grouped by relation
[00:14:03] relation and the other holds the true labels
[00:14:05] and the other holds the true labels grouped by relation
[00:14:09] and our third building block is evaluate
[00:14:12] and our third building block is evaluate predictions so this is a function that
[00:14:14] predictions so this is a function that takes as arguments the two parallel
[00:14:17] takes as arguments the two parallel dictionaries of predictions and true
[00:14:19] dictionaries of predictions and true labels produced by predict
[00:14:22] labels produced by predict and it prints evaluation metrics for
[00:14:24] and it prints evaluation metrics for each relation like we saw earlier
[00:14:28] each relation like we saw earlier now before we dwell on these results i
[00:14:30] now before we dwell on these results i want to show one more function
[00:14:33] want to show one more function which is a function called experiment
[00:14:37] which is a function called experiment and experiment simply chains together
[00:14:40] and experiment simply chains together the three functions that i just showed
[00:14:41] the three functions that i just showed you it changed together training
[00:14:43] you it changed together training prediction and evaluation
[00:14:46] prediction and evaluation so that's very convenient for running
[00:14:48] so that's very convenient for running end-to-end experiments
[00:14:50] end-to-end experiments i haven't shown all the parameters here
[00:14:52] i haven't shown all the parameters here but if you go look at the source code
[00:14:54] but if you go look at the source code you'll see that it actually takes a lot
[00:14:55] you'll see that it actually takes a lot of optional parameters and those
[00:14:57] of optional parameters and those parameters let you specify
[00:15:00] parameters let you specify everything about how to run the
[00:15:02] everything about how to run the experiment let you specify your
[00:15:03] experiment let you specify your featurizers your model factory which
[00:15:06] featurizers your model factory which splits to train and test on and more so
[00:15:10] splits to train and test on and more so for example earlier i mentioned that the
[00:15:12] for example earlier i mentioned that the tiny split is really useful for running
[00:15:15] tiny split is really useful for running fast experiments to work out the kinks
[00:15:18] fast experiments to work out the kinks if you wanted to do that it's very easy
[00:15:20] if you wanted to do that it's very easy using the experiment function just to
[00:15:23] using the experiment function just to set the training split and the test
[00:15:25] set the training split and the test split to tiny to run a very quick
[00:15:27] split to tiny to run a very quick experiment
[00:15:31] now here are the results of evaluating
[00:15:34] now here are the results of evaluating our simple bag of words logistic
[00:15:36] our simple bag of words logistic regression classifier and let's take a
[00:15:39] regression classifier and let's take a closer look because
[00:15:40] closer look because this is quite stunning
[00:15:42] this is quite stunning even though this is just about the
[00:15:44] even though this is just about the simplest possible classifier
[00:15:47] simplest possible classifier we've achieved huge gains over the
[00:15:49] we've achieved huge gains over the phrase matching approach
[00:15:51] phrase matching approach the first thing that jumps out is that
[00:15:53] the first thing that jumps out is that our macro average f score has jumped
[00:15:56] our macro average f score has jumped from 11.1 to 56.7
[00:16:00] from 11.1 to 56.7 and we see big gains in precision for
[00:16:03] and we see big gains in precision for almost every single relation
[00:16:06] almost every single relation and correspondingly big gains in f score
[00:16:11] and correspondingly big gains in f score on the other hand there's still plenty
[00:16:12] on the other hand there's still plenty of room for improvement i mean this is
[00:16:15] of room for improvement i mean this is much
[00:16:16] much uh much more impressive than where we
[00:16:18] uh much more impressive than where we were before but we're very far from
[00:16:21] were before but we're very far from perfection there's abundant headroom and
[00:16:24] perfection there's abundant headroom and opportunity to continue to improve

Lecture 048

Directions to Explore | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=OZ1inhh7AgA --- Transcript [00:00:04] okay we're underway we have a sim...

Directions to Explore | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=OZ1inhh7AgA

---

Transcript

[00:00:04] okay we're underway we have a simple
[00:00:07] okay we're underway we have a simple model with reasonable performance where
[00:00:09] model with reasonable performance where do we go from here
[00:00:12] well to make further gains we need to
[00:00:14] well to make further gains we need to stop treating the model as a black box
[00:00:16] stop treating the model as a black box we need to open it up
[00:00:18] we need to open it up and get visibility into what it's
[00:00:20] and get visibility into what it's learned and more importantly where it
[00:00:23] learned and more importantly where it still falls down
[00:00:25] still falls down and then we can begin to look at some
[00:00:27] and then we can begin to look at some ideas for how to improve it
[00:00:31] ideas for how to improve it one important way to to gain
[00:00:33] one important way to to gain understanding of our trained models is
[00:00:35] understanding of our trained models is to inspect the model weights
[00:00:38] to inspect the model weights which features are strong positive
[00:00:39] which features are strong positive indicators for each relation
[00:00:41] indicators for each relation and what features are strong negative
[00:00:43] and what features are strong negative indicators
[00:00:45] indicators the relaxed module contains a function
[00:00:47] the relaxed module contains a function called examine model weights that makes
[00:00:50] called examine model weights that makes it easy to inspect
[00:00:52] it easy to inspect so here i show results for just four of
[00:00:55] so here i show results for just four of our 16 relations and in general i think
[00:00:58] our 16 relations and in general i think the features with large positive weight
[00:01:00] the features with large positive weight are pretty intuitive
[00:01:01] are pretty intuitive so for the author relation the biggest
[00:01:04] so for the author relation the biggest weights are author books and by
[00:01:08] weights are author books and by uh for film performance we have starring
[00:01:11] uh for film performance we have starring alongside and opposite
[00:01:14] alongside and opposite by the way i was a little bit puzzled
[00:01:15] by the way i was a little bit puzzled when i first saw alongside and opposite
[00:01:17] when i first saw alongside and opposite because i thought
[00:01:19] because i thought that
[00:01:20] that those are words that would naturally
[00:01:21] those are words that would naturally appear between the names of two actors
[00:01:24] appear between the names of two actors not between the name of a film and the
[00:01:26] not between the name of a film and the name of an actor
[00:01:28] name of an actor um
[00:01:29] um what i did was i wrote a little bit of
[00:01:30] what i did was i wrote a little bit of code to
[00:01:32] code to pull up the actual examples that
[00:01:34] pull up the actual examples that caused these weights to wind up being
[00:01:36] caused these weights to wind up being large
[00:01:37] large and what i realized was there's a very
[00:01:39] and what i realized was there's a very common pattern which is um
[00:01:43] common pattern which is um like x appeared in y alongside z
[00:01:48] like x appeared in y alongside z so x and z are actors y is a film x
[00:01:51] so x and z are actors y is a film x appeared in y
[00:01:53] appeared in y alongside z
[00:01:55] alongside z so you have y alongside z that indicates
[00:01:58] so you have y alongside z that indicates that z is an actor that appeared in film
[00:02:01] that z is an actor that appeared in film y
[00:02:03] y uh and i think something similar happens
[00:02:04] uh and i think something similar happens for opposite so it um it does make sense
[00:02:08] for opposite so it um it does make sense that these are uh strong indicators of
[00:02:11] that these are uh strong indicators of the film performance relation
[00:02:14] the film performance relation uh for has spouse we have wife married
[00:02:17] uh for has spouse we have wife married and husband i think this makes perfect
[00:02:19] and husband i think this makes perfect sense
[00:02:21] sense the one that's a bit surprising is
[00:02:22] the one that's a bit surprising is adjoins
[00:02:23] adjoins so for joints we have cordoba talux and
[00:02:26] so for joints we have cordoba talux and valet
[00:02:28] valet it's odd to see specific place names
[00:02:30] it's odd to see specific place names here
[00:02:31] here they certainly don't seem to express the
[00:02:33] they certainly don't seem to express the adjoins
[00:02:35] adjoins um relation
[00:02:35] um relation wonder if anyone has a guess what's
[00:02:37] wonder if anyone has a guess what's going on i was really puzzled by this
[00:02:40] going on i was really puzzled by this and
[00:02:41] and so again i wrote a bit of code to find
[00:02:45] so again i wrote a bit of code to find the specific examples that contributed
[00:02:48] the specific examples that contributed to this result i looked for examples
[00:02:51] to this result i looked for examples where the two edge dimensions stand in
[00:02:53] where the two edge dimensions stand in the a join's relation and these terms
[00:02:56] the a join's relation and these terms these specific terms appear in the
[00:02:58] these specific terms appear in the middle
[00:02:59] middle and when i looked at the examples i
[00:03:01] and when i looked at the examples i realized that what's going on is that
[00:03:04] realized that what's going on is that it's very common to have
[00:03:06] it's very common to have lists of geographic locations
[00:03:08] lists of geographic locations so a comma b comma c comma d
[00:03:12] so a comma b comma c comma d and in such lists it's not uncommon that
[00:03:16] and in such lists it's not uncommon that just by chance a and c or a and d
[00:03:19] just by chance a and c or a and d stand in the a joins relation maybe it's
[00:03:21] stand in the a joins relation maybe it's a list of provinces in a country
[00:03:25] a list of provinces in a country and of course some of those provinces
[00:03:27] and of course some of those provinces are adjacent to each other so if a
[00:03:30] are adjacent to each other so if a adjoins c or a adjoins d
[00:03:34] adjoins c or a adjoins d that will tend to make b b
[00:03:37] that will tend to make b b appear
[00:03:38] appear as a positive indicator for the a joins
[00:03:41] as a positive indicator for the a joins relation and especially if the corpus
[00:03:43] relation and especially if the corpus just happens to contain several such
[00:03:45] just happens to contain several such examples
[00:03:47] examples so i think that's what contributed
[00:03:49] so i think that's what contributed to
[00:03:50] to this puzzling result
[00:03:54] the features with large negative weights
[00:03:57] the features with large negative weights look a bit more haphazard
[00:04:00] look a bit more haphazard but i think that's not surprising it's
[00:04:02] but i think that's not surprising it's kind of what you expect
[00:04:04] kind of what you expect for uh for this kind of linear model
[00:04:10] by the way
[00:04:12] by the way you can fiddle with the code that prints
[00:04:14] you can fiddle with the code that prints this out and here it just prints the top
[00:04:17] this out and here it just prints the top three
[00:04:18] three but you can fiddle with the code there's
[00:04:19] but you can fiddle with the code there's actually a parameter that tells you it
[00:04:22] actually a parameter that tells you it tells it how many
[00:04:23] tells it how many of the top of the list to print and so
[00:04:25] of the top of the list to print and so you can print much longer lists and
[00:04:28] you can print much longer lists and for many of the relations
[00:04:30] for many of the relations the top 20 even the top 50
[00:04:33] the top 20 even the top 50 features all look very plausible and
[00:04:36] features all look very plausible and intuitive and it's quite satisfying to
[00:04:38] intuitive and it's quite satisfying to see those
[00:04:39] see those those results come out
[00:04:44] another way to gain insight into our
[00:04:45] another way to gain insight into our model is to use it to discover new
[00:04:48] model is to use it to discover new relation instances that don't currently
[00:04:50] relation instances that don't currently appear in the kb
[00:04:52] appear in the kb in fact as we discussed last time this
[00:04:54] in fact as we discussed last time this is the whole point of building a
[00:04:56] is the whole point of building a relation extraction system to augment a
[00:04:58] relation extraction system to augment a kb
[00:04:59] kb with knowledge extracted from natural
[00:05:01] with knowledge extracted from natural language text at scale
[00:05:04] language text at scale so the the decisive question is can our
[00:05:07] so the the decisive question is can our model do this effectively
[00:05:09] model do this effectively we can't really evaluate this capability
[00:05:12] we can't really evaluate this capability automatically because we have no other
[00:05:14] automatically because we have no other source of ground truth than the kb
[00:05:17] source of ground truth than the kb itself
[00:05:18] itself but we can evaluate it manually by
[00:05:21] but we can evaluate it manually by examining kb triples that aren't in the
[00:05:23] examining kb triples that aren't in the kb
[00:05:24] kb but which our model really really thinks
[00:05:27] but which our model really really thinks should be
[00:05:28] should be in the kb so we wrote a function to do
[00:05:30] in the kb so we wrote a function to do this um
[00:05:31] this um it's called find new relation instances
[00:05:34] it's called find new relation instances and
[00:05:35] and um you can go look at the code here's
[00:05:38] um you can go look at the code here's how it works it starts from corpus
[00:05:41] how it works it starts from corpus examples
[00:05:42] examples containing pairs of entities that don't
[00:05:44] containing pairs of entities that don't belong to any relation in the kb
[00:05:47] belong to any relation in the kb so these are the um these are what we
[00:05:49] so these are the um these are what we described last time as negative examples
[00:05:53] described last time as negative examples we'll consider each such pair of
[00:05:55] we'll consider each such pair of entities as a candidate to join each
[00:05:59] entities as a candidate to join each relation
[00:06:00] relation so we'll take the cross product of all
[00:06:02] so we'll take the cross product of all of those entity pairs
[00:06:04] of those entity pairs and relations
[00:06:06] and relations um we'll apply our model
[00:06:09] um we'll apply our model to all of those candidate kb triples
[00:06:12] to all of those candidate kb triples and we'll just sort the results by the
[00:06:15] and we'll just sort the results by the probability assigned by the model in
[00:06:17] probability assigned by the model in order to find the most likely new
[00:06:19] order to find the most likely new instances of each relation
[00:06:22] instances of each relation so we'll find the candidate kb triples
[00:06:24] so we'll find the candidate kb triples that aren't currently in the kb
[00:06:26] that aren't currently in the kb but which the model believes
[00:06:28] but which the model believes have really high probability of being
[00:06:31] have really high probability of being valid
[00:06:33] valid let's see what we get when we run it um
[00:06:36] let's see what we get when we run it um here are the results for the adjoins
[00:06:38] here are the results for the adjoins relation
[00:06:39] relation notice that the model assigned a
[00:06:41] notice that the model assigned a probability of 1.0 to each of these
[00:06:45] probability of 1.0 to each of these pairs it is totally convinced that these
[00:06:48] pairs it is totally convinced that these pairs belong to the adjoins relation
[00:06:51] pairs belong to the adjoins relation um but the results are
[00:06:55] um but the results are well let's be honest the results are
[00:06:56] well let's be honest the results are terrible
[00:06:58] terrible uh almost all of these pairs belong to
[00:07:01] uh almost all of these pairs belong to the contains relation which by the way
[00:07:04] the contains relation which by the way isn't actually one of our 16 relations
[00:07:06] isn't actually one of our 16 relations but um
[00:07:08] but um intuitively they should belong to a
[00:07:09] intuitively they should belong to a contains relation
[00:07:11] contains relation not the a joins relation you could make
[00:07:14] not the a joins relation you could make a case maybe for mexico and atlantic
[00:07:17] a case maybe for mexico and atlantic ocean
[00:07:18] ocean uh belonging to the joints relation but
[00:07:20] uh belonging to the joints relation but um
[00:07:21] um i mean to be honest even that one is a
[00:07:23] i mean to be honest even that one is a stretch
[00:07:26] stretch uh one other thing worth noting whenever
[00:07:28] uh one other thing worth noting whenever the model predicts that x adjoins y it
[00:07:31] the model predicts that x adjoins y it also predicts that y adjoins x
[00:07:34] also predicts that y adjoins x you might for a moment think that the
[00:07:36] you might for a moment think that the shows that the model has understood
[00:07:39] shows that the model has understood that adjoins is a symmetric relation
[00:07:42] that adjoins is a symmetric relation um unfortunately
[00:07:44] um unfortunately no that's not what's going on it's just
[00:07:46] no that's not what's going on it's just an artifact of how we wrote the simple
[00:07:49] an artifact of how we wrote the simple bag of words featurizer that simple bag
[00:07:51] bag of words featurizer that simple bag of words featurizer makes no distinction
[00:07:54] of words featurizer makes no distinction between forward and reverse examples so
[00:07:57] between forward and reverse examples so it has no idea which one comes first and
[00:07:59] it has no idea which one comes first and which one comes second
[00:08:01] which one comes second and that will be true for asymmetric
[00:08:03] and that will be true for asymmetric relations just like uh for symmetric
[00:08:06] relations just like uh for symmetric relations
[00:08:08] relations so this is not a very promising start
[00:08:10] so this is not a very promising start it's um
[00:08:11] it's um i mean we saw
[00:08:13] i mean we saw a very good a
[00:08:15] a very good a pretty good quantitative evaluation for
[00:08:17] pretty good quantitative evaluation for this model
[00:08:18] this model so this is a little bit surprising
[00:08:21] so this is a little bit surprising let's see what we get for
[00:08:23] let's see what we get for some other relations so here are the
[00:08:25] some other relations so here are the results for the author relation
[00:08:28] results for the author relation and these look a lot better
[00:08:30] and these look a lot better once again all of the probabilities are
[00:08:33] once again all of the probabilities are one
[00:08:34] one but this time
[00:08:35] but this time every single one of these predictions is
[00:08:38] every single one of these predictions is correct
[00:08:40] correct well not quite actually because
[00:08:42] well not quite actually because the book is supposed to appear first
[00:08:45] the book is supposed to appear first like oliver twist and the author second
[00:08:47] like oliver twist and the author second charles dickens so this first one
[00:08:50] charles dickens so this first one actually is correct
[00:08:51] actually is correct the second one is backwards it has the
[00:08:53] the second one is backwards it has the author first and the book second
[00:08:57] author first and the book second our model is completely ignorant of
[00:08:59] our model is completely ignorant of order so it's just as likely to put
[00:09:01] order so it's just as likely to put things in reverse
[00:09:04] things in reverse but if you ignore that
[00:09:06] but if you ignore that if you're willing to imagine that we
[00:09:09] if you're willing to imagine that we could easily fix that then the results
[00:09:12] could easily fix that then the results look great
[00:09:14] look great we could put all of these triples right
[00:09:16] we could put all of these triples right into our kb and we'd have a bigger and
[00:09:18] into our kb and we'd have a bigger and better kb because of it
[00:09:21] better kb because of it this is relation extraction at its
[00:09:24] this is relation extraction at its finest
[00:09:25] finest this is what we wanted
[00:09:28] here are the results for the capital
[00:09:30] here are the results for the capital relation and it's a similar picture uh
[00:09:33] relation and it's a similar picture uh all of the probabilities are 1.0 the
[00:09:36] all of the probabilities are 1.0 the ordering is
[00:09:37] ordering is frequently reversed it's very haphazard
[00:09:41] frequently reversed it's very haphazard but if you put that aside
[00:09:42] but if you put that aside the results look very good
[00:09:45] the results look very good you could quibble perhaps with
[00:09:48] you could quibble perhaps with deli here i mean the capital of india is
[00:09:51] deli here i mean the capital of india is really new delhi
[00:09:52] really new delhi but new delhi is part of delhi so
[00:09:56] but new delhi is part of delhi so you know it's close
[00:09:58] you know it's close still overall
[00:10:00] still overall i think this looks really good
[00:10:03] i think this looks really good let me show you one more this is this is
[00:10:04] let me show you one more this is this is the last one i'll show these are results
[00:10:06] the last one i'll show these are results for the worked at relation
[00:10:09] for the worked at relation and here the results are more mixed
[00:10:12] and here the results are more mixed uh so we have stan lee and marvel comics
[00:10:16] uh so we have stan lee and marvel comics sure if you can say that elon musk
[00:10:19] sure if you can say that elon musk worked at tesla motors then you can say
[00:10:22] worked at tesla motors then you can say that stan lee worked at marvel comics
[00:10:25] that stan lee worked at marvel comics and while we're at it uh genghis khan
[00:10:28] and while we're at it uh genghis khan worked at the mongol empire sure why not
[00:10:32] worked at the mongol empire sure why not um
[00:10:33] um but the rest are nonsense
[00:10:36] but the rest are nonsense so why
[00:10:38] so why what happened here well when you
[00:10:40] what happened here well when you encounter surprising and mysterious
[00:10:43] encounter surprising and mysterious results in your model output it's really
[00:10:45] results in your model output it's really good practice to go dig into the data
[00:10:49] good practice to go dig into the data and investigate
[00:10:50] and investigate and this is called error analysis and i
[00:10:52] and this is called error analysis and i want to show you a couple examples of
[00:10:54] want to show you a couple examples of that now
[00:10:56] that now so first let's see if we can figure out
[00:10:58] so first let's see if we can figure out what happened with louis chevrolet and
[00:11:01] what happened with louis chevrolet and william c durant
[00:11:03] william c durant um first let's look up the corpus
[00:11:06] um first let's look up the corpus examples containing these two entities
[00:11:10] examples containing these two entities i'm only going to look up the examples
[00:11:12] i'm only going to look up the examples that have them in this order
[00:11:14] that have them in this order um i should look them up in the other
[00:11:16] um i should look them up in the other order as well and as a matter of fact i
[00:11:18] order as well and as a matter of fact i did i'm just not going to put that on
[00:11:20] did i'm just not going to put that on the slide i'm just going to focus on
[00:11:22] the slide i'm just going to focus on what happens in this order
[00:11:25] what happens in this order so i'm going to look up these examples
[00:11:27] so i'm going to look up these examples and print out what they look like and
[00:11:29] and print out what they look like and here's what we get
[00:11:32] here's what we get there are 12 examples and they all look
[00:11:33] there are 12 examples and they all look identical
[00:11:35] identical actually i didn't print the full context
[00:11:37] actually i didn't print the full context here if you look at the code closely
[00:11:38] here if you look at the code closely you'll see that i'm printing the suffix
[00:11:40] you'll see that i'm printing the suffix of the left and the prefix of the right
[00:11:43] of the left and the prefix of the right so there's more context further out on
[00:11:45] so there's more context further out on left and right and if you did see the
[00:11:47] left and right and if you did see the full context you would realize that the
[00:11:50] full context you would realize that the examples do differ slightly
[00:11:52] examples do differ slightly but they're very very similar they're
[00:11:54] but they're very very similar they're near duplicates i mentioned this last
[00:11:56] near duplicates i mentioned this last time that this is one of the warts of
[00:11:57] time that this is one of the warts of this data set it contains a lot of
[00:11:59] this data set it contains a lot of near-duplicate examples and i think this
[00:12:02] near-duplicate examples and i think this is an unfortunate consequence of the way
[00:12:05] is an unfortunate consequence of the way the
[00:12:06] the sample was constructed the way the web
[00:12:08] sample was constructed the way the web documents that this corpus was based on
[00:12:10] documents that this corpus was based on were sampled from the web
[00:12:13] were sampled from the web and it seems like that's bitten us here
[00:12:17] and it seems like that's bitten us here but it still leaves a question why
[00:12:20] but it still leaves a question why why did that repetition lead the model
[00:12:22] why did that repetition lead the model to predict
[00:12:23] to predict that this pair belongs to the worked at
[00:12:26] that this pair belongs to the worked at relation because
[00:12:28] relation because it doesn't look obvious that that's the
[00:12:30] it doesn't look obvious that that's the right relation here
[00:12:31] right relation here i suspect that it's because of the word
[00:12:34] i suspect that it's because of the word founder
[00:12:35] founder because x being a founder of y
[00:12:39] because x being a founder of y strongly implies that x worked at y
[00:12:43] strongly implies that x worked at y and actually we can check this it's not
[00:12:45] and actually we can check this it's not that hard to write some code to inspect
[00:12:49] that hard to write some code to inspect the weight that was assigned to the word
[00:12:52] the weight that was assigned to the word founder
[00:12:53] founder in the model for the workdat relation
[00:12:57] in the model for the workdat relation so here's a little bit of code that does
[00:12:58] so here's a little bit of code that does that
[00:13:00] that uh and sure enough in the model for
[00:13:02] uh and sure enough in the model for workdate the word founder gets a weight
[00:13:05] workdate the word founder gets a weight of 2.05
[00:13:07] of 2.05 which is pretty large if you look at the
[00:13:09] which is pretty large if you look at the distribution of weights it's a
[00:13:12] distribution of weights it's a relatively large one i i forget exactly
[00:13:14] relatively large one i i forget exactly but i think it's in the top 10.
[00:13:16] but i think it's in the top 10. it's a relatively uh significant quite
[00:13:19] it's a relatively uh significant quite significant uh feature for this model
[00:13:22] significant uh feature for this model uh so that's what happened we've got 12
[00:13:24] uh so that's what happened we've got 12 examples each of them is contributing a
[00:13:27] examples each of them is contributing a sizable weight and the result is that
[00:13:29] sizable weight and the result is that the model is completely convinced
[00:13:32] the model is completely convinced that um that this is the right that this
[00:13:35] that um that this is the right that this that the
[00:13:36] that the worked at relation holds here
[00:13:40] worked at relation holds here by the way i didn't check but i'm
[00:13:42] by the way i didn't check but i'm confident
[00:13:43] confident that the founder the model for the
[00:13:46] that the founder the model for the founders relation we'll also predict
[00:13:49] founders relation we'll also predict that the founder relation holds here an
[00:13:51] that the founder relation holds here an understanding of
[00:13:53] understanding of what went wrong here
[00:13:54] what went wrong here could help to stimulate some ideas for
[00:13:57] could help to stimulate some ideas for how to fix it
[00:13:58] how to fix it um i don't think i'll i think i won't i
[00:14:00] um i don't think i'll i think i won't i mean i have some ideas i think i won't
[00:14:02] mean i have some ideas i think i won't give them away but um
[00:14:05] give them away but um i hope this underscores the value of
[00:14:08] i hope this underscores the value of error analysis you really want to
[00:14:10] error analysis you really want to understand when you see weird results
[00:14:12] understand when you see weird results you really want to understand what's
[00:14:13] you really want to understand what's going on in your data that led to these
[00:14:16] going on in your data that led to these weird results
[00:14:19] weird results let me show you one more example that
[00:14:21] let me show you one more example that has a bit of a different flavor let's
[00:14:22] has a bit of a different flavor let's look at what's going on with homer and
[00:14:24] look at what's going on with homer and the iliad
[00:14:26] the iliad i wrote a little bit of code to
[00:14:27] i wrote a little bit of code to investigate this one too
[00:14:29] investigate this one too and i'm not gonna show the whole
[00:14:30] and i'm not gonna show the whole investigation
[00:14:32] investigation but i'm just gonna cherry pick the most
[00:14:34] but i'm just gonna cherry pick the most informative results
[00:14:36] informative results so one thing that i notice is that there
[00:14:38] so one thing that i notice is that there are a lot of examples for homer and
[00:14:41] are a lot of examples for homer and iliad in fact uh there are 118 of them
[00:14:46] iliad in fact uh there are 118 of them just in that direction there's more in
[00:14:48] just in that direction there's more in the reverse direction
[00:14:50] the reverse direction that's impressive
[00:14:52] that's impressive but again by itself it doesn't explain
[00:14:55] but again by itself it doesn't explain why worked at
[00:14:57] why worked at looked like a good prediction by the way
[00:15:00] looked like a good prediction by the way i did check to see if it was the same
[00:15:02] i did check to see if it was the same explanation as last time a lot of near
[00:15:04] explanation as last time a lot of near duplicates it's not that's not what's
[00:15:06] duplicates it's not that's not what's going on here
[00:15:07] going on here but um
[00:15:09] but um the next thing i did was to write some
[00:15:12] the next thing i did was to write some code to count up the most common middles
[00:15:15] code to count up the most common middles that join homer and iliad across these
[00:15:18] that join homer and iliad across these 118 examples
[00:15:20] 118 examples and so that code looks like this
[00:15:24] and so that code looks like this and here are the results um and there
[00:15:26] and here are the results um and there was one middle that strongly dominated
[00:15:29] was one middle that strongly dominated and it's apostrophe s
[00:15:31] and it's apostrophe s so as in homer's
[00:15:33] so as in homer's iliad
[00:15:35] iliad um
[00:15:37] um so that makes sense because clearly the
[00:15:39] so that makes sense because clearly the possessive can indicate
[00:15:42] possessive can indicate the author relation you expect to see
[00:15:44] the author relation you expect to see homer's iliad and jane austen's pride
[00:15:47] homer's iliad and jane austen's pride and prejudice and many other similar
[00:15:50] and prejudice and many other similar uh similar formulations
[00:15:53] uh similar formulations but the apostrophe s can equally well
[00:15:56] but the apostrophe s can equally well indicate the work dat relation as in um
[00:16:00] indicate the work dat relation as in um tesla's elon musk or microsoft's bill
[00:16:02] tesla's elon musk or microsoft's bill gates
[00:16:04] gates um so this apostrophe s is really highly
[00:16:07] um so this apostrophe s is really highly ambiguous
[00:16:10] just to confirm that this
[00:16:13] just to confirm that this is actually
[00:16:15] is actually significant to the result that we saw
[00:16:18] significant to the result that we saw let's check
[00:16:19] let's check what weight was assigned to apostrophe s
[00:16:21] what weight was assigned to apostrophe s in the model for work dat
[00:16:23] in the model for work dat so this code is similar to the code on
[00:16:25] so this code is similar to the code on the previous slide but this time we're
[00:16:27] the previous slide but this time we're looking for the weight for apostrophe s
[00:16:30] looking for the weight for apostrophe s and it turns out that the weight was
[00:16:31] and it turns out that the weight was 0.58
[00:16:33] 0.58 okay it's not a huge weight but it's not
[00:16:36] okay it's not a huge weight but it's not small either and this feature occurred
[00:16:40] small either and this feature occurred 51 times across the corpus
[00:16:43] 51 times across the corpus so i think that's what happened we had
[00:16:46] so i think that's what happened we had a
[00:16:47] a you know
[00:16:48] you know non-trivial amount of weight
[00:16:51] non-trivial amount of weight um that got added up 51 times and we
[00:16:55] um that got added up 51 times and we wound up with a really big contribution
[00:16:57] wound up with a really big contribution and the model feeling really confident
[00:16:59] and the model feeling really confident about this relation
[00:17:02] about this relation so again thinking about this problem
[00:17:03] so again thinking about this problem might suggest some strategies for how to
[00:17:06] might suggest some strategies for how to reduce
[00:17:07] reduce that ambiguity the the fundamental
[00:17:09] that ambiguity the the fundamental problem here is that apostrophe s is
[00:17:11] problem here is that apostrophe s is highly ambiguous
[00:17:13] highly ambiguous um
[00:17:14] um a good question to ask yourself is is
[00:17:17] a good question to ask yourself is is there other information in the sentence
[00:17:20] there other information in the sentence that could help to distinguish
[00:17:22] that could help to distinguish the author relation
[00:17:24] the author relation from the worked at relation
[00:17:26] from the worked at relation and i think there is again i don't want
[00:17:28] and i think there is again i don't want to give too much away but i think
[00:17:30] to give too much away but i think there's other evidence in these
[00:17:33] there's other evidence in these sentences that could help to tease apart
[00:17:35] sentences that could help to tease apart these two relations and this kind of
[00:17:37] these two relations and this kind of error analysis is really indispensable
[00:17:40] error analysis is really indispensable to the model development
[00:17:44] process now for the homework and the
[00:17:47] process now for the homework and the bake off we're going to turn you loose
[00:17:49] bake off we're going to turn you loose to find ways to improve this baseline
[00:17:52] to find ways to improve this baseline model
[00:17:53] model and there are a lot of possibilities
[00:17:55] and there are a lot of possibilities one area for innovation is in the
[00:17:58] one area for innovation is in the feature representation we pass to the
[00:18:00] feature representation we pass to the learning algorithm
[00:18:02] learning algorithm so far we've just used a simple bag of
[00:18:04] so far we've just used a simple bag of words representation but you can imagine
[00:18:06] words representation but you can imagine lots of ways to enhance this so you
[00:18:08] lots of ways to enhance this so you could use
[00:18:10] could use word embeddings like the glove
[00:18:11] word embeddings like the glove embeddings
[00:18:12] embeddings you could use a bag of words
[00:18:14] you could use a bag of words representation that distinguishes
[00:18:15] representation that distinguishes between forward and reverse context
[00:18:19] between forward and reverse context you could use bi-grams or longer engrams
[00:18:23] you could use bi-grams or longer engrams you could leverage the part of speech
[00:18:25] you could leverage the part of speech tags that we have in the corpus
[00:18:27] tags that we have in the corpus or information from wordnet
[00:18:31] or information from wordnet much of the early work on relation
[00:18:32] much of the early work on relation extraction relied heavily on syntactic
[00:18:34] extraction relied heavily on syntactic features so maybe try that
[00:18:38] features so maybe try that and so far we've used features based
[00:18:40] and so far we've used features based only on the middle phrase the phrase
[00:18:42] only on the middle phrase the phrase between the two entity mentions you
[00:18:44] between the two entity mentions you could also try using information about
[00:18:46] could also try using information about the entity mentions themselves for
[00:18:48] the entity mentions themselves for example the entity types
[00:18:51] example the entity types or you could try deriving features from
[00:18:53] or you could try deriving features from the left and right context there are a
[00:18:55] the left and right context there are a lot of possibilities for
[00:18:58] lot of possibilities for richer feature representations
[00:19:02] there's also a lot of room for
[00:19:04] there's also a lot of room for innovation with the model type
[00:19:06] innovation with the model type our baseline model is a simple linear
[00:19:08] our baseline model is a simple linear model optimized with logistic regression
[00:19:10] model optimized with logistic regression that's a good place to start but there
[00:19:12] that's a good place to start but there are many other possibilities
[00:19:14] are many other possibilities if you want to stick with linear models
[00:19:16] if you want to stick with linear models you could use an svm
[00:19:17] you could use an svm and sklearn makes that easy
[00:19:20] and sklearn makes that easy or you could experiment with neural
[00:19:22] or you could experiment with neural networks you could use a simple
[00:19:24] networks you could use a simple feed-forward neural network as a drop-in
[00:19:26] feed-forward neural network as a drop-in replacement for our linear model
[00:19:29] replacement for our linear model or since examples can be a variable
[00:19:32] or since examples can be a variable length you might consider a recurrent
[00:19:34] length you might consider a recurrent neural network like an lstm
[00:19:37] neural network like an lstm if you go this way you'll have to think
[00:19:38] if you go this way you'll have to think carefully about how to encode the input
[00:19:41] carefully about how to encode the input if the input is just the middle phrase
[00:19:44] if the input is just the middle phrase things are probably relatively
[00:19:46] things are probably relatively straightforward
[00:19:48] straightforward but if you want to include the entity
[00:19:50] but if you want to include the entity mentions or the left and right context
[00:19:53] mentions or the left and right context you might need to think carefully about
[00:19:55] you might need to think carefully about how to demarcate the segments
[00:19:59] how to demarcate the segments or you could use a transformer based
[00:20:00] or you could use a transformer based architecture like bert's although the
[00:20:03] architecture like bert's although the quantity of training data that we have
[00:20:04] quantity of training data that we have available here might be a bit small
[00:20:08] available here might be a bit small i think all of these are potentially
[00:20:10] i think all of these are potentially interesting and fruitful directions for
[00:20:13] interesting and fruitful directions for exploration and i think you can have a
[00:20:15] exploration and i think you can have a lot of fun with this
[00:20:21] you

Lecture 049

Overview of Analysis Methods in NLP | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=rSO_vOynrEw --- Transcript [00:00:05] welcome everyone th...

Overview of Analysis Methods in NLP | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=rSO_vOynrEw

---

Transcript

[00:00:05] welcome everyone this is the first
[00:00:06] welcome everyone this is the first screencast in our series on analysis
[00:00:08] screencast in our series on analysis methods in nlp this is one of my
[00:00:09] methods in nlp this is one of my favorite units in the course because
[00:00:11] favorite units in the course because it's directly oriented toward helping
[00:00:12] it's directly oriented toward helping you do even better final projects
[00:00:15] you do even better final projects now there's a lot we could discuss under
[00:00:17] now there's a lot we could discuss under the rubric of analysis methods in nlp
[00:00:19] the rubric of analysis methods in nlp i've chosen four things
[00:00:21] i've chosen four things the first two fall under the heading of
[00:00:22] the first two fall under the heading of behavioral evaluations we'll talk about
[00:00:24] behavioral evaluations we'll talk about adversarial testing which is a very
[00:00:26] adversarial testing which is a very flexible way for you to expose that your
[00:00:28] flexible way for you to expose that your system might have some weaknesses or
[00:00:31] system might have some weaknesses or fail to capture some linguistic
[00:00:33] fail to capture some linguistic phenomenon in a very systematic way
[00:00:35] phenomenon in a very systematic way and then at this point we also have the
[00:00:37] and then at this point we also have the opportunity for a number of tasks to do
[00:00:39] opportunity for a number of tasks to do adversarial training and testing these
[00:00:42] adversarial training and testing these would be large data sets that are full
[00:00:44] would be large data sets that are full of examples that we know are difficult
[00:00:46] of examples that we know are difficult for present day architecture so for
[00:00:48] for present day architecture so for whatever architecture you're exploring
[00:00:50] whatever architecture you're exploring this would be a chance to really stress
[00:00:52] this would be a chance to really stress test that architecture
[00:00:54] test that architecture and then we're going to move beyond
[00:00:55] and then we're going to move beyond behavioral evaluations to talk about
[00:00:57] behavioral evaluations to talk about what i've called structural evaluation
[00:00:59] what i've called structural evaluation methods and these include probing and
[00:01:00] methods and these include probing and feature attribution
[00:01:02] feature attribution these are techniques that you could use
[00:01:04] these are techniques that you could use to peer inside your system and gain an
[00:01:06] to peer inside your system and gain an understanding of what its hidden
[00:01:07] understanding of what its hidden representations are like and how those
[00:01:09] representations are like and how those representations are impacting the
[00:01:11] representations are impacting the model's predictions
[00:01:14] model's predictions the motivations for this are many here
[00:01:16] the motivations for this are many here are just a few high-level ones that are
[00:01:18] are just a few high-level ones that are kind of oriented toward projects the
[00:01:20] kind of oriented toward projects the first is just that we might want to find
[00:01:22] first is just that we might want to find the limits of the system that you're
[00:01:23] the limits of the system that you're developing all our systems have
[00:01:25] developing all our systems have limitations and finding them is always
[00:01:27] limitations and finding them is always scientifically useful
[00:01:29] scientifically useful we might just also want to understand
[00:01:31] we might just also want to understand your system's behavior better what are
[00:01:33] your system's behavior better what are its internal representations like and
[00:01:35] its internal representations like and how are they feeding into its final
[00:01:36] how are they feeding into its final predictions and its overall behaviors
[00:01:38] predictions and its overall behaviors that's also just incredibly rewarding
[00:01:40] that's also just incredibly rewarding and both of these things might feed into
[00:01:42] and both of these things might feed into just achieving more robust systems to
[00:01:44] just achieving more robust systems to the extent that we can find weaknesses
[00:01:46] the extent that we can find weaknesses and understand
[00:01:47] and understand behaviors we can possibly take steps
[00:01:50] behaviors we can possibly take steps toward building even more robust systems
[00:01:53] toward building even more robust systems and as i said all of this is oriented
[00:01:55] and as i said all of this is oriented toward your final projects the
[00:01:56] toward your final projects the techniques that we're discussing are
[00:01:58] techniques that we're discussing are powerful and easy ways to improve the
[00:02:00] powerful and easy ways to improve the analysis section of a paper analysis
[00:02:02] analysis section of a paper analysis sections are important but it can be
[00:02:04] sections are important but it can be difficult to write them it feels very
[00:02:06] difficult to write them it feels very open-ended and often very unstructured
[00:02:08] open-ended and often very unstructured people talk in general ways about doing
[00:02:10] people talk in general ways about doing error analysis and so forth but it can
[00:02:12] error analysis and so forth but it can be hard to pinpoint exactly what would
[00:02:14] be hard to pinpoint exactly what would be productive i think the methods that
[00:02:16] be productive i think the methods that we're talking about here are very
[00:02:18] we're talking about here are very generally applicable
[00:02:20] generally applicable and can lead to really productive and
[00:02:21] and can lead to really productive and rich analysis sections
[00:02:24] rich analysis sections let's begin with adversarial testing
[00:02:25] let's begin with adversarial testing this is a mode that we've talked about
[00:02:27] this is a mode that we've talked about before the examples on this slide are
[00:02:28] before the examples on this slide are from this now classic paper glochner at
[00:02:30] from this now classic paper glochner at all 2018 called breaking nli and what
[00:02:33] all 2018 called breaking nli and what they did is only really mildly
[00:02:35] they did is only really mildly adversarial it's just kind of a
[00:02:36] adversarial it's just kind of a challenge and it exposes some lack of
[00:02:38] challenge and it exposes some lack of systematicity in certain nli models so
[00:02:41] systematicity in certain nli models so here's what they did they began from
[00:02:43] here's what they did they began from snli examples like a little girl is
[00:02:45] snli examples like a little girl is kneeling in the dirt crying entails a
[00:02:48] kneeling in the dirt crying entails a little girl is very sad and they simply
[00:02:50] little girl is very sad and they simply use lexical resources to change the
[00:02:52] use lexical resources to change the hypothesis by one word so that it now
[00:02:54] hypothesis by one word so that it now reads a little girl is very unhappy
[00:02:57] reads a little girl is very unhappy we would expect a system that truly
[00:02:59] we would expect a system that truly understood the reasoning involved in
[00:03:01] understood the reasoning involved in these examples to continue to predict
[00:03:03] these examples to continue to predict entail in the second case because these
[00:03:05] entail in the second case because these examples are roughly synonymous but what
[00:03:07] examples are roughly synonymous but what they found is that systems would often
[00:03:09] they found is that systems would often start to predict contradiction possibly
[00:03:11] start to predict contradiction possibly because of the negation that occurs here
[00:03:14] because of the negation that occurs here the second example is similar we begin
[00:03:16] the second example is similar we begin from the sli example an elderly couple
[00:03:18] from the sli example an elderly couple are sitting outside a restaurant
[00:03:19] are sitting outside a restaurant enjoying wine entails a couple drinking
[00:03:22] enjoying wine entails a couple drinking wine and here they just changed wine to
[00:03:24] wine and here they just changed wine to champagne what we would expect is that a
[00:03:27] champagne what we would expect is that a system that knew about these lexical
[00:03:29] system that knew about these lexical items and their relations would flip to
[00:03:31] items and their relations would flip to predicting neutral in this case
[00:03:33] predicting neutral in this case but as you might imagine systems
[00:03:35] but as you might imagine systems continue to predict entails because they
[00:03:37] continue to predict entails because they have only a very fuzzy understanding of
[00:03:39] have only a very fuzzy understanding of how wine and champagne are related to
[00:03:42] how wine and champagne are related to each other
[00:03:44] each other here's the results table and recall this
[00:03:46] here's the results table and recall this is a 2018 paper and what they're mainly
[00:03:48] is a 2018 paper and what they're mainly testing here are models that we might
[00:03:50] testing here are models that we might regard as precursors to the transformers
[00:03:52] regard as precursors to the transformers that we've been so focused on and the
[00:03:54] that we've been so focused on and the picture is very clear these models do
[00:03:56] picture is very clear these models do well on the snli test set mid to high
[00:03:59] well on the snli test set mid to high 80s but their performance plummets on
[00:04:01] 80s but their performance plummets on this new adversarial test set
[00:04:03] this new adversarial test set there are two exceptions down here this
[00:04:05] there are two exceptions down here this wordnet baseline and the kim
[00:04:07] wordnet baseline and the kim architecture but it's important to note
[00:04:09] architecture but it's important to note that these models effectively had access
[00:04:11] that these models effectively had access directly in the case of wordnet and
[00:04:13] directly in the case of wordnet and indirectly in the case of kim to the
[00:04:15] indirectly in the case of kim to the lexical resource that was used to create
[00:04:17] lexical resource that was used to create the adversarial test
[00:04:19] the adversarial test and so they don't see such a large
[00:04:20] and so they don't see such a large performance drop here but even still all
[00:04:23] performance drop here but even still all of these numbers are kind of modest at
[00:04:25] of these numbers are kind of modest at this point
[00:04:26] this point and i told you that this was an
[00:04:28] and i told you that this was an interesting story here's the interesting
[00:04:29] interesting story here's the interesting twist at this point in 2021 you can
[00:04:32] twist at this point in 2021 you can simply download roberta mnli that's the
[00:04:35] simply download roberta mnli that's the roberta parameters fine-tuned on the
[00:04:37] roberta parameters fine-tuned on the multi-nli data set
[00:04:39] multi-nli data set and run this adversarial test and what
[00:04:41] and run this adversarial test and what you find is that that model does
[00:04:43] you find is that that model does astoundingly well on the breaking nli
[00:04:45] astoundingly well on the breaking nli data set i would focus on these two f1
[00:04:47] data set i would focus on these two f1 scores here for the two classes where we
[00:04:49] scores here for the two classes where we have a lot of support contradiction and
[00:04:51] have a lot of support contradiction and entailment the numbers are above 90 as
[00:04:54] entailment the numbers are above 90 as is the accuracy here which is directly
[00:04:55] is the accuracy here which is directly comparable to the numbers that glockner
[00:04:58] comparable to the numbers that glockner had all reported an amazing
[00:05:00] had all reported an amazing accomplishment recall that the original
[00:05:02] accomplishment recall that the original examples from the um adversarial test
[00:05:04] examples from the um adversarial test are from snli this is multi-nli it was
[00:05:07] are from snli this is multi-nli it was not developed specifically to solve this
[00:05:09] not developed specifically to solve this adversarial test and nonetheless it
[00:05:11] adversarial test and nonetheless it looks like roberta has systematic
[00:05:13] looks like roberta has systematic knowledge of the lexical relations
[00:05:15] knowledge of the lexical relations involved and required to solve this
[00:05:18] involved and required to solve this adversarial test so possibly a markov
[00:05:20] adversarial test so possibly a markov real progress
[00:05:23] real progress as i said you can also for selected
[00:05:25] as i said you can also for selected tasks move into the mode of doing
[00:05:26] tasks move into the mode of doing adversarial training and testing um here
[00:05:29] adversarial training and testing um here are the cases i know where the data set
[00:05:30] are the cases i know where the data set is large enough to support training and
[00:05:33] is large enough to support training and testing on examples that were created
[00:05:35] testing on examples that were created via some adversarial dynamic common
[00:05:38] via some adversarial dynamic common sense reasoning natural language
[00:05:39] sense reasoning natural language inference question answering sentiment
[00:05:41] inference question answering sentiment and hate speech and as i said this is a
[00:05:44] and hate speech and as i said this is a really exciting opportunity to see just
[00:05:46] really exciting opportunity to see just how robust your system is when exposed
[00:05:48] how robust your system is when exposed to examples that we know are difficult
[00:05:50] to examples that we know are difficult for modern architectures because that's
[00:05:52] for modern architectures because that's how these data sets were designed
[00:05:56] how these data sets were designed now let's move into the more behavioral
[00:05:57] now let's move into the more behavioral mode we'll start with probing of
[00:05:59] mode we'll start with probing of internal representations probes are
[00:06:01] internal representations probes are little supervised models typically that
[00:06:02] little supervised models typically that you fit on the internal representations
[00:06:05] you fit on the internal representations of your model of interest to sort of
[00:06:08] of your model of interest to sort of expose what those hidden representations
[00:06:10] expose what those hidden representations latently encode this is from a classic
[00:06:12] latently encode this is from a classic paper by ian tenney at all 2019
[00:06:15] paper by ian tenney at all 2019 and what we have along the x-axis is the
[00:06:17] and what we have along the x-axis is the burt layer starting from the embedding
[00:06:19] burt layer starting from the embedding layer and going to 24 this is burnt
[00:06:21] layer and going to 24 this is burnt large so there are 24 layers and the
[00:06:23] large so there are 24 layers and the picture is quite striking as you start
[00:06:25] picture is quite striking as you start from the top here and move down you can
[00:06:27] from the top here and move down you can see that as we move from more syntactic
[00:06:29] see that as we move from more syntactic things up into more discourse-y semantic
[00:06:31] things up into more discourse-y semantic content like co-ref and
[00:06:34] content like co-ref and relation extraction you find that the
[00:06:37] relation extraction you find that the higher layers of the burp model are
[00:06:39] higher layers of the burp model are encoding that information latently
[00:06:40] encoding that information latently that's what these probing results reveal
[00:06:43] that's what these probing results reveal in this picture quite striking look at
[00:06:46] in this picture quite striking look at what the pre-training process in this
[00:06:48] what the pre-training process in this case of bert is learning latently about
[00:06:51] case of bert is learning latently about the structures of language
[00:06:55] the structures of language and then we'll finally talk about
[00:06:56] and then we'll finally talk about feature attribution which is one step
[00:06:58] feature attribution which is one step further in this more introspective mode
[00:07:00] further in this more introspective mode because here as you'll see i think we
[00:07:02] because here as you'll see i think we can get a really deep picture at how
[00:07:05] can get a really deep picture at how individual features and representations
[00:07:07] individual features and representations are directly related to the model's
[00:07:10] are directly related to the model's predictions what i've done here is use
[00:07:12] predictions what i've done here is use the integrated gradients model which is
[00:07:13] the integrated gradients model which is the model that we'll focus on i ran it
[00:07:16] the model that we'll focus on i ran it on a sentiment model and you can see
[00:07:17] on a sentiment model and you can see here we have the true label the
[00:07:19] here we have the true label the predicted label with the probability and
[00:07:21] predicted label with the probability and then we have word level importances as
[00:07:23] then we have word level importances as measured by integrated gradients where
[00:07:25] measured by integrated gradients where blue means it's a bias toward positive
[00:07:28] blue means it's a bias toward positive predictions and red means it's a bias
[00:07:30] predictions and red means it's a bias toward negative predictions
[00:07:32] toward negative predictions uh and i've picked an example that i
[00:07:33] uh and i've picked an example that i think kind of stress tests the model
[00:07:35] think kind of stress tests the model it's a little bit adversarial because
[00:07:37] it's a little bit adversarial because it's all these examples involve mean
[00:07:40] it's all these examples involve mean in the sense of good as in a mean apple
[00:07:42] in the sense of good as in a mean apple pie meaning a delicious or good one
[00:07:45] pie meaning a delicious or good one and you can see that by and large this
[00:07:46] and you can see that by and large this model's predictions are pretty
[00:07:48] model's predictions are pretty systematic it's mostly predicting
[00:07:49] systematic it's mostly predicting positive for variants like they sell
[00:07:52] positive for variants like they sell they make he makes although this last
[00:07:54] they make he makes although this last one he sells might worry us a little bit
[00:07:56] one he sells might worry us a little bit because it has flipped to negative
[00:07:58] because it has flipped to negative despite the changes to the example being
[00:08:00] despite the changes to the example being truly incidental
[00:08:02] truly incidental and this might point to a way in which
[00:08:04] and this might point to a way in which the model does or doesn't have knowledge
[00:08:06] the model does or doesn't have knowledge of how the individual components of
[00:08:08] of how the individual components of these examples should be predict should
[00:08:10] these examples should be predict should be feeding into the final predictions
[00:08:12] be feeding into the final predictions that the model makes i think that's a
[00:08:14] that the model makes i think that's a wonderful opportunity to get a sense for
[00:08:16] wonderful opportunity to get a sense for how robust the model is actually going
[00:08:17] how robust the model is actually going to be to variations like the one that
[00:08:19] to be to variations like the one that you see here
[00:08:26] you

Lecture 050

Adversarial Testing | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=BilI8LkiAsU --- Transcript [00:00:04] welcome everyone this is part two i...

Adversarial Testing | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=BilI8LkiAsU

---

Transcript

[00:00:04] welcome everyone this is part two in our
[00:00:06] welcome everyone this is part two in our series on analysis methods in nlp we're
[00:00:08] series on analysis methods in nlp we're going to be talking about adversarial
[00:00:09] going to be talking about adversarial testing this is an exciting mode because
[00:00:12] testing this is an exciting mode because as you'll see with a few dozen carefully
[00:00:14] as you'll see with a few dozen carefully created examples you could learn
[00:00:16] created examples you could learn something really interesting about the
[00:00:18] something really interesting about the systems that you're developing
[00:00:20] systems that you're developing to start let's remind ourselves of how
[00:00:22] to start let's remind ourselves of how evaluations standardly work in our field
[00:00:25] evaluations standardly work in our field at step one you create a data set from
[00:00:27] at step one you create a data set from some single homogeneous process it could
[00:00:29] some single homogeneous process it could be that you scrape data from the web or
[00:00:31] be that you scrape data from the web or crowdsource a data set or label examples
[00:00:33] crowdsource a data set or label examples yourself but the important thing is that
[00:00:35] yourself but the important thing is that we typically do this as one single
[00:00:38] we typically do this as one single process
[00:00:39] process and then in step two we divide that data
[00:00:41] and then in step two we divide that data set into disjoint train and test sets
[00:00:44] set into disjoint train and test sets when we set the test set aside
[00:00:46] when we set the test set aside and you do all your development on the
[00:00:48] and you do all your development on the train set and only after all development
[00:00:50] train set and only after all development is complete you do an evaluation of your
[00:00:52] is complete you do an evaluation of your system usually based on accuracy or some
[00:00:55] system usually based on accuracy or some similar metric on that held out test set
[00:00:58] similar metric on that held out test set and then finally and this is the
[00:00:59] and then finally and this is the important part you report the results of
[00:01:02] important part you report the results of that test set evaluation as providing an
[00:01:05] that test set evaluation as providing an estimate of the system's capacity to
[00:01:07] estimate of the system's capacity to generalize to new experiences
[00:01:09] generalize to new experiences i hope you can hear in that that we're
[00:01:11] i hope you can hear in that that we're being awfully generous to our systems at
[00:01:13] being awfully generous to our systems at step one we create a single data set
[00:01:16] step one we create a single data set from a single process we hold out the
[00:01:18] from a single process we hold out the test set and we use that test set as our
[00:01:20] test set and we use that test set as our device for measuring how well the system
[00:01:23] device for measuring how well the system is going to do if deployed out in the
[00:01:25] is going to do if deployed out in the real world but of course we know that
[00:01:27] real world but of course we know that the real world is not created from some
[00:01:29] the real world is not created from some single homogeneous process we know in
[00:01:31] single homogeneous process we know in our heart of hearts that if we deploy
[00:01:33] our heart of hearts that if we deploy the system it will encounter examples
[00:01:36] the system it will encounter examples that are entirely unlike those that it
[00:01:38] that are entirely unlike those that it saw in training and assessment in this
[00:01:40] saw in training and assessment in this standard mode and that might worry us
[00:01:42] standard mode and that might worry us that we're overstating the capacity of
[00:01:44] that we're overstating the capacity of our systems to actually deal with the
[00:01:46] our systems to actually deal with the complexity of the world
[00:01:48] complexity of the world and adversarial evaluations are one way
[00:01:50] and adversarial evaluations are one way that we can begin to close this gap here
[00:01:53] that we can begin to close this gap here so in adversarial evaluations we create
[00:01:56] so in adversarial evaluations we create a data set by whatever you means you
[00:01:58] a data set by whatever you means you like at step one
[00:02:00] like at step one and you develop and assess the system
[00:02:01] and you develop and assess the system using that data set according to
[00:02:03] using that data set according to whatever protocols you choose it could
[00:02:04] whatever protocols you choose it could be the standard evaluation mode if you
[00:02:06] be the standard evaluation mode if you like here's the new part at step three
[00:02:09] like here's the new part at step three you develop a new test set of examples
[00:02:12] you develop a new test set of examples that you suspect or know will be
[00:02:14] that you suspect or know will be challenging given your system and the
[00:02:17] challenging given your system and the original data set that it was trained on
[00:02:20] original data set that it was trained on and then as usual after all your system
[00:02:22] and then as usual after all your system development is complete you evaluate the
[00:02:23] development is complete you evaluate the system usually based on accuracy again
[00:02:26] system usually based on accuracy again on that new test data set
[00:02:28] on that new test data set and that's the result that you report as
[00:02:31] and that's the result that you report as your system's capacity to generalize to
[00:02:33] your system's capacity to generalize to new experiences at least of the sort
[00:02:35] new experiences at least of the sort that you carved out in your adversarial
[00:02:37] that you carved out in your adversarial test set and the idea here is that in
[00:02:40] test set and the idea here is that in having a distinction between the data
[00:02:42] having a distinction between the data that we develop our system on and the
[00:02:45] that we develop our system on and the challenge problems that we pose we'll
[00:02:47] challenge problems that we pose we'll get a better estimate of how our systems
[00:02:49] get a better estimate of how our systems perform on examples that are presumably
[00:02:53] perform on examples that are presumably important examples that it's likely to
[00:02:54] important examples that it's likely to encounter if deployed in the real world
[00:02:59] a brief historical note i think it's
[00:03:00] a brief historical note i think it's fair to say that the idea of adversarial
[00:03:03] fair to say that the idea of adversarial testing traces all the way back to the
[00:03:04] testing traces all the way back to the turing test which is introduced by
[00:03:06] turing test which is introduced by touring in this classic paper computing
[00:03:08] touring in this classic paper computing machinery and intelligence the turing
[00:03:11] machinery and intelligence the turing test has an inherently adversarial
[00:03:13] test has an inherently adversarial flavor to it because of course a
[00:03:15] flavor to it because of course a computer is trying to trick a human into
[00:03:17] computer is trying to trick a human into thinking that the computer is human
[00:03:20] thinking that the computer is human we also hear echoes of adversarial
[00:03:22] we also hear echoes of adversarial testing in this classic article book
[00:03:24] testing in this classic article book from terry winigrad called understanding
[00:03:26] from terry winigrad called understanding natural language there are discussions
[00:03:28] natural language there are discussions in that book of the idea of constructing
[00:03:31] in that book of the idea of constructing examples that we know will stress test
[00:03:33] examples that we know will stress test our systems by probing to see whether
[00:03:36] our systems by probing to see whether they have knowledge of what the world is
[00:03:37] they have knowledge of what the world is like and the true complexity of language
[00:03:40] like and the true complexity of language and then i think hector levesque really
[00:03:42] and then i think hector levesque really elevated those winograd ideas into a
[00:03:44] elevated those winograd ideas into a full-fledged testing mode in his paper
[00:03:46] full-fledged testing mode in his paper on our best behavior
[00:03:48] on our best behavior let's look briefly at those winter guide
[00:03:50] let's look briefly at those winter guide sentences or win a grad schema because
[00:03:52] sentences or win a grad schema because they're kind of interesting the idea is
[00:03:53] they're kind of interesting the idea is that they will key into whether a system
[00:03:55] that they will key into whether a system has deep knowledge of the world and of
[00:03:57] has deep knowledge of the world and of language
[00:03:58] language so we start with an example like the
[00:04:00] so we start with an example like the trophy doesn't fit into the brown
[00:04:01] trophy doesn't fit into the brown suitcase because it's too small what's
[00:04:03] suitcase because it's too small what's too small
[00:04:04] too small and the human answer is the suitcase
[00:04:07] and the human answer is the suitcase there's a minimally contrasting example
[00:04:09] there's a minimally contrasting example the trophy doesn't fit into the brown
[00:04:10] the trophy doesn't fit into the brown suitcase because it's too large what's
[00:04:12] suitcase because it's too large what's too large here we answer the trophy and
[00:04:15] too large here we answer the trophy and presumably we do this because we can do
[00:04:17] presumably we do this because we can do a kind of mental simulation involving
[00:04:19] a kind of mental simulation involving suitcases and trophies and figure out
[00:04:22] suitcases and trophies and figure out how to answer these questions on that
[00:04:24] how to answer these questions on that basis
[00:04:25] basis this next pair of examples is similar
[00:04:27] this next pair of examples is similar but it keys more into kind of normative
[00:04:29] but it keys more into kind of normative social roles the council refused the
[00:04:32] social roles the council refused the demonstrators a permit because they
[00:04:33] demonstrators a permit because they feared violence who feared violence and
[00:04:36] feared violence who feared violence and our standard answer drawing on standard
[00:04:37] our standard answer drawing on standard social roles is the council
[00:04:40] social roles is the council again we have a minimally contrasting
[00:04:41] again we have a minimally contrasting example the council refused the
[00:04:43] example the council refused the demonstrators of permit because they
[00:04:45] demonstrators of permit because they advocated pilots who advocated violence
[00:04:48] advocated pilots who advocated violence and here again drawing on default social
[00:04:50] and here again drawing on default social roles were inclined to answer the
[00:04:52] roles were inclined to answer the demonstrators the intuition as i said is
[00:04:55] demonstrators the intuition as i said is that to resolve these questions given
[00:04:58] that to resolve these questions given how minimally different these examples
[00:04:59] how minimally different these examples are you need to have a deep
[00:05:00] are you need to have a deep understanding of the questions and the
[00:05:02] understanding of the questions and the context and also a deep understanding of
[00:05:05] context and also a deep understanding of what our world is like
[00:05:07] what our world is like and then as i said the vet kind of
[00:05:09] and then as i said the vet kind of continues this and begins to systematize
[00:05:11] continues this and begins to systematize it so he says we should pose questions
[00:05:13] it so he says we should pose questions to our systems like could a crocodile
[00:05:15] to our systems like could a crocodile run a steeplechase
[00:05:17] run a steeplechase the intuition is clear the question can
[00:05:19] the intuition is clear the question can be answered by thinking it through a
[00:05:21] be answered by thinking it through a crocodile has short legs the hedges in a
[00:05:23] crocodile has short legs the hedges in a steeplechase would be too tall for the
[00:05:25] steeplechase would be too tall for the crocodile to jump over so no a crocodile
[00:05:28] crocodile to jump over so no a crocodile cannot run a steeplechase again a kind
[00:05:30] cannot run a steeplechase again a kind of mental simulation that leads us to an
[00:05:32] of mental simulation that leads us to an answer to this surprising question
[00:05:35] answer to this surprising question and the idea is that questions like this
[00:05:37] and the idea is that questions like this will not be susceptible to cheap tricks
[00:05:39] will not be susceptible to cheap tricks as levesque says can we find questions
[00:05:41] as levesque says can we find questions where cheap tricks like this will not be
[00:05:43] where cheap tricks like this will not be sufficient to produce the desired
[00:05:45] sufficient to produce the desired behavior this unfortunately has no easy
[00:05:47] behavior this unfortunately has no easy answer the best we can do perhaps is to
[00:05:49] answer the best we can do perhaps is to come up with a suite of multiple choice
[00:05:51] come up with a suite of multiple choice questions carefully and then study the
[00:05:53] questions carefully and then study the sorts of computer programs that might be
[00:05:55] sorts of computer programs that might be able to answer them i think you can hear
[00:05:57] able to answer them i think you can hear in that what we now call the adversarial
[00:05:59] in that what we now call the adversarial testing mode
[00:06:02] testing mode now i'm encouraging you to pose
[00:06:04] now i'm encouraging you to pose adversarial tests to your systems and as
[00:06:06] adversarial tests to your systems and as i said you can do that by constructing
[00:06:07] i said you can do that by constructing just a few novel examples but we should
[00:06:10] just a few novel examples but we should be aware of what we're doing
[00:06:12] be aware of what we're doing primary question what can adversarial
[00:06:14] primary question what can adversarial testing tell us and what can it tell us
[00:06:17] testing tell us and what can it tell us about the systems that we're developing
[00:06:19] about the systems that we're developing so here are just a few considerations
[00:06:20] so here are just a few considerations that should guide your work in this area
[00:06:23] that should guide your work in this area first you don't need to be too
[00:06:24] first you don't need to be too adversarial it could just be that you're
[00:06:27] adversarial it could just be that you're posing a challenge problem to assess
[00:06:30] posing a challenge problem to assess whether your system has an understanding
[00:06:31] whether your system has an understanding of a particular set of phenomena has my
[00:06:34] of a particular set of phenomena has my system learned anything about numerical
[00:06:36] system learned anything about numerical terms does my system understand how
[00:06:38] terms does my system understand how negation works does my system work with
[00:06:41] negation works does my system work with a new style or genre these are challenge
[00:06:43] a new style or genre these are challenge problems that you could pose
[00:06:44] problems that you could pose open-mindedly and you might find that
[00:06:46] open-mindedly and you might find that your system is surprisingly good at them
[00:06:49] your system is surprisingly good at them second consideration we should be
[00:06:51] second consideration we should be thoughtful about the metrics we used as
[00:06:53] thoughtful about the metrics we used as i signaled to you before
[00:06:55] i signaled to you before the limitations of accuracy based
[00:06:57] the limitations of accuracy based metrics like f1 and so forth are
[00:06:59] metrics like f1 and so forth are generally left unaddressed by the
[00:07:01] generally left unaddressed by the adversarial paradigm and that's because
[00:07:03] adversarial paradigm and that's because we want a minimal contrast with our
[00:07:05] we want a minimal contrast with our standard evaluation modes but i think
[00:07:07] standard evaluation modes but i think you can hear in the mission statements
[00:07:09] you can hear in the mission statements especially from levesque that we might
[00:07:11] especially from levesque that we might at some point want to break free of that
[00:07:13] at some point want to break free of that very restrictive mode and pose more
[00:07:16] very restrictive mode and pose more open-ended complex evaluations for our
[00:07:18] open-ended complex evaluations for our systems that would involve requiring
[00:07:20] systems that would involve requiring them to offer their evidence and
[00:07:22] them to offer their evidence and interact with us to resolve uncertainty
[00:07:24] interact with us to resolve uncertainty about what they're supposed to be do all
[00:07:26] about what they're supposed to be do all of that is kind of left aside in the
[00:07:28] of that is kind of left aside in the standard adversarial mode and we should
[00:07:30] standard adversarial mode and we should be aware that it's a limitation
[00:07:33] be aware that it's a limitation this next question is really
[00:07:34] this next question is really fundamentally important in adversarial
[00:07:37] fundamentally important in adversarial testing if you see a model failure is it
[00:07:40] testing if you see a model failure is it actually a failure of the model or of
[00:07:42] actually a failure of the model or of the data set that the model was trained
[00:07:44] the data set that the model was trained on
[00:07:45] on lea would all pose this nicely what
[00:07:47] lea would all pose this nicely what should we conclude when a system fails
[00:07:49] should we conclude when a system fails on a challenge data set
[00:07:51] on a challenge data set in some cases a challenge might exploit
[00:07:53] in some cases a challenge might exploit blind spots in the design of the
[00:07:54] blind spots in the design of the original data set that would be a data
[00:07:56] original data set that would be a data set weakness
[00:07:57] set weakness in others the challenge might expose an
[00:07:59] in others the challenge might expose an inherent inability of a particular model
[00:08:01] inherent inability of a particular model family to handle certain natural
[00:08:04] family to handle certain natural language phenomena that would be a model
[00:08:05] language phenomena that would be a model weakness
[00:08:07] weakness now these are interestingly different
[00:08:08] now these are interestingly different from the point of view of development in
[00:08:10] from the point of view of development in our field we're apt to hope i think that
[00:08:13] our field we're apt to hope i think that we find model weaknesses because those
[00:08:15] we find model weaknesses because those are really fundamental discoveries but
[00:08:17] are really fundamental discoveries but we should be aware that it might be that
[00:08:20] we should be aware that it might be that the system could have done well on our
[00:08:22] the system could have done well on our adversarial test if it had just been
[00:08:24] adversarial test if it had just been trained on the right kind of examples
[00:08:25] trained on the right kind of examples and that would be a data set weakness
[00:08:27] and that would be a data set weakness data set weaknesses are presumably
[00:08:29] data set weaknesses are presumably relatively easy for us to address we can
[00:08:31] relatively easy for us to address we can just supplement the training data with
[00:08:33] just supplement the training data with examples of the relevant kind whereas
[00:08:35] examples of the relevant kind whereas model weaknesses are pope are forcing us
[00:08:38] model weaknesses are pope are forcing us to confront something that might be an
[00:08:40] to confront something that might be an inherent limitation of the set of
[00:08:42] inherent limitation of the set of approaches that we're taking a much more
[00:08:44] approaches that we're taking a much more fundamental insight
[00:08:47] fundamental insight atticus geiger at all in this paper
[00:08:49] atticus geiger at all in this paper offer a similar insight
[00:08:50] offer a similar insight in the context of being fair to our
[00:08:52] in the context of being fair to our models for any evaluation method we
[00:08:55] models for any evaluation method we should ask whether it's fair has the
[00:08:57] should ask whether it's fair has the model been shown data sufficient to
[00:08:59] model been shown data sufficient to support the kind of generalization we're
[00:09:01] support the kind of generalization we're asking of it unless we can say yes with
[00:09:03] asking of it unless we can say yes with complete certainty we can't be sure
[00:09:05] complete certainty we can't be sure whether a failed evaluation traces to a
[00:09:07] whether a failed evaluation traces to a model limitation or a data limitation
[00:09:10] model limitation or a data limitation that no model could overcome
[00:09:12] that no model could overcome and i'm emphasizing this because it's
[00:09:14] and i'm emphasizing this because it's surprisingly easy to fall into the trap
[00:09:17] surprisingly easy to fall into the trap of thinking you have imposed an
[00:09:19] of thinking you have imposed an unambiguous learning target when in fact
[00:09:21] unambiguous learning target when in fact you have not
[00:09:22] you have not just think about the simple example here
[00:09:24] just think about the simple example here human to human suppose i begin the
[00:09:26] human to human suppose i begin the sequence three three five four and i say
[00:09:28] sequence three three five four and i say to you what comes next in the sequence
[00:09:31] to you what comes next in the sequence now i might have in mind the number
[00:09:33] now i might have in mind the number seven
[00:09:34] seven but the evidence that i have offered you
[00:09:36] but the evidence that i have offered you wildly under determines how to continue
[00:09:38] wildly under determines how to continue the sequence and so i think it's fair to
[00:09:40] the sequence and so i think it's fair to say that no learning agent without a lot
[00:09:42] say that no learning agent without a lot of ambiguity could figure out what my
[00:09:44] of ambiguity could figure out what my intended continuation is
[00:09:46] intended continuation is and sometimes our adversarial tests have
[00:09:49] and sometimes our adversarial tests have this quality that the available data and
[00:09:51] this quality that the available data and experiences of these systems just don't
[00:09:54] experiences of these systems just don't fully disambiguate what our intended
[00:09:56] fully disambiguate what our intended learning targets are so we should be
[00:09:58] learning targets are so we should be aware of that those are data set
[00:09:59] aware of that those are data set failings rather than model failings
[00:10:02] failings rather than model failings now i can offer you a constructive set
[00:10:04] now i can offer you a constructive set of techniques to figure out whether
[00:10:06] of techniques to figure out whether you're dealing with a data set weakness
[00:10:08] you're dealing with a data set weakness or a model weakness and that falls under
[00:10:10] or a model weakness and that falls under the heading of inoculation by
[00:10:11] the heading of inoculation by fine-tuning from this wonderful paper
[00:10:13] fine-tuning from this wonderful paper view it all that i quoted from before
[00:10:16] view it all that i quoted from before so the unidol just remind us that in the
[00:10:18] so the unidol just remind us that in the standard challenge evaluation mode we
[00:10:21] standard challenge evaluation mode we train our system on some original data
[00:10:23] train our system on some original data set and then we test it on both the
[00:10:25] set and then we test it on both the original test set and our challenge test
[00:10:27] original test set and our challenge test set and our expectation is that we'll
[00:10:30] set and our expectation is that we'll see outcomes like this where the system
[00:10:31] see outcomes like this where the system does really well on the original data
[00:10:33] does really well on the original data the original test set and really poorly
[00:10:36] the original test set and really poorly on the challenge test set
[00:10:38] on the challenge test set but when we see this outcome we should
[00:10:39] but when we see this outcome we should ask why this is happening in particular
[00:10:42] ask why this is happening in particular is a model weakness or a data set
[00:10:44] is a model weakness or a data set weakness and their proposed method this
[00:10:46] weakness and their proposed method this inoculation method works as follows
[00:10:49] inoculation method works as follows we're going to fine-tune our system on a
[00:10:51] we're going to fine-tune our system on a few of our challenge examples and then
[00:10:54] few of our challenge examples and then re-test on both the original test set
[00:10:57] re-test on both the original test set and
[00:10:57] and the held out parts of the challenged
[00:10:59] the held out parts of the challenged headset
[00:11:00] headset when we do this there are kind of three
[00:11:02] when we do this there are kind of three classes of outcome that you might see
[00:11:04] classes of outcome that you might see the first would point to a data set
[00:11:06] the first would point to a data set weakness if via this little bit of fine
[00:11:09] weakness if via this little bit of fine tuning on the challenge data set we can
[00:11:11] tuning on the challenge data set we can get good performance on the original and
[00:11:14] get good performance on the original and the challenge data set that shows us
[00:11:16] the challenge data set that shows us that in this original evaluation mode
[00:11:18] that in this original evaluation mode the system just didn't see enough of the
[00:11:20] the system just didn't see enough of the relevant kinds from your of examples
[00:11:22] relevant kinds from your of examples from your adversarial tests to have any
[00:11:24] from your adversarial tests to have any hope of succeeding but a modest amount
[00:11:27] hope of succeeding but a modest amount of training on those examples leads it
[00:11:29] of training on those examples leads it to do fine
[00:11:30] to do fine that's a data set weakness a model
[00:11:33] that's a data set weakness a model weakness is what we might have in the
[00:11:34] weakness is what we might have in the back of our minds for our adversarial
[00:11:36] back of our minds for our adversarial testing and this would be the case where
[00:11:37] testing and this would be the case where even though we have fine-tuned our
[00:11:39] even though we have fine-tuned our system on some of these challenge
[00:11:40] system on some of these challenge examples its performance remains really
[00:11:43] examples its performance remains really low even though the system can maintain
[00:11:45] low even though the system can maintain good performance on the original data
[00:11:47] good performance on the original data set and this is just like there is
[00:11:48] set and this is just like there is something special about these new
[00:11:50] something special about these new examples and the model simply cannot get
[00:11:52] examples and the model simply cannot get traction
[00:11:53] traction there's a third outcome that might be
[00:11:55] there's a third outcome that might be really worrisome and they would trace
[00:11:56] really worrisome and they would trace this to kind of like annotation
[00:11:58] this to kind of like annotation artifacts or label shift or something
[00:12:00] artifacts or label shift or something like that and that's where in doing this
[00:12:02] like that and that's where in doing this fine tuning on some challenge examples
[00:12:04] fine tuning on some challenge examples we see degraded performance on both the
[00:12:06] we see degraded performance on both the original data set and the challenge data
[00:12:09] original data set and the challenge data set and that would show that there is
[00:12:10] set and that would show that there is something fundamentally confusing about
[00:12:13] something fundamentally confusing about these advertising these adversarial
[00:12:15] these advertising these adversarial testing examples that are causing a lot
[00:12:18] testing examples that are causing a lot of problems for the system that we've
[00:12:19] of problems for the system that we've developed because even a modest amount
[00:12:21] developed because even a modest amount of fine tuning causes kind of
[00:12:24] of fine tuning causes kind of consequences to ripple through the
[00:12:25] consequences to ripple through the system that are impacting even
[00:12:27] system that are impacting even performance on the original data set
[00:12:31] performance on the original data set all right to close out this screencast
[00:12:32] all right to close out this screencast let me offer you two examples of
[00:12:34] let me offer you two examples of interesting adversarial tests in our
[00:12:36] interesting adversarial tests in our field beginning with the squad question
[00:12:39] field beginning with the squad question answering data set i showed you this
[00:12:40] answering data set i showed you this leaderboard from squad 2.0 at the start
[00:12:43] leaderboard from squad 2.0 at the start of the quarter and the funny thing of
[00:12:44] of the quarter and the funny thing of course is that you have to go all the
[00:12:46] course is that you have to go all the way to place 13 on the leaderboard to
[00:12:48] way to place 13 on the leaderboard to find a system that is worse than our
[00:12:50] find a system that is worse than our estimate of human performance so we have
[00:12:53] estimate of human performance so we have superhuman performance on squad but what
[00:12:55] superhuman performance on squad but what does that really mean squad was also the
[00:12:57] does that really mean squad was also the site of one of the first really
[00:12:59] site of one of the first really systematic adversarial testing efforts
[00:13:01] systematic adversarial testing efforts in our field this is from jia and lyon
[00:13:03] in our field this is from jia and lyon 2017
[00:13:04] 2017 and what they did is quite simple we
[00:13:06] and what they did is quite simple we begin with squad examples where we have
[00:13:08] begin with squad examples where we have passages and questions as inputs and the
[00:13:11] passages and questions as inputs and the system task is to answer the question
[00:13:14] system task is to answer the question and we can count on the answer being a
[00:13:16] and we can count on the answer being a literal substring in the passage that
[00:13:18] literal substring in the passage that the system was given the adversarial
[00:13:21] the system was given the adversarial thing that gia and yang did was simply
[00:13:22] thing that gia and yang did was simply to append misleading sentences to the
[00:13:25] to append misleading sentences to the ends of those passages
[00:13:26] ends of those passages and what they found is that systems were
[00:13:28] and what they found is that systems were systematically misled by those final
[00:13:30] systematically misled by those final sentences whereas humans could easily
[00:13:32] sentences whereas humans could easily ignore them systems were now inclined to
[00:13:35] ignore them systems were now inclined to answer drawing on information from those
[00:13:37] answer drawing on information from those misleading new sentences
[00:13:39] misleading new sentences and this is kind of an interesting
[00:13:41] and this is kind of an interesting dynamic because you might think well
[00:13:42] dynamic because you might think well we'll just train our system now on
[00:13:45] we'll just train our system now on passages that have these augmented
[00:13:46] passages that have these augmented misleading sentences and then surely our
[00:13:48] misleading sentences and then surely our systems will be more robust
[00:13:50] systems will be more robust that might be true in some sense but of
[00:13:52] that might be true in some sense but of course we could then just append
[00:13:53] course we could then just append sentences to the start of the passages
[00:13:55] sentences to the start of the passages and gia and young found that systems
[00:13:57] and gia and young found that systems were can now confused by the
[00:13:59] were can now confused by the appended initial sentences and they
[00:14:01] appended initial sentences and they started to give wrong answers in that
[00:14:03] started to give wrong answers in that mode as well and you could kind of go
[00:14:05] mode as well and you could kind of go back and forth in this adversarial mode
[00:14:07] back and forth in this adversarial mode showing that systems were worrisomely
[00:14:10] showing that systems were worrisomely easy to trick based on these simple um
[00:14:13] easy to trick based on these simple um appending of misleading sentences
[00:14:16] appending of misleading sentences now that's a very interesting
[00:14:19] now that's a very interesting about adversarial evaluation mode what i
[00:14:21] about adversarial evaluation mode what i think is more interesting about the
[00:14:23] think is more interesting about the outcomes is that if they begin to show
[00:14:25] outcomes is that if they begin to show us just how different adversarial
[00:14:27] us just how different adversarial testing can be
[00:14:28] testing can be so here's the squad leader board the
[00:14:30] so here's the squad leader board the original at the time of the paper as
[00:14:33] original at the time of the paper as well as the results of this adversarial
[00:14:35] well as the results of this adversarial test and you can see first of all that
[00:14:36] test and you can see first of all that system performance has really plummeted
[00:14:38] system performance has really plummeted so this turns out to be highly
[00:14:40] so this turns out to be highly adversarial to these systems for
[00:14:42] adversarial to these systems for whatever reason i think it's more
[00:14:44] whatever reason i think it's more interesting to note that the system
[00:14:45] interesting to note that the system ranking has really been mixed up so it's
[00:14:48] ranking has really been mixed up so it's not like the system's uniformly dropped
[00:14:50] not like the system's uniformly dropped in performance now as we move from the
[00:14:53] in performance now as we move from the original rank to the adversarial rank we
[00:14:55] original rank to the adversarial rank we have like the first place system is now
[00:14:57] have like the first place system is now in place five the second has dropped all
[00:14:59] in place five the second has dropped all the way to place ten
[00:15:01] the way to place ten um but the seventh place system is now
[00:15:03] um but the seventh place system is now in first place it looks kind of chaotic
[00:15:06] in first place it looks kind of chaotic um here's a scatter plot where we have
[00:15:08] um here's a scatter plot where we have the original system performance along
[00:15:10] the original system performance along the x-axis and the adversarial system
[00:15:12] the x-axis and the adversarial system performance along the y-axis and you can
[00:15:14] performance along the y-axis and you can see that it's kind of chaotic there's no
[00:15:16] see that it's kind of chaotic there's no way that one predicts the other so
[00:15:18] way that one predicts the other so something very interesting has happened
[00:15:20] something very interesting has happened and that's noteworthy because it looks
[00:15:22] and that's noteworthy because it looks like that's meaningfully different from
[00:15:24] like that's meaningfully different from what we do when we do standard
[00:15:26] what we do when we do standard evaluations i don't have direct evidence
[00:15:28] evaluations i don't have direct evidence of this from squad but here's a case
[00:15:31] of this from squad but here's a case where people took two classic image data
[00:15:33] where people took two classic image data sets and simply created new test sets
[00:15:36] sets and simply created new test sets according to the same protocols that
[00:15:38] according to the same protocols that were used for the original data set test
[00:15:40] were used for the original data set test sets and what you find is a very strong
[00:15:43] sets and what you find is a very strong correlation even though the examples are
[00:15:45] correlation even though the examples are new because it's the same protocol
[00:15:48] new because it's the same protocol system performance is highly predictive
[00:15:50] system performance is highly predictive in the sense that the original test
[00:15:52] in the sense that the original test accuracy is perfectly correlated with
[00:15:54] accuracy is perfectly correlated with accuracy on these new test sets very
[00:15:57] accuracy on these new test sets very different from this adversarial mode
[00:15:58] different from this adversarial mode where something much more chaotic
[00:16:00] where something much more chaotic happened so adversarial testing is
[00:16:02] happened so adversarial testing is meaningfully different it seems from
[00:16:04] meaningfully different it seems from standard evaluations
[00:16:08] standard evaluations let's move to nli now this will tell us
[00:16:10] let's move to nli now this will tell us two different lessons about how
[00:16:11] two different lessons about how adversarial testing can be informative
[00:16:13] adversarial testing can be informative so we saw at the start of the course
[00:16:15] so we saw at the start of the course that we now have super human performance
[00:16:16] that we now have super human performance on the snli data set that's certainly
[00:16:19] on the snli data set that's certainly noteworthy
[00:16:20] noteworthy and we're reaching superhuman
[00:16:22] and we're reaching superhuman performance on the multi-nli test set
[00:16:24] performance on the multi-nli test set we're sure to be there if we're not
[00:16:25] we're sure to be there if we're not already at the time of this screencast
[00:16:28] already at the time of this screencast but we've also seen that systems that
[00:16:30] but we've also seen that systems that perform really well on these data sets
[00:16:33] perform really well on these data sets are often susceptible to adversaries in
[00:16:35] are often susceptible to adversaries in the first screencast i showed you these
[00:16:37] the first screencast i showed you these examples from glockner at all's breaking
[00:16:39] examples from glockner at all's breaking nli paper
[00:16:40] nli paper where they make simple modifications to
[00:16:42] where they make simple modifications to these hypotheses and find that systems
[00:16:45] these hypotheses and find that systems do not behave systematically with
[00:16:47] do not behave systematically with respect to human intuitions about the
[00:16:49] respect to human intuitions about the modified examples and that's the
[00:16:51] modified examples and that's the worrisome part and they quantify that of
[00:16:53] worrisome part and they quantify that of course and that you can see that the
[00:16:55] course and that you can see that the best systems at the time up here were
[00:16:57] best systems at the time up here were doing pretty well on snli and their
[00:17:00] doing pretty well on snli and their performance plummeted on these new test
[00:17:02] performance plummeted on these new test sets showing that this was for whatever
[00:17:04] sets showing that this was for whatever reason truly adversarial when it comes
[00:17:06] reason truly adversarial when it comes to those systems
[00:17:07] to those systems but i also presented this as a story of
[00:17:10] but i also presented this as a story of progress right because i showed you that
[00:17:12] progress right because i showed you that in simply downloading roberta multi-noi
[00:17:14] in simply downloading roberta multi-noi that is roberta fine-tuned on the
[00:17:16] that is roberta fine-tuned on the multi-nli data set you now have a device
[00:17:19] multi-nli data set you now have a device where with no work you can essentially
[00:17:21] where with no work you can essentially solve this adversarial test i think
[00:17:23] solve this adversarial test i think that's really striking and it points to
[00:17:25] that's really striking and it points to the fact that roberta unlike those
[00:17:27] the fact that roberta unlike those earlier models might truly have
[00:17:29] earlier models might truly have systematic understanding of the relevant
[00:17:32] systematic understanding of the relevant kinds of lexical relationships that you
[00:17:34] kinds of lexical relationships that you need to solve this adversarial test set
[00:17:37] need to solve this adversarial test set so that's an exciting outcome
[00:17:39] so that's an exciting outcome here's a second outcome that you might
[00:17:41] here's a second outcome that you might see and this is from the nikon paper
[00:17:43] see and this is from the nikon paper that does a wide battery of different
[00:17:46] that does a wide battery of different adversarial tests on multi-nli data
[00:17:49] adversarial tests on multi-nli data they did a bunch of things like antonyms
[00:17:52] they did a bunch of things like antonyms i love the cinderella story contradicts
[00:17:54] i love the cinderella story contradicts i hate the cinderella story just drawing
[00:17:56] i hate the cinderella story just drawing on lexical knowledge they asked about
[00:17:58] on lexical knowledge they asked about numerical reasoning across these two
[00:18:00] numerical reasoning across these two premises
[00:18:01] premises word overlap and this is a little bit
[00:18:03] word overlap and this is a little bit different in that you're doing something
[00:18:04] different in that you're doing something like just inserting material that you
[00:18:06] like just inserting material that you might think is going to be distracting
[00:18:08] might think is going to be distracting in the mode of the squad adversary and
[00:18:11] in the mode of the squad adversary and seeing what effects that has and the
[00:18:12] seeing what effects that has and the same thing for negation adding on to the
[00:18:14] same thing for negation adding on to the end information
[00:18:16] end information that's going to be misleading for a
[00:18:17] that's going to be misleading for a system and also includes a lot of
[00:18:19] system and also includes a lot of negation elements there are a few other
[00:18:21] negation elements there are a few other modes that i didn't have space for it's
[00:18:23] modes that i didn't have space for it's a very rich paper with a very
[00:18:25] a very rich paper with a very fine-grained breakdown of how systems do
[00:18:28] fine-grained breakdown of how systems do on these different adversarial problems
[00:18:31] on these different adversarial problems here's a picture of the data set and
[00:18:33] here's a picture of the data set and here's a breakdown and i think the
[00:18:34] here's a breakdown and i think the overall takeaway is that the numbers
[00:18:36] overall takeaway is that the numbers across the board are very low on these
[00:18:39] across the board are very low on these adversaries
[00:18:40] adversaries and so that that's interesting and it
[00:18:42] and so that that's interesting and it looks like even top performing systems
[00:18:44] looks like even top performing systems for multi-noi are stumbling with these
[00:18:46] for multi-noi are stumbling with these problems that we surely want our systems
[00:18:48] problems that we surely want our systems to be able to solve if we're going to
[00:18:50] to be able to solve if we're going to call them true common sense reasoners
[00:18:53] call them true common sense reasoners however this was actually the basis for
[00:18:55] however this was actually the basis for a number of the experiments in that
[00:18:57] a number of the experiments in that inoculation by fine-tuning paper that i
[00:18:59] inoculation by fine-tuning paper that i quoted from before
[00:19:01] quoted from before here's a kind of rough picture of the
[00:19:03] here's a kind of rough picture of the performance results that they report on
[00:19:05] performance results that they report on different subsets of that adversarial
[00:19:07] different subsets of that adversarial test on multi-nli and you can see that
[00:19:10] test on multi-nli and you can see that it shows all the different outcomes that
[00:19:12] it shows all the different outcomes that we discussed under the heading of
[00:19:14] we discussed under the heading of inoculation by fine-tuning to simplify
[00:19:16] inoculation by fine-tuning to simplify things just focus in on the green lines
[00:19:19] things just focus in on the green lines the ones with dots are the original
[00:19:21] the ones with dots are the original system performance and the ones with
[00:19:23] system performance and the ones with crosses are on the challenge data set so
[00:19:26] crosses are on the challenge data set so this first column here is identifying
[00:19:28] this first column here is identifying data set weaknesses what you're seeing
[00:19:30] data set weaknesses what you're seeing is that as we fine tune on more examples
[00:19:32] is that as we fine tune on more examples from the challenge data set going along
[00:19:34] from the challenge data set going along the x-axis here we very quickly get a
[00:19:38] the x-axis here we very quickly get a system that's actually good at this
[00:19:40] system that's actually good at this challenge problem right after just about
[00:19:42] challenge problem right after just about 50 examples the system is basically
[00:19:44] 50 examples the system is basically learning to solve the word overlap and
[00:19:47] learning to solve the word overlap and negation problems although negation
[00:19:49] negation problems although negation takes a little bit longer at 400
[00:19:51] takes a little bit longer at 400 examples so that's a case where the
[00:19:53] examples so that's a case where the original systems were failing not
[00:19:55] original systems were failing not because of any intrinsic property of the
[00:19:57] because of any intrinsic property of the models being used but rather because the
[00:19:59] models being used but rather because the data clearly just didn't have enough
[00:20:01] data clearly just didn't have enough information to resolve these learning
[00:20:03] information to resolve these learning targets
[00:20:04] targets we also see outcome 2 which is a model
[00:20:06] we also see outcome 2 which is a model weakness again follow the green lines
[00:20:08] weakness again follow the green lines and you can see here that for spelling
[00:20:10] and you can see here that for spelling errors and length mismatch no amount of
[00:20:12] errors and length mismatch no amount of fine-tuning on challenge examples helps
[00:20:15] fine-tuning on challenge examples helps these systems get traction on these
[00:20:17] these systems get traction on these problems that's these flat lines here
[00:20:19] problems that's these flat lines here and that's showing that there's
[00:20:20] and that's showing that there's something fundamentally wrong possibly
[00:20:22] something fundamentally wrong possibly with these models they are just unable
[00:20:24] with these models they are just unable to solve these two challenge problems
[00:20:26] to solve these two challenge problems and then we also see the third artifact
[00:20:29] and then we also see the third artifact for numerical reasoning you'll notice
[00:20:30] for numerical reasoning you'll notice that system performance is kind of
[00:20:32] that system performance is kind of really chaotic here and that's
[00:20:34] really chaotic here and that's suggesting that there is something
[00:20:35] suggesting that there is something importantly and problematically
[00:20:37] importantly and problematically different
[00:20:38] different about these challenge examples because
[00:20:40] about these challenge examples because fine-tuning on them causes the system to
[00:20:42] fine-tuning on them causes the system to become really chaotic in its predictions
[00:20:45] become really chaotic in its predictions when we get degraded performance not
[00:20:46] when we get degraded performance not only on the challenge set but also on
[00:20:49] only on the challenge set but also on the original data set showing that we've
[00:20:51] the original data set showing that we've done something quite disruptive
[00:20:53] done something quite disruptive so this is really interesting that in
[00:20:54] so this is really interesting that in this case from this battery of
[00:20:56] this case from this battery of adversarial tests we see all these
[00:20:58] adversarial tests we see all these different outcomes pointing us to all
[00:21:00] different outcomes pointing us to all sorts of different lessons about what
[00:21:02] sorts of different lessons about what action we should take to make these
[00:21:04] action we should take to make these systems more robust

Lecture 051

Adversarial Training (and Testing) | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=mnKQHwfp384 --- Transcript [00:00:05] welcome everyone to ...

Adversarial Training (and Testing) | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=mnKQHwfp384

---

Transcript

[00:00:05] welcome everyone to part three in our
[00:00:06] welcome everyone to part three in our series on analysis methods in nlp we're
[00:00:08] series on analysis methods in nlp we're going to be talking about adversarial
[00:00:10] going to be talking about adversarial training as well as testing of systems
[00:00:12] training as well as testing of systems this is the second of the behavioral
[00:00:14] this is the second of the behavioral evaluation methods that we're
[00:00:16] evaluation methods that we're considering we previously talked about
[00:00:17] considering we previously talked about adversarial testing
[00:00:19] adversarial testing adversarial training and testing of
[00:00:21] adversarial training and testing of course implies that we have a much
[00:00:22] course implies that we have a much larger data set so this is more
[00:00:24] larger data set so this is more difficult to do
[00:00:25] difficult to do but for selected tasks where we have
[00:00:27] but for selected tasks where we have such data sets this can be very exciting
[00:00:29] such data sets this can be very exciting and push you to address all sorts of
[00:00:31] and push you to address all sorts of interesting cutting edge questions
[00:00:33] interesting cutting edge questions i'll start with swag this is an early
[00:00:35] i'll start with swag this is an early entry into the space of adversarial
[00:00:37] entry into the space of adversarial train sets swag stands for situations
[00:00:40] train sets swag stands for situations with adversarial generations
[00:00:43] with adversarial generations there's actually two data sets swag and
[00:00:45] there's actually two data sets swag and the colorfully named hella swag and
[00:00:47] the colorfully named hella swag and you'll see why there are two in a second
[00:00:49] you'll see why there are two in a second this is fundamentally again another
[00:00:51] this is fundamentally again another interesting story of very rapid progress
[00:00:53] interesting story of very rapid progress in our field
[00:00:56] in our field here's how swag examples work were given
[00:00:58] here's how swag examples work were given as a system input
[00:01:00] as a system input a context like he is throwing darts at a
[00:01:02] a context like he is throwing darts at a target
[00:01:03] target and another system input which is the
[00:01:05] and another system input which is the start of a sentence here it's another
[00:01:06] start of a sentence here it's another man
[00:01:08] man and the task of the system is to figure
[00:01:09] and the task of the system is to figure out what the continuation should be so
[00:01:11] out what the continuation should be so the actual continuation that we predict
[00:01:13] the actual continuation that we predict might be throws a dart at the target
[00:01:15] might be throws a dart at the target board
[00:01:16] board and this is fundamentally a
[00:01:17] and this is fundamentally a classification task the system is given
[00:01:19] classification task the system is given some distractors like comes running in
[00:01:22] some distractors like comes running in and shoots an arrow at a target or is
[00:01:23] and shoots an arrow at a target or is shown on the side of men or throws darts
[00:01:25] shown on the side of men or throws darts at a disc and the system is tasked with
[00:01:27] at a disc and the system is tasked with figuring out which of the options is the
[00:01:30] figuring out which of the options is the actual continuation for the sentence
[00:01:32] actual continuation for the sentence given the context
[00:01:34] given the context the data sources for this are activity
[00:01:36] the data sources for this are activity net and the large scale movie
[00:01:37] net and the large scale movie description challenge i think the idea
[00:01:39] description challenge i think the idea here is that we're going to key into all
[00:01:40] here is that we're going to key into all sorts of interesting notions of common
[00:01:43] sorts of interesting notions of common sense reasoning
[00:01:45] sense reasoning now here's where the adversarial part of
[00:01:47] now here's where the adversarial part of this comes in we're going to do
[00:01:48] this comes in we're going to do adversarial filtering for swag
[00:01:50] adversarial filtering for swag for each of the examples in our corpus
[00:01:52] for each of the examples in our corpus and there are over a hundred thousand
[00:01:54] and there are over a hundred thousand examples in swag we're going to be given
[00:01:56] examples in swag we're going to be given the system input like the mixture creams
[00:01:58] the system input like the mixture creams the butter sugar
[00:02:01] the butter sugar and then we'll have a generator model in
[00:02:03] and then we'll have a generator model in the case of swag this was an lstm
[00:02:05] the case of swag this was an lstm produce some distractors for the target
[00:02:07] produce some distractors for the target so let's suppose that the actual target
[00:02:09] so let's suppose that the actual target continuation is is added we'd have a
[00:02:11] continuation is is added we'd have a model produced is sweet and is in many
[00:02:13] model produced is sweet and is in many foods
[00:02:14] foods and then we have the filtering model if
[00:02:17] and then we have the filtering model if it guesses correctly it for is added
[00:02:19] it guesses correctly it for is added then we're going to drop out this entire
[00:02:21] then we're going to drop out this entire example and we'll create some new
[00:02:23] example and we'll create some new distractors like is sprinkled on top or
[00:02:25] distractors like is sprinkled on top or is in many foods and in this case if the
[00:02:28] is in many foods and in this case if the model guesses incorrectly like suppose
[00:02:30] model guesses incorrectly like suppose it chooses b in this case then we'll
[00:02:32] it chooses b in this case then we'll keep this example because relative to
[00:02:34] keep this example because relative to the current models for the lang the
[00:02:36] the current models for the lang the thing we're using to generate these
[00:02:38] thing we're using to generate these distractors and the thing that we're
[00:02:39] distractors and the thing that we're used to using to filter this is a
[00:02:41] used to using to filter this is a challenging example and the idea is that
[00:02:43] challenging example and the idea is that we can repeat this for a bunch of
[00:02:45] we can repeat this for a bunch of iterations continually retraining the
[00:02:47] iterations continually retraining the filtering model so that it gets better
[00:02:48] filtering model so that it gets better and better and therefore ending up with
[00:02:50] and better and therefore ending up with a data set that is really really
[00:02:52] a data set that is really really difficult in terms of the current models
[00:02:54] difficult in terms of the current models that we had available to us
[00:02:57] that we had available to us here's a picture of test accuracy this
[00:02:59] here's a picture of test accuracy this is kind of interesting here they
[00:03:00] is kind of interesting here they actually did an ensemble of filtering
[00:03:02] actually did an ensemble of filtering models to try and key into different
[00:03:05] models to try and key into different notions that might be indicating which
[00:03:07] notions that might be indicating which is the correct continuation so they
[00:03:09] is the correct continuation so they start by using just a multi-layer
[00:03:11] start by using just a multi-layer perceptron for efficiency and then they
[00:03:13] perceptron for efficiency and then they bring in all of these ensembles and you
[00:03:15] bring in all of these ensembles and you can see that test accuracy as we do this
[00:03:17] can see that test accuracy as we do this iterative filtering very quickly goes
[00:03:20] iterative filtering very quickly goes down so that by iteration 140 we're at
[00:03:22] down so that by iteration 140 we're at 10 accuracy
[00:03:24] 10 accuracy so that's the sense in which this is a
[00:03:25] so that's the sense in which this is a very difficult data set because given
[00:03:27] very difficult data set because given the generator model and the filtering
[00:03:29] the generator model and the filtering model that we have available to us we
[00:03:31] model that we have available to us we have a data set that is very difficult
[00:03:34] have a data set that is very difficult in terms of a classification task
[00:03:36] in terms of a classification task so that looks really exciting and
[00:03:38] so that looks really exciting and challenging and i think the authors
[00:03:39] challenging and i think the authors expected this data set to last for a
[00:03:41] expected this data set to last for a very long time however the vert paper
[00:03:44] very long time however the vert paper the original bert paper did evaluations
[00:03:47] the original bert paper did evaluations on swag and essentially solved the
[00:03:49] on swag and essentially solved the problems bert large got
[00:03:51] problems bert large got 86.6 and 86.3 on the devon test sets for
[00:03:55] 86.6 and 86.3 on the devon test sets for swag respectively a very unexpected
[00:03:57] swag respectively a very unexpected result given that i just showed you that
[00:04:00] result given that i just showed you that the swag authors got about 10 with their
[00:04:03] the swag authors got about 10 with their current models and even
[00:04:05] current models and even closely related models to bert like this
[00:04:07] closely related models to bert like this e sim model here were really pretty low
[00:04:10] e sim model here were really pretty low in their performance so bert looked like
[00:04:11] in their performance so bert looked like a real breakthrough and you can see that
[00:04:13] a real breakthrough and you can see that it's in some sense kind of superhuman
[00:04:15] it's in some sense kind of superhuman relative to the swag estimates
[00:04:17] relative to the swag estimates so wow so of course we know what the
[00:04:20] so wow so of course we know what the response should be given that we're
[00:04:21] response should be given that we're talking essentially about model in the
[00:04:23] talking essentially about model in the loop adversarial data set creation that
[00:04:26] loop adversarial data set creation that leads us to hella swag they made some
[00:04:28] leads us to hella swag they made some changes to the data sets that they use
[00:04:30] changes to the data sets that they use for hella swag but i would say the
[00:04:31] for hella swag but i would say the fundamental thing is that we do the same
[00:04:34] fundamental thing is that we do the same kind of adversarial filtering with a
[00:04:36] kind of adversarial filtering with a generator except now we have much more
[00:04:38] generator except now we have much more powerful filtering and generator models
[00:04:41] powerful filtering and generator models thanks to developments related to
[00:04:42] thanks to developments related to transformers so for heliswag we again
[00:04:45] transformers so for heliswag we again have human performance that's really
[00:04:47] have human performance that's really good this is very reassuring because we
[00:04:49] good this is very reassuring because we are using much more powerful models at
[00:04:51] are using much more powerful models at step four
[00:04:52] step four as you can expect burt is no longer
[00:04:55] as you can expect burt is no longer easily able to solve this problem
[00:04:57] easily able to solve this problem here's a further summary of the results
[00:04:59] here's a further summary of the results with burt large before remember that
[00:05:01] with burt large before remember that essentially solves swag now it's down
[00:05:03] essentially solves swag now it's down around 50
[00:05:05] around 50 which
[00:05:05] which shows that it still gets traction but is
[00:05:07] shows that it still gets traction but is nothing like the human perform
[00:05:09] nothing like the human perform superhuman performance that we saw for
[00:05:11] superhuman performance that we saw for swag
[00:05:13] swag okay now let's move into a slightly
[00:05:14] okay now let's move into a slightly different mode and this is going to be a
[00:05:15] different mode and this is going to be a kind of human in the loop adversarial
[00:05:17] kind of human in the loop adversarial data set creation method the first entry
[00:05:20] data set creation method the first entry in this space was the adversarial nli
[00:05:22] in this space was the adversarial nli data set i think this is a really
[00:05:23] data set i think this is a really visionary and exciting paper
[00:05:26] visionary and exciting paper adversarial nli is a direct response to
[00:05:29] adversarial nli is a direct response to the previous things that we've seen with
[00:05:31] the previous things that we've seen with the snli and multi-nli data sets where
[00:05:34] the snli and multi-nli data sets where models seem to do well on those
[00:05:36] models seem to do well on those benchmarks but are easily susceptible to
[00:05:39] benchmarks but are easily susceptible to simple adversaries with adversarial nli
[00:05:41] simple adversaries with adversarial nli we're going to hopefully push systems to
[00:05:44] we're going to hopefully push systems to be much more robust to those adversaries
[00:05:46] be much more robust to those adversaries and explore a much wider range of the
[00:05:48] and explore a much wider range of the space of things you might see under the
[00:05:50] space of things you might see under the heading of natural language inference so
[00:05:52] heading of natural language inference so here's how it worked
[00:05:54] here's how it worked there's a human in the loop an annotator
[00:05:56] there's a human in the loop an annotator and the annotator is presented with a
[00:05:57] and the annotator is presented with a premise sentence and a condition that
[00:06:00] premise sentence and a condition that they need to be in which is just an nli
[00:06:01] they need to be in which is just an nli label entailment contradiction or
[00:06:03] label entailment contradiction or neutral
[00:06:05] neutral the annotator writes a hypothesis to go
[00:06:07] the annotator writes a hypothesis to go along with the premise and the condition
[00:06:09] along with the premise and the condition and then a state-of-the-art model comes
[00:06:11] and then a state-of-the-art model comes in and makes a prediction about the
[00:06:12] in and makes a prediction about the premise hypothesis pair
[00:06:15] premise hypothesis pair if the model's prediction matches the
[00:06:16] if the model's prediction matches the condition that is if the model was
[00:06:18] condition that is if the model was correct then the annotator needs to
[00:06:20] correct then the annotator needs to return to step two and try again with a
[00:06:22] return to step two and try again with a new hypothesis and we could continue in
[00:06:25] new hypothesis and we could continue in that loop
[00:06:26] that loop if the model was fooled the premise
[00:06:28] if the model was fooled the premise hypothesis pair is independently
[00:06:29] hypothesis pair is independently validated by other annotators of course
[00:06:31] validated by other annotators of course so what we get out of this is we hope a
[00:06:34] so what we get out of this is we hope a data set that is intuitive for humans
[00:06:36] data set that is intuitive for humans because of the check at step five but
[00:06:38] because of the check at step five but assuming we continue to loop around
[00:06:40] assuming we continue to loop around through two three and four an example
[00:06:42] through two three and four an example that is really difficult for whatever
[00:06:44] that is really difficult for whatever model is in the loop
[00:06:45] model is in the loop and the expectation is that as we put
[00:06:48] and the expectation is that as we put better and better models in the loop
[00:06:49] better and better models in the loop here we're going to get even more
[00:06:50] here we're going to get even more challenging data sets as an outcome
[00:06:55] challenging data sets as an outcome nli examples tend to be impressively
[00:06:57] nli examples tend to be impressively complex you can see that this example
[00:06:59] complex you can see that this example has a very long premise the hypothesis
[00:07:01] has a very long premise the hypothesis is relatively shorter and an intriguing
[00:07:03] is relatively shorter and an intriguing aspect of adversarial mli is that
[00:07:06] aspect of adversarial mli is that annotators also constructed a reason or
[00:07:08] annotators also constructed a reason or a rationale for their label holding
[00:07:10] a rationale for their label holding between the premise hypothesis pair
[00:07:12] between the premise hypothesis pair to date as far as i know relatively
[00:07:14] to date as far as i know relatively little use has been made of these texts
[00:07:15] little use has been made of these texts but i think they could bring in other
[00:07:17] but i think they could bring in other aspects of natural language inference
[00:07:19] aspects of natural language inference reasoning and that could be an exciting
[00:07:21] reasoning and that could be an exciting new direction
[00:07:24] new direction adversarial nli is a difficult data set
[00:07:27] adversarial nli is a difficult data set indeed we have a similar sort of
[00:07:28] indeed we have a similar sort of leaderboard that we've seen throughout
[00:07:30] leaderboard that we've seen throughout this adversarial regime where
[00:07:32] this adversarial regime where across different rounds of a li there
[00:07:34] across different rounds of a li there are three or cumulatively for the data
[00:07:36] are three or cumulatively for the data set even really excellent models that do
[00:07:39] set even really excellent models that do really well on snli and multi-nli
[00:07:42] really well on snli and multi-nli are posting really low numbers for all
[00:07:45] are posting really low numbers for all of these variants of the data set and
[00:07:47] of these variants of the data set and that shows you that this is truly a
[00:07:49] that shows you that this is truly a difficult problem and as far as i know
[00:07:51] difficult problem and as far as i know not much progress has been made uh since
[00:07:54] not much progress has been made uh since this data set was released on boosting
[00:07:56] this data set was released on boosting these numbers
[00:07:57] these numbers so it stands as an interesting challenge
[00:08:00] so it stands as an interesting challenge stepping back here i'd just like to say
[00:08:02] stepping back here i'd just like to say that i think we find in this paper a
[00:08:04] that i think we find in this paper a real vision for future development and
[00:08:06] real vision for future development and that you see this also in the swag in
[00:08:07] that you see this also in the swag in helsifying papers as those authors say
[00:08:11] helsifying papers as those authors say this adversarial data set creation is a
[00:08:13] this adversarial data set creation is a path for nlp progress going forward
[00:08:15] path for nlp progress going forward toward benchmarks that adversarially
[00:08:17] toward benchmarks that adversarially co-evolve with evolving state-of-the-art
[00:08:19] co-evolve with evolving state-of-the-art models right with swag and heliswag we
[00:08:22] models right with swag and heliswag we saw this swag got solved but the
[00:08:24] saw this swag got solved but the response was clear bring the best model
[00:08:26] response was clear bring the best model in and use it to create the successor
[00:08:28] in and use it to create the successor data set that stands as a real challenge
[00:08:31] data set that stands as a real challenge you have a similar picture from the
[00:08:32] you have a similar picture from the adversarial nli paper this process of
[00:08:35] adversarial nli paper this process of having iterative rounds with humans in
[00:08:38] having iterative rounds with humans in the loop yields a moving post dynamic
[00:08:40] the loop yields a moving post dynamic target for natural language
[00:08:41] target for natural language understanding systems rather than the
[00:08:43] understanding systems rather than the static benchmarks that eventually
[00:08:45] static benchmarks that eventually saturate and we've seen repeatedly that
[00:08:48] saturate and we've seen repeatedly that our benchmarks saturate very quickly
[00:08:49] our benchmarks saturate very quickly these days so we need this kind of
[00:08:51] these days so we need this kind of moving post to make sure we continue to
[00:08:53] moving post to make sure we continue to make
[00:08:54] make meaningful progress
[00:08:56] meaningful progress the neodoll project gave rise i believe
[00:08:59] the neodoll project gave rise i believe to this dynabench platform an open
[00:09:01] to this dynabench platform an open source platform for model and human in
[00:09:03] source platform for model and human in the loop data set creation
[00:09:05] the loop data set creation there as of this writing there are four
[00:09:07] there as of this writing there are four data sets available that have been
[00:09:08] data sets available that have been created on dynabench an nli data set
[00:09:11] created on dynabench an nli data set which is a kind of successor to a li a
[00:09:14] which is a kind of successor to a li a question answering data set a sentiment
[00:09:17] question answering data set a sentiment data data set and a hate speech data set
[00:09:20] data data set and a hate speech data set so if you're working on problems of this
[00:09:22] so if you're working on problems of this form or you have a model that would fit
[00:09:23] form or you have a model that would fit into this mold for one of these tasks i
[00:09:26] into this mold for one of these tasks i would encourage you to explore some
[00:09:28] would encourage you to explore some training of systems on these data sets
[00:09:30] training of systems on these data sets to see whether you're making progress uh
[00:09:32] to see whether you're making progress uh or whether they stand as true
[00:09:34] or whether they stand as true adversaries for whatever innovative
[00:09:35] adversaries for whatever innovative thing you're doing
[00:09:37] thing you're doing finally i want to close with a really
[00:09:39] finally i want to close with a really important question for this area that
[00:09:41] important question for this area that kind of remains open
[00:09:43] kind of remains open can adversarial training improve systems
[00:09:45] can adversarial training improve systems there is a course of concern that as we
[00:09:47] there is a course of concern that as we construct ever harder data sets we're
[00:09:50] construct ever harder data sets we're pushing systems into stranger parts of
[00:09:52] pushing systems into stranger parts of the linguistic and conceptual space
[00:09:54] the linguistic and conceptual space which could actually degrade their real
[00:09:56] which could actually degrade their real world performance we have to keep an eye
[00:09:59] world performance we have to keep an eye on that and the evidence so far i think
[00:10:01] on that and the evidence so far i think is pointing to yes as an answer to this
[00:10:03] is pointing to yes as an answer to this question but the evidence is a bit mixed
[00:10:05] question but the evidence is a bit mixed so i've mentioned that in the squad
[00:10:07] so i've mentioned that in the squad adversarial paper from gia on the on
[00:10:10] adversarial paper from gia on the on training on adversarial examples makes
[00:10:12] training on adversarial examples makes them more remote robust to those
[00:10:13] them more remote robust to those examples but not to simple variants so
[00:10:16] examples but not to simple variants so it's hardly very much progress
[00:10:19] it's hardly very much progress in this paper they found that
[00:10:21] in this paper they found that adversarial training provided no
[00:10:22] adversarial training provided no additional robustness benefit in the
[00:10:24] additional robustness benefit in the experiments using the test set despite
[00:10:26] experiments using the test set despite the fact that the model achieved near
[00:10:28] the fact that the model achieved near 100 accuracy classifying adversarial
[00:10:31] 100 accuracy classifying adversarial examples included in the train set so
[00:10:33] examples included in the train set so that's a more worrisome picture
[00:10:35] that's a more worrisome picture but this is more hopeful fine-tuning
[00:10:37] but this is more hopeful fine-tuning with a few adversarial examples improve
[00:10:39] with a few adversarial examples improve systems in some cases especially where
[00:10:42] systems in some cases especially where you bring in inoculation and this is
[00:10:44] you bring in inoculation and this is hopeful yet again adversarially
[00:10:46] hopeful yet again adversarially generated paraphrases improve model
[00:10:48] generated paraphrases improve model robustness to syntactic variation that's
[00:10:50] robustness to syntactic variation that's really the dream there that as a result
[00:10:53] really the dream there that as a result of doing this new kind of training we
[00:10:55] of doing this new kind of training we get systems that are truly more robust
[00:10:57] get systems that are truly more robust but i think we need more evidence on
[00:10:59] but i think we need more evidence on this picture which means more data sets
[00:11:01] this picture which means more data sets of this form and more an interesting use
[00:11:04] of this form and more an interesting use of the available resources
[00:11:06] of the available resources and i would just love to see what the
[00:11:07] and i would just love to see what the emerging picture is over the next year
[00:11:09] emerging picture is over the next year or two
[00:11:15] you

Lecture 052

Probing | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=ElDtkhqv5ZE --- Transcript [00:00:05] welcome everyone this is part four in [00:00:06...

Probing | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=ElDtkhqv5ZE

---

Transcript

[00:00:05] welcome everyone this is part four in
[00:00:06] welcome everyone this is part four in our series on analysis methods in nlp
[00:00:08] our series on analysis methods in nlp we're going to be talking about probing
[00:00:10] we're going to be talking about probing this is the first of the two structural
[00:00:12] this is the first of the two structural evaluation methods that we're going to
[00:00:13] evaluation methods that we're going to consider
[00:00:14] consider it's time to get really introspective
[00:00:16] it's time to get really introspective about what our models are doing
[00:00:18] about what our models are doing here's an overview of the probing idea
[00:00:21] here's an overview of the probing idea the core thing is that
[00:00:23] the core thing is that we're going to use supervised models
[00:00:24] we're going to use supervised models those are the probe models to determine
[00:00:27] those are the probe models to determine what's latently encoded in the hidden
[00:00:29] what's latently encoded in the hidden representations of our target models
[00:00:33] representations of our target models this is often applied in the context of
[00:00:35] this is often applied in the context of bertology which would be like i have
[00:00:38] bertology which would be like i have bert as a pre-trained artifact and i
[00:00:39] bert as a pre-trained artifact and i would like to understand the nature of
[00:00:41] would like to understand the nature of its hidden representations what do they
[00:00:43] its hidden representations what do they latently encode and for that you might
[00:00:46] latently encode and for that you might use pro models
[00:00:49] probink as you will see can be a source
[00:00:51] probink as you will see can be a source of really valuable and interesting
[00:00:52] of really valuable and interesting insights but we do need to proceed with
[00:00:55] insights but we do need to proceed with caution on two major issues here first a
[00:00:58] caution on two major issues here first a very powerful probe model since it is a
[00:01:01] very powerful probe model since it is a supervised model might lead you to see
[00:01:03] supervised model might lead you to see things that aren't really in your target
[00:01:05] things that aren't really in your target model but rather just things that your
[00:01:07] model but rather just things that your probe model has learned
[00:01:09] probe model has learned and you might therefore over diagnose
[00:01:12] and you might therefore over diagnose latent information in your target model
[00:01:14] latent information in your target model when in fact it's all being stored in
[00:01:16] when in fact it's all being stored in the probe and i'm going to offer you a
[00:01:17] the probe and i'm going to offer you a technique for navigating around that
[00:01:20] technique for navigating around that shoe
[00:01:21] shoe the second one is that probes cannot
[00:01:23] the second one is that probes cannot tell us about whether the information
[00:01:25] tell us about whether the information that we identify
[00:01:26] that we identify has any causal relationship with the
[00:01:28] has any causal relationship with the target model's behavior it will be very
[00:01:30] target model's behavior it will be very tempting for you to say oh i have
[00:01:32] tempting for you to say oh i have discovered that this representation
[00:01:34] discovered that this representation layer includes
[00:01:36] layer includes part of speech information and you might
[00:01:38] part of speech information and you might therefore conclude that part of speech
[00:01:40] therefore conclude that part of speech information is important for whatever
[00:01:42] information is important for whatever task you have set but we can't actually
[00:01:44] task you have set but we can't actually make that inference it could be that the
[00:01:46] make that inference it could be that the part of speech information is simply
[00:01:48] part of speech information is simply latently encoded but not actually
[00:01:51] latently encoded but not actually especially relevant to your model's
[00:01:52] especially relevant to your model's input output behavior
[00:01:57] final section of this slideshow i'm
[00:01:58] final section of this slideshow i'm going to just talk briefly about
[00:02:00] going to just talk briefly about unsupervised probes which seem to
[00:02:02] unsupervised probes which seem to address this first problem here that the
[00:02:04] address this first problem here that the probe model might actually be the thing
[00:02:05] probe model might actually be the thing that's encoding all of this information
[00:02:08] that's encoding all of this information that we claim to have discovered
[00:02:10] that we claim to have discovered and then when we talk about feature
[00:02:11] and then when we talk about feature attribution methods we'll get closer to
[00:02:14] attribution methods we'll get closer to being able to address some of these
[00:02:15] being able to address some of these causal questions
[00:02:18] causal questions let's begin with the core method for
[00:02:19] let's begin with the core method for probing and just because this is a
[00:02:22] probing and just because this is a typical framing of these ideas i've got
[00:02:24] typical framing of these ideas i've got depicted here what you might think of as
[00:02:25] depicted here what you might think of as a kind of generic transformer based
[00:02:27] a kind of generic transformer based model where we have three layers with
[00:02:30] model where we have three layers with all these blocks these maybe are the
[00:02:32] all these blocks these maybe are the output representations from each of the
[00:02:33] output representations from each of the transformer blocks and you can see that
[00:02:35] transformer blocks and you can see that i've got an input sequence coming in
[00:02:37] i've got an input sequence coming in here and the idea would be that we could
[00:02:40] here and the idea would be that we could pick some hidden representation in this
[00:02:42] pick some hidden representation in this model like this middle one h here and
[00:02:44] model like this middle one h here and decide that we're going to fit
[00:02:46] decide that we're going to fit a small linear model presumably on that
[00:02:48] a small linear model presumably on that hidden representation
[00:02:50] hidden representation and see whether we can figure out
[00:02:52] and see whether we can figure out whether that representation encodes some
[00:02:54] whether that representation encodes some information about some task that we care
[00:02:56] information about some task that we care about so for example if you wanted to
[00:02:58] about so for example if you wanted to figure out whether sentiment or lexical
[00:03:00] figure out whether sentiment or lexical entailment was encoded at that point
[00:03:02] entailment was encoded at that point you'd need a label data set for
[00:03:04] you'd need a label data set for sentiment or entailment
[00:03:06] sentiment or entailment and then you would fit the probe model
[00:03:07] and then you would fit the probe model on this representation and use that to
[00:03:09] on this representation and use that to determine the extent to which that
[00:03:11] determine the extent to which that information is encoded there
[00:03:14] information is encoded there this depiction is a little bit poetical
[00:03:15] this depiction is a little bit poetical so it's worth just walking through
[00:03:17] so it's worth just walking through mechanically what you'd actually be
[00:03:18] mechanically what you'd actually be doing you would use this bert model and
[00:03:21] doing you would use this bert model and process different examples like for the
[00:03:23] process different examples like for the sequence here and get an output
[00:03:25] sequence here and get an output representation which should be paired
[00:03:26] representation which should be paired with some task label
[00:03:28] with some task label and you would repeatedly do that for
[00:03:30] and you would repeatedly do that for different inputs you're essentially
[00:03:32] different inputs you're essentially using this bert model as an engine for
[00:03:35] using this bert model as an engine for creating representations that will
[00:03:36] creating representations that will become your feature representation
[00:03:38] become your feature representation matrix x paired with your labels y and
[00:03:42] matrix x paired with your labels y and it is this model that will be the basis
[00:03:44] it is this model that will be the basis for your linear probe model this small
[00:03:46] for your linear probe model this small linear model as i've identified it here
[00:03:48] linear model as i've identified it here so you're kind of using bird as an
[00:03:50] so you're kind of using bird as an engine to create a data set that is then
[00:03:52] engine to create a data set that is then the input to a supervised learning
[00:03:54] the input to a supervised learning problem another perspective would be
[00:03:56] problem another perspective would be that you're kind of using frozen
[00:03:58] that you're kind of using frozen parameters in this case
[00:04:00] parameters in this case and fitting a model on top of them it's
[00:04:02] and fitting a model on top of them it's just that instead of picking an output
[00:04:04] just that instead of picking an output point you pick possibly
[00:04:06] point you pick possibly one of the internal representations
[00:04:09] one of the internal representations and this is very general and in fact the
[00:04:12] and this is very general and in fact the most often when you read without probes
[00:04:13] most often when you read without probes in the literature they're actually
[00:04:14] in the literature they're actually sequence problems like part of speech
[00:04:16] sequence problems like part of speech tagging or named entity recognition and
[00:04:19] tagging or named entity recognition and therefore you might use an entire layer
[00:04:21] therefore you might use an entire layer or even a set of layers as the basis for
[00:04:24] or even a set of layers as the basis for your probe model
[00:04:27] now you can hear in my description there
[00:04:29] now you can hear in my description there that there is a kind of interesting
[00:04:30] that there is a kind of interesting judgment call that you're making about
[00:04:32] judgment call that you're making about whether you are probing or simply
[00:04:34] whether you are probing or simply learning a new model right probes in the
[00:04:37] learning a new model right probes in the sense that i just presented them
[00:04:39] sense that i just presented them are supervised models whose inputs are
[00:04:41] are supervised models whose inputs are frozen parameters of the models that
[00:04:43] frozen parameters of the models that we're probing our target models right
[00:04:45] we're probing our target models right this is hard to distinguish from simply
[00:04:47] this is hard to distinguish from simply fitting a supervised model as usual with
[00:04:49] fitting a supervised model as usual with some particular choice of featurization
[00:04:53] some particular choice of featurization as a result of this it is essentially a
[00:04:56] as a result of this it is essentially a foregone conclusion that at least some
[00:04:57] foregone conclusion that at least some of the information that we identify with
[00:05:00] of the information that we identify with our probe is actually stored in the
[00:05:02] our probe is actually stored in the probe model parameters and it's just
[00:05:04] probe model parameters and it's just that we've provided useful input
[00:05:06] that we've provided useful input features that allow this probe to be
[00:05:08] features that allow this probe to be successful and that's the sense in which
[00:05:10] successful and that's the sense in which the inputs are latently encoding this
[00:05:12] the inputs are latently encoding this information but with the probe we have
[00:05:14] information but with the probe we have not determined that it is truly latently
[00:05:16] not determined that it is truly latently there but rather that it's accepting a
[00:05:18] there but rather that it's accepting a stepping stone toward a model that could
[00:05:20] stepping stone toward a model that could be successful at this is conceived of as
[00:05:23] be successful at this is conceived of as a supervised learning test so those are
[00:05:25] a supervised learning test so those are important distinctions to keep in mind
[00:05:27] important distinctions to keep in mind as a result of this more powerful probes
[00:05:30] as a result of this more powerful probes like deep neural networks
[00:05:32] like deep neural networks might find more information than simple
[00:05:34] might find more information than simple linear models but that's not because
[00:05:36] linear models but that's not because they're able to tease out more
[00:05:38] they're able to tease out more information from the representations
[00:05:40] information from the representations themselves but rather because the pro
[00:05:42] themselves but rather because the pro model has now has so much more capacity
[00:05:45] model has now has so much more capacity for storing information about the task
[00:05:47] for storing information about the task that you're probing for
[00:05:49] that you're probing for so there are a bunch of different
[00:05:50] so there are a bunch of different judgment calls here and that's difficult
[00:05:53] judgment calls here and that's difficult a very productive entry into this space
[00:05:55] a very productive entry into this space is this really lovely paper from hewitt
[00:05:56] is this really lovely paper from hewitt and the 2019 where they introduced the
[00:05:59] and the 2019 where they introduced the notion of a control task and the
[00:06:01] notion of a control task and the corresponding metric of probe
[00:06:03] corresponding metric of probe selectivity so here's the idea a control
[00:06:05] selectivity so here's the idea a control task will be some random task with the
[00:06:08] task will be some random task with the same input output structure as the
[00:06:10] same input output structure as the target test that we want to use for our
[00:06:12] target test that we want to use for our probing
[00:06:14] probing like for example for word sense
[00:06:15] like for example for word sense classification you might have words
[00:06:17] classification you might have words assigned random fixed word senses
[00:06:19] assigned random fixed word senses independent of their context or for part
[00:06:22] independent of their context or for part of speech tagging instead of using the
[00:06:23] of speech tagging instead of using the actual part of speech tags you might
[00:06:25] actual part of speech tags you might randomly assign words to fixed tags from
[00:06:29] randomly assign words to fixed tags from the same tag space
[00:06:30] the same tag space or for parsing it get a little bit more
[00:06:33] or for parsing it get a little bit more nuanced but you might have some edge
[00:06:35] nuanced but you might have some edge assignment strategies that you kind of
[00:06:37] assignment strategies that you kind of use semi-randomly to link different
[00:06:39] use semi-randomly to link different pairs of words into a kind of pseudo
[00:06:41] pairs of words into a kind of pseudo parse and that would serve as a control
[00:06:43] parse and that would serve as a control task for trying to surface latent actual
[00:06:46] task for trying to surface latent actual parsing information
[00:06:48] parsing information so those are control tasks and then
[00:06:50] so those are control tasks and then selectivity is simply the difference
[00:06:52] selectivity is simply the difference between your probe performance on the
[00:06:54] between your probe performance on the task and your identical probe model
[00:06:56] task and your identical probe model structure
[00:06:57] structure on these control tasks
[00:07:00] on these control tasks and uh hewitt and leon use this to tease
[00:07:02] and uh hewitt and leon use this to tease out what i think is a pretty clear
[00:07:04] out what i think is a pretty clear intuition which is that as you get more
[00:07:06] intuition which is that as you get more powerful probes they simply become less
[00:07:08] powerful probes they simply become less selective so along the x-axis here we
[00:07:11] selective so along the x-axis here we have mlp hidden units so we have model
[00:07:14] have mlp hidden units so we have model complexity from left to right where we
[00:07:16] complexity from left to right where we have very complicated powerful models at
[00:07:18] have very complicated powerful models at the right hand side
[00:07:20] the right hand side and here we have accuracy
[00:07:22] and here we have accuracy and we're measuring our control task in
[00:07:24] and we're measuring our control task in red
[00:07:25] red and our actual probe task in this light
[00:07:27] and our actual probe task in this light blue here and selectivity is the
[00:07:29] blue here and selectivity is the difference between those two so you can
[00:07:32] difference between those two so you can see for example that the very weak
[00:07:34] see for example that the very weak models the two the ones with two hidden
[00:07:36] models the two the ones with two hidden units have very high selectivity whereas
[00:07:39] units have very high selectivity whereas by the time i have this very powerful
[00:07:41] by the time i have this very powerful mlp with lots of hidden units
[00:07:43] mlp with lots of hidden units selectivity has gone almost to zero and
[00:07:45] selectivity has gone almost to zero and it's very hard to say that you've
[00:07:47] it's very hard to say that you've uncovered any latent information because
[00:07:49] uncovered any latent information because even
[00:07:50] even the control task is fully solvable with
[00:07:54] the control task is fully solvable with a model that has this much capacity
[00:07:56] a model that has this much capacity so i think what this is pushing us
[00:07:58] so i think what this is pushing us toward is always having control tasks as
[00:08:01] toward is always having control tasks as part of the picture and always reporting
[00:08:03] part of the picture and always reporting selectivity so that we can control for
[00:08:05] selectivity so that we can control for the complexity of the probe model itself
[00:08:09] the complexity of the probe model itself that's an important and easy practical
[00:08:11] that's an important and easy practical step that will give you a clearer
[00:08:12] step that will give you a clearer picture of what you've actually surfaced
[00:08:14] picture of what you've actually surfaced with your probe
[00:08:17] with your probe that's the first issue the second issue
[00:08:18] that's the first issue the second issue is just something that you should keep
[00:08:20] is just something that you should keep in mind as a theoretical fact about
[00:08:22] in mind as a theoretical fact about probing which is that it is
[00:08:23] probing which is that it is fundamentally limited in the sense that
[00:08:25] fundamentally limited in the sense that it cannot tell you that the information
[00:08:28] it cannot tell you that the information you discover has any causal impact on
[00:08:30] you discover has any causal impact on the model's input output behavior to
[00:08:32] the model's input output behavior to illustrate that i'm just going to show
[00:08:34] illustrate that i'm just going to show you a simple example that kind of proves
[00:08:36] you a simple example that kind of proves this
[00:08:37] this so imagine over here on the left i have
[00:08:38] so imagine over here on the left i have a simple model that's going to take in
[00:08:40] a simple model that's going to take in three integers
[00:08:41] three integers and sum them so the output here will be
[00:08:43] and sum them so the output here will be the sum of three integers like if i put
[00:08:45] the sum of three integers like if i put in one two three it will output six and
[00:08:48] in one two three it will output six and it does that by representing each one of
[00:08:50] it does that by representing each one of those integers as the single dimensional
[00:08:52] those integers as the single dimensional vector that just is that integer and
[00:08:55] vector that just is that integer and then we have a whole bunch of
[00:08:56] then we have a whole bunch of transformer like model parameters dense
[00:08:58] transformer like model parameters dense connections here that will lead us
[00:09:00] connections here that will lead us finally to the output layer
[00:09:02] finally to the output layer right so you can easily imagine that you
[00:09:04] right so you can easily imagine that you probe this position l1 here and you find
[00:09:07] probe this position l1 here and you find that it computes x plus y
[00:09:10] that it computes x plus y which might be starting to reveal for
[00:09:11] which might be starting to reveal for you that there's some kind of tree
[00:09:13] you that there's some kind of tree structure to this model even though it
[00:09:14] structure to this model even though it was densely connected it has learned a
[00:09:17] was densely connected it has learned a structured solution to the problem and
[00:09:18] structured solution to the problem and you might probe l2 and find that it
[00:09:20] you might probe l2 and find that it computes z
[00:09:21] computes z and that would really lead you to the
[00:09:23] and that would really lead you to the thing to think that you've got a kind of
[00:09:24] thing to think that you've got a kind of interesting tree structure with
[00:09:26] interesting tree structure with constituents for this addition problem
[00:09:29] constituents for this addition problem and that's certainly suggestive however
[00:09:32] and that's certainly suggestive however here is an example of a model that shows
[00:09:34] here is an example of a model that shows that neither l1 nor l2 has anything to
[00:09:37] that neither l1 nor l2 has anything to do with the model's output predictions
[00:09:39] do with the model's output predictions it is entirely that middle state that
[00:09:42] it is entirely that middle state that tells the complete story about the
[00:09:43] tells the complete story about the output
[00:09:45] output i'll leave you to work through the
[00:09:46] i'll leave you to work through the details if you choose to but a shortcut
[00:09:48] details if you choose to but a shortcut wait to see that is that the final
[00:09:50] wait to see that is that the final parameters that take us from these
[00:09:51] parameters that take us from these output representations to the
[00:09:53] output representations to the predictions have zeroed out the first
[00:09:56] predictions have zeroed out the first and third positions leaving only the
[00:09:58] and third positions leaving only the second one as having any kind of causal
[00:10:00] second one as having any kind of causal efficacy
[00:10:02] efficacy even though in this model if you probe
[00:10:04] even though in this model if you probe you do indeed find that it looks like
[00:10:07] you do indeed find that it looks like those representations perfectly encode
[00:10:09] those representations perfectly encode these two pieces of information
[00:10:11] these two pieces of information that's a dramatic uh and clear simple
[00:10:13] that's a dramatic uh and clear simple illustration of how a probe could get
[00:10:15] illustration of how a probe could get divorced from the actual causal behavior
[00:10:18] divorced from the actual causal behavior of the model again something that's
[00:10:20] of the model again something that's worth keeping in mind
[00:10:22] worth keeping in mind and finally to close this out of course
[00:10:24] and finally to close this out of course for that first problem about
[00:10:26] for that first problem about distinguishing between
[00:10:28] distinguishing between probe capacity and actually latent
[00:10:30] probe capacity and actually latent latently encoded information
[00:10:32] latently encoded information one response to that that's developing
[00:10:34] one response to that that's developing in the literature now is to develop
[00:10:36] in the literature now is to develop unsupervised probes these would be
[00:10:38] unsupervised probes these would be models like these that seek to find
[00:10:42] models like these that seek to find in actual facts about the model with no
[00:10:44] in actual facts about the model with no additional supervision the latent
[00:10:46] additional supervision the latent information that we hope to find and
[00:10:48] information that we hope to find and this would come from simply simply doing
[00:10:51] this would come from simply simply doing linear transformations of the parameters
[00:10:53] linear transformations of the parameters and measuring distance between
[00:10:54] and measuring distance between parameters as a way of getting a sense
[00:10:56] parameters as a way of getting a sense for what's actually there without the
[00:10:58] for what's actually there without the complications that come from having this
[00:11:01] complications that come from having this additional supervised probe model
[00:11:03] additional supervised probe model and finally for much more information
[00:11:05] and finally for much more information about probes and what we think they've
[00:11:07] about probes and what we think they've we've learned from them and what they
[00:11:09] we've learned from them and what they can tell us i encourage you to check out
[00:11:11] can tell us i encourage you to check out this paper by rogers at all a primer on
[00:11:14] this paper by rogers at all a primer on bertology it has a large an interesting
[00:11:16] bertology it has a large an interesting subsection entirely devoted to what
[00:11:19] subsection entirely devoted to what probes have told us certainly worth a
[00:11:21] probes have told us certainly worth a look and a great overview of the space
[00:11:28] you

Lecture 053

Feature Attribution | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=RFE6xdfJvag --- Transcript [00:00:05] welcome everyone this is part five ...

Feature Attribution | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=RFE6xdfJvag

---

Transcript

[00:00:05] welcome everyone this is part five in
[00:00:06] welcome everyone this is part five in our series on analysis methods in nlp
[00:00:09] our series on analysis methods in nlp we're going to be talking about feature
[00:00:10] we're going to be talking about feature attribution methods this is
[00:00:11] attribution methods this is fundamentally a powerful toolkit for
[00:00:14] fundamentally a powerful toolkit for helping you understand how the features
[00:00:15] helping you understand how the features in your model contribute to its output
[00:00:17] in your model contribute to its output predictions
[00:00:19] predictions our fundamental question here is kind of
[00:00:22] our fundamental question here is kind of why does your model make the predictions
[00:00:23] why does your model make the predictions that it makes there are many motivations
[00:00:25] that it makes there are many motivations for asking this question here just a few
[00:00:28] for asking this question here just a few to start you might just want to
[00:00:29] to start you might just want to understand whether your model is
[00:00:30] understand whether your model is systematic with regard to some specific
[00:00:32] systematic with regard to some specific linguistic phenomenon has it actually
[00:00:34] linguistic phenomenon has it actually captured that phenomenon
[00:00:37] captured that phenomenon you might also want to know whether it's
[00:00:38] you might also want to know whether it's robust to minor perturbations in its
[00:00:40] robust to minor perturbations in its input
[00:00:42] input you might use these techniques to
[00:00:43] you might use these techniques to diagnose unwanted biases in your model
[00:00:45] diagnose unwanted biases in your model and relatedly you might use them to find
[00:00:48] and relatedly you might use them to find weaknesses in your model that an
[00:00:49] weaknesses in your model that an adversary could exploit to lead your
[00:00:51] adversary could exploit to lead your model to do really problematic things
[00:00:54] model to do really problematic things fundamentally i think that this is a
[00:00:56] fundamentally i think that this is a tool kit that will help you write really
[00:00:58] tool kit that will help you write really excellent analysis sections for your
[00:01:00] excellent analysis sections for your paper to that end i'm going to try to
[00:01:02] paper to that end i'm going to try to show you a bunch of code that will help
[00:01:04] show you a bunch of code that will help you get hands-on with these techniques
[00:01:06] you get hands-on with these techniques i'll do it at a kind of high level in
[00:01:08] i'll do it at a kind of high level in the screencast and i've just contributed
[00:01:10] the screencast and i've just contributed this new notebook feature attribution to
[00:01:12] this new notebook feature attribution to the course code repository and that
[00:01:15] the course code repository and that should be flexible and adaptable and
[00:01:17] should be flexible and adaptable and help you take these techniques and apply
[00:01:18] help you take these techniques and apply them to whatever models and ideas you're
[00:01:21] them to whatever models and ideas you're exploring for your projects
[00:01:24] exploring for your projects the star of our show really the only
[00:01:26] the star of our show really the only reason that i can do this is this
[00:01:27] reason that i can do this is this amazing captain.ai library it implements
[00:01:30] amazing captain.ai library it implements a wide range of feature attribution
[00:01:32] a wide range of feature attribution techniques we're going to talk
[00:01:34] techniques we're going to talk extensively about the integrated
[00:01:36] extensively about the integrated gradients method and use the gradient
[00:01:38] gradients method and use the gradient based method as a kind of simple
[00:01:40] based method as a kind of simple baseline for that method but as you can
[00:01:42] baseline for that method but as you can see here captive implements a wide range
[00:01:45] see here captive implements a wide range of different algorithms some very
[00:01:47] of different algorithms some very particular to specific model designs and
[00:01:49] particular to specific model designs and others completely agnostic about what
[00:01:51] others completely agnostic about what kind of model you're exploring so it's a
[00:01:53] kind of model you're exploring so it's a very exciting toolkit
[00:01:57] the sundar rajan at all 2017 paper
[00:02:00] the sundar rajan at all 2017 paper introduced the integrated gradients
[00:02:01] introduced the integrated gradients method it's also a lovely contribution
[00:02:03] method it's also a lovely contribution because it gives us a kind of
[00:02:06] because it gives us a kind of framework for thinking about feature
[00:02:08] framework for thinking about feature attribution methods in general and as
[00:02:10] attribution methods in general and as part of that they offer two axioms that
[00:02:12] part of that they offer two axioms that i'm going to use to guide this
[00:02:13] i'm going to use to guide this discussion the first and the more
[00:02:15] discussion the first and the more important one is sensitivity
[00:02:18] important one is sensitivity if two inputs x and x prime differ only
[00:02:20] if two inputs x and x prime differ only at dimension i and lead to different
[00:02:23] at dimension i and lead to different predictions then the feature associated
[00:02:26] predictions then the feature associated with that dimension must have non-zero
[00:02:28] with that dimension must have non-zero attribution
[00:02:29] attribution and with my simple example here you can
[00:02:31] and with my simple example here you can get a sense for why sensitivity is such
[00:02:33] get a sense for why sensitivity is such a fundamental axiom if for some model m
[00:02:36] a fundamental axiom if for some model m and three dimensional input one zero one
[00:02:39] and three dimensional input one zero one we get a prediction of positive and if
[00:02:41] we get a prediction of positive and if for that same model the input one one
[00:02:44] for that same model the input one one one leads to the prediction negative
[00:02:46] one leads to the prediction negative then we really ought to expect that the
[00:02:48] then we really ought to expect that the feature associated with the second
[00:02:50] feature associated with the second position must have non-zero attribution
[00:02:52] position must have non-zero attribution because it must be decisive in leading
[00:02:55] because it must be decisive in leading the model to make these two different
[00:02:57] the model to make these two different predictions
[00:02:59] predictions the second axiom is going to be less
[00:03:01] the second axiom is going to be less important to our discussion but it's
[00:03:02] important to our discussion but it's nonetheless worth having in mind it is
[00:03:04] nonetheless worth having in mind it is implementation and variance if two
[00:03:06] implementation and variance if two models m and m prime have identical
[00:03:08] models m and m prime have identical input output behavior then the
[00:03:10] input output behavior then the attributions for m and m prime are
[00:03:12] attributions for m and m prime are identical this is really just saying
[00:03:14] identical this is really just saying that the attributions we give should be
[00:03:17] that the attributions we give should be separate from any incidental differences
[00:03:19] separate from any incidental differences in model implementation that don't
[00:03:21] in model implementation that don't affect the input output behavior of that
[00:03:24] affect the input output behavior of that model
[00:03:26] to start our discussion let's begin with
[00:03:28] to start our discussion let's begin with this simple baseline which is simply
[00:03:30] this simple baseline which is simply multiplying the gradients by the inputs
[00:03:33] multiplying the gradients by the inputs this is implemented and kept them as
[00:03:34] this is implemented and kept them as input by gradient now i'm showing with
[00:03:37] input by gradient now i'm showing with respect to some particular feature i
[00:03:40] respect to some particular feature i given model m and input x then we simply
[00:03:42] given model m and input x then we simply get the gradients for that feature
[00:03:45] get the gradients for that feature and then multiply it by the actual value
[00:03:47] and then multiply it by the actual value of that feature it's as simple as that
[00:03:50] of that feature it's as simple as that here are two implementations the first
[00:03:51] here are two implementations the first one in cell 2 does this for kind of
[00:03:54] one in cell 2 does this for kind of using raw pi torch just to show you how
[00:03:56] using raw pi torch just to show you how we can use pi torch's auto grad
[00:03:59] we can use pi torch's auto grad functionality to implement this method
[00:04:01] functionality to implement this method and the second implementation is from
[00:04:03] and the second implementation is from captum and it's probably more flexible
[00:04:05] captum and it's probably more flexible and it uses this input by gradient class
[00:04:10] and it uses this input by gradient class to give you a full illustration here
[00:04:12] to give you a full illustration here i've just set up a simple synthetic
[00:04:13] i've just set up a simple synthetic classification problem using scikit
[00:04:15] classification problem using scikit tools
[00:04:16] tools my model will be a torch shallow neural
[00:04:18] my model will be a torch shallow neural classifier which i fit on that data
[00:04:21] classifier which i fit on that data and then in cells 9 and 10 i use those
[00:04:23] and then in cells 9 and 10 i use those two implementations of this method and
[00:04:26] two implementations of this method and you can see in 11 and 12 that they give
[00:04:28] you can see in 11 and 12 that they give identical outputs
[00:04:30] identical outputs another thing worth noting here is that
[00:04:31] another thing worth noting here is that i have used the method by taking
[00:04:33] i have used the method by taking gradients with respect to the actual
[00:04:35] gradients with respect to the actual labels in our data set
[00:04:37] labels in our data set you can often get a different picture if
[00:04:39] you can often get a different picture if you take gradients with respect to the
[00:04:40] you take gradients with respect to the predictions of that model and that might
[00:04:43] predictions of that model and that might give you a better sense for why the
[00:04:44] give you a better sense for why the model is making the predictions that it
[00:04:46] model is making the predictions that it makes
[00:04:48] makes in this case since the model is very
[00:04:49] in this case since the model is very good the attributions are only slightly
[00:04:51] good the attributions are only slightly different
[00:04:53] different that's our kind of baseline i want to
[00:04:56] that's our kind of baseline i want to show you now that the input by gradients
[00:04:58] show you now that the input by gradients method fails the sensitivity test and
[00:05:00] method fails the sensitivity test and this is an example from the cinderella
[00:05:02] this is an example from the cinderella at all paper if they give this simple
[00:05:04] at all paper if they give this simple model m here which is effectively just
[00:05:08] model m here which is effectively just rel u applied to one minus the input and
[00:05:11] rel u applied to one minus the input and then you take one minus that rel
[00:05:13] then you take one minus that rel calculation there that's the model it's
[00:05:15] calculation there that's the model it's got one dimensional inputs and outputs
[00:05:18] got one dimensional inputs and outputs if you calculate for input zero you get
[00:05:20] if you calculate for input zero you get an outcome of zero
[00:05:22] an outcome of zero and if you give the model input 2 you
[00:05:24] and if you give the model input 2 you get an output of 1.
[00:05:26] get an output of 1. since we have differing output
[00:05:28] since we have differing output predictions sensitivity tells us that we
[00:05:31] predictions sensitivity tells us that we have to have differing attributions
[00:05:33] have to have differing attributions for these two cases here these two
[00:05:36] for these two cases here these two one-dimensional inputs but unfortunately
[00:05:39] one-dimensional inputs but unfortunately when you calculate through with this
[00:05:40] when you calculate through with this method you get zero attribution in both
[00:05:43] method you get zero attribution in both cases
[00:05:45] cases that's a failure of sensitivity and
[00:05:46] that's a failure of sensitivity and points to a weakness of this method
[00:05:51] let's move now to integrated gradients
[00:05:52] let's move now to integrated gradients and let me start by giving you the
[00:05:54] and let me start by giving you the intuition for how this method is going
[00:05:56] intuition for how this method is going to work imagine we have a simple
[00:05:57] to work imagine we have a simple two-dimensional feature space feature x1
[00:06:00] two-dimensional feature space feature x1 and x2 so here's the actual
[00:06:03] and x2 so here's the actual point represented here
[00:06:04] point represented here the idea behind integrated gradients is
[00:06:07] the idea behind integrated gradients is that we're going to compare that with
[00:06:08] that we're going to compare that with respect to some baseline the typical
[00:06:11] respect to some baseline the typical baseline for us will be the all zeros
[00:06:13] baseline for us will be the all zeros vector
[00:06:14] vector and then to do the comparison what we'll
[00:06:15] and then to do the comparison what we'll actually do is interpolate a bunch of
[00:06:17] actually do is interpolate a bunch of points between that baseline and our
[00:06:20] points between that baseline and our actual input
[00:06:21] actual input take gradients with respect to each one
[00:06:23] take gradients with respect to each one of them and average all of those
[00:06:25] of them and average all of those gradient results and that will give us a
[00:06:27] gradient results and that will give us a measure of feature importance
[00:06:30] measure of feature importance here's the calculation of the method in
[00:06:32] here's the calculation of the method in full detail i've taken this presentation
[00:06:34] full detail i've taken this presentation from this really excellent tutorial from
[00:06:37] from this really excellent tutorial from tensorflow integrated gradients
[00:06:39] tensorflow integrated gradients and it does all these annotations that i
[00:06:41] and it does all these annotations that i find quite helpful here's fundamentally
[00:06:43] find quite helpful here's fundamentally how this works the core thing is in
[00:06:45] how this works the core thing is in purple we're going to interpolate a
[00:06:48] purple we're going to interpolate a bunch of different inputs between that
[00:06:49] bunch of different inputs between that baseline of all zeros and our actual
[00:06:52] baseline of all zeros and our actual input that's what's happening here and
[00:06:54] input that's what's happening here and we'll take the gradients with respect to
[00:06:56] we'll take the gradients with respect to each one of those with respect to each
[00:06:58] each one of those with respect to each one of the features
[00:07:00] one of the features and we're going to sum those up and
[00:07:01] and we're going to sum those up and average them and that gives us the core
[00:07:03] average them and that gives us the core calculation here
[00:07:05] calculation here and then in five we just kind of scale
[00:07:07] and then in five we just kind of scale that resulting average with respect to
[00:07:09] that resulting average with respect to the original input to put it back on the
[00:07:12] the original input to put it back on the same scale and as i show here integrated
[00:07:14] same scale and as i show here integrated gradients obeys the sensitivity axiom
[00:07:17] gradients obeys the sensitivity axiom let's go back to that original example
[00:07:19] let's go back to that original example of a simple value based model
[00:07:21] of a simple value based model presented here i showed you that the
[00:07:24] presented here i showed you that the input by gradients method failed
[00:07:25] input by gradients method failed sensitivity for this model
[00:07:28] sensitivity for this model integrated gradients of course is
[00:07:29] integrated gradients of course is sensitive in the relevant sense and you
[00:07:31] sensitive in the relevant sense and you can kind of see why that's happening
[00:07:33] can kind of see why that's happening because our core calculation now is not
[00:07:35] because our core calculation now is not with respect to a single input in the
[00:07:37] with respect to a single input in the case of the input 2 but rather with
[00:07:39] case of the input 2 but rather with respect to all of those interpolated
[00:07:41] respect to all of those interpolated feature representations although some of
[00:07:44] feature representations although some of those interpolated feature
[00:07:45] those interpolated feature representations give a gradient of zero
[00:07:47] representations give a gradient of zero not all of them do and the result in
[00:07:50] not all of them do and the result in effect is that you'll get a feature
[00:07:52] effect is that you'll get a feature attribution of approximately one for
[00:07:54] attribution of approximately one for this case of an input two
[00:07:56] this case of an input two the desired result showing sensitivity
[00:07:59] the desired result showing sensitivity because of course the input of zero in
[00:08:00] because of course the input of zero in this case will give an attribution of
[00:08:02] this case will give an attribution of zero
[00:08:05] now let me walk you through a few
[00:08:06] now let me walk you through a few examples that show you how you can use
[00:08:08] examples that show you how you can use captum to get hands-on with the
[00:08:11] captum to get hands-on with the integrated gradients method i'm going to
[00:08:12] integrated gradients method i'm going to do that for two classes of model the
[00:08:15] do that for two classes of model the first one is just a simple feed forward
[00:08:17] first one is just a simple feed forward network and what i'm doing is
[00:08:18] network and what i'm doing is reconnecting with the stanford sentiment
[00:08:20] reconnecting with the stanford sentiment tree bank that we use during our
[00:08:22] tree bank that we use during our sentiment unit so on this slide i've
[00:08:24] sentiment unit so on this slide i've just set up an ssd experiment using
[00:08:27] just set up an ssd experiment using sst.experiment
[00:08:29] sst.experiment from that sst module
[00:08:32] from that sst module my feature representations are going to
[00:08:34] my feature representations are going to be essentially a bag of words and i've
[00:08:35] be essentially a bag of words and i've filtered off stock words to make this a
[00:08:37] filtered off stock words to make this a little more interpretable
[00:08:39] little more interpretable and our classifier is a torch shallow
[00:08:41] and our classifier is a torch shallow neural classifier
[00:08:43] neural classifier i run the experiment and a lot of
[00:08:44] i run the experiment and a lot of information about that experiment you'll
[00:08:46] information about that experiment you'll recall is stored in this variable
[00:08:48] recall is stored in this variable experiment
[00:08:51] here i extract the model from that that
[00:08:54] here i extract the model from that that experiment report
[00:08:55] experiment report and here we get a bunch of other
[00:08:57] and here we get a bunch of other metadata that we're going to use to run
[00:08:58] metadata that we're going to use to run the integrated gradients method the
[00:09:00] the integrated gradients method the feature representations of our test
[00:09:02] feature representations of our test examples the actual labels
[00:09:05] examples the actual labels and the predicted labels along with the
[00:09:07] and the predicted labels along with the feature names and the one thing to note
[00:09:09] feature names and the one thing to note here is that for the sake of captain we
[00:09:11] here is that for the sake of captain we need to turn the string class names into
[00:09:14] need to turn the string class names into their corresponding indices and that's
[00:09:15] their corresponding indices and that's what's happening in cell line here
[00:09:18] what's happening in cell line here then we set up the integrated gradients
[00:09:20] then we set up the integrated gradients using the forward method for our model
[00:09:22] using the forward method for our model and we set up the baseline which is that
[00:09:24] and we set up the baseline which is that all zeros vector and then finally use
[00:09:27] all zeros vector and then finally use the attribute method and here i'm taking
[00:09:29] the attribute method and here i'm taking attributions with respect to the
[00:09:31] attributions with respect to the predictions of the model
[00:09:34] predictions of the model i think this can be a powerful device
[00:09:36] i think this can be a powerful device for doing some simple error analysis and
[00:09:38] for doing some simple error analysis and that's what i've set up on the slide
[00:09:39] that's what i've set up on the slide here i've offered two functions error
[00:09:41] here i've offered two functions error analysis and create attribution lookup
[00:09:44] analysis and create attribution lookup that will help you understand how
[00:09:45] that will help you understand how features in this model are relating to
[00:09:47] features in this model are relating to its output predictions you can see in
[00:09:50] its output predictions you can see in cell 14 here i'm looking for cases where
[00:09:52] cell 14 here i'm looking for cases where the actual label is neutral and the
[00:09:53] the actual label is neutral and the model predicted positive we can find
[00:09:56] model predicted positive we can find those attributions and this is actually
[00:09:57] those attributions and this is actually an informative picture here it looks
[00:10:00] an informative picture here it looks like the model has overfit features like
[00:10:03] like the model has overfit features like the period and the comma those ought to
[00:10:05] the period and the comma those ought to be indicative of the neutral category
[00:10:07] be indicative of the neutral category but here it's using them in ways that
[00:10:09] but here it's using them in ways that lead to a positive prediction so that's
[00:10:11] lead to a positive prediction so that's something that we might want to address
[00:10:13] something that we might want to address and we can go one level further if we
[00:10:15] and we can go one level further if we choose and look at individual examples
[00:10:17] choose and look at individual examples so here i've pulled out an individual
[00:10:19] so here i've pulled out an individual example no one goes unindicted here
[00:10:21] example no one goes unindicted here which is probably for the best this is a
[00:10:23] which is probably for the best this is a case where the correct label is neutral
[00:10:26] case where the correct label is neutral and our model predicted positive and i
[00:10:28] and our model predicted positive and i think the attributions again help us
[00:10:30] think the attributions again help us understand why because by far the
[00:10:32] understand why because by far the feature with the highest attribution is
[00:10:34] feature with the highest attribution is this best one
[00:10:36] this best one and this is revealing that the model
[00:10:37] and this is revealing that the model just does not understand the context in
[00:10:40] just does not understand the context in which the word best is used in this
[00:10:42] which the word best is used in this example that might point to a
[00:10:44] example that might point to a fundamental weakness of the bag of words
[00:10:46] fundamental weakness of the bag of words approach
[00:10:49] for my second example let's connect with
[00:10:51] for my second example let's connect with transformer models since i assume that a
[00:10:53] transformer models since i assume that a lot of you will be working with these
[00:10:54] lot of you will be working with these models and these present exciting new
[00:10:56] models and these present exciting new opportunities for feature attribution
[00:10:58] opportunities for feature attribution because in these models we have so many
[00:11:01] because in these models we have so many representations that we could think
[00:11:03] representations that we could think about doing attributions for here's a
[00:11:05] about doing attributions for here's a kind of general picture of a vert-like
[00:11:07] kind of general picture of a vert-like model where i have the outputs up here
[00:11:10] model where i have the outputs up here you have many layers of transformer
[00:11:12] you have many layers of transformer block outputs those are given in purple
[00:11:14] block outputs those are given in purple then probably an embedding layer in
[00:11:16] then probably an embedding layer in green and that embedding layer might be
[00:11:19] green and that embedding layer might be itself composed of like a word embedding
[00:11:21] itself composed of like a word embedding layer and a positional embedding layer
[00:11:24] layer and a positional embedding layer and maybe others
[00:11:25] and maybe others all of these layers are potential
[00:11:27] all of these layers are potential targets for integrated gradients and cap
[00:11:30] targets for integrated gradients and cap them again makes that relatively easy
[00:11:33] them again makes that relatively easy so to start this off i just downloaded
[00:11:35] so to start this off i just downloaded from hugging face a
[00:11:36] from hugging face a roberta based twitter sentiment model
[00:11:39] roberta based twitter sentiment model that seemed really interesting and i
[00:11:41] that seemed really interesting and i wrote a predict one prabha method that
[00:11:42] wrote a predict one prabha method that will help us with the error analysis
[00:11:45] will help us with the error analysis that we want to do
[00:11:47] that we want to do this next step here does the encodings
[00:11:49] this next step here does the encodings of both the actual example using the
[00:11:52] of both the actual example using the models tokenizer as well as the baseline
[00:11:55] models tokenizer as well as the baseline of all zeros that we'll use for
[00:11:56] of all zeros that we'll use for comparisons
[00:11:58] comparisons in cell seven i've just designed a small
[00:12:00] in cell seven i've just designed a small custom forward method to help captain
[00:12:02] custom forward method to help captain out because this model has slightly
[00:12:04] out because this model has slightly different output structure than is
[00:12:06] different output structure than is expected
[00:12:09] expected here in cell 8 we set up the layer that
[00:12:11] here in cell 8 we set up the layer that we want to target as you can see i'm
[00:12:12] we want to target as you can see i'm targeting the embedding layer but many
[00:12:15] targeting the embedding layer but many other layers could be targeted captain
[00:12:17] other layers could be targeted captain makes that easy
[00:12:18] makes that easy for our example we use this is
[00:12:20] for our example we use this is illuminating which i'll take to have
[00:12:22] illuminating which i'll take to have true class positive
[00:12:24] true class positive we do our encodings in cell 11 of both
[00:12:26] we do our encodings in cell 11 of both the actual example and the baseline and
[00:12:28] the actual example and the baseline and then that's the basis for our
[00:12:30] then that's the basis for our attribution of this single example
[00:12:32] attribution of this single example now for burnt because we have high
[00:12:36] now for burnt because we have high dimensional representations for each one
[00:12:38] dimensional representations for each one of the tokens that we're looking at we
[00:12:40] of the tokens that we're looking at we need to perform another layer of
[00:12:41] need to perform another layer of compression that we didn't have to for
[00:12:43] compression that we didn't have to for the feed forward example as you can see
[00:12:45] the feed forward example as you can see here the attributions have for one
[00:12:47] here the attributions have for one example
[00:12:48] example dimensionality 6 by 768 this is one
[00:12:52] dimensionality 6 by 768 this is one vector per word token
[00:12:54] vector per word token to summarize those at the level of
[00:12:55] to summarize those at the level of individual word tokens we'll just sum
[00:12:57] individual word tokens we'll just sum them up and then z-score normalize them
[00:13:00] them up and then z-score normalize them to kind of put them on a consistent
[00:13:01] to kind of put them on a consistent scale so that will reduce the
[00:13:03] scale so that will reduce the attributions down to one per subword
[00:13:06] attributions down to one per subword token
[00:13:08] token and that feeds into our final kind of
[00:13:10] and that feeds into our final kind of cumulative analysis so we'll do the
[00:13:12] cumulative analysis so we'll do the probabilistic predictions
[00:13:14] probabilistic predictions we'll get the actual class
[00:13:16] we'll get the actual class convert the input to something that
[00:13:18] convert the input to something that captain can digest and then use this
[00:13:20] captain can digest and then use this visualization data record method to
[00:13:22] visualization data record method to bring this all together into a nice
[00:13:24] bring this all together into a nice tabular visualization and that's what's
[00:13:27] tabular visualization and that's what's happening here you can see for our
[00:13:28] happening here you can see for our example we have the true label the
[00:13:30] example we have the true label the predicted label with the associated
[00:13:32] predicted label with the associated probability
[00:13:33] probability and then the really interesting part
[00:13:35] and then the really interesting part per word token we have a summary of its
[00:13:38] per word token we have a summary of its attributions and you can see that green
[00:13:40] attributions and you can see that green is associated with positive white with
[00:13:42] is associated with positive white with neutral and red with negative and this
[00:13:44] neutral and red with negative and this is giving us a reassuring picture about
[00:13:46] is giving us a reassuring picture about the systematicity of these predictions
[00:13:49] the systematicity of these predictions it's a positive prediction and most of
[00:13:51] it's a positive prediction and most of that is the result of the word
[00:13:52] that is the result of the word illuminating and the exclamation mark
[00:13:55] illuminating and the exclamation mark and that kind of feeds into a nice kind
[00:13:57] and that kind of feeds into a nice kind of error analysis slash challenge
[00:14:00] of error analysis slash challenge analysis that you can do of models like
[00:14:02] analysis that you can do of models like this using captum for this slide here
[00:14:04] this using captum for this slide here i've posed a little challenge or
[00:14:06] i've posed a little challenge or adversarial test to see how deeply my
[00:14:09] adversarial test to see how deeply my model understands sentences like they
[00:14:11] model understands sentences like they said it would be great and they were
[00:14:13] said it would be great and they were right
[00:14:14] right you can see it makes the correct
[00:14:15] you can see it makes the correct prediction in that case and when i
[00:14:16] prediction in that case and when i change it to they said it would be great
[00:14:18] change it to they said it would be great and they were wrong it predicts negative
[00:14:21] and they were wrong it predicts negative that's reassuring and so are the feature
[00:14:23] that's reassuring and so are the feature attributions it seems to be keying into
[00:14:25] attributions it seems to be keying into exactly the pieces of information that i
[00:14:27] exactly the pieces of information that i would hope and even doing it in a
[00:14:29] would hope and even doing it in a context-sensitive way
[00:14:31] context-sensitive way for the next two examples i just change
[00:14:33] for the next two examples i just change up the syntax to see whether it's kind
[00:14:35] up the syntax to see whether it's kind of over fit to the position of these
[00:14:37] of over fit to the position of these words in the string and it again looks
[00:14:39] words in the string and it again looks robust they were right to say that it
[00:14:41] robust they were right to say that it would be great
[00:14:42] would be great prediction of positive they were wrong
[00:14:43] prediction of positive they were wrong to say that it would be great prediction
[00:14:45] to say that it would be great prediction of negative very reassuring as is the
[00:14:48] of negative very reassuring as is the second to last example they said it
[00:14:50] second to last example they said it would be stellar and they were correct
[00:14:52] would be stellar and they were correct the only disappointing thing in this
[00:14:54] the only disappointing thing in this challenge problem is for this final
[00:14:56] challenge problem is for this final example it predicts neutral for they
[00:14:58] example it predicts neutral for they said it would be stellar and they were
[00:15:00] said it would be stellar and they were incorrect and the attributions are also
[00:15:02] incorrect and the attributions are also a little bit worrisome about the extent
[00:15:04] a little bit worrisome about the extent to which the model has truly understood
[00:15:06] to which the model has truly understood this example
[00:15:08] this example uh maybe we can think about how to
[00:15:10] uh maybe we can think about how to address that problem but the fundamental
[00:15:12] address that problem but the fundamental takeaway for now is simply that you can
[00:15:14] takeaway for now is simply that you can see how you can use feature attribution
[00:15:16] see how you can use feature attribution together with challenge examples to kind
[00:15:19] together with challenge examples to kind of hone in on exactly how systematic a
[00:15:22] of hone in on exactly how systematic a model model's predictions are for an
[00:15:24] model model's predictions are for an interesting class of cases

Lecture 054

Overview of Methods and Metrics | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=r9ohMetEMfQ --- Transcript [00:00:05] welcome everyone this i...

Overview of Methods and Metrics | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=r9ohMetEMfQ

---

Transcript

[00:00:05] welcome everyone this is the first
[00:00:06] welcome everyone this is the first screencast in our series on methods and
[00:00:08] screencast in our series on methods and metrics fundamentally what we're trying
[00:00:10] metrics fundamentally what we're trying to do with this unit is give you help
[00:00:12] to do with this unit is give you help with your projects and specifically
[00:00:14] with your projects and specifically give you help with the experimental
[00:00:16] give you help with the experimental aspects of your project and so the kind
[00:00:18] aspects of your project and so the kind of highlight topics for us will be
[00:00:20] of highlight topics for us will be around things like managing your data
[00:00:21] around things like managing your data set for the purposes of conducting
[00:00:23] set for the purposes of conducting experiments
[00:00:24] experiments establishing baseline systems and in
[00:00:27] establishing baseline systems and in turn doing model comparisons between
[00:00:29] turn doing model comparisons between baselines and an original system or
[00:00:31] baselines and an original system or between an original system and published
[00:00:33] between an original system and published results in the literature and so forth
[00:00:36] results in the literature and so forth and relatedly we're going to give you
[00:00:37] and relatedly we're going to give you some advice on how to optimize your
[00:00:39] some advice on how to optimize your models effectively those are kind of the
[00:00:41] models effectively those are kind of the highlight topics there and i would say
[00:00:42] highlight topics there and i would say that all of this is kind of oriented
[00:00:44] that all of this is kind of oriented toward the more abstract topic
[00:00:47] toward the more abstract topic of helping you navigate tricky
[00:00:48] of helping you navigate tricky situations that arise as you conduct
[00:00:51] situations that arise as you conduct experiments in nlp and as you'll see
[00:00:53] experiments in nlp and as you'll see very often these tricky situations arise
[00:00:55] very often these tricky situations arise because we encounter limitations in the
[00:00:57] because we encounter limitations in the data that's available to us or we're
[00:00:59] data that's available to us or we're just fundamentally constrained in terms
[00:01:01] just fundamentally constrained in terms of computing resources
[00:01:03] of computing resources and that leads us to have to make some
[00:01:05] and that leads us to have to make some compromises in the ideal experimental
[00:01:07] compromises in the ideal experimental protocol that we would use
[00:01:09] protocol that we would use these things are inevitable and the idea
[00:01:10] these things are inevitable and the idea here is that we're going to equip you
[00:01:12] here is that we're going to equip you with some tools and techniques for
[00:01:13] with some tools and techniques for thinking about the trade-offs and making
[00:01:15] thinking about the trade-offs and making your way through all of these tricky
[00:01:17] your way through all of these tricky situations
[00:01:19] situations there are a bunch of associated
[00:01:20] there are a bunch of associated materials for these screencasts we have
[00:01:22] materials for these screencasts we have a whole notebook that's on metrics i'm
[00:01:24] a whole notebook that's on metrics i'm going to offer some screencasts that
[00:01:26] going to offer some screencasts that just highlight a few of the metrics that
[00:01:28] just highlight a few of the metrics that are discussed in that notebook
[00:01:30] are discussed in that notebook but it's meant as a resource the
[00:01:31] but it's meant as a resource the notebook itself so that you could pursue
[00:01:34] notebook itself so that you could pursue other avenues and overall what i'm
[00:01:35] other avenues and overall what i'm trying to do is give you a framework for
[00:01:37] trying to do is give you a framework for thinking about what metrics encode in
[00:01:39] thinking about what metrics encode in terms of their values what bounds they
[00:01:41] terms of their values what bounds they have and where they can be applied and
[00:01:43] have and where they can be applied and misapplied
[00:01:46] misapplied scikit learn implements essentially all
[00:01:48] scikit learn implements essentially all of the metrics that we'll be discussing
[00:01:50] of the metrics that we'll be discussing and to their credit they've done a
[00:01:52] and to their credit they've done a wonderful job of offering rich
[00:01:53] wonderful job of offering rich documentation that will again help you
[00:01:55] documentation that will again help you not only understand what the metrics do
[00:01:57] not only understand what the metrics do but also where and how they can be
[00:01:59] but also where and how they can be effectively applied
[00:02:01] effectively applied and then there is an entire notebook
[00:02:02] and then there is an entire notebook that's on methods especially
[00:02:04] that's on methods especially experimental methods and that covers a
[00:02:06] experimental methods and that covers a lot of the tricky situations that i just
[00:02:08] lot of the tricky situations that i just described in terms of setting up
[00:02:10] described in terms of setting up experiments and thinking about
[00:02:11] experiments and thinking about trade-offs and then following through on
[00:02:13] trade-offs and then following through on model evaluation and so forth and that
[00:02:15] model evaluation and so forth and that notebook is nice as a supplement to
[00:02:17] notebook is nice as a supplement to these screencasts because it embeds a
[00:02:19] these screencasts because it embeds a bunch of code that could help you run
[00:02:20] bunch of code that could help you run hands-on experiments to get a feel for
[00:02:22] hands-on experiments to get a feel for the core concepts
[00:02:24] the core concepts and we have two readings resnick inland
[00:02:26] and we have two readings resnick inland 2010 is a wonderful overview of
[00:02:29] 2010 is a wonderful overview of experimental evaluations in the context
[00:02:31] experimental evaluations in the context of nlp and smith 2011 appendix b is a
[00:02:34] of nlp and smith 2011 appendix b is a compendium of different metrics so
[00:02:35] compendium of different metrics so another good resource for you if you're
[00:02:37] another good resource for you if you're unsure about how a metric works or what
[00:02:39] unsure about how a metric works or what it what its bounds are or how it's
[00:02:41] it what its bounds are or how it's calculated and things like that
[00:02:43] calculated and things like that the final thing i want to say for this
[00:02:45] the final thing i want to say for this overview relates specifically to the
[00:02:47] overview relates specifically to the projects that you'll be pursuing
[00:02:49] projects that you'll be pursuing the bottom line for us is that we will
[00:02:51] the bottom line for us is that we will never evaluate a project based on how
[00:02:53] never evaluate a project based on how good the results are
[00:02:56] good the results are now we acknowledge that in the field and
[00:02:58] now we acknowledge that in the field and throughout science publication venues do
[00:03:01] throughout science publication venues do this because they have additional
[00:03:02] this because they have additional constraints on space nominally and that
[00:03:05] constraints on space nominally and that leads them as a cultural fact about the
[00:03:07] leads them as a cultural fact about the way science works to favor positive
[00:03:09] way science works to favor positive evidence for new developments over
[00:03:11] evidence for new developments over negative results i frankly think this is
[00:03:13] negative results i frankly think this is unfortunate and exerts a kind of
[00:03:15] unfortunate and exerts a kind of distorting influence on the
[00:03:17] distorting influence on the set of publications that we all get to
[00:03:19] set of publications that we all get to study
[00:03:20] study but nonetheless that's the way the world
[00:03:21] but nonetheless that's the way the world works at present
[00:03:23] works at present in the context of this course we are not
[00:03:25] in the context of this course we are not subject to that constraint so we can do
[00:03:27] subject to that constraint so we can do the right and good thing scientifically
[00:03:29] the right and good thing scientifically of valuing positive results negative
[00:03:31] of valuing positive results negative results and everything in between
[00:03:35] results and everything in between so i repeat our core value here we will
[00:03:37] so i repeat our core value here we will never evaluate a project based on how
[00:03:39] never evaluate a project based on how good the results are instead we're going
[00:03:41] good the results are instead we're going to evaluate your project on the
[00:03:43] to evaluate your project on the appropriateness of the metrics that you
[00:03:44] appropriateness of the metrics that you choose
[00:03:45] choose the strength of your methods
[00:03:47] the strength of your methods and really fundamentally here the extent
[00:03:49] and really fundamentally here the extent to which your paper is open and
[00:03:51] to which your paper is open and clear-sighted about the limits of its
[00:03:53] clear-sighted about the limits of its findings so you'll notice that given
[00:03:55] findings so you'll notice that given this framework here you could report
[00:03:56] this framework here you could report state-of-the-art results world
[00:03:58] state-of-the-art results world record-breaking results on a task but
[00:04:01] record-breaking results on a task but nonetheless not succeed with the project
[00:04:03] nonetheless not succeed with the project if it fails on all of these things that
[00:04:04] if it fails on all of these things that we've listed under our true values and
[00:04:06] we've listed under our true values and conversely
[00:04:08] conversely you might have a hypothesis that turns
[00:04:10] you might have a hypothesis that turns out to be a miserable failure in terms
[00:04:12] out to be a miserable failure in terms of the performance metrics that you're
[00:04:14] of the performance metrics that you're able to report but that could lead to an
[00:04:16] able to report but that could lead to an outstanding grade in the context of this
[00:04:18] outstanding grade in the context of this course provided that you do all of these
[00:04:20] course provided that you do all of these things and that would be under the
[00:04:22] things and that would be under the heading of a negative result that
[00:04:24] heading of a negative result that nonetheless teaches us something really
[00:04:26] nonetheless teaches us something really fundamental and important
[00:04:28] fundamental and important about nlp and therefore pushes the field
[00:04:31] about nlp and therefore pushes the field forward

Lecture 055

Classifier Metrics | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=0RW-aV93Rns --- Transcript [00:00:05] welcome back everyone this is part t...

Classifier Metrics | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=0RW-aV93Rns

---

Transcript

[00:00:05] welcome back everyone this is part two
[00:00:06] welcome back everyone this is part two in our series on methods and metrics
[00:00:08] in our series on methods and metrics we're going to be talking about
[00:00:09] we're going to be talking about classifier metrics i'm sort of assuming
[00:00:12] classifier metrics i'm sort of assuming that the metrics i'll be discussing are
[00:00:13] that the metrics i'll be discussing are broadly familiar to us
[00:00:15] broadly familiar to us i think that's a chance for us to step
[00:00:17] i think that's a chance for us to step back and be reflective about what values
[00:00:20] back and be reflective about what values these familiar metrics actually encode
[00:00:22] these familiar metrics actually encode because that really is the name of the
[00:00:24] because that really is the name of the game here no matter what kind of task
[00:00:26] game here no matter what kind of task you're working on or what the structure
[00:00:27] you're working on or what the structure of your model is like it's just
[00:00:29] of your model is like it's just fundamentally true that different
[00:00:31] fundamentally true that different evaluation metrics will encode different
[00:00:33] evaluation metrics will encode different values different goals you have for your
[00:00:35] values different goals you have for your system and different kinds of hypothesis
[00:00:38] system and different kinds of hypothesis that you might be pursuing you could
[00:00:40] that you might be pursuing you could hear in that that really fundamentally
[00:00:42] hear in that that really fundamentally choosing a metric is a crucial aspect to
[00:00:44] choosing a metric is a crucial aspect to any kind of experimental work it's a
[00:00:46] any kind of experimental work it's a fundamental step in how we
[00:00:48] fundamental step in how we operationalize hypotheses in terms of
[00:00:51] operationalize hypotheses in terms of data and models and model comparisons
[00:00:54] data and models and model comparisons as a result you should feel free for
[00:00:56] as a result you should feel free for whatever task you're working on to
[00:00:58] whatever task you're working on to motivate new metrics or specific uses of
[00:01:01] motivate new metrics or specific uses of existing metrics depending on what your
[00:01:03] existing metrics depending on what your actual goals for your experiments
[00:01:05] actual goals for your experiments actually are
[00:01:07] actually are relatedly for established tasks you'll
[00:01:09] relatedly for established tasks you'll probably feel some pressure to use
[00:01:11] probably feel some pressure to use specific well-established metrics but
[00:01:14] specific well-established metrics but you should always as a scientist feel
[00:01:16] you should always as a scientist feel empowered to push back if you feel that
[00:01:18] empowered to push back if you feel that the accepted metrics
[00:01:20] the accepted metrics are not reflective of your hypothesis or
[00:01:23] are not reflective of your hypothesis or are distorting our notions of progress
[00:01:25] are distorting our notions of progress somehow because remember
[00:01:27] somehow because remember areas of research can stagnate due to
[00:01:29] areas of research can stagnate due to poor metrics and so we have to be
[00:01:31] poor metrics and so we have to be vigilant we have to be on the lookout
[00:01:32] vigilant we have to be on the lookout for cases in which the metrics we've
[00:01:34] for cases in which the metrics we've accepted might be at odds with the
[00:01:37] accepted might be at odds with the actual goals we have for the research
[00:01:39] actual goals we have for the research we're doing
[00:01:42] let's begin our discussion of classifier
[00:01:44] let's begin our discussion of classifier metrics by talking about confusion
[00:01:46] metrics by talking about confusion matrices a pretty fundamental data
[00:01:48] matrices a pretty fundamental data structure for a lot of the calculations
[00:01:50] structure for a lot of the calculations that we'll perform
[00:01:52] that we'll perform so by convention for my confusion
[00:01:53] so by convention for my confusion matrices i'll have the actual labels
[00:01:56] matrices i'll have the actual labels going across the rows here and across
[00:01:58] going across the rows here and across the columns i'll have the predictions
[00:02:00] the columns i'll have the predictions from some classifier model so you can
[00:02:02] from some classifier model so you can see in this confusion matrix that there
[00:02:04] see in this confusion matrix that there were 15 cases in which the model
[00:02:06] were 15 cases in which the model predicted positive and the actual label
[00:02:09] predicted positive and the actual label was positive where there whereas there
[00:02:11] was positive where there whereas there are 10 cases where the actual label was
[00:02:13] are 10 cases where the actual label was positive and the model predicted
[00:02:15] positive and the model predicted negative and so forth for the other
[00:02:17] negative and so forth for the other values in this table
[00:02:19] values in this table i think that seems familiar it's
[00:02:20] i think that seems familiar it's something we can take for granted but we
[00:02:22] something we can take for granted but we should remember
[00:02:24] should remember that behind the scenes here a threshold
[00:02:26] that behind the scenes here a threshold was imposed in order to create these
[00:02:28] was imposed in order to create these categorical predictions
[00:02:30] categorical predictions by and large classifier models that we
[00:02:32] by and large classifier models that we use today
[00:02:33] use today predict probability distributions over
[00:02:35] predict probability distributions over the labels and so in order to create an
[00:02:37] the labels and so in order to create an actual categorical prediction we decided
[00:02:40] actual categorical prediction we decided for example that the label with the
[00:02:42] for example that the label with the maximum probability would be the true
[00:02:44] maximum probability would be the true one and that was the result of that
[00:02:47] one and that was the result of that decision was used to aggregate this
[00:02:48] decision was used to aggregate this table but of course different choices of
[00:02:51] table but of course different choices of that threshold might give very different
[00:02:52] that threshold might give very different results and there might be context in
[00:02:55] results and there might be context in which we want to explore the full range
[00:02:57] which we want to explore the full range of probabilistic predictions that's
[00:02:59] of probabilistic predictions that's something i'll return to at the end of
[00:03:01] something i'll return to at the end of the screencast
[00:03:02] the screencast final note about this it can be helpful
[00:03:04] final note about this it can be helpful in the context of confusion matrices to
[00:03:06] in the context of confusion matrices to add a column or what's called support
[00:03:08] add a column or what's called support which is simply the number of actual
[00:03:11] which is simply the number of actual true instances that fall into each class
[00:03:13] true instances that fall into each class so there are 125 positive instances in
[00:03:16] so there are 125 positive instances in this corpus 35 negative and over a
[00:03:19] this corpus 35 negative and over a thousand that fall in the neutral
[00:03:20] thousand that fall in the neutral category and that's already illuminating
[00:03:22] category and that's already illuminating about how specific metrics might deal
[00:03:25] about how specific metrics might deal with that extremely imbalanced
[00:03:27] with that extremely imbalanced vector of support values
[00:03:31] let's start with accuracy by far the
[00:03:33] let's start with accuracy by far the most famous and familiar of all the
[00:03:35] most famous and familiar of all the classifier metrics accuracy is simply
[00:03:38] classifier metrics accuracy is simply the number of correct predictions
[00:03:39] the number of correct predictions divided by the total number of examples
[00:03:42] divided by the total number of examples in terms of our confusion matrices that
[00:03:44] in terms of our confusion matrices that is just the sum of all the values along
[00:03:46] is just the sum of all the values along the diagonal divided by the sum of all
[00:03:48] the diagonal divided by the sum of all the values that are in this table
[00:03:51] the values that are in this table the bounds are zero and one of course
[00:03:53] the bounds are zero and one of course with zero the worst and one the best
[00:03:56] with zero the worst and one the best in terms of the value encoded by
[00:03:58] in terms of the value encoded by accuracy i would say it's an attempt to
[00:03:59] accuracy i would say it's an attempt to answer the question how often is the
[00:04:02] answer the question how often is the system correct
[00:04:04] system correct and that kind of feeds into the
[00:04:05] and that kind of feeds into the weaknesses here so the weaknesses are
[00:04:07] weaknesses here so the weaknesses are first there's no per class notion of
[00:04:09] first there's no per class notion of accuracy not directly we just get a
[00:04:12] accuracy not directly we just get a single holistic number and relatedly
[00:04:14] single holistic number and relatedly there is just a complete failure to
[00:04:16] there is just a complete failure to control for class size
[00:04:19] control for class size so you can see for example in this
[00:04:20] so you can see for example in this confusion matrix that performance on the
[00:04:23] confusion matrix that performance on the neutral class will completely dominate
[00:04:25] neutral class will completely dominate the accuracy
[00:04:27] the accuracy values and it's to the point in this
[00:04:29] values and it's to the point in this table where
[00:04:30] table where no matter how much progress we make on
[00:04:32] no matter how much progress we make on the positive and negative classes
[00:04:34] the positive and negative classes because they are so much smaller in
[00:04:35] because they are so much smaller in terms of their support than neutral that
[00:04:38] terms of their support than neutral that kind of progress is unlikely to be
[00:04:40] kind of progress is unlikely to be reflected in our accuracy values and
[00:04:42] reflected in our accuracy values and that's why if you return to the value
[00:04:44] that's why if you return to the value encoded you can see that just at a raw
[00:04:46] encoded you can see that just at a raw fundamental level it is simply answering
[00:04:48] fundamental level it is simply answering how often is the system correct
[00:04:52] how often is the system correct another thing to keep in mind is that
[00:04:54] another thing to keep in mind is that for many classifier models
[00:04:57] for many classifier models the loss for those models is what's
[00:04:59] the loss for those models is what's called the cross entropy loss it's also
[00:05:01] called the cross entropy loss it's also called the negative log loss
[00:05:04] called the negative log loss inside kit learn
[00:05:05] inside kit learn and that value is inversely proportional
[00:05:08] and that value is inversely proportional to accuracy
[00:05:10] to accuracy the takeaway there is that even as we
[00:05:12] the takeaway there is that even as we might choose other metrics to compare
[00:05:14] might choose other metrics to compare models and evaluate models we should
[00:05:16] models and evaluate models we should keep in mind that our classifiers
[00:05:18] keep in mind that our classifiers themselves are kind of engines for
[00:05:21] themselves are kind of engines for trying to maximize accuracy and so they
[00:05:23] trying to maximize accuracy and so they are likely to inherit whatever
[00:05:26] are likely to inherit whatever properties and values and strengths and
[00:05:27] properties and values and strengths and weaknesses are inherent in the accuracy
[00:05:30] weaknesses are inherent in the accuracy calculation which as we'll see could be
[00:05:32] calculation which as we'll see could be at odds with our actual values for the
[00:05:35] at odds with our actual values for the system that we're developing
[00:05:38] system that we're developing and that kind of feeds nicely into
[00:05:40] and that kind of feeds nicely into precision recall and f scores which are
[00:05:42] precision recall and f scores which are attempts to make up for some of the
[00:05:45] attempts to make up for some of the weaknesses that you see in accuracy
[00:05:47] weaknesses that you see in accuracy we'll start with precision this is a per
[00:05:49] we'll start with precision this is a per class notion for a class k it's the
[00:05:51] class notion for a class k it's the correct predictions for k divided by the
[00:05:54] correct predictions for k divided by the sum of all the guesses for k that were
[00:05:56] sum of all the guesses for k that were made by your model
[00:05:58] made by your model so in terms of this confusion matrix if
[00:06:00] so in terms of this confusion matrix if we focus on the positive class here the
[00:06:02] we focus on the positive class here the numerator is the number of correct
[00:06:04] numerator is the number of correct predictions for that class divided by
[00:06:06] predictions for that class divided by the sum of all the values that are in
[00:06:08] the sum of all the values that are in this column
[00:06:09] this column and for the negative class we would
[00:06:11] and for the negative class we would repeat that the numerator would be 15
[00:06:13] repeat that the numerator would be 15 and we would sum over the column and
[00:06:15] and we would sum over the column and finally for neutral the numerator would
[00:06:16] finally for neutral the numerator would be a thousand and we would again sum
[00:06:18] be a thousand and we would again sum over this column and that leads to this
[00:06:20] over this column and that leads to this vector of precision values that you see
[00:06:22] vector of precision values that you see along the bottom here
[00:06:26] the bounds of precision are zero and one
[00:06:28] the bounds of precision are zero and one approximately with zero the worst and
[00:06:29] approximately with zero the worst and one the best there is an important
[00:06:31] one the best there is an important caveat here though
[00:06:32] caveat here though precision is technically undefined in
[00:06:35] precision is technically undefined in situations where a model makes no
[00:06:37] situations where a model makes no predictions about a given class because
[00:06:39] predictions about a given class because in that situation you're dividing by
[00:06:41] in that situation you're dividing by zero and that's technically undefined it
[00:06:44] zero and that's technically undefined it is common practice to map those to zero
[00:06:46] is common practice to map those to zero but we should keep in mind that we were
[00:06:48] but we should keep in mind that we were making that extra decision
[00:06:50] making that extra decision the value encoded is a kind of
[00:06:52] the value encoded is a kind of conservative one we're going to penalize
[00:06:54] conservative one we're going to penalize incorrect guesses for a certain class
[00:06:58] incorrect guesses for a certain class so you can imagine that a failure mode
[00:07:00] so you can imagine that a failure mode there is to just rarely guess a certain
[00:07:02] there is to just rarely guess a certain class that is the core weakness you can
[00:07:04] class that is the core weakness you can achieve high precision
[00:07:06] achieve high precision for a class k simply by rarely guessing
[00:07:09] for a class k simply by rarely guessing k
[00:07:10] k so we'll obviously need to offset that
[00:07:12] so we'll obviously need to offset that with some other pressure
[00:07:13] with some other pressure and by and large the offset pressure is
[00:07:16] and by and large the offset pressure is recall
[00:07:17] recall recall is again a per class notion for a
[00:07:19] recall is again a per class notion for a class k it's going to be the correct
[00:07:21] class k it's going to be the correct predictions for k divided by the sum of
[00:07:23] predictions for k divided by the sum of all the true members of k
[00:07:26] all the true members of k so now we're going to operate row right
[00:07:28] so now we're going to operate row right row wise we focus on the positive class
[00:07:30] row wise we focus on the positive class our numerator is 15 the number of true
[00:07:32] our numerator is 15 the number of true predictions for the positive class
[00:07:35] predictions for the positive class divided by the sum of all the values
[00:07:36] divided by the sum of all the values along the rows that is all the true
[00:07:39] along the rows that is all the true members of that class for positive that
[00:07:41] members of that class for positive that gives us a recall value of 0.12 and we
[00:07:44] gives us a recall value of 0.12 and we can repeat that for the other two rows
[00:07:47] can repeat that for the other two rows the bounds are zero and one the zero are
[00:07:48] the bounds are zero and one the zero are the worst and one the best
[00:07:50] the worst and one the best the value encoded is a permissive one we
[00:07:53] the value encoded is a permissive one we want to penalize missed true cases we
[00:07:56] want to penalize missed true cases we would like to make a lot of predictions
[00:07:58] would like to make a lot of predictions about a class in order to avoid leaving
[00:08:00] about a class in order to avoid leaving any out so to speak and that kind of
[00:08:02] any out so to speak and that kind of leads into the core weakness we can
[00:08:04] leads into the core weakness we can achieve high recall for a class k
[00:08:07] achieve high recall for a class k simply by always guessing k
[00:08:09] simply by always guessing k never mind the mistakes as long as we
[00:08:11] never mind the mistakes as long as we get all the actual cases into our
[00:08:13] get all the actual cases into our predictions we're doing well by recall
[00:08:15] predictions we're doing well by recall and you can hear in that that it's
[00:08:17] and you can hear in that that it's important to offset this pressure by
[00:08:19] important to offset this pressure by something else not standardly precision
[00:08:23] something else not standardly precision and the way we offset these two
[00:08:24] and the way we offset these two pressures is typically with f scores
[00:08:27] pressures is typically with f scores so f scores are a harmonic mean of the
[00:08:30] so f scores are a harmonic mean of the precision and recall scores it's again a
[00:08:32] precision and recall scores it's again a per class notion and it has this
[00:08:34] per class notion and it has this weighting value beta if we want to
[00:08:36] weighting value beta if we want to evenly balance precision and recall then
[00:08:38] evenly balance precision and recall then we set beta to 1.
[00:08:41] we set beta to 1. so here's that confusion matrix again
[00:08:43] so here's that confusion matrix again and along this column here i've given
[00:08:44] and along this column here i've given the per class f1 values here
[00:08:48] the per class f1 values here the bounds are 0 and 1 as before with 0
[00:08:51] the bounds are 0 and 1 as before with 0 the worst and one the best and you can
[00:08:53] the worst and one the best and you can count on the fact that the f1 score for
[00:08:54] count on the fact that the f1 score for a class will always fall between the
[00:08:56] a class will always fall between the precision and recall classes because
[00:08:58] precision and recall classes because it's a kind of an average it's the
[00:09:00] it's a kind of an average it's the harmonic mean
[00:09:02] harmonic mean what's the value encoded the best way i
[00:09:04] what's the value encoded the best way i can say this is that we're essentially
[00:09:06] can say this is that we're essentially trying to answer the question for a
[00:09:08] trying to answer the question for a given class k how much do predictions
[00:09:10] given class k how much do predictions for k align with true instances of k
[00:09:13] for k align with true instances of k that is aligning with both precision and
[00:09:16] that is aligning with both precision and recall as pressures and then we can use
[00:09:18] recall as pressures and then we can use the beta value to control how much
[00:09:20] the beta value to control how much weight we place on precision versus
[00:09:22] weight we place on precision versus recall
[00:09:24] recall what are the weaknesses of f-scores well
[00:09:25] what are the weaknesses of f-scores well i can really think of two
[00:09:27] i can really think of two the first is that there's no
[00:09:28] the first is that there's no normalization for the size of the data
[00:09:30] normalization for the size of the data set because of the way we use the
[00:09:32] set because of the way we use the denominators for the row and column sums
[00:09:35] denominators for the row and column sums and relatedly for a given class that we
[00:09:37] and relatedly for a given class that we decide to focus on we actually ignore
[00:09:40] decide to focus on we actually ignore most of the data that's in the table
[00:09:42] most of the data that's in the table consider the fact that if we decide to
[00:09:44] consider the fact that if we decide to calculate the f1 score for the positive
[00:09:46] calculate the f1 score for the positive class we pay attention to these column
[00:09:48] class we pay attention to these column values and these row values but we
[00:09:50] values and these row values but we completely ignore these four values here
[00:09:52] completely ignore these four values here they're just not involved in the
[00:09:54] they're just not involved in the calculation at all and as a result
[00:09:56] calculation at all and as a result the positive class f1 score might give a
[00:10:00] the positive class f1 score might give a distorted picture of what the model's
[00:10:02] distorted picture of what the model's predictions are actually like in virtue
[00:10:04] predictions are actually like in virtue of the fact that they leave out so much
[00:10:06] of the fact that they leave out so much of the data here as you can see
[00:10:10] now
[00:10:11] now because f scores are a per-class notion
[00:10:14] because f scores are a per-class notion i think that's useful in the sense that
[00:10:15] i think that's useful in the sense that it gives us a perspective on each one of
[00:10:17] it gives us a perspective on each one of the classes separately but for many
[00:10:19] the classes separately but for many kinds of model evaluations we need a
[00:10:20] kinds of model evaluations we need a summary number a single number that we
[00:10:22] summary number a single number that we can use to compare models and assess
[00:10:24] can use to compare models and assess overall progress
[00:10:26] overall progress so we're going to do some kind of
[00:10:27] so we're going to do some kind of averaging and i'd like to offer you
[00:10:29] averaging and i'd like to offer you three ways that we might average these f
[00:10:31] three ways that we might average these f scores macro averaging weighted
[00:10:33] scores macro averaging weighted averaging and micro averaging and as
[00:10:35] averaging and micro averaging and as you'll see these encode quite different
[00:10:37] you'll see these encode quite different values about how we want to think about
[00:10:40] values about how we want to think about the f scores
[00:10:42] the f scores macro averaging is a kind of averaging
[00:10:44] macro averaging is a kind of averaging that we've done
[00:10:45] that we've done at various points throughout the quarter
[00:10:47] at various points throughout the quarter it is simply the arithmetic mean
[00:10:50] it is simply the arithmetic mean of all the per category f1 scores so
[00:10:53] of all the per category f1 scores so it's just the mean of the values along
[00:10:55] it's just the mean of the values along this column
[00:10:56] this column its bounds are zero and one with zero
[00:10:58] its bounds are zero and one with zero the worst and one the best
[00:11:00] the worst and one the best what value does encode well it's the
[00:11:02] what value does encode well it's the same values that we get from f scores
[00:11:05] same values that we get from f scores plus the additional and non-trivial
[00:11:07] plus the additional and non-trivial assumption that all of the classes are
[00:11:09] assumption that all of the classes are equal regardless of size differences
[00:11:12] equal regardless of size differences between them
[00:11:13] between them right and that kind of feeds into the
[00:11:14] right and that kind of feeds into the weaknesses here a classifier that does
[00:11:17] weaknesses here a classifier that does well only on the small classes might not
[00:11:20] well only on the small classes might not actually do well in the real world if
[00:11:22] actually do well in the real world if you imagine counterfactually that for
[00:11:24] you imagine counterfactually that for our given model here we had really
[00:11:26] our given model here we had really outstanding f1 scores for positive and
[00:11:28] outstanding f1 scores for positive and negative and really low for neutral that
[00:11:31] negative and really low for neutral that might really be at odds with how this
[00:11:32] might really be at odds with how this classifier would behave in the world
[00:11:35] classifier would behave in the world assuming that most of the examples that
[00:11:37] assuming that most of the examples that are streaming in are in the neutral
[00:11:39] are streaming in are in the neutral category
[00:11:40] category relatedly a classifier that does well
[00:11:43] relatedly a classifier that does well only on large classes might do poorly on
[00:11:46] only on large classes might do poorly on the small but nonetheless vital classes
[00:11:48] the small but nonetheless vital classes that are in our data and that just
[00:11:50] that are in our data and that just reflects the fact that very often in nlp
[00:11:53] reflects the fact that very often in nlp it's the small classes that are the most
[00:11:55] it's the small classes that are the most precious the ones that we care about the
[00:11:56] precious the ones that we care about the most and we're not reflecting that kind
[00:11:59] most and we're not reflecting that kind of asymmetry on our values by simply
[00:12:01] of asymmetry on our values by simply taking the average of all these f scores
[00:12:05] taking the average of all these f scores weighted average f scores will give a
[00:12:07] weighted average f scores will give a very different perspective on model
[00:12:09] very different perspective on model performance
[00:12:10] performance in this case we are again just going to
[00:12:12] in this case we are again just going to take an average of the f1 scores but now
[00:12:14] take an average of the f1 scores but now weighted by the amount of support for
[00:12:16] weighted by the amount of support for each one of the classes
[00:12:18] each one of the classes that again has bound zero to one with
[00:12:20] that again has bound zero to one with zero the worst and one the best
[00:12:22] zero the worst and one the best the value encoded is the same as the
[00:12:24] the value encoded is the same as the values that we get for the f scores but
[00:12:26] values that we get for the f scores but now with the added assumption that the
[00:12:28] now with the added assumption that the size of the classes the amount of
[00:12:30] size of the classes the amount of support really does matter
[00:12:32] support really does matter and that's going to feed into the
[00:12:33] and that's going to feed into the weaknesses and the fundamental thing
[00:12:35] weaknesses and the fundamental thing here is that large classes will dominate
[00:12:38] here is that large classes will dominate just as with accuracy the larger our
[00:12:40] just as with accuracy the larger our class is the more it's going to
[00:12:42] class is the more it's going to contribute to our overall summary number
[00:12:44] contribute to our overall summary number and that can lead to the kind of
[00:12:45] and that can lead to the kind of problematic uh situation where the small
[00:12:48] problematic uh situation where the small classes are just not relevant for the
[00:12:50] classes are just not relevant for the evaluation metric
[00:12:52] evaluation metric that could reflect your values because
[00:12:54] that could reflect your values because if what you really care about is raw
[00:12:56] if what you really care about is raw rate of correct predictions you might
[00:12:58] rate of correct predictions you might want to weight the larger classes more
[00:12:59] want to weight the larger classes more heavily
[00:13:00] heavily but again for many contexts in nlp we
[00:13:03] but again for many contexts in nlp we really care about how much progress we
[00:13:05] really care about how much progress we can make on the small but nonetheless
[00:13:07] can make on the small but nonetheless important classes and so in those
[00:13:09] important classes and so in those contexts weighted averaging is probably
[00:13:11] contexts weighted averaging is probably not the right choice
[00:13:14] not the right choice the final averaging scheme that i'd like
[00:13:16] the final averaging scheme that i'd like to consider is micro averaged f-scores
[00:13:18] to consider is micro averaged f-scores this will be very similar to weighted
[00:13:20] this will be very similar to weighted averaging of f1 scores and is directly
[00:13:23] averaging of f1 scores and is directly connected to accuracy
[00:13:25] connected to accuracy the way this works is a little bit
[00:13:27] the way this works is a little bit involved we start with this core
[00:13:29] involved we start with this core confusion matrix and we're going to
[00:13:31] confusion matrix and we're going to break it down into three
[00:13:33] break it down into three smaller confusion matrices one per class
[00:13:36] smaller confusion matrices one per class so you can see this one on the left here
[00:13:38] so you can see this one on the left here is for the positive class the yeses are
[00:13:40] is for the positive class the yeses are 15 and the no's are the sum of these two
[00:13:43] 15 and the no's are the sum of these two values here along this row
[00:13:46] values here along this row the nose or the 20 which is the sum of
[00:13:48] the nose or the 20 which is the sum of these two values and then this no no
[00:13:50] these two values and then this no no category is all the remaining data in
[00:13:52] category is all the remaining data in this quadrant here
[00:13:54] this quadrant here we repeat that same procedure for the
[00:13:56] we repeat that same procedure for the negative class and for the neutral class
[00:13:59] negative class and for the neutral class and then we simply sum up those three
[00:14:01] and then we simply sum up those three smaller tables into one big yes no
[00:14:04] smaller tables into one big yes no confusion matrix and calculate the f1
[00:14:07] confusion matrix and calculate the f1 scores per category
[00:14:09] scores per category that gives us two scores here one for
[00:14:11] that gives us two scores here one for yes and one for no
[00:14:14] yes and one for no the bounds on this are zero and one with
[00:14:15] the bounds on this are zero and one with zero the worst and one the best
[00:14:18] zero the worst and one the best the value encoded is really easy to
[00:14:20] the value encoded is really easy to state
[00:14:21] state macro averaged f1 scores for the yes
[00:14:24] macro averaged f1 scores for the yes category are equivalent to accuracy
[00:14:26] category are equivalent to accuracy scores numerically so that's identical
[00:14:28] scores numerically so that's identical in terms of that metric and we have this
[00:14:31] in terms of that metric and we have this additional problem now that well we have
[00:14:33] additional problem now that well we have the same kind of value reflected as we
[00:14:35] the same kind of value reflected as we have for the weighted f scores or for
[00:14:37] have for the weighted f scores or for accuracy but now we have brought in an
[00:14:39] accuracy but now we have brought in an additional source of uncertainty which
[00:14:41] additional source of uncertainty which is we have a number for the yes category
[00:14:43] is we have a number for the yes category and the no category and hence no single
[00:14:47] and the no category and hence no single summary number the convention in the
[00:14:48] summary number the convention in the literature is to focus on the yes
[00:14:50] literature is to focus on the yes category but that simply brings us back
[00:14:53] category but that simply brings us back to accuracy with a more involved
[00:14:55] to accuracy with a more involved calculation so that's obviously not very
[00:14:57] calculation so that's obviously not very productive and i would say as a result
[00:14:58] productive and i would say as a result here the two real choices that you want
[00:15:01] here the two real choices that you want to make are between macro averaging
[00:15:03] to make are between macro averaging and weighted averaging of your f1 scores
[00:15:06] and weighted averaging of your f1 scores and again that will come down to what
[00:15:08] and again that will come down to what your fundamental values are and what
[00:15:10] your fundamental values are and what hypotheses you're pursuing
[00:15:13] hypotheses you're pursuing the final point i want to make is that
[00:15:15] the final point i want to make is that thus far we have operated in terms of
[00:15:17] thus far we have operated in terms of the confusion matrix which involved
[00:15:19] the confusion matrix which involved imposing a threshold on probabilistic
[00:15:21] imposing a threshold on probabilistic predictions in order to create
[00:15:23] predictions in order to create categorical values that we could then
[00:15:25] categorical values that we could then compare with precision and recall and so
[00:15:27] compare with precision and recall and so forth
[00:15:28] forth precision and recall curves are offer a
[00:15:31] precision and recall curves are offer a fundamentally different perspective
[00:15:33] fundamentally different perspective in this case instead of imposing one
[00:15:35] in this case instead of imposing one threshold we'll take every possible
[00:15:38] threshold we'll take every possible value that's predicted by our classifier
[00:15:41] value that's predicted by our classifier to be a potential threshold and
[00:15:43] to be a potential threshold and essentially create a bunch of confusion
[00:15:44] essentially create a bunch of confusion matrices based on that successive series
[00:15:47] matrices based on that successive series of thresholds
[00:15:49] of thresholds and then we can plot the trade-off
[00:15:51] and then we can plot the trade-off between precision here along the y-axis
[00:15:53] between precision here along the y-axis and recall along the x-axis for all
[00:15:56] and recall along the x-axis for all those different notions of threshold and
[00:15:58] those different notions of threshold and that can be really illuminating in terms
[00:16:00] that can be really illuminating in terms of helping us see how our system trades
[00:16:02] of helping us see how our system trades precision and recall against each other
[00:16:04] precision and recall against each other and help us find based on values that we
[00:16:07] and help us find based on values that we have about our problem and our goals
[00:16:10] have about our problem and our goals what the optimal balance between
[00:16:11] what the optimal balance between precision and recall actually is
[00:16:14] precision and recall actually is and then if you do need a summary number
[00:16:15] and then if you do need a summary number of this entire table
[00:16:17] of this entire table average precision which is implemented
[00:16:19] average precision which is implemented in scikit-learn is a standard way of
[00:16:21] in scikit-learn is a standard way of summarizing the entire curve with a
[00:16:24] summarizing the entire curve with a single number without though imposing
[00:16:26] single number without though imposing that single threshold that was so much
[00:16:29] that single threshold that was so much shaping all of the previous metrics that
[00:16:31] shaping all of the previous metrics that we discussed
[00:16:37] you

Lecture 056

Natural Language Generation Metrics | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=l-DERqIJjCY --- Transcript [00:00:05] welcome everyone th...

Natural Language Generation Metrics | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=l-DERqIJjCY

---

Transcript

[00:00:05] welcome everyone this is part three in
[00:00:07] welcome everyone this is part three in our series on methods and metrics we're
[00:00:08] our series on methods and metrics we're going to be talking about metrics for
[00:00:10] going to be talking about metrics for assessing natural language generation
[00:00:12] assessing natural language generation systems we previously talked about
[00:00:14] systems we previously talked about classifier metrics and the issues seem
[00:00:16] classifier metrics and the issues seem relatively straightforward as you'll see
[00:00:18] relatively straightforward as you'll see assessment for energy systems is
[00:00:20] assessment for energy systems is considerably more difficult let's
[00:00:22] considerably more difficult let's actually begin with those fundamental
[00:00:24] actually begin with those fundamental challenges maybe the most fundamental of
[00:00:26] challenges maybe the most fundamental of all is that in natural language there is
[00:00:28] all is that in natural language there is more than one effective way to say most
[00:00:30] more than one effective way to say most things the data sets we have might have
[00:00:32] things the data sets we have might have one or a few good examples of how
[00:00:34] one or a few good examples of how something should be said
[00:00:36] something should be said but that's just a sample of the many
[00:00:38] but that's just a sample of the many ways in which we could communicate
[00:00:39] ways in which we could communicate effectively and that leaves us with
[00:00:41] effectively and that leaves us with fundamental open questions about what
[00:00:43] fundamental open questions about what comparisons we should make and how we
[00:00:45] comparisons we should make and how we should assess so-called mistakes
[00:00:48] should assess so-called mistakes relatedly there's just an open question
[00:00:50] relatedly there's just an open question of what we're actually trying to measure
[00:00:51] of what we're actually trying to measure is it fluency
[00:00:53] is it fluency or truthfulness or communicative
[00:00:55] or truthfulness or communicative effectiveness or some blend of the three
[00:00:58] effectiveness or some blend of the three as we think about different metrics we
[00:00:59] as we think about different metrics we might find that they capture one or a
[00:01:02] might find that they capture one or a few of these and completely neglect
[00:01:04] few of these and completely neglect others and that's sure to shape the
[00:01:06] others and that's sure to shape the trajectory of our project and the actual
[00:01:08] trajectory of our project and the actual goals that we achieve so we have to be
[00:01:09] goals that we achieve so we have to be really thoughtful about what we're
[00:01:11] really thoughtful about what we're actually trying to measure in the space
[00:01:15] actually trying to measure in the space let's begin with perplexity i would say
[00:01:16] let's begin with perplexity i would say what perplexity has going for it is that
[00:01:19] what perplexity has going for it is that is it is at least very tightly knit to
[00:01:21] is it is at least very tightly knit to the structure of many of the models that
[00:01:23] the structure of many of the models that we work with in nlg
[00:01:25] we work with in nlg so the core calculation
[00:01:27] so the core calculation is that given some sequence x of length
[00:01:30] is that given some sequence x of length n
[00:01:31] n and a probability distribution p
[00:01:33] and a probability distribution p the perplexity of x relative to that
[00:01:35] the perplexity of x relative to that distribution p is the product of the
[00:01:37] distribution p is the product of the inverse of all the assigned
[00:01:38] inverse of all the assigned probabilities and then we take an
[00:01:40] probabilities and then we take an average here there are many ways to
[00:01:42] average here there are many ways to express this calculation and many ways
[00:01:44] express this calculation and many ways to connect with information theoretic
[00:01:46] to connect with information theoretic measures let me defer those issues for
[00:01:49] measures let me defer those issues for just a second and i'll try to build up
[00:01:50] just a second and i'll try to build up an intuition just after getting through
[00:01:53] an intuition just after getting through the core calculation so that's
[00:01:55] the core calculation so that's perplexity and then when we do token
[00:01:57] perplexity and then when we do token level perplexity right we want to assign
[00:01:59] level perplexity right we want to assign perplexity to individual examples we
[00:02:01] perplexity to individual examples we need to normalize by the length of those
[00:02:03] need to normalize by the length of those examples and we do that in log space in
[00:02:06] examples and we do that in log space in order to capture a kind of geometric
[00:02:08] order to capture a kind of geometric mean which is arguably more appropriate
[00:02:10] mean which is arguably more appropriate for comparing probability values
[00:02:13] for comparing probability values and then if we want the complex the
[00:02:14] and then if we want the complex the perplexity for an entire corpus we again
[00:02:17] perplexity for an entire corpus we again use a geometric mean of all the token
[00:02:20] use a geometric mean of all the token level perplexity predictions
[00:02:23] level perplexity predictions and that gives us a single quantity over
[00:02:25] and that gives us a single quantity over an entire batch of examples
[00:02:28] an entire batch of examples what are the properties of perplexity
[00:02:30] what are the properties of perplexity well its bounds are one to infinity with
[00:02:32] well its bounds are one to infinity with one the best so we would like to
[00:02:33] one the best so we would like to minimize
[00:02:34] minimize perplexity it is equivalent to the
[00:02:37] perplexity it is equivalent to the exponentiation of the cross entropy loss
[00:02:39] exponentiation of the cross entropy loss that's the tight connection with models
[00:02:41] that's the tight connection with models that i wanted to call out we often work
[00:02:44] that i wanted to call out we often work with language models that use across
[00:02:45] with language models that use across entropy loss and you can see that they
[00:02:47] entropy loss and you can see that they are directly optimizing for a quantity
[00:02:49] are directly optimizing for a quantity that is proportional to perplexity and
[00:02:51] that is proportional to perplexity and that can be useful as a kind of getting
[00:02:53] that can be useful as a kind of getting a direct insight into the nature of your
[00:02:55] a direct insight into the nature of your model's predictions
[00:02:57] model's predictions what value does it encode well i think
[00:02:59] what value does it encode well i think it's simple does the model assign high
[00:03:01] it's simple does the model assign high probability to the input sequences that
[00:03:04] probability to the input sequences that is does it assign low perplexity to the
[00:03:06] is does it assign low perplexity to the input sequences
[00:03:08] input sequences the weaknesses there are many actually
[00:03:10] the weaknesses there are many actually first it's heavily dependent on the
[00:03:12] first it's heavily dependent on the underlying vocabulary to see that
[00:03:14] underlying vocabulary to see that imagine an edge case where we take every
[00:03:16] imagine an edge case where we take every word in the vocabulary and map it to a
[00:03:18] word in the vocabulary and map it to a single unk token in that case we will
[00:03:21] single unk token in that case we will absolutely minimize perplexity but our
[00:03:23] absolutely minimize perplexity but our system will be useless in that edge case
[00:03:26] system will be useless in that edge case you can see that i could reduce
[00:03:28] you can see that i could reduce perplexity simply by changing the size
[00:03:30] perplexity simply by changing the size of my vocabulary that's a way that you
[00:03:32] of my vocabulary that's a way that you could kind of gain this metric
[00:03:34] could kind of gain this metric inadvertently
[00:03:36] inadvertently as a result of that we can't really
[00:03:37] as a result of that we can't really commit comparisons across data sets
[00:03:39] commit comparisons across data sets because of course they can have
[00:03:40] because of course they can have different vocabularies and different
[00:03:42] different vocabularies and different intrinsic notions of perplexity
[00:03:45] intrinsic notions of perplexity and it's also even tricky to make
[00:03:46] and it's also even tricky to make comparisons across models you can see
[00:03:48] comparisons across models you can see that in my first weakness there
[00:03:51] that in my first weakness there if we do compare models we need to fix
[00:03:53] if we do compare models we need to fix the data set and make sure that the
[00:03:55] the data set and make sure that the differences between the models are not
[00:03:58] differences between the models are not inherently shaping the range of
[00:04:00] inherently shaping the range of perplexity values that we're likely to
[00:04:01] perplexity values that we're likely to see
[00:04:03] see let's move on now into a family of what
[00:04:05] let's move on now into a family of what you might think of as n-gram based
[00:04:06] you might think of as n-gram based methods for assessing nlg systems
[00:04:08] methods for assessing nlg systems beginning with the word error rate so
[00:04:11] beginning with the word error rate so the fundamental thing here will be an
[00:04:13] the fundamental thing here will be an edit distance measure
[00:04:15] edit distance measure and therefore you can see word error
[00:04:17] and therefore you can see word error rate as a kind of family of measures
[00:04:19] rate as a kind of family of measures depending on the choice of the edit
[00:04:20] depending on the choice of the edit distance function which we would just
[00:04:22] distance function which we would just plug in
[00:04:24] plug in the word error rate is the distance
[00:04:26] the word error rate is the distance between the actual sequence x and some
[00:04:28] between the actual sequence x and some predicted sequence spread normalized by
[00:04:31] predicted sequence spread normalized by the length of the actual sequence
[00:04:33] the length of the actual sequence and if we would like the word error rate
[00:04:35] and if we would like the word error rate for an entire corpus it's easy to scale
[00:04:37] for an entire corpus it's easy to scale it up but there's one twist here the way
[00:04:39] it up but there's one twist here the way that's standardly calculated is that the
[00:04:41] that's standardly calculated is that the numerator
[00:04:42] numerator is the sum of all the distances between
[00:04:44] is the sum of all the distances between the actual and predicted sequences
[00:04:47] the actual and predicted sequences not normalized as it was up here for the
[00:04:49] not normalized as it was up here for the word error rate the normalization that
[00:04:51] word error rate the normalization that happens over the entire corpus it's the
[00:04:52] happens over the entire corpus it's the sum of all the lengths of the actual
[00:04:55] sum of all the lengths of the actual strings in the corpus so we have one
[00:04:57] strings in the corpus so we have one average as opposed to taking an average
[00:04:58] average as opposed to taking an average of averages
[00:05:01] of averages the properties of the word error rate
[00:05:02] the properties of the word error rate its bounds are zero to infinity and we
[00:05:04] its bounds are zero to infinity and we would like to minimize it so zero is the
[00:05:06] would like to minimize it so zero is the best
[00:05:07] best the value encoded is similar to f scores
[00:05:10] the value encoded is similar to f scores we would like to answer the question how
[00:05:11] we would like to answer the question how a line is the predicted sequence with
[00:05:14] a line is the predicted sequence with the actual sequence and i've invoked f
[00:05:16] the actual sequence and i've invoked f scores here because if our edit distance
[00:05:18] scores here because if our edit distance measure has notions of insertion and
[00:05:20] measure has notions of insertion and deletion they play roles that are
[00:05:22] deletion they play roles that are analogous to precision and recall
[00:05:25] analogous to precision and recall the weaknesses well first we have just
[00:05:28] the weaknesses well first we have just one reference text here i called out
[00:05:30] one reference text here i called out before that there are often many good
[00:05:32] before that there are often many good ways to say something whereas here we
[00:05:34] ways to say something whereas here we can make only a single comparison
[00:05:36] can make only a single comparison and it's also maybe this is more
[00:05:38] and it's also maybe this is more fundamental word airaid is a very
[00:05:40] fundamental word airaid is a very syntactic notion just consider comparing
[00:05:42] syntactic notion just consider comparing text like it was good it was not good
[00:05:44] text like it was good it was not good and it was great they're likely to have
[00:05:47] and it was great they're likely to have the identical word error rates even
[00:05:49] the identical word error rates even though the first two to differ
[00:05:50] though the first two to differ dramatically in their meanings and the
[00:05:52] dramatically in their meanings and the first and the third are actually rather
[00:05:54] first and the third are actually rather similar in their meanings that semantic
[00:05:56] similar in their meanings that semantic notion of similarity is unlikely to be
[00:05:58] notion of similarity is unlikely to be reflected in the word error rate
[00:06:02] reflected in the word error rate let's move now to blue scores this is
[00:06:04] let's move now to blue scores this is another
[00:06:05] another based metric but it's going to try to
[00:06:06] based metric but it's going to try to address the fact that we want to make
[00:06:08] address the fact that we want to make comparisons against multiple human
[00:06:10] comparisons against multiple human created reference texts
[00:06:14] created reference texts it has a notion of precision in it but
[00:06:15] it has a notion of precision in it but it's called modified engram precision
[00:06:17] it's called modified engram precision let me walk you through an example and
[00:06:19] let me walk you through an example and hopefully that will motivate it imagine
[00:06:21] hopefully that will motivate it imagine we had the candidate that had just seven
[00:06:23] we had the candidate that had just seven instances of the word thought in it
[00:06:25] instances of the word thought in it we have two reference texts presumably
[00:06:27] we have two reference texts presumably written by humans the cat is on the mat
[00:06:29] written by humans the cat is on the mat and there is a cat on the mat
[00:06:32] and there is a cat on the mat the modified precision takes for the
[00:06:34] the modified precision takes for the token the the maximum number of times
[00:06:37] token the the maximum number of times that the occurs in any reference text
[00:06:39] that the occurs in any reference text and that's two the reference one here
[00:06:42] and that's two the reference one here and it divides that by the number of
[00:06:43] and it divides that by the number of times that the appears in the candidate
[00:06:45] times that the appears in the candidate which is seven that will give us 2 over
[00:06:47] which is seven that will give us 2 over 7 as the modified engram one gram
[00:06:50] 7 as the modified engram one gram precision score for this candidate
[00:06:53] precision score for this candidate there's also a brevity penalty which
[00:06:55] there's also a brevity penalty which will play the role of something like
[00:06:57] will play the role of something like recall in the blue scoring so we have a
[00:06:59] recall in the blue scoring so we have a quantity r which is the sum of all the
[00:07:01] quantity r which is the sum of all the minimal absolute length differences
[00:07:03] minimal absolute length differences between candidates and reference
[00:07:05] between candidates and reference we have c which is the total length of
[00:07:07] we have c which is the total length of all the candidates and then we said that
[00:07:09] all the candidates and then we said that the brevity penalty is 1 if c is greater
[00:07:12] the brevity penalty is 1 if c is greater than r otherwise it's an exponential
[00:07:14] than r otherwise it's an exponential decay off of the ratio of r and c
[00:07:18] decay off of the ratio of r and c and again that will play kind of the
[00:07:19] and again that will play kind of the notion of recall and then the blue score
[00:07:21] notion of recall and then the blue score is simply the product of that brevity
[00:07:23] is simply the product of that brevity penalty
[00:07:24] penalty with the sum of the weighted modified
[00:07:27] with the sum of the weighted modified n-gram precision values for each engram
[00:07:29] n-gram precision values for each engram value n considered so we'd probably go
[00:07:32] value n considered so we'd probably go one through four that's a standard set
[00:07:34] one through four that's a standard set of n grams to consider we would sum up
[00:07:36] of n grams to consider we would sum up all those notions of modified n-gram
[00:07:39] all those notions of modified n-gram precision for each n
[00:07:40] precision for each n and possibly weight them differently
[00:07:42] and possibly weight them differently depending on how we want to value one
[00:07:44] depending on how we want to value one grams two grams three grams and four
[00:07:46] grams two grams three grams and four grams
[00:07:47] grams so that's the blue scoring what are its
[00:07:49] so that's the blue scoring what are its properties its bounds are zero and one
[00:07:51] properties its bounds are zero and one and one is the best but we have really
[00:07:53] and one is the best but we have really no expectation that any system will
[00:07:55] no expectation that any system will actually achieve one because even
[00:07:57] actually achieve one because even comparisons among human translations or
[00:07:59] comparisons among human translations or human created texts will not have a blue
[00:08:02] human created texts will not have a blue score of one
[00:08:03] score of one the value encoded is an appropriate
[00:08:06] the value encoded is an appropriate balance of modified precision and recall
[00:08:09] balance of modified precision and recall under the guise of that brevity penalty
[00:08:12] under the guise of that brevity penalty it's very similar to the word error rate
[00:08:14] it's very similar to the word error rate in that sense but it seeks to
[00:08:16] in that sense but it seeks to accommodate the fact that there are
[00:08:17] accommodate the fact that there are typically multiple suitable outputs for
[00:08:20] typically multiple suitable outputs for a given input and that's a real strength
[00:08:22] a given input and that's a real strength of blue scoring
[00:08:24] of blue scoring the weaknesses well this team has argued
[00:08:26] the weaknesses well this team has argued that blue scores just failed to
[00:08:27] that blue scores just failed to correlate with human um scores for
[00:08:30] correlate with human um scores for translations and that's kind of worrying
[00:08:32] translations and that's kind of worrying because blue scores were originally
[00:08:34] because blue scores were originally motivated in the context of machine
[00:08:36] motivated in the context of machine translation and the issues that they
[00:08:38] translation and the issues that they identify are like it's very sensitive to
[00:08:40] identify are like it's very sensitive to engram order in a way that human
[00:08:42] engram order in a way that human intuitions are not
[00:08:44] intuitions are not and it's insensitive to the type of the
[00:08:46] and it's insensitive to the type of the engrams so again just consider
[00:08:48] engrams so again just consider comparisons like that dog the dog and
[00:08:51] comparisons like that dog the dog and that toaster those will likely have very
[00:08:54] that toaster those will likely have very similar blue scores um but that dog and
[00:08:56] similar blue scores um but that dog and that dog are just inherently much more
[00:08:58] that dog are just inherently much more similar than that dog and that toaster
[00:09:00] similar than that dog and that toaster and virtue of the fact that that and the
[00:09:02] and virtue of the fact that that and the just the difference at the level of
[00:09:04] just the difference at the level of functional vocabulary versus dog and
[00:09:06] functional vocabulary versus dog and toaster is a really contentful change
[00:09:09] toaster is a really contentful change and then as we move into topics that are
[00:09:11] and then as we move into topics that are more closely aligned with nlu we
[00:09:14] more closely aligned with nlu we possibly have an even more worrying
[00:09:16] possibly have an even more worrying picture so this team argues that blue is
[00:09:18] picture so this team argues that blue is just a fundamentally incorrect measure
[00:09:21] just a fundamentally incorrect measure for assessing dialogue systems and that
[00:09:23] for assessing dialogue systems and that could be an indicator that it's not
[00:09:25] could be an indicator that it's not going to be appropriate for many kinds
[00:09:27] going to be appropriate for many kinds of nlg tasks in nlu
[00:09:31] of nlg tasks in nlu that's just a sample of two engram-based
[00:09:33] that's just a sample of two engram-based metrics i thought i'd mention a few more
[00:09:35] metrics i thought i'd mention a few more to give you a framework for making some
[00:09:37] to give you a framework for making some comparisons so i mentioned the word
[00:09:39] comparisons so i mentioned the word errate that's fundamentally edit
[00:09:41] errate that's fundamentally edit distance from a single reference text
[00:09:43] distance from a single reference text blue as we've seen is modified precision
[00:09:46] blue as we've seen is modified precision and a brevity penalty kind of recall
[00:09:48] and a brevity penalty kind of recall motion comparing against many reference
[00:09:50] motion comparing against many reference texts
[00:09:52] texts rouge is a recall focused variant of
[00:09:54] rouge is a recall focused variant of blue that's focused on assessing
[00:09:56] blue that's focused on assessing summarization systems
[00:09:59] summarization systems meteor is interestingly different
[00:10:01] meteor is interestingly different because it's trying to push past simple
[00:10:03] because it's trying to push past simple engram matching and capture some
[00:10:04] engram matching and capture some semantic notions it's a unigram based
[00:10:07] semantic notions it's a unigram based measure that does an alignment measure
[00:10:10] measure that does an alignment measure between not only exact matches of
[00:10:12] between not only exact matches of unigrams but also stem versions and
[00:10:14] unigrams but also stem versions and synonyms really trying to bring in some
[00:10:16] synonyms really trying to bring in some semantic aspects
[00:10:18] semantic aspects and cider is similar this is going to be
[00:10:20] and cider is similar this is going to be even a more semantic notion because it's
[00:10:22] even a more semantic notion because it's going to do its comparisons in vector
[00:10:24] going to do its comparisons in vector space it's kind of approximately a
[00:10:26] space it's kind of approximately a weighted cosine similarity between tf
[00:10:29] weighted cosine similarity between tf idf vectors created from the corpus
[00:10:34] finally in closing i just wanted to
[00:10:36] finally in closing i just wanted to exhort you all to think about more
[00:10:38] exhort you all to think about more communication based metrics in the
[00:10:39] communication based metrics in the context of nlu for nlu it's worth asking
[00:10:42] context of nlu for nlu it's worth asking whether you can evaluate your system
[00:10:44] whether you can evaluate your system based on how well it actually
[00:10:45] based on how well it actually communicates in the context of a real
[00:10:47] communicates in the context of a real world goal as opposed to just comparing
[00:10:50] world goal as opposed to just comparing different strings that are inputs and
[00:10:52] different strings that are inputs and reference texts and we've actually seen
[00:10:54] reference texts and we've actually seen an example of that in our assignment and
[00:10:56] an example of that in our assignment and bake off on color reference we didn't
[00:10:59] bake off on color reference we didn't really assess how well your system could
[00:11:01] really assess how well your system could reproduce the utterances that were in
[00:11:03] reproduce the utterances that were in the corpus rather our fundamental notion
[00:11:05] the corpus rather our fundamental notion was listener accuracy which was keying
[00:11:07] was listener accuracy which was keying into a communication goal how well is
[00:11:10] into a communication goal how well is your system actually able to take
[00:11:12] your system actually able to take messages and use them to figure out what
[00:11:14] messages and use them to figure out what the speaker was referring to in a simple
[00:11:16] the speaker was referring to in a simple color context and for much more on that
[00:11:19] color context and for much more on that and a perspective on a lot of these
[00:11:20] and a perspective on a lot of these issues i encourage you to check out this
[00:11:22] issues i encourage you to check out this paper that was led by ben newman it
[00:11:24] paper that was led by ben newman it began as a course project for this class
[00:11:27] began as a course project for this class and grew into a really successful paper

Lecture 057

Data Organization | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=1yLUN57_c1E --- Transcript [00:00:04] welcome everyone this is part four in...

Data Organization | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=1yLUN57_c1E

---

Transcript

[00:00:04] welcome everyone this is part four in
[00:00:06] welcome everyone this is part four in our series on methods and metrics we're
[00:00:08] our series on methods and metrics we're going to be talking about how we
[00:00:09] going to be talking about how we organize data sets for the purposes of
[00:00:11] organize data sets for the purposes of conducting evaluations in nlp
[00:00:14] conducting evaluations in nlp let's begin with the classic train dev
[00:00:16] let's begin with the classic train dev test split this is a very common format
[00:00:19] test split this is a very common format for data sets in our field especially
[00:00:21] for data sets in our field especially for the very large publicly available
[00:00:22] for the very large publicly available ones uh and it's really good in the
[00:00:25] ones uh and it's really good in the sense that in releasing data sets with
[00:00:27] sense that in releasing data sets with these splits pre-defined we do ensure
[00:00:29] these splits pre-defined we do ensure some consistency across the different
[00:00:31] some consistency across the different evaluations that people run
[00:00:33] evaluations that people run it does the presuppose that you have a
[00:00:35] it does the presuppose that you have a fairly large data set because after all
[00:00:37] fairly large data set because after all right from the get-go you are setting
[00:00:39] right from the get-go you are setting aside a whole lot of examples in the dev
[00:00:41] aside a whole lot of examples in the dev and test splits that you can't use at
[00:00:43] and test splits that you can't use at all to train your systems so even though
[00:00:45] all to train your systems so even though your system might benefit from those
[00:00:47] your system might benefit from those examples you can't use them in that
[00:00:49] examples you can't use them in that context they can be used only for
[00:00:50] context they can be used only for evaluations you're just giving up a lot
[00:00:52] evaluations you're just giving up a lot of potentially useful examples
[00:00:56] of potentially useful examples as we've discussed many times we're all
[00:00:58] as we've discussed many times we're all on the honor system when it comes to
[00:00:59] on the honor system when it comes to that test set it's distributed as part
[00:01:01] that test set it's distributed as part of the data set but it has a privilege
[00:01:03] of the data set but it has a privilege status
[00:01:04] status the test set can be used only once all
[00:01:06] the test set can be used only once all of system development is complete
[00:01:10] of system development is complete and then you do a single evaluation on
[00:01:12] and then you do a single evaluation on the test set and report that number
[00:01:13] the test set and report that number completely hands off
[00:01:15] completely hands off this is vital for our field because it's
[00:01:17] this is vital for our field because it's the only way that we can even hope to
[00:01:18] the only way that we can even hope to get a true picture of how our systems
[00:01:21] get a true picture of how our systems are truly generalizing to new examples
[00:01:25] are truly generalizing to new examples that said
[00:01:26] that said the downside of having pre-defined train
[00:01:29] the downside of having pre-defined train dev test splits is that inevitably
[00:01:31] dev test splits is that inevitably everyone is using those same dev and
[00:01:33] everyone is using those same dev and test sets
[00:01:35] test sets and what that means is that over time as
[00:01:37] and what that means is that over time as we see consistent progress on a
[00:01:39] we see consistent progress on a benchmark task
[00:01:41] benchmark task we're taking that same measurement on
[00:01:42] we're taking that same measurement on that same test set and it can be hard to
[00:01:44] that same test set and it can be hard to be sure whether we're seeing true
[00:01:46] be sure whether we're seeing true progress on the underlying task
[00:01:48] progress on the underlying task or the result of a lot of implicit
[00:01:51] or the result of a lot of implicit lessons that people have learned about
[00:01:52] lessons that people have learned about what works and what doesn't for that
[00:01:55] what works and what doesn't for that particular test set and that's true even
[00:01:57] particular test set and that's true even if everyone is obeying that honor code
[00:01:59] if everyone is obeying that honor code and do it using the test set only for
[00:02:01] and do it using the test set only for truly final evaluations nonetheless
[00:02:04] truly final evaluations nonetheless information can leak out and we might
[00:02:06] information can leak out and we might start to mistake true progress um when
[00:02:09] start to mistake true progress um when we're actually just seeing progress on
[00:02:11] we're actually just seeing progress on that particular test set and i think the
[00:02:13] that particular test set and i think the only way that we can really combat this
[00:02:15] only way that we can really combat this is by continually setting new benchmark
[00:02:17] is by continually setting new benchmark tasks for ourselves um with new test
[00:02:19] tasks for ourselves um with new test sets so that we see how systems perform
[00:02:22] sets so that we see how systems perform in truly unseen environments
[00:02:25] as you leave nlp it's common to find
[00:02:28] as you leave nlp it's common to find data sets that don't come with that
[00:02:30] data sets that don't come with that predefined train dev test split and that
[00:02:32] predefined train dev test split and that poses some methodological questions for
[00:02:34] poses some methodological questions for you this is especially true for small
[00:02:36] you this is especially true for small public data sets that you see out there
[00:02:39] public data sets that you see out there this poses a challenge for assessment
[00:02:41] this poses a challenge for assessment for robust comparisons you really have
[00:02:43] for robust comparisons you really have to run all your models
[00:02:45] to run all your models using your same assessment regime that
[00:02:47] using your same assessment regime that is the same splits
[00:02:49] is the same splits and that's especially important if the
[00:02:51] and that's especially important if the data set is small because of course in a
[00:02:53] data set is small because of course in a small data set you're probably going to
[00:02:54] small data set you're probably going to get more variance across different runs
[00:02:57] get more variance across different runs and this can make it really hard to
[00:02:58] and this can make it really hard to compare outside of the experimental work
[00:03:01] compare outside of the experimental work that you're doing if someone has
[00:03:02] that you're doing if someone has published the results of some random 70
[00:03:05] published the results of some random 70 30 train test split
[00:03:08] 30 train test split unless you can reconstruct exactly the
[00:03:10] unless you can reconstruct exactly the splits that they used it might be
[00:03:12] splits that they used it might be unclear whether you're doing a true
[00:03:13] unclear whether you're doing a true apples to apples comparison so that's
[00:03:16] apples to apples comparison so that's something to keep in mind and it does
[00:03:17] something to keep in mind and it does mean that if you can for your own
[00:03:19] mean that if you can for your own experiments you might impose a split
[00:03:22] experiments you might impose a split right at the start of your project this
[00:03:24] right at the start of your project this is probably feasible if the data set is
[00:03:26] is probably feasible if the data set is large and what it will mean is that you
[00:03:28] large and what it will mean is that you have a simplified experimental setup
[00:03:31] have a simplified experimental setup and you have to do
[00:03:32] and you have to do less hyper parameter optimization just
[00:03:34] less hyper parameter optimization just because there are fewer moving parts in
[00:03:36] because there are fewer moving parts in your underlying experimental setup
[00:03:39] your underlying experimental setup it does presuppose that you have a
[00:03:40] it does presuppose that you have a pretty large data set because as i said
[00:03:42] pretty large data set because as i said before you have to give up a whole bunch
[00:03:44] before you have to give up a whole bunch of examples to dev and test
[00:03:46] of examples to dev and test but it will simplify other aspects of
[00:03:48] but it will simplify other aspects of your project if it's feasible
[00:03:51] your project if it's feasible for small data sets though imposing a
[00:03:53] for small data sets though imposing a split might leave too little data
[00:03:55] split might leave too little data leading to highly variable performance
[00:03:57] leading to highly variable performance and in that context if that's the kind
[00:03:58] and in that context if that's the kind of behavior that you observe
[00:04:00] of behavior that you observe you might want to move into the mode of
[00:04:02] you might want to move into the mode of cross-validation
[00:04:04] cross-validation so cross-validation in this context we
[00:04:06] so cross-validation in this context we take a set of examples say our entire
[00:04:08] take a set of examples say our entire data set and we partition them into two
[00:04:11] data set and we partition them into two or more train test splits
[00:04:13] or more train test splits when you might do that repeatedly and
[00:04:15] when you might do that repeatedly and then average over the results of
[00:04:17] then average over the results of evaluations on those splits in some way
[00:04:19] evaluations on those splits in some way to give a holistic summary of system
[00:04:21] to give a holistic summary of system performance and in that way even as
[00:04:23] performance and in that way even as those numbers vary they might have a lot
[00:04:25] those numbers vary they might have a lot of variance we're still getting in the
[00:04:27] of variance we're still getting in the average we hope a pretty reliable
[00:04:29] average we hope a pretty reliable measure of how the system performs in
[00:04:31] measure of how the system performs in general on the available data and i'm
[00:04:34] general on the available data and i'm going to talk about two ways to do cross
[00:04:35] going to talk about two ways to do cross validation each with its own strengths
[00:04:37] validation each with its own strengths and weaknesses let's begin with what
[00:04:39] and weaknesses let's begin with what i've called random splits here so under
[00:04:41] i've called random splits here so under the random splits regime you take your
[00:04:43] the random splits regime you take your data set
[00:04:44] data set and let's say k times
[00:04:46] and let's say k times you shuffle it
[00:04:48] you shuffle it and you split it and you have t percent
[00:04:50] and you split it and you have t percent per train and then probably the rest
[00:04:52] per train and then probably the rest left out for test
[00:04:53] left out for test and on each one of those splits you
[00:04:55] and on each one of those splits you conduct some kind of evaluation
[00:04:57] conduct some kind of evaluation get back your metrics and then at the
[00:04:59] get back your metrics and then at the end of all these k evaluations you
[00:05:02] end of all these k evaluations you probably average those metrics in some
[00:05:03] probably average those metrics in some way to give a single summary
[00:05:06] way to give a single summary number for system performance
[00:05:09] number for system performance in general but not always when we do
[00:05:10] in general but not always when we do these splits we want them to be to be
[00:05:12] these splits we want them to be to be stratified in the sense that the train
[00:05:15] stratified in the sense that the train and test splits should have
[00:05:16] and test splits should have approximately the same distribution over
[00:05:18] approximately the same distribution over the classes uh in in the underlying data
[00:05:21] the classes uh in in the underlying data but i've been careful to say that this
[00:05:23] but i've been careful to say that this is not always true there could for
[00:05:25] is not always true there could for example be context in which you would
[00:05:26] example be context in which you would like your test set to stress test your
[00:05:29] like your test set to stress test your system by having a very different
[00:05:30] system by having a very different distribution maybe an even distribution
[00:05:33] distribution maybe an even distribution or one that's heavily skewed towards
[00:05:34] or one that's heavily skewed towards some of the smaller but more important
[00:05:36] some of the smaller but more important classes
[00:05:37] classes and that will pose a challenge for train
[00:05:40] and that will pose a challenge for train test regimes because the system's
[00:05:43] test regimes because the system's experiences at train time will be
[00:05:44] experiences at train time will be different in this high level
[00:05:46] different in this high level distributional sense from what it sees
[00:05:48] distributional sense from what it sees at test time but that of course might be
[00:05:50] at test time but that of course might be part of what you're trying to pursue as
[00:05:52] part of what you're trying to pursue as part of your overall hypothesis
[00:05:56] the trade-offs for this kind of
[00:05:58] the trade-offs for this kind of cross-validation the good is that you
[00:05:59] cross-validation the good is that you can create as many splits as you want
[00:06:02] can create as many splits as you want without having this impact the ratio of
[00:06:04] without having this impact the ratio of training to testing examples right
[00:06:05] training to testing examples right because
[00:06:06] because k times we're just going to do a random
[00:06:08] k times we're just going to do a random split and it can be consistent that we
[00:06:10] split and it can be consistent that we do independent of k
[00:06:12] do independent of k 70 train 30 test or 50 50 or whatever we
[00:06:16] 70 train 30 test or 50 50 or whatever we decide we want um that's independent of
[00:06:19] decide we want um that's independent of the number of splits that we set
[00:06:21] the number of splits that we set the bad of this of course is that
[00:06:22] the bad of this of course is that there's no guarantee that every example
[00:06:25] there's no guarantee that every example will be used the same number of times
[00:06:27] will be used the same number of times for training and for testing and for
[00:06:29] for training and for testing and for small data sets this could of course be
[00:06:31] small data sets this could of course be a concern because you might be
[00:06:33] a concern because you might be introducing unwanted correlations across
[00:06:36] introducing unwanted correlations across the splits uh in for example never
[00:06:39] the splits uh in for example never having certain hard examples be part of
[00:06:41] having certain hard examples be part of your test set just as a matter of chance
[00:06:43] your test set just as a matter of chance so that's something to keep in mind but
[00:06:45] so that's something to keep in mind but of course for very large data sets it's
[00:06:47] of course for very large data sets it's very unlikely that you'll be susceptible
[00:06:49] very unlikely that you'll be susceptible to the bad part of this and then you do
[00:06:51] to the bad part of this and then you do get a lot of the benefits of the freedom
[00:06:53] get a lot of the benefits of the freedom of being able to run lots of experiments
[00:06:55] of being able to run lots of experiments with a fixed train test ratio
[00:06:59] with a fixed train test ratio and of course as usual scikit has a lot
[00:07:02] and of course as usual scikit has a lot of tools to help you with this so i've
[00:07:04] of tools to help you with this so i've just given some classic examples down
[00:07:06] just given some classic examples down here from the model selection package
[00:07:08] here from the model selection package you might import shuffle split
[00:07:10] you might import shuffle split stratified shuffle split and of course
[00:07:12] stratified shuffle split and of course train test split is a useful utility for
[00:07:14] train test split is a useful utility for very quickly and flexibly creating
[00:07:16] very quickly and flexibly creating splits of your data and i make heavy use
[00:07:18] splits of your data and i make heavy use of these throughout my own code
[00:07:22] of these throughout my own code the second regime for cross-validation
[00:07:24] the second regime for cross-validation that i'd like to discuss is k-folds
[00:07:26] that i'd like to discuss is k-folds cross-validation here the method is
[00:07:27] cross-validation here the method is slightly different we're going to take
[00:07:29] slightly different we're going to take our data set and split it into three
[00:07:31] our data set and split it into three folds in this case for three-fold
[00:07:33] folds in this case for three-fold cross-validation you could of course
[00:07:35] cross-validation you could of course pick any fold number that you wanted
[00:07:37] pick any fold number that you wanted and then given this three-fold
[00:07:39] and then given this three-fold cross-validation we're going to conduct
[00:07:40] cross-validation we're going to conduct three experiments one where fold one is
[00:07:43] three experiments one where fold one is used for testing
[00:07:44] used for testing and two and three are merged together
[00:07:46] and two and three are merged together for training
[00:07:47] for training a second experiment where we hold out
[00:07:49] a second experiment where we hold out fold two for testing and the union of
[00:07:52] fold two for testing and the union of one and three is used for training and
[00:07:54] one and three is used for training and then finally a third experiment where
[00:07:55] then finally a third experiment where fold three is used for testing and folds
[00:07:58] fold three is used for testing and folds one and two are concatenated for the
[00:08:01] one and two are concatenated for the train set
[00:08:03] train set the trade-offs here are slightly
[00:08:04] the trade-offs here are slightly different from the trade-offs for random
[00:08:06] different from the trade-offs for random splits so the good of this is that every
[00:08:09] splits so the good of this is that every example
[00:08:10] example appears in a train set exactly k minus
[00:08:12] appears in a train set exactly k minus one times and in a test set exactly once
[00:08:15] one times and in a test set exactly once we have that guaranteed in virtue of the
[00:08:17] we have that guaranteed in virtue of the fact that we use a single split over
[00:08:19] fact that we use a single split over here to conduct our three experimental
[00:08:22] here to conduct our three experimental paradigms
[00:08:23] paradigms the bad of this of course can be really
[00:08:25] the bad of this of course can be really difficult the size of k is determining
[00:08:28] difficult the size of k is determining the size of the trained test split right
[00:08:30] the size of the trained test split right just consider that for three-fold
[00:08:32] just consider that for three-fold cross-validation we're going to use 67
[00:08:34] cross-validation we're going to use 67 of the data for training
[00:08:36] of the data for training and 33 for testing
[00:08:38] and 33 for testing but if three experiments is not enough
[00:08:40] but if three experiments is not enough if we want 10 folds the result of that
[00:08:42] if we want 10 folds the result of that will be that we use 90 of our data for
[00:08:45] will be that we use 90 of our data for trading and 10 for testing and the
[00:08:48] trading and 10 for testing and the bottom line is that those are very
[00:08:49] bottom line is that those are very different experimental scenarios from
[00:08:51] different experimental scenarios from the point of view of the amount of
[00:08:53] the point of view of the amount of training data that your system has
[00:08:55] training data that your system has and probably the variance that you see
[00:08:57] and probably the variance that you see in testing because of the way you're
[00:08:59] in testing because of the way you're changing the size of the test set
[00:09:01] changing the size of the test set whereas for the random splits that we
[00:09:03] whereas for the random splits that we just discussed we have an independence
[00:09:05] just discussed we have an independence of the number of folds and then the
[00:09:07] of the number of folds and then the percentage of train test examples that
[00:09:09] percentage of train test examples that we're going to have and that can be very
[00:09:11] we're going to have and that can be very freeing especially for large data sets
[00:09:13] freeing especially for large data sets where the value up here in the good is
[00:09:16] where the value up here in the good is really less pressing
[00:09:19] really less pressing and again scikit-learn has lots of tools
[00:09:21] and again scikit-learn has lots of tools for this i've actually just given a
[00:09:23] for this i've actually just given a sample of them here you have k-fold
[00:09:25] sample of them here you have k-fold stratified k-fold and then cross-val
[00:09:28] stratified k-fold and then cross-val score is a nice raptor utility that will
[00:09:30] score is a nice raptor utility that will again give you flexible access to lots
[00:09:32] again give you flexible access to lots of different ways of conceptualizing
[00:09:35] of different ways of conceptualizing cross validation

Lecture 058

Model Evaluation | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=TxTblROT9lY --- Transcript [00:00:04] welcome everyone this is part five in ...

Model Evaluation | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=TxTblROT9lY

---

Transcript

[00:00:04] welcome everyone this is part five in
[00:00:06] welcome everyone this is part five in our series on methods and metrics we're
[00:00:08] our series on methods and metrics we're going to be talking about essential
[00:00:09] going to be talking about essential selected topics in model evaluation in
[00:00:12] selected topics in model evaluation in our field
[00:00:14] our field here's our overview i'd like to start by
[00:00:16] here's our overview i'd like to start by talking about baselines and their role
[00:00:19] talking about baselines and their role in experimental comparisons
[00:00:21] in experimental comparisons then we'll discuss hyper parameter
[00:00:22] then we'll discuss hyper parameter optimization both the process and the
[00:00:24] optimization both the process and the motivations as well as compromises that
[00:00:27] motivations as well as compromises that you might have to make due to resource
[00:00:29] you might have to make due to resource constraints and other constraints
[00:00:31] constraints and other constraints we'll touch briefly on classifier
[00:00:33] we'll touch briefly on classifier comparison which is a topic we covered
[00:00:34] comparison which is a topic we covered in the sentiment analysis unit
[00:00:37] in the sentiment analysis unit and then we'll close with two topics
[00:00:38] and then we'll close with two topics that are really pressing for deep
[00:00:40] that are really pressing for deep learning models which is
[00:00:42] learning models which is assessing models without convergence and
[00:00:44] assessing models without convergence and the role of random parameter
[00:00:45] the role of random parameter initialization in shaping experimental
[00:00:48] initialization in shaping experimental results
[00:00:51] so let's begin with baselines
[00:00:53] so let's begin with baselines the fundamental insight here is that in
[00:00:56] the fundamental insight here is that in our field evaluation numbers can never
[00:00:58] our field evaluation numbers can never be understood properly in isolation
[00:01:00] be understood properly in isolation let's consider two extreme cases suppose
[00:01:02] let's consider two extreme cases suppose your system gets 0.95 f1 then you might
[00:01:05] your system gets 0.95 f1 then you might feel like you can declare victory at
[00:01:07] feel like you can declare victory at that point
[00:01:08] that point but it will be natural for people who
[00:01:10] but it will be natural for people who are consuming your results to ask well
[00:01:12] are consuming your results to ask well is the task too easy is it really an
[00:01:15] is the task too easy is it really an achievement that you got .95 or would
[00:01:17] achievement that you got .95 or would even simpler systems have
[00:01:19] even simpler systems have achieved something similar
[00:01:21] achieved something similar at the other end of the spectrum suppose
[00:01:23] at the other end of the spectrum suppose your system gets 0.6 f1 you might think
[00:01:26] your system gets 0.6 f1 you might think that that means you haven't gotten
[00:01:27] that that means you haven't gotten traction but we should ask two questions
[00:01:29] traction but we should ask two questions first what do humans get as a kind of
[00:01:31] first what do humans get as a kind of upper bound and also what would a random
[00:01:34] upper bound and also what would a random classifier get and if your 0.6 is really
[00:01:36] classifier get and if your 0.6 is really different from the random classifier and
[00:01:38] different from the random classifier and human performance is kind of low we
[00:01:40] human performance is kind of low we might then see that this point six f1 is
[00:01:43] might then see that this point six f1 is a real achievement
[00:01:45] a real achievement that kind of shows you that bass lines
[00:01:47] that kind of shows you that bass lines are just crucial for strong experiments
[00:01:49] are just crucial for strong experiments in our field so defining bass lines for
[00:01:51] in our field so defining bass lines for you should not be an afterthought but
[00:01:53] you should not be an afterthought but rather central to how you define your
[00:01:55] rather central to how you define your overall hypothesis
[00:01:57] overall hypothesis baselines are really important for
[00:01:59] baselines are really important for building a persuasive case and they can
[00:02:01] building a persuasive case and they can be used to eliminate specific aspects of
[00:02:03] be used to eliminate specific aspects of the problem that you're tackling and
[00:02:05] the problem that you're tackling and specific virtues of your proposed system
[00:02:08] specific virtues of your proposed system what this really comes down to is right
[00:02:10] what this really comes down to is right from the start you might be saying for
[00:02:11] from the start you might be saying for example here's a baseline model here's
[00:02:14] example here's a baseline model here's my proposed modification of it and the
[00:02:16] my proposed modification of it and the way we test your hypothesis is by
[00:02:18] way we test your hypothesis is by comparing the performance of those two
[00:02:20] comparing the performance of those two systems in that context you can see that
[00:02:23] systems in that context you can see that the baseline is playing a crucial role
[00:02:25] the baseline is playing a crucial role in quantifying the extent to which your
[00:02:27] in quantifying the extent to which your hypothesis is true and therefore careful
[00:02:29] hypothesis is true and therefore careful model for comparisons at that level are
[00:02:32] model for comparisons at that level are going to be really fundamental to the to
[00:02:34] going to be really fundamental to the to successful pursuit of the hypothesis
[00:02:38] successful pursuit of the hypothesis when in doubt you could include random
[00:02:39] when in doubt you could include random baselines in your results table they're
[00:02:42] baselines in your results table they're very easy to set up and can illuminate
[00:02:43] very easy to set up and can illuminate what it's like if we're just making
[00:02:45] what it's like if we're just making random predictions and here i'm showing
[00:02:47] random predictions and here i'm showing you that scikit-learn kind of has you
[00:02:49] you that scikit-learn kind of has you covered on this point they have two
[00:02:50] covered on this point they have two classes dummy classifier and dummy
[00:02:53] classes dummy classifier and dummy regressor each with a wide range of
[00:02:55] regressor each with a wide range of different ways that they can make random
[00:02:57] different ways that they can make random guesses based on the data
[00:02:59] guesses based on the data and i would encourage you to use these
[00:03:00] and i would encourage you to use these classes because it will make it easy for
[00:03:02] classes because it will make it easy for you to fit the random baselines into
[00:03:04] you to fit the random baselines into your overall experimental pipeline which
[00:03:07] your overall experimental pipeline which will reduce the amount of code that you
[00:03:08] will reduce the amount of code that you have to write and possibly avoid bugs
[00:03:10] have to write and possibly avoid bugs that might come from implementing these
[00:03:12] that might come from implementing these baselines yourself so strongly
[00:03:14] baselines yourself so strongly encouraged
[00:03:16] encouraged and kind of at the other end of the
[00:03:17] and kind of at the other end of the spectrum you might think for your task
[00:03:19] spectrum you might think for your task whether there are tasks specific
[00:03:21] whether there are tasks specific baselines that you should be considering
[00:03:23] baselines that you should be considering because they might reveal something
[00:03:24] because they might reveal something about the data set or the problem or the
[00:03:26] about the data set or the problem or the way people are going about modeling the
[00:03:28] way people are going about modeling the problem we saw an example of this before
[00:03:31] problem we saw an example of this before in the context of natural language
[00:03:32] in the context of natural language inference we saw that hypothesis only
[00:03:35] inference we saw that hypothesis only baselines tended to make predictions
[00:03:37] baselines tended to make predictions that were as good as 0.65 to 0.70 f1
[00:03:41] that were as good as 0.65 to 0.70 f1 which is substantially better than the
[00:03:43] which is substantially better than the baseline random chance would which would
[00:03:45] baseline random chance would which would be at about 0.33
[00:03:47] be at about 0.33 and that's revealing to us that when we
[00:03:49] and that's revealing to us that when we measure performance we should really be
[00:03:51] measure performance we should really be thinking about gains above that
[00:03:53] thinking about gains above that hypothesis only baseline comparisons
[00:03:55] hypothesis only baseline comparisons against random chance are going to
[00:03:57] against random chance are going to vastly overstate the extent to which we
[00:03:59] vastly overstate the extent to which we have made meaningful progress on those
[00:04:01] have made meaningful progress on those data sets
[00:04:03] data sets the story of the story close task is
[00:04:06] the story of the story close task is somewhat similar here the task is to
[00:04:07] somewhat similar here the task is to distinguish between a coherent and
[00:04:09] distinguish between a coherent and incoherent ending for a story
[00:04:11] incoherent ending for a story and people observed that systems that
[00:04:13] and people observed that systems that looked only at the ending options were
[00:04:15] looked only at the ending options were able to do really well there is some
[00:04:17] able to do really well there is some bias in coherent and incoherent
[00:04:19] bias in coherent and incoherent continuations that leads them to be
[00:04:22] continuations that leads them to be pretty good evidence for making this
[00:04:24] pretty good evidence for making this classification decision
[00:04:26] classification decision again you might think that that reveals
[00:04:27] again you might think that that reveals that there's a fundamental problem with
[00:04:29] that there's a fundamental problem with the data set and that might be true but
[00:04:31] the data set and that might be true but another perspective is simply that when
[00:04:33] another perspective is simply that when we do comparisons and think about model
[00:04:35] we do comparisons and think about model performance it should be with this as
[00:04:37] performance it should be with this as the baseline and not random guessing
[00:04:43] okay let's talk about hyper parameter
[00:04:45] okay let's talk about hyper parameter optimization we discussed this in our
[00:04:47] optimization we discussed this in our unit on sentiment analysis and we walk
[00:04:49] unit on sentiment analysis and we walk through some of the rationale let me
[00:04:50] through some of the rationale let me quickly reiterate the full case for this
[00:04:53] quickly reiterate the full case for this first hyperparameter optimization might
[00:04:55] first hyperparameter optimization might be crucial for obtaining the best
[00:04:57] be crucial for obtaining the best version of your model that you can which
[00:04:59] version of your model that you can which might be your fundamental goal
[00:05:01] might be your fundamental goal probably for any modern model that
[00:05:03] probably for any modern model that you're looking at there are it's a wide
[00:05:04] you're looking at there are it's a wide range of hyper parameters and we might
[00:05:07] range of hyper parameters and we might know that different settings of them
[00:05:08] know that different settings of them lead to very different outcomes so it's
[00:05:10] lead to very different outcomes so it's in your best interest to do hyper
[00:05:12] in your best interest to do hyper parameter search to put your model in
[00:05:14] parameter search to put your model in the very best light
[00:05:16] the very best light we also talked at length about how this
[00:05:18] we also talked at length about how this is a crucial step in conducting fair
[00:05:20] is a crucial step in conducting fair comparisons between models it's really
[00:05:23] comparisons between models it's really important that when you conduct a
[00:05:24] important that when you conduct a comparison you not put one model in its
[00:05:27] comparison you not put one model in its best light with its best hyper parameter
[00:05:29] best light with its best hyper parameter settings and have all the other models
[00:05:31] settings and have all the other models be kind of randomly chosen or even
[00:05:33] be kind of randomly chosen or even poorly chosen hyper parameter settings
[00:05:35] poorly chosen hyper parameter settings because that would lead to unfair
[00:05:37] because that would lead to unfair comparisons and exaggerate differences
[00:05:39] comparisons and exaggerate differences between the models what we want to do is
[00:05:41] between the models what we want to do is compare the models all with their best
[00:05:43] compare the models all with their best possible hyper parameter settings and
[00:05:45] possible hyper parameter settings and that implies doing extensive search to
[00:05:48] that implies doing extensive search to find those settings
[00:05:50] find those settings and the third motivation you might have
[00:05:51] and the third motivation you might have is just to understand the stability of
[00:05:53] is just to understand the stability of your architecture we might want to know
[00:05:55] your architecture we might want to know for some large space of hyper parameters
[00:05:57] for some large space of hyper parameters which ones really matter for final
[00:05:59] which ones really matter for final performance maybe which ones lead to
[00:06:01] performance maybe which ones lead to really degenerate solutions and which
[00:06:03] really degenerate solutions and which space of hyper parameters overall
[00:06:05] space of hyper parameters overall perform the best so that we have more
[00:06:07] perform the best so that we have more than just a single set of parameters
[00:06:09] than just a single set of parameters that work well but maybe real insights
[00:06:11] that work well but maybe real insights into the overall settings of the models
[00:06:13] into the overall settings of the models that are really good
[00:06:15] that are really good there's one more rule that i need to
[00:06:17] there's one more rule that i need to reiterate here
[00:06:18] reiterate here all hyperparameter tuning must be done
[00:06:21] all hyperparameter tuning must be done only on train and development data it is
[00:06:23] only on train and development data it is a sin in our field to do any kind of
[00:06:26] a sin in our field to do any kind of hyper-perimeter tuning on a test set
[00:06:28] hyper-perimeter tuning on a test set all that tuning should happen outside of
[00:06:30] all that tuning should happen outside of the test set and then as usual you get
[00:06:32] the test set and then as usual you get one run on the test set with your chosen
[00:06:34] one run on the test set with your chosen parameters and that is the number that
[00:06:36] parameters and that is the number that you report as performance on the test
[00:06:38] you report as performance on the test data
[00:06:39] data that's the only way that we can really
[00:06:42] that's the only way that we can really get a look at how these systems behave
[00:06:44] get a look at how these systems behave on completely unseen data so this is
[00:06:46] on completely unseen data so this is really crucial for
[00:06:47] really crucial for understanding progress in our field
[00:06:51] now hyper parameter optimization as you
[00:06:54] now hyper parameter optimization as you can imagine can get very expensive let's
[00:06:56] can imagine can get very expensive let's review that and then talk about some
[00:06:58] review that and then talk about some compromises right the ideal for hyper
[00:07:00] compromises right the ideal for hyper parameter optimization is that you
[00:07:02] parameter optimization is that you identify a large set of values for your
[00:07:04] identify a large set of values for your model
[00:07:05] model you create a list of all the
[00:07:07] you create a list of all the combinations of those values this will
[00:07:08] combinations of those values this will be the cross product of all the values
[00:07:11] be the cross product of all the values of the features that you identified
[00:07:13] of the features that you identified then for each of the settings you should
[00:07:15] then for each of the settings you should cross validate it on the available
[00:07:17] cross validate it on the available training data
[00:07:18] training data and then choose the the settings that
[00:07:20] and then choose the the settings that did the best at step three train on all
[00:07:22] did the best at step three train on all the training data used using those
[00:07:24] the training data used using those settings and then finally evaluate on
[00:07:26] settings and then finally evaluate on the test data that is the ideal here and
[00:07:29] the test data that is the ideal here and let's just think about how that's
[00:07:30] let's just think about how that's actually going to work suppose for our
[00:07:32] actually going to work suppose for our example that we have one hyperparameter
[00:07:34] example that we have one hyperparameter and it has five values
[00:07:36] and it has five values and we have a second hyperparameter with
[00:07:38] and we have a second hyperparameter with ten values then the cross product is
[00:07:40] ten values then the cross product is going to lead us to have 50 total
[00:07:42] going to lead us to have 50 total settings for those hyper parameters
[00:07:44] settings for those hyper parameters suppose we add a third hyperparameter
[00:07:46] suppose we add a third hyperparameter with two values now the number of
[00:07:48] with two values now the number of settings that we have has jumped up to
[00:07:50] settings that we have has jumped up to 100.
[00:07:52] 100. if we want to do five-fold
[00:07:53] if we want to do five-fold cross-validation to select those optimal
[00:07:55] cross-validation to select those optimal parameters then we are talking about
[00:07:57] parameters then we are talking about doing 500 different experiments
[00:08:01] doing 500 different experiments that's probably perfectly fine if you're
[00:08:03] that's probably perfectly fine if you're dealing with a small linear model with
[00:08:05] dealing with a small linear model with some hand built features but if you are
[00:08:07] some hand built features but if you are fitting a large transformer based model
[00:08:09] fitting a large transformer based model where each experiment takes you up to
[00:08:11] where each experiment takes you up to one day
[00:08:12] one day this is going to be prohibitively
[00:08:14] this is going to be prohibitively expensive in terms of time or compute
[00:08:16] expensive in terms of time or compute resources and that's going to compel us
[00:08:18] resources and that's going to compel us to make some compromises this is the
[00:08:20] to make some compromises this is the bottom line here the above picture that
[00:08:23] bottom line here the above picture that ideal is untenable as a set of laws for
[00:08:26] ideal is untenable as a set of laws for our scientific community if we adopted
[00:08:28] our scientific community if we adopted it then complex models trained on large
[00:08:30] it then complex models trained on large data sets would end up disfavored and
[00:08:33] data sets would end up disfavored and only the very wealthy would be able to
[00:08:35] only the very wealthy would be able to participate and just to give you a
[00:08:37] participate and just to give you a glimpse of just how expensive this could
[00:08:39] glimpse of just how expensive this could get here's a quotation from this nice
[00:08:40] get here's a quotation from this nice paper on nlp the machine learning for
[00:08:43] paper on nlp the machine learning for healthcare in their supplementary
[00:08:45] healthcare in their supplementary materials they report that performance
[00:08:47] materials they report that performance on all of the above neural networks was
[00:08:49] on all of the above neural networks was tuned automatically using google vizier
[00:08:52] tuned automatically using google vizier with a total of over 200 000 gpu hours
[00:08:56] with a total of over 200 000 gpu hours for me as a private citizen that could
[00:08:58] for me as a private citizen that could easily cost a half a million dollars
[00:09:01] easily cost a half a million dollars just for the process of hyper parameter
[00:09:03] just for the process of hyper parameter optimization and that's what i mean by
[00:09:05] optimization and that's what i mean by this being kind of fundamentally
[00:09:07] this being kind of fundamentally untenable for us
[00:09:09] untenable for us so what should we do in response we need
[00:09:11] so what should we do in response we need a pragmatic response here here are some
[00:09:13] a pragmatic response here here are some steps that you take to alleviate the
[00:09:15] steps that you take to alleviate the problem in what i view as kind of
[00:09:17] problem in what i view as kind of descending order of attractiveness so
[00:09:20] descending order of attractiveness so starting with the best option you could
[00:09:22] starting with the best option you could do some random sampling and maybe guided
[00:09:24] do some random sampling and maybe guided sampling to explore a large space of
[00:09:27] sampling to explore a large space of hyper parameters on a fixed
[00:09:28] hyper parameters on a fixed computational budget
[00:09:31] you could do search based on just a few
[00:09:33] you could do search based on just a few epochs of training right rather than
[00:09:35] epochs of training right rather than allowing your model to run for many
[00:09:37] allowing your model to run for many epochs which could take a whole day you
[00:09:40] epochs which could take a whole day you might select hyper parameters based on
[00:09:41] might select hyper parameters based on one or two epochs on the assumption that
[00:09:45] one or two epochs on the assumption that settings that are good at the start will
[00:09:47] settings that are good at the start will remain good settings that are bad at the
[00:09:49] remain good settings that are bad at the start will remain bad that's a heuristic
[00:09:51] start will remain bad that's a heuristic assumption but it seems reasonable you
[00:09:53] assumption but it seems reasonable you could possibly bolster it with some
[00:09:54] could possibly bolster it with some learning curves and so forth and that
[00:09:56] learning curves and so forth and that could vastly cut down on the amount that
[00:09:58] could vastly cut down on the amount that you have to spend in this search process
[00:10:02] you have to spend in this search process you could also search based on subsets
[00:10:04] you could also search based on subsets of the data this would be another kind
[00:10:06] of the data this would be another kind of compromise however because a lot of
[00:10:08] of compromise however because a lot of hyper parameters are dependent on data
[00:10:10] hyper parameters are dependent on data set size i think of regularization terms
[00:10:13] set size i think of regularization terms this might be riskier than the version
[00:10:16] this might be riskier than the version in two there where we just train for a
[00:10:17] in two there where we just train for a few epochs
[00:10:20] few epochs you also might do some heuristic search
[00:10:22] you also might do some heuristic search maybe by defining which hyper parameters
[00:10:24] maybe by defining which hyper parameters matter less and then set those by hand
[00:10:27] matter less and then set those by hand based on this heuristic search and then
[00:10:28] based on this heuristic search and then you might just describe that process in
[00:10:30] you might just describe that process in the paper that you know via a few
[00:10:32] the paper that you know via a few observations you made some guesses about
[00:10:34] observations you made some guesses about parameters that you could fix and
[00:10:36] parameters that you could fix and therefore explored a smaller subset of
[00:10:38] therefore explored a smaller subset of the space that you might have liked to
[00:10:39] the space that you might have liked to explore
[00:10:40] explore again i think if you make the case and
[00:10:42] again i think if you make the case and you're clear about this readers will be
[00:10:44] you're clear about this readers will be receptive because we're aware of the
[00:10:45] receptive because we're aware of the costs
[00:10:47] costs you could also find optimal hyper
[00:10:49] you could also find optimal hyper parameters via a single split of your
[00:10:51] parameters via a single split of your data and use them for all subsequent
[00:10:52] data and use them for all subsequent splits
[00:10:54] splits that would be justified if the splits
[00:10:55] that would be justified if the splits are very similar and your model
[00:10:57] are very similar and your model performance is very stable and that
[00:10:59] performance is very stable and that would reduce all that cross validation
[00:11:01] would reduce all that cross validation that did cause the number of experiments
[00:11:03] that did cause the number of experiments we had to run to jump up by a large
[00:11:05] we had to run to jump up by a large amount
[00:11:07] amount and finally you might adopt others
[00:11:09] and finally you might adopt others choices
[00:11:10] choices now the skeptic will complain that these
[00:11:12] now the skeptic will complain that these findings don't translate to new data
[00:11:14] findings don't translate to new data sets but it could be the only option
[00:11:16] sets but it could be the only option that you just observe for example that
[00:11:18] that you just observe for example that for some very large model the original
[00:11:20] for some very large model the original authors use settings x y and z and you
[00:11:22] authors use settings x y and z and you might simply adopt them even knowing
[00:11:25] might simply adopt them even knowing that your data set or your test might
[00:11:27] that your data set or your test might call for different optimal settings
[00:11:30] call for different optimal settings it isn't the best but if it's the only
[00:11:32] it isn't the best but if it's the only thing that you can afford
[00:11:34] thing that you can afford it's certainly a reasonable case to make
[00:11:38] finally some tools for hyper parameter
[00:11:40] finally some tools for hyper parameter search as usual scikit-learn has a bunch
[00:11:42] search as usual scikit-learn has a bunch of great tools for this grid search
[00:11:44] of great tools for this grid search randomized search and halving grid
[00:11:47] randomized search and halving grid search
[00:11:48] search grid search will be the most expensive
[00:11:49] grid search will be the most expensive randomized search the least expensive
[00:11:51] randomized search the least expensive and halving grid search will help you
[00:11:53] and halving grid search will help you kind of strategically navigate through
[00:11:55] kind of strategically navigate through the space
[00:11:56] the space of hyper parameters and if you want to
[00:11:58] of hyper parameters and if you want to go even further in that direction the
[00:12:00] go even further in that direction the scikit optimize
[00:12:02] scikit optimize package offers a bunch of tools for
[00:12:04] package offers a bunch of tools for doing
[00:12:05] doing model based performance driven
[00:12:07] model based performance driven exploration of a space of hyper
[00:12:08] exploration of a space of hyper parameters and that could be very
[00:12:10] parameters and that could be very effective indeed
[00:12:13] all right let's talk briefly about
[00:12:14] all right let's talk briefly about classifier comparison we've it's a topic
[00:12:17] classifier comparison we've it's a topic we've reviewed before but i'll just
[00:12:18] we've reviewed before but i'll just briefly recap the scenario is this
[00:12:21] briefly recap the scenario is this suppose you've assessed two classifier
[00:12:22] suppose you've assessed two classifier models their performance is probably
[00:12:24] models their performance is probably different to some degree numerically
[00:12:27] different to some degree numerically what can be done to establish whether
[00:12:28] what can be done to establish whether those models are different in some
[00:12:30] those models are different in some meaningful sense
[00:12:32] meaningful sense as we've discussed i think guidance from
[00:12:34] as we've discussed i think guidance from the literature is that first we could
[00:12:36] the literature is that first we could cover practical differences if you just
[00:12:37] cover practical differences if you just observe that one model makes 10 000 more
[00:12:40] observe that one model makes 10 000 more highly important predictions than
[00:12:42] highly important predictions than another then that might be sufficient to
[00:12:44] another then that might be sufficient to make the case that it's the better model
[00:12:47] make the case that it's the better model for differences that are narrower again
[00:12:49] for differences that are narrower again the good the guidance is that we might
[00:12:50] the good the guidance is that we might use confidence intervals uh on repeated
[00:12:53] use confidence intervals uh on repeated runs are the wilcoxon signed ranked test
[00:12:55] runs are the wilcoxon signed ranked test to get a single summary statistic of
[00:12:58] to get a single summary statistic of whether or not the different runs are
[00:13:00] whether or not the different runs are truly different in their means and
[00:13:02] truly different in their means and variants
[00:13:03] variants you could use mcneemar's test if you can
[00:13:06] you could use mcneemar's test if you can only afford to run one experiment
[00:13:08] only afford to run one experiment whereas the wilcoxon and confidence
[00:13:10] whereas the wilcoxon and confidence intervals will require you to run 10 to
[00:13:12] intervals will require you to run 10 to 20 different experiments which again
[00:13:15] 20 different experiments which again could be prohibitively expensive and in
[00:13:17] could be prohibitively expensive and in those situations you might fall back to
[00:13:19] those situations you might fall back to mcneemar's test because it's less
[00:13:21] mcneemar's test because it's less expensive and arguably better than
[00:13:23] expensive and arguably better than nothing especially in scenarios where
[00:13:26] nothing especially in scenarios where it's hard to tell whether there are
[00:13:27] it's hard to tell whether there are practical differences between the
[00:13:29] practical differences between the systems
[00:13:32] finally let's talk about two topics that
[00:13:34] finally let's talk about two topics that seem especially pressing in the context
[00:13:36] seem especially pressing in the context of large-scale deep learning models and
[00:13:38] of large-scale deep learning models and the first is assessing models without
[00:13:40] the first is assessing models without convergence
[00:13:41] convergence right when working with linear models
[00:13:44] right when working with linear models convergence issues rarely arise because
[00:13:47] convergence issues rarely arise because the models seem to converge quickly
[00:13:49] the models seem to converge quickly based on whatever threshold you've set
[00:13:51] based on whatever threshold you've set and convergence implies kind of maximize
[00:13:53] and convergence implies kind of maximize performance in a wide range of cases
[00:13:56] performance in a wide range of cases with neural networks however convergence
[00:13:58] with neural networks however convergence issues really take center stage the
[00:14:00] issues really take center stage the models rarely converge even based on
[00:14:02] models rarely converge even based on liberal thresholds that you might set
[00:14:05] liberal thresholds that you might set they converge at different rates between
[00:14:07] they converge at different rates between runs so it's hard to predict and their
[00:14:09] runs so it's hard to predict and their performance on the test data is awfully
[00:14:12] performance on the test data is awfully heavily often heavily dependent on these
[00:14:15] heavily often heavily dependent on these differences right sometimes a model with
[00:14:17] differences right sometimes a model with a low final error turns out to be great
[00:14:20] a low final error turns out to be great and sometimes sometimes it turns out to
[00:14:22] and sometimes sometimes it turns out to be worse than one that finished with a
[00:14:23] be worse than one that finished with a higher error who really knows what's
[00:14:25] higher error who really knows what's going on
[00:14:26] going on our only fall back in these situations
[00:14:28] our only fall back in these situations is just to just do experiments and
[00:14:30] is just to just do experiments and observe what seems to work the best
[00:14:33] observe what seems to work the best so i think a very natural and easy to
[00:14:36] so i think a very natural and easy to implement response to this that proves
[00:14:37] implement response to this that proves highly effective is what i'm calling
[00:14:39] highly effective is what i'm calling here incremental dev set testing this is
[00:14:42] here incremental dev set testing this is just the idea that as training proceeds
[00:14:45] just the idea that as training proceeds we will regularly collect information
[00:14:47] we will regularly collect information about performance on some held out dev
[00:14:49] about performance on some held out dev set as part of the training process for
[00:14:52] set as part of the training process for example at every 100th iteration you
[00:14:54] example at every 100th iteration you could make predictions on that dev set
[00:14:56] could make predictions on that dev set and store those predictions for some
[00:14:58] and store those predictions for some kind of assessment
[00:15:00] kind of assessment all the pi torch models for our course
[00:15:03] all the pi torch models for our course have an early stopping parameter that
[00:15:04] have an early stopping parameter that will allow you to conduct experiments in
[00:15:06] will allow you to conduct experiments in this way and keep hold of what seemed to
[00:15:09] this way and keep hold of what seemed to be the best model performance wise and
[00:15:11] be the best model performance wise and then report that
[00:15:13] then report that based on the stopping criteria that you
[00:15:15] based on the stopping criteria that you set up and with lock heuristically that
[00:15:17] set up and with lock heuristically that will give you the best model in the
[00:15:19] will give you the best model in the fewest epochs
[00:15:21] fewest epochs the early stopping parameter has a bunch
[00:15:23] the early stopping parameter has a bunch of different other settings that you can
[00:15:25] of different other settings that you can use to control exactly how it behaves
[00:15:27] use to control exactly how it behaves which might be important for particular
[00:15:29] which might be important for particular model structures or data sets
[00:15:33] here's a bit of
[00:15:34] here's a bit of motivation for early stopping you might
[00:15:36] motivation for early stopping you might be thinking why not just let my model
[00:15:38] be thinking why not just let my model run to convergence if i possibly can
[00:15:41] run to convergence if i possibly can in the context of these large
[00:15:44] in the context of these large very difficult optimization processes
[00:15:46] very difficult optimization processes that could lead you really far astray
[00:15:48] that could lead you really far astray right so here is a picture of a deep
[00:15:50] right so here is a picture of a deep learning model and you can see its error
[00:15:52] learning model and you can see its error going down very quickly over many
[00:15:54] going down very quickly over many iterations and it looks like you might
[00:15:56] iterations and it looks like you might want to iterate out even to 80 epochs of
[00:15:58] want to iterate out even to 80 epochs of training
[00:15:59] training however if you look at performance on
[00:16:01] however if you look at performance on that held up dev set you see that this
[00:16:03] that held up dev set you see that this model actually very quickly reached its
[00:16:05] model actually very quickly reached its peak of performance and then all that
[00:16:07] peak of performance and then all that remaining training was just was training
[00:16:09] remaining training was just was training was just either wasting time or eroding
[00:16:13] was just either wasting time or eroding the performance that you saw early on in
[00:16:15] the performance that you saw early on in the process and this is exactly why
[00:16:16] the process and this is exactly why since this is our real goal here you
[00:16:19] since this is our real goal here you might want to do some kind of dev step
[00:16:21] might want to do some kind of dev step dev set testing with early stopping
[00:16:26] the final thing i'd want to say here is
[00:16:27] the final thing i'd want to say here is that all of this might lead us to get
[00:16:29] that all of this might lead us to get out of the mode of assuming that we
[00:16:30] out of the mode of assuming that we should always be reporting one number to
[00:16:32] should always be reporting one number to summarize our models we're dealing with
[00:16:34] summarize our models we're dealing with very powerful models in the limit they
[00:16:37] very powerful models in the limit they might be able to learn very complicated
[00:16:39] might be able to learn very complicated things and we might want to want to ask
[00:16:41] things and we might want to want to ask different questions like how quickly can
[00:16:43] different questions like how quickly can they learn and how effectively and how
[00:16:45] they learn and how effectively and how robustly and that might imply that what
[00:16:47] robustly and that might imply that what we really want to do is not report
[00:16:49] we really want to do is not report summary tables of statistics but rather
[00:16:52] summary tables of statistics but rather full learning curves with confidence
[00:16:54] full learning curves with confidence intervals this is a picture from a paper
[00:16:56] intervals this is a picture from a paper that i was involved with but i think
[00:16:58] that i was involved with but i think it's illuminating to see a by category
[00:17:00] it's illuminating to see a by category breakdown of how the model is performing
[00:17:02] breakdown of how the model is performing in addition to the overall average
[00:17:04] in addition to the overall average because you can see that while this red
[00:17:06] because you can see that while this red model is arguably much better than the
[00:17:08] model is arguably much better than the yellow and the gray overall
[00:17:11] yellow and the gray overall it's kind of hard to distinguish it
[00:17:13] it's kind of hard to distinguish it globally from this blue model but for
[00:17:15] globally from this blue model but for various of the subcategories you do see
[00:17:17] various of the subcategories you do see some differences whereas for others you
[00:17:19] some differences whereas for others you do see that they're kind of
[00:17:20] do see that they're kind of indistinguishable it's a very rich
[00:17:23] indistinguishable it's a very rich picture you can also see that early on
[00:17:25] picture you can also see that early on for some of these categories some of
[00:17:26] for some of these categories some of these models are really differentiated
[00:17:28] these models are really differentiated they learn more efficiently
[00:17:30] they learn more efficiently whereas by the time you've run out to
[00:17:32] whereas by the time you've run out to 100 000 epochs many of the model
[00:17:34] 100 000 epochs many of the model distinctions have disappeared
[00:17:36] distinctions have disappeared that's the kind of rich picture that is
[00:17:38] that's the kind of rich picture that is already giving us a sense for how
[00:17:40] already giving us a sense for how different values and different goals we
[00:17:42] different values and different goals we have might guide different choices about
[00:17:44] have might guide different choices about which model to use and different choices
[00:17:47] which model to use and different choices about how to optimize and i would just
[00:17:49] about how to optimize and i would just love it if our field got into the habit
[00:17:51] love it if our field got into the habit of reporting this very full picture as
[00:17:54] of reporting this very full picture as opposed to reducing everything to a
[00:17:55] opposed to reducing everything to a single number
[00:17:58] single number the final topic is the role of random
[00:18:00] the final topic is the role of random parameter initialization this is kind of
[00:18:02] parameter initialization this is kind of yet another hyper parameter that's in
[00:18:04] yet another hyper parameter that's in the background that's much more
[00:18:05] the background that's much more difficult to think about
[00:18:07] difficult to think about most deep learning models have their
[00:18:09] most deep learning models have their parameters initialized randomly or many
[00:18:12] parameters initialized randomly or many of those parameters are initialized
[00:18:13] of those parameters are initialized randomly
[00:18:14] randomly this is clearly meaningful for these
[00:18:16] this is clearly meaningful for these non-convex optimization problems that
[00:18:18] non-convex optimization problems that we're posing but even simple models can
[00:18:21] we're posing but even simple models can also be impacted if you're dealing with
[00:18:23] also be impacted if you're dealing with very small data sets with very large
[00:18:25] very small data sets with very large feature spaces
[00:18:27] feature spaces in this classic paper here these authors
[00:18:29] in this classic paper here these authors just observe that different
[00:18:30] just observe that different initializations for neural sequence
[00:18:32] initializations for neural sequence models that were doing named entity
[00:18:34] models that were doing named entity recognition
[00:18:35] recognition led to statistically significantly
[00:18:37] led to statistically significantly different results that is one in the
[00:18:39] different results that is one in the same model with a different random seed
[00:18:42] same model with a different random seed was performing in ways that looked
[00:18:43] was performing in ways that looked significantly different on these data
[00:18:45] significantly different on these data sets and a number of recent systems
[00:18:48] sets and a number of recent systems actually turned out to be
[00:18:49] actually turned out to be indistinguishable in terms of their raw
[00:18:51] indistinguishable in terms of their raw performance
[00:18:52] performance once this source of variation was taken
[00:18:54] once this source of variation was taken into account that's just a powerful
[00:18:56] into account that's just a powerful example of how much a random seed could
[00:19:00] example of how much a random seed could shape final performance in the context
[00:19:02] shape final performance in the context of models like this
[00:19:04] of models like this relatedly at the other end of the
[00:19:05] relatedly at the other end of the spectrum you can see catastrophic
[00:19:07] spectrum you can see catastrophic failure as a result of unlucky
[00:19:09] failure as a result of unlucky initialization some settings are great
[00:19:12] initialization some settings are great and some can be miserable failures
[00:19:14] and some can be miserable failures we don't really know ahead of time which
[00:19:16] we don't really know ahead of time which will be which
[00:19:17] will be which and that means that we just have to be
[00:19:19] and that means that we just have to be really attentive to how we're
[00:19:20] really attentive to how we're initializing these systems in a wide
[00:19:22] initializing these systems in a wide range of settings and you'll notice that
[00:19:24] range of settings and you'll notice that in the evaluation methods notebook that
[00:19:27] in the evaluation methods notebook that i've distributed as a companion to this
[00:19:29] i've distributed as a companion to this lecture i fit a simple feed forward
[00:19:31] lecture i fit a simple feed forward network a very small one on the classic
[00:19:33] network a very small one on the classic xor problem which is one of the original
[00:19:35] xor problem which is one of the original motivating problems for using deep
[00:19:37] motivating problems for using deep learning models at all and what you see
[00:19:40] learning models at all and what you see is that it succeeds about eight out of
[00:19:41] is that it succeeds about eight out of ten times where the only thing that
[00:19:44] ten times where the only thing that we're changing across these models is
[00:19:46] we're changing across these models is the way they are randomly initialized
[00:19:48] the way they are randomly initialized and that again just shows you that this
[00:19:50] and that again just shows you that this can be powerfully shaping
[00:19:52] can be powerfully shaping final performance for our systems and
[00:19:54] final performance for our systems and probably what we need to do is be
[00:19:56] probably what we need to do is be thinking about this as yet another hyper
[00:19:58] thinking about this as yet another hyper parameter that we need to tune and
[00:20:01] parameter that we need to tune and optimize along with all the rest
[00:20:08] you

Lecture 059

Presenting Your Work: Final Papers | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=yNaDky5E4Wg --- Transcript [00:00:05] hello everyone welco...

Presenting Your Work: Final Papers | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=yNaDky5E4Wg

---

Transcript

[00:00:05] hello everyone welcome to the first
[00:00:06] hello everyone welcome to the first screencast in our series on presenting
[00:00:08] screencast in our series on presenting your research the purpose of this series
[00:00:10] your research the purpose of this series is really to help you do outstanding
[00:00:12] is really to help you do outstanding scholarship in the field of nlp and i'm
[00:00:14] scholarship in the field of nlp and i'm also going to try to demystify
[00:00:16] also going to try to demystify publishing in the field of nlp
[00:00:18] publishing in the field of nlp to kick it off i'd like to focus on the
[00:00:20] to kick it off i'd like to focus on the final papers that you're writing
[00:00:21] final papers that you're writing specifically for this course
[00:00:23] specifically for this course um here are some practical details these
[00:00:25] um here are some practical details these links take you to essential things about
[00:00:27] links take you to essential things about writing final paper specifically for
[00:00:29] writing final paper specifically for cs224u the first link is possibly the
[00:00:32] cs224u the first link is possibly the most important that just enumerates the
[00:00:34] most important that just enumerates the requirements for the final paper please
[00:00:36] requirements for the final paper please do review it to make sure that you don't
[00:00:38] do review it to make sure that you don't get points off or something small that
[00:00:39] get points off or something small that could easily have been corrected just
[00:00:41] could easily have been corrected just because you didn't conform to our
[00:00:43] because you didn't conform to our requirements
[00:00:44] requirements the next two links are much more about
[00:00:46] the next two links are much more about helping you with the substance of this
[00:00:47] helping you with the substance of this so the projects file has lots of
[00:00:49] so the projects file has lots of frequently asked questions and other
[00:00:51] frequently asked questions and other information about writing final papers
[00:00:53] information about writing final papers for this course and it also expands out
[00:00:55] for this course and it also expands out to publishing in the field of nlp so it
[00:00:57] to publishing in the field of nlp so it has a lot of useful resources when it
[00:01:00] has a lot of useful resources when it comes to trying to report out your
[00:01:01] comes to trying to report out your research to the community
[00:01:03] research to the community and then this third link here links to
[00:01:05] and then this third link here links to some excellent past final projects many
[00:01:07] some excellent past final projects many of them became publications typically
[00:01:09] of them became publications typically after a bunch of additional work after
[00:01:11] after a bunch of additional work after the end of the quarter but it is
[00:01:12] the end of the quarter but it is exciting that a lot of these really
[00:01:14] exciting that a lot of these really excellent published papers began in this
[00:01:17] excellent published papers began in this course it's very inspiring
[00:01:20] for your projects i just want to review
[00:01:22] for your projects i just want to review a really important point for me that
[00:01:24] a really important point for me that concerns how we'll evaluate your work
[00:01:26] concerns how we'll evaluate your work this is from the methods unit for the
[00:01:28] this is from the methods unit for the course but i want to repeat it here just
[00:01:30] course but i want to repeat it here just to emphasize it
[00:01:32] to emphasize it we will never evaluate a project based
[00:01:34] we will never evaluate a project based on how good the results are
[00:01:37] on how good the results are we do recognize that in our field as in
[00:01:40] we do recognize that in our field as in all scientific fields publications tend
[00:01:42] all scientific fields publications tend to do this because they have additional
[00:01:44] to do this because they have additional constraints on space and that leads them
[00:01:47] constraints on space and that leads them as a cultural fact to favor positive
[00:01:50] as a cultural fact to favor positive evidence for new developments over
[00:01:51] evidence for new developments over negative results but we of course are
[00:01:54] negative results but we of course are not subject to such constraints so we
[00:01:57] not subject to such constraints so we can do the right and good things
[00:01:58] can do the right and good things scientifically evaluating positive
[00:02:00] scientifically evaluating positive results negative results and everything
[00:02:02] results negative results and everything in between
[00:02:04] in between this has real consequences for how we do
[00:02:06] this has real consequences for how we do evaluation right we're going to evaluate
[00:02:08] evaluation right we're going to evaluate your project based on the
[00:02:09] your project based on the appropriateness of the metrics that you
[00:02:11] appropriateness of the metrics that you chose the strength of the methods that
[00:02:13] chose the strength of the methods that you used and maybe most important
[00:02:15] you used and maybe most important importantly the extent to which your
[00:02:17] importantly the extent to which your paper is open and clear-sighted about
[00:02:20] paper is open and clear-sighted about the limitations of its finding
[00:02:22] the limitations of its finding those are the things that really matter
[00:02:24] those are the things that really matter for us scientifically and it is a
[00:02:26] for us scientifically and it is a consequence of this policy that you
[00:02:27] consequence of this policy that you could have a paper that reported
[00:02:29] could have a paper that reported state-of-the-art results but if it's
[00:02:31] state-of-the-art results but if it's just not a clear and substantive paper
[00:02:33] just not a clear and substantive paper it might not get very good marks from us
[00:02:36] it might not get very good marks from us and conversely if all of your hypotheses
[00:02:39] and conversely if all of your hypotheses fell apart and it turned out that all
[00:02:41] fell apart and it turned out that all your evidence pointed away from them as
[00:02:43] your evidence pointed away from them as being true but you nonetheless wrote a
[00:02:45] being true but you nonetheless wrote a paper that was clear about those
[00:02:46] paper that was clear about those findings and helped push the field
[00:02:48] findings and helped push the field forward by steering us away from those
[00:02:50] forward by steering us away from those hypotheses that of course could earn top
[00:02:52] hypotheses that of course could earn top marks and we'd be very happy to help you
[00:02:55] marks and we'd be very happy to help you report those results out to the rest of
[00:02:57] report those results out to the rest of the field because it is important for us
[00:03:00] the field because it is important for us to know about these negative findings so
[00:03:02] to know about these negative findings so that we know where to invest our energy
[00:03:04] that we know where to invest our energy as scholars
[00:03:07] here's a detail that's um from the
[00:03:09] here's a detail that's um from the requirements for the final paper this is
[00:03:11] requirements for the final paper this is the authorship statement
[00:03:13] the authorship statement um this is just a section where you
[00:03:15] um this is just a section where you explain how the individual authors from
[00:03:18] explain how the individual authors from your team and anyone else who helped out
[00:03:20] your team and anyone else who helped out contributed to the final project
[00:03:23] contributed to the final project you're free to say whatever you like in
[00:03:25] you're free to say whatever you like in these sections if you would like a model
[00:03:27] these sections if you would like a model here's a link to the pnas guidelines
[00:03:29] here's a link to the pnas guidelines which gives some details of kind of
[00:03:31] which gives some details of kind of typical statements
[00:03:33] typical statements the rationale is really just that we
[00:03:35] the rationale is really just that we think this is an important aspect of
[00:03:37] think this is an important aspect of scholarship in general and it's not yet
[00:03:39] scholarship in general and it's not yet pervasive in the field of nlp whereas it
[00:03:41] pervasive in the field of nlp whereas it is in other fields but we would like it
[00:03:44] is in other fields but we would like it to be more widespread because it just
[00:03:45] to be more widespread because it just seems like a healthy form of disclosure
[00:03:48] seems like a healthy form of disclosure that's the real rationale i want to
[00:03:50] that's the real rationale i want to emphasize that only in really extreme
[00:03:52] emphasize that only in really extreme cases and after discussion with all the
[00:03:54] cases and after discussion with all the team members would we consider giving
[00:03:57] team members would we consider giving separate grades to the team
[00:03:58] separate grades to the team based on what was in this statement this
[00:04:01] based on what was in this statement this is really not the intent
[00:04:03] is really not the intent the intent rather is this rationale of
[00:04:05] the intent rather is this rationale of just disclosing who did what as part of
[00:04:07] just disclosing who did what as part of the project it's really not about
[00:04:09] the project it's really not about evaluation
[00:04:10] evaluation i also want to emphasize that we have a
[00:04:12] i also want to emphasize that we have a policy on multiple submissions for this
[00:04:14] policy on multiple submissions for this course it's nuanced and subjective
[00:04:17] course it's nuanced and subjective here's a link to it
[00:04:18] here's a link to it uh here and here are some notes kind of
[00:04:20] uh here and here are some notes kind of rational for this so first the policy
[00:04:23] rational for this so first the policy mirrors the policy on multiple
[00:04:24] mirrors the policy on multiple submission to conferences right you
[00:04:26] submission to conferences right you can't take the same paper and submit it
[00:04:28] can't take the same paper and submit it to two different venues with minor
[00:04:30] to two different venues with minor modifications and expect to get two
[00:04:33] modifications and expect to get two publications out of it and the same
[00:04:34] publications out of it and the same thing holds for us when we think about
[00:04:36] thing holds for us when we think about requirements for final papers for this
[00:04:38] requirements for final papers for this course
[00:04:40] course this is designed to ensure that your
[00:04:42] this is designed to ensure that your project is a substantial new effort
[00:04:45] project is a substantial new effort this does mean that you can't merely
[00:04:47] this does mean that you can't merely submit an incremental advancement over
[00:04:49] submit an incremental advancement over another project that you did we are
[00:04:51] another project that you did we are trying to push back against the pattern
[00:04:53] trying to push back against the pattern where people would take final projects
[00:04:55] where people would take final projects from previous courses add a couple of
[00:04:57] from previous courses add a couple of new models and submit those as entirely
[00:05:00] new models and submit those as entirely new papers that's just unfair to the
[00:05:02] new papers that's just unfair to the people who are starting from scratch and
[00:05:04] people who are starting from scratch and it's really not the sort of work that we
[00:05:06] it's really not the sort of work that we would say is up to the level of a final
[00:05:09] would say is up to the level of a final project for a course like this
[00:05:12] project for a course like this this other courses might have different
[00:05:13] this other courses might have different policies at stanford but that fact alone
[00:05:16] policies at stanford but that fact alone is not going to lead us to change our
[00:05:18] is not going to lead us to change our policy because we do think that this is
[00:05:20] policy because we do think that this is equitable and also reflecting values
[00:05:22] equitable and also reflecting values that are pervasive in our field as you
[00:05:24] that are pervasive in our field as you can see from the policies on submission
[00:05:26] can see from the policies on submission to conferences
[00:05:29] to conferences if any of these policies seem relevant
[00:05:31] if any of these policies seem relevant to your work for example if you are
[00:05:32] to your work for example if you are taking a previous course project and
[00:05:34] taking a previous course project and developing it in lots of fresh and new
[00:05:36] developing it in lots of fresh and new ways start the discussion with your
[00:05:39] ways start the discussion with your mentor as early as possible to make sure
[00:05:41] mentor as early as possible to make sure that they're in the loop about what
[00:05:42] that they're in the loop about what you're doing
[00:05:43] you're doing we don't want any surprises when you
[00:05:45] we don't want any surprises when you submit your final paper or after that
[00:05:48] submit your final paper or after that when it comes to this policy so just
[00:05:49] when it comes to this policy so just make sure everyone is in the know and i
[00:05:52] make sure everyone is in the know and i predict that things will go fine
[00:05:55] predict that things will go fine to close a brief note about impact
[00:05:57] to close a brief note about impact statements for now an impact statement
[00:05:59] statements for now an impact statement is an optional section for your final
[00:06:01] is an optional section for your final paper absolutely not required but this
[00:06:03] paper absolutely not required but this has been on my mind a lot lately i think
[00:06:05] has been on my mind a lot lately i think it's really healthy that the field is
[00:06:06] it's really healthy that the field is moving toward having authors include
[00:06:08] moving toward having authors include impact statements and so i thought i
[00:06:10] impact statements and so i thought i would exhort you all to consider having
[00:06:12] would exhort you all to consider having that as part of your final paper as well
[00:06:14] that as part of your final paper as well it does not count against your length
[00:06:15] it does not count against your length limits and it's up to you exactly what
[00:06:18] limits and it's up to you exactly what you would disclose as part of this
[00:06:20] you would disclose as part of this statement there are some examples of
[00:06:21] statement there are some examples of things that you might include you could
[00:06:23] things that you might include you could try to enumerate both the benefits and
[00:06:25] try to enumerate both the benefits and the risks of your research to
[00:06:27] the risks of your research to individuals to society to the world
[00:06:30] individuals to society to the world specifically for the risk you could talk
[00:06:32] specifically for the risk you could talk about costs again to the participants to
[00:06:34] about costs again to the participants to society to the planet where for example
[00:06:37] society to the planet where for example participant costs would be if you had
[00:06:39] participant costs would be if you had human annotators doing a really
[00:06:41] human annotators doing a really difficult
[00:06:42] difficult or kind of negative annotation project
[00:06:45] or kind of negative annotation project you might mention that they paid a
[00:06:46] you might mention that they paid a certain cost and think about whether the
[00:06:49] certain cost and think about whether the costs were worthwhile you could also
[00:06:51] costs were worthwhile you could also think about cost to society and that
[00:06:52] think about cost to society and that would really probably turn on sort of
[00:06:55] would really probably turn on sort of misapplication of your ideas in ways
[00:06:57] misapplication of your ideas in ways that might have more harm than good and
[00:07:00] that might have more harm than good and of course if you trained a really large
[00:07:02] of course if you trained a really large language model or did a really a lot of
[00:07:03] language model or did a really a lot of experiments you could think about the
[00:07:05] experiments you could think about the cost to the planet in terms of energy
[00:07:07] cost to the planet in terms of energy expenditures and so forth
[00:07:09] expenditures and so forth just by way of getting us all to think
[00:07:12] just by way of getting us all to think about the fact that our research does
[00:07:14] about the fact that our research does have costs and that we should all the
[00:07:16] have costs and that we should all the time be thinking about the cost benefit
[00:07:18] time be thinking about the cost benefit analysis when it comes to the work that
[00:07:19] analysis when it comes to the work that we do
[00:07:20] we do and these disclosures are part of
[00:07:22] and these disclosures are part of helping us all have that in mind
[00:07:25] helping us all have that in mind and finally i think it might be really
[00:07:27] and finally i think it might be really inspiring for you to think about
[00:07:28] inspiring for you to think about responsibly use of your data models and
[00:07:31] responsibly use of your data models and findings never mind really evil actors
[00:07:33] findings never mind really evil actors there are likely to be people out there
[00:07:35] there are likely to be people out there who are well-meaning and would like to
[00:07:37] who are well-meaning and would like to apply your ideas but they might be
[00:07:39] apply your ideas but they might be unsure of the limits per se unsure of
[00:07:42] unsure of the limits per se unsure of precisely how to do that responsibly so
[00:07:44] precisely how to do that responsibly so guidance that you could offer about
[00:07:46] guidance that you could offer about where your ideas work and where they
[00:07:48] where your ideas work and where they don't or where your data are relevant
[00:07:51] don't or where your data are relevant and where they're irrelevant could
[00:07:52] and where they're irrelevant could really help someone who is trying to
[00:07:54] really help someone who is trying to just make responsible use of your ideas
[00:07:56] just make responsible use of your ideas you could think about them as part of
[00:07:58] you could think about them as part of crafting this impact statement
[00:08:02] crafting this impact statement for other resources you know i think
[00:08:03] for other resources you know i think it's really great to go through the
[00:08:04] it's really great to go through the exercise of doing a data sheet and a
[00:08:07] exercise of doing a data sheet and a model card a data sheet is a disclosure
[00:08:09] model card a data sheet is a disclosure about a data set that you created or
[00:08:11] about a data set that you created or used and a model card is a similar sort
[00:08:13] used and a model card is a similar sort of structured document for models that
[00:08:15] of structured document for models that you develop and release out into the
[00:08:17] you develop and release out into the world
[00:08:18] world they're both pretty long documents so
[00:08:20] they're both pretty long documents so it's a lot of work to do one in full but
[00:08:22] it's a lot of work to do one in full but it's very rewarding in the sense that it
[00:08:24] it's very rewarding in the sense that it helps you confront some hard truths
[00:08:27] helps you confront some hard truths about the work that you did and
[00:08:29] about the work that you did and articulate the limits of the work that
[00:08:31] articulate the limits of the work that you did all these things are really
[00:08:33] you did all these things are really helpful for your scholarship
[00:08:35] helpful for your scholarship and of course these things are helpful
[00:08:36] and of course these things are helpful when it comes to other people consuming
[00:08:38] when it comes to other people consuming your ideas so that's highly encouraged
[00:08:40] your ideas so that's highly encouraged and you could take bits and pieces from
[00:08:42] and you could take bits and pieces from those structured documents and have them
[00:08:44] those structured documents and have them inform maybe a shorter impact statement
[00:08:46] inform maybe a shorter impact statement that you wrote
[00:08:48] that you wrote and for even more guidance on this you
[00:08:49] and for even more guidance on this you could check out this survey of nurip's
[00:08:51] could check out this survey of nurip's impact statements it has a lot of
[00:08:53] impact statements it has a lot of information about the kinds of things
[00:08:54] information about the kinds of things people are disclosing in these
[00:08:56] people are disclosing in these statements and that too could help you
[00:08:58] statements and that too could help you kind of figure out what you want to say
[00:09:00] kind of figure out what you want to say and what might be relevant to your
[00:09:01] and what might be relevant to your audience
[00:09:03] audience so again this is entirely optional but i
[00:09:05] so again this is entirely optional but i hope this is inspiring and interesting
[00:09:07] hope this is inspiring and interesting for you as a new dimension
[00:09:09] for you as a new dimension when it comes to reporting on the work
[00:09:11] when it comes to reporting on the work that you did

Lecture 060

Writing NLP papers | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=DZNwO-p5PGY --- Transcript [00:00:04] hello everyone welcome to part two i...

Writing NLP papers | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=DZNwO-p5PGY

---

Transcript

[00:00:04] hello everyone welcome to part two in
[00:00:06] hello everyone welcome to part two in our series on presenting your research
[00:00:07] our series on presenting your research we're going to be talking about writing
[00:00:09] we're going to be talking about writing papers in our fields
[00:00:10] papers in our fields to start let's look at the outline of a
[00:00:12] to start let's look at the outline of a typical nlp paper by and large these are
[00:00:15] typical nlp paper by and large these are either four or eight page papers in a
[00:00:17] either four or eight page papers in a two column format that you get from the
[00:00:19] two column format that you get from the style sheets those lengths don't include
[00:00:21] style sheets those lengths don't include the references
[00:00:23] the references and there are a bunch of conventions for
[00:00:24] and there are a bunch of conventions for how the papers are typically organized
[00:00:26] how the papers are typically organized so you have your title and abstract on
[00:00:28] so you have your title and abstract on page one and usually you have an intro
[00:00:30] page one and usually you have an intro section that kind of fits on that first
[00:00:32] section that kind of fits on that first page maybe a little bit onto the second
[00:00:35] page maybe a little bit onto the second in place two you discuss the related
[00:00:37] in place two you discuss the related work or prior literature or background
[00:00:39] work or prior literature or background that's needed to kind of contextualize
[00:00:41] that's needed to kind of contextualize the work that you're doing
[00:00:43] the work that you're doing then there's typically a data section
[00:00:45] then there's typically a data section followed by a section on a model or this
[00:00:47] followed by a section on a model or this could be thought of as the kind of core
[00:00:48] could be thought of as the kind of core proposal section of the paper
[00:00:51] proposal section of the paper then there'll be some methods related to
[00:00:52] then there'll be some methods related to the experiments
[00:00:54] the experiments a reporting of the results of the
[00:00:56] a reporting of the results of the experiments
[00:00:57] experiments and then some analysis of what the
[00:00:59] and then some analysis of what the experimental results mean and then
[00:01:01] experimental results mean and then possibly a short conclusion
[00:01:03] possibly a short conclusion it's not set in stone that you have to
[00:01:05] it's not set in stone that you have to follow these conventions but if you do
[00:01:07] follow these conventions but if you do follow them i think it will be easier on
[00:01:09] follow them i think it will be easier on your readers and also easier on you as a
[00:01:11] your readers and also easier on you as a writer because you can kind of slot your
[00:01:13] writer because you can kind of slot your ideas into this familiar format
[00:01:16] ideas into this familiar format let's look at those sections in a little
[00:01:17] let's look at those sections in a little bit of uh of detail so starting with the
[00:01:19] bit of uh of detail so starting with the intro the ideal intro to my mind really
[00:01:22] intro the ideal intro to my mind really tells the full story of your paper at a
[00:01:25] tells the full story of your paper at a high level we don't need all the details
[00:01:27] high level we don't need all the details but it is very helpful to know from
[00:01:29] but it is very helpful to know from beginning to end what the paper
[00:01:31] beginning to end what the paper accomplishes and good intros provide all
[00:01:34] accomplishes and good intros provide all of that information and really tell the
[00:01:35] of that information and really tell the reader precisely what they'll learn as
[00:01:38] reader precisely what they'll learn as they go through the rest of the paper
[00:01:41] they go through the rest of the paper in place two as i said is the discussion
[00:01:43] in place two as i said is the discussion of background material or related work
[00:01:45] of background material or related work or prior literature this is an
[00:01:47] or prior literature this is an opportunity for you to contextualize
[00:01:50] opportunity for you to contextualize your work and provide insights into the
[00:01:52] your work and provide insights into the major themes of the literature as a
[00:01:54] major themes of the literature as a whole what you should really be thinking
[00:01:56] whole what you should really be thinking about doing is using each paper or each
[00:01:58] about doing is using each paper or each theme that you identify as a chance to
[00:02:01] theme that you identify as a chance to kind of contextualize your ideas and
[00:02:03] kind of contextualize your ideas and especially articulate what's special
[00:02:06] especially articulate what's special about the contribution that you're
[00:02:07] about the contribution that you're making so this kind of sets the stage
[00:02:10] making so this kind of sets the stage for the reader
[00:02:12] for the reader the data section this could vary a lot
[00:02:14] the data section this could vary a lot this could be very detailed if you're
[00:02:16] this could be very detailed if you're offering a new data set or using a debia
[00:02:18] offering a new data set or using a debia set in some unfamiliar way the
[00:02:20] set in some unfamiliar way the community's not used to but of course if
[00:02:22] community's not used to but of course if you're just adopting some data off the
[00:02:24] you're just adopting some data off the shelf then this section might be pretty
[00:02:26] shelf then this section might be pretty short
[00:02:28] short then you get to the heart of your
[00:02:29] then you get to the heart of your proposal your model you want to flesh
[00:02:32] proposal your model you want to flesh out your own approach and really help us
[00:02:34] out your own approach and really help us understand your core contribution
[00:02:37] understand your core contribution then we turn to supporting your ideas
[00:02:39] then we turn to supporting your ideas with some experimental evidence you'll
[00:02:41] with some experimental evidence you'll report the methods your experimental
[00:02:43] report the methods your experimental approach
[00:02:44] approach including descriptions of the metrics
[00:02:46] including descriptions of the metrics and again that will be long or short
[00:02:47] and again that will be long or short depending on whether the metrics are
[00:02:49] depending on whether the metrics are familiar or unfamiliar you want to
[00:02:51] familiar or unfamiliar you want to describe your baseline models and
[00:02:53] describe your baseline models and anything else that's relevant to kind of
[00:02:55] anything else that's relevant to kind of understanding precisely what's going to
[00:02:56] understanding precisely what's going to happen in your experiments
[00:02:59] happen in your experiments i will say that for details about hyper
[00:03:01] i will say that for details about hyper parameters and optimization choices and
[00:03:03] parameters and optimization choices and so forth you can probably move those to
[00:03:05] so forth you can probably move those to an appendix unless they're really
[00:03:07] an appendix unless they're really central to the argument what you want to
[00:03:08] central to the argument what you want to offer here are kind of the crucial
[00:03:10] offer here are kind of the crucial pieces that will help the reader
[00:03:12] pieces that will help the reader understand precisely what you did for
[00:03:14] understand precisely what you did for your experiments
[00:03:17] your experiments then we get our results this could be a
[00:03:18] then we get our results this could be a no-nonsense report of what happened it's
[00:03:20] no-nonsense report of what happened it's probably mainly going to be supported by
[00:03:22] probably mainly going to be supported by figures and tables that report a summary
[00:03:24] figures and tables that report a summary of your core findings according to your
[00:03:27] of your core findings according to your data models and metrics
[00:03:29] data models and metrics and then things open up a bit you have
[00:03:31] and then things open up a bit you have an analysis section i think this is
[00:03:32] an analysis section i think this is really important you should articulate
[00:03:34] really important you should articulate for the reader what your results mean
[00:03:37] for the reader what your results mean what they don't mean where they can be
[00:03:39] what they don't mean where they can be approved where their limits are and so
[00:03:40] approved where their limits are and so forth right these sections vary a lot
[00:03:43] forth right these sections vary a lot depending on the nature of the paper and
[00:03:44] depending on the nature of the paper and the findings but i think they're always
[00:03:47] the findings but i think they're always important and they can be very rewarding
[00:03:49] important and they can be very rewarding it is intimidating because this is
[00:03:51] it is intimidating because this is awfully open-ended but i'm hoping that
[00:03:53] awfully open-ended but i'm hoping that the previous unit on analysis methods in
[00:03:55] the previous unit on analysis methods in our field offer some really good general
[00:03:58] our field offer some really good general purpose tools and techniques for doing
[00:04:00] purpose tools and techniques for doing rich analyses of this sort and really
[00:04:02] rich analyses of this sort and really helping us understand precisely what you
[00:04:04] helping us understand precisely what you accomplished
[00:04:07] accomplished now
[00:04:07] now this is as i said is not set in stone
[00:04:10] this is as i said is not set in stone and different projects will call for
[00:04:11] and different projects will call for different variants on it and one really
[00:04:13] different variants on it and one really prominent variant that you see is that
[00:04:15] prominent variant that you see is that if you have multiple experiments with
[00:04:17] if you have multiple experiments with multiple data sets you might want to
[00:04:19] multiple data sets you might want to repeat that methods results analysis
[00:04:21] repeat that methods results analysis rhythm across all of your experiments to
[00:04:24] rhythm across all of your experiments to give them kind of separate sections in
[00:04:26] give them kind of separate sections in your paper but again it really depends
[00:04:28] your paper but again it really depends on what you think the most natural way
[00:04:30] on what you think the most natural way to express your ideas is these things
[00:04:32] to express your ideas is these things aren't set in stone they're just
[00:04:33] aren't set in stone they're just conventions that help us as readers and
[00:04:36] conventions that help us as readers and as authors
[00:04:38] as authors then finally you have a conclusion this
[00:04:39] then finally you have a conclusion this is probably a quick summary of what the
[00:04:41] is probably a quick summary of what the paper did
[00:04:42] paper did and then an opportunity for you to chart
[00:04:44] and then an opportunity for you to chart out future directions that you or others
[00:04:47] out future directions that you or others might pursue so it's a chance to be more
[00:04:49] might pursue so it's a chance to be more outward looking and expansive
[00:04:53] let me close the screencast with some
[00:04:55] let me close the screencast with some general advice on scientific writing
[00:04:57] general advice on scientific writing that i think can be helpful kind of in
[00:04:58] that i think can be helpful kind of in the background as you think about
[00:05:00] the background as you think about expressing your ideas
[00:05:02] expressing your ideas first i just want to review this really
[00:05:04] first i just want to review this really nice piece from stuart shiber where he
[00:05:06] nice piece from stuart shiber where he advocates for what he calls the rational
[00:05:08] advocates for what he calls the rational reconstruction approach to scientific
[00:05:11] reconstruction approach to scientific writing and to build up to that he
[00:05:12] writing and to build up to that he offers two construct contrasting styles
[00:05:14] offers two construct contrasting styles that you might think about
[00:05:16] that you might think about the first is what he calls the
[00:05:17] the first is what he calls the continental style this is in which one
[00:05:20] continental style this is in which one states the solution with as little
[00:05:22] states the solution with as little introduction or motivation as possible
[00:05:24] introduction or motivation as possible sometimes not even saying what the
[00:05:25] sometimes not even saying what the problem was
[00:05:27] problem was he says readers will have no clue as to
[00:05:29] he says readers will have no clue as to whether you're right or not without
[00:05:31] whether you're right or not without incredible efforts in close reading of
[00:05:33] incredible efforts in close reading of the paper but at least they'll think
[00:05:35] the paper but at least they'll think you're a genius
[00:05:38] you're a genius at the other end of the extreme here you
[00:05:39] at the other end of the extreme here you have what he calls the historical style
[00:05:41] have what he calls the historical style and this is a whole history of false
[00:05:43] and this is a whole history of false starts wrong attempts near misses
[00:05:45] starts wrong attempts near misses redefinitions of the problem
[00:05:48] redefinitions of the problem this is a kind of genuine history of
[00:05:49] this is a kind of genuine history of maybe the struggles that you endured as
[00:05:52] maybe the struggles that you endured as you built up to the final product for
[00:05:54] you built up to the final product for your paper and shiber says this is much
[00:05:57] your paper and shiber says this is much better than the continental style
[00:05:58] better than the continental style because the careful reader can probably
[00:06:00] because the careful reader can probably follow the line of reasoning that the
[00:06:02] follow the line of reasoning that the author went through and used this as
[00:06:04] author went through and used this as motivation but the reader will probably
[00:06:06] motivation but the reader will probably think you're a bit addle-headed we don't
[00:06:08] think you're a bit addle-headed we don't need to hear about every dead end and
[00:06:10] need to hear about every dead end and every false start what we would like
[00:06:12] every false start what we would like rather is what sheba calls the rational
[00:06:15] rather is what sheba calls the rational reconstruction you don't present the
[00:06:17] reconstruction you don't present the actual history that you went through but
[00:06:19] actual history that you went through but rather an idealized history that
[00:06:21] rather an idealized history that perfectly motivates each step in the
[00:06:23] perfectly motivates each step in the solution you might selectively choose
[00:06:25] solution you might selectively choose models that you abandoned as a way of
[00:06:28] models that you abandoned as a way of helping the reader understand how you
[00:06:30] helping the reader understand how you built toward your actual core set of
[00:06:32] built toward your actual core set of methods and findings and results
[00:06:35] methods and findings and results so it's going to be a kind of
[00:06:36] so it's going to be a kind of streamlined version of that historical
[00:06:38] streamlined version of that historical style
[00:06:40] style the goal in pursuing the rational
[00:06:41] the goal in pursuing the rational reconstruction style is not to convince
[00:06:44] reconstruction style is not to convince the reader that you're brilliant or at
[00:06:45] the reader that you're brilliant or at all headed for that matter but that your
[00:06:47] all headed for that matter but that your solution is trivial
[00:06:49] solution is trivial shiber says it takes a certain strength
[00:06:51] shiber says it takes a certain strength of character to take that as one's goal
[00:06:53] of character to take that as one's goal right the goal
[00:06:55] right the goal of writing a really excellent paper is
[00:06:57] of writing a really excellent paper is that the reader comes away thinking that
[00:06:59] that the reader comes away thinking that was clear and obvious that even i could
[00:07:01] was clear and obvious that even i could have done it that's an act of genuine
[00:07:03] have done it that's an act of genuine communication
[00:07:04] communication and it does take a strength of character
[00:07:06] and it does take a strength of character but in the end this is what we should
[00:07:08] but in the end this is what we should all be striving for this kind of really
[00:07:10] all be striving for this kind of really clear and open communication
[00:07:14] this is also a nice document from david
[00:07:16] this is also a nice document from david goss who has some hints on mathematical
[00:07:18] goss who has some hints on mathematical style there's a bunch of low-level
[00:07:19] style there's a bunch of low-level details in there especially related to
[00:07:21] details in there especially related to presenting very formal work the piece
[00:07:24] presenting very formal work the piece that i wanted to pull out is just have
[00:07:25] that i wanted to pull out is just have mercy on the reader this is again
[00:07:27] mercy on the reader this is again recalling the rational reconstruction
[00:07:29] recalling the rational reconstruction approach that sheba advocated for where
[00:07:31] approach that sheba advocated for where you're really thinking about what it's
[00:07:33] you're really thinking about what it's like to be a reader encountering the
[00:07:35] like to be a reader encountering the ideas for the first time and genuinely
[00:07:38] ideas for the first time and genuinely trying to understand what you
[00:07:39] trying to understand what you accomplished you have to really think
[00:07:41] accomplished you have to really think about what it's like to be in that
[00:07:42] about what it's like to be in that position in order to have a successful
[00:07:45] position in order to have a successful and clear paper
[00:07:48] and clear paper i also really like this piece from the
[00:07:50] i also really like this piece from the novelist cormac mccarthy which he
[00:07:52] novelist cormac mccarthy which he published in nature it's full of great
[00:07:54] published in nature it's full of great advice the one piece that i wanted to
[00:07:56] advice the one piece that i wanted to highlight is this mccarthy says decide
[00:07:59] highlight is this mccarthy says decide on your paper's theme and two or three
[00:08:01] on your paper's theme and two or three points you want every reader to remember
[00:08:04] points you want every reader to remember this theme and these points form the
[00:08:06] this theme and these points form the central thread that runs through your
[00:08:08] central thread that runs through your piece the words sentences paragraphs and
[00:08:10] piece the words sentences paragraphs and sections are the needlework that holds
[00:08:12] sections are the needlework that holds it together if something isn't needed to
[00:08:14] it together if something isn't needed to help the reader understand the main
[00:08:16] help the reader understand the main theme omit it
[00:08:17] theme omit it this is helpful to me because i think it
[00:08:20] this is helpful to me because i think it not only results in a better paper but
[00:08:21] not only results in a better paper but it will also be easier for you to write
[00:08:24] it will also be easier for you to write your paper because the themes you choose
[00:08:26] your paper because the themes you choose will determine what to include and
[00:08:28] will determine what to include and exclude and resolve a lot of low-level
[00:08:30] exclude and resolve a lot of low-level questions about your narrative
[00:08:33] questions about your narrative and conversely i've often found that
[00:08:34] and conversely i've often found that when i'm really struggling to write a
[00:08:36] when i'm really struggling to write a paper it's because i haven't figured out
[00:08:38] paper it's because i haven't figured out what these core themes are and i'm kind
[00:08:40] what these core themes are and i'm kind of casting about unsure of what's
[00:08:42] of casting about unsure of what's relevant and what's irrelevant and if
[00:08:44] relevant and what's irrelevant and if you step back and really figure out what
[00:08:46] you step back and really figure out what you're trying to communicate then the
[00:08:48] you're trying to communicate then the act of writing
[00:08:49] act of writing kind of all falls into place
[00:08:53] kind of all falls into place and then the final bit of advice that i
[00:08:54] and then the final bit of advice that i wanted to offer which i'm going to
[00:08:55] wanted to offer which i'm going to return to when we when we talk about
[00:08:57] return to when we when we talk about presenting work at conferences this
[00:09:00] presenting work at conferences this comes from patrick blackburn it's about
[00:09:01] comes from patrick blackburn it's about talks but it really extends to any kind
[00:09:03] talks but it really extends to any kind of communication and science his
[00:09:06] of communication and science his fundamental insight he asks
[00:09:08] fundamental insight he asks where do good talks and i think good
[00:09:10] where do good talks and i think good papers where do they come from and he
[00:09:11] papers where do they come from and he says
[00:09:12] says honesty
[00:09:14] honesty a good talk or a good paper should never
[00:09:16] a good talk or a good paper should never stray far from simple honest
[00:09:18] stray far from simple honest communication and you can hear this in
[00:09:20] communication and you can hear this in the way that we talk about evaluating
[00:09:22] the way that we talk about evaluating your work that fundamentally for us
[00:09:25] your work that fundamentally for us we're looking for papers that offer open
[00:09:27] we're looking for papers that offer open clear honest communication about what
[00:09:30] clear honest communication about what happened and what it means and that's
[00:09:32] happened and what it means and that's like really a fundamental value and i
[00:09:34] like really a fundamental value and i think it's inspiring to think about this
[00:09:36] think it's inspiring to think about this as your kind of guiding light when you
[00:09:38] as your kind of guiding light when you report your scientific results to the
[00:09:40] report your scientific results to the community

Lecture 061

NLP Conference Submissions | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=vb7IN-C7fHs --- Transcript [00:00:05] welcome everyone this is par...

NLP Conference Submissions | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=vb7IN-C7fHs

---

Transcript

[00:00:05] welcome everyone this is part three in
[00:00:06] welcome everyone this is part three in our series on presenting your research
[00:00:08] our series on presenting your research we're going to be talking about the
[00:00:09] we're going to be talking about the sometimes thrilling and sometimes
[00:00:11] sometimes thrilling and sometimes agonizing process of submitting your
[00:00:13] agonizing process of submitting your work for publication at an nlp
[00:00:15] work for publication at an nlp conference
[00:00:16] conference to start i want to review what's known
[00:00:18] to start i want to review what's known as the anonymity period for acl
[00:00:20] as the anonymity period for acl conferences
[00:00:21] conferences uh all of the acl conferences that have
[00:00:23] uh all of the acl conferences that have adopted a uniform policy that submitted
[00:00:26] adopted a uniform policy that submitted papers cannot be uploaded to
[00:00:27] papers cannot be uploaded to repositories like archive or made public
[00:00:30] repositories like archive or made public in other ways starting one month from
[00:00:32] in other ways starting one month from the submission deadline and extending
[00:00:33] the submission deadline and extending through the time when decisions go out
[00:00:36] through the time when decisions go out so you should be aware of this and for
[00:00:38] so you should be aware of this and for specific conferences check their sites
[00:00:40] specific conferences check their sites for the precise date when the embargo
[00:00:42] for the precise date when the embargo goes into effect so that you're sure
[00:00:44] goes into effect so that you're sure you're playing by the rules
[00:00:46] you're playing by the rules the rationale the policy is an attempt
[00:00:48] the rationale the policy is an attempt to balance the benefits of free and fast
[00:00:50] to balance the benefits of free and fast distribution of new ideas against the
[00:00:52] distribution of new ideas against the benefits of double blind peer review the
[00:00:54] benefits of double blind peer review the idea is that we want to avoid
[00:00:57] idea is that we want to avoid corrupting reviewers just at the moment
[00:00:59] corrupting reviewers just at the moment they sit down to begin their work of
[00:01:00] they sit down to begin their work of reviewing by con by having them see on
[00:01:03] reviewing by con by having them see on archive or via twitter an announcement
[00:01:05] archive or via twitter an announcement about the paper they're reviewing that
[00:01:07] about the paper they're reviewing that reveals the authors and the origins and
[00:01:08] reveals the authors and the origins and so forth because we know that that
[00:01:10] so forth because we know that that public announcement would influence
[00:01:12] public announcement would influence their decision-making process
[00:01:14] their decision-making process so to kind of preserve that
[00:01:17] so to kind of preserve that but balance that against free
[00:01:18] but balance that against free dissemination of ideas we have this
[00:01:21] dissemination of ideas we have this embargo period around the period when we
[00:01:23] embargo period around the period when we know reviewing will happen
[00:01:26] know reviewing will happen for more on the policy and its rationale
[00:01:27] for more on the policy and its rationale you can follow this link here but i
[00:01:29] you can follow this link here but i would say that fundamentally it's an
[00:01:31] would say that fundamentally it's an attempt to balance these two pressures a
[00:01:33] attempt to balance these two pressures a kind of pragmatic approach to balancing
[00:01:35] kind of pragmatic approach to balancing them
[00:01:36] them okay now let's dive into the actual
[00:01:38] okay now let's dive into the actual process of submitting work for
[00:01:40] process of submitting work for publication and having it go through
[00:01:41] publication and having it go through review at one of these conferences so to
[00:01:43] review at one of these conferences so to start let's suppose you've submitted
[00:01:44] start let's suppose you've submitted your paper when you do that you'll
[00:01:46] your paper when you do that you'll select some area keywords that will help
[00:01:49] select some area keywords that will help determine which committee gets your
[00:01:51] determine which committee gets your paper this is already a really important
[00:01:54] paper this is already a really important step you'll be choosing from a bunch of
[00:01:56] step you'll be choosing from a bunch of keywords that signal different areas of
[00:01:58] keywords that signal different areas of nlp and what you're doing at that point
[00:02:00] nlp and what you're doing at that point is probably routing your paper to
[00:02:02] is probably routing your paper to certain sets of reviewers and certain
[00:02:04] certain sets of reviewers and certain area chairs and so forth and in doing
[00:02:07] area chairs and so forth and in doing that you're creating expectations about
[00:02:09] that you're creating expectations about the kind of contribution that you're
[00:02:11] the kind of contribution that you're making
[00:02:12] making so if you're unsure about this process i
[00:02:14] so if you're unsure about this process i would encourage you to recruit an expert
[00:02:16] would encourage you to recruit an expert in the nlp reviewing process to help you
[00:02:18] in the nlp reviewing process to help you make these keyword selections here's an
[00:02:20] make these keyword selections here's an example of how this could be important
[00:02:22] example of how this could be important suppose that your paper is fundamentally
[00:02:24] suppose that your paper is fundamentally a new machine learning contribution but
[00:02:26] a new machine learning contribution but it reports some experiments that involve
[00:02:28] it reports some experiments that involve kind of topics in computational social
[00:02:30] kind of topics in computational social science it might be really a mistake to
[00:02:33] science it might be really a mistake to choose the computational social science
[00:02:36] choose the computational social science keyword at this point because if your
[00:02:38] keyword at this point because if your paper ends up with reviewers who have
[00:02:41] paper ends up with reviewers who have expectations that you'll be making some
[00:02:43] expectations that you'll be making some new and fundamental contribution to that
[00:02:45] new and fundamental contribution to that area and what they're looking at is a
[00:02:47] area and what they're looking at is a machine learning contribution that
[00:02:50] machine learning contribution that mismatch in their expectations might
[00:02:52] mismatch in their expectations might lead them to have a negative perception
[00:02:53] lead them to have a negative perception of the work
[00:02:54] of the work and the reverse of course holds as well
[00:02:56] and the reverse of course holds as well if you have a fundamentally new
[00:02:58] if you have a fundamentally new computational social sciences
[00:03:00] computational social sciences contribution but it incidentally makes
[00:03:02] contribution but it incidentally makes use of some machine learning apparatus i
[00:03:04] use of some machine learning apparatus i think it would be a mistake to choose
[00:03:06] think it would be a mistake to choose machine learning as a keyword precisely
[00:03:08] machine learning as a keyword precisely because of that mismatch and
[00:03:09] because of that mismatch and expectations that would result in
[00:03:11] expectations that would result in reviewers minds when they sat down and
[00:03:13] reviewers minds when they sat down and started reviewing your paper so think
[00:03:15] started reviewing your paper so think strategically about this stage you can
[00:03:18] strategically about this stage you can see why it's important at this next
[00:03:19] see why it's important at this next stage here
[00:03:20] stage here reviewers
[00:03:21] reviewers when they begin their work they're going
[00:03:23] when they begin their work they're going to first scan a long list of titles and
[00:03:25] to first scan a long list of titles and abstracts and they'll make bids on which
[00:03:27] abstracts and they'll make bids on which ones they want to do at this point they
[00:03:29] ones they want to do at this point they signal yes or maybe or no or maybe they
[00:03:31] signal yes or maybe or no or maybe they indicate a conflict of interests
[00:03:33] indicate a conflict of interests indicating that they can't review the
[00:03:34] indicating that they can't review the paper at all
[00:03:36] paper at all when they do this the title is probably
[00:03:38] when they do this the title is probably the primary factor in bidding decisions
[00:03:40] the primary factor in bidding decisions they probably have access to the
[00:03:42] they probably have access to the abstract at this stage but they might be
[00:03:44] abstract at this stage but they might be looking at a list of over 200 different
[00:03:46] looking at a list of over 200 different contributions and it's probably just too
[00:03:48] contributions and it's probably just too much to ask that they would read all of
[00:03:50] much to ask that they would read all of these abstracts at this stage so they're
[00:03:52] these abstracts at this stage so they're probably scanning the title and using
[00:03:54] probably scanning the title and using that as an indication about what kind of
[00:03:56] that as an indication about what kind of bids they want to make so again you
[00:03:58] bids they want to make so again you might think strategically about your
[00:04:00] might think strategically about your title and the role it will play at this
[00:04:02] title and the role it will play at this early stage in the process
[00:04:05] early stage in the process after that the program chairs assign
[00:04:07] after that the program chairs assign reviewers their papers partly based on
[00:04:09] reviewers their papers partly based on the bidding but maybe partly based on
[00:04:11] the bidding but maybe partly based on other considerations of workloads and so
[00:04:13] other considerations of workloads and so forth we don't know precisely how that
[00:04:16] forth we don't know precisely how that process will happen but by some
[00:04:17] process will happen but by some mechanism
[00:04:19] mechanism your paper will be assigned probably
[00:04:20] your paper will be assigned probably three different reviewers
[00:04:23] three different reviewers reviewers read the papers write comments
[00:04:26] reviewers read the papers write comments and supply ratings over the course of a
[00:04:27] and supply ratings over the course of a few months
[00:04:29] few months at the very end of that process authors
[00:04:31] at the very end of that process authors are typically allowed to respond briefly
[00:04:33] are typically allowed to respond briefly to the reviews
[00:04:35] to the reviews and then the program chair or the area
[00:04:37] and then the program chair or the area chair might seek to simulate some
[00:04:38] chair might seek to simulate some discussion among the reviewers about
[00:04:40] discussion among the reviewers about conflicts between their reviews or maybe
[00:04:43] conflicts between their reviews or maybe places where the author response says
[00:04:44] places where the author response says the reviewers are incorrect or misguided
[00:04:47] the reviewers are incorrect or misguided or something we hope that that's a
[00:04:48] or something we hope that that's a lively and rich discussion about the
[00:04:50] lively and rich discussion about the paper
[00:04:51] paper that's led by open-minded people who are
[00:04:53] that's led by open-minded people who are just trying to arrive at the best
[00:04:54] just trying to arrive at the best possible recommendation for your paper
[00:04:57] possible recommendation for your paper that's what we hope at this stage in the
[00:04:59] that's what we hope at this stage in the process
[00:05:01] process then finally at the very end the program
[00:05:03] then finally at the very end the program committee is going to do some magic to
[00:05:05] committee is going to do some magic to arrive at the final program based on all
[00:05:07] arrive at the final program based on all of this input of course their reviews
[00:05:09] of this input of course their reviews and ratings will be a major factor but
[00:05:12] and ratings will be a major factor but there might be other considerations that
[00:05:13] there might be other considerations that they bring in at this stage in terms of
[00:05:15] they bring in at this stage in terms of constructing a diverse and interesting
[00:05:17] constructing a diverse and interesting program for their conference
[00:05:19] program for their conference at this stage you might get a meta
[00:05:21] at this stage you might get a meta review that provides some insight into
[00:05:23] review that provides some insight into the final decision-making process
[00:05:25] the final decision-making process although those vary and the amount that
[00:05:27] although those vary and the amount that they actually illuminate the
[00:05:28] they actually illuminate the behind-the-scenes process that led to
[00:05:30] behind-the-scenes process that led to the particular recommendation that was
[00:05:32] the particular recommendation that was made by the reviewers and area chair
[00:05:37] in terms of the work that the actual
[00:05:39] in terms of the work that the actual reviewers are doing i would say that the
[00:05:41] reviewers are doing i would say that the current acl setup is kind of oriented
[00:05:43] current acl setup is kind of oriented around structured text for the reviews
[00:05:45] around structured text for the reviews as opposed to providing a lot of
[00:05:47] as opposed to providing a lot of metadata via ratings
[00:05:50] metadata via ratings so first they'll probably be asked to
[00:05:51] so first they'll probably be asked to just indicate what the paper is about
[00:05:54] just indicate what the paper is about what contributions it makes and what its
[00:05:56] what contributions it makes and what its main strengths and weaknesses are this
[00:05:58] main strengths and weaknesses are this is a kind of check that they actually
[00:05:59] is a kind of check that they actually understand what's in the paper and can
[00:06:01] understand what's in the paper and can articulate what's in the paper and it
[00:06:03] articulate what's in the paper and it gives a first indication of their
[00:06:05] gives a first indication of their assessment
[00:06:07] assessment next you have reasons to accept and
[00:06:08] next you have reasons to accept and reasons to reject
[00:06:10] reasons to reject then there could be a section for
[00:06:12] then there could be a section for additional questions and feedback for
[00:06:14] additional questions and feedback for the authors
[00:06:15] the authors maybe the reviewers can indicate missing
[00:06:17] maybe the reviewers can indicate missing references and maybe also a catch-all
[00:06:19] references and maybe also a catch-all section for typos grammar style and
[00:06:22] section for typos grammar style and presentation improvements and so forth
[00:06:24] presentation improvements and so forth then of course you get two really
[00:06:26] then of course you get two really important ratings the overall
[00:06:28] important ratings the overall recommendation
[00:06:29] recommendation and maybe an assessment of the reviewers
[00:06:31] and maybe an assessment of the reviewers confidence in their overall evaluation
[00:06:34] confidence in their overall evaluation and then finally there could be a
[00:06:35] and then finally there could be a section for confidential information
[00:06:37] section for confidential information that the reviewers want to communicate
[00:06:39] that the reviewers want to communicate directly to the program committee and
[00:06:41] directly to the program committee and that would be hidden from the authors as
[00:06:43] that would be hidden from the authors as well as the other reviewers
[00:06:45] well as the other reviewers so stepping back the most important
[00:06:47] so stepping back the most important pieces of this reviewing form are
[00:06:49] pieces of this reviewing form are obviously the overall recommendation
[00:06:51] obviously the overall recommendation possibly balanced against reviewer
[00:06:53] possibly balanced against reviewer confidence
[00:06:54] confidence and the reasons to accept and reasons to
[00:06:56] and the reasons to accept and reasons to reject you can count on the two texts
[00:06:58] reject you can count on the two texts that they supply
[00:06:59] that they supply under two and three here as being really
[00:07:02] under two and three here as being really important to shaping the discussion that
[00:07:04] important to shaping the discussion that happens and the overall recommendation
[00:07:06] happens and the overall recommendation that gets made
[00:07:10] there's an author response period for at
[00:07:12] there's an author response period for at least the major acl conferences this is
[00:07:14] least the major acl conferences this is a chance for authors to submit short
[00:07:16] a chance for authors to submit short responses to the reviews
[00:07:18] responses to the reviews this is a rather uncertain business
[00:07:20] this is a rather uncertain business along many dimensions so let me just
[00:07:22] along many dimensions so let me just offer some thoughts
[00:07:24] offer some thoughts first many people are cynical about
[00:07:26] first many people are cynical about author responses since they've observed
[00:07:29] author responses since they've observed that reviewers rarely change their
[00:07:31] that reviewers rarely change their scores afterwards and i think that is an
[00:07:33] scores afterwards and i think that is an important consideration
[00:07:35] important consideration however it might just be bad signaling
[00:07:37] however it might just be bad signaling not to submit a response at all it could
[00:07:40] not to submit a response at all it could incidentally indicate to the program
[00:07:42] incidentally indicate to the program committee that you've kind of silently
[00:07:44] committee that you've kind of silently opted out of the process so its mere
[00:07:46] opted out of the process so its mere absence could reduce your chances of
[00:07:48] absence could reduce your chances of getting accepted i believe
[00:07:50] getting accepted i believe more positively for conferences that
[00:07:53] more positively for conferences that have area chairs and i believe all the
[00:07:55] have area chairs and i believe all the current major acl conferences do the
[00:07:58] current major acl conferences do the author response could be really
[00:07:59] author response could be really important an area chair is someone who's
[00:08:01] important an area chair is someone who's tasked with stimulating discussion and
[00:08:03] tasked with stimulating discussion and might writing meta reviews for a small
[00:08:06] might writing meta reviews for a small number of papers maybe five to twenty
[00:08:08] number of papers maybe five to twenty depending on the conference volume
[00:08:10] depending on the conference volume and for those people the author response
[00:08:12] and for those people the author response might have a major impact i've played
[00:08:15] might have a major impact i've played the role of area chair many times and
[00:08:17] the role of area chair many times and the author responses are always valuable
[00:08:19] the author responses are always valuable to me it's another text alongside the
[00:08:21] to me it's another text alongside the reviews that the um reviewers provided
[00:08:24] reviews that the um reviewers provided and it helps me understand places of
[00:08:26] and it helps me understand places of conflict places where the um authors
[00:08:29] conflict places where the um authors differ in their perspective from the
[00:08:30] differ in their perspective from the reviewers and so forth it is always
[00:08:32] reviewers and so forth it is always extremely valuable evidence to me
[00:08:34] extremely valuable evidence to me and i think that holds for many area
[00:08:36] and i think that holds for many area chairs and for this reason alone you
[00:08:39] chairs and for this reason alone you might think about submitting a detailed
[00:08:41] might think about submitting a detailed author response with these area chairs
[00:08:43] author response with these area chairs in mind as your primary readers
[00:08:47] nlp conferences for better or worse have
[00:08:49] nlp conferences for better or worse have very complex rules about what you can
[00:08:51] very complex rules about what you can and can't say in these author responses
[00:08:54] and can't say in these author responses sometimes you can't report any new
[00:08:56] sometimes you can't report any new results sometimes you have to be very
[00:08:57] results sometimes you have to be very circumspect about what kind of results
[00:08:59] circumspect about what kind of results you have and so forth if you have
[00:09:01] you have and so forth if you have questions about what you can do in a
[00:09:03] questions about what you can do in a particular case seek out an expert at
[00:09:05] particular case seek out an expert at stanford for advice on how to interpret
[00:09:07] stanford for advice on how to interpret the precise rules and what kind of
[00:09:09] the precise rules and what kind of leeway you actually have in saying what
[00:09:11] leeway you actually have in saying what you think is important to say i think
[00:09:13] you think is important to say i think all of these restrictions are kind of
[00:09:15] all of these restrictions are kind of unfortunate when i play the role of area
[00:09:17] unfortunate when i play the role of area chair and as reviewer i would simply
[00:09:19] chair and as reviewer i would simply like to have access to all the
[00:09:20] like to have access to all the information that i can possibly obtain
[00:09:23] information that i can possibly obtain and so i would like these author
[00:09:24] and so i would like these author responses to be
[00:09:26] responses to be offering as much information as they
[00:09:28] offering as much information as they feel is important and then i can use
[00:09:29] feel is important and then i can use that to balance all the evidence to make
[00:09:31] that to balance all the evidence to make a final recommendation
[00:09:33] a final recommendation so it's unfortunate from my perspective
[00:09:35] so it's unfortunate from my perspective that these have restrictions at all but
[00:09:37] that these have restrictions at all but they're often there and it's important
[00:09:39] they're often there and it's important to figure out how to navigate them
[00:09:42] to figure out how to navigate them when you construct an author response
[00:09:44] when you construct an author response always be polite you can be firm and
[00:09:47] always be polite you can be firm and direct but you'll want to do that
[00:09:48] direct but you'll want to do that strategically to signal what you feel
[00:09:50] strategically to signal what you feel most strongly about but fundamentally
[00:09:52] most strongly about but fundamentally you should never say things like this
[00:09:55] you should never say things like this your inattentiveness is embarrassing
[00:09:57] your inattentiveness is embarrassing section 6 does what you say we didn't do
[00:10:00] section 6 does what you say we didn't do you might privately say that to your
[00:10:01] you might privately say that to your co-authors as a kind of cathartic act of
[00:10:04] co-authors as a kind of cathartic act of venting about how bad your reviews were
[00:10:06] venting about how bad your reviews were but you should never put it in an author
[00:10:08] but you should never put it in an author response rather you should do things
[00:10:10] response rather you should do things that are more like thank you the
[00:10:12] that are more like thank you the information you're requesting is in
[00:10:14] information you're requesting is in section six we will make this more
[00:10:16] section six we will make this more prominent in our revision fundamentally
[00:10:18] prominent in our revision fundamentally here i think it's just important to be
[00:10:20] here i think it's just important to be polite in these professional contexts
[00:10:22] polite in these professional contexts and it's a way to remind yourself that
[00:10:24] and it's a way to remind yourself that these reviewers did make an investment
[00:10:26] these reviewers did make an investment of their own time and intellectual
[00:10:28] of their own time and intellectual energy into your work and we want to be
[00:10:30] energy into your work and we want to be respectful and aware of that investment
[00:10:32] respectful and aware of that investment that they made
[00:10:36] prison types presentation types and
[00:10:38] prison types presentation types and venues there are lots of them so first
[00:10:39] venues there are lots of them so first there's the fundamental distinction you
[00:10:41] there's the fundamental distinction you might have either oral presentations or
[00:10:43] might have either oral presentations or poster presentations and cross-cutting
[00:10:45] poster presentations and cross-cutting that you might think about submitting to
[00:10:47] that you might think about submitting to a workshop or to a main conference
[00:10:50] a workshop or to a main conference here's a whole bunch of venues and what
[00:10:52] here's a whole bunch of venues and what i've done here is kind of organize this
[00:10:53] i've done here is kind of organize this in a soft way
[00:10:55] in a soft way so on the left here i have nlp
[00:10:57] so on the left here i have nlp conferences and workshops
[00:10:59] conferences and workshops i've put like what we consider the most
[00:11:01] i've put like what we consider the most prestigious three at the top here acl
[00:11:03] prestigious three at the top here acl mackel and emnlp
[00:11:05] mackel and emnlp and then some other large ones just
[00:11:07] and then some other large ones just below that
[00:11:08] below that of course the prestige order could
[00:11:10] of course the prestige order could really change here so who knows what the
[00:11:12] really change here so who knows what the next years will bring for example the
[00:11:14] next years will bring for example the asia acl is brand new and of course the
[00:11:17] asia acl is brand new and of course the number of people doing outstanding work
[00:11:19] number of people doing outstanding work in our area throughout asia is enormous
[00:11:22] in our area throughout asia is enormous and so it's very easy for me to imagine
[00:11:23] and so it's very easy for me to imagine that aacl becomes at least as prominent
[00:11:26] that aacl becomes at least as prominent maybe even more prominent than some of
[00:11:28] maybe even more prominent than some of these other ones up here over the next
[00:11:29] these other ones up here over the next few years
[00:11:31] few years then we have some smaller and older
[00:11:32] then we have some smaller and older conferences down here and then of course
[00:11:35] conferences down here and then of course at the bottom i put workshops all of
[00:11:37] at the bottom i put workshops all of these major conferences have workshop
[00:11:39] these major conferences have workshop series attached to them
[00:11:41] series attached to them and workshops can be a great initial
[00:11:43] and workshops can be a great initial outlet for work you do especially for a
[00:11:45] outlet for work you do especially for a course like this so i would encourage
[00:11:47] course like this so i would encourage you for the major conferences to scan
[00:11:49] you for the major conferences to scan their program of workshops and if you
[00:11:51] their program of workshops and if you find one that's topically aligned with
[00:11:53] find one that's topically aligned with what you're working on consider
[00:11:55] what you're working on consider submitting to the workshop it will
[00:11:57] submitting to the workshop it will probably be less competitive so you have
[00:11:59] probably be less competitive so you have better chances of getting in but i would
[00:12:00] better chances of getting in but i would say the more important thing is that
[00:12:02] say the more important thing is that it's a chance for you to actually
[00:12:04] it's a chance for you to actually connect with a community of people who
[00:12:06] connect with a community of people who are working on precisely the topic that
[00:12:08] are working on precisely the topic that you're working on and that can be
[00:12:09] you're working on and that can be intellectually really exciting
[00:12:12] intellectually really exciting in the middle here i have some
[00:12:14] in the middle here i have some conferences that kind of run
[00:12:16] conferences that kind of run the the spectrum from linguistics
[00:12:18] the the spectrum from linguistics through the world wide web through
[00:12:21] through the world wide web through knowledge graphs and kind of more top
[00:12:23] knowledge graphs and kind of more top more core topics in
[00:12:24] more core topics in artificial intelligence generally and
[00:12:26] artificial intelligence generally and those can be really good outlets for
[00:12:28] those can be really good outlets for work that has an nlp aspect to it and
[00:12:30] work that has an nlp aspect to it and then over here on the right these are
[00:12:32] then over here on the right these are very prestigious machine learning
[00:12:33] very prestigious machine learning conferences and it's kind of the same
[00:12:35] conferences and it's kind of the same story all of these conferences here are
[00:12:38] story all of these conferences here are welcoming of work that involves natural
[00:12:40] welcoming of work that involves natural language processing but you might just
[00:12:42] language processing but you might just have to think about how you're going to
[00:12:43] have to think about how you're going to precisely connect with these specific
[00:12:46] precisely connect with these specific audiences and their specific concerns
[00:12:51] here's my personal assessment of nlp
[00:12:53] here's my personal assessment of nlp reviewing at present
[00:12:55] reviewing at present first i think the focus on conference
[00:12:57] first i think the focus on conference papers as opposed to journals has been
[00:13:00] papers as opposed to journals has been really good for nlp it fits with and
[00:13:03] really good for nlp it fits with and encourages the very rapid pace of our
[00:13:05] encourages the very rapid pace of our field and i think we all benefit from
[00:13:08] field and i think we all benefit from that rapid pace overall
[00:13:11] that rapid pace overall before about 2010 the reviewing in the
[00:13:13] before about 2010 the reviewing in the field was admirably good and rigorous in
[00:13:15] field was admirably good and rigorous in comparison with other fields it really
[00:13:17] comparison with other fields it really was impressive
[00:13:18] was impressive how many deep and insightful reviews you
[00:13:20] how many deep and insightful reviews you would get when you submitted to one of
[00:13:22] would get when you submitted to one of the main conferences in the field
[00:13:25] the main conferences in the field lately though
[00:13:26] lately though the explosive growth of the field has i
[00:13:29] the explosive growth of the field has i think by consensus reduce the general
[00:13:32] think by consensus reduce the general quality of reviewing and the field is
[00:13:33] quality of reviewing and the field is still grappling with this
[00:13:35] still grappling with this of course there was always a kind of
[00:13:36] of course there was always a kind of lottery aspect to whether your paper
[00:13:38] lottery aspect to whether your paper would be accepted and you should keep in
[00:13:40] would be accepted and you should keep in mind that luck is a real element in
[00:13:42] mind that luck is a real element in publication and throughout science
[00:13:45] publication and throughout science but that lottery aspect has gotten more
[00:13:47] but that lottery aspect has gotten more amplified as the field has grown and
[00:13:49] amplified as the field has grown and that's affecting the main conferences in
[00:13:51] that's affecting the main conferences in ways that we all have to figure out but
[00:13:53] ways that we all have to figure out but again this is kind of useful keep in
[00:13:55] again this is kind of useful keep in mind that a rejection does not
[00:13:57] mind that a rejection does not necessarily mean that your work was of
[00:13:59] necessarily mean that your work was of low quality it could mean that you just
[00:14:01] low quality it could mean that you just had really bad luck throughout a kind of
[00:14:03] had really bad luck throughout a kind of chaotic reviewing process
[00:14:06] chaotic reviewing process i also want to say that i think it's
[00:14:08] i also want to say that i think it's unhealthy to force every paper to be
[00:14:10] unhealthy to force every paper to be four or eight pages the reason that
[00:14:12] four or eight pages the reason that happens is that there are two kinds of
[00:14:14] happens is that there are two kinds of submission typically a short paper which
[00:14:16] submission typically a short paper which has max length four and a long paper
[00:14:18] has max length four and a long paper which has max length eight and we all as
[00:14:21] which has max length eight and we all as submitters to one or another of those
[00:14:22] submitters to one or another of those tracks feel a kind of signaling pressure
[00:14:25] tracks feel a kind of signaling pressure to maximize the available space so there
[00:14:27] to maximize the available space so there are like no six page papers in the field
[00:14:30] are like no six page papers in the field and there was really no room to have
[00:14:32] and there was really no room to have papers that are longer than eight pages
[00:14:34] papers that are longer than eight pages and that's unfortunate because sometimes
[00:14:36] and that's unfortunate because sometimes you just need a different length than
[00:14:37] you just need a different length than one of these to express your ideas in
[00:14:40] one of these to express your ideas in the best possible way
[00:14:42] the best possible way so that's been unhealthy i will say
[00:14:43] so that's been unhealthy i will say though that this is alleviated somewhat
[00:14:45] though that this is alleviated somewhat by the increased use of appendices and
[00:14:48] by the increased use of appendices and supplementary materials to express a lot
[00:14:50] supplementary materials to express a lot of details that don't need to be part of
[00:14:52] of details that don't need to be part of the main narrative and that is in effect
[00:14:54] the main narrative and that is in effect via a back door allowing papers to be of
[00:14:56] via a back door allowing papers to be of more variable length
[00:14:59] the biggest failing to my mind of the
[00:15:02] the biggest failing to my mind of the conference reviewing process is that
[00:15:04] conference reviewing process is that there's no revise and resubmit as you
[00:15:06] there's no revise and resubmit as you standardly get from journals there's no
[00:15:08] standardly get from journals there's no chance for authors to appeal to an
[00:15:10] chance for authors to appeal to an editor and interact with an editor in a
[00:15:12] editor and interact with an editor in a way that you get from really top quality
[00:15:14] way that you get from really top quality journals and that just introduces
[00:15:16] journals and that just introduces inefficiency into the system and it's a
[00:15:18] inefficiency into the system and it's a missed opportunity for intellectual
[00:15:20] missed opportunity for intellectual engagement that could really benefit the
[00:15:22] engagement that could really benefit the work
[00:15:24] work i do want to say though there's hope the
[00:15:25] i do want to say though there's hope the transactions of the acl or tackle is a
[00:15:28] transactions of the acl or tackle is a journal that follows the standard acl
[00:15:30] journal that follows the standard acl conference model fairly closely but
[00:15:32] conference model fairly closely but allows for journal style interaction
[00:15:34] allows for journal style interaction with an editor i'm conflicted here
[00:15:36] with an editor i'm conflicted here because i've been a tackle action editor
[00:15:38] because i've been a tackle action editor for a very long time but i do that work
[00:15:40] for a very long time but i do that work because i think tackle is wonderful it's
[00:15:43] because i think tackle is wonderful it's allowing for the best aspects of our
[00:15:45] allowing for the best aspects of our field in terms of fast pace
[00:15:47] field in terms of fast pace while also introducing some healthy
[00:15:49] while also introducing some healthy aspects of the journal reviewing process
[00:15:53] aspects of the journal reviewing process tackle papers are a little bit longer at
[00:15:55] tackle papers are a little bit longer at 10 pages so there's still some of this
[00:15:57] 10 pages so there's still some of this problematic influence under 4 here but
[00:15:59] problematic influence under 4 here but at least they are
[00:16:00] at least they are a bit longer
[00:16:01] a bit longer and overall i think it's just a
[00:16:03] and overall i think it's just a healthier rhythm for thinking about
[00:16:04] healthier rhythm for thinking about evaluating work in these scientific
[00:16:06] evaluating work in these scientific contexts so think about tackle as an
[00:16:09] contexts so think about tackle as an outlet for your work as well
[00:16:12] let's focus on two more specific topics
[00:16:15] let's focus on two more specific topics here starting with titles as i said
[00:16:17] here starting with titles as i said before a title is an important
[00:16:18] before a title is an important ingredient in the fate of your paper and
[00:16:20] ingredient in the fate of your paper and conference reviewing
[00:16:22] conference reviewing jokey titles can be risky and i've
[00:16:24] jokey titles can be risky and i've linked to some evidence for that down
[00:16:26] linked to some evidence for that down here in this paper
[00:16:28] here in this paper more importantly you might think that
[00:16:29] more importantly you might think that it's important calibrate the scope of
[00:16:32] it's important calibrate the scope of your contribution in your title it's a
[00:16:35] your contribution in your title it's a very common complaint from reviewers
[00:16:36] very common complaint from reviewers that the title is an overreach that it
[00:16:38] that the title is an overreach that it claims a broader spectrum of the
[00:16:40] claims a broader spectrum of the scientific space than the paper actually
[00:16:42] scientific space than the paper actually delivers on and you want to avoid that
[00:16:44] delivers on and you want to avoid that by getting a tight alignment between
[00:16:47] by getting a tight alignment between title and paper
[00:16:50] title and paper and then also consider the kinds of
[00:16:52] and then also consider the kinds of reviewers you're likely to attract
[00:16:54] reviewers you're likely to attract during that bidding process the choices
[00:16:56] during that bidding process the choices you make in your title will definitely
[00:16:57] you make in your title will definitely influence that process and so you can
[00:16:59] influence that process and so you can again think strategically about who you
[00:17:01] again think strategically about who you want to pull in as a reviewer because of
[00:17:04] want to pull in as a reviewer because of their interests and expertise and so
[00:17:05] their interests and expertise and so forth
[00:17:08] i would also say that it's worthwhile
[00:17:10] i would also say that it's worthwhile avoiding special fonts and formatting if
[00:17:12] avoiding special fonts and formatting if possible just because it makes it harder
[00:17:14] possible just because it makes it harder to consistently reproduce your title
[00:17:16] to consistently reproduce your title that said i am kind of charmed by the
[00:17:19] that said i am kind of charmed by the move to have emoji and titles recently
[00:17:21] move to have emoji and titles recently um jokey is risky um but this can be fun
[00:17:24] um jokey is risky um but this can be fun and make your paper a bit memorable even
[00:17:26] and make your paper a bit memorable even if it is harder to copy and paste the
[00:17:28] if it is harder to copy and paste the title in various contexts
[00:17:32] title in various contexts abstracts abstracts are also incredibly
[00:17:34] abstracts abstracts are also incredibly important because after the title they
[00:17:37] important because after the title they will create this first and lasting
[00:17:38] will create this first and lasting impression so here's a suggestion about
[00:17:41] impression so here's a suggestion about how to think about writing abstracts
[00:17:42] how to think about writing abstracts it's a difficult form
[00:17:44] it's a difficult form first have your opening be a broad
[00:17:46] first have your opening be a broad overview a glimpse at the central
[00:17:48] overview a glimpse at the central problem in its context
[00:17:51] problem in its context next the middle takes concepts mentioned
[00:17:53] next the middle takes concepts mentioned in the opening and elaborates upon them
[00:17:55] in the opening and elaborates upon them probably by connecting with specific
[00:17:57] probably by connecting with specific experiments and results from your paper
[00:18:00] experiments and results from your paper and then finally at the very end
[00:18:01] and then finally at the very end establish links between your proposal
[00:18:04] establish links between your proposal and broader theoretical concerns so that
[00:18:06] and broader theoretical concerns so that the reviewer on finishing this abstract
[00:18:08] the reviewer on finishing this abstract has an answer to the question does the
[00:18:10] has an answer to the question does the abstract offer a substantial and
[00:18:12] abstract offer a substantial and original proposal
[00:18:14] original proposal so in a little more detail here's a kind
[00:18:16] so in a little more detail here's a kind of
[00:18:17] of meta abstract
[00:18:19] meta abstract this opening sentence situates you dear
[00:18:20] this opening sentence situates you dear reader
[00:18:22] reader our approach seeks to address the fallen
[00:18:23] our approach seeks to address the fallen central issue you spell it out the
[00:18:26] central issue you spell it out the techniques we use are as follows and you
[00:18:28] techniques we use are as follows and you spell it out our experiments are these
[00:18:31] spell it out our experiments are these details and then finally overall we find
[00:18:33] details and then finally overall we find that our approach has the following
[00:18:34] that our approach has the following properties and the significance of this
[00:18:37] properties and the significance of this is and if you just fill in all those
[00:18:39] is and if you just fill in all those ellipsis dots i think you're pretty
[00:18:40] ellipsis dots i think you're pretty close to an effective abstract
[00:18:43] close to an effective abstract and that's at least a starting point for
[00:18:44] and that's at least a starting point for you when you think about maybe being
[00:18:46] you when you think about maybe being more creative and fitting this better to
[00:18:48] more creative and fitting this better to the specifics of your ideas
[00:18:52] style sheets this is small but important
[00:18:54] style sheets this is small but important this is the way that you could avoid the
[00:18:56] this is the way that you could avoid the dreaded dust reject
[00:18:58] dreaded dust reject pay close attention to the details of
[00:19:00] pay close attention to the details of the style sheet for your conference and
[00:19:02] the style sheet for your conference and any other requirements included in the
[00:19:04] any other requirements included in the call for papers it is worth your while
[00:19:06] call for papers it is worth your while to read those documents carefully and
[00:19:08] to read those documents carefully and make sure you've checked all the boxes
[00:19:10] make sure you've checked all the boxes in nlp infractions are the most likely
[00:19:13] in nlp infractions are the most likely cause of the dreaded desk project that
[00:19:15] cause of the dreaded desk project that is rejection without review which is of
[00:19:18] is rejection without review which is of course so disheartening because you made
[00:19:20] course so disheartening because you made a big push to get your paper submitted
[00:19:22] a big push to get your paper submitted and then shortly thereafter you get this
[00:19:24] and then shortly thereafter you get this desk project in the mail and there's no
[00:19:26] desk project in the mail and there's no recourse that's pretty much the end of
[00:19:28] recourse that's pretty much the end of the process for you and you can avoid it
[00:19:30] the process for you and you can avoid it by just attending to what they've said
[00:19:32] by just attending to what they've said in their style sheets and other
[00:19:34] in their style sheets and other guidelines
[00:19:36] guidelines and then finally we have this charming
[00:19:38] and then finally we have this charming notion of a camera-ready version this
[00:19:41] notion of a camera-ready version this refers to old-fashioned technology for
[00:19:43] refers to old-fashioned technology for publishing papers where you actually
[00:19:44] publishing papers where you actually needed to take pictures of documents in
[00:19:46] needed to take pictures of documents in order to publish them those days are
[00:19:48] order to publish them those days are long gone but the terminology remains
[00:19:51] long gone but the terminology remains for most nlp conferences as part of this
[00:19:54] for most nlp conferences as part of this you get an additional page upon
[00:19:55] you get an additional page upon acceptance this is presumably to respond
[00:19:58] acceptance this is presumably to respond to requests made by reviewers although
[00:20:00] to requests made by reviewers although in practice you're free to use the space
[00:20:02] in practice you're free to use the space however you like
[00:20:03] however you like in general we find that the extra page
[00:20:06] in general we find that the extra page is mainly used for fixing passages that
[00:20:09] is mainly used for fixing passages that we made overly terse at reviewing time
[00:20:11] we made overly terse at reviewing time in order to get into the page limits and
[00:20:14] in order to get into the page limits and now we're kind of unpacking them to make
[00:20:15] now we're kind of unpacking them to make them more readable this is obviously an
[00:20:17] them more readable this is obviously an inefficient process but it's the reality
[00:20:20] inefficient process but it's the reality at present
[00:20:23] at present you could of course use your uh
[00:20:25] you could of course use your uh additional page to improve your results
[00:20:27] additional page to improve your results in all sorts of ways it's entirely up to
[00:20:29] in all sorts of ways it's entirely up to you but a word of caution here
[00:20:32] you but a word of caution here if you have very substantially new ideas
[00:20:34] if you have very substantially new ideas and results it might be better for you
[00:20:37] and results it might be better for you to save those for a follow-up paper as
[00:20:40] to save those for a follow-up paper as opposed to trying to pack them into work
[00:20:42] opposed to trying to pack them into work that's already been accepted right
[00:20:44] that's already been accepted right instead of just making more work for
[00:20:45] instead of just making more work for yourself as part of this first
[00:20:46] yourself as part of this first contribution it might be that you're on
[00:20:49] contribution it might be that you're on the exciting path toward a follow-up
[00:20:51] the exciting path toward a follow-up contribution that's going to be its
[00:20:53] contribution that's going to be its entirely own new paper
[00:20:55] entirely own new paper so think about that and balance these
[00:20:56] so think about that and balance these pressures
[00:20:58] pressures and then good luck with your camera
[00:20:59] and then good luck with your camera ready submissions

Lecture 062

Giving Talks | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=GGx7klcahzY --- Transcript [00:00:05] welcome everyone this is part four in [00:...

Giving Talks | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=GGx7klcahzY

---

Transcript

[00:00:05] welcome everyone this is part four in
[00:00:06] welcome everyone this is part four in our series on presenting your research
[00:00:08] our series on presenting your research we're going to be talking about the
[00:00:09] we're going to be talking about the possibly thrilling and possibly
[00:00:11] possibly thrilling and possibly nerve-wracking process of giving a
[00:00:13] nerve-wracking process of giving a conference talk in our field
[00:00:15] conference talk in our field let's start with the basic structure of
[00:00:17] let's start with the basic structure of a talk this is pretty easy
[00:00:19] a talk this is pretty easy it's probably going to mirror the
[00:00:20] it's probably going to mirror the structure of papers in our fields but
[00:00:22] structure of papers in our fields but the thing to keep in mind is that the
[00:00:24] the thing to keep in mind is that the talk narrative has to be dramatically
[00:00:26] talk narrative has to be dramatically simpler
[00:00:27] simpler the beginning should start just like a
[00:00:29] the beginning should start just like a paper does and just like an abstract
[00:00:31] paper does and just like an abstract does you need to answer questions that
[00:00:33] does you need to answer questions that involve context what problem are you
[00:00:35] involve context what problem are you solving
[00:00:36] solving why is the problem important what's been
[00:00:38] why is the problem important what's been tried before and why wasn't it a full
[00:00:40] tried before and why wasn't it a full solution these things will contextualize
[00:00:42] solution these things will contextualize your results and set your audience up to
[00:00:44] your results and set your audience up to be prepared for the contribution that
[00:00:46] be prepared for the contribution that you're about to offer
[00:00:47] you're about to offer that will happen in the middle you'll
[00:00:49] that will happen in the middle you'll give concrete details about what data
[00:00:51] give concrete details about what data you used and then crucially what
[00:00:53] you used and then crucially what approach you took and what its details
[00:00:54] approach you took and what its details are up to the level of detail you can
[00:00:57] are up to the level of detail you can supply in a talk like this and also
[00:00:59] supply in a talk like this and also information about your metrics how do
[00:01:01] information about your metrics how do you evaluate success these concrete
[00:01:02] you evaluate success these concrete details will constitute your proposal
[00:01:05] details will constitute your proposal and they're really crucial
[00:01:07] and they're really crucial and then in the final part of the talk
[00:01:08] and then in the final part of the talk you'll offer results in the form of
[00:01:10] you'll offer results in the form of tables and graphs and so forth and you
[00:01:12] tables and graphs and so forth and you might review some aspects of the
[00:01:14] might review some aspects of the analysis from your paper what worked and
[00:01:16] analysis from your paper what worked and what didn't
[00:01:17] what didn't what work still needs to be done and
[00:01:19] what work still needs to be done and things like that and then crucially at
[00:01:21] things like that and then crucially at the very end you should be sure to
[00:01:23] the very end you should be sure to articulate
[00:01:24] articulate what you achieved in the work and why
[00:01:26] what you achieved in the work and why it's significant the idea is to leave
[00:01:28] it's significant the idea is to leave the audience with information that will
[00:01:30] the audience with information that will compel them
[00:01:32] compel them to take the time and energy to read your
[00:01:34] to take the time and energy to read your actual paper for the full details on
[00:01:36] actual paper for the full details on your contribution and then that way the
[00:01:38] your contribution and then that way the talk will serve as an effective
[00:01:40] talk will serve as an effective advertisement for the underlying project
[00:01:44] now you can read lots of advice on the
[00:01:46] now you can read lots of advice on the internet about how to give effective
[00:01:48] internet about how to give effective talks in various contexts and i would
[00:01:50] talks in various contexts and i would encourage you to seek it out because
[00:01:51] encourage you to seek it out because some of it might really align with your
[00:01:53] some of it might really align with your style you should keep in mind though
[00:01:55] style you should keep in mind though that styles differ and context differ
[00:01:57] that styles differ and context differ and so this will be a process of
[00:01:59] and so this will be a process of figuring out what advice is really
[00:02:01] figuring out what advice is really suitable for you
[00:02:02] suitable for you the one thing that i can say with
[00:02:04] the one thing that i can say with confidence is that patrick blackburn's
[00:02:06] confidence is that patrick blackburn's fundamental insight will apply no matter
[00:02:08] fundamental insight will apply no matter what the context and what the style
[00:02:11] what the context and what the style patrick blackburn asks where do good
[00:02:13] patrick blackburn asks where do good talks come from and his answer is
[00:02:15] talks come from and his answer is honesty he says a good talk should never
[00:02:17] honesty he says a good talk should never stray far from simple honest
[00:02:19] stray far from simple honest communication and if you abide by that
[00:02:22] communication and if you abide by that and you're introspective about where
[00:02:23] and you're introspective about where you've achieved open honest
[00:02:25] you've achieved open honest communication and where you felt short
[00:02:27] communication and where you felt short in various talks that you give and
[00:02:28] in various talks that you give and you're willing to learn from that
[00:02:30] you're willing to learn from that process you will definitely find your
[00:02:32] process you will definitely find your style and become an effective
[00:02:34] style and become an effective communicator about scientific ideas
[00:02:36] communicator about scientific ideas according to what works best for you
[00:02:41] a note about powerpoint you can find
[00:02:43] a note about powerpoint you can find lots of think pieces about how
[00:02:45] lots of think pieces about how powerpoint and related slide
[00:02:47] powerpoint and related slide technologies are inherently kind of evil
[00:02:50] technologies are inherently kind of evil i think they can be used to confuse and
[00:02:52] i think they can be used to confuse and deceive but they also have lots of good
[00:02:54] deceive but they also have lots of good aspects to them and it's not an accident
[00:02:56] aspects to them and it's not an accident that slides are pretty pervasive when
[00:02:58] that slides are pretty pervasive when giving talks in our field and it's just
[00:03:00] giving talks in our field and it's just a matter again of finding your style and
[00:03:02] a matter again of finding your style and thinking about how to use these slides
[00:03:04] thinking about how to use these slides for open honest communication
[00:03:07] for open honest communication in that vein and again this is a matter
[00:03:09] in that vein and again this is a matter of personal style i thought i would
[00:03:10] of personal style i thought i would mention two kind of schools of thought
[00:03:12] mention two kind of schools of thought when it comes to slide design the
[00:03:14] when it comes to slide design the minimalist and the comparative
[00:03:17] minimalist and the comparative so the minimalist would probably just
[00:03:18] so the minimalist would probably just have two words on this slide a
[00:03:20] have two words on this slide a minimalist and comparative and the rest
[00:03:21] minimalist and comparative and the rest would be delivered via a talk track the
[00:03:24] would be delivered via a talk track the idea behind the minimalist approach is
[00:03:25] idea behind the minimalist approach is slides should be as spare as possible
[00:03:28] slides should be as spare as possible the audience should spend most of their
[00:03:30] the audience should spend most of their time listening and listening to you and
[00:03:32] time listening and listening to you and looking at you
[00:03:33] looking at you and individual slides don't stay up for
[00:03:35] and individual slides don't stay up for very long or get used in more than one
[00:03:37] very long or get used in more than one way their kind of punctuation for your
[00:03:39] way their kind of punctuation for your narrative talk track
[00:03:41] narrative talk track by contrast the comparative approach
[00:03:44] by contrast the comparative approach would be you have lots of details on
[00:03:46] would be you have lots of details on your slides slides should be as full as
[00:03:48] your slides slides should be as full as possible without sacrificing clarity
[00:03:50] possible without sacrificing clarity your talk should make it easy for people
[00:03:52] your talk should make it easy for people to spend time studying your slides you
[00:03:54] to spend time studying your slides you have to think about how your narrative
[00:03:56] have to think about how your narrative is going to align with the very detailed
[00:03:57] is going to align with the very detailed slides and individual slides might stay
[00:04:00] slides and individual slides might stay up for a long time and get used to make
[00:04:02] up for a long time and get used to make multiple comparisons and establish
[00:04:04] multiple comparisons and establish numerous connections
[00:04:06] numerous connections i want to emphasize again that this is
[00:04:08] i want to emphasize again that this is really a personal matter the minimalist
[00:04:10] really a personal matter the minimalist view seems right for telling a story
[00:04:12] view seems right for telling a story it's often the best mode when time is of
[00:04:14] it's often the best mode when time is of the essence and the audience is mainly
[00:04:15] the essence and the audience is mainly there to learn about what your paper
[00:04:17] there to learn about what your paper contains
[00:04:18] contains whereas the comparative view seems right
[00:04:20] whereas the comparative view seems right for teaching right it's the closest that
[00:04:22] for teaching right it's the closest that the slides can come to full well and
[00:04:24] the slides can come to full well and organized chalkboards and things like
[00:04:25] organized chalkboards and things like that where a lot of information might
[00:04:27] that where a lot of information might stay up for a very long time
[00:04:30] stay up for a very long time fundamentally though this is a matter of
[00:04:31] fundamentally though this is a matter of style find the version that works for
[00:04:34] style find the version that works for you for the context you're in and i'll
[00:04:36] you for the context you're in and i'll just say again as long as you think long
[00:04:38] just say again as long as you think long and hard about what it would be like to
[00:04:40] and hard about what it would be like to listen to your talk that is the open
[00:04:42] listen to your talk that is the open communication part and you adjust
[00:04:44] communication part and you adjust accordingly i'm sure that you'll shine
[00:04:46] accordingly i'm sure that you'll shine no matter what approach you choose
[00:04:49] i really like slides when it comes to
[00:04:52] i really like slides when it comes to using them to guide audience attention
[00:04:54] using them to guide audience attention and help people follow the narrative of
[00:04:56] and help people follow the narrative of your talk track one fundamental thing
[00:04:58] your talk track one fundamental thing that you can do for that is make heavy
[00:05:00] that you can do for that is make heavy use of overlays overlays might allow you
[00:05:02] use of overlays overlays might allow you to fill a slide with information in that
[00:05:04] to fill a slide with information in that comparative mode while steep still
[00:05:06] comparative mode while steep still keeping the audience with you as you
[00:05:08] keeping the audience with you as you make individual points you can also use
[00:05:11] make individual points you can also use color systematically on slide to create
[00:05:13] color systematically on slide to create distinctions and highlight different
[00:05:14] distinctions and highlight different pieces of information if you use it
[00:05:16] pieces of information if you use it consistently then people will figure out
[00:05:18] consistently then people will figure out that you're using color for one concept
[00:05:21] that you're using color for one concept and that will really help them key into
[00:05:22] and that will really help them key into the structure of your ideas pick an
[00:05:24] the structure of your ideas pick an accessible color palette and then this
[00:05:27] accessible color palette and then this can really be your friend when it comes
[00:05:28] can really be your friend when it comes to communicating with an audience
[00:05:30] to communicating with an audience you can also use size to draw attention
[00:05:32] you can also use size to draw attention to things and boxes and arrows and other
[00:05:35] to things and boxes and arrows and other devices to help people navigate
[00:05:37] devices to help people navigate especially complex information displayed
[00:05:39] especially complex information displayed on your slides this is incredibly useful
[00:05:42] on your slides this is incredibly useful when you are for example displaying a
[00:05:43] when you are for example displaying a figure of results to have boxes as
[00:05:46] figure of results to have boxes as overlays on the individual comparisons
[00:05:49] overlays on the individual comparisons and results that you want to highlight
[00:05:51] and results that you want to highlight same thing for a model diagram you could
[00:05:53] same thing for a model diagram you could show the whole model diagram and then
[00:05:55] show the whole model diagram and then use boxes to highlight in different
[00:05:57] use boxes to highlight in different pieces of information in the diagram as
[00:05:59] pieces of information in the diagram as you talk about them in your narrative
[00:06:01] you talk about them in your narrative talk track and that can be incredibly
[00:06:03] talk track and that can be incredibly valuable when it comes to helping people
[00:06:05] valuable when it comes to helping people navigate what would otherwise be a very
[00:06:07] navigate what would otherwise be a very complicated looking slide
[00:06:11] of course you could offer the
[00:06:12] of course you could offer the information that i just delivered in the
[00:06:14] information that i just delivered in the more minimalist version this would be
[00:06:15] more minimalist version this would be like overlays color
[00:06:18] like overlays color size boxes and arrows and so forth
[00:06:21] size boxes and arrows and so forth for particular styles that might be
[00:06:23] for particular styles that might be exactly the right mode to talk about
[00:06:24] exactly the right mode to talk about guiding audience attention you can
[00:06:26] guiding audience attention you can probably see that it's not really my
[00:06:27] probably see that it's not really my style but it's certainly a valid style
[00:06:30] style but it's certainly a valid style it can be very effective
[00:06:32] it can be very effective some more mundane things
[00:06:34] some more mundane things turn off any notifications that it might
[00:06:36] turn off any notifications that it might appear on your screen if you're on up in
[00:06:38] appear on your screen if you're on up in front of an audience of hundreds of
[00:06:40] front of an audience of hundreds of people and we see a notification about a
[00:06:42] people and we see a notification about a friend's email well it will certainly be
[00:06:44] friend's email well it will certainly be an entertaining thing for your audience
[00:06:46] an entertaining thing for your audience to have seen but it might not be
[00:06:47] to have seen but it might not be something that you wanted to be part of
[00:06:48] something that you wanted to be part of your talk
[00:06:50] your talk make sure your computer is out of power
[00:06:52] make sure your computer is out of power save mode so that the screen doesn't
[00:06:54] save mode so that the screen doesn't shut off while you're talking
[00:06:56] shut off while you're talking projectors can be finicky and even one
[00:06:58] projectors can be finicky and even one time of you losing your screen could
[00:06:59] time of you losing your screen could cause you to lose the projector and burn
[00:07:01] cause you to lose the projector and burn through a bunch of the time that you
[00:07:02] through a bunch of the time that you have allotted for your talk and that
[00:07:04] have allotted for your talk and that could be really sad
[00:07:06] could be really sad shut down running applications that
[00:07:08] shut down running applications that might tax your computer or otherwise get
[00:07:09] might tax your computer or otherwise get in your way again with notifications and
[00:07:11] in your way again with notifications and things like that
[00:07:13] things like that make sure your desktop is clear of files
[00:07:15] make sure your desktop is clear of files and notes that you wouldn't want the
[00:07:16] and notes that you wouldn't want the world to see you know in this day and
[00:07:18] world to see you know in this day and age your desktop might flash for a
[00:07:19] age your desktop might flash for a second before the slides come up and for
[00:07:21] second before the slides come up and for you all you know this talk is going to
[00:07:23] you all you know this talk is going to end up on youtube for the whole world to
[00:07:25] end up on youtube for the whole world to see so think about the privacy aspects
[00:07:27] see so think about the privacy aspects of this
[00:07:29] of this if you're using powerpoint or keynote or
[00:07:31] if you're using powerpoint or keynote or google slides or something like that
[00:07:32] google slides or something like that create a pdf as a backup if your program
[00:07:35] create a pdf as a backup if your program failed or the internet failed you might
[00:07:38] failed or the internet failed you might not have access to your primary version
[00:07:40] not have access to your primary version having a pdf backup will certainly be
[00:07:42] having a pdf backup will certainly be helpful
[00:07:43] helpful and be prepared for the worst case what
[00:07:45] and be prepared for the worst case what if the projector fails you might really
[00:07:48] if the projector fails you might really be glad that you're prepared to give the
[00:07:50] be glad that you're prepared to give the talk without any slides imagine that
[00:07:52] talk without any slides imagine that scenario the audience will be on your
[00:07:54] scenario the audience will be on your side in the presence of such a failure
[00:07:56] side in the presence of such a failure of technology and it could be really a
[00:07:58] of technology and it could be really a chance for you to shine and the one
[00:08:00] chance for you to shine and the one thing i'll say is that if you're
[00:08:01] thing i'll say is that if you're prepared genuinely prepared to give your
[00:08:03] prepared genuinely prepared to give your talk without slides the resulting talk
[00:08:06] talk without slides the resulting talk will be better because the narrative
[00:08:07] will be better because the narrative part of your talk will be so much
[00:08:09] part of your talk will be so much stronger
[00:08:12] finally the discussion period which
[00:08:14] finally the discussion period which could be an exciting discussion period
[00:08:16] could be an exciting discussion period or the most dreaded part of this whole
[00:08:17] or the most dreaded part of this whole process
[00:08:19] process it's an important part of your
[00:08:20] it's an important part of your presentation though
[00:08:22] presentation though it should be a chance for the audience
[00:08:24] it should be a chance for the audience to gain a deeper understanding of your
[00:08:26] to gain a deeper understanding of your ideas and when the discussion period
[00:08:28] ideas and when the discussion period actually has that aim it's really a joy
[00:08:30] actually has that aim it's really a joy and it feels like you and your audience
[00:08:32] and it feels like you and your audience are moving forward together
[00:08:35] are moving forward together sometimes other things happen though you
[00:08:37] sometimes other things happen though you could get a hostile question or a
[00:08:38] could get a hostile question or a confused questioner or something even
[00:08:41] confused questioner or something even more chaotic be ready for this and just
[00:08:43] more chaotic be ready for this and just try to remain on even keel no matter
[00:08:46] try to remain on even keel no matter what happens
[00:08:48] what happens when you get questions after each one
[00:08:50] when you get questions after each one take a pause for a second before
[00:08:52] take a pause for a second before answering this will serve two functions
[00:08:54] answering this will serve two functions first you'll make sure that the person
[00:08:56] first you'll make sure that the person has actually finished asking their
[00:08:57] has actually finished asking their question which i think is socially
[00:08:59] question which i think is socially useful and second it will just make you
[00:09:01] useful and second it will just make you appear more deliberative which is good
[00:09:03] appear more deliberative which is good so even if you know exactly what answer
[00:09:04] so even if you know exactly what answer you want to give taking the pause will
[00:09:06] you want to give taking the pause will create a good impression on your
[00:09:08] create a good impression on your audience
[00:09:10] audience avoid where possible saying i have no
[00:09:12] avoid where possible saying i have no idea in response to things and leaving
[00:09:14] idea in response to things and leaving it at that
[00:09:15] it at that if you truly are floored you might say i
[00:09:17] if you truly are floored you might say i have no idea but let's think about the
[00:09:19] have no idea but let's think about the following other considerations there
[00:09:21] following other considerations there might be cases where you actually just
[00:09:23] might be cases where you actually just want to say i have no idea and leave it
[00:09:24] want to say i have no idea and leave it at that but i think that that should be
[00:09:26] at that but i think that that should be used very sparingly
[00:09:29] used very sparingly most questions won't make total sense to
[00:09:31] most questions won't make total sense to you you have to remember that your
[00:09:33] you you have to remember that your questioner doesn't know the work as well
[00:09:34] questioner doesn't know the work as well as you do they might have lost track of
[00:09:36] as you do they might have lost track of some of the details or gotten distracted
[00:09:38] some of the details or gotten distracted for a moment the question might not
[00:09:41] for a moment the question might not completely make sense you'll feel
[00:09:43] completely make sense you'll feel victorious though
[00:09:44] victorious though if you can warp every question you get
[00:09:47] if you can warp every question you get into one that makes sense and leaves
[00:09:49] into one that makes sense and leaves everyone with the impression that the
[00:09:50] everyone with the impression that the questioner raised an important issue
[00:09:52] questioner raised an important issue again that's another way that you can
[00:09:54] again that's another way that you can create a collective feeling that the
[00:09:56] create a collective feeling that the discussion was productive and as a group
[00:09:58] discussion was productive and as a group kind of move the ideas forward as part
[00:10:00] kind of move the ideas forward as part of this discussion period that's when
[00:10:02] of this discussion period that's when this is really exciting and you should
[00:10:04] this is really exciting and you should do everything you can to strive for such
[00:10:07] do everything you can to strive for such moments knowing though that things could
[00:10:09] moments knowing though that things could go really awry and don't try to
[00:10:11] go really awry and don't try to internalize those too much either this
[00:10:13] internalize those too much either this is a messy process but fundamentally i
[00:10:15] is a messy process but fundamentally i think it can be quite rewarding in the
[00:10:17] think it can be quite rewarding in the end

Lecture 063

Conclusion | Stanford CS224U Natural Language Understanding | Spring 2021 Source: https://www.youtube.com/watch?v=s0GH2pPnJMk --- Transcript [00:00:03] hello we hope you found this course to [00:0...

Conclusion | Stanford CS224U Natural Language Understanding | Spring 2021

Source: https://www.youtube.com/watch?v=s0GH2pPnJMk

---

Transcript

[00:00:03] hello we hope you found this course to
[00:00:04] hello we hope you found this course to be enjoyable and rewarding now that
[00:00:07] be enjoyable and rewarding now that we're wrapping up your thoughts might be
[00:00:08] we're wrapping up your thoughts might be turning to how you can build on what
[00:00:10] turning to how you can build on what you've learned to conduct original
[00:00:11] you've learned to conduct original research to develop new technologies and
[00:00:14] research to develop new technologies and so much more
[00:00:15] so much more the material should position you really
[00:00:16] the material should position you really well to do this the models for word
[00:00:18] well to do this the models for word representations that we discussed are
[00:00:20] representations that we discussed are likely to be valuable components for any
[00:00:22] likely to be valuable components for any task you take on our relation extraction
[00:00:25] task you take on our relation extraction unit focused on powerful techniques for
[00:00:26] unit focused on powerful techniques for distance supervision which is a really
[00:00:29] distance supervision which is a really common mode for applied problems
[00:00:31] common mode for applied problems and the natural language inference unit
[00:00:33] and the natural language inference unit is representative of the kind of
[00:00:34] is representative of the kind of opportunities and challenges one faces
[00:00:37] opportunities and challenges one faces when building deep learning systems with
[00:00:39] when building deep learning systems with really large data sets
[00:00:40] really large data sets and of course the other lectures
[00:00:42] and of course the other lectures highlight more diverse application areas
[00:00:44] highlight more diverse application areas and help reveal how even complex cutting
[00:00:46] and help reveal how even complex cutting edge models are actually usually made up
[00:00:48] edge models are actually usually made up of familiar modular components
[00:00:51] of familiar modular components and of course by now you've done a lot
[00:00:52] and of course by now you've done a lot of work with our notebooks you've
[00:00:54] of work with our notebooks you've designed three original systems and
[00:00:56] designed three original systems and entered them into bake offs and you've
[00:00:57] entered them into bake offs and you've completed an original project this is an
[00:00:59] completed an original project this is an unusually high level of hands-on work
[00:01:02] unusually high level of hands-on work and the practical skills you've acquired
[00:01:04] and the practical skills you've acquired should serve you well in many domains
[00:01:06] should serve you well in many domains the field of nlu continues to progress
[00:01:08] the field of nlu continues to progress rapidly and you're now extremely well
[00:01:10] rapidly and you're now extremely well positioned to follow those changes and
[00:01:12] positioned to follow those changes and even help to shape them for our part
[00:01:14] even help to shape them for our part we'll continue to update and improve our
[00:01:16] we'll continue to update and improve our nlu course materials and share them
[00:01:17] nlu course materials and share them widely and we hope to see you in a
[00:01:19] widely and we hope to see you in a future stanford class soon
[00:01:26] you

Lecture INDEX.md

CS224U – Natural Language Understanding

Playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rPt5D0zs3YhbWSZA8Q_DyiJ

Total Videos: 63
Transcripts Downloaded: 63
Failed/No Captions: 0

---

Lectures

1. Introduction and Welcome | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=rha64cQRLs8](https://www.youtube.com/watch?v=rha64cQRLs8)
- Transcript: [001_rha64cQRLs8.md](001_rha64cQRLs8.md)

2. Course Overview | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=2w_qYPxuzeA](https://www.youtube.com/watch?v=2w_qYPxuzeA)
- Transcript: [002_2w_qYPxuzeA.md](002_2w_qYPxuzeA.md)

3. Homework 1: Word Relatedness | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=egEzcwbej1E](https://www.youtube.com/watch?v=egEzcwbej1E)
- Transcript: [003_egEzcwbej1E.md](003_egEzcwbej1E.md)

4. High-level Goals & Guiding Hypotheses | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=RiQgRJKqEhE](https://www.youtube.com/watch?v=RiQgRJKqEhE)
- Transcript: [004_RiQgRJKqEhE.md](004_RiQgRJKqEhE.md)

5. Matrix Designs | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=ladnEW0ntEM](https://www.youtube.com/watch?v=ladnEW0ntEM)
- Transcript: [005_ladnEW0ntEM.md](005_ladnEW0ntEM.md)

6. Vector Comparison | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=eKvbYOc2rOs](https://www.youtube.com/watch?v=eKvbYOc2rOs)
- Transcript: [006_eKvbYOc2rOs.md](006_eKvbYOc2rOs.md)

7. Basic Reweighting | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=dv559tVBQRk](https://www.youtube.com/watch?v=dv559tVBQRk)
- Transcript: [007_dv559tVBQRk.md](007_dv559tVBQRk.md)

8. Dimensionality Reduction | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=5Bx5UhrJbJI](https://www.youtube.com/watch?v=5Bx5UhrJbJI)
- Transcript: [008_5Bx5UhrJbJI.md](008_5Bx5UhrJbJI.md)

9. Retrofitting | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=2dVdZ4GPQIk](https://www.youtube.com/watch?v=2dVdZ4GPQIk)
- Transcript: [009_2dVdZ4GPQIk.md](009_2dVdZ4GPQIk.md)

10. Static Representations | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=K7wM6FUV0ds](https://www.youtube.com/watch?v=K7wM6FUV0ds)
- Transcript: [010_K7wM6FUV0ds.md](010_K7wM6FUV0ds.md)

11. Homework 2: Sentiment Analysis | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=e5zRhwc-SqI](https://www.youtube.com/watch?v=e5zRhwc-SqI)
- Transcript: [011_e5zRhwc-SqI.md](011_e5zRhwc-SqI.md)

12. Sentiment Analysis | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=sRw3Dtjhlk0](https://www.youtube.com/watch?v=sRw3Dtjhlk0)
- Transcript: [012_sRw3Dtjhlk0.md](012_sRw3Dtjhlk0.md)

13. General Practical Tips | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=qt-TU_f0HDw](https://www.youtube.com/watch?v=qt-TU_f0HDw)
- Transcript: [013_qt-TU_f0HDw.md](013_qt-TU_f0HDw.md)

14. Stanford Sentiment Treebank | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=DxnXVbHGeBg](https://www.youtube.com/watch?v=DxnXVbHGeBg)
- Transcript: [014_DxnXVbHGeBg.md](014_DxnXVbHGeBg.md)

15. DynaSent | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=o-UFgavFlQg](https://www.youtube.com/watch?v=o-UFgavFlQg)
- Transcript: [015_o-UFgavFlQg.md](015_o-UFgavFlQg.md)

16. sst.py | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=_T5q5fIfzww](https://www.youtube.com/watch?v=_T5q5fIfzww)
- Transcript: [016__T5q5fIfzww.md](016__T5q5fIfzww.md)

17. Hyperparameter Search | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=sO3gWU7y9Ws](https://www.youtube.com/watch?v=sO3gWU7y9Ws)
- Transcript: [017_sO3gWU7y9Ws.md](017_sO3gWU7y9Ws.md)

18. Feature Representation | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=L9ajfq6PJBI](https://www.youtube.com/watch?v=L9ajfq6PJBI)
- Transcript: [018_L9ajfq6PJBI.md](018_L9ajfq6PJBI.md)

19. RNN Classifiers | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=7n9zQ169b8Q](https://www.youtube.com/watch?v=7n9zQ169b8Q)
- Transcript: [019_7n9zQ169b8Q.md](019_7n9zQ169b8Q.md)

20. Contextual Representation Models | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=ZrmEcrmmXCg](https://www.youtube.com/watch?v=ZrmEcrmmXCg)
- Transcript: [020_ZrmEcrmmXCg.md](020_ZrmEcrmmXCg.md)

21. Transformers | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=Nsc0Yluf2yc](https://www.youtube.com/watch?v=Nsc0Yluf2yc)
- Transcript: [021_Nsc0Yluf2yc.md](021_Nsc0Yluf2yc.md)

22. BERT | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=TKcSSwKNg7w](https://www.youtube.com/watch?v=TKcSSwKNg7w)
- Transcript: [022_TKcSSwKNg7w.md](022_TKcSSwKNg7w.md)

23. RoBERTa | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=EZMOBbu_5b8](https://www.youtube.com/watch?v=EZMOBbu_5b8)
- Transcript: [023_EZMOBbu_5b8.md](023_EZMOBbu_5b8.md)

24. ELECTRA | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=6NSRLEiqsoE](https://www.youtube.com/watch?v=6NSRLEiqsoE)
- Transcript: [024_6NSRLEiqsoE.md](024_6NSRLEiqsoE.md)

25. Practical Fine-tuning | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=Ns0JHUXyLE0](https://www.youtube.com/watch?v=Ns0JHUXyLE0)
- Transcript: [025_Ns0JHUXyLE0.md](025_Ns0JHUXyLE0.md)

26. Homework 3: Colors | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=6_R00t5Iyrg](https://www.youtube.com/watch?v=6_R00t5Iyrg)
- Transcript: [026_6_R00t5Iyrg.md](026_6_R00t5Iyrg.md)

27. Grounded Language Understanding | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=OW7aDflHdG0](https://www.youtube.com/watch?v=OW7aDflHdG0)
- Transcript: [027_OW7aDflHdG0.md](027_OW7aDflHdG0.md)

28. Speakers | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=-s5B_7_oeiU](https://www.youtube.com/watch?v=-s5B_7_oeiU)
- Transcript: [028_-s5B_7_oeiU.md](028_-s5B_7_oeiU.md)

29. Listeners | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=xrsc0IOLFSY](https://www.youtube.com/watch?v=xrsc0IOLFSY)
- Transcript: [029_xrsc0IOLFSY.md](029_xrsc0IOLFSY.md)

30. Varieties of contextual grounding | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=3CTttlN8l4o](https://www.youtube.com/watch?v=3CTttlN8l4o)
- Transcript: [030_3CTttlN8l4o.md](030_3CTttlN8l4o.md)

31. The Rational Speech Acts Model | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=pkT0g7utr70](https://www.youtube.com/watch?v=pkT0g7utr70)
- Transcript: [031_pkT0g7utr70.md](031_pkT0g7utr70.md)

32. Neural RSA | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=aTEX9C2JBsE](https://www.youtube.com/watch?v=aTEX9C2JBsE)
- Transcript: [032_aTEX9C2JBsE.md](032_aTEX9C2JBsE.md)

33. Natural Language Inference | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=6-NV9lzm8qw](https://www.youtube.com/watch?v=6-NV9lzm8qw)
- Transcript: [033_6-NV9lzm8qw.md](033_6-NV9lzm8qw.md)

34. SNLI, MultiNLI, and Adversarial NLI | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=NAMNv4M2j3g](https://www.youtube.com/watch?v=NAMNv4M2j3g)
- Transcript: [034_NAMNv4M2j3g.md](034_NAMNv4M2j3g.md)

35. Adversarial Testing | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=qLuAeFdbass](https://www.youtube.com/watch?v=qLuAeFdbass)
- Transcript: [035_qLuAeFdbass.md](035_qLuAeFdbass.md)

36. Modeling Strategies | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=T-ryhSTeXpM](https://www.youtube.com/watch?v=T-ryhSTeXpM)
- Transcript: [036_T-ryhSTeXpM.md](036_T-ryhSTeXpM.md)

37. Attention | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=vJYhPL6U3h4](https://www.youtube.com/watch?v=vJYhPL6U3h4)
- Transcript: [037_vJYhPL6U3h4.md](037_vJYhPL6U3h4.md)

38. NLU and Information Retrieval | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=Bn6RNrwwiI0](https://www.youtube.com/watch?v=Bn6RNrwwiI0)
- Transcript: [038_Bn6RNrwwiI0.md](038_Bn6RNrwwiI0.md)

39. Classical IR | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=e8zKKDMAze8](https://www.youtube.com/watch?v=e8zKKDMAze8)
- Transcript: [039_e8zKKDMAze8.md](039_e8zKKDMAze8.md)

40. Neural IR, part 1 | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=XfYNqwWpoGY](https://www.youtube.com/watch?v=XfYNqwWpoGY)
- Transcript: [040_XfYNqwWpoGY.md](040_XfYNqwWpoGY.md)

41. Neural IR, part 2 | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=IWgjCIguAoA](https://www.youtube.com/watch?v=IWgjCIguAoA)
- Transcript: [041_IWgjCIguAoA.md](041_IWgjCIguAoA.md)

42. Neural IR, part 3 | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=KQMuiO59rGM](https://www.youtube.com/watch?v=KQMuiO59rGM)
- Transcript: [042_KQMuiO59rGM.md](042_KQMuiO59rGM.md)

43. Relation Extraction | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=4AjieiJ1CXo](https://www.youtube.com/watch?v=4AjieiJ1CXo)
- Transcript: [043_4AjieiJ1CXo.md](043_4AjieiJ1CXo.md)

44. Data Resources | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=g4KCti_rZA4](https://www.youtube.com/watch?v=g4KCti_rZA4)
- Transcript: [044_g4KCti_rZA4.md](044_g4KCti_rZA4.md)

45. Problem Formulation | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=JLHL5jAHODs](https://www.youtube.com/watch?v=JLHL5jAHODs)
- Transcript: [045_JLHL5jAHODs.md](045_JLHL5jAHODs.md)

46. Evaluation | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=JIBcv-grQIc](https://www.youtube.com/watch?v=JIBcv-grQIc)
- Transcript: [046_JIBcv-grQIc.md](046_JIBcv-grQIc.md)

47. Simple Baselines | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=70FS4wUJjWQ](https://www.youtube.com/watch?v=70FS4wUJjWQ)
- Transcript: [047_70FS4wUJjWQ.md](047_70FS4wUJjWQ.md)

48. Directions to Explore | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=OZ1inhh7AgA](https://www.youtube.com/watch?v=OZ1inhh7AgA)
- Transcript: [048_OZ1inhh7AgA.md](048_OZ1inhh7AgA.md)

49. Overview of Analysis Methods in NLP | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=rSO_vOynrEw](https://www.youtube.com/watch?v=rSO_vOynrEw)
- Transcript: [049_rSO_vOynrEw.md](049_rSO_vOynrEw.md)

50. Adversarial Testing | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=BilI8LkiAsU](https://www.youtube.com/watch?v=BilI8LkiAsU)
- Transcript: [050_BilI8LkiAsU.md](050_BilI8LkiAsU.md)

51. Adversarial Training (and Testing) | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=mnKQHwfp384](https://www.youtube.com/watch?v=mnKQHwfp384)
- Transcript: [051_mnKQHwfp384.md](051_mnKQHwfp384.md)

52. Probing | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=ElDtkhqv5ZE](https://www.youtube.com/watch?v=ElDtkhqv5ZE)
- Transcript: [052_ElDtkhqv5ZE.md](052_ElDtkhqv5ZE.md)

53. Feature Attribution | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=RFE6xdfJvag](https://www.youtube.com/watch?v=RFE6xdfJvag)
- Transcript: [053_RFE6xdfJvag.md](053_RFE6xdfJvag.md)

54. Overview of Methods and Metrics | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=r9ohMetEMfQ](https://www.youtube.com/watch?v=r9ohMetEMfQ)
- Transcript: [054_r9ohMetEMfQ.md](054_r9ohMetEMfQ.md)

55. Classifier Metrics | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=0RW-aV93Rns](https://www.youtube.com/watch?v=0RW-aV93Rns)
- Transcript: [055_0RW-aV93Rns.md](055_0RW-aV93Rns.md)

56. Natural Language Generation Metrics | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=l-DERqIJjCY](https://www.youtube.com/watch?v=l-DERqIJjCY)
- Transcript: [056_l-DERqIJjCY.md](056_l-DERqIJjCY.md)

57. Data Organization | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=1yLUN57_c1E](https://www.youtube.com/watch?v=1yLUN57_c1E)
- Transcript: [057_1yLUN57_c1E.md](057_1yLUN57_c1E.md)

58. Model Evaluation | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=TxTblROT9lY](https://www.youtube.com/watch?v=TxTblROT9lY)
- Transcript: [058_TxTblROT9lY.md](058_TxTblROT9lY.md)

59. Presenting Your Work: Final Papers | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=yNaDky5E4Wg](https://www.youtube.com/watch?v=yNaDky5E4Wg)
- Transcript: [059_yNaDky5E4Wg.md](059_yNaDky5E4Wg.md)

60. Writing NLP papers | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=DZNwO-p5PGY](https://www.youtube.com/watch?v=DZNwO-p5PGY)
- Transcript: [060_DZNwO-p5PGY.md](060_DZNwO-p5PGY.md)

61. NLP Conference Submissions | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=vb7IN-C7fHs](https://www.youtube.com/watch?v=vb7IN-C7fHs)
- Transcript: [061_vb7IN-C7fHs.md](061_vb7IN-C7fHs.md)

62. Giving Talks | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=GGx7klcahzY](https://www.youtube.com/watch?v=GGx7klcahzY)
- Transcript: [062_GGx7klcahzY.md](062_GGx7klcahzY.md)

63. Conclusion | Stanford CS224U Natural Language Understanding | Spring 2021
- Video: [https://www.youtube.com/watch?v=s0GH2pPnJMk](https://www.youtube.com/watch?v=s0GH2pPnJMk)
- Transcript: [063_s0GH2pPnJMk.md](063_s0GH2pPnJMk.md)

💬 CS224U – Natural Language Understanding