================================================================================ LECTURE 001 ================================================================================ Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 1: Introduction Source: https://www.youtube.com/watch?v=2fq9wYslV0A --- Transcript [00:00:05] This is CS231N [00:00:07] This is CS231N and uh I'm professor Falei from computer [00:00:10] and uh I'm professor Falei from computer science department. I will be co-eing [00:00:13] science department. I will be co-eing this uh quarter with professor Isan [00:00:16] this uh quarter with professor Isan Adelli and uh my graduate student Zay. [00:00:19] Adelli and uh my graduate student Zay. So you'll meet them as well as our [00:00:21] So you'll meet them as well as our wonderful uh TA team that you will meet [00:00:24] wonderful uh TA team that you will meet later. So I just want to uh get started. [00:00:28] later. So I just want to uh get started. Yeah. So this is what excites me that um [00:00:31] Yeah. So this is what excites me that um AI has become such an interdisciplinary [00:00:34] AI has become such an interdisciplinary um field that what you're going to learn [00:00:37] um field that what you're going to learn in this class of course it's very [00:00:39] in this class of course it's very technical it's about computer vision and [00:00:42] technical it's about computer vision and deep learning but I really do hope that [00:00:44] deep learning but I really do hope that you take it to whichever discipline you [00:00:46] you take it to whichever discipline you work in and are passionate about and [00:00:48] work in and are passionate about and apply it. So we hear a lot about the [00:00:52] apply it. So we hear a lot about the field of AI. So how do we position uh [00:00:55] field of AI. So how do we position uh computer vision and the scope of this [00:00:57] computer vision and the scope of this class? If you consider AI as this big [00:01:00] class? If you consider AI as this big bubble, um computer vision is very much [00:01:04] bubble, um computer vision is very much a integral part of AI. Uh some of you [00:01:08] a integral part of AI. Uh some of you have heard me saying that not only [00:01:10] have heard me saying that not only vision is part of intelligence, it's a [00:01:12] vision is part of intelligence, it's a cornerstone to intelligence. Unlocking [00:01:14] cornerstone to intelligence. Unlocking the mystery of visual intelligence is [00:01:17] the mystery of visual intelligence is unlocking the mystery of intelligence. [00:01:20] unlocking the mystery of intelligence. But one of the most important tools, [00:01:22] But one of the most important tools, mathematical tools to uh to solving AI [00:01:27] mathematical tools to uh to solving AI is uh machine learning or some people [00:01:29] is uh machine learning or some people call statistical machine learning. And [00:01:31] call statistical machine learning. And this is exactly um you know what we will [00:01:35] this is exactly um you know what we will be talking about within the field of [00:01:37] be talking about within the field of machine learning. Uh in the past 10 plus [00:01:40] machine learning. Uh in the past 10 plus years, we have seen a major revolution [00:01:42] years, we have seen a major revolution called deep learning. And I'll explain a [00:01:45] called deep learning. And I'll explain a little bit of what deep learning is. [00:01:46] little bit of what deep learning is. Deep learning is a set of uh uh [00:01:49] Deep learning is a set of uh uh algorithmic techniques that is built [00:01:51] algorithmic techniques that is built around uh a family of algorithms called [00:01:54] around uh a family of algorithms called neuronet networks. And uh so if you ask [00:01:58] neuronet networks. And uh so if you ask me to pinpoint the the scope of this [00:02:01] me to pinpoint the the scope of this class, we'll not be able to cover the [00:02:03] class, we'll not be able to cover the entirety of computer vision. We'll not [00:02:05] entirety of computer vision. We'll not be able to cover the entirety of machine [00:02:07] be able to cover the entirety of machine learning or deep learning, but we're [00:02:09] learning or deep learning, but we're going to cover the core intersection of [00:02:12] going to cover the core intersection of these two fields. And of course um just [00:02:15] these two fields. And of course um just like the entirety of AI, computer vision [00:02:19] like the entirety of AI, computer vision is becoming more and more a uh [00:02:21] is becoming more and more a uh interdisciplinary field. A lot of the [00:02:24] interdisciplinary field. A lot of the techniques we use as well as the [00:02:26] techniques we use as well as the problems we work with intersect with [00:02:28] problems we work with intersect with many different other fields like natural [00:02:31] many different other fields like natural language processing, speech recognition, [00:02:34] language processing, speech recognition, robotics and AI as a whole is a field [00:02:38] robotics and AI as a whole is a field that intersects with mathematics, [00:02:40] that intersects with mathematics, neuroscience, computer science, [00:02:42] neuroscience, computer science, psychology, physics, biology and many [00:02:44] psychology, physics, biology and many application areas from medicine to law [00:02:46] application areas from medicine to law to education uh and business and so on. [00:02:50] to education uh and business and so on. So what you will get for this lecture, [00:02:54] So what you will get for this lecture, our first lecture is I'll give a very [00:02:56] our first lecture is I'll give a very brief history of computer vision and [00:02:59] brief history of computer vision and deep learning and then uh professor [00:03:01] deep learning and then uh professor Delhi will uh go over the overview of [00:03:04] Delhi will uh go over the overview of this course and lay the groundwork of [00:03:06] this course and lay the groundwork of how this course is set up and what our [00:03:09] how this course is set up and what our expectations are. [00:03:12] expectations are. So you know the history of vision did [00:03:16] So you know the history of vision did not begin when you were born or humanity [00:03:20] not begin when you were born or humanity was born. The history of vision began [00:03:23] was born. The history of vision began 540 million years ago. You might you [00:03:26] 540 million years ago. You might you might ask what happened 540 million [00:03:29] might ask what happened 540 million years ago? Why are we pinpointing a [00:03:31] years ago? Why are we pinpointing a relatively specific date or year in [00:03:34] relatively specific date or year in evolution? Well, it's because a lot of [00:03:36] evolution? Well, it's because a lot of fossil studies have shown us there is a [00:03:40] fossil studies have shown us there is a mystery period called um Cambrian [00:03:43] mystery period called um Cambrian explosion. It is the fossil studies [00:03:46] explosion. It is the fossil studies showed about 10 million years in [00:03:48] showed about 10 million years in evolution. During that time, which is a [00:03:51] evolution. During that time, which is a very short period of time for evolution, [00:03:54] very short period of time for evolution, we see the explosion of animal species [00:03:58] we see the explosion of animal species in the fossil study. Which means before [00:04:01] in the fossil study. Which means before Cambrian explosion, life on Earth was [00:04:04] Cambrian explosion, life on Earth was pretty chill. It was actually in the [00:04:06] pretty chill. It was actually in the water. There's no land yet. No animals [00:04:09] water. There's no land yet. No animals on the land yet. And animals just float [00:04:12] on the land yet. And animals just float around around. So what caused this [00:04:15] around around. So what caused this explosion in animal speciation? There [00:04:18] explosion in animal speciation? There were many theories from climate to [00:04:20] were many theories from climate to chemical composition of the ocean water. [00:04:23] chemical composition of the ocean water. But the most uh compelling one of the [00:04:25] But the most uh compelling one of the most compelling theories was the onset [00:04:28] most compelling theories was the onset of ice. uh the first animal a trilobyte [00:04:32] of ice. uh the first animal a trilobyte um they gained photosensitive cells. So [00:04:35] um they gained photosensitive cells. So the eyes we were talking about were not [00:04:38] the eyes we were talking about were not sophisticated lenses and retinas and [00:04:40] sophisticated lenses and retinas and nerve cells. It was literally a very [00:04:43] nerve cells. It was literally a very simple pinghole and that pinhole [00:04:45] simple pinghole and that pinhole collected light. Once you collected [00:04:49] collected light. Once you collected light [00:04:51] light is completely different without senses [00:04:56] is completely different without senses life is metabolism. It's very passive. [00:04:59] life is metabolism. It's very passive. It is just metabolism and you come and [00:05:02] It is just metabolism and you come and go. With senses, you become an integral [00:05:05] go. With senses, you become an integral part of the environment that you might [00:05:08] part of the environment that you might want to change. You might want to [00:05:09] want to change. You might want to actually uh survive in it. Some some [00:05:13] actually uh survive in it. Some some animals becomes your or plants become [00:05:15] animals becomes your or plants become your dinner and you become someone [00:05:17] your dinner and you become someone else's dinner. So evolutionary forces [00:05:20] else's dinner. So evolutionary forces drives [00:05:22] drives um um intelligence to evolve because of [00:05:26] um um intelligence to evolve because of the onset of senses because of the onset [00:05:28] the onset of senses because of the onset of vision along with uh haptics or or [00:05:32] of vision along with uh haptics or or tactile sensing those are the two most [00:05:34] tactile sensing those are the two most uh the oldest uh senses for for animals. [00:05:38] uh the oldest uh senses for for animals. So that entire course of 540 million [00:05:41] So that entire course of 540 million years of evolution of vision is the [00:05:45] years of evolution of vision is the evolution of intelligence. Vision as one [00:05:47] evolution of intelligence. Vision as one of the primary senses of animals drove [00:05:51] of the primary senses of animals drove the development of nervous system. The [00:05:53] the development of nervous system. The development of intelligence. Almost all [00:05:56] development of intelligence. Almost all animals on earth today we know of have [00:06:00] animals on earth today we know of have vision or use vision as one of the [00:06:02] vision or use vision as one of the primary senses. Humans are especially [00:06:05] primary senses. Humans are especially visual animals. More than half of our [00:06:07] visual animals. More than half of our cortical cells are involved in uh visual [00:06:11] cortical cells are involved in uh visual processing and we have a very complex [00:06:14] processing and we have a very complex and convoluted visual system. So this is [00:06:17] and convoluted visual system. So this is what excites me to enter the field of [00:06:19] what excites me to enter the field of vision and I hope it excites you. So um [00:06:24] vision and I hope it excites you. So um now let's just fast forward from uh u [00:06:27] now let's just fast forward from uh u fast forward from uh Cambrian explosion [00:06:30] fast forward from uh Cambrian explosion to actually human civilization. Um [00:06:33] to actually human civilization. Um humans does innovate and not only we see [00:06:37] humans does innovate and not only we see we want to build machines that see. So [00:06:41] we want to build machines that see. So here is a a drawing by a couple of [00:06:43] here is a a drawing by a couple of drawings by of course Leonardo Davinci [00:06:46] drawings by of course Leonardo Davinci who was just forever um curious about [00:06:48] who was just forever um curious about everything. He he studied um camera [00:06:53] everything. He he studied um camera obscura for uh how to make seeing [00:06:56] obscura for uh how to make seeing machines. in fact that uh even way [00:06:59] machines. in fact that uh even way before him in ancient Greece and in [00:07:02] before him in ancient Greece and in ancient China, we have seen documents [00:07:05] ancient China, we have seen documents about uh you know thinkers, philosophers [00:07:09] about uh you know thinkers, philosophers thinking about how to make uh how to [00:07:12] thinking about how to make uh how to project objects through pinholes and to [00:07:16] project objects through pinholes and to to uh create images of objects. And of [00:07:20] to uh create images of objects. And of course in our modern life cameras have [00:07:24] course in our modern life cameras have truly exploded. But uh but cameras are [00:07:28] truly exploded. But uh but cameras are not enough for seeing just like eyes are [00:07:30] not enough for seeing just like eyes are not enough for seeing. These are [00:07:32] not enough for seeing. These are apparatus. We need to understand how [00:07:34] apparatus. We need to understand how visual intelligence happens and that's [00:07:36] visual intelligence happens and that's really the crux of this course. So let's [00:07:40] really the crux of this course. So let's let's just talk about the a little bit [00:07:43] let's just talk about the a little bit of the history that brought us to this [00:07:46] of the history that brought us to this intersection of deep learning and [00:07:48] intersection of deep learning and computer vision. So um so let me go back [00:07:54] computer vision. So um so let me go back to the 1950s. Um the 1950s a set of very [00:08:00] to the 1950s. Um the 1950s a set of very critical critically important uh [00:08:02] critical critically important uh experiments happened in neuroscience and [00:08:05] experiments happened in neuroscience and that was the study of the visual [00:08:07] that was the study of the visual pathways of mammals especially the [00:08:10] pathways of mammals especially the seinal work by Hubo and Viso. They used [00:08:13] seinal work by Hubo and Viso. They used uh electrodes to put into live cats [00:08:17] uh electrodes to put into live cats anesthesized and then they study the [00:08:20] anesthesized and then they study the receptive field of neurons that are in [00:08:23] receptive field of neurons that are in the primary visual cortex. What they [00:08:26] the primary visual cortex. What they have learned to their surprise are two [00:08:29] have learned to their surprise are two very important things. One is that um [00:08:33] very important things. One is that um vis uh uh neurons that are responsible [00:08:37] vis uh uh neurons that are responsible for seeing in the primary visual cortex [00:08:41] for seeing in the primary visual cortex have their own individual receptive [00:08:44] have their own individual receptive fields. Receptive fields means that for [00:08:47] fields. Receptive fields means that for every neuron there's a part of space it [00:08:51] every neuron there's a part of space it actually sees. It's not all the space. [00:08:54] actually sees. It's not all the space. It's not very big. It tends to be a very [00:08:57] It's not very big. It tends to be a very uh uh confined patch of the space and [00:09:01] uh uh confined patch of the space and within that space it sees uh uh [00:09:05] within that space it sees uh uh specialized patterns simple patterns [00:09:08] specialized patterns simple patterns when the the the when you're measuring [00:09:10] when the the the when you're measuring from the early uh early part of the uh [00:09:14] from the early uh early part of the uh uh visual pathway and by and large in [00:09:17] uh visual pathway and by and large in the primary visual cortex which is [00:09:19] the primary visual cortex which is around here in the back of the head not [00:09:21] around here in the back of the head not near your eyes um it's oriented edges or [00:09:25] near your eyes um it's oriented edges or moving oriented edges. So every neuron [00:09:28] moving oriented edges. So every neuron some neuron will be seeing the edge like [00:09:30] some neuron will be seeing the edge like this, some will be seeing an edge like [00:09:31] this, some will be seeing an edge like this or this. And uh that's how vision [00:09:36] this or this. And uh that's how vision the computation in the brain begins. The [00:09:39] the computation in the brain begins. The second thing they learned is that uh [00:09:41] second thing they learned is that uh visual pathway is hierarchical. As you [00:09:44] visual pathway is hierarchical. As you move beyond the the visual pathway, the [00:09:47] move beyond the the visual pathway, the neurons feed into other neurons and [00:09:50] neurons feed into other neurons and these the neurons in the higher uh [00:09:54] these the neurons in the higher uh layers or deeper layers of the visual [00:09:56] layers or deeper layers of the visual hierarchy have more complex receptive [00:09:59] hierarchy have more complex receptive field. So if you begin with [00:10:02] field. So if you begin with oriented edges, you might feed into a [00:10:05] oriented edges, you might feed into a corner receptor, you might feed into a [00:10:09] corner receptor, you might feed into a an object receptor. I'm overly [00:10:11] an object receptor. I'm overly simplifying but that's the concept is [00:10:14] simplifying but that's the concept is that neurons feed into each other and [00:10:16] that neurons feed into each other and then they create this big network of um [00:10:21] then they create this big network of um uh of uh computation. Of course most of [00:10:24] uh of uh computation. Of course most of you sitting here are already thinking [00:10:27] you sitting here are already thinking the way I've been describing this will [00:10:29] the way I've been describing this will have a profound impact on the modeling [00:10:32] have a profound impact on the modeling the neuronet network modeling of uh [00:10:34] the neuronet network modeling of uh visual algorithms. Let's keep going. [00:10:37] visual algorithms. Let's keep going. That's year [00:10:39] That's year 1959. It's very early studies of uh [00:10:42] 1959. It's very early studies of uh scene. By the way, um about 30 years [00:10:47] scene. By the way, um about 30 years later uh maybe not quite 20 something [00:10:50] later uh maybe not quite 20 something years later, Huba Visel won the Nobel uh [00:10:53] years later, Huba Visel won the Nobel uh Nobel Prize in medicine for studying [00:10:55] Nobel Prize in medicine for studying this uh uncovering the the uh principles [00:10:59] this uh uncovering the the uh principles of visual processing. [00:11:02] of visual processing. Another um milestone in the early [00:11:04] Another um milestone in the early history of computer vision was the first [00:11:06] history of computer vision was the first PhD thesis of computer vision. Most [00:11:09] PhD thesis of computer vision. Most people attribute Larry Roberts in 1963 [00:11:13] people attribute Larry Roberts in 1963 writing the the first PhD thesis just [00:11:16] writing the the first PhD thesis just studying shape and this is a very very [00:11:19] studying shape and this is a very very character uh representation of the world [00:11:22] character uh representation of the world and the idea is that can we take a shape [00:11:25] and the idea is that can we take a shape like this and understand the the [00:11:28] like this and understand the the surfaces and the the corners and [00:11:30] surfaces and the the corners and features of this shape um it's intuitive [00:11:33] features of this shape um it's intuitive that humans do. So an entire PhD thesis [00:11:36] that humans do. So an entire PhD thesis is developed uh is devoted to this and [00:11:39] is developed uh is devoted to this and that's the beginning of computer vision [00:11:43] that's the beginning of computer vision and uh and around that time in 1966 [00:11:48] and uh and around that time in 1966 um uh as an MIT professor uh created a [00:11:53] um uh as an MIT professor uh created a summer project in MIT and asked to hire [00:11:57] summer project in MIT and asked to hire a few undergrads very smart ones um to [00:12:02] a few undergrads very smart ones um to study uh vision And the goal was pretty [00:12:05] study uh vision And the goal was pretty much solve computer vision or solve [00:12:07] much solve computer vision or solve vision for one summer. Of course, just [00:12:10] vision for one summer. Of course, just like the rest of the history of AI, we [00:12:13] like the rest of the history of AI, we tend to be over optimistic of what we [00:12:18] tend to be over optimistic of what we can do in a short period of time. So [00:12:21] can do in a short period of time. So vision did not get solved in in that [00:12:24] vision did not get solved in in that summer. In fact, it has blossomed into [00:12:27] summer. In fact, it has blossomed into an incredible computer science field. If [00:12:30] an incredible computer science field. If you go to our annual conferences every [00:12:33] you go to our annual conferences every year now it has more than 10,000 people [00:12:35] year now it has more than 10,000 people attending but 1960s is where you know [00:12:39] attending but 1960s is where you know between Larry Robert's uh PhD thesis as [00:12:42] between Larry Robert's uh PhD thesis as well as uh as well as this kind of [00:12:44] well as uh as well as this kind of project we in our field consider that [00:12:48] project we in our field consider that the beginning of the field of computer [00:12:50] the beginning of the field of computer vision. [00:12:51] vision. A seminal book was written in the 1970s [00:12:54] A seminal book was written in the 1970s by David Barah who um unfortunately died [00:12:57] by David Barah who um unfortunately died too early. He wanted to study vision [00:13:00] too early. He wanted to study vision systematically and start to consider how [00:13:04] systematically and start to consider how visual processing happens. Even though [00:13:06] visual processing happens. Even though this is not explicitly stated but there [00:13:08] this is not explicitly stated but there is a lot of inspiration from you know [00:13:11] is a lot of inspiration from you know neuroscience and cognitive science. He [00:13:13] neuroscience and cognitive science. He was thinking about you know if you look [00:13:15] was thinking about you know if you look at if if a um if you take a input image [00:13:20] at if if a um if you take a input image how do we visually process and [00:13:22] how do we visually process and understand the image maybe the first [00:13:24] understand the image maybe the first layer is more like edges just like we [00:13:27] layer is more like edges just like we saw uh he calls it primal sketch and [00:13:31] saw uh he calls it primal sketch and then there is a 2 and 1 halfd sketch [00:13:33] then there is a 2 and 1 halfd sketch which uh uh uh you know separates [00:13:36] which uh uh uh you know separates different uh uh depth um u of the the [00:13:41] different uh uh depth um u of the the objects in the image. So the ball is the [00:13:44] objects in the image. So the ball is the foreground object and then the the the [00:13:46] foreground object and then the the the grass here. Oh no, not grass. The floor [00:13:49] grass here. Oh no, not grass. The floor ground here is the background. So he [00:13:52] ground here is the background. So he does these two and a halfD sketch and [00:13:54] does these two and a halfD sketch and then finally he he believes David Mar [00:13:57] then finally he he believes David Mar believes the the the grand holy grail [00:14:01] believes the the the grand holy grail victory of solving vision is to know the [00:14:04] victory of solving vision is to know the entire full 3D representation. And that [00:14:08] entire full 3D representation. And that is actually a um the hardest thing of [00:14:12] is actually a um the hardest thing of vision. Let me digress for 20 seconds [00:14:15] vision. Let me digress for 20 seconds because um if you think about vision [00:14:18] because um if you think about vision right for all animals it's a illposed [00:14:22] right for all animals it's a illposed problem. Since the early trilobytes who [00:14:26] problem. Since the early trilobytes who collected light from underwater [00:14:30] collected light from underwater light the world through photons is [00:14:33] light the world through photons is projected on something on a surface more [00:14:36] projected on something on a surface more or less 2D at that time it was just I [00:14:39] or less 2D at that time it was just I don't know some patch in the in the [00:14:41] don't know some patch in the in the animal but right now for us it's a it's [00:14:44] animal but right now for us it's a it's a retina um but the actual world is 3D [00:14:48] a retina um but the actual world is 3D so recovering [00:14:50] so recovering 3D information the entire entire 3D [00:14:54] 3D information the entire entire 3D world [00:14:56] world from 2D images is the fundamental [00:14:59] from 2D images is the fundamental problem nature had to solve and computer [00:15:01] problem nature had to solve and computer vision has to solve and mathematically [00:15:04] vision has to solve and mathematically that's a illpose problem. So what did [00:15:07] that's a illpose problem. So what did linker do? Anybody have a wild guess? [00:15:17] Yes, nature the trick that nature did is [00:15:20] Yes, nature the trick that nature did is develop mult multiple eyes mostly too. [00:15:22] develop mult multiple eyes mostly too. Some animals have more than uh more than [00:15:24] Some animals have more than uh more than two and that's and then you triangulate [00:15:27] two and that's and then you triangulate information but two eyes are not enough. [00:15:29] information but two eyes are not enough. You actually have to understand [00:15:31] You actually have to understand correspondences and all that. We'll [00:15:33] correspondences and all that. We'll touch on some of these topics but there [00:15:35] touch on some of these topics but there are other computer vision classes in uh [00:15:38] are other computer vision classes in uh in Stanford offers that also [00:15:40] in Stanford offers that also specifically talk about 3D vision. But [00:15:42] specifically talk about 3D vision. But the point is it's a very hard problem [00:15:46] the point is it's a very hard problem and and we have to solve it. Nature has [00:15:48] and and we have to solve it. Nature has solved it. humans have solved it but not [00:15:50] solved it. humans have solved it but not to extreme uh precision. In fact, humans [00:15:54] to extreme uh precision. In fact, humans are not that precise. You know, I [00:15:56] are not that precise. You know, I roughly know the 3D shapes but I don't [00:15:59] roughly know the 3D shapes but I don't have you know geometric precision of all [00:16:02] have you know geometric precision of all the shapes. So that's that's one thing [00:16:04] the shapes. So that's that's one thing to consider and appreciate how hard this [00:16:07] to consider and appreciate how hard this problem is. Another thing that is very [00:16:10] problem is. Another thing that is very different for uh computer vision and and [00:16:13] different for uh computer vision and and language is actually something [00:16:15] language is actually something philosophically subtle. Language doesn't [00:16:18] philosophically subtle. Language doesn't exist in nature. You cannot point to [00:16:22] exist in nature. You cannot point to something and say there's language. [00:16:24] something and say there's language. Language is a purely generated [00:16:29] Language is a purely generated thing. I don't even know what word to [00:16:31] thing. I don't even know what word to use, right? It comes through our brain. [00:16:35] use, right? It comes through our brain. It's generated. It's 1D. It's [00:16:39] It's generated. It's 1D. It's sequential. [00:16:40] sequential. So this actually has profound [00:16:43] So this actually has profound implication in the latest wave of Gen AI [00:16:46] implication in the latest wave of Gen AI algorithms is this is why these LLMs [00:16:49] algorithms is this is why these LLMs which is outside of the scope of this [00:16:51] which is outside of the scope of this class is so powerful because the the we [00:16:55] class is so powerful because the the we can model language that way but vision [00:16:57] can model language that way but vision is not generated. There is actually a [00:17:01] is not generated. There is actually a physical world out there respecting the [00:17:03] physical world out there respecting the laws of physics and materials and all [00:17:05] laws of physics and materials and all that. So vision has very different [00:17:08] that. So vision has very different tasks. So I just want you to appreciate [00:17:12] tasks. So I just want you to appreciate uh the difference between language and [00:17:14] uh the difference between language and vision and and actually frankly [00:17:16] vision and and actually frankly appreciate nature how how it solved this [00:17:18] appreciate nature how how it solved this problem. Okay let's keep going. 1970s. [00:17:23] problem. Okay let's keep going. 1970s. Um the early pioneers of computer vision [00:17:27] Um the early pioneers of computer vision without data, without really much of uh [00:17:30] without data, without really much of uh powerful computers, without um the [00:17:34] powerful computers, without um the mathematical advances we have seen today [00:17:37] mathematical advances we have seen today are already beginning to solve some of [00:17:39] are already beginning to solve some of the harder problem of computer vision. [00:17:41] the harder problem of computer vision. For example, recognition of objects here [00:17:44] For example, recognition of objects here in um in Stanford. One of the pioneering [00:17:47] in um in Stanford. One of the pioneering work is called uh generalized cylinders [00:17:50] work is called uh generalized cylinders by Rodney Brookke and Tom Binford. And [00:17:53] by Rodney Brookke and Tom Binford. And ironically Rodney Brooks uh today is on [00:17:56] ironically Rodney Brooks uh today is on campus actually some part over there [00:17:59] campus actually some part over there giving uh giving a talk at the robotics [00:18:02] giving uh giving a talk at the robotics conference and he went on to become one [00:18:05] conference and he went on to become one of the greatest roboticists of our time [00:18:07] of the greatest roboticists of our time and was founder of uh uh Roomba and many [00:18:10] and was founder of uh uh Roomba and many other robots. And then uh not very far [00:18:13] other robots. And then uh not very far from us in another part of Palo Alto uh [00:18:16] from us in another part of Palo Alto uh researchers have worked on these uh also [00:18:19] researchers have worked on these uh also compositional uh compositional um um u [00:18:24] compositional uh compositional um um u models of uh human body and and objects. [00:18:28] models of uh human body and and objects. And then in the 1980s [00:18:30] And then in the 1980s um digital photos start to appear. At [00:18:34] um digital photos start to appear. At least photos start to appear and people [00:18:37] least photos start to appear and people can digitize that a little bit. And then [00:18:40] can digitize that a little bit. And then um there are some great work in edge [00:18:43] um there are some great work in edge detection. You look at all this and um [00:18:47] detection. You look at all this and um it probably feels a sense of [00:18:48] it probably feels a sense of disappointment, right? Like I mean it's [00:18:52] disappointment, right? Like I mean it's kind of trivial to get some sketches and [00:18:55] kind of trivial to get some sketches and edges and it's not really going anywhere [00:18:57] edges and it's not really going anywhere if that's how that's how vision you know [00:19:00] if that's how that's how vision you know uh works at that time. And in fact, [00:19:02] uh works at that time. And in fact, you're not so wrong. That was around the [00:19:05] you're not so wrong. That was around the time um before many of you were born [00:19:08] time um before many of you were born that we entered AI winter. The field a [00:19:11] that we entered AI winter. The field a entered AI winter because the enthusiasm [00:19:15] entered AI winter because the enthusiasm and hence funding for AI research has [00:19:17] and hence funding for AI research has really dwindled. A lot of things didn't [00:19:19] really dwindled. A lot of things didn't deliver. Computer vision didn't deliver, [00:19:22] deliver. Computer vision didn't deliver, expert systems didn't deliver, robotics [00:19:25] expert systems didn't deliver, robotics didn't deliver. But under the hood of [00:19:28] didn't deliver. But under the hood of this winter, a lot of things uh a lot of [00:19:32] this winter, a lot of things uh a lot of research start to grow uh from different [00:19:34] research start to grow uh from different fields like computer vision, NLP, [00:19:36] fields like computer vision, NLP, robotics. So let's also look at another [00:19:39] robotics. So let's also look at another strand of research that had a profound [00:19:41] strand of research that had a profound implication in computer vision is that [00:19:44] implication in computer vision is that cognitive and neuroscience continue to [00:19:46] cognitive and neuroscience continue to blossom. And what is really important [00:19:48] blossom. And what is really important especially for the field of computer [00:19:50] especially for the field of computer vision is cognitive and neuroscience is [00:19:53] vision is cognitive and neuroscience is starting to point to us the northstar [00:19:56] starting to point to us the northstar problems we should work on. For example, [00:19:58] problems we should work on. For example, psychologists have told us there's [00:20:00] psychologists have told us there's something special about seeing nature, [00:20:02] something special about seeing nature, seeing uh seeing real world. Uh this is [00:20:05] seeing uh seeing real world. Uh this is a uh this is a study by uh Vidderman who [00:20:09] a uh this is a study by uh Vidderman who shows that the detection of bicycles on [00:20:12] shows that the detection of bicycles on two images differ depending on if the [00:20:17] two images differ depending on if the images are scrambled or not. Think about [00:20:19] images are scrambled or not. Think about it from a photon point of view. These [00:20:21] it from a photon point of view. These two bicycles land in the same location [00:20:25] two bicycles land in the same location on your retina, but somehow the rest of [00:20:28] on your retina, but somehow the rest of the image impacts uh the the the [00:20:32] the image impacts uh the the the viewer seeing the the objects in in [00:20:36] viewer seeing the the objects in in um in in in the target object. So there [00:20:39] um in in in the target object. So there is something telling us that seeing the [00:20:41] is something telling us that seeing the entire forest or entire world uh impacts [00:20:45] entire forest or entire world uh impacts the way we see objects. It also tells us [00:20:47] the way we see objects. It also tells us visual processing is very fast. Here's [00:20:50] visual processing is very fast. Here's another direct measure of how fast we we [00:20:53] another direct measure of how fast we we uh we detect objects. This is an early [00:20:56] uh we detect objects. This is an early 1970s uh experiment um showing people uh [00:21:00] 1970s uh experiment um showing people uh a video and uh and the the task for the [00:21:04] a video and uh and the the task for the human uh the subject is to detect the [00:21:07] human uh the subject is to detect the human in one of the frames. I suppose [00:21:09] human in one of the frames. I suppose every one of you have seen that human in [00:21:11] every one of you have seen that human in one of the uh frames. But think about [00:21:13] one of the uh frames. But think about how remarkable your eyes are or your [00:21:16] how remarkable your eyes are or your brain is because uh you've never seen [00:21:18] brain is because uh you've never seen this video. I didn't tell you which [00:21:20] this video. I didn't tell you which frame the the the target object would [00:21:22] frame the the the target object would appear. I did not tell you what the [00:21:24] appear. I did not tell you what the target target object would look like, [00:21:26] target target object would look like, where it is and gestures and all that. [00:21:28] where it is and gestures and all that. Yet you have no problem detecting the [00:21:31] Yet you have no problem detecting the humans. So that is and on top of that [00:21:36] humans. So that is and on top of that these uh frames are played at 10 hertz [00:21:39] these uh frames are played at 10 hertz which means you're seeing every frame [00:21:40] which means you're seeing every frame for only 100 milliseconds. And this is [00:21:44] for only 100 milliseconds. And this is how remarkable our visual system is. In [00:21:48] how remarkable our visual system is. In fact, uh, Simon Thorp, another, uh, uh, [00:21:51] fact, uh, Simon Thorp, another, uh, uh, uh, cognitive neuroscientist, have [00:21:54] uh, cognitive neuroscientist, have measured the speed. If you hook people [00:21:56] measured the speed. If you hook people up in EG caps and show them complex [00:22:00] up in EG caps and show them complex natural things and ask human subjects to [00:22:04] natural things and ask human subjects to categorize things from animals without [00:22:07] categorize things from animals without uh, versus things without animals, [00:22:10] uh, versus things without animals, hundreds of them, and then you measure [00:22:12] hundreds of them, and then you measure the brain wave. It turned out after 150 [00:22:15] the brain wave. It turned out after 150 milliseconds of seeing a photo uh your [00:22:19] milliseconds of seeing a photo uh your brain already has a signal of a [00:22:21] brain already has a signal of a differential signal that categorizes. [00:22:24] differential signal that categorizes. You might not be so impressed because [00:22:26] You might not be so impressed because compared to today's GPUs and modern [00:22:29] compared to today's GPUs and modern chips 150 milliseconds is really orders [00:22:32] chips 150 milliseconds is really orders of magnitude slower. But you got to [00:22:36] of magnitude slower. But you got to admire our wet wear. our brain, our [00:22:40] admire our wet wear. our brain, our neurons don't work as fast as [00:22:42] neurons don't work as fast as transistors. 150 millisecond is actually [00:22:45] transistors. 150 millisecond is actually really fast. Uh it's only a few hops in [00:22:48] really fast. Uh it's only a few hops in the brain in terms of neuroprocessing. [00:22:51] the brain in terms of neuroprocessing. So yet again, this is telling us humans [00:22:54] So yet again, this is telling us humans are really good at um um seeing objects [00:22:58] are really good at um um seeing objects and categorizing them. In fact, not only [00:23:01] and categorizing them. In fact, not only were so good at seeing objects and [00:23:02] were so good at seeing objects and categorizing them, we even developed [00:23:04] categorizing them, we even developed specialized brain areas that have expert [00:23:08] specialized brain areas that have expert ability in recognizing faces or places [00:23:12] ability in recognizing faces or places or body parts. And these are discoveries [00:23:15] or body parts. And these are discoveries by MIT uh neurohysiologists in the 1990s [00:23:19] by MIT uh neurohysiologists in the 1990s and early 21st century. So all these [00:23:22] and early 21st century. So all these studies tell us well we should not just [00:23:26] studies tell us well we should not just be studying these kind of character [00:23:29] be studying these kind of character shapes or the sketches of images. We [00:23:34] shapes or the sketches of images. We really should go after uh important [00:23:37] really should go after uh important fundamental problems that drives visual [00:23:40] fundamental problems that drives visual intelligence. And one of those problem [00:23:42] intelligence. And one of those problem that everything has been telling us is [00:23:44] that everything has been telling us is object recognition. is object [00:23:47] object recognition. is object recognition in natural setting. There's [00:23:50] recognition in natural setting. There's a lot of objects out out there in the [00:23:52] a lot of objects out out there in the world and studying this is going to [00:23:55] world and studying this is going to unlock is going to be part of the [00:23:58] unlock is going to be part of the unlocking of visual intelligence. And [00:24:00] unlocking of visual intelligence. And that's what we did as a field. We [00:24:03] that's what we did as a field. We started by uh looking at how we can [00:24:06] started by uh looking at how we can separate foreground objects from [00:24:08] separate foreground objects from background objects. This is called uh [00:24:11] background objects. This is called uh recognition by grouping. in the 1990s. [00:24:14] recognition by grouping. in the 1990s. Keep in mind we're still in AI winter, [00:24:16] Keep in mind we're still in AI winter, but research is actually happening and [00:24:19] but research is actually happening and progressing. And then there is uh um you [00:24:22] progressing. And then there is uh um you know studies of features and and this is [00:24:26] know studies of features and and this is some of you might still remember like [00:24:27] some of you might still remember like sift features and matching and when I [00:24:31] sift features and matching and when I enter grad school the most exciting [00:24:33] enter grad school the most exciting thing was face detection. I remember [00:24:35] thing was face detection. I remember that first year in my grad school this [00:24:37] that first year in my grad school this paper was published and five years later [00:24:41] paper was published and five years later the first digital camera used this [00:24:43] the first digital camera used this paper's algorithm and delivered uh [00:24:47] paper's algorithm and delivered uh automatic face focus because of uh face [00:24:50] automatic face focus because of uh face detection. So things start to work and [00:24:54] detection. So things start to work and be taken into industry and then around [00:24:58] be taken into industry and then around the early 21st century a very important [00:25:02] the early 21st century a very important thing start to happen is internet start [00:25:06] thing start to happen is internet start to happen. when internet start to happen [00:25:09] to happen. when internet start to happen um data start to proliferate and the [00:25:13] um data start to proliferate and the combination of digital cameras and [00:25:16] combination of digital cameras and internet start to give the field of [00:25:18] internet start to give the field of computer vision some data to work with. [00:25:22] computer vision some data to work with. So in that early days we're working with [00:25:24] So in that early days we're working with thousands of d uh images or tens of [00:25:27] thousands of d uh images or tens of thousands of images to study the visual [00:25:30] thousands of images to study the visual recognition problem or the object [00:25:31] recognition problem or the object recognition problem. So you've got uh [00:25:34] recognition problem. So you've got uh data sets like Pascal Visual Object [00:25:36] data sets like Pascal Visual Object Challenge or Caltech 101. So that was [00:25:41] Challenge or Caltech 101. So that was I'm going to pause here. Um and uh and [00:25:46] I'm going to pause here. Um and uh and and this is where the the the first [00:25:48] and this is where the the the first thread of computer vision start to [00:25:50] thread of computer vision start to progress and you might be wondering why [00:25:52] progress and you might be wondering why is she proing uh pausing because I'm [00:25:55] is she proing uh pausing because I'm going to come back and talk about deep [00:25:56] going to come back and talk about deep learning. So while [00:26:00] learning. So while you know this field of vision was [00:26:02] you know this field of vision was progressing through neurohysiology [00:26:05] progressing through neurohysiology to computer vision to cognitive neur uh [00:26:08] to computer vision to cognitive neur uh neuroscience to computer vision again a [00:26:12] neuroscience to computer vision again a separate effort is going on in parallel [00:26:15] separate effort is going on in parallel and that eventually became deep [00:26:16] and that eventually became deep learning. It started from these early [00:26:19] learning. It started from these early studies of uh neuronet network things [00:26:23] studies of uh neuronet network things like perception and uh and people like [00:26:26] like perception and uh and people like Ramart started to you know work and of [00:26:30] Ramart started to you know work and of course Jeff Hinton in his early days [00:26:32] course Jeff Hinton in his early days start to work with a small number of [00:26:34] start to work with a small number of artificial neurons and look at how that [00:26:36] artificial neurons and look at how that can can process information and learn. [00:26:40] can can process information and learn. Um and you've heard uh people like uh [00:26:45] Um and you've heard uh people like uh the great minds like Marvin Minsky uh [00:26:48] the great minds like Marvin Minsky uh and his colleagues uh working on on [00:26:51] and his colleagues uh working on on different aspects of these uh [00:26:53] different aspects of these uh perception. But he also did Marvin [00:26:56] perception. But he also did Marvin Minsky did say that uh perceptuals [00:26:59] Minsky did say that uh perceptuals cannot um cannot learn these X or logic [00:27:04] cannot um cannot learn these X or logic functions and that caused a little bit [00:27:06] functions and that caused a little bit of a setback in neuronet network. Well, [00:27:11] of a setback in neuronet network. Well, things continue to progress despite the [00:27:14] things continue to progress despite the setback. And one of the most important [00:27:17] setback. And one of the most important work [00:27:19] work before the inflection point, first [00:27:21] before the inflection point, first inflection point is this neocognitron [00:27:23] inflection point is this neocognitron work by Fukushima in Japan. Fukushima [00:27:27] work by Fukushima in Japan. Fukushima handdesigned [00:27:29] handdesigned a neuronet network that looks like this. [00:27:31] a neuronet network that looks like this. So it has about six or seven or five or [00:27:34] So it has about six or seven or five or six layers and then he kind of he kind [00:27:39] six layers and then he kind of he kind of designed the different functions [00:27:41] of designed the different functions across the layers which you will learn [00:27:44] across the layers which you will learn more that more or less was inspired by [00:27:48] more that more or less was inspired by the the visual pathway that I was [00:27:50] the the visual pathway that I was describing. Remember the CAD experiment [00:27:52] describing. Remember the CAD experiment from simple receptive field to more [00:27:55] from simple receptive field to more complicated receptive field and he was [00:27:57] complicated receptive field and he was doing that here. you know the early [00:28:00] doing that here. you know the early layers have simple functions and then [00:28:02] layers have simple functions and then the later layers have more complex [00:28:04] the later layers have more complex functions um and and the simple ones he [00:28:07] functions um and and the simple ones he call it convolution or he uses the [00:28:09] call it convolution or he uses the convolution function and the more [00:28:11] convolution function and the more complex one he was pulling the [00:28:13] complex one he was pulling the information from the convolution layers [00:28:15] information from the convolution layers so Neocognitron [00:28:17] so Neocognitron was really a engineering fee because [00:28:21] was really a engineering fee because every parameter was handdesigned he has [00:28:24] every parameter was handdesigned he has there are hundreds of parameters he has [00:28:26] there are hundreds of parameters he has to just meticulously put them together [00:28:29] to just meticulously put them together so that this small neuronet network can [00:28:32] so that this small neuronet network can recognize digits or letters. [00:28:36] recognize digits or letters. So the real breakthrough came around [00:28:38] So the real breakthrough came around that time in 1986 is a learning rule. [00:28:43] that time in 1986 is a learning rule. That learning rule is called back [00:28:44] That learning rule is called back propagation. It's going to be one of our [00:28:46] propagation. It's going to be one of our first classes to show you that Ramahar, [00:28:49] first classes to show you that Ramahar, Jeff Hinton and they they they [00:28:53] Jeff Hinton and they they they took neuronet network um um um [00:28:57] took neuronet network um um um architecture and introduced a error [00:29:01] architecture and introduced a error correcting objective function so that if [00:29:05] correcting objective function so that if you put in some input and know what the [00:29:08] you put in some input and know what the correct output is, how do you take the [00:29:11] correct output is, how do you take the difference between what the neuronet [00:29:13] difference between what the neuronet network outputs versus the actual [00:29:16] network outputs versus the actual correct answer and then propagate the uh [00:29:20] correct answer and then propagate the uh the information back so that you can uh [00:29:23] the information back so that you can uh um improve the parameters along the [00:29:27] um improve the parameters along the neuronet network and that propagation [00:29:30] neuronet network and that propagation from the output back to the entire [00:29:33] from the output back to the entire neuronet network is called back [00:29:35] neuronet network is called back propagation. it follows some of these [00:29:37] propagation. it follows some of these basic calculus chain rules and uh that [00:29:40] basic calculus chain rules and uh that was a watershed moment for a neuronet [00:29:44] was a watershed moment for a neuronet network algorithm. So one of the most [00:29:47] network algorithm. So one of the most and of course we're still smack in the [00:29:49] and of course we're still smack in the middle of AI winter all these work was [00:29:52] middle of AI winter all these work was uh was happening without public fanfare [00:29:54] uh was happening without public fanfare but of course in in the world of [00:29:57] but of course in in the world of research these are very important [00:29:58] research these are very important milestones. One of the most earliest [00:30:01] milestones. One of the most earliest applications of this uh neuronet network [00:30:04] applications of this uh neuronet network with back propagation is Yamakun's [00:30:06] with back propagation is Yamakun's convolutional neuronet network made in [00:30:08] convolutional neuronet network made in the 1990s when he was working in the [00:30:10] the 1990s when he was working in the Bell labs and uh what he did is just [00:30:14] Bell labs and uh what he did is just created a slightly bigger network about [00:30:16] created a slightly bigger network about seven layersish and uh and made it good [00:30:20] seven layersish and uh and made it good enough with great engineering uh [00:30:22] enough with great engineering uh capability to recognize letters and it [00:30:25] capability to recognize letters and it was actually shipped to uh some part of [00:30:27] was actually shipped to uh some part of the US postal offices and banks to to [00:30:30] the US postal offices and banks to to read digits and letters. So that was a [00:30:35] read digits and letters. So that was a um application of early neuronet [00:30:36] um application of early neuronet network. And then um uh Jeff Hinton and [00:30:40] network. And then um uh Jeff Hinton and Yan Lun continue to work on your [00:30:42] Yan Lun continue to work on your network. It didn't go very far because [00:30:46] network. It didn't go very far because um despite these um improvements and [00:30:51] um despite these um improvements and tweaks of these neuronet network things [00:30:54] tweaks of these neuronet network things more or less just stalled. you know, we [00:30:57] more or less just stalled. you know, we uh they collected a big data set of [00:30:59] uh they collected a big data set of digits and letters and digits and [00:31:01] digits and letters and digits and letters kind of was quasy soft in terms [00:31:04] letters kind of was quasy soft in terms of recognition. But if you put the [00:31:06] of recognition. But if you put the system through, you know, the kind of uh [00:31:08] system through, you know, the kind of uh uh digital photos that the [00:31:10] uh digital photos that the neuroscientists were using to recognize [00:31:12] neuroscientists were using to recognize cats and dogs and microwaves and chairs [00:31:14] cats and dogs and microwaves and chairs and flowers, it just didn't work. And uh [00:31:18] and flowers, it just didn't work. And uh uh a huge part of this problem is the [00:31:21] uh a huge part of this problem is the lack of data. And uh lack of data is not [00:31:26] lack of data. And uh lack of data is not just an inconvenience. It's actually a [00:31:28] just an inconvenience. It's actually a mathematical problem because these uh [00:31:32] mathematical problem because these uh these algorithms are high-capacity uh [00:31:35] these algorithms are high-capacity uh algorithms that actually needs to be [00:31:38] algorithms that actually needs to be driven by lots of data in order to to [00:31:41] driven by lots of data in order to to learn to generalize. And there is some [00:31:43] learn to generalize. And there is some deep mathematical principles behind this [00:31:46] deep mathematical principles behind this rules of generalization and model [00:31:48] rules of generalization and model overfitting. And data was [00:31:51] overfitting. And data was underappreciated was underlooked because [00:31:54] underappreciated was underlooked because most people are just looking at these [00:31:55] most people are just looking at these architectures. They did not realize that [00:31:58] architectures. They did not realize that data is part of the first class citizen [00:32:00] data is part of the first class citizen for for uh machine learning and deep [00:32:02] for for uh machine learning and deep learning. So this is part of the work [00:32:05] learning. So this is part of the work that my lab uh my students and I did in [00:32:08] that my lab uh my students and I did in the early 2000s um that we recognize [00:32:13] the early 2000s um that we recognize this importance of data. We hypothesized [00:32:17] this importance of data. We hypothesized that the whole field was uh was actually [00:32:21] that the whole field was uh was actually missing this underappreciating the [00:32:23] missing this underappreciating the importance of data. So we went about and [00:32:26] importance of data. So we went about and collected a huge data set called image [00:32:28] collected a huge data set called image net that has 15 million images after [00:32:30] net that has 15 million images after cleaning a billion images. And this uh [00:32:33] cleaning a billion images. And this uh 15 million images were sorted across [00:32:36] 15 million images were sorted across 22,000 categories of objects. We [00:32:39] 22,000 categories of objects. We actually studied a lot of the cognitive [00:32:42] actually studied a lot of the cognitive and psychology literature to to [00:32:45] and psychology literature to to appreciate that 22,000 [00:32:48] appreciate that 22,000 um uh images uh were uh oh sorry 22,000 [00:32:52] um uh images uh were uh oh sorry 22,000 categories were roughly in the order of [00:32:55] categories were roughly in the order of the number of categories that uh humans [00:32:57] the number of categories that uh humans learn to recognize in the early years of [00:32:59] learn to recognize in the early years of their life. And then we open sourced [00:33:01] their life. And then we open sourced this data set and created a imageet [00:33:04] this data set and created a imageet challenge called the large scale visual [00:33:06] challenge called the large scale visual recognition challenge. We curated a [00:33:08] recognition challenge. We curated a subset of image net of a million images, [00:33:12] subset of image net of a million images, a million plus images and a thousand [00:33:14] a million plus images and a thousand object classes and then ran an [00:33:17] object classes and then ran an international uh object recognition um [00:33:20] international uh object recognition um challenge for for many years. And the [00:33:22] challenge for for many years. And the goal is that we ask uh researchers to [00:33:26] goal is that we ask uh researchers to participate and their goal is to create [00:33:28] participate and their goal is to create algorithms. It doesn't matter which kind [00:33:30] algorithms. It doesn't matter which kind of algorithms. And then we'll test you [00:33:32] of algorithms. And then we'll test you on your algorithm's ability to recognize [00:33:35] on your algorithm's ability to recognize photos and see if you can call out these [00:33:38] photos and see if you can call out these a thousand uh object classes in as [00:33:41] a thousand uh object classes in as correctly as possible. And here are the [00:33:44] correctly as possible. And here are the errors, right? Like first year we run [00:33:46] errors, right? Like first year we run this uh uh we run this uh um uh [00:33:50] this uh uh we run this uh um uh competition, the the error the algorithm [00:33:53] competition, the the error the algorithm the best performing algorithms error was [00:33:55] the best performing algorithms error was nearly 30%. And it's really pretty [00:33:58] nearly 30%. And it's really pretty abysmal because humans can perform under [00:34:01] abysmal because humans can perform under like say 3% error and then 2011 it [00:34:05] like say 3% error and then 2011 it wasn't that exciting but something [00:34:08] wasn't that exciting but something happened in 2012 that was the most [00:34:11] happened in 2012 that was the most exciting year. That year [00:34:14] exciting year. That year um Jeff Hinton and his students [00:34:16] um Jeff Hinton and his students participated in this challenge using [00:34:18] participated in this challenge using convolutional neuronet network and they [00:34:21] convolutional neuronet network and they reduce the error almost by half and uh [00:34:24] reduce the error almost by half and uh and truly showed the power of deep [00:34:28] and truly showed the power of deep learning algorithms. And so the [00:34:31] learning algorithms. And so the participating algorithm in 2012 image [00:34:34] participating algorithm in 2012 image challenge was called Alex Net. And uh [00:34:38] challenge was called Alex Net. And uh the funny thing is if you look at Alex [00:34:41] the funny thing is if you look at Alex net um it's not that different from the [00:34:45] net um it's not that different from the Fukushima's neo cognitron 32 years ago [00:34:49] Fukushima's neo cognitron 32 years ago but two two major things happened over [00:34:53] but two two major things happened over between these two one is that back [00:34:56] between these two one is that back propagation happened it's a principled [00:34:59] propagation happened it's a principled mathematically rigorous learning rule so [00:35:02] mathematically rigorous learning rule so that you don't have to ever use hand to [00:35:04] that you don't have to ever use hand to tune parameters and that was a major [00:35:08] tune parameters and that was a major breakthrough theoretically. Another [00:35:10] breakthrough theoretically. Another breakthrough was uh was u um was data. [00:35:14] breakthrough was uh was u um was data. The recognition of data and uh the [00:35:18] The recognition of data and uh the understanding of data driving these [00:35:20] understanding of data driving these highcapacity models which eventually [00:35:22] highcapacity models which eventually will have trillion parameters but at [00:35:24] will have trillion parameters but at that time was millions of parameters was [00:35:27] that time was millions of parameters was critical for for uh for setting off the [00:35:30] critical for for uh for setting off the deep learning uh uh you know the the the [00:35:34] deep learning uh uh you know the the the to for for this to work. And really many [00:35:38] to for for this to work. And really many people consider the year of 2012 the and [00:35:42] people consider the year of 2012 the and the Alex net algorithm that won the [00:35:46] the Alex net algorithm that won the image net challenge the historical [00:35:49] image net challenge the historical moment of the birth of or rebirth of [00:35:52] moment of the birth of or rebirth of modern AI or the birth of deep learning [00:35:54] modern AI or the birth of deep learning revolution. [00:35:56] revolution. And of course the reason many of you are [00:35:59] And of course the reason many of you are here is since then we are in the era of [00:36:02] here is since then we are in the era of deep learning explosion. If you look at [00:36:05] deep learning explosion. If you look at computer vision's um uh main annual [00:36:10] computer vision's um uh main annual research conference called CVPR uh the [00:36:13] research conference called CVPR uh the number of papers have exploded and arc [00:36:17] number of papers have exploded and arc paper has exploded and many new [00:36:21] paper has exploded and many new algorithms since then have been invented [00:36:24] algorithms since then have been invented uh to participate in the image that [00:36:27] uh to participate in the image that challenge in the following years. We're [00:36:29] challenge in the following years. We're going to study some of these algorithms [00:36:31] going to study some of these algorithms but the the the point is that some of [00:36:34] but the the the point is that some of these algorithms uh beyond Alex that [00:36:37] these algorithms uh beyond Alex that have had a profound impact in the [00:36:40] have had a profound impact in the progress of uh of the field of computer [00:36:43] progress of uh of the field of computer vision and into the applications of [00:36:45] vision and into the applications of computer vision. So um so a lot of [00:36:51] computer vision. So um so a lot of things have happened that we're going to [00:36:53] things have happened that we're going to cover some of these. Not only the field [00:36:56] cover some of these. Not only the field of computer vision made a major progress [00:36:59] of computer vision made a major progress in the in creating algorithms to [00:37:02] in the in creating algorithms to recognize everyday objects like cats and [00:37:04] recognize everyday objects like cats and dogs and chairs. We also quickly right [00:37:08] dogs and chairs. We also quickly right after uh image that challenge uh the [00:37:10] after uh image that challenge uh the 2012 uh moment. We've got algorithms [00:37:14] 2012 uh moment. We've got algorithms that can recognize uh um uh you know [00:37:18] that can recognize uh um uh you know much more uh much more complicated uh [00:37:22] much more uh much more complicated uh images, can retrieve images or can do [00:37:25] images, can retrieve images or can do multiple object detections, can do image [00:37:28] multiple object detections, can do image uh uh segmentation. These are all [00:37:31] uh uh segmentation. These are all different tasks in visual recognition [00:37:34] different tasks in visual recognition that you'll find yourself getting [00:37:36] that you'll find yourself getting familiar with throughout this course [00:37:38] familiar with throughout this course because vision is not just calling out [00:37:41] because vision is not just calling out cats and dogs. There's so much in in the [00:37:44] cats and dogs. There's so much in in the nuanced ability of visual recognition [00:37:48] nuanced ability of visual recognition and uh and of course vision is not just [00:37:51] and uh and of course vision is not just static images. So their work in video [00:37:55] static images. So their work in video classification, human activity [00:37:57] classification, human activity recognition. I'm showing you this [00:38:00] recognition. I'm showing you this overview. You will learn some of these. [00:38:02] overview. You will learn some of these. It's uh these are all um you don't have [00:38:05] It's uh these are all um you don't have to understand exactly all what's going [00:38:07] to understand exactly all what's going on here but I want you to appreciate uh [00:38:11] on here but I want you to appreciate uh the the the different the variety of [00:38:13] the the the different the variety of vision tasks. medical imaging. Those of [00:38:16] vision tasks. medical imaging. Those of you who come from a medical uh field is [00:38:21] you who come from a medical uh field is you know whether it's radiology or [00:38:22] you know whether it's radiology or pathology or or even other aspects of [00:38:26] pathology or or even other aspects of medicine is deeply visual and this has a [00:38:29] medicine is deeply visual and this has a profound impact. Um you know scientific [00:38:33] profound impact. Um you know scientific discovery even the uh the seinal uh [00:38:37] discovery even the uh the seinal uh picture you probably remember of the [00:38:39] picture you probably remember of the first photography of black hole is uh [00:38:42] first photography of black hole is uh uses a lot of computer vision. um and [00:38:45] uses a lot of computer vision. um and computational photography techniques. Of [00:38:48] computational photography techniques. Of course uh you know applications in the [00:38:50] course uh you know applications in the sustainability and environment is also [00:38:54] sustainability and environment is also um you know computer vision contributed [00:38:56] um you know computer vision contributed a lot of that and uh and uh we also have [00:39:01] a lot of that and uh and uh we also have made a lot of progress in image [00:39:02] made a lot of progress in image captioning uh right after the image that [00:39:06] captioning uh right after the image that uh 2012 moment this is actually work by [00:39:08] uh 2012 moment this is actually work by Andre Capathy when he was my student his [00:39:11] Andre Capathy when he was my student his uh thesis work um then we uh also worked [00:39:16] uh thesis work um then we uh also worked on you know relationship ship [00:39:18] on you know relationship ship understanding. So not only uh not only [00:39:21] understanding. So not only uh not only visual intelligence is about seeing [00:39:23] visual intelligence is about seeing what's on the pixel, you also see what's [00:39:25] what's on the pixel, you also see what's beyond pixels including relationships of [00:39:29] beyond pixels including relationships of uh of objects. Um and also style [00:39:32] uh of objects. Um and also style transfer. Um a lot of this work you will [00:39:35] transfer. Um a lot of this work you will actually Justin Johnson who will come to [00:39:37] actually Justin Johnson who will come to guest lecture this course will tell you [00:39:39] guest lecture this course will tell you all about his seminar work in uh in [00:39:42] all about his seminar work in uh in style transfer. Um and of course in [00:39:46] style transfer. Um and of course in generative AI eras we get these really [00:39:50] generative AI eras we get these really incredible results like you know face [00:39:52] incredible results like you know face generation and uh uh this is the very [00:39:56] generation and uh uh this is the very early days of image generation of Dolly [00:39:59] early days of image generation of Dolly that I think this is the early Dali of [00:40:02] that I think this is the early Dali of course now midjourney and everything has [00:40:04] course now midjourney and everything has gone beyond these avocado and peach [00:40:07] gone beyond these avocado and peach chairs but uh but really we are squarely [00:40:12] chairs but uh but really we are squarely in the most exciting modern era of AI [00:40:15] in the most exciting modern era of AI explosion. The the the um combination [00:40:20] explosion. The the the um combination the three converging forces of [00:40:23] the three converging forces of computation, algorithms and data have uh [00:40:27] computation, algorithms and data have uh have uh taken this field just to a whole [00:40:30] have uh taken this field just to a whole different level where we're now totally [00:40:34] different level where we're now totally out of AI winter. I would say we're in [00:40:37] out of AI winter. I would say we're in an AI global warming period [00:40:40] an AI global warming period and I don't I don't see any of this [00:40:43] and I don't I don't see any of this slowing down. So um for both good and [00:40:47] slowing down. So um for both good and bad reasons and also you know just a [00:40:50] bad reasons and also you know just a word because we are in the Silicon [00:40:52] word because we are in the Silicon Valley we're in the very building of uh [00:40:55] Valley we're in the very building of uh Juan building and Nvidia uh lecture hall [00:40:58] Juan building and Nvidia uh lecture hall so we cannot ignore also the progress of [00:41:02] so we cannot ignore also the progress of uh hardware and what that played. So [00:41:05] uh hardware and what that played. So here is just the the um the the the flop [00:41:09] here is just the the um the the the flop per dollar graph for Nvidia's GPUs. And [00:41:14] per dollar graph for Nvidia's GPUs. And before 2020, you know, the progress was [00:41:18] before 2020, you know, the progress was steady. But as soon as deep learning [00:41:22] steady. But as soon as deep learning started to drive these um these uh GPUs [00:41:26] started to drive these um these uh GPUs and chips, you can just see the the the [00:41:29] and chips, you can just see the the the the G-flops have just completely taken [00:41:33] the G-flops have just completely taken off. And uh we're just by alien measure [00:41:37] off. And uh we're just by alien measure we are in this accelerated curve of uh [00:41:41] we are in this accelerated curve of uh of lots of compute as well as lots of [00:41:44] of lots of compute as well as lots of AI. And these are just different graphs [00:41:47] AI. And these are just different graphs showing you conference attendance, [00:41:49] showing you conference attendance, startups and enterprise applications in [00:41:53] startups and enterprise applications in AI all across not just computer vision [00:41:55] AI all across not just computer vision but also NLP and others have um have uh [00:42:00] but also NLP and others have um have uh just exploded. Okay. So quickly, last [00:42:03] just exploded. Okay. So quickly, last but not the least, it's been exciting. [00:42:06] but not the least, it's been exciting. There has been a lot of successes, but [00:42:08] There has been a lot of successes, but there is still a lot to be done in [00:42:10] there is still a lot to be done in computer vision. So this problem is [00:42:12] computer vision. So this problem is still not totally solved and with great [00:42:16] still not totally solved and with great tools comes with great consequences as [00:42:19] tools comes with great consequences as well, right? So um um computer vision [00:42:23] well, right? So um um computer vision can do a lot of good, but it also can do [00:42:25] can do a lot of good, but it also can do harm. For example, human bias. [00:42:29] harm. For example, human bias. Every single AI algorithm today, the [00:42:31] Every single AI algorithm today, the large ones are driven by data. And data [00:42:35] large ones are driven by data. And data is an artifact of human activities on [00:42:38] is an artifact of human activities on earth and in history. And a lot of the [00:42:41] earth and in history. And a lot of the data carry our bias. And uh this gets [00:42:45] data carry our bias. And uh this gets carried in AI systems. We have seen a [00:42:48] carried in AI systems. We have seen a lot of face recognition algorithms [00:42:50] lot of face recognition algorithms having the same kind of bias that humans [00:42:52] having the same kind of bias that humans have. And we do have to really recognize [00:42:55] have. And we do have to really recognize that we can also use AI to impact human [00:42:59] that we can also use AI to impact human lives. Some for the good, think about [00:43:01] lives. Some for the good, think about medical imaging. But some are [00:43:04] medical imaging. But some are questionable. What if AI is solely [00:43:07] questionable. What if AI is solely behind deciding your job or deciding [00:43:10] behind deciding your job or deciding your financial loans? So um again, [00:43:14] your financial loans? So um again, is it totally bad? Is it totally good? [00:43:17] is it totally bad? Is it totally good? These are very complicated issues. This [00:43:19] These are very complicated issues. This is also why I always get so excited when [00:43:22] is also why I always get so excited when students from HNS or law school or [00:43:24] students from HNS or law school or education school or business school [00:43:26] education school or business school attend my class because not all AI [00:43:29] attend my class because not all AI issues are engineering issues. We have a [00:43:32] issues are engineering issues. We have a lot of human factors and societal issues [00:43:35] lot of human factors and societal issues to to solve. I'm also particularly [00:43:38] to to solve. I'm also particularly excited by AI's medicine and healthc [00:43:40] excited by AI's medicine and healthc care use. And this is something really [00:43:42] care use. And this is something really dear to my heart that uh professor Delhi [00:43:45] dear to my heart that uh professor Delhi and Zay who are also co-instructors of [00:43:47] and Zay who are also co-instructors of this course. We three of us work on AI [00:43:51] this course. We three of us work on AI for aging population as well as uh [00:43:53] for aging population as well as uh patients and to try to use computer [00:43:56] patients and to try to use computer vision to deliver um you know care to to [00:43:59] vision to deliver um you know care to to people. So this is a good use and also [00:44:02] people. So this is a good use and also even in terms of technology human vision [00:44:05] even in terms of technology human vision is remarkable. I want you to come out of [00:44:09] is remarkable. I want you to come out of not only today's class but also this [00:44:11] not only today's class but also this entire course to appreciate despite how [00:44:15] entire course to appreciate despite how much computer vision can do there's just [00:44:17] much computer vision can do there's just so much more nuance subtlety richness [00:44:22] so much more nuance subtlety richness complexity and also emotion in human [00:44:25] complexity and also emotion in human vision you know look at these kids [00:44:28] vision you know look at these kids studying whatever that their curiosity [00:44:30] studying whatever that their curiosity lead them or the humor in this image [00:44:33] lead them or the humor in this image there's still a lot more that computer [00:44:35] there's still a lot more that computer vision cannot do so I hope that continue [00:44:37] vision cannot do so I hope that continue to entice you to study uh computer [00:44:40] to entice you to study uh computer vision. At this point, I'm going to give [00:44:43] vision. At this point, I'm going to give the podium to uh professor Delhi to go [00:44:46] the podium to uh professor Delhi to go over the rest of the class. Thank you. [00:44:50] over the rest of the class. Thank you. Awesome. Thank you, Fay. Uh [00:44:55] Awesome. Thank you, Fay. Uh great start of the quarter and I hope my [00:44:58] great start of the quarter and I hope my microphone is working um right now. [00:45:00] microphone is working um right now. Okay, good. I'm seeing some uh nodding [00:45:03] Okay, good. I'm seeing some uh nodding of heads. All right. Uh so [00:45:07] of heads. All right. Uh so very excited to be here with you all and [00:45:14] very excited to be here with you all and um I'm hoping that um you will have a [00:45:17] um I'm hoping that um you will have a fun and challenging course with an [00:45:21] fun and challenging course with an amazing list of co-instructors that we [00:45:23] amazing list of co-instructors that we have and and great TAs. [00:45:26] have and and great TAs. So [00:45:28] So in this class we are going to cover a [00:45:31] in this class we are going to cover a wide variety of topics around computer [00:45:33] wide variety of topics around computer vision use of deep learning in this [00:45:35] vision use of deep learning in this space categorized into four different [00:45:40] space categorized into four different topics. We'll start with deep learning [00:45:44] topics. We'll start with deep learning basics and and let's start actually with [00:45:47] basics and and let's start actually with a simple question of what is computer [00:45:50] a simple question of what is computer vision really. So at its core it's about [00:45:55] vision really. So at its core it's about enabling machines [00:45:57] enabling machines to see and understand images and [00:46:04] uh basically this is the most [00:46:06] uh basically this is the most fundamental task uh in the space uh in [00:46:09] fundamental task uh in the space uh in this space is is image classification. [00:46:13] this space is is image classification. You give the model an image say of a cat [00:46:17] You give the model an image say of a cat and the model should output a label cat [00:46:21] and the model should output a label cat and that's it. [00:46:24] and that's it. But it's uh this deceptively simple task [00:46:28] But it's uh this deceptively simple task um is the foundation for much of more [00:46:30] um is the foundation for much of more complex applications from self-driving [00:46:33] complex applications from self-driving to medical diagnosis and so on. So how [00:46:37] to medical diagnosis and so on. So how do we teach a machine to do this? One of [00:46:41] do we teach a machine to do this? One of the simplest approaches is to use linear [00:46:44] the simplest approaches is to use linear classification as you can see in this [00:46:47] classification as you can see in this slide. So [00:46:50] slide. So imagine each of the images in our data [00:46:53] imagine each of the images in our data set is shown with a dot in that uh space [00:46:57] set is shown with a dot in that uh space and each axis shows some sort of feature [00:47:02] and each axis shows some sort of feature uh which was driven from the image [00:47:04] uh which was driven from the image itself. Here we are showing a 2D space [00:47:07] itself. Here we are showing a 2D space but uh for simplicity but the task of a [00:47:11] but uh for simplicity but the task of a linear classifier is to find the hyper [00:47:14] linear classifier is to find the hyper plane or the linear function that [00:47:17] plane or the linear function that separates these two say cats from dogs. [00:47:23] separates these two say cats from dogs. But we all know that these linear models [00:47:25] But we all know that these linear models often go just uh only so far. Um they [00:47:29] often go just uh only so far. Um they struggle when the data isn't cleanly [00:47:31] struggle when the data isn't cleanly separable with a straight line. So the [00:47:35] separable with a straight line. So the question is what's next? We'll get into [00:47:38] question is what's next? We'll get into uh the the topics of how to model more [00:47:42] uh the the topics of how to model more complex patterns and um if if we do so [00:47:47] complex patterns and um if if we do so we often face challenges of overfeeding [00:47:51] we often face challenges of overfeeding and underfeeding which um are the topics [00:47:55] and underfeeding which um are the topics we will cover in the early lectures of [00:47:57] we will cover in the early lectures of the the class and to strike a right [00:48:02] the the class and to strike a right balance. the right balance. [00:48:04] balance. the right balance. We use techniques like regularization [00:48:08] We use techniques like regularization uh to control model complexity and [00:48:10] uh to control model complexity and optimization [00:48:13] optimization to find the best fit parameters. [00:48:16] to find the best fit parameters. So these are the nuts and bolts of of [00:48:19] So these are the nuts and bolts of of deep learning and and creating these u [00:48:22] deep learning and and creating these u uh models training models that not only [00:48:25] uh models training models that not only fits the data but also generalizes to [00:48:28] fits the data but also generalizes to unseen and and new data as well. And now [00:48:32] unseen and and new data as well. And now comes the the fun part. Neural networks, [00:48:34] comes the the fun part. Neural networks, we've been talking about them uh quite a [00:48:37] we've been talking about them uh quite a lot. And what neural networks do unlike [00:48:41] lot. And what neural networks do unlike the linear classifiers they they stack [00:48:44] the linear classifiers they they stack multiple layers of um operations [00:48:48] multiple layers of um operations to model nonlinear [00:48:51] to model nonlinear um [00:48:53] um functions to be able to either classify [00:48:56] functions to be able to either classify to to solve the same problem of uh image [00:48:59] to to solve the same problem of uh image classification and so on. Um [00:49:04] classification and so on. Um these are the models powering uh [00:49:07] these are the models powering uh everything from uh Google photos and [00:49:10] everything from uh Google photos and then now everybody's familiar with chat [00:49:12] then now everybody's familiar with chat GPD chat GPTs vision models and so on. [00:49:15] GPD chat GPTs vision models and so on. Uh in this course we will uh go deep [00:49:19] Uh in this course we will uh go deep into the [00:49:21] into the uh details of how they work, how they [00:49:25] uh details of how they work, how they are trained and we will be looking into [00:49:28] are trained and we will be looking into how to debugging and improving them. [00:49:31] how to debugging and improving them. After looking at the deep learning [00:49:34] After looking at the deep learning basics, we will cover the topics of [00:49:37] basics, we will cover the topics of perceiving and understanding the visual [00:49:40] perceiving and understanding the visual world, which is a complex process that [00:49:44] world, which is a complex process that involves interpreting a vast array of [00:49:48] involves interpreting a vast array of visual information. And to do so, we [00:49:51] visual information. And to do so, we often first define tasks [00:49:54] often first define tasks that refer to specific challenges or [00:49:56] that refer to specific challenges or problems we aim to solve. Some of the [00:49:59] problems we aim to solve. Some of the examples are object detection, scene [00:50:01] examples are object detection, scene understanding, motion detection and so [00:50:03] understanding, motion detection and so on. And to solve um these tasks, we use [00:50:08] on. And to solve um these tasks, we use uh different models which are [00:50:10] uh different models which are computational and theoretical [00:50:14] computational and theoretical frameworks we develop to mimic or [00:50:17] frameworks we develop to mimic or explain how our visual system [00:50:20] explain how our visual system accomplishes these tasks. Uh one of the [00:50:23] accomplishes these tasks. Uh one of the examples of the these types of models is [00:50:26] examples of the these types of models is uh neural networks. [00:50:30] So by aligning models and uh with with [00:50:35] So by aligning models and uh with with tasks, we can create systems that can [00:50:39] tasks, we can create systems that can see and interpret the world around us. [00:50:44] see and interpret the world around us. Speaking of tasks, let's uh go back to [00:50:47] Speaking of tasks, let's uh go back to the topic of image uh classification, [00:50:51] the topic of image uh classification, predicting a single label for an entire [00:50:54] predicting a single label for an entire image. [00:50:56] image. Um but we know that real world computer [00:50:59] Um but we know that real world computer vision is much richer than this and [00:51:02] vision is much richer than this and let's walk through some of the tasks [00:51:05] let's walk through some of the tasks that go beyond classification. First [00:51:07] that go beyond classification. First semantic segmentation [00:51:10] semantic segmentation um where we [00:51:12] um where we we are not just labeling the [00:51:15] we are not just labeling the uh the object or the entire image as cat [00:51:18] uh the object or the entire image as cat or tree or whatever. Here we are looking [00:51:22] or tree or whatever. Here we are looking for labels for every single pixel in the [00:51:25] for labels for every single pixel in the image. So every pixel is is a grass, [00:51:27] image. So every pixel is is a grass, cat, tree or sky. But we don't uh [00:51:32] cat, tree or sky. But we don't uh distinguish between individual uh [00:51:34] distinguish between individual uh objects. And next we have object [00:51:36] objects. And next we have object detection [00:51:38] detection where um we now want to not only say [00:51:43] where um we now want to not only say what is the what is in the image but [00:51:45] what is the what is in the image but also pinpoint the location. And that's [00:51:47] also pinpoint the location. And that's why we create bounding boxes around the [00:51:51] why we create bounding boxes around the objects and associate them with specific [00:51:54] objects and associate them with specific labels. And finally we have instance [00:51:56] labels. And finally we have instance segmentation. We will uh review we'll go [00:51:58] segmentation. We will uh review we'll go into instance segmentation which is the [00:52:01] into instance segmentation which is the most granular of them all. It combines [00:52:05] most granular of them all. It combines the ideas of detection and segmentation [00:52:08] the ideas of detection and segmentation together and every object instance gets [00:52:10] together and every object instance gets its own mask. [00:52:13] its own mask. So these tasks uh require you know much [00:52:16] So these tasks uh require you know much deeper understanding special [00:52:19] deeper understanding special understanding in images and they push [00:52:21] understanding in images and they push the models to do more than just [00:52:24] the models to do more than just recognizing [00:52:25] recognizing categories. [00:52:28] categories. The complexity doesn't stop with with [00:52:29] The complexity doesn't stop with with static images. Let's let's look at some [00:52:32] static images. Let's let's look at some uh temporal dimensions. So there's the [00:52:34] uh temporal dimensions. So there's the task of video classification as Fe [00:52:37] task of video classification as Fe talked about where we want to understand [00:52:40] talked about where we want to understand what's happening in the in the video is [00:52:42] what's happening in the in the video is the is is there someone running jumping [00:52:44] the is is there someone running jumping or or dancing. [00:52:47] or or dancing. There is the topic of multimodal video [00:52:50] There is the topic of multimodal video understanding which is combining vision [00:52:54] understanding which is combining vision and sound and other modalities. uh for [00:52:57] and sound and other modalities. uh for example in this uh this this example the [00:53:00] example in this uh this this example the person is playing a vibrophone to really [00:53:03] person is playing a vibrophone to really understand what's happening here we have [00:53:05] understand what's happening here we have to create a blend of visual features and [00:53:08] to create a blend of visual features and audio features to be able to understand [00:53:10] audio features to be able to understand what's happening and finally there is [00:53:12] what's happening and finally there is the topic of visualization and [00:53:15] the topic of visualization and understanding that we we will be [00:53:17] understanding that we we will be covering in this class where we want to [00:53:20] covering in this class where we want to interpret what's being learned uh by the [00:53:23] interpret what's being learned uh by the models and and um see an attention uh [00:53:29] models and and um see an attention uh frame or attention map of what the model [00:53:31] frame or attention map of what the model is is attending to to do a correct [00:53:34] is is attending to to do a correct classification and so on. [00:53:37] classification and so on. And then uh we have models beyond tasks. [00:53:39] And then uh we have models beyond tasks. we we look into models and uh there the [00:53:44] we we look into models and uh there the very first topic let me introduce to you [00:53:46] very first topic let me introduce to you uh that we'll be covering is uh [00:53:48] uh that we'll be covering is uh convolutional neural networks or CNN's [00:53:51] convolutional neural networks or CNN's there are a number of operations we will [00:53:53] there are a number of operations we will be going over the details um in the [00:53:56] be going over the details um in the class starting from an image a number of [00:53:59] class starting from an image a number of convolution sampling and fully connected [00:54:01] convolution sampling and fully connected operations and finally uh creating the [00:54:04] operations and finally uh creating the output and beyond convolutional neural [00:54:08] output and beyond convolutional neural networks we will [00:54:11] networks we will study uh recurrent neural networks for [00:54:13] study uh recurrent neural networks for sequential data and even newer [00:54:15] sequential data and even newer architectures such as [00:54:18] architectures such as transformers and attention based uh [00:54:21] transformers and attention based uh frameworks. [00:54:24] frameworks. So next we will be covering some large [00:54:27] So next we will be covering some large scale distributed training topics which [00:54:30] scale distributed training topics which is kind of new uh this quarter. I'm sure [00:54:35] is kind of new uh this quarter. I'm sure you've all heard about large language [00:54:37] you've all heard about large language models, large vision models and so on. [00:54:40] models, large vision models and so on. And uh we will be briefly discussing how [00:54:44] And uh we will be briefly discussing how these models are actually trained. We [00:54:47] these models are actually trained. We know that data and data sets are [00:54:50] know that data and data sets are expanding models and and large models [00:54:52] expanding models and and large models are are models models are becoming [00:54:55] are are models models are becoming larger and larger. And in order to train [00:54:58] larger and larger. And in order to train such models, there are some strategies [00:55:02] such models, there are some strategies for example data parallelization, model [00:55:04] for example data parallelization, model paralization that we'll cover in this [00:55:07] paralization that we'll cover in this class. But beyond that, there will be so [00:55:10] class. But beyond that, there will be so many challenges such as synchronization [00:55:13] many challenges such as synchronization between these models and workers and so [00:55:16] between these models and workers and so on. [00:55:17] on. as well as several other um aspects that [00:55:20] as well as several other um aspects that we'll be covering in one of the lectures [00:55:23] we'll be covering in one of the lectures this uh quarter and and we will go also [00:55:26] this uh quarter and and we will go also some uh over some of the trends um for [00:55:30] some uh over some of the trends um for training these large models. After [00:55:34] training these large models. After completing this topic, what we'll do [00:55:36] completing this topic, what we'll do next is looking into [00:55:40] next is looking into generative and interactive visual [00:55:44] generative and interactive visual intelligence [00:55:45] intelligence where we will first start with u with [00:55:50] where we will first start with u with self-supervised learning. [00:55:52] self-supervised learning. Self-supervised learning is a branch of [00:55:54] Self-supervised learning is a branch of machine learning in which models learn [00:55:57] machine learning in which models learn to understand and represent data by [00:56:01] to understand and represent data by getting some training signals from the [00:56:03] getting some training signals from the data itself. We will cover this this [00:56:05] data itself. We will cover this this topic. It's a it's um uh one of the [00:56:08] topic. It's a it's um uh one of the approaches that has enabled training of [00:56:10] approaches that has enabled training of large scale models using vast amounts of [00:56:14] large scale models using vast amounts of data that do not require labels, [00:56:16] data that do not require labels, unlabeled data. [00:56:19] unlabeled data. And they have played a key role in [00:56:22] And they have played a key role in recent breakthroughs in in computer [00:56:24] recent breakthroughs in in computer vision in general. [00:56:26] vision in general. And we will talk a little bit about [00:56:28] And we will talk a little bit about generative models. They go beyond [00:56:32] generative models. They go beyond recognition. They actually generate. [00:56:36] recognition. They actually generate. This is an example of the content of a [00:56:38] This is an example of the content of a Stanford campus photo uh which is [00:56:41] Stanford campus photo uh which is reimagined in the style of Van Go's uh [00:56:44] reimagined in the style of Van Go's uh star night. This is known as style [00:56:48] star night. This is known as style transfer. A classic application of neuro [00:56:50] transfer. A classic application of neuro generative uh techniques. [00:56:54] generative uh techniques. Generative models can now translate [00:56:57] Generative models can now translate language into images given a prompt. A [00:57:03] language into images given a prompt. A model like Dolly Dolly 2 generates an [00:57:06] model like Dolly Dolly 2 generates an entirely novel image. [00:57:09] entirely novel image. This showcases how generative vision [00:57:12] This showcases how generative vision models blend understanding, creativity, [00:57:15] models blend understanding, creativity, and control in in their generations. [00:57:19] and control in in their generations. And you've probably heard recently about [00:57:23] And you've probably heard recently about the topic of diffusion models in in [00:57:26] the topic of diffusion models in in general. That's another thing that we'll [00:57:27] general. That's another thing that we'll be covering in u this quarter. [00:57:33] be covering in u this quarter. They basically learn to reverse a [00:57:35] They basically learn to reverse a gradual uh noising process to generate [00:57:39] gradual uh noising process to generate uh images. And interestingly, in [00:57:42] uh images. And interestingly, in assignment three, you will actually be [00:57:44] assignment three, you will actually be implementing a generative model that [00:57:47] implementing a generative model that generates emojis from [00:57:51] generates emojis from text u inputs from prompts. For example, [00:57:54] text u inputs from prompts. For example, a face with a cowboy hat, which is den [00:57:57] a face with a cowboy hat, which is den noiseis from pure noise. [00:58:01] noiseis from pure noise. Vision language models are the next [00:58:04] Vision language models are the next topic uh topic of interest we will be [00:58:06] topic uh topic of interest we will be covering. [00:58:08] covering. Um they connect text and images in a [00:58:13] Um they connect text and images in a shared representation space and given a [00:58:16] shared representation space and given a caption or or uh or image the model [00:58:20] caption or or uh or image the model retrieves or generates its corresponding [00:58:23] retrieves or generates its corresponding pair as you can see. So there are a lot [00:58:27] pair as you can see. So there are a lot of advances in this area that we'll be [00:58:29] of advances in this area that we'll be covering some of the key examples. [00:58:32] covering some of the key examples. Again this is a key task for [00:58:36] Again this is a key task for uh crossmodel retrieval or understanding [00:58:39] uh crossmodel retrieval or understanding and visual question answering and so on. [00:58:41] and visual question answering and so on. So we'll get get to that in the class [00:58:43] So we'll get get to that in the class too. Moving beyond 2D, models can now [00:58:49] too. Moving beyond 2D, models can now reconstruct and generate 3D [00:58:53] reconstruct and generate 3D representations from images. And uh here [00:58:56] representations from images. And uh here you can see uh some waxel based [00:58:59] you can see uh some waxel based reconstructions, [00:59:01] reconstructions, shape completion and even 3D object [00:59:04] shape completion and even 3D object detection for uh from single view [00:59:08] detection for uh from single view uh images. So 3D vision enables more [00:59:13] uh images. So 3D vision enables more especially grounded understanding which [00:59:16] especially grounded understanding which is crucial for robotics and AI ARVR [00:59:19] is crucial for robotics and AI ARVR applications. And finally vision uh [00:59:22] applications. And finally vision uh powers empowers embodied agents [00:59:27] powers empowers embodied agents that act in the physical world. So these [00:59:31] that act in the physical world. So these models often must uh perceive, plan and [00:59:35] models often must uh perceive, plan and execute whether it's cleaning um up a [00:59:39] execute whether it's cleaning um up a messy room or generalizing from human [00:59:43] messy room or generalizing from human demonstrations. So with with all of [00:59:47] demonstrations. So with with all of these we will be covering different [00:59:49] these we will be covering different topics around generative and interactive [00:59:52] topics around generative and interactive visual intelligence and finally we will [00:59:57] visual intelligence and finally we will cover some human- centered applications [01:00:00] cover some human- centered applications and implications and as as um very [01:00:04] and implications and as as um very nicely explained. So there is um [01:00:08] nicely explained. So there is um computer vision and generally AI have [01:00:10] computer vision and generally AI have been um having a lot of impact in the in [01:00:14] been um having a lot of impact in the in the past years and it's very important [01:00:17] the past years and it's very important to to understand the human- centered [01:00:19] to to understand the human- centered aspects and applications and some of [01:00:22] aspects and applications and some of these impacts are reflected by these [01:00:25] these impacts are reflected by these awards that are um going to researchers [01:00:29] awards that are um going to researchers in in this space. Um it was first [01:00:33] in in this space. Um it was first recognized by the uh touring award 2018 [01:00:38] recognized by the uh touring award 2018 which is the most prestigious technical [01:00:40] which is the most prestigious technical award given to major contributions for [01:00:43] award given to major contributions for of lasting importance for computing. [01:00:47] of lasting importance for computing. Jeff, Antonio, Benjio, and Yan Lon [01:00:50] Jeff, Antonio, Benjio, and Yan Lon um received the award for conceptual [01:00:54] um received the award for conceptual engineering breakthroughs that have made [01:00:57] engineering breakthroughs that have made deep neural networks a critical [01:00:59] deep neural networks a critical component of computing. Beyond that, [01:01:02] component of computing. Beyond that, last year in in in 2024, [01:01:05] last year in in in 2024, uh Jeffrey Inon was jointly awarded the [01:01:08] uh Jeffrey Inon was jointly awarded the Nobel Prize in in physics [01:01:11] Nobel Prize in in physics alongside John Hopfield for for their [01:01:13] alongside John Hopfield for for their foundational contributions to uh neural [01:01:16] foundational contributions to uh neural networks. And finally, I want to very [01:01:19] networks. And finally, I want to very briefly mention the learning objectives [01:01:22] briefly mention the learning objectives for this class will be [01:01:25] for this class will be formalizing computer vision applications [01:01:28] formalizing computer vision applications into tasks. As you can um see some of [01:01:31] into tasks. As you can um see some of the details here, we want to develop and [01:01:36] the details here, we want to develop and train vision models, models that operate [01:01:39] train vision models, models that operate on on images and visual data, images, [01:01:41] on on images and visual data, images, videos and so on. Gain an understanding [01:01:45] videos and so on. Gain an understanding of where the field is and where it is [01:01:48] of where the field is and where it is headed. That's why we have some new new [01:01:51] headed. That's why we have some new new topics also covered specifically in this [01:01:54] topics also covered specifically in this uh this year. [01:01:57] uh this year. So the four topics that I mentioned [01:02:00] So the four topics that I mentioned earlier, we will be going over the [01:02:03] earlier, we will be going over the basics in the very first very first few [01:02:06] basics in the very first very first few weeks. Bear with us because these are [01:02:07] weeks. Bear with us because these are important topics and you need to [01:02:10] important topics and you need to understand the the details first. How to [01:02:13] understand the the details first. How to build the models from scratch and then [01:02:15] build the models from scratch and then we'll get to more interesting exciting [01:02:18] we'll get to more interesting exciting topics of the day. um computer vision [01:02:22] topics of the day. um computer vision and and finally we will have one uh big [01:02:24] and and finally we will have one uh big lecture on human centered AI and and [01:02:28] lecture on human centered AI and and computer vision. I want to just leave [01:02:31] computer vision. I want to just leave you with what we'll be covering next [01:02:34] you with what we'll be covering next session that's that's going to be image [01:02:37] session that's that's going to be image classification and linear classifiers [01:02:41] classification and linear classifiers which will get us started with the world [01:02:44] which will get us started with the world of CS231. Thank you. ================================================================================ LECTURE 002 ================================================================================ Stanford CS231N | Spring 2025 | Lecture 2: Image Classification with Linear Classifiers Source: https://www.youtube.com/watch?v=pdqofxJeBN8 --- Transcript [00:00:05] We will be talking today about image [00:00:09] We will be talking today about image classification. Basically continuing our [00:00:11] classification. Basically continuing our our discussion on the topic of image [00:00:14] our discussion on the topic of image classification from last uh lecture and [00:00:18] classification from last uh lecture and we'll get a little bit into some [00:00:22] we'll get a little bit into some uh topics that gets us closer to neural [00:00:25] uh topics that gets us closer to neural networks and and ultimately [00:00:27] networks and and ultimately convolutional neural networks and so on. [00:00:30] convolutional neural networks and so on. We'll start with linear classifiers. [00:00:36] We'll start with linear classifiers. Moving to the [00:00:39] Moving to the next slide. This was the the syllabus [00:00:43] next slide. This was the the syllabus that we've talked about last uh lecture [00:00:48] that we've talked about last uh lecture in the in the previous lecture [00:00:51] in the in the previous lecture where we did talk about three major [00:00:54] where we did talk about three major categories of [00:00:56] categories of topics. Deep learning basics, [00:01:00] topics. Deep learning basics, perceiving uh and understanding the [00:01:02] perceiving uh and understanding the visual world, reconstructing and [00:01:05] visual world, reconstructing and interacting with the visual world as the [00:01:07] interacting with the visual world as the three major topics and and some sub um [00:01:12] three major topics and and some sub um topics that we will be covering in the [00:01:13] topics that we will be covering in the in the class and at at the end we will [00:01:16] in the class and at at the end we will have some discussions around the human- [00:01:19] have some discussions around the human- centered AI aspects [00:01:21] centered AI aspects and [00:01:24] today the goal is to cover the first [00:01:27] today the goal is to cover the first three items datadriven approaches. I [00:01:30] three items datadriven approaches. I will I will try to tell you what this [00:01:32] will I will try to tell you what this means and [00:01:35] means and linear classification [00:01:37] linear classification as well as the K nearest neighbor [00:01:40] as well as the K nearest neighbor algorithm. [00:01:43] algorithm. So [00:01:45] So like last the previous lecture let's [00:01:49] like last the previous lecture let's start with our core task of [00:01:54] start with our core task of image classification. Again, it's a core [00:01:56] image classification. Again, it's a core task in computer vision and we actually [00:02:00] task in computer vision and we actually come back to this task quite often [00:02:02] come back to this task quite often throughout the quarter because it's a [00:02:04] throughout the quarter because it's a very good benchmark and we have some [00:02:06] very good benchmark and we have some examples to to tell you how the [00:02:09] examples to to tell you how the algorithms work. So this is one of the [00:02:13] algorithms work. So this is one of the items that we come back to quite often. [00:02:17] items that we come back to quite often. We we want to define the the image [00:02:19] We we want to define the the image classification task today and then [00:02:24] classification task today and then introduce two of the datadriven [00:02:26] introduce two of the datadriven approaches for image classification. One [00:02:29] approaches for image classification. One of them nearest neighbor and one of them [00:02:31] of them nearest neighbor and one of them the other one near linear linear [00:02:33] the other one near linear linear classifier. There are some other [00:02:35] classifier. There are some other approaches which we have [00:02:38] approaches which we have listed in our backup slides and you're [00:02:41] listed in our backup slides and you're welcome to to look at them after the [00:02:44] welcome to to look at them after the class. [00:02:45] class. But this is what we will be covering. [00:02:50] But this is what we will be covering. So what is image classification? [00:02:53] So what is image classification? Given an image and [00:02:57] Given an image and a number of predefined labels, [00:03:00] a number of predefined labels, predetermined labels, a set of possible [00:03:03] predetermined labels, a set of possible labels such as in this example you see [00:03:06] labels such as in this example you see dog, cat, truck, plane, and so on. [00:03:11] dog, cat, truck, plane, and so on. The job of the system is to assign one [00:03:14] The job of the system is to assign one of those labels to this uh image. [00:03:18] of those labels to this uh image. So [00:03:20] So to us this is actually a very very easy [00:03:22] to us this is actually a very very easy task because our brains our co cognitive [00:03:26] task because our brains our co cognitive system is wired to [00:03:30] system is wired to get a holistic understanding of this [00:03:32] get a holistic understanding of this image and assign a label to it. But when [00:03:35] image and assign a label to it. But when it comes to coding this and and looking [00:03:38] it comes to coding this and and looking at how a computer can make sense of this [00:03:40] at how a computer can make sense of this image that's a that's completely a [00:03:43] image that's a that's completely a different story [00:03:45] different story and we want to see how machines can make [00:03:49] and we want to see how machines can make sense of such data. So images are often [00:03:54] sense of such data. So images are often defined [00:03:56] defined by matrices of data [00:04:00] by matrices of data more broadly more generally tensors of [00:04:02] more broadly more generally tensors of data [00:04:04] data and often the numbers are the each of [00:04:06] and often the numbers are the each of the pixel values are between zero and [00:04:10] the pixel values are between zero and 255 which is a 8 bit um data structure [00:04:15] 255 which is a 8 bit um data structure and since this is an image a colored [00:04:19] and since this is an image a colored image [00:04:21] image assuming that it's with a resolution of [00:04:25] assuming that it's with a resolution of 800 by 600. [00:04:28] 800 by 600. Since it's an R RGB image, it has three [00:04:30] Since it's an R RGB image, it has three channels of red, green, and blue RGB. [00:04:33] channels of red, green, and blue RGB. And therefore, it's a tensor of 800 by [00:04:38] And therefore, it's a tensor of 800 by 600 by 3 as you can see on the slide. [00:04:43] 600 by 3 as you can see on the slide. So, um, [00:04:47] as you can probably guess, this is the [00:04:51] as you can probably guess, this is the semantic gap between our perception of [00:04:54] semantic gap between our perception of this image and how the machine perceives [00:04:56] this image and how the machine perceives and and sees the the image, right? And [00:05:00] and and sees the the image, right? And in order to be able to uh even [00:05:03] in order to be able to uh even understand how this this could be very [00:05:06] understand how this this could be very challenging let's look at some some [00:05:09] challenging let's look at some some challenges some some variations in this [00:05:11] challenges some some variations in this type of imaging u data. So let's assume [00:05:15] type of imaging u data. So let's assume for example as as one example let's [00:05:17] for example as as one example let's assume that um we move the camera if the [00:05:21] assume that um we move the camera if the camera is moved for example panning the [00:05:24] camera is moved for example panning the camera around [00:05:26] camera around even if the cat sits completely and [00:05:28] even if the cat sits completely and perfectly still all of those pixel [00:05:32] perfectly still all of those pixel values every single pixel value of 800 [00:05:35] values every single pixel value of 800 by 600 by3 will be changed. [00:05:41] by 600 by3 will be changed. So all these pixels will have a new [00:05:43] So all these pixels will have a new value. Again, for us humans, it's the [00:05:49] value. Again, for us humans, it's the same object. There's no absolutely no [00:05:52] same object. There's no absolutely no difference. But from a computer's [00:05:53] difference. But from a computer's perspective, it's completely a new data [00:05:56] perspective, it's completely a new data point. So this is one of the challenges, [00:06:01] point. So this is one of the challenges, but there are quite a few [00:06:04] but there are quite a few others as well. For example, [00:06:06] others as well. For example, illumination is another um challenge. [00:06:12] illumination is another um challenge. So if you've seen or if you've taken [00:06:15] So if you've seen or if you've taken courses in graphics or maybe uh other [00:06:18] courses in graphics or maybe uh other other vision courses [00:06:20] other vision courses and or digital image processing uh [00:06:23] and or digital image processing uh courses for engineering applications [00:06:25] courses for engineering applications you know that the value of each RGB uh [00:06:29] you know that the value of each RGB uh pixel the RGB values are a function of [00:06:35] pixel the RGB values are a function of the surface material color and their [00:06:38] the surface material color and their light source [00:06:40] light source and and that's Why same cat, same object [00:06:44] and and that's Why same cat, same object may look at differently in terms of [00:06:47] may look at differently in terms of numbers when it comes to [00:06:50] numbers when it comes to uh being pictured in different [00:06:53] uh being pictured in different illumination conditions. [00:06:55] illumination conditions. With that uh in mind, so whether the cat [00:06:59] With that uh in mind, so whether the cat is in a dark room or under the sun, [00:07:01] is in a dark room or under the sun, still it's the cat. It's it's it's one [00:07:03] still it's the cat. It's it's it's one cat. But um this is creating challenges [00:07:07] cat. But um this is creating challenges for for the machine. Can you maybe name [00:07:09] for for the machine. Can you maybe name some other other challenges that um may [00:07:13] some other other challenges that um may change the the values of the pixels and [00:07:17] change the the values of the pixels and create problems for the machine to to [00:07:19] create problems for the machine to to recognize objects [00:07:22] recognize objects other than illumination and viewpoint [00:07:24] other than illumination and viewpoint changes that I mentioned. So background [00:07:27] changes that I mentioned. So background background clutter background objects [00:07:29] background clutter background objects yes which is actually our next slide. [00:07:33] yes which is actually our next slide. Yes, background uh clutter is another [00:07:36] Yes, background uh clutter is another challenge. What? Anything else? [00:07:40] challenge. What? Anything else? Zooming in and out. Yes. So, the scale [00:07:43] Zooming in and out. Yes. So, the scale basically of the object in the in the [00:07:46] basically of the object in the in the image. Yes. What else? The resolution of [00:07:50] image. Yes. What else? The resolution of the image that is that could be [00:07:53] the image that is that could be considered as that's definitely a [00:07:56] considered as that's definitely a challenge. But often with um machine [00:07:59] challenge. But often with um machine learning models or or any model that we [00:08:01] learning models or or any model that we want to recognize action um objects in [00:08:04] want to recognize action um objects in in images since we normalize the size of [00:08:08] in images since we normalize the size of the image resolution may not be that uh [00:08:11] the image resolution may not be that uh important unless there is zooming [00:08:13] important unless there is zooming effects of the the objects. Occlusion is [00:08:16] effects of the the objects. Occlusion is one of the major problems. Again as [00:08:19] one of the major problems. Again as humans it's very easy to say this is cat [00:08:22] humans it's very easy to say this is cat these are cats. Even the last one which [00:08:24] these are cats. Even the last one which is actually a very challenging the one [00:08:26] is actually a very challenging the one on the on the right. You can only see a [00:08:29] on the on the right. You can only see a tail and a little bit of probably paw in [00:08:32] tail and a little bit of probably paw in in the right side. [00:08:37] One could say yes that could be a tiger [00:08:40] One could say yes that could be a tiger or it could be I don't know a raccoon [00:08:43] or it could be I don't know a raccoon with a tiny tail. But because this is [00:08:47] with a tiny tail. But because this is because of the context, because we know [00:08:49] because of the context, because we know this is inside a living room on a couch, [00:08:52] this is inside a living room on a couch, most probably it's a cat. It's a cat. So [00:08:55] most probably it's a cat. It's a cat. So again, for us humans, it's not that [00:08:58] again, for us humans, it's not that hard. Beyond that, there are many other [00:09:02] hard. Beyond that, there are many other problems. Deformation cats are very [00:09:05] problems. Deformation cats are very deformable. So [00:09:08] deformable. So they [00:09:10] they create challenges for for algorithms to [00:09:14] create challenges for for algorithms to be uh detected recognized. [00:09:18] be uh detected recognized. I mean not today's algorithms generally [00:09:20] I mean not today's algorithms generally for building uh step-by-step um [00:09:24] for building uh step-by-step um algorithms that can detect objects. So [00:09:26] algorithms that can detect objects. So deformation is one of the other major [00:09:30] deformation is one of the other major challenges [00:09:31] challenges and beyond that the intra class [00:09:36] and beyond that the intra class variation is one more important [00:09:41] variation is one more important challenge. [00:09:44] challenge. We know that cats uh can come in [00:09:46] We know that cats uh can come in different sizes, colors, patterns or [00:09:49] different sizes, colors, patterns or even they can they have different breeds [00:09:51] even they can they have different breeds and all of those are still cats. But um [00:09:56] and all of those are still cats. But um for the for the machines it's it's not [00:09:59] for the for the machines it's it's not that easy to recognize the intraclass [00:10:03] that easy to recognize the intraclass variations. [00:10:04] variations. One other interesting challenge is the [00:10:08] One other interesting challenge is the context [00:10:09] context because [00:10:11] because if you only look at that [00:10:15] if you only look at that that part that that image on the right [00:10:17] that part that that image on the right or if an algorithm looks at this without [00:10:21] or if an algorithm looks at this without considering the context it's very easy [00:10:23] considering the context it's very easy to classify this as a tiger or some [00:10:27] to classify this as a tiger or some other animal. But because of the context [00:10:30] other animal. But because of the context and because we know that there's the [00:10:33] and because we know that there's the effect of shadows and so on, this could [00:10:37] effect of shadows and so on, this could probably be classified correctly. [00:10:41] probably be classified correctly. So, [00:10:46] but the thing is that the classifiers [00:10:47] but the thing is that the classifiers that we have today can do really great [00:10:51] that we have today can do really great job, a good jobs at [00:10:55] job, a good jobs at classifying the images, identifying the [00:10:58] classifying the images, identifying the objects in images. Thanks to efforts [00:11:02] objects in images. Thanks to efforts like ImageNet and also all of the [00:11:04] like ImageNet and also all of the follow-up works that created larger [00:11:07] follow-up works that created larger scale benchmarks for training larger [00:11:11] scale benchmarks for training larger scale models and um [00:11:16] scale models and um and and in this class what we want to do [00:11:18] and and in this class what we want to do is to get to uh to a place that we build [00:11:23] is to get to uh to a place that we build models that can recognize activ u [00:11:26] models that can recognize activ u recognize objects and also other um [00:11:30] recognize objects and also other um aspects. within the image. For the rest [00:11:32] aspects. within the image. For the rest of this class, we are going to be [00:11:35] of this class, we are going to be working towards building step by step [00:11:38] working towards building step by step the building blocks that are needed for [00:11:40] the building blocks that are needed for building those large algorithms. And [00:11:44] building those large algorithms. And before doing so, uh we have to look at [00:11:50] before doing so, uh we have to look at the most basic building block [00:11:53] the most basic building block of classifying an image and that is [00:11:57] of classifying an image and that is building implementing a function like [00:12:00] building implementing a function like this. So if you've taken some of the [00:12:05] this. So if you've taken some of the computer science or engineering courses [00:12:07] computer science or engineering courses that often build [00:12:10] that often build frameworks through algorithms like for [00:12:12] frameworks through algorithms like for example sorting in uh as a as a computer [00:12:16] example sorting in uh as a as a computer algorithm it's often it often comes with [00:12:19] algorithm it's often it often comes with some if then else rules and some for [00:12:23] some if then else rules and some for loops and so on. So there's there's a [00:12:25] loops and so on. So there's there's a clear flowchart of um tasks and [00:12:30] clear flowchart of um tasks and step steps if then else steps that [00:12:33] step steps if then else steps that creates an algorithm for sorting. But [00:12:36] creates an algorithm for sorting. But when it comes to images and [00:12:38] when it comes to images and understanding the visual world that is [00:12:42] understanding the visual world that is um not happening that is a challenge [00:12:45] um not happening that is a challenge there is no way to hardcode the steps [00:12:49] there is no way to hardcode the steps for classifying images. Although [00:12:52] for classifying images. Although there has been some efforts um in this [00:12:55] there has been some efforts um in this space, there are papers [00:12:58] space, there are papers that [00:13:00] that they've they've uh tried to come up with [00:13:03] they've they've uh tried to come up with algorithms and steps to recognize [00:13:06] algorithms and steps to recognize objects. And one of those was based on [00:13:10] objects. And one of those was based on um edge detectors, [00:13:13] um edge detectors, finding the edges in [00:13:16] finding the edges in um in in the image as a first step. And [00:13:19] um in in the image as a first step. And then after after creating all of these [00:13:23] then after after creating all of these patterns, look at [00:13:26] patterns, look at uh the important patterns. for example, [00:13:28] uh the important patterns. for example, corners. Extract some features that are [00:13:31] corners. Extract some features that are around the corners or count the number [00:13:34] around the corners or count the number of specific types of corners and based [00:13:37] of specific types of corners and based on those from those try to map that into [00:13:43] on those from those try to map that into the output class. So while this is been [00:13:47] the output class. So while this is been an interesting effort and and it had [00:13:50] an interesting effort and and it had some success on very limited um [00:13:54] some success on very limited um variability of type of images but this [00:13:57] variability of type of images but this is very hard to first it's very hard to [00:14:00] is very hard to first it's very hard to scale these types of algorithms. Even if [00:14:02] scale these types of algorithms. Even if it works it's very hard to scale because [00:14:07] it works it's very hard to scale because you have to create these rules and [00:14:09] you have to create these rules and everything for every single object that [00:14:11] everything for every single object that you want to recognize. and second [00:14:14] you want to recognize. and second finding the logic for each of those [00:14:17] finding the logic for each of those requires a lot of uh effort by itself as [00:14:20] requires a lot of uh effort by itself as well. So because of these challenges, I [00:14:23] well. So because of these challenges, I think um [00:14:25] think um these types of algorithms which are u [00:14:28] these types of algorithms which are u based on creating logics and procedures [00:14:32] based on creating logics and procedures for detecting objects or classifying [00:14:35] for detecting objects or classifying images have not been quite successful [00:14:38] images have not been quite successful and machine learning comes with this [00:14:41] and machine learning comes with this datadriven approach. So with this new [00:14:44] datadriven approach. So with this new paradigm of and another paradigm of [00:14:48] paradigm of and another paradigm of looking at this problem from a [00:14:51] looking at this problem from a datadriven [00:14:53] datadriven perspective, [00:14:55] perspective, we define a a procedure of a three-step [00:14:59] we define a a procedure of a three-step process and the first one is to collect [00:15:05] process and the first one is to collect data sets of images and their labels. So [00:15:09] data sets of images and their labels. So there are many different ways of if you [00:15:11] there are many different ways of if you have if you want to recognize a specific [00:15:14] have if you want to recognize a specific type of object [00:15:16] type of object we or or specific types of objects we [00:15:20] we or or specific types of objects we can look for data sets or single data [00:15:24] can look for data sets or single data points over the internet to create a [00:15:27] points over the internet to create a data many samples from each of the [00:15:30] data many samples from each of the examples. [00:15:32] examples. We used to be doing this 10 20 years ago [00:15:35] We used to be doing this 10 20 years ago uh using search engines and image search [00:15:38] uh using search engines and image search engines over the internet to create [00:15:41] engines over the internet to create these types of data sets. Now we have [00:15:44] these types of data sets. Now we have all of the data sets. So [00:15:48] all of the data sets. So and then the second step is using [00:15:50] and then the second step is using machine learning algorithms to train a [00:15:52] machine learning algorithms to train a classifier. basically build a function [00:15:57] classifier. basically build a function train that takes the [00:16:00] train that takes the images in the training data and their [00:16:03] images in the training data and their associated labels and builds a model [00:16:06] associated labels and builds a model that can relate associate images with [00:16:09] that can relate associate images with the labels. And then um the last step [00:16:13] the labels. And then um the last step would be evaluating the classifier on [00:16:16] would be evaluating the classifier on new [00:16:18] new images [00:16:20] images which means implementing a function [00:16:23] which means implementing a function called predict that takes the model and [00:16:27] called predict that takes the model and some test images and for those test [00:16:29] some test images and for those test images that were were not part of the [00:16:32] images that were were not part of the training images um predicts the labels [00:16:35] training images um predicts the labels and returns those as the output. So a [00:16:40] and returns those as the output. So a very simple procedure but instead of [00:16:44] very simple procedure but instead of building a logic we are building a [00:16:46] building a logic we are building a datadriven approach for it. [00:16:50] datadriven approach for it. As I said we want to talk about two [00:16:53] As I said we want to talk about two popular methods and and classifiers. One [00:16:56] popular methods and and classifiers. One of them is nearest neighbor classifier. [00:16:59] of them is nearest neighbor classifier. This is the easiest form of [00:17:01] This is the easiest form of classification. Um, and [00:17:05] classification. Um, and we specifically want to go over this [00:17:08] we specifically want to go over this because we can learn some of the [00:17:11] because we can learn some of the concepts [00:17:14] concepts around building these classifiers and [00:17:16] around building these classifiers and it's easier to explain some um [00:17:21] it's easier to explain some um some of the details and then we'll move [00:17:23] some of the details and then we'll move to the topic of linear classification. [00:17:26] to the topic of linear classification. Okay. to do that what we do uh to build [00:17:30] Okay. to do that what we do uh to build the nearest neighbor classifier as as I [00:17:32] the nearest neighbor classifier as as I said we need to build the train and [00:17:36] said we need to build the train and predict functions the train function [00:17:39] predict functions the train function needs to just memorize all of the data [00:17:42] needs to just memorize all of the data and labels so the training function [00:17:44] and labels so the training function basically doesn't do anything other than [00:17:46] basically doesn't do anything other than keeping everything in the memory and [00:17:48] keeping everything in the memory and then the prediction function the predict [00:17:51] then the prediction function the predict uh function [00:17:53] uh function looks for the most similar training [00:17:56] looks for the most similar training image basically It creates a lookup [00:17:58] image basically It creates a lookup table of all of the images and all of [00:18:01] table of all of the images and all of their labels. And during the prediction [00:18:05] their labels. And during the prediction or testing time, what what it does is [00:18:08] or testing time, what what it does is tries to find the closest one, the the [00:18:11] tries to find the closest one, the the most similar image and outputs the label [00:18:15] most similar image and outputs the label for that image. Let's look at an [00:18:18] for that image. Let's look at an example. So assuming that we have these [00:18:21] example. So assuming that we have these five [00:18:23] five as in your in our training data then yes [00:18:27] as in your in our training data then yes you you see my cursor and then um this [00:18:32] you you see my cursor and then um this is the query image the input image for [00:18:34] is the query image the input image for for prediction. What we want to do is to [00:18:37] for prediction. What we want to do is to see which of these training data and [00:18:40] see which of these training data and training images is the most similar to [00:18:42] training images is the most similar to this one and for that we need a distance [00:18:45] this one and for that we need a distance function. So this distance function [00:18:48] function. So this distance function needs to take the two images each pair [00:18:51] needs to take the two images each pair of images each of these images compared [00:18:53] of images each of these images compared to the query image and return a value [00:18:58] to the query image and return a value which defines the similarity between [00:19:01] which defines the similarity between these two [00:19:04] these two uh inputs these two images. There are [00:19:07] uh inputs these two images. There are many different ways of doing that. One [00:19:10] many different ways of doing that. One of the most popular ones is [00:19:13] of the most popular ones is L1 distance which is defined as the sum [00:19:17] L1 distance which is defined as the sum over all absolute values of pixel [00:19:21] over all absolute values of pixel differences [00:19:22] differences between the two images image I1 and I2. [00:19:26] between the two images image I1 and I2. As an example, if this is a testing [00:19:27] As an example, if this is a testing image, [00:19:29] image, if we [00:19:31] if we want to calculate the distance of this [00:19:34] want to calculate the distance of this image with [00:19:36] image with uh an image in the training data, we do [00:19:39] uh an image in the training data, we do a pixel wise subtraction [00:19:42] a pixel wise subtraction and the difference between the pixel [00:19:44] and the difference between the pixel values and then sum them up which [00:19:46] values and then sum them up which defines this new this value as [00:19:52] defines this new this value as as the uh the distance between these two [00:19:56] as the uh the distance between these two So [00:19:59] So this is the most basic distance function [00:20:01] this is the most basic distance function but it's actually very useful in many [00:20:05] but it's actually very useful in many applications. We'll be coming back to [00:20:07] applications. We'll be coming back to this uh L1 and and any other variations [00:20:10] this uh L1 and and any other variations of distances in the class quite often [00:20:13] of distances in the class quite often with this very simple definition. We [00:20:15] with this very simple definition. We want to see how we can get it get this [00:20:17] want to see how we can get it get this implemented. As I said the first step is [00:20:20] implemented. As I said the first step is to just memorize the training data. So [00:20:23] to just memorize the training data. So the train function just keeps the data [00:20:26] the train function just keeps the data in the memory and then [00:20:30] in the memory and then what the predict function does using [00:20:34] what the predict function does using actually some um Python libraries and [00:20:37] actually some um Python libraries and numpy and so on. We can implement this [00:20:40] numpy and so on. We can implement this in just four lines. We calculate the [00:20:43] in just four lines. We calculate the distances between each of the testing [00:20:46] distances between each of the testing samples [00:20:48] samples and the training data. [00:20:53] Take the minimum for each of the testing [00:20:56] Take the minimum for each of the testing samples and then output the label for it [00:21:02] samples and then output the label for it for the for the for the one with mean [00:21:04] for the for the for the one with mean index. So this is going to be the uh [00:21:09] index. So this is going to be the uh implementation for for the predict [00:21:11] implementation for for the predict function. Yeah, the the pixel values as [00:21:14] function. Yeah, the the pixel values as I explained [00:21:18] I explained the most the the the simplest form this [00:21:21] the most the the the simplest form this is a tensor of 800 by 600 by3 [00:21:26] is a tensor of 800 by 600 by3 and three channels and these are RGB [00:21:29] and three channels and these are RGB values for each of the pixel uh [00:21:31] values for each of the pixel uh locations. So yes the I should actually [00:21:35] locations. So yes the I should actually repeat the questions for online students [00:21:36] repeat the questions for online students too and the question was the pixel [00:21:39] too and the question was the pixel values what do they represent? Yeah. So [00:21:42] values what do they represent? Yeah. So the next question is why it's between 0 [00:21:44] the next question is why it's between 0 and 255. Um so the there are many [00:21:48] and 255. Um so the there are many different standards for storing images. [00:21:52] different standards for storing images. The most popular one that we use in in [00:21:55] The most popular one that we use in in almost all images that you see online [00:21:58] almost all images that you see online and uh and here they are RGB. RGB is a [00:22:02] and uh and here they are RGB. RGB is a 24bit format sometimes 32 because [00:22:06] 24bit format sometimes 32 because there's an another channel alpha. We [00:22:08] there's an another channel alpha. We don't want to get into uh into those but [00:22:11] don't want to get into uh into those but the 24bit format it means that for each [00:22:15] the 24bit format it means that for each of the channels for each of the three [00:22:17] of the channels for each of the three channels of red green and blue which [00:22:19] channels of red green and blue which create all of all color color [00:22:21] create all of all color color combinations we can have eight bits. So [00:22:25] combinations we can have eight bits. So that's that's the standard that is uh [00:22:27] that's that's the standard that is uh defined. There are some other frameworks [00:22:30] defined. There are some other frameworks too but this is the most popular one. [00:22:34] too but this is the most popular one. So um with that let me [00:22:38] So um with that let me go back to the code and ask you a [00:22:40] go back to the code and ask you a question. [00:22:46] So [00:22:47] So [Applause] [00:22:49] [Applause] I know some of the students uh most of [00:22:51] I know some of the students uh most of the students um come with engineering [00:22:54] the students um come with engineering backgrounds and a little bit of computer [00:22:56] backgrounds and a little bit of computer science as well. But we want to see [00:23:00] science as well. But we want to see with say n samples n examples that we [00:23:03] with say n samples n examples that we have in the training data how fast the [00:23:06] have in the training data how fast the training and prediction happens. [00:23:11] I'm hoping that you're familiar with the [00:23:14] I'm hoping that you're familiar with the big O notation that we often u represent [00:23:19] big O notation that we often u represent computational and sometimes space [00:23:21] computational and sometimes space complexities with. But here if you look [00:23:25] complexities with. But here if you look at the the algorithms I'll I'll go with [00:23:27] at the the algorithms I'll I'll go with the training data in the in the training [00:23:30] the training data in the in the training function uh and then I want you to help [00:23:32] function uh and then I want you to help me with the [00:23:35] me with the the dancer for prediction for the [00:23:37] the dancer for prediction for the training step the the training is of 01 [00:23:42] training step the the training is of 01 because we are not actually doing [00:23:44] because we are not actually doing anything we are not even moving any data [00:23:46] anything we are not even moving any data we're just keeping the copy of the data [00:23:49] we're just keeping the copy of the data in the memory so no operations it means [00:23:52] in the memory so no operations it means that without operations with and [00:23:54] that without operations with and operations of order one we can [00:23:59] operations of order one we can complete the training step. What about [00:24:01] complete the training step. What about the prediction step? For each of the [00:24:05] the prediction step? For each of the single examples of [00:24:09] single examples of the um the training the testing data, [00:24:12] the um the training the testing data, how many operations should we should we [00:24:14] how many operations should we should we take? And uh yes, if we have n training [00:24:19] take? And uh yes, if we have n training data, it means that we have to calculate [00:24:21] data, it means that we have to calculate the distance of every single uh testing [00:24:23] the distance of every single uh testing image with all of the images in the [00:24:26] image with all of the images in the training data. So at least in the order [00:24:29] training data. So at least in the order of n operations. [00:24:33] of n operations. So [00:24:36] this is um this is this is not really [00:24:40] this is um this is this is not really good because what we often want to do is [00:24:43] good because what we often want to do is because training is not doing anything [00:24:46] because training is not doing anything but during testing during prediction [00:24:48] but during testing during prediction time we are spending so much time uh [00:24:51] time we are spending so much time uh just to do comparisons with between the [00:24:54] just to do comparisons with between the data and the each single data point and [00:24:58] data and the each single data point and the training examples. [00:25:00] the training examples. This would be similar to [00:25:03] This would be similar to the fact that each single time that you [00:25:05] the fact that each single time that you ask that GPT a question, it will try to [00:25:09] ask that GPT a question, it will try to see what the answer is and compare it [00:25:11] see what the answer is and compare it with all of the possible answers over [00:25:14] with all of the possible answers over the internet, which will take years and [00:25:17] the internet, which will take years and then return your your your response, [00:25:19] then return your your your response, right? So it it wouldn't work when it [00:25:21] right? So it it wouldn't work when it wants to it wants to scale for very [00:25:23] wants to it wants to scale for very simple problems. We used to be using [00:25:25] simple problems. We used to be using these types of approaches. So what we [00:25:28] these types of approaches. So what we often want is to [00:25:32] often want is to build classifiers uh that are fast [00:25:35] build classifiers uh that are fast during prediction. They they do it much [00:25:38] during prediction. They they do it much faster but it's okay if they take a lot [00:25:40] faster but it's okay if they take a lot of time to do uh during the training [00:25:42] of time to do uh during the training because that could be done offline. So [00:25:45] because that could be done offline. So with that um in mind although there has [00:25:49] with that um in mind although there has been a lot of efforts making uh nearest [00:25:53] been a lot of efforts making uh nearest neighbor more [00:25:55] neighbor more much faster [00:25:57] much faster using GPUs and so on which are beyond [00:26:00] using GPUs and so on which are beyond the the scope of this class. If you're [00:26:02] the the scope of this class. If you're interested you can take a look at those. [00:26:04] interested you can take a look at those. But with that I want to look at some of [00:26:06] But with that I want to look at some of the visualizations and and how this this [00:26:09] the visualizations and and how this this algorithm in general works. So given [00:26:13] algorithm in general works. So given this space that we have five classes of [00:26:16] this space that we have five classes of red, blue, green, purple and red sorry [00:26:19] red, blue, green, purple and red sorry yellow. Um [00:26:23] yellow. Um and each dot represents one one training [00:26:26] and each dot represents one one training sample in that class. [00:26:29] sample in that class. If you partition the space for every [00:26:31] If you partition the space for every single point, you see that we can we can [00:26:35] single point, you see that we can we can create these five partitions. um let's [00:26:38] create these five partitions. um let's say five or in this case six different [00:26:40] say five or in this case six different partitions that [00:26:44] partitions that each point um if if you have a testing [00:26:47] each point um if if you have a testing sample that is in that specific region [00:26:50] sample that is in that specific region the color of that region shows what the [00:26:53] the color of that region shows what the nearest neighbor for that sample will [00:26:56] nearest neighbor for that sample will be. So this this is going to be the [00:27:01] be. So this this is going to be the nearest neighbor algorithm. One nearest [00:27:03] nearest neighbor algorithm. One nearest neighbor algorithm partition the space [00:27:05] neighbor algorithm partition the space in this um setting. But do you see a [00:27:09] in this um setting. But do you see a problem here in this uh example? So the [00:27:13] problem here in this uh example? So the yellow one is is exactly is in the [00:27:16] yellow one is is exactly is in the middle of all of the greens. And this [00:27:17] middle of all of the greens. And this means that probably that's an outlier. [00:27:20] means that probably that's an outlier. That's probably a noise. And this is the [00:27:22] That's probably a noise. And this is the case for many many problems that we have [00:27:24] case for many many problems that we have to solve. And with that um [00:27:29] to solve. And with that um the reason that u there are there is [00:27:32] the reason that u there are there is this this big yellow region in the [00:27:34] this this big yellow region in the middle is just this single point and [00:27:38] middle is just this single point and because you're only using one nearest [00:27:41] because you're only using one nearest neighbor this happens. So to make it a [00:27:43] neighbor this happens. So to make it a little bit more robust we can increase [00:27:45] little bit more robust we can increase the number of nearest neighbors that we [00:27:48] the number of nearest neighbors that we take which turns the nearest neighbor [00:27:50] take which turns the nearest neighbor algorithm into a k nearest neighbor. And [00:27:53] algorithm into a k nearest neighbor. And we often [00:27:55] we often select more than one [00:27:58] select more than one point or or sample. And we often take [00:28:02] point or or sample. And we often take the majority voting for [00:28:06] the majority voting for for identifying the label for any given [00:28:09] for identifying the label for any given testing sample testing image. [00:28:14] testing sample testing image. But the problem that uh you can see here [00:28:16] But the problem that uh you can see here is now we have some white regions. Those [00:28:20] is now we have some white regions. Those white regions are areas that we cannot [00:28:22] white regions are areas that we cannot make a decision a complete decision [00:28:24] make a decision a complete decision because those areas are areas that we [00:28:28] because those areas are areas that we have equal number of samples from the [00:28:31] have equal number of samples from the neighbors um from the three different [00:28:35] neighbors um from the three different classes and there's no way to identify [00:28:38] classes and there's no way to identify what the label of that per that example [00:28:42] what the label of that per that example in the in the white region is. And for [00:28:45] in the in the white region is. And for for you if you create these these types [00:28:47] for you if you create these these types of spaces for your problems this means [00:28:50] of spaces for your problems this means that if if you look at these type spaces [00:28:52] that if if you look at these type spaces it means that those are good regions to [00:28:54] it means that those are good regions to go and collect more data for. So those [00:28:57] go and collect more data for. So those are unclear spaces. So it's a good way [00:29:00] are unclear spaces. So it's a good way of finding regions that are important [00:29:04] of finding regions that are important for data uh more data collection. Okay. [00:29:07] for data uh more data collection. Okay. So we can go larger on the value of the [00:29:10] So we can go larger on the value of the K. But one of the choices that we have, [00:29:14] K. But one of the choices that we have, one of the [00:29:16] one of the um [00:29:18] um factors that that plays an important [00:29:20] factors that that plays an important role is the value of K. But if you [00:29:23] role is the value of K. But if you remember, we had another decision to [00:29:25] remember, we had another decision to make which was the distance function. We [00:29:28] make which was the distance function. We talked about the L1 distance. Again, sum [00:29:31] talked about the L1 distance. Again, sum of all of the absolute values [00:29:35] of all of the absolute values between pair-wise differences of um the [00:29:39] between pair-wise differences of um the pixels. [00:29:41] pixels. And [00:29:44] if I visualize the L1 distance or [00:29:47] if I visualize the L1 distance or sometimes in some context we call it [00:29:49] sometimes in some context we call it Manhattan distance, [00:29:52] Manhattan distance, the the distance function is is kind of [00:29:56] the the distance function is is kind of visualized in this way. [00:29:58] visualized in this way. If I calculate if I if I look at this [00:30:00] If I calculate if I if I look at this square that I have in the [00:30:04] square that I have in the in this space all of the points on that [00:30:08] in this space all of the points on that square are they have a same distance [00:30:12] square are they have a same distance from the origin from the center point. [00:30:15] from the origin from the center point. So [00:30:17] So this is a good way of visualizing and [00:30:19] this is a good way of visualizing and seeing how uh this L1 distance function [00:30:22] seeing how uh this L1 distance function works. Another popular framework, [00:30:24] works. Another popular framework, another popular distance function that [00:30:26] another popular distance function that we use is L2 which instead of the [00:30:31] we use is L2 which instead of the absolute value calculates the square of [00:30:34] absolute value calculates the square of the differences sums it up but because [00:30:37] the differences sums it up but because of the square we also do a square root [00:30:40] of the square we also do a square root and visualizing that we'll get the [00:30:43] and visualizing that we'll get the circle [00:30:44] circle um [00:30:47] um visualization where each of the points [00:30:51] visualization where each of the points on the circle are they have the same [00:30:53] on the circle are they have the same distance from the from the center from [00:30:55] distance from the from the center from the origin. So this visualization [00:30:58] the origin. So this visualization actually helps us understand the [00:30:59] actually helps us understand the differences between these distances too [00:31:01] differences between these distances too and these are the the most basic and [00:31:03] and these are the the most basic and easiest distance functions that we can [00:31:05] easiest distance functions that we can use. So there are again a lot more the [00:31:09] use. So there are again a lot more the um the reason this visualization is [00:31:12] um the reason this visualization is helpful is because sometimes if you [00:31:15] helpful is because sometimes if you rotate the so x and y in these two [00:31:17] rotate the so x and y in these two visualizations are basically the [00:31:19] visualizations are basically the features. If we have two pixel values, [00:31:22] features. If we have two pixel values, two two features then they have this 2D [00:31:24] two two features then they have this 2D space and this X and Y are often those [00:31:28] space and this X and Y are often those features. So if I rotate these features [00:31:33] features. So if I rotate these features meaning if I use other types of features [00:31:36] meaning if I use other types of features this L1 will have a different different [00:31:38] this L1 will have a different different framework different value while it's not [00:31:40] framework different value while it's not any different for L2. So that's why this [00:31:44] any different for L2. So that's why this is a big difference between L1 and L2. [00:31:47] is a big difference between L1 and L2. And sometimes if our features are are [00:31:49] And sometimes if our features are are very specific and meaningful and we want [00:31:51] very specific and meaningful and we want to preserve their information, often L1 [00:31:54] to preserve their information, often L1 is is is more important is is better [00:31:56] is is is more important is is better because it has um kind of as you can see [00:32:02] because it has um kind of as you can see um a shape that preserves and and [00:32:06] um a shape that preserves and and enforces distances based on the [00:32:08] enforces distances based on the features. But if uh there is features [00:32:12] features. But if uh there is features are more arbitrary then L2 distance [00:32:14] are more arbitrary then L2 distance makes more sense. If I um want to [00:32:17] makes more sense. If I um want to calculate the distance so the distance [00:32:20] calculate the distance so the distance of all of the points on this shape [00:32:22] of all of the points on this shape from the origin are exactly the same [00:32:26] from the origin are exactly the same right if I use the L1 distance. But for [00:32:30] right if I use the L1 distance. But for L2 distance, [00:32:32] L2 distance, the points on this circle have the same [00:32:35] the points on this circle have the same distance from uh the center uh or the [00:32:38] distance from uh the center uh or the origin of this uh [00:32:42] this space. So um that's basically the [00:32:46] this space. So um that's basically the the main [00:32:50] what what these two images are showing. [00:32:52] what what these two images are showing. Any point on this shape [00:32:56] Any point on this shape when using an L1 distance have the same [00:32:58] when using an L1 distance have the same distance from the origin and for the [00:33:00] distance from the origin and for the circle any point on the circle if you're [00:33:03] circle any point on the circle if you're using L2 distance will have the same [00:33:06] using L2 distance will have the same distance from the origin. [00:33:08] distance from the origin. Yeah. Why it's important to is better to [00:33:11] Yeah. Why it's important to is better to use L1 if we want to preserve the [00:33:14] use L1 if we want to preserve the features. So to answer that question, if [00:33:18] features. So to answer that question, if I rotate the feature um axis the the [00:33:23] I rotate the feature um axis the the distances and this distance function [00:33:25] distances and this distance function changes completely, right? While if I do [00:33:29] changes completely, right? While if I do the same here, [00:33:31] the same here, nothing changes. It's it's the exact [00:33:34] nothing changes. It's it's the exact same value of the uh features, right? Uh [00:33:37] same value of the uh features, right? Uh distance, sorry. So in this case L1 is [00:33:42] distance, sorry. So in this case L1 is very sensitive on the feature values [00:33:44] very sensitive on the feature values while L2 is not. If you select another [00:33:48] while L2 is not. If you select another feature in the same space that is having [00:33:50] feature in the same space that is having a different creates a different shape [00:33:52] a different creates a different shape then your L function the distance [00:33:55] then your L function the distance function changes as well. So if I draw [00:33:59] function changes as well. So if I draw the lines here again the question for [00:34:02] the lines here again the question for online students is why it changes if we [00:34:04] online students is why it changes if we rotate. If I select another feature that [00:34:07] rotate. If I select another feature that goes from this side, right, then the [00:34:10] goes from this side, right, then the lines will look different, [00:34:12] lines will look different, right? So if you rotate the this thing, [00:34:18] right? So if you rotate the this thing, but it's for that shape, it's not it's [00:34:20] but it's for that shape, it's not it's agnostic, right? [00:34:24] agnostic, right? So with these two distance functions [00:34:26] So with these two distance functions that we talked about, if I revisualize [00:34:29] that we talked about, if I revisualize the [00:34:31] the um space you can see with K equals to [00:34:35] um space you can see with K equals to one with one nearest neighbor with L1 [00:34:37] one with one nearest neighbor with L1 and L2 these are the space [00:34:40] and L2 these are the space partitionings. One of the interesting [00:34:43] partitionings. One of the interesting things that you can see here is that [00:34:45] things that you can see here is that with L1 [00:34:48] with L1 uh function most of the part most of the [00:34:51] uh function most of the part most of the boundaries are [00:34:53] boundaries are parallel to the the the two axis the two [00:34:56] parallel to the the the two axis the two features X uh one and X2 very much [00:35:00] features X uh one and X2 very much sensitive to the features while there we [00:35:02] sensitive to the features while there we have a little bit of more smooth [00:35:05] have a little bit of more smooth uh boundary separation. [00:35:08] uh boundary separation. So there is a tool online on the lab [00:35:11] So there is a tool online on the lab website that you can uh play around with [00:35:14] website that you can uh play around with with this with different distance [00:35:16] with this with different distance functions and different number of K. You [00:35:19] functions and different number of K. You can you can see you can create a [00:35:20] can you can see you can create a different setup. So you can play around [00:35:23] different setup. So you can play around with it. But why did we talk about [00:35:26] with it. But why did we talk about nearest neighbor to begin with first? [00:35:28] nearest neighbor to begin with first? Yes, it's it's the easiest um problem to [00:35:32] Yes, it's it's the easiest um problem to solve, easiest solution, easiest [00:35:35] solve, easiest solution, easiest datadriven approach and um great to [00:35:39] datadriven approach and um great to start with. But one of the main reasons [00:35:41] start with. But one of the main reasons that we [00:35:42] that we we want to iterate and and discuss [00:35:46] we want to iterate and and discuss nearest neighbor is the fact that we can [00:35:51] nearest neighbor is the fact that we can look into the the the topic of [00:35:53] look into the the the topic of hyperparameters. [00:35:55] hyperparameters. Hyperparameters are often [00:35:58] Hyperparameters are often some of the [00:36:00] some of the variables that you have to make a [00:36:02] variables that you have to make a decision on to be able to run your [00:36:05] decision on to be able to run your algorithm. In this case, the value k [00:36:10] algorithm. In this case, the value k the number of nearest neighbors is [00:36:12] the number of nearest neighbors is defined is is a hyperparameter. [00:36:15] defined is is a hyperparameter. Depending on how many number of uh [00:36:18] Depending on how many number of uh nearest neighbors you take, your outputs [00:36:20] nearest neighbors you take, your outputs will be different. And then another [00:36:23] will be different. And then another choice that you have here is the [00:36:25] choice that you have here is the distance function. [00:36:27] distance function. So the [00:36:31] So the choice of hyperparameters are often very [00:36:33] choice of hyperparameters are often very much data set dependent and sometimes [00:36:36] much data set dependent and sometimes problem dependent [00:36:40] problem dependent and we have to have a way to identify [00:36:43] and we have to have a way to identify those to kind of optimize for them [00:36:48] those to kind of optimize for them for each single problem. And that's what [00:36:51] for each single problem. And that's what does what is often referred to as [00:36:52] does what is often referred to as hyperparameter tuning in machine [00:36:55] hyperparameter tuning in machine learning algorithms in in deep learning [00:36:56] learning algorithms in in deep learning algorithms and so on. And how to do [00:36:59] algorithms and so on. And how to do that? How to set [00:37:02] that? How to set the hyperparameters? There are different [00:37:03] the hyperparameters? There are different approaches. One of them is to choose the [00:37:06] approaches. One of them is to choose the hyperparameters that that work the best [00:37:09] hyperparameters that that work the best for the training data. So you have a set [00:37:12] for the training data. So you have a set of images or data in in your training [00:37:15] of images or data in in your training data. You look for [00:37:18] data. You look for the best set of hyper hyperparameters [00:37:20] the best set of hyper hyperparameters that generates the best training [00:37:23] that generates the best training uh or [00:37:25] uh or minimum training loss. While it works [00:37:29] minimum training loss. While it works for the training data, it's not a good [00:37:31] for the training data, it's not a good idea at all because especially with near [00:37:35] idea at all because especially with near nearest neighbor, K equal to one is [00:37:37] nearest neighbor, K equal to one is always the best the best value, right? [00:37:40] always the best the best value, right? Because you're you're memorizing [00:37:41] Because you're you're memorizing training data. So K equal to one will [00:37:43] training data. So K equal to one will give give you always the 100% accuracy. [00:37:47] give give you always the 100% accuracy. So we know that this is this is not a [00:37:50] So we know that this is this is not a great idea. The second one is [00:37:54] great idea. The second one is choosing hyperparameter that works best [00:37:56] choosing hyperparameter that works best for for a held out testing set. [00:38:00] for for a held out testing set. While this is a little bit better than [00:38:03] While this is a little bit better than the first one, [00:38:05] the first one, there is also a big problem here. Can [00:38:07] there is also a big problem here. Can can anybody say why this is a problem [00:38:11] can anybody say why this is a problem exactly? And so it's it's it's kind of [00:38:13] exactly? And so it's it's it's kind of cheating because you are trying to find [00:38:15] cheating because you are trying to find the best hyperparameter that works on [00:38:17] the best hyperparameter that works on the testing data and you don't know how [00:38:20] the testing data and you don't know how the model will work on any other data [00:38:24] the model will work on any other data points not not in the testing set. So [00:38:27] points not not in the testing set. So yes that is that is exactly right. It's [00:38:31] yes that is that is exactly right. It's not a good idea because we don't know [00:38:33] not a good idea because we don't know how the model will generalize and for [00:38:36] how the model will generalize and for sure never do it do this as as as we [00:38:40] sure never do it do this as as as we talked about it's kind of cheating and a [00:38:43] talked about it's kind of cheating and a better idea is to always separate take [00:38:47] better idea is to always separate take some part of the training data as [00:38:49] some part of the training data as validation set [00:38:51] validation set and train your model on the training [00:38:53] and train your model on the training data on the train portion of the the new [00:38:56] data on the train portion of the the new portion that we call train and then Try [00:39:00] portion that we call train and then Try to find or optimize your hyper [00:39:03] to find or optimize your hyper hyperparameter on the validation set and [00:39:06] hyperparameter on the validation set and after you've found the best set of [00:39:09] after you've found the best set of hyperparameters then use those [00:39:12] hyperparameters then use those hyperparameters to replicate the results [00:39:14] hyperparameters to replicate the results for the testing set and do the [00:39:16] for the testing set and do the predictions for the testing set. So this [00:39:17] predictions for the testing set. So this is a much better approach although it [00:39:21] is a much better approach although it does have some uh challenges itself [00:39:25] does have some uh challenges itself because um [00:39:29] because um sometimes the the validation set that [00:39:31] sometimes the the validation set that you've selected it may not be a good [00:39:33] you've selected it may not be a good representative of the entire landscape [00:39:36] representative of the entire landscape because you your validation set is [00:39:38] because you your validation set is almost always much smaller and and [00:39:41] almost always much smaller and and that's why one of the a better approach [00:39:44] that's why one of the a better approach is to use um [00:39:49] is to use um cross validation for setting [00:39:50] cross validation for setting hyperparameters. [00:39:52] hyperparameters. Basically you split your training data [00:39:55] Basically you split your training data into a number of folds, a number of uh [00:39:58] into a number of folds, a number of uh partitions here in this case five and [00:40:02] partitions here in this case five and each of the folds [00:40:05] each of the folds plays as the validation set once and [00:40:08] plays as the validation set once and iteratively you you run this five times [00:40:10] iteratively you you run this five times for five-fold class validation. You do [00:40:13] for five-fold class validation. You do this five times and average the [00:40:17] this five times and average the accuracies. So you set a a value of the [00:40:19] accuracies. So you set a a value of the hyperparameter. You run this for all [00:40:21] hyperparameter. You run this for all these five uh sets. [00:40:24] these five uh sets. Define the accuracy. Calculate the [00:40:26] Define the accuracy. Calculate the accuracy on the validation set. Average [00:40:29] accuracy on the validation set. Average it. And then you do this multiple times [00:40:31] it. And then you do this multiple times to find the best setting for the [00:40:33] to find the best setting for the hyperparameter. After you found the [00:40:35] hyperparameter. After you found the hyperparameter setting, you apply to the [00:40:37] hyperparameter setting, you apply to the testing set. This is a little bit more [00:40:40] testing set. This is a little bit more reliable and and generates much better [00:40:43] reliable and and generates much better results. Although in larger scale deep [00:40:45] results. Although in larger scale deep learning it is less [00:40:49] learning it is less practiced because repeating this [00:40:52] practiced because repeating this multiple times and five times with huge [00:40:54] multiple times and five times with huge data sets is is very hard. So we often [00:40:56] data sets is is very hard. So we often use intuitions for setting [00:40:58] use intuitions for setting hyperparameters and the single [00:41:01] hyperparameters and the single validation set is some sometimes the the [00:41:04] validation set is some sometimes the the approach we go with. But this is pretty [00:41:07] approach we go with. But this is pretty much uh advised [00:41:09] much uh advised again outside computer vision outside [00:41:11] again outside computer vision outside larger scale data sets. Often research [00:41:15] larger scale data sets. Often research papers require doing these types of [00:41:17] papers require doing these types of cross validation and um [00:41:21] cross validation and um and and these types of uh statistical [00:41:24] and and these types of uh statistical frameworks to make sure your results are [00:41:27] frameworks to make sure your results are uh reproducible on a testing set. [00:41:29] uh reproducible on a testing set. Anyways, so there are different [00:41:31] Anyways, so there are different approaches. Let's um finalize the topic, [00:41:36] approaches. Let's um finalize the topic, wrap up the topic of um [00:41:39] wrap up the topic of um nearest neighbor and um look at the some [00:41:43] nearest neighbor and um look at the some examples, some results. [00:41:45] examples, some results. So, let me introduce you to the CR10 [00:41:50] So, let me introduce you to the CR10 data set. It's one of the data sets that [00:41:51] data set. It's one of the data sets that you're going to be using in your [00:41:53] you're going to be using in your assignments um quite often. It has 10 [00:41:57] assignments um quite often. It has 10 classes with a number of training um [00:42:00] classes with a number of training um images and testing images. The 10 [00:42:02] images and testing images. The 10 classes, some of the examples are shown [00:42:04] classes, some of the examples are shown here with nearest neighbor. For each of [00:42:08] here with nearest neighbor. For each of the testing images, if we run [00:42:13] the testing images, if we run nearest neighbor and select the top 10 [00:42:16] nearest neighbor and select the top 10 nearest neighbors, they are all [00:42:19] nearest neighbors, they are all um visualized there. [00:42:23] um visualized there. As you can imagine and guess, one of the [00:42:26] As you can imagine and guess, one of the first questions to answer is how many [00:42:31] what should be the value for K? How many [00:42:33] what should be the value for K? How many nearest neighbors should we take? And [00:42:36] nearest neighbors should we take? And one study, one of the [00:42:39] one study, one of the quick experiments with five-fold each of [00:42:41] quick experiments with five-fold each of those points is uh is one of the folds [00:42:44] those points is uh is one of the folds in fivefold for each of the values of K [00:42:48] in fivefold for each of the values of K shows [00:42:50] shows different values here. And as you can [00:42:53] different values here. And as you can probably see here, K equal to 7 [00:42:58] probably see here, K equal to 7 generates the best results in terms of [00:43:00] generates the best results in terms of accuracy, which is close to 29 28% [00:43:06] accuracy, which is close to 29 28% accuracy, [00:43:09] accuracy, which is actually not too bad because [00:43:12] which is actually not too bad because this is a 10 classification problem. And [00:43:14] this is a 10 classification problem. And with a 10 classification problem, often [00:43:17] with a 10 classification problem, often the random guess gets you a 10% [00:43:20] the random guess gets you a 10% accuracy. So this is much better than [00:43:24] accuracy. So this is much better than random guess. So it's working. It's [00:43:25] random guess. So it's working. It's doing something but there's a lot of [00:43:28] doing something but there's a lot of room to improve. So if we go back and [00:43:33] room to improve. So if we go back and look at the examples, we can actually [00:43:36] look at the examples, we can actually see there are so many mistakes, [00:43:38] see there are so many mistakes, especially with the with the one that is [00:43:40] especially with the with the one that is closest. For example, the fourth row, if [00:43:42] closest. For example, the fourth row, if you look at that, it's a it's a frog, [00:43:44] you look at that, it's a it's a frog, but the first example seems to be a cat. [00:43:47] but the first example seems to be a cat. Sorry, a dog. And um you can guess why [00:43:51] Sorry, a dog. And um you can guess why this is happening because the distance [00:43:54] this is happening because the distance is being applied on pixels [00:43:58] is being applied on pixels and pixel wise they look like each [00:44:01] and pixel wise they look like each other. They have the same type of colors [00:44:04] other. They have the same type of colors in most pixels. So they are much closer. [00:44:08] in most pixels. So they are much closer. This this example and many other [00:44:09] This this example and many other examples show that distances that work [00:44:12] examples show that distances that work on pixels and pixel values are not the [00:44:16] on pixels and pixel values are not the best choices. we never we never practice [00:44:18] best choices. we never we never practice them. There are much better approaches [00:44:21] them. There are much better approaches that we'll discussing [00:44:23] that we'll discussing um at the end of more um in the in the [00:44:28] um at the end of more um in the in the future lectures. And just to wrap up the [00:44:32] future lectures. And just to wrap up the topic, this is another example. If you [00:44:34] topic, this is another example. If you look at this original image, those three [00:44:37] look at this original image, those three images while they look very much [00:44:39] images while they look very much different in terms of color or maybe uh [00:44:42] different in terms of color or maybe uh occlusion or the one in the uh the the [00:44:46] occlusion or the one in the uh the the third one from uh from the left side is [00:44:49] third one from uh from the left side is just same pixel. It's the same image [00:44:52] just same pixel. It's the same image with one pixel shifting to the right. I [00:44:55] with one pixel shifting to the right. I think although from a human eyes [00:44:58] think although from a human eyes perspective there's no absolutely no [00:45:00] perspective there's no absolutely no difference but the this the distance [00:45:03] difference but the this the distance between that and the original image is [00:45:05] between that and the original image is the same as the other two examples that [00:45:07] the same as the other two examples that you see here. I'll um stop for a couple [00:45:10] you see here. I'll um stop for a couple of questions and this is the summary of [00:45:12] of questions and this is the summary of what we've discussed. So the question is [00:45:14] what we've discussed. So the question is what how we make a decision right that's [00:45:16] what how we make a decision right that's in those cases you often go with random [00:45:19] in those cases you often go with random uh randomly selected one of the tops. So [00:45:22] uh randomly selected one of the tops. So if you are to collect more data, [00:45:26] if you are to collect more data, if you're uh for example, you're solving [00:45:28] if you're uh for example, you're solving a problem now in genetics or you're [00:45:31] a problem now in genetics or you're solving a problem in medical imaging, [00:45:33] solving a problem in medical imaging, when you visualize your um examples, [00:45:38] when you visualize your um examples, your features or whatever. And then in [00:45:41] your features or whatever. And then in this nearest neighbor space, you do see [00:45:44] this nearest neighbor space, you do see pockets of space that you don't have any [00:45:46] pockets of space that you don't have any any good samples for or there is [00:45:50] any good samples for or there is ambiguity, [00:45:51] ambiguity, then you often try to go and find other [00:45:54] then you often try to go and find other samples that lie in that same area in [00:45:57] samples that lie in that same area in that space. Okay. Um so summarizing what [00:46:02] that space. Okay. Um so summarizing what we've uh talked about with K nearest [00:46:05] we've uh talked about with K nearest neighbor it was mostly about [00:46:09] neighbor it was mostly about um [00:46:11] um understanding the easiest algorithm [00:46:13] understanding the easiest algorithm datadriven approach and then um talking [00:46:17] datadriven approach and then um talking a little bit about hyperparameter tuning [00:46:19] a little bit about hyperparameter tuning and and how distance metrics and the [00:46:22] and and how distance metrics and the value of K play a very important role. [00:46:26] value of K play a very important role. Moving on to the next topic which is [00:46:29] Moving on to the next topic which is linear classifiers. 20 five minutes uh [00:46:32] linear classifiers. 20 five minutes uh time to cover this. I want to [00:46:37] time to cover this. I want to spend the remaining time of this lecture [00:46:40] spend the remaining time of this lecture to to talk about this very important [00:46:43] to to talk about this very important topic. This is the most important [00:46:46] topic. This is the most important building block for almost [00:46:51] all of deep learning. [00:46:54] all of deep learning. and um [00:46:57] and um and we need to um see [00:47:01] and we need to um see how this this approach is different. So [00:47:04] how this this approach is different. So first we want to see how it's different [00:47:06] first we want to see how it's different from nearest neighbor. So this is a [00:47:08] from nearest neighbor. So this is a parametric approach meaning that now we [00:47:11] parametric approach meaning that now we are learning we are we are finding some [00:47:14] are learning we are we are finding some parameters W or some weights that map [00:47:18] parameters W or some weights that map the input image into the output classes [00:47:22] the input image into the output classes the output numbers. In this case, when [00:47:25] the output numbers. In this case, when we create this function f that that maps [00:47:28] we create this function f that that maps input to the output, often those outputs [00:47:31] input to the output, often those outputs are kind of membership scores [00:47:35] are kind of membership scores of the image to each of those 10 uh [00:47:40] of the image to each of those 10 uh output classes labels. So with this [00:47:43] output classes labels. So with this setup that we we build, a linear [00:47:47] setup that we we build, a linear classifier [00:47:49] classifier first maps uh uses uses w uses these [00:47:53] first maps uh uses uses w uses these parameters to map each of the inputs x [00:47:57] parameters to map each of the inputs x into a value which is the output y. And [00:48:01] into a value which is the output y. And how this is done is very simple. This [00:48:05] how this is done is very simple. This image is basically an area of say 32x [00:48:08] image is basically an area of say 32x 32x3. [00:48:10] 32x3. So [00:48:12] So 372 numbers and this defines [00:48:17] 372 numbers and this defines our X which is 372 by one vector and we [00:48:23] our X which is 372 by one vector and we know that we have 10 output classes. So [00:48:26] know that we have 10 output classes. So we need 10 different scores and the [00:48:28] we need 10 different scores and the scores are the the output will be kind [00:48:31] scores are the the output will be kind of a vector of 10 by one and this means [00:48:35] of a vector of 10 by one and this means that we have to identify to find a [00:48:37] that we have to identify to find a weight matrix W that is a 10 by 372 [00:48:43] weight matrix W that is a 10 by 372 that maps X into the output scores. [00:48:47] that maps X into the output scores. just to complete this linear function. [00:48:50] just to complete this linear function. We often use this bias um [00:48:54] We often use this bias um term as well. It's an input independent [00:48:58] term as well. It's an input independent um value which actually has a lot of [00:49:02] um value which actually has a lot of different use cases. I I can talk about [00:49:04] different use cases. I I can talk about it uh when I do some geometric [00:49:06] it uh when I do some geometric visualizations, but it it sometimes [00:49:09] visualizations, but it it sometimes creates a shift for different um class [00:49:12] creates a shift for different um class class scores and um helps with much [00:49:16] class scores and um helps with much better [00:49:18] better separation of each each class. So as I [00:49:21] separation of each each class. So as I said, these linear functions are [00:49:24] said, these linear functions are actually building blocks for building [00:49:26] actually building blocks for building neural networks. Each of these linear [00:49:29] neural networks. Each of these linear classifiers, linear uh functions when [00:49:33] classifiers, linear uh functions when put together one after the other create [00:49:35] put together one after the other create these uh large neural networks. [00:49:39] these uh large neural networks. Although there are a lot other a lot of [00:49:42] Although there are a lot other a lot of other things that that need to be added [00:49:44] other things that that need to be added here but um this is one of the most [00:49:46] here but um this is one of the most important components. If we look at some [00:49:49] important components. If we look at some of the [00:49:52] popular [00:49:54] popular neural networks, we can see that linear [00:49:58] neural networks, we can see that linear functions are everywhere in the in the [00:50:02] functions are everywhere in the in the architectures. [00:50:04] architectures. So to better understand what this [00:50:06] So to better understand what this mapping and this function is doing, [00:50:08] mapping and this function is doing, let's let's go back to our example of [00:50:11] let's let's go back to our example of CR10 and our uh training and testing [00:50:14] CR10 and our uh training and testing samples and so on and even make it a [00:50:17] samples and so on and even make it a little bit simpler. Instead of looking [00:50:18] little bit simpler. Instead of looking at large images of 32x 32, let's look at [00:50:22] at large images of 32x 32, let's look at images of 2x two um an input image that [00:50:26] images of 2x two um an input image that has four pixels. This means that [00:50:30] has four pixels. This means that the input image is turned into a vector. [00:50:34] the input image is turned into a vector. As you can see here, we have to find a W [00:50:40] As you can see here, we have to find a W and the values of B. So the input image [00:50:45] and the values of B. So the input image is mapped into some scores as the [00:50:48] is mapped into some scores as the output. So [00:50:51] output. So this is this is how the linear function [00:50:54] this is this is how the linear function from an algebraic viewpoint looks like. [00:50:58] from an algebraic viewpoint looks like. The output scores here we are [00:50:59] The output scores here we are considering three classes of cat, dog [00:51:02] considering three classes of cat, dog and and chip. And um as you can see this [00:51:09] and and chip. And um as you can see this function maps the image the vector [00:51:12] function maps the image the vector representing the image into those [00:51:15] representing the image into those scores. [00:51:17] scores. So [00:51:18] So algebraic viewpoint of linear [00:51:20] algebraic viewpoint of linear classification. Now let's look at some [00:51:24] classification. Now let's look at some visual uh perspectives of this linear [00:51:26] visual uh perspectives of this linear classifier. [00:51:28] classifier. As you can see, we often create these [00:51:31] As you can see, we often create these like each of these images as um [00:51:36] like each of these images as um we talked about for each this image for [00:51:39] we talked about for each this image for each of the [00:51:41] each of the classes we define some sort of we have a [00:51:45] classes we define some sort of we have a we have a row of this m row in the [00:51:47] we have a row of this m row in the matrix W right. So this this row is kind [00:51:50] matrix W right. So this this row is kind of a template for that specific class. [00:51:52] of a template for that specific class. If I separate it like this. So this [00:51:57] If I separate it like this. So this image is multiplied by W and B and W is [00:52:01] image is multiplied by W and B and W is this is the template from each of the [00:52:03] this is the template from each of the three classes of cat, dog and ship. And [00:52:08] three classes of cat, dog and ship. And after training or building the model on [00:52:11] after training or building the model on the C4 data set, if I look at the visual [00:52:15] the C4 data set, if I look at the visual uh perspective visual uh a a visual [00:52:18] uh perspective visual uh a a visual point viewpoint of the linear [00:52:20] point viewpoint of the linear classifier, if I look at those templates [00:52:23] classifier, if I look at those templates that are learned for each of the 10 [00:52:25] that are learned for each of the 10 classes, you can see these these [00:52:27] classes, you can see these these templates. So it's very interesting that [00:52:30] templates. So it's very interesting that in some of these cases for example for [00:52:32] in some of these cases for example for the for the example for car you do see a [00:52:35] the for the example for car you do see a front uh the front of a car templateish [00:52:40] front uh the front of a car templateish and um although this is all done with [00:52:43] and um although this is all done with just one linear classifier. [00:52:46] just one linear classifier. So the visual aspect visual uh viewpoint [00:52:48] So the visual aspect visual uh viewpoint of the linear classifier and there is [00:52:51] of the linear classifier and there is another aspect of um geometric [00:52:54] another aspect of um geometric viewpoint. [00:52:56] viewpoint. What this linear classifier often does [00:52:59] What this linear classifier often does is finding those lines if it's in 2D [00:53:03] is finding those lines if it's in 2D space finding those lines that separates [00:53:06] space finding those lines that separates each each class from the others. [00:53:09] each each class from the others. And as you can see here, red, blue and [00:53:11] And as you can see here, red, blue and and green are defined in different [00:53:16] and green are defined in different classes. And um in higher dimensional [00:53:19] classes. And um in higher dimensional space, instead of those lines, it's it's [00:53:21] space, instead of those lines, it's it's it's these hyperplanes as you can see in [00:53:23] it's these hyperplanes as you can see in this example on the on the left. [00:53:27] this example on the on the left. So [00:53:29] So you can also see the the use of the the [00:53:32] you can also see the the use of the the bias term here because if if we didn't [00:53:35] bias term here because if if we didn't have the bias all of these lines should [00:53:36] have the bias all of these lines should have passed through the origin from the [00:53:38] have passed through the origin from the center of that space which doesn't [00:53:41] center of that space which doesn't really make sense but with the with the [00:53:43] really make sense but with the with the bias we can actually create more [00:53:45] bias we can actually create more reliable [00:53:48] reliable functions uh and and uh decision [00:53:51] functions uh and and uh decision boundaries. [00:53:53] boundaries. So [00:53:55] So a linear function is very useful. A [00:53:57] a linear function is very useful. A linear classifier is very useful for [00:53:59] linear classifier is very useful for many applications as uh we talked about [00:54:02] many applications as uh we talked about and it's a building block of [00:54:06] and it's a building block of more complex neural networks but it does [00:54:08] more complex neural networks but it does have its own challenges because it [00:54:11] have its own challenges because it doesn't it can't classify many instances [00:54:15] doesn't it can't classify many instances of separate uh data. For example, in in [00:54:19] of separate uh data. For example, in in this case, if class one is the first and [00:54:22] this case, if class one is the first and third quadrant and the second class is [00:54:24] third quadrant and the second class is second and the fourth, there's no way to [00:54:27] second and the fourth, there's no way to linearly separate these. Another example [00:54:28] linearly separate these. Another example is [00:54:30] is um if if we have this type of separation [00:54:34] um if if we have this type of separation between class one and class two that uh [00:54:38] between class one and class two that uh shows the distance from the origin being [00:54:40] shows the distance from the origin being between one and two as class one and [00:54:42] between one and two as class one and then everything else is class two. [00:54:44] then everything else is class two. Similarly, if there is there are three [00:54:46] Similarly, if there is there are three modes, three areas in the space that are [00:54:49] modes, three areas in the space that are one class and then the second class is [00:54:51] one class and then the second class is everything else. So, in in all of these [00:54:53] everything else. So, in in all of these cases, it's actually very hard to to do [00:54:55] cases, it's actually very hard to to do the separation. [00:54:58] the separation. So [00:55:01] what we um should do so we talked about [00:55:05] what we um should do so we talked about the linear classifiers and and how they [00:55:08] the linear classifiers and and how they uh they can actually map the input [00:55:10] uh they can actually map the input images into any form of labels in the [00:55:13] images into any form of labels in the outputs. But [00:55:16] outputs. But now what remains is how to choose the [00:55:18] now what remains is how to choose the value W that for each of these images [00:55:21] value W that for each of these images maps the image into a score for each [00:55:24] maps the image into a score for each single class as the output. [00:55:27] single class as the output. And in order to do that, we need to [00:55:29] And in order to do that, we need to define a loss function, sometimes [00:55:32] define a loss function, sometimes referred to as objective function that [00:55:36] referred to as objective function that quantifies how bad the classifier, how [00:55:39] quantifies how bad the classifier, how bad the model is is working. So the [00:55:43] bad the model is is working. So the level of unhappiness with uh respect to [00:55:46] level of unhappiness with uh respect to the score on the training data. [00:55:49] the score on the training data. After defining those, we need to find a [00:55:52] After defining those, we need to find a way to efficiently change the values of [00:55:55] way to efficiently change the values of W to be able to op uh to to to minimize [00:56:00] W to be able to op uh to to to minimize that unhappiness basically minimize the [00:56:04] that unhappiness basically minimize the loss function. And this is the this is [00:56:06] loss function. And this is the this is the optimization [00:56:08] the optimization process. So the topic of next class, [00:56:11] process. So the topic of next class, next next lecture. And [00:56:15] next next lecture. And in order to do that again for simplicity [00:56:18] in order to do that again for simplicity let's let's look at a easier and easier [00:56:22] let's let's look at a easier and easier example having these three classes a [00:56:24] example having these three classes a linear function as you can see here and [00:56:29] linear function as you can see here and the [00:56:31] the three classes of cat car and frog. We [00:56:35] three classes of cat car and frog. We need a loss function that tells how good [00:56:38] need a loss function that tells how good our current classifier is. And in order [00:56:40] our current classifier is. And in order to do that we need to parameterize the [00:56:43] to do that we need to parameterize the problem X I and Y I defining the input [00:56:49] problem X I and Y I defining the input image label images and the corresponding [00:56:52] image label images and the corresponding labels. And then we need a loss function [00:56:56] labels. And then we need a loss function in distance function that [00:56:58] in distance function that maps uh looks at the the differences and [00:57:02] maps uh looks at the the differences and how bad the scores are compared to the [00:57:05] how bad the scores are compared to the ones that are predicted. this the [00:57:07] ones that are predicted. this the predicted scores fx and w and the ground [00:57:11] predicted scores fx and w and the ground truth values the values that are already [00:57:13] truth values the values that are already given y i's we often uh normalize them [00:57:17] given y i's we often uh normalize them based on the number of samples as well [00:57:19] based on the number of samples as well but uh it's not that important so this [00:57:21] but uh it's not that important so this defines the loss function the objective [00:57:23] defines the loss function the objective function [00:57:26] so [00:57:29] so how we can do the um optimization and [00:57:32] how we can do the um optimization and and how can we uh [00:57:37] really find the the W's. There are [00:57:41] really find the the W's. There are different ways of defining this L [00:57:44] different ways of defining this L li LI and I want to talk about softmax [00:57:47] li LI and I want to talk about softmax classifier. [00:57:48] classifier. Uh right now [00:57:51] Uh right now as an example [00:57:56] for that cat if you remember the scores [00:57:58] for that cat if you remember the scores that were given were um 3.2 5.1 and a [00:58:02] that were given were um 3.2 5.1 and a minus 1.7. These are um the the scores [00:58:08] minus 1.7. These are um the the scores that are the output of the function. We [00:58:10] that are the output of the function. We decid we discussed fxi and w. And [00:58:16] decid we discussed fxi and w. And in order to turn these scores because [00:58:18] in order to turn these scores because these scores are unbounded and the [00:58:20] these scores are unbounded and the values are often not very much uh [00:58:23] values are often not very much uh controllable [00:58:24] controllable because this is just a linear function, [00:58:26] because this is just a linear function, right? In order to turn these into some [00:58:31] right? In order to turn these into some scoring functions, [00:58:33] scoring functions, the best the possible best possible way [00:58:36] the best the possible best possible way is to turn these into probabilities [00:58:38] is to turn these into probabilities which defines the probability [00:58:41] which defines the probability of the class being this class K for each [00:58:46] of the class being this class K for each inputs image XI. Right? And in order to [00:58:50] inputs image XI. Right? And in order to do that, we first [00:58:53] do that, we first this is the the function that we we use [00:58:56] this is the the function that we we use the soft max function. We first [00:58:59] the soft max function. We first exponentiate the values of the scores to [00:59:04] exponentiate the values of the scores to create these numbers. When we use exp on [00:59:07] create these numbers. When we use exp on these numbers, the the numbers will [00:59:09] these numbers, the the numbers will always be the outputs will always be [00:59:11] always be the outputs will always be positive, right? And we need to make [00:59:14] positive, right? And we need to make sure that the probabilities are always [00:59:15] sure that the probabilities are always positive. And after creating these [00:59:18] positive. And after creating these numbers, what we can do is just [00:59:20] numbers, what we can do is just normalize them. [00:59:22] normalize them. So exponentiate and then normalize based [00:59:24] So exponentiate and then normalize based on sum of all of the samples. Right? So [00:59:27] on sum of all of the samples. Right? So then we normalize them based on sum of [00:59:31] then we normalize them based on sum of all samples and this creates [00:59:35] all samples and this creates very um very good set of values that [00:59:39] very um very good set of values that define a probability function. So this [00:59:41] define a probability function. So this is a distribution function. they sum to [00:59:43] is a distribution function. they sum to one. And [00:59:46] one. And if I want to interpret this, it's very [00:59:49] if I want to interpret this, it's very simple to say it's this this set of W's [00:59:54] simple to say it's this this set of W's parameters [00:59:56] parameters thinks that this image is a cat with a [01:00:01] thinks that this image is a cat with a probability of 13%. Or 13, right? And um [01:00:07] probability of 13%. Or 13, right? And um obviously this is this is making making [01:00:10] obviously this is this is making making a mistake in this example because the W [01:00:12] a mistake in this example because the W is not a good setting. We should um [01:00:15] is not a good setting. We should um optimize it and change it. So these [01:00:19] optimize it and change it. So these probabilities are the counterparts of [01:00:21] probabilities are the counterparts of these unnormalized [01:00:23] these unnormalized uh log probabilities which are often [01:00:26] uh log probabilities which are often referred to as logits. So [01:00:30] referred to as logits. So if you've uh taken other machine [01:00:32] if you've uh taken other machine learning courses or if you I'm sure in [01:00:35] learning courses or if you I'm sure in other fields you've used uh logistic [01:00:38] other fields you've used uh logistic regression. This is a similar type of [01:00:41] regression. This is a similar type of framework. This is the exact same [01:00:42] framework. This is the exact same framework as logistic regression [01:00:46] framework as logistic regression and since we have multiple classes here [01:00:49] and since we have multiple classes here it's a multinnomial logistic regression. [01:00:54] it's a multinnomial logistic regression. How do we define the function L? I told [01:00:58] How do we define the function L? I told you that there are different ways of [01:00:59] you that there are different ways of defining the function L. [01:01:02] defining the function L. We want to define a loss function that [01:01:05] We want to define a loss function that what's the objective here? We want to [01:01:08] what's the objective here? We want to maximize [01:01:10] maximize the probability of [01:01:13] the probability of the sample belonging to the correct [01:01:15] the sample belonging to the correct class. Right? So we want to maximize the [01:01:19] class. Right? So we want to maximize the value of3. [01:01:22] value of3. Now we have other larger values in in [01:01:25] Now we have other larger values in in that um [01:01:27] that um set. So [01:01:30] set. So if you want to maximize this this is a [01:01:32] if you want to maximize this this is a maximization problem right in order to [01:01:35] maximization problem right in order to turn it because all of the objectives [01:01:38] turn it because all of the objectives that we define we try to build the [01:01:41] that we define we try to build the minimization objective function. The [01:01:44] minimization objective function. The first step is just to [01:01:46] first step is just to uh to negate the values right we negate [01:01:50] uh to negate the values right we negate it. So the maximization problem turns [01:01:52] it. So the maximization problem turns into a minimization problem. And then we [01:01:54] into a minimization problem. And then we also take the log of the value just to [01:01:57] also take the log of the value just to uh make the numbers a little bit more [01:01:59] uh make the numbers a little bit more manageable. So negative log of that [01:02:02] manageable. So negative log of that value will define the objective function [01:02:05] value will define the objective function the the loss function for solving this [01:02:09] the the loss function for solving this problem. Very simple. That's that's the [01:02:11] problem. Very simple. That's that's the objective or the loss function for um [01:02:15] objective or the loss function for um softmax and um for for this logistic [01:02:20] softmax and um for for this logistic regression function and if you've taken [01:02:23] regression function and if you've taken as I said other other classes like CS2 [01:02:26] as I said other other classes like CS2 229 [01:02:28] 229 um it's often referred to as maximum [01:02:30] um it's often referred to as maximum likelihood estimation as well it's the [01:02:32] likelihood estimation as well it's the same algorithm [01:02:34] same algorithm so with that in mind I want to say that [01:02:40] so with that in mind I want to say that So as as as we discussed it's the [01:02:43] So as as as we discussed it's the negative of the log of that probability [01:02:45] negative of the log of that probability of the correct class [01:02:47] of the correct class which defines the objective function the [01:02:49] which defines the objective function the loss function [01:02:54] and that's basically that simple but [01:02:56] and that's basically that simple but there are other types of interpreting [01:02:58] there are other types of interpreting this this framework as well. So [01:03:02] this this framework as well. So one way of um [01:03:06] one way of um redefining this loss function is saying [01:03:09] redefining this loss function is saying that we have some estimated [01:03:11] that we have some estimated probabilities and we also have a [01:03:14] probabilities and we also have a probability function that that defines [01:03:15] probability function that that defines the correct probabilities. What we want [01:03:18] the correct probabilities. What we want to do is to match these two probability [01:03:22] to do is to match these two probability functions. Right? And in order to to do [01:03:26] functions. Right? And in order to to do that we want to minimize the KL [01:03:29] that we want to minimize the KL divergence call back laborer divergence. [01:03:32] divergence call back laborer divergence. This is a information theoretic u [01:03:36] This is a information theoretic u perspective of looking at this uh loss [01:03:38] perspective of looking at this uh loss function. And again [01:03:42] function. And again those are exactly the same. this um KL [01:03:45] those are exactly the same. this um KL divergence in this setting [01:03:49] divergence in this setting simplifies into the same negative log [01:03:52] simplifies into the same negative log function that we defined. And even going [01:03:55] function that we defined. And even going further, this is [01:03:59] further, this is um exactly the cross entropy function [01:04:02] um exactly the cross entropy function because if if we define this um use this [01:04:06] because if if we define this um use this entropy of P [01:04:10] entropy of P which is um uh entropy of the correct [01:04:14] which is um uh entropy of the correct values, correct probabilities plus the [01:04:17] values, correct probabilities plus the same KL divergence. Again, this [01:04:19] same KL divergence. Again, this simplifies into the same negative log [01:04:21] simplifies into the same negative log function. And that's because when we use [01:04:24] function. And that's because when we use one hot encoding setting for the classes [01:04:28] one hot encoding setting for the classes the entropy is zero. So that's one of [01:04:31] the entropy is zero. So that's one of the reasons that we call this function [01:04:33] the reasons that we call this function cross entropy or binary cross entropy [01:04:35] cross entropy or binary cross entropy function in in all of deep learning [01:04:37] function in in all of deep learning you've probably [01:04:40] you've probably if you've used any of the neural network [01:04:42] if you've used any of the neural network frameworks you've heard about BCE binary [01:04:44] frameworks you've heard about BCE binary cross entropy or you will be hearing [01:04:46] cross entropy or you will be hearing about it a lot. So this is the same uh [01:04:50] about it a lot. So this is the same uh framework. We start very simple but we [01:04:52] framework. We start very simple but we got to the [01:04:54] got to the to the [01:04:56] to the uh similarities and differences between [01:04:58] uh similarities and differences between each of those. So the objective the [01:05:02] each of those. So the objective the sorry the loss function was defined as [01:05:04] sorry the loss function was defined as negative log of this probability and the [01:05:06] negative log of this probability and the probability was was defined by the [01:05:08] probability was was defined by the softmax which we talked about and and [01:05:12] softmax which we talked about and and then optimizing for this which is the [01:05:14] then optimizing for this which is the topic of next session will give us the [01:05:17] topic of next session will give us the right W's. But before I end I want to [01:05:22] right W's. But before I end I want to ask a couple of questions with this [01:05:24] ask a couple of questions with this definition that you see here. What is [01:05:26] definition that you see here. What is the mean and maximum [01:05:29] the mean and maximum value that you can see for the loss [01:05:31] value that you can see for the loss function li? [01:05:33] function li? Yes, it's zero which we which turns into [01:05:35] Yes, it's zero which we which turns into minus minus infinity but we have a a [01:05:38] minus minus infinity but we have a a negative negation there. So it would be [01:05:40] negative negation there. So it would be infinity that is correct. But then we [01:05:43] infinity that is correct. But then we also have to uh [01:05:47] also have to uh yes that that's that's definitely um [01:05:51] yes that that's that's definitely um that's right. And um let me actually [01:05:55] that's right. And um let me actually look at a second question. Yes, this [01:05:57] look at a second question. Yes, this one. [01:06:00] one. So when we initialize all of the SI, so [01:06:04] So when we initialize all of the SI, so basically the W's in the beginning, it's [01:06:06] basically the W's in the beginning, it's almost random. So the probabilities of [01:06:10] almost random. So the probabilities of each of the classes becomes mostly [01:06:15] each of the classes becomes mostly um [01:06:17] um equal. What is the softmax LI? assuming [01:06:21] equal. What is the softmax LI? assuming we have C classes [01:06:30] and especially if it's C 10. [01:06:34] and especially if it's C 10. So because the probabilities are equal [01:06:37] So because the probabilities are equal it means that all of the probabilities [01:06:39] it means that all of the probabilities are around 1 / C right and then that [01:06:43] are around 1 / C right and then that will uh be defined as log of C and we [01:06:47] will uh be defined as log of C and we have 10 if we have 10 classes then the [01:06:49] have 10 if we have 10 classes then the log or ln of 10 is 2.3 which is the um x [01:06:54] log or ln of 10 is 2.3 which is the um x we know we know about ================================================================================ LECTURE 003 ================================================================================ Stanford CS231N | Spring 2025 | Lecture 3: Regularization and Optimization Source: https://www.youtube.com/watch?v=dyNGd06MWn4 --- Transcript [00:00:05] Today's lecture topic will be about uh [00:00:08] Today's lecture topic will be about uh regularization and optimization which [00:00:10] regularization and optimization which are two very important concepts more [00:00:12] are two very important concepts more broadly in deep learning and machine [00:00:14] broadly in deep learning and machine learning uh but especially important for [00:00:16] learning uh but especially important for computer vision. And we're going to [00:00:18] computer vision. And we're going to start with a recap from last week and [00:00:20] start with a recap from last week and discuss some of the topics that we [00:00:22] discuss some of the topics that we discussed last time. [00:00:24] discussed last time. So um we really honed in on this idea of [00:00:27] So um we really honed in on this idea of image classification as a core task in [00:00:29] image classification as a core task in computer vision. And what this task is [00:00:32] computer vision. And what this task is is given an image as input you try to [00:00:35] is given an image as input you try to map this image to um a a label inside of [00:00:39] map this image to um a a label inside of a set of labels. So here we have five [00:00:41] a set of labels. So here we have five different labels uh cat, dog, bird, [00:00:44] different labels uh cat, dog, bird, deer, and truck. And the goal is to [00:00:46] deer, and truck. And the goal is to assign the correct label to the input [00:00:48] assign the correct label to the input image. and you're creating some model or [00:00:50] image. and you're creating some model or some function that takes an image as [00:00:51] some function that takes an image as input and outputs the specific label [00:00:53] input and outputs the specific label here. [00:00:55] here. And we also talked about you know a lot [00:00:57] And we also talked about you know a lot of the challenges for classification. So [00:01:00] of the challenges for classification. So one of the main challenges is uh shown [00:01:03] one of the main challenges is uh shown in the top left here but it's this idea [00:01:05] in the top left here but it's this idea of the uh semantic gap between what we [00:01:08] of the uh semantic gap between what we as humans perceive in the image which is [00:01:09] as humans perceive in the image which is the cat and what it's actually [00:01:11] the cat and what it's actually represented uh as in the computer which [00:01:13] represented uh as in the computer which is this grid of uh pixel values where uh [00:01:17] is this grid of uh pixel values where uh you know you have this multi-dimensional [00:01:18] you know you have this multi-dimensional array or tensor um and you have discrete [00:01:22] array or tensor um and you have discrete values for each of the pixels. This is [00:01:24] values for each of the pixels. This is very different from how we're perceiving [00:01:26] very different from how we're perceiving the image and deciding that this is a [00:01:28] the image and deciding that this is a cat. So being able to map from this uh [00:01:31] cat. So being able to map from this uh complex numeric representation into one [00:01:34] complex numeric representation into one that we humans understand is the core [00:01:36] that we humans understand is the core challenge here. But also there's [00:01:37] challenge here. But also there's challenges uh surrounding the images [00:01:39] challenges uh surrounding the images themselves. So if you look at something [00:01:41] themselves. So if you look at something like the uh illumination of the scene. [00:01:44] like the uh illumination of the scene. So here you'll have different pixel [00:01:46] So here you'll have different pixel intensities based on where the lighting [00:01:47] intensities based on where the lighting is in the scene. Uh also you know you [00:01:50] is in the scene. Uh also you know you could have certain parts of your object [00:01:52] could have certain parts of your object are in the shade and harder to see. Um [00:01:55] are in the shade and harder to see. Um cats by nature are very deformable. So [00:01:57] cats by nature are very deformable. So you talk about deformable objects. They [00:01:58] you talk about deformable objects. They can move around and twist and bend in [00:02:00] can move around and twist and bend in different ways, so they won't always [00:02:01] different ways, so they won't always have the same shape. And this can prove [00:02:03] have the same shape. And this can prove challenging if you're trying to design [00:02:04] challenging if you're trying to design an algorithm to detect objects. Uh [00:02:06] an algorithm to detect objects. Uh there's also the challenge of occlusion. [00:02:08] there's also the challenge of occlusion. So you could have a cat that's like [00:02:09] So you could have a cat that's like hiding underneath the couch uh cushions [00:02:11] hiding underneath the couch uh cushions here, but we as a human can clearly tell [00:02:13] here, but we as a human can clearly tell this a cat because of the tail uh how [00:02:15] this a cat because of the tail uh how it's sort of sticking out at the end [00:02:16] it's sort of sticking out at the end here. And then you know the way that [00:02:18] here. And then you know the way that cats behave, we can infer that this is a [00:02:20] cats behave, we can infer that this is a cat. Uh you'll also have things like [00:02:22] cat. Uh you'll also have things like background clutter where the object [00:02:23] background clutter where the object could blend into the background. So we [00:02:26] could blend into the background. So we need to account for this somehow as [00:02:27] need to account for this somehow as well. And finally, there's this idea of [00:02:30] well. And finally, there's this idea of interclass variation where different [00:02:32] interclass variation where different objects in the same category can look [00:02:34] objects in the same category can look very different from each other, but we [00:02:36] very different from each other, but we still need to group them all into the [00:02:37] still need to group them all into the same category. So here are a lot of just [00:02:39] same category. So here are a lot of just the challenges of recognition and why it [00:02:42] the challenges of recognition and why it isn't such a simple problem where you [00:02:44] isn't such a simple problem where you can just sort of write if else rules to [00:02:46] can just sort of write if else rules to account for everything and just simple [00:02:48] account for everything and just simple logic to classify. So if logic's sort of [00:02:51] logic to classify. So if logic's sort of thrown out the window, you can't just [00:02:52] thrown out the window, you can't just create these logic rules. um how do you [00:02:54] create these logic rules. um how do you actually create a classifier? Here's [00:02:57] actually create a classifier? Here's where we talked about datadriven [00:02:58] where we talked about datadriven approaches. And we talked about [00:03:00] approaches. And we talked about basically the simplest uh machine [00:03:02] basically the simplest uh machine learning model which is this K nearest [00:03:04] learning model which is this K nearest neighbors model. And the idea is that [00:03:07] neighbors model. And the idea is that you look at for a given data point, what [00:03:10] you look at for a given data point, what are the existing data points in your [00:03:12] are the existing data points in your training set that uh are very close in [00:03:16] training set that uh are very close in distance to your new data point coming [00:03:18] distance to your new data point coming in. And for the one nearest neighbor [00:03:20] in. And for the one nearest neighbor case, uh this just results in, you know, [00:03:23] case, uh this just results in, you know, you find the closest data point, you [00:03:25] you find the closest data point, you assign it that class label. And you can [00:03:27] assign it that class label. And you can also look at multiple nearest neighbors [00:03:29] also look at multiple nearest neighbors where you're assigning the most common [00:03:32] where you're assigning the most common class label among those nearest [00:03:33] class label among those nearest neighbors. So we talked about these two [00:03:35] neighbors. So we talked about these two different approaches. We talked about [00:03:36] different approaches. We talked about how um you ideally don't want to split [00:03:39] how um you ideally don't want to split your data set into train and test but [00:03:41] your data set into train and test but you can do train validation and test so [00:03:44] you can do train validation and test so that you can use this validation set [00:03:46] that you can use this validation set here to actually uh help you choose your [00:03:48] here to actually uh help you choose your hyperparameters. So the main [00:03:51] hyperparameters. So the main hyperparameter for k nearest neighbors [00:03:53] hyperparameter for k nearest neighbors is this k uh one or five in these [00:03:55] is this k uh one or five in these examples. And what we showed is an [00:03:57] examples. And what we showed is an example where you're plotting what is [00:03:59] example where you're plotting what is your accuracy on this validation set [00:04:00] your accuracy on this validation set here um over the different k values [00:04:03] here um over the different k values here. and you would, you know, you would [00:04:05] here. and you would, you know, you would choose the one that has the highest [00:04:06] choose the one that has the highest accuracy. Um, so this is how you'd use [00:04:08] accuracy. Um, so this is how you'd use the validation set and then you would [00:04:09] the validation set and then you would reserve this test set for okay, how does [00:04:11] reserve this test set for okay, how does your model do on completely new data [00:04:13] your model do on completely new data it's never seen before. That would be [00:04:14] it's never seen before. That would be the purpose of the test set. This is all [00:04:16] the purpose of the test set. This is all just recap. Um, there was a bit of [00:04:19] just recap. Um, there was a bit of confusion about uh distance metrics. We [00:04:21] confusion about uh distance metrics. We put a post on ED that explains this in [00:04:22] put a post on ED that explains this in more detail. Um but um we've talked [00:04:25] more detail. Um but um we've talked about two different distance me metrics. [00:04:27] about two different distance me metrics. The most two commonly used ones in [00:04:29] The most two commonly used ones in machine learning which are the sort of [00:04:31] machine learning which are the sort of Manhattan distance and or L1 distance [00:04:33] Manhattan distance and or L1 distance and L2 distance or uklidian distance. Um [00:04:36] and L2 distance or uklidian distance. Um L2 distance is like um if you imagine [00:04:39] L2 distance is like um if you imagine just the straight line distance [00:04:42] just the straight line distance sort of how we think of distance in [00:04:43] sort of how we think of distance in everyday usage of the word uh [00:04:45] everyday usage of the word uh geometrically. And then Manhattan [00:04:47] geometrically. And then Manhattan distance is this idea where you can only [00:04:49] distance is this idea where you can only sort of traverse left and right and up [00:04:51] sort of traverse left and right and up and down in this diagram. and you can't [00:04:52] and down in this diagram. and you can't move diagonally. So specifically looking [00:04:55] move diagonally. So specifically looking at just one quick example here um the [00:04:57] at just one quick example here um the reason why all these points on the line [00:04:59] reason why all these points on the line are the same distance from the origin [00:05:01] are the same distance from the origin here um is because you can't move [00:05:03] here um is because you can't move diagonally. So you have to move in this [00:05:05] diagonally. So you have to move in this case up 0.5 and to the right 0.5. So the [00:05:08] case up 0.5 and to the right 0.5. So the total distance is one whereas you know [00:05:10] total distance is one whereas you know here you're just going a straight line [00:05:11] here you're just going a straight line but it's uh one also the same distance [00:05:14] but it's uh one also the same distance here whereas in the L2 distance um all [00:05:17] here whereas in the L2 distance um all the points equidistant from the origin [00:05:19] the points equidistant from the origin here form a circle because you can you [00:05:21] here form a circle because you can you can just go in the direct line here. So [00:05:24] can just go in the direct line here. So this is maybe a brief explanation. The [00:05:27] this is maybe a brief explanation. The final thing we sort of honed in on last [00:05:29] final thing we sort of honed in on last time was this idea of a linear [00:05:31] time was this idea of a linear classifier. So um the basic idea in the [00:05:35] classifier. So um the basic idea in the in the basic setting that we did is we [00:05:37] in the basic setting that we did is we have an image which is you know say [00:05:39] have an image which is you know say width 32 and height 32 and there are [00:05:42] width 32 and height 32 and there are three uh pixel values for each of the [00:05:46] three uh pixel values for each of the spatial locations in our image [00:05:47] spatial locations in our image representing the red, green and blue uh [00:05:49] representing the red, green and blue uh intensities [00:05:51] intensities forming the color. And the idea is we [00:05:53] forming the color. And the idea is we take this array of numbers for our image [00:05:55] take this array of numbers for our image and we flatten it out into an array of [00:05:58] and we flatten it out into an array of just 3,000 different uh numbers 372. And [00:06:01] just 3,000 different uh numbers 372. And then we're multiplying this vector by [00:06:04] then we're multiplying this vector by our weight matrix W. And um the basic [00:06:08] our weight matrix W. And um the basic idea is if we have a weight matrix W [00:06:10] idea is if we have a weight matrix W that has uh you know uh height here of [00:06:14] that has uh you know uh height here of uh 10 and then the width is 3 372. We're [00:06:18] uh 10 and then the width is 3 372. We're multiplying each of these rows by uh our [00:06:21] multiplying each of these rows by uh our input sample X and this will give us 10 [00:06:24] input sample X and this will give us 10 resulting uh class scores. So um often [00:06:28] resulting uh class scores. So um often times we'll add a bias term as well [00:06:29] times we'll add a bias term as well which would just be one bias term for [00:06:31] which would just be one bias term for each class. So this would be you know a [00:06:34] each class. So this would be you know a size 10 vector here. And we also talked [00:06:36] size 10 vector here. And we also talked about three different ways you can view [00:06:38] about three different ways you can view or think about this these linear models. [00:06:41] or think about this these linear models. One is the algebraic viewpoint which I [00:06:42] One is the algebraic viewpoint which I described here where each row is um [00:06:46] described here where each row is um represented sort of independently [00:06:47] represented sort of independently representing the class and you multiply [00:06:49] representing the class and you multiply it by the input vector x. You get your [00:06:52] it by the input vector x. You get your uh score and you add the bias to get [00:06:54] uh score and you add the bias to get your your final score. um you do each [00:06:56] your your final score. um you do each row sort of independently in this sense. [00:06:58] row sort of independently in this sense. You can also view um these uh learned [00:07:02] You can also view um these uh learned class uh weights here as templates where [00:07:06] class uh weights here as templates where if we then uh sort of reravel the uh [00:07:10] if we then uh sort of reravel the uh vector into the original shape of the [00:07:12] vector into the original shape of the image uh we could plot the intensities [00:07:14] image uh we could plot the intensities here and understand what um what is sort [00:07:18] here and understand what um what is sort of the uh template per class which is [00:07:21] of the uh template per class which is what this visualization represents. And [00:07:24] what this visualization represents. And then the final way you can think about [00:07:25] then the final way you can think about it is a sort of geometric viewpoint [00:07:28] it is a sort of geometric viewpoint where uh each of these um rows in our [00:07:32] where uh each of these um rows in our weight matrix are represented by these [00:07:34] weight matrix are represented by these lines here in our um input space and [00:07:39] lines here in our um input space and specifically the line is where we set [00:07:41] specifically the line is where we set this equation to zero um which is the [00:07:44] this equation to zero um which is the decision boundary. So this forms the [00:07:46] decision boundary. So this forms the point where uh you know above the line [00:07:48] point where uh you know above the line you could have a positive score and [00:07:50] you could have a positive score and below the line you would have a negative [00:07:52] below the line you would have a negative score for the for the for the class. So [00:07:55] score for the for the for the class. So these are sort of the different [00:07:56] these are sort of the different viewpoints for how you can view these [00:07:58] viewpoints for how you can view these linear models. They're all doing the [00:08:00] linear models. They're all doing the same thing. And one nice thing about the [00:08:02] same thing. And one nice thing about the geometric viewpoint is that uh if you [00:08:04] geometric viewpoint is that uh if you visualize your data like say you want to [00:08:06] visualize your data like say you want to classify uh blue versus red here it's [00:08:09] classify uh blue versus red here it's very easy to tell that you can't draw a [00:08:11] very easy to tell that you can't draw a line that perfectly separates um the [00:08:14] line that perfectly separates um the data here. So it's kind of a nice uh way [00:08:16] data here. So it's kind of a nice uh way you can gain intuition about what is [00:08:18] you can gain intuition about what is possible for a linear model to do. [00:08:21] possible for a linear model to do. Okay. Um I think that's sort of the [00:08:23] Okay. Um I think that's sort of the highle recap of what we discussed last [00:08:26] highle recap of what we discussed last time. Um I'll actually be going into a [00:08:28] time. Um I'll actually be going into a little bit more detail on sort of the [00:08:30] little bit more detail on sort of the new content for uh this lecture now but [00:08:34] new content for uh this lecture now but I just wanted to pause briefly if anyone [00:08:36] I just wanted to pause briefly if anyone had any questions about what we [00:08:37] had any questions about what we discussed last time or at the beginning [00:08:38] discussed last time or at the beginning of this lecture feel free to ask. [00:08:41] of this lecture feel free to ask. Yeah. So the question for those online [00:08:43] Yeah. So the question for those online is for this uh visual viewpoint is this [00:08:46] is for this uh visual viewpoint is this the same as um sort of running K nearest [00:08:49] the same as um sort of running K nearest neighbors and this would be maybe one of [00:08:51] neighbors and this would be maybe one of the neighbors that you're comparing [00:08:52] the neighbors that you're comparing against uh are they mathematically [00:08:54] against uh are they mathematically equivalent? Um no they're not the same [00:08:57] equivalent? Um no they're not the same because um these templates are formed [00:08:59] because um these templates are formed from this line. And so it's not like one [00:09:01] from this line. And so it's not like one specific uh data point but you can still [00:09:04] specific uh data point but you can still calculate the templates based on the uh [00:09:08] calculate the templates based on the uh this sort of uh they would represent [00:09:10] this sort of uh they would represent more like if we see in this diagram here [00:09:12] more like if we see in this diagram here there's the line pointing in the [00:09:14] there's the line pointing in the direction of the class. So it would be [00:09:15] direction of the class. So it would be sort of representing this point here. [00:09:18] sort of representing this point here. Yeah. So the question is how did we get [00:09:20] Yeah. So the question is how did we get this like 372 number? So the idea here [00:09:24] this like 372 number? So the idea here is that um if the height of our image is [00:09:27] is that um if the height of our image is 32 pixels and the width is 32 pixels and [00:09:30] 32 pixels and the width is 32 pixels and then each location in the image is [00:09:31] then each location in the image is represented by three values the red, [00:09:33] represented by three values the red, green and blue pixel intensities um we [00:09:35] green and blue pixel intensities um we would then get 32 32 3 total values [00:09:39] would then get 32 32 3 total values to represent the entire image and that's [00:09:40] to represent the entire image and that's how we get this 3,72 number. [00:09:44] how we get this 3,72 number. So um here's a I guess very specific [00:09:46] So um here's a I guess very specific example of a uh linear model here. And [00:09:51] example of a uh linear model here. And we when we multiply our input X by our [00:09:54] we when we multiply our input X by our weight matrix W we get the resulting [00:09:57] weight matrix W we get the resulting scores for these different classes. And [00:10:00] scores for these different classes. And you can see that uh for cat it's not [00:10:01] you can see that uh for cat it's not doing so well because car has a higher [00:10:03] doing so well because car has a higher score and we want the highest score for [00:10:05] score and we want the highest score for the uh correct class. Also here on the [00:10:09] the uh correct class. Also here on the second example does pretty well because [00:10:10] second example does pretty well because it's doing it correctly. But then in the [00:10:12] it's doing it correctly. But then in the frog example, it sort of gets it [00:10:13] frog example, it sort of gets it completely wrong where it's by far the [00:10:15] completely wrong where it's by far the lowest score of the three. So um [00:10:17] lowest score of the three. So um intuitively we can tell that these [00:10:19] intuitively we can tell that these scores are not very good. But how do we [00:10:21] scores are not very good. But how do we sort of mathematically formalize this [00:10:22] sort of mathematically formalize this intuition and how do we determine how [00:10:24] intuition and how do we determine how good a given model is? And this is the [00:10:27] good a given model is? And this is the idea of a loss function which tells you [00:10:30] idea of a loss function which tells you how good or specifically tells you how [00:10:32] how good or specifically tells you how bad uh a classifier is. So given a data [00:10:35] bad uh a classifier is. So given a data set of examples uh where we're indexing [00:10:38] set of examples uh where we're indexing by with this uh letter I we have X I is [00:10:42] by with this uh letter I we have X I is each of the training examples Yi is each [00:10:44] each of the training examples Yi is each of the training labels. Um we can [00:10:46] of the training labels. Um we can compute the loss over our entire data [00:10:48] compute the loss over our entire data set where we calculate this uh loss for [00:10:52] set where we calculate this uh loss for each training example by sending it [00:10:54] each training example by sending it through our model here which is this f [00:10:56] through our model here which is this f ofxi w. We get our label and then we [00:11:00] ofxi w. We get our label and then we compute it compared to the ground truth [00:11:02] compute it compared to the ground truth label yi and we just take the average [00:11:03] label yi and we just take the average over our whole data set. So this is how [00:11:06] over our whole data set. So this is how we do this. We talked about um in last [00:11:09] we do this. We talked about um in last lecture the softmax loss or the cross [00:11:11] lecture the softmax loss or the cross entropy loss which is the most commonly [00:11:13] entropy loss which is the most commonly used loss for classification. And um so [00:11:16] used loss for classification. And um so I won't discuss that again in so much [00:11:18] I won't discuss that again in so much detail here, but um basically it's very [00:11:21] detail here, but um basically it's very high uh it's very high loss when you [00:11:23] high uh it's very high loss when you predict low probability of the correct [00:11:25] predict low probability of the correct class. It's very low loss when you're [00:11:27] class. It's very low loss when you're predicting the correct class at very [00:11:29] predicting the correct class at very high probability. [00:11:31] high probability. Um so these this what I just uh [00:11:35] Um so these this what I just uh explained is all uh contained within [00:11:38] explained is all uh contained within what we call the data loss. So this is a [00:11:41] what we call the data loss. So this is a loss that tells you how well do the [00:11:44] loss that tells you how well do the model predictions match our training [00:11:45] model predictions match our training data. And obviously we want this to be [00:11:47] data. And obviously we want this to be very low and if it's very low it means [00:11:48] very low and if it's very low it means our model's fitting our training data [00:11:50] our model's fitting our training data well. Um but there's a second component [00:11:53] well. Um but there's a second component which I'll discuss today which is this [00:11:55] which I'll discuss today which is this regularization term of the loss [00:11:57] regularization term of the loss function. So what this does is it's [00:12:00] function. So what this does is it's intended to it's intended to prevent the [00:12:03] intended to it's intended to prevent the model from doing too well on the [00:12:04] model from doing too well on the training data. Uh so it actually does [00:12:06] training data. Uh so it actually does worse on the training data, but the goal [00:12:08] worse on the training data, but the goal is to make it do better on new test data [00:12:10] is to make it do better on new test data or unseen data. So worse on training, [00:12:12] or unseen data. So worse on training, but better on on a test set. That's the [00:12:14] but better on on a test set. That's the point of regularization. Um and we'll go [00:12:17] point of regularization. Um and we'll go over a lot of the intuition for how to [00:12:18] over a lot of the intuition for how to think about it in the in the next slides [00:12:20] think about it in the in the next slides here. But the highle goal is to do worse [00:12:22] here. But the highle goal is to do worse on the training data, but then better on [00:12:24] on the training data, but then better on the test data or just unseen data. [00:12:26] the test data or just unseen data. That's the point of regularization. [00:12:29] That's the point of regularization. Yeah. So we're computing the loss on [00:12:31] Yeah. So we're computing the loss on each of the eye training examples. Yeah. [00:12:34] each of the eye training examples. Yeah. So the loss of the i example uses the x [00:12:36] So the loss of the i example uses the x i and the y i. Um [00:12:40] i and the y i. Um does that make sense? I mean you you [00:12:41] does that make sense? I mean you you could not have an i here, but [00:12:44] could not have an i here, but this is just saying the i the loss term. [00:12:46] this is just saying the i the loss term. Yeah. Yeah. You you normally don't have [00:12:49] Yeah. Yeah. You you normally don't have a different loss for each I if that's [00:12:50] a different loss for each I if that's what you're asking. Yeah. Yeah. Yeah. So [00:12:53] what you're asking. Yeah. Yeah. Yeah. So this is just you could we describe li as [00:12:56] this is just you could we describe li as the loss for the training example. So [00:12:57] the loss for the training example. So we're just using here. But yeah, it [00:12:59] we're just using here. But yeah, it could be. [00:13:02] could be. Um so for regularization um people [00:13:04] Um so for regularization um people usually have this intuition when [00:13:06] usually have this intuition when thinking about it where this is sort of [00:13:08] thinking about it where this is sort of like a toy example and the the idea is [00:13:10] like a toy example and the the idea is we want to fit some function to these [00:13:12] we want to fit some function to these points where our input is x and our [00:13:14] points where our input is x and our output is y and uh you say have two [00:13:17] output is y and uh you say have two different types of models f_sub_1 and [00:13:19] different types of models f_sub_1 and f_sub_2 um and you're trying to decide [00:13:22] f_sub_2 um and you're trying to decide which of these is better. So F1 goes [00:13:24] which of these is better. So F1 goes through all of our data points. So the [00:13:26] through all of our data points. So the training or the data loss will be very [00:13:28] training or the data loss will be very low because it's basically doing it [00:13:30] low because it's basically doing it perfectly. Whereas F2 um doesn't go [00:13:33] perfectly. Whereas F2 um doesn't go through every point perfectly, but [00:13:36] through every point perfectly, but intuitively it feels like probably F2 is [00:13:38] intuitively it feels like probably F2 is a better model when we're now testing on [00:13:41] a better model when we're now testing on new data we've never seen before. So um [00:13:44] new data we've never seen before. So um regularization sort of captures this [00:13:46] regularization sort of captures this intuition of you don't want to overfit [00:13:48] intuition of you don't want to overfit your data so hard and you might actually [00:13:50] your data so hard and you might actually be better off with a model that fits the [00:13:52] be better off with a model that fits the data less but is either simpler or has [00:13:55] data less but is either simpler or has some other properties that uh make it a [00:13:57] some other properties that uh make it a better choice. And so if we you know ask [00:14:00] better choice. And so if we you know ask okay how is this how are these models [00:14:02] okay how is this how are these models going to do on new data that's within [00:14:03] going to do on new data that's within our same distribution you'll find that [00:14:05] our same distribution you'll find that you know F2 does a much better job at [00:14:07] you know F2 does a much better job at modeling. So here's it's doing better on [00:14:09] modeling. So here's it's doing better on the unseen data. Um [00:14:12] the unseen data. Um so um I think there's also an intuition [00:14:15] so um I think there's also an intuition in this previous example that's [00:14:16] in this previous example that's demonstrated very well where we're [00:14:17] demonstrated very well where we're preferring simpler models uh where it's [00:14:19] preferring simpler models uh where it's sort of like AAM's razor which is this [00:14:21] sort of like AAM's razor which is this idea in uh philosophy and also [00:14:24] idea in uh philosophy and also scientific discovery where if you have [00:14:25] scientific discovery where if you have multiple competing hypotheses you should [00:14:28] multiple competing hypotheses you should go with the simplest one first and then [00:14:29] go with the simplest one first and then if that if you know for sure that's [00:14:31] if that if you know for sure that's wrong then you can start trying out more [00:14:32] wrong then you can start trying out more complicated ones as you go. This is [00:14:34] complicated ones as you go. This is maybe also some intuition you can have [00:14:35] maybe also some intuition you can have for why regularization can be useful. [00:14:40] for why regularization can be useful. Okay. And then one final thing about [00:14:41] Okay. And then one final thing about this equation that I didn't touch on yet [00:14:43] this equation that I didn't touch on yet is this lambda parameter here. So this [00:14:46] is this lambda parameter here. So this is the regularization strength which is [00:14:48] is the regularization strength which is another hyperparameter. So we might use [00:14:50] another hyperparameter. So we might use uh training validation sets to set what [00:14:53] uh training validation sets to set what is the optimal lambda here as well. But [00:14:55] is the optimal lambda here as well. But the basic idea is we can set this to a [00:14:57] the basic idea is we can set this to a floating point between I guess zero and [00:15:00] floating point between I guess zero and infinity where zero would be basically [00:15:03] infinity where zero would be basically there is no regularization infinity you [00:15:05] there is no regularization infinity you know up to infinity you have really [00:15:06] know up to infinity you have really strong progressively stronger uh [00:15:08] strong progressively stronger uh regularization. So um it's very much a [00:15:12] regularization. So um it's very much a tunable knob you have for determining [00:15:14] tunable knob you have for determining how much you want to prevent the model [00:15:15] how much you want to prevent the model from fitting to your training data. [00:15:19] from fitting to your training data. And I'll go through some simple examples [00:15:22] And I'll go through some simple examples now of regularization. So here we have [00:15:24] now of regularization. So here we have L2 regularization which uh basically [00:15:28] L2 regularization which uh basically what you do is you have your weight [00:15:30] what you do is you have your weight matrix you square each of the terms in [00:15:32] matrix you square each of the terms in your weight matrix and then you sum them [00:15:34] your weight matrix and then you sum them all together. That gives you your your [00:15:36] all together. That gives you your your score here that you then multiply by [00:15:38] score here that you then multiply by lambda and you add to your total loss. [00:15:40] lambda and you add to your total loss. That's L2 regularization. L1 [00:15:43] That's L2 regularization. L1 regularization is very similar but [00:15:44] regularization is very similar but instead of squaring you're taking the [00:15:45] instead of squaring you're taking the absolute value. So in practice there are [00:15:48] absolute value. So in practice there are some differences between uh how these [00:15:51] some differences between uh how these two regularizations uh perform when [00:15:54] two regularizations uh perform when you're training models. So one of the [00:15:55] you're training models. So one of the things that happens with L2 [00:15:56] things that happens with L2 regularization because you're squaring [00:15:58] regularization because you're squaring each of the values when you have a [00:16:00] each of the values when you have a really small value it gets squared like [00:16:02] really small value it gets squared like say you have you know 0.001 you square [00:16:04] say you have you know 0.001 you square it becomes even smaller. Um so L2 [00:16:07] it becomes even smaller. Um so L2 regularization allows for sort of these [00:16:09] regularization allows for sort of these really small values close to zero [00:16:10] really small values close to zero because you then square them so they [00:16:12] because you then square them so they become even smaller and so your your [00:16:14] become even smaller and so your your your penalty here is very very low if [00:16:16] your penalty here is very very low if you have these very small values with L2 [00:16:19] you have these very small values with L2 whereas L1 you're not squaring it. So [00:16:21] whereas L1 you're not squaring it. So it's sort of just whatever the the [00:16:23] it's sort of just whatever the the baseline value was. It's not like it's [00:16:24] baseline value was. It's not like it's getting smaller before you're uh [00:16:27] getting smaller before you're uh computing this regularization term. So [00:16:29] computing this regularization term. So in practice what this leads to is L1 [00:16:30] in practice what this leads to is L1 regularization you get a lot more values [00:16:32] regularization you get a lot more values that are zero actually in your weight [00:16:34] that are zero actually in your weight matrix um or very close to zero whereas [00:16:37] matrix um or very close to zero whereas L2 you can have generally it's more [00:16:40] L2 you can have generally it's more spread out where you have values that [00:16:41] spread out where you have values that are small but uh but but non zero [00:16:44] are small but uh but but non zero because the penalty becomes so small [00:16:46] because the penalty becomes so small seems pretty clear to you why L2 prefers [00:16:49] seems pretty clear to you why L2 prefers sort of spread out weights that are all [00:16:51] sort of spread out weights that are all small but why does L1 prefer sparse uh [00:16:54] small but why does L1 prefer sparse uh vectors so I think the way to think of [00:16:56] vectors so I think the way to think of it is that if a value can be zero uh and [00:16:59] it is that if a value can be zero uh and your performance is roughly the same uh [00:17:02] your performance is roughly the same uh then this would push you towards zeroing [00:17:04] then this would push you towards zeroing that value whereas for L2 what you might [00:17:06] that value whereas for L2 what you might have is the value just becomes very [00:17:08] have is the value just becomes very small but non zero because of the [00:17:10] small but non zero because of the squaring [00:17:12] squaring so uh the question is can we talk about [00:17:14] so uh the question is can we talk about what does pushing towards a zero value [00:17:16] what does pushing towards a zero value mean so um we're going to talk about [00:17:18] mean so um we're going to talk about more how we use this loss term but the [00:17:22] more how we use this loss term but the basic idea is we're trying to minimize [00:17:23] basic idea is we're trying to minimize it so um we're trying to minimize the [00:17:26] it so um we're trying to minimize the loss or minimize the error of our model [00:17:28] loss or minimize the error of our model and um if we have a term here which is [00:17:32] and um if we have a term here which is giving us positive values that don't [00:17:34] giving us positive values that don't it's sort of not affecting the model [00:17:36] it's sort of not affecting the model performance and the data loss we will uh [00:17:39] performance and the data loss we will uh sort of remove those through the [00:17:40] sort of remove those through the optimization procedure. It's sort of a [00:17:42] optimization procedure. It's sort of a trade-off. You're trying to optimize the [00:17:43] trade-off. You're trying to optimize the joint uh sum of the regularization term [00:17:46] joint uh sum of the regularization term and the data loss term. So if your data [00:17:48] and the data loss term. So if your data loss isn't changing much but you're able [00:17:50] loss isn't changing much but you're able to go lower on the regularization term, [00:17:52] to go lower on the regularization term, you'll get a more optimized model. So it [00:17:54] you'll get a more optimized model. So it will it will uh it will be preferred [00:17:56] will it will uh it will be preferred based on trying to minimize the overall [00:17:58] based on trying to minimize the overall term. Um so uh I think we'll also touch [00:18:03] term. Um so uh I think we'll also touch later in the course about much more [00:18:04] later in the course about much more complex forms of regularization where [00:18:07] complex forms of regularization where they're all doing this basic idea of [00:18:09] they're all doing this basic idea of worse on the training data to do better [00:18:10] worse on the training data to do better on the test data. Um but some of them [00:18:12] on the test data. Um but some of them you will even like change the layers of [00:18:14] you will even like change the layers of your model. So um they actually get [00:18:16] your model. So um they actually get pretty complicated. This is like an [00:18:18] pretty complicated. This is like an ongoing research area of how to [00:18:20] ongoing research area of how to regularize models. Uh there's new papers [00:18:23] regularize models. Uh there's new papers each year. So lots of stuff here and [00:18:25] each year. So lots of stuff here and we'll only cover a small subset in this [00:18:27] we'll only cover a small subset in this course. [00:18:29] course. So to summarize why do we regularize [00:18:31] So to summarize why do we regularize models? Um the first is you know it [00:18:34] models? Um the first is you know it allows us to express some sort of [00:18:35] allows us to express some sort of preference over weights. So if for some [00:18:37] preference over weights. So if for some reason in our problem we think the [00:18:39] reason in our problem we think the solution should be spread out or should [00:18:41] solution should be spread out or should contain a lot of sparity where a lot of [00:18:42] contain a lot of sparity where a lot of the values in the weight matrix are [00:18:44] the values in the weight matrix are zero, we might prefer one set of [00:18:46] zero, we might prefer one set of regularization L2 versus L1 over [00:18:48] regularization L2 versus L1 over another. Um it also can depending on how [00:18:51] another. Um it also can depending on how we're regularizing make the model [00:18:53] we're regularizing make the model simpler um so that it works better on [00:18:55] simpler um so that it works better on test data. So it could simplify the [00:18:57] test data. So it could simplify the model if we're say uh heavily [00:18:59] model if we're say uh heavily regularizing really high polomial terms [00:19:02] regularizing really high polomial terms in our model. Uh for example in what I [00:19:04] in our model. Uh for example in what I showed earlier or and something we won't [00:19:07] showed earlier or and something we won't touch on in too much detail is uh [00:19:08] touch on in too much detail is uh especially L2 regularization can [00:19:10] especially L2 regularization can actually improve the optimization [00:19:12] actually improve the optimization process because um if you imagine like [00:19:14] process because um if you imagine like the squared is like a parabola. Uh so if [00:19:17] the squared is like a parabola. Uh so if you're plotting y equals x^2 it's a [00:19:20] you're plotting y equals x^2 it's a parabola and uh these are like convex so [00:19:22] parabola and uh these are like convex so you get a lot of nice optimization [00:19:23] you get a lot of nice optimization properties where there's a global [00:19:25] properties where there's a global minimum. Uh we won't touch on that in [00:19:26] minimum. Uh we won't touch on that in this course that's like beyond the scope [00:19:28] this course that's like beyond the scope but know that for certain types of [00:19:30] but know that for certain types of optimization the regularization actually [00:19:31] optimization the regularization actually helps train the model faster too. [00:19:35] helps train the model faster too. Okay. Um I guess I have a question for [00:19:38] Okay. Um I guess I have a question for you all and what we'll do is you'll do [00:19:40] you all and what we'll do is you'll do uh um uh one if it's W1 and two with [00:19:44] uh um uh one if it's W1 and two with your hand if it's W2. So uh which of [00:19:47] your hand if it's W2. So uh which of these two um weights w1 w2 would l the [00:19:52] these two um weights w1 w2 would l the l2 regularizer prefer? So we have our [00:19:55] l2 regularizer prefer? So we have our input x. It's when you multiply it, you [00:19:58] input x. It's when you multiply it, you do the dotproduct with the weights, you [00:20:00] do the dotproduct with the weights, you get the same score. So you get a score [00:20:01] get the same score. So you get a score of one either way. And here's where the [00:20:03] of one either way. And here's where the data loss would be the same. And we're [00:20:05] data loss would be the same. And we're trying to determine which of the uh [00:20:08] trying to determine which of the uh weights would our regularizer prefer. So [00:20:10] weights would our regularizer prefer. So go one if you think it's W1 and go two [00:20:13] go one if you think it's W1 and go two if you think it's W2. [00:20:15] if you think it's W2. All right, lots of twos. Yeah, it's W2 [00:20:17] All right, lots of twos. Yeah, it's W2 because as you said, it's more spread [00:20:18] because as you said, it's more spread out. You're going to be squaring each of [00:20:20] out. You're going to be squaring each of these turns. So, it's 1/4. You square [00:20:21] these turns. So, it's 1/4. You square it, becomes 1/16th. You sum it all [00:20:23] it, becomes 1/16th. You sum it all together, it's 1/4 is the total [00:20:26] together, it's 1/4 is the total regularization term here. And then here, [00:20:28] regularization term here. And then here, it's, you know, you square it, so it's [00:20:29] it's, you know, you square it, so it's one. So, it's four times lower in terms [00:20:32] one. So, it's four times lower in terms of the regularization loss. [00:20:34] of the regularization loss. Um, and as we said, the intuition is you [00:20:36] Um, and as we said, the intuition is you like more spread out weights. Um, and [00:20:38] like more spread out weights. Um, and then here's another question. Which one [00:20:40] then here's another question. Which one would L1 prefer? Now, so you do one if [00:20:43] would L1 prefer? Now, so you do one if it's uh weight one and two if it's [00:20:45] it's uh weight one and two if it's weight two. [00:20:49] Okay, we got a lot of ones. Uh so this [00:20:51] Okay, we got a lot of ones. Uh so this one's actually a bit of a trick [00:20:52] one's actually a bit of a trick question. So um what L1 regularization [00:20:55] question. So um what L1 regularization is, you sum each of the terms so they'll [00:20:56] is, you sum each of the terms so they'll both be sum to one. Uh in practice, you [00:20:59] both be sum to one. Uh in practice, you probably would see this one because as [00:21:00] probably would see this one because as we said, sparity, but in terms of a loss [00:21:02] we said, sparity, but in terms of a loss standpoint, these two um weights would [00:21:05] standpoint, these two um weights would actually be equivalent uh in terms of L1 [00:21:07] actually be equivalent uh in terms of L1 because one is just the sum of uh the [00:21:10] because one is just the sum of uh the 0.25 25 uh four times and then the other [00:21:12] 0.25 25 uh four times and then the other one's just one. So they're both summed [00:21:15] one's just one. So they're both summed to one and so the the actual um [00:21:18] to one and so the the actual um regularization term is the same for [00:21:19] regularization term is the same for these. [00:21:21] these. Yeah. Okay. So what's an example where [00:21:22] Yeah. Okay. So what's an example where L1 would be preferred if this is like [00:21:24] L1 would be preferred if this is like 0.9 for example? [00:21:27] 0.9 for example? Okay. So just to recap um we have a data [00:21:30] Okay. So just to recap um we have a data set of XY pairs and we have some way to [00:21:34] set of XY pairs and we have some way to calculate scores for each of the classes [00:21:38] calculate scores for each of the classes uh with which in our case is just a [00:21:39] uh with which in our case is just a linear model. You're doing a matrix [00:21:41] linear model. You're doing a matrix multiply. Um the loss for each of the I [00:21:44] multiply. Um the loss for each of the I training examples in the softmax loss [00:21:47] training examples in the softmax loss which we discussed last time is you [00:21:49] which we discussed last time is you exponentiate each of your scores [00:21:52] exponentiate each of your scores um and then you divide by the total sum [00:21:55] um and then you divide by the total sum of the scores. So you exponentiate to [00:21:57] of the scores. So you exponentiate to make them all positive and then you sum [00:21:59] make them all positive and then you sum to get a probability distribution. So [00:22:01] to get a probability distribution. So they all the final values in this all [00:22:04] they all the final values in this all sum to one and you have a score for each [00:22:06] sum to one and you have a score for each class and you take the minus log of your [00:22:08] class and you take the minus log of your of the correct label. So this is uh of [00:22:11] of the correct label. So this is uh of the the probability of the correct label [00:22:12] the the probability of the correct label which is given here. Um and the full [00:22:15] which is given here. Um and the full loss is you just run this over each of [00:22:17] loss is you just run this over each of your training examples, calculate li for [00:22:20] your training examples, calculate li for each of those and then you um add your [00:22:22] each of those and then you um add your regularization term here depending on [00:22:24] regularization term here depending on what is the weights of your model. Why [00:22:27] what is the weights of your model. Why do we use softmax in general? So soft is [00:22:30] do we use softmax in general? So soft is great because um what it does as a [00:22:33] great because um what it does as a function is it converts any set of [00:22:35] function is it converts any set of floatingoint numbers into a probability [00:22:37] floatingoint numbers into a probability distribution where uh they will sum to [00:22:40] distribution where uh they will sum to one and uh depending on the value of the [00:22:43] one and uh depending on the value of the score that will translate to the [00:22:45] score that will translate to the relative probability of that value. So [00:22:47] relative probability of that value. So if you have a really high positive [00:22:48] if you have a really high positive number and everything else is very low [00:22:50] number and everything else is very low negative you'll have merely one for [00:22:52] negative you'll have merely one for softmax and zeros almost for the the [00:22:54] softmax and zeros almost for the the other values. So it's nice because it [00:22:56] other values. So it's nice because it converts any list of uh floatingoint [00:23:00] converts any list of uh floatingoint numbers into a list of probabilities [00:23:01] numbers into a list of probabilities based on the uh the the values of the of [00:23:04] based on the uh the the values of the of the list. That's the main utility of [00:23:06] the list. That's the main utility of softmax. So the question is that you can [00:23:09] softmax. So the question is that you can view the regularization we talked about [00:23:11] view the regularization we talked about which is L1 L2 as um a way of [00:23:14] which is L1 L2 as um a way of regularizing based on the magnitude of [00:23:16] regularizing based on the magnitude of the weights which is true and how does [00:23:18] the weights which is true and how does that translate to simpler models? So I [00:23:20] that translate to simpler models? So I think in L1's explanation is actually [00:23:22] think in L1's explanation is actually pretty simple because uh if we prefer [00:23:24] pretty simple because uh if we prefer say um terms that have a lot of zeros in [00:23:27] say um terms that have a lot of zeros in it, it's basically a linear model with [00:23:29] it, it's basically a linear model with fewer coefficients. So that one is [00:23:31] fewer coefficients. So that one is actually I think relatively [00:23:32] actually I think relatively straightforward. But I think in general [00:23:34] straightforward. But I think in general regularization is not always going to [00:23:36] regularization is not always going to give you a simpler model. It depends on [00:23:38] give you a simpler model. It depends on how it's used. So for example in the [00:23:41] how it's used. So for example in the diagram we showed at the very beginning [00:23:43] diagram we showed at the very beginning here um here you could imagine that um [00:23:46] here um here you could imagine that um you have L2 regularization or L1 [00:23:49] you have L2 regularization or L1 regularization where you're penalizing [00:23:50] regularization where you're penalizing more the higher degree polinomial terms [00:23:52] more the higher degree polinomial terms of your function. So in that sense it's [00:23:54] of your function. So in that sense it's pretty clear how you could design [00:23:56] pretty clear how you could design regularization to prefer a simpler [00:23:57] regularization to prefer a simpler model. But it doesn't always need to be [00:23:59] model. But it doesn't always need to be that way. Really what it is is this idea [00:24:01] that way. Really what it is is this idea of uh doing worse on the training data [00:24:03] of uh doing worse on the training data to do better on the test data. And uh [00:24:06] to do better on the test data. And uh that's not always going to give you a [00:24:07] that's not always going to give you a simpler model. And in fact, there are [00:24:08] simpler model. And in fact, there are many types of uh uh um regularization [00:24:12] many types of uh uh um regularization like dropout that actually make your [00:24:13] like dropout that actually make your model more complex uh but give you [00:24:15] model more complex uh but give you better performance on the on the test [00:24:17] better performance on the on the test data. [00:24:23] Cool. Um so now that we've talked about [00:24:25] Cool. Um so now that we've talked about how we can calculate how good a given W [00:24:28] how we can calculate how good a given W is based on the training data and uh [00:24:32] is based on the training data and uh this regularization term, the question [00:24:34] this regularization term, the question now is like how do we actually find the [00:24:35] now is like how do we actually find the best W? [00:24:37] best W? uh and this is what optimization is [00:24:39] uh and this is what optimization is which is the second half of today's [00:24:41] which is the second half of today's lecture. So I think when people describe [00:24:45] lecture. So I think when people describe uh optimization they will usually use [00:24:46] uh optimization they will usually use this idea of a loss landscape which you [00:24:49] this idea of a loss landscape which you can think of as like a normal landscape [00:24:51] can think of as like a normal landscape like on planet earth where the up and [00:24:53] like on planet earth where the up and down vertical or z-axis direction is the [00:24:56] down vertical or z-axis direction is the loss. So this is the value you're trying [00:24:58] loss. So this is the value you're trying to minimize and then say in this example [00:25:00] to minimize and then say in this example you have two uh parameters in your model [00:25:02] you have two uh parameters in your model which is sort of the x and y direction [00:25:03] which is sort of the x and y direction of where you are in this landscape. And [00:25:05] of where you are in this landscape. And the idea is you're just like uh [00:25:07] the idea is you're just like uh basically a person. You're walking [00:25:08] basically a person. You're walking around this landscape and you're trying [00:25:10] around this landscape and you're trying to find what is the smallest or lowest [00:25:12] to find what is the smallest or lowest point in the entire landscape. I think [00:25:14] point in the entire landscape. I think one of the reasons this analogy and this [00:25:16] one of the reasons this analogy and this is a very commonly used analogy. Um a [00:25:18] is a very commonly used analogy. Um a little bit falls apart is because you [00:25:19] little bit falls apart is because you know as humans we can just see like [00:25:21] know as humans we can just see like visually we can look into the distance [00:25:22] visually we can look into the distance and see what is the lowest point of the [00:25:24] and see what is the lowest point of the valley. But I think this analogy is [00:25:26] valley. But I think this analogy is actually pretty accurate if you think of [00:25:27] actually pretty accurate if you think of the person as being blindfolded. They [00:25:29] the person as being blindfolded. They don't have access to any visual [00:25:30] don't have access to any visual information. they can only feel sort of [00:25:32] information. they can only feel sort of the earth where they are right now and [00:25:34] the earth where they are right now and understand what is the slope of the [00:25:36] understand what is the slope of the ground on the current point in which [00:25:37] ground on the current point in which they're standing. I think if you view it [00:25:38] they're standing. I think if you view it in mat lens this analogies actually [00:25:40] in mat lens this analogies actually becomes extremely accurate um for how uh [00:25:43] becomes extremely accurate um for how uh we're trying to find the best model and [00:25:45] we're trying to find the best model and we have this complex landscape of uh [00:25:47] we have this complex landscape of uh different loss values depending on the [00:25:49] different loss values depending on the parameters of our model which translate [00:25:51] parameters of our model which translate to the location of the person in this uh [00:25:53] to the location of the person in this uh landscape. [00:25:55] landscape. So how can you find the best point? uh [00:25:58] So how can you find the best point? uh we could go with like a really simple [00:26:00] we could go with like a really simple idea which is maybe a really bad uh you [00:26:03] idea which is maybe a really bad uh you know bad idea but it could work. So here [00:26:06] know bad idea but it could work. So here it's just basically a for loop where [00:26:07] it's just basically a for loop where we're trying a thousand different values [00:26:09] we're trying a thousand different values of W randomly and we're just choosing [00:26:12] of W randomly and we're just choosing the best one. So obviously not very [00:26:15] the best one. So obviously not very mathematically rigorous but you know you [00:26:16] mathematically rigorous but you know you will do better than uh a random baseline [00:26:19] will do better than uh a random baseline and if you had nothing else to go for [00:26:21] and if you had nothing else to go for maybe this isn't so bad. um you would [00:26:24] maybe this isn't so bad. um you would get like 15.5% accuracy on the uh [00:26:28] get like 15.5% accuracy on the uh sciphar 10 data set which is the one I [00:26:30] sciphar 10 data set which is the one I showed earlier with the frog and the car [00:26:32] showed earlier with the frog and the car and things like that uh with the 10 [00:26:34] and things like that uh with the 10 different categories. Uh but it doesn't [00:26:36] different categories. Uh but it doesn't perform very good. I mean the [00:26:37] perform very good. I mean the state-of-the-art on this data set it's [00:26:38] state-of-the-art on this data set it's basically solved through modern uh deep [00:26:40] basically solved through modern uh deep learning. You get 99.7% accuracy. So uh [00:26:43] learning. You get 99.7% accuracy. So uh clearly it's not bad but it's also I [00:26:46] clearly it's not bad but it's also I wouldn't say particularly good. [00:26:48] wouldn't say particularly good. Uh strategy number two, which is sort of [00:26:51] Uh strategy number two, which is sort of what I maybe uh explained a bit earlier, [00:26:54] what I maybe uh explained a bit earlier, is this idea of following the slope. So [00:26:56] is this idea of following the slope. So for this um you can imagine you're like [00:27:00] for this um you can imagine you're like blindfolded on the lost landscape and [00:27:01] blindfolded on the lost landscape and you're feeling the ground underneath you [00:27:03] you're feeling the ground underneath you and you're thinking okay which you know [00:27:05] and you're thinking okay which you know which way is the slope of the earth [00:27:08] which way is the slope of the earth pointing me and I should walk in that [00:27:10] pointing me and I should walk in that direction at all times. This basic idea [00:27:13] direction at all times. This basic idea is the fundamental way in which we train [00:27:15] is the fundamental way in which we train all the models in this course and in [00:27:17] all the models in this course and in which basically all deep learning models [00:27:18] which basically all deep learning models are trained where you're feeling the [00:27:20] are trained where you're feeling the location of the current place in the [00:27:22] location of the current place in the lost landscape and you're going down the [00:27:23] lost landscape and you're going down the hill. Um so this is very intuitive way [00:27:26] hill. Um so this is very intuitive way to explain it. We'll now go over more of [00:27:28] to explain it. We'll now go over more of the math uh behind it but this is what [00:27:30] the math uh behind it but this is what you should be maybe visualizing in your [00:27:31] you should be maybe visualizing in your head. [00:27:33] head. So uh how do you actually follow the [00:27:35] So uh how do you actually follow the slope? Um so in one dimension I'm sure [00:27:38] slope? Um so in one dimension I'm sure you all are familiar with the idea of a [00:27:40] you all are familiar with the idea of a derivative which in calculus we can [00:27:42] derivative which in calculus we can think of as uh this limit h definition [00:27:44] think of as uh this limit h definition where we add a very small number to uh [00:27:47] where we add a very small number to uh our current location. We calculate the [00:27:49] our current location. We calculate the value of the function at that new [00:27:50] value of the function at that new location. We subtract the current [00:27:53] location. We subtract the current location and then we divide by the step [00:27:54] location and then we divide by the step size. And as we take the limit uh for h [00:27:57] size. And as we take the limit uh for h approaching zero this gives us the uh [00:28:00] approaching zero this gives us the uh derivative of the uh of the function at [00:28:03] derivative of the uh of the function at that point. Now uh this is for 1D but in [00:28:06] that point. Now uh this is for 1D but in multiple dimensions you use the gradient [00:28:09] multiple dimensions you use the gradient which is where you're calculating [00:28:10] which is where you're calculating essentially this limit definition for [00:28:13] essentially this limit definition for each of the values uh uh separately. So [00:28:16] each of the values uh uh separately. So you'll have a different derivative for [00:28:18] you'll have a different derivative for each of your values and the and you get [00:28:21] each of your values and the and you get a vector instead. Um and this gives you [00:28:24] a vector instead. Um and this gives you the direction along each dimension. So [00:28:27] the direction along each dimension. So um you can actually calculate the slope [00:28:29] um you can actually calculate the slope in the dimension by taking the [00:28:30] in the dimension by taking the dotproduct of the gradient with the [00:28:33] dotproduct of the gradient with the direction and specifically the direction [00:28:35] direction and specifically the direction of the steepest descent or down the hill [00:28:38] of the steepest descent or down the hill is the negative gradient. So the [00:28:40] is the negative gradient. So the gradient points up the hill, the [00:28:41] gradient points up the hill, the negative gradient points uh down the [00:28:43] negative gradient points uh down the hill. So this is the you know the [00:28:45] hill. So this is the you know the direction we should be traveling if [00:28:46] direction we should be traveling if we're trying to get to the bottom of [00:28:47] we're trying to get to the bottom of this lost landscape. [00:28:50] this lost landscape. So you know maybe what are some ways you [00:28:51] So you know maybe what are some ways you can calculate the uh derivative? A [00:28:53] can calculate the uh derivative? A really simple one is you could just [00:28:54] really simple one is you could just actually try to use the limit h [00:28:56] actually try to use the limit h definition with a very small h. Uh so [00:28:58] definition with a very small h. Uh so you add you know 0.00001 [00:29:01] you add you know 0.00001 you uh actually can compute you know the [00:29:03] you uh actually can compute you know the last two digits of the loss change [00:29:05] last two digits of the loss change slightly. So you can compute the [00:29:06] slightly. So you can compute the difference divide by the step size and [00:29:08] difference divide by the step size and you can get like an approximation of [00:29:10] you can get like an approximation of your derivative here. And you could [00:29:11] your derivative here. And you could actually do this for each of your uh [00:29:14] actually do this for each of your uh values in your w. You just do this [00:29:16] values in your w. You just do this procedure over and over again. Um but it [00:29:19] procedure over and over again. Um but it has a few problems. is very slow because [00:29:21] has a few problems. is very slow because you just need to loop through each of [00:29:22] you just need to loop through each of the values. It's also approximate. So [00:29:25] the values. It's also approximate. So you're not even calculating the actual [00:29:26] you're not even calculating the actual derivative and especially with floating [00:29:28] derivative and especially with floating point arithmetic you can get pretty [00:29:30] point arithmetic you can get pretty significant errors here. So uh this is [00:29:33] significant errors here. So uh this is not really preferred but this basic idea [00:29:35] not really preferred but this basic idea or intuition of what we could be doing [00:29:37] or intuition of what we could be doing is to calculate the derivative this way. [00:29:40] is to calculate the derivative this way. But really we have the loss as a [00:29:43] But really we have the loss as a function of w. So um we know how to [00:29:46] function of w. So um we know how to calculate the scores to calculate uh to [00:29:49] calculate the scores to calculate uh to get our loss which is given by our [00:29:50] get our loss which is given by our function for our model and we can then [00:29:54] function for our model and we can then compute the total loss with the [00:29:55] compute the total loss with the regularization terms as well. And this [00:29:58] regularization terms as well. And this entire uh loss is a function of uh [00:30:02] entire uh loss is a function of uh basically w's the w's the x i's and the [00:30:05] basically w's the w's the x i's and the y i's. So you have your w matrix, you [00:30:07] y i's. So you have your w matrix, you have your x i's and y i's and then you [00:30:08] have your x i's and y i's and then you have this formula with you know maybe [00:30:09] have this formula with you know maybe some logs and exponents but [00:30:11] some logs and exponents but fundamentally um this is a function of w [00:30:15] fundamentally um this is a function of w x and y and we specifically want to [00:30:18] x and y and we specifically want to calculate the gradient which is given by [00:30:20] calculate the gradient which is given by this Greek letter naba of our loss with [00:30:23] this Greek letter naba of our loss with respect to the weights. So we can [00:30:25] respect to the weights. So we can imagine our x and i uh x i and y i's are [00:30:29] imagine our x and i uh x i and y i's are held constant and we're trying to [00:30:30] held constant and we're trying to calculate the uh the derivative just [00:30:32] calculate the uh the derivative just with respect to the weights. [00:30:35] with respect to the weights. So to do this we can just uh use [00:30:37] So to do this we can just uh use calculus use the chain rule use the [00:30:39] calculus use the chain rule use the different methods we've learned for [00:30:40] different methods we've learned for calculating derivatives based on uh sort [00:30:43] calculating derivatives based on uh sort of complex equations uh or not so [00:30:45] of complex equations uh or not so complex but you know you need to have [00:30:47] complex but you know you need to have some logs and exponents and chain rules [00:30:48] some logs and exponents and chain rules here to solve it. Um so this will be an [00:30:52] here to solve it. Um so this will be an exercise in the homework so I won't go [00:30:53] exercise in the homework so I won't go through step by step how to do it now [00:30:54] through step by step how to do it now but it is relatively straightforward I [00:30:56] but it is relatively straightforward I think conceptually it should make sense [00:30:57] think conceptually it should make sense to you all how you do this. You assume [00:30:58] to you all how you do this. You assume the x and the y's are constant and you [00:31:00] the x and the y's are constant and you solve for what is the derivative as you [00:31:02] solve for what is the derivative as you change w. So now we actually have a way [00:31:06] change w. So now we actually have a way where we can calculate W based on our uh [00:31:10] where we can calculate W based on our uh or sorry DW the gradient of W with [00:31:13] or sorry DW the gradient of W with respect to our data and the current W [00:31:16] respect to our data and the current W and whatever our loss function is which [00:31:18] and whatever our loss function is which is how to compute the error. [00:31:20] is how to compute the error. So this is I guess summary. So um you [00:31:24] So this is I guess summary. So um you could do a numerical gradient but it's [00:31:26] could do a numerical gradient but it's approximate slow and uh the nice thing [00:31:28] approximate slow and uh the nice thing is that it's very easy to write. you [00:31:29] is that it's very easy to write. you just add a really small h, take the [00:31:32] just add a really small h, take the difference, divide by h. Um, the [00:31:34] difference, divide by h. Um, the analytic gradient is nice because it's [00:31:36] analytic gradient is nice because it's exact, it's fast, but you could like [00:31:38] exact, it's fast, but you could like potentially if you're creating a new [00:31:39] potentially if you're creating a new gradient from scratch, like new code to [00:31:41] gradient from scratch, like new code to calculate from scratch, you could have [00:31:42] calculate from scratch, you could have an error in it. So, if you are doing [00:31:43] an error in it. So, if you are doing this, people normally will have a [00:31:46] this, people normally will have a gradient check, which is where they try [00:31:47] gradient check, which is where they try the h uh version where they have a [00:31:50] the h uh version where they have a really small h value and then they uh [00:31:52] really small h value and then they uh make sure that it's around the same [00:31:53] make sure that it's around the same neighborhood and that's a good way to [00:31:54] neighborhood and that's a good way to make sure you don't have any bugs in [00:31:55] make sure you don't have any bugs in your code. Um so um you'll be doing in [00:31:58] your code. Um so um you'll be doing in there'll be gradient checks in your [00:31:59] there'll be gradient checks in your homework assignments to make sure your [00:32:00] homework assignments to make sure your implementations are correct also. [00:32:02] implementations are correct also. Yeah. So the question is um we often say [00:32:05] Yeah. So the question is um we often say we want a loss function that's [00:32:06] we want a loss function that's differentiable uh because then we can [00:32:08] differentiable uh because then we can calculate the gradients but if we have a [00:32:10] calculate the gradients but if we have a better loss function somehow and uh we [00:32:14] better loss function somehow and uh we can't analytically calculate the [00:32:16] can't analytically calculate the gradient but we could use uh this h kind [00:32:19] gradient but we could use uh this h kind of numerical method. Could we do that? I [00:32:22] of numerical method. Could we do that? I think um in general it's hard to [00:32:24] think um in general it's hard to construct a better loss function that [00:32:26] construct a better loss function that would be um non uh like non [00:32:32] would be um non uh like non non-ifferiable. You could possibly [00:32:34] non-ifferiable. You could possibly though and if you there is just a true [00:32:35] though and if you there is just a true loss function that is the best for your [00:32:37] loss function that is the best for your case but it is non-ifferiable you could [00:32:39] case but it is non-ifferiable you could go with uh this approach and it it may [00:32:41] go with uh this approach and it it may work. I think it would struggle if like [00:32:44] work. I think it would struggle if like for example your loss uh is just truly [00:32:47] for example your loss uh is just truly non-ifferiable across all points and [00:32:49] non-ifferiable across all points and it's basically like a cluster of [00:32:50] it's basically like a cluster of non-connected uh points then you know [00:32:53] non-connected uh points then you know moving in the step of deepest steepest [00:32:56] moving in the step of deepest steepest descent wouldn't really get you [00:32:57] descent wouldn't really get you necessarily at your best solution uh if [00:32:59] necessarily at your best solution uh if they're not well connected and forming [00:33:01] they're not well connected and forming sort of this geography so uh it could [00:33:03] sort of this geography so uh it could work but I would think that if your loss [00:33:05] work but I would think that if your loss is non- differentiable across most of [00:33:07] is non- differentiable across most of the domain then probably you wouldn't be [00:33:10] the domain then probably you wouldn't be able to use these approaches to find the [00:33:12] able to use these approaches to find the uh the bottom point. [00:33:14] uh the bottom point. Yeah. So the I guess TLDDR of the [00:33:16] Yeah. So the I guess TLDDR of the explanation is that if your function's [00:33:18] explanation is that if your function's convex then uh it works very well with [00:33:21] convex then uh it works very well with this sort of gradient uh descent or [00:33:23] this sort of gradient uh descent or steepest descent type of approach. But [00:33:24] steepest descent type of approach. But if you have a non- differentiable [00:33:26] if you have a non- differentiable non-convex function probably this [00:33:28] non-convex function probably this approach won't work uh as well because [00:33:30] approach won't work uh as well because you're not going to be stepping in the [00:33:31] you're not going to be stepping in the right direction. It's not necessarily [00:33:34] right direction. It's not necessarily errorprone if your code is perfectly [00:33:35] errorprone if your code is perfectly good but maybe you have a mistake in [00:33:38] good but maybe you have a mistake in your code and it's hard to tell right [00:33:39] your code and it's hard to tell right away. uh and but the h uh the limit h [00:33:42] away. uh and but the h uh the limit h definition is very easy to code up right [00:33:43] definition is very easy to code up right you just set h to be a very small value [00:33:45] you just set h to be a very small value you run it through your function and you [00:33:47] you run it through your function and you add a very small amount so that's less [00:33:49] add a very small amount so that's less errorrone for implementation [00:33:51] errorrone for implementation okay [00:33:52] okay for implementation [00:33:53] for implementation okay not more errorrone if it's working [00:33:56] okay not more errorrone if it's working correctly [00:33:58] correctly okay so um now I'll talk about this [00:34:00] okay so um now I'll talk about this fundamental algorithm for optimization [00:34:02] fundamental algorithm for optimization called gradient descent and the basic [00:34:04] called gradient descent and the basic intuition is what we already explained [00:34:06] intuition is what we already explained before we calculate the slope at each [00:34:08] before we calculate the slope at each point when we're on our loss landscape [00:34:09] point when we're on our loss landscape and we take a step in the direction [00:34:11] and we take a step in the direction downwards towards the bottom of the loss [00:34:13] downwards towards the bottom of the loss landscape. So what we do is we calculate [00:34:16] landscape. So what we do is we calculate the um gradients of our weights given [00:34:19] the um gradients of our weights given the loss function, the data and our [00:34:22] the loss function, the data and our current weight values. This tells us how [00:34:24] current weight values. This tells us how much we should change each of the [00:34:25] much we should change each of the weights to go down the slope. And then [00:34:27] weights to go down the slope. And then we have to have a step size. So how far [00:34:29] we have to have a step size. So how far down the hill are we taking a step in [00:34:31] down the hill are we taking a step in the direction. Um and so you go down the [00:34:34] the direction. Um and so you go down the hill, so it's the minus sign here and [00:34:35] hill, so it's the minus sign here and the step size times uh the gradient. So [00:34:39] the step size times uh the gradient. So this is basically what gradient descent [00:34:41] this is basically what gradient descent is. Uh you're calculating the gradient [00:34:43] is. Uh you're calculating the gradient at each step and you're moving a fixed [00:34:45] at each step and you're moving a fixed direction uh in the direction of the [00:34:48] direction uh in the direction of the negative gradient down the hill. [00:34:51] negative gradient down the hill. Um so given a concrete example here. So [00:34:54] Um so given a concrete example here. So instead of this being like a 3D loss [00:34:55] instead of this being like a 3D loss landscape, often people will visualize [00:34:57] landscape, often people will visualize it like this where we're sort of looking [00:34:58] it like this where we're sort of looking down at the landscape and uh purple [00:35:01] down at the landscape and uh purple would represent the highest points and [00:35:02] would represent the highest points and red would represent the bottom or the [00:35:04] red would represent the bottom or the valley here. Um and we could imagine we [00:35:06] valley here. Um and we could imagine we have our original W. We can calculate [00:35:08] have our original W. We can calculate the loss. We know the direction of the [00:35:10] the loss. We know the direction of the slope of the negative gradient [00:35:12] slope of the negative gradient direction. And this arrow might [00:35:14] direction. And this arrow might represent the uh fixed step size that we [00:35:16] represent the uh fixed step size that we talked about before. We're taking a [00:35:18] talked about before. We're taking a fixed uh step size in that direction. [00:35:25] Yes. So you can see it's fixed step [00:35:26] Yes. So you can see it's fixed step size, but as the uh gradient becomes [00:35:29] size, but as the uh gradient becomes smaller, we're still multiplying it by [00:35:31] smaller, we're still multiplying it by this fixed step size. So the step the [00:35:33] this fixed step size. So the step the effective step size actually does become [00:35:35] effective step size actually does become smaller because the gradient is smaller [00:35:36] smaller because the gradient is smaller near the end where it's flat or near the [00:35:39] near the end where it's flat or near the end where it's more flat. So this is [00:35:41] end where it's more flat. So this is what it looks like when we're always [00:35:42] what it looks like when we're always heading in the direction of the steepest [00:35:44] heading in the direction of the steepest descent. So the question is when we step [00:35:47] descent. So the question is when we step down, how do we know when we're going to [00:35:48] down, how do we know when we're going to stop? Well, I guess in uh in this [00:35:51] stop? Well, I guess in uh in this formula, right, like you just keep [00:35:52] formula, right, like you just keep looping forever, so you never stop. Uh [00:35:54] looping forever, so you never stop. Uh so this was probably not the best. [00:35:55] so this was probably not the best. Normally you have a predetermined number [00:35:57] Normally you have a predetermined number of iterations that you run it for. Or [00:36:00] of iterations that you run it for. Or you can look at um if the loss is not [00:36:02] you can look at um if the loss is not significantly changing by a fixed [00:36:04] significantly changing by a fixed amount. Also you could have like a [00:36:06] amount. Also you could have like a tolerance for how much you're expecting [00:36:08] tolerance for how much you're expecting the loss to keep decreasing by and if [00:36:09] the loss to keep decreasing by and if it's no longer decreasing. You know it's [00:36:11] it's no longer decreasing. You know it's only decreasing by one e minus 5 or 1 e [00:36:13] only decreasing by one e minus 5 or 1 e - 9 you know maybe you maybe you stop [00:36:15] - 9 you know maybe you maybe you stop there because it's good enough. Um so [00:36:16] there because it's good enough. Um so those are the two ways you can determine [00:36:18] those are the two ways you can determine when to stop is fixed number of [00:36:19] when to stop is fixed number of iterations or a a stopping criteria of [00:36:22] iterations or a a stopping criteria of you know how much we're not really [00:36:24] you know how much we're not really improving that much anymore. [00:36:28] Okay. Um so now I'll talk about the sort [00:36:31] Okay. Um so now I'll talk about the sort of most popular variant of gradient [00:36:33] of most popular variant of gradient descent which is called stochastic [00:36:35] descent which is called stochastic gradient descent. [00:36:37] gradient descent. And when we talked about gradient [00:36:39] And when we talked about gradient descent before we talked about [00:36:40] descent before we talked about calculating the loss of our weights by [00:36:43] calculating the loss of our weights by summing over our entire training set the [00:36:45] summing over our entire training set the loss of li for each i in our entire n [00:36:48] loss of li for each i in our entire n training sets. Um but this is like [00:36:51] training sets. Um but this is like potentially a lot of computation if we [00:36:53] potentially a lot of computation if we have a very large data set. So what uh [00:36:56] have a very large data set. So what uh stochastic gradient descent is is it [00:36:59] stochastic gradient descent is is it basically now instead of looking at the [00:37:00] basically now instead of looking at the entire data set we're looking at a [00:37:01] entire data set we're looking at a subset each time which we call a mini [00:37:04] subset each time which we call a mini batch or a batch of data. And um so here [00:37:07] batch or a batch of data. And um so here if we look at the code it's like you [00:37:08] if we look at the code it's like you know we're sampling 256 data points from [00:37:11] know we're sampling 256 data points from our data set. So the batch size is 256. [00:37:14] our data set. So the batch size is 256. We evaluate the gradients of this 256 [00:37:18] We evaluate the gradients of this 256 subset of our data set and then we do [00:37:21] subset of our data set and then we do the same thing as before. So the reason [00:37:23] the same thing as before. So the reason why it's called stochcastic gradient [00:37:24] why it's called stochcastic gradient descent is because we're sampling a [00:37:26] descent is because we're sampling a random subset of our data set each time [00:37:28] random subset of our data set each time we're running the algorithm uh each step [00:37:30] we're running the algorithm uh each step of the algorithm. So um this is [00:37:32] of the algorithm. So um this is stochastic gradient descent. You're [00:37:33] stochastic gradient descent. You're basically on a running it on a random [00:37:35] basically on a running it on a random subset each time. [00:37:37] subset each time. And in practice people won't just sample [00:37:40] And in practice people won't just sample it completely random. They'll make sure [00:37:42] it completely random. They'll make sure to um get through all the examples in [00:37:45] to um get through all the examples in their data set and then sort of loop [00:37:46] their data set and then sort of loop around again. And that's called one [00:37:48] around again. And that's called one epoch of training where you loop through [00:37:50] epoch of training where you loop through all your data samples once in a random [00:37:52] all your data samples once in a random order. [00:37:54] order. Okay. Um there are some problems with uh [00:37:57] Okay. Um there are some problems with uh gradient descent or stocastic gradient [00:37:58] gradient descent or stocastic gradient descent. So um this visualization is [00:38:01] descent. So um this visualization is sort of the same type as the colored one [00:38:04] sort of the same type as the colored one I showed before where we're looking down [00:38:05] I showed before where we're looking down the loss landscape. But these curves are [00:38:07] the loss landscape. But these curves are called the level set where it's a set of [00:38:10] called the level set where it's a set of points where the loss is the same on all [00:38:11] points where the loss is the same on all of them. So this is another way of [00:38:12] of them. So this is another way of visualizing sort of very popular way to [00:38:15] visualizing sort of very popular way to visualize top down looking at the loss [00:38:16] visualize top down looking at the loss but it's without the colors. Um and so [00:38:20] but it's without the colors. Um and so you could imagine that you have uh this [00:38:23] you could imagine that you have uh this phenomenon where it's like a really [00:38:24] phenomenon where it's like a really narrow valley where it's really steep [00:38:26] narrow valley where it's really steep along the sides and you're trying to [00:38:27] along the sides and you're trying to traverse the center of the valley and um [00:38:31] traverse the center of the valley and um you know gradient descent actually does [00:38:33] you know gradient descent actually does run into issues here. Does anyone um [00:38:36] run into issues here. Does anyone um have any ideas for what could go wrong? [00:38:39] have any ideas for what could go wrong? Yeah. So, one of the things you could do [00:38:40] Yeah. So, one of the things you could do is overshoot. Um, where you're sort of [00:38:43] is overshoot. Um, where you're sort of moving up and down along this direction. [00:38:45] moving up and down along this direction. Um, and if the if it's steep enough and [00:38:47] Um, and if the if it's steep enough and your step size is large enough, you you [00:38:49] your step size is large enough, you you might actually oscillate out of the [00:38:50] might actually oscillate out of the valley. So, uh you can imagine if your [00:38:52] valley. So, uh you can imagine if your step size is very large and this is [00:38:54] step size is very large and this is really steep, you're actually going to [00:38:55] really steep, you're actually going to be gaining like you're moving out and [00:38:57] be gaining like you're moving out and out each time because you're you always [00:38:59] out each time because you're you always have this fixed step size. So, um if [00:39:02] have this fixed step size. So, um if it's steep enough, you could just like [00:39:03] it's steep enough, you could just like bounce out of the valley. Uh that that [00:39:05] bounce out of the valley. Uh that that that actually does happen if your [00:39:06] that actually does happen if your learning rate is too large. So that's [00:39:08] learning rate is too large. So that's one thing that can happen. And then also [00:39:10] one thing that can happen. And then also um even if your learning rate's not too [00:39:12] um even if your learning rate's not too large or your step size is not too large [00:39:15] large or your step size is not too large um you can have this phenomenon where [00:39:17] um you can have this phenomenon where you're sort of jittering because the [00:39:19] you're sort of jittering because the gradient is much larger in the steep [00:39:22] gradient is much larger in the steep direction. So uh you're sort of [00:39:24] direction. So uh you're sort of jittering but you're not making very [00:39:25] jittering but you're not making very much meaningful progress towards the [00:39:26] much meaningful progress towards the actual center because you're spending [00:39:28] actual center because you're spending all this time oscillating back and forth [00:39:29] all this time oscillating back and forth up and down. So this is a pretty big [00:39:32] up and down. So this is a pretty big issue with just default SGD. [00:39:36] issue with just default SGD. Um and then mathematically just an aside [00:39:38] Um and then mathematically just an aside um the loss function we consider here to [00:39:41] um the loss function we consider here to have a high condition number which is [00:39:42] have a high condition number which is the ratio of the largest [00:39:45] the ratio of the largest to smallest singular value of the hessen [00:39:46] to smallest singular value of the hessen matrix which is the second derivative. [00:39:48] matrix which is the second derivative. So you can imagine like the second [00:39:50] So you can imagine like the second derivative along this up and down [00:39:52] derivative along this up and down direction is very high but then side to [00:39:54] direction is very high but then side to side it's very low because it's very [00:39:55] side it's very low because it's very flat. So that's what causes this [00:39:57] flat. So that's what causes this phenomenon. [00:40:00] phenomenon. All right. Um so one of the things we [00:40:04] All right. Um so one of the things we also might have an issue with SGD is um [00:40:08] also might have an issue with SGD is um what happens if the loss function has a [00:40:10] what happens if the loss function has a local minima or local minimum or a [00:40:13] local minima or local minimum or a saddle point. So um for example here [00:40:16] saddle point. So um for example here it's it for like just the very end of [00:40:19] it's it for like just the very end of this curve it's completely flat. So if [00:40:21] this curve it's completely flat. So if we were to imagine um there's like we're [00:40:24] we were to imagine um there's like we're moving down the hill here um we would [00:40:26] moving down the hill here um we would just get stuck here because it's flat [00:40:28] just get stuck here because it's flat and we wouldn't be able to progress any [00:40:30] and we wouldn't be able to progress any further because when we take the [00:40:31] further because when we take the gradient here is zero. Um so this is [00:40:33] gradient here is zero. Um so this is actually a pretty big uh issue where [00:40:36] actually a pretty big uh issue where it'll get stuck either in a local [00:40:37] it'll get stuck either in a local minimum because we you know once we [00:40:39] minimum because we you know once we reach here we don't really have any [00:40:41] reach here we don't really have any direction to go the gradient is zero or [00:40:43] direction to go the gradient is zero or it's very small and we'll just sort of [00:40:44] it's very small and we'll just sort of oscillate back and forth here. And then [00:40:46] oscillate back and forth here. And then here it could actually get stuck on this [00:40:48] here it could actually get stuck on this uh bottom example because uh the [00:40:50] uh bottom example because uh the gradient zero here even though if it [00:40:51] gradient zero here even though if it went a little bit further it could go [00:40:52] went a little bit further it could go down significantly more. [00:40:55] down significantly more. Yeah. So the question is um you know [00:40:57] Yeah. So the question is um you know maybe we can change the way we're doing [00:40:59] maybe we can change the way we're doing the steps. Maybe we could use the hessen [00:41:01] the steps. Maybe we could use the hessen to determine the direction we go. Um we [00:41:03] to determine the direction we go. Um we actually do have a brief slide talking [00:41:04] actually do have a brief slide talking about the sort of hessen style approach [00:41:06] about the sort of hessen style approach at the very end. That's not very [00:41:08] at the very end. That's not very commonly used in deep learning. But the [00:41:10] commonly used in deep learning. But the short answer is yes. There are going to [00:41:12] short answer is yes. There are going to be actually several ways in which you [00:41:13] be actually several ways in which you can account for this that we're going to [00:41:14] can account for this that we're going to go into in like 5 minutes. So it's a [00:41:16] go into in like 5 minutes. So it's a good question. Yeah, we'll get to that. [00:41:22] Okay. Um so um I think one of the other [00:41:25] Okay. Um so um I think one of the other things that you might not know is that [00:41:28] things that you might not know is that empirically saddle points are actually [00:41:29] empirically saddle points are actually much more common as you move to higher [00:41:31] much more common as you move to higher dimensional models. So as your weight uh [00:41:33] dimensional models. So as your weight uh matrix gets larger and larger um you're [00:41:35] matrix gets larger and larger um you're more likely to find these saddle points. [00:41:37] more likely to find these saddle points. And there's this paper describing the [00:41:39] And there's this paper describing the frequency of them. Uh if you don't know [00:41:40] frequency of them. Uh if you don't know a saddle point uh it's called a saddle [00:41:42] a saddle point uh it's called a saddle point because it's shaped like a like a [00:41:44] point because it's shaped like a like a saddle like on a horse. And at the [00:41:46] saddle like on a horse. And at the center of this saddle uh the uh gradient [00:41:50] center of this saddle uh the uh gradient is actually zero in all directions. So [00:41:51] is actually zero in all directions. So it's like the bottom of this and at the [00:41:53] it's like the bottom of this and at the top of this sort of uh curvature and so [00:41:56] top of this sort of uh curvature and so in both the x and the y directions the [00:41:58] in both the x and the y directions the gradient zero. So you could get stuck [00:41:59] gradient zero. So you could get stuck here despite being very close to like [00:42:02] here despite being very close to like going significantly down the lost [00:42:03] going significantly down the lost landscape on either side. So this is [00:42:06] landscape on either side. So this is also a pretty common issue with SGD [00:42:08] also a pretty common issue with SGD where these saddle points and as we move [00:42:10] where these saddle points and as we move to higher dimensional spaces or this is [00:42:12] to higher dimensional spaces or this is equivalent to models with more [00:42:14] equivalent to models with more parameters uh this is more and more [00:42:16] parameters uh this is more and more common. This is a big issue. Um and then [00:42:19] common. This is a big issue. Um and then a final issue with SGD is that um we are [00:42:24] a final issue with SGD is that um we are sampling a subset of our data each time. [00:42:26] sampling a subset of our data each time. Right? So we're not looking at the whole [00:42:29] Right? So we're not looking at the whole this represents the entire loss across [00:42:31] this represents the entire loss across all the data but we're looking at just a [00:42:32] all the data but we're looking at just a subset each time. So we'll actually have [00:42:34] subset each time. So we'll actually have somewhat noisy update steps because [00:42:36] somewhat noisy update steps because we're not looking at the entire data uh [00:42:38] we're not looking at the entire data uh set. So we'll sort of be stepping [00:42:41] set. So we'll sort of be stepping towards the center uh towards this sort [00:42:43] towards the center uh towards this sort of uh local minimum that we're trying to [00:42:46] of uh local minimum that we're trying to reach here. But uh each step doesn't go [00:42:48] reach here. But uh each step doesn't go directly in that direction. So there's [00:42:50] directly in that direction. So there's some noise in how we're progressing [00:42:52] some noise in how we're progressing because we're subsampling the data set. [00:42:56] because we're subsampling the data set. Okay, cool. Um I think uh to summarize [00:43:01] Okay, cool. Um I think uh to summarize these are the main issues and there's a [00:43:04] these are the main issues and there's a pretty neat trick you can do uh where [00:43:05] pretty neat trick you can do uh where you just basically add momentum and you [00:43:07] you just basically add momentum and you can really think of this as the same way [00:43:09] can really think of this as the same way as like if you have a ball that's [00:43:10] as like if you have a ball that's rolling down the hill where it gains [00:43:11] rolling down the hill where it gains momentum. It's actually very similar to [00:43:13] momentum. It's actually very similar to how it's modeled in terms of the [00:43:14] how it's modeled in terms of the physical properties. So u it's a good [00:43:16] physical properties. So u it's a good way to gain intuition about it at the [00:43:18] way to gain intuition about it at the very least. So um you can imagine that [00:43:20] very least. So um you can imagine that um it could help with the you have these [00:43:22] um it could help with the you have these local minimum because if you're rolling [00:43:24] local minimum because if you're rolling down with enough velocity you'll be able [00:43:25] down with enough velocity you'll be able to come out of it. Um if you have the [00:43:27] to come out of it. Um if you have the saddle points or the like the just uh [00:43:30] saddle points or the like the just uh flat point here, the model has been [00:43:32] flat point here, the model has been rolling down the entire hill. So it [00:43:33] rolling down the entire hill. So it won't get stuck here anymore. It will [00:43:34] won't get stuck here anymore. It will continue. Um also um if you have this [00:43:38] continue. Um also um if you have this poor conditioning [00:43:40] poor conditioning value, you will still have maybe some uh [00:43:43] value, you will still have maybe some uh oscillation, but the nice thing is that [00:43:45] oscillation, but the nice thing is that um it will sort of accumulate speed in [00:43:48] um it will sort of accumulate speed in this direction to the right because it [00:43:49] this direction to the right because it will have multiple steps that keep going [00:43:51] will have multiple steps that keep going that way. So it'll gain faster and [00:43:52] that way. So it'll gain faster and faster uh towards the center here. So it [00:43:55] faster uh towards the center here. So it also helps with this problem. Finally, [00:43:57] also helps with this problem. Finally, it can also help sort of average out [00:43:59] it can also help sort of average out some of the noise with the gradients [00:44:01] some of the noise with the gradients because they all sort of have a [00:44:03] because they all sort of have a direction in common uh which is towards [00:44:05] direction in common uh which is towards this uh minimum here. And so um as [00:44:08] this uh minimum here. And so um as you're computing the momentum it sort of [00:44:11] you're computing the momentum it sort of builds on itself and it will converge [00:44:13] builds on itself and it will converge faster because um it sort of the noise [00:44:16] faster because um it sort of the noise uh is uh accounted for by looking at the [00:44:20] uh is uh accounted for by looking at the direction they all share in common which [00:44:21] direction they all share in common which is uh uh which is included in the [00:44:24] is uh uh which is included in the momentum. So let me show you how to [00:44:26] momentum. So let me show you how to actually do it. But this is sort of the [00:44:27] actually do it. But this is sort of the general intuition for how momentum [00:44:29] general intuition for how momentum works. [00:44:31] works. So we have SGD here. We have our uh mini [00:44:36] So we have SGD here. We have our uh mini batch x. We're computing the gradient [00:44:38] batch x. We're computing the gradient which is dx. We have the learning rate [00:44:40] which is dx. We have the learning rate or the step size. If you multiply and [00:44:42] or the step size. If you multiply and then we do the negative because we need [00:44:43] then we do the negative because we need to go down the hill. This gives us our [00:44:45] to go down the hill. This gives us our new x. So this is sgd for with momentum. [00:44:49] new x. So this is sgd for with momentum. We're now updating by this velocity [00:44:52] We're now updating by this velocity term. So instead of updating by the [00:44:53] term. So instead of updating by the gradient at the specific point, we're [00:44:55] gradient at the specific point, we're updating by the velocity. And the [00:44:57] updating by the velocity. And the velocity at a given time step is given [00:44:59] velocity at a given time step is given by the previous velocity uh plus the [00:45:02] by the previous velocity uh plus the current uh slope. So this is sort of how [00:45:04] current uh slope. So this is sort of how you calculate it. And you have this row [00:45:06] you calculate it. And you have this row value which is the momentum the actual [00:45:08] value which is the momentum the actual like how much momentum you want to have. [00:45:10] like how much momentum you want to have. And if you have it very high then your [00:45:12] And if you have it very high then your new velocity is more dependent on the [00:45:14] new velocity is more dependent on the previous uh time steps velocity. And [00:45:16] previous uh time steps velocity. And this sort of is a running average uh [00:45:18] this sort of is a running average uh therefore of the uh last gradients uh [00:45:21] therefore of the uh last gradients uh depend and the momentum term here gives [00:45:23] depend and the momentum term here gives you how much to weight the past versus [00:45:25] you how much to weight the past versus the present. So now we're updating by [00:45:27] the present. So now we're updating by this and we still have this alpha which [00:45:28] this and we still have this alpha which is the step size. So um it's actually a [00:45:30] is the step size. So um it's actually a very simple change, right? You just are [00:45:32] very simple change, right? You just are now computing the velocity which is a [00:45:35] now computing the velocity which is a function of the current velocity plus [00:45:38] function of the current velocity plus our gradient. So I think I'll pause for [00:45:40] our gradient. So I think I'll pause for questions here. Um this is the [00:45:43] questions here. Um this is the explanation of momentum and um maybe I [00:45:46] explanation of momentum and um maybe I could also recap briefly how it resolves [00:45:48] could also recap briefly how it resolves all these issues we saw. So um you know [00:45:50] all these issues we saw. So um you know if now that you're adding momentum in [00:45:52] if now that you're adding momentum in the past uh over the past gradient steps [00:45:55] the past uh over the past gradient steps you could see how it would keep [00:45:56] you could see how it would keep continuing along this direction and [00:45:57] continuing along this direction and depending on your row if your momentum [00:45:59] depending on your row if your momentum is very high it would keep going and be [00:46:02] is very high it would keep going and be able to account for a very large sort of [00:46:04] able to account for a very large sort of hump here with the local minimum. Also [00:46:06] hump here with the local minimum. Also it's very good at sort of these saddle [00:46:07] it's very good at sort of these saddle points because it will just continue [00:46:08] points because it will just continue along the direction in which it was [00:46:10] along the direction in which it was going previously for a significant [00:46:12] going previously for a significant amount of time. and uh poor [00:46:13] amount of time. and uh poor conditioning. You know, if we're having [00:46:15] conditioning. You know, if we're having cumulatively going to the right upon [00:46:17] cumulatively going to the right upon each step, the momentum will also um be [00:46:20] each step, the momentum will also um be consistent there and build up. And then [00:46:22] consistent there and build up. And then if uh we're oscillating significantly [00:46:24] if uh we're oscillating significantly here, it will move less um it will move [00:46:27] here, it will move less um it will move less in the direction because it's sort [00:46:29] less in the direction because it's sort of the values will cancel out um in [00:46:31] of the values will cancel out um in terms of the current direction and the [00:46:33] terms of the current direction and the velocity. Um they'll be pointing the [00:46:35] velocity. Um they'll be pointing the opposite direction so it will get [00:46:36] opposite direction so it will get minimized. The question is what happens [00:46:39] minimized. The question is what happens if you're rolling like right along the [00:46:41] if you're rolling like right along the saddle? I mean I think in practice it's [00:46:42] saddle? I mean I think in practice it's very unlikely but in that case yeah you [00:46:44] very unlikely but in that case yeah you would be you would just get stuck uh in [00:46:47] would be you would just get stuck uh in the saddle. Yeah I think that's like you [00:46:49] the saddle. Yeah I think that's like you know your initial conditions like [00:46:50] know your initial conditions like wherever you start is very unfortunate. [00:46:53] wherever you start is very unfortunate. So uh yeah sometimes I guess that could [00:46:55] So uh yeah sometimes I guess that could happen but it's very unlikely. Yeah. And [00:46:57] happen but it's very unlikely. Yeah. And it's also why in practice people won't [00:46:58] it's also why in practice people won't run like a single model um training run. [00:47:02] run like a single model um training run. Often they'll run multiple ones with [00:47:03] Often they'll run multiple ones with different random seeds just in case [00:47:05] different random seeds just in case something like that could happen. [00:47:06] something like that could happen. Another thing is if you're doing [00:47:08] Another thing is if you're doing stochastic uh gradient descent, you're [00:47:10] stochastic uh gradient descent, you're much more likely to have at least a [00:47:11] much more likely to have at least a little bit of noise to get you out of [00:47:12] little bit of noise to get you out of like directly in that saddle uh back and [00:47:15] like directly in that saddle uh back and forth. So I think it's basically it [00:47:17] forth. So I think it's basically it never would happen because of the [00:47:18] never would happen because of the randomness, but I hypothetically I think [00:47:21] randomness, but I hypothetically I think it could that could occur. Yeah. [00:47:23] it could that could occur. Yeah. So the question is why is the saddle [00:47:25] So the question is why is the saddle just an issue with SGD and not [00:47:26] just an issue with SGD and not optimization in general? Um it would [00:47:28] optimization in general? Um it would also be an issue with the entire data [00:47:29] also be an issue with the entire data set. It might even be more common with [00:47:31] set. It might even be more common with the entire data set. So uh it's an issue [00:47:33] the entire data set. So uh it's an issue that SGD faces but it's also an issue [00:47:35] that SGD faces but it's also an issue that other optimization algorithms that [00:47:37] that other optimization algorithms that just rely on gradient descent with no uh [00:47:40] just rely on gradient descent with no uh sort of bells and whistles attached [00:47:41] sort of bells and whistles attached would also um they would face the same [00:47:44] would also um they would face the same thing. Yeah. [00:47:45] thing. Yeah. Yeah. So the question is does uh adding [00:47:48] Yeah. So the question is does uh adding the momentum make it more difficult to [00:47:50] the momentum make it more difficult to converge because we'll overshoot and [00:47:52] converge because we'll overshoot and then you know have to come back. Uh I [00:47:53] then you know have to come back. Uh I think the short answer is yeah it might [00:47:55] think the short answer is yeah it might not help with converging but it will [00:47:57] not help with converging but it will help you find uh on average it will help [00:47:59] help you find uh on average it will help you find a better uh minimum point to [00:48:02] you find a better uh minimum point to converge to. So it will converge maybe [00:48:05] converge to. So it will converge maybe more slowly because you won't get stuck [00:48:07] more slowly because you won't get stuck on a in a local minimum uh like you [00:48:09] on a in a local minimum uh like you would just converge here if there was no [00:48:10] would just converge here if there was no momentum right versus overshooting. So I [00:48:13] momentum right versus overshooting. So I think a lot of this stuff is empirically [00:48:15] think a lot of this stuff is empirically uh shown where it's like it happens to [00:48:17] uh shown where it's like it happens to be with the specific class of neural [00:48:18] be with the specific class of neural networks the momentum does help uh [00:48:21] networks the momentum does help uh training but um this is the intuition [00:48:23] training but um this is the intuition for why we prefer it. Uh I think to be [00:48:26] for why we prefer it. Uh I think to be honest um people will use whatever works [00:48:30] honest um people will use whatever works best and there are cases where people [00:48:31] best and there are cases where people have found that stocastic gradient [00:48:32] have found that stocastic gradient descent without momentum would [00:48:34] descent without momentum would outperform for a particular model. So, [00:48:37] outperform for a particular model. So, uh, here's the intuition about why it [00:48:38] uh, here's the intuition about why it could perform better, but in practice, [00:48:40] could perform better, but in practice, people will just try a bunch of [00:48:41] people will just try a bunch of different ones and see what works best. [00:48:43] different ones and see what works best. And I'm going over the most common ones [00:48:45] And I'm going over the most common ones that people try now. Yeah. But yeah, [00:48:47] that people try now. Yeah. But yeah, you're right. It could hurt convergence [00:48:49] you're right. It could hurt convergence potentially. [00:48:53] Okay. Um, all right. I'll continue then. [00:48:56] Okay. Um, all right. I'll continue then. So, um, uh, yeah, I think we went [00:48:59] So, um, uh, yeah, I think we went through this. Um, and, uh, one other [00:49:02] through this. Um, and, uh, one other thing I wanted to point out is that [00:49:03] thing I wanted to point out is that there are different ways you can [00:49:05] there are different ways you can formulate this. So these equations are [00:49:06] formulate this. So these equations are identical but you'll sometimes depending [00:49:08] identical but you'll sometimes depending on the implementation see it written in [00:49:10] on the implementation see it written in different ways but they you know they're [00:49:12] different ways but they you know they're doing the same thing. Uh maybe in [00:49:14] doing the same thing. Uh maybe in interest of time I'll skip over why [00:49:16] interest of time I'll skip over why they're identical but you could go over [00:49:17] they're identical but you could go over in the slide and prove to yourself that [00:49:20] in the slide and prove to yourself that these are essentially the same [00:49:21] these are essentially the same formulations. [00:49:23] formulations. Okay. Um I think the next thing I'll [00:49:25] Okay. Um I think the next thing I'll talk about is a different optimizer. So [00:49:28] talk about is a different optimizer. So we talked about momentum and now we'll [00:49:30] we talked about momentum and now we'll talk about something called RMS prop. So [00:49:32] talk about something called RMS prop. So um RMS prop is a bit you know bit of an [00:49:35] um RMS prop is a bit you know bit of an older method now 2012 but uh came out by [00:49:39] older method now 2012 but uh came out by Jeffrey Hinton's group and the basic [00:49:42] Jeffrey Hinton's group and the basic idea is to instead of just having this [00:49:45] idea is to instead of just having this running um velocity which the momentum [00:49:49] running um velocity which the momentum captures it's to actually add u [00:49:51] captures it's to actually add u elementwise scaling of the gradient. So [00:49:54] elementwise scaling of the gradient. So um when we're when we're how do we do [00:49:57] um when we're when we're how do we do this is we have this gradient squared [00:49:59] this is we have this gradient squared term and the decay rate here is very [00:50:02] term and the decay rate here is very much like the momentum that we the [00:50:03] much like the momentum that we the momentum term we explained before but [00:50:05] momentum term we explained before but now it's on the squared gradient. So uh [00:50:08] now it's on the squared gradient. So uh we have this sort of running average [00:50:09] we have this sort of running average where we take the previous term here the [00:50:11] where we take the previous term here the gradient squared and then we do 1 minus [00:50:13] gradient squared and then we do 1 minus times and then here it is the literally [00:50:15] times and then here it is the literally the gradient squared and so this is a [00:50:17] the gradient squared and so this is a running average of our squared [00:50:19] running average of our squared gradients. So uh you know bigger values [00:50:21] gradients. So uh you know bigger values will get much bigger, smaller values [00:50:23] will get much bigger, smaller values will get much smaller and if there are [00:50:24] will get much smaller and if there are consistently large gradients in certain [00:50:27] consistently large gradients in certain values those will get very large as we [00:50:29] values those will get very large as we continue our uh running average here and [00:50:32] continue our uh running average here and we're actually going to divide here in [00:50:34] we're actually going to divide here in the update step we divide by the square [00:50:36] the update step we divide by the square root of it. So the basic idea here is [00:50:38] root of it. So the basic idea here is we're actually now stepping someone [00:50:40] we're actually now stepping someone asked earlier I think there was a [00:50:41] asked earlier I think there was a question what if we change the direction [00:50:43] question what if we change the direction in which we're stepping. uh this is [00:50:44] in which we're stepping. uh this is exactly the type of thing you can do and [00:50:46] exactly the type of thing you can do and this is what this is doing where we're [00:50:48] this is what this is doing where we're dividing by this squared gradient term. [00:50:50] dividing by this squared gradient term. So for values in which uh we have very [00:50:53] So for values in which uh we have very large squared gradients for the values [00:50:56] large squared gradients for the values of w in which the derivative is very [00:50:58] of w in which the derivative is very large um we'll we'll uh divide by a [00:51:00] large um we'll we'll uh divide by a larger value. So we'll step not as far [00:51:02] larger value. So we'll step not as far in that direction. In the more flat [00:51:04] in that direction. In the more flat regions we'll step farther because we're [00:51:05] regions we'll step farther because we're dividing by a smaller term here. So this [00:51:08] dividing by a smaller term here. So this is the basic intuition behind it and it [00:51:11] is the basic intuition behind it and it very much addresses one of the questions [00:51:12] very much addresses one of the questions someone had earlier about can we change [00:51:14] someone had earlier about can we change the way we're stepping in the direction [00:51:15] the way we're stepping in the direction and that's exactly what this is doing [00:51:16] and that's exactly what this is doing here. So you still have a learning rate [00:51:18] here. So you still have a learning rate but you're dividing it by this uh square [00:51:20] but you're dividing it by this uh square root of the cumulative uh squared [00:51:23] root of the cumulative uh squared gradients which gives you larger steps [00:51:26] gradients which gives you larger steps in the flatter areas of your loss [00:51:28] in the flatter areas of your loss landscape and shorter steps in the very [00:51:30] landscape and shorter steps in the very steep areas. Can anyone explain? I sort [00:51:32] steep areas. Can anyone explain? I sort of just gave a brief summary, but what [00:51:34] of just gave a brief summary, but what happens in this specific line here of [00:51:36] happens in this specific line here of the code? Why does uh what happens with [00:51:40] the code? Why does uh what happens with our gradient step direction? How does it [00:51:42] our gradient step direction? How does it change? We're dividing by this value [00:51:45] change? We're dividing by this value which is dependent on the current [00:51:46] which is dependent on the current gradient and also the past gradients. So [00:51:49] gradient and also the past gradients. So the values one of these values are very [00:51:51] the values one of these values are very large. So these are you know you know [00:51:53] large. So these are you know you know these are vector operations. So we have [00:51:54] these are vector operations. So we have a set of um derivatives here and we're [00:51:57] a set of um derivatives here and we're dividing elementwise by another set of [00:52:01] dividing elementwise by another set of uh squared gradient values. So when it's [00:52:03] uh squared gradient values. So when it's very large well the denominator is very [00:52:05] very large well the denominator is very large then the step becomes effectively [00:52:07] large then the step becomes effectively less in that direction because we're [00:52:08] less in that direction because we're dividing by a large value. And when it's [00:52:10] dividing by a large value. And when it's a very small value the step becomes much [00:52:12] a very small value the step becomes much larger because uh the gradient squared [00:52:14] larger because uh the gradient squared term is small. So it's in the [00:52:16] term is small. So it's in the denominator and we're uh increasing the [00:52:18] denominator and we're uh increasing the effective step size. Oh yeah. So it's uh [00:52:22] effective step size. Oh yeah. So it's uh specifically for this type of example [00:52:25] specifically for this type of example here where you have maybe a very narrow [00:52:28] here where you have maybe a very narrow valley where you want to be moving more [00:52:31] valley where you want to be moving more uh in the flatter direction. [00:52:33] uh in the flatter direction. Yeah. The question is what does a small [00:52:35] Yeah. The question is what does a small gradient mean in this context and how [00:52:37] gradient mean in this context and how does this help us move uh along the [00:52:40] does this help us move uh along the steep directions less and along the flat [00:52:42] steep directions less and along the flat directions more? [00:52:45] directions more? Yeah. Yeah. So I think actually this is [00:52:47] Yeah. Yeah. So I think actually this is maybe a great visual because it compares [00:52:48] maybe a great visual because it compares the three different approaches here. So [00:52:51] the three different approaches here. So we have uh with momentum which you can [00:52:53] we have uh with momentum which you can see sort of it overshoots as there was a [00:52:55] see sort of it overshoots as there was a question about earlier but then it kind [00:52:56] question about earlier but then it kind of comes back. Um you have SGD which is [00:52:59] of comes back. Um you have SGD which is slower because it's not it's just sort [00:53:01] slower because it's not it's just sort of always moving in the fixed direction [00:53:02] of always moving in the fixed direction and then you have RMS prop which we just [00:53:03] and then you have RMS prop which we just mentioned. So uh the way that RMS prop [00:53:06] mentioned. So uh the way that RMS prop works is because the gradient and the [00:53:08] works is because the gradient and the direction that I'm moving my mouse here [00:53:10] direction that I'm moving my mouse here is uh is higher uh the gradient square [00:53:14] is uh is higher uh the gradient square term is larger. So we move less in that [00:53:17] term is larger. So we move less in that sorry we move less in that direction. So [00:53:19] sorry we move less in that direction. So it's you can see it actually quickly [00:53:20] it's you can see it actually quickly starts turning here towards the center [00:53:22] starts turning here towards the center where it's a flatter landscape at this [00:53:24] where it's a flatter landscape at this point but it's traversing more in that [00:53:26] point but it's traversing more in that direction. So we're actually sort of [00:53:27] direction. So we're actually sort of changing the direction we're going based [00:53:29] changing the direction we're going based on uh going less in the steep direction [00:53:31] on uh going less in the steep direction and more in the flat direction. So these [00:53:32] and more in the flat direction. So these are the uh sort of three and then [00:53:34] are the uh sort of three and then there's one more we'll discuss which is [00:53:36] there's one more we'll discuss which is by far the most popular optimizer used [00:53:38] by far the most popular optimizer used uh in modern deep learning. Um so it's [00:53:41] uh in modern deep learning. Um so it's sort of just a combination of the SGD [00:53:43] sort of just a combination of the SGD momentum and RMS prop. So here is almost [00:53:48] momentum and RMS prop. So here is almost what the atom optimizer is which is the [00:53:50] what the atom optimizer is which is the most popular optimizer in deep learning [00:53:52] most popular optimizer in deep learning and you also have all the prerequisite [00:53:53] and you also have all the prerequisite knowledge now to understand it. Um so [00:53:56] knowledge now to understand it. Um so you look at it and this first term here [00:53:58] you look at it and this first term here in the red is basically the momentum we [00:54:01] in the red is basically the momentum we described before where we have uh the [00:54:05] described before where we have uh the current uh so the beta 1 is like the [00:54:08] current uh so the beta 1 is like the momentum term and then we have the [00:54:10] momentum term and then we have the velocity here and we're taking a running [00:54:12] velocity here and we're taking a running average. Um the second moment here is [00:54:15] average. Um the second moment here is like the gradient squared term for uh [00:54:18] like the gradient squared term for uh RMS prop and we're doing the same thing [00:54:21] RMS prop and we're doing the same thing here where we're multiplying the [00:54:23] here where we're multiplying the learning rate instead of by the step [00:54:24] learning rate instead of by the step size by the velocity but now we're still [00:54:26] size by the velocity but now we're still doing the thing where we take the square [00:54:28] doing the thing where we take the square root. Um and it's the second moment [00:54:31] root. Um and it's the second moment here. And the reason they use first [00:54:32] here. And the reason they use first moment second moment is like a relation [00:54:33] moment second moment is like a relation to physics and mechanics. Um, but it's [00:54:36] to physics and mechanics. Um, but it's basically just a combination of the two [00:54:38] basically just a combination of the two things we explained earlier where you're [00:54:40] things we explained earlier where you're accelerating movement along the flat [00:54:42] accelerating movement along the flat directions, dampening it among the steep [00:54:44] directions, dampening it among the steep ones, and then you're also adding this [00:54:45] ones, and then you're also adding this notion of momentum and velocity. So, you [00:54:48] notion of momentum and velocity. So, you gradually build up speed if you're [00:54:49] gradually build up speed if you're continuously moving in the same [00:54:51] continuously moving in the same direction. Um, so, uh, as it's written [00:54:54] direction. Um, so, uh, as it's written right now, this will actually run into [00:54:56] right now, this will actually run into issues at the very first time step. And [00:54:59] issues at the very first time step. And it might be a little bit uh unclear to [00:55:02] it might be a little bit uh unclear to you why, but I'll actually wait for [00:55:04] you why, but I'll actually wait for someone to have a guess. So, one thing [00:55:06] someone to have a guess. So, one thing to note is that these betas beta 1 beta [00:55:08] to note is that these betas beta 1 beta 2 are usually initialized very close to [00:55:10] 2 are usually initialized very close to one. So, like 0.9 0.999 [00:55:13] one. So, like 0.9 0.999 uh and that these two values are also [00:55:15] uh and that these two values are also initialized to zero. So, uh during your [00:55:18] initialized to zero. So, uh during your first time step, if you just use this [00:55:19] first time step, if you just use this formulation of atom, you would run into [00:55:22] formulation of atom, you would run into potentially unwanted behavior. So um one [00:55:25] potentially unwanted behavior. So um one of the other things is it has to do with [00:55:27] of the other things is it has to do with the second moment calculation. So this [00:55:30] the second moment calculation. So this is the main issue here. When you [00:55:32] is the main issue here. When you calculate the second moment and then use [00:55:34] calculate the second moment and then use it on the next line, you sort of run [00:55:36] it on the next line, you sort of run into an issue. Yeah. So the denominator [00:55:39] into an issue. Yeah. So the denominator is zero basically. Yeah, that's the [00:55:40] is zero basically. Yeah, that's the exact issue. So it starts at zero. So [00:55:42] exact issue. So it starts at zero. So this term is zero. Um you have a very [00:55:44] this term is zero. Um you have a very large beta. Uh so this value is very [00:55:48] large beta. Uh so this value is very small. And if your gradient is not very [00:55:50] small. And if your gradient is not very large in your first step, you can have [00:55:52] large in your first step, you can have this whole term basically be very close [00:55:53] this whole term basically be very close to zero. Now we're dividing by something [00:55:55] to zero. Now we're dividing by something very close to zero and it just creates a [00:55:57] very close to zero and it just creates a very large initial step even though our [00:55:59] very large initial step even though our gradient was small. So that's probably [00:56:01] gradient was small. So that's probably not something we really want. And so the [00:56:04] not something we really want. And so the final thing that Adam has is it adds [00:56:06] final thing that Adam has is it adds these bias terms here which is [00:56:07] these bias terms here which is specifically to account for this issue [00:56:09] specifically to account for this issue where it's dependent now on the time [00:56:11] where it's dependent now on the time step of training. So uh you know I think [00:56:14] step of training. So uh you know I think this is also something you'll go into in [00:56:16] this is also something you'll go into in the homework. I just want to give the u [00:56:18] the homework. I just want to give the u basically intuition behind atom why the [00:56:20] basically intuition behind atom why the naive implementation wouldn't work which [00:56:22] naive implementation wouldn't work which is this really large initial step and [00:56:24] is this really large initial step and you'll go over in the homework [00:56:25] you'll go over in the homework implementing this and you'll see how the [00:56:27] implementing this and you'll see how the time step is used but the basic idea is [00:56:29] time step is used but the basic idea is this is to account for that very large [00:56:31] this is to account for that very large initial step and uh as your time step [00:56:34] initial step and uh as your time step increases these uh bias terms are not [00:56:36] increases these uh bias terms are not needed as much [00:56:38] needed as much okay cool um these are some like good [00:56:41] okay cool um these are some like good defaults that people normally use um if [00:56:43] defaults that people normally use um if you're training a model with atom you [00:56:46] you're training a model with atom you could go with these and you know maybe [00:56:48] could go with these and you know maybe it'll work maybe it won't but it's a [00:56:49] it'll work maybe it won't but it's a good starting point and you can then [00:56:52] good starting point and you can then tell from the remaining slides we'll [00:56:54] tell from the remaining slides we'll talk about how do you know if your [00:56:56] talk about how do you know if your learning rate's right how do you know if [00:56:57] learning rate's right how do you know if uh these other values are right so I'll [00:56:59] uh these other values are right so I'll I'll speed up a little bit um just to in [00:57:02] I'll speed up a little bit um just to in the interest of time but the basic idea [00:57:03] the interest of time but the basic idea is that you can see these all all these [00:57:05] is that you can see these all all these different optimizers converging um they [00:57:07] different optimizers converging um they all have different properties you can [00:57:09] all have different properties you can sort of see how atom is this combination [00:57:10] sort of see how atom is this combination of RMS prop and SGD with momentum where [00:57:13] of RMS prop and SGD with momentum where it has characteristics of both which is [00:57:15] it has characteristics of both which is very neat to see visually it aligns with [00:57:17] very neat to see visually it aligns with our intuition. [00:57:19] our intuition. Um one final topic related to atom is [00:57:22] Um one final topic related to atom is that um we could look at how [00:57:24] that um we could look at how regularization interacts with the [00:57:26] regularization interacts with the optimizer. So um for example if we have [00:57:29] optimizer. So um for example if we have L2 regularization how does this affect [00:57:31] L2 regularization how does this affect how the optimizer uh works and I think [00:57:34] how the optimizer uh works and I think the answer is it's actually not [00:57:35] the answer is it's actually not immediately obvious and you can do it in [00:57:37] immediately obvious and you can do it in different ways. So uh in default atom [00:57:40] different ways. So uh in default atom they compute L2 when they're computing [00:57:41] they compute L2 when they're computing their gradient. So you know we looked at [00:57:44] their gradient. So you know we looked at the gradient and there was the loss [00:57:46] the gradient and there was the loss portion of our so the data loss portion [00:57:48] portion of our so the data loss portion and then the regularization loss for [00:57:50] and then the regularization loss for atom it's using both of those when it [00:57:51] atom it's using both of those when it computes the gradient um but atom w [00:57:54] computes the gradient um but atom w basically does only looks at the data [00:57:56] basically does only looks at the data loss for doing all of these moment [00:57:58] loss for doing all of these moment calculations and all these steps and it [00:58:00] calculations and all these steps and it just adds the regularization term at the [00:58:02] just adds the regularization term at the end here. Um so basically all I'm trying [00:58:05] end here. Um so basically all I'm trying to describe to you all is there is [00:58:06] to describe to you all is there is flexibility for how you incorporate [00:58:08] flexibility for how you incorporate regularization into your optimizers. Um [00:58:10] regularization into your optimizers. Um weight decay is generally when you just [00:58:12] weight decay is generally when you just add it at the end L2 regularization at [00:58:14] add it at the end L2 regularization at the end and you don't includes it [00:58:15] the end and you don't includes it include it in the uh actual optimizer [00:58:17] include it in the uh actual optimizer for how you're calculating the [00:58:18] for how you're calculating the velocities and momentum etc. Um so this [00:58:21] velocities and momentum etc. Um so this is the main difference and sometimes uh [00:58:24] is the main difference and sometimes uh under a lot of settings atom w works [00:58:26] under a lot of settings atom w works slightly better like I think the llama [00:58:27] slightly better like I think the llama series from meta they all use atom w uh [00:58:30] series from meta they all use atom w uh assuming I assume because it does [00:58:32] assuming I assume because it does slightly better for them. So we have one [00:58:35] slightly better for them. So we have one function optimize why are you splitting [00:58:36] function optimize why are you splitting it into two? Yeah. So if you mix it into [00:58:40] it into two? Yeah. So if you mix it into one function, that's what atom does. And [00:58:42] one function, that's what atom does. And atom w is specifically the separating it [00:58:45] atom w is specifically the separating it into two. So why you might want to do [00:58:47] into two. So why you might want to do that is because if you don't want your [00:58:49] that is because if you don't want your velocities, your momentums to actually [00:58:52] velocities, your momentums to actually be a function of the weights, you want [00:58:54] be a function of the weights, you want it to be a function of the loss. So if [00:58:55] it to be a function of the loss. So if you're trying to traverse your loss [00:58:57] you're trying to traverse your loss landscape sort of more independent of [00:58:59] landscape sort of more independent of your actual weight values, that's why [00:59:01] your actual weight values, that's why you might want to separate it. But you [00:59:02] you might want to separate it. But you still might want a regularization term, [00:59:04] still might want a regularization term, but you don't want it to interfere with [00:59:05] but you don't want it to interfere with the moment calculation. So this is the [00:59:07] the moment calculation. So this is the specific reason why they do it. [00:59:09] specific reason why they do it. Ultimately it's empirical. You try both [00:59:11] Ultimately it's empirical. You try both and you see which one works better. But [00:59:12] and you see which one works better. But this is why you would do it that way. [00:59:15] this is why you would do it that way. Okay cool. Um so we'll talk about [00:59:17] Okay cool. Um so we'll talk about learning rates. So um there are [00:59:18] learning rates. So um there are different uh ways in which uh learning [00:59:21] different uh ways in which uh learning rates can be chosen and sometimes you'll [00:59:24] rates can be chosen and sometimes you'll get a very high learning rate where what [00:59:26] get a very high learning rate where what will happen is basically your loss will [00:59:28] will happen is basically your loss will uh get very large as you sort of [00:59:29] uh get very large as you sort of oscillate out of the loss landscape as [00:59:31] oscillate out of the loss landscape as we described earlier. Um if you have a [00:59:33] we described earlier. Um if you have a very low learning rate your issue is you [00:59:35] very low learning rate your issue is you just converge very slowly. If you have a [00:59:37] just converge very slowly. If you have a high learning rate, but you're not [00:59:38] high learning rate, but you're not oscillating out, but you might not be [00:59:40] oscillating out, but you might not be able to converge because you're sort of [00:59:42] able to converge because you're sort of bumping around the uh local minimum, but [00:59:44] bumping around the uh local minimum, but you're not actually able to get uh any [00:59:46] you're not actually able to get uh any any lower in because your learning [00:59:48] any lower in because your learning rate's too high. And ideally, a good [00:59:49] rate's too high. And ideally, a good learning rate would have this property [00:59:51] learning rate would have this property where it decreere it causes your loss to [00:59:53] where it decreere it causes your loss to decrease quickly over time, but then you [00:59:55] decrease quickly over time, but then you see continued improvements as you [00:59:57] see continued improvements as you continue to train the model. [00:59:59] continue to train the model. Um in reality actually depending on the [01:00:01] Um in reality actually depending on the situation a lot of these could be good [01:00:03] situation a lot of these could be good learning rates and also depending on the [01:00:05] learning rates and also depending on the step in training which is the final uh [01:00:07] step in training which is the final uh thing we'll discuss in lecture today. So [01:00:10] thing we'll discuss in lecture today. So you can actually change your learning [01:00:11] you can actually change your learning rate as you train your model. You don't [01:00:14] rate as you train your model. You don't need to always have a fixed learning [01:00:15] need to always have a fixed learning rate or step size and pretty much all [01:00:17] rate or step size and pretty much all modern uh deep learning like all the [01:00:20] modern uh deep learning like all the best models coming out have different [01:00:22] best models coming out have different ways they vary the learning rate during [01:00:24] ways they vary the learning rate during training. So one really simple way you [01:00:27] training. So one really simple way you could do it is after a fixed number of [01:00:29] could do it is after a fixed number of iterations [01:00:31] iterations you just uh take onetenth of the [01:00:33] you just uh take onetenth of the learning rate and you continue training. [01:00:35] learning rate and you continue training. So this can resolve the issue of where [01:00:38] So this can resolve the issue of where uh your learning rate is too high for [01:00:40] uh your learning rate is too high for you to be able to converge any further. [01:00:42] you to be able to converge any further. So then you reduce it and you're able to [01:00:44] So then you reduce it and you're able to get lower into the loss landscape. Uh [01:00:46] get lower into the loss landscape. Uh and this is really commonly used when [01:00:47] and this is really commonly used when training ResNets. So that's a very [01:00:50] training ResNets. So that's a very popular type of convolutional neural [01:00:51] popular type of convolutional neural network which we'll discuss later in the [01:00:52] network which we'll discuss later in the course. Um another thing you could do is [01:00:55] course. Um another thing you could do is sort of cosine learning rate decay. So [01:00:58] sort of cosine learning rate decay. So this is one is also extremely popular. [01:01:00] this is one is also extremely popular. So um here you have uh basically this is [01:01:04] So um here you have uh basically this is like half of a cosine wave where you're [01:01:06] like half of a cosine wave where you're uh starting at your maximum learning [01:01:08] uh starting at your maximum learning rate here and then you go down to zero [01:01:10] rate here and then you go down to zero to the end and it follows this sort of [01:01:12] to the end and it follows this sort of half cosine uh shape. And here's the [01:01:15] half cosine uh shape. And here's the formula for calculating it. I won't go [01:01:17] formula for calculating it. I won't go into too many any details but the basic [01:01:19] into too many any details but the basic idea is there's a ton of different ways [01:01:20] idea is there's a ton of different ways to do it. when your loss uses a cosign [01:01:23] to do it. when your loss uses a cosign uh learning rateuler, you'll often see a [01:01:25] uh learning rateuler, you'll often see a shape like this where um it sort of you [01:01:28] shape like this where um it sort of you get pretty good continued gains in the [01:01:29] get pretty good continued gains in the middle part of training. But the basic [01:01:31] middle part of training. But the basic idea is that the actual shape of your [01:01:33] idea is that the actual shape of your loss during training will highly depend [01:01:34] loss during training will highly depend on whatuler you use. So this is the [01:01:36] on whatuler you use. So this is the basic idea I'm trying to convey. It [01:01:38] basic idea I'm trying to convey. It looks very different for example than [01:01:39] looks very different for example than this one where you can literally see [01:01:41] this one where you can literally see where we're uh taking onetenth of the [01:01:43] where we're uh taking onetenth of the learning rate during training. [01:01:45] learning rate during training. Um another thing you do is just a linear [01:01:47] Um another thing you do is just a linear learning rate decay. So um it just [01:01:49] learning rate decay. So um it just follows a straight line. uh you could do [01:01:50] follows a straight line. uh you could do inverse square root etc etc. There's [01:01:52] inverse square root etc etc. There's basically an unlimited number of ways [01:01:54] basically an unlimited number of ways you could uh mess with your learning [01:01:56] you could uh mess with your learning rate during training and depending on [01:01:58] rate during training and depending on the type of model you're training and [01:01:59] the type of model you're training and depending on what works best you you [01:02:01] depending on what works best you you just choose the one that works best but [01:02:02] just choose the one that works best but here are some ones you could try that uh [01:02:04] here are some ones you could try that uh could perform well in your setting. Uh [01:02:06] could perform well in your setting. Uh also a really really popular strategy is [01:02:09] also a really really popular strategy is to have a linear warm-up. So instead of [01:02:11] to have a linear warm-up. So instead of just starting at your maximum learning [01:02:13] just starting at your maximum learning rate, you spend a fixed number of [01:02:14] rate, you spend a fixed number of iterations to sort of linearly warm up [01:02:18] iterations to sort of linearly warm up to whatever your maximum value is and [01:02:20] to whatever your maximum value is and then you go about doing whateveruler you [01:02:22] then you go about doing whateveruler you had afterwards. So for example, linear [01:02:23] had afterwards. So for example, linear warm-up and then this would be like the [01:02:26] warm-up and then this would be like the inverse square root or linear warm-up [01:02:27] inverse square root or linear warm-up and then um like cosine is a very [01:02:29] and then um like cosine is a very popular setup for training models. Uh [01:02:32] popular setup for training models. Uh one final thing is that there is this [01:02:35] one final thing is that there is this empirical rule of thumb uh or called the [01:02:38] empirical rule of thumb uh or called the linear uh scaling um linear scaling [01:02:41] linear uh scaling um linear scaling hypothesis or linear scaling law or [01:02:43] hypothesis or linear scaling law or something like that. I forget the name. [01:02:45] something like that. I forget the name. I think it's linear scaling law where it [01:02:46] I think it's linear scaling law where it shows the uh that if you increase your [01:02:49] shows the uh that if you increase your batch size or the number of training [01:02:51] batch size or the number of training examples per update by n uh you could [01:02:53] examples per update by n uh you could you should also scale your learning rate [01:02:56] you should also scale your learning rate by n. So as you increase your batch [01:02:58] by n. So as you increase your batch size, you should increase your learning [01:03:00] size, you should increase your learning rate uh directly proportionally. So uh I [01:03:04] rate uh directly proportionally. So uh I think the math behind this is a bit [01:03:06] think the math behind this is a bit involved and also it's more of an [01:03:07] involved and also it's more of an empirical rule of thumb. So people have [01:03:09] empirical rule of thumb. So people have like tried to show uh mathematical [01:03:12] like tried to show uh mathematical proofs for why this could be useful but [01:03:13] proofs for why this could be useful but and based on you know the variation of [01:03:16] and based on you know the variation of gradients and your batch and the number [01:03:18] gradients and your batch and the number of uh gradients you calculate per batch [01:03:21] of uh gradients you calculate per batch etc. But really this is just shown [01:03:23] etc. But really this is just shown empirically to be true for a large [01:03:24] empirically to be true for a large number of problems. So uh this is a good [01:03:26] number of problems. So uh this is a good rule of thumb. If you have a winning [01:03:28] rule of thumb. If you have a winning recipe but you want to increase the [01:03:29] recipe but you want to increase the batch size then also increase your [01:03:31] batch size then also increase your learning rate by the same amount. [01:03:35] learning rate by the same amount. Cool. Um and then the final thing I'll [01:03:37] Cool. Um and then the final thing I'll talk I'll touch upon very briefly is [01:03:39] talk I'll touch upon very briefly is this idea of uh second order [01:03:43] this idea of uh second order optimization which is uses the hessen [01:03:44] optimization which is uses the hessen that someone asked a question about [01:03:46] that someone asked a question about earlier too. So we won't talk about this [01:03:48] earlier too. So we won't talk about this very much in depth but just to let you [01:03:49] very much in depth but just to let you know this exists. It's not something we [01:03:51] know this exists. It's not something we cover in the course very much, but um [01:03:54] cover in the course very much, but um the basic idea is right now we're using [01:03:55] the basic idea is right now we're using the gradient to form a linear [01:03:57] the gradient to form a linear approximation of basically where is the [01:03:59] approximation of basically where is the downward direction where we're trying to [01:04:01] downward direction where we're trying to traverse this lost landscape. Um and we [01:04:04] traverse this lost landscape. Um and we just sort of look at the direction and [01:04:06] just sort of look at the direction and we take a general step in that [01:04:07] we take a general step in that direction. And we added fancy things [01:04:08] direction. And we added fancy things like momentum and uh you know the RMS [01:04:12] like momentum and uh you know the RMS prop where we're de accelerating along [01:04:14] prop where we're de accelerating along the steep directions. But this is the [01:04:15] the steep directions. But this is the basic idea. We're using this uh gradient [01:04:18] basic idea. We're using this uh gradient uh at each time step. [01:04:21] uh at each time step. The idea of the hessen is instead of [01:04:23] The idea of the hessen is instead of using the gradient you basically try to [01:04:26] using the gradient you basically try to fit a uh pol a quadratic or um like a [01:04:32] fit a uh pol a quadratic or um like a two second degree polinomial to your uh [01:04:34] two second degree polinomial to your uh function based on the derivatives at [01:04:36] function based on the derivatives at that point or the hessions at that point [01:04:38] that point or the hessions at that point and uh you you then try to find the [01:04:40] and uh you you then try to find the minimum this way and in certain uh [01:04:43] minimum this way and in certain uh optimization problems this actually [01:04:45] optimization problems this actually works extremely well but generally um we [01:04:47] works extremely well but generally um we don't use it in deep learning because it [01:04:50] don't use it in deep learning because it requires two things. So one you have to [01:04:51] requires two things. So one you have to do this like tailaylor series expansion [01:04:53] do this like tailaylor series expansion whereas right now we're just sort of [01:04:55] whereas right now we're just sort of doing the first part where we're taking [01:04:56] doing the first part where we're taking the derivative but you would need to be [01:04:58] the derivative but you would need to be able to calculate the second mixed [01:05:00] able to calculate the second mixed derivatives which is already uh maybe [01:05:02] derivatives which is already uh maybe difficult and then on top of that this [01:05:05] difficult and then on top of that this mixed uh derivative of all of your uh [01:05:10] mixed uh derivative of all of your uh parameters in your model by all the [01:05:12] parameters in your model by all the other parameters in your model can get [01:05:13] other parameters in your model can get very large as you have like these many [01:05:15] very large as you have like these many million or billion parameter neural [01:05:16] million or billion parameter neural networks. So in practice uh we don't use [01:05:19] networks. So in practice uh we don't use it because um these values have the the [01:05:23] it because um these values have the the matrices become way too large and so you [01:05:25] matrices become way too large and so you run out of memory on your computer if [01:05:27] run out of memory on your computer if you try to run it specifically your GPU [01:05:29] you try to run it specifically your GPU memory. But if you're training a smaller [01:05:31] memory. But if you're training a smaller model or if you're okay with spending [01:05:33] model or if you're okay with spending much more time uh to get better steps [01:05:36] much more time uh to get better steps towards your uh minimum then maybe you [01:05:39] towards your uh minimum then maybe you want to look into this. Depends on the [01:05:41] want to look into this. Depends on the problem set but for smaller models this [01:05:42] problem set but for smaller models this actually works quite well. But for these [01:05:44] actually works quite well. But for these large neural networks we're training, we [01:05:45] large neural networks we're training, we basically never do this due to the [01:05:46] basically never do this due to the memory restrictions and all the time you [01:05:48] memory restrictions and all the time you spent uh computationally trying to [01:05:50] spent uh computationally trying to calculate the hashing etc. You would [01:05:53] calculate the hashing etc. You would rather just see more data during [01:05:54] rather just see more data during training. [01:05:58] All right. So um some I guess uh [01:06:01] All right. So um some I guess uh concluding uh thoughts for you all that [01:06:04] concluding uh thoughts for you all that can be useful. So um Adam or Adamw is a [01:06:07] can be useful. So um Adam or Adamw is a really good default choice for training [01:06:09] really good default choice for training your first model. If you're working on a [01:06:10] your first model. If you're working on a new problem in a domain, I would [01:06:12] new problem in a domain, I would recommend it. Um, and it could even work [01:06:15] recommend it. Um, and it could even work okay even if you do constant learning [01:06:16] okay even if you do constant learning rate. So, usually people will try Adam [01:06:19] rate. So, usually people will try Adam W's constant learning rate or with like [01:06:20] W's constant learning rate or with like a linear warm-up and then a cosine [01:06:21] a linear warm-up and then a cosine decay. Those are like a really uh [01:06:23] decay. Those are like a really uh popular combination. Uh, also um I think [01:06:27] popular combination. Uh, also um I think SGD and momentum can sometimes [01:06:29] SGD and momentum can sometimes outperform atom. But the tricky thing is [01:06:31] outperform atom. But the tricky thing is because you uh you you generally have to [01:06:34] because you uh you you generally have to like tune the values more. So you have [01:06:37] like tune the values more. So you have to try many more learning rates because [01:06:39] to try many more learning rates because uh you don't have this like RMS prop [01:06:42] uh you don't have this like RMS prop term to account for the steep directions [01:06:44] term to account for the steep directions and also you might have to try different [01:06:46] and also you might have to try different scheduling values whereas in practice [01:06:48] scheduling values whereas in practice atoms sort of like best by test like [01:06:49] atoms sort of like best by test like people have tried in a bunch of [01:06:50] people have tried in a bunch of different domains and it works very [01:06:51] different domains and it works very well. It's very uh adaptive to the loss [01:06:53] well. It's very uh adaptive to the loss landscape. [01:06:55] landscape. If you're doing like a full batch update [01:06:57] If you're doing like a full batch update where you're already at each step, you [01:06:59] where you're already at each step, you can fit basically your entire training [01:07:01] can fit basically your entire training set into your batch size, uh you might [01:07:04] set into your batch size, uh you might want to look beyond first order [01:07:05] want to look beyond first order optimization into second order and [01:07:07] optimization into second order and beyond because it seems like your data [01:07:09] beyond because it seems like your data set's not very large or maybe your model [01:07:10] set's not very large or maybe your model is not very large and you could [01:07:12] is not very large and you could potentially benefit from having these uh [01:07:14] potentially benefit from having these uh non nonlinear sort of update steps and [01:07:17] non nonlinear sort of update steps and uh computing more sophisticated [01:07:19] uh computing more sophisticated strategies for going down the uh trying [01:07:22] strategies for going down the uh trying to find the the minimum here. [01:07:25] to find the the minimum here. Um yeah so I think uh we're essentially [01:07:27] Um yeah so I think uh we're essentially done with the lecture. I'll give some [01:07:28] done with the lecture. I'll give some slides about uh looking forward. So how [01:07:32] slides about uh looking forward. So how do we optimize more complex functions [01:07:34] do we optimize more complex functions than linear models which is what we [01:07:35] than linear models which is what we covered in this lecture and uh next [01:07:38] covered in this lecture and uh next lecture specifically we'll be looking at [01:07:40] lecture specifically we'll be looking at uh neural networks which uh you know is [01:07:42] uh neural networks which uh you know is very exciting topic. um a two-layer [01:07:45] very exciting topic. um a two-layer neural network the one we'll discuss in [01:07:46] neural network the one we'll discuss in class is basically you have two of these [01:07:49] class is basically you have two of these weight matrices now one for each layer [01:07:51] weight matrices now one for each layer and you have something called a [01:07:53] and you have something called a nonlinearity sort of stuck between so um [01:07:57] nonlinearity sort of stuck between so um in this case the most common sorry not [01:07:58] in this case the most common sorry not the most common but the most simple one [01:08:00] the most common but the most simple one is just this uh ru function which you'll [01:08:03] is just this uh ru function which you'll learn about more but basic idea is now [01:08:04] learn about more but basic idea is now we have two weight matrices and we have [01:08:06] we have two weight matrices and we have this additional function that's done in [01:08:08] this additional function that's done in between the weight matrix calculations [01:08:11] between the weight matrix calculations this is nice because as I said it's [01:08:12] this is nice because as I said it's nonlinear So, you know, if we're trying [01:08:14] nonlinear So, you know, if we're trying to build a linear classifier to classify [01:08:16] to build a linear classifier to classify data like this, you'll run into an issue [01:08:18] data like this, you'll run into an issue where the blue points and the red points [01:08:20] where the blue points and the red points are not linearly separable. But maybe [01:08:23] are not linearly separable. But maybe there's some transformations we can do [01:08:24] there's some transformations we can do or through many layers of a model, we [01:08:26] or through many layers of a model, we can eventually transform the data into a [01:08:29] can eventually transform the data into a way in which it is separable by a line, [01:08:31] way in which it is separable by a line, which would be our then sort of final [01:08:32] which would be our then sort of final layer of the model. ================================================================================ LECTURE 004 ================================================================================ Stanford CS231N | Spring 2025 | Lecture 4: Neural Networks and Backpropagation Source: https://www.youtube.com/watch?v=25zD5qJHYsk --- Transcript [00:00:05] As you can see on this slide, today [00:00:07] As you can see on this slide, today we're going to talk about neural [00:00:09] we're going to talk about neural networks and back [00:00:13] networks and back propagation which is actually the [00:00:15] propagation which is actually the process [00:00:17] process early years I was I was studying this. I [00:00:20] early years I was I was studying this. I was often referring to it as the magical [00:00:23] was often referring to it as the magical process that um lets the neural networks [00:00:28] process that um lets the neural networks learn from their own mistakes [00:00:30] learn from their own mistakes pretty much like humans but in a more [00:00:34] pretty much like humans but in a more organized fashion and also uh using a [00:00:38] organized fashion and also uh using a little bit more math. So let's let's [00:00:41] little bit more math. So let's let's dive into the topic. I'm I'm sure this [00:00:44] dive into the topic. I'm I'm sure this is going to be uh exciting and this is a [00:00:47] is going to be uh exciting and this is a found is is laying a foundation for the [00:00:50] found is is laying a foundation for the rest of the quarter. Every single [00:00:52] rest of the quarter. Every single algorithm that we'll be discussing in [00:00:54] algorithm that we'll be discussing in the future without even mentioning is is [00:00:57] the future without even mentioning is is using a form of back propagation and um [00:01:01] using a form of back propagation and um so that's why understanding this this [00:01:03] so that's why understanding this this lecture and the topics are [00:01:06] lecture and the topics are uh is is very important. Okay. in [00:01:09] uh is is very important. Okay. in keeping us with the uh tradition. Let's [00:01:12] keeping us with the uh tradition. Let's cover what we've talked about so far. [00:01:19] cover what we've talked about so far. So, [00:01:21] So, I'm sure you now remember uh what we [00:01:24] I'm sure you now remember uh what we talked about last time. [00:01:27] talked about last time. We we said how we can form the [00:01:33] We we said how we can form the objective functions or loss functions [00:01:34] objective functions or loss functions what we call here and then um we talked [00:01:38] what we call here and then um we talked about regularization. [00:01:40] about regularization. But to do that [00:01:42] But to do that uh we formulated everything through the [00:01:46] uh we formulated everything through the XY [00:01:48] XY uh defining the pairs and and scoring [00:01:52] uh defining the pairs and and scoring function which in this case we are using [00:01:55] function which in this case we are using a linear scoring function as you can see [00:01:58] a linear scoring function as you can see and also [00:02:01] and also uh defining ultimately this this loss [00:02:04] uh defining ultimately this this loss function. So this graph that you see on [00:02:06] function. So this graph that you see on the right is what we [00:02:10] the right is what we uh we drew showing all the process the [00:02:14] uh we drew showing all the process the entire process of learning. There has [00:02:16] entire process of learning. There has been some questions um [00:02:20] been some questions um the questions last [00:02:23] the questions last last le in the last lecture and also [00:02:25] last le in the last lecture and also even before that that why we only using [00:02:28] even before that that why we only using the soft max function. I wanted to uh [00:02:31] the soft max function. I wanted to uh reiterate that it's not the only loss [00:02:34] reiterate that it's not the only loss function that we have and we use it's [00:02:36] function that we have and we use it's it's um it's one of the most widely used [00:02:39] it's um it's one of the most widely used in deep learning and and building [00:02:42] in deep learning and and building especially for the task of [00:02:43] especially for the task of classification but there are so many [00:02:45] classification but there are so many other options that we use for for other [00:02:49] other options that we use for for other task for different tasks even for the [00:02:51] task for different tasks even for the task of classification [00:02:53] task of classification if you've looked at the slides that [00:02:55] if you've looked at the slides that we've shared on the website I I included [00:02:58] we've shared on the website I I included this hinge loss loss or um used to be [00:03:01] this hinge loss loss or um used to be called SVM loss in the reading [00:03:05] called SVM loss in the reading assignments on le in lecture two. So in [00:03:08] assignments on le in lecture two. So in the slides we had uh examples and and [00:03:11] the slides we had uh examples and and everything around [00:03:13] everything around the topic of [00:03:15] the topic of um hinge loss. It is also one of those [00:03:19] um hinge loss. It is also one of those widely used um loss functions especially [00:03:22] widely used um loss functions especially in the early years of uh neural [00:03:25] in the early years of uh neural networks. And um just to give you a high [00:03:30] networks. And um just to give you a high level understanding of what it is uh [00:03:32] level understanding of what it is uh this is a loss [00:03:35] this is a loss function that unlike soft max does not [00:03:38] function that unlike soft max does not turn the scores into probabilities. So [00:03:41] turn the scores into probabilities. So turning them into probabilities is not [00:03:43] turning them into probabilities is not is not the only option. Right? So we can [00:03:46] is not the only option. Right? So we can uh use other forms. This this function [00:03:50] uh use other forms. This this function encourages the score of uh let me [00:03:53] encourages the score of uh let me highlight here the score um [00:03:58] highlight here the score um of the correct items uh which is defined [00:04:02] of the correct items uh which is defined by s yi to be higher than the scores of [00:04:06] by s yi to be higher than the scores of all other items sj. You can see the the [00:04:09] all other items sj. You can see the the condition here creating a [00:04:13] condition here creating a value of zero. [00:04:16] value of zero. if the condition is is true and [00:04:18] if the condition is is true and otherwise [00:04:20] otherwise what it does is um so as as I said it it [00:04:25] what it does is um so as as I said it it encourages the score of the the correct [00:04:27] encourages the score of the the correct item uh to be higher than the scores of [00:04:30] item uh to be higher than the scores of all other items um by at least a margin [00:04:34] all other items um by at least a margin that number one that you see there is [00:04:36] that number one that you see there is the margin that it creates and then if [00:04:39] the margin that it creates and then if the condition is violated the loss [00:04:42] the condition is violated the loss increases proportionally [00:04:45] increases proportionally um from the margin and this is the [00:04:47] um from the margin and this is the visualization. So creating the [00:04:51] visualization. So creating the uh this um [00:04:54] uh this um visualization of of the function. So [00:04:56] visualization of of the function. So this promotes correct scores by [00:05:00] this promotes correct scores by penalizing cases where irrelevant items [00:05:03] penalizing cases where irrelevant items are scored too highly. So again refer to [00:05:08] are scored too highly. So again refer to assignment reading assignment in lecture [00:05:10] assignment reading assignment in lecture two for examples and uh and to get [00:05:13] two for examples and uh and to get better understanding of that. Next we [00:05:17] better understanding of that. Next we have talked about uh general [00:05:19] have talked about uh general optimization how to find the the best [00:05:22] optimization how to find the the best parameters w for [00:05:26] parameters w for for uh the neural network. And in doing [00:05:30] for uh the neural network. And in doing so, we talked a little bit about this [00:05:32] so, we talked a little bit about this lost landscape [00:05:35] lost landscape uh being as a large valley um as as as [00:05:39] uh being as a large valley um as as as shown in this um image. And every point [00:05:43] shown in this um image. And every point on that valley is a different set of [00:05:47] on that valley is a different set of weight parameters [00:05:49] weight parameters and we wanted to find the set of [00:05:51] and we wanted to find the set of parameters W that minimizes that loss [00:05:55] parameters W that minimizes that loss landscape. [00:05:56] landscape. We talked about the fact that the key is [00:05:59] We talked about the fact that the key is is being able to take the gradient of [00:06:02] is being able to take the gradient of the loss function L with respect to W [00:06:06] the loss function L with respect to W and use the gradient for optimization in [00:06:09] and use the gradient for optimization in a stepbystep [00:06:11] a stepbystep manner which gave us the gradient [00:06:15] manner which gave us the gradient descent algorithm. Right? So the weights [00:06:19] descent algorithm. Right? So the weights are basically updated. Although it's [00:06:22] are basically updated. Although it's very hard for me to see from from this [00:06:25] very hard for me to see from from this distance what I'm pointing to, but I [00:06:28] distance what I'm pointing to, but I guess I [00:06:29] guess I uh I can guess. So um anyways, [00:06:36] uh I can guess. So um anyways, in order to walk down the loss landscape [00:06:39] in order to walk down the loss landscape towards the minimum value, a step size [00:06:44] towards the minimum value, a step size is defined. and we often uh get [00:06:49] is defined. and we often uh get take one step [00:06:52] take one step with respect to a step step size in the [00:06:54] with respect to a step step size in the negative direction of the uh gradient. [00:06:58] negative direction of the uh gradient. So this was the gradient descent [00:07:01] So this was the gradient descent algorithm and in order to optimize we [00:07:03] algorithm and in order to optimize we talked about uh two different approaches [00:07:07] talked about uh two different approaches of numerical gradient and analytical [00:07:10] of numerical gradient and analytical gradient both of which having pros and [00:07:14] gradient both of which having pros and cons and we discussed in practice to [00:07:18] cons and we discussed in practice to drive analytical gradients. [00:07:22] drive analytical gradients. In practice, we drive analytical [00:07:24] In practice, we drive analytical gradients and often if it's if it's hard [00:07:26] gradients and often if it's if it's hard to do the implementation and the math [00:07:29] to do the implementation and the math and everything, we check our [00:07:31] and everything, we check our implementations with neural numerical [00:07:34] implementations with neural numerical gradients. [00:07:37] gradients. And one of the other challenges we [00:07:39] And one of the other challenges we talked about was the use of um um [00:07:43] talked about was the use of um um incorporating the loss function and its [00:07:45] incorporating the loss function and its gradient on the entire data set. So if [00:07:48] gradient on the entire data set. So if you have a large data set, it's it's [00:07:50] you have a large data set, it's it's very expensive to [00:07:53] very expensive to um to run the loss function and the the [00:07:56] um to run the loss function and the the der derivative on the entire data set. [00:07:58] der derivative on the entire data set. And that's why we talked about the idea [00:08:01] And that's why we talked about the idea of mini batches using a number of [00:08:05] of mini batches using a number of examples sampled from the data set often [00:08:08] examples sampled from the data set often 32 um [00:08:12] 32 um often maybe uh 3264 [00:08:16] often maybe uh 3264 uh 128 or 256. [00:08:19] uh 128 or 256. And that subsampled [00:08:21] And that subsampled data is used for [00:08:25] data is used for identifying the gradients and then [00:08:27] identifying the gradients and then taking the steps towards the [00:08:32] taking the steps towards the the minimal. [00:08:34] the minimal. And beyond SGD and a stochastic um [00:08:38] And beyond SGD and a stochastic um gradient descent, we talked about some [00:08:41] gradient descent, we talked about some optimizations of SD SGD with momentum [00:08:46] optimizations of SD SGD with momentum um RMS prom uh prop and and atom [00:08:50] um RMS prom uh prop and and atom optimizer. [00:08:52] optimizer. And um there were a lot of details that [00:08:55] And um there were a lot of details that I would refer you to the lecture [00:08:58] I would refer you to the lecture um the third lecture if you have um any [00:09:01] um the third lecture if you have um any specific [00:09:03] specific uh questions about those. So [00:09:06] uh questions about those. So and then one of the other things that we [00:09:09] and then one of the other things that we talked about was the importance of the [00:09:11] talked about was the importance of the learning rate and and scheduling the [00:09:14] learning rate and and scheduling the learning rate. And in some of the [00:09:16] learning rate. And in some of the optimizers we often try to start with a [00:09:19] optimizers we often try to start with a larger value of the learning rate and [00:09:21] larger value of the learning rate and then start using different types of um [00:09:26] then start using different types of um decaying the learning rate or or [00:09:28] decaying the learning rate or or reducing its its value by a factor. Um [00:09:34] reducing its its value by a factor. Um this is normally uh needed in many [00:09:37] this is normally uh needed in many uh optimizers but but in in some of the [00:09:40] uh optimizers but but in in some of the more recent ones uh atom and it's its [00:09:42] more recent ones uh atom and it's its variance we often do not need to uh [00:09:46] variance we often do not need to uh manually or explicitly decrease that [00:09:48] manually or explicitly decrease that because they that that is kind of [00:09:50] because they that that is kind of encoded into the optimizer itself. [00:09:53] encoded into the optimizer itself. So with that I uh want us to get uh to [00:09:57] So with that I uh want us to get uh to the topic of neural networks [00:10:01] the topic of neural networks and and see how we can actually build um [00:10:05] and and see how we can actually build um neural networks and and solve more [00:10:09] neural networks and and solve more exciting and and harder problems. [00:10:14] exciting and and harder problems. So uh [00:10:18] So uh we've so far talked about this this [00:10:20] we've so far talked about this this function linear function of w multiplied [00:10:23] function linear function of w multiplied by x and that is um is is the most basic [00:10:29] by x and that is um is is the most basic neural network that could be defined. [00:10:32] neural network that could be defined. It's just one u layer. We will be [00:10:34] It's just one u layer. We will be talking about the layers. [00:10:36] talking about the layers. And um what I want you to pay attention [00:10:41] And um what I want you to pay attention to here is are these dimensions D and C [00:10:45] to here is are these dimensions D and C which are the dimensionality of the [00:10:47] which are the dimensionality of the input um data input X or or the uh [00:10:53] input um data input X or or the uh number of features and C is the number [00:10:55] number of features and C is the number of classes basically the number of [00:10:57] of classes basically the number of output uh nodes or or neurons whatever [00:11:01] output uh nodes or or neurons whatever number of outputs we need. [00:11:05] number of outputs we need. And um in order to create a neural [00:11:09] And um in order to create a neural network at a second layer, we can define [00:11:14] network at a second layer, we can define a new set of weights uh referred to as [00:11:18] a new set of weights uh referred to as W2 here. And we apply those to the the [00:11:23] W2 here. And we apply those to the the previous layer of W1 uh multiplied by X. [00:11:29] previous layer of W1 uh multiplied by X. Again pay attention to the [00:11:31] Again pay attention to the dimensionalities here that we have the C [00:11:34] dimensionalities here that we have the C number of outputs and D as the number of [00:11:36] number of outputs and D as the number of uh input features. But then we also [00:11:38] uh input features. But then we also define H and and uh that defines [00:11:44] define H and and uh that defines uh the number of neurons the number of [00:11:48] uh the number of neurons the number of hidden layer nodes or neurons. [00:11:53] hidden layer nodes or neurons. That's one point. The second point is [00:11:54] That's one point. The second point is this max function that um we'll be [00:11:59] this max function that um we'll be coming back to and we'll explain what it [00:12:02] coming back to and we'll explain what it it is and what it means. What um the max [00:12:06] it is and what it means. What um the max operation is doing here is to create a [00:12:09] operation is doing here is to create a nonlinearity between the linear [00:12:11] nonlinearity between the linear transformations done by W1 and W2. And [00:12:16] transformations done by W1 and W2. And this is actually a very very important [00:12:19] this is actually a very very important um [00:12:20] um part of um the the process. I will talk [00:12:23] part of um the the process. I will talk a little bit about the nonlinearity but [00:12:25] a little bit about the nonlinearity but also uh look at this this last part [00:12:29] also uh look at this this last part before I forget in practice that's right [00:12:32] before I forget in practice that's right that we are only including w and x we we [00:12:35] that we are only including w and x we we as we talked about this in the first and [00:12:37] as we talked about this in the first and second lecture we also [00:12:40] second lecture we also uh incorporate a bias just to have a a [00:12:44] uh incorporate a bias just to have a a complete framework so so in practice we [00:12:46] complete framework so so in practice we also have bias but we don't write it [00:12:49] also have bias but we don't write it here for the sake of simplicity anyways [00:12:52] here for the sake of simplicity anyways the max operation is creating the [00:12:55] the max operation is creating the nonlinearity and it's actually very [00:12:57] nonlinearity and it's actually very important because as we talked about um [00:13:01] important because as we talked about um the [00:13:03] the linear classifiers last in the last few [00:13:06] linear classifiers last in the last few lectures we we said we mentioned that [00:13:09] lectures we we said we mentioned that there are so many different problems [00:13:11] there are so many different problems that we can't uh separate the samples [00:13:15] that we can't uh separate the samples with just one single line right this was [00:13:17] with just one single line right this was one of the examples that um in order to [00:13:20] one of the examples that um in order to be able to solve this problem with [00:13:22] be able to solve this problem with neural networks with linear functions. [00:13:25] neural networks with linear functions. we need some sort of nonlinear [00:13:27] we need some sort of nonlinear transformation from the original space [00:13:30] transformation from the original space to a new space and now in the new space [00:13:32] to a new space and now in the new space you see that it's they are separable [00:13:35] you see that it's they are separable using a a line right so with um in this [00:13:41] using a a line right so with um in this case it's it's a nonlinear u [00:13:44] case it's it's a nonlinear u transformation between the input and [00:13:46] transformation between the input and then the second space which is mapping [00:13:48] then the second space which is mapping the x and y to their polar polar [00:13:52] the x and y to their polar polar coordinates r and theta but again Um [00:13:55] coordinates r and theta but again Um this is just one example. There are so [00:13:57] this is just one example. There are so many other um examples too. [00:14:01] many other um examples too. So um with [00:14:05] So um with this example, let's go back to oops uh [00:14:07] this example, let's go back to oops uh let's go back to our uh definition of [00:14:11] let's go back to our uh definition of the two layer neural network. [00:14:14] the two layer neural network. As you've probably seen in the [00:14:17] As you've probably seen in the literature and outside uh this class, [00:14:19] literature and outside uh this class, these types of networks which only [00:14:22] these types of networks which only depend on weights and u inputs and [00:14:26] depend on weights and u inputs and layers and so on. There are no other [00:14:28] layers and so on. There are no other operations than multiplication are often [00:14:31] operations than multiplication are often referred to as fully connected networks [00:14:33] referred to as fully connected networks or multilayer perceptrons MLPS. So [00:14:37] or multilayer perceptrons MLPS. So that's that's one um thing and we can [00:14:40] that's that's one um thing and we can actually stack more and more layers to [00:14:43] actually stack more and more layers to create better um more um larger networks [00:14:49] create better um more um larger networks and in this case again uh pay attention [00:14:52] and in this case again uh pay attention to the dimensionalities and the hidden [00:14:54] to the dimensionalities and the hidden layers that we have in u in in in the [00:14:58] layers that we have in u in in in the middle and the dimensionalities that do [00:15:00] middle and the dimensionalities that do uh match one after the other. So um [00:15:08] back to this visual representation of [00:15:11] back to this visual representation of what the neural network is doing. Um [00:15:16] what the neural network is doing. Um we talked about this when when we had [00:15:19] we talked about this when when we had the linear representations that often [00:15:23] the linear representations that often what happens is the network through the [00:15:27] what happens is the network through the weights is learning some sort of [00:15:30] weights is learning some sort of templates. [00:15:32] templates. If you remember last uh last week we [00:15:34] If you remember last uh last week we were talking about these templates that [00:15:36] were talking about these templates that are being learned. So again I'm saying [00:15:38] are being learned. So again I'm saying templates they're they're not I mean [00:15:40] templates they're they're not I mean they are some representatives of the [00:15:44] they are some representatives of the images but from the data [00:15:47] images but from the data uh depending on what data it was trained [00:15:49] uh depending on what data it was trained on. So these templates in in um what we [00:15:53] on. So these templates in in um what we discussed last week were kind of [00:15:56] discussed last week were kind of generated by by these um 10 outputs by [00:16:00] generated by by these um 10 outputs by applying the W's on top of the input [00:16:03] applying the W's on top of the input neurons right so um with that [00:16:09] neurons right so um with that now that we have multiple layers more [00:16:11] now that we have multiple layers more layers now we can actually create some [00:16:13] layers now we can actually create some more templates now now we have a layer [00:16:16] more templates now now we have a layer in the middle that can actually create [00:16:19] in the middle that can actually create 100 templates lets um as opposed to just [00:16:22] 100 templates lets um as opposed to just just 10 for a linear classifier although [00:16:24] just 10 for a linear classifier although we still have those 10 as well and this [00:16:27] we still have those 10 as well and this again uh in a very high from a very high [00:16:30] again uh in a very high from a very high level understanding point of view I'm [00:16:32] level understanding point of view I'm I'm telling you what this means when um [00:16:36] I'm telling you what this means when um we have these 100 neurons in the in the [00:16:38] we have these 100 neurons in the in the middle we are giving the network the [00:16:40] middle we are giving the network the power to create templates for not entire [00:16:44] power to create templates for not entire objects but maybe parts of the object [00:16:47] objects but maybe parts of the object for example the classes that you see [00:16:48] for example the classes that you see here We had bird, cat, um, deer, dog, [00:16:52] here We had bird, cat, um, deer, dog, frog, horse, they all have eyes, right? [00:16:56] frog, horse, they all have eyes, right? So one of those 10 templates, 100 [00:16:59] So one of those 10 templates, 100 templates could probably be a part of [00:17:01] templates could probably be a part of the objects that is that could be shared [00:17:03] the objects that is that could be shared between all of the classes. So from a [00:17:06] between all of the classes. So from a high level point of view and [00:17:07] high level point of view and understanding this um these can form [00:17:11] understanding this um these can form templates and when we come back to the [00:17:14] templates and when we come back to the topics of visualization and what we [00:17:16] topics of visualization and what we learn from the neural networks this uh [00:17:19] learn from the neural networks this uh this topic will we'll uncover more [00:17:21] this topic will we'll uncover more details about what I I'm talking about [00:17:23] details about what I I'm talking about right now. So um [00:17:27] right now. So um back to the function max we talked about [00:17:30] back to the function max we talked about max uh function the nonlinearity that is [00:17:32] max uh function the nonlinearity that is created here and in neural network [00:17:37] created here and in neural network terminology we call that an activation [00:17:40] terminology we call that an activation function right it and it's actually [00:17:43] function right it and it's actually playing a very very important role a [00:17:46] playing a very very important role a pivotal role in in building the model uh [00:17:50] pivotal role in in building the model uh building a neural network let's answer [00:17:52] building a neural network let's answer this question that we have on the slide [00:17:54] this question that we have on the slide what happens if we try to build a neural [00:17:57] what happens if we try to build a neural network without uh one of these [00:17:59] network without uh one of these activation functions let's say the the [00:18:02] activation functions let's say the the max function this will be our function [00:18:04] max function this will be our function if I remove the the max right so it [00:18:07] if I remove the the max right so it would be w2 * w1 by x what would happen [00:18:11] would be w2 * w1 by x what would happen here yes exactly so as u you can guess [00:18:15] here yes exactly so as u you can guess and and correctly you mentioned the [00:18:20] and and correctly you mentioned the multiplication of w2 by w1 could [00:18:23] multiplication of w2 by w1 could actually be replaced easily with another [00:18:25] actually be replaced easily with another matrix W3 and then your function becomes [00:18:29] matrix W3 and then your function becomes just a linear function. So everything [00:18:30] just a linear function. So everything could be lumped uh together. So we need [00:18:33] could be lumped uh together. So we need some sort of nonlinearity in the middle [00:18:36] some sort of nonlinearity in the middle to be able to give us the uh the power [00:18:40] to be able to give us the uh the power to solve nonlinear [00:18:43] to solve nonlinear uh problems. The function that we just [00:18:46] uh problems. The function that we just talked uh about is uh ReLU. It's the [00:18:51] talked uh about is uh ReLU. It's the rectified linear unit. It's a very [00:18:54] rectified linear unit. It's a very popular function activation function [00:18:56] popular function activation function used in neural networks while there are [00:19:01] used in neural networks while there are so many other variants that have have [00:19:03] so many other variants that have have been tested in many uh many other [00:19:06] been tested in many uh many other architectures and and even in the more [00:19:09] architectures and and even in the more modern architectures. One of the [00:19:11] modern architectures. One of the problems that ReLU has it it sometimes [00:19:14] problems that ReLU has it it sometimes creates dead neurons because it it's [00:19:18] creates dead neurons because it it's it's it's making everything equal to [00:19:20] it's it's making everything equal to zero if it's uh not positive. Right? So [00:19:23] zero if it's uh not positive. Right? So in order to avoid the dead uh neurons [00:19:26] in order to avoid the dead uh neurons leaky relu with this uh type of modeling [00:19:30] leaky relu with this uh type of modeling or uh elu the exponential linear unit [00:19:37] or uh elu the exponential linear unit are other options. ELU is a little bit [00:19:39] are other options. ELU is a little bit better because it has a better zero [00:19:41] better because it has a better zero centered uh function and then there are [00:19:45] centered uh function and then there are some newer variations [00:19:49] some newer variations um jello uh gshian error uh linear units [00:19:55] um jello uh gshian error uh linear units or I don't know I've I've heard both [00:19:57] or I don't know I've I've heard both variation jello and yellow so um could [00:20:01] variation jello and yellow so um could be could be used they are often used [00:20:03] be could be used they are often used more often in neuro architecture in in [00:20:06] more often in neuro architecture in in transformers and Um we also have [00:20:11] transformers and Um we also have silo [00:20:13] silo or or swish. Uh it's the sigmoid linear [00:20:17] or or swish. Uh it's the sigmoid linear unit that um uh that one is is also used [00:20:22] unit that um uh that one is is also used in some of the modern CNN architectures. [00:20:25] in some of the modern CNN architectures. Google was using this for uh efficient [00:20:28] Google was using this for uh efficient some of the variations of their uh [00:20:30] some of the variations of their uh models and also in efficient net. [00:20:34] models and also in efficient net. Other than these there are um functions [00:20:37] Other than these there are um functions like simmoid and uh ten or or tanh that [00:20:43] like simmoid and uh ten or or tanh that are often also used for [00:20:46] are often also used for as as activation functions although they [00:20:49] as as activation functions although they do have a few problems [00:20:53] do have a few problems because they do squash values in a [00:20:56] because they do squash values in a narrow range and that uh sometimes [00:20:59] narrow range and that uh sometimes results in vanishing vanishing [00:21:01] results in vanishing vanishing gradients. So we often do not use [00:21:03] gradients. So we often do not use sigmoid or or tang in the middle of the [00:21:07] sigmoid or or tang in the middle of the neural networks. They are often used in [00:21:09] neural networks. They are often used in the later layers where um we want to for [00:21:14] the later layers where um we want to for example binarize the outputs and and uh [00:21:18] example binarize the outputs and and uh things like that. So as I said ReLU is [00:21:21] things like that. So as I said ReLU is often a good default choice. It's it's [00:21:23] often a good default choice. It's it's very much used in many [00:21:25] very much used in many architectures and there are so many [00:21:27] architectures and there are so many variations of the same function that we [00:21:30] variations of the same function that we talked uh about. [00:21:33] talked uh about. I want to summarize what what we've [00:21:35] I want to summarize what what we've talked about and then answer some [00:21:38] talked about and then answer some questions. So um we did talk about [00:21:42] questions. So um we did talk about different uh adding layers and and so [00:21:44] different uh adding layers and and so on. But I want to highlight that [00:21:46] on. But I want to highlight that activation functions are often functions [00:21:49] activation functions are often functions that are operating in the [00:21:53] that are operating in the layers. And you also have W's which [00:21:57] layers. And you also have W's which define the weights mapping between the [00:22:00] define the weights mapping between the previous layer and the and the next [00:22:03] previous layer and the and the next layer. Again, these are fully connected [00:22:06] layer. Again, these are fully connected neural networks with very simple [00:22:09] neural networks with very simple uh implementations. [00:22:12] uh implementations. What we only need is to be able to [00:22:15] What we only need is to be able to define an activation function. And in [00:22:18] define an activation function. And in this example, if you look at the [00:22:19] this example, if you look at the example, we have um the sigmoid uh [00:22:24] example, we have um the sigmoid uh function defined as the activation [00:22:26] function defined as the activation function. And very easily using that [00:22:29] function. And very easily using that activation. The first and the second [00:22:32] activation. The first and the second layers of the hidden values, hidden [00:22:36] layers of the hidden values, hidden neurons are [00:22:38] neurons are uh calculated by applying w1 by x and [00:22:42] uh calculated by applying w1 by x and then also the uh the bias and then [00:22:44] then also the uh the bias and then applying it um applying the function the [00:22:46] applying it um applying the function the the activation function and then same [00:22:48] the activation function and then same for h2 and the output will be very [00:22:52] for h2 and the output will be very simply the the dot product for the uh [00:22:56] simply the the dot product for the uh between the with w3 and the last layer [00:22:59] between the with w3 and the last layer of hidden values creating the output [00:23:03] of hidden values creating the output layer. I'll stop here to answer some [00:23:06] layer. I'll stop here to answer some questions if there are any and then I [00:23:08] questions if there are any and then I would love to continue it. That is a [00:23:10] would love to continue it. That is a great question and the question is how [00:23:12] great question and the question is how would we choose for a new problem which [00:23:15] would we choose for a new problem which of these activation functions to use? Um [00:23:19] of these activation functions to use? Um the short answer to your question is uh [00:23:22] the short answer to your question is uh is yes. It's it's empir empirical in [00:23:24] is yes. It's it's empir empirical in most cases. Um but we often start with [00:23:27] most cases. Um but we often start with with Reu or we go with um standard [00:23:31] with Reu or we go with um standard activation functions being used for [00:23:33] activation functions being used for those specific architectures. Uh as I [00:23:36] those specific architectures. Uh as I mentioned there are um activation [00:23:39] mentioned there are um activation functions that are often commonly used [00:23:42] functions that are often commonly used in um CNN's or in transformers and and [00:23:46] in um CNN's or in transformers and and uh different architectures. So we often [00:23:49] uh different architectures. So we often uh go with the with the ones that are [00:23:52] uh go with the with the ones that are tested before. But yes, it's it's mostly [00:23:54] tested before. But yes, it's it's mostly empirical. If you're if you're designing [00:23:56] empirical. If you're if you're designing a new network for a new problem, then uh [00:23:59] a new network for a new problem, then uh that's one of your choices that you have [00:24:02] that's one of your choices that you have to make very much similar to other [00:24:03] to make very much similar to other hyperparameters. [00:24:05] hyperparameters. So the question here uh is is uh what is [00:24:09] So the question here uh is is uh what is the attribute that is is is basically [00:24:11] the attribute that is is is basically common between all of these activation [00:24:13] common between all of these activation functions and uh what it it really does. [00:24:17] functions and uh what it it really does. I will uh give you some examples and [00:24:19] I will uh give you some examples and I'll I'll go into some of the details of [00:24:22] I'll I'll go into some of the details of what these activation functions are [00:24:25] what these activation functions are doing. Uh basically the uh main and and [00:24:31] doing. Uh basically the uh main and and the most important common characteristic [00:24:34] the most important common characteristic here is to create nonlinearity and we're [00:24:37] here is to create nonlinearity and we're not using a linear function as as the [00:24:39] not using a linear function as as the activation. Right? So creating some sort [00:24:42] activation. Right? So creating some sort of nonlinearity is is something that is [00:24:44] of nonlinearity is is something that is makes makes it very important. And why [00:24:47] makes makes it very important. And why do we have so many variations? I told [00:24:49] do we have so many variations? I told you a little bit about the problems with [00:24:51] you a little bit about the problems with v vanishing gradients. I told you uh a [00:24:54] v vanishing gradients. I told you uh a little bit about [00:24:56] little bit about uh differentiability of the the [00:24:58] uh differentiability of the the functions. They should be uh [00:25:00] functions. They should be uh differentiable because we are using them [00:25:02] differentiable because we are using them in neural network. and uh and and uh [00:25:06] in neural network. and uh and and uh sometimes having a a proper zero [00:25:09] sometimes having a a proper zero centered value and a smooth function [00:25:12] centered value and a smooth function makes it uh much faster to to get [00:25:15] makes it uh much faster to to get converging networks. So there are so [00:25:18] converging networks. So there are so many different uh factors these are the [00:25:20] many different uh factors these are the the main ones that I uh told you and [00:25:22] the main ones that I uh told you and talked about which play an important [00:25:24] talked about which play an important role for defining or designing these uh [00:25:27] role for defining or designing these uh functions. [00:25:30] functions. I'll talk a little bit more about that [00:25:32] I'll talk a little bit more about that when I uh go into details of the [00:25:35] when I uh go into details of the functions too. In all of the layers we [00:25:39] functions too. In all of the layers we often use um same activation functions [00:25:42] often use um same activation functions but as I said sometimes in the later [00:25:44] but as I said sometimes in the later layers or the output layer we use like a [00:25:46] layers or the output layer we use like a sigmoid activation sigmoid uh function [00:25:50] sigmoid activation sigmoid uh function and or tangent function. So um but but [00:25:53] and or tangent function. So um but but commonly yes and uh [00:25:58] commonly yes and uh the question was if we use the same uh [00:26:02] the question was if we use the same uh across the networks [00:26:04] across the networks uh the the entire network same function [00:26:06] uh the the entire network same function for all of the neurons. [00:26:09] for all of the neurons. Okay. Um continuing uh to [00:26:14] Okay. Um continuing uh to what we were talking about which um is [00:26:18] what we were talking about which um is the implementation of these um [00:26:22] the implementation of these um models these um a neural network. So [00:26:26] models these um a neural network. So there is a very simple way um I mean [00:26:29] there is a very simple way um I mean that building a neural network a two [00:26:31] that building a neural network a two layer neuron network in Python is just [00:26:33] layer neuron network in Python is just less than or or 20 lines of code very [00:26:37] less than or or 20 lines of code very simple define our network as I talked [00:26:41] simple define our network as I talked about the dimensionalities n is the [00:26:43] about the dimensionalities n is the number of samples dn d in is the [00:26:47] number of samples dn d in is the dimensionality of the input and d out is [00:26:49] dimensionality of the input and d out is the dimension of the out the output and [00:26:51] the dimension of the out the output and h the number of neurons in the [00:26:55] h the number of neurons in the uh hidden layer and we talked about this [00:26:59] uh hidden layer and we talked about this is just creating X and Y and randomize u [00:27:02] is just creating X and Y and randomize u randomly init uh initializing W's. Then [00:27:06] randomly init uh initializing W's. Then we have the forward forward pass which [00:27:09] we have the forward forward pass which means applying W's to the inputs [00:27:14] means applying W's to the inputs layer by layer and ultimately creating [00:27:17] layer by layer and ultimately creating the output the prediction wise predicted [00:27:23] the output the prediction wise predicted wise and then al and and and finally [00:27:27] wise and then al and and and finally calculating the loss function and [00:27:29] calculating the loss function and outputting that loss value [00:27:33] outputting that loss value after the forward As we need an [00:27:36] after the forward As we need an optimization pro process, a way to [00:27:39] optimization pro process, a way to calculate the analytical gradients and [00:27:44] calculate the analytical gradients and use those gradients that are created to [00:27:49] use those gradients that are created to run gradient descent to optimize W1 and [00:27:53] run gradient descent to optimize W1 and W2. Basically taking one step towards [00:27:56] W2. Basically taking one step towards the optimal value of the network. [00:28:00] the optimal value of the network. But this part calculating the analytical [00:28:04] But this part calculating the analytical gradient is is the most important um [00:28:07] gradient is is the most important um part in in here that we haven't very [00:28:10] part in in here that we haven't very much uh gone into. So this is the the [00:28:13] much uh gone into. So this is the the almost the rest of this lecture is about [00:28:16] almost the rest of this lecture is about making this work and scale in um [00:28:20] making this work and scale in um different settings. So um [00:28:26] after training and building such a [00:28:28] after training and building such a neural network depending on on how many [00:28:32] neural network depending on on how many nodes we use in the hidden layer you see [00:28:35] nodes we use in the hidden layer you see that we can we can identify we can get [00:28:37] that we can we can identify we can get different patterns uh of separation [00:28:40] different patterns uh of separation between the two classes and more neurons [00:28:44] between the two classes and more neurons often means more capacity to learn more [00:28:47] often means more capacity to learn more complex functions and better separation. [00:28:51] complex functions and better separation. of the [00:28:54] of the uh the nodes the the [00:28:57] uh the nodes the the points [00:29:00] points if you take a look at this this is this [00:29:02] if you take a look at this this is this is very much similar to this this [00:29:05] is very much similar to this this pattern I'm showing here is is similar [00:29:07] pattern I'm showing here is is similar to the one that I showed in the second [00:29:08] to the one that I showed in the second lecture where we were talking about k [00:29:12] lecture where we were talking about k nearest neighbor and and when we had [00:29:15] nearest neighbor and and when we had only k equal to one the the one nearest [00:29:20] only k equal to one the the one nearest neighbor framework [00:29:22] neighbor framework it was very much similar to using more [00:29:24] it was very much similar to using more neurons right so same type of arguments [00:29:28] neurons right so same type of arguments happen here that that if we give a lot [00:29:30] happen here that that if we give a lot of capacity to the network then we will [00:29:34] of capacity to the network then we will have some overfeeding problems uh we [00:29:37] have some overfeeding problems uh we won't be able to generalize to unseen [00:29:39] won't be able to generalize to unseen data but uh there are many different [00:29:43] data but uh there are many different solutions for [00:29:45] solutions for this as well and one thing that I as a [00:29:48] this as well and one thing that I as a rule of thumb what I want to highlight [00:29:51] rule of thumb what I want to highlight here for you is to not use the size of [00:29:54] here for you is to not use the size of neural network as a regularizer. We [00:29:57] neural network as a regularizer. We don't often use that as a hyperparameter [00:30:00] don't often use that as a hyperparameter to fine-tune this this network size [00:30:02] to fine-tune this this network size although we experiment with different um [00:30:04] although we experiment with different um values of uh the network size and um [00:30:09] values of uh the network size and um related hyperparameters. But what we [00:30:12] related hyperparameters. But what we often do is we we go with a a little bit [00:30:15] often do is we we go with a a little bit of a um bigger network that we need and [00:30:19] of a um bigger network that we need and then we use the regularization [00:30:22] then we use the regularization um and then this regularizer and [00:30:24] um and then this regularizer and specifically this regular regularization [00:30:28] specifically this regular regularization um hyperparameter [00:30:30] um hyperparameter to to do a different to check the [00:30:34] to to do a different to check the different setups. So what we often tune [00:30:37] different setups. So what we often tune is the regularization and regularization [00:30:40] is the regularization and regularization hyperparameter not necessarily the [00:30:41] hyperparameter not necessarily the network size itself. [00:30:45] network size itself. Okay. Um [00:30:48] Okay. Um this is the concept of neural networks [00:30:52] this is the concept of neural networks in a nutshell. But we [00:30:56] in a nutshell. But we have heard about neural networks and how [00:30:58] have heard about neural networks and how they could be related to uh the [00:31:02] they could be related to uh the biological there are some biological [00:31:04] biological there are some biological inspirations. Uh so I'll I'll talk a [00:31:05] inspirations. Uh so I'll I'll talk a little bit about it but there's a [00:31:06] little bit about it but there's a question basically your question is uh [00:31:10] question basically your question is uh why is the model [00:31:12] why is the model more underfeeding when we increase the [00:31:15] more underfeeding when we increase the value of lambda here? Yes. So um just to [00:31:20] value of lambda here? Yes. So um just to quickly answer that question, the value [00:31:22] quickly answer that question, the value of lambda is controlling how much [00:31:25] of lambda is controlling how much contribution the regularizer should have [00:31:28] contribution the regularizer should have in the overall loss, right? And the [00:31:31] in the overall loss, right? And the larger contribution that you have um on [00:31:34] larger contribution that you have um on the regularizer and remember that [00:31:36] the regularizer and remember that regularizer was defined on W's. So it's [00:31:40] regularizer was defined on W's. So it's constraining the W's. It's giving less [00:31:42] constraining the W's. It's giving less freedom to the values on W's, right? So [00:31:46] freedom to the values on W's, right? So less freedom equals a little bit of like [00:31:50] less freedom equals a little bit of like u [00:31:52] u more generic boundaries not not [00:31:55] more generic boundaries not not necessarily giving you those uh like [00:31:57] necessarily giving you those uh like detailed values detailed um uh parts of [00:32:00] detailed values detailed um uh parts of the the boundaries right so if you [00:32:03] the the boundaries right so if you constrain the model too much even with [00:32:06] constrain the model too much even with regularizer you're also going to get uh [00:32:10] regularizer you're also going to get uh values like that decision boundaries [00:32:12] values like that decision boundaries like that yes the right regularizer [00:32:15] like that yes the right regularizer always overfeeds uh prevents [00:32:17] always overfeeds uh prevents overfeeding. [00:32:19] overfeeding. Again you you are creating a compromise [00:32:21] Again you you are creating a compromise a balance between the loss like [00:32:25] a balance between the loss like predicting the actual the right output. [00:32:27] predicting the actual the right output. So the first part of the loss is [00:32:29] So the first part of the loss is predicting the right output. The second [00:32:31] predicting the right output. The second part is only playing with the values of [00:32:33] part is only playing with the values of the weights doesn't doesn't care about [00:32:34] the weights doesn't doesn't care about the outputs anymore. If you overweight [00:32:36] the outputs anymore. If you overweight this you're not going to get very good [00:32:39] this you're not going to get very good uh classifiers right. So creating a [00:32:42] uh classifiers right. So creating a balance regularizer is always good but [00:32:44] balance regularizer is always good but nothing is good if you use too much of [00:32:46] nothing is good if you use too much of it. Right. [00:32:51] [Music] [00:32:53] [Music] Um could you go over again why we would [00:32:55] Um could you go over again why we would want to change the regularization rather [00:32:57] want to change the regularization rather than the size? [00:33:00] than the size? So um there are many different reasons. [00:33:03] So um there are many different reasons. One of them is size of the network. [00:33:05] One of them is size of the network. You're building networks. If you're [00:33:06] You're building networks. If you're going to build networks that sometimes [00:33:09] going to build networks that sometimes that you have to run them for few days [00:33:12] that you have to run them for few days to to to get some results, right? So [00:33:17] to to to get some results, right? So networks um we often what we what we [00:33:20] networks um we often what we what we often do is we start increasing the [00:33:23] often do is we start increasing the quality the number of parameters in [00:33:25] quality the number of parameters in networks until we see some levels of [00:33:28] networks until we see some levels of overfeeding. [00:33:29] overfeeding. So that's that's the time that we know [00:33:31] So that's that's the time that we know that the network is is actually [00:33:33] that the network is is actually understanding the patterns in the data [00:33:35] understanding the patterns in the data and is trying is is is now able to [00:33:37] and is trying is is is now able to memorize the data and that's the time [00:33:40] memorize the data and that's the time that we try to minimize the overfeitting [00:33:43] that we try to minimize the overfeitting by regularizing the the network. So [00:33:46] by regularizing the the network. So regularization plays an important factor [00:33:48] regularization plays an important factor there. So if we if we go too high on the [00:33:52] there. So if we if we go too high on the uh on the number of parameters number of [00:33:54] uh on the number of parameters number of complexity of the network then that's [00:33:57] complexity of the network then that's going to be causing a problem. We never [00:33:59] going to be causing a problem. We never never often do that. We often for a new [00:34:02] never often do that. We often for a new problem start with smaller u networks [00:34:05] problem start with smaller u networks and and increase that um after with [00:34:09] and and increase that um after with correct that with the regularizer for a [00:34:11] correct that with the regularizer for a given problem. [00:34:14] given problem. How do we know how many neurons we need [00:34:16] How do we know how many neurons we need to solve the problem? that's based on uh [00:34:22] to solve the problem? that's based on uh empirical um research work and and [00:34:26] empirical um research work and and looking at other uh similar type of [00:34:29] looking at other uh similar type of there is no one prescription for all [00:34:32] there is no one prescription for all like you have to look at um other [00:34:35] like you have to look at um other counterparts other other types of [00:34:37] counterparts other other types of networks that were trained in sim on [00:34:39] networks that were trained in sim on similar data. start from that range and [00:34:42] similar data. start from that range and then um often you do a number of [00:34:46] then um often you do a number of experiments yourself to balance and [00:34:48] experiments yourself to balance and increase or decrease the complexity of [00:34:50] increase or decrease the complexity of the network. So it's it's often and [00:34:52] the network. So it's it's often and always pretty much bound to uh [00:34:56] always pretty much bound to uh exploration. [00:34:58] exploration. So your question is are there any [00:35:00] So your question is are there any theoretical and and foundational work uh [00:35:03] theoretical and and foundational work uh done to see um which ones to use, which [00:35:07] done to see um which ones to use, which activation functions to use and how many [00:35:09] activation functions to use and how many layers to use. There are so many uh [00:35:11] layers to use. There are so many uh research and and and uh papers out [00:35:14] research and and and uh papers out analyzing these and and also some [00:35:16] analyzing these and and also some methods for optimizing all of these um [00:35:21] methods for optimizing all of these um meta or hyperparameters of of the [00:35:23] meta or hyperparameters of of the networks. We're not going to get into uh [00:35:26] networks. We're not going to get into uh them in in detail because again a big [00:35:29] them in in detail because again a big part of it is based on very much [00:35:31] part of it is based on very much dependent on the data set the problem [00:35:33] dependent on the data set the problem you're solving and um so [00:35:38] you're solving and um so uh the best answer to your question is [00:35:39] uh the best answer to your question is yes there are some works out there but [00:35:42] yes there are some works out there but um but again each of those make [00:35:45] um but again each of those make assumptions that may not be necessarily [00:35:47] assumptions that may not be necessarily true to your uh for your application or [00:35:50] true to your uh for your application or problem. So um what what happens is um [00:35:56] problem. So um what what happens is um there are some biological inspirations. [00:35:58] there are some biological inspirations. Again these inspirations are very much [00:36:00] Again these inspirations are very much very loose. If there is a neuroscientist [00:36:02] very loose. If there is a neuroscientist sitting here or is watching online um [00:36:06] sitting here or is watching online um it's uh do not take all of the examples [00:36:10] it's uh do not take all of the examples that I'm um giving you or talking about [00:36:14] that I'm um giving you or talking about as u [00:36:16] as u the ground truth. But [00:36:19] the ground truth. But generally what happens in in neurons and [00:36:21] generally what happens in in neurons and you know this this is a a [00:36:25] you know this this is a a visualization of a neuron. It does have [00:36:27] visualization of a neuron. It does have a cell body that often uh aggregates the [00:36:33] a cell body that often uh aggregates the impulses carried through dendrites to [00:36:36] impulses carried through dendrites to the cell um [00:36:39] the cell um body itself cell uh body and then using [00:36:42] body itself cell uh body and then using axons those impulses are carried away to [00:36:45] axons those impulses are carried away to other uh neurons. This is very much [00:36:49] other uh neurons. This is very much similar to what we are doing in our [00:36:51] similar to what we are doing in our neural networks. We often have a [00:36:53] neural networks. We often have a function that captures the [00:36:58] function that captures the uh the the the the signals all of the [00:37:03] uh the the the the signals all of the previous impulses activations from the [00:37:06] previous impulses activations from the previous layers and in the cell body [00:37:10] previous layers and in the cell body that function [00:37:12] that function is operated on the um on the inputs and [00:37:16] is operated on the um on the inputs and outputs the out the the activations and [00:37:20] outputs the out the the activations and passes them to the next layer. here next [00:37:23] passes them to the next layer. here next neuron. And that's basically why we need [00:37:26] neuron. And that's basically why we need some sort of activation function here to [00:37:28] some sort of activation function here to to create the impulses to to to increase [00:37:32] to create the impulses to to to increase or decrease the the values. Um, so with [00:37:38] or decrease the the values. Um, so with that again, uh, that there are many [00:37:41] that again, uh, that there are many differences between biological neurons [00:37:44] differences between biological neurons and how they could actually be way more [00:37:46] and how they could actually be way more complex than what [00:37:49] complex than what neural networks we build look like. But [00:37:53] neural networks we build look like. But um, but generally there are common [00:37:56] um, but generally there are common concepts. [00:37:58] concepts. Often the neural networks that we build [00:38:00] Often the neural networks that we build are organized into regular patterns and [00:38:04] are organized into regular patterns and those patterns are because we want to [00:38:06] those patterns are because we want to have better computational efficiency [00:38:08] have better computational efficiency when we be we implement the uh neural [00:38:11] when we be we implement the uh neural networks. Although there has been [00:38:13] networks. Although there has been research creating these complex neural [00:38:16] research creating these complex neural networks and and trying to optimize but [00:38:19] networks and and trying to optimize but um again in terms of results they are [00:38:23] um again in terms of results they are almost comparable with the regular [00:38:26] almost comparable with the regular functions regular neuron networks that [00:38:28] functions regular neuron networks that we often build and we'll be talking [00:38:31] we often build and we'll be talking about in this class. [00:38:36] I can't uh warn you enough on on on on [00:38:40] I can't uh warn you enough on on on on being careful with your brain analogies [00:38:43] being careful with your brain analogies and um and how this could uh be [00:38:47] and um and how this could uh be interpreted. So there are so many [00:38:49] interpreted. So there are so many differences and um I'll I'll just uh [00:38:52] differences and um I'll I'll just uh stop here and would be happy to discuss [00:38:55] stop here and would be happy to discuss if if anybody was interested in the [00:38:58] if if anybody was interested in the neuroscience aspect of things as well. [00:39:01] neuroscience aspect of things as well. So um plugging everything in we did have [00:39:06] So um plugging everything in we did have a scoring function. [00:39:08] a scoring function. This scoring function [00:39:11] This scoring function turns the inputs through some W's, some [00:39:14] turns the inputs through some W's, some weight vectors or weight matrices into [00:39:18] weight vectors or weight matrices into scores. And um [00:39:22] scores. And um what we often use as the loss function [00:39:26] what we often use as the loss function for the network is using those scores [00:39:28] for the network is using those scores either through hinge loss or soft max or [00:39:31] either through hinge loss or soft max or other variations. [00:39:34] other variations. And uh and and in addition to that we [00:39:37] And uh and and in addition to that we defined regularizers [00:39:39] defined regularizers which ultimately give us the total loss [00:39:42] which ultimately give us the total loss of a the data loss plus regularizer. [00:39:48] of a the data loss plus regularizer. And um we talked about this fact that in [00:39:51] And um we talked about this fact that in order to be able to optimize W1 and W2, [00:39:56] order to be able to optimize W1 and W2, what we need is to be able to take the [00:40:00] what we need is to be able to take the derivative of the partial derivative of [00:40:02] derivative of the partial derivative of L with respect to W1 and W2. Uh partial [00:40:08] L with respect to W1 and W2. Uh partial A by W1 and W2. There are so many [00:40:12] A by W1 and W2. There are so many different um details that we have to be [00:40:17] different um details that we have to be aware of. [00:40:19] aware of. First, uh building these functions and [00:40:22] First, uh building these functions and then taking the derivatives and and [00:40:25] then taking the derivatives and and writing them down is often tedious. [00:40:27] writing them down is often tedious. There are lots of matrix calculations [00:40:30] There are lots of matrix calculations and need a lot of uh work on the paper [00:40:33] and need a lot of uh work on the paper before you can actually implement a [00:40:35] before you can actually implement a neural network. The other challenge, the [00:40:37] neural network. The other challenge, the other problem is what if you want to [00:40:40] other problem is what if you want to change the loss slightly different from [00:40:43] change the loss slightly different from what we we've we've done the paper all [00:40:46] what we we've we've done the paper all of the calculations [00:40:48] of the calculations um over. So in that case again we have [00:40:51] um over. So in that case again we have to um [00:40:54] to um redo the entire thing and finally um [00:40:59] redo the entire thing and finally um this becomes intractable and sometimes [00:41:02] this becomes intractable and sometimes invisible if the loss function is [00:41:05] invisible if the loss function is complex. So with complex functions [00:41:08] complex. So with complex functions that's going to be even [00:41:10] that's going to be even harder. [00:41:12] harder. But there is a better idea, something [00:41:14] But there is a better idea, something that is often used um in our [00:41:18] that is often used um in our implementations and I'm going to go into [00:41:21] implementations and I'm going to go into a few examples today just to make sure [00:41:23] a few examples today just to make sure everybody is on the same page and [00:41:26] everybody is on the same page and understands these uh topics. [00:41:29] understands these uh topics. So and and that is computational graphs [00:41:32] So and and that is computational graphs and the idea of back propagation. [00:41:36] and the idea of back propagation. computational graphs are are [00:41:39] computational graphs are are putting together all of the operations [00:41:41] putting together all of the operations in the neural network and um creating [00:41:45] in the neural network and um creating that stepbystep thing and and and start [00:41:49] that stepbystep thing and and and start from the inputs and all of the [00:41:51] from the inputs and all of the parameters that are uh basically needed [00:41:53] parameters that are uh basically needed and get the loss as the final um output [00:41:58] and get the loss as the final um output final layer. So in this case we had a [00:42:01] final layer. So in this case we had a loss function which could be a softmax [00:42:03] loss function which could be a softmax function or a hinge loss function [00:42:05] function or a hinge loss function whatever it is it's the loss function [00:42:08] whatever it is it's the loss function which is added to the regularizer the [00:42:11] which is added to the regularizer the the function rw [00:42:14] the function rw and r has the has w as the input. So [00:42:18] and r has the has w as the input. So these two added together calculate or or [00:42:21] these two added together calculate or or create the loss and before doing the um [00:42:26] create the loss and before doing the um or or having the loss calculated we [00:42:28] or or having the loss calculated we often also need to aggregate X and W and [00:42:32] often also need to aggregate X and W and create the score. This is a [00:42:35] create the score. This is a multiplication function. [00:42:38] multiplication function. This is um actually very useful because [00:42:40] This is um actually very useful because most of the neural networks that you [00:42:42] most of the neural networks that you build are also they have graphical [00:42:44] build are also they have graphical representations and all of these complex [00:42:47] representations and all of these complex functions could be could be shown with [00:42:50] functions could be could be shown with the same uh framework and then we can [00:42:53] the same uh framework and then we can use this and uh and build their [00:42:56] use this and uh and build their computation graph starting from input [00:42:59] computation graph starting from input image or input data. There are a bunch [00:43:02] image or input data. There are a bunch of weights throughout the network. And [00:43:04] of weights throughout the network. And finally, there is the loss function. [00:43:09] And again, this is useful because um [00:43:13] And again, this is useful because um there are some complex neural networks [00:43:15] there are some complex neural networks like this um neural touring machine and [00:43:20] like this um neural touring machine and uh that is actually used for temporal [00:43:22] uh that is actually used for temporal and sequential data. So there's a lot of [00:43:24] and sequential data. So there's a lot of unrolling of this this machine. And if [00:43:27] unrolling of this this machine. And if we have to do all of the work manually [00:43:30] we have to do all of the work manually by hand, this is going to be un um [00:43:34] by hand, this is going to be un um intractable and and not uh feasible. So [00:43:39] intractable and and not uh feasible. So and that's why [00:43:41] and that's why when we build this this computational [00:43:44] when we build this this computational graph, the solution to that is back [00:43:48] graph, the solution to that is back propagation. [00:43:50] propagation. And I want to start with a very simple [00:43:52] And I want to start with a very simple example. So we start with a function um [00:43:56] example. So we start with a function um f of x y and z which is x + y * z. And [00:44:03] f of x y and z which is x + y * z. And if I draw the computational graph for [00:44:07] if I draw the computational graph for for this function, you see we have an [00:44:10] for this function, you see we have an operation [00:44:12] operation which is uh the the the addition [00:44:15] which is uh the the the addition operation between x and y and then we [00:44:18] operation between x and y and then we have a multiplication between x uh [00:44:21] have a multiplication between x uh between the [00:44:23] between the that that addition of x1 and y [00:44:25] that that addition of x1 and y multiplied by z. [00:44:29] multiplied by z. So given an input setup of x = -2, y = 5 [00:44:35] So given an input setup of x = -2, y = 5 and z = -4. [00:44:38] and z = -4. Now actually we can make all of the [00:44:40] Now actually we can make all of the calculations for and and do this uh the [00:44:43] calculations for and and do this uh the forward [00:44:45] forward pass the uh for stepping forward in the [00:44:49] pass the uh for stepping forward in the the neural network. The first step is [00:44:53] the neural network. The first step is adding x and y which gives us three. And [00:44:57] adding x and y which gives us three. And in order to be able to to understand the [00:44:59] in order to be able to to understand the steps and step by step, I'm I'm giving [00:45:01] steps and step by step, I'm I'm giving the name to it. So Q= X + Y. And if I [00:45:07] the name to it. So Q= X + Y. And if I want to calculate the partial [00:45:11] want to calculate the partial derivatives of Q with respect to both X [00:45:13] derivatives of Q with respect to both X and Y, it's very simple because we have [00:45:16] and Y, it's very simple because we have the formulation here between X and um Q [00:45:20] the formulation here between X and um Q and X and Y. The the formulation is [00:45:23] and X and Y. The the formulation is there. the derivatives the partial [00:45:26] there. the derivatives the partial partial q by x = 1 and partial q by y = [00:45:31] partial q by x = 1 and partial q by y = 1 as well. So so this is this is a [00:45:34] 1 as well. So so this is this is a simple setup we know it exists so just [00:45:38] simple setup we know it exists so just keep it in the back of our minds. Then u [00:45:42] keep it in the back of our minds. Then u the second operation is f= q * [00:45:46] the second operation is f= q * z. Again, since we have this function, [00:45:49] z. Again, since we have this function, it's very easy to calculate to to write [00:45:51] it's very easy to calculate to to write the partial derivatives, right? [00:45:54] the partial derivatives, right? Uh partial f by q= z and f by z equals u [00:46:01] Uh partial f by q= z and f by z equals u q. So, it's kind of swap between z and [00:46:04] q. So, it's kind of swap between z and q. [00:46:08] I'm hoping that everybody knows all of [00:46:09] I'm hoping that everybody knows all of these from linear algebra. So, and if [00:46:13] these from linear algebra. So, and if you don't, you should definitely check [00:46:15] you don't, you should definitely check it out and remind yourself because these [00:46:18] it out and remind yourself because these are actually very very important [00:46:20] are actually very very important um algebra in general and uh for the [00:46:24] um algebra in general and uh for the rest of the quarter. what we want and [00:46:28] rest of the quarter. what we want and what we need uh in this setup and and to [00:46:31] what we need uh in this setup and and to complete this example of back [00:46:33] complete this example of back propagation, we need the partial [00:46:36] propagation, we need the partial derivative of f with respect to x, y and [00:46:41] derivative of f with respect to x, y and z. [00:46:44] z. How we start and how back propagation [00:46:47] How we start and how back propagation implements this is to start at the front [00:46:50] implements this is to start at the front of the network at the end of the network [00:46:52] of the network at the end of the network and we start going back back propagating [00:46:58] and we start going back back propagating all of the gradients and um this is [00:47:03] all of the gradients and um this is basically a recursive process that will [00:47:05] basically a recursive process that will be u running. So [00:47:09] be u running. So derivative of f with respect to f [00:47:13] derivative of f with respect to f is what? It's the thing with respect to [00:47:17] is what? It's the thing with respect to itself, right? So it's always the the [00:47:20] itself, right? So it's always the the the the last part the derivative of loss [00:47:24] the the last part the derivative of loss function with respect to itself is is [00:47:26] function with respect to itself is is always one. If I want to back prop the [00:47:32] always one. If I want to back prop the first the most immediate one is z. You [00:47:36] first the most immediate one is z. You can see here that [00:47:38] can see here that um we have z and and for this one if I [00:47:41] um we have z and and for this one if I calculate the derivative of f with [00:47:44] calculate the derivative of f with respect to z we already have it right f [00:47:48] respect to z we already have it right f with respect to z is equal to q. So [00:47:51] with respect to z is equal to q. So whatever the value of q is goes to [00:47:56] whatever the value of q is goes to this um as as the gradient as well. Next [00:48:02] this um as as the gradient as well. Next we have um q q is the the next one. next [00:48:06] we have um q q is the the next one. next one that is directly connected to f. So [00:48:09] one that is directly connected to f. So this is also easy to compute because we [00:48:11] this is also easy to compute because we have um derivative of f with respect to [00:48:14] have um derivative of f with respect to q. We have also already uh calculated [00:48:17] q. We have also already uh calculated that it's equal to z whatever z is [00:48:20] that it's equal to z whatever z is that's the value of derivative here [00:48:22] that's the value of derivative here minus 4. [00:48:24] minus 4. Next we have [00:48:26] Next we have y which is directly before q and we know [00:48:30] y which is directly before q and we know that y and f although we need derivative [00:48:33] that y and f although we need derivative of f with respect to y but y and f are [00:48:37] of f with respect to y but y and f are not directly connected and that's where [00:48:40] not directly connected and that's where we use the chain rule [00:48:43] we use the chain rule where [00:48:45] where we split the the [00:48:49] we split the the uh calculation of derivatives with [00:48:51] uh calculation of derivatives with respect to the variable in in the [00:48:53] respect to the variable in in the middle. So partial f by y equals to [00:48:57] middle. So partial f by y equals to partial f by q q by y. Right? So this is [00:49:03] partial f by q q by y. Right? So this is the [00:49:04] the this is how the chain rule could be [00:49:07] this is how the chain rule could be could be written in this case. And now I [00:49:10] could be written in this case. And now I want to introduce you to two important [00:49:13] want to introduce you to two important new terms. Local gradient and upstream [00:49:15] new terms. Local gradient and upstream gradient. up string gradient is often [00:49:18] gradient. up string gradient is often the gradient that comes from the end of [00:49:21] the gradient that comes from the end of the network to this this um [00:49:27] current node that we are in and then uh [00:49:31] current node that we are in and then uh the local gradient is the gradient of [00:49:36] the local gradient is the gradient of the the the note what the input of the [00:49:39] the the the note what the input of the note is with u [00:49:43] note is with u gradient of of its output with respect [00:49:45] gradient of of its output with respect to its input. So it's the local [00:49:47] to its input. So it's the local gradient. So defining these is actually [00:49:50] gradient. So defining these is actually not too hard because f by q we already [00:49:54] not too hard because f by q we already have the value q by y we also already [00:49:58] have the value q by y we also already have the value. So it's 1 multiplied by [00:50:03] have the value. So it's 1 multiplied by z and the value will become minus 4. [00:50:07] z and the value will become minus 4. Same story when it it's the [00:50:10] Same story when it it's the for the other uh variable x here the [00:50:16] for the other uh variable x here the local gradient upstream could again be [00:50:19] local gradient upstream could again be written uh down with this chain rule and [00:50:23] written uh down with this chain rule and it also results in minus 4 and and um [00:50:29] it also results in minus 4 and and um gives us because because in both cases [00:50:31] gives us because because in both cases the [00:50:33] the gradient with respect to x or y was [00:50:36] gradient with respect to x or y was already one. So both of them get the [00:50:40] already one. So both of them get the same value. So with this computational [00:50:43] same value. So with this computational setup and the computational graph, it [00:50:45] setup and the computational graph, it becomes very easy to modularize what we [00:50:49] becomes very easy to modularize what we want to do for every single node in the [00:50:52] want to do for every single node in the neural network [00:50:55] neural network having X and Y as input or whatever else [00:50:58] having X and Y as input or whatever else and Z as the output. [00:51:02] and Z as the output. What we need [00:51:04] What we need are first the local gradients [00:51:08] are first the local gradients which we can always we have the function [00:51:10] which we can always we have the function f it's a function of x and y. So the [00:51:14] f it's a function of x and y. So the output gradient of output with respect [00:51:17] output gradient of output with respect to the each of the inputs it's easy to [00:51:20] to the each of the inputs it's easy to calculate for every single node and what [00:51:25] calculate for every single node and what we need to be able to back propagate is [00:51:28] we need to be able to back propagate is the upstream gradient right and the back [00:51:31] the upstream gradient right and the back propagation process is giving us the [00:51:33] propagation process is giving us the power to get this upstream gradient step [00:51:36] power to get this upstream gradient step by step. So we when we are at this node [00:51:39] by step. So we when we are at this node we also have the upstream gradient [00:51:41] we also have the upstream gradient already uh calculated from the future [00:51:45] already uh calculated from the future nodes and um [00:51:49] nodes and um that's what we need. What we can do [00:51:52] that's what we need. What we can do after this is just to multiply the [00:51:55] after this is just to multiply the upstream gradient with the local [00:51:57] upstream gradient with the local gradient and create what now we call [00:52:02] gradient and create what now we call downstream gradients. So the downstream [00:52:04] downstream gradients. So the downstream gradients are going to be upstream [00:52:06] gradients are going to be upstream gradients for the previous layers. [00:52:07] gradients for the previous layers. Right? So that's how we calculate that [00:52:10] Right? So that's how we calculate that for X. Same story when it comes to Y. [00:52:16] for X. Same story when it comes to Y. So this this whole process gives us the [00:52:19] So this this whole process gives us the the power to create and and calculate [00:52:22] the power to create and and calculate all of these completely locally and step [00:52:24] all of these completely locally and step by step go backwards and give it to the [00:52:27] by step go backwards and give it to the previous nodes so they can continue the [00:52:30] previous nodes so they can continue the process. So again this is one of the [00:52:33] process. So again this is one of the most um fundamental operations in all of [00:52:37] most um fundamental operations in all of neural networks and many optimization [00:52:41] neural networks and many optimization uh processes involving multiple layers [00:52:44] uh processes involving multiple layers of uh information. [00:52:47] of uh information. If I understand the question correctly, [00:52:49] If I understand the question correctly, you're saying how can we understand this [00:52:51] you're saying how can we understand this intuitively what the gradients are [00:52:52] intuitively what the gradients are doing, right? So let's take one step [00:52:56] doing, right? So let's take one step back and and see why we why we are here [00:52:59] back and and see why we why we are here to begin with. What we needed was to [00:53:02] to begin with. What we needed was to identify to calculate the gradient of [00:53:04] identify to calculate the gradient of the loss function with respect to w1 and [00:53:07] the loss function with respect to w1 and w2 and w's in general to be able to take [00:53:11] w2 and w's in general to be able to take a step to uh in the negative u direction [00:53:15] a step to uh in the negative u direction of in the opposite direction of the [00:53:17] of in the opposite direction of the gradients to be able to find the the [00:53:19] gradients to be able to find the the optimal value right optimal loss. So in [00:53:23] optimal value right optimal loss. So in order to do that we need gradient of l [00:53:26] order to do that we need gradient of l loss with respect to everything. So what [00:53:30] loss with respect to everything. So what we are doing is we are just moving [00:53:32] we are doing is we are just moving gradient of L with respect to all [00:53:34] gradient of L with respect to all variables in the network back to every [00:53:37] variables in the network back to every single value of the network without [00:53:40] single value of the network without sitting down and writing the function [00:53:42] sitting down and writing the function for the entire network. If the network [00:53:43] for the entire network. If the network has um 100 layers, we're not going to be [00:53:47] has um 100 layers, we're not going to be sitting down and writing the function [00:53:48] sitting down and writing the function for all of the 100 layers separately. [00:53:51] for all of the 100 layers separately. This is how we back propagate step by [00:53:53] This is how we back propagate step by step to to get the values that we need [00:53:56] step to to get the values that we need for that optimization process of every [00:53:58] for that optimization process of every single weight that is going to be [00:53:59] single weight that is going to be incorporated in the network. [00:54:02] incorporated in the network. Okay. Um [00:54:05] Okay. Um example uh another example [00:54:08] example uh another example um [00:54:10] um so this is a little bit more complex [00:54:12] so this is a little bit more complex function uh function of weights and x [00:54:15] function uh function of weights and x and we have 1 / 1 + e ^ of a linear [00:54:20] and we have 1 / 1 + e ^ of a linear combination of x and um w. [00:54:24] combination of x and um w. So [00:54:26] So there are a bunch of multiplications, [00:54:29] there are a bunch of multiplications, additions, [00:54:30] additions, negation and and the exp function and [00:54:34] negation and and the exp function and and ultimately the one over whatever we [00:54:39] and ultimately the one over whatever we calculated function. So with with with [00:54:42] calculated function. So with with with all of those let's look at this example [00:54:45] all of those let's look at this example that we have specific values for w0 x0 [00:54:50] that we have specific values for w0 x0 w1 x1 and w2 with these given values we [00:54:55] w1 x1 and w2 with these given values we can do the forward pass calculate every [00:54:58] can do the forward pass calculate every single value that we have in uh this [00:55:01] single value that we have in uh this process and just to remind you we do [00:55:05] process and just to remind you we do have some of the details some of the we [00:55:08] have some of the details some of the we we know um For an exp function e ^ of x [00:55:14] we know um For an exp function e ^ of x what its derivative is with respect to x [00:55:17] what its derivative is with respect to x constant [00:55:18] constant uh multiplication always the derivative [00:55:20] uh multiplication always the derivative is the constant value itself 1 /x has a [00:55:24] is the constant value itself 1 /x has a derivative of -1 / x2 [00:55:27] derivative of -1 / x2 uh these are again what we know from [00:55:28] uh these are again what we know from algebra. So and and if it's a constant [00:55:32] algebra. So and and if it's a constant addition it's always the derivative is [00:55:34] addition it's always the derivative is equal to one. So as I said always in the [00:55:40] equal to one. So as I said always in the very beginning at the at at the end of [00:55:42] very beginning at the at at the end of the network the derivative of L with [00:55:46] the network the derivative of L with respect to L is always equal to one. So [00:55:49] respect to L is always equal to one. So that's where we start using this uh rule [00:55:54] that's where we start using this uh rule the derivative of function 1 /x. Now we [00:55:57] the derivative of function 1 /x. Now we can calculate upstream I said it's it's [00:56:00] can calculate upstream I said it's it's one always at the end. The local [00:56:03] one always at the end. The local gradient could be one min -1 / x 2. What [00:56:07] gradient could be one min -1 / x 2. What is the value of x that whatever the [00:56:09] is the value of x that whatever the input is? So this calculation results in [00:56:12] input is? So this calculation results in minus uh.53. [00:56:16] minus uh.53. So.53 is the downstream gradient which [00:56:20] So.53 is the downstream gradient which defines the upstream gradient for the [00:56:22] defines the upstream gradient for the next one. And in the next again the [00:56:25] next one. And in the next again the function here is just a constant [00:56:27] function here is just a constant addition where we know that the local [00:56:30] addition where we know that the local gradient equals to one. So one [00:56:33] gradient equals to one. So one multiplied by upstream gradient same [00:56:36] multiplied by upstream gradient same value goes back [00:56:39] value goes back uh next step is the exp function. So for [00:56:45] uh next step is the exp function. So for that again the lo the upstream we [00:56:48] that again the lo the upstream we already have the value for the local [00:56:52] already have the value for the local gradient it's e to the power of x what [00:56:54] gradient it's e to the power of x what is x the input of to the of this uh step [00:56:58] is x the input of to the of this uh step minus one. So calculating this will give [00:57:01] minus one. So calculating this will give us minus.2 [00:57:04] us minus.2 and this goes back to the next step. [00:57:08] and this goes back to the next step. Here we again have a multiplication with [00:57:12] Here we again have a multiplication with a constant number which defines the [00:57:15] a constant number which defines the local gradient equal to that um number [00:57:18] local gradient equal to that um number that that uh constant value and [00:57:23] that that uh constant value and defining the new gradient downstream [00:57:25] defining the new gradient downstream gradient and going back now here we have [00:57:29] gradient and going back now here we have an addition function where we are [00:57:32] an addition function where we are getting some um data some some [00:57:36] getting some um data some some information sorry two inputs [00:57:39] information sorry two inputs um of the different values here. And [00:57:42] um of the different values here. And again, if you want to calculate the [00:57:44] again, if you want to calculate the upstream gradient, it's 02 already. We [00:57:46] upstream gradient, it's 02 already. We have it. The downstream at the local [00:57:49] have it. The downstream at the local gradients will be equal to [00:57:53] gradients will be equal to uh one because it's just an addition [00:57:56] uh one because it's just an addition between two values and an addition. The [00:57:58] between two values and an addition. The derivative of x + y with respect to both [00:58:02] derivative of x + y with respect to both x and y is always one. So both inputs [00:58:05] x and y is always one. So both inputs will be the same. Then we have [00:58:09] will be the same. Then we have multiplication operations with [00:58:11] multiplication operations with multiplication. [00:58:13] multiplication. Uh upstream gradient again we have the [00:58:15] Uh upstream gradient again we have the values and the local gradients with [00:58:19] values and the local gradients with respect to a multiplication val is if we [00:58:23] respect to a multiplication val is if we always have say for example a multiplied [00:58:26] always have say for example a multiplied by x. The derivative of this with [00:58:28] by x. The derivative of this with respect to x is always a the other [00:58:30] respect to x is always a the other variable. Right? So here for the first [00:58:35] variable. Right? So here for the first one it's minus one which is the value of [00:58:38] one it's minus one which is the value of x and for the second one it's two which [00:58:41] x and for the second one it's two which is the value of w. So the other other [00:58:43] is the value of w. So the other other variable whatever the value it has. So [00:58:47] variable whatever the value it has. So with that we can calculate everything [00:58:49] with that we can calculate everything and then then also calculate the the [00:58:51] and then then also calculate the the ones with respect to w1 and x1. Again, [00:58:54] ones with respect to w1 and x1. Again, we made all of these calculations so we [00:58:57] we made all of these calculations so we can identify how much W should be [00:59:03] can identify how much W should be changed in order to step towards the [00:59:06] changed in order to step towards the optimal point in the neural in the [00:59:09] optimal point in the neural in the network. [00:59:11] network. So this was another example. There are [00:59:14] So this was another example. There are so many different ways to [00:59:17] so many different ways to draw a computational graph. This was not [00:59:20] draw a computational graph. This was not the only one that I I explained. So we [00:59:22] the only one that I I explained. So we can actually lump all of the function [00:59:24] can actually lump all of the function together and define a sigmoid sigmoid [00:59:27] together and define a sigmoid sigmoid because this is basically sigmoid of a [00:59:30] because this is basically sigmoid of a linear function. Right? So the linear [00:59:32] linear function. Right? So the linear function could be here and then all of [00:59:35] function could be here and then all of these operations could be defined as [00:59:37] these operations could be defined as sigmoid. And actually sigmoid is [00:59:39] sigmoid. And actually sigmoid is interesting and very useful to um to use [00:59:43] interesting and very useful to um to use because the local gradient using sigmoid [00:59:46] because the local gradient using sigmoid is dependent on sigmoid itself. So the [00:59:49] is dependent on sigmoid itself. So the local gradient of sigmoid with respect [00:59:52] local gradient of sigmoid with respect to the variable x if we do the [00:59:54] to the variable x if we do the calculations and simplify it's 1 minus [00:59:57] calculations and simplify it's 1 minus sigmoid multiplied by the sigmoid of the [01:00:00] sigmoid multiplied by the sigmoid of the same x. So it's actually a very very uh [01:00:04] same x. So it's actually a very very uh useful uh framework. So anyways uh [01:00:08] useful uh framework. So anyways uh useful function and easy and in order to [01:00:12] useful function and easy and in order to calculate the upstream um the the the [01:00:15] calculate the upstream um the the the downstream gradient again what the [01:00:18] downstream gradient again what the upstream gradient was was value one and [01:00:22] upstream gradient was was value one and if I calculate the local gradient which [01:00:24] if I calculate the local gradient which is this function replacing x with what [01:00:27] is this function replacing x with what the input was which is one [01:00:30] the input was which is one it's it's how this will be calculated [01:00:32] it's it's how this will be calculated multiplied by one will be 2 which is [01:00:35] multiplied by one will be 2 which is actually the exact same value that we [01:00:37] actually the exact same value that we had from before doing it um separately. [01:00:42] had from before doing it um separately. I want to summarize and and say that um [01:00:47] I want to summarize and and say that um there are few patterns in in the data [01:00:50] there are few patterns in in the data often um very much in for the for the [01:00:54] often um very much in for the for the for the nodes that um we can actually [01:00:58] for the nodes that um we can actually kind of memorize. uh there is AD gate [01:01:01] kind of memorize. uh there is AD gate for the AD gate it's always a gradient [01:01:04] for the AD gate it's always a gradient distributor because of the the the [01:01:08] distributor because of the the the properties of addition that I explained [01:01:11] properties of addition that I explained the gradients will uh remain the same as [01:01:17] the gradients will uh remain the same as whatever its input is for the [01:01:19] whatever its input is for the multiplication gate it's the it's a swap [01:01:23] multiplication gate it's the it's a swap function again I told you gradient of xy [01:01:26] function again I told you gradient of xy with respect to x is y with respect to Y [01:01:29] with respect to x is y with respect to Y is X. So it's kind of a swap and then [01:01:34] is X. So it's kind of a swap and then um there is copy gate in the copy gate. [01:01:38] um there is copy gate in the copy gate. copy gate it the operation that happens [01:01:41] copy gate it the operation that happens is is just an addition of what is coming [01:01:45] is is just an addition of what is coming uh to the network and to the node or [01:01:49] uh to the network and to the node or gate the gate and then ultimately [01:01:51] gate the gate and then ultimately there's a max gate which is actually [01:01:53] there's a max gate which is actually something that we use quite often very [01:01:55] something that we use quite often very much similar to the relu function max um [01:02:00] much similar to the relu function max um gate has the gradient of because it's [01:02:02] gate has the gradient of because it's it's up it's taking a max between its [01:02:05] it's up it's taking a max between its inputs so whichever the max value was [01:02:08] inputs so whichever the max value was you just relay the or route the gradient [01:02:12] you just relay the or route the gradient towards that um direction. [01:02:16] towards that um direction. So with that um it's it's very simple to [01:02:19] So with that um it's it's very simple to now implement a neural network forward [01:02:23] now implement a neural network forward pass compute all of the steps and then [01:02:26] pass compute all of the steps and then in the backward pass we start computing [01:02:28] in the backward pass we start computing the gradients and step by step I explain [01:02:31] the gradients and step by step I explain that the gradient of the uh fun the loss [01:02:35] that the gradient of the uh fun the loss function with respect to itself is [01:02:37] function with respect to itself is always one and then we start from the [01:02:40] always one and then we start from the end of the network and go up. You can [01:02:43] end of the network and go up. You can see here that we are going uh up. So [01:02:45] see here that we are going uh up. So this is the sigmoid function [01:02:48] this is the sigmoid function calculating the gradients. Then going up [01:02:50] calculating the gradients. Then going up that was the add gate. We had another [01:02:53] that was the add gate. We had another add gate and then we had two multiply [01:02:57] add gate and then we had two multiply gates which basically very simply um [01:03:01] gates which basically very simply um gives us the implementations. [01:03:04] gives us the implementations. And with this type of formulation, what [01:03:07] And with this type of formulation, what we can do is just do some implement [01:03:12] we can do is just do some implement modular modularize every um function in [01:03:17] modular modularize every um function in the in neural network and create the [01:03:21] the in neural network and create the forward and backward APIs for every [01:03:24] forward and backward APIs for every single function that we need in the [01:03:26] single function that we need in the neural network. So in this case this is [01:03:29] neural network. So in this case this is a multiplication gate that because for [01:03:32] a multiplication gate that because for multiplication we need to access the [01:03:35] multiplication we need to access the inputs for use in the backward pass we [01:03:39] inputs for use in the backward pass we often save them memorize them but then [01:03:43] often save them memorize them but then calculate the forward pass values and [01:03:45] calculate the forward pass values and then the backward pass calculate the [01:03:47] then the backward pass calculate the gradients. So this means we can write [01:03:51] gradients. So this means we can write our functions and put the forward and [01:03:54] our functions and put the forward and backward passes all in. And this is how [01:03:57] backward passes all in. And this is how PyTorch operators right now look like. [01:04:00] PyTorch operators right now look like. If you look at the sigmoid layer for [01:04:02] If you look at the sigmoid layer for example, it's just the forward pass. [01:04:04] example, it's just the forward pass. Although [01:04:05] Although um in this very function, it's not [01:04:07] um in this very function, it's not implemented. It's it's somewhere else in [01:04:09] implemented. It's it's somewhere else in the C++ code in the C code um that it's [01:04:12] the C++ code in the C code um that it's actually implemented in PyTorch. But [01:04:14] actually implemented in PyTorch. But then the backward pass of sigmoid is [01:04:17] then the backward pass of sigmoid is also calculating the same function that [01:04:19] also calculating the same function that we just uh talked about. [01:04:22] we just uh talked about. So with this um [01:04:27] So with this um the so far what we've what we've said uh [01:04:31] the so far what we've what we've said uh and I actually covered most of the [01:04:33] and I actually covered most of the examples that I wanted to cover in um [01:04:38] examples that I wanted to cover in um using the scholar values. All of the [01:04:40] using the scholar values. All of the examples were were were just scholar [01:04:42] examples were were were just scholar values. But we know that um all of these [01:04:48] values. But we know that um all of these operations could actually be implemented [01:04:50] operations could actually be implemented in vector or matrix uh forms. Just [01:04:53] in vector or matrix uh forms. Just expanding on that um piece here. Um we [01:04:59] expanding on that um piece here. Um we talked about this that with the scholar [01:05:01] talked about this that with the scholar to scholar setting so far what we've [01:05:04] to scholar setting so far what we've talked about uh for any input x and y [01:05:07] talked about uh for any input x and y being scholars. So um the derivative [01:05:11] being scholars. So um the derivative will also be a scholar which means if we [01:05:14] will also be a scholar which means if we change x by a small amount how much the [01:05:18] change x by a small amount how much the value of y uh will change if it's now [01:05:22] value of y uh will change if it's now vectorzed and and there are vectors if [01:05:24] vectorzed and and there are vectors if if x is a vector of n elements and y is [01:05:28] if x is a vector of n elements and y is a scalar a vector to a scalar uh [01:05:32] a scalar a vector to a scalar uh derivative then in this case the [01:05:36] derivative then in this case the derivative will also be a vector. vector [01:05:38] derivative will also be a vector. vector and every single element in that vector [01:05:43] and every single element in that vector means if we change x by a small value [01:05:47] means if we change x by a small value that that value then how much the amount [01:05:50] that that value then how much the amount of y changes then the entire amount of y [01:05:53] of y changes then the entire amount of y because it's just one single value and [01:05:55] because it's just one single value and then there are also vector to vector uh [01:05:58] then there are also vector to vector uh frameworks where x and y both of them [01:06:02] frameworks where x and y both of them are vectors of arbitrary size n and m in [01:06:07] are vectors of arbitrary size n and m in those cases the derivatives will form a [01:06:10] those cases the derivatives will form a matrix or what we call Jacobians [01:06:13] matrix or what we call Jacobians and and there [01:06:16] and and there uh for each of the elements in X if it [01:06:20] uh for each of the elements in X if it changes by a small amount then uh this [01:06:23] changes by a small amount then uh this derivative tells us how much each [01:06:26] derivative tells us how much each element of Y will be changed again look [01:06:30] element of Y will be changed again look at the sub u scripts here they are not [01:06:34] at the sub u scripts here they are not um completely they're not the They could [01:06:37] um completely they're not the They could be different for every single element in [01:06:40] be different for every single element in this jubian. There has a there's a clear [01:06:44] this jubian. There has a there's a clear meaning. And then how we can do or or um [01:06:48] meaning. And then how we can do or or um see it and visualize it in here is that [01:06:53] see it and visualize it in here is that if you want to backrop uh with vectors [01:06:57] if you want to backrop uh with vectors say xy and z are vectors of size dx, dy [01:07:03] say xy and z are vectors of size dx, dy and dz. Again the loss [01:07:08] and dz. Again the loss derivative um L is the loss itself is [01:07:12] derivative um L is the loss itself is always scholar because that's always one [01:07:14] always scholar because that's always one value we want to uh minimize but then [01:07:17] value we want to uh minimize but then calculating the upstream gradient will [01:07:21] calculating the upstream gradient will result in also a vector DZ same size as [01:07:25] result in also a vector DZ same size as its u [01:07:27] its u um variable Z and same story happens [01:07:32] um variable Z and same story happens when it comes to downstream gradient in [01:07:35] when it comes to downstream gradient in in downstream gradients. U actually [01:07:38] in downstream gradients. U actually before going to downstream gradients let [01:07:40] before going to downstream gradients let me tell you a little bit about the local [01:07:43] me tell you a little bit about the local gradients where we have [01:07:47] gradients where we have uh [01:07:48] uh gradient of Z with respect to X and uh Y [01:07:53] gradient of Z with respect to X and uh Y and in this case that's the part that I [01:07:55] and in this case that's the part that I said there will be Jacobians because now [01:07:58] said there will be Jacobians because now the matrix the the gradients will turn [01:08:00] the matrix the the gradients will turn into matrices. So we have two Jacobian [01:08:04] into matrices. So we have two Jacobian matrices here um defined by the size of [01:08:08] matrices here um defined by the size of their input by multiplied by the size of [01:08:12] their input by multiplied by the size of the output and um and then this results [01:08:19] the output and um and then this results in downstream gradients that are [01:08:22] in downstream gradients that are multiplication of upstream and the um [01:08:26] multiplication of upstream and the um local gradient and there we get same [01:08:30] local gradient and there we get same size as the inputs. X itself. So the [01:08:35] size as the inputs. X itself. So the we will have a vector again here because [01:08:37] we will have a vector again here because the input was a vector same size in [01:08:40] the input was a vector same size in terms of the [01:08:42] terms of the gradients. Um [01:08:45] gradients. Um I just mentioned that gradients of [01:08:47] I just mentioned that gradients of variables with respect to loss always [01:08:50] variables with respect to loss always have the same dimensionality as the [01:08:52] have the same dimensionality as the original variable itself as also shown [01:08:55] original variable itself as also shown in this um slide. So um [01:09:02] back prop with vectors was that's just [01:09:05] back prop with vectors was that's just one uh one example here. Let's say we [01:09:08] one uh one example here. Let's say we have a function which is the max of zero [01:09:10] have a function which is the max of zero and x. That's that's the relu function. [01:09:14] and x. That's that's the relu function. So this is an elementwise function that [01:09:17] So this is an elementwise function that takes the input takes a max between zero [01:09:19] takes the input takes a max between zero and if it's non zero if it's uh non- [01:09:22] and if it's non zero if it's uh non- negative it passes through otherwise [01:09:24] negative it passes through otherwise replaces it with a zero. Assume you get [01:09:28] replaces it with a zero. Assume you get some upstream gradients and there now we [01:09:31] some upstream gradients and there now we need to build a Jacobian matrix here and [01:09:34] need to build a Jacobian matrix here and this Jacobian matrix in this case [01:09:36] this Jacobian matrix in this case because this is an elementwise operation [01:09:40] because this is an elementwise operation it doesn't have any dependence on any of [01:09:43] it doesn't have any dependence on any of the other inputs only the dependence is [01:09:47] the other inputs only the dependence is on the value itself. This is a very [01:09:50] on the value itself. This is a very sparse matrix only has value on the main [01:09:54] sparse matrix only has value on the main diagonal. And those values are actually [01:09:58] diagonal. And those values are actually either zero or one depending on if the [01:10:01] either zero or one depending on if the max was actually um taken or a zero was [01:10:05] max was actually um taken or a zero was if the the value was passed through or [01:10:08] if the the value was passed through or or just the zero was replaced um with it [01:10:12] or just the zero was replaced um with it um in in its place. multiplying it by [01:10:15] um in in its place. multiplying it by the upstream gradient gives us the [01:10:18] the upstream gradient gives us the downstream gradient and um this is how [01:10:23] downstream gradient and um this is how um the calculations are done. As I said, [01:10:26] um the calculations are done. As I said, Jacobian here is mat is is a sparse [01:10:29] Jacobian here is mat is is a sparse because in this case the operation is [01:10:32] because in this case the operation is element wise and that could actually be [01:10:36] element wise and that could actually be instead of in the in the backward pass [01:10:39] instead of in the in the backward pass instead of calculating that huge sparse [01:10:43] instead of calculating that huge sparse jacobian matrix what we do is just use [01:10:46] jacobian matrix what we do is just use these rule-based calculation of the [01:10:48] these rule-based calculation of the gradient for for this max function. So [01:10:51] gradient for for this max function. So we don't really store that that matrix [01:10:53] we don't really store that that matrix and and do not calculate that because we [01:10:57] and and do not calculate that because we know the how the the function uh [01:10:59] know the how the the function uh operates. [01:11:01] operates. And then this could also be extended to [01:11:04] And then this could also be extended to matrices um and even tensors if the [01:11:08] matrices um and even tensors if the inputs are not vectors. They are high [01:11:10] inputs are not vectors. They are high higher dimensionalities uh high [01:11:12] higher dimensionalities uh high dimensionality data. So in those cases [01:11:16] dimensionality data. So in those cases again the gradients with respect to the [01:11:20] again the gradients with respect to the variables would be of the same size as [01:11:24] variables would be of the same size as that specific variable and calculating [01:11:28] that specific variable and calculating the upstream and downstream calcul uh [01:11:33] the upstream and downstream calcul uh matrices and derivatives is going to be [01:11:36] matrices and derivatives is going to be done same as how we discussed uh and and [01:11:40] done same as how we discussed uh and and showed [01:11:42] showed earlier for vectors and then when it [01:11:45] earlier for vectors and then when it comes to the local gradients although um [01:11:50] comes to the local gradients although um however it's going to be a huge matrix a [01:11:53] however it's going to be a huge matrix a huge jacobian because we have a matrix [01:11:55] huge jacobian because we have a matrix as in as the output and the matrix as [01:11:57] as in as the output and the matrix as the input and then the local gradients [01:12:00] the input and then the local gradients will be the [01:12:02] will be the same size as the multiplication of its [01:12:05] same size as the multiplication of its input size and and the output size. So [01:12:08] input size and and the output size. So it's it's going to be a huge matrix by [01:12:11] it's it's going to be a huge matrix by itself. Let me give you an example. If [01:12:14] itself. Let me give you an example. If this is input X and W as as input for a [01:12:18] this is input X and W as as input for a node for a gate with matrix [01:12:20] node for a gate with matrix multiplication, [01:12:22] multiplication, what happens is [01:12:25] what happens is uh and and generates this this Y as the [01:12:27] uh and and generates this this Y as the output. Calculating [01:12:30] output. Calculating derivative of L with respect to Y gives [01:12:33] derivative of L with respect to Y gives us these Jacobian matrices. And say we [01:12:38] us these Jacobian matrices. And say we have a batch size mini batch size of 64 [01:12:42] have a batch size mini batch size of 64 and dimensionality of those matrices is [01:12:44] and dimensionality of those matrices is is 4,96 [01:12:46] is 4,96 then this means that that Jacobian [01:12:48] then this means that that Jacobian matrix that huge Jacobian matrix will be [01:12:51] matrix that huge Jacobian matrix will be over 256 GB for just one single multiply [01:12:56] over 256 GB for just one single multiply um matrix multiplication. [01:13:00] um matrix multiplication. So in order to [01:13:04] So in order to simplify this we often um what we do is [01:13:08] simplify this we often um what we do is we try to look at the values and how [01:13:12] we try to look at the values and how they impact each other. For example, [01:13:14] they impact each other. For example, what part of parts of Y will be affected [01:13:19] what part of parts of Y will be affected if element one element of X um gets [01:13:24] if element one element of X um gets impacted. So there uh x uh n and d this [01:13:29] impacted. So there uh x uh n and d this this specific one often affects just one [01:13:33] this specific one often affects just one row in the output. And this is um this [01:13:37] row in the output. And this is um this basically helps us uh identify for [01:13:41] basically helps us uh identify for calculating each of these nodes. We [01:13:43] calculating each of these nodes. We don't need to create the huge jacobian. [01:13:46] don't need to create the huge jacobian. we can actually write the backward pass [01:13:49] we can actually write the backward pass functions uh specifically for matrix [01:13:51] functions uh specifically for matrix multiplication [01:13:53] multiplication uh in a more efficient way and uh I'm [01:13:58] uh in a more efficient way and uh I'm almost done so you answer this question [01:14:01] almost done so you answer this question how much does x and d affect the value [01:14:06] how much does x and d affect the value of y and m so this is y and m that is um [01:14:11] of y and m so this is y and m that is um getting impact from x and d [01:14:15] getting impact from x and d how much Does it get impact? It means [01:14:18] how much Does it get impact? It means that what should I place or or put as [01:14:21] that what should I place or or put as its gradient with respect to um [01:14:26] its gradient with respect to um the the specific value x and d. [01:14:30] the the specific value x and d. Just to remind you this is a [01:14:32] Just to remind you this is a multiplication operation. [01:14:37] In multiply gates it should be a swap [01:14:40] In multiply gates it should be a swap right. So whatever [01:14:43] right. So whatever this this uh the answer to this question [01:14:45] this this uh the answer to this question is something a value in W right remember [01:14:50] is something a value in W right remember that we had this multiplication gate [01:14:51] that we had this multiplication gate which was a swap multiplier so there is [01:14:54] which was a swap multiplier so there is a swap happening here so the value of [01:15:02] X being affected uh affecting Y one of [01:15:06] X being affected uh affecting Y one of the elements in Y is going to be [01:15:08] the elements in Y is going to be dependent on [01:15:11] dependent on a W that is in row D defined by the X [01:15:15] a W that is in row D defined by the X matrix and column M defined by the Y [01:15:19] matrix and column M defined by the Y matrix. So it's swapping the values. [01:15:22] matrix. So it's swapping the values. It's the same swap but here now we have [01:15:25] It's the same swap but here now we have to look at the the giant matrices and [01:15:28] to look at the the giant matrices and find which which uh specific element uh [01:15:30] find which which uh specific element uh it should be and then based on that we [01:15:34] it should be and then based on that we can actually replace the entire thing [01:15:37] can actually replace the entire thing with matrix multiplication [01:15:40] with matrix multiplication matrix operations. The gradient of L [01:15:42] matrix operations. The gradient of L with respect to X will be defined as [01:15:45] with respect to X will be defined as this simple matrix operation. And then [01:15:48] this simple matrix operation. And then the gradient of L with respect to W will [01:15:51] the gradient of L with respect to W will be defined as this very simple [01:15:53] be defined as this very simple multiplication. Again swap here for X we [01:15:57] multiplication. Again swap here for X we include the entire W. For W we include [01:16:01] include the entire W. For W we include the entire X and do the uh [01:16:04] the entire X and do the uh multiplications. [01:16:05] multiplications. These formulas makes it easy to to [01:16:08] These formulas makes it easy to to implement larger and and harder [01:16:11] implement larger and and harder operations [01:16:12] operations and um get them implemented in the [01:16:15] and um get them implemented in the backboard passes. All right, we're done. [01:16:19] backboard passes. All right, we're done. Um just to summarize, we talked today [01:16:22] Um just to summarize, we talked today about fully connected neural networks. [01:16:25] about fully connected neural networks. We went through all the steps needed for [01:16:29] We went through all the steps needed for back propagation, the forward passes, [01:16:31] back propagation, the forward passes, backward passes and next session we will [01:16:35] backward passes and next session we will be getting into the topic of [01:16:37] be getting into the topic of convolutional neural networks. [01:16:40] convolutional neural networks. Thank you. ================================================================================ LECTURE 005 ================================================================================ Stanford CS231N | Spring 2025 | Lecture 5: Image Classification with CNNs Source: https://www.youtube.com/watch?v=f3g1zGdxptI --- Transcript [00:00:05] Today we're going to be talking about [00:00:06] Today we're going to be talking about image classification with CNN's. So you [00:00:09] image classification with CNN's. So you might be wondering who am I? Um I'm a [00:00:10] might be wondering who am I? Um I'm a new face. You haven't seen me before in [00:00:12] new face. You haven't seen me before in this class. I'm Justin. I'm the fourth [00:00:14] this class. I'm Justin. I'm the fourth mystery instructor in this class. I [00:00:15] mystery instructor in this class. I think my picture's been on the website, [00:00:17] think my picture's been on the website, but it's my first time here today. Um [00:00:19] but it's my first time here today. Um and a little about me. Um I did my PhD [00:00:22] and a little about me. Um I did my PhD here at Stanford from 2012 to 2018. uh [00:00:25] here at Stanford from 2012 to 2018. uh working with FE on deep learning, [00:00:27] working with FE on deep learning, computer vision, pretty much all tasks [00:00:28] computer vision, pretty much all tasks in computer vision around that time. Um [00:00:31] in computer vision around that time. Um during my time here at Stanford, I was [00:00:33] during my time here at Stanford, I was uh lucky enough to initiate CS231N with [00:00:36] uh lucky enough to initiate CS231N with Andre and Feay and others uh and teach [00:00:39] Andre and Feay and others uh and teach it you know quite a few times 2015, 16, [00:00:41] it you know quite a few times 2015, 16, 17 and 18 and 19 um at Stanford. After [00:00:46] 17 and 18 and 19 um at Stanford. After that I spent time at uh Facebook AI [00:00:49] that I spent time at uh Facebook AI research doing all kinds of deep [00:00:50] research doing all kinds of deep learning computer vision stuff there. Um [00:00:52] learning computer vision stuff there. Um and being I was a faculty member at [00:00:54] and being I was a faculty member at University of Michigan. Um and there I [00:00:56] University of Michigan. Um and there I taught basically the same class a couple [00:00:58] taught basically the same class a couple more times. So I've taught this class a [00:00:59] more times. So I've taught this class a couple times but it's been a while since [00:01:01] couple times but it's been a while since I've been here. Um and most recently [00:01:03] I've been here. Um and most recently I've been doing a startup called World [00:01:05] I've been doing a startup called World Labs with FE. Um and that's just a [00:01:07] Labs with FE. Um and that's just a little bit about me. [00:01:09] little bit about me. Now about where we are on this class. So [00:01:12] Now about where we are on this class. So we're kind of at an interesting point in [00:01:14] we're kind of at an interesting point in the class right now. um where the class [00:01:16] the class right now. um where the class is kind of divided up into these couple [00:01:17] is kind of divided up into these couple different segments. Um and we basically [00:01:19] different segments. Um and we basically finished the first segment. The first [00:01:20] finished the first segment. The first segment is basically around deep [00:01:22] segment is basically around deep learning basics. And this is really cool [00:01:24] learning basics. And this is really cool because now all the stuff that you've [00:01:26] because now all the stuff that you've seen in like four lectures is basically [00:01:28] seen in like four lectures is basically all the fundamentals of deep learning. [00:01:29] all the fundamentals of deep learning. You basically know this whole pipeline [00:01:31] You basically know this whole pipeline of what is the basic pieces that go into [00:01:34] of what is the basic pieces that go into building a deep learning system. So I [00:01:35] building a deep learning system. So I thought it's useful here at the [00:01:36] thought it's useful here at the beginning of this inflection point to [00:01:38] beginning of this inflection point to just sort of step back and recap some of [00:01:40] just sort of step back and recap some of the major themes that we've seen in the [00:01:41] the major themes that we've seen in the first bit of the course. Um so the first [00:01:44] first bit of the course. Um so the first is this idea of image classification [00:01:46] is this idea of image classification with linear classifiers. Um and this was [00:01:48] with linear classifiers. Um and this was kind of meant as a toy problem to give [00:01:50] kind of meant as a toy problem to give you a sense of a kind of problem you [00:01:51] you a sense of a kind of problem you might solve with deep learning where [00:01:53] might solve with deep learning where usually the first step in solving a deep [00:01:54] usually the first step in solving a deep learning problem is to define your [00:01:56] learning problem is to define your problem in a way where it is um you take [00:01:58] problem in a way where it is um you take input some grid of numbers some tensors. [00:02:01] input some grid of numbers some tensors. You produce as output some tensors and [00:02:03] You produce as output some tensors and you want to formalize this problem as [00:02:04] you want to formalize this problem as some input output of tensors. Um, and we [00:02:07] some input output of tensors. Um, and we kind of do that in the image [00:02:08] kind of do that in the image classification setting by saying that we [00:02:10] classification setting by saying that we want to classify images into a bunch of [00:02:11] want to classify images into a bunch of human understandable categories. The [00:02:13] human understandable categories. The inputs are going going to be these grids [00:02:15] inputs are going going to be these grids of pixel values which are arranged in [00:02:17] of pixel values which are arranged in threedimensional tensors. The outputs [00:02:19] threedimensional tensors. The outputs are going to be scores giving us the [00:02:21] are going to be scores giving us the affinity or the the degree to which we [00:02:23] affinity or the the degree to which we have one score per category. You define [00:02:25] have one score per category. You define a set of categories in advance and the [00:02:27] a set of categories in advance and the network is supposed to predict high [00:02:28] network is supposed to predict high scores for categories that the image is [00:02:30] scores for categories that the image is likely to be low scores for the [00:02:32] likely to be low scores for the categories that the image is likely to [00:02:33] categories that the image is likely to be not. Um and then we can define we can [00:02:36] be not. Um and then we can define we can set up a a problem to you know uh use a [00:02:39] set up a a problem to you know uh use a weight matrix uh multiply that against [00:02:41] weight matrix uh multiply that against the the image pixels and predict these [00:02:43] the the image pixels and predict these scores. Um and we saw that there's a [00:02:45] scores. Um and we saw that there's a couple different viewpoints a couple [00:02:46] couple different viewpoints a couple different ways that we can interpret [00:02:48] different ways that we can interpret these linear classifiers. Um and this [00:02:50] these linear classifiers. Um and this basically sets up a functional form [00:02:52] basically sets up a functional form saying that we can predict scores for [00:02:54] saying that we can predict scores for images if only we have a weight matrix [00:02:56] images if only we have a weight matrix W. So then the question is how do we [00:02:57] W. So then the question is how do we select a good weight matrix W? And for [00:03:00] select a good weight matrix W? And for that we go to loss functions. Right? So [00:03:02] that we go to loss functions. Right? So loss functions are these things that [00:03:03] loss functions are these things that tell us you know given a particular [00:03:05] tell us you know given a particular value of a weight matrix given a [00:03:07] value of a weight matrix given a particular data set how well does this [00:03:09] particular data set how well does this weight matrix fit the solve the problem [00:03:11] weight matrix fit the solve the problem on this data set. Um and in particular [00:03:13] on this data set. Um and in particular we saw some examples of loss functions [00:03:15] we saw some examples of loss functions that are commonly used for [00:03:16] that are commonly used for classification problems including the [00:03:18] classification problems including the softmax loss and probably the SVM loss [00:03:21] softmax loss and probably the SVM loss as well. Um and then now okay so now [00:03:23] as well. Um and then now okay so now we've kind of gotten a little bit [00:03:25] we've kind of gotten a little bit further along in our problem. We've set [00:03:26] further along in our problem. We've set up the problem of image classification. [00:03:28] up the problem of image classification. um we have a model for solving that [00:03:30] um we have a model for solving that problem using linear classifiers. We [00:03:32] problem using linear classifiers. We have a way to tell if our solutions are [00:03:33] have a way to tell if our solutions are good using a loss function. But now we [00:03:35] good using a loss function. But now we actually need to search for a good [00:03:37] actually need to search for a good solution in that space. And that's where [00:03:39] solution in that space. And that's where optimization comes in. Right? So now [00:03:40] optimization comes in. Right? So now that we now you think of the now you [00:03:43] that we now you think of the now you think of defining this optimization [00:03:44] think of defining this optimization landscape where on the x-axis or on the [00:03:47] landscape where on the x-axis or on the on the horizontal plane are all the [00:03:48] on the horizontal plane are all the different possible settings of your [00:03:49] different possible settings of your weight matrix. And then the loss [00:03:51] weight matrix. And then the loss function is basically a height of this [00:03:53] function is basically a height of this plane where a high loss function is bad [00:03:55] plane where a high loss function is bad because losing things is bad. So you [00:03:57] because losing things is bad. So you want low loss. So the purpose of [00:03:58] want low loss. So the purpose of optimization is to somehow traverse this [00:04:01] optimization is to somehow traverse this space, slide down this manifold and find [00:04:03] space, slide down this manifold and find a point at the bottom of very low loss. [00:04:06] a point at the bottom of very low loss. And now each point in this space [00:04:07] And now each point in this space corresponds to a weight matrix. So by [00:04:09] corresponds to a weight matrix. So by sliding down that space, we're going to [00:04:10] sliding down that space, we're going to find a good weight matrix that solves [00:04:12] find a good weight matrix that solves our problem um and gives us a good [00:04:14] our problem um and gives us a good solution to our task. And in particular, [00:04:16] solution to our task. And in particular, we saw a couple different commonly used [00:04:18] we saw a couple different commonly used optimization algorithms that are used in [00:04:20] optimization algorithms that are used in deep learning pipelines. uh stochastic [00:04:22] deep learning pipelines. uh stochastic gradient descent usually with momentum [00:04:24] gradient descent usually with momentum RMS prop atom um and one sort of [00:04:27] RMS prop atom um and one sort of interesting uh topical note is that [00:04:29] interesting uh topical note is that right now um one of the biggest uh deep [00:04:32] right now um one of the biggest uh deep learning research conferences is I clear [00:04:34] learning research conferences is I clear international conference on learning [00:04:35] international conference on learning representations um and just yesterday uh [00:04:38] representations um and just yesterday uh iclear 2025 gave their test of time [00:04:41] iclear 2025 gave their test of time award to the atom paper um because the [00:04:43] award to the atom paper um because the paper that introduced this atom [00:04:44] paper that introduced this atom optimizer was published at iclear uh 10 [00:04:47] optimizer was published at iclear uh 10 years ago in 2015 um so at a lot of [00:04:49] years ago in 2015 um so at a lot of academic conferences we'll they'll tend [00:04:50] academic conferences we'll they'll tend to give test of time awards to some of [00:04:52] to give test of time awards to some of the most impactful papers from 10 years [00:04:54] the most impactful papers from 10 years ago. Um, and just yesterday the atom [00:04:56] ago. Um, and just yesterday the atom optimizer that you guys have been learn [00:04:58] optimizer that you guys have been learn that you guys saw in this class got this [00:05:00] that you guys saw in this class got this very prestigious test of time award at [00:05:01] very prestigious test of time award at iClar 2025. So I thought that was pretty [00:05:04] iClar 2025. So I thought that was pretty cool and a nice uh sort of way to [00:05:06] cool and a nice uh sort of way to connect what you've been learning to [00:05:07] connect what you've been learning to stuff that's happening right now in the [00:05:08] stuff that's happening right now in the machine learning community. [00:05:11] machine learning community. Okay. So then now that we now we [00:05:13] Okay. So then now that we now we basically at this point we've got we've [00:05:14] basically at this point we've got we've got our linear classifiers, we've got [00:05:16] got our linear classifiers, we've got our loss functions, we can optimize [00:05:17] our loss functions, we can optimize them. Now we're almost good to go. Um [00:05:20] them. Now we're almost good to go. Um but we ran into a problem is that the [00:05:22] but we ran into a problem is that the linear classifiers that we started with [00:05:23] linear classifiers that we started with are actually not very powerful. Um and [00:05:26] are actually not very powerful. Um and we saw this from we we saw two different [00:05:28] we saw this from we we saw two different ways of attacking this this uh this [00:05:30] ways of attacking this this uh this deficiency in linear classifiers. One [00:05:32] deficiency in linear classifiers. One was from the visual viewpoint where you [00:05:34] was from the visual viewpoint where you can interpret a linear classifier by [00:05:36] can interpret a linear classifier by thinking about by thinking of that [00:05:37] thinking about by thinking of that learned weight matrix as an image where [00:05:39] learned weight matrix as an image where you learn one image template for each of [00:05:41] you learn one image template for each of the categories that you're trying to [00:05:43] the categories that you're trying to that you're trying to classify against. [00:05:44] that you're trying to classify against. Um, and if you think about it that way, [00:05:46] Um, and if you think about it that way, we realize that the weights of your [00:05:48] we realize that the weights of your linear classifier, each row of that [00:05:49] linear classifier, each row of that weight matrix is one template. So the so [00:05:51] weight matrix is one template. So the so the linear classifier basically needs to [00:05:53] the linear classifier basically needs to summarize all of its knowledge about [00:05:55] summarize all of its knowledge about each category into just one template. [00:05:57] each category into just one template. And that's kind of a difficult that's [00:05:59] And that's kind of a difficult that's just not a very powerful classifier. Um, [00:06:01] just not a very powerful classifier. Um, so then you can this kind of shows up in [00:06:03] so then you can this kind of shows up in these visualized templates of a learned [00:06:05] these visualized templates of a learned linear classifier where you can see that [00:06:07] linear classifier where you can see that for for categories like um the car, this [00:06:10] for for categories like um the car, this car kind of looks like a red blob, but [00:06:12] car kind of looks like a red blob, but cars don't have to be red, right? What [00:06:13] cars don't have to be red, right? What if your car was blue or purple or green [00:06:15] if your car was blue or purple or green or something else? There's just no good [00:06:17] or something else? There's just no good way for a linear classifier to caption [00:06:19] way for a linear classifier to caption this cap capture this notion of there [00:06:21] this cap capture this notion of there might be different appearances for an [00:06:22] might be different appearances for an object for each category. Um or from the [00:06:25] object for each category. Um or from the geometric viewpoint, if we imagine these [00:06:27] geometric viewpoint, if we imagine these these each point of our data set as some [00:06:29] these each point of our data set as some point in highdimensional space, then a [00:06:31] point in highdimensional space, then a linear classifier is basically going in [00:06:33] linear classifier is basically going in and carving up that space with [00:06:34] and carving up that space with hyperplanes. Um, so that's really good [00:06:36] hyperplanes. Um, so that's really good if all your categories actually do lie [00:06:38] if all your categories actually do lie in linearly separable regions of your [00:06:40] in linearly separable regions of your space, but there's no reason to expect [00:06:42] space, but there's no reason to expect that to be true in general. So these are [00:06:44] that to be true in general. So these are both two big deficiencies that we ran [00:06:45] both two big deficiencies that we ran into when looking at these linear [00:06:47] into when looking at these linear classifiers um as applied to image [00:06:48] classifiers um as applied to image classification problems. [00:06:51] classification problems. And that led us in in that led us to [00:06:52] And that led us in in that led us to define this notion of neural networks [00:06:54] define this notion of neural networks where we're going to generalize our [00:06:55] where we're going to generalize our linear classifiers um to no longer just [00:06:58] linear classifiers um to no longer just have one weight matrix, but instead [00:06:59] have one weight matrix, but instead stack two weight matrices on top of each [00:07:01] stack two weight matrices on top of each other um with a nonlinearity in between [00:07:03] other um with a nonlinearity in between them. And now this gives us a much more [00:07:06] them. And now this gives us a much more powerful mechanism for predicting scores [00:07:08] powerful mechanism for predicting scores from our inputs. Now the now basically [00:07:10] from our inputs. Now the now basically the problem is still the same. We have [00:07:12] the problem is still the same. We have our input pixels going through this uh [00:07:14] our input pixels going through this uh this computation spitting out scores. [00:07:16] this computation spitting out scores. But now rather than computing but we [00:07:18] But now rather than computing but we just basically selected a different [00:07:19] just basically selected a different functional form for this score function. [00:07:21] functional form for this score function. Um and this gave us and now you know the [00:07:24] Um and this gave us and now you know the algebra is pretty simple. You just need [00:07:25] algebra is pretty simple. You just need to go from f= wx. You add an extra w2 [00:07:29] to go from f= wx. You add an extra w2 add a little nonlinearity in between. So [00:07:30] add a little nonlinearity in between. So the algebra doesn't change very much, [00:07:32] the algebra doesn't change very much, but in doing so, your classifiers get [00:07:34] but in doing so, your classifiers get much much more powerful than they were [00:07:35] much much more powerful than they were before. Um, but now things get a little [00:07:38] before. Um, but now things get a little bit complicated again because how does [00:07:40] bit complicated again because how does this play into optimization, right? We [00:07:42] this play into optimization, right? We know that if we have a loss function, if [00:07:44] know that if we have a loss function, if we have a model, then we want to find [00:07:45] we have a model, then we want to find values of those weight matrix that cause [00:07:47] values of those weight matrix that cause the loss to go down. And and to do that, [00:07:49] the loss to go down. And and to do that, we need to compute gradients. We need to [00:07:51] we need to compute gradients. We need to be able to compute gradients of the loss [00:07:53] be able to compute gradients of the loss with respect to all the parameters of [00:07:54] with respect to all the parameters of our model. And that's this notion of a [00:07:56] our model. And that's this notion of a computational graph. So these [00:07:58] computational graph. So these computational graphs are basically a [00:07:59] computational graphs are basically a data structure to organize the [00:08:01] data structure to organize the computation of a neural network where [00:08:03] computation of a neural network where each node in the graph is a little [00:08:04] each node in the graph is a little functional primitive like a matrix [00:08:06] functional primitive like a matrix multiply or a ru or some other something [00:08:08] multiply or a ru or some other something else like that and then data flows [00:08:10] else like that and then data flows forward in this graph from left to right [00:08:12] forward in this graph from left to right from our inputs and our weights on the [00:08:14] from our inputs and our weights on the left flowing through all these [00:08:15] left flowing through all these intermediate nodes in the graph to spit [00:08:17] intermediate nodes in the graph to spit out the loss function on the right. Um [00:08:19] out the loss function on the right. Um and then once we compute the loss, you [00:08:21] and then once we compute the loss, you you traverse this lo this uh this graph [00:08:23] you traverse this lo this uh this graph backwards from right to left to compute [00:08:25] backwards from right to left to compute gradients of that loss with respect to [00:08:27] gradients of that loss with respect to all the nodes of the of the of the of [00:08:29] all the nodes of the of the of the of the graph inside the network. Um, and [00:08:31] the graph inside the network. Um, and this is now really cool because it [00:08:33] this is now really cool because it basically means that we can write down [00:08:35] basically means that we can write down these arbitrarily complicated neural [00:08:36] these arbitrarily complicated neural networks, these arbitrarily complicated [00:08:38] networks, these arbitrarily complicated expressions for computing our our [00:08:40] expressions for computing our our outputs from our inputs, but we now have [00:08:43] outputs from our inputs, but we now have an automate a nearly automated algorithm [00:08:45] an automate a nearly automated algorithm for computing whatever gradients we want [00:08:47] for computing whatever gradients we want through arbitrarily complex neural [00:08:48] through arbitrarily complex neural networks. Um, and the way that we do [00:08:50] networks. Um, and the way that we do that is this magic of back propagation. [00:08:52] that is this magic of back propagation. Um, and back propagation is really cool. [00:08:55] Um, and back propagation is really cool. I think it's one of the algorithms that [00:08:56] I think it's one of the algorithms that makes deep learning work because it [00:08:58] makes deep learning work because it makes it takes this global problem of [00:09:01] makes it takes this global problem of how do we compute the loss through this [00:09:03] how do we compute the loss through this computational graph and converts it into [00:09:05] computational graph and converts it into a local problem and now each of these [00:09:07] a local problem and now each of these nodes don't doesn't need to know [00:09:08] nodes don't doesn't need to know anything about the larger context of [00:09:11] anything about the larger context of what is the graph I'm living in what is [00:09:12] what is the graph I'm living in what is the problem I'm trying to solve we just [00:09:14] the problem I'm trying to solve we just need to be able to define these little [00:09:16] need to be able to define these little nodes inside of our computational graph [00:09:18] nodes inside of our computational graph that on the forward pass know how know [00:09:20] that on the forward pass know how know how to compute outputs from their inputs [00:09:22] how to compute outputs from their inputs and then on the backwards pass can [00:09:24] and then on the backwards pass can receive gradients coming upstream. It [00:09:26] receive gradients coming upstream. It doesn't have to care where those [00:09:28] doesn't have to care where those gradients come from. What what was what [00:09:30] gradients come from. What what was what was causing those gradients to happen. [00:09:32] was causing those gradients to happen. And I just need to compute gradients [00:09:34] And I just need to compute gradients downstream gradients with respect to my [00:09:35] downstream gradients with respect to my inputs given my upstream gradients. And [00:09:38] inputs given my upstream gradients. And again, this is so powerful because now [00:09:40] again, this is so powerful because now it gives us this mechanism where we can [00:09:42] it gives us this mechanism where we can just define a bunch of different types [00:09:44] just define a bunch of different types of nodes um that all just have follow [00:09:46] of nodes um that all just have follow this local API of computing outputs, [00:09:49] this local API of computing outputs, computing local gradients. And if we as [00:09:51] computing local gradients. And if we as long as we follow that API for all of [00:09:53] long as we follow that API for all of the nodes, then we can start to stitch [00:09:54] the nodes, then we can start to stitch them together into these big complicated [00:09:56] them together into these big complicated computational graphs that can do [00:09:58] computational graphs that can do basically arbitrary computation. And the [00:10:00] basically arbitrary computation. And the gradients just come for free when we [00:10:02] gradients just come for free when we turn the crank on the back propagation [00:10:03] turn the crank on the back propagation algorithm. [00:10:05] algorithm. Um and u you know this is this slide you [00:10:07] Um and u you know this is this slide you know that you guys saw last time is [00:10:09] know that you guys saw last time is basically back propagation on uh on [00:10:11] basically back propagation on uh on scalar values but we can generalize this [00:10:13] scalar values but we can generalize this to work on vector value on vector valued [00:10:15] to work on vector value on vector valued or matrix or tensor valued values as [00:10:17] or matrix or tensor valued values as well. Um but the basic thing to remember [00:10:20] well. Um but the basic thing to remember is that your inputs are some tensors, [00:10:22] is that your inputs are some tensors, your outputs are some tensors. And now [00:10:24] your outputs are some tensors. And now your your upstream gradient that you get [00:10:27] your your upstream gradient that you get is the gradient of the loss with respect [00:10:28] is the gradient of the loss with respect to your outputs. Um and that always has [00:10:31] to your outputs. Um and that always has the same shape as your outputs, right? [00:10:32] the same shape as your outputs, right? Because the loss is a scalar. A gradient [00:10:35] Because the loss is a scalar. A gradient of a loss with respect to a tensor says [00:10:38] of a loss with respect to a tensor says for each element in a tensor, if I were [00:10:39] for each element in a tensor, if I were to wiggle that element a little bit, [00:10:41] to wiggle that element a little bit, then how much does the loss wiggle, [00:10:43] then how much does the loss wiggle, right? And because the loss is a scalar, [00:10:44] right? And because the loss is a scalar, we just need to wiggle each of those. we [00:10:46] we just need to wiggle each of those. we just need to imagine wiggling each [00:10:47] just need to imagine wiggling each element in our tensor independently. Um, [00:10:50] element in our tensor independently. Um, and that is the definition of our [00:10:51] and that is the definition of our gradient. So then that's very easy to [00:10:53] gradient. So then that's very easy to remember. Your upstream gradients always [00:10:54] remember. Your upstream gradients always have the exact same shape as your [00:10:56] have the exact same shape as your outputs. Your downstream gradients, [00:10:58] outputs. Your downstream gradients, those are the gradients with respect to [00:10:59] those are the gradients with respect to my inputs. Those also have the same [00:11:01] my inputs. Those also have the same shape as my inputs. Right? So then the [00:11:04] shape as my inputs. Right? So then the back propagation algorithm is basically [00:11:06] back propagation algorithm is basically just the chain rule where I need to [00:11:08] just the chain rule where I need to somehow compute my downstream gradients [00:11:10] somehow compute my downstream gradients as a function of my upstream gradients [00:11:12] as a function of my upstream gradients and whatever function I was trying to [00:11:13] and whatever function I was trying to compute. um and you'll get some practice [00:11:15] compute. um and you'll get some practice on later assignments computing uh you [00:11:17] on later assignments computing uh you know writing down the gradient [00:11:18] know writing down the gradient expressions for different kinds of [00:11:20] expressions for different kinds of operators uh in in your neural networks. [00:11:23] operators uh in in your neural networks. So basically this gives us our recipe [00:11:25] So basically this gives us our recipe for solving pretty much any problem in [00:11:27] for solving pretty much any problem in deep learning, right? Like this this was [00:11:29] deep learning, right? Like this this was intended to be quite a bit more general [00:11:31] intended to be quite a bit more general than just image classification or just [00:11:32] than just image classification or just linear classifiers or just fully [00:11:34] linear classifiers or just fully connected networks. Right now if you [00:11:36] connected networks. Right now if you have any kind of problem that you want [00:11:37] have any kind of problem that you want to solve, you just need to encode it as [00:11:39] to solve, you just need to encode it as tensors. Write down some computational [00:11:41] tensors. Write down some computational graph that computes your output tensors [00:11:43] graph that computes your output tensors from your input tensors. Collect a data [00:11:45] from your input tensors. Collect a data set of input output tensors. Write down [00:11:47] set of input output tensors. Write down a loss function for the kind of problem [00:11:49] a loss function for the kind of problem you want to solve. Now, and then [00:11:51] you want to solve. Now, and then optimize that loss function using [00:11:53] optimize that loss function using gradient descent, using back [00:11:54] gradient descent, using back propagation. Um, and that's a really [00:11:56] propagation. Um, and that's a really powerful recipe that basically powers [00:11:58] powerful recipe that basically powers all deep learning applications, whe [00:12:00] all deep learning applications, whe whether it's image classification, image [00:12:02] whether it's image classification, image generation, large language models, [00:12:04] generation, large language models, pretty much anything involving a neural [00:12:05] pretty much anything involving a neural neural network is trained using this [00:12:08] neural network is trained using this using this formula or some slight [00:12:10] using this formula or some slight variant on top of this formula. So that [00:12:12] variant on top of this formula. So that leads us to the second part of the [00:12:13] leads us to the second part of the class, which is perceiving and [00:12:15] class, which is perceiving and understanding the visual world. So here [00:12:17] understanding the visual world. So here is where we want to get a little bit [00:12:18] is where we want to get a little bit more specialized and start talking about [00:12:20] more specialized and start talking about you know not the general framework of [00:12:22] you know not the general framework of deep learning but how does this apply to [00:12:24] deep learning but how does this apply to to problems that we want to solve in [00:12:25] to problems that we want to solve in computer vision processing images doing [00:12:27] computer vision processing images doing interesting stuff with images. Um and [00:12:30] interesting stuff with images. Um and today we'll take a step towards that by [00:12:31] today we'll take a step towards that by talking about a bit more about [00:12:33] talking about a bit more about convolutional networks. Um so [00:12:35] convolutional networks. Um so convolutional networks actually are a [00:12:37] convolutional networks actually are a pretty small lift on top of this [00:12:39] pretty small lift on top of this framework that we've already already [00:12:40] framework that we've already already defined. Right? So we've already talked [00:12:42] defined. Right? So we've already talked about two right we have this you have [00:12:44] about two right we have this you have this sort of general paradigm of [00:12:46] this sort of general paradigm of computational graphs of little operators [00:12:48] computational graphs of little operators that can live inside of our [00:12:49] that can live inside of our computational graphs but we so we have [00:12:51] computational graphs but we so we have this beautiful framework but we actually [00:12:52] this beautiful framework but we actually haven't filled in a lot of the specifics [00:12:54] haven't filled in a lot of the specifics of that framework. We've actually only [00:12:55] of that framework. We've actually only seen like two or three different kinds [00:12:58] seen like two or three different kinds of nodes that can live inside of our [00:12:59] of nodes that can live inside of our computational graphs. We've seen fully [00:13:01] computational graphs. We've seen fully connected layers which are which is [00:13:03] connected layers which are which is basically a matrix multiply. We've seen [00:13:04] basically a matrix multiply. We've seen activation functions like our ReLU and [00:13:06] activation functions like our ReLU and we've seen our loss functions themselves [00:13:09] we've seen our loss functions themselves and um in you know now to build up from [00:13:12] and um in you know now to build up from what we've seen already into [00:13:13] what we've seen already into convolutional networks basically all we [00:13:15] convolutional networks basically all we need to do is add a couple new types of [00:13:18] need to do is add a couple new types of nodes um that can fit into our [00:13:19] nodes um that can fit into our computational graphs. Um, and in [00:13:21] computational graphs. Um, and in particular, there's really only two [00:13:23] particular, there's really only two operators that we need to talk about to [00:13:24] operators that we need to talk about to build much more powerful networks, which [00:13:26] build much more powerful networks, which will be the convolution layer that we'll [00:13:28] will be the convolution layer that we'll spend most of today's lecture talking [00:13:29] spend most of today's lecture talking about. Um, and then the pooling layer, [00:13:31] about. Um, and then the pooling layer, which is um, another thing that we often [00:13:33] which is um, another thing that we often use when processing images. [00:13:36] use when processing images. So, that's kind of the road map for [00:13:38] So, that's kind of the road map for today. I want to talk a little bit about [00:13:39] today. I want to talk a little bit about convolutional networks in general. Then, [00:13:41] convolutional networks in general. Then, we'll talk about these two particular [00:13:43] we'll talk about these two particular um, computational primitives that we can [00:13:45] um, computational primitives that we can use to build convolutional networks in [00:13:47] use to build convolutional networks in our computational graphs. [00:13:50] our computational graphs. Okay, so we've already talked about im [00:13:51] Okay, so we've already talked about im so then here we want to step back a [00:13:53] so then here we want to step back a little bit and think about this problem [00:13:54] little bit and think about this problem of image classification again. So we've [00:13:56] of image classification again. So we've already talked about how image [00:13:57] already talked about how image classification is this super core [00:13:59] classification is this super core problem in computer vision where we want [00:14:01] problem in computer vision where we want to take an input image and then predict [00:14:03] to take an input image and then predict you know from that input image what what [00:14:04] you know from that input image what what is in this image as a set of category uh [00:14:06] is in this image as a set of category uh you know predict one of k category [00:14:08] you know predict one of k category labels basically um in this image [00:14:10] labels basically um in this image obviously is a cat so we want to predict [00:14:11] obviously is a cat so we want to predict the cat classifier um and most of the [00:14:14] the cat classifier um and most of the and you know we basically solved this [00:14:16] and you know we basically solved this problem in some sense already by [00:14:17] problem in some sense already by building linear classifiers and by [00:14:19] building linear classifiers and by building uh fully connected multi-layer [00:14:21] building uh fully connected multi-layer perceptron neural networks um but these [00:14:23] perceptron neural networks um but these these uh these networks are basically [00:14:25] these uh these networks are basically operating in pixel space um they their [00:14:27] operating in pixel space um they their inputs you know remember we said the [00:14:29] inputs you know remember we said the first way to the first step to solving a [00:14:30] first way to the first step to solving a deep learning problem is to formulate it [00:14:32] deep learning problem is to formulate it in terms of input output tensors well in [00:14:35] in terms of input output tensors well in this case our input tensors were the raw [00:14:37] this case our input tensors were the raw pixel values of our images um so when we [00:14:40] pixel values of our images um so when we write f ofx= wx that x input that's just [00:14:43] write f ofx= wx that x input that's just the literal values of all of our pixels [00:14:46] the literal values of all of our pixels um and then we go from those raw pixel [00:14:47] um and then we go from those raw pixel values to our class scores um but [00:14:50] values to our class scores um but there's another way to do it which was [00:14:51] there's another way to do it which was common back in the dark ages before [00:14:53] common back in the dark ages before neural networks came about and saved us [00:14:55] neural networks came about and saved us from all this tedium um maybe back in [00:14:57] from all this tedium um maybe back in the early 2000s sort of up until maybe [00:14:59] the early 2000s sort of up until maybe 2010 2011-ish was this idea of feature [00:15:02] 2010 2011-ish was this idea of feature representations. So here the idea is we [00:15:04] representations. So here the idea is we could actually um you can actually [00:15:06] could actually um you can actually choose what is going to be your input to [00:15:07] choose what is going to be your input to your neural network. So you could have [00:15:09] your neural network. So you could have said rather than feeding the raw pixel [00:15:12] said rather than feeding the raw pixel values of the network of the image into [00:15:14] values of the network of the image into our neural network instead we could [00:15:16] our neural network instead we could define some other kind of function which [00:15:18] define some other kind of function which is going to extract features extract [00:15:20] is going to extract features extract some convert those pixel values of our [00:15:22] some convert those pixel values of our image into some other meaningful [00:15:23] image into some other meaningful representation that we as the [00:15:25] representation that we as the intelligent human designers of this [00:15:27] intelligent human designers of this system believe represent or capture some [00:15:29] system believe represent or capture some of the important facets of our input [00:15:31] of the important facets of our input image. Um so then when if you're doing [00:15:34] image. Um so then when if you're doing you know image classification on top of [00:15:36] you know image classification on top of a feature representation your step one [00:15:38] a feature representation your step one would be to define a feature [00:15:40] would be to define a feature representation that converts your um [00:15:42] representation that converts your um your raw image pixels into this higher [00:15:45] your raw image pixels into this higher level representation. And now that [00:15:46] level representation. And now that feature representation will now take the [00:15:48] feature representation will now take the will now be the X that feeds into your [00:15:50] will now be the X that feeds into your linear classifier. Um, and there was a [00:15:53] linear classifier. Um, and there was a ton of work in computer vision, uh, you [00:15:55] ton of work in computer vision, uh, you know, really in the 2000s to the late [00:15:57] know, really in the 2000s to the late 2010s or to the early 2010sish that were [00:16:00] 2010s or to the early 2010sish that were that used this idea of feature [00:16:01] that used this idea of feature representations for all kinds of CL for [00:16:03] representations for all kinds of CL for all kinds of tasks. Um, and I don't [00:16:05] all kinds of tasks. Um, and I don't really think it's useful to go into [00:16:07] really think it's useful to go into super great detail on any of these [00:16:09] super great detail on any of these particular feature representations [00:16:10] particular feature representations because spoiler alert, they got [00:16:12] because spoiler alert, they got deprecated like 10 years ago. Um, but [00:16:14] deprecated like 10 years ago. Um, but it's useful to have a a flavor for what [00:16:16] it's useful to have a a flavor for what they might have looked like. So one [00:16:18] they might have looked like. So one example of a kind of feature [00:16:20] example of a kind of feature representation that people sometimes [00:16:21] representation that people sometimes used is this notion of a color [00:16:23] used is this notion of a color histogram. So here what we could say is [00:16:25] histogram. So here what we could say is divide the space. So maybe we we we [00:16:27] divide the space. So maybe we we we think that somehow the distribution of [00:16:29] think that somehow the distribution of colors in our image might be a useful [00:16:31] colors in our image might be a useful thing for a classifier to look at or [00:16:33] thing for a classifier to look at or care about, right? Because maybe you're [00:16:34] care about, right? Because maybe you're building a fruit detector, apple [00:16:36] building a fruit detector, apple detector, and you want to know if it's [00:16:37] detector, and you want to know if it's ripe or not. Maybe a maybe a red apple [00:16:39] ripe or not. Maybe a maybe a red apple from a green apple. you know how knowing [00:16:41] from a green apple. you know how knowing how much red or green is in the image [00:16:42] how much red or green is in the image might be something that we as humans [00:16:44] might be something that we as humans think is useful for the network to know [00:16:46] think is useful for the network to know for for making its classifications. So [00:16:48] for for making its classifications. So here uh we could build we could try to [00:16:50] here uh we could build we could try to build a feature representation that [00:16:51] build a feature representation that captures that intuition. So here what we [00:16:53] captures that intuition. So here what we might do is take the space of all [00:16:54] might do is take the space of all possible colors, discretise that space [00:16:57] possible colors, discretise that space into some set of buckets and now for [00:16:59] into some set of buckets and now for every pixel in our image, we map that [00:17:00] every pixel in our image, we map that pixel to one of the discrete buckets in [00:17:02] pixel to one of the discrete buckets in our color space and then basically our [00:17:04] our color space and then basically our feature representation becomes something [00:17:06] feature representation becomes something like a count of how many pixels in the [00:17:08] like a count of how many pixels in the image fall into this color bucket. Um [00:17:10] image fall into this color bucket. Um and now you could and now this this is [00:17:12] and now you could and now this this is kind of an interesting representation [00:17:14] kind of an interesting representation because it destroys all the spatial [00:17:15] because it destroys all the spatial structure of the image and it only talks [00:17:17] structure of the image and it only talks about the color distributions. So now [00:17:19] about the color distributions. So now you know if you had red in the corner [00:17:20] you know if you had red in the corner versus red on the other side they would [00:17:22] versus red on the other side they would those two images would look the same to [00:17:24] those two images would look the same to this color histogram features um but [00:17:25] this color histogram features um but they would look very different from the [00:17:27] they would look very different from the raw on the raw pixel perspective. So the [00:17:30] raw on the raw pixel perspective. So the color histogram is a kind of one basic [00:17:32] color histogram is a kind of one basic kind of feature extractor or feature [00:17:34] kind of feature extractor or feature representation that you can build on [00:17:35] representation that you can build on images that basically looks only at [00:17:37] images that basically looks only at color and does not look at spatial [00:17:38] color and does not look at spatial structure at all. Um another category of [00:17:42] structure at all. Um another category of um feature representations that people [00:17:43] um feature representations that people used to look at is sort of the con is [00:17:45] used to look at is sort of the con is sort of the sort of dual to that in [00:17:47] sort of the sort of dual to that in which are these histogram of oriented [00:17:49] which are these histogram of oriented gradients and I don't think it's useful [00:17:50] gradients and I don't think it's useful to talk too much about exactly how these [00:17:52] to talk too much about exactly how these are computed but the intuition of these [00:17:54] are computed but the intuition of these is that they basically throw away the [00:17:55] is that they basically throw away the color information and they only look at [00:17:57] color information and they only look at the structure information they basically [00:17:59] the structure information they basically want to look for every point in the [00:18:00] want to look for every point in the image what are like what direction like [00:18:03] image what are like what direction like what is the local direction of the edges [00:18:04] what is the local direction of the edges in the image around that local region so [00:18:06] in the image around that local region so here you can see that this frog you know [00:18:08] here you can see that this frog you know the leaves of the frog frog, it kind of [00:18:10] the leaves of the frog frog, it kind of extracts diagonal type of features [00:18:11] extracts diagonal type of features because it corresponds to these diagonal [00:18:13] because it corresponds to these diagonal structures over here. Um, or around the [00:18:15] structures over here. Um, or around the frog's eyes, you can see, oh, it sort of [00:18:17] frog's eyes, you can see, oh, it sort of captured those those those spherical [00:18:18] captured those those those spherical those circular structures. So, again, [00:18:21] those circular structures. So, again, it's not super useful to see how this is [00:18:22] it's not super useful to see how this is computed, but it's useful to know that [00:18:24] computed, but it's useful to know that these are the kinds of features that [00:18:25] these are the kinds of features that people would uh designed for images [00:18:28] people would uh designed for images maybe a decade or decade and a half ago. [00:18:31] maybe a decade or decade and a half ago. Um, and people combined these in all [00:18:32] Um, and people combined these in all kinds of complicated ways. So, people [00:18:34] kinds of complicated ways. So, people would often you might wonder, oh, what's [00:18:37] would often you might wonder, oh, what's the best feature representation? The [00:18:38] the best feature representation? The usual answer was just stack them all [00:18:40] usual answer was just stack them all together. So a pretty common approach [00:18:42] together. So a pretty common approach would be to have a bunch of different [00:18:43] would be to have a bunch of different feature representations um extract them [00:18:45] feature representations um extract them all from your image and then concatenate [00:18:47] all from your image and then concatenate them all into one big feature vector. Um [00:18:49] them all into one big feature vector. Um and that's kind of becomes your feature [00:18:51] and that's kind of becomes your feature representation for your image. Um and [00:18:52] representation for your image. Um and now you could imagine once we have this [00:18:54] now you could imagine once we have this feature representation, we can basically [00:18:56] feature representation, we can basically stick whatever kind of classifier we [00:18:58] stick whatever kind of classifier we want on top of it. Um and this and it's [00:19:00] want on top of it. Um and this and it's interesting to then take a step back and [00:19:02] interesting to then take a step back and contrast that picture, that viewpoint of [00:19:05] contrast that picture, that viewpoint of that whole system, you know. So system A [00:19:08] that whole system, you know. So system A is thinking about a feature feature [00:19:10] is thinking about a feature feature extractor plus learned network or [00:19:13] extractor plus learned network or learned linear classifier on top of your [00:19:15] learned linear classifier on top of your features. And then system B is endto-end [00:19:17] features. And then system B is endto-end neural networks. And it's they actually [00:19:20] neural networks. And it's they actually don't look that different if you take a [00:19:22] don't look that different if you take a step back and think about it in the [00:19:23] step back and think about it in the right way. Both of these systems are [00:19:25] right way. Both of these systems are ultimately inputting the raw pixels of [00:19:28] ultimately inputting the raw pixels of the image and outputting some scores or [00:19:30] the image and outputting some scores or predictions about the image. Um the [00:19:32] predictions about the image. Um the difference is that the is the difference [00:19:34] difference is that the is the difference is which part of the system is designed [00:19:36] is which part of the system is designed by humans versus which part is learned [00:19:38] by humans versus which part is learned via gradient descent. In the feature [00:19:40] via gradient descent. In the feature extraction plus linear classifier [00:19:42] extraction plus linear classifier paradigm the feature extraction portion [00:19:44] paradigm the feature extraction portion of the system is designed that could be [00:19:46] of the system is designed that could be a bunch of really hairy C code or hairy [00:19:48] a bunch of really hairy C code or hairy mat lab code. Um and you don't want to [00:19:49] mat lab code. Um and you don't want to think about the details of what's going [00:19:51] think about the details of what's going on inside of that. Um and then just the [00:19:54] on inside of that. Um and then just the part that you're learning via gradient [00:19:55] part that you're learning via gradient descent the part that you're learning [00:19:56] descent the part that you're learning from your training data is just that [00:19:58] from your training data is just that classifier on top of the feature [00:19:59] classifier on top of the feature extractor. Whereas the neural network [00:20:02] extractor. Whereas the neural network approach is basically saying gradient [00:20:04] approach is basically saying gradient descent is probably a better programmer [00:20:06] descent is probably a better programmer than you and lots of data probably knows [00:20:08] than you and lots of data probably knows more about your problem than you do. So [00:20:10] more about your problem than you do. So the then the intuition of these neural [00:20:12] the then the intuition of these neural network classifiers is there's still [00:20:14] network classifiers is there's still ultimately going to be a system that [00:20:16] ultimately going to be a system that inputs the raw pixel values and spits [00:20:18] inputs the raw pixel values and spits out your classification scores at the [00:20:19] out your classification scores at the end. But the difference is that all [00:20:21] end. But the difference is that all parts of that system from the raw pixels [00:20:23] parts of that system from the raw pixels all the way to the final classification [00:20:24] all the way to the final classification scores will be tuned via gradient [00:20:26] scores will be tuned via gradient descent and will be learned from your [00:20:28] descent and will be learned from your training data set. So the intuition is [00:20:30] training data set. So the intuition is that you know there might be in this [00:20:32] that you know there might be in this feature extraction paradigm there might [00:20:34] feature extraction paradigm there might be some bottlenecks. You as a human [00:20:35] be some bottlenecks. You as a human might get something wrong. You might [00:20:36] might get something wrong. You might have wrong intuition about what parts of [00:20:38] have wrong intuition about what parts of the problem are important what things [00:20:40] the problem are important what things are not important or it might be really [00:20:41] are not important or it might be really hard for you to write down the right the [00:20:43] hard for you to write down the right the perfect feature extractor that solves [00:20:44] perfect feature extractor that solves your problem. Um and this endto-end [00:20:46] your problem. Um and this endto-end learning approach of comnets and really [00:20:48] learning approach of comnets and really of deep learning more generally is just [00:20:50] of deep learning more generally is just saying that data and compute can likely [00:20:52] saying that data and compute can likely solve that problem better than you as a [00:20:54] solve that problem better than you as a human designer can. Um and this is and [00:20:56] human designer can. Um and this is and this b this paradigm has basically won [00:20:58] this b this paradigm has basically won over over the past decade and a half for [00:21:01] over over the past decade and a half for lots and lots of problems repeatedly. [00:21:05] lots and lots of problems repeatedly. Okay. So that that kind of gives an [00:21:07] Okay. So that that kind of gives an intuition of so that so then the [00:21:08] intuition of so that so then the question is like for the particular [00:21:10] question is like for the particular problem of images um how should we [00:21:12] problem of images um how should we design these endto-end systems right [00:21:14] design these endto-end systems right like it's not going to be a fully [00:21:15] like it's not going to be a fully connected network all the way that would [00:21:17] connected network all the way that would be a little bit silly. We do need to [00:21:18] be a little bit silly. We do need to still put a little bit of design into [00:21:20] still put a little bit of design into the system, right? But the difference [00:21:22] the system, right? But the difference about designing a neural network versus [00:21:23] about designing a neural network versus designing a feature extractor is that in [00:21:25] designing a feature extractor is that in designing a neural network, you're not [00:21:27] designing a neural network, you're not designing a particular function of a [00:21:28] designing a particular function of a feature extractor. You're kind of [00:21:30] feature extractor. You're kind of defining a whole category of functions [00:21:32] defining a whole category of functions where the category of functions is [00:21:33] where the category of functions is defined by the structure of your [00:21:35] defined by the structure of your computational graph um by the by the [00:21:37] computational graph um by the by the sequence of operators that get run. But [00:21:39] sequence of operators that get run. But the but there's a little bit of but [00:21:40] the but there's a little bit of but there's some flexibility in that system [00:21:42] there's some flexibility in that system because you're leaving the weights of [00:21:43] because you're leaving the weights of the system free to be learned from data. [00:21:46] the system free to be learned from data. But the role of the human designer still [00:21:47] But the role of the human designer still matters. You still need to decide what [00:21:49] matters. You still need to decide what is that architecture of your network. [00:21:52] is that architecture of your network. What is that sequence of operators that [00:21:54] What is that sequence of operators that get stitched into a computational graph? [00:21:55] get stitched into a computational graph? What are the sizes of all the matrices [00:21:57] What are the sizes of all the matrices involved at every stage of processing. [00:21:59] involved at every stage of processing. So there still is a lot of role for the [00:22:01] So there still is a lot of role for the human to design parts of the problem um [00:22:03] human to design parts of the problem um in this deep learning era. Um but the [00:22:06] in this deep learning era. Um but the the that like what you're designing is a [00:22:08] the that like what you're designing is a little bit different. [00:22:09] little bit different. So this is basically where we start to [00:22:11] So this is basically where we start to see the deficiencies in the tools that [00:22:12] see the deficiencies in the tools that we have so far for solving this problem, [00:22:15] we have so far for solving this problem, right? because we see that we've we [00:22:16] right? because we see that we've we we've seen linear layers. We've seen [00:22:18] we've seen linear layers. We've seen fully connected networks. Um and the [00:22:20] fully connected networks. Um and the kind of only neural network architecture [00:22:22] kind of only neural network architecture that we've seen is to flatten our pixels [00:22:24] that we've seen is to flatten our pixels of our image into a big vector. Do [00:22:26] of our image into a big vector. Do matrix multiply, do RLU, do more matrix [00:22:29] matrix multiply, do RLU, do more matrix multiply, do more RLU, and that's about [00:22:31] multiply, do more RLU, and that's about it. That's all we know how to do at this [00:22:32] it. That's all we know how to do at this point. Um and one big problem with that [00:22:34] point. Um and one big problem with that is that it destroys the spatial [00:22:35] is that it destroys the spatial structure of the images. Um there's this [00:22:37] structure of the images. Um there's this big problem, right? Like it's sort of [00:22:40] big problem, right? Like it's sort of like images are actually not [00:22:41] like images are actually not one-dimensional objects. Images are [00:22:43] one-dimensional objects. Images are two-dimensional. They have two [00:22:44] two-dimensional. They have two dimensional structure. That [00:22:45] dimensional structure. That two-dimensional structure matters for [00:22:47] two-dimensional structure matters for the content of those images. And when [00:22:49] the content of those images. And when you build a linear classifier on raw [00:22:51] you build a linear classifier on raw pixels by stretching it out into a big [00:22:53] pixels by stretching it out into a big vector, you're basically ignoring that [00:22:55] vector, you're basically ignoring that important factor of your input data in [00:22:57] important factor of your input data in the design of your neural network [00:22:58] the design of your neural network architecture. So when we think about [00:23:00] architecture. So when we think about designing neural network architectures [00:23:02] designing neural network architectures for images in particular, we want to [00:23:04] for images in particular, we want to think what are other what are other [00:23:06] think what are other what are other designs for our network? What are other [00:23:08] designs for our network? What are other computational primitives we can slot [00:23:09] computational primitives we can slot into our computational graphs that [00:23:11] into our computational graphs that better that better respect um that [00:23:14] better that better respect um that two-dimensional structure of images? [00:23:17] two-dimensional structure of images? And that leads us to convolutional [00:23:18] And that leads us to convolutional networks, right? So convolutional [00:23:20] networks, right? So convolutional networks are basically a category of [00:23:21] networks are basically a category of neural network architectures that are [00:23:23] neural network architectures that are built of linear layers, non [00:23:25] built of linear layers, non nonlinearities, convolution layers, [00:23:27] nonlinearities, convolution layers, pooling layers, sometimes a couple [00:23:29] pooling layers, sometimes a couple others um that stitch together into [00:23:31] others um that stitch together into these neural network architectures that [00:23:32] these neural network architectures that input raw pixel values and then output [00:23:34] input raw pixel values and then output some kind of prediction or scores for [00:23:37] some kind of prediction or scores for our uh for our images. And the general [00:23:39] our uh for our images. And the general structure of these is that usually [00:23:41] structure of these is that usually they'll they'll have some prefix some [00:23:43] they'll they'll have some prefix some body of the network which is some [00:23:44] body of the network which is some interled sequence of convolution layers [00:23:46] interled sequence of convolution layers pooling layers and nonlinearities that [00:23:49] pooling layers and nonlinearities that can be thought of of do as extracting [00:23:51] can be thought of of do as extracting some useful feature representation for [00:23:52] some useful feature representation for the image. And then on top of that [00:23:54] the image. And then on top of that there'll usually be some kind of fully [00:23:56] there'll usually be some kind of fully connected layers sometimes as few as one [00:23:59] connected layers sometimes as few as one um but sometimes more than one which you [00:24:01] um but sometimes more than one which you can think of as a multi-layer perceptron [00:24:03] can think of as a multi-layer perceptron fully connected network classifier that [00:24:05] fully connected network classifier that lives on top of and ingests the features [00:24:07] lives on top of and ingests the features from the convolutional portion of the [00:24:09] from the convolutional portion of the network. But crucially this whole system [00:24:12] network. But crucially this whole system is tuned end to end um via gradient [00:24:15] is tuned end to end um via gradient descent by minimizing the loss on your [00:24:17] descent by minimizing the loss on your training data set. And these networks [00:24:20] training data set. And these networks actually have quite a bit of long [00:24:21] actually have quite a bit of long history. So um these this this image [00:24:24] history. So um these this this image this particular combinant architecture [00:24:25] this particular combinant architecture that we've drawn on the screen actually [00:24:27] that we've drawn on the screen actually comes from a paper back in 1998 um with [00:24:30] comes from a paper back in 1998 um with Yan Lakun Leon Bau uh and others who [00:24:33] Yan Lakun Leon Bau uh and others who were at the time building these [00:24:35] were at the time building these convolutional neural networks all the [00:24:36] convolutional neural networks all the way back in 1998 to perform the task of [00:24:38] way back in 1998 to perform the task of digit classification um and it actually [00:24:41] digit classification um and it actually worked pretty well but it was really [00:24:43] worked pretty well but it was really expensive they didn't have GPUs they [00:24:45] expensive they didn't have GPUs they didn't have TPUs they didn't have the [00:24:46] didn't have TPUs they didn't have the the kind of compute resources that we [00:24:48] the kind of compute resources that we did today but the underlying algorithm [00:24:50] did today but the underlying algorithm the underlying network architecture [00:24:52] the underlying network architecture basically looks pretty similar in 1998 [00:24:54] basically looks pretty similar in 1998 um to what things were you to to the [00:24:56] um to what things were you to to the kinds of architectures that people were [00:24:57] kinds of architectures that people were using well into the 2010s [00:25:00] using well into the 2010s and then zooming forward from 1998 up [00:25:02] and then zooming forward from 1998 up until 2012 um that's when the alexnet [00:25:05] until 2012 um that's when the alexnet architecture came out and this was kind [00:25:06] architecture came out and this was kind of a a big boom like giant explosion of [00:25:09] of a a big boom like giant explosion of deep learning especially in computer [00:25:11] deep learning especially in computer vision and I think we talked about this [00:25:12] vision and I think we talked about this in some earlier lectures um but the [00:25:14] in some earlier lectures um but the alexnet architecture again like doesn't [00:25:16] alexnet architecture again like doesn't look that different from this yanlun [00:25:18] look that different from this yanlun lenat architecture from 1998 it's a [00:25:20] lenat architecture from 1998 it's a bunch of convolutional layers fully [00:25:21] bunch of convolutional layers fully connected layers. It's bigger. There's [00:25:23] connected layers. It's bigger. There's more layers. The layers have more units [00:25:25] more layers. The layers have more units in them. Um but it's still trained end [00:25:27] in them. Um but it's still trained end to end with back propagation to minimize [00:25:29] to end with back propagation to minimize some fairly simple loss functions. Um [00:25:32] some fairly simple loss functions. Um but here like Alexet was when really [00:25:34] but here like Alexet was when really things started to take off and at this [00:25:36] things started to take off and at this time they were able to train on GPUs, [00:25:38] time they were able to train on GPUs, GPUs were available um there was more [00:25:40] GPUs were available um there was more data available because of internet [00:25:41] data available because of internet because of imageet. So that's the so [00:25:44] because of imageet. So that's the so Alexet is when things really started to [00:25:46] Alexet is when things really started to take off. Um so then the era from I [00:25:48] take off. Um so then the era from I think about 2012 to around 2020ish was [00:25:52] think about 2012 to around 2020ish was an era where convolutional networks were [00:25:54] an era where convolutional networks were basically dominating almost every [00:25:55] basically dominating almost every problem in computer vision. Um they were [00:25:57] problem in computer vision. Um they were sol like basically anything you any kind [00:25:59] sol like basically anything you any kind of a problem that you wanted to do with [00:26:00] of a problem that you wanted to do with an image. Um in that era it was almost [00:26:02] an image. Um in that era it was almost certainly going to be a comnet that had [00:26:04] certainly going to be a comnet that had the best performance on that problem. So [00:26:06] the best performance on that problem. So this this included tasks like detection [00:26:08] this this included tasks like detection on the left um which is the task of not [00:26:10] on the left um which is the task of not just classifying an image but drawing a [00:26:12] just classifying an image but drawing a box around um all the objects in the [00:26:14] box around um all the objects in the image and putting a category label on [00:26:15] image and putting a category label on the box. Segmentation is the task of [00:26:18] the box. Segmentation is the task of assigning labels not at the box level or [00:26:20] assigning labels not at the box level or the image level but instead assigning [00:26:22] the image level but instead assigning labels at the pixel level. So now we [00:26:24] labels at the pixel level. So now we want to assign a category label to every [00:26:26] want to assign a category label to every pixel independently in our image. Um and [00:26:28] pixel independently in our image. Um and we'll talk more about architectures for [00:26:29] we'll talk more about architectures for these problems in future lectures. But [00:26:31] these problems in future lectures. But you know these are sol these can be [00:26:33] you know these are sol these can be solved very effectively using [00:26:34] solved very effectively using convolutional networks. [00:26:36] convolutional networks. People used comments for other kinds of [00:26:38] People used comments for other kinds of problems involving language as well. So [00:26:40] problems involving language as well. So the task of image captioning where we [00:26:42] the task of image captioning where we want to predict a a natural language [00:26:44] want to predict a a natural language caption from an image. Uh the some of [00:26:46] caption from an image. Uh the some of the first widely successful approaches [00:26:48] the first widely successful approaches to this problem also were built on [00:26:49] to this problem also were built on convolutional networks. Um and then even [00:26:52] convolutional networks. Um and then even for even for some more recent tasks of [00:26:54] for even for some more recent tasks of generative modeling, right? So text to [00:26:56] generative modeling, right? So text to image text to image uh sorry captioning [00:26:59] image text to image uh sorry captioning is basically the problem of image to [00:27:00] is basically the problem of image to text where we input an image and then [00:27:02] text where we input an image and then want to output a natural language [00:27:04] want to output a natural language sentence describing the image. We can [00:27:06] sentence describing the image. We can also think about the inverse problem of [00:27:08] also think about the inverse problem of text to image generation where we want [00:27:10] text to image generation where we want to in input a natural language [00:27:11] to in input a natural language description of something that we're [00:27:13] description of something that we're imagining in our head and have the [00:27:15] imagining in our head and have the system generate a new image from scratch [00:27:17] system generate a new image from scratch that you know hopefully matches our [00:27:19] that you know hopefully matches our input description. Um and some of the [00:27:21] input description. Um and some of the really some of the first really widely [00:27:23] really some of the first really widely successful uh widely successful versions [00:27:26] successful uh widely successful versions of this problem also were built on [00:27:28] of this problem also were built on convolutional networks. So this uh this [00:27:30] convolutional networks. So this uh this particular figure is from the stable [00:27:32] particular figure is from the stable diffusion paper that came out back in [00:27:33] diffusion paper that came out back in 2021. Um and this you know this has got [00:27:36] 2021. Um and this you know this has got this technology has gotten a lot better [00:27:38] this technology has gotten a lot better in the last couple years and we'll talk [00:27:39] in the last couple years and we'll talk more about that in some later lectures. [00:27:41] more about that in some later lectures. But it's useful to point out that this [00:27:42] But it's useful to point out that this basically the the first versions of this [00:27:44] basically the the first versions of this that started to work really well were [00:27:46] that started to work really well were also built on convolutional networks. So [00:27:49] also built on convolutional networks. So basically convolutional networks were so [00:27:50] basically convolutional networks were so important for this history of computer [00:27:52] important for this history of computer vision that the initial version of this [00:27:54] vision that the initial version of this class that we started way back in 2015 [00:27:56] class that we started way back in 2015 was actually called um convolutional [00:27:58] was actually called um convolutional neural networks for visual recognition [00:28:00] neural networks for visual recognition because at the time convolutional [00:28:02] because at the time convolutional networks was basically synonymous with [00:28:03] networks was basically synonymous with computer vision. Um and computer vision [00:28:06] computer vision. Um and computer vision was basically the the biggest the [00:28:08] was basically the the biggest the biggest field that was benefiting from [00:28:09] biggest field that was benefiting from deep learning at that time. So in in [00:28:11] deep learning at that time. So in in setting out to teach a class about deep [00:28:13] setting out to teach a class about deep learning, it actually made a lot of [00:28:15] learning, it actually made a lot of sense to focus entirely on the problem [00:28:16] sense to focus entirely on the problem of convolutional networks for image [00:28:18] of convolutional networks for image problems. Um and that's basically the [00:28:20] problems. Um and that's basically the inception of this class 10 years ago. Um [00:28:22] inception of this class 10 years ago. Um but the field has actually evolved a lot [00:28:24] but the field has actually evolved a lot since then, right? Convolutional [00:28:26] since then, right? Convolutional networks have actually gotten replaced. [00:28:28] networks have actually gotten replaced. Visual recognition, there's a lot of [00:28:29] Visual recognition, there's a lot of other interesting problems that we can [00:28:30] other interesting problems that we can solve now. So you'll notice that the [00:28:32] solve now. So you'll notice that the name of the class changed at some point [00:28:33] name of the class changed at some point along the way um and to no longer focus [00:28:36] along the way um and to no longer focus so specifically on on neural on [00:28:38] so specifically on on neural on convolutional networks. And the reason [00:28:40] convolutional networks. And the reason for that is that you know I said this [00:28:42] for that is that you know I said this was the era from 2012 to 2020. You might [00:28:44] was the era from 2012 to 2020. You might be wondering what happened in 2020 other [00:28:46] be wondering what happened in 2020 other than co that could have displaced [00:28:48] than co that could have displaced convolutional networks. Um it wasn't co [00:28:51] convolutional networks. Um it wasn't co it was transformers. Right? So [00:28:52] it was transformers. Right? So transformers are this alternate neural [00:28:54] transformers are this alternate neural network architecture that we'll talk [00:28:55] network architecture that we'll talk about in a couple more lectures. Um but [00:28:57] about in a couple more lectures. Um but basically they started off in natural [00:28:59] basically they started off in natural language processing for processing uh [00:29:00] language processing for processing uh documents for processing text strings. [00:29:02] documents for processing text strings. Um, and the transformer architecture got [00:29:04] Um, and the transformer architecture got first published in 2017 and for a couple [00:29:07] first published in 2017 and for a couple years after that it mainly stayed in the [00:29:09] years after that it mainly stayed in the regime of processing text. But there was [00:29:11] regime of processing text. But there was a really important paper in 2021 that [00:29:13] a really important paper in 2021 that basically applied nearly the exact same [00:29:15] basically applied nearly the exact same transformer architecture that had been [00:29:16] transformer architecture that had been getting used to process text to process [00:29:19] getting used to process text to process strings and instead used it to process [00:29:21] strings and instead used it to process images in nearly the exact same way. Um, [00:29:23] images in nearly the exact same way. Um, and since then people have found that in [00:29:26] and since then people have found that in a for a lot of the previous problems [00:29:27] a for a lot of the previous problems that we just talked about that were [00:29:29] that we just talked about that were previously solved using convolutional [00:29:30] previously solved using convolutional networks. You could replace the CNN with [00:29:32] networks. You could replace the CNN with a transformer, keep everything else the [00:29:34] a transformer, keep everything else the same and the problem and you tend to get [00:29:36] same and the problem and you tend to get better performance on these problems. [00:29:38] better performance on these problems. They they scale up to more data, they [00:29:39] They they scale up to more data, they scale up to more compute. Um, and you [00:29:41] scale up to more compute. Um, and you know this is a then we can get more we [00:29:44] know this is a then we can get more we can get more data, we can get more [00:29:45] can get more data, we can get more compute. Um, so that so these are much [00:29:47] compute. Um, so that so these are much more commonly used for more and more [00:29:48] more commonly used for more and more computer vision problems these days. [00:29:51] computer vision problems these days. Um, and we'll talk much more about [00:29:52] Um, and we'll talk much more about transformers in lecture 8, but I thought [00:29:54] transformers in lecture 8, but I thought it would be weird to be pitching comnets [00:29:57] it would be weird to be pitching comnets super hard when actually they don't get [00:29:58] super hard when actually they don't get used quite as much nowadays as they did [00:30:00] used quite as much nowadays as they did maybe 5 years ago. But I still think [00:30:03] maybe 5 years ago. But I still think it's really useful to talk about [00:30:04] it's really useful to talk about convolutional networks. Um, one because [00:30:06] convolutional networks. Um, one because there is a lot of historical [00:30:07] there is a lot of historical significance. Two, these uh these these [00:30:10] significance. Two, these uh these these algorithms still do get used quite a lot [00:30:11] algorithms still do get used quite a lot in practice. Um, three, it helps you [00:30:13] in practice. Um, three, it helps you build intuitions about what's important [00:30:15] build intuitions about what's important for images. Um, and four, they're [00:30:17] for images. Um, and four, they're actually not completely dead, right? [00:30:18] actually not completely dead, right? Like a lot of times we're actually [00:30:19] Like a lot of times we're actually building hybrid systems. Sometimes we [00:30:21] building hybrid systems. Sometimes we use convolutions, sometimes we use [00:30:22] use convolutions, sometimes we use transformers, sometimes we mix them [00:30:23] transformers, sometimes we mix them together in various ways. So it's [00:30:25] together in various ways. So it's actually super useful to still know [00:30:26] actually super useful to still know about this stuff. [00:30:28] about this stuff. So then basically um the rest of today [00:30:30] So then basically um the rest of today we're going to talk more about [00:30:31] we're going to talk more about convolutional networks. We said that we [00:30:33] convolutional networks. We said that we already we said that a convolutional [00:30:35] already we said that a convolutional network is just a computational graph [00:30:36] network is just a computational graph for processing images that's built from [00:30:38] for processing images that's built from a couple different primitives. We've [00:30:40] a couple different primitives. We've already met the fully connected layer [00:30:41] already met the fully connected layer and the activation function. So we [00:30:42] and the activation function. So we basically need to walk through these two [00:30:44] basically need to walk through these two more layers of the convolution layer and [00:30:45] more layers of the convolution layer and the pooling layer. Quick recap of the [00:30:48] the pooling layer. Quick recap of the fully connected layer. This is what [00:30:50] fully connected layer. This is what we've already talked about in the [00:30:51] we've already talked about in the context of linear classifiers. So with [00:30:53] context of linear classifiers. So with our fully connected layer, like we said, [00:30:55] our fully connected layer, like we said, basically what we do is we take our [00:30:56] basically what we do is we take our pixels of our image. Our image is this [00:30:59] pixels of our image. Our image is this three-dimensional tensor. 3 32x 32x3. [00:31:02] three-dimensional tensor. 3 32x 32x3. 32x 32 are these two spatial dimensions. [00:31:04] 32x 32 are these two spatial dimensions. 3 are three channel dimensions for your [00:31:06] 3 are three channel dimensions for your RGB colors. So we take the that 32x 32x3 [00:31:10] RGB colors. So we take the that 32x 32x3 vector. You stretch it out into a long [00:31:12] vector. You stretch it out into a long vector of 3072 cuz that's you know if [00:31:14] vector of 3072 cuz that's you know if you multiply those in your head, that's [00:31:15] you multiply those in your head, that's the number you get. Um and then you have [00:31:17] the number you get. Um and then you have this basically this vector of 372 3072 [00:31:20] this basically this vector of 372 3072 numbers. Um we have a weight matrix [00:31:22] numbers. Um we have a weight matrix that's 3072 by 10 in this case because [00:31:24] that's 3072 by 10 in this case because 10 is the number of output classes that [00:31:26] 10 is the number of output classes that we want. You do a matrix vector multiply [00:31:28] we want. You do a matrix vector multiply between those two you end up with with a [00:31:29] between those two you end up with with a vector of 10 numbers giving us our class [00:31:31] vector of 10 numbers giving us our class score. Um and in particular it's [00:31:34] score. Um and in particular it's interest like in trying to generalize [00:31:36] interest like in trying to generalize this from fully connected layers to [00:31:38] this from fully connected layers to convolutional layers. It's useful to [00:31:40] convolutional layers. It's useful to think a little bit more about the [00:31:41] think a little bit more about the structure of what this fully connected [00:31:43] structure of what this fully connected layer is doing. Right? that fully [00:31:44] layer is doing. Right? that fully connected layer. Um the output vector [00:31:46] connected layer. Um the output vector contains 10 elements. Each one of those [00:31:49] contains 10 elements. Each one of those elements is a single number. Each one of [00:31:51] elements is a single number. Each one of those numbers is predicted by computing [00:31:53] those numbers is predicted by computing an inner product between one of the rows [00:31:55] an inner product between one of the rows of your weight matrix and the entire [00:31:57] of your weight matrix and the entire input vector. Right? But each each entry [00:32:00] input vector. Right? But each each entry you should basically think of as a dot [00:32:01] you should basically think of as a dot productduct. And a dot productduct you [00:32:02] productduct. And a dot productduct you should basically think about as a [00:32:03] should basically think about as a template match. Right? Because the dot [00:32:05] template match. Right? Because the dot productduct between two vectors is high [00:32:07] productduct between two vectors is high when the two vectors point the same way [00:32:09] when the two vectors point the same way and it's zero when the two vectors are [00:32:10] and it's zero when the two vectors are orthogonal. So anything built on dot [00:32:12] orthogonal. So anything built on dot productducts is basically a kind of [00:32:14] productducts is basically a kind of template matching. So what the way that [00:32:16] template matching. So what the way that you should think about these fully [00:32:17] you should think about these fully connected layers is that we have a set [00:32:19] connected layers is that we have a set of templates. Each of the templates has [00:32:21] of templates. Each of the templates has the same size as our input. Um and then [00:32:23] the same size as our input. Um and then the output is the template matching [00:32:26] the output is the template matching score between each one of our templates [00:32:28] score between each one of our templates and the entire input. So then once we [00:32:31] and the entire input. So then once we think about it that way, there's [00:32:32] think about it that way, there's actually a nice way we can generalize [00:32:34] actually a nice way we can generalize this from fully connected layers into [00:32:36] this from fully connected layers into convolutional layers. And that's by [00:32:38] convolutional layers. And that's by saying, you know, we're still going to [00:32:39] saying, you know, we're still going to have this notion of template matching. [00:32:40] have this notion of template matching. We're still going to have this notion of [00:32:42] We're still going to have this notion of learning a bank of filters, but what [00:32:44] learning a bank of filters, but what we're going to change is that those [00:32:46] we're going to change is that those filters will those those templates are [00:32:48] filters will those those templates are no longer going to have the same shape [00:32:49] no longer going to have the same shape as the input. Instead, now our fil now [00:32:52] as the input. Instead, now our fil now our filters will have a will only look [00:32:55] our filters will have a will only look at a small subset of the input. So, more [00:32:58] at a small subset of the input. So, more concretely, um, rather than stretching [00:33:00] concretely, um, rather than stretching out our image into a big vector of 372 [00:33:02] out our image into a big vector of 372 numbers, instead we're going to maintain [00:33:04] numbers, instead we're going to maintain the 3D spatial structure of of our [00:33:06] the 3D spatial structure of of our image. It's going to now be a [00:33:08] image. It's going to now be a three-dimensional tensor um of three [00:33:10] three-dimensional tensor um of three channels, sometimes called depth or [00:33:12] channels, sometimes called depth or channels dimension. Um 32 width, 32 [00:33:14] channels dimension. Um 32 width, 32 height. And now one of our filters is [00:33:16] height. And now one of our filters is going to be a tiny little sub image, a [00:33:18] going to be a tiny little sub image, a tiny lowresolution image, in this case a [00:33:20] tiny lowresolution image, in this case a 5x5 pixel image. Um and importantly that [00:33:24] 5x5 pixel image. Um and importantly that that that small filter needs to have [00:33:26] that that small filter needs to have three channels. The channels are always [00:33:28] three channels. The channels are always going to span the same as the number of [00:33:30] going to span the same as the number of channels in the input, but the spatial [00:33:31] channels in the input, but the spatial size will be smaller. And now what we're [00:33:34] size will be smaller. And now what we're going to do is we're going to comput [00:33:35] going to do is we're going to comput dotproducts. We think about that small [00:33:37] dotproducts. We think about that small filter as a little chunk of image [00:33:39] filter as a little chunk of image template and we're going to slide it [00:33:41] template and we're going to slide it everywhere across the image and and say [00:33:43] everywhere across the image and and say for every point in the image, how much [00:33:44] for every point in the image, how much does that sub portion subport subp part [00:33:47] does that sub portion subport subp part of the image match this template that [00:33:49] of the image match this template that we're learning in our convolutional [00:33:50] we're learning in our convolutional filter. So we'll plop that convolutional [00:33:53] filter. So we'll plop that convolutional filter down at some chunk of the image. [00:33:55] filter down at some chunk of the image. That 5x5x3 filter will line up with some [00:33:58] That 5x5x3 filter will line up with some 5x5x3 chunk of the image at that spatial [00:34:01] 5x5x3 chunk of the image at that spatial location. We'll comput an inner product [00:34:03] location. We'll comput an inner product between those two. And that will give us [00:34:04] between those two. And that will give us one single scalar number telling us how [00:34:06] one single scalar number telling us how much does that chunk of the image align [00:34:08] much does that chunk of the image align with our template. And now we'll repeat [00:34:11] with our template. And now we'll repeat that process and slide that template [00:34:13] that process and slide that template everywhere in our image. And every place [00:34:15] everywhere in our image. And every place that we plop down that template, it'll [00:34:17] that we plop down that template, it'll give us we'll again compute this [00:34:18] give us we'll again compute this template matching score that says how [00:34:20] template matching score that says how much does that piece of image align with [00:34:22] much does that piece of image align with that one template. And as we slide that [00:34:24] that one template. And as we slide that filter everywhere on the input image, [00:34:27] filter everywhere on the input image, we're going to collect all of those [00:34:28] we're going to collect all of those scores, all of those template matching [00:34:30] scores, all of those template matching scores into a plane, right? And that [00:34:33] scores into a plane, right? And that plane will now be a two-dimensional a [00:34:35] plane will now be a two-dimensional a two-dimensional plane that says [00:34:37] two-dimensional plane that says basically for every point in the every [00:34:39] basically for every point in the every point in the plane now corresponds to [00:34:41] point in the plane now corresponds to how much did that corresponding piece of [00:34:43] how much did that corresponding piece of the input image align with our filter. [00:34:47] the input image align with our filter. Um, but of course um this is deep [00:34:49] Um, but of course um this is deep learning. We want a lot of compute and [00:34:51] learning. We want a lot of compute and how do we get more compute? we have more [00:34:53] how do we get more compute? we have more filters. So now we'll we'll add a second [00:34:55] filters. So now we'll we'll add a second filter and we'll say rather we'll we'll [00:34:57] filter and we'll say rather we'll we'll repeat the whole process again um with [00:34:59] repeat the whole process again um with another filter. So we'll have we we'll [00:35:01] another filter. So we'll have we we'll go to go we had the a 5x5x3 filter that [00:35:03] go to go we had the a 5x5x3 filter that we colored in blue. Um now let's imagine [00:35:06] we colored in blue. Um now let's imagine a second filter that's now colored in [00:35:07] a second filter that's now colored in green. Our second filter will still be [00:35:10] green. Our second filter will still be 5x 5x3 and we'll repeat the exact same [00:35:12] 5x 5x3 and we'll repeat the exact same procedure of sliding that green filter [00:35:14] procedure of sliding that green filter everywhere on the image. um compute [00:35:16] everywhere on the image. um compute template matching scores between the [00:35:17] template matching scores between the green filter and little sub pieces of [00:35:19] green filter and little sub pieces of the image and then collect all of those [00:35:21] the image and then collect all of those scores in a second plane telling us you [00:35:24] scores in a second plane telling us you know for every point in the image how [00:35:25] know for every point in the image how much did it respond to the green filter [00:35:28] much did it respond to the green filter and now we can com and now we can [00:35:29] and now we can com and now we can basically um iterate this and and add as [00:35:32] basically um iterate this and and add as many filters as we want. Um so then in [00:35:34] many filters as we want. Um so then in this case we are drawing six filters um [00:35:37] this case we are drawing six filters um each of them is going to be 3x 5x 5 or [00:35:40] each of them is going to be 3x 5x 5 or 3x so yeah 3x 5x 5. So then we can [00:35:43] 3x so yeah 3x 5x 5. So then we can actually collect all of those filters [00:35:44] actually collect all of those filters into a single four-dimensional tensor. [00:35:46] into a single four-dimensional tensor. Right? So that four-dimensional tensor [00:35:48] Right? So that four-dimensional tensor now has six as a leading dimension [00:35:50] now has six as a leading dimension because we have six filters. And then [00:35:52] because we have six filters. And then that 3x 5x5 is that image template. It [00:35:56] that 3x 5x5 is that image template. It is that chunk is that template um that [00:35:58] is that chunk is that template um that we're learning. And now the convolution [00:36:00] we're learning. And now the convolution layer basically takes as input our [00:36:02] layer basically takes as input our three-dimensional input our [00:36:03] three-dimensional input our three-dimensional image and our [00:36:05] three-dimensional image and our four-dimensional bank of filters slides [00:36:07] four-dimensional bank of filters slides slides all the filters everywhere in the [00:36:09] slides all the filters everywhere in the image and gives us these response [00:36:10] image and gives us these response planes. So then once we collect all [00:36:12] planes. So then once we collect all those response planes and stack them up [00:36:14] those response planes and stack them up in a third dimension then our output has [00:36:17] in a third dimension then our output has um has size 6x 28x 28 where 28x 28 [00:36:22] um has size 6x 28x 28 where 28x 28 should be interpreted as spatial [00:36:23] should be interpreted as spatial dimensions that and that six is a [00:36:25] dimensions that and that six is a channel dimension. Um and of course [00:36:27] channel dimension. Um and of course we'll also uh you know just as we do [00:36:29] we'll also uh you know just as we do with linear layers we'll often add a [00:36:31] with linear layers we'll often add a learnable bias vector as well to our [00:36:33] learnable bias vector as well to our convolutional layers. So that then in [00:36:35] convolutional layers. So that then in that sense is sort of in the in a linear [00:36:37] that sense is sort of in the in a linear layer a bias is one scaler per row in [00:36:40] layer a bias is one scaler per row in the in in the in the linear layer. [00:36:41] the in in the in the linear layer. Correspondingly in a convolutional layer [00:36:43] Correspondingly in a convolutional layer we'll have typically one scalar bias [00:36:45] we'll have typically one scalar bias value for every filter in our [00:36:47] value for every filter in our convolutional for every for every one of [00:36:49] convolutional for every for every one of our convolutional filters. So that means [00:36:51] our convolutional filters. So that means that we'll have a sixdimensional bias [00:36:52] that we'll have a sixdimensional bias vector in this in this setting. Yeah the [00:36:55] vector in this in this setting. Yeah the question was clarifying three is the RGB [00:36:56] question was clarifying three is the RGB channels. Yeah that's correct. Question [00:36:59] channels. Yeah that's correct. Question is how do you get the filters back to [00:37:00] is how do you get the filters back to the miracle of gradient descent and back [00:37:02] the miracle of gradient descent and back propagation. Right. So the idea is that [00:37:04] propagation. Right. So the idea is that we're defining this operator. This [00:37:06] we're defining this operator. This operator is going to have an input image [00:37:08] operator is going to have an input image and a and a set of filters, but we but [00:37:11] and a and a set of filters, but we but we no human is going to define what [00:37:12] we no human is going to define what those filters are going to be. Instead, [00:37:14] those filters are going to be. Instead, we're going to initialize those filters [00:37:15] we're going to initialize those filters randomly and then they'll be learned via [00:37:17] randomly and then they'll be learned via gradient descent um on whatever problem [00:37:19] gradient descent um on whatever problem you're trying to solve. So that that's [00:37:21] you're trying to solve. So that that's actually a really important thing to [00:37:22] actually a really important thing to keep in mind and that that that gives [00:37:24] keep in mind and that that that gives these these layers their power is that [00:37:26] these these layers their power is that we're defining this fairly [00:37:27] we're defining this fairly computationally expensive layer but [00:37:29] computationally expensive layer but we're expecting that it'll be filled in [00:37:31] we're expecting that it'll be filled in with um with the data from our with the [00:37:33] with um with the data from our with the data and compute from our training. [00:37:35] data and compute from our training. Question is how do you how do you set [00:37:37] Question is how do you how do you set the five? Um that's a hyperparameter. So [00:37:39] the five? Um that's a hyperparameter. So you know we talked about hyperparameters [00:37:40] you know we talked about hyperparameters and cross validation a couple lectures [00:37:42] and cross validation a couple lectures ago. So these would be architectural [00:37:43] ago. So these would be architectural hyperparameters that you would typically [00:37:44] hyperparameters that you would typically set via cross validation in some way. [00:37:47] set via cross validation in some way. Yeah good question. Does it make sense [00:37:48] Yeah good question. Does it make sense to have different sizes of filters? So, [00:37:50] to have different sizes of filters? So, as we'll see in the CNN architectures [00:37:51] as we'll see in the CNN architectures lecture next lecture, uh some actually I [00:37:54] lecture next lecture, uh some actually I think you're going to talk about [00:37:55] think you're going to talk about inception. Um sometimes you sometimes [00:37:57] inception. Um sometimes you sometimes you actually do have that but that [00:37:58] you actually do have that but that typically happens at the uh but um for [00:38:01] typically happens at the uh but um for there there's kind of a nice API design [00:38:03] there there's kind of a nice API design problem when you're designing what is [00:38:05] problem when you're designing what is the what is a primitive in your [00:38:06] the what is a primitive in your computational graph versus what is going [00:38:08] computational graph versus what is going to be an emergent structure built out of [00:38:09] to be an emergent structure built out of primitives. So in this case it's we [00:38:11] primitives. So in this case it's we usually define a single convolutional [00:38:13] usually define a single convolutional layer as having a fixed filter size [00:38:15] layer as having a fixed filter size because that makes it easier to compute [00:38:16] because that makes it easier to compute and write efficient GPU kernels. But if [00:38:18] and write efficient GPU kernels. But if you you can effectively have multiple [00:38:21] you you can effectively have multiple multiply sized filters by stitching [00:38:23] multiply sized filters by stitching together a computational graph that [00:38:25] together a computational graph that combines convolution layers with [00:38:26] combines convolution layers with different filter sizes in a larger [00:38:28] different filter sizes in a larger network structure. So it's sort of yes [00:38:30] network structure. So it's sort of yes and no is the answer to your question. [00:38:33] and no is the answer to your question. The question is what are we learning? Um [00:38:34] The question is what are we learning? Um and this is very important to [00:38:35] and this is very important to distinguish between a parameter versus a [00:38:37] distinguish between a parameter versus a hyperparameter. So a hyperparameter is [00:38:39] hyperparameter. So a hyperparameter is something that we set before we start [00:38:41] something that we set before we start training the network. So in this case, [00:38:43] training the network. So in this case, one one of the hyperparameters would be [00:38:44] one one of the hyperparameters would be the number of filters and the size of [00:38:46] the number of filters and the size of those filters because those set the the [00:38:48] those filters because those set the the shapes of our tensor, right? So then and [00:38:51] shapes of our tensor, right? So then and a parameter is a value that we're going [00:38:53] a parameter is a value that we're going to set and optimize over the course of [00:38:54] to set and optimize over the course of gradient descent. So in this case, the [00:38:56] gradient descent. So in this case, the number of filters, the number of output [00:38:58] number of filters, the number of output channels, the size of those filters, [00:38:59] channels, the size of those filters, those will be hyperparameters. We set [00:39:01] those will be hyperparameters. We set those once before we start training. At [00:39:03] those once before we start training. At the beginning of training, we'll [00:39:04] the beginning of training, we'll randomly initialize the filters and then [00:39:06] randomly initialize the filters and then the value that will that will give us a [00:39:08] the value that will that will give us a fixed shape fixedsized tensor and then [00:39:10] fixed shape fixedsized tensor and then the values inside of that tensor will [00:39:12] the values inside of that tensor will float around and change over the course [00:39:13] float around and change over the course of optimization. So that's the so then [00:39:15] of optimization. So that's the so then those are parameters because they get [00:39:17] those are parameters because they get set via grad via gradient descent. Yes, [00:39:20] set via grad via gradient descent. Yes, the question is what gradient are we [00:39:21] the question is what gradient are we computing? Whenever you whenever you do [00:39:23] computing? Whenever you whenever you do back propagation, you're always [00:39:24] back propagation, you're always computing gradient of the loss with [00:39:26] computing gradient of the loss with respect to things inside the network. So [00:39:28] respect to things inside the network. So in this case we'll be computing gradient [00:39:29] in this case we'll be computing gradient of the loss with respect to the [00:39:32] of the loss with respect to the individual scaler with respect to our [00:39:33] individual scaler with respect to our our convolutional filter weights. So [00:39:35] our convolutional filter weights. So that's basically saying you know what is [00:39:37] that's basically saying you know what is a gradient that's saying for every [00:39:38] a gradient that's saying for every individual scalar inside of every one of [00:39:40] individual scalar inside of every one of our filters if we wiggle that scaler a [00:39:42] our filters if we wiggle that scaler a little bit then how much is the loss [00:39:44] little bit then how much is the loss going to change. So then the gradient of [00:39:46] going to change. So then the gradient of the loss with respect we're always [00:39:47] the loss with respect we're always computing gradient of the loss with [00:39:48] computing gradient of the loss with respect to our our convolutional [00:39:50] respect to our our convolutional filters. The question is basically like [00:39:52] filters. The question is basically like what do we do with the bias? So, so [00:39:54] what do we do with the bias? So, so basically the bias would be added to [00:39:56] basically the bias would be added to each of our inner products, right? So, [00:39:58] each of our inner products, right? So, then we'll always compute like inner [00:39:59] then we'll always compute like inner product of one of our filters against a [00:40:00] product of one of our filters against a chunk of the image and then add the [00:40:02] chunk of the image and then add the corresponding scaler from the bias. The [00:40:05] corresponding scaler from the bias. The bias is is a is a vector, but the number [00:40:07] bias is is a is a vector, but the number of entries in the vector is equal to the [00:40:08] of entries in the vector is equal to the number of filters. So then each each [00:40:10] number of filters. So then each each entry in the bias gets basically [00:40:13] entry in the bias gets basically broadcast across the entire spatial [00:40:15] broadcast across the entire spatial dimension in the output. Um, but each [00:40:17] dimension in the output. Um, but each bias only gets used for one bias gets [00:40:20] bias only gets used for one bias gets used for one filter. So conceptually you [00:40:23] used for one filter. So conceptually you basically one filter you slide [00:40:24] basically one filter you slide everywhere that gives us a [00:40:26] everywhere that gives us a two-dimensional plane of activations [00:40:27] two-dimensional plane of activations right and then if you have a second [00:40:29] right and then if you have a second filter you get a second plane of [00:40:30] filter you get a second plane of activations those are independent [00:40:32] activations those are independent operators right like step one um slide [00:40:34] operators right like step one um slide first filter everywhere step two slide [00:40:37] first filter everywhere step two slide the second filter everywhere every [00:40:38] the second filter everywhere every filter gives rise to a plane um a plane [00:40:40] filter gives rise to a plane um a plane of of a plane that we call an activation [00:40:42] of of a plane that we call an activation map and then we stack all of those up um [00:40:44] map and then we stack all of those up um and that's that's the operation of the [00:40:46] and that's that's the operation of the convolution layer question is yeah [00:40:48] convolution layer question is yeah basically after every gradient descent [00:40:50] basically after every gradient descent every time we do gradient descent sent [00:40:52] every time we do gradient descent sent um it's going to change the filters, [00:40:53] um it's going to change the filters, right? So whenever you whenever you [00:40:55] right? So whenever you whenever you imagine training a neural network, it's [00:40:56] imagine training a neural network, it's always this loop of like while true get [00:40:59] always this loop of like while true get a batch of data, send your data through [00:41:00] a batch of data, send your data through the network, forward pass, compute loss, [00:41:03] the network, forward pass, compute loss, um backward pass, compute gradient with [00:41:05] um backward pass, compute gradient with respect to loss, and now make a gradient [00:41:07] respect to loss, and now make a gradient step using your optimizer. So then it's [00:41:08] step using your optimizer. So then it's always going to be data forward loss [00:41:11] always going to be data forward loss backward step and then every time you do [00:41:13] backward step and then every time you do a step, it's going to make a change to [00:41:14] a step, it's going to make a change to the filters. [00:41:16] the filters. All right, so I I I swung the other way. [00:41:18] All right, so I I I swung the other way. I said more questions. I got too many [00:41:20] I said more questions. I got too many questions. Um but that's good. We'll [00:41:21] questions. Um but that's good. We'll we'll kind of equalize in here. Okay. So [00:41:24] we'll kind of equalize in here. Okay. So then uh we talked about the convolution [00:41:26] then uh we talked about the convolution layer. Um you know it's actually pretty [00:41:27] layer. Um you know it's actually pretty common in the convolution layer to work [00:41:29] common in the convolution layer to work on it in a batched mode. So rather than [00:41:31] on it in a batched mode. So rather than working on one input image, we'll [00:41:32] working on one input image, we'll actually work on a batch of input [00:41:34] actually work on a batch of input images. So this this is kind of nice cuz [00:41:36] images. So this this is kind of nice cuz it makes everything four-dimensional. [00:41:37] it makes everything four-dimensional. Now we have a four-dimensional tensor of [00:41:39] Now we have a four-dimensional tensor of inputs which is a set of input images. [00:41:42] inputs which is a set of input images. We have a four-dimensional tensor of [00:41:43] We have a four-dimensional tensor of filters which is a set of filters each [00:41:46] filters which is a set of filters each of which is a threedimensional chunk of [00:41:47] of which is a threedimensional chunk of an image. And then the output is a [00:41:49] an image. And then the output is a four-dimensional uh is a [00:41:50] four-dimensional uh is a four-dimensional tensor which is a set [00:41:52] four-dimensional tensor which is a set of outputs. Each output one output per [00:41:55] of outputs. Each output one output per image. Each images output is a [00:41:57] image. Each images output is a three-dimensional tensor giving a stack [00:41:59] three-dimensional tensor giving a stack of feature planes. Um you you have to [00:42:01] of feature planes. Um you you have to start to think in lots of dimensions [00:42:02] start to think in lots of dimensions when you start to build neural networks. [00:42:04] when you start to build neural networks. And that's actually kind of fun. Um so [00:42:07] And that's actually kind of fun. Um so then here's kind of the general [00:42:08] then here's kind of the general formulation of a convolution layer. um [00:42:10] formulation of a convolution layer. um is that in general you're going to take [00:42:11] is that in general you're going to take as input a fourdimensional tensor of n [00:42:13] as input a fourdimensional tensor of n by cn by h by w which is a set of n [00:42:16] by cn by h by w which is a set of n images. Each of those n images has c [00:42:19] images. Each of those n images has c channels. Um for the case of an RGB [00:42:21] channels. Um for the case of an RGB image that'll be three but we might in [00:42:23] image that'll be three but we might in general have more than c more than three [00:42:25] general have more than c more than three channels. This could be arbitrary. Um [00:42:27] channels. This could be arbitrary. Um and then h and w is the spatial size of [00:42:29] and then h and w is the spatial size of our input images. Our convolutional [00:42:31] our input images. Our convolutional filters will be a four-dimensional [00:42:32] filters will be a four-dimensional tensor of shape C out by C in by KW by [00:42:36] tensor of shape C out by C in by KW by KH. K C out is the number of filters. [00:42:39] KH. K C out is the number of filters. the number of output channels. Um CN is [00:42:42] the number of output channels. Um CN is you know and then the rest of that are [00:42:43] you know and then the rest of that are threedimensional filters. So it's a set [00:42:45] threedimensional filters. So it's a set of threedimensional filters. Um each [00:42:47] of threedimensional filters. Um each threedimensional filter has shape C in [00:42:49] threedimensional filter has shape C in by KW by KH. That's the kernel width and [00:42:52] by KW by KH. That's the kernel width and kernel height. And then we have C out [00:42:54] kernel height. And then we have C out such filters that could collect it into [00:42:55] such filters that could collect it into a four-dimensional tensor. And then as [00:42:58] a four-dimensional tensor. And then as output we're going to produce a [00:42:59] output we're going to produce a four-dimensional tensor again where the [00:43:01] four-dimensional tensor again where the shape is N for the number of images one [00:43:04] shape is N for the number of images one output per image. C out. Each of those [00:43:06] output per image. C out. Each of those outputs is going to consist of C out [00:43:08] outputs is going to consist of C out feature planes um one per filter. And [00:43:11] feature planes um one per filter. And then each of those planes is going to be [00:43:13] then each of those planes is going to be H prime by WP prime. Um and this is kind [00:43:16] H prime by WP prime. Um and this is kind of the general formulation of a con [00:43:18] of the general formulation of a con layer. [00:43:19] layer. And then a convolutional network is just [00:43:21] And then a convolutional network is just a network is just a computational graph [00:43:23] a network is just a computational graph that includes a bunch of con layers. So [00:43:25] that includes a bunch of con layers. So in practice we'll tend to stack up a [00:43:27] in practice we'll tend to stack up a bunch of convolutional operators one [00:43:29] bunch of convolutional operators one after another. Um and in stacking a [00:43:31] after another. Um and in stacking a bunch of convolutional operators that [00:43:32] bunch of convolutional operators that will be that will be a convolutional [00:43:34] will be that will be a convolutional network. Um so this was kind of a simple [00:43:36] network. Um so this was kind of a simple connet. You know we start with an image [00:43:38] connet. You know we start with an image that's 3x 32x 32. Um then we have a con [00:43:41] that's 3x 32x 32. Um then we have a con layer that has six filters. Each filter [00:43:43] layer that has six filters. Each filter is 5x5x3. Then after we do the first [00:43:46] is 5x5x3. Then after we do the first convolution that gives us a new [00:43:48] convolution that gives us a new three-dimensional set of activations for [00:43:50] three-dimensional set of activations for that one image where it we have six [00:43:53] that one image where it we have six channels that matches the six filters [00:43:55] channels that matches the six filters 28x 28 because the spatial size changed [00:43:57] 28x 28 because the spatial size changed a little bit through the convolution. [00:43:58] a little bit through the convolution. Then we have another convolution that [00:44:00] Then we have another convolution that now has 10 filters which each of which [00:44:02] now has 10 filters which each of which is 5x 5x 6. So the 10 is going to give [00:44:05] is 5x 5x 6. So the 10 is going to give us the output dimen the output [00:44:07] us the output dimen the output dimensions in the next layer of the [00:44:08] dimensions in the next layer of the convolution. Um and this six is going to [00:44:11] convolution. Um and this six is going to be the number of channels that needs to [00:44:12] be the number of channels that needs to match up the channel dimension here of [00:44:14] match up the channel dimension here of the input to the convolution. Um so you [00:44:16] the input to the convolution. Um so you can kind of see like you you can just [00:44:18] can kind of see like you you can just stack up a bunch of these convolution [00:44:19] stack up a bunch of these convolution layers and perform a lot of computation. [00:44:22] layers and perform a lot of computation. But there's actually a problem in [00:44:23] But there's actually a problem in exactly this network archite [00:44:24] exactly this network archite architecture design. And can anybody [00:44:26] architecture design. And can anybody spot it? sizing. Uh, that that's a [00:44:29] spot it? sizing. Uh, that that's a problem. Not the one I had in mind. Are [00:44:31] problem. Not the one I had in mind. Are local. That's another good problem. Not [00:44:33] local. That's another good problem. Not the one I had in mind. Uh, actually [00:44:34] the one I had in mind. Uh, actually those two we'll be able to fix pretty [00:44:36] those two we'll be able to fix pretty easily in a couple slides, but I had a [00:44:37] easily in a couple slides, but I had a different problem in mind. A lot of [00:44:38] different problem in mind. A lot of memory. Uh, that is a problem, but not [00:44:40] memory. Uh, that is a problem, but not one we can fix. You just got to buy a [00:44:42] one we can fix. You just got to buy a bigger GPU. [00:44:44] bigger GPU. Number of filters increases. Uh, I don't [00:44:46] Number of filters increases. Uh, I don't think that's a problem necessarily. [00:44:47] think that's a problem necessarily. That's okay. Ah, everything's linear. [00:44:50] That's okay. Ah, everything's linear. Yes, that is a problem. Right. So, we [00:44:52] Yes, that is a problem. Right. So, we said that convolution was dot was [00:44:54] said that convolution was dot was dotproducts. Dot product is a linear [00:44:55] dotproducts. Dot product is a linear operator. um composition of two linear [00:44:57] operator. um composition of two linear operators is still a linear operator. So [00:44:59] operators is still a linear operator. So that means that if we have two [00:45:00] that means that if we have two convolution layers stacked directly on [00:45:02] convolution layers stacked directly on top of each other, they actually have [00:45:03] top of each other, they actually have the same representational power as a [00:45:05] the same representational power as a single convolution layer because because [00:45:07] single convolution layer because because of the linearity of the operator. Um [00:45:09] of the linearity of the operator. Um there's actually a very simple fix to [00:45:10] there's actually a very simple fix to that. Add an activation function. [00:45:12] that. Add an activation function. Exactly. So it's the same actually the [00:45:14] Exactly. So it's the same actually the same bug that we saw in multi-layer [00:45:16] same bug that we saw in multi-layer neural networks and the same fix. We [00:45:17] neural networks and the same fix. We need to add a nonlinear activation [00:45:19] need to add a nonlinear activation function in between our convolutional [00:45:21] function in between our convolutional layers if we want. This introduces [00:45:23] layers if we want. This introduces nonlinearity to the problem. [00:45:24] nonlinearity to the problem. nonlinearity to the to the to the [00:45:26] nonlinearity to the to the to the network architecture and increases the [00:45:27] network architecture and increases the representational power of the network [00:45:29] representational power of the network that we're learning. So you know com in [00:45:32] that we're learning. So you know com in general com nets are going to be some [00:45:33] general com nets are going to be some stack of convolution layers [00:45:35] stack of convolution layers nonlinearities and other kinds of other [00:45:36] nonlinearities and other kinds of other kinds of layers in our computational [00:45:38] kinds of layers in our computational graph. There was a question earlier [00:45:40] graph. There was a question earlier about what do the convolutional filters [00:45:41] about what do the convolutional filters learn. Um this is basically we can view [00:45:43] learn. Um this is basically we can view this by but by analogy with what we [00:45:45] this by but by analogy with what we already did in linear classifiers. So in [00:45:47] already did in linear classifiers. So in linear classifiers we have this [00:45:49] linear classifiers we have this intuition where each row could be [00:45:51] intuition where each row could be visualized each row of the learned [00:45:52] visualized each row of the learned weight matrix could be thought of as a [00:45:54] weight matrix could be thought of as a template that has the same shape as the [00:45:56] template that has the same shape as the whole input image. Now with a [00:45:58] whole input image. Now with a convolutional filter um you can you can [00:46:00] convolutional filter um you can you can think of it the same way but now each [00:46:02] think of it the same way but now each filter rather than extending over the [00:46:04] filter rather than extending over the entire spatial size of the input image [00:46:05] entire spatial size of the input image is going to be just a small subpiece a [00:46:07] is going to be just a small subpiece a sub chunk of an image. So we can [00:46:09] sub chunk of an image. So we can actually visualize the first uh we can [00:46:12] actually visualize the first uh we can actually visualize the first layer [00:46:13] actually visualize the first layer convolution filters um of a trained [00:46:16] convolution filters um of a trained neural network. So these are some these [00:46:18] neural network. So these are some these are the first layer convolution filters [00:46:19] are the first layer convolution filters that are learned by an alexn net [00:46:21] that are learned by an alexn net architecture that was trained for image [00:46:23] architecture that was trained for image classification on imageet. Um and here [00:46:25] classification on imageet. Um and here each of these are basically little [00:46:26] each of these are basically little chunks of RGB images. These are the [00:46:28] chunks of RGB images. These are the little templates that get slid around [00:46:30] little templates that get slid around the input image in the first layer of [00:46:32] the input image in the first layer of the alexnet architecture. Um, and you [00:46:34] the alexnet architecture. Um, and you know, the fact that this was AlexNet, [00:46:35] know, the fact that this was AlexNet, the fact that this was trained on [00:46:36] the fact that this was trained on imageet, the fact that this was [00:46:37] imageet, the fact that this was classification, um, it turns out just [00:46:39] classification, um, it turns out just about all convolutional networks end up [00:46:41] about all convolutional networks end up learning filters that look something [00:46:43] learning filters that look something like this, um, on almost all problems [00:46:45] like this, um, on almost all problems and almost all data sets and tasks as [00:46:47] and almost all data sets and tasks as long as they're sort of reasonable [00:46:48] long as they're sort of reasonable tasks. Um, and the the thing we see is [00:46:51] tasks. Um, and the the thing we see is that we often learn two kinds of filters [00:46:53] that we often learn two kinds of filters in here. Um, one tends to be looking for [00:46:55] in here. Um, one tends to be looking for colors, especially opposing colors. So, [00:46:57] colors, especially opposing colors. So, you'll see like this one is looking for [00:46:59] you'll see like this one is looking for a contrast between green and red. We [00:47:01] a contrast between green and red. We also see colored blobs like pink and [00:47:03] also see colored blobs like pink and green blobs. And the other category of [00:47:05] green blobs. And the other category of filter we tend to see are looking as [00:47:07] filter we tend to see are looking as looking for somehow the spatial [00:47:08] looking for somehow the spatial structure of the images. So like this [00:47:10] structure of the images. So like this one is looking for a vertical edge, a [00:47:12] one is looking for a vertical edge, a horizontal edge. These are this one is [00:47:14] horizontal edge. These are this one is looking for a vertical edge. Some of [00:47:16] looking for a vertical edge. Some of these are looking for a diagonal edg [00:47:17] these are looking for a diagonal edg edges. So they tend to look for colors [00:47:19] edges. So they tend to look for colors and edges um and like these little these [00:47:22] and edges um and like these little these in these little local neighborhoods of [00:47:23] in these little local neighborhoods of our input images. Um, so we can play [00:47:26] our input images. Um, so we can play this trick on the first layer of the [00:47:28] this trick on the first layer of the convolutional filter and just visualize [00:47:29] convolutional filter and just visualize them directly as images. It gets a [00:47:31] them directly as images. It gets a little bit trickier to visualize the [00:47:32] little bit trickier to visualize the higher layers in the network. Um, and I [00:47:35] higher layers in the network. Um, and I I'm not going to explain this figure. [00:47:36] I'm not going to explain this figure. I'm just going to present it without too [00:47:37] I'm just going to present it without too much too much explanation. But, um, [00:47:39] much too much explanation. But, um, higher layers of the network tend to [00:47:41] higher layers of the network tend to learn larger spatial structures of our [00:47:43] learn larger spatial structures of our input image. Um here the visualization [00:47:45] input image. Um here the visualization is like each row represents a filter in [00:47:48] is like each row represents a filter in in a learned network and each column [00:47:50] in a learned network and each column represents some piece of an input image [00:47:53] represents some piece of an input image that that filter was responding strongly [00:47:55] that that filter was responding strongly to. So the the visualization here is is [00:47:57] to. So the the visualization here is is a bit different than the previous slide. [00:47:59] a bit different than the previous slide. Um so these are all basically chunks of [00:48:01] Um so these are all basically chunks of input images that a filter was [00:48:02] input images that a filter was responding to. And here you can see that [00:48:04] responding to. And here you can see that this sixth layer convolution, one of [00:48:06] this sixth layer convolution, one of these filters feels like it's responding [00:48:08] these filters feels like it's responding maybe to eyes. This one looks like maybe [00:48:10] maybe to eyes. This one looks like maybe it's responding to pieces of text. Um, [00:48:12] it's responding to pieces of text. Um, this one looks like maybe it's [00:48:14] this one looks like maybe it's responding to wheels or or circles or [00:48:16] responding to wheels or or circles or top halves of circles, something like [00:48:17] top halves of circles, something like that. Um, and again like these this all [00:48:20] that. Um, and again like these this all sort of gets driven via gradient descent [00:48:21] sort of gets driven via gradient descent via training on your large scale data [00:48:23] via training on your large scale data sets via gradient descent. Uh, nobody's [00:48:25] sets via gradient descent. Uh, nobody's sort of sitting down and designing these [00:48:26] sort of sitting down and designing these filters by hand. Um, and like I said, [00:48:28] filters by hand. Um, and like I said, visualizing these higher layer filters [00:48:30] visualizing these higher layer filters is a bit tricky and more involved. [00:48:32] is a bit tricky and more involved. Question was um, if you if you if you [00:48:35] Question was um, if you if you if you look at all the responses to the [00:48:36] look at all the responses to the filters, can you reconstruct the [00:48:38] filters, can you reconstruct the original image? Actually, it turns out [00:48:40] original image? Actually, it turns out you can do that. And the trick that and [00:48:42] you can do that. And the trick that and the way that you do that is also [00:48:43] the way that you do that is also gradient descent. Um gradient sense is [00:48:45] gradient descent. Um gradient sense is really powerful and that's something [00:48:47] really powerful and that's something that we'll talk about I think in a [00:48:48] that we'll talk about I think in a couple more lectures on some some [00:48:50] couple more lectures on some some mechanisms that do that. Oh that's a [00:48:52] mechanisms that do that. Oh that's a good question. How do you how do the how [00:48:54] good question. How do you how do the how do the filters get differentiated? um [00:48:56] do the filters get differentiated? um that actually comes down to the random [00:48:57] that actually comes down to the random initialization, right? So then it's [00:48:59] initialization, right? So then it's really important that the way you [00:49:00] really important that the way you initialize your filters um is random and [00:49:03] initialize your filters um is random and and they have and and crucially that you [00:49:04] and they have and and crucially that you have a different initialization for each [00:49:06] have a different initialization for each filter when you start training your [00:49:07] filter when you start training your network. Um because that's going to [00:49:08] network. Um because that's going to break the symmetry between the filters, [00:49:10] break the symmetry between the filters, right? Because if all the filters are [00:49:11] right? Because if all the filters are exactly the same, um the loss is the [00:49:13] exactly the same, um the loss is the same, then that gradient is going to [00:49:14] same, then that gradient is going to broadcast back and be the same on all [00:49:16] broadcast back and be the same on all the filters. So if you initialize them [00:49:17] the filters. So if you initialize them the same, they will stay the same. Um [00:49:19] the same, they will stay the same. Um but if you initialize them to be [00:49:20] but if you initialize them to be different, then you'll break the [00:49:21] different, then you'll break the symmetry and they can learn different [00:49:22] symmetry and they can learn different features. [00:49:23] features. Yeah. Basically the the the human [00:49:25] Yeah. Basically the the the human designer of the network needs to write [00:49:26] designer of the network needs to write down what is the sequence of operators [00:49:28] down what is the sequence of operators and the sequence of channels and that's [00:49:29] and the sequence of channels and that's the question of neural network [00:49:31] the question of neural network architecture design that we'll talk a [00:49:32] architecture design that we'll talk a little bit more about in the next [00:49:33] little bit more about in the next lecture. [00:49:35] lecture. Good question is how do we like why is [00:49:37] Good question is how do we like why is it the deeper layers visualize larger [00:49:39] it the deeper layers visualize larger structures that actually has a bit to do [00:49:40] structures that actually has a bit to do with the receptive fields that we have a [00:49:42] with the receptive fields that we have a slide on in a couple in a in a little [00:49:43] slide on in a couple in a in a little bit. So maybe maybe we'll get there and [00:49:44] bit. So maybe maybe we'll get there and I think a couple more some of these [00:49:45] I think a couple more some of these questions will get answered. Um so one [00:49:48] questions will get answered. Um so one thing that already came up is how do we [00:49:49] thing that already came up is how do we look at the spatial dimensions of these [00:49:51] look at the spatial dimensions of these convolutions? Um so I wanted to take a [00:49:53] convolutions? Um so I wanted to take a take a look a closer look at exactly how [00:49:55] take a look a closer look at exactly how we compute the spatial dimensions of our [00:49:56] we compute the spatial dimensions of our convolutions. Right? So in this case um [00:49:59] convolutions. Right? So in this case um here we've taken this picture of a con [00:50:01] here we've taken this picture of a con this picture of a convolution. We're [00:50:02] this picture of a convolution. We're rotating at 90° and dropping the channel [00:50:04] rotating at 90° and dropping the channel dimension. So now the channel dimension [00:50:06] dimension. So now the channel dimension is going into the board. Um and then we [00:50:08] is going into the board. Um and then we have our 7x7 spatial dimensions. So here [00:50:11] have our 7x7 spatial dimensions. So here we're looking at an input that's 7 by 7 [00:50:12] we're looking at an input that's 7 by 7 in spatial size and we have a 3x3 com [00:50:14] in spatial size and we have a 3x3 com kernel. And then the question is how big [00:50:16] kernel. And then the question is how big is our output going to be here? [00:50:19] is our output going to be here? Well, 1 2 3 4 5, right? So, our output [00:50:24] Well, 1 2 3 4 5, right? So, our output is going to be 5 by five because we can [00:50:25] is going to be 5 by five because we can slide that filter and plop it down in [00:50:27] slide that filter and plop it down in five different spaces. Um, and then we [00:50:29] five different spaces. Um, and then we can generalize it, right? If our input [00:50:30] can generalize it, right? If our input has has length w, our com filter has [00:50:33] has has length w, our com filter has length k, then our output is going to be [00:50:34] length k, then our output is going to be w minus k + one. And you can sit down [00:50:37] w minus k + one. And you can sit down and convince yourself that that's the [00:50:38] and convince yourself that that's the right formula. Um, but there's kind of a [00:50:40] right formula. Um, but there's kind of a problem that actually a couple people [00:50:42] problem that actually a couple people already pointed out is that your feature [00:50:44] already pointed out is that your feature maps are going to shrink in spatial size [00:50:46] maps are going to shrink in spatial size as you go through this convolution. um [00:50:47] as you go through this convolution. um that's kind of annoying. Um it's [00:50:49] that's kind of annoying. Um it's actually like you could actually work [00:50:51] actually like you could actually work with that and there are some neural [00:50:52] with that and there are some neural network architectures that deal with [00:50:53] network architectures that deal with that. But sometimes we're lazy and we [00:50:55] that. But sometimes we're lazy and we just want to keep the same size for [00:50:57] just want to keep the same size for everything because that's just basically [00:50:59] everything because that's just basically simpler for human designers to think [00:51:01] simpler for human designers to think about. Um and one trick that we do there [00:51:03] about. Um and one trick that we do there is something called padding. So here [00:51:05] is something called padding. So here it's common to add additional data like [00:51:09] it's common to add additional data like virtual data around the input of your [00:51:11] virtual data around the input of your around your true input data um that [00:51:13] around your true input data um that you're going to like basically add extra [00:51:15] you're going to like basically add extra zeros um around before you compute the [00:51:17] zeros um around before you compute the convolution operator. Um and now in this [00:51:20] convolution operator. Um and now in this this basically lets us solve this [00:51:21] this basically lets us solve this shrinking feature map problem because [00:51:23] shrinking feature map problem because now if we have um you know add padding [00:51:25] now if we have um you know add padding of P in this case we have padding P [00:51:27] of P in this case we have padding P equals 1. So we're adding one pixel of [00:51:29] equals 1. So we're adding one pixel of zeros all around everywhere then we we [00:51:32] zeros all around everywhere then we we basically add 2 P to our output size. So [00:51:35] basically add 2 P to our output size. So in particular if you have a three if you [00:51:37] in particular if you have a three if you have like a 3x3 con and you add padding [00:51:39] have like a 3x3 con and you add padding of one then your your feature map is [00:51:42] of one then your your feature map is going to stay the same size and that's [00:51:43] going to stay the same size and that's convenient. Um now if you've taken [00:51:45] convenient. Um now if you've taken signal processing there actually are [00:51:47] signal processing there actually are some problems here right like this can [00:51:49] some problems here right like this can lead to weird weirdness in from a signal [00:51:51] lead to weird weirdness in from a signal processing perspective but we'll ignore [00:51:52] processing perspective but we'll ignore that and we'll just look at the sizes [00:51:54] that and we'll just look at the sizes sizes and shapes of the tensors because [00:51:56] sizes and shapes of the tensors because that's that's a little bit easier to [00:51:57] that's that's a little bit easier to comprehend but be aware like why are we [00:51:59] comprehend but be aware like why are we putting zeros is that going to cause [00:52:00] putting zeros is that going to cause problems yes it is going to cause [00:52:02] problems yes it is going to cause problems on the borders but it seems to [00:52:04] problems on the borders but it seems to be okay in a lot of cases [00:52:07] be okay in a lot of cases okay yeah so then like I said a pretty [00:52:08] okay yeah so then like I said a pretty common setting is to set p is to have [00:52:10] common setting is to set p is to have actually k be an odd number um and then [00:52:13] actually k be an odd number um and then have P be k minus 1 /2 because that's [00:52:16] have P be k minus 1 /2 because that's going to mean that your your your your [00:52:17] going to mean that your your your your spatial size after convolution is the [00:52:19] spatial size after convolution is the same as the spatial size before the [00:52:20] same as the spatial size before the convolution. [00:52:22] convolution. Okay. Then the next interesting thing to [00:52:24] Okay. Then the next interesting thing to think about is this notion of receptive [00:52:26] think about is this notion of receptive fields. Someone was asking a little bit [00:52:28] fields. Someone was asking a little bit uh over here why do the deeper layers [00:52:30] uh over here why do the deeper layers learn larger structures? That's actually [00:52:32] learn larger structures? That's actually sort of inherent in the way that [00:52:33] sort of inherent in the way that convolutions are built. Right? So in [00:52:36] convolutions are built. Right? So in thinking about a single convolution um [00:52:38] thinking about a single convolution um each output is looking at this local [00:52:40] each output is looking at this local region of an input. Right? So by design [00:52:43] region of an input. Right? So by design the output of one convolution at the [00:52:45] the output of one convolution at the first layer can only be looking at a [00:52:47] first layer can only be looking at a piece of the image which is the same [00:52:48] piece of the image which is the same size as the convolutional kernel that [00:52:50] size as the convolutional kernel that you're learning. But if we build a [00:52:52] you're learning. But if we build a convenant that's stacking multiple [00:52:53] convenant that's stacking multiple convolutions on top of each other then [00:52:55] convolutions on top of each other then these receptive fields get magnified [00:52:57] these receptive fields get magnified through the network. So then you in this [00:52:59] through the network. So then you in this case we're looking at a at a network [00:53:02] case we're looking at a at a network with three convolution layers and we see [00:53:04] with three convolution layers and we see that um in the in the final layer of [00:53:06] that um in the in the final layer of activations each entry here depends on a [00:53:09] activations each entry here depends on a local region in the in the in the layer [00:53:11] local region in the in the in the layer before it. But each one of those entries [00:53:14] before it. But each one of those entries depends in turn on a local region in the [00:53:16] depends in turn on a local region in the layer before it which depends in turn on [00:53:18] layer before it which depends in turn on a local region in the layer before it. [00:53:20] a local region in the layer before it. So when you have these convolutions, [00:53:22] So when you have these convolutions, even though each individual convolution [00:53:24] even though each individual convolution is looking at a local neighborhood in [00:53:26] is looking at a local neighborhood in the layer before it, as you stack up [00:53:27] the layer before it, as you stack up convolutions in a bunch of layers, then [00:53:29] convolutions in a bunch of layers, then the effective size of the original input [00:53:32] the effective size of the original input that each of those convolutions is [00:53:33] that each of those convolutions is looking at grows in the grows over the [00:53:36] looking at grows in the grows over the course of the network. And in [00:53:38] course of the network. And in particular, um this this uh this this we [00:53:40] particular, um this this uh this this we call the effective receptive field. So [00:53:42] call the effective receptive field. So the effective receptive field of a [00:53:44] the effective receptive field of a convolution is basically how many pixels [00:53:46] convolution is basically how many pixels in the original image um had the [00:53:48] in the original image um had the opportunity to influence um one one [00:53:50] opportunity to influence um one one activation of the network you know later [00:53:52] activation of the network you know later on downstream. [00:53:54] on downstream. And you'll notice that the convolution [00:53:56] And you'll notice that the convolution actually the this effective receptive [00:53:57] actually the this effective receptive field basically grows linearly with the [00:53:59] field basically grows linearly with the number of convolution layers. Um, and [00:54:01] number of convolution layers. Um, and there's a potential problem here is [00:54:03] there's a potential problem here is because ultimately when we make [00:54:04] because ultimately when we make classification decisions at the end of [00:54:06] classification decisions at the end of our network, we would like our [00:54:07] our network, we would like our classification decisions to basically [00:54:09] classification decisions to basically aggregate global information across the [00:54:11] aggregate global information across the entire image. Um, but you need a lot of [00:54:13] entire image. Um, but you need a lot of comp layers to do it. So a trick there [00:54:15] comp layers to do it. So a trick there is um basically to add some kind of way [00:54:18] is um basically to add some kind of way to increase effective receptive fields [00:54:20] to increase effective receptive fields more quickly. One way that we can do [00:54:22] more quickly. One way that we can do this in convolution is by introducing [00:54:24] this in convolution is by introducing something called a stride. So here what [00:54:26] something called a stride. So here what we're saying is that rather than placing [00:54:28] we're saying is that rather than placing the filter everywhere in the image, [00:54:30] the filter everywhere in the image, we're going to skip some. So we're going [00:54:31] we're going to skip some. So we're going to instead of moving the field moving [00:54:34] to instead of moving the field moving the receptive field with one, we're [00:54:36] the receptive field with one, we're going to stride it by two instead. So [00:54:38] going to stride it by two instead. So now in this case, we go back to our 7x7 [00:54:39] now in this case, we go back to our 7x7 input 3x3 con do a stride two. Now [00:54:42] input 3x3 con do a stride two. Now what's the output size? 1 2 3 3x3. Um [00:54:48] what's the output size? 1 2 3 3x3. Um and then in general if we have our input [00:54:50] and then in general if we have our input W filter size K padding of P stride S [00:54:54] W filter size K padding of P stride S then we get this kind of ugly formula [00:54:56] then we get this kind of ugly formula for the size of the output W minus K. [00:54:58] for the size of the output W minus K. The uh bigger kernels shrink the input [00:55:01] The uh bigger kernels shrink the input plus 2 P um padding adds back some of [00:55:03] plus 2 P um padding adds back some of the missing size divided by the stride. [00:55:06] the missing size divided by the stride. The stride you know divides the input [00:55:08] The stride you know divides the input shape and then plus one because that's [00:55:10] shape and then plus one because that's how because of some fence post fence [00:55:12] how because of some fence post fence post math. [00:55:14] post math. Okay. So then the strided convolutions [00:55:16] Okay. So then the strided convolutions are interesting because if you go back [00:55:17] are interesting because if you go back to this picture now when we do a strided [00:55:19] to this picture now when we do a strided convolution it's effectively [00:55:20] convolution it's effectively downsampling the image inside the neural [00:55:22] downsampling the image inside the neural network. So then when we have a strided [00:55:24] network. So then when we have a strided convolution then each con layer is [00:55:27] convolution then each con layer is effectively like dividing the the shape [00:55:29] effectively like dividing the the shape of the feature map usually by two and [00:55:30] of the feature map usually by two and then when we stack these that means that [00:55:32] then when we stack these that means that now you can get exponential growth in [00:55:34] now you can get exponential growth in the effective receptive field. So if you [00:55:36] the effective receptive field. So if you stack you know a bunch of con players [00:55:37] stack you know a bunch of con players and each of those con layers is actually [00:55:38] and each of those con layers is actually downsampling by a factor of two. Then if [00:55:40] downsampling by a factor of two. Then if you run through a similar exercise [00:55:42] you run through a similar exercise you'll see that that the effective [00:55:43] you'll see that that the effective receptive field is now growing [00:55:44] receptive field is now growing exponentially in the depth of the [00:55:46] exponentially in the depth of the network. So that means that with very [00:55:47] network. So that means that with very with relatively few layers we can build [00:55:49] with relatively few layers we can build up a very large effective receptive [00:55:51] up a very large effective receptive field that looks at the entire input [00:55:52] field that looks at the entire input image. [00:55:54] image. Um okay so here let's work through just [00:55:56] Um okay so here let's work through just one one example to make sure that we all [00:55:58] one one example to make sure that we all are on the same page about convolution. [00:56:00] are on the same page about convolution. So here let's think about an input [00:56:01] So here let's think about an input volume 3x 32x32. Let's think about a [00:56:04] volume 3x 32x32. Let's think about a convolution layer with 10 filters. Each [00:56:06] convolution layer with 10 filters. Each of those filters is 5 x5 um with stride [00:56:09] of those filters is 5 x5 um with stride one pad 2. What's the size of the [00:56:11] one pad 2. What's the size of the output? I color coded it because there's [00:56:13] output? I color coded it because there's a lot of numbers here to keep to keep [00:56:14] a lot of numbers here to keep to keep track of. Right? So here um it's 10 x [00:56:17] track of. Right? So here um it's 10 x 32x 32. This 32 is actually a different [00:56:20] 32x 32. This 32 is actually a different 32 than this 32. So that's why they're [00:56:22] 32 than this 32. So that's why they're different colors of blue. Um right, but [00:56:24] different colors of blue. Um right, but this 10 is the number of output [00:56:25] this 10 is the number of output channels. Output channels has to match [00:56:27] channels. Output channels has to match the number of filters. Um and the [00:56:29] the number of filters. Um and the spatial size is computed using that [00:56:31] spatial size is computed using that formula that we just saw. So then the [00:56:33] formula that we just saw. So then the input spatial size comes down here. Um [00:56:36] input spatial size comes down here. Um plus two plus the padding comes down [00:56:38] plus two plus the padding comes down here. Padding adds the spatial size. Um [00:56:41] here. Padding adds the spatial size. Um five is the convolutional kernel that [00:56:43] five is the convolutional kernel that divides the spatial size. Stride of one. [00:56:45] divides the spatial size. Stride of one. So that's trivial. And then add one. And [00:56:47] So that's trivial. And then add one. And this just so happens to come out to 32. [00:56:49] this just so happens to come out to 32. Um so in this case this follows the same [00:56:51] Um so in this case this follows the same pattern that we talked about a couple [00:56:52] pattern that we talked about a couple slides ago where it's a five where it's [00:56:54] slides ago where it's a five where it's an oddshaped convolutional kernel. In [00:56:55] an oddshaped convolutional kernel. In this case five, the padding is 2. So if [00:56:58] this case five, the padding is 2. So if the kernel size is is 2k + one then [00:57:01] the kernel size is is 2k + one then padding of k means we maintain the same [00:57:03] padding of k means we maintain the same spatial size [00:57:05] spatial size number of learnable parameters here [00:57:08] number of learnable parameters here maybe I'll just go through these cuz uh [00:57:10] maybe I'll just go through these cuz uh we have a couple more slides to get [00:57:11] we have a couple more slides to get through um so here um in this case [00:57:14] through um so here um in this case number of learnable parameters is 760 [00:57:16] number of learnable parameters is 760 because we have um each filter is [00:57:18] because we have um each filter is basically 3x5 x5 um and we have one for [00:57:21] basically 3x5 x5 um and we have one for the bias so we have 76 learnable [00:57:23] the bias so we have 76 learnable parameters per filter we have 10 filters [00:57:25] parameters per filter we have 10 filters so it's 760 multiple 760 learnable [00:57:29] so it's 760 multiple 760 learnable parameters here. We can also compute the [00:57:31] parameters here. We can also compute the number of multiply ad operations. How [00:57:33] number of multiply ad operations. How much compute does this convolution [00:57:34] much compute does this convolution kernel how much compute does this [00:57:36] kernel how much compute does this convolution operator take? So here it's [00:57:38] convolution operator take? So here it's it's a lot seven. Well, is it a lot? I [00:57:40] it's a lot seven. Well, is it a lot? I don't know. I don't have a lot. You may [00:57:42] don't know. I don't have a lot. You may not you may or may not have a lot of [00:57:43] not you may or may not have a lot of intuition for what is a lot of [00:57:44] intuition for what is a lot of computation. But in this case, um the [00:57:46] computation. But in this case, um the way that I think about computing how [00:57:47] way that I think about computing how many flops, how much how much compute [00:57:49] many flops, how much how much compute does a convolution operator take. We [00:57:51] does a convolution operator take. We think about the output volume size is [00:57:53] think about the output volume size is 10x 32x 32. And we know that each entry [00:57:55] 10x 32x 32. And we know that each entry in that output volume was computed via a [00:57:58] in that output volume was computed via a dot productduct. A dot productduct in [00:57:59] dot productduct. A dot productduct in particular between one of our filters [00:58:01] particular between one of our filters and a chunk of our input. So in this [00:58:03] and a chunk of our input. So in this case we know the total flops because we [00:58:05] case we know the total flops because we know the number of outputs is 10 x 32x [00:58:07] know the number of outputs is 10 x 32x 32 which is about 10,000. And then each [00:58:09] 32 which is about 10,000. And then each of those outputs is computed via a [00:58:11] of those outputs is computed via a dotproduct of a 3x5x5 um filter and a [00:58:14] dotproduct of a 3x5x5 um filter and a 3x5x5 chunk of the image. So that's 75 [00:58:17] 3x5x5 chunk of the image. So that's 75 elements. So um multiplying those [00:58:19] elements. So um multiplying those together together means it takes about [00:58:20] together together means it takes about 768,000 floating point uh multiply ad [00:58:23] 768,000 floating point uh multiply ad operations. [00:58:26] operations. Okay. So then here's the kind of oneline [00:58:27] Okay. So then here's the kind of oneline sum one one slide summary of [00:58:29] sum one one slide summary of convolution. I'm not going to walk [00:58:31] convolution. I'm not going to walk through this. This is more for your for [00:58:32] through this. This is more for your for you to look at later but this just [00:58:34] you to look at later but this just summarizes all the hyperparameters and [00:58:35] summarizes all the hyperparameters and the formulas associated with convolution [00:58:37] the formulas associated with convolution layers. Um if you look in PyTorch, [00:58:39] layers. Um if you look in PyTorch, PyTorch is the deep learning framework [00:58:41] PyTorch is the deep learning framework that a lot of people use. Um there [00:58:43] that a lot of people use. Um there you'll see um this convolution layer has [00:58:46] you'll see um this convolution layer has all these hyperparameters that we talked [00:58:47] all these hyperparameters that we talked about. There's a couple other [00:58:49] about. There's a couple other interesting hyperparameters that we [00:58:50] interesting hyperparameters that we didn't talk about called group groups [00:58:51] didn't talk about called group groups and dilation. Um dilation isn't really [00:58:53] and dilation. Um dilation isn't really used so much anymore. Groups still get [00:58:55] used so much anymore. Groups still get used sometimes. Um and but maybe we'll [00:58:57] used sometimes. Um and but maybe we'll talk about those in a later lecture. Um [00:59:00] talk about those in a later lecture. Um you can have other kind other kinds of [00:59:02] you can have other kind other kinds of convolutions too. Um so we talked about [00:59:04] convolutions too. Um so we talked about 2D convolution. We can also do 1D [00:59:05] 2D convolution. We can also do 1D convolution. um where rather than having [00:59:08] convolution. um where rather than having a two-dimensional signal that we slide a [00:59:09] a two-dimensional signal that we slide a filter over, we now have a two like a [00:59:12] filter over, we now have a two like a one-dimensional signal that we slide a [00:59:13] one-dimensional signal that we slide a filter over in one with one degree of [00:59:14] filter over in one with one degree of freedom or a threedimensional [00:59:16] freedom or a threedimensional convolution where we have a [00:59:17] convolution where we have a three-dimensional signal, a [00:59:19] three-dimensional signal, a three-dimensional filter and now you can [00:59:20] three-dimensional filter and now you can slide that filter everywhere in 3D space [00:59:22] slide that filter everywhere in 3D space to convolve with the input signal. So [00:59:24] to convolve with the input signal. So this idea of a convolution really [00:59:25] this idea of a convolution really extends beyond just two dimen two [00:59:27] extends beyond just two dimen two dimensional images. [00:59:29] dimensional images. Okay, that's basically all about [00:59:31] Okay, that's basically all about convolution. Um and the last one is [00:59:34] convolution. Um and the last one is pooling. Thankfully, pooling is pretty [00:59:36] pooling. Thankfully, pooling is pretty simple. So, pooling layers are basically [00:59:38] simple. So, pooling layers are basically another way to downsample inside of your [00:59:40] another way to downsample inside of your neural network. So, we saw that strided [00:59:42] neural network. So, we saw that strided convolution is one way that we can down [00:59:44] convolution is one way that we can down sample inside of a neural network. And [00:59:46] sample inside of a neural network. And down sampling is useful because it lets [00:59:48] down sampling is useful because it lets us build up receptive fields more [00:59:49] us build up receptive fields more quickly as we go through the through the [00:59:51] quickly as we go through the through the depth of the network. Um, but [00:59:52] depth of the network. Um, but convolution actually still costs quite a [00:59:54] convolution actually still costs quite a lot of computation. So, convolution, you [00:59:56] lot of computation. So, convolution, you know, is where the most of the most of [00:59:58] know, is where the most of the most of the flops, most of the compute happens [00:59:59] the flops, most of the compute happens in a convolutional network. And pooling [01:00:01] in a convolutional network. And pooling layers are basically a way to downsample [01:00:03] layers are basically a way to downsample um that's very very cheap that doesn't [01:00:05] um that's very very cheap that doesn't cost a lot of compute. And the idea in [01:00:07] cost a lot of compute. And the idea in in a pooling layer is given our [01:00:09] in a pooling layer is given our three-dimensional tensor um where in [01:00:11] three-dimensional tensor um where in this case 64x12 x112 you should think [01:00:14] this case 64x12 x112 you should think about that as um a spat like a [01:00:16] about that as um a spat like a threedimensional volume of features [01:00:17] threedimensional volume of features where the spatial size is 112 x12 and we [01:00:21] where the spatial size is 112 x12 and we have 64 planes 64 channels of [01:00:23] have 64 planes 64 channels of activation. So now what we're going to [01:00:25] activation. So now what we're going to and each one of those planes is a is a [01:00:27] and each one of those planes is a is a 112 x12 image. But then what we're going [01:00:30] 112 x12 image. But then what we're going to do is take each one of those [01:00:31] to do is take each one of those individual feature planes, pull it out [01:00:33] individual feature planes, pull it out from our input tensor and down sample [01:00:34] from our input tensor and down sample them independently and then restack them [01:00:36] them independently and then restack them to compute the output. So then this uh [01:00:38] to compute the output. So then this uh input 64x224 x 224, we're going to pull [01:00:41] input 64x224 x 224, we're going to pull out each of those 224 x24 planes [01:00:44] out each of those 224 x24 planes independently down sample it and then [01:00:45] independently down sample it and then restack them to give an same number of [01:00:47] restack them to give an same number of channels um but change in the spatial [01:00:50] channels um but change in the spatial size. What is the method we use for [01:00:52] size. What is the method we use for downsampling? Great question. Um so uh [01:00:55] downsampling? Great question. Um so uh the way there's actually that's actually [01:00:56] the way there's actually that's actually a hyperparameter. There's a couple [01:00:57] a hyperparameter. There's a couple different mechanisms of downsampling [01:00:58] different mechanisms of downsampling that we use. Um, one of the common one, [01:01:00] that we use. Um, one of the common one, one of the most common ones to use is [01:01:02] one of the most common ones to use is actually max is called max pooling. Um, [01:01:04] actually max is called max pooling. Um, so in max pooling, what we're going to [01:01:06] so in max pooling, what we're going to do is um take our single depth slice, [01:01:08] do is um take our single depth slice, divide it up into non-over overlapping [01:01:10] divide it up into non-over overlapping regions. In this case, these are two uh [01:01:13] regions. In this case, these are two uh and and we often use and we use the same [01:01:14] and and we often use and we use the same terminology to talk about these as we do [01:01:16] terminology to talk about these as we do with convolution. So this we this we [01:01:18] with convolution. So this we this we could say it's a kernel size 2x two with [01:01:20] could say it's a kernel size 2x two with stride of two um because that that then [01:01:22] stride of two um because that that then that divides our inputs into these [01:01:24] that divides our inputs into these non-over overlapping 2x2 tiles. Then [01:01:26] non-over overlapping 2x2 tiles. Then within each of those non-over [01:01:27] within each of those non-over overlapping 2x two tiles, we'll take the [01:01:29] overlapping 2x two tiles, we'll take the max entry. In this case, it's a 6, 8, 3, [01:01:33] max entry. In this case, it's a 6, 8, 3, 4. So then you take the max entry inside [01:01:35] 4. So then you take the max entry inside each of those. And then that and then [01:01:37] each of those. And then that and then that that gives us our spatial [01:01:39] that that gives us our spatial compression. Um, and there's a whole you [01:01:41] compression. Um, and there's a whole you can imagine a whole set of [01:01:42] can imagine a whole set of hyperparameters here. You could say what [01:01:44] hyperparameters here. You could say what is the kernel size? You could change the [01:01:46] is the kernel size? You could change the kernel size. You can change the stride. [01:01:48] kernel size. You can change the stride. You can also change the function that we [01:01:49] You can also change the function that we use for downsampling. Um, max pooling is [01:01:52] use for downsampling. Um, max pooling is pretty common. You'll also see average. [01:01:53] pretty common. You'll also see average. You'll also see anti-alias downpooling [01:01:56] You'll also see anti-alias downpooling sometimes. Um, so these are all just [01:01:57] sometimes. Um, so these are all just ways that you can downsample these [01:01:59] ways that you can downsample these feature maps one at a time. Uh, good [01:02:01] feature maps one at a time. Uh, good question. Do we make a use of padding? [01:02:02] question. Do we make a use of padding? Um, typically you do not use padding [01:02:04] Um, typically you do not use padding inside of inside of pooling layers. Um, [01:02:06] inside of inside of pooling layers. Um, there's nothing mathematically [01:02:08] there's nothing mathematically preventing you from doing so. Um, but in [01:02:10] preventing you from doing so. Um, but in the case of max pooling, it would be [01:02:11] the case of max pooling, it would be kind of silly. It's basically equivalent [01:02:12] kind of silly. It's basically equivalent to aru. Um, and when so whenever you're [01:02:15] to aru. Um, and when so whenever you're using max pooling, if you're also using [01:02:16] using max pooling, if you're also using aru, that would be redundant. Um, so [01:02:18] aru, that would be redundant. Um, so typically we don't use padding in in [01:02:20] typically we don't use padding in in pooling layers. I'm actually not sure if [01:02:22] pooling layers. I'm actually not sure if PyTorch has a flag for pooling for [01:02:24] PyTorch has a flag for pooling for padding or not in in pooling layers. [01:02:27] padding or not in in pooling layers. Yeah. So the the stride would be another [01:02:29] Yeah. So the the stride would be another one of these architectural [01:02:29] one of these architectural hyperparameters. Um but usually you [01:02:31] hyperparameters. Um but usually you don't tune these things too much. Um [01:02:33] don't tune these things too much. Um usually the intuition behind a pooling [01:02:34] usually the intuition behind a pooling layer like honestly the most common is I [01:02:36] layer like honestly the most common is I want to down sample my I want to [01:02:38] want to down sample my I want to downsample everything by a factor of two [01:02:39] downsample everything by a factor of two is the mo is like by far the most common [01:02:41] is the mo is like by far the most common operation. So then the most common thing [01:02:43] operation. So then the most common thing to do would be 2x two stride 2. [01:02:45] to do would be 2x two stride 2. Sometimes you'll do like two 4x4 stride [01:02:48] Sometimes you'll do like two 4x4 stride two, but basically the most common [01:02:50] two, but basically the most common settings by far is I want to down sample [01:02:52] settings by far is I want to down sample my everything by a factor of exactly [01:02:54] my everything by a factor of exactly two. Oh, that's a very good question. [01:02:56] two. Oh, that's a very good question. Uh, do images all have to be the same [01:02:58] Uh, do images all have to be the same input size? Um, in all the language that [01:03:00] input size? Um, in all the language that we're talking about so far, yes. Um, [01:03:02] we're talking about so far, yes. Um, you're going to run into big problems if [01:03:04] you're going to run into big problems if your image if your input images are not [01:03:05] your image if your input images are not the same size. So then things that [01:03:07] the same size. So then things that you'll typically do to fix that would be [01:03:09] you'll typically do to fix that would be like one you either resize all your [01:03:11] like one you either resize all your images to the exact same size before you [01:03:13] images to the exact same size before you batch them to feed to the network. Um [01:03:15] batch them to feed to the network. Um sometimes you'll also pad your images [01:03:17] sometimes you'll also pad your images out with zeros or some other value to [01:03:19] out with zeros or some other value to make them all the same size but now [01:03:21] make them all the same size but now padded rather than warped. Um or you [01:03:23] padded rather than warped. Um or you need to basically run these layers [01:03:24] need to basically run these layers independently for images of different [01:03:26] independently for images of different aspect ratios. Um so that's another [01:03:27] aspect ratios. Um so that's another thing that you'll do sometimes um in [01:03:29] thing that you'll do sometimes um in more sophisticated training setups is [01:03:31] more sophisticated training setups is that sometimes you'll do what's known as [01:03:32] that sometimes you'll do what's known as like aspect ratio bucketing. So then [01:03:34] like aspect ratio bucketing. So then from your training data you'll bucket [01:03:36] from your training data you'll bucket them into different aspect ratios and [01:03:37] them into different aspect ratios and then each forward backward pass of the [01:03:39] then each forward backward pass of the network will be on a on a batch of [01:03:40] network will be on a on a batch of images of the same resolution and aspect [01:03:42] images of the same resolution and aspect ratio but then each iteration you might [01:03:44] ratio but then each iteration you might grab images with different resolutions [01:03:45] grab images with different resolutions or aspect ratios but that's something [01:03:47] or aspect ratios but that's something that you'll see in some of the more [01:03:48] that you'll see in some of the more common larger production systems. Yeah. [01:03:51] common larger production systems. Yeah. So the question is where do you put [01:03:52] So the question is where do you put these? Um these are usually interspersed [01:03:54] these? Um these are usually interspersed with the convolution layers. So a pretty [01:03:56] with the convolution layers. So a pretty common architecture a pretty common [01:03:57] common architecture a pretty common pattern for comn nets is to intersperse [01:03:59] pattern for comn nets is to intersperse the convolution and pooling. So for [01:04:01] the convolution and pooling. So for example, you'll see like com pool comcom [01:04:04] example, you'll see like com pool comcom pool comcom pool fully connected fully [01:04:06] pool comcom pool fully connected fully connected is kind of a prototypical [01:04:08] connected is kind of a prototypical convolutional network. [01:04:10] convolutional network. Yes, that's an extra excellent question. [01:04:11] Yes, that's an extra excellent question. Does this introduce nonlinearity? So it [01:04:13] Does this introduce nonlinearity? So it depends on the type of pooling operation [01:04:16] depends on the type of pooling operation that you're using. So if you're doing [01:04:18] that you're using. So if you're doing max pooling that's a nonlinearity. So um [01:04:20] max pooling that's a nonlinearity. So um in some networks if you have a max [01:04:22] in some networks if you have a max pooling you may you may not use a relu [01:04:25] pooling you may you may not use a relu around that convolution because a max [01:04:27] around that convolution because a max pooling is a nonlinearity itself. If [01:04:28] pooling is a nonlinearity itself. If it's an average pooling, um, that's also [01:04:30] it's an average pooling, um, that's also a linear operator. So then if you do [01:04:32] a linear operator. So then if you do average pooling, it's linear. So then [01:04:33] average pooling, it's linear. So then you probably still would want a relu [01:04:35] you probably still would want a relu there. [01:04:37] there. Okay. So then here's my my quick oneside [01:04:39] Okay. So then here's my my quick oneside summary of pooling. Um, it's basically [01:04:41] summary of pooling. Um, it's basically the same hyperparameters as convolution. [01:04:43] the same hyperparameters as convolution. Um, except you've got this extra pooling [01:04:44] Um, except you've got this extra pooling function which is what is the mechanism [01:04:46] function which is what is the mechanism you're using to do the downsampling. [01:04:48] you're using to do the downsampling. Um, then the last thing I wanted to [01:04:51] Um, then the last thing I wanted to mention is this notion of translation [01:04:55] mention is this notion of translation equivariance. What the hell is that? Um, [01:04:57] equivariance. What the hell is that? Um, so I said at the beginning of the of the [01:04:59] so I said at the beginning of the of the beginning of the lecture that we wanted [01:05:00] beginning of the lecture that we wanted operators that are respecting the [01:05:02] operators that are respecting the spatial structure of our images, right? [01:05:04] spatial structure of our images, right? And that we have this notion that [01:05:05] And that we have this notion that flattening our images into big vectors [01:05:07] flattening our images into big vectors is somehow not respecting the spatial [01:05:09] is somehow not respecting the spatial structure of of our images. So there's a [01:05:11] structure of of our images. So there's a really interesting property that is [01:05:12] really interesting property that is shared by both convolution and pooling [01:05:15] shared by both convolution and pooling which is one way to more for to [01:05:16] which is one way to more for to formalize this notion of them respecting [01:05:19] formalize this notion of them respecting the 2D spatial structure of the images [01:05:21] the 2D spatial structure of the images and that's this notion of translation [01:05:22] and that's this notion of translation equariance. So it sounds like it sounds [01:05:25] equariance. So it sounds like it sounds pretty crazy but the idea is we can [01:05:27] pretty crazy but the idea is we can imagine two different operators uh two [01:05:29] imagine two different operators uh two different branches um along one branch [01:05:31] different branches um along one branch we can imagine taking our image doing a [01:05:33] we can imagine taking our image doing a convolution or pooling operator to get [01:05:35] convolution or pooling operator to get an updated image and then translating [01:05:37] an updated image and then translating the result by shifting that feature map [01:05:39] the result by shifting that feature map to the side by for example then you [01:05:42] to the side by for example then you could imagine changing the order of [01:05:43] could imagine changing the order of these two things instead. What we could [01:05:45] these two things instead. What we could have done instead is first translate the [01:05:48] have done instead is first translate the image and then do our convolution or [01:05:50] image and then do our convolution or pool operator on top of the translated [01:05:52] pool operator on top of the translated image. And it and it just so happens [01:05:54] image. And it and it just so happens that in this case the order doesn't [01:05:56] that in this case the order doesn't matter. If you translate and then [01:05:58] matter. If you translate and then convolution, you get the same result as [01:06:00] convolution, you get the same result as if you had done convolution and then [01:06:02] if you had done convolution and then translate um subject to some boundary [01:06:04] translate um subject to some boundary conditions blah blah blah blah blah. [01:06:05] conditions blah blah blah blah blah. like in in sort of the limit of [01:06:07] like in in sort of the limit of infinitely large images and blah blah [01:06:09] infinitely large images and blah blah blah blah blah ignoring this ignoring [01:06:10] blah blah blah ignoring this ignoring some of these technical conditions. Um [01:06:12] some of these technical conditions. Um it's really interesting that you can [01:06:14] it's really interesting that you can actually swap the order of translation [01:06:16] actually swap the order of translation in space versus performing these these [01:06:18] in space versus performing these these downsampling or convolution operators. [01:06:20] downsampling or convolution operators. Um and that bakes in an important [01:06:22] Um and that bakes in an important intuition about images which is that [01:06:24] intuition about images which is that when we're processing images um pro we [01:06:27] when we're processing images um pro we the the features that we extract from an [01:06:29] the the features that we extract from an image should only depend on the content [01:06:31] image should only depend on the content of the image and should not depend on [01:06:33] of the image and should not depend on where in the image that where the [01:06:35] where in the image that where the absolute location in the image that [01:06:37] absolute location in the image that content came from. So that means that [01:06:38] content came from. So that means that you know if I'm looking this way it [01:06:40] you know if I'm looking this way it looks like people it looks like people [01:06:42] looks like people it looks like people and benches. If I'm looking this way it [01:06:44] and benches. If I'm looking this way it looks like people and benches. and the [01:06:46] looks like people and benches. and the fact that it's over here on my right and [01:06:48] fact that it's over here on my right and the fact that it's over here on my left, [01:06:49] the fact that it's over here on my left, I want to process that data in the exact [01:06:51] I want to process that data in the exact same way. And that's an important um [01:06:53] same way. And that's an important um intuition, an important structure of [01:06:55] intuition, an important structure of that we want to of of images and of the [01:06:57] that we want to of of images and of the kind of 2D data that we're processing. [01:06:58] kind of 2D data that we're processing. And the and this notion of translation [01:07:00] And the and this notion of translation equariance basically is a way to [01:07:02] equariance basically is a way to mathematically describe how that [01:07:04] mathematically describe how that structure is baked into these operators. [01:07:06] structure is baked into these operators. Um so this is kind of interesting um [01:07:08] Um so this is kind of interesting um that it's it's a way that we can build [01:07:10] that it's it's a way that we can build in our intuition about how images ought [01:07:12] in our intuition about how images ought to be processed um through the design of [01:07:14] to be processed um through the design of our operators not through the design of [01:07:16] our operators not through the design of our of our feature extraction methods as [01:07:17] our of our feature extraction methods as we saw at the beginning. The question is [01:07:19] we saw at the beginning. The question is why do you do a translation? You you [01:07:21] why do you do a translation? You you don't like this is not something you're [01:07:22] don't like this is not something you're actually going to do. Um this is [01:07:23] actually going to do. Um this is basically a mathematical cur curiosity, [01:07:25] basically a mathematical cur curiosity, right? To be clear that you should not [01:07:27] right? To be clear that you should not generally do this inside of your neural [01:07:29] generally do this inside of your neural networks. Um this is basically like it's [01:07:31] networks. Um this is basically like it's interesting to note that this happens to [01:07:32] interesting to note that this happens to be true but you would not do this inside [01:07:34] be true but you would not do this inside of your neural networks. [01:07:36] of your neural networks. Um and if you if you were a [01:07:38] Um and if you if you were a mathematician you call this a commutive [01:07:39] mathematician you call this a commutive diagram and mathematicians love those [01:07:41] diagram and mathematicians love those things. [01:07:42] things. Okay so that's basically the summary of [01:07:44] Okay so that's basically the summary of today. Um we talked about convolutional [01:07:46] today. Um we talked about convolutional networks. We talked about you know why [01:07:48] networks. We talked about you know why they're interesting. We talked about [01:07:49] they're interesting. We talked about these two new operators of convolution [01:07:51] these two new operators of convolution and pooling. And then next lecture we'll [01:07:53] and pooling. And then next lecture we'll see how to stitch those together into [01:07:54] see how to stitch those together into CNN architectures. Um and see you next [01:07:57] CNN architectures. Um and see you next time for that. ================================================================================ LECTURE 006 ================================================================================ Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 6: CNN Architectures Source: https://www.youtube.com/watch?v=aVJy4O5TOk8 --- Transcript [00:00:05] Hi everyone. Um, my name is Zayn. I [00:00:07] Hi everyone. Um, my name is Zayn. I realized I actually didn't introduce [00:00:09] realized I actually didn't introduce myself on the first lecture I gave uh, [00:00:11] myself on the first lecture I gave uh, which was lecture three, but I'm one of [00:00:12] which was lecture three, but I'm one of the co-instructors for the course. My [00:00:14] the co-instructors for the course. My name is Zane Dante. I'm co-advised by [00:00:16] name is Zane Dante. I'm co-advised by Essan and Feay. I'm a fourth year PhD [00:00:18] Essan and Feay. I'm a fourth year PhD student uh, at Stamford. And in this [00:00:22] student uh, at Stamford. And in this lecture today, lecture six, we'll be [00:00:24] lecture today, lecture six, we'll be talking about training convolutional [00:00:25] talking about training convolutional neural networks and also um, CNN [00:00:28] neural networks and also um, CNN architectures. [00:00:29] architectures. So um you know I would say this lecture [00:00:32] So um you know I would say this lecture is really broken up into two different [00:00:34] is really broken up into two different components. The first one is just [00:00:36] components. The first one is just telling you how to piece together all of [00:00:38] telling you how to piece together all of the different building blocks that we've [00:00:40] the different building blocks that we've learned like convolutional layers uh [00:00:42] learned like convolutional layers uh linear layers or fully connected layers [00:00:44] linear layers or fully connected layers together to create uh CNN architecture. [00:00:46] together to create uh CNN architecture. We'll go through some examples and then [00:00:48] We'll go through some examples and then also we'll talk about how you actually [00:00:50] also we'll talk about how you actually train these and all the steps uh [00:00:52] train these and all the steps uh involved there. So as I mentioned before [00:00:54] involved there. So as I mentioned before we'll have basically two different [00:00:56] we'll have basically two different topics. The first one is how to build [00:00:57] topics. The first one is how to build CNN's and by this I mean how do you [00:00:59] CNN's and by this I mean how do you actually define your CNN architecture to [00:01:01] actually define your CNN architecture to set it up to be trained and then the [00:01:03] set it up to be trained and then the second set of topics today is how do you [00:01:04] second set of topics today is how do you train CNN's. So starting with the first [00:01:08] train CNN's. So starting with the first set of topics here we'll go through the [00:01:10] set of topics here we'll go through the layers in convolutional neural networks. [00:01:12] layers in convolutional neural networks. And if you recall from last lecture we [00:01:15] And if you recall from last lecture we learned about the key layer in these [00:01:16] learned about the key layer in these models which is the convolution layer. [00:01:19] models which is the convolution layer. And the way that these layers work is [00:01:21] And the way that these layers work is they have these filters. um you have a [00:01:23] they have these filters. um you have a predefined number of filters per one of [00:01:26] predefined number of filters per one of these convolution layers. In this case, [00:01:27] these convolution layers. In this case, six. Um they match the depth of your [00:01:30] six. Um they match the depth of your input data here. So in this case, we [00:01:32] input data here. So in this case, we have a 32x32 RGB image. So we have three [00:01:35] have a 32x32 RGB image. So we have three depth channels. Each of these filters [00:01:37] depth channels. Each of these filters slides across the image and calculates a [00:01:39] slides across the image and calculates a score at each point. At that location in [00:01:41] score at each point. At that location in the image, you take the dotproduct of [00:01:43] the image, you take the dotproduct of the values in the filter with the values [00:01:45] the values in the filter with the values in the image. Uh so you multiply all [00:01:46] in the image. Uh so you multiply all these values together, you sum them [00:01:48] these values together, you sum them together, and then you add a bias term. [00:01:50] together, and then you add a bias term. And this is how you calculate each um [00:01:53] And this is how you calculate each um value in your output activation map on [00:01:55] value in your output activation map on the right. So you have these sliding [00:01:57] the right. So you have these sliding windows that go across the image. They [00:01:59] windows that go across the image. They calculate a score at each position and [00:02:01] calculate a score at each position and that's how you get these uh activation [00:02:03] that's how you get these uh activation maps and you have one per each filter. [00:02:06] maps and you have one per each filter. Normally we'll do a sort of relu or a [00:02:09] Normally we'll do a sort of relu or a nonlinearity activation function at the [00:02:10] nonlinearity activation function at the end here. So this is from last lecture. [00:02:13] end here. So this is from last lecture. I won't spend too much time on it. The [00:02:15] I won't spend too much time on it. The question is for images the depth is [00:02:17] question is for images the depth is equal to the number of channels RGB but [00:02:20] equal to the number of channels RGB but here the depth is six for the output [00:02:22] here the depth is six for the output here. So if we had a second convolution [00:02:24] here. So if we had a second convolution layer afterwards they would need to be [00:02:26] layer afterwards they would need to be filters that uh go across all six of [00:02:29] filters that uh go across all six of these uh activation maps. So the next [00:02:32] these uh activation maps. So the next layer would be a depth of six. Okay. And [00:02:35] layer would be a depth of six. Okay. And then the second layer we talked about [00:02:36] then the second layer we talked about which is much simpler than the [00:02:38] which is much simpler than the convolution layer is this idea of a [00:02:39] convolution layer is this idea of a pooling layer. So here um it's still [00:02:42] pooling layer. So here um it's still this sort of filter that we're sliding [00:02:44] this sort of filter that we're sliding across the image uh you know 2 x2 filter [00:02:46] across the image uh you know 2 x2 filter with stride two. So we're skipping over. [00:02:48] with stride two. So we're skipping over. We're not doing every single location. [00:02:49] We're not doing every single location. And here is a max pooling. So we're just [00:02:51] And here is a max pooling. So we're just taking the max of each of these areas. [00:02:53] taking the max of each of these areas. And that's the value we get here. Or you [00:02:55] And that's the value we get here. Or you could do an average pooling. Um you know [00:02:57] could do an average pooling. Um you know these are both commonly used I would [00:02:59] these are both commonly used I would say. And depending on the architecture [00:03:01] say. And depending on the architecture you would probably just if you're [00:03:02] you would probably just if you're creating new architecture you would try [00:03:03] creating new architecture you would try both of them and see what performs [00:03:04] both of them and see what performs better here. But the basic idea is to [00:03:07] better here. But the basic idea is to consolidate among the height and width [00:03:08] consolidate among the height and width dimensions for your image. [00:03:12] dimensions for your image. Okay. So um we've basically gone over at [00:03:15] Okay. So um we've basically gone over at this point in the course all the the top [00:03:17] this point in the course all the the top row here which is convolution layers, [00:03:19] row here which is convolution layers, pooling layers and also the fully [00:03:20] pooling layers and also the fully connected layers. These are the first [00:03:22] connected layers. These are the first layers that we talked about in the uh [00:03:23] layers that we talked about in the uh neural networks lecture where it's [00:03:25] neural networks lecture where it's basically one matrix multiply followed [00:03:27] basically one matrix multiply followed by an activation function. And for the [00:03:30] by an activation function. And for the rest of this lecture, I'll talk about [00:03:31] rest of this lecture, I'll talk about the remaining layers that you see in [00:03:33] the remaining layers that you see in CNN's at least the the commonly used [00:03:35] CNN's at least the the commonly used ones which include normalization layers [00:03:38] ones which include normalization layers uh which I'll go into those and then [00:03:39] uh which I'll go into those and then also dropout which is a regular uh [00:03:42] also dropout which is a regular uh regularization technique that's used [00:03:43] regularization technique that's used actually in the model architecture [00:03:45] actually in the model architecture itself and then finally we'll revisit [00:03:47] itself and then finally we'll revisit the activation functions and I'll tell [00:03:48] the activation functions and I'll tell you about you know here are the most [00:03:50] you about you know here are the most commonly used ones both historically and [00:03:52] commonly used ones both historically and in the modern era of deep learning. Um, [00:03:55] in the modern era of deep learning. Um, so starting out with normalization [00:03:56] so starting out with normalization layers, the basic idea here is we're [00:03:59] layers, the basic idea here is we're going to be calculating statistics like [00:04:01] going to be calculating statistics like the mean and standard deviation for our [00:04:03] the mean and standard deviation for our input data and then using those to [00:04:05] input data and then using those to normalize the data and then we'll uh [00:04:08] normalize the data and then we'll uh we'll basically learn what is the uh [00:04:11] we'll basically learn what is the uh optimal distribution for the model to [00:04:14] optimal distribution for the model to learn at that point. So so very [00:04:15] learn at that point. So so very concretely we learn parameters that will [00:04:18] concretely we learn parameters that will scale and shift our input data by a [00:04:21] scale and shift our input data by a learned mean and a learned uh standard [00:04:23] learned mean and a learned uh standard deviation. So how all of these [00:04:25] deviation. So how all of these normalization layers work is there's two [00:04:28] normalization layers work is there's two steps. The first is to normalize the [00:04:30] steps. The first is to normalize the data coming in to be a unit Gaussian. So [00:04:32] data coming in to be a unit Gaussian. So mean zero, standard deviation one. And [00:04:35] mean zero, standard deviation one. And then we will scale and shift it. So [00:04:37] then we will scale and shift it. So multiply by some value to increase or [00:04:40] multiply by some value to increase or decrease the standard deviation and then [00:04:42] decrease the standard deviation and then shift it to change where the mean is. [00:04:44] shift it to change where the mean is. And all normalization layers do this [00:04:45] And all normalization layers do this technique. But the way that they differ [00:04:47] technique. But the way that they differ is how they calculate the statistics. So [00:04:49] is how they calculate the statistics. So how are you calculating the mean and [00:04:50] how are you calculating the mean and standard dev standard deviation and [00:04:52] standard dev standard deviation and which values are you applying these [00:04:54] which values are you applying these calculated statistics to but all [00:04:56] calculated statistics to but all normalization layers are doing this [00:04:57] normalization layers are doing this highle process. [00:05:00] highle process. So um I'll talk about layer norm which [00:05:03] So um I'll talk about layer norm which is the most commonly used uh [00:05:05] is the most commonly used uh normalization layer I would say today in [00:05:07] normalization layer I would say today in deep learning and it's really commonly [00:05:08] deep learning and it's really commonly used in transformers specifically um and [00:05:11] used in transformers specifically um and so you can imagine you have some data [00:05:13] so you can imagine you have some data coming in X which is a batch size of N. [00:05:16] coming in X which is a batch size of N. So we have n samples coming into our [00:05:18] So we have n samples coming into our model and each of these are vectors of [00:05:21] model and each of these are vectors of dimension d. So what it layer norm does [00:05:24] dimension d. So what it layer norm does is we calculate a mean and standard [00:05:27] is we calculate a mean and standard deviation for each of our samples [00:05:29] deviation for each of our samples separately. So we're calculating what is [00:05:31] separately. So we're calculating what is the mean among the along the depth uh or [00:05:34] the mean among the along the depth uh or or the dimension d here and what is the [00:05:36] or the dimension d here and what is the standard deviation. Then we learn [00:05:39] standard deviation. Then we learn parameters and so these are learnable [00:05:42] parameters and so these are learnable parameters learned uh through via [00:05:44] parameters learned uh through via gradient descent in our model. uh to [00:05:46] gradient descent in our model. uh to then apply to each sample. So after we [00:05:48] then apply to each sample. So after we calculate our statistics in this way, [00:05:50] calculate our statistics in this way, treating each sample separately to [00:05:52] treating each sample separately to calculate the mean and standard [00:05:54] calculate the mean and standard deviation, we then apply these learned [00:05:57] deviation, we then apply these learned uh scale and shift or uh parameters [00:06:00] uh scale and shift or uh parameters here. So we subtract the mean and divide [00:06:02] here. So we subtract the mean and divide by the standard deviation within our [00:06:04] by the standard deviation within our input data to normalize it and then we [00:06:07] input data to normalize it and then we apply the scale here but with [00:06:08] apply the scale here but with multiplication and the the shift. So um [00:06:11] multiplication and the the shift. So um this is the idea behind layer norm and [00:06:14] this is the idea behind layer norm and at a high level layer uh had a high [00:06:16] at a high level layer uh had a high level idea all of these different uh [00:06:19] level idea all of these different uh normalization layers are all are are all [00:06:21] normalization layers are all are are all computing very similar things but the [00:06:23] computing very similar things but the main difference is how are they [00:06:25] main difference is how are they computing the uh the mean and standard [00:06:28] computing the uh the mean and standard deviation. So this is a really nice [00:06:30] deviation. So this is a really nice visualization from a paper called group [00:06:32] visualization from a paper called group normalization that introduces a new way [00:06:35] normalization that introduces a new way to normalize. um it's I would say not so [00:06:37] to normalize. um it's I would say not so commonly used these days but this is [00:06:38] commonly used these days but this is actually a really great way to gain [00:06:40] actually a really great way to gain intuition about how these different [00:06:42] intuition about how these different normalization layers are different. So [00:06:44] normalization layers are different. So for layer norm um I described the really [00:06:47] for layer norm um I described the really simple case where we just have vectors [00:06:49] simple case where we just have vectors that we're normalizing but in the in the [00:06:51] that we're normalizing but in the in the case for convolutional neural networks [00:06:53] case for convolutional neural networks we have a channel dimension or the depth [00:06:56] we have a channel dimension or the depth uh we have the in the height and the [00:06:57] uh we have the in the height and the width or the spatial dimensions of the [00:06:59] width or the spatial dimensions of the image. So what layerorm does is for each [00:07:02] image. So what layerorm does is for each sample we're still processing it [00:07:04] sample we're still processing it separately and we're calculating the [00:07:06] separately and we're calculating the mean across all of the channels all of [00:07:08] mean across all of the channels all of the heights and all the widths. So if we [00:07:11] the heights and all the widths. So if we look um sort of back into this diagram [00:07:14] look um sort of back into this diagram here um you would basically be counting [00:07:16] here um you would basically be counting calculating one uh one mean and one [00:07:19] calculating one uh one mean and one standard deviation over all of these [00:07:22] standard deviation over all of these over all of these values. So um we're [00:07:24] over all of these values. So um we're for each of our input uh data points [00:07:26] for each of our input uh data points we're calculating one mean and one [00:07:28] we're calculating one mean and one standard deviation across all of the [00:07:30] standard deviation across all of the channels all of the height and width [00:07:31] channels all of the height and width dimensions. So this is what layer norm [00:07:32] dimensions. So this is what layer norm is doing. But you could imagine feasibly [00:07:35] is doing. But you could imagine feasibly that you could calculate these [00:07:36] that you could calculate these statistics differently. Batchnorm uh [00:07:39] statistics differently. Batchnorm uh you're you're taking each channel so [00:07:41] you're you're taking each channel so each channel is being calculated as one [00:07:43] each channel is being calculated as one mean and one standard deviation and [00:07:45] mean and one standard deviation and you're applying it just to that channel. [00:07:46] you're applying it just to that channel. Uh and so you're averaging across all [00:07:48] Uh and so you're averaging across all the data in your batch. Instance norm is [00:07:50] the data in your batch. Instance norm is even more granular and then a group [00:07:52] even more granular and then a group norm. So I just want to point out all [00:07:53] norm. So I just want to point out all these layers are kind of trying to do [00:07:55] these layers are kind of trying to do the same thing where you're a you're [00:07:56] the same thing where you're a you're you're basically normalizing your data [00:07:58] you're basically normalizing your data and then having these learnable scaling [00:08:00] and then having these learnable scaling and shifting parameters but the way they [00:08:01] and shifting parameters but the way they do it is different because they're [00:08:03] do it is different because they're calculating the statistics using [00:08:04] calculating the statistics using different variable uh different subsets [00:08:06] different variable uh different subsets of your your input data. Yeah. So the [00:08:09] of your your input data. Yeah. So the question is for layer norm are we [00:08:10] question is for layer norm are we calculating one mean and one standard [00:08:12] calculating one mean and one standard deviation for each image or input data? [00:08:14] deviation for each image or input data? Yes, they're all calculated separately. [00:08:18] Yes, they're all calculated separately. But for batch norm it's they would not [00:08:19] But for batch norm it's they would not be the case in this example here. Yeah, [00:08:22] be the case in this example here. Yeah, for batch norm it's actually within the [00:08:24] for batch norm it's actually within the mini batch when you're, you know, you're [00:08:26] mini batch when you're, you know, you're doing gradient descent, you have a small [00:08:27] doing gradient descent, you have a small batch of data you're looking at, you [00:08:28] batch of data you're looking at, you feed it into your model. Um, you're [00:08:31] feed it into your model. Um, you're calculating the per channel, uh, mean [00:08:35] calculating the per channel, uh, mean and standard deviation based on all of [00:08:37] and standard deviation based on all of the data in your batch. [00:08:41] Yeah, I think this is like a good if you [00:08:43] Yeah, I think this is like a good if you can understand this diagram, I think you [00:08:44] can understand this diagram, I think you understand what all of the different uh, [00:08:46] understand what all of the different uh, normalization layers are doing. So it [00:08:48] normalization layers are doing. So it might be worthwhile after lecture if you [00:08:49] might be worthwhile after lecture if you still don't fully understand it just to [00:08:50] still don't fully understand it just to like go through and make sure you [00:08:52] like go through and make sure you understand what you know the and shaded [00:08:54] understand what you know the and shaded in blue these are the values we're both [00:08:56] in blue these are the values we're both calculating our statistics over and then [00:08:58] calculating our statistics over and then applying the mean and standard [00:09:00] applying the mean and standard deviations to uh yeah one final question [00:09:01] deviations to uh yeah one final question and then we'll go on is channel same as [00:09:04] and then we'll go on is channel same as like the layers [00:09:06] like the layers so channel here is the is the depth here [00:09:10] so channel here is the is the depth here so the the number of uh spatial [00:09:15] so the the number of uh spatial um the different spatial values you have [00:09:17] um the different spatial values you have at each spatial the number of values you [00:09:18] at each spatial the number of values you have at each spatial location. [00:09:22] have at each spatial location. Okay, cool. Um, so this is uh we talked [00:09:26] Okay, cool. Um, so this is uh we talked about normalization layers. The key idea [00:09:27] about normalization layers. The key idea is you're calculating these statistics, [00:09:29] is you're calculating these statistics, applying them to your input data and [00:09:32] applying them to your input data and then uh learning a scale and shift uh [00:09:35] then uh learning a scale and shift uh parameter that you then apply. So um the [00:09:38] parameter that you then apply. So um the next type of layer we'll talk about is [00:09:40] next type of layer we'll talk about is called dropout. And this is a [00:09:42] called dropout. And this is a regularization layer in CNN's. This is [00:09:44] regularization layer in CNN's. This is the final layer that you'll need to [00:09:46] the final layer that you'll need to learn in order to basically we can start [00:09:48] learn in order to basically we can start going through all these different CNN [00:09:50] going through all these different CNN architectures that people have created [00:09:51] architectures that people have created over the years. Um so with dropout the [00:09:55] over the years. Um so with dropout the basic idea is to add randomization [00:09:58] basic idea is to add randomization during the training process that we then [00:10:00] during the training process that we then take away at test time and the goal is [00:10:02] take away at test time and the goal is to make it harder for the model to learn [00:10:04] to make it harder for the model to learn the training data but then it will [00:10:05] the training data but then it will generalize better. So this is a form of [00:10:07] generalize better. So this is a form of regularization. [00:10:09] regularization. The way we do it concretely is that in [00:10:12] The way we do it concretely is that in each forward pass of our layer, we'll [00:10:14] each forward pass of our layer, we'll actually randomly zero out some of the [00:10:16] actually randomly zero out some of the outputs or activations from that from [00:10:18] outputs or activations from that from that layer. And uh the main parameter [00:10:22] that layer. And uh the main parameter you have for this dropout layer, which [00:10:24] you have for this dropout layer, which is just a fixed hyperparameter, is the [00:10:26] is just a fixed hyperparameter, is the probability of dropping out the values. [00:10:28] probability of dropping out the values. And 0.5 is probably the most common or [00:10:31] And 0.5 is probably the most common or 0.25 is also commonly used here. Um so [00:10:34] 0.25 is also commonly used here. Um so you're just dropping out a fixed [00:10:36] you're just dropping out a fixed percentage of the values here. And so uh [00:10:39] percentage of the values here. And so uh you know going forward to the next layer [00:10:40] you know going forward to the next layer these would be zero. Um and so you don't [00:10:42] these would be zero. Um and so you don't you don't really need to calculate the [00:10:44] you don't really need to calculate the um the values here. So um I mean [00:10:48] um the values here. So um I mean basically all the all the outputs here [00:10:49] basically all the all the outputs here are zero at this point. So you can get [00:10:50] are zero at this point. So you can get there's some tricks you can do with [00:10:51] there's some tricks you can do with masking. So you don't even need to [00:10:53] masking. So you don't even need to calculate the values because 0 times any [00:10:55] calculate the values because 0 times any value will be uh will be zero. So um I [00:10:58] value will be uh will be zero. So um I think in general uh you might ask why [00:11:01] think in general uh you might ask why does this work? And um I would say this [00:11:03] does this work? And um I would say this is more of an empirical thing than a uh [00:11:06] is more of an empirical thing than a uh well wellstudied from a um from a [00:11:10] well wellstudied from a um from a theoretical standpoint. But um there are [00:11:13] theoretical standpoint. But um there are actually some ways you can view what [00:11:14] actually some ways you can view what dropout is doing to gain intuition of [00:11:16] dropout is doing to gain intuition of why this might be useful. So um it it [00:11:19] why this might be useful. So um it it basically forces your network to you can [00:11:21] basically forces your network to you can imagine it forces it to have redundant [00:11:23] imagine it forces it to have redundant representations. So if we have a list of [00:11:25] representations. So if we have a list of features that we're learning at a given [00:11:27] features that we're learning at a given layer, say the the layer right before [00:11:29] layer, say the the layer right before the output of our model, and we have a [00:11:32] the output of our model, and we have a CNN that is basically extracting each of [00:11:35] CNN that is basically extracting each of these features. So it can detect if [00:11:37] these features. So it can detect if there's ears in the image or if there's [00:11:39] there's ears in the image or if there's a tail, if it's furry, it has claws. And [00:11:41] a tail, if it's furry, it has claws. And you want to uh your model to output the [00:11:43] you want to uh your model to output the probability of a cat score. So um one of [00:11:46] probability of a cat score. So um one of the things that's useful about this is [00:11:48] the things that's useful about this is because some of these values might [00:11:49] because some of these values might randomly be dropped out during training, [00:11:51] randomly be dropped out during training, your model can't over rely on certain [00:11:53] your model can't over rely on certain features being present. in in uh some of [00:11:55] features being present. in in uh some of the classes and actually needs to learn [00:11:57] the classes and actually needs to learn a more broad uh set of correspondences [00:12:00] a more broad uh set of correspondences between your features and your your [00:12:02] between your features and your your output classes. So the model can't just [00:12:05] output classes. So the model can't just you know hard focus on uh okay well if [00:12:07] you know hard focus on uh okay well if if it has an ear and is furry these are [00:12:10] if it has an ear and is furry these are and you know it just so happens that [00:12:12] and you know it just so happens that these are always cats uh or something [00:12:13] these are always cats uh or something like this or if it has claws um and it [00:12:17] like this or if it has claws um and it has an ear it'll it'll almost always be [00:12:19] has an ear it'll it'll almost always be a cat on your data set. So it'll [00:12:21] a cat on your data set. So it'll actually help you generalize better to [00:12:22] actually help you generalize better to new new features despite the fact that [00:12:25] new new features despite the fact that in your data set there might actually be [00:12:26] in your data set there might actually be really strong correlations between the [00:12:28] really strong correlations between the codependency of certain features in your [00:12:30] codependency of certain features in your output class. By having dropout you're [00:12:33] output class. By having dropout you're essentially making it so the model can't [00:12:35] essentially making it so the model can't rely on these during the training phase [00:12:36] rely on these during the training phase because it won't always see the pairs of [00:12:38] because it won't always see the pairs of features together. So this is an example [00:12:41] features together. So this is an example for for for cat and the question is if [00:12:44] for for for cat and the question is if we had something like tree instead how [00:12:46] we had something like tree instead how would you determine which features to [00:12:48] would you determine which features to drop out? So, uh, the dropping out part [00:12:51] drop out? So, uh, the dropping out part is actually completely random. So, we're [00:12:52] is actually completely random. So, we're not making any choices about this. It's [00:12:54] not making any choices about this. It's just in this case, 50% of your your [00:12:58] just in this case, 50% of your your features at any given step will be [00:13:00] features at any given step will be dropped out and set to zero. Um, so [00:13:02] dropped out and set to zero. Um, so yeah, you don't have to make choices [00:13:04] yeah, you don't have to make choices about it, which is kind of nice. Um, but [00:13:06] about it, which is kind of nice. Um, but it is completely random. [00:13:09] it is completely random. How would the model know if you're only [00:13:10] How would the model know if you're only seeing a subset of the features like [00:13:12] seeing a subset of the features like tail and claw here? Um, the point is you [00:13:14] tail and claw here? Um, the point is you will actually do worse on the training [00:13:15] will actually do worse on the training data because you're only seeing a subset [00:13:17] data because you're only seeing a subset of the features. So it does make the [00:13:18] of the features. So it does make the model worse by not having all the [00:13:20] model worse by not having all the information. Uh but then it does better [00:13:22] information. Uh but then it does better at test time is the idea. So worse in [00:13:24] at test time is the idea. So worse in training time and better in test time [00:13:25] training time and better in test time because at test time um you're basically [00:13:28] because at test time um you're basically no longer having this dropout. So the [00:13:31] no longer having this dropout. So the final component here which maybe I [00:13:33] final component here which maybe I should have explained first before [00:13:34] should have explained first before fielding questions is the idea at test [00:13:36] fielding questions is the idea at test time you're no longer dropping out any [00:13:38] time you're no longer dropping out any of the values. So this is randomness [00:13:40] of the values. So this is randomness that we're adding during the training [00:13:41] that we're adding during the training phase only. Then at test time we're uh [00:13:44] phase only. Then at test time we're uh ne we're never masking any of the output [00:13:46] ne we're never masking any of the output activations and we're removing this [00:13:47] activations and we're removing this dropout uh idea altogether. Now one [00:13:50] dropout uh idea altogether. Now one thing we need to note is that because if [00:13:53] thing we need to note is that because if we were dropping out 50% of the [00:13:55] we were dropping out 50% of the activations uh during training time at [00:13:57] activations uh during training time at test time you're basically having 50% [00:13:59] test time you're basically having 50% more values that are being input to uh [00:14:02] more values that are being input to uh each of your uh layers. And so this can [00:14:05] each of your uh layers. And so this can cause issues if you don't scale it. So [00:14:07] cause issues if you don't scale it. So what you need to do is multiply by the [00:14:09] what you need to do is multiply by the probability of dropout so that the um [00:14:12] probability of dropout so that the um the sort of magnitude of the values [00:14:14] the sort of magnitude of the values coming into each layer is preserved [00:14:16] coming into each layer is preserved during training and test time. Otherwise [00:14:17] during training and test time. Otherwise if you're dropping 50% of the values and [00:14:19] if you're dropping 50% of the values and then at test time you just include all [00:14:21] then at test time you just include all of them you'll get really weird behavior [00:14:23] of them you'll get really weird behavior because um you'll be seeing much larger [00:14:25] because um you'll be seeing much larger magnitude of inputs than before. [00:14:28] magnitude of inputs than before. Yeah. So what about for backward prop? [00:14:30] Yeah. So what about for backward prop? Uh so for back prop when you have these [00:14:33] Uh so for back prop when you have these zeroed values yeah it's sort of like uh [00:14:36] zeroed values yeah it's sort of like uh you don't uh you you don't need to [00:14:38] you don't uh you you don't need to traverse that uh path of your directed [00:14:41] traverse that uh path of your directed graph anymore. It's very similar to relu [00:14:43] graph anymore. It's very similar to relu if you have a a zeroed value at that [00:14:45] if you have a a zeroed value at that point. Um the gradient becomes zero. So [00:14:48] point. Um the gradient becomes zero. So um anything be sort of uh further back [00:14:51] um anything be sort of uh further back in your computational graph uh there's [00:14:54] in your computational graph uh there's no gradients calculated at that point. [00:14:56] no gradients calculated at that point. If you're dropping out um certain values [00:15:00] If you're dropping out um certain values or activations, the weights associated [00:15:02] or activations, the weights associated with those specific activations will not [00:15:04] with those specific activations will not be updated uh during gradient descent if [00:15:07] be updated uh during gradient descent if you're dropping them out. Yeah. So the [00:15:09] you're dropping them out. Yeah. So the question is um maybe I'll reframe it. [00:15:11] question is um maybe I'll reframe it. What are we doing at test time? So at [00:15:13] What are we doing at test time? So at test time we are using all of the um [00:15:16] test time we are using all of the um output activations. We're not dropping [00:15:18] output activations. We're not dropping them out anymore, but we need to scale [00:15:20] them out anymore, but we need to scale by the probability of dropout. So we [00:15:21] by the probability of dropout. So we multiply each of our output activations [00:15:23] multiply each of our output activations by this p value because now we're using [00:15:25] by this p value because now we're using all of them. Uh so otherwise the uh you [00:15:28] all of them. Uh so otherwise the uh you can imagine you have each node is sort [00:15:30] can imagine you have each node is sort of seeing a significantly higher number [00:15:32] of seeing a significantly higher number of inputs than it did during training at [00:15:34] of inputs than it did during training at test time. So you need to uh multiply by [00:15:37] test time. So you need to uh multiply by this p value to maintain the same uh [00:15:39] this p value to maintain the same uh magnitude of your your your inputs [00:15:40] magnitude of your your your inputs coming in and uh the variance stays the [00:15:43] coming in and uh the variance stays the same and all these different properties [00:15:44] same and all these different properties it works very nicely if you do it like [00:15:46] it works very nicely if you do it like this. Yeah. So the question is can you [00:15:48] this. Yeah. So the question is can you just add noise to the image instead? The [00:15:50] just add noise to the image instead? The answer is yes and we'll go over how to [00:15:52] answer is yes and we'll go over how to do that in future slides. Yes, that's a [00:15:54] do that in future slides. Yes, that's a great idea to add noise to your image. [00:15:57] great idea to add noise to your image. Okay. Um, some specific code here. I [00:16:00] Okay. Um, some specific code here. I won't go over this because we already [00:16:01] won't go over this because we already mentioned this, but you know, you're [00:16:02] mentioned this, but you know, you're dropping a um a p percentage of your [00:16:06] dropping a um a p percentage of your activations here and then you multiply [00:16:08] activations here and then you multiply here at test time. [00:16:10] here at test time. Okay. Um the the next topic I'll talk [00:16:12] Okay. Um the the next topic I'll talk about is activation functions. So, you [00:16:13] about is activation functions. So, you all have basically learned all of the [00:16:15] all have basically learned all of the key layers now in CNN's and now we're [00:16:18] key layers now in CNN's and now we're going to be talking about these [00:16:19] going to be talking about these activation functions. If you remember, [00:16:21] activation functions. If you remember, the whole point of these activation [00:16:22] the whole point of these activation functions is to introduce nonlinearities [00:16:25] functions is to introduce nonlinearities to our model. So, um right now with [00:16:28] to our model. So, um right now with these convolution operators, the kernel [00:16:30] these convolution operators, the kernel sighting across the image and the simple [00:16:32] sighting across the image and the simple layers without act uh sorry, the fully [00:16:33] layers without act uh sorry, the fully connected layers without um activations, [00:16:36] connected layers without um activations, they're all just linear operations [00:16:37] they're all just linear operations because they're multiplications and [00:16:39] because they're multiplications and additions. Um and the whole point of the [00:16:40] additions. Um and the whole point of the activation function is to add [00:16:42] activation function is to add nonlinearity. So um historically sigmoid [00:16:45] nonlinearity. So um historically sigmoid is a really commonly was a really [00:16:47] is a really commonly was a really commonly used activation function but [00:16:50] commonly used activation function but there's actually a key problem with [00:16:51] there's actually a key problem with sigmoid that is the reason why it's no [00:16:53] sigmoid that is the reason why it's no longer used today. And so sigmoid if you [00:16:56] longer used today. And so sigmoid if you graph it looks like this. You can see [00:16:57] graph it looks like this. You can see the equation in the top right of the [00:17:00] the equation in the top right of the slide here. And um the main issue is [00:17:04] slide here. And um the main issue is that empirically what happened was after [00:17:07] that empirically what happened was after many layers of sigmoids you would get [00:17:09] many layers of sigmoids you would get smaller and smaller gradients as you're [00:17:11] smaller and smaller gradients as you're computing back props. So starting from [00:17:13] computing back props. So starting from the end the gradients are fairly large [00:17:14] the end the gradients are fairly large in magnitude and as you undergo multiple [00:17:16] in magnitude and as you undergo multiple layers of back propagation go to the [00:17:18] layers of back propagation go to the initial early layers of your model um [00:17:20] initial early layers of your model um you would get smaller and smaller [00:17:22] you would get smaller and smaller gradients as you do this process. So [00:17:23] gradients as you do this process. So I'll actually open this question up to [00:17:25] I'll actually open this question up to the class um you know this isn't a a [00:17:28] the class um you know this isn't a a phenomenon we see that occurs with [00:17:30] phenomenon we see that occurs with sigmoid and so in what reason reason uh [00:17:32] sigmoid and so in what reason reason uh regions in our graph does sigmoid have a [00:17:35] regions in our graph does sigmoid have a really small gradient? Yeah so very [00:17:37] really small gradient? Yeah so very negative and very positive values is [00:17:39] negative and very positive values is correct and this is actually a huge [00:17:40] correct and this is actually a huge issue. I mean you can visually see here [00:17:42] issue. I mean you can visually see here in the graph the gradient's very flat at [00:17:44] in the graph the gradient's very flat at you know you're taking the derivative [00:17:45] you know you're taking the derivative here it's very small um and so basically [00:17:47] here it's very small um and so basically for almost all of our input space uh [00:17:50] for almost all of our input space uh from negative infinity to positive [00:17:51] from negative infinity to positive infinity you have very small gradients [00:17:53] infinity you have very small gradients and it's only this narrow range in the [00:17:54] and it's only this narrow range in the middle where you have something that's [00:17:56] middle where you have something that's uh non non zero so it basically [00:17:59] uh non non zero so it basically approaches zero very quickly on both [00:18:01] approaches zero very quickly on both ends of the extremes and so this means [00:18:03] ends of the extremes and so this means that if the values coming into sigmoid [00:18:05] that if the values coming into sigmoid are very large or very small then your [00:18:07] are very large or very small then your gradient will be uh very [00:18:11] So this is one of the main reasons why [00:18:13] So this is one of the main reasons why relu became super popular because now in [00:18:16] relu became super popular because now in the positive region we don't have any of [00:18:18] the positive region we don't have any of this behavior. Um it's just a derivative [00:18:20] this behavior. Um it's just a derivative of one here. Um but in in practice you [00:18:24] of one here. Um but in in practice you still have this flat portion here on the [00:18:27] still have this flat portion here on the left where um you know your gradient is [00:18:30] left where um you know your gradient is zero. Um so so now we basically have [00:18:34] zero. Um so so now we basically have half of our input uh domain here. we we [00:18:37] half of our input uh domain here. we we get a gradient of one and the other half [00:18:38] get a gradient of one and the other half is zero which is better than almost all [00:18:41] is zero which is better than almost all of it being zero or very close to zero [00:18:43] of it being zero or very close to zero except for a small region in the middle. [00:18:45] except for a small region in the middle. So in practice these work better also [00:18:47] So in practice these work better also it's much cheaper to just compute a max [00:18:49] it's much cheaper to just compute a max operation between zero and your input [00:18:51] operation between zero and your input value than the sigmoid function. So for [00:18:53] value than the sigmoid function. So for those two reasons, Railus became super [00:18:55] those two reasons, Railus became super popular. Um but you still have this [00:18:57] popular. Um but you still have this issue where for any negative input, you [00:19:00] issue where for any negative input, you get a zero gradient. And so um more [00:19:03] get a zero gradient. And so um more recently, there's been uh popular [00:19:05] recently, there's been uh popular activation functions that avoid this um [00:19:08] activation functions that avoid this um by basically having a non-flat section [00:19:11] by basically having a non-flat section of the activation function in the nearby [00:19:14] of the activation function in the nearby neighborhood uh to near zero. So um this [00:19:19] neighborhood uh to near zero. So um this is Gaoo and there's also Celu which I'll [00:19:21] is Gaoo and there's also Celu which I'll show in a slide but I won't go over the [00:19:23] show in a slide but I won't go over the formula. They look very similar. Um the [00:19:25] formula. They look very similar. Um the basic idea is to smoothen out the this [00:19:27] basic idea is to smoothen out the this uh non non smooth jump here in the in [00:19:31] uh non non smooth jump here in the in the derivative from 0 to 1 at 0 for RLU. [00:19:36] the derivative from 0 to 1 at 0 for RLU. So um no you know this is a very sharp [00:19:39] So um no you know this is a very sharp uh uh and non smooth function with RLU. [00:19:41] uh uh and non smooth function with RLU. But the nice part about Gillu is we [00:19:43] But the nice part about Gillu is we actually have nonzero gradients here. [00:19:46] actually have nonzero gradients here. And in the limit as x approaches [00:19:48] And in the limit as x approaches infinity or negative infinity, it does [00:19:50] infinity or negative infinity, it does converge to uh relu as well. But you get [00:19:53] converge to uh relu as well. But you get sort of more smooth behavior in the [00:19:54] sort of more smooth behavior in the middle here. And specifically what GU [00:19:56] middle here. And specifically what GU calculates is this um Gaussian error [00:20:00] calculates is this um Gaussian error linear unit. So this is the uh [00:20:03] linear unit. So this is the uh cumulative distribution function of a [00:20:06] cumulative distribution function of a Gaussian normal. So if you imagine the [00:20:07] Gaussian normal. So if you imagine the area under the curve of a Gaussian uh [00:20:09] area under the curve of a Gaussian uh that's what this 5x is at any point x. [00:20:11] that's what this 5x is at any point x. So if you have a really uh negative [00:20:14] So if you have a really uh negative value here you'll have a value close to [00:20:16] value here you'll have a value close to zero which is why it sort of converges [00:20:18] zero which is why it sort of converges to relu zero here and then a very uh [00:20:21] to relu zero here and then a very uh high positive value it gets very close [00:20:23] high positive value it gets very close to one the area under the curve so it [00:20:25] to one the area under the curve so it converges to x here. So um this is gau [00:20:28] converges to x here. So um this is gau it has these nice properties and it [00:20:29] it has these nice properties and it converges to relu at the extremes too [00:20:32] converges to relu at the extremes too and this is the main uh activation [00:20:34] and this is the main uh activation function used in transformers today. [00:20:38] function used in transformers today. Um if you look at all of them and you [00:20:39] Um if you look at all of them and you kind of squint a lot of these kind of [00:20:41] kind of squint a lot of these kind of look the same. The basic idea is [00:20:42] look the same. The basic idea is something relatively flat and then it uh [00:20:45] something relatively flat and then it uh in the limit it approaches x equals uh [00:20:47] in the limit it approaches x equals uh sorry f ofx equals x um and and sort of [00:20:51] sorry f ofx equals x um and and sort of it becomes a linear line. Um so celu is [00:20:55] it becomes a linear line. Um so celu is actually x * sigmoid of x um which also [00:20:58] actually x * sigmoid of x um which also has this property of for a very negative [00:21:00] has this property of for a very negative value you have something close to zero [00:21:02] value you have something close to zero and for a very positive value uh it's [00:21:04] and for a very positive value uh it's close to one. So it's actually similar [00:21:06] close to one. So it's actually similar to the um cumulative distribution [00:21:08] to the um cumulative distribution function for the unit Gaussian that is [00:21:10] function for the unit Gaussian that is fi here. So that's why the the shapes [00:21:12] fi here. So that's why the the shapes are actually really similar looking too. [00:21:15] are actually really similar looking too. Okay. Um so you might ask where are [00:21:18] Okay. Um so you might ask where are these activations used in CNN's and the [00:21:20] these activations used in CNN's and the general answer is that they're placed [00:21:22] general answer is that they're placed after linear operators. So almost [00:21:25] after linear operators. So almost anytime we have a feed forward or a [00:21:27] anytime we have a feed forward or a linear layer or a fully connected layer, [00:21:29] linear layer or a fully connected layer, these are all words for the same layer. [00:21:31] these are all words for the same layer. So matrix multiply followed by [00:21:33] So matrix multiply followed by activation function. um or if we have a [00:21:35] activation function. um or if we have a convolutional layer pretty much after [00:21:37] convolutional layer pretty much after these is where we place the the so after [00:21:39] these is where we place the the so after the convolutional layer or after these [00:21:41] the convolutional layer or after these linear layers we'll put the the [00:21:42] linear layers we'll put the the activation function. [00:21:45] activation function. Okay, so you've learned everything now [00:21:48] Okay, so you've learned everything now about all the components of CNN's and [00:21:50] about all the components of CNN's and I'll now go through some examples of how [00:21:51] I'll now go through some examples of how we put them together and how people have [00:21:53] we put them together and how people have created state-of-the-art convolutional [00:21:55] created state-of-the-art convolutional neural network architectures. [00:21:58] neural network architectures. Um I think this is a really neat slide [00:22:00] Um I think this is a really neat slide because it plots two different values. [00:22:02] because it plots two different values. So on one hand we have the error rate [00:22:04] So on one hand we have the error rate which is these blue bars and this is [00:22:06] which is these blue bars and this is over time. So these are different models [00:22:08] over time. So these are different models people have trained on imageet and then [00:22:10] people have trained on imageet and then you have these orange triangles which [00:22:12] you have these orange triangles which represent the number of layers uh that [00:22:15] represent the number of layers uh that models have and you can see at the same [00:22:17] models have and you can see at the same point that we have a significant drop in [00:22:20] point that we have a significant drop in error where we actually surpass human [00:22:22] error where we actually surpass human performance for the first time we see uh [00:22:25] performance for the first time we see uh a huge increase in the number of layers. [00:22:27] a huge increase in the number of layers. So we'll go over in class today how they [00:22:30] So we'll go over in class today how they were able to achieve this and what were [00:22:31] were able to achieve this and what were the design challenges and goals for how [00:22:33] the design challenges and goals for how they how they did this. Um historically [00:22:36] they how they did this. Um historically AlexNet was the first sort of CNN based [00:22:39] AlexNet was the first sort of CNN based paper that worked really well on imageet [00:22:41] paper that worked really well on imageet and they used they were able to train it [00:22:42] and they used they were able to train it by using GPUs. We talked about this [00:22:44] by using GPUs. We talked about this earlier in lecture, so I won't spend too [00:22:46] earlier in lecture, so I won't spend too much details around AlexNet from a [00:22:48] much details around AlexNet from a historical lens, but I do want to [00:22:50] historical lens, but I do want to compare it to another architecture [00:22:52] compare it to another architecture called VGG, um, which was a really [00:22:55] called VGG, um, which was a really standard and commonly used uh, [00:22:57] standard and commonly used uh, architecture in the 2010s. And I think, [00:23:00] architecture in the 2010s. And I think, uh, I can plot the two, uh, CNN [00:23:04] uh, I can plot the two, uh, CNN architectures together side by side [00:23:06] architectures together side by side here. So in general in AI, we like to [00:23:08] here. So in general in AI, we like to plot our model architectures using block [00:23:10] plot our model architectures using block diagrams where each block represents a [00:23:13] diagrams where each block represents a different layer or a group of layers [00:23:16] different layer or a group of layers that are stacked together. And it also [00:23:18] that are stacked together. And it also helps you gain intuition about you know [00:23:20] helps you gain intuition about you know what are the general differences just at [00:23:22] what are the general differences just at a at an initial glance. So these orange [00:23:24] a at an initial glance. So these orange blocks which are the common ones here [00:23:26] blocks which are the common ones here are 3x3 convolution layers. So these are [00:23:29] are 3x3 convolution layers. So these are convolution layers that have filters [00:23:31] convolution layers that have filters that are sliding across that are size [00:23:32] that are sliding across that are size 3x3. Uh their stride is one. So they're [00:23:36] 3x3. Uh their stride is one. So they're visiting every location in the image, [00:23:38] visiting every location in the image, not skipping over anything. And they add [00:23:40] not skipping over anything. And they add padding of one around the outside so [00:23:42] padding of one around the outside so that we're we're not shrinking as we do [00:23:44] that we're we're not shrinking as we do these convolution uh uh layers. And so [00:23:48] these convolution uh uh layers. And so um they also add these pooling max [00:23:51] um they also add these pooling max pooling layers throughout here too. And [00:23:53] pooling layers throughout here too. And you'll notice that for all of these they [00:23:54] you'll notice that for all of these they have um after a pooling layer they'll [00:23:57] have um after a pooling layer they'll start doing two sets of fully connected [00:24:00] start doing two sets of fully connected layers of dimension uh 4096 followed by [00:24:04] layers of dimension uh 4096 followed by a dimension of a thousand. And the [00:24:06] a dimension of a thousand. And the reason we have a thousand at the end is [00:24:07] reason we have a thousand at the end is because imageet was a thousand different [00:24:09] because imageet was a thousand different image categories. So we need to have [00:24:11] image categories. So we need to have scores for each of these categories. So [00:24:13] scores for each of these categories. So the final layer is always uh equal to [00:24:15] the final layer is always uh equal to the number of classes you have for for [00:24:18] the number of classes you have for for image classification problem. [00:24:20] image classification problem. Um, and so you can see it actually looks [00:24:22] Um, and so you can see it actually looks extremely similar. It's sort of just [00:24:24] extremely similar. It's sort of just like a scaled up version of Alexet with [00:24:26] like a scaled up version of Alexet with more uh layers here. And also they're [00:24:29] more uh layers here. And also they're now doing some uh sort of three groups [00:24:31] now doing some uh sort of three groups of convolutions at a time followed by [00:24:32] of convolutions at a time followed by pooling rather than uh two layers in a [00:24:34] pooling rather than uh two layers in a pooling or even one. Um so it is [00:24:38] pooling or even one. Um so it is actually pretty remarkable that there's [00:24:40] actually pretty remarkable that there's only basically uh three different types [00:24:42] only basically uh three different types of layers right in these models but they [00:24:44] of layers right in these models but they they perform extremely well compared to [00:24:46] they perform extremely well compared to anything people had tried before at this [00:24:48] anything people had tried before at this point. Um so these are I would say the [00:24:50] point. Um so these are I would say the simplest layers. You might uh simplest [00:24:52] simplest layers. You might uh simplest models we're going to discuss today. But [00:24:54] models we're going to discuss today. But you might ask why are they doing three [00:24:56] you might ask why are they doing three by uh 3x3 convolutions like how do they [00:24:58] by uh 3x3 convolutions like how do they pick this this value? And um there is [00:25:01] pick this this value? And um there is actually some intuition behind how they [00:25:04] actually some intuition behind how they chose 3x3. So um and specifically they [00:25:08] chose 3x3. So um and specifically they have groups of three or even four of [00:25:10] have groups of three or even four of these. So I'll ask you all a question. [00:25:12] these. So I'll ask you all a question. Um what is the effective receptive [00:25:15] Um what is the effective receptive field? So we we looked at receptive [00:25:18] field? So we we looked at receptive field um last time but it's it's [00:25:20] field um last time but it's it's basically the idea of the parts of your [00:25:22] basically the idea of the parts of your input image that a particular value in [00:25:24] input image that a particular value in your activation map has seen before. So [00:25:26] your activation map has seen before. So what values have been used to compute [00:25:28] what values have been used to compute the final activation map after many [00:25:31] the final activation map after many layers of your model. So we have three [00:25:32] layers of your model. So we have three of these layers that are all 3x3 [00:25:35] of these layers that are all 3x3 convolutions with this sliding filter [00:25:37] convolutions with this sliding filter with stride of one. What is the [00:25:38] with stride of one. What is the effective receptive field of each value [00:25:40] effective receptive field of each value in our activation [00:25:42] in our activation map A3 here? So this is after the third [00:25:44] map A3 here? So this is after the third layer. [00:25:46] layer. So I'm showing one of the layers here. [00:25:49] So I'm showing one of the layers here. You can see for each value in A3, it's [00:25:51] You can see for each value in A3, it's computed by looking at a 3x3 grid of [00:25:53] computed by looking at a 3x3 grid of values in A2. And then conceivably for [00:25:56] values in A2. And then conceivably for each one in A2, it's a 3x3 grid in A1. [00:25:58] each one in A2, it's a 3x3 grid in A1. And for each of those, it's a 3x3 grid [00:26:00] And for each of those, it's a 3x3 grid in our input. So uh I'll let you all [00:26:03] in our input. So uh I'll let you all think about this for a little bit. Uh [00:26:05] think about this for a little bit. Uh maybe it will help to see the next uh [00:26:08] maybe it will help to see the next uh layer here. So it is actually really [00:26:10] layer here. So it is actually really helpful to visualize this. So at A1, [00:26:13] helpful to visualize this. So at A1, each of the corners here, you know, [00:26:15] each of the corners here, you know, we're calculating a new from a new 3x3 [00:26:18] we're calculating a new from a new 3x3 grid here. Um, so from our input, how [00:26:23] grid here. Um, so from our input, how large is this overall square? 7 by 7. [00:26:27] large is this overall square? 7 by 7. Yeah, exactly. So this first one is 3x3, [00:26:29] Yeah, exactly. So this first one is 3x3, this next one is 5x 5, and then the next [00:26:32] this next one is 5x 5, and then the next one is 7 by 7. And we can visualize it [00:26:34] one is 7 by 7. And we can visualize it here uh pretty easily. So the nice thing [00:26:37] here uh pretty easily. So the nice thing about the 3x3 convolution with stride [00:26:39] about the 3x3 convolution with stride one is that uh you're basically always [00:26:41] one is that uh you're basically always adding two to your receptive field at [00:26:44] adding two to your receptive field at each layer because um you know each [00:26:46] each layer because um you know each point here you're looking to the left [00:26:48] point here you're looking to the left and to the right and above and below it. [00:26:50] and to the right and above and below it. So after you have many blocks of those [00:26:52] So after you have many blocks of those you're just adding two each time. [00:26:56] Okay. Um so we basically just showed [00:27:00] Okay. Um so we basically just showed that a stack of three of these 3x3 [00:27:02] that a stack of three of these 3x3 convolution and stride one layers has [00:27:04] convolution and stride one layers has the same effective field as one 7 by7 [00:27:07] the same effective field as one 7 by7 layer. [00:27:08] layer. Yeah. So the question is how much of [00:27:09] Yeah. So the question is how much of this is justification after the fact [00:27:11] this is justification after the fact versus how much of this is intuition [00:27:12] versus how much of this is intuition that they then use to say design the [00:27:15] that they then use to say design the experiments. Um I think it actually [00:27:17] experiments. Um I think it actually probably depends on the architecture. So [00:27:19] probably depends on the architecture. So for some of them it's more intuition [00:27:20] for some of them it's more intuition focused. And actually the one we're [00:27:21] focused. And actually the one we're going to cover next I think really was [00:27:24] going to cover next I think really was uh the whole research direction was [00:27:25] uh the whole research direction was spawned by an empirical finding that [00:27:27] spawned by an empirical finding that there's a thought experiment about. So [00:27:29] there's a thought experiment about. So that's resonates and I think for there [00:27:31] that's resonates and I think for there there's actually pretty good intuition [00:27:32] there's actually pretty good intuition that led the whole investigation into [00:27:34] that led the whole investigation into what will work well. Um, but then for [00:27:35] what will work well. Um, but then for this one, I mean, I I I can't say for I [00:27:38] this one, I mean, I I I can't say for I can't speak for the authors um whether [00:27:41] can't speak for the authors um whether it's justification after effect or um [00:27:45] it's justification after effect or um you know, based on empirical findings or [00:27:48] you know, based on empirical findings or um or if it was involved in the design [00:27:51] um or if it was involved in the design choices because I haven't seen them [00:27:52] choices because I haven't seen them speak publicly about that yet. But for [00:27:54] speak publicly about that yet. But for resonance, I do know it it was the the [00:27:57] resonance, I do know it it was the the hypothesis that led to the creation. Um [00:28:00] hypothesis that led to the creation. Um but this is actually a really nice [00:28:01] but this is actually a really nice property. So three of these 3x3's have [00:28:03] property. So three of these 3x3's have the same effective receptive field as [00:28:05] the same effective receptive field as one 7x7 layer and it actually has fewer [00:28:07] one 7x7 layer and it actually has fewer parameters too. So um we have if you [00:28:10] parameters too. So um we have if you imagine our channel dimension is saying [00:28:12] imagine our channel dimension is saying stay staying the same. Uh you have these [00:28:14] stay staying the same. Uh you have these 3x3 grids where um you have you know [00:28:18] 3x3 grids where um you have you know your input number of channels. So 3 * 3 [00:28:20] your input number of channels. So 3 3 c would be the number of values in [00:28:22] * c would be the number of values in each of these filters. And if we have c [00:28:23] each of these filters. And if we have c of these filters it's 3 3 c * c or 3 [00:28:27] of these filters it's 3 3 c * c or 3 c ^ 2 and we have three layers total. So [00:28:30] c ^ 2 and we have three layers total. So um sort of if we look at it through this [00:28:33] um sort of if we look at it through this lens, it's actually um it's actually [00:28:36] lens, it's actually um it's actually fewer parameters and we're having a more [00:28:38] fewer parameters and we're having a more complex um and more nonlinear model uh [00:28:42] complex um and more nonlinear model uh that we're building here. So fewer [00:28:44] that we're building here. So fewer parameters and it can model more complex [00:28:46] parameters and it can model more complex relationships among your input data. So [00:28:49] relationships among your input data. So this is maybe why stacking these 3x3 [00:28:52] this is maybe why stacking these 3x3 layers together um could be better than [00:28:54] layers together um could be better than having just a larger filter that you're [00:28:56] having just a larger filter that you're sliding across. So fewer parameters and [00:28:59] sliding across. So fewer parameters and then also you can model more complex [00:29:00] then also you can model more complex relationships. [00:29:03] relationships. Okay. Um I'll now talk about resonance [00:29:05] Okay. Um I'll now talk about resonance which I I guess I'll very much bring up [00:29:07] which I I guess I'll very much bring up the thought experiment that someone just [00:29:08] the thought experiment that someone just asked a question about. So um there was [00:29:12] asked a question about. So um there was actually an empirical finding that [00:29:13] actually an empirical finding that spawned a lot of the um conversation and [00:29:16] spawned a lot of the um conversation and thought around designing ResNets and the [00:29:19] thought around designing ResNets and the idea was um actually shown that if you [00:29:22] idea was um actually shown that if you keep stacking deeper layers on a plain [00:29:24] keep stacking deeper layers on a plain CNN network like something like this and [00:29:27] CNN network like something like this and you just keep adding layers to it, you [00:29:29] you just keep adding layers to it, you build it larger and larger. What [00:29:30] build it larger and larger. What happens? So what we found and what they [00:29:33] happens? So what we found and what they found is that um the 20 layer model will [00:29:36] found is that um the 20 layer model will actually have a lower test error than a [00:29:38] actually have a lower test error than a 56 layer model. And you might think that [00:29:41] 56 layer model. And you might think that this is because of overfitting, but it's [00:29:43] this is because of overfitting, but it's actually not because if we look at the [00:29:44] actually not because if we look at the training error, the training error of [00:29:46] training error, the training error of this 20 layer model is also lower. So it [00:29:48] this 20 layer model is also lower. So it has a lower training and a lower test [00:29:51] has a lower training and a lower test error. Basically means that the model is [00:29:53] error. Basically means that the model is doing better on all accounts. Um so why [00:29:56] doing better on all accounts. Um so why is this 56 layer model performing worse [00:29:59] is this 56 layer model performing worse than a 20 layer model? it doesn't you [00:30:01] than a 20 layer model? it doesn't you know you might it might be confusing and [00:30:04] know you might it might be confusing and we know as I mentioned right before it's [00:30:06] we know as I mentioned right before it's it's not caused by overfitting [00:30:09] it's not caused by overfitting so these deeper models have more [00:30:11] so these deeper models have more representational power and theoretically [00:30:13] representational power and theoretically they should be able to uh represent any [00:30:16] they should be able to uh represent any model that a more shallow network uh can [00:30:19] model that a more shallow network uh can can model. So the the set of possible um [00:30:23] can model. So the the set of possible um uh mappings between your inputs and [00:30:26] uh mappings between your inputs and different values for your larger [00:30:27] different values for your larger networks is a is a superset of for your [00:30:30] networks is a is a superset of for your smaller networks. Uh because [00:30:32] smaller networks. Uh because theoretically you could imagine that um [00:30:34] theoretically you could imagine that um you basically are just setting some of [00:30:36] you basically are just setting some of these layers to be the identity function [00:30:39] these layers to be the identity function where um they're the layers doing [00:30:41] where um they're the layers doing nothing and then you would have if you [00:30:43] nothing and then you would have if you set half your layers to do nothing you [00:30:44] set half your layers to do nothing you have exactly the same representation [00:30:46] have exactly the same representation power as a as a model one half the size. [00:30:49] power as a as a model one half the size. So the idea is not that um these models [00:30:53] So the idea is not that um these models are worse, but we're in terms of the [00:30:55] are worse, but we're in terms of the representational power, but they're [00:30:57] representational power, but they're actually harder to optimize because this [00:31:00] actually harder to optimize because this the set of possible models for your your [00:31:02] the set of possible models for your your deeper networks is larger and it [00:31:04] deeper networks is larger and it contains all of the possible models that [00:31:06] contains all of the possible models that your your your more shallow networks [00:31:08] your your your more shallow networks could could learn. [00:31:11] could could learn. So I sort of hinted at it before, but [00:31:14] So I sort of hinted at it before, but how specifically could the deeper model [00:31:17] how specifically could the deeper model learn to be at least as good as a [00:31:18] learn to be at least as good as a shallow model? It's by setting. So if we [00:31:20] shallow model? It's by setting. So if we have a two-layer model versus a one [00:31:22] have a two-layer model versus a one layer, uh so one layer here and two [00:31:24] layer, uh so one layer here and two layer on the right. If we set one of the [00:31:26] layer on the right. If we set one of the layers to just essentially be a identity [00:31:28] layers to just essentially be a identity matrix or you know this is just an [00:31:30] matrix or you know this is just an identity function, the the the model [00:31:33] identity function, the the the model should be at least as good as the [00:31:34] should be at least as good as the shallow model. [00:31:36] shallow model. Um it should be at least as good as a [00:31:37] Um it should be at least as good as a shallow model. [00:31:40] shallow model. So how do we actually build this [00:31:42] So how do we actually build this intuition into our models? We want them [00:31:44] intuition into our models? We want them to be able to be just as good as a [00:31:46] to be able to be just as good as a shallower model if they want to be uh [00:31:49] shallower model if they want to be uh during optimization. So the way that we [00:31:52] during optimization. So the way that we do this is actually by fitting what's [00:31:54] do this is actually by fitting what's called a residual mapping instead of [00:31:56] called a residual mapping instead of directly trying to fit a desired [00:31:58] directly trying to fit a desired underlap underlying mapping. And what [00:32:01] underlap underlying mapping. And what this looks like is we basically take the [00:32:03] this looks like is we basically take the value x and we copy it over past our [00:32:06] value x and we copy it over past our convolution layers. So that um the value [00:32:10] convolution layers. So that um the value at this point is already receiving x our [00:32:14] at this point is already receiving x our original input as well as the output of [00:32:16] original input as well as the output of our two convolution stacks. So um [00:32:19] our two convolution stacks. So um basically at this point this f ofx or [00:32:21] basically at this point this f ofx or which is called the residual map here um [00:32:23] which is called the residual map here um it could just learn uh zero values for [00:32:26] it could just learn uh zero values for the uh all the com filters and the [00:32:30] the uh all the com filters and the output would be zero here and then we [00:32:32] output would be zero here and then we would just add x uh along here and we [00:32:34] would just add x uh along here and we would get x. So it allows a very simple [00:32:36] would get x. So it allows a very simple way for the model to bypass these layers [00:32:39] way for the model to bypass these layers if it doesn't need to learn anything for [00:32:41] if it doesn't need to learn anything for the layers. And what this means is you [00:32:43] the layers. And what this means is you can really easily now learn uh basically [00:32:45] can really easily now learn uh basically this identity function that we talked [00:32:47] this identity function that we talked about earlier by just learning uh zero [00:32:50] about earlier by just learning uh zero zero filters. So your filters all are [00:32:52] zero filters. So your filters all are filled with zero values for example [00:32:55] filled with zero values for example or uh in this case more practically what [00:32:58] or uh in this case more practically what happens is they just need to learn very [00:32:59] happens is they just need to learn very small values because instead of learning [00:33:01] small values because instead of learning the uh this entire mapping from x to h [00:33:05] the uh this entire mapping from x to h of x they just need to learn this [00:33:07] of x they just need to learn this difference which is f ofx. So um you're [00:33:10] difference which is f ofx. So um you're now just learning this sort of [00:33:12] now just learning this sort of difference between your desired output [00:33:14] difference between your desired output here and um the the copied over block. [00:33:18] here and um the the copied over block. This is called uh residual block or [00:33:20] This is called uh residual block or residual connection residual connection [00:33:22] residual connection residual connection when you copy over values from an [00:33:23] when you copy over values from an earlier layer into a later layer in your [00:33:26] earlier layer into a later layer in your model and then you just add it to the [00:33:28] model and then you just add it to the the values at that point. [00:33:30] the values at that point. So that I talked a bit about the [00:33:32] So that I talked a bit about the intuition which was this observed [00:33:34] intuition which was this observed phenomenon that these larger networks [00:33:37] phenomenon that these larger networks were achieving worse training and worse [00:33:39] were achieving worse training and worse test error because they were harder to [00:33:40] test error because they were harder to optimize. So the intuition was we need [00:33:43] optimize. So the intuition was we need to build a model that can really easily [00:33:45] to build a model that can really easily model the shallower networks so it can [00:33:47] model the shallower networks so it can be at least as good as a shallower [00:33:48] be at least as good as a shallower model. The way they did this was by [00:33:50] model. The way they did this was by adding a residual connection so that you [00:33:52] adding a residual connection so that you can just copy over the values easily um [00:33:54] can just copy over the values easily um build that into the architecture itself [00:33:56] build that into the architecture itself rather than trying to learn some [00:33:58] rather than trying to learn some identity mapping among the convolutional [00:34:00] identity mapping among the convolutional layers. And empirically this showed to [00:34:02] layers. And empirically this showed to work extremely well too. Yeah. So what [00:34:05] work extremely well too. Yeah. So what does the residual block carry? Um so we [00:34:06] does the residual block carry? Um so we have our input X. We pass it through two [00:34:09] have our input X. We pass it through two different convolutional layers and we [00:34:10] different convolutional layers and we get our output f ofx. Um x is lit is [00:34:13] get our output f ofx. Um x is lit is just copied over here. So this is [00:34:15] just copied over here. So this is exactly the same as X and we add it to [00:34:17] exactly the same as X and we add it to the output of uh these two blocks which [00:34:20] the output of uh these two blocks which is f ofx. Yeah, x is the output of one [00:34:23] is f ofx. Yeah, x is the output of one of the previous layers or if it's the [00:34:25] of the previous layers or if it's the very first layer of the model it would [00:34:27] very first layer of the model it would be the the image. [00:34:28] be the the image. Yeah. So the question is if maybe you [00:34:30] Yeah. So the question is if maybe you just don't have enough data and if you [00:34:31] just don't have enough data and if you added enough data then maybe you could [00:34:33] added enough data then maybe you could train a model without these blocks. Um I [00:34:35] train a model without these blocks. Um I think these blocks actually do extremely [00:34:37] think these blocks actually do extremely help you with learning from more data. [00:34:40] help you with learning from more data. Um I think the issue it was really an [00:34:42] Um I think the issue it was really an optimization problem. So transformers [00:34:44] optimization problem. So transformers use residual blocks for exactly the same [00:34:46] use residual blocks for exactly the same reason because uh and I think it [00:34:48] reason because uh and I think it actually helps you model these more [00:34:50] actually helps you model these more complex models and it actually enables [00:34:52] complex models and it actually enables you to use more data. So um I I think [00:34:54] you to use more data. So um I I think it's very good residual blocks help you [00:34:56] it's very good residual blocks help you use more data more efficiently because [00:34:58] use more data more efficiently because you're able to model a greater number [00:35:00] you're able to model a greater number you're able to more easily model a [00:35:02] you're able to more easily model a greater number of uh functions. [00:35:05] greater number of uh functions. Yeah. So the question is that um maybe [00:35:08] Yeah. So the question is that um maybe if we just trained for longer um the [00:35:12] if we just trained for longer um the performance would eventually converge to [00:35:14] performance would eventually converge to the value of the smaller network and [00:35:15] the value of the smaller network and maybe it's just harder to optimize [00:35:17] maybe it's just harder to optimize because it takes longer to train a a [00:35:18] because it takes longer to train a a larger model. Um and I think the answer [00:35:20] larger model. Um and I think the answer is no that these um these these were not [00:35:24] is no that these um these these were not uh like they were not converging to the [00:35:26] uh like they were not converging to the performance of the smaller model [00:35:28] performance of the smaller model regardless of how long you trained it. [00:35:30] regardless of how long you trained it. And the reason is because it's being [00:35:31] And the reason is because it's being stuck in essentially local minimum uh [00:35:36] stuck in essentially local minimum uh and when you add these residual [00:35:38] and when you add these residual connections you're uh avoiding these. [00:35:41] connections you're uh avoiding these. This is the actual explanation why this [00:35:44] This is the actual explanation why this is the case is still I would say being a [00:35:46] is the case is still I would say being a more active area of research. Um it's [00:35:49] more active area of research. Um it's really hard to understand exactly what [00:35:51] really hard to understand exactly what causes these models to um avoid local a [00:35:54] causes these models to um avoid local a global minimum or sorry to avoid local [00:35:56] global minimum or sorry to avoid local minimum and not get an optimal solution [00:35:58] minimum and not get an optimal solution or to uh what causes them to not train [00:36:02] or to uh what causes them to not train uh and find better solutions. And often [00:36:04] uh and find better solutions. And often times this is really an empirical [00:36:05] times this is really an empirical finding but there's some intuition [00:36:07] finding but there's some intuition behind it. And in this case the [00:36:08] behind it. And in this case the intuition was that uh we want to enable [00:36:10] intuition was that uh we want to enable our models to do at least as well as the [00:36:12] our models to do at least as well as the shallower models which we know were [00:36:13] shallower models which we know were performing better at the time. Um, so [00:36:15] performing better at the time. Um, so it's not that you could just train it [00:36:17] it's not that you could just train it for longer and it would uh do better. [00:36:19] for longer and it would uh do better. Um, it was actually a limitation where [00:36:21] Um, it was actually a limitation where it was no longer it was un just [00:36:24] it was no longer it was un just completely unable to achieve as good as [00:36:25] completely unable to achieve as good as the shallower models. [00:36:31] Okay. Um, [00:36:34] Okay. Um, so here's the overall ResNet [00:36:36] so here's the overall ResNet architecture. Um, we have these stacks [00:36:39] architecture. Um, we have these stacks of residual blocks now. So that's what [00:36:42] of residual blocks now. So that's what these two blocks here together mean. um [00:36:45] these two blocks here together mean. um it's a residual uh block. So we have a [00:36:48] it's a residual uh block. So we have a 3x3 convolution with a relu followed by [00:36:50] 3x3 convolution with a relu followed by another 3x3 convolution and we're [00:36:52] another 3x3 convolution and we're copying over this x value here. We're [00:36:54] copying over this x value here. We're adding it to the outputs here and then [00:36:55] adding it to the outputs here and then we're having a relu afterwards. So each [00:36:57] we're having a relu afterwards. So each of these pairs of blocks is one of these [00:37:00] of these pairs of blocks is one of these residual and that's why you see this [00:37:02] residual and that's why you see this line skipping over here because the [00:37:03] line skipping over here because the values getting added forward. Um the [00:37:06] values getting added forward. Um the cool thing about ResNets also um is that [00:37:09] cool thing about ResNets also um is that they uh basically had a lot of these [00:37:12] they uh basically had a lot of these different um depths that they created. [00:37:14] different um depths that they created. So they created a whole family of [00:37:16] So they created a whole family of models, some smaller and some larger. [00:37:18] models, some smaller and some larger. And they showed that as they increased [00:37:19] And they showed that as they increased the number of layers, their performance [00:37:21] the number of layers, their performance was increasing, albeit the the sort of [00:37:24] was increasing, albeit the the sort of the difference in performance as you got [00:37:25] the difference in performance as you got to the larger and larger models became [00:37:27] to the larger and larger models became smaller. So it was sort of reaching a [00:37:29] smaller. So it was sort of reaching a point of given the data set uh they [00:37:32] point of given the data set uh they weren't able to scale any significant [00:37:34] weren't able to scale any significant amount by adding more layers further [00:37:36] amount by adding more layers further than that but you they saw significant [00:37:38] than that but you they saw significant improvements in performance among [00:37:40] improvements in performance among especially the earlier models and then [00:37:41] especially the earlier models and then 101 to 152 is where the performance it [00:37:44] 101 to 152 is where the performance it wasn't really change it was marginally [00:37:45] wasn't really change it was marginally better but uh performance change was [00:37:47] better but uh performance change was maybe only like 1% at that point. Yeah. [00:37:50] maybe only like 1% at that point. Yeah. How did they get the number of 152? I [00:37:52] How did they get the number of 152? I actually don't know how they got the [00:37:53] actually don't know how they got the number of 152. Uh I think they wanted to [00:37:55] number of 152. Uh I think they wanted to try different values here and you can [00:37:58] try different values here and you can see that I mean they're not exactly [00:38:00] see that I mean they're not exactly doubling but you know there's sort of [00:38:03] doubling but you know there's sort of you know a significant increase in each [00:38:05] you know a significant increase in each time. I don't know how they picked 152. [00:38:07] time. I don't know how they picked 152. I I that's a good question. Maybe they [00:38:09] I I that's a good question. Maybe they showed it somehow worked better than [00:38:10] showed it somehow worked better than other I actually don't know though. So [00:38:14] other I actually don't know though. So generally when you're trying multiple [00:38:15] generally when you're trying multiple different uh number of layers for your [00:38:17] different uh number of layers for your model like given that say these are the [00:38:19] model like given that say these are the number of layers you want to try um what [00:38:22] number of layers you want to try um what you'll do is you'll sort of first train [00:38:23] you'll do is you'll sort of first train the smallest model see performance and [00:38:26] the smallest model see performance and then add more layers see if your [00:38:27] then add more layers see if your performance increases and go on so [00:38:28] performance increases and go on so forth. So that's probably why they [00:38:30] forth. So that's probably why they stopped at 152 is because performance [00:38:32] stopped at 152 is because performance wasn't increases as much anymore. Um and [00:38:34] wasn't increases as much anymore. Um and also there's GPU memory limitations. So [00:38:37] also there's GPU memory limitations. So as you get larger and larger models, it [00:38:38] as you get larger and larger models, it becomes harder to train from a hardware [00:38:40] becomes harder to train from a hardware perspective because you need to fit more [00:38:41] perspective because you need to fit more parameters into your GPU memory. So um [00:38:44] parameters into your GPU memory. So um there is like a limit for given your [00:38:46] there is like a limit for given your compute setup how large of a model you [00:38:47] compute setup how large of a model you can train. Um I think you need to train [00:38:50] can train. Um I think you need to train these models separately though. So um [00:38:52] these models separately though. So um you have one model run for 18 layers and [00:38:55] you have one model run for 18 layers and one for 34 etc. [00:38:58] one for 34 etc. So the question is um how to think of [00:39:01] So the question is um how to think of intuition of CNN blocks given we're [00:39:04] intuition of CNN blocks given we're using these residual connections because [00:39:07] using these residual connections because um you can still think of it as higher [00:39:08] um you can still think of it as higher levels of abstraction and this is shown [00:39:10] levels of abstraction and this is shown to be true in the layers. So um you're [00:39:13] to be true in the layers. So um you're you're not in instead of learning the um [00:39:16] you're not in instead of learning the um within the block itself instead of just [00:39:18] within the block itself instead of just learning the higher level features [00:39:19] learning the higher level features you're learning the delta from the [00:39:20] you're learning the delta from the original image to get the higher level [00:39:22] original image to get the higher level features. That's what you're learning in [00:39:24] features. That's what you're learning in the block. So you're learning the delta [00:39:26] the block. So you're learning the delta but you're still achieving these higher [00:39:28] but you're still achieving these higher level representations at each step. So [00:39:30] level representations at each step. So that part is the same but the actual [00:39:32] that part is the same but the actual functional way of doing it is we're [00:39:34] functional way of doing it is we're learning this uh right you learn this f [00:39:37] learning this uh right you learn this f ofx that you add your your your previous [00:39:40] ofx that you add your your your previous input to. So it's like you're learning [00:39:42] input to. So it's like you're learning the delta. The question is if you do [00:39:44] the delta. The question is if you do addition does that require you to have [00:39:45] addition does that require you to have the same tensor size? The answer is yes. [00:39:47] the same tensor size? The answer is yes. And uh it's part of the reason why it's [00:39:51] And uh it's part of the reason why it's really a nice property that all of these [00:39:52] really a nice property that all of these ones have this 3x3 convolution with [00:39:54] ones have this 3x3 convolution with stride one so that you maintain the in [00:39:56] stride one so that you maintain the in one padding so that you maintain the [00:39:58] one padding so that you maintain the same size uh at every at every layer [00:40:01] same size uh at every at every layer going forward. So after you had say a [00:40:03] going forward. So after you had say a pooling layer, you wouldn't I mean you [00:40:05] pooling layer, you wouldn't I mean you could maybe come up with a way to do it [00:40:07] could maybe come up with a way to do it where you double the you know you [00:40:09] where you double the you know you increase uh you sort of unpool the [00:40:12] increase uh you sort of unpool the values. Uh but after a pooling layer for [00:40:14] values. Uh but after a pooling layer for example you couldn't do a naive addition [00:40:15] example you couldn't do a naive addition anymore because the size of your tensors [00:40:17] anymore because the size of your tensors are different. So these are done within [00:40:20] are different. So these are done within you know before a pool [00:40:23] you know before a pool at least the regular one. I mean you [00:40:24] at least the regular one. I mean you could get around it by just having each [00:40:26] could get around it by just having each value be spread out on into multiple [00:40:28] value be spread out on into multiple values for example. [00:40:31] values for example. Okay. Um so these are basically the main [00:40:33] Okay. Um so these are basically the main takeaways for ResNets. Um one other neat [00:40:36] takeaways for ResNets. Um one other neat trick they do is they basically [00:40:38] trick they do is they basically periodically after a certain number of [00:40:39] periodically after a certain number of these blocks they'll uh double the [00:40:42] these blocks they'll uh double the number of filters and down sample the [00:40:43] number of filters and down sample the spatial dimension. Um so basically you [00:40:47] spatial dimension. Um so basically you can imagine if you start with a really [00:40:49] can imagine if you start with a really flat uh image as the activations get [00:40:53] flat uh image as the activations get pushed through the network it becomes [00:40:54] pushed through the network it becomes smaller spatially but then the depth is [00:40:57] smaller spatially but then the depth is larger. So um this is how to think of it [00:40:59] larger. So um this is how to think of it and then at the very end it just becomes [00:41:00] and then at the very end it just becomes a vector that you then use for [00:41:02] a vector that you then use for classification. So that's how you should [00:41:04] classification. So that's how you should be like visualizing what's happening to [00:41:05] be like visualizing what's happening to the values in the network itself and the [00:41:08] the values in the network itself and the shape of them. And then one other sort [00:41:10] shape of them. And then one other sort of thing that's somewhat unique to [00:41:12] of thing that's somewhat unique to ResNets other architectures do this too [00:41:14] ResNets other architectures do this too but before all these uh layers with the [00:41:16] but before all these uh layers with the residual blocks they have uh this [00:41:19] residual blocks they have uh this relatively larger convolution layer and [00:41:21] relatively larger convolution layer and here is just empirically shown that it [00:41:23] here is just empirically shown that it did better if they added this here. So [00:41:26] did better if they added this here. So this one is purely empirical finding. [00:41:29] this one is purely empirical finding. Okay. Um I think um yeah basically to [00:41:34] Okay. Um I think um yeah basically to highlight you know it did extremely well [00:41:35] highlight you know it did extremely well these larger models. It was the only it [00:41:37] these larger models. It was the only it was the first time they were able to [00:41:38] was the first time they were able to train 100 plus layer models [00:41:40] train 100 plus layer models successfully. So it was a really big [00:41:41] successfully. So it was a really big deal and basically resnets were used in [00:41:44] deal and basically resnets were used in a huge variety of computer vision tasks [00:41:46] a huge variety of computer vision tasks afterwards. Almost every task in [00:41:48] afterwards. Almost every task in computer vision was using a resnet at [00:41:50] computer vision was using a resnet at the time because they performed so well [00:41:51] the time because they performed so well because of these residual connections. [00:41:54] because of these residual connections. Okay. Um so we talked about some CNN [00:41:57] Okay. Um so we talked about some CNN architectures the main ones being the [00:41:59] architectures the main ones being the main one being ResNet and then uh also [00:42:01] main one being ResNet and then uh also VGG historically. So we talked about why [00:42:03] VGG historically. So we talked about why the smaller filter size is useful and [00:42:06] the smaller filter size is useful and having many layers of these is useful. [00:42:08] having many layers of these is useful. So the final thing I'll talk about in [00:42:09] So the final thing I'll talk about in terms of how we actually construct the [00:42:11] terms of how we actually construct the CNN's and prime them to be ready for [00:42:13] CNN's and prime them to be ready for training is how do you actually [00:42:15] training is how do you actually initialize the weight values of the [00:42:17] initialize the weight values of the individual layers. [00:42:19] individual layers. So um you know depending on what values [00:42:23] So um you know depending on what values you choose you could either put values [00:42:25] you choose you could either put values that are too small or too large which [00:42:27] that are too small or too large which would cause significant issues for your [00:42:29] would cause significant issues for your model during training. So here we're [00:42:32] model during training. So here we're it's basically a um six layer network [00:42:34] it's basically a um six layer network where we have uh 4096dimensional [00:42:38] where we have uh 4096dimensional uh features and this is just a fully six [00:42:41] uh features and this is just a fully six layers of fully connected model and we [00:42:44] layers of fully connected model and we initialize them here. This is getting a [00:42:46] initialize them here. This is getting a unit gaussian random and then we're [00:42:48] unit gaussian random and then we're multiplying it by 0.01 to get very small [00:42:51] multiplying it by 0.01 to get very small values close to zero and we have you [00:42:54] values close to zero and we have you know ru at each layer too. So if you [00:42:56] know ru at each layer too. So if you plot the forward pass of this model, uh [00:43:00] plot the forward pass of this model, uh you would actually see that at the [00:43:02] you would actually see that at the beginning you get a relatively high you [00:43:05] beginning you get a relatively high you know cuz it's rel so all the means are [00:43:07] know cuz it's rel so all the means are going to be positive. Um but you'll have [00:43:09] going to be positive. Um but you'll have a mean and standard deviation that is [00:43:11] a mean and standard deviation that is relatively high but as each layer [00:43:13] relatively high but as each layer progresses because we had a really small [00:43:15] progresses because we had a really small weight initialization um it becomes [00:43:18] weight initialization um it becomes smaller and smaller mean and standard [00:43:19] smaller and smaller mean and standard deviation. And really ideally we would [00:43:21] deviation. And really ideally we would want basically all these to be the same [00:43:23] want basically all these to be the same for each layer. Um because it makes our [00:43:25] for each layer. Um because it makes our optimization problem much nicer to [00:43:27] optimization problem much nicer to solve. [00:43:28] solve. Um so if we say use 0.05 instead of 0.01 [00:43:33] Um so if we say use 0.05 instead of 0.01 uh can anyone imagine what might be the [00:43:35] uh can anyone imagine what might be the issue here if we set it to too large of [00:43:37] issue here if we set it to too large of a value. So what's too when it's too [00:43:40] a value. So what's too when it's too small it goes to zero basically. What [00:43:43] small it goes to zero basically. What happens if it's too large? Yeah. [00:43:45] happens if it's too large? Yeah. Basically the activations get larger and [00:43:47] Basically the activations get larger and larger at each layer. So uh if you plot [00:43:49] larger at each layer. So uh if you plot it here you can see that you know by the [00:43:51] it here you can see that you know by the end there's just a massive mean and [00:43:53] end there's just a massive mean and standard deviation and if you're [00:43:54] standard deviation and if you're training a 152 layer ResNet and you know [00:43:57] training a 152 layer ResNet and you know you can imagine this becomes quite an [00:43:59] you can imagine this becomes quite an issue very quickly. So how do you [00:44:01] issue very quickly. So how do you actually uh do this? So in this case um [00:44:05] actually uh do this? So in this case um you know maybe the the optimal value I [00:44:07] you know maybe the the optimal value I think is 0.022 or something but how [00:44:09] think is 0.022 or something but how would you actually know that and how [00:44:11] would you actually know that and how would you do this more generally across [00:44:12] would you do this more generally across any layer? Um, there are a few different [00:44:14] any layer? Um, there are a few different ways you can initialize weights. And [00:44:16] ways you can initialize weights. And I'll go over the most commonly used one [00:44:18] I'll go over the most commonly used one today in class, but know there are other [00:44:20] today in class, but know there are other ones. And generally what they're a [00:44:22] ones. And generally what they're a function of is the dimension of your um [00:44:25] function of is the dimension of your um of your of your values here. So you'll [00:44:28] of your of your values here. So you'll have a different value for a 4096 [00:44:30] have a different value for a 4096 dimension uh fully connected layer [00:44:33] dimension uh fully connected layer versus a 2048. And the specific formula [00:44:36] versus a 2048. And the specific formula we'll go through um is called Kaiming [00:44:39] we'll go through um is called Kaiming initialization is actually the same [00:44:40] initialization is actually the same person who created uh resonance. So [00:44:43] person who created uh resonance. So Kaiming Hu he was you know very famous I [00:44:46] Kaiming Hu he was you know very famous I mean he still is a very famous computer [00:44:47] mean he still is a very famous computer vision researcher. I think he's one of [00:44:48] vision researcher. I think he's one of the most widely cited computer [00:44:50] the most widely cited computer scientists of the last uh 10 or 15 [00:44:52] scientists of the last uh 10 or 15 years, maybe the most. Uh so he he's [00:44:54] years, maybe the most. Uh so he he's extremely well known in the computer [00:44:56] extremely well known in the computer vision community and he also came up [00:44:58] vision community and he also came up with uh this idea of initializing the [00:45:02] with uh this idea of initializing the values to the square root of two over [00:45:05] values to the square root of two over your input dimension size. And I won't [00:45:08] your input dimension size. And I won't go over all the details for how they uh [00:45:10] go over all the details for how they uh derived this and showed that with a relu [00:45:13] derived this and showed that with a relu activation this would cause the standard [00:45:14] activation this would cause the standard deviation and mean to be relatively [00:45:16] deviation and mean to be relatively constant throughout the layers. But if [00:45:18] constant throughout the layers. But if you do uh plot it, you see this does [00:45:20] you do uh plot it, you see this does have the effect. So you can almost think [00:45:21] have the effect. So you can almost think of this as a magic formula where if you [00:45:22] of this as a magic formula where if you plug it in, you get the desired [00:45:23] plug it in, you get the desired properties. And if you want to know the [00:45:26] properties. And if you want to know the derivation, you can actually uh we link [00:45:27] derivation, you can actually uh we link the paper here. So feel free to look [00:45:29] the paper here. So feel free to look into that, but you can just sort of uh [00:45:31] into that, but you can just sort of uh take our word. I won't go through the [00:45:32] take our word. I won't go through the details here, but um it does this [00:45:34] details here, but um it does this desired effect where the mean and [00:45:36] desired effect where the mean and standard deviation is unchanging. And [00:45:38] standard deviation is unchanging. And you can also imagine that for any given [00:45:39] you can also imagine that for any given setup, you could also just through [00:45:41] setup, you could also just through testing try to find what is the what is [00:45:44] testing try to find what is the what is the value here. [00:45:46] the value here. Okay. Um, so these are how you we [00:45:49] Okay. Um, so these are how you we discuss how you initialize weights, how [00:45:51] discuss how you initialize weights, how you combine these different layers [00:45:53] you combine these different layers together to form a CNN architecture, [00:45:56] together to form a CNN architecture, which activations function, which [00:45:57] which activations function, which activation functions people use, and [00:45:59] activation functions people use, and then all the different layers and CNN's. [00:46:01] then all the different layers and CNN's. So I think already we covered quite a [00:46:02] So I think already we covered quite a few topics. So I think I'll pause very [00:46:04] few topics. So I think I'll pause very briefly to see if there's any questions [00:46:06] briefly to see if there's any questions about these. In the second part of the [00:46:08] about these. In the second part of the lecture, it's actually much less dense [00:46:10] lecture, it's actually much less dense than the first part. So we'll be mainly [00:46:12] than the first part. So we'll be mainly going over a lot of nice practical tips [00:46:15] going over a lot of nice practical tips for when you're training these models. [00:46:17] for when you're training these models. So the question is how do you do weight [00:46:18] So the question is how do you do weight initialization for CNN's? So you still [00:46:21] initialization for CNN's? So you still use this same uh initialization but your [00:46:23] use this same uh initialization but your dimension in here is the size of your [00:46:25] dimension in here is the size of your kernel. So if you have a 3x3 kernel with [00:46:28] kernel. So if you have a 3x3 kernel with channels uh say six it would be 3 3 [00:46:31] channels uh say six it would be 3 3 6 that is but it's the same idea. It's [00:46:34] 6 that is but it's the same idea. It's just you calculate your dimensions [00:46:36] just you calculate your dimensions differently depending on the layer the [00:46:37] differently depending on the layer the type of layer. Yeah. [00:46:39] type of layer. Yeah. It's you can think of it's the number of [00:46:41] It's you can think of it's the number of values roughly in each operation but it [00:46:45] values roughly in each operation but it does depend on the layer and and some [00:46:47] does depend on the layer and and some layers use different uh weight [00:46:48] layers use different uh weight initializations but this is specifically [00:46:50] initializations but this is specifically for CNN's how this timing initialization [00:46:52] for CNN's how this timing initialization applies to it. [00:46:54] applies to it. Yeah. So the question is why do your [00:46:55] Yeah. So the question is why do your activations expose explode if you have [00:46:57] activations expose explode if you have too large of an initialization value? So [00:47:00] too large of an initialization value? So um if you so you you imagine at each [00:47:02] um if you so you you imagine at each layer of your initialized network you [00:47:04] layer of your initialized network you have um a set of randomly initialized [00:47:07] have um a set of randomly initialized values and if they're very large then um [00:47:11] values and if they're very large then um when you do your um you know you have a [00:47:14] when you do your um you know you have a RLU activation afterwards and that [00:47:16] RLU activation afterwards and that doesn't actually cap the um the the [00:47:20] doesn't actually cap the um the the outputs of your layer right you can go [00:47:22] outputs of your layer right you can go to infinity with RLU so if you have a [00:47:25] to infinity with RLU so if you have a too large of a set of values that you're [00:47:27] too large of a set of values that you're essentially repeating the same operation [00:47:29] essentially repeating the same operation on because you're initializing all the [00:47:30] on because you're initializing all the weights uh to the same set of uh random [00:47:33] weights uh to the same set of uh random values. Then at each layer, you'll be [00:47:35] values. Then at each layer, you'll be multiplying one set of large values by a [00:47:38] multiplying one set of large values by a set that's been initialized too large so [00:47:40] set that's been initialized too large so that it becomes larger at each uh [00:47:43] that it becomes larger at each uh iteration afterwards. I I I mean you [00:47:45] iteration afterwards. I I I mean you could think of it as it's sort of like a [00:47:46] could think of it as it's sort of like a recurrence relation where u because [00:47:49] recurrence relation where u because they're all initialized randomly at the [00:47:50] they're all initialized randomly at the start where it's being multiplied by a [00:47:53] start where it's being multiplied by a value and you would want in a simple [00:47:55] value and you would want in a simple recurrence relation you'd want it to be [00:47:56] recurrence relation you'd want it to be one, right? But because we have a vector [00:47:58] one, right? But because we have a vector of values that are being multiplied by [00:48:00] of values that are being multiplied by our matrix um it depends on the [00:48:03] our matrix um it depends on the dimension of the vector what your [00:48:04] dimension of the vector what your average uh output would be and what the [00:48:07] average uh output would be and what the uh sort after the relu because um you [00:48:10] uh sort after the relu because um you have basically a standard deviation for [00:48:12] have basically a standard deviation for what is your activations then you you [00:48:15] what is your activations then you you remove all the negative ones and you're [00:48:16] remove all the negative ones and you're left with what are your outputs at that [00:48:18] left with what are your outputs at that point and if you have a really large [00:48:20] point and if you have a really large values you have a really large standard [00:48:21] values you have a really large standard deviation so when you remove the the uh [00:48:24] deviation so when you remove the the uh the bottom half of it you get your your [00:48:27] the bottom half of it you get your your starts moving more positive and more [00:48:28] starts moving more positive and more positive. Uh, did that make sense? It's [00:48:30] positive. Uh, did that make sense? It's sort of I didn't have slides to show it, [00:48:32] sort of I didn't have slides to show it, but okay. Mostly sorry. Yeah, you can [00:48:34] but okay. Mostly sorry. Yeah, you can see more details in the paper. It's [00:48:35] see more details in the paper. It's actually not too bad to read, I think. [00:48:41] So, the the conclusion of the discussion [00:48:43] So, the the conclusion of the discussion here is that normalization will solve [00:48:45] here is that normalization will solve this activation [00:48:47] this activation issue where they're blowing up, but it [00:48:48] issue where they're blowing up, but it still might be harder to optimize. Um [00:48:50] still might be harder to optimize. Um maybe we can we should probably do a [00:48:53] maybe we can we should probably do a follow-up post on Edge explaining this [00:48:55] follow-up post on Edge explaining this in more detail, but I think it's a [00:48:56] in more detail, but I think it's a really good question actually. Yeah. Um [00:48:58] really good question actually. Yeah. Um it would solve I think this particular [00:48:59] it would solve I think this particular issue, but maybe it's still hard to do [00:49:02] issue, but maybe it's still hard to do something like this in the discussion. [00:49:03] something like this in the discussion. Yeah, it's a good question. [00:49:06] Yeah, it's a good question. Okay, [00:49:09] cool. Um so I'll talk about these steps [00:49:12] cool. Um so I'll talk about these steps now. So how do you actually train your [00:49:13] now. So how do you actually train your model? And uh the nice thing for data [00:49:16] model? And uh the nice thing for data prep-processing is it's really easy for [00:49:17] prep-processing is it's really easy for images. So if you have your giant image [00:49:19] images. So if you have your giant image data set, the standard way to do it is [00:49:21] data set, the standard way to do it is you calculate the average red, the [00:49:23] you calculate the average red, the average green and the average blue pixel [00:49:25] average green and the average blue pixel along with the standard deviations and [00:49:27] along with the standard deviations and you take your input image, you subtract [00:49:29] you take your input image, you subtract the mean and you divide by the standard [00:49:31] the mean and you divide by the standard deviation. And this is how you this is [00:49:34] deviation. And this is how you this is how you do uh data normalization for [00:49:36] how you do uh data normalization for images. It's actually very [00:49:37] images. It's actually very straightforward. Um so it does require [00:49:39] straightforward. Um so it does require you to premp compute the means and [00:49:40] you to premp compute the means and standard and standard deviation for each [00:49:42] standard and standard deviation for each pixel channel. Um so sometimes what [00:49:45] pixel channel. Um so sometimes what people will do is they'll use means that [00:49:47] people will do is they'll use means that have already been calculated like a very [00:49:49] have already been calculated like a very common one is to use the imageet uh [00:49:51] common one is to use the imageet uh means and standard deviations and uh [00:49:54] means and standard deviations and uh apply those to your input images even if [00:49:56] apply those to your input images even if you're training a model not on imageet. [00:49:58] you're training a model not on imageet. Um so it it is very um data set [00:50:03] Um so it it is very um data set dependent is is is the way to think of [00:50:04] dependent is is is the way to think of this and different models will use [00:50:06] this and different models will use different values here depending on their [00:50:08] different values here depending on their data set but the most commonly used one [00:50:10] data set but the most commonly used one is just use the mean and standard [00:50:11] is just use the mean and standard deviation from image net. Yeah. So any [00:50:14] deviation from image net. Yeah. So any input image you've apply this operation [00:50:17] input image you've apply this operation before the model sees it. [00:50:22] Okay. Um so yeah that one was really [00:50:25] Okay. Um so yeah that one was really quick. Um and then in terms of data [00:50:26] quick. Um and then in terms of data augmentation, so someone had a [00:50:29] augmentation, so someone had a suggestion uh earlier in the class, why [00:50:31] suggestion uh earlier in the class, why don't we just add noise to our uh image? [00:50:34] don't we just add noise to our uh image? And that's a great idea and we'll talk [00:50:35] And that's a great idea and we'll talk about the different ways you can add [00:50:36] about the different ways you can add noise to your image here. This helps [00:50:38] noise to your image here. This helps with regularization and helps prevent [00:50:40] with regularization and helps prevent your model from overfitting. [00:50:43] your model from overfitting. So we talked about it before um but this [00:50:46] So we talked about it before um but this is sort of a common pattern with [00:50:47] is sort of a common pattern with regularization where during training [00:50:49] regularization where during training time you add some kind of randomness and [00:50:51] time you add some kind of randomness and then at testing time you then average [00:50:53] then at testing time you then average out the randomness. So um sometimes this [00:50:56] out the randomness. So um sometimes this is approximate but you know for example [00:50:58] is approximate but you know for example for dropout we saw that during training [00:51:00] for dropout we saw that during training time we'll randomly drop say 50% of the [00:51:02] time we'll randomly drop say 50% of the activations and then at testing time [00:51:04] activations and then at testing time we'll use all the activations but then [00:51:06] we'll use all the activations but then we'll need to scale it down by this [00:51:07] we'll need to scale it down by this probability of dropout P. Um so this is [00:51:10] probability of dropout P. Um so this is a really common pattern and it's also [00:51:12] a really common pattern and it's also used for data augmentation. So um you [00:51:16] used for data augmentation. So um you can imagine this uh being a this [00:51:20] can imagine this uh being a this cylinder here is like your data set. you [00:51:22] cylinder here is like your data set. you load an image and a label. So we have a [00:51:24] load an image and a label. So we have a cat label and we have our original image [00:51:26] cat label and we have our original image from our data set um before we actually [00:51:29] from our data set um before we actually pass it into our model. It's extremely [00:51:31] pass it into our model. It's extremely common and basically always in in modern [00:51:34] common and basically always in in modern deep learning people will always use [00:51:36] deep learning people will always use data augmentation for training uh [00:51:39] data augmentation for training uh computer vision models. But the basic [00:51:41] computer vision models. But the basic idea is to apply some transformations to [00:51:43] idea is to apply some transformations to the image to make it look different uh [00:51:45] the image to make it look different uh but still recognizable for the category [00:51:47] but still recognizable for the category class and then pass that to your model [00:51:49] class and then pass that to your model and you're computing the loss here. So [00:51:52] and you're computing the loss here. So one of the nice benefits of this is you [00:51:53] one of the nice benefits of this is you can effectively increase the size of [00:51:54] can effectively increase the size of your data set because instead of seeing [00:51:57] your data set because instead of seeing each image multiple times, it will be [00:51:59] each image multiple times, it will be seeing different versions of the image [00:52:00] seeing different versions of the image with different transformations that all [00:52:02] with different transformations that all still are the same category label. So [00:52:04] still are the same category label. So you can basically get more data and [00:52:06] you can basically get more data and therefore it increases your [00:52:07] therefore it increases your generalization capabilities, but your [00:52:09] generalization capabilities, but your training loss will be higher because um [00:52:12] training loss will be higher because um you're not just seeing the same example [00:52:14] you're not just seeing the same example over and over again. So it makes it [00:52:15] over and over again. So it makes it harder for the model to just memorize. [00:52:18] harder for the model to just memorize. So how do we know the weight [00:52:19] So how do we know the weight initialization is just right? So we know [00:52:21] initialization is just right? So we know it's right in this case because the [00:52:24] it's right in this case because the means and the standard deviations are [00:52:25] means and the standard deviations are relatively constant throughout the [00:52:28] relatively constant throughout the layers of the network and we're not [00:52:29] layers of the network and we're not seeing uh in this case we saw sort of uh [00:52:33] seeing uh in this case we saw sort of uh mode collapse to zero. In this case it [00:52:35] mode collapse to zero. In this case it was sort of blowing up to infinity as we [00:52:38] was sort of blowing up to infinity as we increase the number of layers. So the [00:52:41] increase the number of layers. So the way you can ensure it always happens is [00:52:43] way you can ensure it always happens is by using the formula. This will always [00:52:45] by using the formula. This will always initialize them. Well um so in practice [00:52:47] initialize them. Well um so in practice that's how people do it. If you were [00:52:49] that's how people do it. If you were creating a new layer um that maybe does [00:52:53] creating a new layer um that maybe does some different kind of operation that no [00:52:55] some different kind of operation that no one's done before, then yeah, you [00:52:56] one's done before, then yeah, you probably would need to try a bunch of [00:52:58] probably would need to try a bunch of different weight initialization schemes [00:52:59] different weight initialization schemes and see what works best. Um but [00:53:01] and see what works best. Um but generally for these linear layers or for [00:53:05] generally for these linear layers or for or the convolutional layers, you can use [00:53:06] or the convolutional layers, you can use this uh formula here which is called the [00:53:10] this uh formula here which is called the kiming initialization. Yeah. [00:53:14] kiming initialization. Yeah. Okay. Um so back to data augmentation. [00:53:17] Okay. Um so back to data augmentation. Um, so what are the different [00:53:18] Um, so what are the different augmentations you can do specifically? [00:53:20] augmentations you can do specifically? So one of them is horizontal flipping. [00:53:22] So one of them is horizontal flipping. This depends on the problem. Um, so if [00:53:24] This depends on the problem. Um, so if you want to have a model that reads [00:53:26] you want to have a model that reads text, this would be a very bad [00:53:27] text, this would be a very bad augmentation to use because the text is [00:53:29] augmentation to use because the text is now it's like you're looking through a [00:53:31] now it's like you're looking through a mirror and you can't read it properly. [00:53:33] mirror and you can't read it properly. Um, so this is sometimes useful for [00:53:36] Um, so this is sometimes useful for everyday objects. It's usually pretty [00:53:38] everyday objects. It's usually pretty good because most objects are [00:53:40] good because most objects are symmetrical. So this property uh [00:53:43] symmetrical. So this property uh actually works pretty well. Um, and then [00:53:44] actually works pretty well. Um, and then you could also imagine if you're looking [00:53:46] you could also imagine if you're looking maybe at images from a microscope or [00:53:48] maybe at images from a microscope or overhead that you could also do a [00:53:49] overhead that you could also do a vertical flip and that would make sense. [00:53:51] vertical flip and that would make sense. But for everyday objects, vertical [00:53:53] But for everyday objects, vertical flipping actually doesn't really make [00:53:54] flipping actually doesn't really make sense because a cat is almost always [00:53:56] sense because a cat is almost always seen in this position. But maybe if you [00:53:58] seen in this position. But maybe if you had a data set where cats were in all [00:53:59] had a data set where cats were in all different orientations, you could [00:54:00] different orientations, you could imagine that flipping or rotating or all [00:54:02] imagine that flipping or rotating or all these things would make sense for for [00:54:04] these things would make sense for for your data set. [00:54:06] your data set. Um, another type of augmentation is this [00:54:09] Um, another type of augmentation is this resizing and cropping idea. So what [00:54:12] resizing and cropping idea. So what ResNets and many um different image [00:54:16] ResNets and many um different image models in uh deep learning do is they [00:54:20] models in uh deep learning do is they basically take a random crop of the [00:54:22] basically take a random crop of the image and uh then resize that to be your [00:54:26] image and uh then resize that to be your image size. They might even take another [00:54:28] image size. They might even take another crop afterwards. So the most common [00:54:30] crop afterwards. So the most common strategy is you pick the length of what [00:54:32] strategy is you pick the length of what is basically the short side of your [00:54:34] is basically the short side of your image. Um so if you have a input image [00:54:38] image. Um so if you have a input image size to your model of 224 x 224 pixels [00:54:41] size to your model of 224 x 224 pixels you would pick a value larger than this [00:54:43] you would pick a value larger than this first and uh find sort of find some crop [00:54:47] first and uh find sort of find some crop of your image that contains these uh [00:54:50] of your image that contains these uh this larger scale L uh and this these [00:54:53] this larger scale L uh and this these are commonly used values. You crop the [00:54:54] are commonly used values. You crop the image um so you you you uh sorry you [00:54:58] image um so you you you uh sorry you don't crop you resize the image to be uh [00:55:01] don't crop you resize the image to be uh that scale. So instead of say this is a [00:55:03] that scale. So instead of say this is a 800 by 600 image if we used 256 here um [00:55:07] 800 by 600 image if we used 256 here um we resize the short side. So 600 would [00:55:09] we resize the short side. So 600 would be 256 and then 800 would be scaled [00:55:11] be 256 and then 800 would be scaled correspondingly. So we scale it to this [00:55:14] correspondingly. So we scale it to this L. We scale the short side to L. And [00:55:16] L. We scale the short side to L. And then we crop a random patch of 224x 224 [00:55:20] then we crop a random patch of 224x 224 pixels from that image. So you're [00:55:21] pixels from that image. So you're scaling the image by first preserving [00:55:24] scaling the image by first preserving you preserve the relative resolution but [00:55:26] you preserve the relative resolution but you make it smaller or larger to fit [00:55:29] you make it smaller or larger to fit this L and then you take a random crop [00:55:30] this L and then you take a random crop of that. And this is like by far the [00:55:33] of that. And this is like by far the most commonly used uh rand random [00:55:35] most commonly used uh rand random resized crop is what it's called in most [00:55:37] resized crop is what it's called in most libraries. So this is used in most [00:55:38] libraries. So this is used in most problems because it works pretty well [00:55:40] problems because it works pretty well and it reserves the relative uh [00:55:42] and it reserves the relative uh resolution of your of your images. Um [00:55:45] resolution of your of your images. Um and then there's another neat trick you [00:55:46] and then there's another neat trick you can do with augmentation called test [00:55:48] can do with augmentation called test time augmentation. So if you really just [00:55:50] time augmentation. So if you really just want to get the best performance [00:55:51] want to get the best performance possible, you can basically get a bunch [00:55:53] possible, you can basically get a bunch of these different crops and resizes and [00:55:55] of these different crops and resizes and run them all through your model and then [00:55:57] run them all through your model and then average your predictions at the end. And [00:55:59] average your predictions at the end. And for ResNets, people will often try a [00:56:01] for ResNets, people will often try a bunch of different scales, um, a bunch [00:56:03] bunch of different scales, um, a bunch of different crop locations, and, uh, [00:56:06] of different crop locations, and, uh, maybe even flip it. And usually you'll [00:56:09] maybe even flip it. And usually you'll start getting diminishing returns, but [00:56:11] start getting diminishing returns, but you can get actually pretty good like 1 [00:56:12] you can get actually pretty good like 1 to 2% performance boost by using this [00:56:14] to 2% performance boost by using this sort of test time augmentation. So if [00:56:16] sort of test time augmentation. So if you're in a setting where it really [00:56:17] you're in a setting where it really matters, you're trying to eek out every [00:56:19] matters, you're trying to eek out every last bit of percentage points, then this [00:56:21] last bit of percentage points, then this is actually a really great trick that [00:56:22] is actually a really great trick that you can use for any almost any computer [00:56:24] you can use for any almost any computer vision problem. [00:56:27] vision problem. Um, okay. So a final sort of few [00:56:30] Um, okay. So a final sort of few augmentations. One is color jitter. So [00:56:32] augmentations. One is color jitter. So here we're specifically randomizing the [00:56:35] here we're specifically randomizing the contrast and brightness and scaling the [00:56:37] contrast and brightness and scaling the image correspondingly. So maybe images [00:56:39] image correspondingly. So maybe images look more muted or more sorry the colors [00:56:41] look more muted or more sorry the colors look more muted or more um brighter. But [00:56:44] look more muted or more um brighter. But these are sort of very traditional uh [00:56:46] these are sort of very traditional uh image processing techniques. And usually [00:56:48] image processing techniques. And usually with all these different augmentations, [00:56:49] with all these different augmentations, you'll try different values and see [00:56:52] you'll try different values and see which ones make your images still look [00:56:54] which ones make your images still look in distribution and look normal to you [00:56:56] in distribution and look normal to you as a human. And that's like a pretty [00:56:57] as a human. And that's like a pretty good way to judge what values you should [00:56:59] good way to judge what values you should pick for how much jitter you should [00:57:00] pick for how much jitter you should have, how much brightness variance, uh, [00:57:02] have, how much brightness variance, uh, etc. So, normally when I'm starting a [00:57:04] etc. So, normally when I'm starting a problem, I'll try a bunch of these [00:57:05] problem, I'll try a bunch of these different augmentations. I'll see what [00:57:07] different augmentations. I'll see what is making the data look different from [00:57:08] is making the data look different from the original data, but then still [00:57:10] the original data, but then still recognizable to me and still very easy [00:57:13] recognizable to me and still very easy to recognize. And that's like generally [00:57:14] to recognize. And that's like generally a good set of augmentations to use. Um, [00:57:18] a good set of augmentations to use. Um, final one is you can imagine just like [00:57:20] final one is you can imagine just like cropping out parts of the image where [00:57:21] cropping out parts of the image where you just are basically putting a black [00:57:23] you just are basically putting a black or a gray box over it. Um, and I think [00:57:26] or a gray box over it. Um, and I think this one's maybe less commonly used, but [00:57:27] this one's maybe less commonly used, but it kind of shows you how you can get [00:57:29] it kind of shows you how you can get creative with the augmentations [00:57:30] creative with the augmentations depending on your problem. Like say [00:57:32] depending on your problem. Like say you're in a setting where things will [00:57:34] you're in a setting where things will get covered, like the the the camera [00:57:36] get covered, like the the the camera will be oluded, so it won't be able to [00:57:38] will be oluded, so it won't be able to see the objects fully. This could be a [00:57:40] see the objects fully. This could be a really neat trick you do to make your [00:57:41] really neat trick you do to make your model more resilient to stuff blocking [00:57:42] model more resilient to stuff blocking parts of your objects. So, you could [00:57:45] parts of your objects. So, you could almost imagine for your given setting, [00:57:46] almost imagine for your given setting, what augmentations make sense? what ways [00:57:48] what augmentations make sense? what ways can you uh transform your your input [00:57:51] can you uh transform your your input data so that it's still recognizable to [00:57:53] data so that it's still recognizable to you as a human but it makes it harder [00:57:54] you as a human but it makes it harder for the model to memorize the training [00:57:56] for the model to memorize the training examples. [00:57:59] examples. Okay. Um so the final set of topics here [00:58:02] Okay. Um so the final set of topics here are basically extremely practical. So [00:58:05] are basically extremely practical. So when you're say doing a project or uh [00:58:08] when you're say doing a project or uh training a model say for your course [00:58:10] training a model say for your course project um I think you should basically [00:58:13] project um I think you should basically do the exact things we're going to be [00:58:14] do the exact things we're going to be describing in the coming slides. Um but [00:58:16] describing in the coming slides. Um but this also applies outside the course to [00:58:18] this also applies outside the course to any computer vision domain you could be [00:58:20] any computer vision domain you could be uh practicing in. So um in practice in [00:58:24] uh practicing in. So um in practice in many times we don't actually have so [00:58:26] many times we don't actually have so much data. You know imageet the original [00:58:28] much data. You know imageet the original version had a million images. Uh maybe [00:58:30] version had a million images. Uh maybe you don't have a million images for your [00:58:31] you don't have a million images for your problem which almost none of us do [00:58:33] problem which almost none of us do unless you've been collecting vast [00:58:35] unless you've been collecting vast amount of data with a huge team. So um [00:58:38] amount of data with a huge team. So um if you don't have a lot of data can you [00:58:39] if you don't have a lot of data can you still train CNN's and the short answer [00:58:41] still train CNN's and the short answer is yes you can but you need to be a [00:58:43] is yes you can but you need to be a little bit smart with how you do it. [00:58:46] little bit smart with how you do it. So um I think in one of the I think [00:58:48] So um I think in one of the I think maybe it was last lecture we showed how [00:58:50] maybe it was last lecture we showed how the different uh filters in your CNN are [00:58:53] the different uh filters in your CNN are sort of extracting different uh types of [00:58:55] sort of extracting different uh types of features. So this goes back to someone [00:58:57] features. So this goes back to someone asked about like the hierarchy of [00:58:58] asked about like the hierarchy of features in convolutional neural [00:59:00] features in convolutional neural networks. So at the beginning it's more [00:59:02] networks. So at the beginning it's more of just like edges or patterns or or [00:59:04] of just like edges or patterns or or really small shapes. And then at the [00:59:06] really small shapes. And then at the highest level um you can imagine uh if [00:59:09] highest level um you can imagine uh if we put an image into our CNN and we get [00:59:12] we put an image into our CNN and we get this uh final uh vector right before we [00:59:16] this uh final uh vector right before we get our class scores and we compare that [00:59:18] get our class scores and we compare that to other images in our data set. Um [00:59:21] to other images in our data set. Um you'll actually see that these these [00:59:24] you'll actually see that these these values uh of the vector of our image are [00:59:27] values uh of the vector of our image are actually really close. So this is sort [00:59:28] actually really close. So this is sort of like you can think of this as sort of [00:59:30] of like you can think of this as sort of like the nearest neighbors thing we did [00:59:32] like the nearest neighbors thing we did before, but instead of it being the [00:59:33] before, but instead of it being the pixels of the image, we're we're looking [00:59:35] pixels of the image, we're we're looking at the uh the vector at the very end of [00:59:37] at the uh the vector at the very end of your CNN right before you you have the [00:59:39] your CNN right before you you have the classification layer. Um so this would [00:59:41] classification layer. Um so this would be like the the 4096 or the 2048 layer. [00:59:45] be like the the 4096 or the 2048 layer. And if we look at the difference here is [00:59:48] And if we look at the difference here is the L2 distance. Um you'll find that for [00:59:51] the L2 distance. Um you'll find that for a given image if you put it into your [00:59:53] a given image if you put it into your model and you look at the other images [00:59:55] model and you look at the other images that are close to the model in this [00:59:57] that are close to the model in this vector space right here after you uh go [01:00:00] vector space right here after you uh go through all the layers except the last [01:00:01] through all the layers except the last one that you'll find the images are [01:00:03] one that you'll find the images are extremely close to each other when the [01:00:05] extremely close to each other when the items are in the same category. So [01:00:07] items are in the same category. So intuitively what this basically means is [01:00:09] intuitively what this basically means is that these features here are actually [01:00:11] that these features here are actually really good at at uh you could build a [01:00:15] really good at at uh you could build a linear classifier on top of them and [01:00:16] linear classifier on top of them and then be able to or or a K nearest [01:00:18] then be able to or or a K nearest neighbor classifier and be able to [01:00:20] neighbor classifier and be able to classify objects extremely well. Um so [01:00:23] classify objects extremely well. Um so so how could you use this in practice? [01:00:26] so how could you use this in practice? So um what you would do is you would [01:00:28] So um what you would do is you would first train your model on imageet or you [01:00:30] first train your model on imageet or you would just grab a model someone else has [01:00:32] would just grab a model someone else has trained on imageet or one of these [01:00:34] trained on imageet or one of these really large uh web internet scale data [01:00:36] really large uh web internet scale data sets and you can just uh freeze all of [01:00:40] sets and you can just uh freeze all of these layers. So you don't train any of [01:00:42] these layers. So you don't train any of them. You keep them exactly the same as [01:00:43] them. You keep them exactly the same as before and you replace this final layer [01:00:46] before and you replace this final layer instead of it being in the in the case [01:00:48] instead of it being in the in the case of imageet a thousand classes you [01:00:49] of imageet a thousand classes you replace it with the number of classes [01:00:51] replace it with the number of classes you have in your data set. And then when [01:00:53] you have in your data set. And then when you're training the model, you only [01:00:54] you're training the model, you only train this layer here. So if we think [01:00:57] train this layer here. So if we think about um we talked about how there sort [01:01:00] about um we talked about how there sort of in the old paradigm of computer [01:01:02] of in the old paradigm of computer vision, you had feature extractors which [01:01:03] vision, you had feature extractors which was a predefined set of operations to [01:01:06] was a predefined set of operations to get stuff like color histograms and [01:01:08] get stuff like color histograms and other uh predefined features. Um you can [01:01:11] other uh predefined features. Um you can almost think of uh the frozen model as [01:01:13] almost think of uh the frozen model as doing this. It's a predefined feature [01:01:15] doing this. It's a predefined feature extractor that we're not changing in any [01:01:17] extractor that we're not changing in any way, but we're using it to calculate [01:01:19] way, but we're using it to calculate features that we then train a model on [01:01:20] features that we then train a model on top of. It's actually extremely similar [01:01:22] top of. It's actually extremely similar under that paradigm because you're not [01:01:24] under that paradigm because you're not training it here. And if you have a [01:01:26] training it here. And if you have a larger data set, what tends to work best [01:01:28] larger data set, what tends to work best in practice is to actually train the [01:01:31] in practice is to actually train the whole model, but you're initializing it [01:01:33] whole model, but you're initializing it from these values that are that were [01:01:35] from these values that are that were pre-trained say on imageet or some other [01:01:37] pre-trained say on imageet or some other really large internet scale data set. So [01:01:40] really large internet scale data set. So I think pretty much for all of the [01:01:42] I think pretty much for all of the problems I ever work on, I'm doing this [01:01:45] problems I ever work on, I'm doing this step three here because my I have maybe [01:01:48] step three here because my I have maybe a million or 10 million training [01:01:50] a million or 10 million training examples. So I'll start it with a model [01:01:52] examples. So I'll start it with a model that was trained on billions that I [01:01:54] that was trained on billions that I don't have the compute for and then I'll [01:01:56] don't have the compute for and then I'll fine-tune the model uh on my relatively [01:01:59] fine-tune the model uh on my relatively smaller data set and I'll get better [01:02:01] smaller data set and I'll get better performance than if I just try to train [01:02:02] performance than if I just try to train a model myself uh because the model's [01:02:04] a model myself uh because the model's basically seen more data. that's uh [01:02:07] basically seen more data. that's uh created a better feature extractor and [01:02:08] created a better feature extractor and then when I fine-tune the whole thing it [01:02:10] then when I fine-tune the whole thing it can still be specific enough to my [01:02:11] can still be specific enough to my problem. You're basically taking say say [01:02:14] say say say say say say say say say say say say say say say say say say say say [01:02:14] say say say say say say say say say say say say say say say say say say say [01:02:14] say say say say say say say say say let's use a very concrete case where [01:02:16] let's use a very concrete case where we're training a model on imageet um [01:02:18] we're training a model on imageet um we're taking this model and we're [01:02:20] we're taking this model and we're replacing the final layer so that it's [01:02:22] replacing the final layer so that it's no longer outputting a thousand classes [01:02:25] no longer outputting a thousand classes it's outputting um you know the number [01:02:27] it's outputting um you know the number of classes in your data set and we're [01:02:29] of classes in your data set and we're initializing this randomly using the [01:02:31] initializing this randomly using the kiming initialization we talked about [01:02:33] kiming initialization we talked about before but the rest of these layers are [01:02:34] before but the rest of these layers are maintaining their values uh that they [01:02:37] maintaining their values uh that they they had before so we're not changing [01:02:39] they had before so we're not changing these values and during gradient descent [01:02:41] these values and during gradient descent we're never changing these values. So, [01:02:43] we're never changing these values. So, um these values are unchanged. We [01:02:45] um these values are unchanged. We basically take our image, we pass it [01:02:47] basically take our image, we pass it through our model, and now it's just [01:02:49] through our model, and now it's just like we have a it's almost like you're [01:02:50] like we have a it's almost like you're just training a linear linear classifier [01:02:52] just training a linear linear classifier where your input are these 4096 vectors [01:02:55] where your input are these 4096 vectors for each image that we calculate by [01:02:57] for each image that we calculate by passing it through the whole model. Then [01:02:58] passing it through the whole model. Then we have our vector of 4096 and we're [01:03:01] we have our vector of 4096 and we're just mapping that to the number of [01:03:02] just mapping that to the number of classes and we're only training this [01:03:04] classes and we're only training this mapping at the end. [01:03:05] mapping at the end. Yeah. So the question is uh will you [01:03:07] Yeah. So the question is uh will you have some bias in your model because [01:03:09] have some bias in your model because it's trained on imageet? The answer is [01:03:11] it's trained on imageet? The answer is definitely. So um the the model if you [01:03:15] definitely. So um the the model if you do this uh number two like this uh way [01:03:18] do this uh number two like this uh way of training um then it will do best on [01:03:22] of training um then it will do best on data sets that look very similar to [01:03:23] data sets that look very similar to imageet. So these would be like pictures [01:03:25] imageet. So these would be like pictures of everyday things like laptops or uh [01:03:28] of everyday things like laptops or uh maybe a classroom or a person things [01:03:30] maybe a classroom or a person things like this where imageet is everyday [01:03:32] like this where imageet is everyday objects but if it was say photos of Mars [01:03:34] objects but if it was say photos of Mars uh it would do a lot worse. So there's [01:03:36] uh it would do a lot worse. So there's definitely bias based on the training [01:03:38] definitely bias based on the training data of the pre-trained model and you [01:03:39] data of the pre-trained model and you want to get something that is in the [01:03:41] want to get something that is in the same type of distribution or you're [01:03:43] same type of distribution or you're seeing the same kinds of objects or [01:03:45] seeing the same kinds of objects or locations or things like that. So um the [01:03:48] locations or things like that. So um the question is what do you do when your [01:03:49] question is what do you do when your data set is out of distribution? Um I [01:03:52] data set is out of distribution? Um I actually have a um a slide here to cover [01:03:55] actually have a um a slide here to cover some of that. So it's a great question. [01:03:56] some of that. So it's a great question. Um so um if you have a very similar data [01:03:59] Um so um if you have a very similar data set but very little data, you can use [01:04:01] set but very little data, you can use the linear classifier strategy we just [01:04:02] the linear classifier strategy we just mentioned. If you have a similar data [01:04:04] mentioned. If you have a similar data set, quite a lot of data, you'll get [01:04:05] set, quite a lot of data, you'll get best performance by fine-tuning all the [01:04:07] best performance by fine-tuning all the layers. These are strategies two and [01:04:08] layers. These are strategies two and three on the slide that I mentioned [01:04:09] three on the slide that I mentioned earlier. But what about when you have a [01:04:11] earlier. But what about when you have a very different data set? So, um if you [01:04:14] very different data set? So, um if you have a lot of data, you might just want [01:04:16] have a lot of data, you might just want to start from scratch. Um or you could [01:04:19] to start from scratch. Um or you could you might get better performance if you [01:04:21] you might get better performance if you still initialize here. You would test, [01:04:22] still initialize here. You would test, but there's no guaranteed way to know [01:04:24] but there's no guaranteed way to know that performance would be better or [01:04:25] that performance would be better or worse. Um and then yeah, if you have [01:04:28] worse. Um and then yeah, if you have very little data or a very different [01:04:29] very little data or a very different data set, you probably want to try to [01:04:31] data set, you probably want to try to find a model that's trained on something [01:04:32] find a model that's trained on something close. There are specific techniques [01:04:34] close. There are specific techniques that researchers have looked into for [01:04:36] that researchers have looked into for out of domain generalization and um you [01:04:40] out of domain generalization and um you know this basic idea of you have one [01:04:41] know this basic idea of you have one domain, you train a model on one domain [01:04:42] domain, you train a model on one domain and you're trying to learn a new domain [01:04:44] and you're trying to learn a new domain that's different in some ways. So this [01:04:45] that's different in some ways. So this is an active area of research, but I [01:04:47] is an active area of research, but I wouldn't say there's like a general [01:04:49] wouldn't say there's like a general technique that always works and it's a [01:04:51] technique that always works and it's a bit problem dependent in that setting. [01:04:52] bit problem dependent in that setting. Whereas this is for everything except [01:04:55] Whereas this is for everything except the upper right quadrant here. It works [01:04:57] the upper right quadrant here. It works pretty well in practice. So there are [01:04:58] pretty well in practice. So there are actually techniques for this and it's a [01:05:00] actually techniques for this and it's a pretty active area of research and [01:05:01] pretty active area of research and certain models generalize better. Like I [01:05:03] certain models generalize better. Like I think language models are pretty good at [01:05:04] think language models are pretty good at learning a lot of different domains for [01:05:05] learning a lot of different domains for example. Um but yeah it's it's it's [01:05:09] example. Um but yeah it's it's it's definitely the worst scenario to be in [01:05:11] definitely the worst scenario to be in where you have a completely different [01:05:12] where you have a completely different problem that anyone's ever worked on [01:05:14] problem that anyone's ever worked on before and you don't have a lot of data. [01:05:15] before and you don't have a lot of data. It's by far the hardest to train a model [01:05:17] It's by far the hardest to train a model in that setting. So the question is do [01:05:18] in that setting. So the question is do you ever do anything between training [01:05:19] you ever do anything between training one file layer and all layers? Yeah, [01:05:21] one file layer and all layers? Yeah, people have actually done a lot of work [01:05:22] people have actually done a lot of work looking into training a subset of the [01:05:24] looking into training a subset of the layers. Um, there's also a technique [01:05:27] layers. Um, there's also a technique called Laura, which we might go into in [01:05:29] called Laura, which we might go into in the transformers lecture. I'm not sure [01:05:30] the transformers lecture. I'm not sure if it'll make it this year, but the [01:05:32] if it'll make it this year, but the basic idea is to fine-tune all the [01:05:34] basic idea is to fine-tune all the layers in a way where you're not [01:05:36] layers in a way where you're not changing all the values exactly, but [01:05:37] changing all the values exactly, but you're learning uh basically low rank uh [01:05:41] you're learning uh basically low rank uh uh differences between the different [01:05:43] uh differences between the different layers. um where you're sort of [01:05:45] layers. um where you're sort of fine-tuning the differences between the [01:05:48] fine-tuning the differences between the original layers rather than fine-tuning [01:05:50] original layers rather than fine-tuning the layers themselves. So um yeah, [01:05:52] the layers themselves. So um yeah, there's techniques you could use Laura [01:05:55] there's techniques you could use Laura uh and it would need more explanation, [01:05:58] uh and it would need more explanation, but the basic idea is instead of [01:05:59] but the basic idea is instead of fine-tuning the actual values, you're [01:06:00] fine-tuning the actual values, you're fine-tuning uh these differences between [01:06:02] fine-tuning uh these differences between the value layers. Sort of like how a [01:06:04] the value layers. Sort of like how a ResNet, you're learning the difference. [01:06:06] ResNet, you're learning the difference. Loras are like that, but you do it with [01:06:07] Loras are like that, but you do it with a very small number of parameters. I [01:06:10] a very small number of parameters. I think the question is uh how did they [01:06:11] think the question is uh how did they basically decide how many layers to [01:06:13] basically decide how many layers to pick? Why did they pick a large number [01:06:15] pick? Why did they pick a large number of layers? Specifically, why are there [01:06:18] of layers? Specifically, why are there two convolution layers of each size [01:06:19] two convolution layers of each size instead of one? Um, so it's actually [01:06:22] instead of one? Um, so it's actually really similar to the example we showed [01:06:23] really similar to the example we showed earlier with VGG where if you have three [01:06:25] earlier with VGG where if you have three of these 3x3 convolutions, you're able [01:06:28] of these 3x3 convolutions, you're able to uh have the same receptive field as a [01:06:30] to uh have the same receptive field as a 7x7 convolution, but you're able to [01:06:33] 7x7 convolution, but you're able to model more nonlinear relationships [01:06:35] model more nonlinear relationships because you have these three activation [01:06:36] because you have these three activation functions rather than just one [01:06:38] functions rather than just one activation on the 7x7 filter. So [01:06:40] activation on the 7x7 filter. So basically 3x3 is more expressive, but [01:06:43] basically 3x3 is more expressive, but you're still looking at the same set of [01:06:44] you're still looking at the same set of values as long as you have enough of [01:06:46] values as long as you have enough of them. So a a larger set of smaller [01:06:48] them. So a a larger set of smaller filters is more expressive than a [01:06:50] filters is more expressive than a smaller set of larger filters. [01:06:53] smaller set of larger filters. Okay. [01:06:55] Okay. Um okay. So we'll go on. Um yeah, [01:06:58] Um okay. So we'll go on. Um yeah, basically try to find a large data set [01:06:59] basically try to find a large data set that has similar data. Uh get a model [01:07:01] that has similar data. Uh get a model that was trained on that and then [01:07:02] that was trained on that and then fine-tune it on your own data. Uh some [01:07:05] fine-tune it on your own data. Uh some good links. Uh PyTorch image models has [01:07:07] good links. Uh PyTorch image models has a bunch of models that are trained on [01:07:09] a bunch of models that are trained on imageet and other data sets. Um and then [01:07:11] imageet and other data sets. Um and then also just in the PyTorch vision GitHub [01:07:14] also just in the PyTorch vision GitHub repo, you'll find some too. Okay. Um [01:07:16] repo, you'll find some too. Okay. Um I'll talk over very briefly at the end [01:07:17] I'll talk over very briefly at the end for hyperparameter selection. So um if [01:07:20] for hyperparameter selection. So um if you're having difficulty training your [01:07:22] you're having difficulty training your model and it's not working right away, I [01:07:24] model and it's not working right away, I think the best thing you can do is to [01:07:25] think the best thing you can do is to overfit on a small sample. So this is [01:07:27] overfit on a small sample. So this is like the default debugging strategy in [01:07:29] like the default debugging strategy in deep learning where you just have one [01:07:31] deep learning where you just have one data point and you want to see your [01:07:32] data point and you want to see your training loss basically go to zero. Your [01:07:34] training loss basically go to zero. Your model should be able to memorize one [01:07:36] model should be able to memorize one training example and if it's not able to [01:07:37] training example and if it's not able to do that, you have a bug somewhere in [01:07:38] do that, you have a bug somewhere in your code or you're not picking the [01:07:40] your code or you're not picking the right kind of model to model your [01:07:42] right kind of model to model your problem. Um, so this is a really good uh [01:07:45] problem. Um, so this is a really good uh training problem and it'll also tell you [01:07:46] training problem and it'll also tell you like what learning rates work or what [01:07:48] like what learning rates work or what ones don't work and you'll get a rough [01:07:49] ones don't work and you'll get a rough idea of the neighborhood of learning [01:07:51] idea of the neighborhood of learning rates you should explore. So this is a [01:07:52] rates you should explore. So this is a good way to make sure your model's [01:07:54] good way to make sure your model's correct, make sure your learning rate is [01:07:56] correct, make sure your learning rate is reasonable and also to make sure you [01:07:58] reasonable and also to make sure you don't have any other bugs that could be [01:08:00] don't have any other bugs that could be impacting your code. So this is like [01:08:02] impacting your code. So this is like always step one if you're having issues [01:08:04] always step one if you're having issues just uh running some code. This is how [01:08:06] just uh running some code. This is how you debug. Um the second thing you would [01:08:09] you debug. Um the second thing you would want to do after you get this is maybe [01:08:11] want to do after you get this is maybe you try a very coarse grid of [01:08:12] you try a very coarse grid of hyperparameters. So I would first try [01:08:14] hyperparameters. So I would first try with different learning rates and see [01:08:16] with different learning rates and see how if you train model on different [01:08:18] how if you train model on different learning rates what are the training [01:08:19] learning rates what are the training losses look like. You want one that has [01:08:21] losses look like. You want one that has the most sustained uh decreasing in the [01:08:23] the most sustained uh decreasing in the training loss over maybe one epoch. [01:08:25] training loss over maybe one epoch. That's like a pretty good estimation but [01:08:27] That's like a pretty good estimation but you can train for longer. Um once you [01:08:29] you can train for longer. Um once you get a good set of learning rates you [01:08:31] get a good set of learning rates you could then look into other [01:08:31] could then look into other hyperparameters too. And specifically [01:08:34] hyperparameters too. And specifically besides the loss, you'll also want to [01:08:36] besides the loss, you'll also want to look at the accuracy curves. [01:08:38] look at the accuracy curves. So you have your training accuracy and [01:08:39] So you have your training accuracy and your validation accuracy. Um if they're [01:08:42] your validation accuracy. Um if they're still going up, it means you want to [01:08:43] still going up, it means you want to keep training. Uh pretty reasonable, but [01:08:46] keep training. Uh pretty reasonable, but you might have a scenario where the [01:08:48] you might have a scenario where the training loss is going up, but your [01:08:49] training loss is going up, but your validation loss is going down. Uh this [01:08:51] validation loss is going down. Uh this is overfitting. So um we need to either [01:08:55] is overfitting. So um we need to either increase the regularization or if we can [01:08:57] increase the regularization or if we can get more data, that could also work. But [01:08:58] get more data, that could also work. But you need to do one of the two um in [01:09:01] you need to do one of the two um in order to improve your performance [01:09:02] order to improve your performance further beyond uh the peak right here I [01:09:04] further beyond uh the peak right here I guess would be the best model you have [01:09:05] guess would be the best model you have so far. Um if you're seeing very little [01:09:08] so far. Um if you're seeing very little of a gap here then you can probably [01:09:10] of a gap here then you can probably train for model uh for longer because [01:09:12] train for model uh for longer because generally we you do want to just uh get [01:09:17] generally we you do want to just uh get to the point where your validation loss [01:09:18] to the point where your validation loss has been maximized. So if you could just [01:09:20] has been maximized. So if you could just keep training uh you know you could keep [01:09:22] keep training uh you know you could keep training. So uh even if there's not a [01:09:24] training. So uh even if there's not a significant gap here or anything um if [01:09:27] significant gap here or anything um if you see uh the validation loss is [01:09:29] you see uh the validation loss is similar to the training uh sorry the [01:09:30] similar to the training uh sorry the validation accuracy similar to the [01:09:32] validation accuracy similar to the training accuracy you can probably keep [01:09:33] training accuracy you can probably keep training until your training accuracy [01:09:35] training until your training accuracy starts diverging from your your [01:09:37] starts diverging from your your validation accuracy [01:09:39] validation accuracy and you basically can repeat this [01:09:40] and you basically can repeat this process over and over again. Um one [01:09:43] process over and over again. Um one final note is in terms of for [01:09:45] final note is in terms of for hyperparameter search normally people [01:09:47] hyperparameter search normally people think you know you have two [01:09:48] think you know you have two hyperparameters or more that you're [01:09:49] hyperparameters or more that you're searching over should you try every [01:09:51] searching over should you try every combination of the hyperparameters or [01:09:53] combination of the hyperparameters or what is the best way to do it I think in [01:09:55] what is the best way to do it I think in practice a random search over their [01:09:57] practice a random search over their hyperparameter space works a lot better [01:09:59] hyperparameter space works a lot better than a like a grid search where you're [01:10:01] than a like a grid search where you're trying every set of a predefined set and [01:10:04] trying every set of a predefined set and the reason is mainly because you can [01:10:06] the reason is mainly because you can imagine you have one axis which is an [01:10:07] imagine you have one axis which is an unimportant hyperparameter where [01:10:09] unimportant hyperparameter where depending on the value your performance [01:10:10] depending on the value your performance will be roughly the same versus an [01:10:12] will be roughly the same versus an important one. If you do random values [01:10:14] important one. If you do random values across all of these, you'll actually [01:10:15] across all of these, you'll actually search the hyperparameter space of your [01:10:18] search the hyperparameter space of your uh important parameter much more [01:10:19] uh important parameter much more thoroughly. Whereas you're sort of [01:10:20] thoroughly. Whereas you're sort of wasting time uh doing on on the grid [01:10:23] wasting time uh doing on on the grid search rechecking multiple values for an [01:10:25] search rechecking multiple values for an unimportant hyperparameter. So in [01:10:26] unimportant hyperparameter. So in practice, you should define the ranges [01:10:28] practice, you should define the ranges you want to try and then just uh [01:10:30] you want to try and then just uh randomly collect hyperparameters with [01:10:32] randomly collect hyperparameters with values from those ranges. And that's [01:10:34] values from those ranges. And that's probably the best way to do it. And you [01:10:36] probably the best way to do it. And you just keep running till you get the best [01:10:37] just keep running till you get the best model. Okay. Uh that's it. So we talked [01:10:40] model. Okay. Uh that's it. So we talked about layers and CNN's activation [01:10:41] about layers and CNN's activation functions, CNN architectures and weight [01:10:43] functions, CNN architectures and weight initialization. How do you actually [01:10:45] initialization. How do you actually predefine and build these models and [01:10:47] predefine and build these models and then we talked about how do you actually [01:10:48] then we talked about how do you actually train it? How do you change your data to [01:10:50] train it? How do you change your data to be input to the model? How do you [01:10:51] be input to the model? How do you augment it? Uh transfer learning is a [01:10:53] augment it? Uh transfer learning is a really neat trick for improving [01:10:54] really neat trick for improving performance and then how do you pick the [01:10:56] performance and then how do you pick the best hybrid parameters. So uh yeah, we [01:10:58] best hybrid parameters. So uh yeah, we covered a lot in lecture today. Thank [01:10:59] covered a lot in lecture today. Thank you all so much. Uh yeah. ================================================================================ LECTURE 007 ================================================================================ Stanford CS231N | Spring 2025 | Lecture 7: Recurrent Neural Networks Source: https://www.youtube.com/watch?v=kG2lAPBF7zA --- Transcript [00:00:05] So hello everyone, welcome to lecture 7. [00:00:09] So hello everyone, welcome to lecture 7. Um I also wanted to go over some [00:00:10] Um I also wanted to go over some clarifications from last time. So when I [00:00:12] clarifications from last time. So when I gave lecture last time, there were two [00:00:14] gave lecture last time, there were two ed posts that I think were good that you [00:00:17] ed posts that I think were good that you all might want to check out. Uh but in [00:00:19] all might want to check out. Uh but in case you haven't seen it, I'll just go [00:00:20] case you haven't seen it, I'll just go through it really quickly. I think when [00:00:22] through it really quickly. I think when describing dropout, how to scale [00:00:24] describing dropout, how to scale probabilities at test time. During [00:00:25] probabilities at test time. During lecture, there was a bit of confusion. [00:00:27] lecture, there was a bit of confusion. Um and basically the what I said in the [00:00:30] Um and basically the what I said in the slide had a sort of mismatch. So um in [00:00:34] slide had a sort of mismatch. So um in each forward pass for dropout we have [00:00:36] each forward pass for dropout we have this hyperparameter P which is either [00:00:38] this hyperparameter P which is either the amount of neurons you're dropping [00:00:40] the amount of neurons you're dropping out or it's the amount of neurons that [00:00:42] out or it's the amount of neurons that you're keeping depending on which [00:00:44] you're keeping depending on which implementation of dropout you're using. [00:00:46] implementation of dropout you're using. Generally they do it's the number of [00:00:48] Generally they do it's the number of ones you drop out. Um so in most [00:00:51] ones you drop out. Um so in most libraries that's what P means. But the [00:00:53] libraries that's what P means. But the basic idea is that at test time you want [00:00:55] basic idea is that at test time you want the expected output to be the same as at [00:00:57] the expected output to be the same as at training time. So this means that if you [00:01:00] training time. So this means that if you dropped 25% of your activations during [00:01:03] dropped 25% of your activations during training time at test time you would [00:01:05] training time at test time you would scale by 75 so that the expected output [00:01:08] scale by 75 so that the expected output is the same. Um and so I think there was [00:01:10] is the same. Um and so I think there was a bit of confusion because in this slide [00:01:12] a bit of confusion because in this slide the implementation uses P as the [00:01:14] the implementation uses P as the probability of uh keeping a unit active. [00:01:17] probability of uh keeping a unit active. So uh there's a bit of a mismatch there. [00:01:20] So uh there's a bit of a mismatch there. Just to clarify, there was also a [00:01:22] Just to clarify, there was also a question in class from last time about [00:01:24] question in class from last time about how normalization can be useful and [00:01:26] how normalization can be useful and maybe resolve the issues that arise when [00:01:28] maybe resolve the issues that arise when you have weights that are initialized [00:01:30] you have weights that are initialized incorrectly. But we had this to choice [00:01:32] incorrectly. But we had this to choice setting where we have 2D inputs to our [00:01:34] setting where we have 2D inputs to our model and a two-layer neural network [00:01:36] model and a two-layer neural network with RLU. Um it's outputting basically [00:01:38] with RLU. Um it's outputting basically this quadrant function. So if the point [00:01:41] this quadrant function. So if the point lies in the top right, it'll output one [00:01:43] lies in the top right, it'll output one or two or three or four depending on [00:01:44] or two or three or four depending on which quadrant uh the point lies in. And [00:01:48] which quadrant uh the point lies in. And uh we plot the different training losses [00:01:50] uh we plot the different training losses and test losses for uh good [00:01:52] and test losses for uh good initialization using the kiming [00:01:54] initialization using the kiming initialization we discussed last time [00:01:55] initialization we discussed last time and bad initialization where the [00:01:57] and bad initialization where the standard deviation is too high. And the [00:02:00] standard deviation is too high. And the blue plot here represents bad and the [00:02:03] blue plot here represents bad and the green represents bad initialization with [00:02:04] green represents bad initialization with layer norm. So you can see it actually [00:02:06] layer norm. So you can see it actually does resolve a lot of the issues but to [00:02:08] does resolve a lot of the issues but to get the best performance you still need [00:02:09] get the best performance you still need good weight initialization which are the [00:02:11] good weight initialization which are the two lines afterwards. So you can go dive [00:02:13] two lines afterwards. So you can go dive in and also whether or not layerm helped [00:02:16] in and also whether or not layerm helped depends uh depends on the problem. So in [00:02:19] depends uh depends on the problem. So in this quadrant one you can imagine that [00:02:20] this quadrant one you can imagine that you don't need to know the exact 2D [00:02:22] you don't need to know the exact 2D position uh of each point. So layerm was [00:02:24] position uh of each point. So layerm was actually helping but for some of the [00:02:26] actually helping but for some of the other functions that are in the code you [00:02:28] other functions that are in the code you can check out um where you need to know [00:02:30] can check out um where you need to know the exact coordinate in order to get the [00:02:32] the exact coordinate in order to get the right output layer actually hurts [00:02:33] right output layer actually hurts performance uh because you lose some [00:02:35] performance uh because you lose some information about the exact spatial [00:02:37] information about the exact spatial location of your input when you're doing [00:02:38] location of your input when you're doing this subtraction of the mean and [00:02:39] this subtraction of the mean and dividing by the standard deviation. So [00:02:42] dividing by the standard deviation. So uh just some notes here um basically at [00:02:45] uh just some notes here um basically at a high level it does help with the issue [00:02:47] a high level it does help with the issue but gap remains. So you can't get by [00:02:50] but gap remains. So you can't get by this weight initialization issue with [00:02:51] this weight initialization issue with just normalization. [00:02:54] just normalization. Um and as I mentioned it may not always [00:02:57] Um and as I mentioned it may not always make sense depending on what you're [00:02:58] make sense depending on what you're trying to model. Um so I think just to [00:03:01] trying to model. Um so I think just to recap also from last time we've been [00:03:03] recap also from last time we've been mainly talking about these sort of [00:03:04] mainly talking about these sort of vanilla standard non-recurrent neural [00:03:06] vanilla standard non-recurrent neural networks so far. So this is a fixed size [00:03:09] networks so far. So this is a fixed size input and a fixed size output. Um you [00:03:12] input and a fixed size output. Um you have sort of this onetime setup where [00:03:14] have sort of this onetime setup where you set your activation functions. You [00:03:16] you set your activation functions. You do data pre-processing according to some [00:03:18] do data pre-processing according to some fixed mean and standard deviation for [00:03:20] fixed mean and standard deviation for the images uh the image channels. You [00:03:22] the images uh the image channels. You have your weight initialization uh and [00:03:25] have your weight initialization uh and normalization functions that you use as [00:03:28] normalization functions that you use as well as um transfer learning. So if you [00:03:31] well as um transfer learning. So if you pre-train on one data set like imageet [00:03:33] pre-train on one data set like imageet or some other large scale internet data [00:03:35] or some other large scale internet data set, you can get better results if you [00:03:36] set, you can get better results if you initialize your weights to those values. [00:03:38] initialize your weights to those values. We also talked about training dynamics, [00:03:40] We also talked about training dynamics, how you can babysit the learning process [00:03:42] how you can babysit the learning process and choosing a good learning rate, how [00:03:44] and choosing a good learning rate, how you want to update your different uh [00:03:45] you want to update your different uh hyperparameters and also how to optimize [00:03:47] hyperparameters and also how to optimize those based on the validation [00:03:49] those based on the validation performance as well as uh test time [00:03:51] performance as well as uh test time augmentation to improve performance [00:03:53] augmentation to improve performance further. So, a really good tool for [00:03:55] further. So, a really good tool for points two and three here is uh [00:03:57] points two and three here is uh something I use in basically all my [00:03:59] something I use in basically all my projects called weights and biases. So, [00:04:00] projects called weights and biases. So, you might find this useful. It's a [00:04:02] you might find this useful. It's a really neat way that you can um [00:04:04] really neat way that you can um essentially look at different you set [00:04:07] essentially look at different you set different runs with different [00:04:08] different runs with different hyperparameters. In this case, uh they [00:04:10] hyperparameters. In this case, uh they show a dropout column here. So, these [00:04:12] show a dropout column here. So, these are all the different values of dropout. [00:04:14] are all the different values of dropout. Then color coding is really nice. So, [00:04:15] Then color coding is really nice. So, you can see that generally the lower [00:04:17] you can see that generally the lower values of dropout will achieve higher [00:04:19] values of dropout will achieve higher accuracy. And so you can visualize uh [00:04:22] accuracy. And so you can visualize uh these different uh validation uh sorry [00:04:25] these different uh validation uh sorry these different hyperparameters based on [00:04:26] these different hyperparameters based on validation step performance and uh you [00:04:29] validation step performance and uh you can sort of uh see you know based on [00:04:32] can sort of uh see you know based on many runs get an idea of which [00:04:34] many runs get an idea of which hyperparameters work best. So I always [00:04:36] hyperparameters work best. So I always use this I think it's great especially [00:04:37] use this I think it's great especially if you have the compute where you can [00:04:38] if you have the compute where you can just run something over and over again [00:04:40] just run something over and over again to improve performance more. This is a [00:04:41] to improve performance more. This is a really neat way of visualizing it. I [00:04:42] really neat way of visualizing it. I think they do it well. There are other [00:04:43] think they do it well. There are other tools like tensorboard uh but this is [00:04:46] tools like tensorboard uh but this is personally the one that I like. [00:04:49] personally the one that I like. Okay. Um, so for the rest of lecture [00:04:52] Okay. Um, so for the rest of lecture today, we'll be discussing sequence [00:04:54] today, we'll be discussing sequence modeling. So this is in contrast to a [00:04:57] modeling. So this is in contrast to a fixed-sized input uh as as input to our [00:05:00] fixed-sized input uh as as input to our model. What if we have a sequence of [00:05:01] model. What if we have a sequence of variable length and also we'll be [00:05:04] variable length and also we'll be discussing what are the sort of simple [00:05:06] discussing what are the sort of simple neural networks that people use before [00:05:08] neural networks that people use before the era of transformers which mainly [00:05:10] the era of transformers which mainly consist of RNN's and some variants of [00:05:12] consist of RNN's and some variants of RNN's. And then I'll also relate in one [00:05:15] RNN's. And then I'll also relate in one slide how RNN's actually are similar to [00:05:18] slide how RNN's actually are similar to a lot of and inspire a lot of the modern [00:05:21] a lot of and inspire a lot of the modern uh type of language models that you see [00:05:23] uh type of language models that you see called state space models. So you might [00:05:25] called state space models. So you might have heard of mamba there's some other [00:05:26] have heard of mamba there's some other ones too that we'll talk about in the [00:05:28] ones too that we'll talk about in the slide but the basic ideas the key [00:05:30] slide but the basic ideas the key concepts from RNNs are still being used [00:05:32] concepts from RNNs are still being used today. They're not just used in the past [00:05:34] today. They're not just used in the past and they have a lot of nice advantages [00:05:36] and they have a lot of nice advantages over transformers that we'll go into. [00:05:39] over transformers that we'll go into. Cool. So to specifically formulate this [00:05:42] Cool. So to specifically formulate this sequence modeling task, you can imagine [00:05:44] sequence modeling task, you can imagine we have a vanilla neural network where [00:05:46] we have a vanilla neural network where we have one fixed size input to one [00:05:48] we have one fixed size input to one fixed size output which is what we've [00:05:50] fixed size output which is what we've discussed in the course so far. In [00:05:52] discussed in the course so far. In contrast, you could have a one to many [00:05:54] contrast, you could have a one to many uh sequence modeling task. So here we [00:05:57] uh sequence modeling task. So here we still have a fixed size input like say [00:05:58] still have a fixed size input like say an image but we want to output a [00:06:00] an image but we want to output a sequence of variable length. So one [00:06:02] sequence of variable length. So one common example is image captioning. So [00:06:04] common example is image captioning. So we input an image and we want to output [00:06:06] we input an image and we want to output a sequence of words or characters or [00:06:08] a sequence of words or characters or however you're modeling uh the language [00:06:11] however you're modeling uh the language or encoding it but the goal is to have a [00:06:13] or encoding it but the goal is to have a variable length caption output for [00:06:15] variable length caption output for what's happening in the image. You could [00:06:18] what's happening in the image. You could also have a many to one sequence [00:06:20] also have a many to one sequence modeling task. So here we could imagine [00:06:22] modeling task. So here we could imagine our inputs are say a video and we're [00:06:25] our inputs are say a video and we're trying to classify what is this a video [00:06:27] trying to classify what is this a video of. So we give it a sequence of video [00:06:30] of. So we give it a sequence of video frames and the output is one single [00:06:32] frames and the output is one single class label similar to the image [00:06:34] class label similar to the image classification case but now we have [00:06:35] classification case but now we have multiple frames as input rather than [00:06:37] multiple frames as input rather than just a single image. So this is an [00:06:39] just a single image. So this is an example of many to one. Then you also [00:06:41] example of many to one. Then you also have many to many. So um the number of [00:06:46] have many to many. So um the number of inputs and outputs in the sequences [00:06:48] inputs and outputs in the sequences don't need to match. So you could have [00:06:50] don't need to match. So you could have your input is a variable number of [00:06:52] your input is a variable number of frames and your output in this case [00:06:54] frames and your output in this case could be a caption of variable length [00:06:56] could be a caption of variable length and they don't need to necessarily [00:06:57] and they don't need to necessarily match. Um but they could match. So you [00:06:59] match. Um but they could match. So you could have so for every single input you [00:07:01] could have so for every single input you have one output. And for discussing [00:07:03] have one output. And for discussing RNN's we'll mainly be focusing on this [00:07:05] RNN's we'll mainly be focusing on this setting on the far right but there are [00:07:06] setting on the far right but there are basically a lot of small changes you can [00:07:09] basically a lot of small changes you can make to change and reformulate the [00:07:11] make to change and reformulate the problem to apply to the other settings. [00:07:13] problem to apply to the other settings. But this is sort of the most [00:07:14] But this is sort of the most straightforward one. Every time there's [00:07:15] straightforward one. Every time there's an input there's an output. and we'll be [00:07:17] an input there's an output. and we'll be using it for the beginning of class to [00:07:19] using it for the beginning of class to talk about how RNN's work [00:07:22] talk about how RNN's work and a canonical example problem here [00:07:24] and a canonical example problem here would be video classification where [00:07:25] would be video classification where you're classifying every single frame. [00:07:28] you're classifying every single frame. Okay, so what is an RNN? Uh the basic [00:07:31] Okay, so what is an RNN? Uh the basic idea is you have an input sequence X and [00:07:35] idea is you have an input sequence X and an output sequence Y. And what makes an [00:07:38] an output sequence Y. And what makes an RNN an RNN is this recurrent nature. So [00:07:41] RNN an RNN is this recurrent nature. So often people will diagram it by this [00:07:43] often people will diagram it by this sort of arrow that's feeding back into [00:07:45] sort of arrow that's feeding back into the block. uh this is how you know it's [00:07:47] the block. uh this is how you know it's sort of like a recurrent layer when [00:07:48] sort of like a recurrent layer when you're reading different diagrams. But [00:07:50] you're reading different diagrams. But what it actually means is that RNN's [00:07:52] what it actually means is that RNN's have this internal state or a hidden [00:07:55] have this internal state or a hidden state as it's often called that is [00:07:57] state as it's often called that is updated as a sequence is processed. So [00:07:59] updated as a sequence is processed. So every time there's a new input to the [00:08:01] every time there's a new input to the model, we process that and we calculate [00:08:03] model, we process that and we calculate a new hidden state or internal state. So [00:08:06] a new hidden state or internal state. So there's a hidden state, it updates and [00:08:07] there's a hidden state, it updates and it depends on the new inputs as well as [00:08:10] it depends on the new inputs as well as the previous internal or hidden state. [00:08:13] the previous internal or hidden state. Um I think this diagram is sometimes a [00:08:16] Um I think this diagram is sometimes a bit confusing when you're trying to [00:08:17] bit confusing when you're trying to think about how the gradients are [00:08:18] think about how the gradients are actually calculated and what are the [00:08:19] actually calculated and what are the order of operations. So people will [00:08:21] order of operations. So people will often do this uh diagram of an unrolled [00:08:25] often do this uh diagram of an unrolled RNN. And so here it's basically the same [00:08:28] RNN. And so here it's basically the same as before but we're explicitly showing [00:08:30] as before but we're explicitly showing that the current hidden state [00:08:32] that the current hidden state calculation is dependent on our input at [00:08:34] calculation is dependent on our input at that time step as well as the previous [00:08:37] that time step as well as the previous uh RNN state. So we're more explicitly [00:08:40] uh RNN state. So we're more explicitly modeling what is exactly needed to [00:08:42] modeling what is exactly needed to calculate each output and each RNN. You [00:08:44] calculate each output and each RNN. You move backwards uh in the computational [00:08:46] move backwards uh in the computational graph. [00:08:48] graph. So I've been speaking with words so far. [00:08:51] So I've been speaking with words so far. So let's formulate this with [00:08:52] So let's formulate this with mathematical equations. Now so the basic [00:08:54] mathematical equations. Now so the basic idea is we're trying to uh process the [00:08:58] idea is we're trying to uh process the sequence of vectors X and we're applying [00:09:00] sequence of vectors X and we're applying this recurrence formula at every single [00:09:02] this recurrence formula at every single time step. So we have our new hidden [00:09:05] time step. So we have our new hidden state as a function of the old hidden [00:09:08] state as a function of the old hidden state and the input vector at some time [00:09:10] state and the input vector at some time step as well as we have a function with [00:09:13] step as well as we have a function with normally with an activation function [00:09:14] normally with an activation function along with some parameters w. So you can [00:09:17] along with some parameters w. So you can think of this as it's very similar to [00:09:19] think of this as it's very similar to the sort of initial neural network [00:09:21] the sort of initial neural network layers we were learning where it's a [00:09:22] layers we were learning where it's a weight matrix uh multiply by and then [00:09:25] weight matrix uh multiply by and then you follow it up by an activation [00:09:27] you follow it up by an activation function. This is the same thing here. [00:09:29] function. This is the same thing here. The only change is that it's now a [00:09:30] The only change is that it's now a recurrence formula. So we're using um [00:09:34] recurrence formula. So we're using um the same set of W's and the same um uh [00:09:38] the same set of W's and the same um uh activation function each time we're [00:09:40] activation function each time we're computing the hidden state. [00:09:43] computing the hidden state. So um basically yeah as I mentioned this [00:09:46] So um basically yeah as I mentioned this is a recurrence formula. Um and to get [00:09:49] is a recurrence formula. Um and to get the actual output so how do we calculate [00:09:50] the actual output so how do we calculate this blue block? we have a separate uh [00:09:53] this blue block? we have a separate uh function that depends on a separate set [00:09:56] function that depends on a separate set of uh parameters that convert our hidden [00:09:59] of uh parameters that convert our hidden dimension state into uh the dimension of [00:10:03] dimension state into uh the dimension of our output and it also is a set of [00:10:05] our output and it also is a set of weights to convert the hidden state to [00:10:06] weights to convert the hidden state to the output. So this does two in one. It [00:10:08] the output. So this does two in one. It sort of changes the dimension of our [00:10:10] sort of changes the dimension of our vectors from the dimension size of our [00:10:11] vectors from the dimension size of our hidden state which can be whatever we [00:10:12] hidden state which can be whatever we want to the dimension size of our output [00:10:15] want to the dimension size of our output and then also it provides a [00:10:16] and then also it provides a transformation there. So, why is a [00:10:20] transformation there. So, why is a weight matrix that you will multiply by [00:10:22] weight matrix that you will multiply by your hidden state to get the so it does [00:10:25] your hidden state to get the so it does two things. It converts your hidden [00:10:26] two things. It converts your hidden state to the dimension of your output. [00:10:28] state to the dimension of your output. So, your hidden state and output could [00:10:29] So, your hidden state and output could be different dimensions. And then also [00:10:31] be different dimensions. And then also it's uh you know it's a weight matrix [00:10:34] it's uh you know it's a weight matrix that you learn. So, not only does it do [00:10:36] that you learn. So, not only does it do this dimension change, but also it [00:10:37] this dimension change, but also it applies a transformation to your hidden [00:10:40] applies a transformation to your hidden state. So, it's how you convert your [00:10:41] state. So, it's how you convert your hidden states to your outputs. What WHY [00:10:44] hidden states to your outputs. What WHY is. So the previous slide was how we [00:10:46] is. So the previous slide was how we calculate the new hidden state. So [00:10:49] calculate the new hidden state. So there's it's essentially the same idea [00:10:51] there's it's essentially the same idea where you're doing this uh recursively [00:10:53] where you're doing this uh recursively with the same set of parameters but we [00:10:54] with the same set of parameters but we have one set of parameters and one [00:10:56] have one set of parameters and one function for calculating the hidden [00:10:57] function for calculating the hidden state. We have another set of parameters [00:10:59] state. We have another set of parameters and another function for calculating the [00:11:00] and another function for calculating the output depending on what type of task it [00:11:02] output depending on what type of task it is and how we want to model the RNN. [00:11:06] is and how we want to model the RNN. Yeah. So they still share the same [00:11:08] Yeah. So they still share the same weights for each time step. Um but [00:11:10] weights for each time step. Um but there's two different things here. One [00:11:11] there's two different things here. One is to calculate basically how do you [00:11:13] is to calculate basically how do you maybe it'll be more clear as we go [00:11:14] maybe it'll be more clear as we go through more concrete examples but uh [00:11:16] through more concrete examples but uh how do you actually calculate the new [00:11:18] how do you actually calculate the new hidden state which is this internal [00:11:20] hidden state which is this internal state of the RNN and then how do you [00:11:21] state of the RNN and then how do you convert that hidden state to the output [00:11:23] convert that hidden state to the output which is this slide. [00:11:26] which is this slide. Okay. Um so looking through this [00:11:28] Okay. Um so looking through this unrolled uh diagram here um we can see [00:11:32] unrolled uh diagram here um we can see that you sort of need to initialize your [00:11:35] that you sort of need to initialize your hidden state to some value. So we [00:11:37] hidden state to some value. So we usually call we call this H0 and you can [00:11:40] usually call we call this H0 and you can initialize it to whatever you want. In [00:11:42] initialize it to whatever you want. In principle usually this is a learned um [00:11:45] principle usually this is a learned um input vector. Um but now we'll [00:11:48] input vector. Um but now we'll specifically go into each step of this [00:11:51] specifically go into each step of this unrolled uh recurrent RNN and actually [00:11:53] unrolled uh recurrent RNN and actually go through a concrete example for what [00:11:55] go through a concrete example for what it looks like when you're doing the [00:11:56] it looks like when you're doing the forward pass. [00:11:58] forward pass. So um one thing to note that already [00:12:00] So um one thing to note that already came up with some of the questions is [00:12:02] came up with some of the questions is that um we're processing the sequence of [00:12:04] that um we're processing the sequence of vectors x and we're applying this [00:12:06] vectors x and we're applying this recurrence formula at each time step. So [00:12:09] recurrence formula at each time step. So uh really do notice how the same [00:12:11] uh really do notice how the same function and the same set of parameters [00:12:12] function and the same set of parameters are used at every time step when [00:12:14] are used at every time step when computing the hidden state and a [00:12:16] computing the hidden state and a separate function and a separate set of [00:12:17] separate function and a separate set of parameters are always used at each time [00:12:19] parameters are always used at each time step when predicting the output from the [00:12:21] step when predicting the output from the hidden state. Yes. So can old values of [00:12:24] hidden state. Yes. So can old values of y affect the new hidden state? uh under [00:12:26] y affect the new hidden state? uh under some formulations. Yes. And we'll [00:12:27] some formulations. Yes. And we'll actually go through one example of why [00:12:29] actually go through one example of why that's used. It's most commonly used if [00:12:30] that's used. It's most commonly used if you want to predict the next uh like if [00:12:34] you want to predict the next uh like if you're doing a language modeling or auto [00:12:36] you're doing a language modeling or auto reggressive modeling task where you're [00:12:37] reggressive modeling task where you're trying to predict one value given the [00:12:39] trying to predict one value given the previous values. Uh people will just use [00:12:41] previous values. Uh people will just use the previous values as the input. Uh so [00:12:43] the previous values as the input. Uh so that's generally how people do that [00:12:45] that's generally how people do that explicit uh formulation of how can y [00:12:47] explicit uh formulation of how can y affect the next hidden state. What is [00:12:49] affect the next hidden state. What is the difference between h and x at the [00:12:51] the difference between h and x at the first time step? Um so they use [00:12:54] first time step? Um so they use basically different uh weights. So so [00:12:58] basically different uh weights. So so the H0 is using uh all of it's using the [00:13:01] the H0 is using uh all of it's using the weights that are used to update every [00:13:03] weights that are used to update every hidden state to the next one. Whereas [00:13:06] hidden state to the next one. Whereas the we'll go through exactly what the [00:13:07] the we'll go through exactly what the weights look like but u basically [00:13:09] weights look like but u basically they're using different weights is the [00:13:10] they're using different weights is the short answer. Um okay so when people say [00:13:13] short answer. Um okay so when people say vanilla RNN they usually are almost [00:13:15] vanilla RNN they usually are almost exactly referring to this type of model [00:13:18] exactly referring to this type of model where um we have our hidden state t [00:13:21] where um we have our hidden state t which is uh uses tan h or hyperbolic [00:13:24] which is uh uses tan h or hyperbolic tangent as an activation function. This [00:13:26] tangent as an activation function. This is nice because it's bounded between one [00:13:28] is nice because it's bounded between one and negative 1. Um so as you do the [00:13:31] and negative 1. Um so as you do the operation over and over again your [00:13:33] operation over and over again your values will stay within this range. Um [00:13:35] values will stay within this range. Um so this is a nice property to have. It's [00:13:37] so this is a nice property to have. It's also zero centered and you can represent [00:13:38] also zero centered and you can represent both positive and negative values. This [00:13:40] both positive and negative values. This is why people use 10h. Um uh al also we [00:13:45] is why people use 10h. Um uh al also we um sometimes have an output function fy [00:13:48] um sometimes have an output function fy here but in the simplest case your [00:13:50] here but in the simplest case your output yt could just be a matrix [00:13:51] output yt could just be a matrix multiply by your by your hidden state. [00:13:53] multiply by your by your hidden state. So this is really like the most simple [00:13:55] So this is really like the most simple formulation of of an RNN. And what we'll [00:13:58] formulation of of an RNN. And what we'll specifically go in our concrete example [00:14:00] specifically go in our concrete example today and lecture is this idea of just [00:14:04] today and lecture is this idea of just manly manually creating a recurrent [00:14:06] manly manually creating a recurrent neural network. So, we're not going to [00:14:07] neural network. So, we're not going to learn this through gradient descent or [00:14:09] learn this through gradient descent or all these different methods. I'm just [00:14:11] all these different methods. I'm just going to sort of uh show you how you [00:14:13] going to sort of uh show you how you could construct one by hand and we'll go [00:14:15] could construct one by hand and we'll go through it and you'll understand the [00:14:17] through it and you'll understand the forward pass, what each of the different [00:14:19] forward pass, what each of the different weight matrices are doing as well as how [00:14:21] weight matrices are doing as well as how the output is calculated. [00:14:23] the output is calculated. So, in this really toy example, because [00:14:25] So, in this really toy example, because it needs to be pretty simple if we're [00:14:26] it needs to be pretty simple if we're just going to be like going through all [00:14:27] just going to be like going through all the different weights, uh you're given a [00:14:29] the different weights, uh you're given a sequences of zeros and ones. Um, and [00:14:32] sequences of zeros and ones. Um, and your goal is to output a one when [00:14:34] your goal is to output a one when there's two repeated ones in a row. So [00:14:36] there's two repeated ones in a row. So you're basically detecting repeated [00:14:38] you're basically detecting repeated ones. Uh, and you'll output a zero [00:14:40] ones. Uh, and you'll output a zero otherwise. So you can see this input [00:14:42] otherwise. So you can see this input sequence coming in 0 1 0 1. Uh, so far [00:14:45] sequence coming in 0 1 0 1. Uh, so far there's been no repeated ones. But now [00:14:47] there's been no repeated ones. But now we have a repeated one. Then we have [00:14:48] we have a repeated one. Then we have another repeated one because there's two [00:14:50] another repeated one because there's two in a row here and so on. So this is the [00:14:53] in a row here and so on. So this is the type of model we're building. It's [00:14:54] type of model we're building. It's trying to do this task. This is [00:14:56] trying to do this task. This is specifically the many to many sequence [00:14:58] specifically the many to many sequence modeling task where we have one output [00:14:59] modeling task where we have one output for every input. [00:15:02] for every input. And so um we've kind of been talking [00:15:03] And so um we've kind of been talking high level so far, but if you're trying [00:15:05] high level so far, but if you're trying to create an RNN to do this, what [00:15:07] to create an RNN to do this, what information should be captured in the [00:15:10] information should be captured in the hidden state? So you have this internal [00:15:12] hidden state? So you have this internal state of your model, what information [00:15:14] state of your model, what information needs to be captured there in order to [00:15:15] needs to be captured there in order to do this task? [00:15:17] do this task? Yeah. So the input to the previous time [00:15:19] Yeah. So the input to the previous time step and if our output is only dependent [00:15:21] step and if our output is only dependent on the hidden state, what else do we [00:15:23] on the hidden state, what else do we need to know? And the current Yeah. [00:15:24] need to know? And the current Yeah. Yeah. Exactly. Um so this is the [00:15:27] Yeah. Exactly. Um so this is the information that we need to capture in [00:15:28] information that we need to capture in our hidden state. So um previous input [00:15:31] our hidden state. So um previous input and the current value for x. So 0 or [00:15:33] and the current value for x. So 0 or one. And the way I'll do this is I'll [00:15:36] one. And the way I'll do this is I'll just set the hidden state t to be a [00:15:37] just set the hidden state t to be a threedimensional vector. The reason why [00:15:39] threedimensional vector. The reason why it's three is this one will come in [00:15:40] it's three is this one will come in handy when we're trying to do the uh [00:15:42] handy when we're trying to do the uh output stage calculation, but you could [00:15:45] output stage calculation, but you could probably construct one without a one [00:15:47] probably construct one without a one here. Um this is just to make the math [00:15:49] here. Um this is just to make the math easy and simple for the for the purposes [00:15:51] easy and simple for the for the purposes of the lecture today. And the other [00:15:52] of the lecture today. And the other information is the current value. So [00:15:54] information is the current value. So this will either be zero or one along [00:15:56] this will either be zero or one along with the previous value 0 or one. and [00:15:58] with the previous value 0 or one. and we'll initialize it to be 001. So that [00:16:01] we'll initialize it to be 001. So that um we're basically assuming it's [00:16:03] um we're basically assuming it's basically seen two zeros in a row before [00:16:04] basically seen two zeros in a row before at this point. Um yeah, so this is how [00:16:07] at this point. Um yeah, so this is how we will do this will be the uh type of [00:16:11] we will do this will be the uh type of variables we're trying to track in our [00:16:13] variables we're trying to track in our hidden state and this is how we'll [00:16:14] hidden state and this is how we'll initialize h0. So I talked about how you [00:16:15] initialize h0. So I talked about how you can initialize it to vary different [00:16:17] can initialize it to vary different strategies or you could learn it. This [00:16:19] strategies or you could learn it. This is what we'll initialize it to. Okay, [00:16:22] is what we'll initialize it to. Okay, now let's walk by the co let's walk [00:16:23] now let's walk by the co let's walk through the code and I'll do it step by [00:16:25] through the code and I'll do it step by step. So, I'm just putting it on screen [00:16:27] step. So, I'm just putting it on screen here right now. But we'll also do ReLU. [00:16:30] here right now. But we'll also do ReLU. Um, I guess sorry, one one other thing I [00:16:32] Um, I guess sorry, one one other thing I missed on this slide is that we're [00:16:33] missed on this slide is that we're setting our activation functions to be [00:16:34] setting our activation functions to be RLU just to make the math easy. So, [00:16:36] RLU just to make the math easy. So, it'll just be max of zero or whatever [00:16:39] it'll just be max of zero or whatever the value is. We're only dealing [00:16:40] the value is. We're only dealing essentially with zeros and ones in this [00:16:41] essentially with zeros and ones in this case. So, it makes it pretty simple to [00:16:43] case. So, it makes it pretty simple to think about. Yeah, you probably could [00:16:46] think about. Yeah, you probably could construct it so that it works with tan [00:16:48] construct it so that it works with tan h, but this is just something that I [00:16:49] h, but this is just something that I created as an example for how to run it. [00:16:52] created as an example for how to run it. And so just to make the math really [00:16:53] And so just to make the math really easy, we'll just do RLU. Um but yeah, [00:16:55] easy, we'll just do RLU. Um but yeah, you could conceivably make a model that [00:16:57] you could conceivably make a model that could do this with tanh. Um yeah, [00:17:02] could do this with tanh. Um yeah, cool. Um so we have RLU. We have two [00:17:05] cool. Um so we have RLU. We have two specific weights here. We have the first [00:17:08] specific weights here. We have the first weight which um converts our uh previous [00:17:12] weight which um converts our uh previous hidden state. Uh it it applies a [00:17:15] hidden state. Uh it it applies a transformation to the previous hidden [00:17:16] transformation to the previous hidden state onto the sort of to calculate the [00:17:19] state onto the sort of to calculate the next one. And then we have this weight [00:17:21] next one. And then we have this weight here which um converts our input x to [00:17:25] here which um converts our input x to the dimension of our hidden state as [00:17:26] the dimension of our hidden state as well as applies a transformation. So we [00:17:29] well as applies a transformation. So we are setting this second one. So our our [00:17:31] are setting this second one. So our our our current hidden state is a function [00:17:32] our current hidden state is a function of the previous hidden state as long [00:17:35] of the previous hidden state as long along with the current time step. And so [00:17:37] along with the current time step. And so when we're trying to calculate this uh [00:17:40] when we're trying to calculate this uh hidden state at time step t, we're [00:17:43] hidden state at time step t, we're looking to calculate this current value [00:17:46] looking to calculate this current value first. So we'll use the x value here. [00:17:49] first. So we'll use the x value here. We'll set the weight to be a 3x one [00:17:52] We'll set the weight to be a 3x one column vector um with values 1 0 0 such [00:17:56] column vector um with values 1 0 0 such that when x is zero and we do the matrix [00:18:00] that when x is zero and we do the matrix multiply we get 0 vector and when x is 1 [00:18:04] multiply we get 0 vector and when x is 1 we'll get 1 0 0 and we'll add this to [00:18:07] we'll get 1 0 0 and we'll add this to another term but basically this is going [00:18:08] another term but basically this is going to be calculating what is the current [00:18:10] to be calculating what is the current value here. So it'll be either uh zero [00:18:13] value here. So it'll be either uh zero on top or a one on top and it's [00:18:15] on top or a one on top and it's calculated based on this first operation [00:18:18] calculated based on this first operation here. [00:18:20] here. Okay. Um, so that's how we're [00:18:22] Okay. Um, so that's how we're calculating the current value based on [00:18:25] calculating the current value based on the uh input. Now we'll talk about uh [00:18:29] the uh input. Now we'll talk about uh you know how are we doing this hidden [00:18:31] you know how are we doing this hidden state transformation. So we want to just [00:18:33] state transformation. So we want to just use the current value for this top value [00:18:35] use the current value for this top value here. So in our weight matrix we'll just [00:18:37] here. So in our weight matrix we'll just have zeros in the top row. This means [00:18:39] have zeros in the top row. This means that when we multiply it with the [00:18:40] that when we multiply it with the previous hidden state we'll get a zero [00:18:42] previous hidden state we'll get a zero value here for the top. So it'll be 0 [00:18:44] value here for the top. So it'll be 0 plus whatever value the right hand side [00:18:46] plus whatever value the right hand side contains. So that's how we're going to [00:18:48] contains. So that's how we're going to maintain this not changing based on the [00:18:50] maintain this not changing based on the previous hidden state. And we'll set it [00:18:52] previous hidden state. And we'll set it to be 1 0 0 for the next row. Why we do [00:18:56] to be 1 0 0 for the next row. Why we do this is you can imagine we have the [00:18:58] this is you can imagine we have the hidden state from the previous time step [00:19:00] hidden state from the previous time step here. And we want to set the uh now [00:19:04] here. And we want to set the uh now previous to be the former current time [00:19:06] previous to be the former current time step. So we have a 1 0 0. What this will [00:19:08] step. So we have a 1 0 0. What this will do is it'll multiply by htus one. We'll [00:19:11] do is it'll multiply by htus one. We'll set the current value over to now the [00:19:14] set the current value over to now the previous value for this time uh step. So [00:19:17] previous value for this time uh step. So basically this term will be a zero on [00:19:19] basically this term will be a zero on top and it will be whatever the previous [00:19:21] top and it will be whatever the previous time step uh input value was as the [00:19:24] time step uh input value was as the second term and then this final bit here [00:19:26] second term and then this final bit here just maintains the one so that we're [00:19:28] just maintains the one so that we're keeping this one across all [00:19:29] keeping this one across all calculations. Um so just to recap we [00:19:32] calculations. Um so just to recap we have zeros here because we want the [00:19:34] have zeros here because we want the right hand side term to be tracking this [00:19:37] right hand side term to be tracking this uh current value. We have a one here to [00:19:40] uh current value. We have a one here to copy over the current from the former [00:19:42] copy over the current from the former time step to be the previous uh sorry [00:19:44] time step to be the previous uh sorry the the to to copy the current of the [00:19:46] the the to to copy the current of the former time step to be the previous of [00:19:48] former time step to be the previous of the current time step. Uh so we're just [00:19:50] the current time step. Uh so we're just doing you know h uh maybe it's easy in [00:19:53] doing you know h uh maybe it's easy in the code but uh you know ht previous is [00:19:56] the code but uh you know ht previous is equal to ht then we want to also move [00:19:58] equal to ht then we want to also move the corresponding value down one here [00:20:01] the corresponding value down one here and then this is just a copy of the one. [00:20:03] and then this is just a copy of the one. Um so how do we actually get our output [00:20:05] Um so how do we actually get our output now? So we tal we basically talked about [00:20:07] now? So we tal we basically talked about how we can track these values given the [00:20:10] how we can track these values given the weight matrices I talked about. So whhh [00:20:13] weight matrices I talked about. So whhh and w xh. So if we have a weight matrix [00:20:16] and w xh. So if we have a weight matrix to convert our hidden state into the [00:20:19] to convert our hidden state into the output dimension we want it to be uh [00:20:22] output dimension we want it to be uh 1x3. So it's uh single value that's [00:20:25] 1x3. So it's uh single value that's being output when we have this hidden [00:20:26] being output when we have this hidden dimension as input. And this is sort of [00:20:29] dimension as input. And this is sort of like a dotproduct between the values [00:20:31] like a dotproduct between the values here and the values here. So what this [00:20:33] here and the values here. So what this will correspond to is the current plus [00:20:36] will correspond to is the current plus the previous minus1 minus one because we [00:20:39] the previous minus1 minus one because we multiply the minus1 here. This is where [00:20:40] multiply the minus1 here. This is where the one became useful and uh the current [00:20:45] the one became useful and uh the current associated here with a one and then also [00:20:47] associated here with a one and then also the the previous associated here with a [00:20:49] the the previous associated here with a one as well. Um so that's how we [00:20:52] one as well. Um so that's how we actually do it. And if you if you think [00:20:53] actually do it. And if you if you think about it um this general formula will [00:20:57] about it um this general formula will work. So uh if we have say we're looking [00:21:00] work. So uh if we have say we're looking here we have the current plus the [00:21:02] here we have the current plus the previous is 2 minus one is one for this [00:21:05] previous is 2 minus one is one for this uh left hand term inside the ru so the [00:21:07] uh left hand term inside the ru so the max of one and 0 is one and if these are [00:21:10] max of one and 0 is one and if these are both zero you'll have a minus one so [00:21:12] both zero you'll have a minus one so we'll get zero these are a one and a [00:21:14] we'll get zero these are a one and a zero then you'll still get zero so these [00:21:17] zero then you'll still get zero so these are how you can construct these weight [00:21:19] are how you can construct these weight matrices but I actually wanted to pause [00:21:21] matrices but I actually wanted to pause briefly um and talk about if there were [00:21:24] briefly um and talk about if there were any questions about any step among this [00:21:26] any questions about any step among this calculation [00:21:27] calculation because this is the only example we'll [00:21:30] because this is the only example we'll go through in class where we're [00:21:31] go through in class where we're literally doing all the matrix and [00:21:32] literally doing all the matrix and vector multiplications and the rest will [00:21:34] vector multiplications and the rest will be more highle explanations for how [00:21:37] be more highle explanations for how people tend to put these layers [00:21:38] people tend to put these layers together. So I just want to pause and [00:21:41] together. So I just want to pause and see if there's a question about how the [00:21:43] see if there's a question about how the matri matrices and vectors are tracked [00:21:45] matri matrices and vectors are tracked and multiplied and updated. [00:21:48] and multiplied and updated. Yeah, so the question is how do you go [00:21:49] Yeah, so the question is how do you go about constructing the weight matrices [00:21:51] about constructing the weight matrices which is a really great question uh and [00:21:53] which is a really great question uh and I thought to put it in the slide here. [00:21:55] I thought to put it in the slide here. So, how how would you actually do this? [00:21:56] So, how how would you actually do this? Um, it's the same way we're always [00:21:58] Um, it's the same way we're always finding the weight matrices in this [00:21:59] finding the weight matrices in this class. We're going to be using gradient [00:22:00] class. We're going to be using gradient descent and we'll talk about how you do [00:22:02] descent and we'll talk about how you do gradient descent descent when you have [00:22:03] gradient descent descent when you have multiple time steps and maybe you have [00:22:05] multiple time steps and maybe you have losses computed at each time step as [00:22:07] losses computed at each time step as well. So, that'll be a lot of what we go [00:22:09] well. So, that'll be a lot of what we go into right next. So, it's a great [00:22:11] into right next. So, it's a great question and very relevant to the [00:22:12] question and very relevant to the lecture. So, this is um just an example [00:22:16] lecture. So, this is um just an example so you can see how all of the weight [00:22:17] so you can see how all of the weight matrices are multiplied. Um the [00:22:20] matrices are multiplied. Um the basically if you were trying to change [00:22:22] basically if you were trying to change if you're trying to initialize with this [00:22:24] if you're trying to initialize with this and then train it to do another task um [00:22:26] and then train it to do another task um that would be sort of like transfer [00:22:28] that would be sort of like transfer learning where you're initializing the [00:22:30] learning where you're initializing the weights with this. Um but in practice I [00:22:34] weights with this. Um but in practice I don't think it would work very well at [00:22:35] don't think it would work very well at all because your hidden state is really [00:22:36] all because your hidden state is really small and people normally do much larger [00:22:39] small and people normally do much larger hidden states. I just wanted to do [00:22:41] hidden states. I just wanted to do something that I could fit in the slide [00:22:42] something that I could fit in the slide here. Yeah. Okay. I'll go over the [00:22:44] here. Yeah. Okay. I'll go over the second row again. Um so if you imagine [00:22:47] second row again. Um so if you imagine we have ht minus one as a column vector [00:22:50] we have ht minus one as a column vector here when you do the matrix multiply um [00:22:54] here when you do the matrix multiply um to get the value of the left hand side [00:22:56] to get the value of the left hand side here what this second row is doing is [00:22:58] here what this second row is doing is you're taking the value and you're sort [00:23:00] you're taking the value and you're sort of you know rotating it and doing the [00:23:02] of you know rotating it and doing the dot productduct with the values here. So [00:23:04] dot productduct with the values here. So for the second row that gets calculated [00:23:07] for the second row that gets calculated it will be the entry here will be equal [00:23:10] it will be the entry here will be equal to the top of the the vector here. So [00:23:13] to the top of the the vector here. So this is how we move the current down to [00:23:15] this is how we move the current down to the previous is in this step here. So [00:23:18] the previous is in this step here. So the the end result of this matrix [00:23:19] the the end result of this matrix multiply will be that such that the [00:23:21] multiply will be that such that the second uh value is the uh is the current [00:23:26] second uh value is the uh is the current value from t minus one. [00:23:30] value from t minus one. So we do the matrix multiply with t [00:23:31] So we do the matrix multiply with t minus one here. And so the second row of [00:23:35] minus one here. And so the second row of this operation and this operation are [00:23:36] this operation and this operation are both giving us vectors of the size of [00:23:39] both giving us vectors of the size of our hidden state. And so we're adding [00:23:41] our hidden state. And so we're adding them together. Yeah, the left is doing [00:23:44] them together. Yeah, the left is doing the previous carryover and the right is [00:23:46] the previous carryover and the right is doing the current. And that's also sort [00:23:48] doing the current. And that's also sort of how it works for RNN's when you are [00:23:51] of how it works for RNN's when you are doing it beyond this sort of toy example [00:23:53] doing it beyond this sort of toy example where this weight matrix is being [00:23:55] where this weight matrix is being multiplied by the current input and this [00:23:56] multiplied by the current input and this other weight matrix is being multiplied [00:23:58] other weight matrix is being multiplied by the the previous hidden state. So um [00:24:01] by the the previous hidden state. So um that's what these weight matrices track [00:24:03] that's what these weight matrices track more generally than this specific [00:24:04] more generally than this specific problem as well. [00:24:06] problem as well. Okay. Um so how do you actually compute [00:24:10] Okay. Um so how do you actually compute the gradients? Uh let's look at the [00:24:12] the gradients? Uh let's look at the computational graph. So uh just to draw [00:24:14] computational graph. So uh just to draw it a little bit more explicitly than [00:24:15] it a little bit more explicitly than before. Um we have these x1 coming in [00:24:18] before. Um we have these x1 coming in and then x2 and we have a sequence of [00:24:20] and then x2 and we have a sequence of x's. We're calculating a hidden state at [00:24:22] x's. We're calculating a hidden state at each time step. And we're specifically [00:24:24] each time step. And we're specifically using the same w's uh same weight [00:24:27] using the same w's uh same weight matrices for each of these calculations [00:24:29] matrices for each of these calculations as well. So we need to be thinking about [00:24:31] as well. So we need to be thinking about this when we're thinking about how we're [00:24:33] this when we're thinking about how we're computing the gradients. [00:24:35] computing the gradients. And let's start with the many to many [00:24:37] And let's start with the many to many scenario. So we have an output for each [00:24:39] scenario. So we have an output for each input. And um in this scenario you can [00:24:42] input. And um in this scenario you can often also calculate a loss for each [00:24:44] often also calculate a loss for each output which is how correct is the [00:24:45] output which is how correct is the output at that stage. So if we're doing [00:24:48] output at that stage. So if we're doing this uh setting you have a loss at each [00:24:52] this uh setting you have a loss at each uh step and you can sum them all [00:24:54] uh step and you can sum them all together to get your total loss and this [00:24:56] together to get your total loss and this would be your loss across the entire [00:24:57] would be your loss across the entire input sequence. [00:24:59] input sequence. And um when we do back prop basically [00:25:02] And um when we do back prop basically the final loss is the loss uh you know [00:25:06] the final loss is the loss uh you know this final loss when we comput our final [00:25:08] this final loss when we comput our final loss we can then calculate the loss per [00:25:10] loss we can then calculate the loss per time step as well um or depending on the [00:25:13] time step as well um or depending on the formulation. So if we're calculating a [00:25:15] formulation. So if we're calculating a loss per time step you can treat them [00:25:17] loss per time step you can treat them independently. Um sometimes you have an [00:25:19] independently. Um sometimes you have an overall loss based on the loss uh per [00:25:21] overall loss based on the loss uh per time step too. Um we can also uh get the [00:25:26] time step too. Um we can also uh get the basically the final gradients for each [00:25:28] basically the final gradients for each of these W's. You can calculate the [00:25:30] of these W's. You can calculate the gradient for each time step separately [00:25:32] gradient for each time step separately and then you're going to sum them all [00:25:33] and then you're going to sum them all together. So this is how it works in [00:25:35] together. So this is how it works in practice. You could imagine if it were [00:25:38] practice. You could imagine if it were different W's at each time step, you [00:25:40] different W's at each time step, you could pretty uh probably easily see how [00:25:42] could pretty uh probably easily see how the computational graph could be [00:25:44] the computational graph could be structured such that you're calculating [00:25:45] structured such that you're calculating a different gradient for each of these [00:25:47] a different gradient for each of these different W's. And so we're essentially [00:25:50] different W's. And so we're essentially treating our single W for computational [00:25:52] treating our single W for computational purposes as uh a set of um different [00:25:57] purposes as uh a set of um different W's, but then at the end we merge all [00:25:58] W's, but then at the end we merge all the gradients together because it's just [00:25:59] the gradients together because it's just the same weight matrix that's being [00:26:01] the same weight matrix that's being multiplied. So conceptually you can [00:26:03] multiplied. So conceptually you can think of it as you're just calculating [00:26:06] think of it as you're just calculating it for each time step almost treating it [00:26:08] it for each time step almost treating it in your head like it's a different W [00:26:10] in your head like it's a different W being used, but because it's the same [00:26:11] being used, but because it's the same value of the weights, you just need to [00:26:13] value of the weights, you just need to sum all the gradients that you calculate [00:26:15] sum all the gradients that you calculate at each time step together. [00:26:17] at each time step together. Um in the many to one scenario, you'll [00:26:20] Um in the many to one scenario, you'll just have a a single loss calculated [00:26:22] just have a a single loss calculated here. Um so and sometimes you'll only [00:26:26] here. Um so and sometimes you'll only use the final hidden state uh to [00:26:28] use the final hidden state uh to calculate the value depending on the [00:26:30] calculate the value depending on the problem setting. Like say you're trying [00:26:30] problem setting. Like say you're trying to do video classification, it may make [00:26:33] to do video classification, it may make makes sense to use the hidden state from [00:26:34] makes sense to use the hidden state from every step because you might have [00:26:36] every step because you might have information about the video throughout [00:26:38] information about the video throughout the entire course of the video during [00:26:39] the entire course of the video during classification. You can do some pooling [00:26:41] classification. You can do some pooling like average pooling or max pooling or [00:26:43] like average pooling or max pooling or something like that to compute your your [00:26:45] something like that to compute your your y-value. Um and then sort of uh if you [00:26:49] y-value. Um and then sort of uh if you have this sort of one to many mapping [00:26:51] have this sort of one to many mapping like in image classification [00:26:53] like in image classification um or sorry in image captioning um there [00:26:56] um or sorry in image captioning um there was a question about how you could [00:26:57] was a question about how you could incorporate the previous Y's. So um you [00:27:00] incorporate the previous Y's. So um you still need to have an input input to [00:27:02] still need to have an input input to your FW [00:27:04] your FW uh because it's sort of two different of [00:27:06] uh because it's sort of two different of these weight matrices. One that's [00:27:08] these weight matrices. One that's expecting input vector X and the other [00:27:10] expecting input vector X and the other that's expecting um the previous time [00:27:13] that's expecting um the previous time steps hidden state. So um you could [00:27:16] steps hidden state. So um you could imagine that you can put a lot of values [00:27:18] imagine that you can put a lot of values in here. You could just put zeros or you [00:27:20] in here. You could just put zeros or you could put the the previous um the [00:27:23] could put the the previous um the previous output here. [00:27:26] previous output here. Okay. Um so I explained at a high level [00:27:28] Okay. Um so I explained at a high level how you do the back propagation. But [00:27:30] how you do the back propagation. But there's actually some specific issues [00:27:32] there's actually some specific issues that you'll run into uh when you're [00:27:34] that you'll run into uh when you're trying this conceptual when you're sort [00:27:36] trying this conceptual when you're sort of looking through this conceptual [00:27:37] of looking through this conceptual framework that are very practical in [00:27:39] framework that are very practical in terms of running out of GPU memory which [00:27:41] terms of running out of GPU memory which is always the cause of basically all the [00:27:42] is always the cause of basically all the issues when you're trying to train a a [00:27:44] issues when you're trying to train a a neural network that and I guess nan loss [00:27:47] neural network that and I guess nan loss during training. So um when you're [00:27:49] during training. So um when you're computing say a loss at each time step [00:27:52] computing say a loss at each time step um and you have an extremely long input [00:27:55] um and you have an extremely long input sequence you'll actually it's really [00:27:57] sequence you'll actually it's really easy to understand you need to be [00:27:58] easy to understand you need to be keeping the activations and the [00:28:00] keeping the activations and the gradients at each uh time step in memory [00:28:02] gradients at each uh time step in memory then summing them all together. This is [00:28:04] then summing them all together. This is going to get extremely large as your [00:28:06] going to get extremely large as your input sequence increases. So what can [00:28:08] input sequence increases. So what can you do practically to resolve this [00:28:10] you do practically to resolve this issue? Um this is called back [00:28:12] issue? Um this is called back propagation through time by the way when [00:28:14] propagation through time by the way when you have the same weight matrix that's [00:28:16] you have the same weight matrix that's being applied in multiple different time [00:28:18] being applied in multiple different time steps and then you're summing the [00:28:19] steps and then you're summing the gradient at each time step together. Um [00:28:22] gradient at each time step together. Um so um what you can do is it's called [00:28:24] so um what you can do is it's called truncated back propagation through time. [00:28:26] truncated back propagation through time. So you basically fix a time window and [00:28:29] So you basically fix a time window and you can look at basically uh pretending [00:28:31] you can look at basically uh pretending that this is uh all the model was [00:28:34] that this is uh all the model was trained on so far. We start with our h0. [00:28:37] trained on so far. We start with our h0. Um we calculate based on the input at [00:28:40] Um we calculate based on the input at time step one and our previous h value. [00:28:43] time step one and our previous h value. We can calculate what is the current [00:28:44] We can calculate what is the current hidden uh state at h1. And then we can [00:28:47] hidden uh state at h1. And then we can use that to calculate our output. We'll [00:28:49] use that to calculate our output. We'll have our loss. And we can run this for [00:28:51] have our loss. And we can run this for each of our examples. And you can [00:28:53] each of our examples. And you can imagine how in this setting it's [00:28:54] imagine how in this setting it's relatively easy to see how you just sort [00:28:56] relatively easy to see how you just sort of treat the beginning sequence as if [00:28:58] of treat the beginning sequence as if this is all we were seeing during [00:29:00] this is all we were seeing during training. And moving to the next block, [00:29:02] training. And moving to the next block, you can now essentially you're starting [00:29:04] you can now essentially you're starting your H0 with now it's the output of your [00:29:06] your H0 with now it's the output of your previous step here. So we're [00:29:09] previous step here. So we're initializing the hidden state with [00:29:10] initializing the hidden state with whatever the output was in our final [00:29:11] whatever the output was in our final step, but the gradients are no longer [00:29:13] step, but the gradients are no longer carrying over. So we're basically [00:29:15] carrying over. So we're basically batching the computational graph such [00:29:17] batching the computational graph such that we're only looking at the loss in a [00:29:19] that we're only looking at the loss in a neighborhood of these time steps at a [00:29:21] neighborhood of these time steps at a time. This is a fixed window size that [00:29:23] time. This is a fixed window size that you set. So this is how you get around [00:29:24] you set. So this is how you get around this um relatively I would say um common [00:29:30] this um relatively I would say um common issue especially as you have really long [00:29:32] issue especially as you have really long input sequences. Um and so yeah you [00:29:35] input sequences. Um and so yeah you basically are batching it out and you [00:29:37] basically are batching it out and you can just keep doing this for the entire [00:29:39] can just keep doing this for the entire input uh sequence. So um one other thing [00:29:44] input uh sequence. So um one other thing is that um you know you might ask how [00:29:46] is that um you know you might ask how does this work if we have just a single [00:29:48] does this work if we have just a single output at the very end. Um so you can [00:29:51] output at the very end. Um so you can still calculate the gradients at each [00:29:53] still calculate the gradients at each time step but you will no longer have [00:29:55] time step but you will no longer have this um uh loss that's uh dependent on [00:30:00] this um uh loss that's uh dependent on the time step uh itself uh the output of [00:30:03] the time step uh itself uh the output of the time step itself rather you'll be [00:30:04] the time step itself rather you'll be relying on upstream gradients. So you [00:30:06] relying on upstream gradients. So you can imagine we're looking at the far [00:30:08] can imagine we're looking at the far right of the diagram here and we have [00:30:10] right of the diagram here and we have our loss that we calculate based on the [00:30:12] our loss that we calculate based on the output at the final time step. um we can [00:30:14] output at the final time step. um we can calculate what is the gradient uh with [00:30:16] calculate what is the gradient uh with respect to our current uh hidden state [00:30:18] respect to our current uh hidden state at the end. And then we have our whh [00:30:22] at the end. And then we have our whh matrix to help us understand how did the [00:30:25] matrix to help us understand how did the uh how did the previous hidden state [00:30:27] uh how did the previous hidden state contribute to the uh final hidden state [00:30:30] contribute to the uh final hidden state and we can use that to uh calculate the [00:30:33] and we can use that to uh calculate the gradient and understanding based on the [00:30:36] gradient and understanding based on the previous hidden state and the weight [00:30:37] previous hidden state and the weight matrix. how can we change this [00:30:39] matrix. how can we change this transformation matrix whh such that um [00:30:42] transformation matrix whh such that um we would be changing our loss and uh [00:30:45] we would be changing our loss and uh then you can just you basically just [00:30:47] then you can just you basically just applying the gradient rule to whh over [00:30:50] applying the gradient rule to whh over and over again here and you're only [00:30:51] and over again here and you're only looking at how the hidden state changed [00:30:52] looking at how the hidden state changed the next hidden state and how that [00:30:54] the next hidden state and how that contributed to the loss. So you look at [00:30:57] contributed to the loss. So you look at the final example here. This tells you [00:30:59] the final example here. This tells you how changing the hidden state depends on [00:31:01] how changing the hidden state depends on loss and then you know how the previous [00:31:03] loss and then you know how the previous hidden states how they change how that [00:31:05] hidden states how they change how that affected uh the current hidden state [00:31:07] affected uh the current hidden state which is given by this whh matrix. So [00:31:10] which is given by this whh matrix. So using the W's at each time like using [00:31:12] using the W's at each time like using different W's at each time step would [00:31:14] different W's at each time step would essentially mean that you're um no [00:31:16] essentially mean that you're um no longer modeling it as a recurrence [00:31:18] longer modeling it as a recurrence relation. So basically you have uh you [00:31:20] relation. So basically you have uh you can think of it as one layer for each [00:31:22] can think of it as one layer for each different possible time step. Um so you [00:31:26] different possible time step. Um so you would probably see um worse performance [00:31:28] would probably see um worse performance because um if you are sort of no longer [00:31:32] because um if you are sort of no longer modeling it as a sequence recursively. [00:31:34] modeling it as a sequence recursively. You're just uh I mean imagine you train [00:31:36] You're just uh I mean imagine you train a neural network where you have a series [00:31:38] a neural network where you have a series of inputs. Each one has a separate [00:31:39] of inputs. Each one has a separate weight that it goes to [00:31:40] weight that it goes to independently. [00:31:41] independently. Yeah. Independently. That would make [00:31:42] Yeah. Independently. That would make sense for a problem where it's not a [00:31:45] sense for a problem where it's not a sequence modeling problem. You just have [00:31:46] sequence modeling problem. You just have a set of things that you want to [00:31:48] a set of things that you want to classify. U you would need to know like [00:31:50] classify. U you would need to know like the the amount of the sequence ahead of [00:31:52] the the amount of the sequence ahead of time. So I think it could work if it's [00:31:54] time. So I think it could work if it's not a sequence but for uh sequences of [00:31:57] not a sequence but for uh sequences of variable length um I I think it would [00:32:00] variable length um I I think it would not work very well u because you're sort [00:32:03] not work very well u because you're sort of um I mean I'm trying to think of the [00:32:06] of um I mean I'm trying to think of the simple way to explain it but it's sort [00:32:07] simple way to explain it but it's sort of like you're just training one neural [00:32:08] of like you're just training one neural network for each uh time step. So it's [00:32:11] network for each uh time step. So it's sort of not the way to formulate it. So [00:32:14] sort of not the way to formulate it. So how does this work with chunking? So um [00:32:16] how does this work with chunking? So um we have our so we we can do you [00:32:19] we have our so we we can do you understand how to this point we can at [00:32:21] understand how to this point we can at the point right here with the red dot we [00:32:23] the point right here with the red dot we can calculate um the gradient of the [00:32:26] can calculate um the gradient of the loss with respect to our final hidden [00:32:28] loss with respect to our final hidden state. Okay. So um if we can do that [00:32:30] state. Okay. So um if we can do that then we can calculate the gradient of [00:32:32] then we can calculate the gradient of our loss with respect to our second to [00:32:34] our loss with respect to our second to final hidden state because we know our [00:32:36] final hidden state because we know our final hidden state is dependent on our [00:32:38] final hidden state is dependent on our previous hidden state times this weight [00:32:40] previous hidden state times this weight matrix W. Okay. So we can do this and we [00:32:43] matrix W. Okay. So we can do this and we can go back and forth until here and at [00:32:45] can go back and forth until here and at this point uh we can uh sort of all we [00:32:48] this point uh we can uh sort of all we need to save is this final step here. So [00:32:50] need to save is this final step here. So what is the gradient of this very final [00:32:54] what is the gradient of this very final uh or finals I'm maybe overusing the [00:32:57] uh or finals I'm maybe overusing the word final but what is the gradient of [00:32:58] word final but what is the gradient of this initial um hidden state within our [00:33:02] this initial um hidden state within our truncated batch with respect to the loss [00:33:05] truncated batch with respect to the loss and then when we're calculating [00:33:06] and then when we're calculating backwards we just use that value to [00:33:09] backwards we just use that value to calculate all the previous time steps. [00:33:11] calculate all the previous time steps. So that's the overall process. You're [00:33:13] So that's the overall process. You're only looking at how the hidden state [00:33:16] only looking at how the hidden state transforms to form the new hidden state. [00:33:19] transforms to form the new hidden state. And that's the only value that's getting [00:33:20] And that's the only value that's getting updated here. [00:33:23] updated here. Um uh yeah. Yeah. Oh, and also the also [00:33:26] Um uh yeah. Yeah. Oh, and also the also how the input uh changes the hidden [00:33:27] how the input uh changes the hidden state. So you're looking at two values. [00:33:29] state. So you're looking at two values. Both how the input affects it and how [00:33:31] Both how the input affects it and how the input affects the next hidden state [00:33:33] the input affects the next hidden state and the the previous uh hidden state. [00:33:36] and the the previous uh hidden state. Sorry, there's two values. Yeah. So the [00:33:38] Sorry, there's two values. Yeah. So the learning still occurs for all the [00:33:40] learning still occurs for all the batches. Um so um you have your loss [00:33:43] batches. Um so um you have your loss with respect to um each of your [00:33:47] with respect to um each of your parameters in W here and then when [00:33:50] parameters in W here and then when you're calculating it for the previous [00:33:52] you're calculating it for the previous time step um you you you basically keep [00:33:56] time step um you you you basically keep this one value. If you change the final [00:33:58] this one value. If you change the final the initial hidden state here, how does [00:33:59] the initial hidden state here, how does that change loss? you can calculate that [00:34:01] that change loss? you can calculate that and then you can see how all the [00:34:02] and then you can see how all the variables feeding into this uh and [00:34:04] variables feeding into this uh and namely this original hidden state and [00:34:06] namely this original hidden state and the current time step how will that [00:34:07] the current time step how will that affect the variable but then when you're [00:34:09] affect the variable but then when you're actually moving to the next chunk over [00:34:11] actually moving to the next chunk over you only need to look at how does the [00:34:13] you only need to look at how does the this hidden state here affect the hidden [00:34:15] this hidden state here affect the hidden state on in in the next chunk. So you're [00:34:18] state on in in the next chunk. So you're looking at this division boundary. The [00:34:19] looking at this division boundary. The one variable you need to track over is [00:34:21] one variable you need to track over is what is the gradient um of the hidden [00:34:24] what is the gradient um of the hidden state that occurs after the chunk. Um [00:34:27] state that occurs after the chunk. Um and then you can use that to calculate [00:34:29] and then you can use that to calculate the gradient of the current hidden state [00:34:30] the gradient of the current hidden state which is dependent on input x and the [00:34:32] which is dependent on input x and the and the previous. So there's different [00:34:34] and the previous. So there's different ways you can formulate it, but you can [00:34:36] ways you can formulate it, but you can imagine we just apply uh the update to [00:34:39] imagine we just apply uh the update to all the the weights here and we zero out [00:34:41] all the the weights here and we zero out the memory. The only thing we're [00:34:42] the memory. The only thing we're tracking is uh yeah the the gradient [00:34:45] tracking is uh yeah the the gradient right here. So you're you can apply you [00:34:48] right here. So you're you can apply you can do a gradient apply step where you [00:34:49] can do a gradient apply step where you apply all the gradients to the weights [00:34:51] apply all the gradients to the weights depending on the learning rate and your [00:34:53] depending on the learning rate and your optimizer and all this stuff and then [00:34:54] optimizer and all this stuff and then you move on to calculating the next [00:34:56] you move on to calculating the next batch. So the reason why this isn't a [00:34:58] batch. So the reason why this isn't a perfect calculation is because um you're [00:35:01] perfect calculation is because um you're calculating these independently rather [00:35:03] calculating these independently rather than all at once. So you sort of have [00:35:05] than all at once. So you sort of have three different updates rather than just [00:35:07] three different updates rather than just one update at a time. But it should be [00:35:10] one update at a time. But it should be um you're still calculating the gradient [00:35:12] um you're still calculating the gradient for each each step here. you keep one [00:35:15] for each each step here. you keep one thing in memory which is how does this [00:35:17] thing in memory which is how does this hidden state the the first one in the [00:35:19] hidden state the the first one in the batch how is that [00:35:22] batch how is that how can we update the hidden state here [00:35:24] how can we update the hidden state here to to determine the loss and we throw [00:35:27] to to determine the loss and we throw out all the other ones so you have the [00:35:30] out all the other ones so you have the weights in memory you can apply the [00:35:32] weights in memory you can apply the gradient you do your learning rate [00:35:34] gradient you do your learning rate multiply and you apply it to the weights [00:35:36] multiply and you apply it to the weights you'll also see a similar thing if you [00:35:38] you'll also see a similar thing if you do distributed um uh learning so if you [00:35:41] do distributed um uh learning so if you have a gradient calculated on each GPU [00:35:43] have a gradient calculated on each GPU separately [00:35:44] separately they will apply them all to the same set [00:35:46] they will apply them all to the same set of weights even though they're [00:35:47] of weights even though they're calculated independently. So I think we [00:35:50] calculated independently. So I think we have a lecture on distributed learning [00:35:52] have a lecture on distributed learning coming up. So it's a similar thing where [00:35:53] coming up. So it's a similar thing where you're not tracking it all in the same [00:35:56] you're not tracking it all in the same memory at the same time and you're [00:35:57] memory at the same time and you're applying it to the weights one at a [00:35:59] applying it to the weights one at a time. [00:36:00] time. Yeah. Yeah. Yeah. It would be better if [00:36:02] Yeah. Yeah. Yeah. It would be better if you could fit it all in memory. Yeah. [00:36:04] you could fit it all in memory. Yeah. Yeah. It would be better if you fit it [00:36:05] Yeah. It would be better if you fit it all in memory. I mean this is mainly for [00:36:07] all in memory. I mean this is mainly for for this one it's essentially the same [00:36:09] for this one it's essentially the same but in this setting u maybe it's more [00:36:11] but in this setting u maybe it's more clear how you're explicitly losing [00:36:12] clear how you're explicitly losing information. [00:36:14] information. So, um, here you're only looking at some [00:36:17] So, um, here you're only looking at some of the outputs at a time. Um, so you [00:36:21] of the outputs at a time. Um, so you it's really clear how we're not looking [00:36:23] it's really clear how we're not looking at the entire set of the losses when [00:36:26] at the entire set of the losses when we're calculating because there's losses [00:36:28] we're calculating because there's losses at each time step. So, you lose [00:36:30] at each time step. So, you lose information here, but in this case, you [00:36:32] information here, but in this case, you wouldn't lose information. Uh, I think [00:36:35] wouldn't lose information. Uh, I think one more practical example where we [00:36:37] one more practical example where we can't fit the whole RNN on the slide is [00:36:39] can't fit the whole RNN on the slide is this idea of a character level language [00:36:41] this idea of a character level language model. And it's really funny because [00:36:43] model. And it's really funny because these were shown to be quite effective [00:36:46] these were shown to be quite effective uh 10 years ago. Um and you can it's [00:36:49] uh 10 years ago. Um and you can it's really funny because you can see how the [00:36:50] really funny because you can see how the current wave of language models are sort [00:36:53] current wave of language models are sort of a buildup of this really simple [00:36:55] of a buildup of this really simple approach of just predicting characters [00:36:57] approach of just predicting characters with RNN's. Um so usually when you do a [00:37:00] with RNN's. Um so usually when you do a model like this you will input uh your [00:37:03] model like this you will input uh your characters and then people call this a [00:37:05] characters and then people call this a one hot uh encoding where um you [00:37:08] one hot uh encoding where um you basically have one uh you have a a one [00:37:11] basically have one uh you have a a one in your vector and zeros in every other [00:37:13] in your vector and zeros in every other location. So it's sort of like the index [00:37:15] location. So it's sort of like the index here. Uh you can encode this as the [00:37:17] here. Uh you can encode this as the index and then uh we can use these as [00:37:21] index and then uh we can use these as inputs and we can calculate our hidden [00:37:23] inputs and we can calculate our hidden layers based on the previous hidden [00:37:25] layers based on the previous hidden layer as well as the current uh input. [00:37:28] layer as well as the current uh input. And then we have our output layer the [00:37:30] And then we have our output layer the same where now we can look at um you [00:37:33] same where now we can look at um you know what is the the output for the [00:37:36] know what is the the output for the corresponding correct value here which [00:37:37] corresponding correct value here which is taken as the next time step. So we [00:37:40] is taken as the next time step. So we want the output for example to be E. We [00:37:42] want the output for example to be E. We map it over here. We look at you can [00:37:45] map it over here. We look at you can imagine this is something like softmax [00:37:47] imagine this is something like softmax and we have the logits. So these are the [00:37:48] and we have the logits. So these are the scores. Um 2.2 it's lower than 4.1. So [00:37:53] scores. Um 2.2 it's lower than 4.1. So um so yeah we we have a we have a you [00:37:56] um so yeah we we have a we have a you know this is maybe not so great of an [00:37:57] know this is maybe not so great of an output at this time step um and so on [00:38:00] output at this time step um and so on and so forth. So you can really view [00:38:01] and so forth. So you can really view this as a time stepwise classification [00:38:03] this as a time stepwise classification problem and that's exactly what uh in [00:38:07] problem and that's exactly what uh in general these language model these [00:38:08] general these language model these language models are doing is timestep [00:38:10] language models are doing is timestep wise classification based on softmax. [00:38:14] wise classification based on softmax. Um, so at test time, the basic idea is [00:38:17] Um, so at test time, the basic idea is we need to also sample characters one at [00:38:19] we need to also sample characters one at a time and and just feed it back into [00:38:20] a time and and just feed it back into the model. So it sees what it generated [00:38:23] the model. So it sees what it generated at the previous time step. Um, so so on [00:38:26] at the previous time step. Um, so so on and so forth uh repeating until we [00:38:28] and so forth uh repeating until we generate the words. So you can actually [00:38:30] generate the words. So you can actually create RNN's to do this um basic uh [00:38:35] create RNN's to do this um basic uh language model language modeling task by [00:38:37] language model language modeling task by operating at a character level and it [00:38:39] operating at a character level and it works quite well. Um, one thing to note [00:38:42] works quite well. Um, one thing to note is that in terms of this input layer, [00:38:44] is that in terms of this input layer, um, usually we don't actually input one [00:38:46] um, usually we don't actually input one hot embeddings into the model and [00:38:48] hot embeddings into the model and instead we'll have something called an [00:38:50] instead we'll have something called an embedding layer where, um, this is [00:38:52] embedding layer where, um, this is essentially just a giant matrix which is [00:38:55] essentially just a giant matrix which is the dimensions um, the it's d byd where [00:38:58] the dimensions um, the it's d byd where d is the number of different inputs you [00:39:00] d is the number of different inputs you have to your model. And um what you do [00:39:03] have to your model. And um what you do is you sort of you can imagine this as a [00:39:04] is you sort of you can imagine this as a matrix multiply where we grab the first [00:39:07] matrix multiply where we grab the first row or in this case we grab the second [00:39:09] row or in this case we grab the second row of our embedding matrix based on [00:39:12] row of our embedding matrix based on what our input sample is here. And we [00:39:14] what our input sample is here. And we just use this as a matrix multiply. [00:39:17] just use this as a matrix multiply. This is incorrect actually. This this [00:39:19] This is incorrect actually. This this one should be higher probability. Yeah, [00:39:23] one should be higher probability. Yeah, it's it's funny. We've had these slides [00:39:25] it's it's funny. We've had these slides for quite a few years and I guess no one [00:39:26] for quite a few years and I guess no one noticed it. question. Um anyway, so we [00:39:30] noticed it. question. Um anyway, so we have uh E here as our target character. [00:39:33] have uh E here as our target character. Um and so in this case, you're correct [00:39:36] Um and so in this case, you're correct that it's the model's actually getting [00:39:38] that it's the model's actually getting it wrong. So we will want to penalize it [00:39:39] it wrong. So we will want to penalize it heavy for this time. So yeah. Yeah, it [00:39:40] heavy for this time. So yeah. Yeah, it was a good question. Um yeah, so um one [00:39:44] was a good question. Um yeah, so um one of the nice things about this [00:39:45] of the nice things about this implementation is also it's really [00:39:46] implementation is also it's really simple. So it's like 112 lines of Python [00:39:49] simple. So it's like 112 lines of Python code and you can train these models on a [00:39:51] code and you can train these models on a variety of different tasks. So this is [00:39:53] variety of different tasks. So this is like the prelim era of what you could [00:39:55] like the prelim era of what you could do. You can train it on sonnetss by [00:39:57] do. You can train it on sonnetss by William Shakespeare. And as I mentioned, [00:39:59] William Shakespeare. And as I mentioned, there's a blog post by former instructor [00:40:01] there's a blog post by former instructor of this course, Andre Carpathy, back in [00:40:03] of this course, Andre Carpathy, back in 2015 which talked about how these RNNs [00:40:06] 2015 which talked about how these RNNs are sort of unreasonably effective at [00:40:08] are sort of unreasonably effective at what they do in generating text. Yeah. [00:40:10] what they do in generating text. Yeah. Could you explain why you use the why [00:40:12] Could you explain why you use the why you use an embedded layer? Oh. [00:40:14] you use an embedded layer? Oh. Yeah. So, the basic idea for an [00:40:16] Yeah. So, the basic idea for an embedding layer is that generally it's [00:40:17] embedding layer is that generally it's better to have vectors as input to our [00:40:20] better to have vectors as input to our models. And you can learn what these [00:40:21] models. And you can learn what these embedding layers are too. So um you know [00:40:24] embedding layers are too. So um you know we tend to favor like spread out weights [00:40:26] we tend to favor like spread out weights in general when we're trying to learn uh [00:40:28] in general when we're trying to learn uh these. So you can initialize your [00:40:29] these. So you can initialize your embedding layer to this sort of very uh [00:40:32] embedding layer to this sort of very uh small zero values with something like [00:40:33] small zero values with something like the kiming initialization we talked [00:40:35] the kiming initialization we talked about and then you're just looking at [00:40:36] about and then you're just looking at one row of it at a time as your input [00:40:38] one row of it at a time as your input vector rather than it being like a a [00:40:40] vector rather than it being like a a number as input. Uh how you would have [00:40:42] number as input. Uh how you would have to represent that is basically a one [00:40:44] to represent that is basically a one with a bunch of zeros and [00:40:45] with a bunch of zeros and optimizationally the embedding works [00:40:48] optimizationally the embedding works better. [00:40:51] Okay. Um, so yeah, you can do it in 112 [00:40:54] Okay. Um, so yeah, you can do it in 112 lines of Python code, which is pretty [00:40:56] lines of Python code, which is pretty neat. Um, you can train it on songs by [00:40:58] neat. Um, you can train it on songs by William Shakespeare and it'll actually [00:40:59] William Shakespeare and it'll actually output reasonable text. We'll go through [00:41:01] output reasonable text. We'll go through some examples. So, one of the cool [00:41:03] some examples. So, one of the cool things is you can see as you train the [00:41:04] things is you can see as you train the model more, it becomes more and more [00:41:06] model more, it becomes more and more coherent. So, um, at the beginning, it's [00:41:08] coherent. So, um, at the beginning, it's basically just gibberish because it [00:41:09] basically just gibberish because it hasn't learned proper values for W. And [00:41:12] hasn't learned proper values for W. And then as you train it more and more, um, [00:41:14] then as you train it more and more, um, it becomes more like this stage three [00:41:16] it becomes more like this stage three kind of looks like English, at least [00:41:17] kind of looks like English, at least some of the words, you know, right [00:41:19] some of the words, you know, right there. And then as you train more it [00:41:20] there. And then as you train more it actually starts working really well. Um [00:41:22] actually starts working really well. Um which this is I guess was a a bit of [00:41:24] which this is I guess was a a bit of foreshadowing for what was to come in [00:41:26] foreshadowing for what was to come in the era in the era of AI which is pretty [00:41:29] the era in the era of AI which is pretty cool. Um you can see fullon uh it learns [00:41:32] cool. Um you can see fullon uh it learns things about the style how you should [00:41:34] things about the style how you should you know have someone's name and how uh [00:41:37] you know have someone's name and how uh you know something that seems fairly [00:41:39] you know something that seems fairly plausible. Uh as you have it generating [00:41:41] plausible. Uh as you have it generating more and more it starts making less and [00:41:42] more and more it starts making less and less sense but it it's pretty cool to [00:41:44] less sense but it it's pretty cool to see. Um, you can train it on code like I [00:41:48] see. Um, you can train it on code like I think this in this example they trained [00:41:50] think this in this example they trained it on Linux. So the just the source code [00:41:52] it on Linux. So the just the source code for Linux they trained one of these [00:41:54] for Linux they trained one of these character level RNN's and you can see it [00:41:56] character level RNN's and you can see it generating C code which looks pretty [00:41:58] generating C code which looks pretty good. Uh I don't know if this would [00:42:00] good. Uh I don't know if this would compile but it looks reasonable just [00:42:02] compile but it looks reasonable just looking at it. And this idea has really [00:42:05] looking at it. And this idea has really taken off over the past uh few years. So [00:42:08] taken off over the past uh few years. So um I mean I'm sure you all are know [00:42:11] um I mean I'm sure you all are know especially since a lot of you work in [00:42:13] especially since a lot of you work in computer science or coding or your [00:42:14] computer science or coding or your students in this area but um there's all [00:42:17] students in this area but um there's all of these different programming tools now [00:42:19] of these different programming tools now for these language models that were [00:42:21] for these language models that were essentially trained on a similar task [00:42:22] essentially trained on a similar task where they've consumed a bunch of this [00:42:25] where they've consumed a bunch of this uh uh training data that's just existing [00:42:27] uh uh training data that's just existing code and instead of trying to predict [00:42:29] code and instead of trying to predict the next character they're trying to [00:42:31] the next character they're trying to predict the next token which is a group [00:42:32] predict the next token which is a group of characters and how they define tokens [00:42:35] of characters and how they define tokens depends on the model and there's a lot [00:42:36] depends on the model and there's a lot of details we could get into there But [00:42:38] of details we could get into there But at a high level, it's a really similar [00:42:39] at a high level, it's a really similar thing. They're just predicting groups of [00:42:40] thing. They're just predicting groups of characters autogressively, one after the [00:42:42] characters autogressively, one after the next. And it's really seen a blow up in [00:42:44] next. And it's really seen a blow up in recent years with all these existing [00:42:46] recent years with all these existing tools. Yeah. [00:42:51] What is the input to the model? Is it [00:42:52] What is the input to the model? Is it like a trigger? [00:42:54] like a trigger? Oh, what? Like for this? [00:42:55] Oh, what? Like for this? Yeah. [00:42:56] Yeah. Um, you could have the input be Yeah, [00:42:59] Um, you could have the input be Yeah, you just maybe you start with a random [00:43:01] you just maybe you start with a random character could be one way to do it. Uh, [00:43:03] character could be one way to do it. Uh, but you would need some initial input. [00:43:04] but you would need some initial input. Um there could be usually with uh [00:43:08] Um there could be usually with uh language models they have a start token [00:43:09] language models they have a start token as like a predetermined this is always [00:43:12] as like a predetermined this is always what you see at the start of your [00:43:13] what you see at the start of your sequence. So you could do a similar [00:43:14] sequence. So you could do a similar things with RNNs. I don't know in this [00:43:16] things with RNNs. I don't know in this exact scenario what they did. Maybe they [00:43:18] exact scenario what they did. Maybe they just did a character but it's hard to [00:43:19] just did a character but it's hard to know. So the question is how does [00:43:21] know. So the question is how does labeling work with language models? And [00:43:24] labeling work with language models? And the neat thing about these pure language [00:43:26] the neat thing about these pure language models all they're doing is just [00:43:27] models all they're doing is just predicting the next token. You don't [00:43:28] predicting the next token. You don't need to label it. You just need to give [00:43:29] need to label it. You just need to give it a lot of text. That's why these [00:43:31] it a lot of text. That's why these models are so good is because they [00:43:33] models are so good is because they scrape the internet for, you know, [00:43:35] scrape the internet for, you know, essentially all available text and then [00:43:37] essentially all available text and then they train a model on all of it. So [00:43:39] they train a model on all of it. So that's that's why they're so good. It's [00:43:41] that's that's why they're so good. It's cuz you it's just generating the next [00:43:42] cuz you it's just generating the next token and you don't need to label it. So [00:43:44] token and you don't need to label it. So that's why language models are so good. [00:43:46] that's why language models are so good. So the question is uh if we're always [00:43:49] So the question is uh if we're always taking the maximum probability output at [00:43:51] taking the maximum probability output at each time step, um won't we always just [00:43:53] each time step, um won't we always just be generating the same thing over and [00:43:56] be generating the same thing over and over again? And the answer is yes, [00:43:57] over again? And the answer is yes, actually. So if you just took the [00:43:59] actually. So if you just took the maximum probability uh I guess uh this [00:44:02] maximum probability uh I guess uh this example is not so good but imagine the [00:44:04] example is not so good but imagine the probabilities are correct here and you [00:44:06] probabilities are correct here and you just took the maximum probability at [00:44:07] just took the maximum probability at each time step you would always be [00:44:09] each time step you would always be getting the same output given the same [00:44:10] getting the same output given the same input. In practice what people do is [00:44:12] input. In practice what people do is they don't do this is called greed [00:44:14] they don't do this is called greed decoding. You're always picking the [00:44:15] decoding. You're always picking the maximum probability. In practice they [00:44:17] maximum probability. In practice they sample based on a distribution the [00:44:19] sample based on a distribution the distribution given by the probabilities [00:44:20] distribution given by the probabilities output by your softmax. So you won't [00:44:22] output by your softmax. So you won't pick the max probability. You would pick [00:44:24] pick the max probability. You would pick uh say in this case probability [00:44:27] uh say in this case probability 84 for this one or probability.13 for [00:44:30] 84 for this one or probability.13 for this other uh output variable and then [00:44:32] this other uh output variable and then you would run that for each uh sequence. [00:44:34] you would run that for each uh sequence. And there's a bunch of different ways [00:44:35] And there's a bunch of different ways you can do it too. There's like you can [00:44:37] you can do it too. There's like you can search ahead called beam searching where [00:44:39] search ahead called beam searching where you're trying different ones and seeing [00:44:40] you're trying different ones and seeing which one has the highest overall [00:44:42] which one has the highest overall probability for the sequence. So there's [00:44:44] probability for the sequence. So there's there's a lot of this is like a whole [00:44:46] there's a lot of this is like a whole active area of research. How do you [00:44:47] active area of research. How do you sample from these models? But the simple [00:44:49] sample from these models? But the simple answer is you don't always pick the [00:44:50] answer is you don't always pick the highest probability. Yes. The question [00:44:53] highest probability. Yes. The question is in the case where we have many to one [00:44:55] is in the case where we have many to one outputs are we outputting something each [00:44:57] outputs are we outputting something each time or do we have something to look at [00:44:58] time or do we have something to look at here? So I think in practice to save [00:45:01] here? So I think in practice to save compute you wouldn't want to output [00:45:02] compute you wouldn't want to output something that's never used but you [00:45:04] something that's never used but you could feasibly output it at each time [00:45:05] could feasibly output it at each time step and it might be interesting [00:45:07] step and it might be interesting depending on your problem to look at [00:45:08] depending on your problem to look at that and understand is the output [00:45:10] that and understand is the output converging over the course of training [00:45:12] converging over the course of training or not something like that. So you you [00:45:13] or not something like that. So you you it might be useful to look at but [00:45:15] it might be useful to look at but generally people wouldn't do it just to [00:45:17] generally people wouldn't do it just to save compute but it could be useful [00:45:19] save compute but it could be useful actually. Yeah could help you understand [00:45:20] actually. Yeah could help you understand the way your model works if there's [00:45:21] the way your model works if there's certain triggers or things that help it [00:45:24] certain triggers or things that help it predict uh the correct answer. [00:45:27] predict uh the correct answer. Cool. Good questions. [00:45:30] Cool. Good questions. Um okay so we'll keep on chugging along. [00:45:33] Um okay so we'll keep on chugging along. Um we talked about these RNNs how good [00:45:36] Um we talked about these RNNs how good they are at generating characters. We [00:45:37] they are at generating characters. We related them to some of these modern [00:45:39] related them to some of these modern coding tools which are really neat. Um, [00:45:41] coding tools which are really neat. Um, one of the cool things also about RNN's [00:45:43] one of the cool things also about RNN's is you can look at the activation [00:45:45] is you can look at the activation uh activation values and they'll [00:45:47] uh activation values and they'll actually sometimes tell you interesting [00:45:49] actually sometimes tell you interesting things about what the model's tracking. [00:45:50] things about what the model's tracking. So, we had in our little toy example, we [00:45:52] So, we had in our little toy example, we had we looked at the act output [00:45:54] had we looked at the act output activations and you would see it's the [00:45:57] activations and you would see it's the current value and the previous value. [00:45:58] current value and the previous value. That was what the RNN states or cells [00:46:00] That was what the RNN states or cells were were tracking. Um, what you can [00:46:02] were were tracking. Um, what you can also do is give it basically a sequence [00:46:05] also do is give it basically a sequence here. Um, in the models I'll show in [00:46:07] here. Um, in the models I'll show in these slides, it's using a tanh [00:46:08] these slides, it's using a tanh activation. So this is from one to minus [00:46:11] activation. So this is from one to minus one and minus one means it's visualized [00:46:14] one and minus one means it's visualized as red here and very close to one would [00:46:17] as red here and very close to one would be blue. Um we get the whole spectrum [00:46:19] be blue. Um we get the whole spectrum here. And you can look at for each [00:46:22] here. And you can look at for each character coming in what is the [00:46:23] character coming in what is the activation of that cell at that time [00:46:25] activation of that cell at that time step. And so that's how they're color [00:46:27] step. And so that's how they're color coding these plots here. This one's not [00:46:29] coding these plots here. This one's not really showing anything. It's random. A [00:46:30] really showing anything. It's random. A lot of them won't be interpretable. But [00:46:32] lot of them won't be interpretable. But some of them have pretty cool things [00:46:33] some of them have pretty cool things that you can track. Like for example, [00:46:35] that you can track. Like for example, this one's a quote detector. So it turns [00:46:38] this one's a quote detector. So it turns on basically as soon as the quote starts [00:46:40] on basically as soon as the quote starts and it uh ends when the quote ends. So [00:46:42] and it uh ends when the quote ends. So this is basically something in the RNN [00:46:44] this is basically something in the RNN tracking. We need to have an end quote [00:46:45] tracking. We need to have an end quote at some point. Uh and when to put it is [00:46:48] at some point. Uh and when to put it is sort of uh something the model is trying [00:46:50] sort of uh something the model is trying to figure out, but it's tracking it. Um [00:46:53] to figure out, but it's tracking it. Um another cool thing is the line tracking [00:46:56] another cool thing is the line tracking line length tracking cell. So it starts [00:46:59] line length tracking cell. So it starts uh kind of very uh high value and then [00:47:03] uh kind of very uh high value and then it becomes a very low value as you near [00:47:05] it becomes a very low value as you near where the model thinks there'll be a new [00:47:06] where the model thinks there'll be a new line character. So this is also kind of [00:47:09] line character. So this is also kind of cool as a way to look at uh this other [00:47:13] cool as a way to look at uh this other value. And these are again just single [00:47:14] value. And these are again just single activations in a layer of this model [00:47:16] activations in a layer of this model that we're looking at and mapping it to [00:47:19] that we're looking at and mapping it to um each character. So it's highly [00:47:21] um each character. So it's highly interpretable. [00:47:22] interpretable. Um they have this sort of if statement [00:47:25] Um they have this sort of if statement cell. So anything within an if statement [00:47:27] cell. So anything within an if statement is being tracked here which is also [00:47:29] is being tracked here which is also pretty cool and uh even uh things like [00:47:31] pretty cool and uh even uh things like detecting quotes or comments because it [00:47:33] detecting quotes or comments because it needs to know to output the end of [00:47:36] needs to know to output the end of comment uh character here. So it's [00:47:38] comment uh character here. So it's something it needs to track and so you [00:47:40] something it needs to track and so you have this nice interpretable cell as [00:47:41] have this nice interpretable cell as well. And then finally this code depth [00:47:44] well. And then finally this code depth cell. So as you sort of have nesting in [00:47:47] cell. So as you sort of have nesting in your code, it activates uh more and more [00:47:49] your code, it activates uh more and more at each time step at each not time step [00:47:52] at each time step at each not time step at each uh step into the sort of [00:47:54] at each uh step into the sort of indentation into the um into your code [00:47:57] indentation into the um into your code hierarchy. So yeah, this is pretty neat. [00:47:59] hierarchy. So yeah, this is pretty neat. You can actually look at the activations [00:48:01] You can actually look at the activations and directly map them onto the inputs [00:48:03] and directly map them onto the inputs without needing to do any fancy tricks. [00:48:05] without needing to do any fancy tricks. Um which is which is actually pretty [00:48:06] Um which is which is actually pretty incredible if you think about how [00:48:08] incredible if you think about how interpretable some of these hidden [00:48:10] interpretable some of these hidden states are in the RNN. It's actually [00:48:12] states are in the RNN. It's actually somewhat similar to what we were doing [00:48:13] somewhat similar to what we were doing where we were manually assigning it, but [00:48:15] where we were manually assigning it, but the RNN is sort of internally doing a [00:48:17] the RNN is sort of internally doing a very similar process. [00:48:19] very similar process. Cool. Um, so I'll talk about now some of [00:48:21] Cool. Um, so I'll talk about now some of the trade-offs like why you might want [00:48:23] the trade-offs like why you might want to use an RNN and when is it helpful. [00:48:25] to use an RNN and when is it helpful. So, um, the nice thing is they can [00:48:27] So, um, the nice thing is they can process any length of input. So, a lot [00:48:30] process any length of input. So, a lot of these modern language models that [00:48:32] of these modern language models that rely on transformers have something [00:48:33] rely on transformers have something called a context length or context [00:48:34] called a context length or context window, maximum context window. Um, RNNs [00:48:38] window, maximum context window. Um, RNNs don't have this. they can just take a [00:48:39] don't have this. they can just take a sequence of infinite length essentially [00:48:41] sequence of infinite length essentially as long as you can keep running the [00:48:42] as long as you can keep running the model on it. Um so there's no context [00:48:45] model on it. Um so there's no context length limit. Um the computation for the [00:48:48] length limit. Um the computation for the time step t can in theory use [00:48:51] time step t can in theory use information from many steps back if it's [00:48:52] information from many steps back if it's captured in the hidden state. So if your [00:48:55] captured in the hidden state. So if your model is effectively capturing all of [00:48:56] model is effectively capturing all of the dynamics of your input sequence in [00:48:58] the dynamics of your input sequence in the hidden state, in theory it can use [00:49:00] the hidden state, in theory it can use values from an extremely long time ago. [00:49:02] values from an extremely long time ago. Um but in practice there might be some [00:49:04] Um but in practice there might be some issues with this which we'll kind of go [00:49:06] issues with this which we'll kind of go into some details there. Um also the [00:49:10] into some details there. Um also the model size does not increase for a [00:49:12] model size does not increase for a longer input. So um you know we had an [00:49:15] longer input. So um you know we had an example where what if you just had a [00:49:16] example where what if you just had a different layer for or different layer [00:49:18] different layer for or different layer for each input time step. Uh you don't [00:49:20] for each input time step. Uh you don't have this issue which is nice. And then [00:49:23] have this issue which is nice. And then uh we're applying the same weights at [00:49:24] uh we're applying the same weights at each time step. So basically we know the [00:49:26] each time step. So basically we know the update rule for how we calculate the [00:49:28] update rule for how we calculate the outputs is the same every single time. [00:49:29] outputs is the same every single time. So um there's some nice like symmetry [00:49:32] So um there's some nice like symmetry here and also just and when you think [00:49:34] here and also just and when you think conceptually about the problem you're [00:49:35] conceptually about the problem you're always doing the same thing at every [00:49:36] always doing the same thing at every single time step which is is nice [00:49:38] single time step which is is nice conceptually and also helps with some [00:49:40] conceptually and also helps with some imple implementation as well. So what [00:49:42] imple implementation as well. So what are the main disadvantages? So you need [00:49:45] are the main disadvantages? So you need to compute the previous hidden state to [00:49:48] to compute the previous hidden state to compute the next one every single time. [00:49:51] compute the next one every single time. So this can be slow if you need to you [00:49:55] So this can be slow if you need to you sort of have each hidden state is [00:49:56] sort of have each hidden state is determined and uh is conditioned on all [00:49:59] determined and uh is conditioned on all the previous ones then this recurrence [00:50:01] the previous ones then this recurrence computation can actually tend up taking [00:50:04] computation can actually tend up taking end up taking a lot of time. So although [00:50:06] end up taking a lot of time. So although this is not an issue at during inference [00:50:08] this is not an issue at during inference time when you're always like for a [00:50:10] time when you're always like for a transformer also you have this issue [00:50:11] transformer also you have this issue where you need to output the next uh [00:50:14] where you need to output the next uh token or character every single time but [00:50:16] token or character every single time but at training time it's actually difficult [00:50:18] at training time it's actually difficult to batch uh all of these together and [00:50:21] to batch uh all of these together and and during training because in order to [00:50:23] and during training because in order to calculate the loss you need to calculate [00:50:25] calculate the loss you need to calculate the previous hidden states. Um so so [00:50:28] the previous hidden states. Um so so this can pose challenges for scaling up [00:50:31] this can pose challenges for scaling up to a lot of data and then in practice um [00:50:34] to a lot of data and then in practice um it's actually difficult to access [00:50:36] it's actually difficult to access information many time steps back because [00:50:38] information many time steps back because we have a fixed size hidden state and [00:50:40] we have a fixed size hidden state and we're trying to cram all the information [00:50:42] we're trying to cram all the information into it. So you'll eventually lose some [00:50:44] into it. So you'll eventually lose some information as your sequence goes longer [00:50:45] information as your sequence goes longer and longer. [00:50:49] Cool. I'll talk about some uh uh [00:50:51] Cool. I'll talk about some uh uh applications more specific to computer [00:50:53] applications more specific to computer vision where RNNs have seen success now. [00:50:55] vision where RNNs have seen success now. Um so one of them is image captioning [00:50:58] Um so one of them is image captioning which we talked about. So um the basic [00:51:00] which we talked about. So um the basic thing here is we mentioned there's this [00:51:02] thing here is we mentioned there's this sort of start token or start character [00:51:04] sort of start token or start character which begins the sequence and then you [00:51:07] which begins the sequence and then you will terminate when you have this end [00:51:08] will terminate when you have this end character or end token. In this case it [00:51:10] character or end token. In this case it seems like it's word level uh tokens. So [00:51:13] seems like it's word level uh tokens. So um you could have a model like this. you [00:51:15] um you could have a model like this. you have um the most basic way to do it is [00:51:18] have um the most basic way to do it is you essentially have a CNN or something [00:51:21] you essentially have a CNN or something that encodes a visual encoder that [00:51:23] that encodes a visual encoder that encodes the image and we use that as [00:51:26] encodes the image and we use that as input to our uh recurrent neural network [00:51:28] input to our uh recurrent neural network as well as the uh previous text that was [00:51:31] as well as the uh previous text that was generated. So we sort of have two stages [00:51:33] generated. So we sort of have two stages here and and very more concretely how [00:51:36] here and and very more concretely how would you combine the CNN and RNN? You [00:51:37] would you combine the CNN and RNN? You can imagine you have this test image. It [00:51:39] can imagine you have this test image. It comes in. So your your model sort of [00:51:42] comes in. So your your model sort of going downwards here starting at the [00:51:44] going downwards here starting at the first layers at the top and then moving [00:51:46] first layers at the top and then moving downwards. Um you can imagine this is [00:51:47] downwards. Um you can imagine this is something that was like say trained on [00:51:49] something that was like say trained on imageet or something. Um and so we're [00:51:52] imageet or something. Um and so we're not going to use the class uh labels but [00:51:54] not going to use the class uh labels but we're going to use this second to last [00:51:56] we're going to use this second to last layer. This is sort of the common [00:51:57] layer. This is sort of the common strategy we saw for transfer learning as [00:51:59] strategy we saw for transfer learning as well or this sort of getting good visual [00:52:01] well or this sort of getting good visual uh representations of images. [00:52:04] uh representations of images. So we use the second to last uh layer [00:52:06] So we use the second to last uh layer and then we can start using this as [00:52:08] and then we can start using this as input to our hidden state. And now our [00:52:12] input to our hidden state. And now our hidden state is also a function of um [00:52:15] hidden state is also a function of um this w [00:52:17] this w um value here. So we don't necessarily [00:52:20] um value here. So we don't necessarily have um just a hidden state. We're also [00:52:24] have um just a hidden state. We're also tracking the visual components here. But [00:52:27] tracking the visual components here. But I won't spend too much time on this [00:52:28] I won't spend too much time on this because um this won't be in any of the [00:52:31] because um this won't be in any of the assignments. But just to give you a [00:52:32] assignments. But just to give you a flavor of here's how RNN's were used [00:52:34] flavor of here's how RNN's were used historically with CNN's where we're [00:52:36] historically with CNN's where we're taking a CNN pre-trained on imageet and [00:52:38] taking a CNN pre-trained on imageet and now we're including this information [00:52:39] now we're including this information into the hidden state as well. Um so you [00:52:42] into the hidden state as well. Um so you know we use the sampling process either [00:52:43] know we use the sampling process either greedy sampling some other version of [00:52:45] greedy sampling some other version of sampling to calculate tokens at each [00:52:47] sampling to calculate tokens at each time step. We end it when we have this [00:52:49] time step. We end it when we have this end token whenever we sample the end [00:52:51] end token whenever we sample the end token. That's how we know when to [00:52:53] token. That's how we know when to finish. And these models actually worked [00:52:55] finish. And these models actually worked very well for the time. I think uh they [00:52:57] very well for the time. I think uh they had a lot of great successes. So you can [00:52:59] had a lot of great successes. So you can see here a lot of nice examples of where [00:53:02] see here a lot of nice examples of where the model's outputting very reasonable [00:53:04] the model's outputting very reasonable uh captions based on the input image [00:53:08] uh captions based on the input image but also they the models would struggle [00:53:10] but also they the models would struggle in a lot of scenarios too. So um a lot [00:53:13] in a lot of scenarios too. So um a lot of these have to do with sort of the [00:53:15] of these have to do with sort of the distribution of where these images are [00:53:17] distribution of where these images are commonly seen in the training data. For [00:53:19] commonly seen in the training data. For example, um someone holding something [00:53:21] example, um someone holding something with their hands sort of cupped like [00:53:23] with their hands sort of cupped like this. Um very much looks like how they [00:53:25] this. Um very much looks like how they might hold a mouse, but obviously um you [00:53:29] might hold a mouse, but obviously um you know, we can tell this is a phone [00:53:31] know, we can tell this is a phone because it's a flat object they're [00:53:32] because it's a flat object they're holding and their their their hand is [00:53:34] holding and their their their hand is facing up, not not downwards. So, this [00:53:36] facing up, not not downwards. So, this sort of thing is interesting to see. Um [00:53:39] sort of thing is interesting to see. Um also, I guess they think the woman's [00:53:41] also, I guess they think the woman's holding a cat where she's just wearing [00:53:42] holding a cat where she's just wearing some fur uh clothing. um you know they [00:53:47] some fur uh clothing. um you know they see a beach so they assume there's a [00:53:49] see a beach so they assume there's a surfboard. This type of hallucination I [00:53:51] surfboard. This type of hallucination I would say is still extremely common with [00:53:53] would say is still extremely common with vision language models today where uh [00:53:55] vision language models today where uh it'll think there's objects present that [00:53:58] it'll think there's objects present that are commonly present in a given scene [00:54:00] are commonly present in a given scene but aren't in this particular scene that [00:54:01] but aren't in this particular scene that you're looking at. Um also you know [00:54:04] you're looking at. Um also you know things like bird being perched in the [00:54:05] things like bird being perched in the tree or uh throwing a ball but he's [00:54:07] tree or uh throwing a ball but he's catching a ball. These are all based on [00:54:09] catching a ball. These are all based on the bias in the data set. Essentially in [00:54:11] the bias in the data set. Essentially in the model learning that during training [00:54:14] the model learning that during training it is most probably true that a certain [00:54:15] it is most probably true that a certain object or a certain action is being [00:54:17] object or a certain action is being performed when in the actual image it's [00:54:19] performed when in the actual image it's not the case. Um so in the data set uh [00:54:23] not the case. Um so in the data set uh there's high coccurrence of these [00:54:25] there's high coccurrence of these actions or objects with the particular [00:54:27] actions or objects with the particular scene. So the model learns to associate [00:54:29] scene. So the model learns to associate them but it doesn't learn to [00:54:30] them but it doesn't learn to disentangle. Okay, it's happening in [00:54:31] disentangle. Okay, it's happening in this scene because of the, you know, in [00:54:33] this scene because of the, you know, in this scene we know they're not throwing [00:54:35] this scene we know they're not throwing because the glove is here and the ball's [00:54:36] because the glove is here and the ball's going into the glove, not in the other [00:54:38] going into the glove, not in the other hand. But you need to sort of explain it [00:54:40] hand. But you need to sort of explain it uh like that. And the way we train these [00:54:41] uh like that. And the way we train these models, we're we're training them just [00:54:43] models, we're we're training them just to output the caption. So we're not [00:54:44] to output the caption. So we're not doing any sort of explanation there. And [00:54:46] doing any sort of explanation there. And that's why u it's part of the reason you [00:54:47] that's why u it's part of the reason you see this co-occurrence issue. [00:54:51] see this co-occurrence issue. Okay. Um so for visual question [00:54:54] Okay. Um so for visual question answering, this is another really common [00:54:56] answering, this is another really common task where RNN's reused. And there are [00:54:58] task where RNN's reused. And there are sort of two formulations for visual [00:55:00] sort of two formulations for visual question answering that were commonly [00:55:02] question answering that were commonly used. One is to basically um you say you [00:55:05] used. One is to basically um you say you have a model that is a a captioning [00:55:07] have a model that is a a captioning model and you want to see how well it [00:55:10] model and you want to see how well it could answer questions. One thing you [00:55:12] could answer questions. One thing you could do is to give it this question and [00:55:15] could do is to give it this question and then have it output text and look at the [00:55:16] then have it output text and look at the probabilities of each of the answer [00:55:18] probabilities of each of the answer sequences. So you have a probability for [00:55:20] sequences. So you have a probability for each character or token and you could [00:55:22] each character or token and you could multiply them together to get the [00:55:24] multiply them together to get the probability of the overall answer. This [00:55:25] probability of the overall answer. This is like one way you could use one of [00:55:27] is like one way you could use one of these RNN style uh models to do question [00:55:30] these RNN style uh models to do question answering. Um a more common way people [00:55:33] answering. Um a more common way people did it is they would have basically a [00:55:36] did it is they would have basically a question as input to the model. Um [00:55:40] question as input to the model. Um multiple different answers also as in [00:55:42] multiple different answers also as in separate inputs to your model and then [00:55:44] separate inputs to your model and then it's outputting essentially a [00:55:45] it's outputting essentially a probability per question. And so in this [00:55:46] probability per question. And so in this case it would be a four-way classifier [00:55:49] case it would be a four-way classifier where uh you have four different classes [00:55:51] where uh you have four different classes answer one, answer two, answer three, [00:55:53] answer one, answer two, answer three, answer four and you're just outputting [00:55:54] answer four and you're just outputting the probabilities and a lot of different [00:55:55] the probabilities and a lot of different ways you can formulate it but um this is [00:55:58] ways you can formulate it but um this is a very common task in computer vision [00:56:00] a very common task in computer vision where you need to use language and where [00:56:02] where you need to use language and where sequence modeling helps. [00:56:04] sequence modeling helps. Um also visual dialogue um you know at [00:56:07] Um also visual dialogue um you know at the time these were all considered very [00:56:09] the time these were all considered very separate tasks. these days the tasks you [00:56:12] separate tasks. these days the tasks you have sort of one model that can do [00:56:13] have sort of one model that can do almost all of these um but you know how [00:56:15] almost all of these um but you know how can can you have a chat about an image [00:56:17] can can you have a chat about an image we've really seen an explosion in the [00:56:18] we've really seen an explosion in the capabilities of these kinds of models in [00:56:20] capabilities of these kinds of models in the last 2 years [00:56:22] the last 2 years um maybe uh one other type of model that [00:56:25] um maybe uh one other type of model that RNNs were commonly used for is for this [00:56:28] RNNs were commonly used for is for this visual navigation task so um you know [00:56:31] visual navigation task so um you know you you have these images coming in and [00:56:33] you you have these images coming in and you want to output a sequence of [00:56:34] you want to output a sequence of directions to move in some 2D floor plan [00:56:37] directions to move in some 2D floor plan how do you get to the target destination [00:56:40] how do you get to the target destination Um this is another application for you [00:56:42] Um this is another application for you all to be aware of where these sequence [00:56:44] all to be aware of where these sequence modeling um these sequence models were [00:56:46] modeling um these sequence models were used. [00:56:48] used. Okay. Um one thing I want to note that I [00:56:51] Okay. Um one thing I want to note that I didn't really explicitly mention before [00:56:54] didn't really explicitly mention before but um just in the same way where we can [00:56:56] but um just in the same way where we can have multi-layer CNN's or multilayer uh [00:57:00] have multi-layer CNN's or multilayer uh uh of these sort of dense or fully [00:57:03] uh of these sort of dense or fully connected layers um you can also have [00:57:05] connected layers um you can also have multi-layer RNNs. And in practice most [00:57:07] multi-layer RNNs. And in practice most of the RNNs I showed were multi-layer [00:57:09] of the RNNs I showed were multi-layer RNNs. Now the main difference is that um [00:57:13] RNNs. Now the main difference is that um you sort of treat each layer uh [00:57:15] you sort of treat each layer uh separately. So the hidden state of say [00:57:18] separately. So the hidden state of say layer 1 depends on the hidden state of [00:57:20] layer 1 depends on the hidden state of the previous time step of of layer 1. Um [00:57:23] the previous time step of of layer 1. Um so in this in the sort of depthwise [00:57:25] so in this in the sort of depthwise dimension sorry in the yeah in in the [00:57:28] dimension sorry in the yeah in in the depth dimension each of these [00:57:31] depth dimension each of these layers you're only looking at the hidden [00:57:33] layers you're only looking at the hidden states from that layer in the previous [00:57:34] states from that layer in the previous time steps. And then in terms of looking [00:57:36] time steps. And then in terms of looking at windows in the time dimension um [00:57:40] at windows in the time dimension um instead of so the first layer will have [00:57:42] instead of so the first layer will have the actual input x as the input but then [00:57:44] the actual input x as the input but then the second layer will take us input the [00:57:46] the second layer will take us input the output y from the previous uh layer. So [00:57:50] output y from the previous uh layer. So you can sort of stack these up and it [00:57:51] you can sort of stack these up and it forms this grid where we have each layer [00:57:55] forms this grid where we have each layer is operating uh with regards to the [00:57:58] is operating uh with regards to the previous hidden states only within that [00:57:59] previous hidden states only within that layer. But then in terms of passing [00:58:01] layer. But then in terms of passing input output that is between layers. And [00:58:04] input output that is between layers. And so you can see to calculate this top [00:58:05] so you can see to calculate this top right value we need to calculate all of [00:58:08] right value we need to calculate all of the different values uh all of the [00:58:11] the different values uh all of the different hidden states in this entire [00:58:12] different hidden states in this entire computational graph beforehand. So you [00:58:14] computational graph beforehand. So you can get a feel for how as you start [00:58:17] can get a feel for how as you start training this gets to be a very involved [00:58:19] training this gets to be a very involved process and not very efficient. [00:58:22] process and not very efficient. Okay. Um I'll talk about one of the key [00:58:26] Okay. Um I'll talk about one of the key variants for RNN's that was proposed [00:58:28] variants for RNN's that was proposed actually a while ago in the 1990s but [00:58:31] actually a while ago in the 1990s but saw a lot of success for quite some time [00:58:34] saw a lot of success for quite some time until the transformer revolution called [00:58:36] until the transformer revolution called LSTMs. You won't need to know about the [00:58:39] LSTMs. You won't need to know about the details of how LSTMs operate, but what I [00:58:42] details of how LSTMs operate, but what I hope you learn is that RNN's have some [00:58:44] hope you learn is that RNN's have some key um disadvantages that LSTMs seek to [00:58:48] key um disadvantages that LSTMs seek to alleviate and a lot of the more modern [00:58:51] alleviate and a lot of the more modern statebased models also try to seek to [00:58:53] statebased models also try to seek to alleviate some of these same issues that [00:58:55] alleviate some of these same issues that RNN's face. [00:58:57] RNN's face. So um you know we talked about how by [00:59:00] So um you know we talked about how by default tanh is a really commonly used [00:59:03] default tanh is a really commonly used activation function and we also talked [00:59:05] activation function and we also talked about how you have this whh matrix that [00:59:07] about how you have this whh matrix that converts your previous hidden state to [00:59:09] converts your previous hidden state to the to the uh sort of the new one and [00:59:12] the to the uh sort of the new one and it's summed with this uh wxh matrix that [00:59:15] it's summed with this uh wxh matrix that converts your input uh vector xt at the [00:59:18] converts your input uh vector xt at the current time step into your uh hidden [00:59:20] current time step into your uh hidden state dimension. Then you sum these [00:59:22] state dimension. Then you sum these together. Um, you can also formulate [00:59:24] together. Um, you can also formulate this as sort of uh we have our weights [00:59:27] this as sort of uh we have our weights here and weights here and you're sort of [00:59:28] here and weights here and you're sort of stacking the vectors uh like this. And [00:59:31] stacking the vectors uh like this. And so sometimes for shorthand people will [00:59:33] so sometimes for shorthand people will just combine both of these W's together [00:59:35] just combine both of these W's together to form one big W. Um but you should [00:59:38] to form one big W. Um but you should note that um it's sort of like these are [00:59:42] note that um it's sort of like these are two blocks that are uh diagonally [00:59:44] two blocks that are uh diagonally positioned together where there's there [00:59:46] positioned together where there's there would be like if you're formulating like [00:59:48] would be like if you're formulating like this there's sort of a lot of zeros in [00:59:49] this there's sort of a lot of zeros in this w because whh is not interacting [00:59:52] this w because whh is not interacting with xt at all but this is a shorthand [00:59:54] with xt at all but this is a shorthand way to notate it where it makes thinking [00:59:56] way to notate it where it makes thinking about it and writing down the math [00:59:58] about it and writing down the math easier. So you will see all three [00:59:59] easier. So you will see all three variants here. Um, this one's maybe the [01:00:01] variants here. Um, this one's maybe the most explicit about where the actual [01:00:04] most explicit about where the actual values, the non-zero values in the [01:00:05] values, the non-zero values in the weight matrices lie. And so one way to [01:00:08] weight matrices lie. And so one way to think of it is you stack these vectors [01:00:09] think of it is you stack these vectors together, which is shown here. Um, we're [01:00:11] together, which is shown here. Um, we're multiplying by this w and then we pass [01:00:13] multiplying by this w and then we pass it through tanh. This gives us our [01:00:14] it through tanh. This gives us our output ht, which we pass to the next uh, [01:00:18] output ht, which we pass to the next uh, RNN. You can imagine these are stacked. [01:00:19] RNN. You can imagine these are stacked. And then you may also have either the [01:00:22] And then you may also have either the output directly YT or we have this layer [01:00:25] output directly YT or we have this layer um where it's a weight matrix time HT [01:00:28] um where it's a weight matrix time HT with a with an activation function [01:00:29] with a with an activation function around it too. Uh yeah question. [01:00:33] around it too. Uh yeah question. Oh sure. Yeah. [01:00:35] Oh sure. Yeah. Okay. [01:00:35] Okay. So here we have multilayer RN. [01:00:37] So here we have multilayer RN. Yeah. [01:00:37] Yeah. How is the uh the symmetric pretty much? [01:00:41] How is the uh the symmetric pretty much? Yeah. So the weights are shared within [01:00:43] Yeah. So the weights are shared within the layers for multi-layer RNN. All of [01:00:45] the layers for multi-layer RNN. All of these um all of these hidden state [01:00:49] these um all of these hidden state updates will use the same weights and [01:00:51] updates will use the same weights and then each layer which you stack it [01:00:54] then each layer which you stack it vertically in this diagram each layer [01:00:55] vertically in this diagram each layer will have a separate set of weights. [01:00:58] will have a separate set of weights. Okay. [01:01:00] Okay. Um yeah. So so this is sort of the way [01:01:03] Um yeah. So so this is sort of the way that it works. And then when you have [01:01:05] that it works. And then when you have back propagation, we talked about if you [01:01:08] back propagation, we talked about if you don't have a loss for each time step y, [01:01:12] don't have a loss for each time step y, um you need to only calculate your loss [01:01:13] um you need to only calculate your loss based on what the loss is of your output [01:01:15] based on what the loss is of your output ht. Um and so when you do this back [01:01:18] ht. Um and so when you do this back propagation, you're multiplying by w and [01:01:21] propagation, you're multiplying by w and then you're also taking the derivative [01:01:22] then you're also taking the derivative of tan h. And both of these can actually [01:01:24] of tan h. And both of these can actually have some issues. So um when [01:01:27] have some issues. So um when specifically mathematically looking at [01:01:29] specifically mathematically looking at what is the gradient of the um how can [01:01:32] what is the gradient of the um how can you know if we change each component of [01:01:34] you know if we change each component of our hidden state t um with respect to t [01:01:37] our hidden state t um with respect to t minus one. So so sorry if we change each [01:01:39] minus one. So so sorry if we change each component of t minus one how will that [01:01:41] component of t minus one how will that affect ht um this is what this gradient [01:01:44] affect ht um this is what this gradient is calculating. we need the derivative [01:01:45] is calculating. we need the derivative of tan h um because this is our [01:01:47] of tan h um because this is our activation function and then we have whh [01:01:50] activation function and then we have whh which is the the multiply here for [01:01:52] which is the the multiply here for converting the previous hidden state to [01:01:54] converting the previous hidden state to the next one. Um so this is actually how [01:01:56] the next one. Um so this is actually how we calculate the gradient and here we [01:01:58] we calculate the gradient and here we can run into issues. So um if we're [01:02:00] can run into issues. So um if we're calculating the loss at each time step [01:02:04] calculating the loss at each time step and we have the loss here the total loss [01:02:06] and we have the loss here the total loss we just sum it for each of the weights. [01:02:09] we just sum it for each of the weights. Um so the total loss is just the sum of [01:02:11] Um so the total loss is just the sum of the loss at each time step with respect [01:02:13] the loss at each time step with respect to these this reused w matrix. [01:02:18] to these this reused w matrix. Um and so you end up getting this [01:02:20] Um and so you end up getting this product of these um of these d L uh you [01:02:25] product of these um of these d L uh you know uh you circuit to calculate sorry [01:02:28] know uh you circuit to calculate sorry the loss of L at the final step LT with [01:02:32] the loss of L at the final step LT with respect to HT you need to calculate each [01:02:34] respect to HT you need to calculate each of the intermediate hidden states and [01:02:36] of the intermediate hidden states and how that affects uh W. In order to [01:02:38] how that affects uh W. In order to calculate this final uh loss here by [01:02:40] calculate this final uh loss here by using the chain rule u we sort of [01:02:43] using the chain rule u we sort of mentioned example here and just to point [01:02:44] mentioned example here and just to point out why this is an issue. Um if we look [01:02:47] out why this is an issue. Um if we look at these individual terms so if we hone [01:02:49] at these individual terms so if we hone in on this aspect of how does changing [01:02:52] in on this aspect of how does changing the uh current hidden state change the [01:02:54] the uh current hidden state change the next one which is the majority of these [01:02:56] next one which is the majority of these calculations uh contained in this [01:02:58] calculations uh contained in this product term here. we get that it's this [01:03:01] product term here. we get that it's this tan it's this uh it's the same thing we [01:03:04] tan it's this uh it's the same thing we mentioned earlier where you have this [01:03:05] mentioned earlier where you have this derivative of tanh multiplied by your [01:03:08] derivative of tanh multiplied by your whh and so why is this an issue? Well, [01:03:11] whh and so why is this an issue? Well, first of all the this is sort of the [01:03:13] first of all the this is sort of the derivative of tanh plotted here. The [01:03:15] derivative of tanh plotted here. The maximum value is one and so almost [01:03:18] maximum value is one and so almost always you're getting less than one. So [01:03:21] always you're getting less than one. So you can have vanishing gradients from [01:03:23] you can have vanishing gradients from this term here. Um but even if we assume [01:03:26] this term here. Um but even if we assume there's no nonlinearity uh nonlinearity [01:03:30] there's no nonlinearity uh nonlinearity or we pick some um activation function [01:03:33] or we pick some um activation function that doesn't have this issue um if we [01:03:35] that doesn't have this issue um if we look at this uh weight matrix here um [01:03:38] look at this uh weight matrix here um that we're multiplying at each um time [01:03:41] that we're multiplying at each um time step either we're going to have a large [01:03:44] step either we're going to have a large singular value. So um this will be you [01:03:47] singular value. So um this will be you know as the as the vectors are coming in [01:03:49] know as the as the vectors are coming in what is the maximum they'll be uh [01:03:52] what is the maximum they'll be uh stretched if it's say a unit vector uh [01:03:54] stretched if it's say a unit vector uh singular value is telling you what [01:03:56] singular value is telling you what what's the maximum that a unit vector [01:03:58] what's the maximum that a unit vector could be stretched by the matrix. Uh so [01:03:59] could be stretched by the matrix. Uh so if it's very large you can have these [01:04:01] if it's very large you can have these gradients explode or if it's very small [01:04:03] gradients explode or if it's very small you can have this vanishing gradients [01:04:04] you can have this vanishing gradients issue. Um, and if you have exploding [01:04:08] issue. Um, and if you have exploding gradients, we have a fix which is this [01:04:10] gradients, we have a fix which is this scaling the gradient. So you can just [01:04:12] scaling the gradient. So you can just divide or like clip it and somehow so [01:04:14] divide or like clip it and somehow so that you don't. So too big of a [01:04:16] that you don't. So too big of a gradient, it's not too much of an issue. [01:04:17] gradient, it's not too much of an issue. But this really small gradient vanishing [01:04:20] But this really small gradient vanishing gradient issue is actually the main [01:04:22] gradient issue is actually the main issue with why people don't just use [01:04:24] issue with why people don't just use really long RNNs in practice because of [01:04:26] really long RNNs in practice because of tanh and because you under many [01:04:28] tanh and because you under many scenarios your your weight matrix has [01:04:30] scenarios your your weight matrix has this property where it's either um you [01:04:33] this property where it's either um you know expanding your activations or or or [01:04:36] know expanding your activations or or or reducing them. So um yeah I think these [01:04:40] reducing them. So um yeah I think these these are the main reasons why it [01:04:41] these are the main reasons why it motivated a change in RNN architectures [01:04:43] motivated a change in RNN architectures and why a lot of the reasons why people [01:04:45] and why a lot of the reasons why people don't use RNN's. This is one of the main [01:04:47] don't use RNN's. This is one of the main issues. So how do you resolve this? So [01:04:50] issues. So how do you resolve this? So the way that people did it was this [01:04:53] the way that people did it was this creation of the LSTM and the highle idea [01:04:56] creation of the LSTM and the highle idea which I won't go into too many details [01:04:57] which I won't go into too many details because it's actually quite complicated [01:05:00] because it's actually quite complicated um is that you have four of these [01:05:02] um is that you have four of these different gates that are tracking [01:05:03] different gates that are tracking different values that instead of just [01:05:06] different values that instead of just having one hidden state, you sort of [01:05:07] having one hidden state, you sort of have multiple of these values you [01:05:09] have multiple of these values you premputee to determine how to change [01:05:11] premputee to determine how to change your hidden state and then also what [01:05:12] your hidden state and then also what information to pass through a different [01:05:14] information to pass through a different pathway. So you have the regular hidden [01:05:16] pathway. So you have the regular hidden state pathway. You have a different [01:05:17] state pathway. You have a different pathway where it's easier to pass [01:05:19] pathway where it's easier to pass information. And this is the basic idea. [01:05:22] information. And this is the basic idea. Um you know at a at a high level they [01:05:25] Um you know at a at a high level they call it a gate gate or like what are you [01:05:27] call it a gate gate or like what are you actually writing to the hidden state of [01:05:29] actually writing to the hidden state of the cell. The input gate which is [01:05:31] the cell. The input gate which is deciding whether or not you write [01:05:33] deciding whether or not you write information to the cell. The forget gate [01:05:35] information to the cell. The forget gate how much to forget from previous time [01:05:37] how much to forget from previous time steps as well as the output gate which [01:05:39] steps as well as the output gate which is like how much are you actually uh [01:05:41] is like how much are you actually uh outputting for your hidden state. So you [01:05:43] outputting for your hidden state. So you can see this is like really involved a [01:05:44] can see this is like really involved a lot of design choices here and they put [01:05:47] lot of design choices here and they put it all together into this I would say uh [01:05:49] it all together into this I would say uh fairly complicated diagram but the basic [01:05:51] fairly complicated diagram but the basic idea is uh this part is the same where [01:05:54] idea is uh this part is the same where we're doing this weight multiply but now [01:05:56] we're doing this weight multiply but now we have four different values we're [01:05:57] we have four different values we're computing instead of just the ht um we [01:06:00] computing instead of just the ht um we have the input gate and the gate gate to [01:06:03] have the input gate and the gate gate to determine how much to write here and we [01:06:05] determine how much to write here and we have our output that's passed to the [01:06:07] have our output that's passed to the next hidden state um this is sort of you [01:06:11] next hidden state um this is sort of you can think of this top section here is [01:06:13] can think of this top section here is like a highway where the goal is to not [01:06:15] like a highway where the goal is to not have any activation functions. So no [01:06:17] have any activation functions. So no tanh so we avoid the issues we had where [01:06:20] tanh so we avoid the issues we had where tanh made the gradients vanish. Um and [01:06:23] tanh made the gradients vanish. Um and all we're applying is this forget gate. [01:06:25] all we're applying is this forget gate. So as long as we're not basically [01:06:28] So as long as we're not basically forgetting uh all the information at [01:06:31] forgetting uh all the information at each time step we're able to pass [01:06:33] each time step we're able to pass information more easily. This is the [01:06:35] information more easily. This is the highle explanation and then more [01:06:37] highle explanation and then more importantly in practice people seem to [01:06:39] importantly in practice people seem to see that this worked very well. Um again [01:06:42] see that this worked very well. Um again you won't be implementing this for the [01:06:43] you won't be implementing this for the course at all. Um but I think this is a [01:06:46] course at all. Um but I think this is a really commonly used baseline still in [01:06:48] really commonly used baseline still in some uh deep learning papers. So it's [01:06:50] some uh deep learning papers. So it's good to know about but uh I think you [01:06:52] good to know about but uh I think you can think of this through the lens of [01:06:54] can think of this through the lens of people are trying to construct these [01:06:56] people are trying to construct these things to make up for all the issues [01:06:57] things to make up for all the issues that RNN's had which is vanishing [01:06:59] that RNN's had which is vanishing gradients and also the lack of [01:07:01] gradients and also the lack of information being captured. Uh you need [01:07:03] information being captured. Uh you need to cram everything into this hidden [01:07:04] to cram everything into this hidden state. Right? So you have really [01:07:05] state. Right? So you have really long-term dependencies. Those are lost. [01:07:07] long-term dependencies. Those are lost. So they created a separate pathway to to [01:07:09] So they created a separate pathway to to to pass over this uh more long-term [01:07:11] to pass over this uh more long-term information through the top here. [01:07:14] information through the top here. Um so do LSTM solve the Ganishing [01:07:17] Um so do LSTM solve the Ganishing gradient problem completely? Um it [01:07:19] gradient problem completely? Um it definitely helps. So um it makes the RNN [01:07:22] definitely helps. So um it makes the RNN easier to preserve this information over [01:07:23] easier to preserve this information over many time steps by using this top [01:07:25] many time steps by using this top pathway diagram. Um so it's in contrast [01:07:29] pathway diagram. Um so it's in contrast it's a much harder for vanilla RNNs to [01:07:32] it's a much harder for vanilla RNNs to learn our current weight matrix that [01:07:34] learn our current weight matrix that preserves info uh info in the hidden [01:07:36] preserves info uh info in the hidden state across every single time step if [01:07:38] state across every single time step if we're always doing the same operation [01:07:40] we're always doing the same operation and we're not able to just pass [01:07:41] and we're not able to just pass information directly without an [01:07:43] information directly without an activation function. Um so it doesn't [01:07:46] activation function. Um so it doesn't guarantee it but it's makes it [01:07:48] guarantee it but it's makes it significantly easier uh and it helps [01:07:50] significantly easier uh and it helps improve learning long-term dependencies [01:07:52] improve learning long-term dependencies and uh works very well empirically. So [01:07:56] and uh works very well empirically. So people generally don't train RNN so much [01:07:58] people generally don't train RNN so much and they'll more often train LSTMs if [01:08:00] and they'll more often train LSTMs if you were going to go with this recurrent [01:08:02] you were going to go with this recurrent um modeling route. Um but I think in [01:08:06] um modeling route. Um but I think in general these are also I would saying I [01:08:09] general these are also I would saying I would say significantly fallen out of [01:08:10] would say significantly fallen out of fashion. But this gives you a sense of [01:08:12] fashion. But this gives you a sense of the way that people have tried to design [01:08:15] the way that people have tried to design RNNs to account for the issues they [01:08:16] RNNs to account for the issues they face. [01:08:18] face. So, um, one other thing that would be [01:08:21] So, um, one other thing that would be kind of cool to tie in to something you [01:08:22] kind of cool to tie in to something you learned earlier in this course is this [01:08:24] learned earlier in this course is this idea of directly adding outputs and [01:08:27] idea of directly adding outputs and skipping some activation functions or [01:08:29] skipping some activation functions or other layers is actually highly related [01:08:31] other layers is actually highly related to the idea that we discussed in ResNets [01:08:33] to the idea that we discussed in ResNets where you have these skip connections [01:08:35] where you have these skip connections where uh the value is just copied over [01:08:37] where uh the value is just copied over and added uh later in the in the in the [01:08:40] and added uh later in the in the in the in the layer block. So you have multiple [01:08:43] in the layer block. So you have multiple in resonance you have multiple of these [01:08:44] in resonance you have multiple of these convolution layers stacked together and [01:08:45] convolution layers stacked together and then you add uh skip connection where [01:08:47] then you add uh skip connection where the value just gets added here and you [01:08:49] the value just gets added here and you can sort of think of uh this in a [01:08:52] can sort of think of uh this in a similar light for LSTMs how it's [01:08:54] similar light for LSTMs how it's skipping over some of these uh layers [01:08:56] skipping over some of these uh layers and it helps improve as you get in this [01:08:59] and it helps improve as you get in this case instead of very large depth of the [01:09:01] case instead of very large depth of the model it's very long sequences of time [01:09:04] model it's very long sequences of time steps. So um this is sort of it's a [01:09:06] steps. So um this is sort of it's a parallel but it's a little different [01:09:07] parallel but it's a little different because one is the number of layers and [01:09:08] because one is the number of layers and the other is the number of time steps. [01:09:12] the other is the number of time steps. Okay. Um I think the final slide for [01:09:16] Okay. Um I think the final slide for today's lecture is just a little tie-in [01:09:18] today's lecture is just a little tie-in for how um these RNNs have sort of made [01:09:21] for how um these RNNs have sort of made a bit of a resurgent in the last uh year [01:09:23] a bit of a resurgent in the last uh year or two which is kind of funny uh because [01:09:26] or two which is kind of funny uh because I think if we taught the course maybe a [01:09:28] I think if we taught the course maybe a year or two ago I would have been much [01:09:30] year or two ago I would have been much more willing to want to cut RNNs [01:09:31] more willing to want to cut RNNs entirely. But there are actually a lot [01:09:33] entirely. But there are actually a lot of nice advantages they have. So the [01:09:35] of nice advantages they have. So the main one is this unlimited context [01:09:36] main one is this unlimited context length. So one of the main issues with [01:09:38] length. So one of the main issues with transformers is they have a limited [01:09:40] transformers is they have a limited context length. As people are really [01:09:42] context length. As people are really pushing the boundaries for what these [01:09:43] pushing the boundaries for what these models are capable of, this context [01:09:45] models are capable of, this context length is becoming more and more of an [01:09:46] length is becoming more and more of an issue. So if there have been various [01:09:48] issue. So if there have been various workarounds in the transformer space, [01:09:50] workarounds in the transformer space, people do things like rope and some [01:09:51] people do things like rope and some other techniques to try to extend the [01:09:53] other techniques to try to extend the context length. But it's a pretty [01:09:55] context length. But it's a pretty significant limitation of the model. Um [01:09:57] significant limitation of the model. Um the other thing is that during um during [01:10:00] the other thing is that during um during inference for RNN's the compute scales [01:10:03] inference for RNN's the compute scales linearly with the sequence length or or [01:10:05] linearly with the sequence length or or during training too but uh basically as [01:10:07] during training too but uh basically as you add more and more uh steps to your [01:10:10] you add more and more uh steps to your sequence uh you just need to recomputee [01:10:12] sequence uh you just need to recomputee the same operation over and over again. [01:10:14] the same operation over and over again. So there's no operation that looks [01:10:16] So there's no operation that looks across the entire input sequence like [01:10:18] across the entire input sequence like you have for transformers. Um so these [01:10:20] you have for transformers. Um so these are really big advantages and there have [01:10:23] are really big advantages and there have been a couple of papers. So to shout out [01:10:24] been a couple of papers. So to shout out a few um there's this RWKV model um you [01:10:29] a few um there's this RWKV model um you can check out archive link here and also [01:10:30] can check out archive link here and also mamba are both uh mainly highlighting [01:10:33] mamba are both uh mainly highlighting this idea of we're able to achieve [01:10:34] this idea of we're able to achieve linear time sequence modeling. So as you [01:10:37] linear time sequence modeling. So as you scale up your input sequence the compute [01:10:38] scale up your input sequence the compute also scales linearly as opposed to [01:10:40] also scales linearly as opposed to quadratically with transformers and so [01:10:42] quadratically with transformers and so you get uh it's better for long context [01:10:45] you get uh it's better for long context problems sometimes in terms of compute [01:10:48] problems sometimes in terms of compute it works better and has these main [01:10:49] it works better and has these main advantages. So we will try to get the [01:10:51] advantages. So we will try to get the best of both worlds and there's been a [01:10:52] best of both worlds and there's been a lot of research in this area. How can [01:10:54] lot of research in this area. How can you get the performance of transformers [01:10:55] you get the performance of transformers with the scaling of RNNs? Okay. Um so [01:10:59] with the scaling of RNNs? Okay. Um so that's all for today in class. We [01:11:01] that's all for today in class. We basically talked about how uh there's a [01:11:03] basically talked about how uh there's a lot of different ways you can design [01:11:04] lot of different ways you can design architectures with RNN's. Vanilla RNNs [01:11:07] architectures with RNN's. Vanilla RNNs are simple but they don't work that [01:11:08] are simple but they don't work that well. And there's been more complex [01:11:10] well. And there's been more complex variants that people have proposed that [01:11:12] variants that people have proposed that introduce ways to selectively pass [01:11:14] introduce ways to selectively pass information. um this backward flow of [01:11:17] information. um this backward flow of gradients in the RNNs can either explode [01:11:18] gradients in the RNNs can either explode or vanish depending on the activation [01:11:20] or vanish depending on the activation function you use or what is the [01:11:21] function you use or what is the properties of your weight matrix. So um [01:11:24] properties of your weight matrix. So um you often need this back propagation [01:11:26] you often need this back propagation through time to actually compute the [01:11:28] through time to actually compute the gradient as well. Um and then finally [01:11:30] gradient as well. Um and then finally basically these better architectures are [01:11:32] basically these better architectures are hot topic of research right now as well [01:11:34] hot topic of research right now as well as just generally new paradigms for [01:11:36] as just generally new paradigms for reasoning over sequences. So yeah I [01:11:38] reasoning over sequences. So yeah I think that's it for today. Uh next time [01:11:39] think that's it for today. Uh next time we'll talk about attention and [01:11:41] we'll talk about attention and transformers. ================================================================================ LECTURE 008 ================================================================================ Stanford CS231N | Spring 2025 | Lecture 8: Attention and Transformers Source: https://www.youtube.com/watch?v=RQowiOF_FvQ --- Transcript [00:00:04] All right, welcome back everyone to [00:00:06] All right, welcome back everyone to lecture 8. Uh, today we're going to talk [00:00:08] lecture 8. Uh, today we're going to talk about attention and transformers. And I [00:00:10] about attention and transformers. And I think this is a this is a really fun [00:00:11] think this is a this is a really fun one. So, as a quick recap, last time we [00:00:14] one. So, as a quick recap, last time we were talking about recurrent neural [00:00:15] were talking about recurrent neural networks and recurrent neural networks [00:00:17] networks and recurrent neural networks were this new kind of neural network [00:00:18] were this new kind of neural network architecture meant for processing [00:00:19] architecture meant for processing sequences. And in particular, we saw how [00:00:21] sequences. And in particular, we saw how neural networks by by processing [00:00:23] neural networks by by processing sequences let us attack a whole new [00:00:25] sequences let us attack a whole new different kinds of problems than we than [00:00:26] different kinds of problems than we than we could with convolutional networks uh [00:00:28] we could with convolutional networks uh before. So in particular usually we had [00:00:31] before. So in particular usually we had been thinking about these onetoone [00:00:32] been thinking about these onetoone problems where you input one thing like [00:00:34] problems where you input one thing like an image and then output one thing like [00:00:36] an image and then output one thing like a classification for what's in that [00:00:38] a classification for what's in that image. But once you have the ability to [00:00:40] image. But once you have the ability to move beyond images and move towards [00:00:41] move beyond images and move towards sequences of data it let us tackle a lot [00:00:44] sequences of data it let us tackle a lot of new kinds of problems like one to [00:00:45] of new kinds of problems like one to many problems image captioning maybe we [00:00:47] many problems image captioning maybe we want to input an image output a textual [00:00:49] want to input an image output a textual description of that image which is going [00:00:50] description of that image which is going to be a sequence of words. Um maybe many [00:00:53] to be a sequence of words. Um maybe many to one where we do input a sequence of [00:00:54] to one where we do input a sequence of frames and output a classification for [00:00:56] frames and output a classification for those frames. um and a bunch of other [00:00:58] those frames. um and a bunch of other problems along this vein. Um so now [00:01:00] problems along this vein. Um so now we're seeing that moving be moving into [00:01:02] we're seeing that moving be moving into these more um sophisticated neural [00:01:04] these more um sophisticated neural network architectures um both is sort of [00:01:06] network architectures um both is sort of more interesting architecturally but [00:01:07] more interesting architecturally but also lets us tackle new problems than we [00:01:09] also lets us tackle new problems than we could with um with kind of more [00:01:11] could with um with kind of more traditional feed forward neural [00:01:12] traditional feed forward neural networks. So today we're going to build [00:01:15] networks. So today we're going to build on that and talk about two new things in [00:01:17] on that and talk about two new things in today's lecture. Um the first thing is [00:01:19] today's lecture. Um the first thing is going to be attention which is going to [00:01:20] going to be attention which is going to be a brand new neural network primitive [00:01:22] be a brand new neural network primitive that fundamentally operates on sets of [00:01:24] that fundamentally operates on sets of vectors. And then the second thing we're [00:01:26] vectors. And then the second thing we're going to talk about is the transformer. [00:01:28] going to talk about is the transformer. And the transformer is a different [00:01:30] And the transformer is a different neural network architecture that has [00:01:31] neural network architecture that has self attention at its core. Um, and this [00:01:35] self attention at its core. Um, and this and the spoiler alert is that [00:01:36] and the spoiler alert is that transformers are basically the [00:01:38] transformers are basically the architecture that we use for almost all [00:01:39] architecture that we use for almost all problems in deep learning today. So any [00:01:42] problems in deep learning today. So any of the largest applications that you're [00:01:43] of the largest applications that you're seeing out there in the wild today, [00:01:45] seeing out there in the wild today, whether it's classifying images, [00:01:47] whether it's classifying images, generating images, generating text, [00:01:49] generating images, generating text, classifying text, working with audio, [00:01:51] classifying text, working with audio, basically any kind of large neural [00:01:53] basically any kind of large neural network today, um that is large, [00:01:55] network today, um that is large, state-of-the-art, trained in a lot of [00:01:56] state-of-the-art, trained in a lot of data, um deployed by a big company, [00:01:58] data, um deployed by a big company, almost all of them are going to be [00:02:00] almost all of them are going to be transformers today. Um so that's really [00:02:02] transformers today. Um so that's really exciting that we get to get you up to [00:02:04] exciting that we get to get you up to speed on the latest and greatest [00:02:05] speed on the latest and greatest architectures that people are using now. [00:02:08] architectures that people are using now. Um, but even though transformers are [00:02:10] Um, but even though transformers are this sort of state-of-the-art [00:02:11] this sort of state-of-the-art architecture that everyone is using for [00:02:13] architecture that everyone is using for everything today, they have kind of they [00:02:14] everything today, they have kind of they have a relatively long history. Um, and [00:02:17] have a relatively long history. Um, and they initially like it's kind of [00:02:19] they initially like it's kind of interesting watching these fields [00:02:20] interesting watching these fields develop because looking back on it when [00:02:22] develop because looking back on it when the moment that Transformers came out, [00:02:23] the moment that Transformers came out, it feels like it ought to have been this [00:02:24] it feels like it ought to have been this big moment, this big thing when there [00:02:26] big moment, this big thing when there was a big sea change, this new [00:02:27] was a big sea change, this new architecture, this new thing. But it [00:02:29] architecture, this new thing. But it actually didn't feel that way because [00:02:31] actually didn't feel that way because even though there was one moment where [00:02:32] even though there was one moment where the transformer architecture was born, [00:02:34] the transformer architecture was born, um these ideas around self attention, [00:02:37] um these ideas around self attention, around using attention in various ways, [00:02:39] around using attention in various ways, those had actually been around in the [00:02:40] those had actually been around in the field for several years at that time. [00:02:42] field for several years at that time. And in particular, these ideas around [00:02:44] And in particular, these ideas around attention, self attention, they actually [00:02:46] attention, self attention, they actually developed out of recurrent neural [00:02:47] developed out of recurrent neural networks. So we're going to start there [00:02:49] networks. So we're going to start there to talk about and motivate these [00:02:50] to talk about and motivate these problems. So this is going to be a [00:02:52] problems. So this is going to be a little bit mirroring the historical [00:02:53] little bit mirroring the historical development of these ideas. [00:02:55] development of these ideas. So for that reason um we're actually [00:02:57] So for that reason um we're actually going to in order to introduce [00:02:58] going to in order to introduce transformers we're actually going to [00:03:00] transformers we're actually going to roll back and recap a little bit about [00:03:02] roll back and recap a little bit about this idea of recurrent neural networks [00:03:03] this idea of recurrent neural networks that we saw in the last lecture. So as a [00:03:05] that we saw in the last lecture. So as a motivating problem um let's think about [00:03:07] motivating problem um let's think about this sequence to sequence problem of [00:03:10] this sequence to sequence problem of translation. So we want to input one [00:03:11] translation. So we want to input one sequence which is going to be a sequence [00:03:13] sequence which is going to be a sequence of words in English. Then we want to [00:03:15] of words in English. Then we want to output another sequence which is going [00:03:17] output another sequence which is going to be a sequence of words in a different [00:03:18] to be a sequence of words in a different language Italian. Right? And because you [00:03:21] language Italian. Right? And because you and we can't make assumptions that [00:03:22] and we can't make assumptions that there's any correspondence between those [00:03:24] there's any correspondence between those words. Right? The the number of words in [00:03:26] words. Right? The the number of words in the English sentence might be different [00:03:27] the English sentence might be different from the number of words in the Italian [00:03:29] from the number of words in the Italian sentence and the order of those words [00:03:30] sentence and the order of those words might be totally different. So this is a [00:03:32] might be totally different. So this is a perfect application of the kind of [00:03:33] perfect application of the kind of sequence processing algorithms that we [00:03:35] sequence processing algorithms that we uh sequence processing architectures [00:03:36] uh sequence processing architectures that we saw in recurrent neural [00:03:38] that we saw in recurrent neural networks. Um and indeed this idea of [00:03:40] networks. Um and indeed this idea of processing these uh sequencetosequence [00:03:42] processing these uh sequencetosequence problems with recurrent neural networks. [00:03:44] problems with recurrent neural networks. This goes all the way back to 2014 um [00:03:46] This goes all the way back to 2014 um even even a bit earlier than that. But [00:03:48] even even a bit earlier than that. But people had been processing sequences [00:03:50] people had been processing sequences with recurrent neural networks for more [00:03:51] with recurrent neural networks for more than a decade more than a decade at this [00:03:53] than a decade more than a decade at this point. So the basic architecture for [00:03:56] point. So the basic architecture for processing sequence sequence to sequence [00:03:58] processing sequence sequence to sequence problems with recurrent neural networks [00:03:59] problems with recurrent neural networks is that typically you'll start with one [00:04:01] is that typically you'll start with one encoder. Your encoder is a recurrent [00:04:03] encoder. Your encoder is a recurrent neural network. Um the recurrent neural [00:04:05] neural network. Um the recurrent neural network recall is this function that [00:04:06] network recall is this function that gets applied recursively um on two [00:04:09] gets applied recursively um on two inputs. One is your xt your input at the [00:04:11] inputs. One is your xt your input at the current time step and the other is your [00:04:13] current time step and the other is your ht minus one which is your hidden state [00:04:15] ht minus one which is your hidden state at the previous time step. and your [00:04:17] at the previous time step. and your recurrent neural network unit will then [00:04:18] recurrent neural network unit will then spit out a next hidden unit, a next [00:04:21] spit out a next hidden unit, a next hidden state at the next time step. And [00:04:23] hidden state at the next time step. And then we can apply that same recurrent [00:04:25] then we can apply that same recurrent neural network unit over in time um to [00:04:27] neural network unit over in time um to process a sequence of a of potentially [00:04:29] process a sequence of a of potentially variable length. So in this case, we're [00:04:32] variable length. So in this case, we're using a recurrent neural network encoder [00:04:34] using a recurrent neural network encoder that inputs the input sequence in [00:04:35] that inputs the input sequence in English. Um input sequence, you know, [00:04:37] English. Um input sequence, you know, you you got to use relatively short [00:04:39] you you got to use relatively short sentences um to fit on slides and still [00:04:41] sentences um to fit on slides and still have all the boxes show up. So we're [00:04:42] have all the boxes show up. So we're using a kind of short, silly sentence, [00:04:44] using a kind of short, silly sentence, uh we see the sky. Um, and this, you [00:04:46] uh we see the sky. Um, and this, you know, each word in that sentence gets [00:04:48] know, each word in that sentence gets processed but via one tick of the [00:04:50] processed but via one tick of the recurrent neural network. And now, and [00:04:53] recurrent neural network. And now, and then we're going to the idea of this [00:04:54] then we're going to the idea of this encoder neural recurrent neural network [00:04:56] encoder neural recurrent neural network is it wants to process all of the words [00:04:58] is it wants to process all of the words in in the input sequence and somehow [00:05:00] in in the input sequence and somehow summarize the content of that input [00:05:01] summarize the content of that input sentence so that we can translate it [00:05:03] sentence so that we can translate it into a different into our output target [00:05:05] into a different into our output target language. So the the the more the more [00:05:08] language. So the the the more the more concrete way that this happens is that [00:05:10] concrete way that this happens is that after processing all the words in the [00:05:12] after processing all the words in the input in the input sequence, we're going [00:05:14] input in the input sequence, we're going to summarize the the entire content of [00:05:16] to summarize the the entire content of that input sequence into a single vector [00:05:19] that input sequence into a single vector called the context vector. Um and [00:05:21] called the context vector. Um and there's a there's a couple different [00:05:22] there's a there's a couple different ways that people would typically do [00:05:23] ways that people would typically do these in recurren recurrent neural [00:05:25] these in recurren recurrent neural networks. I don't think the details are [00:05:26] networks. I don't think the details are too interesting. So as a you can just [00:05:28] too interesting. So as a you can just think that that context vector is [00:05:30] think that that context vector is basically the last hidden state of the [00:05:32] basically the last hidden state of the of the encoder recurrent neural network. [00:05:34] of the encoder recurrent neural network. Um and now the idea is that because you [00:05:35] Um and now the idea is that because you know this this these because of the [00:05:37] know this this these because of the recurrent structure of our recurrent [00:05:38] recurrent structure of our recurrent neural networks the last hidden state [00:05:40] neural networks the last hidden state sort of incorporates information of the [00:05:42] sort of incorporates information of the entire input sequence. So we can think [00:05:44] entire input sequence. So we can think of that last hidden state as summarizing [00:05:46] of that last hidden state as summarizing or encoding all of the information in [00:05:48] or encoding all of the information in the entire input sequence. Um so then [00:05:50] the entire input sequence. Um so then that that is one vector that is going to [00:05:52] that that is one vector that is going to kind of summarize that entire input [00:05:54] kind of summarize that entire input sequence to do whatever we want with it. [00:05:56] sequence to do whatever we want with it. Um and in this case what we want to do [00:05:58] Um and in this case what we want to do with it is translate that input sequence [00:06:00] with it is translate that input sequence into an output sequence in a different [00:06:02] into an output sequence in a different language. So to do that we're going to [00:06:03] language. So to do that we're going to use a second recurrent neural network [00:06:05] use a second recurrent neural network called the decoder. Um which usually has [00:06:08] called the decoder. Um which usually has the same architecture but potentially [00:06:09] the same architecture but potentially different a different weight matrix [00:06:11] different a different weight matrix different set of learned parameters. Um [00:06:12] different set of learned parameters. Um and this decoder GU is going to be a [00:06:15] and this decoder GU is going to be a different recurrent neural network with [00:06:16] different recurrent neural network with different different learnable weights u [00:06:18] different different learnable weights u but has the same basic idea. Um now this [00:06:21] but has the same basic idea. Um now this a recurrent neural network unit is going [00:06:23] a recurrent neural network unit is going to take three inputs at every time step. [00:06:25] to take three inputs at every time step. It's going to take um yt minus one which [00:06:27] It's going to take um yt minus one which is the token and the output sequence at [00:06:29] is the token and the output sequence at the previous time step. It's going to [00:06:31] the previous time step. It's going to take ST minus one, which is the previous [00:06:34] take ST minus one, which is the previous hidden state in the output sequence, and [00:06:36] hidden state in the output sequence, and C, which is that context vector summariz [00:06:38] C, which is that context vector summariz summarizing the entire input sequence. [00:06:40] summarizing the entire input sequence. Um, and then we kind of unroll that [00:06:41] Um, and then we kind of unroll that output sequence just as we saw in the [00:06:43] output sequence just as we saw in the last lecture and and produce words one [00:06:45] last lecture and and produce words one at a time in the output sequence. And I [00:06:47] at a time in the output sequence. And I don't speak Italian, so I'm not going to [00:06:49] don't speak Italian, so I'm not going to try to I'm not going to try to pronounce [00:06:50] try to I'm not going to try to pronounce these, but um there's some Italian words [00:06:52] these, but um there's some Italian words on the screen that you can see. And I'm [00:06:55] on the screen that you can see. And I'm assuming that that indeed translates to [00:06:58] assuming that that indeed translates to we see the sky. Um hopefully that's [00:07:01] we see the sky. Um hopefully that's correct, right? But the idea is we're [00:07:02] correct, right? But the idea is we're going to like tick this recurrent neural [00:07:04] going to like tick this recurrent neural network one tick at a time. It's going [00:07:05] network one tick at a time. It's going to output words one at a time. And this [00:07:07] to output words one at a time. And this is basically a summary of what we saw [00:07:09] is basically a summary of what we saw last lecture. So this should this should [00:07:11] last lecture. So this should this should basically not be too surprising in light [00:07:13] basically not be too surprising in light of the previous lecture. [00:07:15] of the previous lecture. But there's a potential problem here, [00:07:17] But there's a potential problem here, right? And that's the there's a [00:07:18] right? And that's the there's a communication bottleneck here between [00:07:20] communication bottleneck here between the input sequence and the output [00:07:22] the input sequence and the output sequence, right? The only way in which [00:07:25] sequence, right? The only way in which the input sequence is communicating with [00:07:26] the input sequence is communicating with the output sequence is via that context [00:07:28] the output sequence is via that context vector C. Um, and that C is going to be [00:07:31] vector C. Um, and that C is going to be a fixed length vector, right? Because [00:07:32] a fixed length vector, right? Because the size of that vector is fixed when we [00:07:35] the size of that vector is fixed when we set the size of our recurrent neural [00:07:36] set the size of our recurrent neural network. Um, and maybe that's fine, [00:07:38] network. Um, and maybe that's fine, right? So C might be a fixed length [00:07:40] right? So C might be a fixed length vector of like 128 floats, 1024 floats, [00:07:43] vector of like 128 floats, 1024 floats, but the size of that input vector is not [00:07:46] but the size of that input vector is not going to change as our input and output [00:07:47] going to change as our input and output sequence sizes grow or shrink. And [00:07:49] sequence sizes grow or shrink. And that's a potential problem, right? So if [00:07:51] that's a potential problem, right? So if we're doing short sequences like we see [00:07:53] we're doing short sequences like we see the sky, maybe it seems pretty plausible [00:07:55] the sky, maybe it seems pretty plausible that we can summarize everything we need [00:07:57] that we can summarize everything we need to know about that sequence in that [00:07:58] to know about that sequence in that fixed vector of 12 the 1024 floats. But [00:08:02] fixed vector of 12 the 1024 floats. But what if we're not trying to translate [00:08:03] what if we're not trying to translate four words? What if we're trying to [00:08:05] four words? What if we're trying to translate a whole paragraph or a whole [00:08:06] translate a whole paragraph or a whole book or like an entire corpus of data? [00:08:10] book or like an entire corpus of data? um then in that case we're going to run [00:08:11] um then in that case we're going to run into a bottleneck where at some point as [00:08:13] into a bottleneck where at some point as we scale that input sequence it's just [00:08:15] we scale that input sequence it's just not going to be sensible for to ask the [00:08:17] not going to be sensible for to ask the network to summarize the entire input [00:08:19] network to summarize the entire input sequence into a into a single fixed [00:08:22] sequence into a into a single fixed length vector. So that's going to be a [00:08:23] length vector. So that's going to be a problem. [00:08:25] problem. So the solution here is actually let's [00:08:28] So the solution here is actually let's not bottleneck the network through a [00:08:30] not bottleneck the network through a fixed through one fixed length vector. [00:08:33] fixed through one fixed length vector. Instead let's change the architecture of [00:08:35] Instead let's change the architecture of our recurrent neural network. [00:08:37] our recurrent neural network. Intuitively, what we want to do is not [00:08:39] Intuitively, what we want to do is not force a bottleneck between a in a fixed [00:08:41] force a bottleneck between a in a fixed length vector between the input and the [00:08:42] length vector between the input and the output. Instead, as we process the [00:08:44] output. Instead, as we process the output sequence, we're going to give the [00:08:46] output sequence, we're going to give the model the ability to look back at the [00:08:48] model the ability to look back at the input sequence. And now, every time it [00:08:50] input sequence. And now, every time it produces an output vector, we want to [00:08:52] produces an output vector, we want to give the network the opportunity to look [00:08:54] give the network the opportunity to look back at the entire input sequence. And [00:08:56] back at the entire input sequence. And if we do this, there's going to be no [00:08:58] if we do this, there's going to be no bottleneck. It will scale to much longer [00:08:59] bottleneck. It will scale to much longer sequences. And hopefully, the the model [00:09:01] sequences. And hopefully, the the model architecture will work much better. So [00:09:03] architecture will work much better. So that's the motivating idea that led to [00:09:05] that's the motivating idea that led to attention and transformers and all this [00:09:07] attention and transformers and all this great stuff that we see today. It all [00:09:08] great stuff that we see today. It all came like you know one way of telling [00:09:10] came like you know one way of telling the story is that it all came from [00:09:12] the story is that it all came from trying to solve this bottleneck problem [00:09:13] trying to solve this bottleneck problem in recurrent neural networks. So let's [00:09:16] in recurrent neural networks. So let's see um how we can actually implement [00:09:18] see um how we can actually implement this intuition and endow our recurrent [00:09:20] this intuition and endow our recurrent neural network with the ability to look [00:09:21] neural network with the ability to look back at the input sequence on every time [00:09:23] back at the input sequence on every time step. So here we're going to you know [00:09:25] step. So here we're going to you know start with the same thing. Um our [00:09:26] start with the same thing. Um our encoder neural network is going to [00:09:28] encoder neural network is going to remain the same. No changes there. we [00:09:30] remain the same. No changes there. we still need to set some initial hidden [00:09:32] still need to set some initial hidden state for the output sequence. Um and so [00:09:35] state for the output sequence. Um and so we need to set some initial um decoder [00:09:37] we need to set some initial um decoder state s0 in some way. But now once we [00:09:40] state s0 in some way. But now once we have that decoder hidden state, what [00:09:42] have that decoder hidden state, what we're going to do is look back at the [00:09:44] we're going to do is look back at the input sequence. So the way that we're [00:09:46] input sequence. So the way that we're going to do that is by computing some [00:09:48] going to do that is by computing some alignment scores by comparing that that [00:09:51] alignment scores by comparing that that basically compute um a scalar value a [00:09:53] basically compute um a scalar value a scalar value for each step in the input [00:09:56] scalar value for each step in the input sequence that says how much does that [00:09:58] sequence that says how much does that initial decoder state s0 how much does [00:10:01] initial decoder state s0 how much does that decoder state match each token of [00:10:03] that decoder state match each token of the input sequence. So in this case [00:10:05] the input sequence. So in this case there were four tokens in the input [00:10:07] there were four tokens in the input sequence. So we want to compute four [00:10:08] sequence. So we want to compute four alignment scores each of which is just a [00:10:10] alignment scores each of which is just a single number um that says how how what [00:10:12] single number um that says how how what is the similarity between the input to [00:10:15] is the similarity between the input to the the input sequence the token of the [00:10:17] the the input sequence the token of the input sequence and this initial um this [00:10:20] input sequence and this initial um this initial uh this initial decoder state as [00:10:22] initial uh this initial decoder state as zero. Now there's a there's a lot of [00:10:24] zero. Now there's a there's a lot of ways that we could implement alignment [00:10:25] ways that we could implement alignment scores but um a simple way is just use a [00:10:27] scores but um a simple way is just use a simple linear layer that we're calling f [00:10:29] simple linear layer that we're calling f subat. And so that linear layer is going [00:10:31] subat. And so that linear layer is going to input is going to concatenate the [00:10:33] to input is going to concatenate the decoder hidden state s um with one of [00:10:36] decoder hidden state s um with one of the encoder hidden states H concatenate [00:10:38] the encoder hidden states H concatenate those two into a vector and then um [00:10:40] those two into a vector and then um apply a linear transform that squashes [00:10:42] apply a linear transform that squashes that down into a scaler. Um and that's [00:10:43] that down into a scaler. Um and that's just a linear operator that can be put [00:10:45] just a linear operator that can be put into a computational graph and learned [00:10:47] into a computational graph and learned jointly via gradient descent just in the [00:10:49] jointly via gradient descent just in the way that we learn all other parameters [00:10:50] way that we learn all other parameters of a network. Um so now at this point [00:10:52] of a network. Um so now at this point we've got sort of this scalar alignment [00:10:54] we've got sort of this scalar alignment score um for each se for each step in [00:10:56] score um for each se for each step in the input sequence. And now we want to [00:10:58] the input sequence. And now we want to apply a softmax function. Right? These [00:11:00] apply a softmax function. Right? These scaler alignment scores are totally [00:11:02] scaler alignment scores are totally unbounded. They're arbitrary val they're [00:11:04] unbounded. They're arbitrary val they're arbitrary real values from minus [00:11:05] arbitrary real values from minus infinity to infinity. We want to put [00:11:07] infinity to infinity. We want to put some um some structure on this to [00:11:09] some um some structure on this to prevent things from blowing up. So one [00:11:10] prevent things from blowing up. So one way that we do this is apply a softmax [00:11:12] way that we do this is apply a softmax function. So we've got four scalar [00:11:14] function. So we've got four scalar values um telling us the alignment of [00:11:16] values um telling us the alignment of that decoder hidden state which each [00:11:18] that decoder hidden state which each with each of the encoder hidden states. [00:11:20] with each of the encoder hidden states. Um now we apply a softmax over those [00:11:22] Um now we apply a softmax over those four values to give us a distribution [00:11:24] four values to give us a distribution over those four over those four values. [00:11:27] over those four over those four values. So remember the softmax function that we [00:11:28] So remember the softmax function that we saw a few lectures ago is going to take [00:11:31] saw a few lectures ago is going to take a vector of arbitrary scores and convert [00:11:33] a vector of arbitrary scores and convert it into a probability distribution which [00:11:35] it into a probability distribution which means it'll have the property that each [00:11:37] means it'll have the property that each entry in the output softmax [00:11:38] entry in the output softmax probabilities will be between 0 and one [00:11:40] probabilities will be between 0 and one and they will sum to one. So we can [00:11:43] and they will sum to one. So we can think of so whenever we have whenever we [00:11:45] think of so whenever we have whenever we run a vector through a softmax we can [00:11:47] run a vector through a softmax we can think of the thing we get out as a [00:11:49] think of the thing we get out as a probability distribution rather a [00:11:50] probability distribution rather a discrete probability distribution over [00:11:52] discrete probability distribution over those input scores. So in this case, so [00:11:54] those input scores. So in this case, so in this so at this point after we take [00:11:56] in this so at this point after we take those alignment scores and run them [00:11:58] those alignment scores and run them through a softmax, what we what we've [00:12:00] through a softmax, what we what we've essentially done is predicted a [00:12:02] essentially done is predicted a distribution over the input tokens um [00:12:05] distribution over the input tokens um given that decoder hidden state. So now [00:12:08] given that decoder hidden state. So now what we want to do is take that [00:12:09] what we want to do is take that distribution over over the input tokens [00:12:12] distribution over over the input tokens and use them to compute a vector um [00:12:15] and use them to compute a vector um summarizing the information in the [00:12:17] summarizing the information in the encoder. So the way that we do that is [00:12:19] encoder. So the way that we do that is we take our attention scores which [00:12:21] we take our attention scores which recall are these uh these numbers a a11 [00:12:25] recall are these uh these numbers a a11 a12 a13 a14 they're all between 0 and 1. [00:12:28] a12 a13 a14 they're all between 0 and 1. They sum to one. We're going to take a [00:12:29] They sum to one. We're going to take a linear combination now of the encoder [00:12:31] linear combination now of the encoder hidden states h1 h2 h3 h4. Um and take a [00:12:35] hidden states h1 h2 h3 h4. Um and take a linear combination of those encoder [00:12:37] linear combination of those encoder hidden states weighted by our attention [00:12:39] hidden states weighted by our attention scores. Um and this will give us a new a [00:12:42] scores. Um and this will give us a new a context vector that we're calling c1 [00:12:44] context vector that we're calling c1 here in purple. um which is going to [00:12:46] here in purple. um which is going to summarize the information in the encoder [00:12:48] summarize the information in the encoder sequence um in in some way that's [00:12:50] sequence um in in some way that's modulated by that by those by those [00:12:52] modulated by that by those by those attention weights. Um and now at this [00:12:54] attention weights. Um and now at this point right so then this this C1 is [00:12:56] point right so then this this C1 is basically some linear combination of the [00:12:59] basically some linear combination of the input encoder states H1 to H4 things [00:13:02] input encoder states H1 to H4 things look basically the same as they did in [00:13:03] look basically the same as they did in the non-attention case. So we we have [00:13:05] the non-attention case. So we we have our context vector um we concatenate it [00:13:07] our context vector um we concatenate it with our in with our first token of the [00:13:09] with our in with our first token of the output sequence y0 pass that to our [00:13:12] output sequence y0 pass that to our recurrent unit um to get both to get um [00:13:14] recurrent unit um to get both to get um to get our uh next hidden state of the [00:13:17] to get our uh next hidden state of the decoder recurrent neural network as well [00:13:19] decoder recurrent neural network as well as the first output token from the [00:13:21] as the first output token from the decoder recurrent neural network. Um so [00:13:24] decoder recurrent neural network. Um so basically the structure of that decoder [00:13:27] basically the structure of that decoder RNN did not really change. Um all we did [00:13:29] RNN did not really change. Um all we did is rather than set we've computed the [00:13:31] is rather than set we've computed the context vector in a different way using [00:13:33] context vector in a different way using this attention linear combination [00:13:35] this attention linear combination mechanism. [00:13:36] mechanism. But now crucially um right so the [00:13:39] But now crucially um right so the intuition here is that this context [00:13:41] intuition here is that this context vector basically attends or looks at [00:13:43] vector basically attends or looks at different parts of the input sequence [00:13:45] different parts of the input sequence that is modulated by whatever the output [00:13:48] that is modulated by whatever the output RNN wants to look at at this moment in [00:13:50] RNN wants to look at at this moment in time. Um so for example you know part of [00:13:53] time. Um so for example you know part of the input you know as uh part of the [00:13:56] the input you know as uh part of the input sequence has this token these [00:13:58] input sequence has this token these these two words we see. So then in [00:14:00] these two words we see. So then in trying to produce that one word in [00:14:02] trying to produce that one word in Italian that corresponds to we see the [00:14:05] Italian that corresponds to we see the network probably wants to go back and [00:14:06] network probably wants to go back and look at those two words in the input [00:14:08] look at those two words in the input sequence in order to know what output [00:14:10] sequence in order to know what output word to produce. So we might expect we [00:14:13] word to produce. So we might expect we we might want to have some we might [00:14:14] we might want to have some we might expect that intuitively when trying to [00:14:17] expect that intuitively when trying to produce the word um vendo um then the [00:14:21] produce the word um vendo um then the network will want to look back at the [00:14:23] network will want to look back at the words we see and put higher attention [00:14:25] words we see and put higher attention weights on those and it doesn't really [00:14:26] weights on those and it doesn't really care about the sky because those words [00:14:29] care about the sky because those words are not necessary for producing that [00:14:31] are not necessary for producing that vendamo output. Um and that's the kind [00:14:33] vendamo output. Um and that's the kind of intuition we're giving the network [00:14:35] of intuition we're giving the network the ability to look back at the relevant [00:14:36] the ability to look back at the relevant parts of the input sequence for the word [00:14:38] parts of the input sequence for the word that it's trying to predict at this [00:14:39] that it's trying to predict at this moment in time. [00:14:41] moment in time. Um and the other thing to keep in mind [00:14:43] Um and the other thing to keep in mind is that this is all differentiable. Um [00:14:45] is that this is all differentiable. Um we don't need to supervise the network. [00:14:47] we don't need to supervise the network. We don't need to tell it which words in [00:14:49] We don't need to tell it which words in the input sequence were required for [00:14:50] the input sequence were required for each word in the output. Instead, this [00:14:52] each word in the output. Instead, this is just a big computational graph [00:14:54] is just a big computational graph composed of differentiable operations. [00:14:56] composed of differentiable operations. Um this all of this will can be learned [00:14:58] Um this all of this will can be learned end to end via gradient descent. So at [00:15:00] end to end via gradient descent. So at the end of the day, we're still going to [00:15:01] the end of the day, we're still going to have this, you know, cross entropy [00:15:02] have this, you know, cross entropy softmax loss where the network is trying [00:15:04] softmax loss where the network is trying to predict the tokens of the output [00:15:06] to predict the tokens of the output sequence. And in the process of trying [00:15:08] sequence. And in the process of trying to predict the right tokens in the [00:15:10] to predict the right tokens in the output sequence, it's going to learn for [00:15:12] output sequence, it's going to learn for itself how to attend to different parts [00:15:14] itself how to attend to different parts of the input sequence. So that's that's [00:15:16] of the input sequence. So that's that's really critical, right? If we have to go [00:15:18] really critical, right? If we have to go in and supervise and tell the network [00:15:20] in and supervise and tell the network the alignment between the two, it would [00:15:21] the alignment between the two, it would be very difficult to get training data [00:15:23] be very difficult to get training data for this kind of thing. The question is [00:15:25] for this kind of thing. The question is how do we initialize the decoder? Uh [00:15:26] how do we initialize the decoder? Uh we're actually using the word you got to [00:15:28] we're actually using the word you got to be careful we're using the word [00:15:29] be careful we're using the word initialize a little bit overloaded here. [00:15:31] initialize a little bit overloaded here. So one question is the decoder is itself [00:15:33] So one question is the decoder is itself a neural network that has weights. When [00:15:35] a neural network that has weights. When we start training that network we need [00:15:37] we start training that network we need to initialize those weights in some way. [00:15:39] to initialize those weights in some way. So then we will typically initialize the [00:15:40] So then we will typically initialize the weights of the decoder randomly and then [00:15:43] weights of the decoder randomly and then optimize them via gradient descent just [00:15:45] optimize them via gradient descent just as we do with any other neural network [00:15:46] as we do with any other neural network weights. Um but there's a second notion [00:15:48] weights. Um but there's a second notion of initialize which is that when the [00:15:50] of initialize which is that when the network is processing a sequence um [00:15:53] network is processing a sequence um whether whatever its current value of [00:15:54] whether whatever its current value of the weights are we need some way to set [00:15:56] the weights are we need some way to set that initial hidden state at the time we [00:15:58] that initial hidden state at the time we start processing an output sequence. Um [00:16:01] start processing an output sequence. Um and in that case we need some rule or [00:16:03] and in that case we need some rule or some some some way to set that initial [00:16:06] some some some way to set that initial hidden state of the decoder output [00:16:08] hidden state of the decoder output sequence. Um there's a couple different [00:16:09] sequence. Um there's a couple different mechanisms for this. Um sometimes you [00:16:11] mechanisms for this. Um sometimes you might initialize it as the hidden the [00:16:14] might initialize it as the hidden the last hidden state of the encoder is one [00:16:16] last hidden state of the encoder is one thing you'll sometimes do. You might [00:16:17] thing you'll sometimes do. You might have a linear transform that um projects [00:16:20] have a linear transform that um projects has some learned projection from the [00:16:21] has some learned projection from the last decoder state to the first from the [00:16:23] last decoder state to the first from the last encoder state to the first decoder [00:16:25] last encoder state to the first decoder state. Um or sometimes will people even [00:16:28] state. Um or sometimes will people even initialize the first hidden state of the [00:16:30] initialize the first hidden state of the decoder to be all zeros. Um any of those [00:16:32] decoder to be all zeros. Um any of those will work as long as you train the [00:16:34] will work as long as you train the network to expect that kind of input. So [00:16:36] network to expect that kind of input. So the question is negations and exors [00:16:38] the question is negations and exors would this cause a problem? Maybe this [00:16:39] would this cause a problem? Maybe this is this is a this is a hard problem but [00:16:41] is this is a this is a hard problem but then you need a lot of data a lot of [00:16:42] then you need a lot of data a lot of flops to try to hope the network can [00:16:44] flops to try to hope the network can disentangle this. Um but basically [00:16:46] disentangle this. Um but basically recurrent unit takes three things as [00:16:48] recurrent unit takes three things as input. It take in the decoder right it [00:16:50] input. It take in the decoder right it takes the previous hidden state the [00:16:52] takes the previous hidden state the previous decoder hidden state it takes [00:16:54] previous decoder hidden state it takes the current context vector and it takes [00:16:56] the current context vector and it takes the current um token in the output [00:16:58] the current um token in the output sequence and then from that we produce [00:17:00] sequence and then from that we produce the next hidden state and then from the [00:17:02] the next hidden state and then from the next hidden state then we go and predict [00:17:03] next hidden state then we go and predict the output token. So that's actually the [00:17:05] the output token. So that's actually the same setup as in the non-attention case. [00:17:08] same setup as in the non-attention case. I guess there there's an implicit [00:17:09] I guess there there's an implicit connection from there's a there's a [00:17:11] connection from there's a there's a connection from S0 to S1 that we're not [00:17:13] connection from S0 to S1 that we're not drawing. So there there should have been [00:17:14] drawing. So there there should have been another arrow from S0 to S1. I think I [00:17:17] another arrow from S0 to S1. I think I just dropped the S0 arrow. So sorry [00:17:19] just dropped the S0 arrow. So sorry about that. Well, we're basically [00:17:21] about that. Well, we're basically letting the network decide for itself to [00:17:23] letting the network decide for itself to look back at any part of the in input [00:17:25] look back at any part of the in input sequence that it thinks might be [00:17:26] sequence that it thinks might be relevant for the task at hand. Um so but [00:17:28] relevant for the task at hand. Um so but the reason why we think that this [00:17:30] the reason why we think that this mechanism is plausible and might be [00:17:32] mechanism is plausible and might be helpful for the network is because we [00:17:34] helpful for the network is because we know that you know in a language task [00:17:36] know that you know in a language task there often is some kind of [00:17:37] there often is some kind of correspondence between words in the [00:17:39] correspondence between words in the output and words in the input and we [00:17:41] output and words in the input and we want to let the network kind of look [00:17:42] want to let the network kind of look back and pick out which are the relevant [00:17:44] back and pick out which are the relevant bits of the input for producing this bit [00:17:46] bits of the input for producing this bit of the output. But again we're not [00:17:47] of the output. But again we're not directly supervising it. We're not [00:17:49] directly supervising it. We're not telling it how to use these attention [00:17:50] telling it how to use these attention scores. But the intuition is that we [00:17:52] scores. But the intuition is that we think that's a plausible thing that it [00:17:54] think that's a plausible thing that it might choose to do given this mechanism. [00:17:57] might choose to do given this mechanism. Okay. So that's that's sort of like one [00:17:58] Okay. So that's that's sort of like one tick of the output. Um, and now [00:18:01] tick of the output. Um, and now basically we do it again. We do this [00:18:03] basically we do it again. We do this whole process again for every time we [00:18:04] whole process again for every time we tick the decoder RNN, right? Remember [00:18:06] tick the decoder RNN, right? Remember the problem we were trying to solve is [00:18:08] the problem we were trying to solve is that previously the decoder was [00:18:10] that previously the decoder was bottlenecking through a single vector. [00:18:11] bottlenecking through a single vector. Um, now we're going to compute instead [00:18:13] Um, now we're going to compute instead of bottlenecking through a single [00:18:14] of bottlenecking through a single vector, we're going to repeat this whole [00:18:16] vector, we're going to repeat this whole process again and compute a new context [00:18:18] process again and compute a new context vector for the second time step of the [00:18:19] vector for the second time step of the decoder and go let it go back and look [00:18:21] decoder and go let it go back and look at the whole input sequence yet again. [00:18:23] at the whole input sequence yet again. So now um basically given our S1 which [00:18:27] So now um basically given our S1 which is our computed first hit like hidden [00:18:29] is our computed first hit like hidden state in the decoder we're going to go [00:18:31] state in the decoder we're going to go back you take S1 go back and compute [00:18:34] back you take S1 go back and compute comparison and and use our attention uh [00:18:37] comparison and and use our attention uh mechanism to compute similarity scores [00:18:39] mechanism to compute similarity scores between S1 and all of the hidden states [00:18:41] between S1 and all of the hidden states in the encoder. Um that will compute our [00:18:43] in the encoder. Um that will compute our similarity scores using that exact same [00:18:46] similarity scores using that exact same fat that same linear projection that we [00:18:48] fat that same linear projection that we used at the first time step. um compute [00:18:50] used at the first time step. um compute these alignment scores again, cram them [00:18:52] these alignment scores again, cram them through softmax to get a a new [00:18:54] through softmax to get a a new distribution over the input sequence for [00:18:56] distribution over the input sequence for the second decoder time step and now [00:18:58] the second decoder time step and now compute a new linear combination of the [00:19:00] compute a new linear combination of the encoder hidden states now weighted by [00:19:02] encoder hidden states now weighted by this new distribution that we computed [00:19:05] this new distribution that we computed at the second time step. Um and this sec [00:19:07] at the second time step. Um and this sec this this will basically give us a new [00:19:08] this this will basically give us a new context vector um C2 [00:19:12] context vector um C2 that now is a different summarization of [00:19:15] that now is a different summarization of the input sequence that's now computed [00:19:17] the input sequence that's now computed as a new linear combination of the input [00:19:19] as a new linear combination of the input encoder hidden states and then we then [00:19:22] encoder hidden states and then we then the whole thing kind of iterates right [00:19:24] the whole thing kind of iterates right we have a new context vector we use that [00:19:25] we have a new context vector we use that to run another tick of our decoder RNN [00:19:28] to run another tick of our decoder RNN unit that will now now does include that [00:19:31] unit that will now now does include that that mysterious missing arrow that [00:19:32] that mysterious missing arrow that wasn't there on the previous time step. [00:19:34] wasn't there on the previous time step. So then given our new context vector, [00:19:37] So then given our new context vector, given the next token of the output [00:19:38] given the next token of the output sequence and given the S1 hidden state [00:19:41] sequence and given the S1 hidden state of the decoder, we compute a new decoder [00:19:44] of the decoder, we compute a new decoder state S2 and then from that compute [00:19:46] state S2 and then from that compute another token of the output sequence. Um [00:19:49] another token of the output sequence. Um and again remember you know in this case [00:19:51] and again remember you know in this case it's producing ill which maybe is the [00:19:54] it's producing ill which maybe is the according to the slide. I hope that's [00:19:55] according to the slide. I hope that's true. Um, and then you know in this case [00:19:58] true. Um, and then you know in this case there's maybe a one to one [00:19:59] there's maybe a one to one correspondence between the word the the [00:20:01] correspondence between the word the the network is trying to produce for this [00:20:02] network is trying to produce for this sequence and one of the words in the [00:20:03] sequence and one of the words in the output and one of the words in the [00:20:05] output and one of the words in the input. So we might expect that the [00:20:07] input. So we might expect that the network should put relatively high [00:20:08] network should put relatively high attention weight on just one of the [00:20:10] attention weight on just one of the words in the input sequence and [00:20:12] words in the input sequence and relatively low attention weight on all [00:20:14] relatively low attention weight on all the other words in the input sequence. [00:20:15] the other words in the input sequence. But again, we don't supervise this. The [00:20:17] But again, we don't supervise this. The network is deciding for itself how to [00:20:18] network is deciding for itself how to make use of this mechanism all driven by [00:20:20] make use of this mechanism all driven by gradient descent on our training task. [00:20:22] gradient descent on our training task. Um, and this whole thing is going to [00:20:24] Um, and this whole thing is going to we're just going to repeat that whole [00:20:25] we're just going to repeat that whole process for every tick of the decoder [00:20:27] process for every tick of the decoder RNN. [00:20:29] RNN. Um, so now this basically solves our [00:20:31] Um, so now this basically solves our problem right there. We are no longer [00:20:33] problem right there. We are no longer bottlenecking the input sequence through [00:20:35] bottlenecking the input sequence through a single fixed length vector. Instead, [00:20:36] a single fixed length vector. Instead, we have this new mechanism where at [00:20:38] we have this new mechanism where at every time step of the decoder, the [00:20:40] every time step of the decoder, the network looks back at the entire input [00:20:42] network looks back at the entire input sequence, reummarizes the input sequence [00:20:45] sequence, reummarizes the input sequence um for the to generate a new context [00:20:47] um for the to generate a new context vector on the fly for this one time step [00:20:49] vector on the fly for this one time step of the decoder and then uses that to [00:20:50] of the decoder and then uses that to produce the outputs. So this is a this [00:20:52] produce the outputs. So this is a this is a pretty cool mechanism and this is [00:20:54] is a pretty cool mechanism and this is called attention because the network is [00:20:56] called attention because the network is attending or looking at different parts [00:20:58] attending or looking at different parts of the input sequence um at every at [00:21:01] of the input sequence um at every at every moment in its output. So we talked [00:21:04] every moment in its output. So we talked about these attention weights and we [00:21:05] about these attention weights and we said that they were driven that the [00:21:07] said that they were driven that the network was learning for itself how to [00:21:08] network was learning for itself how to set these attention weights based on its [00:21:10] set these attention weights based on its training data based on its training [00:21:12] training data based on its training task. Um and another really cool thing [00:21:14] task. Um and another really cool thing about attention is it also gives us a [00:21:16] about attention is it also gives us a way to introspect and see what the [00:21:18] way to introspect and see what the network is looking at as it's trying to [00:21:20] network is looking at as it's trying to solve this problem. So we never told it [00:21:22] solve this problem. So we never told it how the what what what what what the [00:21:24] how the what what what what what the what the alignment was between the input [00:21:26] what the alignment was between the input sequence and the output sequence. But by [00:21:27] sequence and the output sequence. But by looking at the attention weights that [00:21:29] looking at the attention weights that that the network predicts when trying to [00:21:31] that the network predicts when trying to solve this task, we get a sense of what [00:21:33] solve this task, we get a sense of what the network was looking at um while [00:21:35] the network was looking at um while trying to solve the problem. So that [00:21:37] trying to solve the problem. So that gives us a way to interpret the [00:21:38] gives us a way to interpret the processing of the neural network in some [00:21:40] processing of the neural network in some way. Um and so here's so one thing that [00:21:42] way. Um and so here's so one thing that we can do is then go and look at in the [00:21:45] we can do is then go and look at in the process of producing a particular of [00:21:47] process of producing a particular of processing a particular sequence what [00:21:49] processing a particular sequence what were the attention weights that the [00:21:50] were the attention weights that the network predicted when trying to do this [00:21:53] network predicted when trying to do this task and we can visualize these in a [00:21:54] task and we can visualize these in a two-dimensional grid. So here we're [00:21:57] two-dimensional grid. So here we're looking at an example of English to [00:21:58] looking at an example of English to French translation. Um and across the [00:22:02] French translation. Um and across the and across the the top we have our input [00:22:05] and across the the top we have our input sequence. The agreement on the European [00:22:07] sequence. The agreement on the European economic area was signed in August 1992. [00:22:09] economic area was signed in August 1992. And then running down the rows is the [00:22:12] And then running down the rows is the output sequence which is in French which [00:22:14] output sequence which is in French which I will not attempt to pronounce. Um but [00:22:17] I will not attempt to pronounce. Um but you can see that like basically through [00:22:18] you can see that like basically through this attention mechanism for every [00:22:20] this attention mechanism for every remember the way this attention [00:22:21] remember the way this attention mechanism worked is that each time the [00:22:24] mechanism worked is that each time the network produced one of these words in [00:22:25] network produced one of these words in the output sequence it predicted a [00:22:27] the output sequence it predicted a probability distribution over the entire [00:22:29] probability distribution over the entire input sequence. So we visualized that in [00:22:32] input sequence. So we visualized that in that first row. So if you look at the [00:22:34] that first row. So if you look at the first row of this matrix, we're [00:22:35] first row of this matrix, we're visualizing that predicted probability [00:22:37] visualizing that predicted probability distribution over the entire input [00:22:39] distribution over the entire input English sentence. And we see that when [00:22:41] English sentence. And we see that when trying to predict that first um word the [00:22:44] trying to predict that first um word the of the French sentence, then it puts a [00:22:47] of the French sentence, then it puts a lot of probability mass on the English [00:22:48] lot of probability mass on the English word 'the' and basically no probability [00:22:51] word 'the' and basically no probability mass on any of the other words. Then [00:22:53] mass on any of the other words. Then when predicting the second word of the [00:22:54] when predicting the second word of the output sequence, remember it goes back [00:22:56] output sequence, remember it goes back and predicts a new distribution over the [00:22:58] and predicts a new distribution over the entire input sequence and that's going [00:22:59] entire input sequence and that's going to be the second row in this matrix. So [00:23:02] to be the second row in this matrix. So you can see that accord um I it puts a [00:23:05] you can see that accord um I it puts a lot of probability mass on agreement and [00:23:07] lot of probability mass on agreement and then no probability mass anywhere else. [00:23:09] then no probability mass anywhere else. So then that gives us some sense that [00:23:11] So then that gives us some sense that the network actually kind of did figure [00:23:13] the network actually kind of did figure out the alignment between the input [00:23:15] out the alignment between the input words and the output words when doing [00:23:16] words and the output words when doing this translation task. And there's some [00:23:19] this translation task. And there's some interesting patterns here that here that [00:23:20] interesting patterns here that here that kind of pop up when we look when we see [00:23:23] kind of pop up when we look when we see diagonal structures in this attention [00:23:24] diagonal structures in this attention matrix. That means that there was a [00:23:26] matrix. That means that there was a onetoone correspondence between words in [00:23:29] onetoone correspondence between words in order between the input sequence and the [00:23:30] order between the input sequence and the output sequence. So in particular we see [00:23:32] output sequence. So in particular we see that the agreement on the the first four [00:23:35] that the agreement on the the first four words of the input sequence correspond [00:23:37] words of the input sequence correspond to this diagonal structure um in the [00:23:39] to this diagonal structure um in the attention matrix. So that means that the [00:23:41] attention matrix. So that means that the network has decided for itself that [00:23:43] network has decided for itself that these first four words of the input [00:23:45] these first four words of the input sequence sort of align or match up or [00:23:47] sequence sort of align or match up or correspond to the first four words of [00:23:49] correspond to the first four words of the input sequence. And the same thing [00:23:51] the input sequence. And the same thing for the last um several words. So again, [00:23:53] for the last um several words. So again, we see this diagonal structure at the [00:23:55] we see this diagonal structure at the end of the sequence, which means that [00:23:56] end of the sequence, which means that August 1992 or in August 1992 um [00:24:00] August 1992 or in August 1992 um corresponds to these uh these last [00:24:02] corresponds to these uh these last couple words in the French sequence. And [00:24:03] couple words in the French sequence. And again, there's this one:1 correspondence [00:24:05] again, there's this one:1 correspondence between words in the output and words in [00:24:07] between words in the output and words in the input. But we see some other [00:24:09] the input. But we see some other interesting stuff in the middle here. So [00:24:11] interesting stuff in the middle here. So in the middle, um we see European [00:24:13] in the middle, um we see European economic area, but in the French, we see [00:24:16] economic area, but in the French, we see words that look kind of like those in a [00:24:18] words that look kind of like those in a slightly different order. Good question. [00:24:20] slightly different order. Good question. How does it figure out the grammar? [00:24:22] How does it figure out the grammar? That's the mystery of deep learning. But [00:24:24] That's the mystery of deep learning. But like basically we told the we didn't [00:24:26] like basically we told the we didn't tell the network anything about grammar. [00:24:27] tell the network anything about grammar. We told the network we supervised it [00:24:29] We told the network we supervised it with a lot of input output pairs. We [00:24:31] with a lot of input output pairs. We told it here's an input sequence in [00:24:32] told it here's an input sequence in English. Here's an output sequence in [00:24:34] English. Here's an output sequence in French. Here's a mechanism for [00:24:36] French. Here's a mechanism for processing this and learn via gradient [00:24:38] processing this and learn via gradient descent to to like set the weights of [00:24:41] descent to to like set the weights of this architecture in order to in order [00:24:43] this architecture in order to in order to produce this output from this input. [00:24:45] to produce this output from this input. Um we'd never told it anything about [00:24:46] Um we'd never told it anything about grammar. Um but it kind of because we as [00:24:49] grammar. Um but it kind of because we as human designers have this intuition that [00:24:51] human designers have this intuition that maybe it makes sense that there ought to [00:24:53] maybe it makes sense that there ought to be some correspondence between some of [00:24:54] be some correspondence between some of the words. So we bake in a mechanism [00:24:56] the words. So we bake in a mechanism that we think as human designers might [00:24:58] that we think as human designers might be helpful for solving this problem and [00:25:00] be helpful for solving this problem and the network figures out for itself in [00:25:02] the network figures out for itself in the process of doing the endto-end task [00:25:04] the process of doing the endto-end task how to make use of that mechanism um to [00:25:06] how to make use of that mechanism um to solve the problem we set for it. And [00:25:08] solve the problem we set for it. And it's pretty pretty amazing that it [00:25:10] it's pretty pretty amazing that it works. [00:25:12] works. Um right but in this case you know it [00:25:14] Um right but in this case you know it kind of figured out some of the grammar [00:25:15] kind of figured out some of the grammar for itself. So it sees that you know we [00:25:17] for itself. So it sees that you know we see this nondagonal sort of backward [00:25:19] see this nondagonal sort of backward diagonal in the attention matrix here [00:25:21] diagonal in the attention matrix here and that means that the network figured [00:25:22] and that means that the network figured out for itself this um other this like [00:25:25] out for itself this um other this like different word order between words in [00:25:26] different word order between words in English and words in French um or in the [00:25:29] English and words in French um or in the middle you see there's a little there's [00:25:30] middle you see there's a little there's a little like 2x2 grid um kind of here [00:25:33] a little like 2x2 grid um kind of here and that kind of corresponds to a [00:25:35] and that kind of corresponds to a situation where there might not have [00:25:36] situation where there might not have been a one to one correspondence between [00:25:37] been a one to one correspondence between the English words and the French words. [00:25:39] the English words and the French words. There might have been two French words [00:25:40] There might have been two French words that corresponded to two English words [00:25:42] that corresponded to two English words and they didn't perfectly disentangle [00:25:44] and they didn't perfectly disentangle perfectly. I mean the network just all [00:25:45] perfectly. I mean the network just all figures out this for itself over the [00:25:47] figures out this for itself over the process of training um on a lot of data [00:25:49] process of training um on a lot of data and putting a lot of compute through [00:25:50] and putting a lot of compute through this and that's pretty cool. [00:25:53] this and that's pretty cool. Okay, so there's actually uh so that [00:25:55] Okay, so there's actually uh so that that's basic so and this actually was [00:25:57] that's basic so and this actually was the initial usage of attention in [00:25:59] the initial usage of attention in machine learning. Um it actually came [00:26:00] machine learning. Um it actually came from this from these machine translation [00:26:02] from this from these machine translation problems. Um so this was from a paper [00:26:04] problems. Um so this was from a paper back in uh back in 2015 uh neural [00:26:07] back in uh back in 2015 uh neural machine translation by joint by jointly [00:26:09] machine translation by joint by jointly alerting to align and translate. Um, and [00:26:12] alerting to align and translate. Um, and this paper actually just won the [00:26:13] this paper actually just won the runner-up test of time award at iclair [00:26:15] runner-up test of time award at iclair 2025. Uh, so that's pretty cool. A nice [00:26:18] 2025. Uh, so that's pretty cool. A nice nice uh, you know, this has been a [00:26:20] nice uh, you know, this has been a really impactful paper over time. Um, [00:26:22] really impactful paper over time. Um, but it turns out that there's actually a [00:26:24] but it turns out that there's actually a more general idea here and a more [00:26:26] more general idea here and a more general operator hiding here. You know, [00:26:28] general operator hiding here. You know, we approach this problem from the [00:26:29] we approach this problem from the perspective of trying to fix our [00:26:31] perspective of trying to fix our recurrent neural networks. But it turns [00:26:33] recurrent neural networks. But it turns out the mechanism that we used to fix [00:26:35] out the mechanism that we used to fix the recurrent neural networks actually [00:26:37] the recurrent neural networks actually is something general and interesting and [00:26:39] is something general and interesting and really powerful in its own right. So now [00:26:41] really powerful in its own right. So now we want to try to pull that out, pull [00:26:43] we want to try to pull that out, pull out this idea of attention and divorce [00:26:46] out this idea of attention and divorce the idea of attention from the recurrent [00:26:48] the idea of attention from the recurrent neural networks. And it turns out that [00:26:50] neural networks. And it turns out that attention will be a very useful and [00:26:52] attention will be a very useful and powerful computational primitive for [00:26:54] powerful computational primitive for neural networks in its own right. even [00:26:57] neural networks in its own right. even even if then we can cut away the [00:26:58] even if then we can cut away the recurrent neural network part and just [00:27:00] recurrent neural network part and just be left with attention as the core [00:27:02] be left with attention as the core primitive in our architecture and that's [00:27:04] primitive in our architecture and that's kind of where we're what we're going [00:27:05] kind of where we're what we're going towards. So now what we want to do is [00:27:07] towards. So now what we want to do is take this this idea of attention as we [00:27:09] take this this idea of attention as we saw it in recurrent neural networks and [00:27:11] saw it in recurrent neural networks and try to generalize it and try to carve [00:27:13] try to generalize it and try to carve out this independent operator that can [00:27:15] out this independent operator that can be used on its own. So let's think about [00:27:17] be used on its own. So let's think about what this attention mechanism was doing. [00:27:20] what this attention mechanism was doing. Basically what this attention mechanism [00:27:22] Basically what this attention mechanism did is there were a bunch of query [00:27:24] did is there were a bunch of query vectors. These are Well, maybe maybe it [00:27:27] vectors. These are Well, maybe maybe it makes sense to talk about these in the [00:27:28] makes sense to talk about these in the other order. So, there's data vectors [00:27:29] other order. So, there's data vectors which are like data that we want to [00:27:31] which are like data that we want to summarize. These are the the the the [00:27:33] summarize. These are the the the the encoder states of the encoder RNN. So, [00:27:35] encoder states of the encoder RNN. So, we have this input sequence and we've [00:27:37] we have this input sequence and we've summarized that into a sequence of [00:27:38] summarized that into a sequence of vectors. Um, and the sequence of vectors [00:27:40] vectors. Um, and the sequence of vectors is sort of data that we think is [00:27:42] is sort of data that we think is relevant for the problem that we're [00:27:43] relevant for the problem that we're trying to solve. Um, and now in the [00:27:45] trying to solve. Um, and now in the process of trying to make use of that [00:27:47] process of trying to make use of that data, we want to produce a bunch of [00:27:49] data, we want to produce a bunch of outputs. And for each output, we have a [00:27:51] outputs. And for each output, we have a query vector. A query vector is a vector [00:27:54] query vector. A query vector is a vector that we're trying to use to solve an out [00:27:55] that we're trying to use to solve an out to to produce some piece of output. Um, [00:27:57] to to produce some piece of output. Um, and in this case, the query vectors are [00:28:00] and in this case, the query vectors are the hidden states of the decoder RNN. [00:28:03] the hidden states of the decoder RNN. Um, and we have this this property that [00:28:05] Um, and we have this this property that for each query vector, we want to go [00:28:07] for each query vector, we want to go back look at the data vectors and [00:28:09] back look at the data vectors and summarize the information in the data [00:28:11] summarize the information in the data vectors into a context vector. Um, for [00:28:16] vectors into a context vector. Um, for each well, okay, from the purpose of [00:28:18] each well, okay, from the purpose of from the purpose of attention, this gets [00:28:19] from the purpose of attention, this gets a little bit weird. So the output of the [00:28:22] a little bit weird. So the output of the attention operator are the context [00:28:23] attention operator are the context vectors that we just talked about for [00:28:25] vectors that we just talked about for the RNN. So if we're talk if we're [00:28:26] the RNN. So if we're talk if we're thinking about just what does that [00:28:28] thinking about just what does that attention operator do? The output of the [00:28:30] attention operator do? The output of the attention operator were the context [00:28:32] attention operator were the context vectors that we feed into the RNN. So [00:28:34] vectors that we feed into the RNN. So then what is the attention operator [00:28:36] then what is the attention operator doing? The attention operator is taking [00:28:38] doing? The attention operator is taking a query vector going back to the input [00:28:40] a query vector going back to the input data vectors summarizing the data [00:28:42] data vectors summarizing the data vectors in some new way to produce an [00:28:44] vectors in some new way to produce an output vector. Um and that's what the [00:28:46] output vector. Um and that's what the attention operator is doing. Is that [00:28:48] attention operator is doing. Is that does that is that does that kind of make [00:28:49] does that is that does that kind of make sense as a generalization of this [00:28:51] sense as a generalization of this attention mechanism that we just saw? [00:28:55] attention mechanism that we just saw? Yeah. Yeah. I I'll repeat it again [00:28:56] Yeah. Yeah. I I'll repeat it again because it's it's tricky. There's a lot [00:28:57] because it's it's tricky. There's a lot of stuff flying around here. A lot of [00:28:59] of stuff flying around here. A lot of boxes and we're changing the words that [00:29:00] boxes and we're changing the words that we're using to define the the define the [00:29:02] we're using to define the the define the boxes. So I get it. There's a lot [00:29:03] boxes. So I get it. There's a lot happening. Um so what the attention [00:29:05] happening. Um so what the attention operator is doing is there's a bunch of [00:29:06] operator is doing is there's a bunch of data vectors which are the encoder [00:29:08] data vectors which are the encoder hidden states. Um then for then we have [00:29:10] hidden states. Um then for then we have a bunch of query vectors which are the p [00:29:13] a bunch of query vectors which are the p the things we're trying to produce [00:29:14] the things we're trying to produce output for. Now, in the process of [00:29:16] output for. Now, in the process of processing a query vector, we're going [00:29:18] processing a query vector, we're going to go back to the data vectors, [00:29:20] to go back to the data vectors, summarize the data vectors in a new [00:29:22] summarize the data vectors in a new custom way for each query vector, and [00:29:25] custom way for each query vector, and that will produce um an output vector, [00:29:27] that will produce um an output vector, which is the context to be fed into the [00:29:30] which is the context to be fed into the next tick of the RNN. Right? So, our [00:29:32] next tick of the RNN. Right? So, our query vectors are these guys in green. [00:29:34] query vectors are these guys in green. For each query vector, we go back to the [00:29:35] For each query vector, we go back to the data vectors, summarize the data [00:29:37] data vectors, summarize the data vectors, and then produce a new output [00:29:39] vectors, and then produce a new output vector, which is one of the contexts [00:29:40] vector, which is one of the contexts that we then feed into the the rest of [00:29:42] that we then feed into the the rest of the network. So um you know this is kind [00:29:45] the network. So um you know this is kind of tricky because we're trying to like [00:29:46] of tricky because we're trying to like go into this architecture and like cut [00:29:48] go into this architecture and like cut carefully cut out the attention part and [00:29:50] carefully cut out the attention part and cut it out from the RNN. Um so then [00:29:53] cut it out from the RNN. Um so then we're going to try to like walk through [00:29:54] we're going to try to like walk through this again from the perspective of just [00:29:56] this again from the perspective of just the attention operator. So from the [00:29:58] the attention operator. So from the perspective of just the attention [00:30:00] perspective of just the attention operator we're going to start with just [00:30:01] operator we're going to start with just one query vector at first um which is [00:30:04] one query vector at first um which is you know one of the one of the states in [00:30:06] you know one of the one of the states in our RNN. We also have a bunch of data [00:30:08] our RNN. We also have a bunch of data vectors which are the encoder hidden [00:30:10] vectors which are the encoder hidden states in the RNN. Now the computation [00:30:13] states in the RNN. Now the computation that we want to perform is first compute [00:30:15] that we want to perform is first compute similarities between that query vector [00:30:17] similarities between that query vector and all of the data vectors. This is the [00:30:19] and all of the data vectors. This is the exact same thing that we just saw just [00:30:20] exact same thing that we just saw just sort of written in a different way. So [00:30:22] sort of written in a different way. So we use this fat function to compute [00:30:24] we use this fat function to compute these similarity scores um from our to [00:30:27] these similarity scores um from our to to to compute similarities between each [00:30:29] to to compute similarities between each data vector and our one query vector. [00:30:32] data vector and our one query vector. Then once we have those similarities, [00:30:33] Then once we have those similarities, we're going to squash them through [00:30:34] we're going to squash them through through a softmax to get attention [00:30:36] through a softmax to get attention weights. And this will be a distribution [00:30:38] weights. And this will be a distribution over the data vectors that has been [00:30:39] over the data vectors that has been computed on the fly for this one query [00:30:41] computed on the fly for this one query vector. [00:30:43] vector. Then we want to do is produce an output [00:30:45] Then we want to do is produce an output vector. And out this output vector is a [00:30:47] vector. And out this output vector is a linear combination of our data vectors [00:30:50] linear combination of our data vectors where those linear combination weights [00:30:52] where those linear combination weights are the attention scores that we just [00:30:53] are the attention scores that we just computed. So this is the output of the [00:30:55] computed. So this is the output of the attention layer. And then in the context [00:30:57] attention layer. And then in the context of the larger RNN that we saw, the [00:30:59] of the larger RNN that we saw, the output of the attention layer or the [00:31:01] output of the attention layer or the attention operator will become an input [00:31:03] attention operator will become an input to the next tick of the decoder RNN. But [00:31:06] to the next tick of the decoder RNN. But we're trying to deprecate the RNN. So we [00:31:08] we're trying to deprecate the RNN. So we don't want to talk about that. We just [00:31:09] don't want to talk about that. We just want to talk about the attention and [00:31:10] want to talk about the attention and focus on the computation happening [00:31:11] focus on the computation happening inside the attention layer. So like so [00:31:15] inside the attention layer. So like so this is basically the operator that we [00:31:16] this is basically the operator that we saw in the RNN, right? We had this one [00:31:19] saw in the RNN, right? We had this one like we we did this process over and [00:31:21] like we we did this process over and over again of taking a query vector [00:31:23] over again of taking a query vector using it to compute similarity scores [00:31:24] using it to compute similarity scores getting attention weights getting an [00:31:26] getting attention weights getting an output vector. Then we got a new query [00:31:28] output vector. Then we got a new query vector. Where did that query vector come [00:31:29] vector. Where did that query vector come from? Attention operator doesn't care. [00:31:31] from? Attention operator doesn't care. Get a new query vector. Go back [00:31:33] Get a new query vector. Go back summarize the data vectors get a new [00:31:34] summarize the data vectors get a new output vector. Um and that's that's the [00:31:36] output vector. Um and that's that's the core of the attention operator. So now [00:31:38] core of the attention operator. So now let's try to generalize this and make it [00:31:40] let's try to generalize this and make it a even more powerful computational [00:31:42] a even more powerful computational primitive. Yeah. So in principle this um [00:31:44] primitive. Yeah. So in principle this um this fat doesn't have to be it could be [00:31:47] this fat doesn't have to be it could be any function. It could be any function [00:31:48] any function. It could be any function of two vectors that outputs a scaler um [00:31:50] of two vectors that outputs a scaler um in principle but in practice we're [00:31:52] in principle but in practice we're actually going to make it simpler in in [00:31:53] actually going to make it simpler in in a couple slides. But in principle yeah [00:31:56] a couple slides. But in principle yeah you could just slot in any any function [00:31:57] you could just slot in any any function that you wanted there. [00:32:00] that you wanted there. Okay. So the first generalization that [00:32:02] Okay. So the first generalization that we're going to do is actually um the [00:32:04] we're going to do is actually um the opposite of what you just suggested and [00:32:05] opposite of what you just suggested and make that similarity function simpler. [00:32:07] make that similarity function simpler. So we said in principle it can be any [00:32:09] So we said in principle it can be any function that takes two vectors and [00:32:11] function that takes two vectors and gives a similarity score. What's the [00:32:13] gives a similarity score. What's the simplest possible function that inputs [00:32:14] simplest possible function that inputs two vectors and gives us a scalar [00:32:16] two vectors and gives us a scalar similarity score? It's a dot product. So [00:32:19] similarity score? It's a dot product. So we want to try to make things uh simpler [00:32:21] we want to try to make things uh simpler and also generalize at the same time. [00:32:23] and also generalize at the same time. And it turns out that a dot productduct [00:32:24] And it turns out that a dot productduct is is good enough of a similarity score [00:32:26] is is good enough of a similarity score to be used for this purpose. So the [00:32:28] to be used for this purpose. So the first thing we're going to do is um [00:32:30] first thing we're going to do is um actually just only use dot products to [00:32:31] actually just only use dot products to compute similarity. [00:32:33] compute similarity. Um but it turns out there's a slight [00:32:35] Um but it turns out there's a slight problem with dotproducts. So and this [00:32:38] problem with dotproducts. So and this one's kind of subtle because there's a [00:32:39] one's kind of subtle because there's a weird interaction between the dot [00:32:41] weird interaction between the dot product and the softmax. Um and that has [00:32:43] product and the softmax. Um and that has to do with what happens when the when [00:32:46] to do with what happens when the when the dimension of those vectors scales up [00:32:47] the dimension of those vectors scales up or down. Right? So if you have a like [00:32:51] or down. Right? So if you have a like the motivating example is that if you [00:32:52] the motivating example is that if you scale the dimension of that vector, say [00:32:55] scale the dimension of that vector, say we had a constant vector of all ones of [00:32:57] we had a constant vector of all ones of like dimension 10 versus a constant [00:32:59] like dimension 10 versus a constant vector of all ones of dimension 100, [00:33:02] vector of all ones of dimension 100, then as we go to the to the higher [00:33:03] then as we go to the to the higher dimensional vector, then when we compute [00:33:06] dimensional vector, then when we compute the sum inside that softmax, then we're [00:33:08] the sum inside that softmax, then we're going to be dividing by a larger number. [00:33:10] going to be dividing by a larger number. So we'll end up with more squashed [00:33:12] So we'll end up with more squashed probability scores as we go to higher [00:33:14] probability scores as we go to higher dimensional vectors. Um that can lead to [00:33:16] dimensional vectors. Um that can lead to vanishing gradients as we just saw in [00:33:17] vanishing gradients as we just saw in the previous lecture. and prevent [00:33:19] the previous lecture. and prevent learning of this whole thing. So as kind [00:33:21] learning of this whole thing. So as kind of a slight hack to prevent that um and [00:33:24] of a slight hack to prevent that um and make this architecture more generally [00:33:26] make this architecture more generally more generalizably scalable up and down [00:33:28] more generalizably scalable up and down to different dimension vectors um what [00:33:30] to different dimension vectors um what we're going to do is actually not use [00:33:32] we're going to do is actually not use the pure dot productduct but scale the [00:33:34] the pure dot productduct but scale the dotproduct down by the square root of [00:33:36] dotproduct down by the square root of the dimension of those vectors that [00:33:37] the dimension of those vectors that we're looking at. Um and this is just a [00:33:39] we're looking at. Um and this is just a way to prevent vanishing gradients and [00:33:41] way to prevent vanishing gradients and give nicer gradient flow through the [00:33:42] give nicer gradient flow through the softmax for a wider range of dimensions [00:33:45] softmax for a wider range of dimensions of vectors. Um, and this turns out to be [00:33:47] of vectors. Um, and this turns out to be very important because as we make these [00:33:49] very important because as we make these networks bigger and bigger and bigger [00:33:50] networks bigger and bigger and bigger over time, we want to get higher [00:33:52] over time, we want to get higher dimensional vectors because that gives [00:33:53] dimensional vectors because that gives us more compute, more capacity. So we [00:33:55] us more compute, more capacity. So we always want to think about how our [00:33:56] always want to think about how our architectures will scale as we make the [00:33:58] architectures will scale as we make the parts of those architectures get bigger [00:34:00] parts of those architectures get bigger and bigger and bigger. So this um this [00:34:03] and bigger and bigger. So this um this this scale dotproduct is actually really [00:34:04] this scale dotproduct is actually really important for preventing vanishing [00:34:06] important for preventing vanishing gradients here. Yeah. Question was we're [00:34:08] gradients here. Yeah. Question was we're limited to data and query vectors of the [00:34:09] limited to data and query vectors of the same size, but we'll actually fix that. [00:34:12] same size, but we'll actually fix that. So uh our first generalization was to [00:34:14] So uh our first generalization was to use actually scaled dotproduct [00:34:16] use actually scaled dotproduct similarity as our as our similarity [00:34:17] similarity as our as our similarity measure. Um so now you know if we go [00:34:20] measure. Um so now you know if we go back and look at the shapes of these [00:34:21] back and look at the shapes of these things we have one query vector of [00:34:23] things we have one query vector of dimension dq. We have data vectors of [00:34:25] dimension dq. We have data vectors of dimension nx by dq as well because it's [00:34:27] dimension nx by dq as well because it's a dot productduct they need to match. Um [00:34:30] a dot productduct they need to match. Um but there's actually a next generaliz [00:34:32] but there's actually a next generaliz generalization that we're going to do is [00:34:34] generalization that we're going to do is have multiple query vectors right like [00:34:36] have multiple query vectors right like maybe we don't want to process just one [00:34:38] maybe we don't want to process just one query vector at a time. We want to have [00:34:40] query vector at a time. We want to have the ability to process a whole set of [00:34:42] the ability to process a whole set of query vectors all at once. Um, and this [00:34:44] query vectors all at once. Um, and this kind of happens in the RNN. You know, we [00:34:45] kind of happens in the RNN. You know, we did end up with a bunch of query [00:34:46] did end up with a bunch of query vectors. Um, and it's useful for the [00:34:48] vectors. Um, and it's useful for the attention operator to be able to process [00:34:50] attention operator to be able to process not one query vector at a time, but [00:34:52] not one query vector at a time, but basically process a set of query vectors [00:34:54] basically process a set of query vectors all in parallel and perform the exact [00:34:56] all in parallel and perform the exact same computation in parallel for a whole [00:34:59] same computation in parallel for a whole set of query vectors. So in this case, [00:35:01] set of query vectors. So in this case, we've now generalized it to have n. So q [00:35:03] we've now generalized it to have n. So q is now a matrix of shape nq by dq. So we [00:35:06] is now a matrix of shape nq by dq. So we have n q query vectors. Each of those [00:35:08] have n q query vectors. Each of those query vectors has dimension dq. We have [00:35:10] query vectors has dimension dq. We have our data vectors is a matrix of size nx [00:35:13] our data vectors is a matrix of size nx by dq. And now uh this now the the [00:35:16] by dq. And now uh this now the the computation changes a little bit because [00:35:18] computation changes a little bit because now want when we compute these alignment [00:35:20] now want when we compute these alignment scores when we when we compute these [00:35:22] scores when we when we compute these similarities basically we want to [00:35:24] similarities basically we want to compute all pairs of similarities [00:35:26] compute all pairs of similarities between all of the input data vectors [00:35:29] between all of the input data vectors and all of the input query vectors. And [00:35:31] and all of the input query vectors. And that sim and each one of those [00:35:33] that sim and each one of those similarities is a dotproduct. So well [00:35:35] similarities is a dotproduct. So well scaled dotproduct. So what's a very [00:35:37] scaled dotproduct. So what's a very efficient and easy and natural way for [00:35:39] efficient and easy and natural way for us to compute dotproducts between two [00:35:41] us to compute dotproducts between two sets of input vectors? That turns out [00:35:43] sets of input vectors? That turns out exactly to be a matrix multiply, right? [00:35:45] exactly to be a matrix multiply, right? Cuz remember when you do a matrix [00:35:46] Cuz remember when you do a matrix multiply, each entry in the output [00:35:48] multiply, each entry in the output matrix is the inner product of one of [00:35:50] matrix is the inner product of one of the columns of one of your matrices and [00:35:52] the columns of one of your matrices and the rows of your other matrix. And [00:35:53] the rows of your other matrix. And that's what uh so then each entry in the [00:35:55] that's what uh so then each entry in the output of a matrix multiply is exactly [00:35:57] output of a matrix multiply is exactly the dotproduct between the rows and the [00:35:59] the dotproduct between the rows and the columns in the output. So this by [00:36:01] columns in the output. So this by computing a matrix multiply between our [00:36:04] computing a matrix multiply between our um query vectors Q and our data vectors [00:36:07] um query vectors Q and our data vectors X and you need to get a transpose in [00:36:08] X and you need to get a transpose in there to make the rows and columns match [00:36:10] there to make the rows and columns match up in the right way. Um this basically [00:36:12] up in the right way. Um this basically gives us you know lets us compute all [00:36:13] gives us you know lets us compute all the scale all the similarities between [00:36:15] the scale all the similarities between all the data vectors and all the query [00:36:17] all the data vectors and all the query vectors um all in one simple matrix [00:36:20] vectors um all in one simple matrix multiply. [00:36:21] multiply. Um now we still need to compute these [00:36:23] Um now we still need to compute these attention weights. Remember the [00:36:24] attention weights. Remember the attention weights we want to compute for [00:36:26] attention weights we want to compute for each query vector we want to compute a [00:36:28] each query vector we want to compute a distribution over the data vectors. [00:36:30] distribution over the data vectors. Well, we already have these. Now, our [00:36:32] Well, we already have these. Now, our similarity scores are not just a single [00:36:33] similarity scores are not just a single vector of scores. They're now a matrix [00:36:35] vector of scores. They're now a matrix of scores giving all the similarities. [00:36:37] of scores giving all the similarities. But we still want to compute a [00:36:38] But we still want to compute a distribution over the data vectors for [00:36:41] distribution over the data vectors for each query vector independently. So, now [00:36:43] each query vector independently. So, now we need to comput the softmax over just [00:36:45] we need to comput the softmax over just one of the axes of that um of that [00:36:47] one of the axes of that um of that matrix of similarity scores. So, this is [00:36:49] matrix of similarity scores. So, this is basically the exact same computation [00:36:50] basically the exact same computation that we just saw. We're just doing it in [00:36:52] that we just saw. We're just doing it in parallel for a set of query vectors all [00:36:54] parallel for a set of query vectors all at once. Um now, we need to compute the [00:36:56] at once. Um now, we need to compute the output vectors. And remember the output [00:36:58] output vectors. And remember the output vectors were going to be a um weighted [00:37:02] vectors were going to be a um weighted combination of the in of the data [00:37:04] combination of the in of the data vectors where those weights are the [00:37:07] vectors where those weights are the values in the softmax. And it turns out [00:37:09] values in the softmax. And it turns out that this is also something that matrix [00:37:11] that this is also something that matrix multiply does. Um another way to think [00:37:13] multiply does. Um another way to think about matrix multiply is that when you [00:37:15] about matrix multiply is that when you take a matrix multiply of two matrices, [00:37:17] take a matrix multiply of two matrices, a different way to view a matrix [00:37:19] a different way to view a matrix multiply is that it takes a linear [00:37:21] multiply is that it takes a linear combination of oh man, am I going to get [00:37:23] combination of oh man, am I going to get the rows and the columns in the right [00:37:24] the rows and the columns in the right way? But I think you get the the linear [00:37:26] way? But I think you get the the linear combination of the columns of one of [00:37:28] combination of the columns of one of your input matrices um weighted by the [00:37:30] your input matrices um weighted by the values in the other input matrix. So [00:37:33] values in the other input matrix. So this is another interpretation of matrix [00:37:35] this is another interpretation of matrix multiplication. So then if you kind of [00:37:37] multiplication. So then if you kind of work through the indices and draw some [00:37:38] work through the indices and draw some little pictures for yourself to to prove [00:37:40] little pictures for yourself to to prove to yourself what's going on. It also [00:37:42] to yourself what's going on. It also turns out that you know in order to [00:37:44] turns out that you know in order to compute me now what we want to do is [00:37:46] compute me now what we want to do is compute many linear combinations of the [00:37:48] compute many linear combinations of the data vectors where each linear [00:37:50] data vectors where each linear combination will be given by the [00:37:52] combination will be given by the probabilities in one of the rows of the [00:37:54] probabilities in one of the rows of the attention matrix. Um so we can compute [00:37:56] attention matrix. Um so we can compute all of these all at once with another [00:37:58] all of these all at once with another matrix multiply between um the attention [00:38:01] matrix multiply between um the attention matrix A and the data vectors X. And [00:38:03] matrix A and the data vectors X. And again you need to get the transposes in [00:38:05] again you need to get the transposes in the right order to make this work out. [00:38:07] the right order to make this work out. But basically, this is the exact same [00:38:08] But basically, this is the exact same operation that we just saw, but we're [00:38:10] operation that we just saw, but we're now doing it for a set of query vectors [00:38:11] now doing it for a set of query vectors all at once. And it turns out that we [00:38:13] all at once. And it turns out that we can do it all at once with just a couple [00:38:15] can do it all at once with just a couple matrix multiplies. There's a next way [00:38:18] matrix multiplies. There's a next way that we'll generalize this is notice [00:38:20] that we'll generalize this is notice that in this equation, the the data [00:38:23] that in this equation, the the data vectors X are actually entering in two [00:38:26] vectors X are actually entering in two different places in this computation. Um [00:38:28] different places in this computation. Um the first place that we're using the [00:38:29] the first place that we're using the data vectors X is to compute [00:38:31] data vectors X is to compute similarities with the query vectors in [00:38:33] similarities with the query vectors in this uh similarities computation. So in [00:38:36] this uh similarities computation. So in that in that notion what we're trying to [00:38:38] that in that notion what we're trying to do is say oh for hey data vector how [00:38:40] do is say oh for hey data vector how much do you line up with each query [00:38:41] much do you line up with each query vector as measured by an inner product [00:38:43] vector as measured by an inner product but then we're also using the data [00:38:45] but then we're also using the data vectors again to compute the output [00:38:46] vectors again to compute the output vectors. So we're we're re we're the [00:38:49] vectors. So we're we're re we're the output vectors are now a linear [00:38:50] output vectors are now a linear combination of the data vectors weighted [00:38:52] combination of the data vectors weighted by our attention weights. Um and it [00:38:54] by our attention weights. Um and it maybe seems a little bit weird to reuse [00:38:57] maybe seems a little bit weird to reuse the data vectors in those two different [00:38:58] the data vectors in those two different contexts. So now what we want to do is [00:39:01] contexts. So now what we want to do is um separate those two usages of the data [00:39:03] um separate those two usages of the data vectors um and let the network sort of [00:39:06] vectors um and let the network sort of figure out for itself two different ways [00:39:09] figure out for itself two different ways to use the data vectors in those two [00:39:11] to use the data vectors in those two contexts. [00:39:12] contexts. So to do that we'll introduce this idea [00:39:14] So to do that we'll introduce this idea of keys and queries. So now what we're [00:39:17] of keys and queries. So now what we're going to do is you know we had a set of [00:39:19] going to do is you know we had a set of data vectors but what we're going to do [00:39:21] data vectors but what we're going to do is for each data vector we're going to [00:39:23] is for each data vector we're going to project each data vector into two [00:39:25] project each data vector into two vectors. One is a key vector one is a [00:39:28] vectors. One is a key vector one is a value vector. Um, and the idea of the [00:39:30] value vector. Um, and the idea of the key vectors are the key vectors are [00:39:32] key vectors are the key vectors are going to be compared with the query [00:39:34] going to be compared with the query vectors to compute the alignment scores [00:39:36] vectors to compute the alignment scores and the value vectors are what we're [00:39:38] and the value vectors are what we're going to compute linear combinations of [00:39:40] going to compute linear combinations of in order to compute the output from the [00:39:42] in order to compute the output from the layer. Um, and this also by so then the [00:39:45] layer. Um, and this also by so then the way that we implement this is we add two [00:39:47] way that we implement this is we add two learnable weight matrices, the key [00:39:49] learnable weight matrices, the key matrix and the value matrix which are [00:39:51] matrix and the value matrix which are going to be um linear projections that [00:39:53] going to be um linear projections that project the input that project the data [00:39:55] project the input that project the data vectors into key vectors and value [00:39:57] vectors into key vectors and value vectors. So now the data vectors are we [00:40:00] vectors. So now the data vectors are we have remember we have n data vectors [00:40:01] have remember we have n data vectors each of dimension dx. So now the key [00:40:03] each of dimension dx. So now the key matrix projects is a is a linear [00:40:06] matrix projects is a is a linear transformation that projects from dx [00:40:08] transformation that projects from dx into dq right because we know that we're [00:40:10] into dq right because we know that we're going to compare the key vectors with [00:40:12] going to compare the key vectors with the query vectors. So they need to have [00:40:13] the query vectors. So they need to have the same dimension as the query vectors. [00:40:15] the same dimension as the query vectors. Um so that will project each so then [00:40:17] Um so that will project each so then applying matrix multiply of k= x wk will [00:40:21] applying matrix multiply of k= x wk will project each data vector into a key [00:40:24] project each data vector into a key vector of dimension dq. Then we'll [00:40:26] vector of dimension dq. Then we'll separately have another weight matrix [00:40:28] separately have another weight matrix that projects from dx to dv which is the [00:40:31] that projects from dx to dv which is the dimension of the value vectors which in [00:40:33] dimension of the value vectors which in principle could be different than the [00:40:35] principle could be different than the query vector dimension. Um and then [00:40:37] query vector dimension. Um and then we'll separately project each each data [00:40:39] we'll separately project each each data vector into a value vector again with a [00:40:42] vector into a value vector again with a matrix multiply operator here. Um and [00:40:44] matrix multiply operator here. Um and the intuition here is that it's kind of [00:40:46] the intuition here is that it's kind of like in a search engine like you want to [00:40:48] like in a search engine like you want to separate what you're looking for from [00:40:50] separate what you're looking for from the answer you want in response to that [00:40:52] the answer you want in response to that query, right? So like you go to Google [00:40:54] query, right? So like you go to Google or these days chatgpt and you type in [00:40:56] or these days chatgpt and you type in like what is the best school in the [00:40:58] like what is the best school in the world that's your query and then the [00:41:00] world that's your query and then the value you get that's the that's the [00:41:02] value you get that's the that's the query that needs to be combined with the [00:41:03] query that needs to be combined with the keys in the back end but then the value [00:41:05] keys in the back end but then the value the data you want to get back from that [00:41:06] the data you want to get back from that query is actually different from the [00:41:08] query is actually different from the query you typed in right so we want to [00:41:10] query you typed in right so we want to separate this idea of like you put your [00:41:12] separate this idea of like you put your query in what is the best school in the [00:41:14] query in what is the best school in the world that query needs to go match on [00:41:16] world that query needs to go match on all the different strings in the on the [00:41:18] all the different strings in the on the internet and then the value you want to [00:41:19] internet and then the value you want to get back from that query is Stanford [00:41:21] get back from that query is Stanford which is a different value come which is [00:41:23] which is a different value come which is a different value which is different [00:41:24] a different value which is different from the query that you put in. So [00:41:26] from the query that you put in. So that's kind of the intuition another [00:41:27] that's kind of the intuition another intuition between separating the keys [00:41:29] intuition between separating the keys and the queries and the values in this [00:41:31] and the queries and the values in this way. The query is what I'm looking for. [00:41:34] way. The query is what I'm looking for. The key is you know in the back end we [00:41:36] The key is you know in the back end we have some record of all the data back [00:41:38] have some record of all the data back there in the data vectors but um when we [00:41:41] there in the data vectors but um when we query we want to match up against part [00:41:43] query we want to match up against part of the potentially just part of the data [00:41:44] of the potentially just part of the data vector and then the thing we want to get [00:41:46] vector and then the thing we want to get back from the data vector is the value. [00:41:48] back from the data vector is the value. So we're separating the usage of the [00:41:50] So we're separating the usage of the data vectors into those two different [00:41:52] data vectors into those two different notions of keys and values. Then we can [00:41:55] notions of keys and values. Then we can visualize this in a different way. So [00:41:57] visualize this in a different way. So now we're we're we're finally throwing [00:41:59] now we're we're we're finally throwing away the RNN and we're looking at [00:42:00] away the RNN and we're looking at attention just as an operator on its [00:42:02] attention just as an operator on its own. So we can step through this [00:42:04] own. So we can step through this operation again. We've got our query [00:42:05] operation again. We've got our query vectors coming in. We've got our data [00:42:07] vectors coming in. We've got our data vectors coming in. Now what we're going [00:42:09] vectors coming in. Now what we're going to do is from the data vectors, we're [00:42:11] to do is from the data vectors, we're going to project each data vector into a [00:42:13] going to project each data vector into a key and a value. Um then we're going to [00:42:16] key and a value. Um then we're going to compare each key with each query to get [00:42:19] compare each key with each query to get our um similarity scores. Right? So this [00:42:21] our um similarity scores. Right? So this is a similarity. This is a matrix of [00:42:23] is a similarity. This is a matrix of scalers giving the similarities between [00:42:25] scalers giving the similarities between each key and each query. Then once we [00:42:28] each key and each query. Then once we have this matrix of similarity scores, [00:42:30] have this matrix of similarity scores, we want to compute um a distribution [00:42:32] we want to compute um a distribution over each qu a distribution over the [00:42:34] over each qu a distribution over the data vectors for each query. So that [00:42:36] data vectors for each query. So that means we need to run softmax over this [00:42:39] means we need to run softmax over this uh matrix of alignment scores um or we [00:42:42] uh matrix of alignment scores um or we compute the softmax over each row. Then [00:42:44] compute the softmax over each row. Then what we do is we want to take rewe the [00:42:48] what we do is we want to take rewe the value vectors by the attention scores in [00:42:51] value vectors by the attention scores in the softmax. Oh actually no sorry we [00:42:54] the softmax. Oh actually no sorry we want each we want a each column to be we [00:42:56] want each we want a each column to be we want each column to be a distribution. [00:42:58] want each column to be a distribution. uh right because each we want for each [00:43:00] uh right because each we want for each query a distribution over the keys which [00:43:02] query a distribution over the keys which means we want softmax over the columns [00:43:04] means we want softmax over the columns right because we want it to be aligned [00:43:05] right because we want it to be aligned to the columns. So then what we do is [00:43:08] to the columns. So then what we do is you know we've got this query one we've [00:43:10] you know we've got this query one we've prod we've predicted this distribution [00:43:12] prod we've predicted this distribution over all of the keys um from this [00:43:14] over all of the keys um from this computation. Then we're going to take a [00:43:16] computation. Then we're going to take a linear combination of the values [00:43:18] linear combination of the values weighted by these attention weights and [00:43:20] weighted by these attention weights and comput a linear combination of the value [00:43:21] comput a linear combination of the value vectors to produce our first output [00:43:23] vectors to produce our first output vector y1. And then the same thing [00:43:25] vector y1. And then the same thing happens over here. Our second query got [00:43:27] happens over here. Our second query got compared with all the keys. We computed [00:43:29] compared with all the keys. We computed a distribution over those alignment [00:43:30] a distribution over those alignment scores to get a distribution over the [00:43:32] scores to get a distribution over the keys for the second query which then get [00:43:35] keys for the second query which then get linearly combine. Then we use those to [00:43:37] linearly combine. Then we use those to linear linearly combine the values to [00:43:39] linear linearly combine the values to produce our output vector. So now this [00:43:41] produce our output vector. So now this is now the attention operator sort of [00:43:43] is now the attention operator sort of standing on its own um divorced from the [00:43:45] standing on its own um divorced from the recurrent neural network. The question [00:43:47] recurrent neural network. The question is how do you divide the data vector [00:43:48] is how do you divide the data vector into keys and values? The beautiful part [00:43:50] into keys and values? The beautiful part is we don't have to say we don't have to [00:43:52] is we don't have to say we don't have to say how just as we just give the neural [00:43:54] say how just as we just give the neural network the capacity to split it by by [00:43:56] network the capacity to split it by by to split it by itself by giving it this [00:43:59] to split it by itself by giving it this mechanism to project separately into [00:44:01] mechanism to project separately into keys and values but just but we're not [00:44:04] keys and values but just but we're not going to tell it how to do it. Um these [00:44:05] going to tell it how to do it. Um these are just going to be the the key matrix [00:44:07] are just going to be the the key matrix and the value matrix are just going to [00:44:08] and the value matrix are just going to be learnable parameters of the model [00:44:09] be learnable parameters of the model that will be learned via gradient [00:44:11] that will be learned via gradient descent along with everything else. So [00:44:13] descent along with everything else. So just as we did not tell it how to align [00:44:15] just as we did not tell it how to align the English and the French sentences all [00:44:17] the English and the French sentences all of that was sort of learned via gradient [00:44:18] of that was sort of learned via gradient descent. the model will learn for itself [00:44:20] descent. the model will learn for itself how to separately project into keys and [00:44:22] how to separately project into keys and values in a way that's sensible for the [00:44:25] values in a way that's sensible for the problem for that's helpful for the [00:44:26] problem for that's helpful for the problem it's trying to solve. So that [00:44:28] problem it's trying to solve. So that the keys and values you might think of [00:44:29] the keys and values you might think of it as some kind of filter right so the [00:44:31] it as some kind of filter right so the data vector might have a lot of stuff in [00:44:32] data vector might have a lot of stuff in there but for the task at hand we might [00:44:34] there but for the task at hand we might want to filter the data vector in [00:44:35] want to filter the data vector in various ways and only try to match our [00:44:37] various ways and only try to match our queries against part of it and we only [00:44:39] queries against part of it and we only care about retrieving information of a [00:44:40] care about retrieving information of a different part of it. So you could think [00:44:41] different part of it. So you could think of those as yeah filtering you know the [00:44:43] of those as yeah filtering you know the information in the data vector in two [00:44:45] information in the data vector in two different ways. [00:44:47] different ways. Okay, so this is this is basically our [00:44:49] Okay, so this is this is basically our attention operator. And now like there's [00:44:51] attention operator. And now like there's no RNN here. This is just a neural [00:44:53] no RNN here. This is just a neural network layer that you could have [00:44:55] network layer that you could have standing on its own, right? It receives [00:44:57] standing on its own, right? It receives two inputs, the query vectors and the [00:44:59] two inputs, the query vectors and the data vectors. It has two weights of [00:45:00] data vectors. It has two weights of learnable parameters which are the key [00:45:02] learnable parameters which are the key matrix and the value matrix. Um it [00:45:04] matrix and the value matrix. Um it inputs two two sets two sequences of [00:45:06] inputs two two sets two sequences of vectors, outputs a sequence of vectors. [00:45:08] vectors, outputs a sequence of vectors. So this is a neural network layer in its [00:45:10] So this is a neural network layer in its own right that you could start to plug [00:45:11] own right that you could start to plug into your neural network architectures [00:45:13] into your neural network architectures in various places. This is sometimes [00:45:15] in various places. This is sometimes called a cross attention layer because [00:45:17] called a cross attention layer because it has two sets of inputs coming in, [00:45:20] it has two sets of inputs coming in, right? The idea is we have both data [00:45:21] right? The idea is we have both data vectors and query vectors. They're [00:45:23] vectors and query vectors. They're potentially coming from two different [00:45:25] potentially coming from two different sources. Um, and this is sometimes [00:45:27] sources. Um, and this is sometimes useful, right? So that I have a set of [00:45:29] useful, right? So that I have a set of queries. For each query, I want to go [00:45:30] queries. For each query, I want to go and summarize information from my data [00:45:32] and summarize information from my data which is potentially different or a [00:45:34] which is potentially different or a different number or totally different [00:45:35] different number or totally different from my query vectors. Um, so this is [00:45:37] from my query vectors. Um, so this is this is sometimes called a cross [00:45:39] this is sometimes called a cross attention layer because we're [00:45:40] attention layer because we're crossending between two different sets [00:45:42] crossending between two different sets of things. [00:45:44] of things. Um, but there's another version of this [00:45:46] Um, but there's another version of this that happens maybe even more commonly is [00:45:48] that happens maybe even more commonly is a self attention layer. So here what [00:45:51] a self attention layer. So here what we're going to do is we only have one [00:45:53] we're going to do is we only have one set of things. We only have one sequence [00:45:55] set of things. We only have one sequence of inputs. We we have one set of [00:45:57] of inputs. We we have one set of vectors, one sequence of vectors that [00:45:58] vectors, one sequence of vectors that we're processing. Um, and then so now we [00:46:00] we're processing. Um, and then so now we we no longer have this separation [00:46:02] we no longer have this separation between data vectors and query vectors. [00:46:04] between data vectors and query vectors. We just have one set of input vectors [00:46:06] We just have one set of input vectors that we would like to process. So in a [00:46:08] that we would like to process. So in a self attention layer um we're going to [00:46:10] self attention layer um we're going to have one we're going to have a set of [00:46:12] have one we're going to have a set of input vectors and we're going to produce [00:46:14] input vectors and we're going to produce a set of output vectors. So we want to [00:46:16] a set of output vectors. So we want to input a set of vectors X output a set of [00:46:18] input a set of vectors X output a set of vectors Y that are the same number as [00:46:20] vectors Y that are the same number as the input vectors. But now the mechanism [00:46:22] the input vectors. But now the mechanism of this is basically the same attention [00:46:24] of this is basically the same attention mechanism that we just saw. Um but now [00:46:26] mechanism that we just saw. Um but now rather than projecting but we then we're [00:46:28] rather than projecting but we then we're still going to use this notion of [00:46:29] still going to use this notion of filtering but now rather than projecting [00:46:31] filtering but now rather than projecting our data vectors into keys and queries [00:46:33] our data vectors into keys and queries as we previously did. Now, what we're [00:46:35] as we previously did. Now, what we're going to do is take each one of our [00:46:36] going to do is take each one of our input vectors and project it to three [00:46:39] input vectors and project it to three different things. Um, from each of our [00:46:41] different things. Um, from each of our input vectors, we're going to project it [00:46:43] input vectors, we're going to project it to a query, to a key, and to a value. [00:46:46] to a query, to a key, and to a value. Um, and so the the the equations change [00:46:48] Um, and so the the the equations change just a little bit. Um, and but the [00:46:50] just a little bit. Um, and but the picture over here doesn't actually [00:46:51] picture over here doesn't actually change very much for each of our input [00:46:53] change very much for each of our input vectors. We separately project it to a [00:46:55] vectors. We separately project it to a query, to a key, and to a value. Um, and [00:46:58] query, to a key, and to a value. Um, and now we have, you know, the exact same [00:47:00] now we have, you know, the exact same computation. Now we've got queries, [00:47:02] computation. Now we've got queries, we've got keys, we've got values. From [00:47:04] we've got keys, we've got values. From the perspective of everything happening [00:47:05] the perspective of everything happening up here, it's all the same. It just h it [00:47:07] up here, it's all the same. It just h it just so happened that we computed the [00:47:10] just so happened that we computed the keys and the queries and the values all [00:47:12] keys and the queries and the values all from different linear projections of [00:47:14] from different linear projections of those same input vectors, but all the [00:47:16] those same input vectors, but all the computation otherwise shared. Yeah. [00:47:18] computation otherwise shared. Yeah. Question is how do you where what are D [00:47:20] Question is how do you where what are D in and D out? How are they sized? Um so [00:47:22] in and D out? How are they sized? Um so these are going to be architectural [00:47:23] these are going to be architectural hyperparameters of the layer, right? [00:47:25] hyperparameters of the layer, right? Like just when we have a learnable [00:47:27] Like just when we have a learnable linear layer in a model, a linear layer [00:47:29] linear layer in a model, a linear layer basically projects from a Din to a D [00:47:30] basically projects from a Din to a D out. Those are going to be architectural [00:47:32] out. Those are going to be architectural hyperparameters of the layer. Um same [00:47:34] hyperparameters of the layer. Um same thing with a self attention layer. The D [00:47:35] thing with a self attention layer. The D in and the D out are going to be [00:47:37] in and the D out are going to be architectural hyperparameters of the [00:47:38] architectural hyperparameters of the layer. Um and in principle they could be [00:47:40] layer. Um and in principle they could be different, right? There's enough there's [00:47:43] different, right? There's enough there's enough flexibility in this architecture [00:47:44] enough flexibility in this architecture so that in principle D in D in and D out [00:47:47] so that in principle D in D in and D out could be different. Although I don't [00:47:49] could be different. Although I don't think I've almost ever seen that. In [00:47:50] think I've almost ever seen that. In practice they're like almost always the [00:47:52] practice they're like almost always the same. So I've been like a little bit [00:47:54] same. So I've been like a little bit extra general in the notation here. [00:47:58] extra general in the notation here. Okay. So I I don't know that we [00:47:59] Okay. So I I don't know that we necessarily need to walk through this. [00:48:00] necessarily need to walk through this. Oh actually there is one important [00:48:01] Oh actually there is one important thing. Right. So I said that um we are [00:48:04] thing. Right. So I said that um we are separately projecting the inputs into [00:48:05] separately projecting the inputs into queries, keys and values. Um so that [00:48:08] queries, keys and values. Um so that happens via three matrix multiplies with [00:48:10] happens via three matrix multiplies with our three learnable weight matrices. Now [00:48:12] our three learnable weight matrices. Now we have three learnable weight matrices. [00:48:13] we have three learnable weight matrices. One for keys, one for values, one for [00:48:15] One for keys, one for values, one for queries. Um and we separately project [00:48:17] queries. Um and we separately project the the input vectors X into keys, [00:48:20] the the input vectors X into keys, queries and values. Um but in practice [00:48:22] queries and values. Um but in practice um we can actually typically compute [00:48:24] um we can actually typically compute just one matrix multiply all at once for [00:48:26] just one matrix multiply all at once for those because it's typically more [00:48:27] those because it's typically more efficient on hardware to do fewer large [00:48:30] efficient on hardware to do fewer large matrix multiplies than it is to do more [00:48:32] matrix multiplies than it is to do more smaller matrix multiplies. So a pretty [00:48:34] smaller matrix multiplies. So a pretty common trick in practice is to fuse is [00:48:36] common trick in practice is to fuse is to sort of concatenate these three matri [00:48:38] to sort of concatenate these three matri matrices along the dimensions and [00:48:40] matrices along the dimensions and compute all of these keys queries and [00:48:42] compute all of these keys queries and values for all the input vectors all at [00:48:43] values for all the input vectors all at once with one big matrix multiply. If [00:48:46] once with one big matrix multiply. If you've read transformers before, they [00:48:48] you've read transformers before, they sometimes separate between encoder and [00:48:50] sometimes separate between encoder and decoder transformers or encoder decoder [00:48:52] decoder transformers or encoder decoder attention. So in that case like this [00:48:54] attention. So in that case like this does this is this would be the decoder [00:48:56] does this is this would be the decoder only attention. Um if you've read [00:48:57] only attention. Um if you've read transformer papers before um and which [00:48:59] transformer papers before um and which corresponds to the decoder [00:49:02] corresponds to the decoder of the RNN initial example at beginning [00:49:04] of the RNN initial example at beginning of class um but like this mechanism is [00:49:07] of class um but like this mechanism is actually just the most gen is like the [00:49:08] actually just the most gen is like the most commonly used flavor of attention [00:49:10] most commonly used flavor of attention nowadays is this sort of so-called [00:49:11] nowadays is this sort of so-called decoder only attention. So we are we are [00:49:14] decoder only attention. So we are we are quite divorcing ourselves away from the [00:49:16] quite divorcing ourselves away from the RNN now. Right? So this flavor of it [00:49:18] RNN now. Right? So this flavor of it doesn't really make sense to be used in [00:49:20] doesn't really make sense to be used in the RNN that we saw at the beginning of [00:49:21] the RNN that we saw at the beginning of class. Right? So we basically been like [00:49:23] class. Right? So we basically been like doing a little bit of a slight of hand [00:49:24] doing a little bit of a slight of hand here where we introduced this [00:49:26] here where we introduced this architecture for the purpose of RNN in [00:49:28] architecture for the purpose of RNN in this very concrete case of machine [00:49:30] this very concrete case of machine translation sequence to sequence. But [00:49:32] translation sequence to sequence. But we've now generalized it to become a [00:49:33] we've now generalized it to become a totally different operator that can be [00:49:35] totally different operator that can be used all on its own. And in this [00:49:37] used all on its own. And in this particular generalization into self [00:49:38] particular generalization into self attention, it actually no longer can be [00:49:40] attention, it actually no longer can be used in that decoder in the RNN. Um but [00:49:42] used in that decoder in the RNN. Um but it's a very useful primitive that gets [00:49:44] it's a very useful primitive that gets used in a lot of other places. It turns [00:49:45] used in a lot of other places. It turns out the question is what's the benefit [00:49:48] out the question is what's the benefit or difference between the self attention [00:49:49] or difference between the self attention versus the cross attention. Um they [00:49:51] versus the cross attention. Um they would get used in different contexts. So [00:49:53] would get used in different contexts. So in a in some situations you naturally [00:49:55] in a in some situations you naturally have two dis different kinds of data [00:49:57] have two dis different kinds of data that you want to compare which we saw [00:49:59] that you want to compare which we saw for example in the machine translation [00:50:00] for example in the machine translation setting we have an input sentence we [00:50:02] setting we have an input sentence we have an output sentence. We believe that [00:50:04] have an output sentence. We believe that there there's some natural structure in [00:50:05] there there's some natural structure in the problem that there's two different [00:50:07] the problem that there's two different sets of things that we want to compare. [00:50:08] sets of things that we want to compare. Um that also might happen in say image [00:50:10] Um that also might happen in say image captioning right say we have an input [00:50:12] captioning right say we have an input image we want to produce an output [00:50:13] image we want to produce an output sentence there's two different kinds of [00:50:15] sentence there's two different kinds of things we want to compare pieces of the [00:50:17] things we want to compare pieces of the image and tokens in the words that we're [00:50:19] image and tokens in the words that we're generating so for some problems there's [00:50:21] generating so for some problems there's just this natural structure where you [00:50:22] just this natural structure where you have two different kinds of things [00:50:24] have two different kinds of things floating around but for other problems [00:50:26] floating around but for other problems there aren't two kinds of things there's [00:50:28] there aren't two kinds of things there's just one thing um so say you're doing [00:50:29] just one thing um so say you're doing image classification then there's only [00:50:31] image classification then there's only an image we just want to process the [00:50:33] an image we just want to process the image so in that case we just want to [00:50:34] image so in that case we just want to compare parts of the image with itself [00:50:36] compare parts of the image with itself and that's where you use a self [00:50:37] and that's where you use a self attention layer So they just get used in [00:50:39] attention layer So they just get used in different for different kinds of [00:50:40] different for different kinds of problems. [00:50:41] problems. Um but we want to but crucially we want [00:50:43] Um but we want to but crucially we want to reuse basically the same machinery [00:50:45] to reuse basically the same machinery and the same uh computational primitives [00:50:47] and the same uh computational primitives to you to be used in those different [00:50:49] to you to be used in those different kinds of problems and that's very that's [00:50:50] kinds of problems and that's very that's really beneficial. There's a couple [00:50:53] really beneficial. There's a couple interesting things about attention that [00:50:55] interesting things about attention that I want to get through. So one is like [00:50:57] I want to get through. So one is like let's consider what happens if you [00:50:58] let's consider what happens if you permit permute the inputs right we had a [00:51:00] permit permute the inputs right we had a set of input vectors. What happens if [00:51:02] set of input vectors. What happens if you shuffle them and process them in a [00:51:03] you shuffle them and process them in a different order? Now actually a lot of [00:51:05] different order? Now actually a lot of interesting stuff happens. So the keys, [00:51:07] interesting stuff happens. So the keys, the queries, and the values will all end [00:51:09] the queries, and the values will all end up the same, right? Because they are [00:51:10] up the same, right? Because they are computed as linear projections of the [00:51:12] computed as linear projections of the input. So we'll end up getting the same [00:51:14] input. So we'll end up getting the same keys, queries, and values. They'll just [00:51:16] keys, queries, and values. They'll just be in a different order, shuffled in the [00:51:17] be in a different order, shuffled in the same way that the inputs were. Um, and [00:51:19] same way that the inputs were. Um, and now because our similarity scores were [00:51:21] now because our similarity scores were just dotproducts, we'll also end up with [00:51:22] just dotproducts, we'll also end up with the same similarity scores, just again [00:51:24] the same similarity scores, just again kind of shuffled in accordance with the [00:51:25] kind of shuffled in accordance with the way we shuffled input. Um, same thing [00:51:27] way we shuffled input. Um, same thing with the softmax. Softmax doesn't [00:51:29] with the softmax. Softmax doesn't actually care about the order of its [00:51:30] actually care about the order of its inputs. So it's the softmax is now [00:51:33] inputs. So it's the softmax is now operating on the same vector, but [00:51:34] operating on the same vector, but shuffled. So um each column of our [00:51:36] shuffled. So um each column of our attention weights will end up the same [00:51:37] attention weights will end up the same as they did before just shuffled. And [00:51:39] as they did before just shuffled. And then same thing with linear [00:51:40] then same thing with linear combinations. So our output val our [00:51:43] combinations. So our output val our outputs y will actually still be the [00:51:45] outputs y will actually still be the same outputs as we said before. They'll [00:51:46] same outputs as we said before. They'll just all be shuffled. So that that means [00:51:48] just all be shuffled. So that that means that there's a really interesting [00:51:49] that there's a really interesting structure here um called permutation [00:51:52] structure here um called permutation equariance. Remember we saw we we saw [00:51:54] equariance. Remember we saw we we saw this a couple lectures ago with with um [00:51:56] this a couple lectures ago with with um with with convolution. Now we see a [00:51:58] with with convolution. Now we see a different equariance property of these [00:52:00] different equariance property of these uh self attention layers which is that [00:52:02] uh self attention layers which is that if we shuffle the inputs then the [00:52:04] if we shuffle the inputs then the outputs we get the same outputs just [00:52:06] outputs we get the same outputs just shuffled in the same way that the inputs [00:52:08] shuffled in the same way that the inputs were shuffled. And this kind of means in [00:52:10] were shuffled. And this kind of means in this case that self attention doesn't [00:52:13] this case that self attention doesn't actually care about the order of the [00:52:15] actually care about the order of the inputs. If we change the order of the [00:52:16] inputs. If we change the order of the inputs we'll get the same outputs just [00:52:18] inputs we'll get the same outputs just shuffled in the same way. That the [00:52:20] shuffled in the same way. That the computation of the layer does not depend [00:52:21] computation of the layer does not depend on the order in which we present the [00:52:23] on the order in which we present the inputs. So that means that we can think [00:52:25] inputs. So that means that we can think of self attention actually not as [00:52:27] of self attention actually not as operating on sequences of vectors. They [00:52:29] operating on sequences of vectors. They happen to be packed into into an ordered [00:52:31] happen to be packed into into an ordered sequence of a matrix. But we really [00:52:33] sequence of a matrix. But we really think of it instead as operating on an [00:52:35] think of it instead as operating on an unordered set of vectors because the the [00:52:37] unordered set of vectors because the the the the the outputs that we get don't [00:52:39] the the the outputs that we get don't actually depend on what order we've [00:52:41] actually depend on what order we've packed those vectors into our input [00:52:43] packed those vectors into our input matrix. So we really think about this as [00:52:45] matrix. So we really think about this as a kind of different neural network [00:52:46] a kind of different neural network primitive that fundamentally operates on [00:52:48] primitive that fundamentally operates on sets of vectors rather than sequences of [00:52:50] sets of vectors rather than sequences of vectors. But this is sometimes a [00:52:52] vectors. But this is sometimes a problem. Sometimes it is useful to tell [00:52:54] problem. Sometimes it is useful to tell the neural network what the order of the [00:52:55] the neural network what the order of the se what the order of the entries is. So [00:52:57] se what the order of the entries is. So as a quick fix to that, we'll sometimes [00:52:59] as a quick fix to that, we'll sometimes concatenate an additional piece of data [00:53:01] concatenate an additional piece of data onto each of the input vectors called a [00:53:03] onto each of the input vectors called a positional embedding. That is basically [00:53:05] positional embedding. That is basically some some piece of data that tells the [00:53:07] some some piece of data that tells the neural network this one's at index one, [00:53:09] neural network this one's at index one, this one's at index two, this one's at [00:53:10] this one's at index two, this one's at index 3, blah blah blah blah blah. And [00:53:11] index 3, blah blah blah blah blah. And there's a bunch of different mechanisms [00:53:12] there's a bunch of different mechanisms for that. The question is, is it going [00:53:16] for that. The question is, is it going to train to the same result? Um the [00:53:18] to train to the same result? Um the training, I'm not really talking about [00:53:20] training, I'm not really talking about the training here. I'm talking about if [00:53:21] the training here. I'm talking about if you fix the weight matrices and just [00:53:23] you fix the weight matrices and just consider the computation of the layer [00:53:25] consider the computation of the layer then if I were to shuffle the inputs [00:53:27] then if I were to shuffle the inputs then I receive the same outputs but [00:53:29] then I receive the same outputs but they'll be shuffled in the same way that [00:53:30] they'll be shuffled in the same way that the inputs were shuffled. So like the [00:53:32] the inputs were shuffled. So like the the the the question of what vectors do [00:53:34] the the the question of what vectors do I compute at the output does not depend [00:53:36] I compute at the output does not depend on the on the on the vectors on the [00:53:39] on the on the on the vectors on the order of the vectors in the input but [00:53:41] order of the vectors in the input but the order of the vectors I get from the [00:53:42] the order of the vectors I get from the output does depend on the order that [00:53:44] output does depend on the order that they were presented in the input. So [00:53:46] they were presented in the input. So there's another couple tricks we can do [00:53:47] there's another couple tricks we can do with self attention, but I'll go through [00:53:48] with self attention, but I'll go through these a little bit faster. Um, so [00:53:51] these a little bit faster. Um, so sometimes, you know, in in a full self [00:53:53] sometimes, you know, in in a full self attention layer, we allowed every piece [00:53:55] attention layer, we allowed every piece of the input to look at every other [00:53:56] of the input to look at every other piece of the input. But for some [00:53:58] piece of the input. But for some problems, we might want to impose some [00:54:00] problems, we might want to impose some structure on this computation and say [00:54:02] structure on this computation and say that certain pieces of the input are [00:54:04] that certain pieces of the input are only allowed to look at certain other [00:54:05] only allowed to look at certain other pieces of the input rather than looking [00:54:07] pieces of the input rather than looking at rather than everything being allowed [00:54:08] at rather than everything being allowed to look at everything. And we can [00:54:10] to look at everything. And we can implement this via a notion called [00:54:11] implement this via a notion called masked self attention. So what we're [00:54:13] masked self attention. So what we're going to do is after we compute these um [00:54:15] going to do is after we compute these um these alignment scores E, we're going to [00:54:17] these alignment scores E, we're going to go in and override the alignment scores [00:54:19] go in and override the alignment scores with negative infinities in places where [00:54:21] with negative infinities in places where we want to block the attention. Um and [00:54:23] we want to block the attention. Um and now if you have a negative infinity in [00:54:24] now if you have a negative infinity in your alignment scores, then after you do [00:54:26] your alignment scores, then after you do a softmax, it's going to end up as a [00:54:28] a softmax, it's going to end up as a zero if you walk through the softmax [00:54:29] zero if you walk through the softmax computation. Um so that means that if [00:54:31] computation. Um so that means that if there's a zero if there's whenever [00:54:32] there's a zero if there's whenever there's a negative infinity in the [00:54:33] there's a negative infinity in the alignment scores we end up with a zero [00:54:35] alignment scores we end up with a zero in the softmax in in the scores after [00:54:37] in the softmax in in the scores after the softmax which means that that output [00:54:39] the softmax which means that that output y will not depend on the value vector [00:54:41] y will not depend on the value vector computed at that index. So this is a [00:54:43] computed at that index. So this is a mechanism to let us control which inputs [00:54:46] mechanism to let us control which inputs are allowed to interact with each other [00:54:47] are allowed to interact with each other in the process of the computation. Um [00:54:50] in the process of the computation. Um and we might want to do this now for [00:54:51] and we might want to do this now for language modeling right because now [00:54:53] language modeling right because now we've generalized this operator to the [00:54:55] we've generalized this operator to the point where we don't need an RNN at all. [00:54:57] point where we don't need an RNN at all. We can just use this in the for the same [00:54:59] We can just use this in the for the same problem that we used to use an RNN for. [00:55:01] problem that we used to use an RNN for. So now we can use it to process sequence [00:55:03] So now we can use it to process sequence of words like attention is very and then [00:55:06] of words like attention is very and then output is very cool. So then in this [00:55:08] output is very cool. So then in this case we're doing the same language [00:55:10] case we're doing the same language modeling task that we saw last lecture [00:55:12] modeling task that we saw last lecture with RNN's but we can now do just do it [00:55:14] with RNN's but we can now do just do it natively with this self attention block. [00:55:16] natively with this self attention block. But in this case we want to make the [00:55:18] But in this case we want to make the first output is only depend on the first [00:55:20] first output is only depend on the first word. The second output vary only [00:55:22] word. The second output vary only allowed to look depend on the first two [00:55:23] allowed to look depend on the first two words. We don't want to let that let the [00:55:25] words. We don't want to let that let the network look ahead in the sequence and [00:55:26] network look ahead in the sequence and cheat. So here is where we would use [00:55:28] cheat. So here is where we would use masking. [00:55:30] masking. Um another thing that we'll sometimes do [00:55:32] Um another thing that we'll sometimes do with self attention is called [00:55:33] with self attention is called multi-headed self attention where you [00:55:35] multi-headed self attention where you run n copies like h separate independent [00:55:38] run n copies like h separate independent copies of self attention in parallel. [00:55:40] copies of self attention in parallel. Why do you want to do this? Because it's [00:55:41] Why do you want to do this? Because it's more computation, it's more flops, it's [00:55:43] more computation, it's more flops, it's more parameters. Deep learning we always [00:55:45] more parameters. Deep learning we always want more and bigger. Um and this is [00:55:46] want more and bigger. Um and this is another way that you can make this [00:55:47] another way that you can make this network that make this this layer more [00:55:50] network that make this this layer more and bigger and more powerful. So what [00:55:51] and bigger and more powerful. So what we're going to do is take our inputs X, [00:55:54] we're going to do is take our inputs X, route them to H independent copies of [00:55:56] route them to H independent copies of separate self attention layers. Those [00:55:58] separate self attention layers. Those will each produce their own outputs Y [00:56:00] will each produce their own outputs Y which will then stack up um along the [00:56:02] which will then stack up um along the output and then fuse the and then have [00:56:05] output and then fuse the and then have another linear projection at the output [00:56:07] another linear projection at the output to kind of fuse the output data from [00:56:08] to kind of fuse the output data from each of the independent self attention [00:56:10] each of the independent self attention layers. Um and now in this case uh this [00:56:13] layers. Um and now in this case uh this is called multi-headed self attention. [00:56:16] is called multi-headed self attention. Um and this is basically the format that [00:56:18] Um and this is basically the format that we always see in practice. So this is [00:56:20] we always see in practice. So this is like whenever you see self- attention [00:56:21] like whenever you see self- attention used these days, it's almost always this [00:56:24] used these days, it's almost always this multi-headed self- attention version. [00:56:27] multi-headed self- attention version. Um, and in practice, um, it turns out [00:56:29] Um, and in practice, um, it turns out that you can compute this all with [00:56:31] that you can compute this all with matrix multiplies as well. So you don't [00:56:33] matrix multiplies as well. So you don't have to like run a for loop. Um, you can [00:56:35] have to like run a for loop. Um, you can compute each of these H copies of self [00:56:37] compute each of these H copies of self attention all in parallel if you're [00:56:39] attention all in parallel if you're clever and use batched matrix multiplies [00:56:41] clever and use batched matrix multiplies all in the right places. Um, so in in in [00:56:43] all in the right places. Um, so in in in fact this whole self attention operator [00:56:46] fact this whole self attention operator seems like a lot of stuff going on, but [00:56:47] seems like a lot of stuff going on, but it's really basically just four matrix [00:56:49] it's really basically just four matrix multiplies. We have one matrix multiply [00:56:51] multiplies. We have one matrix multiply where we take our inputs and project [00:56:53] where we take our inputs and project them to queries, keys, and values. Um, [00:56:56] them to queries, keys, and values. Um, we have another matrix multiply where we [00:56:58] we have another matrix multiply where we compute Qase similarity. For each Q, we [00:57:00] compute Qase similarity. For each Q, we compute the similarity against all the [00:57:02] compute the similarity against all the all the K's. And that's one big batched [00:57:04] all the K's. And that's one big batched matrix multiply. Now in the multi-headed [00:57:06] matrix multiply. Now in the multi-headed case um we have another one called [00:57:08] case um we have another one called V-weing where we want to take linear [00:57:10] V-weing where we want to take linear combinations of all the values weighted [00:57:11] combinations of all the values weighted by the softmax entries and that can be [00:57:14] by the softmax entries and that can be done in another big batched matrix [00:57:16] done in another big batched matrix multiply and then finally we have an [00:57:17] multiply and then finally we have an output projection to mix information [00:57:19] output projection to mix information across our different self our different [00:57:21] across our different self our different heads of our self attention. So even [00:57:23] heads of our self attention. So even though there's a lot of equations [00:57:24] though there's a lot of equations there's a lot of vectors flying around [00:57:26] there's a lot of vectors flying around this whole self attention operator is [00:57:28] this whole self attention operator is basically just four big batched matrix [00:57:30] basically just four big batched matrix multiplies. Um, and that's great because [00:57:32] multiplies. Um, and that's great because matrix multipliers are a really [00:57:34] matrix multipliers are a really scalable, powerful primitive that we can [00:57:36] scalable, powerful primitive that we can distribute, we can optimize um, and we [00:57:38] distribute, we can optimize um, and we can make this thing highly parallel, [00:57:40] can make this thing highly parallel, highly parallel, highly scalable, highly [00:57:42] highly parallel, highly scalable, highly uh, highly efficient. Yeah. Question is [00:57:44] uh, highly efficient. Yeah. Question is that the x1, x2, x3, they're exactly the [00:57:46] that the x1, x2, x3, they're exactly the same. Um, but just but um, yeah, but [00:57:49] same. Um, but just but um, yeah, but we're just going to like have separate [00:57:51] we're just going to like have separate cop like basically separate copies of [00:57:53] cop like basically separate copies of the self attention layer. They're all [00:57:55] the self attention layer. They're all they will all be random. They all have [00:57:56] they will all be random. They all have different weights critically and those [00:57:58] different weights critically and those weights will be initialized randomly [00:58:00] weights will be initialized randomly different at initialization. So they [00:58:01] different at initialization. So they will end up learning to process them in [00:58:03] will end up learning to process them in slightly different ways. So this is just [00:58:04] slightly different ways. So this is just a way to give extra capacity to the [00:58:06] a way to give extra capacity to the layer. Oh yeah, the only thing different [00:58:09] layer. Oh yeah, the only thing different between the different heads is the [00:58:10] between the different heads is the weights. So we'll the architecture is [00:58:11] weights. So we'll the architecture is exactly the same. The computation is [00:58:12] exactly the same. The computation is exactly the same, but they'll have [00:58:14] exactly the same, but they'll have different weights and those weights will [00:58:15] different weights and those weights will be diff will be initialized to different [00:58:17] be diff will be initialized to different things at initialization. Um but other [00:58:18] things at initialization. Um but other than that it's all exactly the same. [00:58:21] than that it's all exactly the same. Okay, there's some stuff there that we [00:58:23] Okay, there's some stuff there that we can skip. But now basically we've gotten [00:58:24] can skip. But now basically we've gotten to one really interesting place where we [00:58:26] to one really interesting place where we have basically three different ways to [00:58:28] have basically three different ways to process sequences that we've seen in [00:58:30] process sequences that we've seen in this class. The first is recurrent [00:58:31] this class. The first is recurrent neural networks. We saw that recurrent [00:58:33] neural networks. We saw that recurrent neural networks basically operate on 1D [00:58:35] neural networks basically operate on 1D ordered sequences. Um and they're [00:58:37] ordered sequences. Um and they're they're really cool. They're really [00:58:38] they're really cool. They're really powerful. People like them for a long [00:58:40] powerful. People like them for a long time, but they're fundamentally not very [00:58:42] time, but they're fundamentally not very parallelizable because of this [00:58:43] parallelizable because of this concurrent structure where each hidden [00:58:45] concurrent structure where each hidden state depends on the previous hidden [00:58:47] state depends on the previous hidden state. Then they're just a fundamentally [00:58:49] state. Then they're just a fundamentally sequential algorithm. there's no way to [00:58:51] sequential algorithm. there's no way to parallelize this across the sequence. [00:58:53] parallelize this across the sequence. Um, and that makes them very difficult [00:58:54] Um, and that makes them very difficult to scale, very difficult to make very [00:58:56] to scale, very difficult to make very big. Um, another primitive that we've [00:58:58] big. Um, another primitive that we've seen is convolution. And convolution [00:59:00] seen is convolution. And convolution basically operates on multi-dimensional [00:59:02] basically operates on multi-dimensional grids. Um, we've seen it in [00:59:04] grids. Um, we've seen it in two-dimensional grids in the case of [00:59:05] two-dimensional grids in the case of images. You can also run them on 1D [00:59:07] images. You can also run them on 1D grids, 3D grids, 4D grids. And [00:59:09] grids, 3D grids, 4D grids. And convolution basically is something that [00:59:11] convolution basically is something that mixes information locally in [00:59:13] mixes information locally in n-dimensional grids. Um, this is great. [00:59:15] n-dimensional grids. Um, this is great. is very parallelizable because by this [00:59:18] is very parallelizable because by this notion of sliding a kernel around a [00:59:19] notion of sliding a kernel around a grid, each position that we might want [00:59:21] grid, each position that we might want to place the kernel can in principle be [00:59:23] to place the kernel can in principle be computed in parallel. So this is a very [00:59:25] computed in parallel. So this is a very parallelizable primitive. Um but um it [00:59:28] parallelizable primitive. Um but um it has a hard time building up large [00:59:29] has a hard time building up large receptive fields. If we want to if we [00:59:31] receptive fields. If we want to if we want to summarize an entire very long [00:59:33] want to summarize an entire very long input sequence or an entire very large [00:59:35] input sequence or an entire very large image with convolution, we either need [00:59:37] image with convolution, we either need to have very large convolutional kernels [00:59:39] to have very large convolutional kernels or stack up many many many convolutional [00:59:41] or stack up many many many convolutional layers. So that still introduces some [00:59:43] layers. So that still introduces some fundamental sequentiality in the way [00:59:45] fundamental sequentiality in the way that we need to process large pieces of [00:59:47] that we need to process large pieces of data. And now self attention basically [00:59:50] data. And now self attention basically is a separate kind of primitive that [00:59:52] is a separate kind of primitive that operates on sets of vectors. Um it sort [00:59:54] operates on sets of vectors. Um it sort of naturally generalizes to long [00:59:56] of naturally generalizes to long sequences. There are no bottlenecks in [00:59:58] sequences. There are no bottlenecks in the way that there are in in in [00:59:59] the way that there are in in in recurrent neural networks. There's also [01:00:01] recurrent neural networks. There's also no necessity of stacking up many many [01:00:03] no necessity of stacking up many many layers of them to pro to to let all the [01:00:05] layers of them to pro to to let all the vectors look at each other. In one layer [01:00:07] vectors look at each other. In one layer of self attention, every vector looks at [01:00:09] of self attention, every vector looks at every other vector. So with just one [01:00:11] every other vector. So with just one layer you can summarize you can do a lot [01:00:13] layer you can summarize you can do a lot of computation um and it's also highly [01:00:15] of computation um and it's also highly paralyzable as we saw the whole [01:00:16] paralyzable as we saw the whole operation is just four big matrix [01:00:18] operation is just four big matrix multiplies and matrix multiplies are a [01:00:20] multiplies and matrix multiplies are a great primitive that we can distribute [01:00:22] great primitive that we can distribute we can run on GPUs we can run in very [01:00:24] we can run on GPUs we can run in very scalable distributed ways um the only [01:00:26] scalable distributed ways um the only downside of attention is that it's [01:00:28] downside of attention is that it's expensive it ends up having n squ [01:00:30] expensive it ends up having n squ compute for a sequence of length n um [01:00:32] compute for a sequence of length n um and n squ or later n o n memory for a [01:00:35] and n squ or later n o n memory for a sequence of of of length n and if your n [01:00:38] sequence of of of length n and if your n ends up being like 100,000 million 10 [01:00:40] ends up being like 100,000 million 10 million n squared becomes very expensive [01:00:42] million n squared becomes very expensive but you can solve that by buying more [01:00:44] but you can solve that by buying more GPUs. Um so that's that's basically the [01:00:46] GPUs. Um so that's that's basically the solution that people have have come up [01:00:48] solution that people have have come up with here. So basically attention has [01:00:50] with here. So basically attention has become this super awesome primitive that [01:00:53] become this super awesome primitive that is super powerful for processing very [01:00:55] is super powerful for processing very arbitrary pieces of data and you might [01:00:58] arbitrary pieces of data and you might be wondering which of these you should [01:00:59] be wondering which of these you should use. Attention attention is all you [01:01:01] use. Attention attention is all you need. It turns out that of the three you [01:01:03] need. It turns out that of the three you can get a long way using only attention. [01:01:06] can get a long way using only attention. Yeah the question is is paralyzable. [01:01:08] Yeah the question is is paralyzable. What's the advantage of that? Um the [01:01:09] What's the advantage of that? Um the advantage of that is that in the history [01:01:11] advantage of that is that in the history of computing um it it get it gets hard [01:01:14] of computing um it it get it gets hard to make processors faster, right? We've [01:01:16] to make processors faster, right? We've sort of run up against this limit as a [01:01:18] sort of run up against this limit as a fundamental limit in hardware that it's [01:01:19] fundamental limit in hardware that it's become very difficult to make individual [01:01:21] become very difficult to make individual processes faster. But what we can do [01:01:23] processes faster. But what we can do very easily is get a lot of processors, [01:01:26] very easily is get a lot of processors, right? So we so the way that we've able [01:01:29] right? So we so the way that we've able to marshall more computation over the [01:01:31] to marshall more computation over the last two decades is finding algorithms [01:01:33] last two decades is finding algorithms that do not require running on one [01:01:35] that do not require running on one really fast processor. But instead if we [01:01:38] really fast processor. But instead if we can have an algorithm that can make use [01:01:39] can have an algorithm that can make use of 10 processors or a 100 processors or [01:01:42] of 10 processors or a 100 processors or a thousand processors or a million [01:01:43] a thousand processors or a million processors. I want to blanket the entire [01:01:46] processors. I want to blanket the entire Stanford campus with processors and have [01:01:47] Stanford campus with processors and have all of them working together in concert [01:01:49] all of them working together in concert to process this big thing. If we can [01:01:51] to process this big thing. If we can find algorithms that do that, that's how [01:01:52] find algorithms that do that, that's how we can scale up and get really big [01:01:54] we can scale up and get really big powerful computations. Um, so the [01:01:56] powerful computations. Um, so the benefit of parallelizability is that if [01:01:58] benefit of parallelizability is that if you have algorithms that can trivially [01:02:00] you have algorithms that can trivially make use of more and more and more [01:02:02] make use of more and more and more processors in parallel, then we can [01:02:04] processors in parallel, then we can scale up those algorithms without having [01:02:05] scale up those algorithms without having to wait for individual processors to [01:02:07] to wait for individual processors to become faster, which they may never [01:02:08] become faster, which they may never will. Yeah. Is there a trade-off with [01:02:10] will. Yeah. Is there a trade-off with the n squ? I I think the n squed is [01:02:12] the n squ? I I think the n squed is actually a good thing. Um, so it it see [01:02:14] actually a good thing. Um, so it it see it seems bad. You're taught in computer [01:02:16] it seems bad. You're taught in computer science that higher parameters inside [01:02:18] science that higher parameters inside that n that that those is bad. But in [01:02:20] that n that that those is bad. But in the case of neural networks for compute [01:02:22] the case of neural networks for compute it could actually be a good thing [01:02:24] it could actually be a good thing because more compute means the network [01:02:25] because more compute means the network is doing more computation. It has more [01:02:27] is doing more computation. It has more ability to think more ability to [01:02:29] ability to think more ability to process. So actually the more compute [01:02:31] process. So actually the more compute the network does on the input sequence [01:02:33] the network does on the input sequence actually maybe the better answer it [01:02:34] actually maybe the better answer it could get it could arrive to. So it [01:02:36] could get it could arrive to. So it means that you know it's more expensive [01:02:38] means that you know it's more expensive but that's not necessarily a bad thing. [01:02:40] but that's not necessarily a bad thing. So basically the transformer is now a [01:02:42] So basically the transformer is now a neural network architecture that puts [01:02:44] neural network architecture that puts self attention at the core of [01:02:45] self attention at the core of everything. So our input is going to be [01:02:47] everything. So our input is going to be a set of vectors X. Um then we're going [01:02:49] a set of vectors X. Um then we're going to run all those vectors through self [01:02:51] to run all those vectors through self attention. Um which is as we just said [01:02:53] attention. Um which is as we just said this amazing primitive that lets all the [01:02:54] this amazing primitive that lets all the vectors talk to each other. Um after [01:02:56] vectors talk to each other. Um after that we'll wrap that self attention in a [01:02:58] that we'll wrap that self attention in a residual connection for all the same [01:03:00] residual connection for all the same reasons that we wanted to use residual [01:03:02] reasons that we wanted to use residual connections in ResNets just a couple [01:03:03] connections in ResNets just a couple lectures ago. Um then we will take the [01:03:06] lectures ago. Um then we will take the output of that residual connection pass [01:03:08] output of that residual connection pass it through a layer normalization because [01:03:10] it through a layer normalization because as we saw in ResNets and in CNN's adding [01:03:12] as we saw in ResNets and in CNN's adding normalization inside your architectures [01:03:14] normalization inside your architectures makes them train more stably. Um then um [01:03:17] makes them train more stably. Um then um but then now there's something [01:03:18] but then now there's something interesting because the self attention [01:03:20] interesting because the self attention basically what it does is compares all [01:03:22] basically what it does is compares all the vectors with each other. Um and [01:03:23] the vectors with each other. Um and that's a very useful primitive that's a [01:03:25] that's a very useful primitive that's a very powerful thing to do. But we also [01:03:27] very powerful thing to do. But we also want to give this network the ability to [01:03:29] want to give this network the ability to perform processing on vectors [01:03:31] perform processing on vectors independently one one by one. So then [01:03:33] independently one one by one. So then there's a second primitive inside the [01:03:34] there's a second primitive inside the transformer which is the multi-layer [01:03:36] transformer which is the multi-layer perceptron MLP or also called FFN. But [01:03:39] perceptron MLP or also called FFN. But basically this is a little two-layer [01:03:40] basically this is a little two-layer neural network that operates independent [01:03:42] neural network that operates independent that is run independently on each one of [01:03:44] that is run independently on each one of our vectors inside. So then this kind of [01:03:47] our vectors inside. So then this kind of works in concert with the self attention [01:03:48] works in concert with the self attention where self attention lets all the [01:03:50] where self attention lets all the vectors talk to each other and compare [01:03:51] vectors talk to each other and compare with each other and the FFN or MLP um [01:03:54] with each other and the FFN or MLP um lets us perform computation on each [01:03:56] lets us perform computation on each vector independently. [01:03:58] vector independently. Um we'll also wrap the MLP in a residual [01:04:00] Um we'll also wrap the MLP in a residual connection put a layer normalization and [01:04:02] connection put a layer normalization and put a box around the whole thing and [01:04:04] put a box around the whole thing and call it a neural network block. So this [01:04:06] call it a neural network block. So this is our um transformer block and a [01:04:08] is our um transformer block and a transformer is just a sequence of [01:04:10] transformer is just a sequence of transformer blocks. Um and these things [01:04:12] transformer blocks. Um and these things have gotten much much bigger over time. [01:04:14] have gotten much much bigger over time. Um the architectures haven't changed too [01:04:16] Um the architectures haven't changed too much since 2017 when this is introduced. [01:04:18] much since 2017 when this is introduced. Um the original transformer was [01:04:20] Um the original transformer was something like 12 blocks, 200 million [01:04:22] something like 12 blocks, 200 million parameters. And now we're people are [01:04:24] parameters. And now we're people are training transformers with up with [01:04:26] training transformers with up with hundreds of blocks and trillions of [01:04:28] hundreds of blocks and trillions of parameters. So this same architecture [01:04:30] parameters. So this same architecture has scaled across many orders of [01:04:31] has scaled across many orders of magnitude in compute and size and [01:04:33] magnitude in compute and size and parameters over the past eight years. Um [01:04:36] parameters over the past eight years. Um they can be used both for language [01:04:37] they can be used both for language modeling as we sort of already seen. Um [01:04:39] modeling as we sort of already seen. Um they also can be used for for for for [01:04:42] they also can be used for for for for images. And here the application is [01:04:44] images. And here the application is fairly straightforward. Given an image [01:04:46] fairly straightforward. Given an image we basically divide the image up into [01:04:47] we basically divide the image up into patches project each of those patches [01:04:49] patches project each of those patches separately into a vector. Those vectors [01:04:52] separately into a vector. Those vectors then get passed as inputs to our [01:04:54] then get passed as inputs to our transformer. um and then the output [01:04:56] transformer. um and then the output gives us one output from the transformer [01:04:59] gives us one output from the transformer for every patch in the input. Now if you [01:05:01] for every patch in the input. Now if you want to do something like a [01:05:02] want to do something like a classification score uh do a [01:05:04] classification score uh do a classification problem then you do a [01:05:05] classification problem then you do a pooling operation on all the vectors [01:05:07] pooling operation on all the vectors coming out of the transformer and have a [01:05:08] coming out of the transformer and have a linear layer that predicts your class [01:05:10] linear layer that predicts your class scores. So that's then this same [01:05:13] scores. So that's then this same architecture of a transformer can be [01:05:15] architecture of a transformer can be applied both to a language and to images [01:05:18] applied both to a language and to images and to a lot of other things as well. [01:05:20] and to a lot of other things as well. Um, I mentioned there have been a couple [01:05:22] Um, I mentioned there have been a couple minor tweaks to transformers since they [01:05:24] minor tweaks to transformers since they were first introduced, but we're running [01:05:25] were first introduced, but we're running out of time, so I'll just leave those as [01:05:27] out of time, so I'll just leave those as extra reading. So, kind of the summary [01:05:29] extra reading. So, kind of the summary of where we get to at the end of this [01:05:30] of where we get to at the end of this lecture is basically two things that I [01:05:33] lecture is basically two things that I promised at the beginning. One is that [01:05:35] promised at the beginning. One is that we introduced attention, which is this [01:05:37] we introduced attention, which is this new primitive that lets us operate on [01:05:39] new primitive that lets us operate on sets of vectors. It's highly [01:05:41] sets of vectors. It's highly paralyzable. It's basically just a [01:05:42] paralyzable. It's basically just a couple matrix multiplies. So, it's [01:05:44] couple matrix multiplies. So, it's highly scalable, highly paralyzable, [01:05:46] highly scalable, highly paralyzable, highly flexible. It can be applied in a [01:05:48] highly flexible. It can be applied in a lot of different situations. Um and the [01:05:50] lot of different situations. Um and the transformer which is now a neural [01:05:51] transformer which is now a neural network architecture that uses self [01:05:53] network architecture that uses self attention as its main computational [01:05:55] attention as its main computational primitive. Um and the transformer is [01:05:57] primitive. Um and the transformer is basically the neural network [01:05:58] basically the neural network architecture that every every [01:06:00] architecture that every every application in deep learning is using [01:06:02] application in deep learning is using these days. So that's super powerful, [01:06:04] these days. So that's super powerful, super interesting, super exciting. Um [01:06:06] super interesting, super exciting. Um they've been transformers have been with [01:06:08] they've been transformers have been with us for like 8 years now and I don't see [01:06:10] us for like 8 years now and I don't see them really dying anytime soon. So [01:06:12] them really dying anytime soon. So that's that's pretty pretty exciting. So [01:06:14] that's that's pretty pretty exciting. So that's that's basically it for today's [01:06:16] that's that's basically it for today's lecture. Um and then next time we'll [01:06:18] lecture. Um and then next time we'll come back and talk about some new tasks [01:06:20] come back and talk about some new tasks uh detection, segmentation, [01:06:21] uh detection, segmentation, visualization and see how we can use [01:06:23] visualization and see how we can use these architectures to do new cool [01:06:25] these architectures to do new cool things. ================================================================================ LECTURE 009 ================================================================================ Stanford CS231N | Spring 2025 | Lecture 9: Object Detection, Image Segmentation, Visualizing Source: https://www.youtube.com/watch?v=PTypu6GqEd4 --- Transcript [00:00:05] Okay, today we'll be talking about [00:00:09] Okay, today we'll be talking about different tasks of um core computer [00:00:13] different tasks of um core computer vision [00:00:15] vision algorithms and tasks detection and [00:00:18] algorithms and tasks detection and segmentation. We will also be covering [00:00:20] segmentation. We will also be covering topics around visualization and [00:00:22] topics around visualization and understanding. I will cover the the most [00:00:25] understanding. I will cover the the most important ones. All right. So last le [00:00:29] important ones. All right. So last le the the previous lecture last time what [00:00:32] the the previous lecture last time what we discussed was around the topic of um [00:00:36] we discussed was around the topic of um transitioning from sequence to sequence [00:00:39] transitioning from sequence to sequence models RNN's through uh two transformers [00:00:44] models RNN's through uh two transformers and we saw that transformers were [00:00:47] and we saw that transformers were defined by um having some sort of [00:00:50] defined by um having some sort of encoder a number of layers which had had [00:00:54] encoder a number of layers which had had multi-headed self attention and layer [00:00:57] multi-headed self attention and layer norm norm as well as some MLP norm uh [00:01:00] norm norm as well as some MLP norm uh layers and this was [00:01:04] layers and this was ultimately called something that we now [00:01:07] ultimately called something that we now refer to as encoder encoding the [00:01:10] refer to as encoder encoding the sequence and then if we need to decode [00:01:12] sequence and then if we need to decode an image or a language a sequence as the [00:01:16] an image or a language a sequence as the output then similar type of architecture [00:01:20] output then similar type of architecture is used for decoder getting the [00:01:25] is used for decoder getting the uh encoded tokens as input taking those [00:01:27] uh encoded tokens as input taking those as input and then generating what is um [00:01:32] as input and then generating what is um I'm hoping that you can see my cursor [00:01:34] I'm hoping that you can see my cursor too what is um [00:01:37] too what is um the [00:01:39] the uh the the output the desired output we [00:01:42] uh the the output the desired output we did talk Justin talked extensively quite [00:01:46] did talk Justin talked extensively quite extensively about the differences of [00:01:49] extensively about the differences of modeling sequences [00:01:53] modeling sequences recurrence neural networks RNN's and and [00:01:56] recurrence neural networks RNN's and and their variation ations that we've talked [00:01:58] their variation ations that we've talked um last week I think on Tuesday uh about [00:02:01] um last week I think on Tuesday uh about and then using convolution also as as [00:02:05] and then using convolution also as as another approach but we talked about [00:02:08] another approach but we talked about ultimately that um self attention is [00:02:11] ultimately that um self attention is what we [00:02:15] work with in many of the applications [00:02:17] work with in many of the applications these days they work uh much better than [00:02:20] these days they work uh much better than the other other two they do add they are [00:02:24] the other other two they do add they are more expensive they do add computation [00:02:27] more expensive they do add computation and and memory requirements for um and [00:02:31] and and memory requirements for um and um but but that comes with much better [00:02:35] um but but that comes with much better modeling of the sequence and uh better [00:02:38] modeling of the sequence and uh better results in terms of any of the tasks. So [00:02:43] results in terms of any of the tasks. So until here [00:02:45] until here uh it was it was mostly talking about [00:02:48] uh it was it was mostly talking about self attention. We also talked a little [00:02:49] self attention. We also talked a little bit about cross attention and [00:02:53] bit about cross attention and related topics. And then we got to the [00:02:57] related topics. And then we got to the topic of vision transformers which is [00:03:00] topic of vision transformers which is one of the core models that is being [00:03:03] one of the core models that is being used in modern applications computer [00:03:06] used in modern applications computer vision applications. We did go through [00:03:09] vision applications. We did go through this um [00:03:11] this um in the [00:03:13] in the last minutes of last lecture and I want [00:03:16] last minutes of last lecture and I want to the previous lecture and I want to [00:03:19] to the previous lecture and I want to revisit the topic and after that I'll [00:03:22] revisit the topic and after that I'll stop and and and hear any questions or [00:03:25] stop and and and hear any questions or comments you may have regarding the [00:03:27] comments you may have regarding the assignments and everything that I talked [00:03:29] assignments and everything that I talked about so far. We talked about the fact [00:03:32] about so far. We talked about the fact that what we do with um transformers [00:03:36] that what we do with um transformers when we want to process images, we [00:03:40] when we want to process images, we split the image into patches [00:03:44] split the image into patches basically creating a kind of sequence, [00:03:47] basically creating a kind of sequence, right? So the image was split to S bys [00:03:52] right? So the image was split to S bys or um in in this case maybe uh 3x3 [00:03:58] or um in in this case maybe uh 3x3 patches and each of those patches are [00:04:02] patches and each of those patches are then represented by what we call tokens. [00:04:05] then represented by what we call tokens. And tokens are often a linear [00:04:10] And tokens are often a linear projection of the the the vector the [00:04:16] projection of the the the vector the reshaped version of the image into a [00:04:19] reshaped version of the image into a vector and um it's it's basically a [00:04:23] vector and um it's it's basically a D-dimensional vector as you can see in [00:04:26] D-dimensional vector as you can see in this slide. But because we have turned [00:04:28] this slide. But because we have turned the image into patches what becomes [00:04:31] the image into patches what becomes important? What are we losing here? It's [00:04:33] important? What are we losing here? It's it's basically we're losing the [00:04:35] it's basically we're losing the location, the position, the 2D position [00:04:37] location, the position, the 2D position of of the image, right? So that's why we [00:04:40] of of the image, right? So that's why we often create or add something that we [00:04:44] often create or add something that we call positional embedding. And there are [00:04:47] call positional embedding. And there are many different ways of doing this. You [00:04:49] many different ways of doing this. You can create a sequence just put numbers [00:04:52] can create a sequence just put numbers of sequence as 1 2 3 and and so on. or [00:04:55] of sequence as 1 2 3 and and so on. or you can do [00:04:57] you can do uh a 2D version of X and Y u coordinates [00:05:02] uh a 2D version of X and Y u coordinates and adding these two together creates [00:05:06] and adding these two together creates the the new token that goes to the [00:05:09] the the new token that goes to the transformer layer the same way all of [00:05:12] transformer layer the same way all of the self attention layer norm and um [00:05:17] the self attention layer norm and um everything that we've talked about NMLP [00:05:19] everything that we've talked about NMLP everything we talked about last last [00:05:22] everything we talked about last last week. So and then the output layer will [00:05:27] week. So and then the output layer will will generate the output vectors for us [00:05:30] will generate the output vectors for us could be used for any application. One [00:05:32] could be used for any application. One of the major applications in computer [00:05:34] of the major applications in computer vision has been classification. We [00:05:37] vision has been classification. We started with image classification. [00:05:38] started with image classification. Right? So with image classification what [00:05:41] Right? So with image classification what what uh becomes important is to somehow [00:05:45] what uh becomes important is to somehow be able to encode [00:05:48] be able to encode or um generate something as the output [00:05:52] or um generate something as the output that is is uh representative of the [00:05:54] that is is uh representative of the class. So what we do is often we add one [00:05:58] class. So what we do is often we add one token a special extra input to the [00:06:01] token a special extra input to the transformer which is of the same [00:06:04] transformer which is of the same dimensionality but it's a learnable [00:06:06] dimensionality but it's a learnable parameter that [00:06:09] parameter that in the output space whatever that [00:06:13] in the output space whatever that represents is going to be turned into [00:06:16] represents is going to be turned into the class um probability vector. So a cd [00:06:20] the class um probability vector. So a cd dimensional vector which is the class [00:06:21] dimensional vector which is the class probabilities and that's what we often [00:06:24] probabilities and that's what we often call the the class token. So this is one [00:06:27] call the the class token. So this is one of the base [00:06:30] of the base u and and most standard way of doing uh [00:06:32] u and and most standard way of doing uh using vits vision transformers for image [00:06:38] using vits vision transformers for image classification [00:06:40] classification but transformers are not only used for [00:06:42] but transformers are not only used for classification they could be used for [00:06:43] classification they could be used for for many other tasks that we'll be [00:06:46] for many other tasks that we'll be covering some of those uh today as well [00:06:49] covering some of those uh today as well but last week we also talked about this [00:06:52] but last week we also talked about this other variant of the transformers again [00:06:55] other variant of the transformers again tokens And from the tokens we we go with [00:07:00] tokens And from the tokens we we go with uh the transformer layers. If you [00:07:03] uh the transformer layers. If you remember last time we talked about these [00:07:06] remember last time we talked about these multiple um layers of transformers as I [00:07:10] multiple um layers of transformers as I said positional embeddings are added and [00:07:14] said positional embeddings are added and here because images are they we we see [00:07:17] here because images are they we we see the entire image all together. We don't [00:07:20] the entire image all together. We don't have to do masking like we did for [00:07:22] have to do masking like we did for language because language is really a [00:07:24] language because language is really a sequence uh that we shouldn't be using [00:07:26] sequence uh that we shouldn't be using the future information for. And then [00:07:30] the future information for. And then ultimately [00:07:31] ultimately transformers um give an output of a [00:07:36] transformers um give an output of a uh a vector uh patch for for each of the [00:07:40] uh a vector uh patch for for each of the inputs. And the other option for [00:07:43] inputs. And the other option for training a transformer is actually to [00:07:45] training a transformer is actually to instead of having a separate class token [00:07:48] instead of having a separate class token just take the outputs run it pulling [00:07:51] just take the outputs run it pulling layer and then turn that into a [00:07:54] layer and then turn that into a probability vector for for C different [00:07:56] probability vector for for C different classes. So I talked about two versions [00:07:59] classes. So I talked about two versions of transformers, right? One of them was [00:08:01] of transformers, right? One of them was we're using a class token and the other [00:08:04] we're using a class token and the other one was for we take all of the output [00:08:07] one was for we take all of the output tokens. We apply pulling and projection [00:08:09] tokens. We apply pulling and projection into a vector that represents the class [00:08:13] into a vector that represents the class probabilities. How we supervise this? [00:08:16] probabilities. How we supervise this? It's the exact same thing that we talked [00:08:18] It's the exact same thing that we talked about earlier and that was uh back [00:08:22] about earlier and that was uh back propagation defining a loss function [00:08:25] propagation defining a loss function binary cross entropy the soft max loss [00:08:27] binary cross entropy the soft max loss and and so on. [00:08:29] and and so on. So this was VITS uh this is VIT in a [00:08:34] So this was VITS uh this is VIT in a nutshell but and and and over the years [00:08:39] nutshell but and and and over the years this type of architecture um for many [00:08:42] this type of architecture um for many different applications have have [00:08:44] different applications have have remained the same. Um many modern [00:08:48] remained the same. Um many modern architectures right now use many of [00:08:51] architectures right now use many of these components very similar to what we [00:08:55] these components very similar to what we presented here. But there are some [00:08:58] presented here. But there are some optimizations that we had in the slides [00:09:01] optimizations that we had in the slides last uh last week. But I think I'll I'll [00:09:04] last uh last week. But I think I'll I'll just spend quick um quickly a couple of [00:09:07] just spend quick um quickly a couple of minutes um on on them. But I want you to [00:09:11] minutes um on on them. But I want you to understand that there are many different [00:09:14] understand that there are many different tweaks and optimizations for better [00:09:18] tweaks and optimizations for better performance and also making the [00:09:20] performance and also making the transformers the training a little bit [00:09:22] transformers the training a little bit more stable. [00:09:24] more stable. One of them was is actually the residual [00:09:28] One of them was is actually the residual connections. This layer norm is is [00:09:30] connections. This layer norm is is basically at outside the residual [00:09:33] basically at outside the residual connection. So this means that whatever [00:09:36] connection. So this means that whatever we get here, we normalize it, right? So [00:09:39] we get here, we normalize it, right? So this doesn't really mean that we we [00:09:42] this doesn't really mean that we we can't replicate any form of identity [00:09:44] can't replicate any form of identity function anymore that ResNets really [00:09:46] function anymore that ResNets really wanted to to do, right? So the solution [00:09:48] wanted to to do, right? So the solution for that is to bring in the layer [00:09:52] for that is to bring in the layer normalization. we often put it uh before [00:09:55] normalization. we often put it uh before self attention and the second one right [00:09:57] self attention and the second one right before the MLP layer. So normalization [00:10:00] before the MLP layer. So normalization is there but we also preserve our [00:10:03] is there but we also preserve our identity function. There are al also [00:10:06] identity function. There are al also other ways of normalizing. There is this [00:10:09] other ways of normalizing. There is this RMS norm root mean square normalization [00:10:13] RMS norm root mean square normalization which is actually a very um basic type [00:10:17] which is actually a very um basic type of normalization. It doesn't it doesn't [00:10:19] of normalization. It doesn't it doesn't use the for each of the features it [00:10:21] use the for each of the features it doesn't use the the mean value of the [00:10:24] doesn't use the the mean value of the feature for normalization but this makes [00:10:26] feature for normalization but this makes the training a little bit more uh [00:10:28] the training a little bit more uh stable. Again there are the these are [00:10:31] stable. Again there are the these are all empirically shown to be [00:10:35] all empirically shown to be uh better options. [00:10:38] uh better options. Uh although there are some [00:10:39] Uh although there are some justifications why they work well but [00:10:42] justifications why they work well but most of mostly the reason uh for [00:10:45] most of mostly the reason uh for adopting these is is just um the fact [00:10:48] adopting these is is just um the fact that they are uh they perform they make [00:10:51] that they are uh they perform they make the trainings more stable. The other [00:10:54] the trainings more stable. The other option is is to instead of using a [00:10:56] option is is to instead of using a simple MLP, we use a uh sugloo MLP where [00:11:03] simple MLP, we use a uh sugloo MLP where we actually do some sort of this is what [00:11:05] we actually do some sort of this is what we call gated non nonlinearity. Instead [00:11:08] we call gated non nonlinearity. Instead of having two vectors of U weight [00:11:11] of having two vectors of U weight matrices of W1 and W2, we add a third [00:11:14] matrices of W1 and W2, we add a third one 1 2 and three. But here we create [00:11:18] one 1 2 and three. But here we create some sort of gated non nonlinearity [00:11:21] some sort of gated non nonlinearity which basically what what it uh does is [00:11:27] which basically what what it uh does is um [00:11:30] is is um [00:11:32] is is um getting more um trainable uh parameters [00:11:37] getting more um trainable uh parameters and not just necessarily trainable [00:11:39] and not just necessarily trainable parameters but creating a better [00:11:41] parameters but creating a better nonlinearity for a small architecture. [00:11:44] nonlinearity for a small architecture. Even if we select the hidden layer um [00:11:47] Even if we select the hidden layer um value equal to 8 di divided by 3, it [00:11:51] value equal to 8 di divided by 3, it keeps the the same size of the network [00:11:54] keeps the the same size of the network in terms of the number of parameters but [00:11:56] in terms of the number of parameters but it does um learn higher dimensional [00:12:00] it does um learn higher dimensional nonlinearities um in [00:12:04] nonlinearities um in uh in in that layer. The last piece is [00:12:08] uh in in that layer. The last piece is mixture of extra experts that is often [00:12:10] mixture of extra experts that is often used in even the very modern [00:12:12] used in even the very modern architectures these days. Instead of [00:12:15] architectures these days. Instead of having one set of MLP layers, you can [00:12:18] having one set of MLP layers, you can have multiple sets of MLP layers. Each [00:12:20] have multiple sets of MLP layers. Each of those will be an expert and what we [00:12:24] of those will be an expert and what we do is you through a a router the tokens [00:12:29] do is you through a a router the tokens will be routed to a of those e experts [00:12:35] will be routed to a of those e experts and in in this uh way we actually have a [00:12:38] and in in this uh way we actually have a active experts but then again this what [00:12:42] active experts but then again this what it does is it increases [00:12:45] it does is it increases the [00:12:46] the number of parameters and It helps [00:12:50] number of parameters and It helps learning more robust models without [00:12:54] learning more robust models without increasing too much in the uh on the [00:12:57] increasing too much in the uh on the compute and these again are all parallel [00:13:00] compute and these again are all parallel MLPS. So we can have multiple experts in [00:13:04] MLPS. So we can have multiple experts in parallel. As I said, they are used in [00:13:07] parallel. As I said, they are used in all LLMs these days, large language [00:13:09] all LLMs these days, large language models and um all of the modern LLMs up [00:13:14] models and um all of the modern LLMs up to the level that we know of know about [00:13:18] to the level that we know of know about are using these types of tweaks and this [00:13:22] are using these types of tweaks and this is the summary of all of the tweaks that [00:13:24] is the summary of all of the tweaks that I just mentioned. This is similar to [00:13:27] I just mentioned. This is similar to bias. No, this is this is completely a [00:13:30] bias. No, this is this is completely a trainable parameter by itself that you [00:13:33] trainable parameter by itself that you train either a feed forward network or [00:13:37] train either a feed forward network or or just a linear projection to turn that [00:13:39] or just a linear projection to turn that into the probability vector. No. So it's [00:13:42] into the probability vector. No. So it's it's it's not uh it's not just then [00:13:46] it's it's not uh it's not just then again remember that you have so many [00:13:49] again remember that you have so many uh self attention networks here, right? [00:13:53] uh self attention networks here, right? layers here and those self attention [00:13:55] layers here and those self attention layers are basically fusing the [00:13:58] layers are basically fusing the information creating attention between [00:14:00] information creating attention between all of the tokens and this class token. [00:14:02] all of the tokens and this class token. So when you supervise it from here the [00:14:05] So when you supervise it from here the loss function comes in this will [00:14:07] loss function comes in this will represent as the will represent the [00:14:09] represent as the will represent the class uh class token uh the class [00:14:12] class uh class token uh the class probabilities vector. So the question is [00:14:14] probabilities vector. So the question is if there are nice intuitions uh what [00:14:17] if there are nice intuitions uh what different experts are doing. Uh that's a [00:14:20] different experts are doing. Uh that's a great question [00:14:22] great question because they are trained in parallel and [00:14:25] because they are trained in parallel and they are initialized differently. They [00:14:28] they are initialized differently. They often try to learn one aspect or uh or a [00:14:32] often try to learn one aspect or uh or a related maybe also sometimes very much [00:14:34] related maybe also sometimes very much related aspect but um it's just adding [00:14:39] related aspect but um it's just adding more more u compute and more parameters [00:14:43] more more u compute and more parameters giving the network to learn different [00:14:46] giving the network to learn different things um if if it does have to uh learn [00:14:50] things um if if it does have to uh learn multiple concepts for example if you [00:14:52] multiple concepts for example if you have to cover multiple probability [00:14:54] have to cover multiple probability distributions then then with these ops [00:14:57] distributions then then with these ops you often have the power to like [00:14:59] you often have the power to like separate those modes of um data. So the [00:15:02] separate those modes of um data. So the question is if the the number of experts [00:15:06] question is if the the number of experts is a hyperparameter or or not. Yes, [00:15:09] is a hyperparameter or or not. Yes, definitely it's a hyperparameter. From [00:15:12] definitely it's a hyperparameter. From what I know, it's often predefined. Uh [00:15:15] what I know, it's often predefined. Uh don't necessarily like uh over fine-tune [00:15:18] don't necessarily like uh over fine-tune them, but yes, they are all [00:15:21] them, but yes, they are all hyperparameters. [00:15:22] hyperparameters. Yes. And they are. So why moving the the [00:15:27] Yes. And they are. So why moving the the layer norm helps us learn identifi [00:15:31] layer norm helps us learn identifi identity transformation. So look at this [00:15:34] identity transformation. So look at this architecture will you be able to create [00:15:37] architecture will you be able to create any form of identity because right after [00:15:39] any form of identity because right after that uh residual connection the feature [00:15:42] that uh residual connection the feature values are changed because you have a [00:15:44] values are changed because you have a normalization. you will never have the [00:15:47] normalization. you will never have the identity in the features, right? Because [00:15:49] identity in the features, right? Because right after that you see the layer norm [00:15:52] right after that you see the layer norm normalization, [00:15:53] normalization, right? And that's why what we do is we [00:15:55] right? And that's why what we do is we bring it in. [00:15:57] bring it in. We have a few quite a few different [00:16:02] We have a few quite a few different tasks in computer vision and these these [00:16:06] tasks in computer vision and these these were the core and more important the [00:16:08] were the core and more important the most important tasks over the years for [00:16:11] most important tasks over the years for computer vision applications. Although [00:16:13] computer vision applications. Although these days we we're solving much much [00:16:16] these days we we're solving much much harder tasks and nobody cares about [00:16:17] harder tasks and nobody cares about object detection anymore because now we [00:16:20] object detection anymore because now we can just do it with one line of code but [00:16:23] can just do it with one line of code but over the past 10 15 years there has been [00:16:26] over the past 10 15 years there has been a lot of advances and we want to cover I [00:16:29] a lot of advances and we want to cover I want to really cover some of those today [00:16:32] want to really cover some of those today just so if you have to design something [00:16:34] just so if you have to design something new yourself you know uh where to look [00:16:37] new yourself you know uh where to look and how to design your models and then [00:16:41] and how to design your models and then ultimately there is the topic Think of [00:16:43] ultimately there is the topic Think of visualization and understanding which is [00:16:47] visualization and understanding which is very important in in many applications. [00:16:49] very important in in many applications. For example, if you're working with [00:16:51] For example, if you're working with medical data, often the visualization [00:16:53] medical data, often the visualization understanding is more important than the [00:16:55] understanding is more important than the classification itself or detection of [00:16:58] classification itself or detection of tumor for example. You want to know [00:17:00] tumor for example. You want to know where, why and so on. [00:17:05] And um the way we started the class and [00:17:09] And um the way we started the class and this this slide is probably uh very much [00:17:12] this this slide is probably uh very much familiar to everybody. [00:17:15] familiar to everybody. We talked about different tasks and for [00:17:18] We talked about different tasks and for object classification for for the task [00:17:21] object classification for for the task of classification. We talked about this. [00:17:24] of classification. We talked about this. We spent quite a lot of time over the [00:17:27] We spent quite a lot of time over the first few lectures saying how we can [00:17:31] first few lectures saying how we can create a classifier that classifies [00:17:33] create a classifier that classifies images from pixels into labels. But then [00:17:39] images from pixels into labels. But then one of the other tasks important [00:17:43] one of the other tasks important um similarly is is semantic [00:17:47] um similarly is is semantic segmentation. And within semantic [00:17:49] segmentation. And within semantic segmentation what we care about is to [00:17:53] segmentation what we care about is to assign a label for to every single pixel [00:17:56] assign a label for to every single pixel inside the image. [00:17:59] inside the image. turn each each of the pixels into a [00:18:02] turn each each of the pixels into a label that [00:18:05] label that is is the label for that object or or um [00:18:09] is is the label for that object or or um anything in the scene. [00:18:12] anything in the scene. So basically at the when we train a [00:18:15] So basically at the when we train a model that does this at the test time we [00:18:17] model that does this at the test time we want to take an image and generate the [00:18:20] want to take an image and generate the same map as the output. How do we do [00:18:24] same map as the output. How do we do that? There are many different options. [00:18:26] that? There are many different options. So let's say um I can what I can do is [00:18:31] So let's say um I can what I can do is just look at each pixel every single [00:18:32] just look at each pixel every single pixel and say what the value or what the [00:18:37] pixel and say what the value or what the label for that pixel should be. Some it [00:18:40] label for that pixel should be. Some it in in the very basic form as you can see [00:18:42] in in the very basic form as you can see here it's it's actually very much [00:18:45] here it's it's actually very much impossible. It's hard to say what pixel [00:18:48] impossible. It's hard to say what pixel that [00:18:50] that um represent that that specific what [00:18:52] um represent that that specific what what object that specific pixel [00:18:55] what object that specific pixel represents. [00:18:56] represents. because there's no context if you only [00:18:58] because there's no context if you only look at the the pixel itself. So [00:19:02] look at the the pixel itself. So that's why context is important. We look [00:19:04] that's why context is important. We look at the surrounding areas and um and then [00:19:10] at the surrounding areas and um and then if I take these patches the pixel in the [00:19:14] if I take these patches the pixel in the center and the surrounding areas. Now I [00:19:17] center and the surrounding areas. Now I can train a convolutional neural network [00:19:20] can train a convolutional neural network or any network that generates the output [00:19:22] or any network that generates the output label for us. Right? It's the same [00:19:24] label for us. Right? It's the same architecture that we've talked about [00:19:26] architecture that we've talked about over the quarter and you can select any [00:19:29] over the quarter and you can select any of those that we used for for image [00:19:31] of those that we used for for image classification because now you're [00:19:33] classification because now you're classifying the image the entire image. [00:19:35] classifying the image the entire image. It could be a CNN, could be a ResNet, [00:19:36] It could be a CNN, could be a ResNet, could be a VIT or whatever. [00:19:40] could be a VIT or whatever. This is really time consuming because if [00:19:44] This is really time consuming because if you want to run one full network for [00:19:48] you want to run one full network for every single p pixel in an image, it [00:19:50] every single p pixel in an image, it will take forever to to turn this into a [00:19:53] will take forever to to turn this into a segmentation map. The other option that [00:19:56] segmentation map. The other option that we often uh we can use is to [00:20:00] we often uh we can use is to instead of running one network for every [00:20:04] instead of running one network for every single pixel, what if we train a neural [00:20:06] single pixel, what if we train a neural network that takes the image as the [00:20:09] network that takes the image as the input and outputs the entire pixel map, [00:20:13] input and outputs the entire pixel map, the segmentation map, not just one [00:20:16] the segmentation map, not just one single label, a matrix of labels, right? [00:20:20] single label, a matrix of labels, right? And in that case [00:20:23] And in that case um we will have our segmentation task [00:20:27] um we will have our segmentation task solved. And in order to do that we need [00:20:30] solved. And in order to do that we need to have [00:20:32] to have a layer in the in the input that is the [00:20:35] a layer in the in the input that is the same size of as the image. And also in [00:20:37] same size of as the image. And also in the in the output you also need some [00:20:40] the in the output you also need some sort of an inflated layer. You can't go [00:20:43] sort of an inflated layer. You can't go to fully connected layers and so on [00:20:46] to fully connected layers and so on because now we are generating an image [00:20:50] because now we are generating an image and because of that we need to [00:20:54] and because of that we need to keep the the network inflated and and [00:20:58] keep the the network inflated and and then um that's what we call often fully [00:21:02] then um that's what we call often fully convolutional neural networks or or FCN. [00:21:07] convolutional neural networks or or FCN. So with fully connection uh [00:21:09] So with fully connection uh convolutional neural networks this is [00:21:10] convolutional neural networks this is this is definitely a great idea but [00:21:14] this is definitely a great idea but there is a caveat there is a problem [00:21:16] there is a caveat there is a problem these images are are large and and these [00:21:18] these images are are large and and these networks these uh layers will become [00:21:23] networks these uh layers will become very large and there will be so many [00:21:26] very large and there will be so many parameters to optimize especially in the [00:21:28] parameters to optimize especially in the early years that we didn't have powerful [00:21:30] early years that we didn't have powerful GPUs this was a bottleneck a problem a [00:21:34] GPUs this was a bottleneck a problem a challenge for training algorithms [00:21:37] challenge for training algorithms And that's why the algorithms evolved [00:21:41] And that's why the algorithms evolved into starting from full size images [00:21:45] into starting from full size images going down in terms of the resolution [00:21:49] going down in terms of the resolution making the convolutions the the special [00:21:52] making the convolutions the the special resolutions smaller and smaller through [00:21:54] resolutions smaller and smaller through down sampling operations. And then [00:21:58] down sampling operations. And then somewhere in the in the middle we'll [00:21:59] somewhere in the in the middle we'll have a low resolution but but somehow [00:22:02] have a low resolution but but somehow thick in terms of the number of [00:22:04] thick in terms of the number of channels. And and then from there what [00:22:07] channels. And and then from there what we do is we go back up to the same size [00:22:10] we do is we go back up to the same size of the image to create the output pixel. [00:22:14] of the image to create the output pixel. And in order to do that we know how to [00:22:18] And in order to do that we know how to do the down sampling. Right? [00:22:19] do the down sampling. Right? Downsampling was was easy. We've talked [00:22:21] Downsampling was was easy. We've talked about it. We talked about um pulling [00:22:24] about it. We talked about um pulling operation, strided convolution and um [00:22:27] operation, strided convolution and um several other [00:22:31] um [00:22:33] um steps or operations that could be used [00:22:36] steps or operations that could be used here. [00:22:38] here. But on the upsampling side, we don't we [00:22:41] But on the upsampling side, we don't we don't really how know how to to do the [00:22:43] don't really how know how to to do the upsampling, right? because we don't have [00:22:46] upsampling, right? because we don't have pulling or reverse of uppooling or [00:22:50] pulling or reverse of uppooling or reverse uh reverse right convolutions [00:22:54] reverse uh reverse right convolutions right and and because of that we had to [00:22:57] right and and because of that we had to invent some new operations that reverses [00:23:01] invent some new operations that reverses down sampling [00:23:03] down sampling uh by itself but before I go to the [00:23:07] uh by itself but before I go to the upsampling [00:23:09] upsampling uh defining what upsampling is I just [00:23:12] uh defining what upsampling is I just briefly want to tell you that um [00:23:18] how this uh maybe I can ask you a [00:23:20] how this uh maybe I can ask you a question. How do you think this network [00:23:22] question. How do you think this network is trained? Because now we have a [00:23:25] is trained? Because now we have a network that starts from an image and [00:23:28] network that starts from an image and ends with an image and then the tools [00:23:31] ends with an image and then the tools that we have for training this network [00:23:33] that we have for training this network was a loss function, right? [00:23:37] was a loss function, right? How do you think is the best to train or [00:23:40] How do you think is the best to train or or define a loss function for this [00:23:42] or define a loss function for this network? [00:23:44] network? We talked about softmax loss, right? We [00:23:47] We talked about softmax loss, right? We talked also about a little bit about [00:23:49] talked also about a little bit about some regression losses and SVM loss. But [00:23:54] some regression losses and SVM loss. But assuming that we want to use softmax [00:23:57] assuming that we want to use softmax loss function, [00:23:59] loss function, how could we define this uh or train [00:24:01] how could we define this uh or train this network? [00:24:02] this network? What would the objective be? So mean you [00:24:06] What would the objective be? So mean you said mean classification loss for each [00:24:08] said mean classification loss for each of the pixels and that's uh that's [00:24:11] of the pixels and that's uh that's correct. You can add the loss function [00:24:15] correct. You can add the loss function for every single pixel because every [00:24:17] for every single pixel because every single pixel is like is doing a [00:24:19] single pixel is like is doing a classification right. So you will have a [00:24:21] classification right. So you will have a sigma over all pixels of the image and [00:24:25] sigma over all pixels of the image and the loss function is just a simple soft [00:24:27] the loss function is just a simple soft max and then you can back prop that's [00:24:30] max and then you can back prop that's that's that's the entire loss function [00:24:32] that's that's the entire loss function that you need. The question is um do we [00:24:36] that you need. The question is um do we need what we call ground truth for [00:24:38] need what we call ground truth for training? So that's that's actually the [00:24:40] training? So that's that's actually the ground truth of segmentation and that [00:24:42] ground truth of segmentation and that yes for these types of algorithms [00:24:44] yes for these types of algorithms because they are fully supervised. We do [00:24:46] because they are fully supervised. We do need the ground truth label maps and [00:24:50] need the ground truth label maps and early years there has been a lot of work [00:24:53] early years there has been a lot of work doing and sitting down and and and u [00:24:56] doing and sitting down and and and u manually labeling the the pixels to be [00:24:58] manually labeling the the pixels to be able to train these algorithms. Yes, [00:25:01] able to train these algorithms. Yes, these days we don't need that because we [00:25:03] these days we don't need that because we have tools. But early on in order to [00:25:06] have tools. But early on in order to train these algorithms, we needed the [00:25:08] train these algorithms, we needed the ground truth. [00:25:12] Okay, very briefly let me tell you what [00:25:15] Okay, very briefly let me tell you what we do with upsampling. Upsampling is [00:25:18] we do with upsampling. Upsampling is actually not that hard. We can use an [00:25:21] actually not that hard. We can use an unpulling operation. There are different [00:25:23] unpulling operation. There are different ways of doing it. One is nearest [00:25:25] ways of doing it. One is nearest neighbor. If I want to go from a two 2x [00:25:28] neighbor. If I want to go from a two 2x two [00:25:30] two As an example here, a matrix 2 to 4x4. I [00:25:34] As an example here, a matrix 2 to 4x4. I just need to copy [00:25:36] just need to copy the data kind of for each of these. Take [00:25:39] the data kind of for each of these. Take the nearest neighbor in the in the lower [00:25:41] the nearest neighbor in the in the lower resolution one or bed of nails is you [00:25:44] resolution one or bed of nails is you just select one of those in in the [00:25:47] just select one of those in in the upsampled version. You only select one [00:25:50] upsampled version. You only select one of those or the one on the corner [00:25:54] of those or the one on the corner to copy the data. replace everything [00:25:56] to copy the data. replace everything else with zero and through multiple [00:25:59] else with zero and through multiple layers of convolution these values will [00:26:01] layers of convolution these values will will start um appearing. [00:26:05] will start um appearing. If we used max pooling in our [00:26:10] If we used max pooling in our um in our network in the in the encoding [00:26:13] um in our network in the in the encoding side, what we can do is we can save the [00:26:16] side, what we can do is we can save the locations of the max um the the ones [00:26:20] locations of the max um the the ones that were selected and then copy the [00:26:22] that were selected and then copy the data in the unpulling max unpulling [00:26:26] data in the unpulling max unpulling stage right over there that the [00:26:31] stage right over there that the the uh max was defined. So basically we [00:26:35] the uh max was defined. So basically we we save the locations in the encoding [00:26:39] we save the locations in the encoding part and in decoding part in the [00:26:40] part and in decoding part in the upsampling step we reuse those saved [00:26:45] upsampling step we reuse those saved coordinates. [00:26:47] coordinates. The other option is to to do a learned [00:26:50] The other option is to to do a learned upsampling. [00:26:52] upsampling. All of these that I I showed there is no [00:26:54] All of these that I I showed there is no parameter to be learned. It's just an [00:26:56] parameter to be learned. It's just an operation. But learned upsamplings are [00:26:59] operation. But learned upsamplings are also possible. Very simply, let's let's [00:27:03] also possible. Very simply, let's let's revisit the convolution. In the [00:27:05] revisit the convolution. In the convolution layer, what we we did was [00:27:08] convolution layer, what we we did was applying a convolution filter for a [00:27:10] applying a convolution filter for a pixel and generating the output and do [00:27:12] pixel and generating the output and do this repeatedly for all of the pixels, [00:27:15] this repeatedly for all of the pixels, right? [00:27:17] right? And when we wanted to do um to to down [00:27:21] And when we wanted to do um to to down sample what we did was strided [00:27:23] sample what we did was strided convolution where instead of taking [00:27:26] convolution where instead of taking steps of one we take steps of two and [00:27:29] steps of one we take steps of two and generate the outputs step by step. If [00:27:32] generate the outputs step by step. If you don't remember this part go back to [00:27:33] you don't remember this part go back to the lecture we talked about it third [00:27:36] the lecture we talked about it third lecture I think and um and then we can [00:27:40] lecture I think and um and then we can replicate the same for the offsampling [00:27:44] replicate the same for the offsampling process. So this one will represent this [00:27:49] process. So this one will represent this area in the upsampled image. And then we [00:27:52] area in the upsampled image. And then we define some weights here to map that to [00:27:55] define some weights here to map that to the output um map. And then for the next [00:27:58] the output um map. And then for the next one, same story, but there will be [00:28:00] one, same story, but there will be overlaps. And for the overlaps, you [00:28:02] overlaps. And for the overlaps, you often sum over the the values. Let me [00:28:05] often sum over the the values. Let me give you an example. Sum over the [00:28:07] give you an example. Sum over the outputs. Let me give you an example. And [00:28:10] outputs. Let me give you an example. And that's with a simple 1D [00:28:13] that's with a simple 1D uh function. If the input is um just two [00:28:17] uh function. If the input is um just two values of A and B, we learn a filter [00:28:21] values of A and B, we learn a filter that that filter maps it to the higher [00:28:24] that that filter maps it to the higher resolution output, right? And for doing [00:28:27] resolution output, right? And for doing that, we just apply the filter to each [00:28:30] that, we just apply the filter to each of the values and write the outputs [00:28:34] of the values and write the outputs here. for the parts that there's an [00:28:36] here. for the parts that there's an overlap, it's a summation addition of [00:28:39] overlap, it's a summation addition of what is um [00:28:43] what is um coming from each of the two uh [00:28:45] coming from each of the two uh locations. [00:28:49] So [00:28:52] So we did talk about this um fully [00:28:56] we did talk about this um fully convolutional neural networks and and [00:28:58] convolutional neural networks and and how they are being used. This these are [00:29:02] how they are being used. This these are actually some of the most basic and [00:29:06] actually some of the most basic and mostly widely used algor algorithms for [00:29:09] mostly widely used algor algorithms for segmentation. Um and I want to also very [00:29:14] segmentation. Um and I want to also very quickly highlight one of the widely used [00:29:19] quickly highlight one of the widely used networks unit as you can see the the the [00:29:22] networks unit as you can see the the the shape U. It's actually the same [00:29:24] shape U. It's actually the same architecture as I showed here just let's [00:29:27] architecture as I showed here just let's let's uh draw it as a similar to a U [00:29:31] let's uh draw it as a similar to a U shape. And [00:29:34] shape. And uh the reason that I'm highlighting this [00:29:35] uh the reason that I'm highlighting this is that still today some of the medical [00:29:39] is that still today some of the medical applications uh that that work on [00:29:41] applications uh that that work on segmentation use segmentation algorithms [00:29:44] segmentation use segmentation algorithms still this is uh the this this unit or [00:29:49] still this is uh the this this unit or its variance generate the [00:29:51] its variance generate the state-of-the-art um results if you don't [00:29:54] state-of-the-art um results if you don't want to use any foundation model. So [00:29:58] want to use any foundation model. So what it does is exactly what we [00:30:00] what it does is exactly what we explained a down sampling phase that [00:30:03] explained a down sampling phase that increases the field of view and loses [00:30:06] increases the field of view and loses some special uh information and then [00:30:08] some special uh information and then upsampling [00:30:10] upsampling phase that um goes back to the image [00:30:13] phase that um goes back to the image resolution. The only difference that [00:30:15] resolution. The only difference that unit has is because it's the it's used [00:30:18] unit has is because it's the it's used for segmentation. [00:30:20] for segmentation. There is this [00:30:22] There is this um understanding that we need to keep [00:30:25] um understanding that we need to keep the the the [00:30:28] the the the special uh information in the decoder [00:30:32] special uh information in the decoder side because when we down sample we [00:30:33] side because when we down sample we somehow lose resolution. And then [00:30:36] somehow lose resolution. And then upsampling if you don't have the [00:30:37] upsampling if you don't have the information it's going to be a little [00:30:39] information it's going to be a little bit um hard and we often get into [00:30:42] bit um hard and we often get into sometimes boundaries are are faded and [00:30:46] sometimes boundaries are are faded and in order not to um get there the feature [00:30:51] in order not to um get there the feature maps in the encoder side are actually [00:30:54] maps in the encoder side are actually copied [00:30:55] copied as inputs to the decoder layers. In that [00:30:59] as inputs to the decoder layers. In that way you are act you're you're keeping [00:31:01] way you are act you're you're keeping this structural [00:31:03] this structural information within the image and [00:31:05] information within the image and generate the outputs that are much [00:31:08] generate the outputs that are much sharper. So this was the idea about um [00:31:11] sharper. So this was the idea about um behind unit and as I said it's actually [00:31:14] behind unit and as I said it's actually being used quite often. Summary of [00:31:18] being used quite often. Summary of semantic segmentation. Um what we talked [00:31:22] semantic segmentation. Um what we talked about today the fully convolutional [00:31:24] about today the fully convolutional neural networks you have same filter as [00:31:29] neural networks you have same filter as um as before [00:31:31] um as before that we had for down sampling here. So [00:31:34] that we had for down sampling here. So you have a um a [00:31:39] you have a um a actually so [00:31:42] actually so for to save time I actually removed some [00:31:45] for to save time I actually removed some of the slides from this part and I have [00:31:47] of the slides from this part and I have it in the backup slides you should check [00:31:48] it in the backup slides you should check it out. This is a reverse um this is a [00:31:51] it out. This is a reverse um this is a transform transpose convolution. So we [00:31:54] transform transpose convolution. So we do have a 3x3 matrix here and then [00:31:57] do have a 3x3 matrix here and then instead of convolving the input image [00:32:01] instead of convolving the input image input data we convolve we we do the [00:32:03] input data we convolve we we do the convolution on the applied convolution [00:32:06] convolution on the applied convolution on the transposed version of the input [00:32:09] on the transposed version of the input and it actually generates an a larger [00:32:11] and it actually generates an a larger output. So it's the transposed [00:32:14] output. So it's the transposed convolution it's it's uh it's the [00:32:17] convolution it's it's uh it's the reverse of the regular convolution. But [00:32:20] reverse of the regular convolution. But why transposed? I would refer you to [00:32:23] why transposed? I would refer you to take a look at the the additional [00:32:25] take a look at the the additional slides. So the question is is that [00:32:27] slides. So the question is is that filter trained? Yes, it's very much [00:32:29] filter trained? Yes, it's very much similar to other convolution layers. All [00:32:31] similar to other convolution layers. All of the filters are trained. Yes. Yeah. [00:32:36] Okay. Um [00:32:39] Okay. Um great. This was uh the topic of semantic [00:32:45] great. This was uh the topic of semantic segmentation. And as we talked about [00:32:47] segmentation. And as we talked about this, we only get labels for the pixels. [00:32:52] this, we only get labels for the pixels. But if there are two instances of the [00:32:54] But if there are two instances of the same object, we have no idea which one [00:32:58] same object, we have no idea which one is which, right? Because this is just [00:33:03] generating or outputting the the labels, [00:33:06] generating or outputting the the labels, pixel labels. And this brings us to that [00:33:09] pixel labels. And this brings us to that brings us to the topic of instance [00:33:11] brings us to the topic of instance segmentation where now we not only care [00:33:15] segmentation where now we not only care about the pixel uh classes but also I [00:33:19] about the pixel uh classes but also I want to know that um these pixels belong [00:33:23] want to know that um these pixels belong to one instance of the dog and this uh [00:33:26] to one instance of the dog and this uh next one is actually a different dog [00:33:28] next one is actually a different dog right and and for doing that what we [00:33:32] right and and for doing that what we need is [00:33:35] need is understanding objects multiple objects [00:33:38] understanding objects multiple objects in the image and brings us to the topic [00:33:41] in the image and brings us to the topic of object detection. Object detection [00:33:44] of object detection. Object detection has been one of the uh besides image [00:33:49] has been one of the uh besides image classification or after image [00:33:50] classification or after image classification has been one of the core [00:33:53] classification has been one of the core computer vision [00:33:56] computer vision problems and and tasks. And for many [00:34:00] problems and and tasks. And for many years many many different algorithms [00:34:03] years many many different algorithms were proposed for just doing the task of [00:34:08] were proposed for just doing the task of object detection. We are going to fly [00:34:10] object detection. We are going to fly over some of them and highlight a couple [00:34:13] over some of them and highlight a couple of important ones but again there are so [00:34:18] of important ones but again there are so many works in the literature that even [00:34:20] many works in the literature that even in literature of deep learning that I'm [00:34:23] in literature of deep learning that I'm not covering uh here. So in the past 101 [00:34:28] not covering uh here. So in the past 101 15 years [00:34:30] 15 years how we can do that um and and solve the [00:34:33] how we can do that um and and solve the problem of object detection if it's just [00:34:35] problem of object detection if it's just a single object it means that we need to [00:34:39] a single object it means that we need to generate we need to do the [00:34:41] generate we need to do the classification [00:34:43] classification generate a label class scores as well as [00:34:47] generate a label class scores as well as getting a bounding box coordinates of a [00:34:50] getting a bounding box coordinates of a box right so you need the coordinates of [00:34:52] box right so you need the coordinates of the box XY and H and and W as the output [00:34:56] the box XY and H and and W as the output as well as what class it is. Right? So [00:34:58] as well as what class it is. Right? So this is exactly the task of object [00:35:00] this is exactly the task of object detection. How can we solve this? It's [00:35:03] detection. How can we solve this? It's very simple. Right? We can define a [00:35:05] very simple. Right? We can define a softmax loss soft max loss function for [00:35:10] softmax loss soft max loss function for the class scores. And we can define an [00:35:13] the class scores. And we can define an L2 loss function which is a simple [00:35:17] L2 loss function which is a simple distance metric a regression loss for [00:35:20] distance metric a regression loss for the box coordinates. And having these [00:35:24] the box coordinates. And having these two defined we have a multitask loss. We [00:35:27] two defined we have a multitask loss. We are solving two tasks at the same time. [00:35:30] are solving two tasks at the same time. And for doing that we again uh add the [00:35:34] And for doing that we again uh add the loss values and generate a [00:35:38] loss values and generate a compound loss function as you can see [00:35:42] compound loss function as you can see here. [00:35:44] here. So this is simple. It's doable. If we [00:35:48] So this is simple. It's doable. If we have one single object, we can for sure [00:35:53] have one single object, we can for sure um [00:35:55] um solve this problem using this [00:35:56] solve this problem using this architecture that I talked about. But [00:35:58] architecture that I talked about. But this is not that easy if we have [00:36:00] this is not that easy if we have multiple objects in the in the scene. So [00:36:03] multiple objects in the in the scene. So for three objects, we have to generate [00:36:05] for three objects, we have to generate 12 output numbers and and if there are [00:36:08] 12 output numbers and and if there are more then it it's going to be too many [00:36:11] more then it it's going to be too many uh numbers to generate. So this is this [00:36:13] uh numbers to generate. So this is this algorithm is not really scalable. it's [00:36:15] algorithm is not really scalable. it's it's it's just extending the [00:36:17] it's it's just extending the classification into some sort of object [00:36:19] classification into some sort of object detection which is fine but it's not [00:36:21] detection which is fine but it's not really scalable. [00:36:23] really scalable. So um [00:36:25] So um if the when there are multiple objects, [00:36:28] if the when there are multiple objects, one solution is [00:36:31] one solution is um instead of going or or getting the [00:36:34] um instead of going or or getting the entire image as the input, why not to [00:36:37] entire image as the input, why not to look at bounding boxes for each bounding [00:36:39] look at bounding boxes for each bounding box, we can say we only have one label [00:36:45] box, we can say we only have one label and um whether it's a cat or a dog or or [00:36:48] and um whether it's a cat or a dog or or the background, right? And if I [00:36:53] the background, right? And if I have this this um way of classifying [00:36:56] have this this um way of classifying each of the bounding boxes, I can do a [00:36:58] each of the bounding boxes, I can do a sliding window. I can create just [00:37:00] sliding window. I can create just bounding boxes slided over the image [00:37:04] bounding boxes slided over the image from coordinate 0 0 to all combination [00:37:07] from coordinate 0 0 to all combination of XY and H and W and see if we can [00:37:11] of XY and H and W and see if we can detect the object. So step by step I can [00:37:14] detect the object. So step by step I can I can create uh I can find the bounding [00:37:17] I can create uh I can find the bounding boxes that have the the maximum [00:37:20] boxes that have the the maximum probability of each of the objects. [00:37:23] probability of each of the objects. But there is a huge problem here right [00:37:26] But there is a huge problem here right again there are so many different [00:37:28] again there are so many different combinations of bounding boxes that we [00:37:30] combinations of bounding boxes that we can use and again this algorithm is not [00:37:34] can use and again this algorithm is not scalable. [00:37:36] scalable. what we've been doing in the literature [00:37:39] what we've been doing in the literature again early early years if you look at [00:37:41] again early early years if you look at the [00:37:42] the uh the years these articles were [00:37:44] uh the years these articles were published 2014 and before [00:37:48] published 2014 and before there has been a lot of research around [00:37:51] there has been a lot of research around finding regions [00:37:54] finding regions that are um they have high probability [00:37:58] that are um they have high probability of having the object in them. So region [00:38:01] of having the object in them. So region proposals and if I have a way to find [00:38:05] proposals and if I have a way to find region proposals that's actually going [00:38:07] region proposals that's actually going to be a easyish [00:38:10] to be a easyish problem. I can do the same thing as as I [00:38:12] problem. I can do the same thing as as I explained earlier, right? For a for an [00:38:15] explained earlier, right? For a for an image, what we can do is if if I have [00:38:18] image, what we can do is if if I have region proposals, I can just take that [00:38:21] region proposals, I can just take that part that patch out [00:38:24] part that patch out and run a CNN on that on that patch, [00:38:28] and run a CNN on that on that patch, right? And convolutional neuronet [00:38:29] right? And convolutional neuronet network on the patch and then classify [00:38:31] network on the patch and then classify it. And in order to even I can I can [00:38:34] it. And in order to even I can I can refine the bounding boxes. [00:38:38] refine the bounding boxes. So classify oops classify and then [00:38:43] So classify oops classify and then refine the bounding boxes to have um the [00:38:47] refine the bounding boxes to have um the object detected. So we can we can [00:38:49] object detected. So we can we can classify the boxes and also refine the [00:38:51] classify the boxes and also refine the bounding boxes if if I have to change [00:38:53] bounding boxes if if I have to change the coordinates a little bit. [00:38:56] the coordinates a little bit. And this is what is called um RCNN um [00:39:00] And this is what is called um RCNN um algorithm. [00:39:02] algorithm. And uh although it's it's it's it works [00:39:07] And uh although it's it's it's it works and again this is one of the early [00:39:09] and again this is one of the early algorithms CPR 2014 [00:39:12] algorithms CPR 2014 there are these are very slow because [00:39:14] there are these are very slow because for each of these box boxes again we are [00:39:17] for each of these box boxes again we are running a full convolutional neural [00:39:20] running a full convolutional neural network. But there is one catch that um [00:39:24] network. But there is one catch that um um what what we can do is [00:39:28] um what what we can do is um is instead of [00:39:32] um is instead of running convolution neural network on [00:39:35] running convolution neural network on each of these boxes because convo [00:39:38] each of these boxes because convo operations convolution operations they [00:39:39] operations convolution operations they preserve the special information right [00:39:41] preserve the special information right they either down sample or or upsample [00:39:45] they either down sample or or upsample we always have a a way to to track them [00:39:48] we always have a a way to to track them where in the pixel space they are Right. [00:39:51] where in the pixel space they are Right. So in that uh case what we do is instead [00:39:55] So in that uh case what we do is instead of running the [00:39:58] of running the convolutional neural network on the [00:40:01] convolutional neural network on the patches let's say we run one big [00:40:04] patches let's say we run one big convolution on the entire image and then [00:40:07] convolution on the entire image and then now we have the regions in that feature [00:40:09] now we have the regions in that feature map which is corresponding to the entire [00:40:11] map which is corresponding to the entire image. Let's look at those regions and [00:40:14] image. Let's look at those regions and now run a smaller CNN for on top of [00:40:17] now run a smaller CNN for on top of those and generate the outputs. [00:40:21] those and generate the outputs. for each of these two um outputs that I [00:40:25] for each of these two um outputs that I want. First the box offset like should I [00:40:28] want. First the box offset like should I move the bounding box a little bit or [00:40:31] move the bounding box a little bit or what the object category is. So this is [00:40:34] what the object category is. So this is the the fast version of RCNN. These are [00:40:38] the the fast version of RCNN. These are some uh some basic algorithms that you [00:40:40] some uh some basic algorithms that you can use convolutional neural networks [00:40:42] can use convolutional neural networks for detecting objects their bounding [00:40:45] for detecting objects their bounding boxes and so on. The question is if the [00:40:47] boxes and so on. The question is if the number of pre proposed uh regions are [00:40:49] number of pre proposed uh regions are predefined. Uh that is the the short [00:40:52] predefined. Uh that is the the short answer to that is yes. I will talk very [00:40:54] answer to that is yes. I will talk very briefly about the region proposal [00:40:56] briefly about the region proposal networks too. [00:41:00] So easy algorithms right? One puts the [00:41:04] So easy algorithms right? One puts the bounding boxes of the regions the [00:41:07] bounding boxes of the regions the proposed regions um on the images one [00:41:10] proposed regions um on the images one puts it on the feature maps of the [00:41:11] puts it on the feature maps of the connet and both of those generate the [00:41:14] connet and both of those generate the output class label as well as offset of [00:41:18] output class label as well as offset of improving the the location of the [00:41:21] improving the the location of the detected object. But this will mean this [00:41:26] detected object. But this will mean this requires us to do um that bounding box [00:41:31] requires us to do um that bounding box uh region proposal um first and and have [00:41:35] uh region proposal um first and and have a region proposal network that tells us [00:41:37] a region proposal network that tells us where to in image we should look at uh [00:41:40] where to in image we should look at uh look for and uh there has been research [00:41:44] look for and uh there has been research on on building region proposal networks [00:41:47] on on building region proposal networks RPNs and here what we do is we just [00:41:50] RPNs and here what we do is we just randomly start with with a CNN we try to [00:41:56] randomly start with with a CNN we try to randomly start in in different locations [00:41:59] randomly start in in different locations in the image and through layers of [00:42:01] in the image and through layers of convolution [00:42:03] convolution we refine those regions where they have [00:42:08] we refine those regions where they have the higher probability of having an [00:42:10] the higher probability of having an object in them because we have the [00:42:12] object in them because we have the object labels and and and locations. So [00:42:14] object labels and and and locations. So we can optim we can uh supervise this [00:42:17] we can optim we can uh supervise this and then each of those also refine the [00:42:20] and then each of those also refine the box coordinates. So basically a neural u [00:42:24] box coordinates. So basically a neural u a region proposal network what it does [00:42:27] a region proposal network what it does is it [00:42:29] is it refineses the boxes each of those boxes [00:42:32] refineses the boxes each of those boxes that have a probability a high [00:42:34] that have a probability a high probability of an object in them and the [00:42:38] probability of an object in them and the box [00:42:40] box uh the output boxes the box corrections [00:42:43] uh the output boxes the box corrections again I'm I'm leaving all of the details [00:42:45] again I'm I'm leaving all of the details about the coordinate uh coordinates and [00:42:47] about the coordinate uh coordinates and all of these u dimensionalities [00:42:50] all of these u dimensionalities for you to pick uh afterwards because [00:42:52] for you to pick uh afterwards because you it will take too much time and we [00:42:54] you it will take too much time and we don't want to spend too much time on [00:42:55] don't want to spend too much time on this uh algorithm. But what's important [00:42:58] this uh algorithm. But what's important here is back to your question, we often [00:43:02] here is back to your question, we often take the top k the ones that have the [00:43:07] take the top k the ones that have the highest probability of having an object [00:43:09] highest probability of having an object in them as the proposals for this image. [00:43:12] in them as the proposals for this image. This is an simple image. So then then [00:43:16] This is an simple image. So then then most of the and only has one object. So [00:43:18] most of the and only has one object. So most of the regions are centered around [00:43:22] most of the regions are centered around that single object but in general that's [00:43:24] that single object but in general that's not the case. So in in many setups we [00:43:27] not the case. So in in many setups we can have region proposals used in in [00:43:31] can have region proposals used in in different setups. um we can get [00:43:34] different setups. um we can get different objects and with higher [00:43:36] different objects and with higher probabilities with [00:43:40] probabilities with that and after talking a little bit [00:43:42] that and after talking a little bit about RCNN and mask RCNN which again for [00:43:46] about RCNN and mask RCNN which again for you it's it's important to to go through [00:43:49] you it's it's important to to go through the details and if you can spend some [00:43:51] the details and if you can spend some time um doing the calculations yourself [00:43:55] time um doing the calculations yourself that would be very good but those types [00:43:58] that would be very good but those types of algorithms RCNN mask RCNN they're not [00:44:01] of algorithms RCNN mask RCNN they're not being used anymore more these days [00:44:03] being used anymore more these days because because they are very heavy [00:44:06] because because they are very heavy computationally. [00:44:07] computationally. Although it's important to understand [00:44:09] Although it's important to understand how we got to this point but but those [00:44:13] how we got to this point but but those are for uh for many reasons. One of [00:44:16] are for uh for many reasons. One of those reasons is that we we need two [00:44:18] those reasons is that we we need two separate networks. One region proposal [00:44:20] separate networks. One region proposal network and then one um classification [00:44:24] network and then one um classification and box refundment network. So it's like [00:44:28] and box refundment network. So it's like at least two passes for detecting [00:44:30] at least two passes for detecting objects on the on the uh for each image [00:44:33] objects on the on the uh for each image right and that's why there has been [00:44:38] right and that's why there has been advances after these um using single [00:44:41] advances after these um using single single stage object detectors SSDs and [00:44:46] single stage object detectors SSDs and one of the most popular ones is called [00:44:49] one of the most popular ones is called YOLO. YOLO is um probably if if you work [00:44:54] YOLO. YOLO is um probably if if you work with any computer vision problem you've [00:44:55] with any computer vision problem you've you've heard about YOLO even to date [00:44:58] you've heard about YOLO even to date although it's a it's a convolution um [00:45:01] although it's a it's a convolution um heavy network today uh at least its [00:45:04] heavy network today uh at least its earlier versions [00:45:06] earlier versions uh in many even industrial applications [00:45:10] uh in many even industrial applications YOLO is being used as as the base for [00:45:15] YOLO is being used as as the base for object detection because it's a fast uh [00:45:18] object detection because it's a fast uh object detector and It's um very good in [00:45:22] object detector and It's um very good in terms of detecting uh the objects. What [00:45:25] terms of detecting uh the objects. What YOLO does I want to very briefly uh tell [00:45:30] YOLO does I want to very briefly uh tell you a little bit about. It's basically [00:45:31] you a little bit about. It's basically you look only once with one single pass [00:45:34] you look only once with one single pass on the image. You generate all of the [00:45:37] on the image. You generate all of the bounding boxes. How it does it is it [00:45:41] bounding boxes. How it does it is it maps it. It divides the image into S bys [00:45:46] maps it. It divides the image into S bys S [00:45:48] S grid of S bys and in this example it's 7 [00:45:51] grid of S bys and in this example it's 7 by 7. What happens is that for each [00:45:54] by 7. What happens is that for each single box in that grid it tries to [00:45:58] single box in that grid it tries to output it. It it creates a fully con [00:46:00] output it. It it creates a fully con convolutional network that outputs the [00:46:04] convolutional network that outputs the probability of an object being in that [00:46:09] probability of an object being in that location. [00:46:10] location. refinements of the bounding boxes. So it [00:46:13] refinements of the bounding boxes. So it generates Bbounding boxes and new [00:46:15] generates Bbounding boxes and new hyperparameters B bounding boxes that is [00:46:18] hyperparameters B bounding boxes that is the refinement of the object that is [00:46:21] the refinement of the object that is present in that box and also it [00:46:23] present in that box and also it generates a probability the class uh [00:46:26] generates a probability the class uh class probability object class [00:46:27] class probability object class probabilities and um in this case for [00:46:31] probabilities and um in this case for example if it's B equal to two it [00:46:33] example if it's B equal to two it generates just two B bounding boxes with [00:46:36] generates just two B bounding boxes with different probabilities. [00:46:38] different probabilities. It it does this for all of the [00:46:41] It it does this for all of the boxes at the same time. So basically [00:46:44] boxes at the same time. So basically it's the it's the same network that is [00:46:46] it's the it's the same network that is being uh generating something as the [00:46:49] being uh generating something as the output for each of these [00:46:51] output for each of these bounding boxes and it it it does [00:46:54] bounding boxes and it it it does generate a number of different options [00:46:57] generate a number of different options for the for the for the object and as I [00:47:01] for the for the for the object and as I said each of those boxes are associated [00:47:04] said each of those boxes are associated with a probability and pro in this [00:47:06] with a probability and pro in this example the probability is shown with [00:47:08] example the probability is shown with the weight of the edges [00:47:11] the weight of the edges uh in each of those boxes. [00:47:14] uh in each of those boxes. And um for these many different bounding [00:47:16] And um for these many different bounding boxes and object probabilities now we [00:47:19] boxes and object probabilities now we can do thresholding [00:47:21] can do thresholding and also [00:47:25] um [00:47:27] um there is an algorithm that they used in [00:47:28] there is an algorithm that they used in the paper again I don't want to go into [00:47:30] the paper again I don't want to go into the details uh non uh maximal [00:47:33] the details uh non uh maximal suppression and um some some algorithms [00:47:37] suppression and um some some algorithms uh with thresholding involved that that [00:47:40] uh with thresholding involved that that identifies the ones that have the [00:47:42] identifies the ones that have the highest [00:47:43] highest probabilities. So this is this is a [00:47:45] probabilities. So this is this is a simple implementation or or use of the [00:47:50] simple implementation or or use of the uh [00:47:54] object detection. Again this is this is [00:47:57] object detection. Again this is this is something very useful if you have time [00:47:59] something very useful if you have time spend time with the repositories of [00:48:01] spend time with the repositories of YOLO. There are so many different uh [00:48:04] YOLO. There are so many different uh newer versions of YOLO that is being [00:48:06] newer versions of YOLO that is being used for many applications in in [00:48:10] used for many applications in in medicine robotics and also in many [00:48:12] medicine robotics and also in many industrial applications. So the question [00:48:14] industrial applications. So the question is how do we get this uh second image [00:48:16] is how do we get this uh second image and what's the intuition behind it right [00:48:19] and what's the intuition behind it right as I said for each of the grids we [00:48:21] as I said for each of the grids we generate B bounding boxes like for for [00:48:24] generate B bounding boxes like for for this one we generated two for all of all [00:48:26] this one we generated two for all of all others we also generate two this B is [00:48:30] others we also generate two this B is again a a probability vector and each of [00:48:33] again a a probability vector and each of these boxes are associated with [00:48:35] these boxes are associated with probability of uh existing an object in [00:48:37] probability of uh existing an object in them and then if I put all of them [00:48:39] them and then if I put all of them together for all of the patches I have [00:48:42] together for all of the patches I have so many boxes [00:48:43] so many boxes And now each of those are associated [00:48:46] And now each of those are associated with a probability [00:48:47] with a probability right. [00:48:55] let's move on. Um [00:49:00] and one of the more recent [00:49:04] and one of the more recent approaches for object detection is deter [00:49:07] approaches for object detection is deter a detection transformer. This is purely [00:49:11] a detection transformer. This is purely based on transformers [00:49:14] based on transformers and the topic that we did discuss last [00:49:16] and the topic that we did discuss last uh week and also I started today same [00:49:20] uh week and also I started today same type of self attention and cross [00:49:23] type of self attention and cross attention modules could also generate [00:49:26] attention modules could also generate some [00:49:28] some uh some object detections and bounding [00:49:30] uh some object detections and bounding boxes for us and how this works. This is [00:49:34] boxes for us and how this works. This is actually not a very old paper 2020 uh [00:49:38] actually not a very old paper 2020 uh almost 5 years ago although it's it's [00:49:40] almost 5 years ago although it's it's now kind of deprecated. Nobody uses this [00:49:42] now kind of deprecated. Nobody uses this for real applications but but it's a [00:49:46] for real applications but but it's a it's a very good example of how to use [00:49:48] it's a very good example of how to use transformers for object detection. And [00:49:52] transformers for object detection. And what we do here is basically similar to [00:49:55] what we do here is basically similar to what we've explained earlier. We make [00:49:58] what we've explained earlier. We make turn the image into patches and then [00:50:02] turn the image into patches and then those patches are passed through CNN's [00:50:04] those patches are passed through CNN's creating a token. Then we add positional [00:50:08] creating a token. Then we add positional encoding the same way that I explained [00:50:10] encoding the same way that I explained to the patches and those define our [00:50:15] to the patches and those define our tokens for the input uh which are inputs [00:50:18] tokens for the input uh which are inputs to the transformer encoder. a [00:50:21] to the transformer encoder. a transformer encoder again a bunch of [00:50:23] transformer encoder again a bunch of self attention [00:50:25] self attention layer normalization or any normalization [00:50:28] layer normalization or any normalization as well as MLP layers that generates the [00:50:32] as well as MLP layers that generates the output tokens after multiple layers of [00:50:36] output tokens after multiple layers of uh transformer encoder. [00:50:39] uh transformer encoder. Then in order to generate the bounding [00:50:41] Then in order to generate the bounding boxes, this is this is the smart part [00:50:44] boxes, this is this is the smart part for this algorithm that it does take the [00:50:48] for this algorithm that it does take the encoder output tokens as input in the [00:50:51] encoder output tokens as input in the transformer decoder. But we also define [00:50:54] transformer decoder. But we also define some queries [00:50:57] some queries which are trainable parameters [00:50:58] which are trainable parameters themselves that [00:51:01] themselves that each of those for example if I add five [00:51:04] each of those for example if I add five five queries as input four queries as [00:51:06] five queries as input four queries as input or 10 or 20 inquiries as input I'm [00:51:09] input or 10 or 20 inquiries as input I'm seeking [00:51:11] seeking up to 20 objects to be detected in that [00:51:14] up to 20 objects to be detected in that image and then again through a [00:51:17] image and then again through a combination of self attention [00:51:20] combination of self attention layers in the beginning of this [00:51:23] layers in the beginning of this transformer decoder as well as cross [00:51:27] transformer decoder as well as cross attention with the encoder [00:51:31] attention with the encoder output. [00:51:33] output. So cross attention and self attention [00:51:35] So cross attention and self attention networks layers. It generates it it it [00:51:39] networks layers. It generates it it it generates the output values for each of [00:51:42] generates the output values for each of these queries which are passed through [00:51:46] these queries which are passed through an FFN uh freeforward network to [00:51:49] an FFN uh freeforward network to generate either class [00:51:52] generate either class um [00:51:54] um labels and and the bounding boxes very [00:51:57] labels and and the bounding boxes very similar to what we discussed earlier [00:52:00] similar to what we discussed earlier or even uh in some cases it it just says [00:52:03] or even uh in some cases it it just says there's no object to be detected and and [00:52:07] there's no object to be detected and and at the end we have the bounding boxes [00:52:10] at the end we have the bounding boxes and the classes associated with bounding [00:52:12] and the classes associated with bounding boxes as the output. [00:52:16] boxes as the output. So the question is are we inputting [00:52:18] So the question is are we inputting every possible box to the transformer? [00:52:20] every possible box to the transformer? No, the input here are some general [00:52:23] No, the input here are some general parameters that are representing they [00:52:25] parameters that are representing they are queries [00:52:26] are queries uh representing [00:52:30] uh representing asking the question that I actually want [00:52:31] asking the question that I actually want an object to be outputed in in in place [00:52:35] an object to be outputed in in in place of this input query. Right? So there is [00:52:38] of this input query. Right? So there is no box or anything as the input. It's [00:52:41] no box or anything as the input. It's part of the output that it generates the [00:52:43] part of the output that it generates the class label and the box coordinates. [00:52:46] class label and the box coordinates. So the question is um if the queries are [00:52:50] So the question is um if the queries are formed in a way that's it it it actually [00:52:54] formed in a way that's it it it actually represents what we want to look for and [00:52:56] represents what we want to look for and where in the image right so in this case [00:53:00] where in the image right so in this case um what we are looking for is defined by [00:53:03] um what we are looking for is defined by class labels which are defined uh [00:53:06] class labels which are defined uh predefined and they are as part of the [00:53:08] predefined and they are as part of the output. So our supervision is based on [00:53:10] output. So our supervision is based on the class labels. We have a class [00:53:11] the class labels. We have a class probability vector same same way that we [00:53:14] probability vector same same way that we defined it for the other algorithms. [00:53:16] defined it for the other algorithms. Right? So that's how um the algorithm [00:53:19] Right? So that's how um the algorithm knows what type of classes to look for. [00:53:21] knows what type of classes to look for. And then in terms of the outputs [00:53:25] And then in terms of the outputs uh again these outputs are supervised if [00:53:28] uh again these outputs are supervised if you remember based on the L2 norm L2 [00:53:31] you remember based on the L2 norm L2 loss of the ground truth boxes right. So [00:53:35] loss of the ground truth boxes right. So we're not telling anything in the query [00:53:37] we're not telling anything in the query part [00:53:39] part um what to or where to look uh for any [00:53:44] um what to or where to look uh for any of the objects. The training the the [00:53:47] of the objects. The training the the process itself is back propagating um if [00:53:49] process itself is back propagating um if there are any losses any any errors it [00:53:52] there are any losses any any errors it back propagates the the outputs. So [00:53:56] back propagates the the outputs. So basically it's [00:53:58] basically it's um we are not determining anything in [00:54:00] um we are not determining anything in the beginning or in in this part. So the [00:54:03] the beginning or in in this part. So the question was [00:54:05] question was the query is give me up to nine objects [00:54:08] the query is give me up to nine objects and uh and yes the that's the that's [00:54:11] and uh and yes the that's the that's that that's basically what this means [00:54:14] that that's basically what this means and through the self attention and cost [00:54:18] and through the self attention and cost attention it will turn try to generate [00:54:21] attention it will turn try to generate output tokens that are turned into class [00:54:24] output tokens that are turned into class and box coordinates through that FFN [00:54:28] and box coordinates through that FFN operation. Your question is if the [00:54:31] operation. Your question is if the queries over there if they are image [00:54:34] queries over there if they are image patches or not. [00:54:36] patches or not. No, they are they are not image patches. [00:54:38] No, they are they are not image patches. They are just uh queries for uh [00:54:41] They are just uh queries for uh trainable parameters to to you you put [00:54:44] trainable parameters to to you you put them in to generate the outputs for each [00:54:47] them in to generate the outputs for each of the inputs. You you get the value as [00:54:50] of the inputs. You you get the value as the output and that value is turned into [00:54:53] the output and that value is turned into class and box coordinates. [00:54:56] class and box coordinates. Again the question is what is object [00:54:57] Again the question is what is object queries? They are trainable learnable [00:55:00] queries? They are trainable learnable parameters. So you initialize them, the [00:55:03] parameters. So you initialize them, the network finds the best values for them [00:55:06] network finds the best values for them and that's what you get as the output. [00:55:09] and that's what you get as the output. The question is if uh if there's any [00:55:11] The question is if uh if there's any intuition uh which FFN gets which box, [00:55:13] intuition uh which FFN gets which box, right? The short answer to that is no. [00:55:18] right? The short answer to that is no. uh there's there's nothing that stops [00:55:20] uh there's there's nothing that stops the network from I mean we are not [00:55:24] the network from I mean we are not including anything that stops the [00:55:25] including anything that stops the network from uh generating multiple but [00:55:29] network from uh generating multiple but remember that are so many self attention [00:55:32] remember that are so many self attention and cross attention layers over there [00:55:35] and cross attention layers over there that they are actually interacting with [00:55:37] that they are actually interacting with each other and and makes each of those [00:55:40] each other and and makes each of those queries match with one of the uh output [00:55:43] queries match with one of the uh output layer. So it's it's not generating the [00:55:45] layer. So it's it's not generating the exact same thing at the as the output. [00:55:47] exact same thing at the as the output. And we also have control over how to [00:55:49] And we also have control over how to supervise supervising those uh FFNS as [00:55:52] supervise supervising those uh FFNS as well. So your question is if there are [00:55:55] well. So your question is if there are image segmentations pixel level [00:55:57] image segmentations pixel level segmentations as part of the training. [00:56:00] segmentations as part of the training. This algorithm does not require the [00:56:03] This algorithm does not require the segment level um pixel level [00:56:06] segment level um pixel level segmentations. It it's only supervised [00:56:09] segmentations. It it's only supervised based on class labels and bounding [00:56:11] based on class labels and bounding boxes. But if you have the b the pixel [00:56:14] boxes. But if you have the b the pixel level segmentations, you can always turn [00:56:16] level segmentations, you can always turn the pixel level segmentations into a [00:56:17] the pixel level segmentations into a bounding box to train this algorithm, [00:56:20] bounding box to train this algorithm, right? But it doesn't necessarily [00:56:22] right? But it doesn't necessarily require that. So the question is if it's [00:56:25] require that. So the question is if it's possible to generalize unseen objects. [00:56:28] possible to generalize unseen objects. Um [00:56:30] Um and by unseen you mean a new class [00:56:32] and by unseen you mean a new class label. [00:56:34] label. uh for these types of algorithms that [00:56:37] uh for these types of algorithms that they are fully supervised often there is [00:56:39] they are fully supervised often there is no way because you are creating class [00:56:41] no way because you are creating class probability vector there's no way of [00:56:44] probability vector there's no way of like adding something at the end for a [00:56:46] like adding something at the end for a new class um [00:56:49] new class um without previously knowing there's [00:56:52] without previously knowing there's there's some some other classes right so [00:56:54] there's some some other classes right so fully supervised networks they are often [00:56:56] fully supervised networks they are often there's no new object we can have a [00:56:58] there's no new object we can have a background object or no object as you [00:57:00] background object or no object as you can see we have the the label of no [00:57:02] can see we have the the label of no object [00:57:03] object But there are many algorithms and [00:57:06] But there are many algorithms and extensions of these types of algorithms [00:57:08] extensions of these types of algorithms that are used for zeroot learning. Zero [00:57:11] that are used for zeroot learning. Zero shot means understanding something new [00:57:14] shot means understanding something new without having an example of those in [00:57:15] without having an example of those in the training data. But uh it's beyond [00:57:18] the training data. But uh it's beyond this this topic. What happens if you [00:57:20] this this topic. What happens if you have more objects in the scene than um [00:57:24] have more objects in the scene than um what you uh put in as the query? Right. [00:57:28] what you uh put in as the query? Right. So that's a that's a great question. it [00:57:30] So that's a that's a great question. it often generates the ones that have the [00:57:32] often generates the ones that have the has has the highest confidence on the [00:57:35] has has the highest confidence on the objects. So bounding boxes with the [00:57:36] objects. So bounding boxes with the highest confidence and in in those cases [00:57:39] highest confidence and in in those cases you often want to add more queries just [00:57:42] you often want to add more queries just so you can get more objects, right? [00:57:46] so you can get more objects, right? Okay. Um I'll be here to answer [00:57:48] Okay. Um I'll be here to answer questions if you have any uh after after [00:57:50] questions if you have any uh after after the class, but we have a bunch of other [00:57:53] the class, but we have a bunch of other topics to cover and I want to make sure [00:57:56] topics to cover and I want to make sure we we go over them. at least you get uh [00:57:59] we we go over them. at least you get uh familiar with the topics. So with the [00:58:02] familiar with the topics. So with the object detections now back to the [00:58:05] object detections now back to the question that was asked earlier, how can [00:58:08] question that was asked earlier, how can we use those types of algorithms for [00:58:11] we use those types of algorithms for instance segmentation and that's [00:58:13] instance segmentation and that's actually not too hard. We talked about [00:58:15] actually not too hard. We talked about this um when when we were talking about [00:58:19] this um when when we were talking about our CNN algorithms where we run a CNN on [00:58:23] our CNN algorithms where we run a CNN on the image then we have a region proposal [00:58:26] the image then we have a region proposal network that gives us the bounding boxes [00:58:29] network that gives us the bounding boxes and those bounding boxes are [00:58:33] and those bounding boxes are turned into either class labels and [00:58:36] turned into either class labels and bounding box refinements. Right? So this [00:58:39] bounding box refinements. Right? So this is what we've talked so far [00:58:42] is what we've talked so far with RCNN and so on. Now we can turn [00:58:44] with RCNN and so on. Now we can turn this into a mask RCNN that also [00:58:47] this into a mask RCNN that also generates the mask. So, so basically [00:58:51] generates the mask. So, so basically same architecture that we talked about [00:58:53] same architecture that we talked about earlier. Now we can we can uh take one [00:58:57] earlier. Now we can we can uh take one more output make it more multitask [00:59:01] more output make it more multitask and generate the mask predictions. So [00:59:05] and generate the mask predictions. So what we used to be doing before was [00:59:07] what we used to be doing before was again image region proposals CNN gives [00:59:12] again image region proposals CNN gives us class label and the box coordinates. [00:59:15] us class label and the box coordinates. Now we add another layer of convolution [00:59:18] Now we add another layer of convolution that generates the mask the the mask for [00:59:23] that generates the mask the the mask for for that object on in the pixel level. [00:59:26] for that object on in the pixel level. And that mask again could be the same [00:59:28] And that mask again could be the same size as the input and the and the image [00:59:32] size as the input and the and the image and basically on the on the layer [00:59:34] and basically on the on the layer itself. If we use fully convolutional [00:59:36] itself. If we use fully convolutional neural network that's that's what we [00:59:38] neural network that's that's what we often get as the output for each of the [00:59:41] often get as the output for each of the objects. When we have that box tiny box [00:59:44] objects. When we have that box tiny box we can always get the mask for that. the [00:59:48] we can always get the mask for that. the chair in different settings of the box [00:59:51] chair in different settings of the box itself if you have different boxes, the [00:59:54] itself if you have different boxes, the bed uh and the human the baby in the [01:00:00] bed uh and the human the baby in the image and this is what we an extension [01:00:03] image and this is what we an extension of the RCNN algorithm which we call [01:00:06] of the RCNN algorithm which we call mascarn [01:00:07] mascarn and um [01:00:10] and um with mascarn it's actually the the [01:00:12] with mascarn it's actually the the results have been very very good in [01:00:14] results have been very very good in detecting different objects different [01:00:18] detecting different objects different um [01:00:20] um known objects that we could train the [01:00:22] known objects that we could train the algorithms for. And then um there are so [01:00:26] algorithms for. And then um there are so many APIs and and open-source versions [01:00:29] many APIs and and open-source versions of object detectors that you can um [01:00:33] of object detectors that you can um explore. There are some some links and [01:00:36] explore. There are some some links and resources here. But this all basically [01:00:40] resources here. But this all basically rounds up and summarizes some of the [01:00:42] rounds up and summarizes some of the tasks that we wanted to cover and and [01:00:46] tasks that we wanted to cover and and they are actually very important for you [01:00:47] they are actually very important for you to understand these tasks. They have [01:00:49] to understand these tasks. They have been core computer revision tasks. [01:00:52] been core computer revision tasks. Although these days computer vision is [01:00:54] Although these days computer vision is is way more advanced, they're not bound [01:00:57] is way more advanced, they're not bound to these tasks. But if if you have [01:01:00] to these tasks. But if if you have industrial applications of for example [01:01:03] industrial applications of for example uh quality control of [01:01:07] uh quality control of separating rotten tomatoes and and good [01:01:10] separating rotten tomatoes and and good tomatoes in an industrial pipeline then [01:01:12] tomatoes in an industrial pipeline then with computer vision you need to be able [01:01:14] with computer vision you need to be able to detect objects and then classify them [01:01:16] to detect objects and then classify them into good or bad. Right? So so that [01:01:20] into good or bad. Right? So so that that's why it's important to still [01:01:21] that's why it's important to still understand and know these um steps and [01:01:24] understand and know these um steps and pipelines and how to do them in real [01:01:26] pipelines and how to do them in real time. But now there are larger scale [01:01:28] time. But now there are larger scale models that you're all familiar with. [01:01:32] models that you're all familiar with. This um summarizes the first part the [01:01:35] This um summarizes the first part the computer vision tasks that I wanted to [01:01:37] computer vision tasks that I wanted to talk about. And the last piece that I [01:01:38] talk about. And the last piece that I want to spend 10 minutes on is around [01:01:42] want to spend 10 minutes on is around visualization and understanding. [01:01:46] visualization and understanding. Again this has been a big lecture by [01:01:49] Again this has been a big lecture by itself and and in 2050s60s until 2020s [01:01:55] itself and and in 2050s60s until 2020s that the topic of computer u and and and [01:01:57] that the topic of computer u and and and even before that 2014 13 the topic of [01:02:00] even before that 2014 13 the topic of visualizing neural networks has been [01:02:03] visualizing neural networks has been very hot and very uh much [01:02:09] very hot and very uh much it gained it it helped us gain [01:02:11] it gained it it helped us gain understanding into what the networks are [01:02:13] understanding into what the networks are learning and I'm going to summarize some [01:02:15] learning and I'm going to summarize some of those the most important ones here [01:02:18] of those the most important ones here that you may need to un to use in your [01:02:21] that you may need to un to use in your applications. But before that, let me go [01:02:25] applications. But before that, let me go back to the linear classifier that we [01:02:27] back to the linear classifier that we talked about. Um, we spent quite a lot [01:02:30] talked about. Um, we spent quite a lot of time on linear classifiers. And with [01:02:33] of time on linear classifiers. And with the linear classifiers what we we did [01:02:36] the linear classifiers what we we did was at the end we said if if I look at [01:02:39] was at the end we said if if I look at the linear function the the what what [01:02:42] the linear function the the what what the network is learning I can have a [01:02:44] the network is learning I can have a template for each of those classes like [01:02:46] template for each of those classes like for example for this car you can always [01:02:48] for example for this car you can always see a front-facing car um as a template [01:02:51] see a front-facing car um as a template right we can do the same with neural [01:02:54] right we can do the same with neural networks if I visualize one of the [01:02:56] networks if I visualize one of the filters so here we visualize the weights [01:03:00] filters so here we visualize the weights of the linear function it was the visual [01:03:02] of the linear function it was the visual viewpoint point I can do the same with [01:03:04] viewpoint point I can do the same with linear uh sorry with visualizing the [01:03:08] linear uh sorry with visualizing the filters in the neural networks. So for [01:03:12] filters in the neural networks. So for each of the filters the the network is [01:03:15] each of the filters the the network is for example is learning something that [01:03:19] for example is learning something that is basically some of the basic shapes um [01:03:23] is basically some of the basic shapes um orientations or or simple shapes as you [01:03:26] orientations or or simple shapes as you can see here. [01:03:28] can see here. Although this visualization is it's we [01:03:30] Although this visualization is it's we can only do it for the layers that we [01:03:33] can only do it for the layers that we have few channels like for example if we [01:03:34] have few channels like for example if we have three three channels I can put them [01:03:36] have three three channels I can put them in an RGB image and just just visualize [01:03:38] in an RGB image and just just visualize it but as you remember in in CNN's that [01:03:41] it but as you remember in in CNN's that was not the case in CNN's we had [01:03:45] was not the case in CNN's we had different um sometimes [01:03:47] different um sometimes quite a few uh channels in the middle [01:03:51] quite a few uh channels in the middle layer so it's not easy to visualize [01:03:53] layer so it's not easy to visualize those in something that we can see but [01:03:55] those in something that we can see but but that's that's what you basically in [01:03:58] but that's that's what you basically in early layers that we have fewer channels [01:04:00] early layers that we have fewer channels we can visualize them and see the [01:04:02] we can visualize them and see the network is actually learning some [01:04:04] network is actually learning some patterns. So we start it starts learning [01:04:06] patterns. So we start it starts learning patterns and then later stages it um [01:04:10] patterns and then later stages it um gets [01:04:12] gets uh more holistic and and bigger patterns [01:04:15] uh more holistic and and bigger patterns as um [01:04:19] as um if if we train um sorry if we run some [01:04:23] if if we train um sorry if we run some something that we call guided back [01:04:24] something that we call guided back propagation we can also visualize those [01:04:28] propagation we can also visualize those but not as simple as this. I want to [01:04:32] but not as simple as this. I want to highlight an a couple of ways of [01:04:34] highlight an a couple of ways of evaluating uh understanding and [01:04:36] evaluating uh understanding and visualizing the neural networks which [01:04:39] visualizing the neural networks which are actually kind of important. One is [01:04:44] are actually kind of important. One is uh the concept of salencies. So in many [01:04:48] uh the concept of salencies. So in many applications it's very important for you [01:04:50] applications it's very important for you to know which pixel matters. For [01:04:53] to know which pixel matters. For example, in a medical application when [01:04:55] example, in a medical application when you do a classification of tumor versus [01:04:58] you do a classification of tumor versus none, you want to see which parts of [01:05:00] none, you want to see which parts of that image is actually the the tumor [01:05:03] that image is actually the the tumor because if you want to automate this, [01:05:06] because if you want to automate this, nobody cares about knowing if there is [01:05:08] nobody cares about knowing if there is tumor or not. Everybody cares about [01:05:10] tumor or not. Everybody cares about where in the image the tumor is, right? [01:05:12] where in the image the tumor is, right? So [01:05:14] So in order to do that simplest application [01:05:17] in order to do that simplest application is we train a network a feed forward um [01:05:20] is we train a network a feed forward um neural network that generates the value [01:05:24] neural network that generates the value or the class label doc. But what we can [01:05:28] or the class label doc. But what we can do is we can um [01:05:32] do is we can um actually before I showed you that in in [01:05:35] actually before I showed you that in in this case in order to train this network [01:05:37] this case in order to train this network what we've done was we always took the [01:05:40] what we've done was we always took the derivative derivative of the neural [01:05:43] derivative derivative of the neural network the weights sorry of the loss or [01:05:47] network the weights sorry of the loss or of the class score with respect to the [01:05:50] of the class score with respect to the weights in order to update the weights. [01:05:52] weights in order to update the weights. Right now what I need is for each pixel [01:05:56] Right now what I need is for each pixel I want to see how much changing the [01:05:59] I want to see how much changing the pixel value how much changing the pixel [01:06:02] pixel value how much changing the pixel value [01:06:03] value would affect the dog score right what [01:06:07] would affect the dog score right what does this mean what I explained is is [01:06:09] does this mean what I explained is is the the meaning of derivation right so [01:06:14] the the meaning of derivation right so if I u this means that the meaning of [01:06:17] if I u this means that the meaning of basically gradient so if I take the [01:06:21] basically gradient so if I take the gradient of the core with respect to now [01:06:23] gradient of the core with respect to now the pixel values not the network weights [01:06:26] the pixel values not the network weights anymore with pixel values I can [01:06:28] anymore with pixel values I can visualize those gradients [01:06:30] visualize those gradients and visualizing those means that these [01:06:33] and visualizing those means that these are the pixels that are that that matter [01:06:36] are the pixels that are that that matter in order to classify dog [01:06:38] in order to classify dog on this image those are the pixels that [01:06:41] on this image those are the pixels that matter. So if I change the values of [01:06:43] matter. So if I change the values of those pixels the score the dog score [01:06:45] those pixels the score the dog score will change will be changed right again [01:06:49] will change will be changed right again this is the basic meaning and definition [01:06:51] this is the basic meaning and definition of gradients that uh we've talked about. [01:06:54] of gradients that uh we've talked about. So this is one way if you run this on on [01:06:57] So this is one way if you run this on on different objects [01:06:59] different objects uh that you've trained in in the network [01:07:02] uh that you've trained in in the network then this is what you get. [01:07:07] then this is what you get. So that's one way of understanding [01:07:10] So that's one way of understanding salency. Uh and it's very effective in [01:07:13] salency. Uh and it's very effective in many cases. But sometimes it's not just [01:07:17] many cases. But sometimes it's not just about the the pixel values all the way [01:07:20] about the the pixel values all the way to the to the back. You want to see [01:07:24] to the to the back. You want to see um for each of the classes how the [01:07:26] um for each of the classes how the activations [01:07:27] activations uh work. And this brings us to class [01:07:30] uh work. And this brings us to class activation maps or CAM algorithm. uh [01:07:34] activation maps or CAM algorithm. uh class activation mapping cam or grat cam [01:07:38] class activation mapping cam or grat cam that I will talk about in a in two [01:07:39] that I will talk about in a in two minutes are uh one of the most and [01:07:44] minutes are uh one of the most and widely used algorithms for understanding [01:07:46] widely used algorithms for understanding CNN's and also could be used for other [01:07:49] CNN's and also could be used for other architectures too but for transformers [01:07:51] architectures too but for transformers we we have a much better way of um [01:07:54] we we have a much better way of um making sense of those which actually we [01:07:57] making sense of those which actually we talked last uh in the last lecture. So [01:08:00] talked last uh in the last lecture. So what happens is that for each of the [01:08:02] what happens is that for each of the convolution layers um we often do [01:08:05] convolution layers um we often do pooling and the pooling generates [01:08:07] pooling and the pooling generates feature maps. The feature maps are then [01:08:09] feature maps. The feature maps are then turned into scores and those scores with [01:08:13] turned into scores and those scores with with those uh values of the [01:08:17] with those uh values of the uh weights. [01:08:20] uh weights. If I if we extend the math, basically we [01:08:24] If I if we extend the math, basically we simply can highlight the class scores in [01:08:27] simply can highlight the class scores in a weighted sum form. And this means that [01:08:30] a weighted sum form. And this means that um you can trace back class predictions [01:08:35] um you can trace back class predictions all the way back to the feature maps [01:08:39] all the way back to the feature maps and a specific locations of the space [01:08:42] and a specific locations of the space because convolution layers are always [01:08:44] because convolution layers are always mapped to a space in the image space [01:08:46] mapped to a space in the image space too, right? we do convolution that's the [01:08:49] too, right? we do convolution that's the the special uh consistency across all of [01:08:53] the special uh consistency across all of the operations can can help us trace [01:08:55] the operations can can help us trace back all the way to the image space. So [01:08:58] back all the way to the image space. So anyways we can we can look at the [01:09:01] anyways we can we can look at the feature maps and see how the class [01:09:04] feature maps and see how the class activations each of these classes are [01:09:06] activations each of these classes are actually impacting those locations in [01:09:09] actually impacting those locations in the in the image. And with that now if I [01:09:13] the in the image. And with that now if I if I do this um multiplication of [01:09:16] if I do this um multiplication of weights versus uh the the the weights [01:09:19] weights versus uh the the the weights that we've learned on top of the feature [01:09:22] that we've learned on top of the feature values we create some class activations. [01:09:26] values we create some class activations. And this means that I have a way now to [01:09:30] And this means that I have a way now to go back to the image space because as [01:09:32] go back to the image space because as long as I'm in the convolutional space, [01:09:33] long as I'm in the convolutional space, I can go all the way back to the image [01:09:36] I can go all the way back to the image and create these maps of like for [01:09:38] and create these maps of like for example for each of the classes, palace, [01:09:41] example for each of the classes, palace, dome, church, u altar and and uh [01:09:45] dome, church, u altar and and uh monastery. We can have different class [01:09:50] monastery. We can have different class activation maps. These are the weights. [01:09:54] activation maps. These are the weights. These are the pixels or areas of the [01:09:56] These are the pixels or areas of the convolution layer that have been driving [01:10:00] convolution layer that have been driving the scores for these specific [01:10:04] the scores for these specific uh classes. It's the same for for others [01:10:08] uh classes. It's the same for for others um like class activation maps for one [01:10:10] um like class activation maps for one single object in different images. [01:10:14] single object in different images. But there's a problem with this and that [01:10:16] But there's a problem with this and that problem is that we can only apply this [01:10:18] problem is that we can only apply this to the last convolution layer because [01:10:21] to the last convolution layer because this is the only way we can do it. Like [01:10:24] this is the only way we can do it. Like we can only go to the last convolution [01:10:26] we can only go to the last convolution layer the way that we did the [01:10:28] layer the way that we did the calculations here. And in order to solve [01:10:30] calculations here. And in order to solve that problem there is one variant of the [01:10:34] that problem there is one variant of the algorithm called grat cam. So gradient [01:10:38] algorithm called grat cam. So gradient weighted class activation maps. It's [01:10:40] weighted class activation maps. It's basically the same algorithm just we [01:10:43] basically the same algorithm just we need to weight calculate the weights [01:10:45] need to weight calculate the weights with respect to the we basically take [01:10:47] with respect to the we basically take one of the [01:10:49] one of the uh layers [01:10:52] uh layers that created some sort of activation in [01:10:54] that created some sort of activation in the class class uh level. We compute the [01:10:58] the class class uh level. We compute the gradients instead of just calculating [01:11:01] gradients instead of just calculating the multiplication between W and [01:11:03] the multiplication between W and feature. we all we go all the way back [01:11:06] feature. we all we go all the way back with the gradients and create a weight [01:11:09] with the gradients and create a weight based on the gradients um and then that [01:11:12] based on the gradients um and then that is used instead of the weights it's [01:11:16] is used instead of the weights it's aggregate of all of the weights and [01:11:18] aggregate of all of the weights and gradients up to that specific layer and [01:11:21] gradients up to that specific layer and then we weigh that with with that and [01:11:25] then we weigh that with with that and then we also use ReLU to only map uh to [01:11:28] then we also use ReLU to only map uh to pass the uh be the positive ones and And [01:11:34] pass the uh be the positive ones and And that could also be all the way shown in [01:11:36] that could also be all the way shown in the class in the image space. So I [01:11:41] the class in the image space. So I talked about cam which was only applied [01:11:43] talked about cam which was only applied to the last convolution layer. If you [01:11:46] to the last convolution layer. If you wants to but but this is not possible [01:11:49] wants to but but this is not possible because in most of the CNN algorithms we [01:11:52] because in most of the CNN algorithms we don't have just u like one convolution [01:11:55] don't have just u like one convolution layer at the end right we always have [01:11:56] layer at the end right we always have some operations fully connected and so [01:11:59] some operations fully connected and so on. So in order to be able to carry this [01:12:02] on. So in order to be able to carry this class activation to the convolution [01:12:05] class activation to the convolution layer if there is something else in the [01:12:07] layer if there is something else in the middle we often use the gradients and [01:12:10] middle we often use the gradients and weight basically weigh the maps with [01:12:14] weight basically weigh the maps with with the gradients uh aggregates and [01:12:18] with the gradients uh aggregates and then we can actually do the [01:12:19] then we can actually do the visualization they create these heat [01:12:21] visualization they create these heat maps for each of the objects. So this [01:12:25] maps for each of the objects. So this was about CNN's but we talked about u [01:12:29] was about CNN's but we talked about u transformers last week right u last in [01:12:31] transformers last week right u last in the last lecture that they actually they [01:12:34] the last lecture that they actually they inherently come with the activation [01:12:36] inherently come with the activation maps. Did you do you remember that [01:12:38] maps. Did you do you remember that language um matrix that that Justin [01:12:42] language um matrix that that Justin showed that for each of the output words [01:12:45] showed that for each of the output words there is a tension weight for the input. [01:12:47] there is a tension weight for the input. We can do that the the same thing for [01:12:49] We can do that the the same thing for the pixels for each of the outputs we [01:12:52] the pixels for each of the outputs we can create these maps in the pixel space [01:12:56] can create these maps in the pixel space and and visualize the features of the [01:12:59] and and visualize the features of the VITs in the pixel space. So basically [01:13:01] VITs in the pixel space. So basically with vit and and transformers this is [01:13:03] with vit and and transformers this is much easier. you already have a way to [01:13:06] much easier. you already have a way to to visualize the ch the attentions the [01:13:11] to visualize the ch the attentions the weights but with [01:13:14] weights but with CNN's we often use grat cam or these [01:13:17] CNN's we often use grat cam or these types of algorithms that said [01:13:22] types of algorithms that said I did this u task that I thought I [01:13:25] I did this u task that I thought I wouldn't be able to completing the the [01:13:28] wouldn't be able to completing the the the topics I wanted to talk about today [01:13:31] the topics I wanted to talk about today and next session we'll have the lecture [01:13:35] and next session we'll have the lecture around video understanding. Thank you. ================================================================================ LECTURE 010 ================================================================================ Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 10: Video Understanding Source: https://www.youtube.com/watch?v=wElqklprhPE --- Transcript [00:00:05] I think at the beginning of the course [00:00:07] I think at the beginning of the course we announced that we would have um a few [00:00:10] we announced that we would have um a few guest lecturers, people who previously [00:00:12] guest lecturers, people who previously taught the course to come and give uh [00:00:14] taught the course to come and give uh sort of a single guest lecture about a [00:00:16] sort of a single guest lecture about a topic that they're very familiar with. [00:00:17] topic that they're very familiar with. And I'm very happy to announce we have [00:00:19] And I'm very happy to announce we have the first one of those lectures today. [00:00:21] the first one of those lectures today. So uh I'll introduce Dr. Rohan Gao. He [00:00:25] So uh I'll introduce Dr. Rohan Gao. He is an assistant professor in the [00:00:26] is an assistant professor in the department of computer science at the [00:00:28] department of computer science at the University of Maryland. uh College Park [00:00:30] University of Maryland. uh College Park and he leads the multiensory machine [00:00:32] and he leads the multiensory machine intelligence lab there. He was [00:00:34] intelligence lab there. He was previously an instructor for CS23 uh 1N [00:00:38] previously an instructor for CS23 uh 1N from 2022 to 2023 and he this is while [00:00:42] from 2022 to 2023 and he this is while he completed his posttock with uh uh Fei [00:00:46] he completed his posttock with uh uh Fei Jajin Wu and Sylvio Saves. So without [00:00:48] Jajin Wu and Sylvio Saves. So without further ado, I'll leave it to Rohan to [00:00:51] further ado, I'll leave it to Rohan to give the presentation today. [00:00:53] give the presentation today. Okay. Thanks. Uh hello uh hello [00:00:56] Okay. Thanks. Uh hello uh hello everyone. So it's really uh exciting to [00:00:59] everyone. So it's really uh exciting to be back to the class of two thes 231N [00:01:02] be back to the class of two thes 231N and I'm Rohan uh just like Zen [00:01:04] and I'm Rohan uh just like Zen introduced. So as you can tell I'm very [00:01:06] introduced. So as you can tell I'm very interested in multi model stuff. So a [00:01:08] interested in multi model stuff. So a lot only vision but also how we can make [00:01:10] lot only vision but also how we can make use of other sensory modalities like [00:01:12] use of other sensory modalities like audio tactile or other modalities just [00:01:15] audio tactile or other modalities just like with humans to uh perceive [00:01:18] like with humans to uh perceive understand and interact with this [00:01:20] understand and interact with this multiensory world. But of course uh [00:01:22] multiensory world. But of course uh vision is the most important modalities [00:01:23] vision is the most important modalities right that's why we have this course uh [00:01:25] right that's why we have this course uh deep learning for computer vision and [00:01:27] deep learning for computer vision and I'm sure up to this point that you guys [00:01:29] I'm sure up to this point that you guys are very familiar with image [00:01:30] are very familiar with image classification right given a 2D image [00:01:32] classification right given a 2D image like this how to you know uh give a [00:01:35] like this how to you know uh give a class uh label to see whether it's a dog [00:01:38] class uh label to see whether it's a dog it's a cat or it's a truck a plane [00:01:41] it's a cat or it's a truck a plane that's a 2D based image classification [00:01:43] that's a 2D based image classification and from the last lecture I'm sure you [00:01:45] and from the last lecture I'm sure you have also learned some other tasks that [00:01:47] have also learned some other tasks that you can do on images not only you can [00:01:49] you can do on images not only you can just assign a single label to see it's a [00:01:51] just assign a single label to see it's a cat a lot. And also you can do semantic [00:01:54] cat a lot. And also you can do semantic segmentation to segment you know the [00:01:56] segmentation to segment you know the picture into different portions [00:01:58] picture into different portions components and also have some semantic [00:02:00] components and also have some semantic meaning like where is grass, where is [00:02:01] meaning like where is grass, where is cat, where is tree and also you can also [00:02:03] cat, where is tree and also you can also put a bounding box on top of the objects [00:02:06] put a bounding box on top of the objects uh you detect in the image uh to see [00:02:09] uh you detect in the image uh to see where the dog is, where the cat is and [00:02:11] where the dog is, where the cat is and also do instance segmentation that you [00:02:12] also do instance segmentation that you not only you want to know the categories [00:02:15] not only you want to know the categories but also you know uh for each category [00:02:17] but also you know uh for each category if there are two dogs I want to have a [00:02:19] if there are two dogs I want to have a se segmentation pass for each category [00:02:21] se segmentation pass for each category that's instant segmentation. There are a [00:02:22] that's instant segmentation. There are a lot of like tasks classification [00:02:24] lot of like tasks classification recognition task you can do uh based on [00:02:27] recognition task you can do uh based on 2D images but that's not u the only [00:02:31] 2D images but that's not u the only thing that we can use for computer [00:02:33] thing that we can use for computer vision system to do right and also our [00:02:34] vision system to do right and also our world is not just static like this so if [00:02:37] world is not just static like this so if we look at this image hopefully at up to [00:02:40] we look at this image hopefully at up to this point you have uh you know learned [00:02:43] this point you have uh you know learned a lot of tools that you can uh train [00:02:46] a lot of tools that you can uh train some models to know detect this classify [00:02:49] some models to know detect this classify this is a kneeling room right you have [00:02:51] this is a kneeling room right you have also have uh tools you have learned that [00:02:54] also have uh tools you have learned that to put a bounding box to see that this [00:02:56] to put a bounding box to see that this is a dog and this is a baby and also [00:02:58] is a dog and this is a baby and also even you can even have a you know [00:03:00] even you can even have a you know segmentation mask to segment them out to [00:03:02] segmentation mask to segment them out to see where uh those objects you detect [00:03:04] see where uh those objects you detect are uh in the image. So today we're [00:03:06] are uh in the image. So today we're going to f focus on video understanding. [00:03:08] going to f focus on video understanding. So more formally what is video? [00:03:10] So more formally what is video? Basically video is just like this 2D [00:03:12] Basically video is just like this 2D image plus time. There's an extra time [00:03:14] image plus time. There's an extra time dimension. So we now we are uh tackling [00:03:17] dimension. So we now we are uh tackling things not only in this 3D image but [00:03:21] things not only in this 3D image but also but now in 4D we have this uh uh uh [00:03:25] also but now in 4D we have this uh uh uh three * t t is temporal dimension and h [00:03:28] three * t t is temporal dimension and h and w are the spatial dimension. Now we [00:03:30] and w are the spatial dimension. Now we are uh considering this kind of image [00:03:32] are uh considering this kind of image and videos as like a volume of images of [00:03:36] and videos as like a volume of images of of video frames. So an example task is [00:03:39] of video frames. So an example task is video classification just like image [00:03:41] video classification just like image classification right. So we are given a [00:03:43] classification right. So we are given a video like this uh some some person is [00:03:46] video like this uh some some person is like running right we want to take this [00:03:49] like running right we want to take this videos as input and also we want to [00:03:51] videos as input and also we want to train some model right a deep learning [00:03:53] train some model right a deep learning model we want to classify whether this [00:03:55] model we want to classify whether this person is doing swimming or running or [00:03:57] person is doing swimming or running or jumping or what actions that he's doing [00:03:59] jumping or what actions that he's doing right just based on this uh temporal [00:04:02] right just based on this uh temporal streams of video frames so we also we [00:04:06] streams of video frames so we also we have from the previous lectures I'm sure [00:04:07] have from the previous lectures I'm sure you have already learned some you know [00:04:09] you have already learned some you know loss functions like cross entropy loss [00:04:11] loss functions like cross entropy loss and you and train a image classifier. [00:04:13] and you and train a image classifier. Similarly, you can do you can use the [00:04:15] Similarly, you can do you can use the similar tools, you know, just train a [00:04:16] similar tools, you know, just train a video classifier. You just get some [00:04:18] video classifier. You just get some features and use the same loss functions [00:04:20] features and use the same loss functions and train a video classifier. So now the [00:04:23] and train a video classifier. So now the problem on video understanding is that [00:04:25] problem on video understanding is that how can we uh get features of videos [00:04:28] how can we uh get features of videos that you can apply the loss functions [00:04:30] that you can apply the loss functions you have learned from the previous [00:04:31] you have learned from the previous lectures. Right? So uh and also another [00:04:35] lectures. Right? So uh and also another kind of difference between image [00:04:36] kind of difference between image classification and the video [00:04:38] classification and the video classification video understanding is [00:04:39] classification video understanding is that now the things the task you want to [00:04:41] that now the things the task you want to do might be a little bit different just [00:04:42] do might be a little bit different just from the previous example in images for [00:04:45] from the previous example in images for example for image classification usually [00:04:47] example for image classification usually you care more about the scenes the [00:04:48] you care more about the scenes the objects right you want to just uh uh [00:04:50] objects right you want to just uh uh doing a classification what is the [00:04:52] doing a classification what is the object category for videos usually just [00:04:55] object category for videos usually just like this example I'm showing here [00:04:56] like this example I'm showing here usually you want to classify actions [00:04:58] usually you want to classify actions it's often actions like where the person [00:05:00] it's often actions like where the person what activities the person or some some [00:05:03] what activities the person or some some animals are doing in the in the videos [00:05:04] animals are doing in the in the videos that's what we care about usually in [00:05:06] that's what we care about usually in video understanding. So nature of things [00:05:08] video understanding. So nature of things to recognize can be a little bit [00:05:10] to recognize can be a little bit different. [00:05:12] different. And another problem that uh we want to [00:05:15] And another problem that uh we want to be careful about for video understanding [00:05:17] be careful about for video understanding is that videos are usually very big. [00:05:20] is that videos are usually very big. Right? Well you talk when we talk about [00:05:23] Right? Well you talk when we talk about images it's just like three times H* W. [00:05:26] images it's just like three times H* W. It's a single you know matrix of some [00:05:28] It's a single you know matrix of some you know RGB RGB numbers. But now we [00:05:31] you know RGB RGB numbers. But now we consider videos it's a sequence of [00:05:33] consider videos it's a sequence of frames. It can be like 30 frames per [00:05:35] frames. It can be like 30 frames per second. So in movies it can be sometimes [00:05:38] second. So in movies it can be sometimes we can have even higher like uh [00:05:41] we can have even higher like uh resolution uh and also temporal uh [00:05:44] resolution uh and also temporal uh resolution uh video frames and so if you [00:05:46] resolution uh video frames and so if you consider uh space to store videos you [00:05:50] consider uh space to store videos you for example if we consider standard [00:05:51] for example if we consider standard definition videos it can take about like [00:05:54] definition videos it can take about like 1.5 [00:05:56] 1.5 uh gigabyte per minute if we store this [00:05:58] uh gigabyte per minute if we store this video. we consider even high resolution [00:06:01] video. we consider even high resolution like 19 uh 1,920 [00:06:04] like 19 uh 1,920 times uh 10880 [00:06:06] times uh 10880 and now it takes like 10 gigabyte per [00:06:08] and now it takes like 10 gigabyte per minute. So it takes a gigantic space in [00:06:11] minute. So it takes a gigantic space in order to store this kind of video data [00:06:14] order to store this kind of video data and also there's no way for us to just [00:06:16] and also there's no way for us to just fit this kind of data directly to GPUs [00:06:19] fit this kind of data directly to GPUs right if we we just have the input then [00:06:22] right if we we just have the input then we can have have a lot of storage to [00:06:23] we can have have a lot of storage to store them to source this kind of data [00:06:25] store them to source this kind of data and also there are other things you can [00:06:27] and also there are other things you can have to store like the weights the [00:06:28] have to store like the weights the activations in your convolution neural [00:06:31] activations in your convolution neural networks so then uh your model uh will [00:06:33] networks so then uh your model uh will be uh very huge [00:06:36] be uh very huge and uh the solution what solutions we [00:06:40] and uh the solution what solutions we can have to you know to make videos [00:06:42] can have to you know to make videos smaller to make them you know [00:06:43] smaller to make them you know processable. So one simple solution is [00:06:46] processable. So one simple solution is that we just make videos smaller right u [00:06:48] that we just make videos smaller right u so although the high definition videos [00:06:51] so although the high definition videos and also the original videos are long [00:06:54] and also the original videos are long when we can shrink things both [00:06:56] when we can shrink things both temporarily and uh spatially right we [00:06:59] temporarily and uh spatially right we can just u for example for 3.2 two [00:07:02] can just u for example for 3.2 two second videos like this. We can for [00:07:04] second videos like this. We can for example we can maybe for each for each [00:07:06] example we can maybe for each for each second we don't maybe we don't need all [00:07:08] second we don't maybe we don't need all the frames. Let's just take five frame [00:07:10] the frames. Let's just take five frame because there are a lot of redundancies [00:07:12] because there are a lot of redundancies uh in the video frames, right? If we [00:07:13] uh in the video frames, right? If we take five five frames per second uh and [00:07:16] take five five frames per second uh and also we just have uh smaller spatial [00:07:19] also we just have uh smaller spatial resolution like 112 * 112 and now we can [00:07:22] resolution like 112 * 112 and now we can make the videos uh slightly smaller. For [00:07:24] make the videos uh slightly smaller. For example, it's 5 588 [00:07:28] example, it's 5 588 uh KB uh for this uh simple video. But [00:07:31] uh KB uh for this uh simple video. But definitely we can also do uh larger [00:07:33] definitely we can also do uh larger resolution. We have the compute right [00:07:35] resolution. We have the compute right just like images and also how to uh [00:07:39] just like images and also how to uh train a model on this long videos we [00:07:42] train a model on this long videos we cannot usually like in the previous [00:07:44] cannot usually like in the previous slide I showed that we are training this [00:07:46] slide I showed that we are training this video classifier on 3.2 in two second, [00:07:48] video classifier on 3.2 in two second, right? But videos can be very long, can [00:07:49] right? But videos can be very long, can be minutes, can be hours, right? So, one [00:07:52] be minutes, can be hours, right? So, one one way that people do is that we train [00:07:53] one way that people do is that we train on clips just like uh we train on kind [00:07:56] on clips just like uh we train on kind of chunks of this video frames uh using [00:07:58] of chunks of this video frames uh using a video classifier. And what we do is [00:08:01] a video classifier. And what we do is that we train models to classify short [00:08:03] that we train models to classify short clips with some low of uh FPS, frame per [00:08:07] clips with some low of uh FPS, frame per second. And we just use a sliding [00:08:09] second. And we just use a sliding window. We, you know, sample a lot of [00:08:11] window. We, you know, sample a lot of different clips and use them as training [00:08:12] different clips and use them as training data and we train a classifier. And then [00:08:15] data and we train a classifier. And then during testing, during inference time, [00:08:16] during testing, during inference time, we just run the model on different [00:08:19] we just run the model on different clips. We sample a few clips, right? We [00:08:21] clips. We sample a few clips, right? We made 10 clips and then we average the [00:08:23] made 10 clips and then we average the prediction results. And that is our [00:08:25] prediction results. And that is our prediction for this long video. [00:08:29] prediction for this long video. And uh then what is the same post like [00:08:32] And uh then what is the same post like video classification model uh we can [00:08:35] video classification model uh we can use. So I have mentioned basically [00:08:39] use. So I have mentioned basically video is just like a sequence of images [00:08:41] video is just like a sequence of images right a sequence of video uh image [00:08:43] right a sequence of video uh image frames. So one simple thing is that we [00:08:46] frames. So one simple thing is that we just treat them as images right that's [00:08:48] just treat them as images right that's the simplest kind of uh tool uh we [00:08:50] the simplest kind of uh tool uh we already have right we just run single [00:08:52] already have right we just run single frame convolution neuronet networks [00:08:54] frame convolution neuronet networks because we already have all the tools [00:08:56] because we already have all the tools right we have learned that we can train [00:08:57] right we have learned that we can train an image classifier if we just take our [00:09:00] an image classifier if we just take our image classifier to just run on top of [00:09:02] image classifier to just run on top of those kind of video frames to treat them [00:09:04] those kind of video frames to treat them as images we can indeed get uh decent [00:09:06] as images we can indeed get uh decent predictions right especially like a [00:09:08] predictions right especially like a video like this you can see that there [00:09:10] video like this you can see that there are not many changes across videos right [00:09:11] are not many changes across videos right the person is running maybe there are [00:09:13] the person is running maybe there are some different movements uh on body. But [00:09:15] some different movements uh on body. But generally it looks pretty similar, [00:09:17] generally it looks pretty similar, right? Maybe we just you run a image [00:09:20] right? Maybe we just you run a image action classifier and uh on the every [00:09:22] action classifier and uh on the every frame maybe all of the frames will tell [00:09:25] frame maybe all of the frames will tell you it's running and you if you average [00:09:27] you it's running and you if you average the prediction results from each image [00:09:29] the prediction results from each image each video frame then you'll predict uh [00:09:31] each video frame then you'll predict uh running for this uh particular video. So [00:09:34] running for this uh particular video. So actually also it's a ve it's usually a [00:09:36] actually also it's a ve it's usually a very very strong baseline right for this [00:09:39] very very strong baseline right for this simple image classifier u for especially [00:09:42] simple image classifier u for especially for video like this because there are [00:09:43] for video like this because there are not too many changes across uh videos. [00:09:46] not too many changes across uh videos. So if you are trying to design some uh [00:09:49] So if you are trying to design some uh video classifier you should always run [00:09:50] video classifier you should always run this first uh because that's kind of [00:09:52] this first uh because that's kind of simple things to try and maybe you can [00:09:54] simple things to try and maybe you can already get pretty decent results. So [00:09:56] already get pretty decent results. So the question is uh whether we just run [00:09:59] the question is uh whether we just run on single frame or we run on chunk of [00:10:01] on single frame or we run on chunk of frames. So for this simple uh single [00:10:03] frames. So for this simple uh single frame state and basically we just uh you [00:10:05] frame state and basically we just uh you have a video of like 30 frame maybe you [00:10:07] have a video of like 30 frame maybe you just maybe sample a few frames and just [00:10:09] just maybe sample a few frames and just use a image classifier to run on those [00:10:11] use a image classifier to run on those like sampled 10 frames and just treat [00:10:13] like sampled 10 frames and just treat them as images and you just directly [00:10:14] them as images and you just directly average the results. That's basically uh [00:10:17] average the results. That's basically uh the per frames they in. So I think you [00:10:19] the per frames they in. So I think you ask a very important question is how to [00:10:21] ask a very important question is how to sample the frame that's a very key [00:10:22] sample the frame that's a very key question because we're given a giant I'm [00:10:25] question because we're given a giant I'm I'm I'm talking about we want to sample [00:10:26] I'm I'm talking about we want to sample some frames and want to run a CN on the [00:10:28] some frames and want to run a CN on the frame. So how to get those frames? So [00:10:30] frame. So how to get those frames? So that is actually also an active area of [00:10:32] that is actually also an active area of research. Uh one simple way is that you [00:10:34] research. Uh one simple way is that you do random random sampling. If you have a [00:10:36] do random random sampling. If you have a one hour video, I don't know where where [00:10:38] one hour video, I don't know where where the interesting part where the important [00:10:39] the interesting part where the important parts are, right? We just sample maybe [00:10:41] parts are, right? We just sample maybe every one minute I sample one frame and [00:10:43] every one minute I sample one frame and then I run image classifier average [00:10:45] then I run image classifier average results. But obviously maybe this is [00:10:47] results. But obviously maybe this is it's a gives some good results but maybe [00:10:49] it's a gives some good results but maybe this is not the smartest way to do the [00:10:50] this is not the smartest way to do the sampling. There are other methods trying [00:10:53] sampling. There are other methods trying to propose smarter sampling strategy. [00:10:55] to propose smarter sampling strategy. Maybe you can sample one frame. you can [00:10:56] Maybe you can sample one frame. you can then use that decision to decide where [00:10:59] then use that decision to decide where where else to sample. I actually also [00:11:00] where else to sample. I actually also have some examples later uh in later [00:11:03] have some examples later uh in later lecture slides. [00:11:05] lecture slides. Okay. So this is uh a very very simple [00:11:07] Okay. So this is uh a very very simple kind of uh video classifier just like we [00:11:09] kind of uh video classifier just like we just adopt image classifier single frame [00:11:11] just adopt image classifier single frame CN and uh similarly just uh maybe we [00:11:16] CN and uh similarly just uh maybe we take one step further instead of [00:11:18] take one step further instead of directly just uh run single frame CN and [00:11:22] directly just uh run single frame CN and average the prediction results maybe we [00:11:24] average the prediction results maybe we can doing some fusion right across the [00:11:27] can doing some fusion right across the you know features uh on the uh using the [00:11:30] you know features uh on the uh using the uh single frame CN so this is often [00:11:33] uh single frame CN so this is often called late fusion Basically the idea is [00:11:36] called late fusion Basically the idea is that we still take some 2D ends and we [00:11:39] that we still take some 2D ends and we have some input uh maybe t frames and [00:11:42] have some input uh maybe t frames and for for each frame we uh use a 2D and [00:11:47] for for each frame we uh use a 2D and then we extract some feature vector u [00:11:49] then we extract some feature vector u and then we get u maybe a feature [00:11:51] and then we get u maybe a feature feature map of b * h prime time w prime [00:11:56] feature map of b * h prime time w prime and then we get because we have t frames [00:11:58] and then we get because we have t frames right then basically we have t feature [00:12:01] right then basically we have t feature feature uh feature maps and then the [00:12:03] feature uh feature maps and then the simple thing that we just uh uh flatten [00:12:06] simple thing that we just uh uh flatten all the feature maps to vectors and then [00:12:08] all the feature maps to vectors and then concatenate them. Then we have a giant [00:12:10] concatenate them. Then we have a giant like uh feature vector that basically [00:12:13] like uh feature vector that basically contains all the information all the [00:12:14] contains all the information all the features across all the frames right and [00:12:16] features across all the frames right and then what we can do we can use tools [00:12:18] then what we can do we can use tools that we have learned like fully [00:12:19] that we have learned like fully connecting networks right we train a MLP [00:12:21] connecting networks right we train a MLP that maps this uh vector to some ner [00:12:25] that maps this uh vector to some ner dimension and then we train a classifier [00:12:27] dimension and then we train a classifier on top of it made to map it to class [00:12:29] on top of it made to map it to class score C right so this is uh uh called [00:12:32] score C right so this is uh uh called late fusion because basically you can [00:12:34] late fusion because basically you can see that we extract the feature maps and [00:12:37] see that we extract the feature maps and we process them very independently And [00:12:39] we process them very independently And then at the very late stage we [00:12:40] then at the very late stage we concatenate the feature vectors and run [00:12:42] concatenate the feature vectors and run some fully connecters to doing the [00:12:44] some fully connecters to doing the classification. So uh this is uh uh uh [00:12:49] classification. So uh this is uh uh uh this is useful but uh one drawback that [00:12:52] this is useful but uh one drawback that you can you can already probably tell [00:12:53] you can you can already probably tell from this uh example from my description [00:12:55] from this uh example from my description is that this fully connect right it's [00:12:58] is that this fully connect right it's it's going to introduce a lot of [00:13:00] it's going to introduce a lot of parameters because if you if we [00:13:02] parameters because if you if we concatenate a lot of we flatten them and [00:13:05] concatenate a lot of we flatten them and across time and this feature vectors [00:13:07] across time and this feature vectors depending on how long how large t is [00:13:10] depending on how long how large t is then you can have a giant feature vector [00:13:12] then you can have a giant feature vector and you use this giant feature vector [00:13:14] and you use this giant feature vector you want to map them into some lower [00:13:15] you want to map them into some lower dimension and you have a very large [00:13:17] dimension and you have a very large fully contracting [00:13:19] fully contracting layer and that will introduce a lot of [00:13:21] layer and that will introduce a lot of parameters. So it's not very efficient. [00:13:23] parameters. So it's not very efficient. So another way to do this is that [00:13:26] So another way to do this is that instead of concatenating them right we [00:13:28] instead of concatenating them right we we we don't do concatenation we don't [00:13:30] we we don't do concatenation we don't use uh just use the giant feature vector [00:13:34] use uh just use the giant feature vector and then doing fully then have a fully [00:13:37] and then doing fully then have a fully concatenator to map them to scores. We [00:13:38] concatenator to map them to scores. We can actually just do a simple pooling [00:13:40] can actually just do a simple pooling right doing pooling we don't you don't [00:13:43] right doing pooling we don't you don't increase the the you know the length of [00:13:44] increase the the you know the length of feature vector basically if you have [00:13:46] feature vector basically if you have feature dimension some feature dimension [00:13:48] feature dimension some feature dimension for a single frame and you pull across [00:13:50] for a single frame and you pull across time right for this t frames you just [00:13:52] time right for this t frames you just doing a pooling to do a tempo [00:13:54] doing a pooling to do a tempo aggregation and then based on this uh [00:13:56] aggregation and then based on this uh clip feature d and then you uh instead [00:13:59] clip feature d and then you uh instead of now instead of d time t you just you [00:14:02] of now instead of d time t you just you still have a feature vector of time d if [00:14:03] still have a feature vector of time d if you do pulling right and then you have a [00:14:05] you do pulling right and then you have a linear layer to map d to some uh uh [00:14:08] linear layer to map d to some uh uh dimens C that match the cast score and [00:14:10] dimens C that match the cast score and then you train the uh cross entropy loss [00:14:13] then you train the uh cross entropy loss on top of it and that's also late fusion [00:14:16] on top of it and that's also late fusion but now we are using pooling and the and [00:14:18] but now we are using pooling and the and the the the good side here is that now [00:14:21] the the the good side here is that now you have you don't have to have a very [00:14:23] you have you don't have to have a very large fully connect but the pooling can [00:14:26] large fully connect but the pooling can also you know uh get rid of information [00:14:29] also you know uh get rid of information that may be important so that's kind of [00:14:31] that may be important so that's kind of the downside of this operation [00:14:34] the downside of this operation so the the reason I'm calling late [00:14:37] so the the reason I'm calling late fusion right The important part is late, [00:14:39] fusion right The important part is late, right? Uh and when it's late, maybe [00:14:41] right? Uh and when it's late, maybe there's some information that has [00:14:43] there's some information that has already been lost when you have using [00:14:45] already been lost when you have using this 2D convolution networks to process [00:14:47] this 2D convolution networks to process images, right? For example, for example, [00:14:49] images, right? For example, for example, as shown in this red circles here. So, [00:14:51] as shown in this red circles here. So, what uh is very important to recognize [00:14:54] what uh is very important to recognize this video is actually the the motion of [00:14:57] this video is actually the the motion of this uh this man's feet, right? It's [00:14:59] this uh this man's feet, right? It's moving up and down, up and down, and you [00:15:02] moving up and down, up and down, and you can maybe tell that he's running, right? [00:15:04] can maybe tell that he's running, right? So if you if we just use a single 2D CN [00:15:07] So if you if we just use a single 2D CN to process them indiv independently as a [00:15:10] to process them indiv independently as a uh 2D image and extract some feature map [00:15:13] uh 2D image and extract some feature map and maybe up to you know very late stage [00:15:15] and maybe up to you know very late stage of the feature maps you it doesn't [00:15:17] of the feature maps you it doesn't actually already contain uh it doesn't [00:15:20] actually already contain uh it doesn't contain the information of this move [00:15:22] contain the information of this move movement of this of the feet of this man [00:15:24] movement of this of the feet of this man anymore right at this very late stage. [00:15:26] anymore right at this very late stage. So, so some information this fit up and [00:15:29] So, so some information this fit up and down is like showing these red circles [00:15:31] down is like showing these red circles which should be useful cues right but [00:15:33] which should be useful cues right but now it's not there in the feature maps [00:15:35] now it's not there in the feature maps the intuition is that if you you think [00:15:38] the intuition is that if you you think of if you if you extract features from [00:15:39] of if you if you extract features from the early layers right it's very close [00:15:42] the early layers right it's very close to the original video frames so it's so [00:15:44] to the original video frames so it's so it will it will be there's larger chance [00:15:48] it will it will be there's larger chance it will contain this no- level kind of [00:15:50] it will contain this no- level kind of information there's no movement like [00:15:51] information there's no movement like from the video frames right and also you [00:15:54] from the video frames right and also you can if you concaten them or pull them It [00:15:56] can if you concaten them or pull them It will come across time. It will analyze [00:15:59] will come across time. It will analyze the motion across time. But because we [00:16:01] the motion across time. But because we are processing a lot of convolution [00:16:03] are processing a lot of convolution pooling, convolution pooling up to a [00:16:04] pooling, convolution pooling up to a relate stage you at a very late stage it [00:16:07] relate stage you at a very late stage it contains more high level information [00:16:09] contains more high level information like demand information instead of this [00:16:11] like demand information instead of this low-level motion information. So that's [00:16:13] low-level motion information. So that's why it's most likely it's lost there. So [00:16:16] why it's most likely it's lost there. So uh that's the downside of late fusion. [00:16:19] uh that's the downside of late fusion. So instead of doing late fusion then we [00:16:21] So instead of doing late fusion then we actually we can do early fusion. Right? [00:16:22] actually we can do early fusion. Right? So to do early fusion if we want to make [00:16:24] So to do early fusion if we want to make use of the feature vectors more closer [00:16:26] use of the feature vectors more closer to the actual video frames we can just [00:16:28] to the actual video frames we can just you know take this uh input uh and then [00:16:31] you know take this uh input uh and then we we get we we directly you know [00:16:35] we we get we we directly you know reshape them to 3t h w right we just [00:16:38] reshape them to 3t h w right we just directly aggregate the information [00:16:40] directly aggregate the information temporally uh from the very beginning [00:16:42] temporally uh from the very beginning and then we use some uh 2D convolution [00:16:44] and then we use some uh 2D convolution the first 2D convolution just directly [00:16:46] the first 2D convolution just directly map them uh to from from channel [00:16:49] map them uh to from from channel dimensions 3T to D basically we use the [00:16:51] dimensions 3T to D basically we use the 2D convol solution to process this [00:16:54] 2D convol solution to process this temporal information uh in the first [00:16:57] temporal information uh in the first layer to map uh the the channel [00:16:59] layer to map uh the the channel dimension from 3D to D to process you [00:17:01] dimension from 3D to D to process you know the the video frames of all the [00:17:03] know the the video frames of all the information from the frames uh in the in [00:17:05] information from the frames uh in the in the very beginning of the convolution [00:17:07] the very beginning of the convolution neural networks and the rest of network [00:17:09] neural networks and the rest of network is then in standard 2DC end and uh uh [00:17:13] is then in standard 2DC end and uh uh the only difference that now we destroy [00:17:14] the only difference that now we destroy and collapse all the temporary [00:17:16] and collapse all the temporary information into a single uh single [00:17:18] information into a single uh single using a single layer and then the rest [00:17:19] using a single layer and then the rest is just like image classification and [00:17:21] is just like image classification and then you're doing uh this classification [00:17:23] then you're doing uh this classification using standard cross entropy loss for [00:17:26] using standard cross entropy loss for each frame we get a features like D [00:17:28] each frame we get a features like D right and each frame each each each each [00:17:29] right and each frame each each each each each single frame will give you a [00:17:31] each single frame will give you a feature D so you have T this feature [00:17:33] feature D so you have T this feature vectors D so for pooling we are pulling [00:17:35] vectors D so for pooling we are pulling over the features basically we can do [00:17:36] over the features basically we can do mean pooling to average the features or [00:17:39] mean pooling to average the features or we max pooling we max over the features [00:17:41] we max pooling we max over the features then after that we still get a feature [00:17:42] then after that we still get a feature that's D so it's pulling over the [00:17:44] that's D so it's pulling over the features not the frames [00:17:48] okay so that's early fusion Um so then [00:17:53] okay so that's early fusion Um so then the downside of the early field is that [00:17:55] the downside of the early field is that u uh although we explicitly trying to [00:17:58] u uh although we explicitly trying to handle you know the motion from the [00:17:59] handle you know the motion from the early layer but then the but then we the [00:18:02] early layer but then the but then we the we are we are trying we [00:18:05] we are we are trying we we are too ambitious we're trying to you [00:18:06] we are too ambitious we're trying to you know capture everything in a single [00:18:08] know capture everything in a single layer right we just concatenate all the [00:18:10] layer right we just concatenate all the frames and then collapse all the temp [00:18:12] frames and then collapse all the temp information a single convolution network [00:18:14] information a single convolution network maybe it's not uh going to achieve what [00:18:16] maybe it's not uh going to achieve what we wanted to achieve right so then [00:18:18] we wanted to achieve right so then another solution is that we uh instead [00:18:21] another solution is that we uh instead Instead of doing late fusion or early [00:18:22] Instead of doing late fusion or early fusion, maybe we should do something in [00:18:24] fusion, maybe we should do something in between, right? That's kind of like slow [00:18:26] between, right? That's kind of like slow fusion. That's exactly what uh this 3D [00:18:28] fusion. That's exactly what uh this 3D convolution 3D convolution network is [00:18:30] convolution 3D convolution network is doing. So in the intuition is that we [00:18:32] doing. So in the intuition is that we want to use this 3D version of [00:18:34] want to use this 3D version of convolution and and pooling. We want to [00:18:35] convolution and and pooling. We want to slowly fuse information over the course [00:18:38] slowly fuse information over the course of the network. Instead of doing it at a [00:18:40] of the network. Instead of doing it at a very late stage or at a very early [00:18:42] very late stage or at a very early stage, we gradually shrink over, you [00:18:44] stage, we gradually shrink over, you know, temporal dimension and spatial [00:18:46] know, temporal dimension and spatial dimension to get this 3D feature maps. [00:18:48] dimension to get this 3D feature maps. So that's the idea of uh 3D convolution [00:18:51] So that's the idea of uh 3D convolution neuronet network. We just use 3D [00:18:53] neuronet network. We just use 3D convolution and a 3D pooling operation. [00:18:55] convolution and a 3D pooling operation. So what is 3D convolution? 3D pooling. [00:18:57] So what is 3D convolution? 3D pooling. So you have learned the 2D convolution [00:18:59] So you have learned the 2D convolution right for 2D convolution there uh [00:19:01] right for 2D convolution there uh basically if you you take an image like [00:19:03] basically if you you take an image like this 32 32 32 * 3 uh uh image and if [00:19:08] this 32 32 32 * 3 uh uh image and if you if you use 2D convolution basically [00:19:10] you if you use 2D convolution basically uh you have learned that you can for [00:19:12] uh you have learned that you can for each kernel you have this uh uh filter [00:19:15] each kernel you have this uh uh filter right you can you can have like this uh [00:19:18] right you can you can have like this uh convolution kernel that is maybe 5 5 3 [00:19:21] convolution kernel that is maybe 5 5 3 that that runs like in a sliding window [00:19:24] that that runs like in a sliding window approach just you know slides across [00:19:27] approach just you know slides across across space and uh goes all the way ac [00:19:32] across space and uh goes all the way ac for for each uh uh for each computation [00:19:36] for for each uh uh for each computation you you know you map that uh uh to a [00:19:39] you you know you map that uh uh to a single value in that final activation [00:19:41] single value in that final activation maps and then finally you obtain this [00:19:44] maps and then finally you obtain this activation map uh of 28 28 1 in this [00:19:48] activation map uh of 28 28 1 in this case you come over all spatial locations [00:19:51] case you come over all spatial locations right and map this uh channel dimension [00:19:53] right and map this uh channel dimension the depth step go all the all go all the [00:19:55] the depth step go all the all go all the way over the channel dimension and [00:19:57] way over the channel dimension and uh map three map from three to to one in [00:20:00] uh map three map from three to to one in this case. So that's 2D convolution. So [00:20:03] this case. So that's 2D convolution. So the difference uh is that for 3D [00:20:05] the difference uh is that for 3D convolution now uh we just have one [00:20:09] convolution now uh we just have one extra dimension. So here uh you can [00:20:11] extra dimension. So here uh you can think that uh uh here the input is c * t [00:20:16] think that uh uh here the input is c t h * w right the the extra thing is [00:20:18] h w right the the extra thing is this t dimension that is a temporal [00:20:20] this t dimension that is a temporal dimension. So, but what I'm showing here [00:20:22] dimension. So, but what I'm showing here because we can only show things in 3D, [00:20:24] because we can only show things in 3D, right? We cannot show things in in 4D. [00:20:26] right? We cannot show things in in 4D. So, there's actually one dimension that [00:20:28] So, there's actually one dimension that is not shown here that is the C [00:20:29] is not shown here that is the C dimension. The channel dimension is not [00:20:31] dimension. The channel dimension is not shown here. So you can think that uh for [00:20:33] shown here. So you can think that uh for each grid point in this feature map [00:20:35] each grid point in this feature map there are um many features there are uh [00:20:39] there are um many features there are uh C features uh in that spatial in that [00:20:41] C features uh in that spatial in that grid point and then for this uh 3D [00:20:44] grid point and then for this uh 3D convolution basically if we if we are [00:20:47] convolution basically if we if we are talking about like a 60 6 6 6 [00:20:49] talking about like a 60 6 6 6 convolution because it have one extra [00:20:51] convolution because it have one extra dimension now instead of slide over you [00:20:53] dimension now instead of slide over you know the spatial dimension just just in [00:20:55] know the spatial dimension just just in the H and W dimension over the images [00:20:58] the H and W dimension over the images now we are sliding over this you know [00:21:00] now we are sliding over this you know cube right we are sliding over uh this [00:21:02] cube right we are sliding over uh this uh cube of dimension t h w. So [00:21:06] uh cube of dimension t h w. So includes both the spatial dimension and [00:21:08] includes both the spatial dimension and the temporal dimension and also it goes [00:21:11] the temporal dimension and also it goes all the way uh along the channel [00:21:14] all the way uh along the channel dimension. So then it will uh gradually [00:21:17] dimension. So then it will uh gradually you can you know uh just like 2D con the [00:21:19] you can you know uh just like 2D con the other part is just like the uh 2D [00:21:22] other part is just like the uh 2D convolution it just to have this extra [00:21:24] convolution it just to have this extra dimension and then you get the 3D 6 * 6 [00:21:27] dimension and then you get the 3D 6 6 6 3D convolution and maybe another [00:21:29] 6 3D convolution and maybe another layer of five 5 and finally after you [00:21:32] layer of five* 5 and finally after you process doing this 3D convolution [00:21:34] process doing this 3D convolution operations and you flatten the feature [00:21:36] operations and you flatten the feature vectors and then you use a 4D connecters [00:21:39] vectors and then you use a 4D connecters to map them to the class scores. So that [00:21:42] to map them to the class scores. So that is uh basically the idea of 3D [00:21:44] is uh basically the idea of 3D convolution. [00:21:46] convolution. So let's walk maybe through some toy [00:21:49] So let's walk maybe through some toy examples to you know to better [00:21:50] examples to you know to better understand it to compare uh the early [00:21:53] understand it to compare uh the early late and uh uh the uh 3D convolution [00:21:58] late and uh uh the uh 3D convolution neuronet networks right just to just to [00:22:00] neuronet networks right just to just to give you a flavor how it works uh uh in [00:22:03] give you a flavor how it works uh uh in practice actually definitely it's the [00:22:04] practice actually definitely it's the better can be much larger and more [00:22:06] better can be much larger and more complicated but here I'm just trying to [00:22:08] complicated but here I'm just trying to show that uh a toy example to to walk [00:22:11] show that uh a toy example to to walk through you know the the size of the [00:22:13] through you know the the size of the feature maps and also the receptive [00:22:14] feature maps and also the receptive field to give you a sense about what's [00:22:16] field to give you a sense about what's the difference between early fusion late [00:22:17] the difference between early fusion late fusion 3D convolution neural networks. [00:22:20] fusion 3D convolution neural networks. So for [00:22:21] So for uh uh late fusion right you can think [00:22:24] uh uh late fusion right you can think that for example in this case maybe [00:22:26] that for example in this case maybe originally it's of uh the input is like [00:22:29] originally it's of uh the input is like 3 * 20 20 is a temporal dimension and [00:22:31] 3 * 20 20 is a temporal dimension and 6464 is a spatial dimension and you use [00:22:34] 6464 is a spatial dimension and you use a two you use a uh 2D convolution [00:22:37] a two you use a uh 2D convolution basically we we we we just because we're [00:22:40] basically we we we we just because we're doing late fusion we we don't do [00:22:41] doing late fusion we we don't do anything over the temporal dimension [00:22:43] anything over the temporal dimension initially right we just keep the 20 the [00:22:45] initially right we just keep the 20 the the temporal dimension we just build up [00:22:47] the temporal dimension we just build up the receptive field spatially right now [00:22:49] the receptive field spatially right now we have the uh conute com 2D layer to [00:22:51] we have the uh conute com 2D layer to map the channel dimension from 3 to 12 [00:22:53] map the channel dimension from 3 to 12 but just keep the temporal dimension 20 [00:22:56] but just keep the temporal dimension 20 and then gradually maybe we use some uh [00:22:58] and then gradually maybe we use some uh pooling layers still we you know we [00:23:00] pooling layers still we you know we didn't do anything with the tempo [00:23:01] didn't do anything with the tempo dimension so it's still 20 right but we [00:23:03] dimension so it's still 20 right but we because of because of the pooling [00:23:05] because of because of the pooling operation we build up the receptive [00:23:06] operation we build up the receptive field in the spatial uh dimension and [00:23:09] field in the spatial uh dimension and then gradually we maybe use another 2D [00:23:12] then gradually we maybe use another 2D layer and we now the facial map is 24 * [00:23:14] layer and we now the facial map is 24 20 16 * 16 and we also we gradually [00:23:17] 20 16 16 and we also we gradually increase the spatial receptive field But [00:23:19] increase the spatial receptive field But we still keep the temporal dimension 20. [00:23:21] we still keep the temporal dimension 20. So we did we didn't do anything over the [00:23:22] So we did we didn't do anything over the temporal dimension. And finally just [00:23:24] temporal dimension. And finally just using a single like a global average [00:23:26] using a single like a global average pooling we pull across the feature maps [00:23:28] pooling we pull across the feature maps 20 16 16. So we pull over both time [00:23:31] 20 16 16. So we pull over both time and and the spatial dimension. And now [00:23:34] and and the spatial dimension. And now we get from this 20 16 16 we get a 1 [00:23:37] we get from this 20 16 16 we get a 1 1 1 feature point. Right? So [00:23:39] 1 1 feature point. Right? So basically we collapse everything in the [00:23:41] basically we collapse everything in the final single layer and we build up the [00:23:43] final single layer and we build up the temporal receptive field in the single [00:23:45] temporal receptive field in the single layer. So that's late fusion. So then [00:23:48] layer. So that's late fusion. So then for early fusion what's the difference? [00:23:50] for early fusion what's the difference? So now instead of building slowly in [00:23:52] So now instead of building slowly in space or at once in time at end now we [00:23:54] space or at once in time at end now we are building slowly in space and all at [00:23:57] are building slowly in space and all at once in time at the very beginning. [00:23:59] once in time at the very beginning. Right? So so so the input is still that [00:24:02] Right? So so so the input is still that 3 20 64 * 64 but now we're just [00:24:05] 3 20 64 * 64 but now we're just using a single com 2D layer. Now we just [00:24:07] using a single com 2D layer. Now we just treat this 3 * 20 as a single you know [00:24:10] treat this 3 * 20 as a single you know the the channel dimension. we just map [00:24:12] the the channel dimension. we just map everything this three * 30 uh just treat [00:24:15] everything this three * 30 uh just treat all of them as channel dimension and [00:24:16] all of them as channel dimension and then map them to 12. So basically we use [00:24:18] then map them to 12. So basically we use a single con convolution uh layer 2D 2D [00:24:22] a single con convolution uh layer 2D 2D convolution air to map uh to collapse [00:24:24] convolution air to map uh to collapse all the temporal information from the [00:24:26] all the temporal information from the very beginning. So we build the uh [00:24:28] very beginning. So we build the uh temporal receptive field in the first [00:24:30] temporal receptive field in the first layer. So now the temporal receptive [00:24:31] layer. So now the temporal receptive field becomes uh uh from 1 to 20 and [00:24:34] field becomes uh uh from 1 to 20 and then the spatial receptive field [00:24:35] then the spatial receptive field gradually builds up and we use pooling [00:24:38] gradually builds up and we use pooling and count 2D to build up you know the [00:24:40] and count 2D to build up you know the spatial dimension just as late fusion [00:24:42] spatial dimension just as late fusion and finally we use a global average [00:24:44] and finally we use a global average pooling now with glob average pooling is [00:24:46] pooling now with glob average pooling is only just trying to you know uh doing [00:24:48] only just trying to you know uh doing the averaging doing the pooling across [00:24:51] the averaging doing the pooling across space. So we build a slowly space but [00:24:54] space. So we build a slowly space but all at once uh at the very beginning. So [00:24:56] all at once uh at the very beginning. So that's early fusion. So then what is uh [00:25:00] that's early fusion. So then what is uh 3D convolutional networks? So for 3D [00:25:02] 3D convolutional networks? So for 3D convolutionary basically we build slowly [00:25:04] convolutionary basically we build slowly both in space and time. So that's why we [00:25:06] both in space and time. So that's why we call it uh slow fusion. So the input can [00:25:09] call it uh slow fusion. So the input can still be like the same 3 20 64 * 64 [00:25:12] still be like the same 3 20 64 * 64 but now we are using 3D convolutions. So [00:25:15] but now we are using 3D convolutions. So we uh uh in the first layer uh we we [00:25:19] we uh uh in the first layer uh we we just uh uh uh to map things from uh [00:25:24] just uh uh uh to map things from uh three time to to 12. We also just keep [00:25:27] three time to to 12. We also just keep the temporal dimension in this case. And [00:25:29] the temporal dimension in this case. And then we build up a little bit temporal [00:25:31] then we build up a little bit temporal receptive field and spatial receptive [00:25:33] receptive field and spatial receptive field. And then we use a pooling layer. [00:25:35] field. And then we use a pooling layer. And then uh like four times four times [00:25:37] And then uh like four times four times four pooling layer. And then we you know [00:25:39] four pooling layer. And then we you know we pull a little bit of this temporal [00:25:41] we pull a little bit of this temporal feature and also spatial features. And [00:25:42] feature and also spatial features. And then we further build up both this [00:25:44] then we further build up both this spatial and temporal receptive field. [00:25:46] spatial and temporal receptive field. And we have another con 3D layer and and [00:25:49] And we have another con 3D layer and and then to further build up the spatial and [00:25:51] then to further build up the spatial and temporal receptive field. And finally [00:25:53] temporal receptive field. And finally we're using a global average pooling but [00:25:55] we're using a global average pooling but now we're pulling over this four 16 [00:25:57] now we're pulling over this four 16 16 feature map and then to further [00:26:00] 16 feature map and then to further increase the uh temporal and spatial [00:26:03] increase the uh temporal and spatial receptive field. So we are building up [00:26:04] receptive field. So we are building up gradually in both space and time. So [00:26:06] gradually in both space and time. So that's kind of the difference between uh [00:26:08] that's kind of the difference between uh early fusion late fusion and 3D [00:26:10] early fusion late fusion and 3D convolution neural networks. So you can [00:26:12] convolution neural networks. So you can see that uh for the early fusion and 3D [00:26:16] see that uh for the early fusion and 3D convolution neural networks. So both of [00:26:18] convolution neural networks. So both of them builds receptive field over time, [00:26:21] them builds receptive field over time, right? But what's the actual difference? [00:26:23] right? But what's the actual difference? So let's look at it uh more more closely [00:26:25] So let's look at it uh more more closely here. So if we think of it as like a a [00:26:31] here. So if we think of it as like a a feature vector uh for each spatial grid [00:26:34] feature vector uh for each spatial grid point. So the uh the count filter [00:26:39] point. So the uh the count filter uh if it's a 2D convolution, right? for [00:26:43] uh if it's a 2D convolution, right? for this grid point it it will consider [00:26:48] this grid point it it will consider all the temporal uh along the temporal [00:26:52] all the temporal uh along the temporal dimensions t t is equal to 16 right so [00:26:55] dimensions t t is equal to 16 right so so it is local in space but extends [00:26:57] so it is local in space but extends fully in time right that's like the the [00:27:00] fully in time right that's like the the the filter in 2D convolution uh neuronet [00:27:04] the filter in 2D convolution uh neuronet network but but what what is the problem [00:27:07] network but but what what is the problem think about it so if you if we directly [00:27:09] think about it so if you if we directly just go all the way through time [00:27:11] just go all the way through time dimension in this 2D dev convolution [00:27:12] dimension in this 2D dev convolution what problem uh it's going to happen so [00:27:15] what problem uh it's going to happen so the shortcoming of that is that there [00:27:18] the shortcoming of that is that there will be no temporal shift in variance [00:27:21] will be no temporal shift in variance because uh the 2D the future now extends [00:27:25] because uh the 2D the future now extends fully in time right so if we want to ner [00:27:28] fully in time right so if we want to ner some like global transition in color [00:27:30] some like global transition in color have different time for example if we [00:27:31] have different time for example if we want to uh uh because it's it's a temp [00:27:35] want to uh uh because it's it's a temp it's a video right it's a some when we [00:27:37] it's a video right it's a some when we recognize temporal information if there [00:27:39] recognize temporal information if there if there are some changes from like blue [00:27:41] if there are some changes from like blue to orange at different time step. Maybe [00:27:43] to orange at different time step. Maybe there's some change uh that is happening [00:27:46] there's some change uh that is happening at time step four. There's another same [00:27:48] at time step four. There's another same change that is happening at time step [00:27:51] change that is happening at time step 15. But it's the same change like from [00:27:53] 15. But it's the same change like from blue to to to to orange. Right? If you [00:27:56] blue to to to to orange. Right? If you if we go all the way through through uh [00:27:59] if we go all the way through through uh through time, the future extends fully [00:28:01] through time, the future extends fully in time. Then if we want to learn the [00:28:04] in time. Then if we want to learn the global transition at different time, [00:28:06] global transition at different time, then we have to have a whole separate [00:28:09] then we have to have a whole separate future in order to learn this, right? So [00:28:10] future in order to learn this, right? So we have to you know learn a different uh [00:28:12] we have to you know learn a different uh this kernel in order to learn this [00:28:14] this kernel in order to learn this different transitions across uh at [00:28:17] different transitions across uh at different time stamp. So there's no [00:28:19] different time stamp. So there's no temporal shift in variance. So uh so how [00:28:23] temporal shift in variance. So uh so how to recognize uh this kind of blue to [00:28:25] to recognize uh this kind of blue to orange transition just anywhere in space [00:28:28] orange transition just anywhere in space and and time right just like when we are [00:28:30] and and time right just like when we are doing image classification we want to [00:28:33] doing image classification we want to you know have some spatial invariance. [00:28:34] you know have some spatial invariance. We want to okay be able to recognize the [00:28:37] We want to okay be able to recognize the cat the image contains a cat no matter [00:28:39] cat the image contains a cat no matter where the cat is on the right corner [00:28:40] where the cat is on the right corner left corner right we want to to share [00:28:43] left corner right we want to to share the you know the the the kernels to be [00:28:45] the you know the the the kernels to be able to you know to recognize things at [00:28:47] able to you know to recognize things at different at a different spatial [00:28:49] different at a different spatial location here we want to be able to [00:28:51] location here we want to be able to learn this different types of motion [00:28:53] learn this different types of motion different time this temporal patterns at [00:28:55] different time this temporal patterns at different you know temporal time steps [00:28:57] different you know temporal time steps so that's kind of similar idea so then [00:28:59] so that's kind of similar idea so then the that's exactly the benefit of 3D [00:29:01] the that's exactly the benefit of 3D convolution neuronet networks right Now [00:29:04] convolution neuronet networks right Now instead of extends fully in time right [00:29:06] instead of extends fully in time right in this that t dimension originally for [00:29:09] in this that t dimension originally for this uh early fusion t extends all the [00:29:11] this uh early fusion t extends all the way in the temporal dimension t is [00:29:13] way in the temporal dimension t is equals to 16 but now t t is equal to [00:29:16] equals to 16 but now t t is equal to three and we can slide over the temporal [00:29:18] three and we can slide over the temporal dimension right just like uh we learn [00:29:20] dimension right just like uh we learn this uh spatial invariance using filter [00:29:22] this uh spatial invariance using filter on local regions now this count filter [00:29:24] on local regions now this count filter only span a local window in time and [00:29:27] only span a local window in time and slide over in the time dimension. So the [00:29:30] slide over in the time dimension. So the then the the benefit is that now we can [00:29:32] then the the benefit is that now we can have some temporal shift invariance [00:29:35] have some temporal shift invariance because each filter slides over time. So [00:29:37] because each filter slides over time. So we can reuse this filter to recognize [00:29:39] we can reuse this filter to recognize different motion patterns uh across uh [00:29:42] different motion patterns uh across uh these dimensions. So the transition from [00:29:44] these dimensions. So the transition from blue to orange can now be recognized at [00:29:46] blue to orange can now be recognized at every moment in time. Right? Uh and then [00:29:50] every moment in time. Right? Uh and then the benefit of of this is that we don't [00:29:52] the benefit of of this is that we don't have to have separate filters, right? [00:29:53] have to have separate filters, right? Then now we are more efficient, more [00:29:55] Then now we are more efficient, more representation efficient. we don't need [00:29:57] representation efficient. we don't need to know separate futures anymore. So [00:30:00] to know separate futures anymore. So that's basically the main difference [00:30:02] that's basically the main difference between 2D con early fusion and the 3D [00:30:05] between 2D con early fusion and the 3D convolutional network. [00:30:08] convolutional network. And also uh in the last lecture I think [00:30:11] And also uh in the last lecture I think uh you have already also uh seen some [00:30:14] uh you have already also uh seen some examples of some tools that we can use [00:30:16] examples of some tools that we can use to you know visualize what we have [00:30:18] to you know visualize what we have learned right in a in a 2D convolutional [00:30:21] learned right in a in a 2D convolutional networks. Similarly, we can also [00:30:22] networks. Similarly, we can also visualize uh uh filters just uh for this [00:30:26] visualize uh uh filters just uh for this 3D convolution networks as this kind of [00:30:29] 3D convolution networks as this kind of video clips. You can see that uh uh [00:30:32] video clips. You can see that uh uh there the I'm not you can see it. Okay. [00:30:35] there the I'm not you can see it. Okay. The learner filters from the 3D [00:30:36] The learner filters from the 3D convolution neural networks uh because [00:30:39] convolution neural networks uh because now the filter extends both in space and [00:30:43] now the filter extends both in space and time. So uh we can see that in in in in [00:30:48] time. So uh we can see that in in in in uh as a video clip and you can see that [00:30:50] uh as a video clip and you can see that for this filters uh some of them are [00:30:53] for this filters uh some of them are just like those like um filters you have [00:30:56] just like those like um filters you have simple im image classifier right you can [00:30:58] simple im image classifier right you can have this kind of color patterns and [00:31:01] have this kind of color patterns and also the different edges but you can [00:31:03] also the different edges but you can also see that there are some other [00:31:05] also see that there are some other filters there's some temporal transition [00:31:07] filters there's some temporal transition right from from one color to another or [00:31:10] right from from one color to another or from some one edge pattern to another [00:31:12] from some one edge pattern to another So, so that's uh uh and some some [00:31:16] So, so that's uh uh and some some doesn't learn motion and some maybe [00:31:17] doesn't learn motion and some maybe focus on just like the image p uh the [00:31:19] focus on just like the image p uh the color patterns but some learns motion in [00:31:21] color patterns but some learns motion in like different uh directions. So, we can [00:31:24] like different uh directions. So, we can just visualize these kernels uh like [00:31:25] just visualize these kernels uh like this to to interpret them. Basically, [00:31:29] this to to interpret them. Basically, the main there are two two difference. [00:31:30] the main there are two two difference. One is like this slow fusion right [00:31:32] One is like this slow fusion right basically in terms of convolution [00:31:35] basically in terms of convolution operation. Yeah, indeed. Basically 3D [00:31:37] operation. Yeah, indeed. Basically 3D convolution but 3D convolution [00:31:39] convolution but 3D convolution definitely and 2D convolution are [00:31:40] definitely and 2D convolution are totally different, right? It's a you you [00:31:42] totally different, right? It's a you you you have another dimension of [00:31:44] you have another dimension of convolution. You have this temporal [00:31:45] convolution. You have this temporal dimension. So the main difference you [00:31:46] dimension. So the main difference you have this temporal dimension in the [00:31:47] have this temporal dimension in the convolution operation but uh practically [00:31:50] convolution operation but uh practically uh you if you use 3D convolution your [00:31:52] uh you if you use 3D convolution your networks it kind of gradually builds [00:31:54] networks it kind of gradually builds this receptive field uh over space and [00:31:56] this receptive field uh over space and time. Yeah. [00:32:01] So uh we have talked about this uh [00:32:05] So uh we have talked about this uh tools 3D convolution networks or [00:32:07] tools 3D convolution networks or architectures right but what data we can [00:32:10] architectures right but what data we can uh use just like image network what data [00:32:12] uh use just like image network what data we can use to do video to train a video [00:32:14] we can use to do video to train a video classifier. So one kind of example data [00:32:17] classifier. So one kind of example data set uh challenge data set that uh people [00:32:19] set uh challenge data set that uh people have been tackling is this uh data set [00:32:22] have been tackling is this uh data set called sports 1 million which was [00:32:24] called sports 1 million which was introduced uh in uh 2014. So for this [00:32:28] introduced uh in uh 2014. So for this data set you can see that what what kind [00:32:30] data set you can see that what what kind of task we can do we can do very very [00:32:32] of task we can do we can do very very fine grain uh sport cateory [00:32:34] fine grain uh sport cateory classification right you can see that [00:32:35] classification right you can see that the here the blue shows the ground [00:32:37] the here the blue shows the ground truths and uh the [00:32:40] truths and uh the uh below uh the it shows the top five [00:32:44] uh below uh the it shows the top five predictions and the green shows the [00:32:46] predictions and the green shows the correct prediction and the red shows the [00:32:48] correct prediction and the red shows the incorrect prediction. You can see that [00:32:50] incorrect prediction. You can see that uh uh the action categories is that it's [00:32:52] uh uh the action categories is that it's very very fine grain right there are 487 [00:32:55] very very fine grain right there are 487 different types of sports like there can [00:32:58] different types of sports like there can be like marathon ultra marathon actually [00:33:01] be like marathon ultra marathon actually don't know the the difference between [00:33:02] don't know the the difference between them but it's kind of very kind of fine [00:33:05] them but it's kind of very kind of fine different types of sports categories in [00:33:07] different types of sports categories in this data set [00:33:09] this data set and uh here are some results uh if we [00:33:12] and uh here are some results uh if we you know train this different types of [00:33:13] you know train this different types of classifiers we have talked about on this [00:33:16] classifiers we have talked about on this uh sports 1 million data set so One very [00:33:19] uh sports 1 million data set so One very shocking results you can probably see [00:33:20] shocking results you can probably see here is that uh for this single frame [00:33:23] here is that uh for this single frame model that I ask you to try uh for if [00:33:26] model that I ask you to try uh for if you want to develop some video [00:33:28] you want to develop some video classification model uh uh is that it [00:33:31] classification model uh uh is that it actually has a very good performance [00:33:33] actually has a very good performance right you can can see that the single [00:33:34] right you can can see that the single frame model if you just train trade at [00:33:36] frame model if you just train trade at the image classifier uh it actually [00:33:38] the image classifier uh it actually gives you like 77.7 uh classification [00:33:42] gives you like 77.7 uh classification top five uh accuracy and for the early [00:33:44] top five uh accuracy and for the early fusion we talked about it actually has a [00:33:46] fusion we talked about it actually has a slightly worse performance and for late [00:33:48] slightly worse performance and for late fusion is slightly better and if we use [00:33:51] fusion is slightly better and if we use 3D convolution new networks in this case [00:33:53] 3D convolution new networks in this case on this data set it gets like a 2% uh [00:33:56] on this data set it gets like a 2% uh two to 3% kind of boost [00:33:59] two to 3% kind of boost uh so the takeaway message here is that [00:34:02] uh so the takeaway message here is that uh definitely you should try the single [00:34:04] uh definitely you should try the single frame model it can usually actually [00:34:06] frame model it can usually actually works pretty well and uh uh and the 3D [00:34:10] works pretty well and uh uh and the 3D convolution by the 3D convolution new [00:34:11] convolution by the 3D convolution new networks uh I showed here is the 3D [00:34:14] networks uh I showed here is the 3D convolution neural network used in 2014 [00:34:17] convolution neural network used in 2014 right but over the past 10 years we have [00:34:19] right but over the past 10 years we have seen a lot of advancements so the [00:34:21] seen a lot of advancements so the members are also uh getting much much [00:34:23] members are also uh getting much much better as I'm going to talk about in the [00:34:25] better as I'm going to talk about in the later slide yeah for both training and [00:34:27] later slide yeah for both training and testing it's just treating videos as [00:34:29] testing it's just treating videos as images and train image classifier that's [00:34:32] images and train image classifier that's exactly what uh single frame is doing [00:34:34] exactly what uh single frame is doing basically if I understand question [00:34:35] basically if I understand question correctly it's it's it's using image [00:34:37] correctly it's it's it's using image classifier but it's training a lot of [00:34:38] classifier but it's training a lot of frames on videos it's not a single frame [00:34:40] frames on videos it's not a single frame each video yeah [00:34:43] each video yeah uh and also uh because this data set is [00:34:45] uh and also uh because this data set is a huge data set because like I mentioned [00:34:47] a huge data set because like I mentioned Videos are very huge, right? When people [00:34:50] Videos are very huge, right? When people sharing data sets uh video data sets, we [00:34:53] sharing data sets uh video data sets, we cannot just share them uh just like as [00:34:56] cannot just share them uh just like as like imageet we have people can can [00:34:59] like imageet we have people can can download from some database because [00:35:01] download from some database because videos are really huge like this data [00:35:03] videos are really huge like this data set has like maybe 1 million videos, [00:35:05] set has like maybe 1 million videos, right? It's not very uh doable to [00:35:07] right? It's not very uh doable to download all of them to going to share [00:35:08] download all of them to going to share it. Actually, this video actually [00:35:09] it. Actually, this video actually originally when it was released is [00:35:11] originally when it was released is shared as a list of URLs, YouTube URLs. [00:35:14] shared as a list of URLs, YouTube URLs. But one thing can you can expect from [00:35:16] But one thing can you can expect from this YouTube vial URLs that people you [00:35:19] this YouTube vial URLs that people you know modify their videos and delete [00:35:21] know modify their videos and delete their videos, right? And so so that that [00:35:24] their videos, right? And so so that that original list maybe have one million [00:35:25] original list maybe have one million videos but now maybe the I I guess maybe [00:35:27] videos but now maybe the I I guess maybe the list maybe half of the videos are [00:35:30] the list maybe half of the videos are already gone or maybe not there. So this [00:35:31] already gone or maybe not there. So this data set is not very kind of stable [00:35:33] data set is not very kind of stable because of this reason. [00:35:36] because of this reason. Okay. So [00:35:39] Okay. So um [00:35:42] um another kind of uh like I mentioned 3D [00:35:45] another kind of uh like I mentioned 3D convolutional networks have been like [00:35:48] convolutional networks have been like improving yeah gradually right since [00:35:49] improving yeah gradually right since maybe 2014. So one early kind of popular [00:35:53] maybe 2014. So one early kind of popular version of this 3D convolution network [00:35:54] version of this 3D convolution network is this model called 33N networks. So [00:35:58] is this model called 33N networks. So basically uh it's actually very very [00:36:00] basically uh it's actually very very very simple. Basically it's uh uh it's [00:36:03] very simple. Basically it's uh uh it's very similar to the VG architecture we [00:36:06] very similar to the VG architecture we use for 2D image classification. uh but [00:36:09] use for 2D image classification. uh but it's now we just con we just convert [00:36:11] it's now we just con we just convert things uh to three dimension uh [00:36:14] things uh to three dimension uh convolution network right uh and uh for [00:36:17] convolution network right uh and uh for example for the 3D CNN uh it you use [00:36:22] example for the 3D CNN uh it you use three 3 3 and 2 2 2 pooling and [00:36:26] three 3 3 and 2 2 2 pooling and except maybe for the first layer has [00:36:27] except maybe for the first layer has some changes uh it's so over [00:36:30] some changes uh it's so over architecture it's very similar to like [00:36:32] architecture it's very similar to like VG uh architecture it's now just we have [00:36:35] VG uh architecture it's now just we have this extra dimension and uh for this so [00:36:38] this extra dimension and uh for this so it's so that's why it's called the VG of [00:36:40] it's so that's why it's called the VG of 3D ends and it's uh the model uh it was [00:36:44] 3D ends and it's uh the model uh it was trained on this sports one mini data set [00:36:46] trained on this sports one mini data set that uh uh I just mentioned and because [00:36:50] that uh uh I just mentioned and because it uh it was introduced like in 2014 [00:36:53] it uh it was introduced like in 2014 right and that that time uh imagine that [00:36:56] right and that that time uh imagine that you want to train such uh model it needs [00:36:58] you want to train such uh model it needs a lot of compute right because this uh [00:37:01] a lot of compute right because this uh uh not so many people have access to you [00:37:05] uh not so many people have access to you know a lot of GPUs at that time so [00:37:07] know a lot of GPUs at that time so Actually this model was trained at [00:37:09] Actually this model was trained at Facebook and they uh they and they [00:37:11] Facebook and they uh they and they literally released this model this the [00:37:12] literally released this model this the pre-trained weights they they train this [00:37:15] pre-trained weights they they train this 3D model on sport and release the [00:37:18] 3D model on sport and release the feature um pre-trained model as a [00:37:20] feature um pre-trained model as a feature extractor. So many people [00:37:22] feature extractor. So many people actually who cannot afford to you know [00:37:24] actually who cannot afford to you know train a video model themselves actually [00:37:26] train a video model themselves actually started to use this model as a feature [00:37:27] started to use this model as a feature extractor. So you can just take a video [00:37:29] extractor. So you can just take a video and extract features from from this uh [00:37:31] and extract features from from this uh from this uh uh using this uh [00:37:34] from this uh uh using this uh pre-trained model say 3D model and then [00:37:36] pre-trained model say 3D model and then maybe train some other classifier. So uh [00:37:39] maybe train some other classifier. So uh people start to to to you know use it uh [00:37:41] people start to to to you know use it uh that's why how it get got popular. So so [00:37:45] that's why how it get got popular. So so the question basically is about uh when [00:37:46] the question basically is about uh when we're talking about media classification [00:37:48] we're talking about media classification about how many frames we should take as [00:37:50] about how many frames we should take as input in terms to extract the features. [00:37:52] input in terms to extract the features. So basically for all this uh models we [00:37:54] So basically for all this uh models we are talking about we assume that we are [00:37:55] are talking about we assume that we are just passing a clip a predefined length [00:37:58] just passing a clip a predefined length like 16 frames right and or 32 frames [00:38:00] like 16 frames right and or 32 frames you train a single model that always [00:38:02] you train a single model that always takes 16 frames or 32 frames as input [00:38:05] takes 16 frames or 32 frames as input and there are other techniques we're [00:38:08] and there are other techniques we're going to talk about how we're going to [00:38:09] going to talk about how we're going to aggregate this kind of clip level kind [00:38:10] aggregate this kind of clip level kind of prediction but for for now we're just [00:38:13] of prediction but for for now we're just doing clip level kind of feature [00:38:15] doing clip level kind of feature extraction. So the the downside of this [00:38:17] extraction. So the the downside of this uh 3D CAN is that uh it's very [00:38:19] uh 3D CAN is that uh it's very computationally expensive right [00:38:21] computationally expensive right basically we just directly in a bootstro [00:38:23] basically we just directly in a bootstro way we just make to make this VG kind of [00:38:26] way we just make to make this VG kind of style from 2D to 3D and uh you can see [00:38:29] style from 2D to 3D and uh you can see that uh for Alex that uh for this for [00:38:32] that uh for Alex that uh for this for this G-flop basically what what it means [00:38:34] this G-flop basically what what it means is that it's it's it's the ga flops it's [00:38:37] is that it's it's it's the ga flops it's trying to measure how many floatingoint [00:38:39] trying to measure how many floatingoint operations you need for a single forward [00:38:41] operations you need for a single forward pass basically just trying to measure [00:38:43] pass basically just trying to measure you know whether the network is [00:38:44] you know whether the network is efficient or So for Alex net it takes [00:38:47] efficient or So for Alex net it takes 0.7 g flops for VG16 it takes like 13.6 [00:38:52] 0.7 g flops for VG16 it takes like 13.6 six g flop but for C3D right you [00:38:54] six g flop but for C3D right you directly doing this kind of mapping from [00:38:56] directly doing this kind of mapping from 2D 3D to 3D now it takes like 13 uh 39.5 [00:39:00] 2D 3D to 3D now it takes like 13 uh 39.5 g flops so it's 2.9 times VG so it's not [00:39:03] g flops so it's 2.9 times VG so it's not very efficient right so uh that's the [00:39:06] very efficient right so uh that's the kind of the the downside of this kind of [00:39:08] kind of the the downside of this kind of network [00:39:10] network uh and if we look at the performance on [00:39:12] uh and if we look at the performance on sports 1 million uh uh this just 360 now [00:39:16] sports 1 million uh uh this just 360 now uh gets uh about like 4% uh gain in [00:39:20] uh gets uh about like 4% uh gain in terms of it top wide accuracy. uh but [00:39:25] terms of it top wide accuracy. uh but uh so this is just like one example of [00:39:27] uh so this is just like one example of the 3D convolution network we can do [00:39:29] the 3D convolution network we can do right but there are definitely can be [00:39:31] right but there are definitely can be other things right we have talking about [00:39:32] other things right we have talking about a lot of tricks that we we we can do for [00:39:34] a lot of tricks that we we we can do for 2D image classification right we can [00:39:36] 2D image classification right we can have this residue connections like you [00:39:38] have this residue connections like you you have seen in in ResNet right but [00:39:40] you have seen in in ResNet right but definitely we can also do that just [00:39:42] definitely we can also do that just improve say 3D to adding some residue [00:39:44] improve say 3D to adding some residue connections or other techniques we [00:39:46] connections or other techniques we talked about in 2D convolutions and [00:39:48] talked about in 2D convolutions and indeed there are also a lot of work on [00:39:50] indeed there are also a lot of work on trying to improve this different 3D [00:39:52] trying to improve this different 3D different types of 3D video convol uh [00:39:54] different types of 3D video convol uh video uh architectures and also papers [00:39:58] video uh architectures and also papers on that but apart from that let's think [00:40:00] on that but apart from that let's think maybe a little bit more on whether we [00:40:02] maybe a little bit more on whether we should treat space and time in a [00:40:06] should treat space and time in a separate way right because that indeed [00:40:07] separate way right because that indeed very very very different things right [00:40:09] very very very different things right spatial kind of information temporal [00:40:11] spatial kind of information temporal information so maybe we should actually [00:40:13] information so maybe we should actually explicitly try to model things that is [00:40:17] explicitly try to model things that is uh exists there temporally that is [00:40:19] uh exists there temporally that is motion right so we humans actually can [00:40:22] motion right so we humans actually can incredible job uh processing motion. So [00:40:25] incredible job uh processing motion. So maybe uh take a guess what actions the [00:40:28] maybe uh take a guess what actions the the humans are doing here in this uh [00:40:30] the humans are doing here in this uh simple video. [00:40:33] simple video. You can yeah say it out if you want. [00:40:38] What's this [00:40:40] What's this seating? [00:40:45] Yeah, just from this very a few points [00:40:48] Yeah, just from this very a few points you can actually do pretty good good [00:40:50] you can actually do pretty good good job, right? This is to recognize what [00:40:51] job, right? This is to recognize what actions uh that this person is doing, [00:40:54] actions uh that this person is doing, right? Or maybe two person, right? [00:41:02] Yeah. Now there are not any appearance [00:41:05] Yeah. Now there are not any appearance information, right? Just a few points, [00:41:07] information, right? Just a few points, right? Just motion. We can actually have [00:41:09] right? Just motion. We can actually have a very good understanding about some [00:41:11] a very good understanding about some activities that is going on in this [00:41:12] activities that is going on in this videos, right? So how to pro so that's [00:41:14] videos, right? So how to pro so that's why how we process appearance and motion [00:41:17] why how we process appearance and motion might be very different. maybe we should [00:41:19] might be very different. maybe we should have separate kind of uh networks to [00:41:21] have separate kind of uh networks to process them. Right? So indeed that and [00:41:23] process them. Right? So indeed that and that's kind of motivation uh for this [00:41:26] that's kind of motivation uh for this work that uh was introduced in 20 uh uh [00:41:29] work that uh was introduced in 20 uh uh 14 and uh uh they are trying to propose [00:41:33] 14 and uh uh they are trying to propose uh a two stream network to process [00:41:35] uh a two stream network to process appearance information and and the [00:41:37] appearance information and and the motion information separately. So [00:41:38] motion information separately. So basically uh one way to measure [00:41:41] basically uh one way to measure explicitly measure of motion is to use [00:41:43] explicitly measure of motion is to use this uh concept called optical flow. So [00:41:46] this uh concept called optical flow. So for optic flow basically the idea is [00:41:47] for optic flow basically the idea is that uh we want to measure the the [00:41:50] that uh we want to measure the the motion the changes uh of the the motion [00:41:53] motion the changes uh of the the motion of the pixels uh in adjacent frames. [00:41:55] of the pixels uh in adjacent frames. Basically the for the first frame for [00:41:57] Basically the for the first frame for every pixel how it's going to move in [00:41:58] every pixel how it's going to move in the second frame. So it calculates kind [00:42:01] the second frame. So it calculates kind the velocity for points within the [00:42:03] the velocity for points within the frames and kind of provides an [00:42:05] frames and kind of provides an estimation of where the points could be [00:42:06] estimation of where the points could be in the in the next frame sequence. For [00:42:09] in the in the next frame sequence. For example, in this in this case like for [00:42:10] example, in this in this case like for frame t and t plus one basically this [00:42:13] frame t and t plus one basically this flow field right here are two two [00:42:15] flow field right here are two two dimensions and tells uh whether where [00:42:18] dimensions and tells uh whether where each pixel will move in the next frame. [00:42:21] each pixel will move in the next frame. So fxy is equals to dxdy and then the uh [00:42:24] So fxy is equals to dxdy and then the uh i t + 1 x + uh dx that's where uh you [00:42:29] i t + 1 x + uh dx that's where uh you know the the pixel in the last frame is [00:42:32] know the the pixel in the last frame is equal to you know ix ity uh in the [00:42:34] equal to you know ix ity uh in the current frame. So it's trying to measure [00:42:38] current frame. So it's trying to measure uh a way to measure explicitly measure [00:42:39] uh a way to measure explicitly measure motion of the pixels. Right? So there [00:42:41] motion of the pixels. Right? So there are many papers actually on doing [00:42:42] are many papers actually on doing research and also how to actually [00:42:44] research and also how to actually compute object flow uh given a pair of [00:42:46] compute object flow uh given a pair of frames. There there are ways to make [00:42:48] frames. There there are ways to make different types of assumptions like some [00:42:50] different types of assumptions like some some work assume the object flow uh just [00:42:53] some work assume the object flow uh just assumes uh brightly stays constant as [00:42:56] assumes uh brightly stays constant as things move and then trying to propose [00:42:58] things move and then trying to propose some techniques to compute this object [00:43:00] some techniques to compute this object flow. But once you get it, it basically [00:43:02] flow. But once you get it, it basically captures the motion information for two [00:43:04] captures the motion information for two adjacent frames. And also you can [00:43:06] adjacent frames. And also you can because there are two dimensions, right? [00:43:08] because there are two dimensions, right? Because it's trying to uh capture how [00:43:10] Because it's trying to uh capture how pixels move horizontally and vertically. [00:43:13] pixels move horizontally and vertically. So you can actually also visualize it [00:43:15] So you can actually also visualize it separately. You can visualize it the [00:43:16] separately. You can visualize it the horizontal motion horizontal flow dx and [00:43:19] horizontal motion horizontal flow dx and also you can visualize the vertical flow [00:43:21] also you can visualize the vertical flow dy. You can see that there capture some [00:43:23] dy. You can see that there capture some you know horizontal motion and the [00:43:24] you know horizontal motion and the vertical motion. Uh so we capture this [00:43:26] vertical motion. Uh so we capture this kind of low-level motion cues. So once [00:43:29] kind of low-level motion cues. So once you have a way to capture this kind of [00:43:30] you have a way to capture this kind of motion cues as optical flow uh then [00:43:32] motion cues as optical flow uh then people trying to you know propose a [00:43:34] people trying to you know propose a two-stream networks to se uh to train a [00:43:37] two-stream networks to se uh to train a motion classifier and appearance [00:43:38] motion classifier and appearance classifier. So this is a famous [00:43:40] classifier. So this is a famous twostream network for action [00:43:41] twostream network for action recognition. So basically it has a one [00:43:43] recognition. So basically it has a one single frame uh model that that's doing [00:43:46] single frame uh model that that's doing appearance classification to tell what [00:43:48] appearance classification to tell what action it is and then you have a [00:43:49] action it is and then you have a separate uh stream that's the temporal [00:43:51] separate uh stream that's the temporal stream come that takes this uh uh [00:43:55] stream come that takes this uh uh multi-frame optic flow for two for every [00:43:57] multi-frame optic flow for two for every two adjacent frames it computes the [00:43:59] two adjacent frames it computes the optical flow map and also uh it [00:44:01] optical flow map and also uh it separately traits the horizontal uh [00:44:03] separately traits the horizontal uh motion optical flow and the vertical [00:44:05] motion optical flow and the vertical vertical flow and stack them together [00:44:07] vertical flow and stack them together and then they process them using a [00:44:09] and then they process them using a temporal stream convolution neuronet [00:44:10] temporal stream convolution neuronet network and doing make a prediction and [00:44:13] network and doing make a prediction and then they aggregate the prediction [00:44:14] then they aggregate the prediction results uh for both the motion stream [00:44:16] results uh for both the motion stream and the appearance stream to get a final [00:44:18] and the appearance stream to get a final prediction. So that's the idea of this [00:44:21] prediction. So that's the idea of this two-stream network and it actually works [00:44:23] two-stream network and it actually works pretty pretty good. Uh it's it's on [00:44:25] pretty pretty good. Uh it's it's on another data set called UCF 101. It's [00:44:27] another data set called UCF 101. It's there are 121 100 101 action categories [00:44:30] there are 121 100 101 action categories in this data set. So you can see that uh [00:44:33] in this data set. So you can see that uh one surprising thing you can see that is [00:44:35] one surprising thing you can see that is that using only motion actually works [00:44:37] that using only motion actually works very well surprisingly well right uh you [00:44:40] very well surprisingly well right uh you can you see the performance of 3D cycle [00:44:42] can you see the performance of 3D cycle network and the spatial only that's only [00:44:44] network and the spatial only that's only the appearance uh stream and the [00:44:46] the appearance uh stream and the temporal only that's a motion stream you [00:44:47] temporal only that's a motion stream you can see that motion stream actually is [00:44:49] can see that motion stream actually is uh works uh much better compared to this [00:44:52] uh works uh much better compared to this the spatial only stream right the the [00:44:54] the spatial only stream right the the appearance string so my hypothesis is [00:44:57] appearance string so my hypothesis is that uh uh it's less easier to overfit [00:45:00] that uh uh it's less easier to overfit because you Uh for the motion uh there [00:45:05] because you Uh for the motion uh there are a lot of like background information [00:45:06] are a lot of like background information which may be not important for for the [00:45:08] which may be not important for for the background uh for for the action [00:45:10] background uh for for the action classification but uh but but for the [00:45:12] classification but uh but but for the motion stream it actually contains the [00:45:14] motion stream it actually contains the queue the very the the key information [00:45:16] queue the very the the key information right the movements which are less [00:45:18] right the movements which are less easier to overfeit actually you can get [00:45:20] easier to overfeit actually you can get better results on this data set. So so [00:45:22] better results on this data set. So so far uh uh we have been talking about uh [00:45:26] far uh uh we have been talking about uh short-term structures uh in in videos. [00:45:28] short-term structures uh in in videos. So um and also we have uh earlier I [00:45:32] So um and also we have uh earlier I think uh the uh folks asking about uh [00:45:35] think uh the uh folks asking about uh you know how where how many fra how many [00:45:37] you know how where how many fra how many frame we should use actually to doing [00:45:39] frame we should use actually to doing the classification right so definitely [00:45:40] the classification right so definitely it's very important uh to modeling the [00:45:42] it's very important uh to modeling the long-term uh temporal uh structure uh to [00:45:47] long-term uh temporal uh structure uh to to recognize more more distant in in [00:45:49] to recognize more more distant in in time right so uh we already know [00:45:53] time right so uh we already know actually we already have the tools we [00:45:55] actually we already have the tools we have to handle sequences uh to you know [00:45:58] have to handle sequences uh to you know to use rec uh recurrent networks right [00:46:01] to use rec uh recurrent networks right to process a sequence of words to doing [00:46:03] to process a sequence of words to doing some you know uh uh it's like caption [00:46:07] some you know uh uh it's like caption tasks and some some some prediction [00:46:09] tasks and some some some prediction tasks right so we can also use similar [00:46:11] tasks right so we can also use similar like tools right just recurrent new [00:46:13] like tools right just recurrent new networks uh we just have a convolutional [00:46:16] networks uh we just have a convolutional networks right we can uh not no matter [00:46:18] networks right we can uh not no matter whether it's a single frame convolution [00:46:20] whether it's a single frame convolution networks to get a 2D feature vector or [00:46:23] networks to get a 2D feature vector or uh use a 3D convolution network to get a [00:46:25] uh use a 3D convolution network to get a feature vector from a clip but if you [00:46:27] feature vector from a clip but if you have a much longer video. We can get a [00:46:30] have a much longer video. We can get a feature vector and then we just use like [00:46:34] feature vector and then we just use like uh the RNs or LSTMs we have talked about [00:46:37] uh the RNs or LSTMs we have talked about to model the long-term temporal [00:46:39] to model the long-term temporal structure, right? We just process uh the [00:46:41] structure, right? We just process uh the local features using this recurrent [00:46:44] local features using this recurrent networks uh and uh uh make maybe make a [00:46:47] networks uh and uh uh make maybe make a final prediction uh at the last time [00:46:49] final prediction uh at the last time step, right? We want to do a single [00:46:51] step, right? We want to do a single video level classification. Uh we just [00:46:54] video level classification. Uh we just doing a many to one mapping, right? one [00:46:56] doing a many to one mapping, right? one uh output at the end of the video or we [00:46:58] uh output at the end of the video or we can also do like uh one to one mapping [00:47:02] can also do like uh one to one mapping uh like we talked about right uh so for [00:47:04] uh like we talked about right uh so for each uh for each frame we can make a [00:47:06] each uh for each frame we can make a prediction maybe there are some [00:47:07] prediction maybe there are some predictions we can want to make for each [00:47:10] predictions we can want to make for each video frame and we can also get this [00:47:11] video frame and we can also get this output for uh from like LSTM or recurren [00:47:15] output for uh from like LSTM or recurren new network [00:47:16] new network and uh uh [00:47:20] and uh uh actually this kind of idea is is already [00:47:22] actually this kind of idea is is already has already been explored in 20 2011 11 [00:47:25] has already been explored in 20 2011 11 actually that's kind of way ahead of it [00:47:27] actually that's kind of way ahead of it time right because Alex that was [00:47:28] time right because Alex that was introduced in uh 2012 right so but it's [00:47:32] introduced in uh 2012 right so but it's more popularized by this 2015 paper so [00:47:36] more popularized by this 2015 paper so uh you can also if you want to train [00:47:39] uh you can also if you want to train this kind of recurring architectures for [00:47:40] this kind of recurring architectures for for modeling long-term temporal [00:47:42] for modeling long-term temporal structure uh you can often only back [00:47:46] structure uh you can often only back propagate through this RN layer right [00:47:48] propagate through this RN layer right you can fuse the CNS you can pretend [00:47:49] you can fuse the CNS you can pretend them on maybe on some clips on some [00:47:52] them on maybe on some clips on some image classification and you just uh [00:47:54] image classification and you just uh otherwise you have a huge network [00:47:56] otherwise you have a huge network recurrent part is a convolution part [00:47:58] recurrent part is a convolution part then it's it's very hard to kind of [00:47:59] then it's it's very hard to kind of train them uh in end to end so you can [00:48:02] train them uh in end to end so you can just use this in C3D as like feature [00:48:04] just use this in C3D as like feature extractor and train this recurrent new [00:48:06] extractor and train this recurrent new networks so and we have already seen two [00:48:09] networks so and we have already seen two approaches right uh to model the [00:48:10] approaches right uh to model the temporal structure right how about we [00:48:12] temporal structure right how about we can combine uh this kind of two two [00:48:15] can combine uh this kind of two two approaches right this convolution [00:48:16] approaches right this convolution networks and this recurrent new network [00:48:19] networks and this recurrent new network both of them has some uh advantages we [00:48:21] both of them has some uh advantages we can maybe just combine them in a single [00:48:25] can maybe just combine them in a single kind of architecture to process the kind [00:48:27] kind of architecture to process the kind of video data, right? So indeed we can [00:48:29] of video data, right? So indeed we can take some inspiration from this [00:48:30] take some inspiration from this multi-layer recurrent new networks we [00:48:32] multi-layer recurrent new networks we have talked about right so each time [00:48:34] have talked about right so each time stamp can takes this previous hidden [00:48:36] stamp can takes this previous hidden time stamp from the same layer and also [00:48:38] time stamp from the same layer and also the output from uh the same time stamp [00:48:41] the output from uh the same time stamp from the previous layer right that's [00:48:42] from the previous layer right that's basically the idea of this multi-layer [00:48:44] basically the idea of this multi-layer RN but similarly we can just do do it [00:48:46] RN but similarly we can just do do it for videos now we introduce you can [00:48:48] for videos now we introduce you can introduce use this recurrent convolution [00:48:50] introduce use this recurrent convolution neuronet networks right uh it's it's [00:48:52] neuronet networks right uh it's it's very similar um it's just like now we [00:48:55] very similar um it's just like now we build this grid of features right where [00:48:58] build this grid of features right where Each one is kind of a three dimension [00:49:00] Each one is kind of a three dimension vector like two are spatial dimension [00:49:02] vector like two are spatial dimension and one is a channel dimension. Uh so [00:49:05] and one is a channel dimension. Uh so each uh d so each feature vector uh it's [00:49:08] each uh d so each feature vector uh it's uh uh like uh of dimension c h w. So [00:49:12] uh uh like uh of dimension c h w. So each depends on two inputs for each [00:49:14] each depends on two inputs for each vector for each feature map it depends [00:49:16] vector for each feature map it depends on the feature map from the same layer [00:49:18] on the feature map from the same layer by the previous time stamp but it also [00:49:21] by the previous time stamp but it also depends on the feature map from the [00:49:22] depends on the feature map from the previous layer by the same time stamp. [00:49:24] previous layer by the same time stamp. Right? Uh [00:49:26] Right? Uh so if we record in 2D convolution [00:49:29] so if we record in 2D convolution network right where we uh we just map [00:49:32] network right where we uh we just map this feature map from some input feature [00:49:34] this feature map from some input feature to a output feature right but here for [00:49:36] to a output feature right but here for this recurrent convolution network uh we [00:49:39] this recurrent convolution network uh we can just use as input this two 3D [00:49:43] can just use as input this two 3D tensors right one from the previous [00:49:45] tensors right one from the previous layer and uh previous uh same layer and [00:49:47] layer and uh previous uh same layer and previous time stamp and one from the [00:49:48] previous time stamp and one from the previous layer and the same time stamp. [00:49:50] previous layer and the same time stamp. So you recall a recurrent uh uh network [00:49:53] So you recall a recurrent uh uh network it has this uh form uh form of like it [00:49:56] it has this uh form uh form of like it has some hidden uh hidden layer uh [00:50:00] has some hidden uh hidden layer uh feature map HT minus one. It takes input [00:50:02] feature map HT minus one. It takes input of this current time stamp, right? It [00:50:04] of this current time stamp, right? It have some uh some function with some [00:50:06] have some uh some function with some parameter W and then process new state [00:50:09] parameter W and then process new state uh feature vector HT, right? That's [00:50:11] uh feature vector HT, right? That's basically R the the key of RN. So now [00:50:15] basically R the the key of RN. So now instead we just change this vectors [00:50:17] instead we just change this vectors forms of RN. We just replace all this [00:50:19] forms of RN. We just replace all this matrix multiplication in this uh in [00:50:21] matrix multiplication in this uh in recurrent new networks with 2D [00:50:23] recurrent new networks with 2D convolutions. Right? Now we get this [00:50:26] convolutions. Right? Now we get this recurrent convolution networks. So you [00:50:28] recurrent convolution networks. So you have the feature map you do 2D [00:50:29] have the feature map you do 2D convolution instead of have this matrix [00:50:31] convolution instead of have this matrix multiplication we get another feature [00:50:33] multiplication we get another feature map right and uh for also for features [00:50:35] map right and uh for also for features from uh the previous layer the same time [00:50:37] from uh the previous layer the same time stamp we also do this and then we use uh [00:50:39] stamp we also do this and then we use uh uh after doing this 2D convolution we [00:50:42] uh after doing this 2D convolution we add them together use another 10h layer [00:50:44] add them together use another 10h layer and we get the picture map for for for [00:50:47] and we get the picture map for for for the current uh uh hidden uh hidden layer [00:50:50] the current uh uh hidden uh hidden layer so that's basically the the idea of [00:50:52] so that's basically the the idea of recurrent convolution network we com [00:50:54] recurrent convolution network we com combine convolution operations and [00:50:56] combine convolution operations and recurrent operations and we can also [00:50:59] recurrent operations and we can also actually do this for any kind kind of [00:51:01] actually do this for any kind kind of recurren network of coal variants like [00:51:02] recurren network of coal variants like GRUs and LSTMs. Maybe you have already [00:51:05] GRUs and LSTMs. Maybe you have already learned from previous class and uh so [00:51:09] learned from previous class and uh so now we can successfully combine the [00:51:11] now we can successfully combine the benefits of the two right we have this [00:51:12] benefits of the two right we have this uh both spatial and temporal fusion uh [00:51:16] uh both spatial and temporal fusion uh inside this recurrent convolution new [00:51:18] inside this recurrent convolution new network. So but this model is not uh was [00:51:22] network. So but this model is not uh was not used too much and because uh there's [00:51:25] not used too much and because uh there's one uh large downside of recurrent [00:51:27] one uh large downside of recurrent neural networks which you have already [00:51:29] neural networks which you have already learned that iron unit are very slow [00:51:32] learned that iron unit are very slow right for processing non-sequence and [00:51:34] right for processing non-sequence and videos are usually very very very long [00:51:35] videos are usually very very very long and you have to pro you have to be [00:51:37] and you have to pro you have to be processing in parallel but irons are [00:51:39] processing in parallel but irons are very hard to to to be paralyzed [00:51:42] very hard to to to be paralyzed but there's uh another thing another uh [00:51:45] but there's uh another thing another uh important model you have learned like uh [00:51:47] important model you have learned like uh I think in the previous lectures right [00:51:48] I think in the previous lectures right what we can We can also use uh uh [00:51:51] what we can We can also use uh uh operations like the self attention right [00:51:53] operations like the self attention right to uh process videos right for self [00:51:56] to uh process videos right for self attention you have this kind of key [00:51:58] attention you have this kind of key queries keys and values and you uh and [00:52:02] queries keys and values and you uh and you you can you can use self attention [00:52:03] you you can you can use self attention here as a standalone kind of operation [00:52:05] here as a standalone kind of operation to process images here we can also do it [00:52:07] to process images here we can also do it for videos right and and one one very [00:52:10] for videos right and and one one very large advantage of self attention is [00:52:12] large advantage of self attention is highly paralyzerable right and all the [00:52:14] highly paralyzerable right and all the alignment and this attention scores for [00:52:17] alignment and this attention scores for all the inputs can be done [00:52:19] all the inputs can be done completely in parallel. So indeed people [00:52:21] completely in parallel. So indeed people are trying to you know use self [00:52:23] are trying to you know use self attention also in videos right so they [00:52:26] attention also in videos right so they just pause self attention directly to 3D [00:52:28] just pause self attention directly to 3D right maybe you have some 3D [00:52:29] right maybe you have some 3D convolutional network you get some [00:52:31] convolutional network you get some feature map like c t h * w and then [00:52:34] feature map like c t h * w and then you can similarly you can you want to [00:52:35] you can similarly you can you want to get some query query feature maps right [00:52:37] get some query query feature maps right you can use some 1 1 1 3D [00:52:40] you can use some 1 1 1 3D convolutions to uh change the channel [00:52:42] convolutions to uh change the channel dimension to map them to query feature [00:52:44] dimension to map them to query feature map that is c prime t h * w similarly [00:52:47] map that is c prime t h * w similarly for keys you get this feature map for [00:52:48] for keys you get this feature map for values you get this feature maps and [00:52:50] values you get this feature maps and Then you want to get some tension [00:52:52] Then you want to get some tension weights, right? Basically, you're doing [00:52:53] weights, right? Basically, you're doing some transpose of this feature map from [00:52:56] some transpose of this feature map from queries and you're doing this uh uh uh [00:53:00] queries and you're doing this uh uh uh vector wise uh multiplication. You get a [00:53:02] vector wise uh multiplication. You get a attention score for each query and key [00:53:07] attention score for each query and key uh feature uh kind of pair and then you [00:53:10] uh feature uh kind of pair and then you can get this attention map and then use [00:53:11] can get this attention map and then use it to you know condition the values, [00:53:13] it to you know condition the values, right? And you can uh them to you can [00:53:15] right? And you can uh them to you can get another value kind of feature [00:53:17] get another value kind of feature feature map and then you can map them [00:53:20] feature map and then you can map them you do another one times one time one [00:53:21] you do another one times one time one convolution to map them back to you know [00:53:23] convolution to map them back to you know the same dimension C so that you can be [00:53:25] the same dimension C so that you can be concatenated with the original feature [00:53:27] concatenated with the original feature input. So that is a resid connection. So [00:53:30] input. So that is a resid connection. So in total you can see that it's very [00:53:32] in total you can see that it's very similar to the self attention uh uh [00:53:35] similar to the self attention uh uh operations but now we move things to 3D [00:53:37] operations but now we move things to 3D and this is some one block that is very [00:53:40] and this is some one block that is very you know independent it can stand on its [00:53:42] you know independent it can stand on its own right you can so that's so in this [00:53:44] own right you can so that's so in this uh paper it's called looo neuronet [00:53:46] uh paper it's called looo neuronet network it introduces kind of block and [00:53:48] network it introduces kind of block and call local block you can use it as uh a [00:53:51] call local block you can use it as uh a kind of building block for uh processing [00:53:53] kind of building block for uh processing videos to do video understanding for [00:53:55] videos to do video understanding for example you can just add this unknown [00:53:57] example you can just add this unknown local blocks uh into existing 3D [00:54:00] local blocks uh into existing 3D convolutional network architectures and [00:54:02] convolutional network architectures and uh to you know to have some 3DC and have [00:54:05] uh to you know to have some 3DC and have non-local block and another and another [00:54:07] non-local block and another and another block of 3DC and add non-local block and [00:54:10] block of 3DC and add non-local block and each non-local block basically has is [00:54:12] each non-local block basically has is very powerful to fuse across both space [00:54:14] very powerful to fuse across both space and time and finally you do into this [00:54:16] and time and finally you do into this classification. [00:54:18] classification. So but one thing we haven't talked about [00:54:20] So but one thing we haven't talked about is what is this 3D convolutional [00:54:22] is what is this 3D convolutional networks right what what what what we [00:54:24] networks right what what what what we should use here. So uh another very kind [00:54:28] should use here. So uh another very kind of u interesting idea that people have [00:54:30] of u interesting idea that people have explored in the past that is can we [00:54:33] explored in the past that is can we reuse the 2D convolution neural network [00:54:36] reuse the 2D convolution neural network many successful architecture we have we [00:54:38] many successful architecture we have we have been we have talked about or have [00:54:39] have been we have talked about or have learned right directly to 3D right we [00:54:41] learned right directly to 3D right we can we just doing some inflation of this [00:54:44] can we just doing some inflation of this 2D networks so we can you know then we [00:54:46] 2D networks so we can you know then we can get a 3D convolution new networks so [00:54:48] can get a 3D convolution new networks so for this work uh it's called I 3D [00:54:50] for this work uh it's called I 3D architecture the idea is that uh they [00:54:53] architecture the idea is that uh they just take a two 2D say and architecture [00:54:55] just take a two 2D say and architecture they replace each 2D uh convo pool uh [00:55:00] they replace each 2D uh convo pool uh layer uh the layer that originates of [00:55:03] layer uh the layer that originates of dimension kh * kw but now we would we we [00:55:07] dimension kh kw but now we would we we replace with a 3D version that is a kt [00:55:10] replace with a 3D version that is a kt kh kw right they just inflate it [00:55:13] kh* kw right they just inflate it basically and they use it uh on top of [00:55:17] basically and they use it uh on top of the inception block uh uh and then the [00:55:24] after they doing this inflation uh bas [00:55:27] after they doing this inflation uh bas uh then you you have architecture uh for [00:55:30] uh then you you have architecture uh for processing videos right directly just [00:55:33] processing videos right directly just reuse the existing architectures and uh [00:55:36] reuse the existing architectures and uh al we now we can transfer the [00:55:38] al we now we can transfer the architecture that works pretty well in [00:55:40] architecture that works pretty well in 2D to work uh also in 3D right but what [00:55:44] 2D to work uh also in 3D right but what one taking one step further people also [00:55:45] one taking one step further people also have been uh uh trying trying things [00:55:50] have been uh uh trying trying things that not only we can transfer the [00:55:52] that not only we can transfer the architectures but actually also we can [00:55:54] architectures but actually also we can transfer the weights right because we [00:55:55] transfer the weights right because we have already trained pre-trained a lot [00:55:57] have already trained pre-trained a lot of architectures models on image data [00:56:00] of architectures models on image data sets right maybe we can actually use the [00:56:02] sets right maybe we can actually use the weights we have learned there there are [00:56:03] weights we have learned there there are some maybe some good prior information [00:56:05] some maybe some good prior information so they so one thing you can do is that [00:56:07] so they so one thing you can do is that you can just initialize uh uh the [00:56:09] you can just initialize uh uh the inflated CN with weights train on images [00:56:12] inflated CN with weights train on images for example you have maybe from from [00:56:13] for example you have maybe from from from for for from for from for for from [00:56:14] for from for from for from for from for from for for one originally uh uh you [00:56:17] from for for one originally uh uh you have you have this 2D common kernel [00:56:18] have you have this 2D common kernel right uh you just uh copy the kernel by [00:56:22] right uh you just uh copy the kernel by KT times and you divide it by KT and you [00:56:26] KT times and you divide it by KT and you just uh now you originally takes one [00:56:28] just uh now you originally takes one single image as input now you take this [00:56:30] single image as input now you take this video of three times KT H W as input [00:56:33] video of three times KT H W as input because we have divided them by KT if [00:56:35] because we have divided them by KT if you just use this inflated version and [00:56:38] you just use this inflated version and use a copy the weights by copy the [00:56:40] use a copy the weights by copy the weights by uh KT times and then you [00:56:42] weights by uh KT times and then you you'll get the same output if you just [00:56:44] you'll get the same output if you just uh input a single frame or like a video [00:56:48] uh input a single frame or like a video of constant uh frames. So now we have a [00:56:52] of constant uh frames. So now we have a way to recycle this kind of existing 2D [00:56:55] way to recycle this kind of existing 2D image based on this architecture uh and [00:56:58] image based on this architecture uh and weights from uh 2D uh image [00:57:00] weights from uh 2D uh image understanding. So and actually it works [00:57:03] understanding. So and actually it works uh pretty well. So if you look at the [00:57:05] uh pretty well. So if you look at the performance if you inflate them uh [00:57:07] performance if you inflate them uh compared to this two stream convolution [00:57:09] compared to this two stream convolution network actually has better performance [00:57:10] network actually has better performance and you can also inflate actually not [00:57:12] and you can also inflate actually not only in the appearance stream frame you [00:57:14] only in the appearance stream frame you can also inflate the you know motion [00:57:16] can also inflate the you know motion motion stream. So you actually get gets [00:57:18] motion stream. So you actually get gets uh some further improvements. Basically [00:57:21] uh some further improvements. Basically this is just like a technique you can do [00:57:24] this is just like a technique you can do to reuse this kind of independent from [00:57:25] to reuse this kind of independent from the 3D convolutional networks. Those are [00:57:27] the 3D convolutional networks. Those are you can you can you can build this kind [00:57:29] you can you can you can build this kind of long local blocks and but this part [00:57:32] of long local blocks and but this part I'm what I'm trying to say is that uh we [00:57:35] I'm what I'm trying to say is that uh we can we have a lot of 2D convolutional [00:57:38] can we have a lot of 2D convolutional networks uh the weights successful [00:57:41] networks uh the weights successful people have have you know shown that [00:57:42] people have have you know shown that they are very successful and if we want [00:57:44] they are very successful and if we want to reuse them people have shown that [00:57:47] to reuse them people have shown that that they can actually copy the weights [00:57:49] that they can actually copy the weights and reuse them reuse their weights so [00:57:53] and reuse them reuse their weights so directly oper uh use them to operate on [00:57:55] directly oper uh use them to operate on on videos so So that basically that's [00:57:57] on videos so So that basically that's the kind of highlight idea and you just [00:57:59] the kind of highlight idea and you just you can you can after you're doing this [00:58:01] you can you can after you're doing this initialization you can still fine-tune [00:58:02] initialization you can still fine-tune right on the video data but you have uh [00:58:05] right on the video data but you have uh the pre-trained weights from images. So [00:58:07] the pre-trained weights from images. So then we can give you some good [00:58:09] then we can give you some good initialization for training the video [00:58:11] initialization for training the video models. So this is this idea of this I3D [00:58:14] models. So this is this idea of this I3D network basically is trying to copy the [00:58:16] network basically is trying to copy the weights and doing the inflation. [00:58:18] weights and doing the inflation. Uh okay. Uh so there this is also just a [00:58:23] Uh okay. Uh so there this is also just a one example of this video understanding [00:58:24] one example of this video understanding uh net uh model and there are also a [00:58:28] uh net uh model and there are also a many other video transfer model proposed [00:58:30] many other video transfer model proposed for for video understanding that is uh [00:58:34] for for video understanding that is uh uh for example there are some uh this [00:58:36] uh for example there are some uh this work uh space-time attention is trying [00:58:38] work uh space-time attention is trying to doing more factorized attention to [00:58:40] to doing more factorized attention to transpose uh space and time and also [00:58:43] transpose uh space and time and also there are some other method trying to be [00:58:45] there are some other method trying to be more efficient in terms of this [00:58:46] more efficient in terms of this transformer architecture or have some [00:58:48] transformer architecture or have some mask autoenccoder you heard about to [00:58:50] mask autoenccoder you heard about to doing more efficient scalable video [00:58:54] doing more efficient scalable video uh level kind of pre-training uh to [00:58:55] uh level kind of pre-training uh to doing video understanding. So I'm not [00:58:57] doing video understanding. So I'm not going to talk uh them uh here in the [00:58:59] going to talk uh them uh here in the class but if you are interested you can [00:59:01] class but if you are interested you can check out our papers because there are [00:59:02] check out our papers because there are also a lot many progress has been made [00:59:04] also a lot many progress has been made to have better you know video [00:59:06] to have better you know video understanding models and if you look at [00:59:08] understanding models and if you look at the performance of progress that we I [00:59:10] the performance of progress that we I think we start here from like single [00:59:11] think we start here from like single frame model 62.2 two for on this uh on [00:59:14] frame model 62.2 two for on this uh on this this is another data set kinetics [00:59:16] this this is another data set kinetics 400 it's a large video data set and then [00:59:19] 400 it's a large video data set and then you can see that for this video muscle [00:59:21] you can see that for this video muscle encoder now it already gets to 90% uh [00:59:24] encoder now it already gets to 90% uh accuracy so there are some other uh new [00:59:28] accuracy so there are some other uh new transformer model has been proposed [00:59:30] transformer model has been proposed so so we are doing uh very well on [00:59:33] so so we are doing uh very well on classifying the the videos uh and [00:59:36] classifying the the videos uh and similar to the image classification uh [00:59:39] similar to the image classification uh in the last class we can also use [00:59:41] in the last class we can also use similar tricks for visualizing uh video [00:59:44] similar tricks for visualizing uh video models. So we can taking this uh two [00:59:46] models. So we can taking this uh two stream network uh as an examples we can [00:59:49] stream network uh as an examples we can randomly initialize the appearance image [00:59:51] randomly initialize the appearance image and the flow image uh doing a we doing a [00:59:55] and the flow image uh doing a we doing a forward pass and then compute the the [00:59:57] forward pass and then compute the the score and then we can back propagate [00:59:59] score and then we can back propagate with with respect to the score of a [01:00:01] with with respect to the score of a particular class and you gradient ascent [01:00:04] particular class and you gradient ascent to you know maximize the classification [01:00:06] to you know maximize the classification score just just like we were doing the [01:00:09] score just just like we were doing the visualization for for the image based [01:00:11] visualization for for the image based model right so if you then it's through [01:00:13] model right so if you then it's through this way that if you can we can [01:00:15] this way that if you can we can visualize you know doing some [01:00:17] visualize you know doing some visualization interpretation of what has [01:00:19] visualization interpretation of what has been earned right so this is uh the left [01:00:21] been earned right so this is uh the left is the optimized image for appearance [01:00:23] is the optimized image for appearance stream maybe it's hard to you know to [01:00:25] stream maybe it's hard to you know to guess what is maybe happening uh in the [01:00:27] guess what is maybe happening uh in the in the video stream on the right uh it's [01:00:30] in the video stream on the right uh it's uh it's optimized [01:00:32] uh it's optimized image for the flow stream one has like [01:00:34] image for the flow stream one has like some temporal constraints to you know to [01:00:37] some temporal constraints to you know to pre prevent the temporal stream to [01:00:38] pre prevent the temporal stream to change too too fast so there's so you [01:00:41] change too too fast so there's so you can capture slow motion and the other is [01:00:43] can capture slow motion and the other is capture 's right motion. So you can [01:00:45] capture 's right motion. So you can guess what the action it is. Uh maybe [01:00:47] guess what the action it is. Uh maybe this in this case is pretty clear. So [01:00:49] this in this case is pretty clear. So what action is this? [01:00:52] what action is this? So this is a uh weak lifting. You can [01:00:55] So this is a uh weak lifting. You can see that the maybe the single the middle [01:00:57] see that the maybe the single the middle one is doing some bar shaking, right? [01:00:59] one is doing some bar shaking, right? And the the right one is doing some uh [01:01:02] And the the right one is doing some uh pushing uh overhead the the motion, [01:01:05] pushing uh overhead the the motion, right? So it's indeed actually you can [01:01:07] right? So it's indeed actually you can see that this uh video models action [01:01:09] see that this uh video models action models uh are learning something about [01:01:12] models uh are learning something about this motion. [01:01:13] this motion. Okay. So uh so so far uh I have been [01:01:18] Okay. So uh so so far uh I have been talking about uh how we can classify the [01:01:20] talking about uh how we can classify the short clips right uh the swimming [01:01:23] short clips right uh the swimming running uh uh but another very important [01:01:27] running uh uh but another very important thing is that how we can other task is [01:01:30] thing is that how we can other task is that uh this is called temporal action [01:01:32] that uh this is called temporal action localization is that uh not only we want [01:01:35] localization is that uh not only we want to you know just doing clip level or [01:01:38] to you know just doing clip level or classification sometimes we want to [01:01:39] classification sometimes we want to localize just we want to doing object [01:01:41] localize just we want to doing object detection right now we want to localize [01:01:43] detection right now we want to localize where in the video the action is [01:01:45] where in the video the action is happening right maybe sometimes the [01:01:46] happening right maybe sometimes the person is running sometimes is jumping [01:01:48] person is running sometimes is jumping so this is another there's another task [01:01:50] so this is another there's another task this is another class called temporal [01:01:52] this is another class called temporal action localization it's a uh you can [01:01:54] action localization it's a uh you can also use similar maybe ideas from like [01:01:57] also use similar maybe ideas from like fast RN right you can just generate some [01:01:59] fast RN right you can just generate some temporal proposals and then doing the [01:02:01] temporal proposals and then doing the classification [01:02:03] classification and uh also there you can also do both [01:02:06] and uh also there you can also do both right this is a spatial temporal [01:02:07] right this is a spatial temporal detection basically you can uh local you [01:02:09] detection basically you can uh local you want to localize not only in space but [01:02:11] want to localize not only in space but also in time where the action is [01:02:13] also in time where the action is happening in space where the hashing is [01:02:14] happening in space where the hashing is happening temporally. So this is another [01:02:16] happening temporally. So this is another task called spatial temporal uh [01:02:18] task called spatial temporal uh detection. [01:02:20] detection. Okay. Uh so so far uh I have been uh [01:02:25] Okay. Uh so so far uh I have been uh talking about you know the temporal [01:02:26] talking about you know the temporal stream and uh uh how architectures we [01:02:30] stream and uh uh how architectures we can use to you know doing 3DC and [01:02:32] can use to you know doing 3DC and twostream neuronet network spatial [01:02:33] twostream neuronet network spatial temporal self attention and uh we have [01:02:36] temporal self attention and uh we have already talked about some tools uh to do [01:02:38] already talked about some tools uh to do that but yeah uh maybe in the final 10 [01:02:42] that but yeah uh maybe in the final 10 minutes let's just uh revisit I hope to [01:02:44] minutes let's just uh revisit I hope to yeah finish uh in time uh let's revisit [01:02:48] yeah finish uh in time uh let's revisit example that we started today right we [01:02:50] example that we started today right we we We I showed you a video, right? But [01:02:54] we We I showed you a video, right? But that's still uh maybe not the full [01:02:56] that's still uh maybe not the full picture, right? [01:03:10] So looking at video, we are doing video [01:03:12] So looking at video, we are doing video understanding there's another very [01:03:14] understanding there's another very important dimension that we have never [01:03:15] important dimension that we have never covered till now, right? That is this. [01:03:18] covered till now, right? That is this. There's sound, there's audio, there's [01:03:19] There's sound, there's audio, there's another modalities in videos, right? If [01:03:21] another modalities in videos, right? If we we miss that ingredient, you not lose [01:03:23] we we miss that ingredient, you not lose a lot of fun, right? There's emotions [01:03:25] a lot of fun, right? There's emotions you can perceive. There's another you [01:03:27] you can perceive. There's another you know interactions you can do if you [01:03:28] know interactions you can do if you combine this visual and motion. So if we [01:03:32] combine this visual and motion. So if we have this audio in mind and we have this [01:03:34] have this audio in mind and we have this vision stream then people also have [01:03:35] vision stream then people also have proposed many other interesting tasks [01:03:38] proposed many other interesting tasks and also we have explored other tasks to [01:03:40] and also we have explored other tasks to doing video understanding. Here's [01:03:42] doing video understanding. Here's another example that maybe uh you in [01:03:45] another example that maybe uh you in videos that maybe there are some multi [01:03:46] videos that maybe there are some multi multiple objects multiple speakers and [01:03:49] multiple objects multiple speakers and you can actually one task example task [01:03:51] you can actually one task example task uh that I I also personally have [01:03:53] uh that I I also personally have explored in the past that we only guided [01:03:54] explored in the past that we only guided audio source separation and you can [01:03:56] audio source separation and you can actually understand trying to process [01:03:58] actually understand trying to process things visually and acoustically you can [01:04:00] things visually and acoustically you can use a visual information to guide the [01:04:02] use a visual information to guide the source separation you want to separate [01:04:03] source separation you want to separate the sound components right because [01:04:05] the sound components right because originally maybe there's a mixture you [01:04:07] originally maybe there's a mixture you want to use a visual information to [01:04:08] want to use a visual information to separate into some sound components this [01:04:10] separate into some sound components this is called visually guided also [01:04:11] is called visually guided also separation ation to just to give you an [01:04:13] separation ation to just to give you an example for this task. For example, here [01:04:15] example for this task. For example, here is a speech mixture. Maybe sometimes we [01:04:18] is a speech mixture. Maybe sometimes we want to hear the sounds for each person [01:04:20] want to hear the sounds for each person individually, right? Then we can use [01:04:21] individually, right? Then we can use their visual information, audio [01:04:23] their visual information, audio information to process them together to [01:04:24] information to process them together to separate their sounds. So here is a what [01:04:26] separate their sounds. So here is a what we can do. We can separate the voice for [01:04:27] we can do. We can separate the voice for the live speakers. So only we can do [01:04:30] the live speakers. So only we can do this for for people for speech right [01:04:31] this for for people for speech right when we have to process audio and speech [01:04:34] when we have to process audio and speech and visual strength. But we also have do [01:04:36] and visual strength. But we also have do this for other types of you know sound [01:04:38] this for other types of you know sound like muted instruments. Here's another [01:04:39] like muted instruments. Here's another example. We can even do music [01:04:41] example. We can even do music instruments separation by analyze the [01:04:43] instruments separation by analyze the motion the object central information [01:04:45] motion the object central information with the audio stream and doing the [01:04:47] with the audio stream and doing the separation. Yeah. So this is another [01:04:50] separation. Yeah. So this is another example for this task and also there [01:04:52] example for this task and also there have been since since once we introduce [01:04:55] have been since since once we introduce this new modality of audio we just want [01:04:56] this new modality of audio we just want to do video understanding classification [01:04:58] to do video understanding classification we can audio can also be useful cues [01:05:00] we can audio can also be useful cues right so there are indeed there are [01:05:01] right so there are indeed there are other work audio visual video [01:05:03] other work audio visual video understanding work proposed from [01:05:04] understanding work proposed from transformer attention based models [01:05:07] transformer attention based models trying to not only we want to map images [01:05:09] trying to not only we want to map images to patches but also we map those audio [01:05:11] to patches but also we map those audio spectrum to patches and use some [01:05:13] spectrum to patches and use some transformer architectures to doing the [01:05:15] transformer architectures to doing the classification or even we want to we can [01:05:17] classification or even we want to we can do some uh um mask autoenccoder style. [01:05:20] do some uh um mask autoenccoder style. We want to predict the patches for the [01:05:22] We want to predict the patches for the images and also spectrograms doing video [01:05:25] images and also spectrograms doing video understanding. Uh so and and also [01:05:29] understanding. Uh so and and also another aspect people have been uh [01:05:32] another aspect people have been uh exploring is how to do efficient video [01:05:33] exploring is how to do efficient video understanding. So I will just quickly [01:05:35] understanding. So I will just quickly give some examples. So here uh uh [01:05:39] give some examples. So here uh uh throughout this class I think many focus [01:05:41] throughout this class I think many focus on clip clip level classification right [01:05:43] on clip clip level classification right just giving a clip how to doing this uh [01:05:46] just giving a clip how to doing this uh classification and after we classify a [01:05:48] classification and after we classify a lot of clips and we want to aggregate [01:05:50] lot of clips and we want to aggregate information to get a video level [01:05:51] information to get a video level predictions right that's action [01:05:52] predictions right that's action recognition in non videos so so for for [01:05:56] recognition in non videos so so for for efficient video understanding why we [01:05:57] efficient video understanding why we want to do efficient video understanding [01:05:58] want to do efficient video understanding because you know videos are very long we [01:06:00] because you know videos are very long we don't we cannot afford to process every [01:06:02] don't we cannot afford to process every clip one by one right so there we're [01:06:04] clip one by one right so there we're trying to increase the the efficiency [01:06:07] trying to increase the the efficiency for a single uh clip just building like [01:06:10] for a single uh clip just building like this X3D is trying to build better 3D [01:06:13] this X3D is trying to build better 3D convolution your network but also they [01:06:15] convolution your network but also they are trying to you know this like this uh [01:06:18] are trying to you know this like this uh SD sampler trying to you know predict [01:06:20] SD sampler trying to you know predict which which which clips are the most [01:06:23] which which which clips are the most senior most useful and then you can com [01:06:26] senior most useful and then you can com uh combine the predictions only run your [01:06:28] uh combine the predictions only run your clip classifier on those important clips [01:06:30] clip classifier on those important clips and also they were trying to doing [01:06:31] and also they were trying to doing policy learning trying to you know [01:06:33] policy learning trying to you know predict which which modality we to use [01:06:36] predict which which modality we to use in order to doing this uh action [01:06:37] in order to doing this uh action classification. We can select oh whether [01:06:39] classification. We can select oh whether we want to use video, how many videos, [01:06:41] we want to use video, how many videos, how many video clips or whether want to [01:06:43] how many video clips or whether want to use audio or other sensory data. Um so [01:06:46] use audio or other sensory data. Um so here's one example that yeah we we can [01:06:48] here's one example that yeah we we can also use audio trying to as a preview [01:06:50] also use audio trying to as a preview mechanism we can uh to to predict or [01:06:54] mechanism we can uh to to predict or which which uh where are the important [01:06:57] which which uh where are the important uh moments and then we use that uh as a [01:07:00] uh moments and then we use that uh as a guiding crew to process the clips and to [01:07:03] guiding crew to process the clips and to average the results. Um so that's about [01:07:05] average the results. Um so that's about the efficient video understanding. So [01:07:07] the efficient video understanding. So that's also uh one area of research. So [01:07:10] that's also uh one area of research. So and also nowadays there people are [01:07:12] and also nowadays there people are moving to VR and AR right the smart [01:07:14] moving to VR and AR right the smart glasses right and uh for this and also [01:07:16] glasses right and uh for this and also now in the future I'm I'm guessing there [01:07:18] now in the future I'm I'm guessing there are a lot of like egocentric video [01:07:19] are a lot of like egocentric video streams that's another aspect of video [01:07:21] streams that's another aspect of video understanding so not only you have this [01:07:23] understanding so not only you have this egocentric videos but you have also have [01:07:25] egocentric videos but you have also have this multi multi microphone microphone [01:07:28] this multi multi microphone microphone array multi- channelannel audios so then [01:07:30] array multi- channelannel audios so then you can so how to doing better on video [01:07:32] you can so how to doing better on video understanding from this egocentric [01:07:34] understanding from this egocentric multimodal egocentric stream video [01:07:35] multimodal egocentric stream video streams is also a hot topic maybe uh we [01:07:39] streams is also a hot topic maybe uh we I've explored that uh we can do like use [01:07:42] I've explored that uh we can do like use process this video streams the audio [01:07:44] process this video streams the audio multi channel audio and the vision [01:07:45] multi channel audio and the vision information to doing you know to predict [01:07:47] information to doing you know to predict who is speaking to whom and who is [01:07:49] who is speaking to whom and who is listening to whom imagine whe in the [01:07:50] listening to whom imagine whe in the future you wear this smart glasses you [01:07:52] future you wear this smart glasses you want to to use it to help you to [01:07:55] want to to use it to help you to understand this different type of social [01:07:56] understand this different type of social interactions right so that's a ego [01:07:58] interactions right so that's a ego egocentric video understanding so my [01:08:01] egocentric video understanding so my final slide yeah definitely for LMS [01:08:04] final slide yeah definitely for LMS right now uh there also a lot of ongoing [01:08:07] right now uh there also a lot of ongoing work trying to build video level [01:08:09] work trying to build video level foundation models, right? How to connect [01:08:10] foundation models, right? How to connect the video understanding to Loom. So [01:08:12] the video understanding to Loom. So there indeed there are works trying to [01:08:13] there indeed there are works trying to you know just map the videos to some [01:08:16] you know just map the videos to some tokenize them and map them to the LM [01:08:19] tokenize them and map them to the LM embedding space and use maybe you can [01:08:20] embedding space and use maybe you can ask prompt the video uh foundation model [01:08:24] ask prompt the video uh foundation model you know where the person is uh what the [01:08:26] you know where the person is uh what the person is doing in the in the video and [01:08:27] person is doing in the in the video and then you output some you know uh uh text [01:08:31] then you output some you know uh uh text to you know uh to describe the videos. [01:08:34] to you know uh to describe the videos. So there are many work trying to connect [01:08:36] So there are many work trying to connect video understanding LMS. So that's also [01:08:37] video understanding LMS. So that's also a hot topic right now. ================================================================================ LECTURE 011 ================================================================================ Stanford CS231N | Spring 2025 | Lecture 11: Large Scale Distributed Training Source: https://www.youtube.com/watch?v=9MvD-XsowsE --- Transcript [00:00:05] Welcome back to CS231 lecture 11. Today [00:00:08] Welcome back to CS231 lecture 11. Today we're going to talk about large scale [00:00:09] we're going to talk about large scale distributed training. Um and this is a [00:00:11] distributed training. Um and this is a pretty exciting topic because this is [00:00:12] pretty exciting topic because this is basically how all neural networks get [00:00:14] basically how all neural networks get trained in practice today. When you look [00:00:15] trained in practice today. When you look at large models from from startups, from [00:00:17] at large models from from startups, from industries, even academia um really [00:00:20] industries, even academia um really large scale is kind of the new norm in [00:00:22] large scale is kind of the new norm in deep learning nowadays. Um and that's [00:00:23] deep learning nowadays. Um and that's actually something that's changed quite [00:00:24] actually something that's changed quite a lot in the last 10 years since we [00:00:26] a lot in the last 10 years since we started this class. Um 10 years ago it [00:00:28] started this class. Um 10 years ago it was really well it was actually pretty [00:00:30] was really well it was actually pretty common to train all models basically on [00:00:32] common to train all models basically on one GPU one device um and it was fairly [00:00:34] one GPU one device um and it was fairly uncommon to train on multiple devices [00:00:37] uncommon to train on multiple devices but as we'll see nowadays the new norm [00:00:38] but as we'll see nowadays the new norm is to train models on tens hundreds [00:00:41] is to train models on tens hundreds thousands even tens of thousands of [00:00:42] thousands even tens of thousands of devices concurrently so we need to [00:00:44] devices concurrently so we need to develop new algorithms and new ways of [00:00:46] develop new algorithms and new ways of thinking in order to do that. [00:00:48] thinking in order to do that. So, as a and and then as a as a bit of [00:00:50] So, as a and and then as a as a bit of running example through today's lecture, [00:00:52] running example through today's lecture, we're going to be talking a lot about [00:00:53] we're going to be talking a lot about Llama 3 405B. Um, not because this is [00:00:56] Llama 3 405B. Um, not because this is the best model or the most interesting [00:00:57] the best model or the most interesting model, but because this is a model that [00:00:59] model, but because this is a model that this is a fairly close to [00:01:00] this is a fairly close to state-of-the-art model that actually [00:01:01] state-of-the-art model that actually shares a lot of the implementation [00:01:03] shares a lot of the implementation details of how it was trained, the model [00:01:05] details of how it was trained, the model architecture, everything like that. Um, [00:01:06] architecture, everything like that. Um, there's a lot of really amazing powerful [00:01:08] there's a lot of really amazing powerful models that have been trained in the [00:01:09] models that have been trained in the last couple of years from Google, from [00:01:11] last couple of years from Google, from OpenAI, from Anthropic, from others, but [00:01:13] OpenAI, from Anthropic, from others, but basically they don't share any details [00:01:14] basically they don't share any details whatsoever about their models anymore. [00:01:16] whatsoever about their models anymore. Um there's a very famous quote that just [00:01:18] Um there's a very famous quote that just sort of marked a sea change in the [00:01:20] sort of marked a sea change in the industry to me that was in the GPT4 [00:01:22] industry to me that was in the GPT4 paper um back in 2023. So when they [00:01:24] paper um back in 2023. So when they released the GPT4 model they said um [00:01:26] released the GPT4 model they said um given both the competitive landscape and [00:01:28] given both the competitive landscape and the safety implications of large scale [00:01:30] the safety implications of large scale models like GPT4 this report meaning the [00:01:32] models like GPT4 this report meaning the paper they wrote about GPT4 contains no [00:01:35] paper they wrote about GPT4 contains no further details about the architecture [00:01:37] further details about the architecture including model size hardware training [00:01:39] including model size hardware training compute data set construction training [00:01:41] compute data set construction training method or similar and that's basically [00:01:42] method or similar and that's basically the new that's basically been the [00:01:44] the new that's basically been the state-of-the-art for um large scale [00:01:46] state-of-the-art for um large scale models the last 3 years since this since [00:01:48] models the last 3 years since this since GPT4 they're not they don't tell you [00:01:50] GPT4 they're not they don't tell you anything about anything they'll tell you [00:01:51] anything about anything they'll tell you nothing about the model you'll be lucky [00:01:53] nothing about the model you'll be lucky if they'll tell it's a transformer. They [00:01:54] if they'll tell it's a transformer. They might tell you that much. Um, so Llama 3 [00:01:57] might tell you that much. Um, so Llama 3 is sort of notable not because it's the [00:01:59] is sort of notable not because it's the best model out there, but because it's [00:02:00] best model out there, but because it's one of the most open models out there. [00:02:02] one of the most open models out there. So this is a large language model that [00:02:03] So this is a large language model that was trained by Meta um and released open [00:02:05] was trained by Meta um and released open source about a year ago in April 2024. [00:02:08] source about a year ago in April 2024. And unlike OpenAI, the paper actually [00:02:10] And unlike OpenAI, the paper actually does share a lot of the details about [00:02:12] does share a lot of the details about the the model training, not too much [00:02:14] the the model training, not too much about the data set, a lot about the [00:02:15] about the data set, a lot about the system infrastructure that was used to [00:02:17] system infrastructure that was used to train it. Um, so this is something that [00:02:19] train it. Um, so this is something that we and this gives us a sort of a peak [00:02:21] we and this gives us a sort of a peak into how large scale LLMs are actually [00:02:23] into how large scale LLMs are actually trained these days. Um, and by the way, [00:02:25] trained these days. Um, and by the way, there is a new Llama 4 model that just [00:02:27] there is a new Llama 4 model that just came out from Meta uh, last month, April [00:02:29] came out from Meta uh, last month, April 2025. Um, so there are slightly better [00:02:32] 2025. Um, so there are slightly better models out there in open source already. [00:02:34] models out there in open source already. Um, but there's no paper on Llama 4 yet. [00:02:36] Um, but there's no paper on Llama 4 yet. So I'm excited to read that one [00:02:37] So I'm excited to read that one hopefully when it comes out in a couple [00:02:38] hopefully when it comes out in a couple months and see um, what can we learn [00:02:40] months and see um, what can we learn from the new generation of llama [00:02:41] from the new generation of llama training. But as just as a as a running [00:02:43] training. But as just as a as a running example through today's lecture, we'll [00:02:44] example through today's lecture, we'll be pointing at a lot of examples from [00:02:46] be pointing at a lot of examples from the llama 3 405b model um for this [00:02:48] the llama 3 405b model um for this reason. [00:02:50] reason. Okay. So there's basically two things [00:02:51] Okay. So there's basically two things that I want to talk about today. Um one [00:02:53] that I want to talk about today. Um one is a bit about GPU hardware and the [00:02:56] is a bit about GPU hardware and the other is how to train on lots of GPUs. [00:02:58] other is how to train on lots of GPUs. So I want to give you a sense both of [00:02:59] So I want to give you a sense both of what actually is the hardware that these [00:03:01] what actually is the hardware that these things execute on as well as the [00:03:03] things execute on as well as the algorithms that we need to use in order [00:03:04] algorithms that we need to use in order to train on a lot of them. So first [00:03:06] to train on a lot of them. So first we're going to talk a little bit about [00:03:08] we're going to talk a little bit about GPU hardware. So GPU um for those of you [00:03:10] GPU hardware. So GPU um for those of you that don't know is graphics processing [00:03:12] that don't know is graphics processing unit. These were specialized [00:03:14] unit. These were specialized co-processors that were originally [00:03:15] co-processors that were originally developed for computer graphics. Um and [00:03:17] developed for computer graphics. Um and they turned out to become to be very [00:03:18] they turned out to become to be very useful generalizable parallel [00:03:20] useful generalizable parallel processors. Um it's actually very [00:03:22] processors. Um it's actually very fitting to be giving this lecture in [00:03:23] fitting to be giving this lecture in this room cuz this is the the Hang [00:03:26] this room cuz this is the the Hang auditorium. Jensen Huang is the CEO and [00:03:28] auditorium. Jensen Huang is the CEO and founder of Nvidia um which is sort of [00:03:30] founder of Nvidia um which is sort of the biggest company right now and has [00:03:32] the biggest company right now and has been for the last couple decades in [00:03:34] been for the last couple decades in producing GPUs both for gaming and for [00:03:36] producing GPUs both for gaming and for uh and for ML. So these things started [00:03:38] uh and for ML. So these things started off basically for graphics because if [00:03:40] off basically for graphics because if you think about it when you're doing [00:03:41] you think about it when you're doing computer graphics you need to generate a [00:03:43] computer graphics you need to generate a lot of pixels on the screen. You need to [00:03:44] lot of pixels on the screen. You need to process a little lots of little pieces [00:03:46] process a little lots of little pieces of primitive geometry to produce those [00:03:48] of primitive geometry to produce those pixels. So it's kind of very natural to [00:03:50] pixels. So it's kind of very natural to do a lot of computation all in parallel [00:03:51] do a lot of computation all in parallel when you're doing computer graphics. Um [00:03:53] when you're doing computer graphics. Um so they people quickly figured out that [00:03:56] so they people quickly figured out that this hardware that had been built [00:03:58] this hardware that had been built intended to use in computer graphics [00:04:00] intended to use in computer graphics could actually be used for much more [00:04:01] could actually be used for much more general pieces of parallel computation [00:04:03] general pieces of parallel computation um as well. So researchers kind so in in [00:04:05] um as well. So researchers kind so in in the early days in the sort of early [00:04:07] the early days in the sort of early 2000s researchers figured out how they [00:04:09] 2000s researchers figured out how they could contort these graphics cards into [00:04:11] could contort these graphics cards into doing general generalizable parallel [00:04:13] doing general generalizable parallel programming. And then moving on towards [00:04:15] programming. And then moving on towards the end of the 2000s and into the 2010s [00:04:17] the end of the 2000s and into the 2010s um Nvidia really picked up this and sort [00:04:19] um Nvidia really picked up this and sort of developed these things, marketed [00:04:20] of developed these things, marketed them, built them with the intention of [00:04:22] them, built them with the intention of being generalized parallel processors. [00:04:24] being generalized parallel processors. Um they didn't quite know at the time [00:04:26] Um they didn't quite know at the time what exactly they were going to be used [00:04:27] what exactly they were going to be used for. think I think they had this general [00:04:29] for. think I think they had this general idea that parallel processing was going [00:04:30] idea that parallel processing was going to be important and they really [00:04:32] to be important and they really capitalized on deep learning when it [00:04:33] capitalized on deep learning when it came when it started to take off in the [00:04:34] came when it started to take off in the early 2010s. Um much to Nvidia's credit, [00:04:37] early 2010s. Um much to Nvidia's credit, I think they really realized the [00:04:38] I think they really realized the potential of this research area very [00:04:40] potential of this research area very early um even in the early 2010s and [00:04:43] early um even in the early 2010s and started putting a ton of resources into [00:04:45] started putting a ton of resources into making sure that their hardware was [00:04:46] making sure that their hardware was really useful um for deep learning [00:04:48] really useful um for deep learning training and it's basically been the [00:04:50] training and it's basically been the like the main way that people train deep [00:04:53] like the main way that people train deep large scale deep learning models for [00:04:55] large scale deep learning models for more than a decade now. um that's [00:04:57] more than a decade now. um that's starting to change as we'll see a little [00:04:58] starting to change as we'll see a little bit, but um it's their their their chips [00:05:01] bit, but um it's their their their chips are kind of the the main one that people [00:05:02] are kind of the the main one that people use. So I think it's I always like [00:05:05] use. So I think it's I always like looking inside these things and seeing [00:05:06] looking inside these things and seeing like what's in them. So this is a [00:05:07] like what's in them. So this is a picture of the Nvidia H100 which is sort [00:05:10] picture of the Nvidia H100 which is sort of the the the sort of the mainstay of [00:05:12] of the the the sort of the mainstay of deep learning training right now today. [00:05:14] deep learning training right now today. Um there's a next generation that just [00:05:15] Um there's a next generation that just came out but it's not really accessible [00:05:17] came out but it's not really accessible yet. I haven't trained anything on it [00:05:18] yet. I haven't trained anything on it yet. Um so this is this is kind of the [00:05:20] yet. Um so this is this is kind of the state-of-the-art right now. Inside this [00:05:22] state-of-the-art right now. Inside this Nvidia GP, inside this H100 GPU, in the [00:05:25] Nvidia GP, inside this H100 GPU, in the middle here are these compute cores and [00:05:27] middle here are these compute cores and surrounding that are 80 GB of HBM [00:05:29] surrounding that are 80 GB of HBM memory, high bandwidth memory. Um, so [00:05:31] memory, high bandwidth memory. Um, so you can see the memory is separated from [00:05:32] you can see the memory is separated from the compute cores. They need to move [00:05:34] the compute cores. They need to move they need to talk to each other over [00:05:35] they need to talk to each other over this over this bus to move data back and [00:05:37] this over this bus to move data back and forth from the GPU memory into the [00:05:39] forth from the GPU memory into the cores. Um, and it can do that at at a [00:05:41] cores. Um, and it can do that at at a speed of about 3 terabytes per second, [00:05:43] speed of about 3 terabytes per second, which is a lot of bits moving around. [00:05:45] which is a lot of bits moving around. Now, if we dive deeper inside the the [00:05:47] Now, if we dive deeper inside the the GPU cores, um we see that in the middle [00:05:49] GPU cores, um we see that in the middle in that compute core part, we've got a [00:05:52] in that compute core part, we've got a smaller piece of memory about 50 [00:05:53] smaller piece of memory about 50 megabytes of L2 cache um that is much [00:05:55] megabytes of L2 cache um that is much much smaller than that 80 GB of um of [00:05:58] much smaller than that 80 GB of um of HBM memory, but it's very very close to [00:05:59] HBM memory, but it's very very close to this to the GP to the actual computing [00:06:01] this to the GP to the actual computing elements. So, they can be accessed much [00:06:03] elements. So, they can be accessed much more quickly from the compute cores. Um [00:06:05] more quickly from the compute cores. Um and then the real heart of the thing are [00:06:07] and then the real heart of the thing are these 132 streaming multipprocessors or [00:06:10] these 132 streaming multipprocessors or SM. Um these are kind of like [00:06:12] SM. Um these are kind of like independent parallel cores. Um, they're [00:06:14] independent parallel cores. Um, they're a little bit more powerful in some ways [00:06:16] a little bit more powerful in some ways than a typical CPU core because they can [00:06:18] than a typical CPU core because they can do a lot more parallelism, but they're a [00:06:19] do a lot more parallelism, but they're a lot weaker than a typical CP CPU core [00:06:21] lot weaker than a typical CP CPU core also in a lot of ways because they tend [00:06:23] also in a lot of ways because they tend to have slower clock speeds. They can't [00:06:24] to have slower clock speeds. They can't do as much instruction prediction, as [00:06:26] do as much instruction prediction, as much branch prediction. So, it's it's [00:06:28] much branch prediction. So, it's it's really hard to make exact apples to [00:06:29] really hard to make exact apples to apples comparisons between these GPU [00:06:31] apples comparisons between these GPU cores and the CPU cores. But, I usually [00:06:33] cores and the CPU cores. But, I usually think of these um streaming [00:06:34] think of these um streaming multipprocessors as roughly akin to a [00:06:37] multipprocessors as roughly akin to a CPU core. Um, also for I I know [00:06:39] CPU core. Um, also for I I know someone's going to go back home and [00:06:40] someone's going to go back home and actually count all the little boxes on [00:06:42] actually count all the little boxes on this screen and you'll see that there [00:06:43] this screen and you'll see that there are actually 144 of them when I've said [00:06:46] are actually 144 of them when I've said there's only 132. Why is that? It's [00:06:48] there's only 132. Why is that? It's because all GPU hardware uses a process [00:06:50] because all GPU hardware uses a process called binning where making these [00:06:52] called binning where making these things, they have so many transistors, [00:06:53] things, they have so many transistors, so many little computing elements, no [00:06:55] so many little computing elements, no matter how much money they pour into the [00:06:56] matter how much money they pour into the process, they just don't come out [00:06:57] process, they just don't come out perfectly. Um, some of them always end [00:06:59] perfectly. Um, some of them always end up a little bit messed up. So, they kind [00:07:01] up a little bit messed up. So, they kind of plan for that in the development of [00:07:02] of plan for that in the development of their products and they say, "We're [00:07:04] their products and they say, "We're going to try to make a chip. The full [00:07:05] going to try to make a chip. The full chip in theory has 144, but none of the [00:07:08] chip in theory has 144, but none of the chips are perfect. But we know we'll get [00:07:09] chips are perfect. But we know we'll get a reasonable number of those that have [00:07:11] a reasonable number of those that have at least 132 that are functioning. So [00:07:13] at least 132 that are functioning. So they tend to use this process of [00:07:15] they tend to use this process of binning. So then they actually only um [00:07:17] binning. So then they actually only um you know can sell a much more larger [00:07:18] you know can sell a much more larger proportion of the chips they try to [00:07:20] proportion of the chips they try to produce by um you know only only [00:07:23] produce by um you know only only promising that 132 of them will be will [00:07:25] promising that 132 of them will be will be um turned on. [00:07:27] be um turned on. um then we can dive even deeper inside [00:07:29] um then we can dive even deeper inside one of those streaming multipprocessors [00:07:31] one of those streaming multipprocessors and then we can see uh even more what's [00:07:33] and then we can see uh even more what's going on inside of these GPUs. So this [00:07:35] going on inside of these GPUs. So this is just one of the 132 active streaming [00:07:38] is just one of the 132 active streaming multipprocessors inside an H100. And [00:07:40] multipprocessors inside an H100. And there's a couple interesting elements in [00:07:41] there's a couple interesting elements in here to look at. Um first we see we have [00:07:44] here to look at. Um first we see we have um 256 kilobytes of L1 cache and [00:07:46] um 256 kilobytes of L1 cache and register files. So this this continues [00:07:48] register files. So this this continues the trend of the memory hierarchy in the [00:07:50] the trend of the memory hierarchy in the GPU. We saw that you know in general you [00:07:53] GPU. We saw that you know in general you know you you thought you were learning [00:07:54] know you you thought you were learning deep learning, you're actually learning [00:07:55] deep learning, you're actually learning computer architecture. Sorry, it's a [00:07:57] computer architecture. Sorry, it's a surprise. Um, and it turns out that that [00:07:58] surprise. Um, and it turns out that that memory hierarchy is really important for [00:08:00] memory hierarchy is really important for deep learning and for all all kind of [00:08:02] deep learning and for all all kind of high performance computing. And the [00:08:04] high performance computing. And the general trend is that you have larger [00:08:05] general trend is that you have larger bits of memory that are farther away [00:08:07] bits of memory that are farther away from the compute cores. And the closer [00:08:08] from the compute cores. And the closer you get to the compute cores, you have [00:08:10] you get to the compute cores, you have smaller bits of memory but are much much [00:08:12] smaller bits of memory but are much much faster. And it's really important to [00:08:14] faster. And it's really important to write if you're writing the low [00:08:15] write if you're writing the low low-level algorithms that run on these [00:08:17] low-level algorithms that run on these things, it's very important to be aware [00:08:19] things, it's very important to be aware of this memory hierarchy and to be very [00:08:21] of this memory hierarchy and to be very diligent in passing data between [00:08:22] diligent in passing data between different phases of this memor memory [00:08:24] different phases of this memor memory hierarchy. And if you're writing [00:08:25] hierarchy. And if you're writing performant GPU kernels, you spend a lot [00:08:27] performant GPU kernels, you spend a lot of time try trying to optimize that. So [00:08:29] of time try trying to optimize that. So just to give you a flavor of that, you [00:08:31] just to give you a flavor of that, you know, you see that we see the three [00:08:32] know, you see that we see the three levels of memory memory hierarchy in the [00:08:34] levels of memory memory hierarchy in the H100. You've got 256 kilobytes of L1 [00:08:37] H100. You've got 256 kilobytes of L1 cache, 50 megabytes of um of L2 cache, [00:08:40] cache, 50 megabytes of um of L2 cache, and then 80 GB of um HBM memory. So [00:08:43] and then 80 GB of um HBM memory. So those are the three primary levels of [00:08:45] those are the three primary levels of memory hierarchy in the H100. [00:08:47] memory hierarchy in the H100. Then we've also got 128 of these FP32 [00:08:50] Then we've also got 128 of these FP32 cores. Um these are little arithmetic [00:08:53] cores. Um these are little arithmetic units that can do sort of generalized [00:08:54] units that can do sort of generalized floatingoint operations. Um and in [00:08:56] floatingoint operations. Um and in particular each one of these 128 FP FP32 [00:08:59] particular each one of these 128 FP FP32 cores can compute ax plus b where ax and [00:09:02] cores can compute ax plus b where ax and b are all scalers. It can perform that [00:09:04] b are all scalers. It can perform that that bit of computation in one clock [00:09:06] that bit of computation in one clock cycle. So then if you um add this all up [00:09:09] cycle. So then if you um add this all up that ax plus b is basically one multiply [00:09:11] that ax plus b is basically one multiply um one addition and you've got 128 of [00:09:13] um one addition and you've got 128 of these cores. Um so you can do this this [00:09:15] these cores. Um so you can do this this whole SM can do 256 floatingoint [00:09:18] whole SM can do 256 floatingoint operations per SM uh per clock cycle of [00:09:21] operations per SM uh per clock cycle of the device. [00:09:23] the device. Then we'll also see that in red we've [00:09:25] Then we'll also see that in red we've got this is where the real magic [00:09:26] got this is where the real magic happens. Um in addition to these uh FP32 [00:09:29] happens. Um in addition to these uh FP32 cores, there are also these four tensor [00:09:31] cores, there are also these four tensor cores. Um I think the name is a little [00:09:33] cores. Um I think the name is a little bit miss a little bit of a misnomer. [00:09:34] bit miss a little bit of a misnomer. These are actually matrix cores. Um what [00:09:37] These are actually matrix cores. Um what these what each of these little tensor [00:09:38] these what each of these little tensor cores does is they are special circuits [00:09:41] cores does is they are special circuits that are designed they only do one [00:09:42] that are designed they only do one thing. they do matrix multiply. So each [00:09:44] thing. they do matrix multiply. So each one of these little tensor cores can do [00:09:46] one of these little tensor cores can do a single chunk of matrix multiply. Um in [00:09:49] a single chunk of matrix multiply. Um in particular I believe the H100 ones can [00:09:51] particular I believe the H100 ones can do a 16 like input matrix A is 16x4. Um [00:09:55] do a 16 like input matrix A is 16x4. Um input matrix B is 4x8 and then plus a [00:09:58] input matrix B is 4x8 and then plus a bias matrix of for of a size 16 by 8. So [00:10:01] bias matrix of for of a size 16 by 8. So it basically does ax plus b where a x [00:10:04] it basically does ax plus b where a x and b are little matrix chunks of this [00:10:06] and b are little matrix chunks of this fixed size. And it can do that that one [00:10:08] fixed size. And it can do that that one little ma chunk of matrix multiply once [00:10:10] little ma chunk of matrix multiply once per tensor core per clock cycle. Um so [00:10:12] per tensor core per clock cycle. Um so then if you kind of multiply all these [00:10:14] then if you kind of multiply all these numbers out you see that um you you know [00:10:17] numbers out you see that um you you know that little matrix multiply of ax plus b [00:10:19] that little matrix multiply of ax plus b of that particular size is 1,024 [00:10:22] of that particular size is 1,024 floatingoint operations where we that's [00:10:24] floatingoint operations where we that's counting each multiply each ad as a [00:10:26] counting each multiply each ad as a single floatingoint operation. We [00:10:27] single floatingoint operation. We multiply that by the four uh tensor [00:10:29] multiply that by the four uh tensor cores in the SM and we see that the [00:10:31] cores in the SM and we see that the entire S the entire SM if it's going [00:10:33] entire S the entire SM if it's going through the through the tensor cores can [00:10:35] through the through the tensor cores can do just over can do 4096 floatingoint [00:10:37] do just over can do 4096 floatingoint operations per SM per clock cycle. Um, [00:10:40] operations per SM per clock cycle. Um, and this we need to compare with a 256 [00:10:42] and this we need to compare with a 256 that we can get from the FP32 cores. And [00:10:44] that we can get from the FP32 cores. And here we see that just like the tensor [00:10:47] here we see that just like the tensor cores are where all the magic happens. [00:10:48] cores are where all the magic happens. This is where the main throughput of the [00:10:50] This is where the main throughput of the device comes from. And if you're writing [00:10:52] device comes from. And if you're writing code that wants to run on these GPUs and [00:10:53] code that wants to run on these GPUs and make maximum avail make maximum usage of [00:10:56] make maximum avail make maximum usage of them, you need to make maximum usage of [00:10:57] them, you need to make maximum usage of these tensor cores. Another interesting [00:11:00] these tensor cores. Another interesting thing about these tensor cores is that [00:11:01] thing about these tensor cores is that they actually operate in mixed [00:11:02] they actually operate in mixed precision. um rather than traditional [00:11:04] precision. um rather than traditional floatingoint numbers which are normally [00:11:06] floatingoint numbers which are normally 32-bit, uh the tensor cores tend to use [00:11:08] 32-bit, uh the tensor cores tend to use a mix precision procedure where the [00:11:10] a mix precision procedure where the inputs are usually 16 bit. Um and [00:11:12] inputs are usually 16 bit. Um and there's a couple different interesting [00:11:13] there's a couple different interesting 16- bit formats that they can use that [00:11:15] 16- bit formats that they can use that we can't get into today. Um and they'll [00:11:17] we can't get into today. Um and they'll do the they'll do the multiplications in [00:11:19] do the they'll do the multiplications in this lower precision 16 bit and then do [00:11:21] this lower precision 16 bit and then do the additions, the accumulations in a [00:11:23] the additions, the accumulations in a higher precision 32-bit. So these tensor [00:11:25] higher precision 32-bit. So these tensor cores take a low precision 16- bit input [00:11:28] cores take a low precision 16- bit input and then produce uh and then do some of [00:11:29] and then produce uh and then do some of the intermediate computation and produce [00:11:31] the intermediate computation and produce the outputs in a higher precision [00:11:33] the outputs in a higher precision 32-bit. Um and this is important because [00:11:36] 32-bit. Um and this is important because um at the PyTorch layer if you forget to [00:11:38] um at the PyTorch layer if you forget to cast your model into into 16 bit it will [00:11:41] cast your model into into 16 bit it will run on the floatingoint coes instead it [00:11:43] run on the floatingoint coes instead it will be 20 times slower than you expect. [00:11:45] will be 20 times slower than you expect. So this is uh you know seems like a [00:11:46] So this is uh you know seems like a little bit of minutiae but it becomes [00:11:48] little bit of minutiae but it becomes very tangible when you mess up that data [00:11:50] very tangible when you mess up that data those data types in your pietorch code. [00:11:53] those data types in your pietorch code. Um, and then not so GPUs are really fast [00:11:56] Um, and then not so GPUs are really fast and it's really crazy just how fast how [00:11:58] and it's really crazy just how fast how much faster they've gotten over the past [00:12:00] much faster they've gotten over the past decade or 15 years or so. Um, so when I [00:12:02] decade or 15 years or so. Um, so when I first started my PhD um, and was working [00:12:05] first started my PhD um, and was working on deep learning, the state-of-the-art [00:12:06] on deep learning, the state-of-the-art GPU that we were all using was this K40 [00:12:08] GPU that we were all using was this K40 GPU um, back which was released back in [00:12:11] GPU um, back which was released back in 2013. And this thing could do just uh, 5 [00:12:14] 2013. And this thing could do just uh, 5 teraflops of FP32 compute um, for the [00:12:16] teraflops of FP32 compute um, for the whole device. Uh, all right. So I should [00:12:18] whole device. Uh, all right. So I should explain the graph. So the x- axis is uh [00:12:20] explain the graph. So the x- axis is uh time ranging from about uh 2013 up to [00:12:23] time ranging from about uh 2013 up to present day and then the y-axis is the [00:12:25] present day and then the y-axis is the peak throughput of each of these devices [00:12:28] peak throughput of each of these devices measured in terms of uh teraflops per [00:12:30] measured in terms of uh teraflops per teraflops per second per device. Um and [00:12:33] teraflops per second per device. Um and you can see like the graph goes up a [00:12:35] you can see like the graph goes up a lot. Um but there's something salient to [00:12:37] lot. Um but there's something salient to notice here is that um from the K40 to [00:12:39] notice here is that um from the K40 to the P100 something really amazing [00:12:41] the P100 something really amazing happened in the V100 which came out uh [00:12:43] happened in the V100 which came out uh towards the end of my PhD in around 20 [00:12:45] towards the end of my PhD in around 20 2016 2017 and that the V100 was the [00:12:48] 2016 2017 and that the V100 was the first device that introduced these [00:12:50] first device that introduced these tensor cores. Um and since then um the [00:12:52] tensor cores. Um and since then um the the the since then more recent devices [00:12:55] the the since then more recent devices have gotten more tensor cores, bigger [00:12:57] have gotten more tensor cores, bigger tensor cores, more of the device area [00:12:59] tensor cores, more of the device area allocated to tensor cores and this has [00:13:00] allocated to tensor cores and this has resulted in a gigantic increase in the [00:13:03] resulted in a gigantic increase in the throughput of these devices over the [00:13:04] throughput of these devices over the past 10 or 15 years. Um and the most [00:13:07] past 10 or 15 years. Um and the most recent device is this B200 um that was [00:13:09] recent device is this B200 um that was uh you know formally announced. It's [00:13:10] uh you know formally announced. It's slowly rolling out now. Um this one in [00:13:13] slowly rolling out now. Um this one in theory has about you know five uh has [00:13:17] theory has about you know five uh has about um 83 83.3 uh teraflops per second [00:13:21] about um 83 83.3 uh teraflops per second of FP32 compute and 5,000 teraflops per [00:13:24] of FP32 compute and 5,000 teraflops per second in theory of um mixed precision [00:13:26] second in theory of um mixed precision compute on the tensor cores. So if you [00:13:28] compute on the tensor cores. So if you step back like this is this is like [00:13:30] step back like this is this is like literally we've been living through a [00:13:31] literally we've been living through a 1,000fold increase in computation um [00:13:34] 1,000fold increase in computation um over the past you know 12 years. Um and [00:13:36] over the past you know 12 years. Um and that's just at the per device level. So [00:13:38] that's just at the per device level. So there one explanation of why AI has [00:13:41] there one explanation of why AI has gotten so good in the last 10 years. [00:13:42] gotten so good in the last 10 years. What has happened like this is the [00:13:44] What has happened like this is the answer. Um there's now a source of [00:13:46] answer. Um there's now a source of computation that we're taking advantage [00:13:47] computation that we're taking advantage of and it's gone up by 1,000x in the [00:13:49] of and it's gone up by 1,000x in the last decade. Anytime anything in the [00:13:51] last decade. Anytime anything in the world changes by 10,000x you should step [00:13:54] world changes by 10,000x you should step up and pay attention cuz that's going to [00:13:55] up and pay attention cuz that's going to that's going to cause major changes in [00:13:57] that's going to cause major changes in our technological capabilities. And this [00:13:59] our technological capabilities. And this 1000x improvement I think is the major [00:14:01] 1000x improvement I think is the major driver of improvement in deep learning [00:14:03] driver of improvement in deep learning over the past decade. So the it's it [00:14:06] over the past decade. So the it's it does not have 5,000 does not have 5,000 [00:14:08] does not have 5,000 does not have 5,000 tensor cores. That's 5,000 teraflops of [00:14:10] tensor cores. That's 5,000 teraflops of compute on the tensor cores. Okay. [00:14:12] compute on the tensor cores. Okay. Yeah. So we always try to distinguish [00:14:14] Yeah. So we always try to distinguish between the compute on the tensor cores [00:14:16] between the compute on the tensor cores versus the compute on FP32 cores. [00:14:19] versus the compute on FP32 cores. Right. So like this is already crazy, [00:14:20] Right. So like this is already crazy, right? It's already crazy that there's [00:14:22] right? It's already crazy that there's been a 1,000 increase in like a device [00:14:24] been a 1,000 increase in like a device that you can hold in your hands. Like [00:14:26] that you can hold in your hands. Like I've held a K40 in my hands and I've [00:14:28] I've held a K40 in my hands and I've I've not held I've not had the [00:14:29] I've not held I've not had the opportunity to hold a B100, but like [00:14:31] opportunity to hold a B100, but like they feel like the same physical object. [00:14:33] they feel like the same physical object. It's like about the same size, about the [00:14:34] It's like about the same size, about the same weight, like kind of looks the [00:14:36] same weight, like kind of looks the same, but the one from today is 1,000 [00:14:37] same, but the one from today is 1,000 times faster than the one from 12 years [00:14:39] times faster than the one from 12 years ago. That's insane. Um, but it gets even [00:14:41] ago. That's insane. Um, but it gets even crazier because we don't train on one [00:14:44] crazier because we don't train on one GPU, right? I said that when when the [00:14:46] GPU, right? I said that when when the K40 first came out in 2013, it actually [00:14:48] K40 first came out in 2013, it actually was common to train a lot of models on [00:14:50] was common to train a lot of models on just one GPU. But today, we're training [00:14:52] just one GPU. But today, we're training not just on one GPU. We're training on [00:14:54] not just on one GPU. We're training on thousands, tens of thousands, sometimes [00:14:56] thousands, tens of thousands, sometimes hundreds of thousands of GPUs all [00:14:58] hundreds of thousands of GPUs all working together to train one model. So, [00:15:00] working together to train one model. So, so, so, so stack that on top of this [00:15:02] so, so, so stack that on top of this 1,000fold increase in per device [00:15:04] 1,000fold increase in per device throughput and something truly insane [00:15:06] throughput and something truly insane has happened in the past decade. [00:15:08] has happened in the past decade. So, then um you know then we can know [00:15:11] So, then um you know then we can know we've we've looked inside the GPU. Now, [00:15:13] we've we've looked inside the GPU. Now, from here I want to zoom out and and put [00:15:15] from here I want to zoom out and and put that GPU in context not looking at [00:15:17] that GPU in context not looking at individual devices but thinking about [00:15:18] individual devices but thinking about the modern GPU clusters that we build [00:15:20] the modern GPU clusters that we build that stitch a lot of these things [00:15:21] that stitch a lot of these things together. So we've already seen a single [00:15:23] together. So we've already seen a single H100 GPU. Um we saw right and here the [00:15:26] H100 GPU. Um we saw right and here the and here we can think of it as another [00:15:28] and here we can think of it as another level of memory hier hierarchy. We [00:15:30] level of memory hier hierarchy. We already saw inside the H100 there were [00:15:32] already saw inside the H100 there were sort of three layers of memory hierarchy [00:15:33] sort of three layers of memory hierarchy as we got closer to the compute elements [00:15:35] as we got closer to the compute elements and as you got farther away from the [00:15:36] and as you got farther away from the compute elements the bandwidth the [00:15:38] compute elements the bandwidth the memory bandwidth the ability of the [00:15:39] memory bandwidth the ability of the device to move bits around between [00:15:41] device to move bits around between different parts of the system um gets [00:15:43] different parts of the system um gets slower and this trend actually continues [00:15:45] slower and this trend actually continues if you once you escape the bounds of a [00:15:47] if you once you escape the bounds of a single device and imagine these in the [00:15:49] single device and imagine these in the context of a full data center. So here [00:15:51] context of a full data center. So here we saw a single H100 GPU gets about 3 [00:15:53] we saw a single H100 GPU gets about 3 terabytes of um memory bandwidth. That's [00:15:55] terabytes of um memory bandwidth. That's sort of the GPU memory talking to its [00:15:57] sort of the GPU memory talking to its from its own HBM memory to its own [00:15:59] from its own HBM memory to its own compute elements. 3 terabytes per second [00:16:01] compute elements. 3 terabytes per second it can move bits around. Um but these [00:16:03] it can move bits around. Um but these things typically live inside a GPU [00:16:05] things typically live inside a GPU server. Um almost all GPU servers have [00:16:07] server. Um almost all GPU servers have um eight devices in one big box. Um and [00:16:10] um eight devices in one big box. Um and those GPUs can talk to each other. Um [00:16:12] those GPUs can talk to each other. Um and they typically talk to each other at [00:16:14] and they typically talk to each other at a rate of about 900 GB per second from [00:16:16] a rate of about 900 GB per second from any one GPU in the server to any other [00:16:19] any one GPU in the server to any other GPU in the server. So you can see that's [00:16:21] GPU in the server. So you can see that's like a 3x less memory communication [00:16:23] like a 3x less memory communication bandwidth compared to the GPU talking [00:16:25] bandwidth compared to the GPU talking from in inside one device. Um and here [00:16:29] from in inside one device. Um and here things here we again turn to llama 3. Um [00:16:32] things here we again turn to llama 3. Um a lot of major players don't publish a [00:16:34] a lot of major players don't publish a lot of details on their training [00:16:36] lot of details on their training clusters but the llama 3 technical [00:16:37] clusters but the llama 3 technical report did actually give a lot of [00:16:38] report did actually give a lot of details around their training clusters. [00:16:40] details around their training clusters. So from here some of the specifics [00:16:42] So from here some of the specifics probably vary a little bit from cluster [00:16:43] probably vary a little bit from cluster to cluster. Um but these are now numbers [00:16:45] to cluster. Um but these are now numbers from the llama 3 cluster that was used [00:16:47] from the llama 3 cluster that was used to train their their their models. Um, [00:16:49] to train their their their models. Um, so they given one GPU box, they stack [00:16:53] so they given one GPU box, they stack two of those box into one server rack. [00:16:55] two of those box into one server rack. Um, and a server rack, if you haven't [00:16:56] Um, and a server rack, if you haven't seen it, they're, you know, about 6 ft [00:16:58] seen it, they're, you know, about 6 ft tall, like about the size of a person to [00:17:00] tall, like about the size of a person to just kind of get a mental picture of one [00:17:02] just kind of get a mental picture of one of those things. So one server rack has [00:17:04] of those things. So one server rack has two servers inside of it. Total of 16 [00:17:06] two servers inside of it. Total of 16 GPUs. [00:17:07] GPUs. Then we connect a lot of server racks [00:17:09] Then we connect a lot of server racks together into a GPU pod. Um, the Llama 3 [00:17:12] together into a GPU pod. Um, the Llama 3 cluster has GPU pods that are composed [00:17:14] cluster has GPU pods that are composed of 192 racks. um and which is a total of [00:17:17] of 192 racks. um and which is a total of 3,72 GPUs. And these things have really [00:17:20] 3,72 GPUs. And these things have really high bandwidth connectors between all [00:17:22] high bandwidth connectors between all the different racks. Um and as a result [00:17:24] the different racks. Um and as a result um any pair of GPUs inside that pod can [00:17:28] um any pair of GPUs inside that pod can talk to each other at a rate of about 50 [00:17:29] talk to each other at a rate of about 50 GB per second. And now you see this is [00:17:32] GB per second. And now you see this is another sort of 20x decrease in memory [00:17:34] another sort of 20x decrease in memory traffic between what an individual um [00:17:36] traffic between what an individual um server can talk and then what any GPU [00:17:38] server can talk and then what any GPU across an entire rack can talk to each [00:17:40] across an entire rack can talk to each other. So 3072 GPUs seems like a lot of [00:17:43] other. So 3072 GPUs seems like a lot of compute, but it's now it's nowhere near [00:17:45] compute, but it's now it's nowhere near enough. Um, so we're going to stack [00:17:47] enough. Um, so we're going to stack those GPU pods together into a full GPU [00:17:49] those GPU pods together into a full GPU cluster. Um, so this is actually the [00:17:51] cluster. Um, so this is actually the full GPU cluster that um, Meta built to [00:17:54] full GPU cluster that um, Meta built to train their Llama 3 models. This thing [00:17:56] train their Llama 3 models. This thing combines eight GPU pods together for a [00:17:58] combines eight GPU pods together for a total of 24,576 [00:18:00] total of 24,576 GPUs. Um, I could not find exact numbers [00:18:03] GPUs. Um, I could not find exact numbers on the memory traffic between these [00:18:05] on the memory traffic between these things. Um, but it's definitely less [00:18:06] things. Um, but it's definitely less than 50 GB per second. And by the way, [00:18:09] than 50 GB per second. And by the way, this is not the largest GPU cluster in [00:18:11] this is not the largest GPU cluster in the world by a long shot. Um, it's the [00:18:13] the world by a long shot. Um, it's the long it's the biggest one that I could [00:18:15] long it's the biggest one that I could quickly find precise numbers on, but [00:18:16] quickly find precise numbers on, but there are definitely GPU clusters out [00:18:18] there are definitely GPU clusters out there in the world that are, you know, [00:18:19] there in the world that are, you know, 50,000 GPUs, 100,000 GPUs. They exist [00:18:22] 50,000 GPUs, 100,000 GPUs. They exist and people train models on them. Um, and [00:18:24] and people train models on them. Um, and like the way that this works is it sort [00:18:26] like the way that this works is it sort of scales out naturally. So, you would [00:18:27] of scales out naturally. So, you would just sort of cluster together more pods [00:18:29] just sort of cluster together more pods together to create a bigger a bigger [00:18:31] together to create a bigger a bigger cluster. Or you might have another level [00:18:32] cluster. Or you might have another level of hierarchy where you might have sort [00:18:34] of hierarchy where you might have sort of a super pod that connects to other [00:18:35] of a super pod that connects to other super pods to get you another level up. [00:18:38] super pods to get you another level up. uh how long do they train with that GPU [00:18:39] uh how long do they train with that GPU cluster? Um I I don't remember offhand [00:18:42] cluster? Um I I don't remember offhand for the llama 3 models, but there's been [00:18:44] for the llama 3 models, but there's been kind of a rule of thumb um for the past [00:18:45] kind of a rule of thumb um for the past decade is that the longest models that [00:18:47] decade is that the longest models that people train are usually on the order of [00:18:49] people train are usually on the order of months. Um and that I think has less to [00:18:51] months. Um and that I think has less to do with technology and more to do with [00:18:52] do with technology and more to do with people. When it comes to like making [00:18:54] people. When it comes to like making like having progress, like making plans, [00:18:56] like having progress, like making plans, like having people work on things, it's [00:18:58] like having people work on things, it's very difficult to have training runs [00:19:00] very difficult to have training runs that are that are very very long. Um so [00:19:02] that are that are very very long. Um so the longest training runs, the biggest [00:19:03] the longest training runs, the biggest state-of-the-art models I think are [00:19:04] state-of-the-art models I think are typically measured in months. Um, I [00:19:06] typically measured in months. Um, I would not be surprised if the very very [00:19:08] would not be surprised if the very very largest models, the GPT 4.5s, the GPT5s, [00:19:11] largest models, the GPT 4.5s, the GPT5s, if those are like pushing closer to a [00:19:13] if those are like pushing closer to a year at this point. Um, but it's pretty [00:19:15] year at this point. Um, but it's pretty common to see training runs that are on [00:19:16] common to see training runs that are on the order of a couple of months on these [00:19:18] the order of a couple of months on these really long, really big training [00:19:19] really long, really big training clusters. The question is, um, why do [00:19:22] clusters. The question is, um, why do you organize servers into a rack rather [00:19:23] you organize servers into a rack rather than in a pod? You got to put them [00:19:25] than in a pod? You got to put them somewhere, right? There's physical [00:19:26] somewhere, right? There's physical constraints on these things. Um, so [00:19:28] constraints on these things. Um, so server racks are kind of a stand have [00:19:29] server racks are kind of a stand have been a standard unit, um, in just like [00:19:31] been a standard unit, um, in just like data centers for for decades at this [00:19:33] data centers for for decades at this point. So when they when new these new [00:19:34] point. So when they when new these new devices came onto the scene of GPUs that [00:19:36] devices came onto the scene of GPUs that that gives you a different kind of [00:19:38] that gives you a different kind of server, they're a lot physically bigger. [00:19:40] server, they're a lot physically bigger. They have a lot more power. Um but you [00:19:42] They have a lot more power. Um but you can't redesign the whole data center [00:19:44] can't redesign the whole data center from scratch overnight. Um so as a [00:19:46] from scratch overnight. Um so as a result, the server rack has been kind of [00:19:47] result, the server rack has been kind of a standard unit um with standard [00:19:49] a standard unit um with standard hardware sizes and everything that the [00:19:51] hardware sizes and everything that the that data centers are typically built [00:19:52] that data centers are typically built around. [00:19:53] around. How much physical space does like a [00:19:55] How much physical space does like a cluster typically? [00:19:57] cluster typically? Oh, that's a great question. Um, [00:20:00] Oh, that's a great question. Um, so a a single you should think of a [00:20:01] so a a single you should think of a single server rack as being like around [00:20:04] single server rack as being like around 6 8 ft tall, something like that, like [00:20:06] 6 8 ft tall, something like that, like about this big. So maybe like a server [00:20:08] about this big. So maybe like a server rack would be like around the size of [00:20:10] rack would be like around the size of this podium and like about as tall as [00:20:12] this podium and like about as tall as me. Um, then you then you've got 192 [00:20:14] me. Um, then you then you've got 192 racks in a pod. So imagine like 200 of [00:20:17] racks in a pod. So imagine like 200 of these podiums. How big would that be? [00:20:18] these podiums. How big would that be? And then multiply that by eight. Um, but [00:20:20] And then multiply that by eight. Um, but it's actually that's actually a little [00:20:21] it's actually that's actually a little bit of an underestimate because you [00:20:23] bit of an underestimate because you typically organize these things in rows [00:20:24] typically organize these things in rows so people can actually walk between [00:20:26] so people can actually walk between them. Um, and there's more hardware that [00:20:27] them. Um, and there's more hardware that you need to pack into the cluster, not [00:20:29] you need to pack into the cluster, not just the train, not just the the compute [00:20:30] just the train, not just the the compute racks. Um, so in addition to the compute [00:20:32] racks. Um, so in addition to the compute racks that have the physical GPU [00:20:34] racks that have the physical GPU servers, there'll be other racks that [00:20:36] servers, there'll be other racks that contain networking hardware, we've got a [00:20:37] contain networking hardware, we've got a lot of a lot of a lot of bits that need [00:20:39] lot of a lot of a lot of bits that need to fly around between all these devices. [00:20:41] to fly around between all these devices. So there'll be dedicated racks that only [00:20:43] So there'll be dedicated racks that only hold networking hardware. There will [00:20:44] hold networking hardware. There will also be dedicated racks that only hold [00:20:46] also be dedicated racks that only hold storage hardware um because you need to [00:20:47] storage hardware um because you need to store the training data somewhere and [00:20:48] store the training data somewhere and get that into your devices. So these [00:20:50] get that into your devices. So these things uh can take up quite a lot of [00:20:52] things uh can take up quite a lot of space. Oh yeah question is um the when [00:20:54] space. Oh yeah question is um the when you go to these big clusters do the [00:20:56] you go to these big clusters do the smaller units of compute maintain the [00:20:58] smaller units of compute maintain the higher throughput and yes they do and [00:20:59] higher throughput and yes they do and that's part of this that's part of the [00:21:01] that's part of this that's part of the secret and the challenge of designing [00:21:02] secret and the challenge of designing for these systems because you ideally [00:21:04] for these systems because you ideally want to take advantage of the fast [00:21:05] want to take advantage of the fast communication when you can get it but [00:21:07] communication when you can get it but also fall back gracefully to the slower [00:21:09] also fall back gracefully to the slower communication on the larger units um as [00:21:11] communication on the larger units um as you as you scale up. Oh, how hot does it [00:21:13] you as you scale up. Oh, how hot does it get? Um pretty hot. Like if if anyway if [00:21:16] get? Um pretty hot. Like if if anyway if any of you is a gamer and has a 4090 GPU [00:21:18] any of you is a gamer and has a 4090 GPU or 5090 GPU in your in your desktop at [00:21:21] or 5090 GPU in your in your desktop at home, like a single 4090 GPU if you're [00:21:24] home, like a single 4090 GPU if you're playing games will heat up your room, [00:21:25] playing games will heat up your room, like make you want to open the window. [00:21:27] like make you want to open the window. Like it will make a make the room [00:21:28] Like it will make a make the room physically warmer. Um so imagine like if [00:21:30] physically warmer. Um so imagine like if that's what a single gaming GPU can do [00:21:32] that's what a single gaming GPU can do to an averageized room. Yeah, there's [00:21:34] to an averageized room. Yeah, there's some serious cooling requirements for [00:21:35] some serious cooling requirements for these things once you stack tens of [00:21:36] these things once you stack tens of thousands of them in a big in a big data [00:21:38] thousands of them in a big in a big data center. [00:21:41] center. Um although another interesting thing is [00:21:43] Um although another interesting thing is about I mean the cooling gets crazy, [00:21:44] about I mean the cooling gets crazy, right? So a gaming desktop will [00:21:46] right? So a gaming desktop will typically be air cooled, sometimes water [00:21:47] typically be air cooled, sometimes water cooled, and then like you can design [00:21:49] cooled, and then like you can design different different cooling systems and [00:21:50] different different cooling systems and you can go nuts on the hardware here to [00:21:52] you can go nuts on the hardware here to try to optimize all this stuff. All [00:21:54] try to optimize all this stuff. All right, so I I think this stuff is super [00:21:56] right, so I I think this stuff is super cool. Just like imagine imagining like [00:21:58] cool. Just like imagine imagining like these GPUs are not just mythical [00:22:00] these GPUs are not just mythical creatures that are floating around in [00:22:01] creatures that are floating around in the cloud. These are like actual [00:22:02] the cloud. These are like actual physical atoms that someone built and [00:22:04] physical atoms that someone built and stacked up in a room somewhere. And it's [00:22:05] stacked up in a room somewhere. And it's really interesting to imagine what they [00:22:07] really interesting to imagine what they look like. Um and so basically one one [00:22:10] look like. Um and so basically one one one kind of mindset shift when we move [00:22:12] one kind of mindset shift when we move to these big GPU clusters is actually [00:22:14] to these big GPU clusters is actually thinking not so much about the [00:22:15] thinking not so much about the individual devices about the individual [00:22:17] individual devices about the individual servers. I basically try to think of the [00:22:19] servers. I basically try to think of the entire data center as one big computer. [00:22:22] entire data center as one big computer. Um and this big computer has in this [00:22:24] Um and this big computer has in this case has 24,000 GPUs 1.8 pabytes of HBM [00:22:27] case has 24,000 GPUs 1.8 pabytes of HBM memory on the GPUs 415 million FP32 [00:22:31] memory on the GPUs 415 million FP32 cores 13 million tensor cores and this [00:22:34] cores 13 million tensor cores and this whole thing can do 24 exoflops of [00:22:36] whole thing can do 24 exoflops of compute per second. That's 10. That's 24 [00:22:38] compute per second. That's 10. That's 24 * 10 18. That's a lot of flops. It's a [00:22:41] * 10 18. That's a lot of flops. It's a lot of flops, but I guarantee you 5 [00:22:43] lot of flops, but I guarantee you 5 years from today, it will not feel like [00:22:44] years from today, it will not feel like a lot of flops, which is the even [00:22:45] a lot of flops, which is the even crazier part. Um, and our goal here is [00:22:48] crazier part. Um, and our goal here is actually to think of this entire block [00:22:50] actually to think of this entire block of 24,000 GPUs as one giant supercomput. [00:22:54] of 24,000 GPUs as one giant supercomput. And then the question is, how can we [00:22:56] And then the question is, how can we train one neural network for months at a [00:22:58] train one neural network for months at a time on this one giant supercomputer and [00:23:01] time on this one giant supercomputer and train a really gigantic neural network [00:23:02] train a really gigantic neural network that's really powerful that can soak up [00:23:04] that's really powerful that can soak up tons and tons of data? Um, and that's [00:23:05] tons and tons of data? Um, and that's basically the the question and the [00:23:07] basically the the question and the paradigm that we've moved to in deep [00:23:08] paradigm that we've moved to in deep learning. [00:23:10] learning. Um, and by the way, um, I I I keep [00:23:12] Um, and by the way, um, I I I keep saying GPU, I keep saying Nvidia because [00:23:14] saying GPU, I keep saying Nvidia because they are sort of the most dominant [00:23:15] they are sort of the most dominant training architecture and and hardware [00:23:17] training architecture and and hardware today, but there are some others that [00:23:19] today, but there are some others that have sprung up. Um, the biggest [00:23:20] have sprung up. Um, the biggest competitor, I think, right now to to [00:23:23] competitor, I think, right now to to Nvidia training hardware is Google. [00:23:25] Nvidia training hardware is Google. Google has their own hardware called [00:23:26] Google has their own hardware called tensor proing units, TPUs. Um, and these [00:23:29] tensor proing units, TPUs. Um, and these are really good. Um, these are uh they [00:23:31] are really good. Um, these are uh they they've gone through six generations of [00:23:32] they've gone through six generations of these already. Um, these are the stats [00:23:35] these already. Um, these are the stats of the V5P TPU which you can rent in [00:23:37] of the V5P TPU which you can rent in Google Cloud today. Um, and it's sort of [00:23:39] Google Cloud today. Um, and it's sort of roughly, you know, same order of [00:23:41] roughly, you know, same order of magnitude, kind of similar specs as the [00:23:42] magnitude, kind of similar specs as the H100 that we just talked about. There [00:23:44] H100 that we just talked about. There are some interesting design decisions in [00:23:46] are some interesting design decisions in the TPU that are quite different from [00:23:47] the TPU that are quite different from the GPUs, which I find fascinating, but [00:23:49] the GPUs, which I find fascinating, but we just don't have time to get into [00:23:50] we just don't have time to get into today. Um, and someone was asking how [00:23:52] today. Um, and someone was asking how big are these things. This is an actual [00:23:54] big are these things. This is an actual picture. Just like GPUs, uh, these TPUs [00:23:56] picture. Just like GPUs, uh, these TPUs are arranged into pods. Um, and the V5P [00:23:59] are arranged into pods. Um, and the V5P TPU TPUs can be arranged in pods of up [00:24:02] TPU TPUs can be arranged in pods of up to 8,960 [00:24:04] to 8,960 chips. Um, and this is a picture [00:24:06] chips. Um, and this is a picture actually of a V2 TPU pod which has only [00:24:09] actually of a V2 TPU pod which has only 256 chips. So then that kind of gives [00:24:11] 256 chips. So then that kind of gives you a sense of how big these things are. [00:24:13] you a sense of how big these things are. Each one of those um, then you see [00:24:14] Each one of those um, then you see there's four racks here. Those racks are [00:24:16] there's four racks here. Those racks are kind of like I said maybe about as a [00:24:18] kind of like I said maybe about as a little bit taller than me and there's [00:24:20] little bit taller than me and there's four of them side by side for 256 TPU [00:24:22] four of them side by side for 256 TPU chips. And now imagine this thing is [00:24:24] chips. And now imagine this thing is going to get a lot bigger in the more [00:24:25] going to get a lot bigger in the more recent pods that have up to almost 9,000 [00:24:27] recent pods that have up to almost 9,000 chips. Uh yes. So Google's Gemini models [00:24:30] chips. Uh yes. So Google's Gemini models are almost certainly trained on TPUs. I [00:24:32] are almost certainly trained on TPUs. I would be I mean of course they don't [00:24:33] would be I mean of course they don't tell you, but I would be astounded, [00:24:35] tell you, but I would be astounded, absolutely astounded if they um were [00:24:37] absolutely astounded if they um were not. And like I said um the TPUs are [00:24:40] not. And like I said um the TPUs are actually very good. Um most I I I assume [00:24:42] actually very good. Um most I I I assume that most large scale Google models are [00:24:43] that most large scale Google models are trained on these things and those are [00:24:45] trained on these things and those are very competitive models. Um so this is [00:24:47] very competitive models. Um so this is really good training hardware. Um the [00:24:49] really good training hardware. Um the difference with Nvidia is you can't buy [00:24:50] difference with Nvidia is you can't buy it. the only way you can access TPUs are [00:24:52] it. the only way you can access TPUs are either by working at Google or by [00:24:54] either by working at Google or by renting them on Google Cloud. Um, but it [00:24:56] renting them on Google Cloud. Um, but it is very good hardware and a lot of [00:24:57] is very good hardware and a lot of people are making use of it, but I think [00:24:59] people are making use of it, but I think it's still a little bit less popular [00:25:00] it's still a little bit less popular today um than H100 than uh than Nvidia [00:25:02] today um than H100 than uh than Nvidia GPUs. Um, and of course other companies [00:25:04] GPUs. Um, and of course other companies obviously know that this is a very [00:25:06] obviously know that this is a very important thing. So there's a lot of [00:25:07] important thing. So there's a lot of other companies that are trying to build [00:25:09] other companies that are trying to build competitive training hardware, but I [00:25:10] competitive training hardware, but I think like my honest assessment right [00:25:12] think like my honest assessment right now is that probably Nvidia and TPUs are [00:25:15] now is that probably Nvidia and TPUs are the like the two big ones. they're [00:25:16] the like the two big ones. they're they're way ahead of everyone else right [00:25:18] they're way ahead of everyone else right now today in terms of usability, [00:25:20] now today in terms of usability, performance, just like market share, but [00:25:22] performance, just like market share, but there are a lot of others that are [00:25:23] there are a lot of others that are trying to to catch up here. Um, two [00:25:25] trying to to catch up here. Um, two notable ones are AMD. Um, AMD has been [00:25:28] notable ones are AMD. Um, AMD has been sort of the second major GPU [00:25:29] sort of the second major GPU manufacturer for many decades. They also [00:25:32] manufacturer for many decades. They also have um a training accelerator called [00:25:33] have um a training accelerator called the MI3 325X. On paper, it actually has [00:25:36] the MI3 325X. On paper, it actually has really good stats that are pretty [00:25:37] really good stats that are pretty comparable to an H100, but it just has [00:25:40] comparable to an H100, but it just has not had the same impact as the H100 [00:25:42] not had the same impact as the H100 right now. Um AWS also has their own [00:25:45] right now. Um AWS also has their own training chip that they've developed [00:25:46] training chip that they've developed called trrenium. Um I don't know too [00:25:49] called trrenium. Um I don't know too much about this one. I've never tried to [00:25:50] much about this one. I've never tried to use it myself, but I know that Enthropic [00:25:52] use it myself, but I know that Enthropic uses it for some of their training. I [00:25:53] uses it for some of their training. I don't know how to what extent their [00:25:55] don't know how to what extent their training is entirely tranium versus um [00:25:56] training is entirely tranium versus um GPUs. [00:25:58] GPUs. Um so we should expect to see more. But [00:26:00] Um so we should expect to see more. But like today I think G Nvidia GPUs are [00:26:02] like today I think G Nvidia GPUs are probably the most dominant and Google [00:26:04] probably the most dominant and Google TPUs are like right there. They're [00:26:06] TPUs are like right there. They're really good as well, but probably not [00:26:07] really good as well, but probably not quite as widely used as GPUs from [00:26:08] quite as widely used as GPUs from Nvidia. [00:26:10] Nvidia. Okay, so that's basically part one. um [00:26:12] Okay, so that's basically part one. um what is what are GPUs? How do we arrange [00:26:14] what is what are GPUs? How do we arrange them into clusters? Just give you a [00:26:15] them into clusters? Just give you a sense of the physicality of these [00:26:16] sense of the physicality of these machines that we're building and [00:26:17] machines that we're building and training on. Then the second question is [00:26:20] training on. Then the second question is how do we actually write algorithms that [00:26:22] how do we actually write algorithms that can make use of this giant GPU cluster [00:26:25] can make use of this giant GPU cluster with tens of thousands of GPUs? It's [00:26:27] with tens of thousands of GPUs? It's going to require us to develop new [00:26:28] going to require us to develop new algorithms, new ways of thinking about [00:26:30] algorithms, new ways of thinking about our compute, um and new ways of [00:26:32] our compute, um and new ways of parallelizing and splitting up our [00:26:34] parallelizing and splitting up our neural networks. So the basic strategy [00:26:36] neural networks. So the basic strategy here is going to be split up your [00:26:37] here is going to be split up your computation. These things are giant [00:26:39] computation. These things are giant parallel devices. They have a lot of we [00:26:41] parallel devices. They have a lot of we saw they have a lot of GPUs, a lot of [00:26:43] saw they have a lot of GPUs, a lot of CPU cores, a lot of GPU cores that can [00:26:45] CPU cores, a lot of GPU cores that can all operate independently and they can't [00:26:47] all operate independently and they can't talk to each other too much. If you [00:26:49] talk to each other too much. If you think about what a computer really does [00:26:50] think about what a computer really does like at a high level, a computer [00:26:52] like at a high level, a computer basically does two things. It does [00:26:53] basically does two things. It does computation which is taking input bits [00:26:55] computation which is taking input bits and computing new output bits from those [00:26:57] and computing new output bits from those and it does communication which is [00:26:59] and it does communication which is taking bits and moving them from one bit [00:27:00] taking bits and moving them from one bit of memory in one place to some bit of [00:27:02] of memory in one place to some bit of memory in some other place. And the [00:27:04] memory in some other place. And the whole trick is how do we make use of all [00:27:06] whole trick is how do we make use of all of these multi- scale multi multiple [00:27:08] of these multi- scale multi multiple scales of memory hierarchy across the [00:27:10] scales of memory hierarchy across the entire cluster to overlap the [00:27:13] entire cluster to overlap the communication with the computation and [00:27:15] communication with the computation and also to split up the computation and [00:27:16] also to split up the computation and parallelize paralyze it so that in the [00:27:19] parallelize paralyze it so that in the process of training a giant neural [00:27:20] process of training a giant neural network we have useful work for all of [00:27:23] network we have useful work for all of those tens of thousands of individ [00:27:24] those tens of thousands of individ individual GPUs all of those millions of [00:27:26] individual GPUs all of those millions of individual compute elements we have [00:27:28] individual compute elements we have useful work for all of them to be doing [00:27:30] useful work for all of them to be doing in parallel and then get them to [00:27:31] in parallel and then get them to communicate their work to each other in [00:27:33] communicate their work to each other in a way that achieves training a giant [00:27:35] a way that achieves training a giant neural network all on this giant [00:27:36] neural network all on this giant cluster. [00:27:38] cluster. So to that end um one way I like to [00:27:39] So to that end um one way I like to think about it is there's basically five [00:27:42] think about it is there's basically five degrees of parallelism that people [00:27:43] degrees of parallelism that people exploit when training neural networks [00:27:45] exploit when training neural networks large scale neural neural networks [00:27:46] large scale neural neural networks today. Um a lot of this is specific to [00:27:48] today. Um a lot of this is specific to transformers because those are the [00:27:50] transformers because those are the dominant architecture that people are [00:27:51] dominant architecture that people are using for large scale training. So if [00:27:53] using for large scale training. So if you think about a transformer, a [00:27:54] you think about a transformer, a transformer is basically a stack of L [00:27:57] transformer is basically a stack of L layers. And each one of those L layers [00:28:00] layers. And each one of those L layers is operating on a threedimensional [00:28:01] is operating on a threedimensional tensor of size mini where we're where [00:28:03] tensor of size mini where we're where one dimension is the mini batch [00:28:05] one dimension is the mini batch dimension. We've got a bunch of [00:28:06] dimension. We've got a bunch of sequences all operating in a mini batch. [00:28:08] sequences all operating in a mini batch. Um a sequence dimension, you know, we're [00:28:09] Um a sequence dimension, you know, we're operating on sequences or sets of tokens [00:28:12] operating on sequences or sets of tokens and a dim dimension. So each of those [00:28:14] and a dim dimension. So each of those tokens itself is a vector with some [00:28:16] tokens itself is a vector with some dimension. So our tensor our our our t [00:28:19] dimension. So our tensor our our our t our um our transformers are operating on [00:28:21] our um our transformers are operating on these threedimensional tensors and they [00:28:23] these threedimensional tensors and they operate through a stack of layers. So [00:28:25] operate through a stack of layers. So that gives us four axes to parallelize [00:28:27] that gives us four axes to parallelize on. Um we can access we can parallelize [00:28:29] on. Um we can access we can parallelize on the layers axis which is pipeline [00:28:31] on the layers axis which is pipeline parallelism. We can we can parallelize [00:28:33] parallelism. We can we can parallelize on the um batch dimension which is data [00:28:35] on the um batch dimension which is data parallelism. We can split on the [00:28:37] parallelism. We can split on the sequence dimension which is called [00:28:38] sequence dimension which is called context parallelism. And we can split on [00:28:40] context parallelism. And we can split on that dim dimension which is called [00:28:41] that dim dimension which is called tensor parallelism. So all of these have [00:28:43] tensor parallelism. So all of these have kind of funny names, but if you think [00:28:45] kind of funny names, but if you think about it in this way, they're basically [00:28:46] about it in this way, they're basically all different ways of splitting up your [00:28:48] all different ways of splitting up your computation across these four axes of [00:28:50] computation across these four axes of compute inside your transformer. And [00:28:52] compute inside your transformer. And then we're going to step through each of [00:28:53] then we're going to step through each of each one of these in more detail because [00:28:55] each one of these in more detail because there's a lot of interesting nuances [00:28:56] there's a lot of interesting nuances with all of these different meth [00:28:58] with all of these different meth mechanisms of distributed training. [00:29:00] mechanisms of distributed training. So the first one is data parallelism um [00:29:02] So the first one is data parallelism um or DP. And the the basic idea here is [00:29:06] or DP. And the the basic idea here is kind of simple. So remember when we're [00:29:08] kind of simple. So remember when we're training neural networks, we're always [00:29:09] training neural networks, we're always operating on mini batches of samples, [00:29:11] operating on mini batches of samples, right? like we're always, you know, [00:29:13] right? like we're always, you know, taking a mini batch of of elements. [00:29:14] taking a mini batch of of elements. We're computing a loss for every entry [00:29:16] We're computing a loss for every entry in our mini batch depending on what our [00:29:17] in our mini batch depending on what our whatever our training task is. Then we [00:29:19] whatever our training task is. Then we compute a gradient where the gradient is [00:29:21] compute a gradient where the gradient is actually typically an average of the [00:29:23] actually typically an average of the gradients of the losses for the [00:29:24] gradients of the losses for the individual elements in the mini batch. [00:29:26] individual elements in the mini batch. So um in most neural network [00:29:28] So um in most neural network architectures, the computation of [00:29:29] architectures, the computation of computing the loss and then computing [00:29:31] computing the loss and then computing the gradient is independent for each of [00:29:33] the gradient is independent for each of the elements in the mini batch. So this [00:29:34] the elements in the mini batch. So this is something that seems trivially [00:29:36] is something that seems trivially parallelizable. So the basic idea is [00:29:38] parallelizable. So the basic idea is that if you have if you can fit a mini [00:29:40] that if you have if you can fit a mini batch of n examples on a single GPU and [00:29:43] batch of n examples on a single GPU and you have access to m GPUs, then we're [00:29:45] you have access to m GPUs, then we're going to train our model with a giant [00:29:47] going to train our model with a giant mini batch of m* n examples where we [00:29:50] mini batch of m* n examples where we split up that giant mini batch into a [00:29:52] split up that giant mini batch into a little tiny smaller mini batch of n [00:29:54] little tiny smaller mini batch of n samples that goes on each GPU. Um, and [00:29:57] samples that goes on each GPU. Um, and if you think about mathematically why [00:29:59] if you think about mathematically why this makes sense, it's because gradients [00:30:00] this makes sense, it's because gradients are linear. So in practice, you know, if [00:30:03] are linear. So in practice, you know, if you're kind of computing a single scalar [00:30:05] you're kind of computing a single scalar loss L, which is going to be the average [00:30:07] loss L, which is going to be the average of some individual losses computed on [00:30:09] of some individual losses computed on each of our so these Xig are all the [00:30:12] each of our so these Xig are all the entries across your entire macro batch, [00:30:14] entries across your entire macro batch, I guess we'll call it. And then the W [00:30:16] I guess we'll call it. And then the W are the weight matric weight matrices of [00:30:18] are the weight matric weight matrices of the entire network. Then typically the [00:30:20] the entire network. Then typically the loss that you're computing um at the end [00:30:21] loss that you're computing um at the end of the forward pass is an average of the [00:30:24] of the forward pass is an average of the losses on each of the individual mini [00:30:25] losses on each of the individual mini batch elements. And then if you take the [00:30:27] batch elements. And then if you take the gradient of the loss with respect to the [00:30:29] gradient of the loss with respect to the weights of the network, that's the [00:30:30] weights of the network, that's the that's the that's the thing we need to [00:30:31] that's the that's the thing we need to compute in order to make a weight [00:30:33] compute in order to make a weight update, then that is actually going to [00:30:34] update, then that is actually going to split. Um and you can because gradients [00:30:36] split. Um and you can because gradients are linear, um you get you get to sort [00:30:38] are linear, um you get you get to sort of choose in what order do we want to do [00:30:40] of choose in what order do we want to do the sum, do we want to do the gradient, [00:30:41] the sum, do we want to do the gradient, do we want to do the averaging. So in [00:30:42] do we want to do the averaging. So in particular, it becomes convenient to [00:30:44] particular, it becomes convenient to arrange the gradient in this particular [00:30:45] arrange the gradient in this particular formulation. um where there's this inner [00:30:48] formulation. um where there's this inner term that we've highlighted in blue [00:30:50] term that we've highlighted in blue which is um which is basically a normal [00:30:52] which is um which is basically a normal for backward pass on n elements and [00:30:54] for backward pass on n elements and these can be be computed in parallel on [00:30:57] these can be be computed in parallel on different on different GPUs and then [00:30:59] different on different GPUs and then there's an outer sum where we need to [00:31:00] there's an outer sum where we need to take an average of the gradients across [00:31:02] take an average of the gradients across our um m different devices that we are [00:31:05] our um m different devices that we are that that we're operating on. Um so [00:31:08] that that we're operating on. Um so that's kind of why what what what's [00:31:09] that's kind of why what what what's happening from a mathematical [00:31:10] happening from a mathematical perspective and we see that this is [00:31:12] perspective and we see that this is perfectly mathematically sound. This is [00:31:13] perfectly mathematically sound. This is basically exactly the same [00:31:14] basically exactly the same mathematically as training on um a [00:31:17] mathematically as training on um a single device. Um we've just been clever [00:31:19] single device. Um we've just been clever with our algebra and changed the order [00:31:20] with our algebra and changed the order of doing our averages and our [00:31:22] of doing our averages and our summations. But this is not an [00:31:23] summations. But this is not an approximation. This is exactly the same [00:31:25] approximation. This is exactly the same computation as we would have done on a [00:31:26] computation as we would have done on a single larger GPU. [00:31:28] single larger GPU. So kind of what this looks like at the [00:31:30] So kind of what this looks like at the GPU perspective is that we have um n GPU [00:31:32] GPU perspective is that we have um n GPU m GPUs. Here I'm showing m equals 3 [00:31:35] m GPUs. Here I'm showing m equals 3 because that's all that can sensibly fit [00:31:37] because that's all that can sensibly fit on the slide but you know think that [00:31:38] on the slide but you know think that this is much larger than three in [00:31:40] this is much larger than three in practice. Then each one of those GPUs [00:31:42] practice. Then each one of those GPUs actually maintains its own separate copy [00:31:44] actually maintains its own separate copy of the neural network weights um of the [00:31:46] of the neural network weights um of the optimizer state um and of the gradients. [00:31:49] optimizer state um and of the gradients. So then what we're going to do each GPU [00:31:50] So then what we're going to do each GPU will load in parallel a different mini [00:31:52] will load in parallel a different mini batch of data. Um here we're showing [00:31:54] batch of data. Um here we're showing each GPU loading three a mini batch of [00:31:57] each GPU loading three a mini batch of three elements. And crucially the the [00:31:59] three elements. And crucially the the different GPUs need to load different [00:32:01] different GPUs need to load different mini batches of data. Um I've had bugs [00:32:03] mini batches of data. Um I've had bugs in in my code in students code where [00:32:05] in in my code in students code where they accidentally load the same mini [00:32:06] they accidentally load the same mini batch in all the GPUs. That's not going [00:32:08] batch in all the GPUs. That's not going to help you. That's not going to be [00:32:09] to help you. That's not going to be good. Don't don't make that mistake. Um [00:32:11] good. Don't don't make that mistake. Um so you it's crucially important that [00:32:12] so you it's crucially important that your and that your different GPUs [00:32:14] your and that your different GPUs actually load different mini batches of [00:32:15] actually load different mini batches of data. Um then each GPU will [00:32:18] data. Um then each GPU will independently do its own forward pass on [00:32:20] independently do its own forward pass on its own mini batch of data to compute [00:32:21] its own mini batch of data to compute its own local loss on its own local mini [00:32:24] its own local loss on its own local mini batch of data. And these these these can [00:32:26] batch of data. And these these these can all operate totally independently. It [00:32:28] all operate totally independently. It does not require any communication [00:32:29] does not require any communication between GPUs. Then the the each network [00:32:32] between GPUs. Then the the each network will do its own backward pass to compute [00:32:34] will do its own backward pass to compute the gradient of its own local loss with [00:32:36] the gradient of its own local loss with respect to all the weights of the model. [00:32:38] respect to all the weights of the model. And again this can happen totally [00:32:40] And again this can happen totally independently because each model [00:32:41] independently because each model remember has its or each GPU has its own [00:32:43] remember has its or each GPU has its own independent copy of the model weights. [00:32:45] independent copy of the model weights. It can do its own forward backward pass [00:32:47] It can do its own forward backward pass um completely independently. [00:32:49] um completely independently. But now after the backward pass is done [00:32:51] But now after the backward pass is done this is where things get tricky. [00:32:53] this is where things get tricky. Remember we said we needed to compute an [00:32:55] Remember we said we needed to compute an average of those gradients across all [00:32:56] average of those gradients across all the devices that are participating in [00:32:58] the devices that are participating in our training. So then we need [00:33:00] our training. So then we need communication. So this is where we do um [00:33:02] communication. So this is where we do um an an all reduce operation and every GPU [00:33:05] an an all reduce operation and every GPU needs to needs to send its gradients to [00:33:07] needs to needs to send its gradients to all the other GPUs and each so like [00:33:10] all the other GPUs and each so like there's sort of two things happening [00:33:11] there's sort of two things happening simultaneously. One each GPU needs to [00:33:13] simultaneously. One each GPU needs to broadcast its gradients to all the GPUs [00:33:16] broadcast its gradients to all the GPUs and then two each GPU needs to collect [00:33:18] and then two each GPU needs to collect the gradients from all the GPUs in the [00:33:20] the gradients from all the GPUs in the that are participating in the training. [00:33:21] that are participating in the training. Um so this is an all reduce operation. [00:33:24] Um so this is an all reduce operation. Um and there's uh that this kind of [00:33:25] Um and there's uh that this kind of happens in sort of logarithmic time [00:33:27] happens in sort of logarithmic time typically depending on in the number of [00:33:29] typically depending on in the number of GPUs. Um but at the end of this all [00:33:31] GPUs. Um but at the end of this all reduce operation then each GPU now has [00:33:35] reduce operation then each GPU now has um an an average of all the gradients [00:33:37] um an an average of all the gradients across all the devices. So at this point [00:33:39] across all the devices. So at this point the communication has happened. Um each [00:33:41] the communication has happened. Um each GPU now has an identical copy of the [00:33:44] GPU now has an identical copy of the gradients that have been all reduced [00:33:45] gradients that have been all reduced across all the devices. So now um at the [00:33:47] across all the devices. So now um at the beginning of the training iteration we [00:33:49] beginning of the training iteration we assumed that each GPU had its own [00:33:51] assumed that each GPU had its own independent copy of the model weights. [00:33:52] independent copy of the model weights. Now at this point each GPU has its own [00:33:54] Now at this point each GPU has its own independent but identical copy of the [00:33:56] independent but identical copy of the gradients across the entire macro batch [00:33:59] gradients across the entire macro batch of data. So now at this point um each [00:34:01] of data. So now at this point um each GPU can make a weight update on its own [00:34:04] GPU can make a weight update on its own local copy of the weights. And because [00:34:05] local copy of the weights. And because they started with the same weights and [00:34:07] they started with the same weights and they had this and they applied the same [00:34:08] they had this and they applied the same gradient, they're going to have the same [00:34:10] gradient, they're going to have the same weights after the local weight update [00:34:12] weights after the local weight update assuming the arithmetic was [00:34:13] assuming the arithmetic was deterministic. [00:34:15] deterministic. Um and also by the way um and this is [00:34:18] Um and also by the way um and this is really important, steps four and five [00:34:20] really important, steps four and five can actually happen in parallel. We said [00:34:22] can actually happen in parallel. We said that there's two things that there's [00:34:24] that there's two things that there's actually two things here that can happen [00:34:25] actually two things here that can happen in parallel. One is the backward pass [00:34:27] in parallel. One is the backward pass where each GPU computes its own backward [00:34:29] where each GPU computes its own backward pass to compute gradients and the other [00:34:31] pass to compute gradients and the other is the communication of the gradients [00:34:33] is the communication of the gradients across the GPUs and these things in [00:34:35] across the GPUs and these things in practice will typically happen [00:34:36] practice will typically happen simultaneously. So that means that each [00:34:38] simultaneously. So that means that each model will start off doing doing [00:34:40] model will start off doing doing backward pass over the last layer in the [00:34:41] backward pass over the last layer in the network and then compute its own local [00:34:43] network and then compute its own local gradient. Now the model will move its [00:34:45] gradient. Now the model will move its compute on to computing backward pass [00:34:46] compute on to computing backward pass for the second to last layer of the of [00:34:48] for the second to last layer of the of the of the model. And while the compute [00:34:50] the of the model. And while the compute elements are busy computing the backward [00:34:52] elements are busy computing the backward pass on the second to last layer, then [00:34:54] pass on the second to last layer, then the the GPUs will simultaneously be [00:34:56] the the GPUs will simultaneously be doing an all reduce of the gradients of [00:34:58] doing an all reduce of the gradients of the last layer. So this means that these [00:35:00] the last layer. So this means that these things kind of chunk along um [00:35:02] things kind of chunk along um communication for layer L+1 and backward [00:35:04] communication for layer L+1 and backward pass for layer L and they can just chunk [00:35:06] pass for layer L and they can just chunk along in parallel so that hopefully by [00:35:08] along in parallel so that hopefully by the time we've gotten to the end of the [00:35:09] the time we've gotten to the end of the network and by the time the backward [00:35:11] network and by the time the backward pass is done, the the gradients have [00:35:13] pass is done, the the gradients have already been all reduced across all the [00:35:14] already been all reduced across all the devices. Um and we can make our weight [00:35:16] devices. Um and we can make our weight update all at once without waiting. Um [00:35:18] update all at once without waiting. Um this is really important because like we [00:35:20] this is really important because like we said with the communication is [00:35:22] said with the communication is relatively slow. So the whole trick in [00:35:23] relatively slow. So the whole trick in these things is figuring out ways to [00:35:25] these things is figuring out ways to hide the communication costs and do them [00:35:27] hide the communication costs and do them at the same time as the compute. The [00:35:29] at the same time as the compute. The question is is four or five going to be [00:35:31] question is is four or five going to be the bottleneck? And the answer is yes. [00:35:32] the bottleneck? And the answer is yes. Um it depends entirely on how fast your [00:35:34] Um it depends entirely on how fast your device is, how big is your model, how [00:35:36] device is, how big is your model, how big is your mini batch, how fast is the [00:35:38] big is your mini batch, how fast is the inter interconnect between the devices. [00:35:40] inter interconnect between the devices. Um when you get to this lower scale [00:35:41] Um when you get to this lower scale distributed trading the answer is always [00:35:43] distributed trading the answer is always it depends on your situation and you [00:35:45] it depends on your situation and you need to benchmark for your situation. [00:35:47] need to benchmark for your situation. Ah, why not take m different gradient [00:35:48] Ah, why not take m different gradient steps on each of them? That's actually a [00:35:50] steps on each of them? That's actually a really cool idea. Um, there actually was [00:35:52] really cool idea. Um, there actually was a popular uh set of algorithms that [00:35:54] a popular uh set of algorithms that people used a while back called [00:35:56] people used a while back called asynchronous SGD where they would [00:35:58] asynchronous SGD where they would basically do that then basically have a [00:36:00] basically do that then basically have a bunch of different model replicas all [00:36:01] bunch of different model replicas all take a bunch of independent model steps [00:36:03] take a bunch of independent model steps and then try to average them every once [00:36:04] and then try to average them every once in a while. Um, and those were popular [00:36:06] in a while. Um, and those were popular like there were some like actually [00:36:08] like there were some like actually Google used to do this before they [00:36:09] Google used to do this before they developed the TPU pods. Um, and some of [00:36:11] developed the TPU pods. Um, and some of their earlier networks in the early [00:36:13] their earlier networks in the early 2010s were trained in this way. Um, but [00:36:15] 2010s were trained in this way. Um, but one, it tends to just be a lot more [00:36:17] one, it tends to just be a lot more unstable. Um, and two, it's very hard to [00:36:19] unstable. Um, and two, it's very hard to debug and reproduce. Um, so it and it it [00:36:22] debug and reproduce. Um, so it and it it just tends to work a little bit worse. [00:36:23] just tends to work a little bit worse. So it it does feel like a more scalable [00:36:25] So it it does feel like a more scalable approach, but in practice, um, if you [00:36:27] approach, but in practice, um, if you can do everything synchronously, then [00:36:29] can do everything synchronously, then your algorithms are easier to debug, [00:36:30] your algorithms are easier to debug, easier to understand, easier to reason [00:36:32] easier to understand, easier to reason about. Um, and if you can get basically [00:36:34] about. Um, and if you can get basically if you can get away with with [00:36:35] if you can get away with with synchronous gradient updates, it's [00:36:36] synchronous gradient updates, it's probably going to work better. Um, but [00:36:38] probably going to work better. Um, but actually I would personally not be too [00:36:40] actually I would personally not be too surprised if we see a resurgence of [00:36:41] surprised if we see a resurgence of async SGD methods in the next couple of [00:36:43] async SGD methods in the next couple of years at some point because I think they [00:36:45] years at some point because I think they are a lot more friendly to distributed [00:36:47] are a lot more friendly to distributed training. There's no one computer that [00:36:49] training. There's no one computer that can orchestrate all this stuff. All [00:36:50] can orchestrate all this stuff. All these things are independent devices [00:36:51] these things are independent devices with their own independent stuff. [00:36:52] with their own independent stuff. There's no there's no driver that can [00:36:55] There's no there's no driver that can take a god's eye view and and take those [00:36:57] take a god's eye view and and take those steps. All that computation has to [00:36:59] steps. All that computation has to happen somewhere. Ah, great question. I [00:37:01] happen somewhere. Ah, great question. I I said like as you're overlapping [00:37:03] I said like as you're overlapping communication and compute, do you need [00:37:04] communication and compute, do you need to write code for this or does the [00:37:05] to write code for this or does the hardware do this automatically? You [00:37:07] hardware do this automatically? You definitely got to write code for this. [00:37:08] definitely got to write code for this. The hardware is not smart enough to [00:37:09] The hardware is not smart enough to understand what you want to do. Like the [00:37:11] understand what you want to do. Like the hardware, like we said, it's sort of [00:37:12] hardware, like we said, it's sort of understands these little matrix multiply [00:37:13] understands these little matrix multiply chunks. It understands pretty low-level [00:37:15] chunks. It understands pretty low-level stuff. Um anything that you want to do [00:37:17] stuff. Um anything that you want to do to schedule that communication um the [00:37:19] to schedule that communication um the you need to take care of in software. Um [00:37:21] you need to take care of in software. Um but thankfully um for a lot of these [00:37:23] but thankfully um for a lot of these common use cases, um PyTorch ships with [00:37:25] common use cases, um PyTorch ships with it for you. So, for example, in this [00:37:26] it for you. So, for example, in this case, there's a there's a PyTorch class [00:37:28] case, there's a there's a PyTorch class called distributed data parallel um that [00:37:30] called distributed data parallel um that will do this for you and make this uh [00:37:31] will do this for you and make this uh this sort of happen relatively [00:37:33] this sort of happen relatively transparently on top of um otherwise [00:37:35] transparently on top of um otherwise straightforward PyTorch code that you've [00:37:36] straightforward PyTorch code that you've written. [00:37:37] written. Although that actually that is really [00:37:39] Although that actually that is really interesting to contrast with the [00:37:40] interesting to contrast with the individual devices because um if you're [00:37:42] individual devices because um if you're programming an individual GPU in CUDA, [00:37:44] programming an individual GPU in CUDA, which is Nvidia's language for [00:37:46] which is Nvidia's language for programming GPUs, then actually the [00:37:48] programming GPUs, then actually the hardware does take care of a lot of this [00:37:49] hardware does take care of a lot of this async transfer for you automatically. Um [00:37:51] async transfer for you automatically. Um but at the cluster level, um it [00:37:53] but at the cluster level, um it typically doesn't. then you typically [00:37:54] typically doesn't. then you typically need to do it in software. So there [00:37:56] need to do it in software. So there there actually is a little bit of [00:37:57] there actually is a little bit of interesting asymmetry here between [00:37:58] interesting asymmetry here between parallelism on the individual device [00:38:00] parallelism on the individual device level where a lot of that is does happen [00:38:02] level where a lot of that is does happen automatically in hardware versus at the [00:38:03] automatically in hardware versus at the cluster level where it needs to be [00:38:04] cluster level where it needs to be orchestrated in software. Yeah. So so [00:38:07] orchestrated in software. Yeah. So so typically these are these are [00:38:08] typically these are these are heterogeneous systems where different [00:38:09] heterogeneous systems where different parts of the system are written in [00:38:10] parts of the system are written in different programming languages. So [00:38:12] different programming languages. So there's going to be low-level device [00:38:13] there's going to be low-level device kernels that actually are the code that [00:38:15] kernels that actually are the code that executes inside the GPU and those are [00:38:17] executes inside the GPU and those are typically written in CUDA which is um [00:38:19] typically written in CUDA which is um you know it's a C like language that is [00:38:21] you know it's a C like language that is Nvidia's language for programming their [00:38:22] Nvidia's language for programming their own GPUs. Um, and but then those [00:38:24] own GPUs. Um, and but then those individual GPU kernels will get wrapped [00:38:26] individual GPU kernels will get wrapped up and you can call those GPU kernels [00:38:28] up and you can call those GPU kernels from Python. And this is basically how [00:38:30] from Python. And this is basically how PyTorch works. PyTorch is sort of like a [00:38:32] PyTorch works. PyTorch is sort of like a collection of a lot of GPU kernels that [00:38:34] collection of a lot of GPU kernels that can do lots of interesting stuff on the [00:38:36] can do lots of interesting stuff on the GPU and then a lot of C++ and Python [00:38:38] GPU and then a lot of C++ and Python code that wraps around those GPU kernels [00:38:40] code that wraps around those GPU kernels and makes it more user friendly to [00:38:41] and makes it more user friendly to program. So in this picture, each GPU is [00:38:44] program. So in this picture, each GPU is computing its own gradients in black um [00:38:46] computing its own gradients in black um on it by itself and then the gradients [00:38:48] on it by itself and then the gradients in red get computed via an all reduce [00:38:51] in red get computed via an all reduce across all the GPUs in parallel. Oh, the [00:38:54] across all the GPUs in parallel. Oh, the backward pass at the lower layer is [00:38:55] backward pass at the lower layer is dependent on the gradients from the [00:38:56] dependent on the gradients from the previous layer. Um, but crucially, um, [00:38:59] previous layer. Um, but crucially, um, each network is each GPU is only doing [00:39:01] each network is each GPU is only doing backward pass locally on its own mini [00:39:02] backward pass locally on its own mini batch. So then there's basically two [00:39:04] batch. So then there's basically two different variants of the gradient at [00:39:06] different variants of the gradient at each layer that you need to think about. [00:39:07] each layer that you need to think about. Now there's the local gradients about [00:39:09] Now there's the local gradients about like the gradient of the local loss of [00:39:11] like the gradient of the local loss of my mini batch with respect to my network [00:39:12] my mini batch with respect to my network weights and there's the global gradient [00:39:14] weights and there's the global gradient which is the derivative of the total [00:39:16] which is the derivative of the total loss of the of the macro batch with [00:39:18] loss of the of the macro batch with respect to the network weights. So each [00:39:19] respect to the network weights. So each GPU can only in order to compute a [00:39:21] GPU can only in order to compute a backward pass each GPU only needs the [00:39:23] backward pass each GPU only needs the local version of its upstream gradient [00:39:25] local version of its upstream gradient but then the computing the global [00:39:26] but then the computing the global version of the upstream gradient [00:39:28] version of the upstream gradient requires communication. [00:39:31] requires communication. Um so this is data parallelism and [00:39:34] Um so this is data parallelism and there's actually a bit of problem here [00:39:35] there's actually a bit of problem here which is this is a great way to paralyze [00:39:37] which is this is a great way to paralyze G GPU computation and this was the first [00:39:39] G GPU computation and this was the first way that people started paralyzing GPU [00:39:40] way that people started paralyzing GPU computation in neural network training [00:39:42] computation in neural network training but we quickly hit a bottleneck on the G [00:39:44] but we quickly hit a bottleneck on the G on the model size. So here remember that [00:39:47] on the model size. So here remember that each GPU is keeping its own independent [00:39:49] each GPU is keeping its own independent copy of the model parameters and this [00:39:51] copy of the model parameters and this becomes a bottleneck when you want to [00:39:52] becomes a bottleneck when you want to have really big models. So in particular [00:39:54] have really big models. So in particular um now each weight in your neural [00:39:56] um now each weight in your neural network you basically need to keep track [00:39:57] network you basically need to keep track of four numbers the weight itself um the [00:40:00] of four numbers the weight itself um the gradient of that the the gradient of [00:40:01] gradient of that the the gradient of that weight uh and the optimizer state. [00:40:03] that weight uh and the optimizer state. So if you're using atom that's typically [00:40:05] So if you're using atom that's typically a beta that's a beta 1 and a beta 2 per [00:40:07] a beta that's a beta 1 and a beta 2 per parameter in the network. Um and [00:40:08] parameter in the network. Um and sometimes you'll also have an [00:40:10] sometimes you'll also have an exponential moving average of the model [00:40:11] exponential moving average of the model parameters as well. So typically you'll [00:40:13] parameters as well. So typically you'll have you know four to five scalers that [00:40:15] have you know four to five scalers that you need to keep track of for every [00:40:16] you need to keep track of for every weight in your network. Um and if you're [00:40:19] weight in your network. Um and if you're training with 16 bit precision which is [00:40:21] training with 16 bit precision which is pretty common these days what some some [00:40:23] pretty common these days what some some of these you'll sometimes keep in in [00:40:24] of these you'll sometimes keep in in higher precision but let's talk about 16 [00:40:26] higher precision but let's talk about 16 bit as a lower bound then you need two [00:40:28] bit as a lower bound then you need two bytes for each number. So that means [00:40:29] bytes for each number. So that means that we need you know four numbers two [00:40:31] that we need you know four numbers two bytes we need six we need uh eight bytes [00:40:34] bytes we need six we need uh eight bytes per scaler in the in the network um to [00:40:36] per scaler in the in the network um to to keep track of which means that 1 [00:40:38] to keep track of which means that 1 billion parameters um 1 billion model [00:40:40] billion parameters um 1 billion model parameters is going to take about 8 GB [00:40:42] parameters is going to take about 8 GB of GPU memory to store all that stuff [00:40:44] of GPU memory to store all that stuff and we said the whole GPU only has 80 GB [00:40:46] and we said the whole GPU only has 80 GB of memory for an H100. So that means [00:40:48] of memory for an H100. So that means that the biggest model you could ever [00:40:49] that the biggest model you could ever hope to train in this scenario is [00:40:51] hope to train in this scenario is something like 10 billion parameters and [00:40:53] something like 10 billion parameters and that's not big enough. We want really [00:40:55] that's not big enough. We want really big models. We don't want to be [00:40:56] big models. We don't want to be constrained by the tyranny of our GPU [00:40:58] constrained by the tyranny of our GPU memory size in telling us how big of [00:41:00] memory size in telling us how big of models we're allowed to train. So we [00:41:01] models we're allowed to train. So we need to fix this somehow. And the fix [00:41:04] need to fix this somehow. And the fix for this is actually relatively easy. We [00:41:06] for this is actually relatively easy. We want to split we need to split the model [00:41:07] want to split we need to split the model weights across the different GPUs. So in [00:41:09] weights across the different GPUs. So in addition to splitting the batch of data [00:41:11] addition to splitting the batch of data across GPUs, we're also going to split [00:41:13] across GPUs, we're also going to split our model weights across the GPUs. And [00:41:15] our model weights across the GPUs. And this leads to a variant of data [00:41:17] this leads to a variant of data parallelism called fully sharded data [00:41:18] parallelism called fully sharded data parallelism or FSTP. Um, and this is [00:41:22] parallelism or FSTP. Um, and this is relatively simp conceptually what we're [00:41:24] relatively simp conceptually what we're going to do is each model weight in the [00:41:26] going to do is each model weight in the network, each weight wii, we are going [00:41:28] network, each weight wii, we are going to assign it to a owner GPU. So each [00:41:32] to assign it to a owner GPU. So each weight will be owned by a unique GPU [00:41:34] weight will be owned by a unique GPU among the MGPUs that we're training on. [00:41:36] among the MGPUs that we're training on. Um, and that the GPU that owns each [00:41:38] Um, and that the GPU that owns each weight will also be responsible for [00:41:40] weight will also be responsible for managing the global gradients for that [00:41:42] managing the global gradients for that weight and the optimizer state for that [00:41:43] weight and the optimizer state for that weight. Um, and typically you would [00:41:45] weight. Um, and typically you would split this up by layer like you're not [00:41:47] split this up by layer like you're not managing individual scalers. This W you [00:41:49] managing individual scalers. This W you should think of as like the lay like the [00:41:50] should think of as like the lay like the weight matrix for an entire layer of the [00:41:52] weight matrix for an entire layer of the neural network. [00:41:54] neural network. So now what but now so now what's so now [00:41:56] So now what but now so now what's so now the the picture on the right changes a [00:41:57] the the picture on the right changes a little bit here we're only showing two [00:41:59] little bit here we're only showing two GPUs because spoiler there's going to be [00:42:01] GPUs because spoiler there's going to be a lot more arrows flying around here in [00:42:02] a lot more arrows flying around here in just a moment. So here we're showing a [00:42:04] just a moment. So here we're showing a four-layer network that we're [00:42:05] four-layer network that we're distributing across two different GPUs. [00:42:07] distributing across two different GPUs. We've we've assigned the first two [00:42:09] We've we've assigned the first two network the weights for the first two [00:42:11] network the weights for the first two network layers W1 and W2. Those are [00:42:13] network layers W1 and W2. Those are owned by GPU1. um the weights W3 and W4 [00:42:16] owned by GPU1. um the weights W3 and W4 are owned by GPU 3, GPU 2. So that means [00:42:19] are owned by GPU 3, GPU 2. So that means that you know at the start of each of [00:42:20] that you know at the start of each of each batch the network weights are split [00:42:23] each batch the network weights are split up across the GPUs in this way. But it's [00:42:25] up across the GPUs in this way. But it's it's still data parallelism. It's still [00:42:27] it's still data parallelism. It's still the same basic idea that each GPU is [00:42:29] the same basic idea that each GPU is going to load its own independent batch [00:42:30] going to load its own independent batch of elements, do a full forward backward [00:42:32] of elements, do a full forward backward pass on that batch to compute its own [00:42:34] pass on that batch to compute its own local gradients. Then I'll reduce the [00:42:36] local gradients. Then I'll reduce the gradients and take a gradient step. Same [00:42:37] gradients and take a gradient step. Same basic algorithm, but it gets tricky now [00:42:39] basic algorithm, but it gets tricky now because the model weights are split up. [00:42:41] because the model weights are split up. So here we need to introduce extra [00:42:42] So here we need to introduce extra communication. So when you're doing [00:42:44] communication. So when you're doing fully sharded data parallelism now at [00:42:46] fully sharded data parallelism now at the beginning of forward before you [00:42:48] the beginning of forward before you start doing the forward pass of the [00:42:50] start doing the forward pass of the first layer um whoever owns that weight [00:42:53] first layer um whoever owns that weight for the first layer needs to broadcast [00:42:55] for the first layer needs to broadcast that weight matrix to all the other GPUs [00:42:56] that weight matrix to all the other GPUs that you're training on. So in this case [00:42:58] that you're training on. So in this case um GPU1 owns W1. So it broadcasts that [00:43:01] um GPU1 owns W1. So it broadcasts that to GPU 2. So GPU 2 now has a copy of W1. [00:43:05] to GPU 2. So GPU 2 now has a copy of W1. Now that all the GPUs have a copy of W1, [00:43:07] Now that all the GPUs have a copy of W1, they can run a forward pass through the [00:43:09] they can run a forward pass through the first layer of the network and compute [00:43:10] first layer of the network and compute the activations at the first layer of [00:43:12] the activations at the first layer of the network. And now um after you run [00:43:14] the network. And now um after you run the forward pass um each GPU that does [00:43:17] the forward pass um each GPU that does not own that W1 is going to delete its [00:43:19] not own that W1 is going to delete its local copy of the W1 weight matrix to [00:43:21] local copy of the W1 weight matrix to save memory. So then after we've run the [00:43:23] save memory. So then after we've run the first after we run the forward pass for [00:43:25] first after we run the forward pass for the first layer we're back in the state [00:43:26] the first layer we're back in the state where the model weights are split up [00:43:28] where the model weights are split up across the GPUs. But now all the GPUs [00:43:30] across the GPUs. But now all the GPUs have also have an activations in in GPU [00:43:32] have also have an activations in in GPU memory that is the result of running the [00:43:34] memory that is the result of running the first layer of the network. And now now [00:43:36] first layer of the network. And now now it's time to do the second layer and we [00:43:38] it's time to do the second layer and we do the exact same thing. So then the GPU [00:43:40] do the exact same thing. So then the GPU that owns the weight matrix for layer 2 [00:43:42] that owns the weight matrix for layer 2 is going to broadcast that to all the [00:43:44] is going to broadcast that to all the GPUs that we're training on. Now they [00:43:46] GPUs that we're training on. Now they all have their own local copy of W2. [00:43:48] all have their own local copy of W2. They can uh and they can go go forward. [00:43:50] They can uh and they can go go forward. Um and by the way, we also have an [00:43:52] Um and by the way, we also have an opportunity to interle computation and [00:43:54] opportunity to interle computation and communication here as well. So that [00:43:56] communication here as well. So that while we are computing the forward pass [00:43:57] while we are computing the forward pass for layer I, we can be prefetching the [00:44:00] for layer I, we can be prefetching the weights for the next layer. So in [00:44:01] weights for the next layer. So in practice, this will happen in parallel [00:44:03] practice, this will happen in parallel um during the forward pass of an FSDP [00:44:05] um during the forward pass of an FSDP run. So then we'll be computing for [00:44:07] run. So then we'll be computing for we'll be computing uh layer 2 at the [00:44:09] we'll be computing uh layer 2 at the same time we are fetching the weights [00:44:10] same time we are fetching the weights for layer 3. Um and once we get to layer [00:44:12] for layer 3. Um and once we get to layer 3, note that now GPU 1 owns layer 3. So [00:44:16] 3, note that now GPU 1 owns layer 3. So then GPU1 will be broadcasting the [00:44:17] then GPU1 will be broadcasting the weights to all the GPUs that we're [00:44:19] weights to all the GPUs that we're training on. And this will repeat until [00:44:21] training on. And this will repeat until we've gotten to the end of the network. [00:44:22] we've gotten to the end of the network. And now at the end of the network, then [00:44:24] And now at the end of the network, then all models now we now each model has on [00:44:26] all models now we now each model has on a full forward pass computed its local [00:44:28] a full forward pass computed its local loss on its own local mini batch and has [00:44:30] loss on its own local mini batch and has all the activations for all the layers [00:44:31] all the activations for all the layers in memory all ready for backward. Um and [00:44:34] in memory all ready for backward. Um and now we need to do the same thing in [00:44:35] now we need to do the same thing in reverse in order to compute the backward [00:44:36] reverse in order to compute the backward pass. So um at the beginning of the [00:44:38] pass. So um at the beginning of the backward pass for the for the last layer [00:44:40] backward pass for the for the last layer whoever owns that last layer weight will [00:44:43] whoever owns that last layer weight will broadcast it to all the devices. Um once [00:44:45] broadcast it to all the devices. Um once the devices have that weight then they [00:44:47] the devices have that weight then they can perform the backward pass and this [00:44:49] can perform the backward pass and this whole we'll do a similar kind of [00:44:50] whole we'll do a similar kind of procedure in the backward pass. Now [00:44:52] procedure in the backward pass. Now there is a little bit of optimization we [00:44:54] there is a little bit of optimization we can do on the la very last layer in the [00:44:56] can do on the la very last layer in the network which is don't delete the me [00:44:58] network which is don't delete the me like keep the have the all the GPUs keep [00:45:00] like keep the have the all the GPUs keep the the weights for the last layer in [00:45:01] the the weights for the last layer in memory. So, this is something that [00:45:02] memory. So, this is something that you'll usually do in practice, right? [00:45:04] you'll usually do in practice, right? Because at the end, um, all the all the [00:45:05] Because at the end, um, all the all the GPUs already have a copy of the last [00:45:07] GPUs already have a copy of the last their weights from the forward pass. [00:45:08] their weights from the forward pass. They'll just keep it in memory and just [00:45:10] They'll just keep it in memory and just because they they're they know they're [00:45:11] because they they're they know they're about to reuse it for the backward pass [00:45:12] about to reuse it for the backward pass anyway. So, we just won't delete the the [00:45:14] anyway. So, we just won't delete the the the weights from the very last layer. [00:45:17] the weights from the very last layer. Um, so now there's basically but now [00:45:18] Um, so now there's basically but now there's basically three things that need [00:45:20] there's basically three things that need to happen during the backward pass. Um, [00:45:22] to happen during the backward pass. Um, one is that you know once the GPUs have [00:45:24] one is that you know once the GPUs have computed the backward pass for their [00:45:26] computed the backward pass for their last layer of the network now that they [00:45:28] last layer of the network now that they have a copy of the weights now at that [00:45:29] have a copy of the weights now at that point each GPU has computed its own [00:45:31] point each GPU has computed its own local backward gra it own local [00:45:34] local backward gra it own local gradients for its local loss with [00:45:35] gradients for its local loss with respect to that last layer weights. Um [00:45:38] respect to that last layer weights. Um then we need to we need to communicate [00:45:40] then we need to we need to communicate those gradients back and we said that [00:45:42] those gradients back and we said that the GPU that owns the weight matrix is [00:45:45] the GPU that owns the weight matrix is also going to be responsible for [00:45:46] also going to be responsible for managing the gradients for that weight [00:45:47] managing the gradients for that weight matrix. So now rather than all reducing [00:45:50] matrix. So now rather than all reducing the gradients as we did in the data [00:45:51] the gradients as we did in the data parallelism case, instead we're going to [00:45:54] parallelism case, instead we're going to um just just the one weight matrix that [00:45:56] um just just the one weight matrix that owns just the one GPU that owns that one [00:45:59] owns just the one GPU that owns that one last layer weights is going to gather [00:46:00] last layer weights is going to gather and take take a sum across all the local [00:46:03] and take take a sum across all the local gradients across all our devices. So in [00:46:05] gradients across all our devices. So in this case, GPU1 is going to send its [00:46:07] this case, GPU1 is going to send its last layer local gradient to GPU uh to [00:46:10] last layer local gradient to GPU uh to GPU 2, which will then have the full [00:46:12] GPU 2, which will then have the full gradient um DL DW4 of the entire macro [00:46:16] gradient um DL DW4 of the entire macro batch with respect to the last their [00:46:18] batch with respect to the last their weights. What happens during the [00:46:19] weights. What happens during the downtime? You got to get all this stuff [00:46:21] downtime? You got to get all this stuff happening in parallel. Um so in this so [00:46:24] happening in parallel. Um so in this so then you know there's basically three [00:46:25] then you know there's basically three things that need to happen during [00:46:26] things that need to happen during backward. During backward we need to [00:46:28] backward. During backward we need to communicate the weights. So whoever [00:46:30] communicate the weights. So whoever whatever whatever GPU owns the layer [00:46:32] whatever whatever GPU owns the layer owns the weights for that layer has to [00:46:34] owns the weights for that layer has to broadcast them. Two we need to all the [00:46:36] broadcast them. Two we need to all the GPUs once they get that weight need to [00:46:38] GPUs once they get that weight need to compute a backward pass for that layer. [00:46:40] compute a backward pass for that layer. And then three um after each GPU [00:46:43] And then three um after each GPU computes its backward pass then it needs [00:46:45] computes its backward pass then it needs to send the result of that the gradients [00:46:47] to send the result of that the gradients with respect to the weights of that [00:46:48] with respect to the weights of that backward pass back to the GPU that owns [00:46:50] backward pass back to the GPU that owns it. Um and we can and then after that [00:46:52] it. Um and we can and then after that then the owner then once the owner of [00:46:55] then the owner then once the owner of the weights has that full gradient then [00:46:57] the weights has that full gradient then only the owner of the weight matrix can [00:46:59] only the owner of the weight matrix can now make a gradient update on that one [00:47:01] now make a gradient update on that one weight matrix. Um but I I think at this [00:47:04] weight matrix. Um but I I think at this point we actually do not need to [00:47:05] point we actually do not need to communicate the the updated weight [00:47:07] communicate the the updated weight matrix because it will get [00:47:08] matrix because it will get recommunicated to all the GPUs on the [00:47:10] recommunicated to all the GPUs on the next forward pass. Um so that that's a [00:47:12] next forward pass. Um so that that's a little bit different from the from the [00:47:14] little bit different from the from the from the DP case maybe. Um and then [00:47:16] from the DP case maybe. Um and then basically all of these things can [00:47:18] basically all of these things can actually happen in parallel as well. So [00:47:20] actually happen in parallel as well. So they will repeat this for every layer of [00:47:21] they will repeat this for every layer of the network and then basically we in the [00:47:23] the network and then basically we in the steady state of a very deep network all [00:47:25] steady state of a very deep network all three of these things will be happening [00:47:26] three of these things will be happening simultaneously. Um so so while we are [00:47:29] simultaneously. Um so so while we are computing the backward pass for layer L [00:47:32] computing the backward pass for layer L we will be um aggregating the gradients [00:47:34] we will be um aggregating the gradients and performing a weight update on the on [00:47:37] and performing a weight update on the on layer L+1 and we will be pre-fetching [00:47:40] layer L+1 and we will be pre-fetching the weights for layer L minus one. So I [00:47:42] the weights for layer L minus one. So I said there's three things that need to [00:47:43] said there's three things that need to happen. we need to get the weight um run [00:47:46] happen. we need to get the weight um run the backward pass and then uh update the [00:47:48] the backward pass and then uh update the weight and then aggregate the gradient [00:47:50] weight and then aggregate the gradient and update the weight and these things [00:47:51] and update the weight and these things can all happen in parallel. So we'll [00:47:53] can all happen in parallel. So we'll basically in general be operating on [00:47:54] basically in general be operating on three consecutive layers and doing all [00:47:56] three consecutive layers and doing all three of these things in parallel over [00:47:58] three of these things in parallel over the over the course of the backward [00:47:59] the over the course of the backward pass. [00:48:02] Right? Right. So then like during the [00:48:04] Right? Right. So then like during the and then as we chunk backwards over the [00:48:05] and then as we chunk backwards over the network um then by the time we then [00:48:07] network um then by the time we then hopefully if you were properly able to [00:48:09] hopefully if you were properly able to overlap all that communication and [00:48:11] overlap all that communication and computation then by the time you get you [00:48:13] computation then by the time you get you finish your backward pass then all the [00:48:15] finish your backward pass then all the gradients have already been communicated [00:48:16] gradients have already been communicated all the GPUs have already finished doing [00:48:18] all the GPUs have already finished doing their update on all the weights and [00:48:20] their update on all the weights and we're ready and also hopefully your data [00:48:22] we're ready and also hopefully your data loader that's loading data is al also [00:48:24] loader that's loading data is al also happening asynchronously usually on the [00:48:25] happening asynchronously usually on the CPU cores of our servers. So then the [00:48:28] CPU cores of our servers. So then the CPU is like ready with a fresh batch of [00:48:29] CPU is like ready with a fresh batch of data to go forward again. So these [00:48:31] data to go forward again. So these things are basically parallelization [00:48:33] things are basically parallelization machines. We have a lot of stuff that [00:48:34] machines. We have a lot of stuff that needs to happen um both within a GPU and [00:48:37] needs to happen um both within a GPU and across GPUs and we need to overlap all [00:48:39] across GPUs and we need to overlap all of that as much as possible. So we can [00:48:40] of that as much as possible. So we can always feed the GPUs and have them [00:48:42] always feed the GPUs and have them running on those tensor cores as as [00:48:43] running on those tensor cores as as densely as possible. [00:48:46] densely as possible. Um right so then we're basically ready [00:48:48] Um right so then we're basically ready to do our next batch after after that. [00:48:51] to do our next batch after after that. So this is great. This is fully sharded [00:48:53] So this is great. This is fully sharded data parallelism. Um and this is this [00:48:55] data parallelism. Um and this is this can get you a long way. Um but there's [00:48:57] can get you a long way. Um but there's actually a slightly uh slightly fancier [00:48:59] actually a slightly uh slightly fancier variant of data parallelism that people [00:49:00] variant of data parallelism that people sometimes use called hybrid sharded data [00:49:03] sometimes use called hybrid sharded data parallelism or HSDP. Um and in this case [00:49:06] parallelism or HSDP. Um and in this case we're actually going to imagine [00:49:08] we're actually going to imagine conceptually dividing our GPUs into a [00:49:10] conceptually dividing our GPUs into a two-dimensional grid. So in the previous [00:49:12] two-dimensional grid. So in the previous examples we said we had sort of n GPUs [00:49:14] examples we said we had sort of n GPUs and the way that we parallelized our [00:49:16] and the way that we parallelized our computation was kind of the same. We had [00:49:18] computation was kind of the same. We had sort of one axis of parallelization um [00:49:21] sort of one axis of parallelization um in the previous variants of data [00:49:22] in the previous variants of data parallelism. Once we get to hybrid [00:49:24] parallelism. Once we get to hybrid sharded data parallelism, we now are [00:49:25] sharded data parallelism, we now are going to have two separate axes of [00:49:27] going to have two separate axes of parallelism that we will do at the same [00:49:29] parallelism that we will do at the same time. So the first axis is we will do [00:49:32] time. So the first axis is we will do typical FSTP um fully sharded data [00:49:34] typical FSTP um fully sharded data parallelism along one axis that we just [00:49:35] parallelism along one axis that we just talked about. Um so we'll have sort of [00:49:38] talked about. Um so we'll have sort of groups of KG GPUs. Um and each group of [00:49:40] groups of KG GPUs. Um and each group of KGPUs will be doing um fully sharded [00:49:43] KGPUs will be doing um fully sharded data parallelism that we just talked [00:49:44] data parallelism that we just talked about. So within each group of KG GPUs, [00:49:47] about. So within each group of KG GPUs, the model weights will be split across [00:49:49] the model weights will be split across those KGPUs and they will be interle um [00:49:51] those KGPUs and they will be interle um sending weights and gradients back and [00:49:52] sending weights and gradients back and forth to each other during the forward [00:49:54] forth to each other during the forward and backward passes. But we will have [00:49:56] and backward passes. But we will have now um m M copies of those K groups [00:50:00] now um m M copies of those K groups operating in parallel. So in this case [00:50:02] operating in parallel. So in this case we have um two groups of four GPUs. So [00:50:05] we have um two groups of four GPUs. So each group of four GPUs you see has the [00:50:08] each group of four GPUs you see has the weights split across the four GPUs but [00:50:11] weights split across the four GPUs but we have the entire the entire the entire [00:50:13] we have the entire the entire the entire uh setup duplicated a second time on a [00:50:16] uh setup duplicated a second time on a second group of two GPUs. Um and the [00:50:18] second group of two GPUs. Um and the reason and and then when you do this now [00:50:20] reason and and then when you do this now you know then they do typical data [00:50:22] you know then they do typical data parallelism across the groups. So within [00:50:24] parallelism across the groups. So within a group we're going to do forward [00:50:25] a group we're going to do forward backward um and at the end of the [00:50:27] backward um and at the end of the backward each group will have computed [00:50:29] backward each group will have computed its own local gradients and then after [00:50:32] its own local gradients and then after the backward then each group needs to [00:50:34] the backward then each group needs to all reduce the gradients across the [00:50:36] all reduce the gradients across the groups so that we now have like the full [00:50:38] groups so that we now have like the full like macro macro batch gradients across [00:50:40] like macro macro batch gradients across the two groups and then each group can [00:50:42] the two groups and then each group can make an gradient update independently [00:50:44] make an gradient update independently once they've received the full gradients [00:50:46] once they've received the full gradients for the macro macro batch. [00:50:49] for the macro macro batch. Um so this is called multi-dimensional [00:50:51] Um so this is called multi-dimensional parallelism because now there's [00:50:52] parallelism because now there's basically two different axes, two [00:50:54] basically two different axes, two different strategies that we're using to [00:50:55] different strategies that we're using to paralyze our computation simultaneously. [00:50:58] paralyze our computation simultaneously. Um and why this might be useful is [00:51:00] Um and why this might be useful is because there's different amounts of [00:51:02] because there's different amounts of communication required for these two [00:51:03] communication required for these two different kinds of parallelism. So if we [00:51:05] different kinds of parallelism. So if we think about fully sharded parallelism, [00:51:07] think about fully sharded parallelism, you know, we actually need to what do we [00:51:08] you know, we actually need to what do we need to communicate for during fully [00:51:10] need to communicate for during fully sharded data parallelism? During FSTP, [00:51:12] sharded data parallelism? During FSTP, during the forward pass, remember we [00:51:13] during the forward pass, remember we were copying the weights all over. So we [00:51:15] were copying the weights all over. So we sort of copy like during the forward [00:51:17] sort of copy like during the forward pass we end up doing a communication of [00:51:19] pass we end up doing a communication of one full copy of the network weights. [00:51:20] one full copy of the network weights. Then during the backward pass we need to [00:51:22] Then during the backward pass we need to recommunicate the network weights um and [00:51:24] recommunicate the network weights um and we also need to communicate the [00:51:25] we also need to communicate the gradients. So basically when it comes to [00:51:27] gradients. So basically when it comes to when you use fully shredded data [00:51:28] when you use fully shredded data parallelism you b like during a single [00:51:30] parallelism you b like during a single forward backward pass you need to [00:51:32] forward backward pass you need to communicate three times the network [00:51:33] communicate three times the network weights across everything participating [00:51:35] weights across everything participating in an FSDP group. But when you do normal [00:51:37] in an FSDP group. But when you do normal data parallelism um where each where [00:51:39] data parallelism um where each where where you each group keeps its own [00:51:41] where you each group keeps its own independent copy of the weights you only [00:51:43] independent copy of the weights you only need to all reduce the gradients. So [00:51:44] need to all reduce the gradients. So that means that across multiple data [00:51:46] that means that across multiple data parallelism groups, you only need to [00:51:48] parallelism groups, you only need to communicate the network weights once [00:51:50] communicate the network weights once over a forward backward pass. And this [00:51:52] over a forward backward pass. And this plays into this idea of multiple levels [00:51:53] plays into this idea of multiple levels of hierarchy inside of our GPU clusters. [00:51:56] of hierarchy inside of our GPU clusters. So what you'll what you might do, for [00:51:57] So what you'll what you might do, for example, is have um a GPU server where [00:52:00] example, is have um a GPU server where it has eight GPUs with higher [00:52:01] it has eight GPUs with higher interconnect inside a single machine. [00:52:03] interconnect inside a single machine. Those might be an FSTP group because it [00:52:05] Those might be an FSTP group because it requires more communication inside an [00:52:07] requires more communication inside an FSTP group. But then you could have [00:52:08] FSTP group. But then you could have multiple servers that are, you know, on [00:52:11] multiple servers that are, you know, on this other axis. So you have sort of one [00:52:12] this other axis. So you have sort of one server with a full copy of the model [00:52:14] server with a full copy of the model weights then another server with another [00:52:16] weights then another server with another full copy of the model weights and [00:52:17] full copy of the model weights and remember communication across servers is [00:52:19] remember communication across servers is going to be slower than communication [00:52:20] going to be slower than communication inside a server. So then this is our [00:52:22] inside a server. So then this is our first example of you know take designing [00:52:24] first example of you know take designing algorithms to take to make to take [00:52:26] algorithms to take to make to take advantage of the network topology that [00:52:28] advantage of the network topology that we know our devices are connected into. [00:52:30] we know our devices are connected into. Question is um would you rather have g [00:52:33] Question is um would you rather have g like the these things are impossible to [00:52:34] like the these things are impossible to tune. It's very very hard to say. Um [00:52:38] tune. It's very very hard to say. Um right so you know and then basically but [00:52:41] right so you know and then basically but once you have data parallelism once you [00:52:43] once you have data parallelism once you have this this DP FSTP and HSTP this is [00:52:46] have this this DP FSTP and HSTP this is actually a recipe that can take you a [00:52:47] actually a recipe that can take you a long ways. So for example um you know we [00:52:50] long ways. So for example um you know we a model with 100 billion parameters [00:52:52] a model with 100 billion parameters would take 800 GB of memory to store um [00:52:54] would take 800 GB of memory to store um and and if you split that over 80 GPUs [00:52:57] and and if you split that over 80 GPUs it only takes 10 GB of memory per GPU. [00:52:59] it only takes 10 GB of memory per GPU. So you can have you know a pretty big [00:53:00] So you can have you know a pretty big model once you have FSDP. [00:53:03] model once you have FSDP. Um, but there's another problem that the [00:53:05] Um, but there's another problem that the model activations themselves now start [00:53:06] model activations themselves now start to fill up memory. So if we go back to [00:53:08] to fill up memory. So if we go back to llama 35B, um, it's a transformer with [00:53:11] llama 35B, um, it's a transformer with 126 layers, model dimension of 16,000, [00:53:14] 126 layers, model dimension of 16,000, sequence length 4096. Um, so if you kind [00:53:16] sequence length 4096. Um, so if you kind of imagine how much GPU memory it takes [00:53:18] of imagine how much GPU memory it takes to just store the hidden states during a [00:53:20] to just store the hidden states during a forward pass. Um, that's going to be a [00:53:22] forward pass. Um, that's going to be a lot. So that's going to quickly cause [00:53:24] lot. So that's going to quickly cause your GPU to run out of memory once your [00:53:26] your GPU to run out of memory once your models and sequences get really big. So [00:53:28] models and sequences get really big. So that leads to another trick is called uh [00:53:30] that leads to another trick is called uh activation checkpointing which means [00:53:32] activation checkpointing which means that we're actually not going to store [00:53:33] that we're actually not going to store all the activations in memory. We're [00:53:35] all the activations in memory. We're going to recmp compute them during the [00:53:36] going to recmp compute them during the backward pass. So to to to see how this [00:53:39] backward pass. So to to to see how this works like we it's useful to think of [00:53:40] works like we it's useful to think of your neural network in a different way [00:53:42] your neural network in a different way where there's actually two different [00:53:43] where there's actually two different layers in the neural each each layer in [00:53:45] layers in the neural each each layer in the in the neural network does two [00:53:46] the in the neural network does two things. It does a forward pass that [00:53:48] things. It does a forward pass that computes activations for the next layer. [00:53:50] computes activations for the next layer. Then it has a backward pass that [00:53:51] Then it has a backward pass that computes gradients that take both the up [00:53:54] computes gradients that take both the up upstream gradients and the activations. [00:53:56] upstream gradients and the activations. So normally you know how much compute [00:53:58] So normally you know how much compute and memory does this all take. Um if we [00:54:00] and memory does this all take. Um if we assume that all of these are constant [00:54:02] assume that all of these are constant then typically a forward backward pass [00:54:03] then typically a forward backward pass will take sort of 1 2 3 4 four step [00:54:07] will take sort of 1 2 3 4 four step string forward you'll remember those [00:54:08] string forward you'll remember those activations during the forward pass then [00:54:11] activations during the forward pass then 1 2 3 four step string backward. So in a [00:54:14] 1 2 3 four step string backward. So in a normal forward backward pass it sort of [00:54:16] normal forward backward pass it sort of takes um compute and memory for an end [00:54:18] takes um compute and memory for an end layer network. Um but as we just said [00:54:21] layer network. Um but as we just said this is going to run out of memory. So [00:54:23] this is going to run out of memory. So instead what we can do is imagine [00:54:24] instead what we can do is imagine recomputing the activations during the [00:54:26] recomputing the activations during the backward pass. So what that looks like [00:54:28] backward pass. So what that looks like is something like this. So we'll start [00:54:29] is something like this. So we'll start with the activations. We'll run the [00:54:31] with the activations. We'll run the first layer and then immediately throw [00:54:32] first layer and then immediately throw away the activations for the first well [00:54:34] away the activations for the first well run the forward pass for the first layer [00:54:36] run the forward pass for the first layer and then immediately throw away those [00:54:37] and then immediately throw away those activations and sort of do this four [00:54:39] activations and sort of do this four times. So now we've sort of gone through [00:54:41] times. So now we've sort of gone through the network once got the activations at [00:54:42] the network once got the activations at the last layer. At this point we can [00:54:44] the last layer. At this point we can compute our backward for the last layer. [00:54:46] compute our backward for the last layer. But um now we're kind of out of luck. We [00:54:48] But um now we're kind of out of luck. We we don't have the activations from from [00:54:50] we don't have the activations from from A3 to compute the next backward pass. [00:54:52] A3 to compute the next backward pass. But we can recmp compute them. So we [00:54:54] But we can recmp compute them. So we recomp compute them. Now we can do the [00:54:55] recomp compute them. Now we can do the backward pass. Now recomputee some more. [00:54:57] backward pass. Now recomputee some more. Now do the backward pass. Now recomp [00:54:59] Now do the backward pass. Now recomp compute. Now do another backward pass. [00:55:01] compute. Now do another backward pass. So if you kind of add this all up, this [00:55:03] So if you kind of add this all up, this ends up being n^2 compute and constant [00:55:05] ends up being n^2 compute and constant memory for a layer for a network with n [00:55:07] memory for a layer for a network with n layers cuz it's sort of sum n minus 1 n [00:55:10] layers cuz it's sort of sum n minus 1 n -2 n -3 blah blah blah down to one. [00:55:12] -2 n -3 blah blah blah down to one. That's quadratic time. Um and you can [00:55:15] That's quadratic time. Um and you can split this up. You know n squed compute [00:55:16] split this up. You know n squed compute is pretty bad for deep networks. So [00:55:18] is pretty bad for deep networks. So instead let's not recmp compute [00:55:20] instead let's not recmp compute everything. Let's instead imagine taking [00:55:22] everything. Let's instead imagine taking a checkpoint of activations every C [00:55:24] a checkpoint of activations every C layers. So we'll only sort of compute [00:55:26] layers. So we'll only sort of compute like recomp compute within blocks of uh [00:55:28] like recomp compute within blocks of uh within tinier blocks of the network. [00:55:30] within tinier blocks of the network. Then in that case um if you take C [00:55:32] Then in that case um if you take C checkpoints where you remember your [00:55:34] checkpoints where you remember your activations c times over the course of [00:55:36] activations c times over the course of your network then it's going to take n^ [00:55:37] your network then it's going to take n^ squ over c compute and o of c memory. Um [00:55:40] squ over c compute and o of c memory. Um and a pretty common thing to do is to [00:55:41] and a pretty common thing to do is to set c equal to root n in which case this [00:55:44] set c equal to root n in which case this becomes n root n compute and o of root n [00:55:47] becomes n root n compute and o of root n memory. Um so this is a pretty common [00:55:49] memory. Um so this is a pretty common way that you can trade off computation [00:55:50] way that you can trade off computation and memory um to train even bigger [00:55:52] and memory um to train even bigger models. [00:55:54] models. Okay so now at this point once we have [00:55:56] Okay so now at this point once we have FSTP activation checkpointing HSTP we [00:55:59] FSTP activation checkpointing HSTP we actually this can we can do a lot of [00:56:00] actually this can we can do a lot of damage here. We can start to train some [00:56:02] damage here. We can start to train some really big models. Um and the recipe for [00:56:04] really big models. Um and the recipe for that is basically as following. Um so [00:56:06] that is basically as following. Um so your scaling recipe that will take you [00:56:08] your scaling recipe that will take you quite a long way from here is first use [00:56:10] quite a long way from here is first use data parallelism just raw data [00:56:12] data parallelism just raw data parallelism roughly up to 128 GPUs um [00:56:15] parallelism roughly up to 128 GPUs um and roughly up to models of around a [00:56:17] and roughly up to models of around a billion parameters. You can just do [00:56:18] billion parameters. You can just do normal data parallelism for models of [00:56:20] normal data parallelism for models of this size. It tends to work pretty well. [00:56:22] this size. It tends to work pretty well. Um and another thing that you almost [00:56:24] Um and another thing that you almost always want to set the local batch size [00:56:26] always want to set the local batch size per GPU to max out the GPU memory. [00:56:28] per GPU to max out the GPU memory. That's almost always the right thing to [00:56:30] That's almost always the right thing to do. Um, and then once your model starts [00:56:32] do. Um, and then once your model starts to get big, then the model itself will [00:56:34] to get big, then the model itself will take up a lot of memory inside your GPU. [00:56:36] take up a lot of memory inside your GPU. So that'll that will start to give you [00:56:38] So that'll that will start to give you problems. So you know, it kind of [00:56:40] problems. So you know, it kind of depends on how much memory your GPU has, [00:56:42] depends on how much memory your GPU has, how fast your interconnects are. But in [00:56:44] how fast your interconnects are. But in general, once your model starts to be [00:56:46] general, once your model starts to be more than a billion parameters, that's [00:56:48] more than a billion parameters, that's when you want to start thinking about [00:56:49] when you want to start thinking about switching from data parallelism to fully [00:56:51] switching from data parallelism to fully sharded data parallelism. Um, and then [00:56:53] sharded data parallelism. Um, and then at this point you can scale up quite a [00:56:55] at this point you can scale up quite a bit, but then you'll run into the memory [00:56:57] bit, but then you'll run into the memory bottleneck for your activations and [00:56:58] bottleneck for your activations and that's when you turn on activation [00:56:59] that's when you turn on activation checkpointing. Activation checkpointing [00:57:01] checkpointing. Activation checkpointing kind of sucks because it makes [00:57:02] kind of sucks because it makes everything a lot slower, but it does let [00:57:04] everything a lot slower, but it does let you train much bigger models. Um, and [00:57:07] you train much bigger models. Um, and once and this this will scale like up to [00:57:09] once and this this will scale like up to several hundred GPUs and then there's [00:57:11] several hundred GPUs and then there's some point usually depending on your [00:57:12] some point usually depending on your cluster topology maybe around 256 GPUs, [00:57:15] cluster topology maybe around 256 GPUs, maybe around 512 GPUs. Once you get to [00:57:17] maybe around 512 GPUs. Once you get to like on the order of multiple hundreds [00:57:19] like on the order of multiple hundreds of devices, then FSTP becomes too [00:57:22] of devices, then FSTP becomes too expensive and you need to and you need [00:57:23] expensive and you need to and you need to start switching to HSTP. Um, and then [00:57:26] to start switching to HSTP. Um, and then this is basically going to let you get [00:57:27] this is basically going to let you get up to models that are roughly tens of [00:57:30] up to models that are roughly tens of billions of parameters training on maybe [00:57:32] billions of parameters training on maybe a thousand GPUs. Um, that's on like [00:57:34] a thousand GPUs. Um, that's on like pretty long sequence lengths. So that's [00:57:36] pretty long sequence lengths. So that's pretty good. Um but if you have maybe [00:57:38] pretty good. Um but if you have maybe more than a thousand GPUs, uh more than [00:57:41] more than a thousand GPUs, uh more than 50 billion parameter models, sequence [00:57:42] 50 billion parameter models, sequence lengths more than more than 10,000 or [00:57:45] lengths more than more than 10,000 or so, um this is when you need to turn to [00:57:46] so, um this is when you need to turn to these more advanced strategies, context [00:57:48] these more advanced strategies, context parallelism, pipeline parallelism, or [00:57:50] parallelism, pipeline parallelism, or tensor parallelism. [00:57:52] tensor parallelism. And then there's a big question. It's [00:57:53] And then there's a big question. It's like, oh my god, there's a lot of nubs [00:57:54] like, oh my god, there's a lot of nubs to tune here. Like how am I supposed to [00:57:55] to tune here. Like how am I supposed to optimize this? I need to set the glo the [00:57:57] optimize this? I need to set the glo the global batch size, the local batch size, [00:57:59] global batch size, the local batch size, the HSTP dimension, the FSDB dimension. [00:58:01] the HSTP dimension, the FSDB dimension. Like how much to recomputee? Like I'm [00:58:03] Like how much to recomputee? Like I'm lost here. What do I do? There's so many [00:58:05] lost here. What do I do? There's so many knobs. I don't know what what what what [00:58:06] knobs. I don't know what what what what do I do? Um the answer is to optimize a [00:58:09] do I do? Um the answer is to optimize a very important metric called model flops [00:58:11] very important metric called model flops utilization MFU. Whenever you get lost [00:58:13] utilization MFU. Whenever you get lost in the sea of GPU parallelism like model [00:58:16] in the sea of GPU parallelism like model flops utilization is your guiding light. [00:58:17] flops utilization is your guiding light. Follow this. It will tell you what to do [00:58:19] Follow this. It will tell you what to do um to optimize your training stack. But [00:58:22] um to optimize your training stack. But before we get to model flops [00:58:23] before we get to model flops utilization, we need to talk about [00:58:24] utilization, we need to talk about hardware flops utilization. So remember [00:58:26] hardware flops utilization. So remember we said in theory an H100 can do 988 [00:58:30] we said in theory an H100 can do 988 989.4 t flops per second of compute on [00:58:33] 989.4 t flops per second of compute on the tensor cores. But that's [00:58:34] the tensor cores. But that's theoretical. How much can you actually [00:58:35] theoretical. How much can you actually get? Um the question is how much can you [00:58:38] get? Um the question is how much can you actually achieve in practice? Um and [00:58:40] actually achieve in practice? Um and that's the metric of hardware flops [00:58:41] that's the metric of hardware flops utilization. You know, you're running [00:58:43] utilization. You know, you're running some compute on the device. How much [00:58:44] some compute on the device. How much compute do you actually realize of that [00:58:46] compute do you actually realize of that theoretical maximum? Um and this is not [00:58:48] theoretical maximum? Um and this is not hard to do. Like you can write a couple [00:58:49] hard to do. Like you can write a couple lines of PyTorch code and just like [00:58:51] lines of PyTorch code and just like benchmark this. So this is a benchmark [00:58:53] benchmark this. So this is a benchmark that I wrote that I ran on an H100 [00:58:54] that I wrote that I ran on an H100 yesterday. Um and you can see what it [00:58:57] yesterday. Um and you can see what it does is basically X-axis. It just does [00:58:59] does is basically X-axis. It just does dense matrix multiply in in a loop and [00:59:01] dense matrix multiply in in a loop and then times how long did the matrix [00:59:02] then times how long did the matrix multiply happen. how long did the matrix [00:59:04] multiply happen. how long did the matrix multiply take? Um, we can compute how [00:59:06] multiply take? Um, we can compute how many flops the matrix multiply takes. [00:59:08] many flops the matrix multiply takes. Then on the x axis, we're plotting the [00:59:09] Then on the x axis, we're plotting the size of our matrix um going from uh 512 [00:59:12] size of our matrix um going from uh 512 up to 32,000. And the y-axis is this [00:59:15] up to 32,000. And the y-axis is this hardware flops utilization, which is [00:59:17] hardware flops utilization, which is basically the fraction of the [00:59:18] basically the fraction of the theoretical maximum throughput of the [00:59:19] theoretical maximum throughput of the device that we actually realize from [00:59:21] device that we actually realize from these matrix multiplies. And you can see [00:59:23] these matrix multiplies. And you can see that on this, you know, pretty [00:59:24] that on this, you know, pretty straightforward PyTorch loop, we're [00:59:26] straightforward PyTorch loop, we're getting about 80% HFU on an H100 once we [00:59:29] getting about 80% HFU on an H100 once we get to large matrix multiplies of around [00:59:31] get to large matrix multiplies of around um 8,000 by 8,000. So that's pretty [00:59:33] um 8,000 by 8,000. So that's pretty good. But the problem is that HFU does [00:59:36] good. But the problem is that HFU does not account for all the other stuff that [00:59:37] not account for all the other stuff that your model needs to do, right? We're [00:59:39] your model needs to do, right? We're doing we're maybe doing activation [00:59:40] doing we're maybe doing activation recmputation. We're maybe running some [00:59:42] recmputation. We're maybe running some other models on the side. We're maybe [00:59:44] other models on the side. We're maybe doing data loading, data augmentation. [00:59:45] doing data loading, data augmentation. There's a lot of other stuff your GPU is [00:59:47] There's a lot of other stuff your GPU is doing other than just forward backward [00:59:48] doing other than just forward backward on your raw model. And that's where we [00:59:50] on your raw model. And that's where we move from hardware flops utilization to [00:59:52] move from hardware flops utilization to model flops utilization. So model flops [00:59:55] model flops utilization. So model flops utilization is basically saying um what [00:59:57] utilization is basically saying um what fraction of the GPU's theoretical peak [00:59:59] fraction of the GPU's theoretical peak flops are being used for forward [01:00:01] flops are being used for forward backward in my model. Um and this is the [01:00:04] backward in my model. Um and this is the thing you always want to optimize for. [01:00:06] thing you always want to optimize for. So then to make this more concrete, you [01:00:08] So then to make this more concrete, you basically compute um you know based on [01:00:09] basically compute um you know based on your model architecture, the size of the [01:00:11] your model architecture, the size of the number of the layers, the size of the [01:00:13] number of the layers, the size of the layers, you compute how many flops does [01:00:15] layers, you compute how many flops does it take to do a full forward backward [01:00:16] it take to do a full forward backward pass of your architecture on your mini [01:00:18] pass of your architecture on your mini batch of data. Then you look up [01:00:20] batch of data. Then you look up somewhere the peak theoretical [01:00:22] somewhere the peak theoretical throughput of the device you're running [01:00:23] throughput of the device you're running on. And then you divide those two and [01:00:25] on. And then you divide those two and that tells you how long should a full [01:00:27] that tells you how long should a full forward backward pass take if you were [01:00:29] forward backward pass take if you were achieving the theoretical maximum [01:00:31] achieving the theoretical maximum throughput of the device. That's like [01:00:32] throughput of the device. That's like the theoretical fastest you could ever [01:00:34] the theoretical fastest you could ever do a forward backward pass on your [01:00:35] do a forward backward pass on your model. Then um you actually time a [01:00:38] model. Then um you actually time a forward backward pass of your model. You [01:00:40] forward backward pass of your model. You know your your training loop is doing [01:00:42] know your your training loop is doing all this other stuff. It's doing data [01:00:43] all this other stuff. It's doing data loading. It's doing it's doing [01:00:44] loading. It's doing it's doing augmentation. It's doing communication. [01:00:46] augmentation. It's doing communication. It's doing um maybe activation [01:00:48] It's doing um maybe activation checkpointing. So it's doing recmp [01:00:49] checkpointing. So it's doing recmp computation during backward. Your [01:00:51] computation during backward. Your training loop is doing a lot of stuff. [01:00:52] training loop is doing a lot of stuff. Just time see how long it actually takes [01:00:54] Just time see how long it actually takes and then divide those two numbers. that [01:00:55] and then divide those two numbers. that gives you a number between zero and one [01:00:57] gives you a number between zero and one which is like what fraction of that [01:00:59] which is like what fraction of that theoretical maximum are you actually [01:01:01] theoretical maximum are you actually achieving in your training loop and that [01:01:04] achieving in your training loop and that that's your MFU your model FOPS [01:01:05] that's your MFU your model FOPS utilization um and again we can kind of [01:01:07] utilization um and again we can kind of benchmark this with some relatively [01:01:08] benchmark this with some relatively simple PyTorch code here's an example [01:01:10] simple PyTorch code here's an example running forward backward on just a like [01:01:12] running forward backward on just a like a like a short multi-layer perception [01:01:14] a like a short multi-layer perception with uh with a relu nonlinearity and [01:01:17] with uh with a relu nonlinearity and with really big with really wide MLP [01:01:19] with really big with really wide MLP layers and a gigantic batch size on a [01:01:21] layers and a gigantic batch size on a single uh H100 this is getting around [01:01:23] single uh H100 this is getting around 50% MFU [01:01:25] 50% MFU Um, but and then in in general whenever [01:01:28] Um, but and then in in general whenever you're trying to tune knobs for [01:01:30] you're trying to tune knobs for distributed training, you always want to [01:01:31] distributed training, you always want to try to tune whatever knobs you can to [01:01:33] try to tune whatever knobs you can to maximize MFU because that's the one [01:01:35] maximize MFU because that's the one metric that we typically care about when [01:01:37] metric that we typically care about when trying to optimize training throughput. [01:01:39] trying to optimize training throughput. Um, and in general, you know, an MFU [01:01:42] Um, and in general, you know, an MFU these days, generally above 30% is [01:01:44] these days, generally above 30% is pretty good. If you're way under 30%, [01:01:46] pretty good. If you're way under 30%, you've probably got some gigantic [01:01:48] you've probably got some gigantic bottleneck somewhere and something is [01:01:49] bottleneck somewhere and something is going wrong. Um, and above 40% is pretty [01:01:51] going wrong. Um, and above 40% is pretty pretty excellent. And that's basically [01:01:53] pretty excellent. And that's basically state-of-the-art. Um, and here's some [01:01:55] state-of-the-art. Um, and here's some numbers that we can pull from a couple [01:01:56] numbers that we can pull from a couple papers. In particular, this is that [01:01:58] papers. In particular, this is that llama llama 3 405b paper that we talked [01:02:01] llama llama 3 405b paper that we talked about. Um, in their in their final [01:02:03] about. Um, in their in their final training step, they have a couple [01:02:04] training step, they have a couple different variants of their of their [01:02:06] different variants of their of their training phases where they train on [01:02:07] training phases where they train on between 8,000 and 16,000 GPUs [01:02:10] between 8,000 and 16,000 GPUs simultaneously. Um, and across that, [01:02:12] simultaneously. Um, and across that, they're getting MFUs roughly in the low [01:02:14] they're getting MFUs roughly in the low like high30s, low 40s. And that's pretty [01:02:17] like high30s, low 40s. And that's pretty that's pretty good. Um, you're never [01:02:19] that's pretty good. Um, you're never going to get really much higher than [01:02:20] going to get really much higher than that on an H100. Um, and actually [01:02:22] that on an H100. Um, and actually paradoxically, more recent devices [01:02:24] paradoxically, more recent devices sometimes get worse MFUs. So on on the [01:02:27] sometimes get worse MFUs. So on on the previous generation devices, the A100's, [01:02:29] previous generation devices, the A100's, you could sometimes get MFUs above 50%. [01:02:31] you could sometimes get MFUs above 50%. And the reason for that is because GPUs [01:02:34] And the reason for that is because GPUs are getting faster faster than they are [01:02:36] are getting faster faster than they are getting faster at communicating. So when [01:02:38] getting faster at communicating. So when we moved from the A100 to the H100, we [01:02:40] we moved from the A100 to the H100, we got roughly a 3x improvement in the [01:02:42] got roughly a 3x improvement in the theoretical throughput of the compute, [01:02:44] theoretical throughput of the compute, but we only got a 2x improvement in the [01:02:45] but we only got a 2x improvement in the theoretical memory bandwidth. Um, so [01:02:47] theoretical memory bandwidth. Um, so that that there's there there's this [01:02:49] that that there's there there's this growing gap between making GPUs are [01:02:51] growing gap between making GPUs are getting faster really fast, but it's [01:02:53] getting faster really fast, but it's harder to scale the communication [01:02:54] harder to scale the communication between the GPUs and as a result, we [01:02:56] between the GPUs and as a result, we tend to sometimes get worse MFUs [01:02:58] tend to sometimes get worse MFUs actually on more recent generations of [01:03:00] actually on more recent generations of devices. [01:03:02] devices. So that's um that and I I intentionally [01:03:04] So that's um that and I I intentionally spent wanted to spend most of the time [01:03:05] spent wanted to spend most of the time on those points because those are the [01:03:07] on those points because those are the ones that you guys are probably going to [01:03:08] ones that you guys are probably going to use in practice. Um I don't think anyone [01:03:11] use in practice. Um I don't think anyone in this room likely has access to a [01:03:12] in this room likely has access to a 10,000 GPU cluster. If you do, like come [01:03:15] 10,000 GPU cluster. If you do, like come talk to me after class. I would love to [01:03:16] talk to me after class. I would love to be your friend. Um but uh so like those [01:03:19] be your friend. Um but uh so like those are the ones that you're likely to [01:03:20] are the ones that you're likely to encounter in practice like up to many [01:03:22] encounter in practice like up to many hundreds of GPUs. Um but there are these [01:03:24] hundreds of GPUs. Um but there are these other ones that I I just like there are [01:03:26] other ones that I I just like there are slides here that are pretty that I think [01:03:27] slides here that are pretty that I think are pretty nice, but it's okay if we [01:03:28] are pretty nice, but it's okay if we don't go through the full details of [01:03:29] don't go through the full details of these. You can ch you can check it [01:03:31] these. You can ch you can check it offline. Um so we said context [01:03:33] offline. Um so we said context parallelism is basically splitting on [01:03:35] parallelism is basically splitting on the sequence dimension. So we said [01:03:36] the sequence dimension. So we said transformers are operating on sequences. [01:03:38] transformers are operating on sequences. Um and basically the idea is you've got [01:03:41] Um and basically the idea is you've got a long sequence, you know, make [01:03:43] a long sequence, you know, make different GPUs handle different parts of [01:03:44] different GPUs handle different parts of the sequence. Um, and for if you recall [01:03:47] the sequence. Um, and for if you recall your transformer block, this is actually [01:03:49] your transformer block, this is actually easy for large parts of the transformer [01:03:51] easy for large parts of the transformer because the layer norm, the multi-layer [01:03:52] because the layer norm, the multi-layer like the the FFN, MLP, um, and the [01:03:55] like the the FFN, MLP, um, and the residual connections, those all operate [01:03:56] residual connections, those all operate independently across the sequence [01:03:58] independently across the sequence anyway. So, it's relatively [01:04:00] anyway. So, it's relatively straightforward to ask to chunk up that [01:04:02] straightforward to ask to chunk up that computation across the sequence [01:04:03] computation across the sequence dimension. Um, things get I mean it does [01:04:05] dimension. Um, things get I mean it does get a little bit hairy inside the MLP [01:04:07] get a little bit hairy inside the MLP because there are weights in there. So, [01:04:08] because there are weights in there. So, you have to have some some reduce of the [01:04:10] you have to have some some reduce of the gradients like we did in the data [01:04:12] gradients like we did in the data parallelism cases. The attention is [01:04:14] parallelism cases. The attention is where things get hairy for sequence [01:04:16] where things get hairy for sequence parallelism because if we remember um [01:04:17] parallelism because if we remember um attention we need to compute the sort of [01:04:19] attention we need to compute the sort of all pairs interaction between every pair [01:04:21] all pairs interaction between every pair of elements inside the sequence. Um the [01:04:23] of elements inside the sequence. Um the QKB projection is easy um because that's [01:04:25] QKB projection is easy um because that's sort of trivially paralyzable over the [01:04:27] sort of trivially paralyzable over the sequence. But that core attention matrix [01:04:29] sequence. But that core attention matrix that actually gets pretty tricky to [01:04:30] that actually gets pretty tricky to paralyze. Um the the most like the the [01:04:33] paralyze. Um the the most like the the first version of this that people [01:04:34] first version of this that people developed was called ring attention [01:04:36] developed was called ring attention where you basically take that full [01:04:37] where you basically take that full attention matrix um chunk it up into [01:04:39] attention matrix um chunk it up into blocks and then a have your your your [01:04:41] blocks and then a have your your your GPUs sort of work on those blocks [01:04:43] GPUs sort of work on those blocks independently in parallel in the right [01:04:45] independently in parallel in the right order to make sure everything works out. [01:04:47] order to make sure everything works out. Um there's a lot of details in there. [01:04:48] Um there's a lot of details in there. You can check out the paper for more [01:04:49] You can check out the paper for more details. Um the second which is a little [01:04:52] details. Um the second which is a little bit conceptually easier is called um [01:04:54] bit conceptually easier is called um ulysus attention. Um where you do [01:04:56] ulysus attention. Um where you do parallelism over the heads. So remember [01:04:58] parallelism over the heads. So remember in a transformer you're almost always [01:04:59] in a transformer you're almost always doing multi-head attention where you're [01:05:01] doing multi-head attention where you're sort of computing attention over like [01:05:03] sort of computing attention over like multiple attention matrices all in [01:05:05] multiple attention matrices all in parallel. Um so in ulysus attention [01:05:07] parallel. Um so in ulysus attention we're going to parallelize that [01:05:08] we're going to parallelize that computation of that core attention [01:05:10] computation of that core attention operator paralyze that over heads and [01:05:12] operator paralyze that over heads and then everything else all other parts of [01:05:14] then everything else all other parts of the transformer are paralyzed over the [01:05:15] the transformer are paralyzed over the sequence dimension. [01:05:18] sequence dimension. Um and as a as an example like um cont [01:05:21] Um and as a as an example like um cont this this context parallelism becomes [01:05:22] this this context parallelism becomes important once you scale up your [01:05:24] important once you scale up your sequence length to be quite large. So if [01:05:26] sequence length to be quite large. So if we go back to this example of llama 3 [01:05:27] we go back to this example of llama 3 pre-training, they actually train the [01:05:29] pre-training, they actually train the model in two stages. The first stage [01:05:31] model in two stages. The first stage they go um sequence length of 8,000 with [01:05:33] they go um sequence length of 8,000 with no context parallelism whatsoever. Um [01:05:35] no context parallelism whatsoever. Um and then they have a second stage of [01:05:37] and then they have a second stage of training where they crank the sequence [01:05:38] training where they crank the sequence length up to 130,000. And then that [01:05:41] length up to 130,000. And then that point at at that point they do 16-way [01:05:43] point at at that point they do 16-way context parallelism. So that means that [01:05:44] context parallelism. So that means that each of those um 131,000 long sequences [01:05:48] each of those um 131,000 long sequences has 16 GPUs operating on one sequence in [01:05:50] has 16 GPUs operating on one sequence in parallel. And that's kind of like saying [01:05:52] parallel. And that's kind of like saying the batch size is like 1 over 16 because [01:05:54] the batch size is like 1 over 16 because now like each batch each GPU is h is [01:05:57] now like each batch each GPU is h is working on like less than one element. [01:06:00] working on like less than one element. Um so that's context parallelism. [01:06:02] Um so that's context parallelism. Pipeline parallelism we're going to [01:06:03] Pipeline parallelism we're going to split across the layers dimension. Um [01:06:05] split across the layers dimension. Um intuitively what you want to do you have [01:06:07] intuitively what you want to do you have you a network with a bunch of layers and [01:06:09] you a network with a bunch of layers and we're going to just divide the layers [01:06:11] we're going to just divide the layers across the GPUs. That's actually a very [01:06:13] across the GPUs. That's actually a very intuitive thing to do. The problem is [01:06:15] intuitive thing to do. The problem is that there's sequential dependencies, [01:06:17] that there's sequential dependencies, right? Because each GPU like it needs [01:06:19] right? Because each GPU like it needs the activations from the previous GPU to [01:06:22] the activations from the previous GPU to continue running the forward pass and [01:06:23] continue running the forward pass and during the backward pass I need the [01:06:25] during the backward pass I need the gradients from the upstream layers in [01:06:26] gradients from the upstream layers in order to compute the backward pass. So [01:06:28] order to compute the backward pass. So we can draw a diagram like this where [01:06:30] we can draw a diagram like this where the the vertical axis are GPUs 1 to 4 [01:06:33] the the vertical axis are GPUs 1 to 4 the horizontal axis is what happens over [01:06:35] the horizontal axis is what happens over the course of time. So then you can see [01:06:36] the course of time. So then you can see that GPU run GPU 1 runs forward then [01:06:39] that GPU run GPU 1 runs forward then passes the activations to GPU 2 which [01:06:41] passes the activations to GPU 2 which passes activations to GPU 3 which passes [01:06:43] passes activations to GPU 3 which passes activations to GPU 4. GPU 4 is lucky it [01:06:46] activations to GPU 4. GPU 4 is lucky it can do forward backward all at once then [01:06:48] can do forward backward all at once then pass gradients back to GPU 3 back to GPU [01:06:50] pass gradients back to GPU 3 back to GPU 2 back to GPU 1. So from this graph like [01:06:53] 2 back to GPU 1. So from this graph like there that's obviously really really bad [01:06:55] there that's obviously really really bad right because the GPUs are mostly [01:06:56] right because the GPUs are mostly sitting idle um and in fact if you have [01:06:59] sitting idle um and in fact if you have n GPUs like only you're only getting [01:07:01] n GPUs like only you're only getting useful work out of them one overn of the [01:07:04] useful work out of them one overn of the time. So that means that if we had like [01:07:05] time. So that means that if we had like 8-way pipeline parallelism, your maximum [01:07:08] 8-way pipeline parallelism, your maximum possible MFU at that point is like 12%. [01:07:10] possible MFU at that point is like 12%. Which is terrible. So like that's really [01:07:12] Which is terrible. So like that's really bad. Um and by the way, there's a cute [01:07:14] bad. Um and by the way, there's a cute name. These are sometimes called the [01:07:15] name. These are sometimes called the bubble. Um is like that chunk of where [01:07:18] bubble. Um is like that chunk of where GPUs are waiting for work and they [01:07:19] GPUs are waiting for work and they they're like waiting around for the [01:07:21] they're like waiting around for the communication. So the trick in pipeline [01:07:23] communication. So the trick in pipeline parallelism is to shrink the bubble. You [01:07:25] parallelism is to shrink the bubble. You want to have less bubble and the way [01:07:26] want to have less bubble and the way that we do that is running multiple [01:07:28] that we do that is running multiple micro batches simultaneously. So now [01:07:31] micro batches simultaneously. So now like rather than running running one [01:07:32] like rather than running running one batch of data through all the GPUs [01:07:34] batch of data through all the GPUs forward and backward, we're going to [01:07:35] forward and backward, we're going to have multiple batches of data in play [01:07:37] have multiple batches of data in play simultaneously and shuttle these things [01:07:39] simultaneously and shuttle these things across the GPUs um in parallel. [01:07:42] across the GPUs um in parallel. So there's a lot of different [01:07:43] So there's a lot of different interesting patterns you can try to [01:07:45] interesting patterns you can try to design for this, but here's a relatively [01:07:47] design for this, but here's a relatively simple one where we have um four-way [01:07:49] simple one where we have um four-way pipeline parallelism. So there's four [01:07:50] pipeline parallelism. So there's four GPUs that are all working in parallel [01:07:52] GPUs that are all working in parallel and then we we have four batches of data [01:07:54] and then we we have four batches of data that are all active at the same time. So [01:07:56] that are all active at the same time. So then and these batches are colorcoded. [01:07:58] then and these batches are colorcoded. So then we see that GPU1 runs forward on [01:08:01] So then we see that GPU1 runs forward on the on the blue batch, then forward on [01:08:03] the on the blue batch, then forward on the yellow batch, then forward on the [01:08:04] the yellow batch, then forward on the green batch, then forward on the red [01:08:05] green batch, then forward on the red batch. And while GPU1 is going forward [01:08:08] batch. And while GPU1 is going forward on the yellow batch, um we've passed the [01:08:10] on the yellow batch, um we've passed the activations of the blue batch to GPU 1 [01:08:12] activations of the blue batch to GPU 1 or to GPU 2 and GPU 2 can now do forward [01:08:14] or to GPU 2 and GPU 2 can now do forward on the blue batch. So these things can [01:08:16] on the blue batch. So these things can all sort of cascade down and happen in [01:08:18] all sort of cascade down and happen in parallel. And then the same kind of [01:08:20] parallel. And then the same kind of pattern repeats during the backward [01:08:21] pattern repeats during the backward pass. We can kind of interle these [01:08:23] pass. We can kind of interle these different mini batch, these different [01:08:24] different mini batch, these different microbatches as we pipeline them through [01:08:26] microbatches as we pipeline them through the different GPUs. Um and in this case [01:08:29] the different GPUs. Um and in this case with um you know four-way pipeline [01:08:31] with um you know four-way pipeline parallelism with four microbatches our [01:08:33] parallelism with four microbatches our max the like the max theoretical MFU is [01:08:36] max the like the max theoretical MFU is just like the fraction of this graph [01:08:37] just like the fraction of this graph which is not white and that increases [01:08:39] which is not white and that increases now to 57% which is pretty good. So with [01:08:42] now to 57% which is pretty good. So with pipeline parallelism in theory if you go [01:08:44] pipeline parallelism in theory if you go to like lots and lots of microbatches [01:08:46] to like lots and lots of microbatches then your MFU is going to be good [01:08:48] then your MFU is going to be good because you can do a lot of work in [01:08:49] because you can do a lot of work in parallel but the more microbatches you [01:08:51] parallel but the more microbatches you have they need to store all the [01:08:53] have they need to store all the activations in memory. So now we need to [01:08:55] activations in memory. So now we need to do activation checkpointing. And then [01:08:56] do activation checkpointing. And then you think like, oh crap, like how do I [01:08:58] you think like, oh crap, like how do I tune these things? Should I have more [01:09:00] tune these things? Should I have more pipeline parallelism? Should I have [01:09:01] pipeline parallelism? Should I have fewer microbatches? Should I have more [01:09:03] fewer microbatches? Should I have more aggressive activation checkpointing? And [01:09:04] aggressive activation checkpointing? And then should I also layer data [01:09:06] then should I also layer data parallelism on top of that? I don't [01:09:07] parallelism on top of that? I don't know. What are you going to do? Maximize [01:09:09] know. What are you going to do? Maximize MFU. You're going to try to turn tune [01:09:11] MFU. You're going to try to turn tune all of those knobs to maximize the MFU [01:09:13] all of those knobs to maximize the MFU of your of your of your training [01:09:14] of your of your of your training pipeline. [01:09:16] pipeline. Then the last one is tensor parallelism. [01:09:18] Then the last one is tensor parallelism. So this one um you're going to split on [01:09:20] So this one um you're going to split on that dimension on that on that um model [01:09:22] that dimension on that on that um model dimension. So basically what we're going [01:09:24] dimension. So basically what we're going to do is we have a lot of weight [01:09:25] to do is we have a lot of weight matrices in our model. All those weight [01:09:27] matrices in our model. All those weight matrices um are like computing XW equals [01:09:30] matrices um are like computing XW equals Y. That's basically what we're doing [01:09:31] Y. That's basically what we're doing over and over again inside of our trans [01:09:33] over and over again inside of our trans inside of our transformer. Now the idea [01:09:35] inside of our transformer. Now the idea is we'll split each weight matrix across [01:09:38] is we'll split each weight matrix across GPUs kind of and now this is different [01:09:40] GPUs kind of and now this is different from FSTP because we're actually [01:09:41] from FSTP because we're actually splitting a single weight matrix across [01:09:43] splitting a single weight matrix across GPUs and now there's no communication. [01:09:45] GPUs and now there's no communication. Each GPU now we we do a block matrix [01:09:47] Each GPU now we we do a block matrix multiply. So we um each GPU is now [01:09:50] multiply. So we um each GPU is now computing a slice of that matrix [01:09:51] computing a slice of that matrix multiply on the full input. So in this [01:09:53] multiply on the full input. So in this case we split our weight matrix into W1, [01:09:55] case we split our weight matrix into W1, W2, W3, W4. Um and then each GPU just [01:09:58] W2, W3, W4. Um and then each GPU just computes a slice of that matrix multiply [01:10:00] computes a slice of that matrix multiply to compute a slice of the output. And [01:10:03] to compute a slice of the output. And then a problem is that now you know [01:10:04] then a problem is that now you know after you do that forward pass then you [01:10:06] after you do that forward pass then you need to gather the activations across [01:10:08] need to gather the activations across all the GPUs to do the next to do the [01:10:10] all the GPUs to do the next to do the next forward pass. Um there's a slight [01:10:12] next forward pass. Um there's a slight trick which is if you have two of these [01:10:14] trick which is if you have two of these layers in sequence, you can actually get [01:10:16] layers in sequence, you can actually get away with not computing with not uh not [01:10:19] away with not computing with not uh not gathering in between two layers. So if [01:10:21] gathering in between two layers. So if you have two layers, you can sit down in [01:10:23] you have two layers, you can sit down in a quiet place and work through this. You [01:10:25] a quiet place and work through this. You split the first matrix into [01:10:26] split the first matrix into column-shaped chunks, then you split the [01:10:28] column-shaped chunks, then you split the second matrix into row-shaped chunks. [01:10:30] second matrix into row-shaped chunks. And if you do all this, then it all kind [01:10:32] And if you do all this, then it all kind of works out magically because of the [01:10:34] of works out magically because of the magic and mystery of block matrix [01:10:35] magic and mystery of block matrix multiplication. And you see that the [01:10:37] multiplication. And you see that the final output you can kind of compute as [01:10:38] final output you can kind of compute as an inner product like structure of these [01:10:40] an inner product like structure of these um of this block of these block matrix [01:10:42] um of this block of these block matrix multipliers of Y and U. So then you [01:10:45] multipliers of Y and U. So then you basically can have two layers of matrix [01:10:48] basically can have two layers of matrix multiply um and the that are split [01:10:50] multiply um and the that are split across multiple GPUs and then they only [01:10:52] across multiple GPUs and then they only need to communicate at the end of every [01:10:54] need to communicate at the end of every two layers. Um and this actually works [01:10:55] two layers. Um and this actually works out nicely because remember transformers [01:10:57] out nicely because remember transformers have a two-layer MLP in the FFN. So this [01:11:00] have a two-layer MLP in the FFN. So this is a really nice trick that plays really [01:11:02] is a really nice trick that plays really nicely into the two-layer MLPS that [01:11:04] nicely into the two-layer MLPS that transformers always have. So it's pretty [01:11:06] transformers always have. So it's pretty common in big transformers to use tensor [01:11:08] common in big transformers to use tensor parallelism. This this two this like two [01:11:10] parallelism. This this two this like two layer tensor parallelism trick um on the [01:11:13] layer tensor parallelism trick um on the MLP in a transformer. [01:11:15] MLP in a transformer. Um so that's basically all of our [01:11:17] Um so that's basically all of our mechanisms for splitting up computation [01:11:19] mechanisms for splitting up computation across GPUs. Which one is the best? The [01:11:22] across GPUs. Which one is the best? The actual answer is all of them. So in [01:11:24] actual answer is all of them. So in practice we're going to use ND [01:11:25] practice we're going to use ND parallelism. We saw already an example [01:11:27] parallelism. We saw already an example of two dimensional parallelism with [01:11:29] of two dimensional parallelism with HSTP. In practice, you know, the current [01:11:31] HSTP. In practice, you know, the current state-of-the-art is like [01:11:32] state-of-the-art is like four-dimensional parallelism. If we go [01:11:34] four-dimensional parallelism. If we go back to Llama, we see that they are [01:11:36] back to Llama, we see that they are training with their on their biggest [01:11:38] training with their on their biggest training run with 16,000 GPUs. They're [01:11:40] training run with 16,000 GPUs. They're using 8-way tensor parallelism, 16-way [01:11:42] using 8-way tensor parallelism, 16-way context parallelism, 16-way pipeline [01:11:44] context parallelism, 16-way pipeline parallelism, and 8way data parallelism [01:11:47] parallelism, and 8way data parallelism all at the same time. Um, and if you're [01:11:49] all at the same time. Um, and if you're careful in like these different meth [01:11:51] careful in like these different meth mechanisms of parallelism have different [01:11:53] mechanisms of parallelism have different communication requirements. So if you're [01:11:55] communication requirements. So if you're careful in how you arrange these GP how [01:11:57] careful in how you arrange these GP how you arrange these different axes of [01:11:59] you arrange these different axes of parallelism along your cluster, you can [01:12:01] parallelism along your cluster, you can try to take advantage of that varying [01:12:02] try to take advantage of that varying speeds of communication across your your [01:12:05] speeds of communication across your your whole cluster. [01:12:06] whole cluster. And that's basically and that's [01:12:08] And that's basically and that's basically a whirlwind tour of large [01:12:09] basically a whirlwind tour of large scale distributed training. So the [01:12:11] scale distributed training. So the takeaway for today is that an individual [01:12:13] takeaway for today is that an individual GPU is basically a generalizable [01:12:15] GPU is basically a generalizable parallel computing machine. A GPU [01:12:17] parallel computing machine. A GPU cluster is a giant massively parallel [01:12:20] cluster is a giant massively parallel machine with tens of thousands, maybe [01:12:21] machine with tens of thousands, maybe hundreds of thousands of individual GPUs [01:12:23] hundreds of thousands of individual GPUs that we want to program as one big unit. [01:12:26] that we want to program as one big unit. And then we talked about multiple [01:12:27] And then we talked about multiple different mechanisms of paralyzing our [01:12:29] different mechanisms of paralyzing our computation across big clusters as well [01:12:31] computation across big clusters as well as one trick, activation checkpointing [01:12:33] as one trick, activation checkpointing for saving memory. And then the one [01:12:35] for saving memory. And then the one guiding light metric that you're always [01:12:36] guiding light metric that you're always trying to optimize when you design these [01:12:38] trying to optimize when you design these pipelines, which is model flops [01:12:39] pipelines, which is model flops utilization. So, the next time you're [01:12:41] utilization. So, the next time you're going out and training on tens of [01:12:42] going out and training on tens of thousands of GPUs, hope you keep this in [01:12:44] thousands of GPUs, hope you keep this in mind. Um, and let me know so I can [01:12:45] mind. Um, and let me know so I can borrow your tens of thousands of GPUs. ================================================================================ LECTURE 012 ================================================================================ Stanford CS231N | Spring 2025 | Lecture 12: Self-Supervised Learning Source: https://www.youtube.com/watch?v=4howBU7THbM --- Transcript [00:00:05] Last time on on Tuesday this week, we [00:00:10] Last time on on Tuesday this week, we had a lecture on GPUs and how to train [00:00:13] had a lecture on GPUs and how to train and how to use them and how to use [00:00:15] and how to use them and how to use multiple GPUs for training larger scale [00:00:18] multiple GPUs for training larger scale scaling your trainings and so on. an [00:00:21] scaling your trainings and so on. an exciting um a new topic that we've added [00:00:23] exciting um a new topic that we've added to this uh class this this year which is [00:00:27] to this uh class this this year which is I think timely and very important with [00:00:29] I think timely and very important with the increase of model sizes and and [00:00:33] the increase of model sizes and and applications [00:00:34] applications uh that you see the AI models have these [00:00:38] uh that you see the AI models have these days. [00:00:40] days. And [00:00:41] And before that we talked about we covered [00:00:44] before that we talked about we covered all of the key tasks in computer vision [00:00:49] all of the key tasks in computer vision from classification, semantic [00:00:51] from classification, semantic segmentation, object detection, instance [00:00:53] segmentation, object detection, instance segmentation and so on. And uh we're [00:00:58] segmentation and so on. And uh we're going to revisit some of these topics, [00:00:59] going to revisit some of these topics, some of the results of the models we [00:01:01] some of the results of the models we talked about today. But so that's that's [00:01:05] talked about today. But so that's that's those tasks are uh still quite [00:01:07] those tasks are uh still quite important. [00:01:09] important. And then we talked about visualizing and [00:01:11] And then we talked about visualizing and understanding the [00:01:15] understanding the models and and seeing what the models [00:01:18] models and and seeing what the models are learning. One of the things that [00:01:20] are learning. One of the things that we've discussed was [00:01:22] we've discussed was for example using just [00:01:26] for example using just in the in the early sessions we talked [00:01:28] in the in the early sessions we talked about nearest neighbor and and in the [00:01:30] about nearest neighbor and and in the pixel space and how we can actually do [00:01:34] pixel space and how we can actually do find the class of um we can find the [00:01:36] find the class of um we can find the class of images based on only pixel [00:01:39] class of images based on only pixel space distances and we discussed how [00:01:42] space distances and we discussed how it's actually not efficient and one of [00:01:45] it's actually not efficient and one of the things that we talked about was [00:01:49] the things that we talked about was if we use the embedding layers or the [00:01:52] if we use the embedding layers or the feature space feature layers, what one [00:01:55] feature space feature layers, what one of those fully connected layers in [00:01:58] of those fully connected layers in the the feature maps there from a [00:02:01] the the feature maps there from a convolutional neural network or any [00:02:03] convolutional neural network or any other network architecture that we use [00:02:06] other network architecture that we use there that could actually be a good [00:02:08] there that could actually be a good representation of images. And we talked [00:02:12] representation of images. And we talked about the L2 distance if if we use that [00:02:16] about the L2 distance if if we use that as the metric for nearest neighbor in [00:02:19] as the metric for nearest neighbor in the feature space feature of these [00:02:21] the feature space feature of these models. Right? So these uh this means [00:02:24] models. Right? So these uh this means that these features are quite meaningful [00:02:27] that these features are quite meaningful for the specific task that we had at [00:02:29] for the specific task that we had at hand. And this means that specifically [00:02:33] hand. And this means that specifically okay if we run a neural network a CNN a [00:02:38] okay if we run a neural network a CNN a resnet or even uh the transformer models [00:02:41] resnet or even uh the transformer models and look at the [00:02:45] and look at the representations the learned [00:02:47] representations the learned representations in different context you [00:02:49] representations in different context you may uh see these as uh referred to with [00:02:52] may uh see these as uh referred to with with different names large [00:02:55] with different names large representations features embeddings uh [00:02:58] representations features embeddings uh latent space and so on but These [00:03:01] latent space and so on but These learned representations or features are [00:03:04] learned representations or features are very good representatives of the images. [00:03:07] very good representatives of the images. And if we can if we have a way to to [00:03:12] And if we can if we have a way to to extract those, we can always get the [00:03:14] extract those, we can always get the class labels out of those features as [00:03:17] class labels out of those features as well by a simple linear model as you can [00:03:20] well by a simple linear model as you can see at the end here. But the major [00:03:23] see at the end here. But the major challenge that exists is [00:03:28] training or building these networks at [00:03:30] training or building these networks at large scale is always um challenging and [00:03:33] large scale is always um challenging and can you tell me why there's a challenge [00:03:35] can you tell me why there's a challenge here? [00:03:37] here? So the thing is that at large scale we [00:03:40] So the thing is that at large scale we need a lot of labeled data because this [00:03:43] need a lot of labeled data because this network is trained starting from an [00:03:46] network is trained starting from an image and at the end we have class [00:03:48] image and at the end we have class labels. If we train this network, yes, [00:03:50] labels. If we train this network, yes, these features are going to be very [00:03:52] these features are going to be very useful for getting those uh class labels [00:03:55] useful for getting those uh class labels out, right? But at scale, we need a lot [00:04:00] out, right? But at scale, we need a lot of manual labeling efforts uh to to sit [00:04:03] of manual labeling efforts uh to to sit down and label the images one by one. If [00:04:05] down and label the images one by one. If the task is segmentation, you have to [00:04:07] the task is segmentation, you have to label the pixels one by one in every [00:04:10] label the pixels one by one in every image. And that is going to be very [00:04:13] image. And that is going to be very challenging. [00:04:14] challenging. So the question is is there a way we can [00:04:18] So the question is is there a way we can train neural networks without the need [00:04:21] train neural networks without the need for huge manually labeled data sets. So [00:04:25] for huge manually labeled data sets. So these manual labels are um the the [00:04:29] these manual labels are um the the challenge and we want to see if we can [00:04:32] challenge and we want to see if we can bypass them in a way to train any neuron [00:04:35] bypass them in a way to train any neuron network that gets us very good features. [00:04:39] network that gets us very good features. And with that the topic of [00:04:41] And with that the topic of self-supervised learning uh comes in [00:04:44] self-supervised learning uh comes in light and that's what we are going to [00:04:46] light and that's what we are going to cover today. So [00:04:51] cover today. So having a a large data set of say images [00:04:55] having a a large data set of say images without any labels [00:04:58] without any labels our hypothesis is that we can train a [00:05:01] our hypothesis is that we can train a neural network using an objective [00:05:03] neural network using an objective function a pretext task that gets us [00:05:08] function a pretext task that gets us good features [00:05:11] good features from images. And then when it comes to [00:05:16] from images. And then when it comes to learning on a specific data set with a [00:05:18] learning on a specific data set with a smaller uh set of data points which have [00:05:22] smaller uh set of data points which have labels, we we can basically train uh we [00:05:26] labels, we we can basically train uh we can transfer this trained encoder and [00:05:29] can transfer this trained encoder and use that to extract features for a [00:05:31] use that to extract features for a downstream task or downstream objective. [00:05:34] downstream task or downstream objective. So here we want to define a pretext [00:05:37] So here we want to define a pretext task, a task that is general enough to [00:05:40] task, a task that is general enough to be able to uh learn some good features [00:05:45] be able to uh learn some good features from the images and then use that [00:05:48] from the images and then use that encoder. Let's call it encoder [00:05:51] encoder. Let's call it encoder to [00:05:53] to uh to to solve another problem what we [00:05:56] uh to to solve another problem what we call downstream task downstream [00:05:58] call downstream task downstream objective which is the application that [00:06:01] objective which is the application that you care about. Like for example, we [00:06:03] you care about. Like for example, we have a lot of natural images [00:06:06] have a lot of natural images downloaded from in the internet. We can [00:06:08] downloaded from in the internet. We can we can train something out of it and [00:06:10] we can train something out of it and then we have a small data set of for [00:06:12] then we have a small data set of for example one of these industrial [00:06:14] example one of these industrial applications or medical applications [00:06:16] applications or medical applications that we have few labeled images on and [00:06:19] that we have few labeled images on and we can now use that transfer of [00:06:21] we can now use that transfer of knowledge to extract features and then [00:06:24] knowledge to extract features and then classify or or perform the task that we [00:06:26] classify or or perform the task that we are interested in. [00:06:29] are interested in. So we want to delve into this topic and [00:06:32] So we want to delve into this topic and go a little bit deeper in uh [00:06:34] go a little bit deeper in uh understanding the different components. [00:06:37] understanding the different components. In a nutshell, what self revised [00:06:39] In a nutshell, what self revised learning is, as I said, defining this [00:06:42] learning is, as I said, defining this preex task on the data set with no [00:06:46] preex task on the data set with no labels. encoder often [00:06:51] labels. encoder often uh gets us some learned representations [00:06:55] uh gets us some learned representations and then another module of the the same [00:06:58] and then another module of the the same neural network generates or or does um [00:07:02] neural network generates or or does um transfer the learned representations [00:07:04] transfer the learned representations into the output space which could be [00:07:06] into the output space which could be labels or outputs that automatically are [00:07:10] labels or outputs that automatically are generated from the data. They are not [00:07:12] generated from the data. They are not manual annotations. So if we can do this [00:07:17] manual annotations. So if we can do this then um we have an an objective function [00:07:22] then um we have an an objective function a loss function and a neural network uh [00:07:25] a loss function and a neural network uh to be trained with that loss function. [00:07:28] to be trained with that loss function. I'm as as you can see here we call the [00:07:31] I'm as as you can see here we call the second part sometimes decoder a [00:07:33] second part sometimes decoder a classifier a regressor depending on how [00:07:36] classifier a regressor depending on how we define our pre-text task. I will give [00:07:38] we define our pre-text task. I will give you some examples but there are this [00:07:41] you some examples but there are this could be any form of framework but when [00:07:44] could be any form of framework but when when it's encoder and then a decoder [00:07:47] when it's encoder and then a decoder this is more of an autoenccoding [00:07:49] this is more of an autoenccoding framework that I'll I'll talk briefly [00:07:51] framework that I'll I'll talk briefly about okay [00:07:54] about okay so after we we do this uh training with [00:07:57] so after we we do this uh training with the pretext task now we can um use the [00:08:01] the pretext task now we can um use the encoder and the learn representations [00:08:03] encoder and the learn representations for a downstream task which we just need [00:08:07] for a downstream task which we just need to add uh one layer or even a fully [00:08:11] to add uh one layer or even a fully connected neural network, a linear [00:08:13] connected neural network, a linear function or a fully connected neural [00:08:15] function or a fully connected neural network that you that predicts the [00:08:16] network that you that predicts the labels and these labels are coming now [00:08:18] labels and these labels are coming now from the data set. So that's that's the [00:08:22] from the data set. So that's that's the major that's the main concept of [00:08:24] major that's the main concept of self-supervised [00:08:26] self-supervised learning that the pretext version of it [00:08:31] learning that the pretext version of it portion of it doesn't require any [00:08:34] portion of it doesn't require any labeled data to do the training. But how [00:08:37] labeled data to do the training. But how to define the pretext task itself is not [00:08:40] to define the pretext task itself is not that uh straightforward. [00:08:43] that uh straightforward. There are many different ways of [00:08:45] There are many different ways of defining those. For example, [00:08:48] defining those. For example, just keep in mind that we want to define [00:08:52] just keep in mind that we want to define um [00:08:54] um the pretext task in a way that it's [00:08:58] the pretext task in a way that it's first general enough that can get us [00:09:00] first general enough that can get us good features and doesn't require manual [00:09:03] good features and doesn't require manual labeling. So the labels should come from [00:09:05] labeling. So the labels should come from the data itself, right? So one one [00:09:08] the data itself, right? So one one example would be image completion where [00:09:13] example would be image completion where we mask half of the image or parts of [00:09:15] we mask half of the image or parts of the image and we define a task to given [00:09:19] the image and we define a task to given the parts that are unmasked predict the [00:09:22] the parts that are unmasked predict the parts that are masked. Or for example, [00:09:24] parts that are masked. Or for example, we rotate the image with a specific u [00:09:27] we rotate the image with a specific u angle and the task is to take the image [00:09:32] angle and the task is to take the image as input and predict what's the rotation [00:09:35] as input and predict what's the rotation angle that it's gone through. And the [00:09:38] angle that it's gone through. And the other one could be a jigsaw puzzle where [00:09:42] other one could be a jigsaw puzzle where we have patches of the image that are [00:09:44] we have patches of the image that are not ordered but the the the the task is [00:09:46] not ordered but the the the the task is to output the correct order of these [00:09:49] to output the correct order of these patches. and colorization is one of the [00:09:52] patches. and colorization is one of the popular ones that these are the four [00:09:54] popular ones that these are the four that we'll be covering uh very quickly [00:09:57] that we'll be covering uh very quickly today. But uh given the um blackout and [00:10:03] today. But uh given the um blackout and white version of the image predict the [00:10:05] white version of the image predict the colors for each of the uh pixels. So [00:10:08] colors for each of the uh pixels. So solving the pretext task allows the [00:10:12] solving the pretext task allows the model to learn good features. That's [00:10:14] model to learn good features. That's what we wanted. And we can automatically [00:10:17] what we wanted. And we can automatically generate labels for the pretext task. So [00:10:20] generate labels for the pretext task. So the two points that I mentioned we we [00:10:23] the two points that I mentioned we we need for a task that could be qualified [00:10:26] need for a task that could be qualified as a good pretext task for [00:10:28] as a good pretext task for self-supervised learning. Some uh quick [00:10:32] self-supervised learning. Some uh quick considerations to to always keep in mind [00:10:36] considerations to to always keep in mind how to evaluate a self-supervised [00:10:38] how to evaluate a self-supervised learning framework. There are many [00:10:40] learning framework. There are many different pieces and and areas that you [00:10:43] different pieces and and areas that you can actually look into the pretext task [00:10:46] can actually look into the pretext task itself because we are generating the [00:10:48] itself because we are generating the labels and so on. It gives us the power [00:10:51] labels and so on. It gives us the power to do some evaluation of how good the [00:10:55] to do some evaluation of how good the model is able to solve that pretext [00:10:58] model is able to solve that pretext task. So that's one of the [00:11:02] task. So that's one of the uh factors. Then representation quality [00:11:05] uh factors. Then representation quality itself is sometimes very important [00:11:08] itself is sometimes very important looking at um for example [00:11:12] looking at um for example only representations without any [00:11:14] only representations without any fine-tuning or anything that I'll I'll [00:11:16] fine-tuning or anything that I'll I'll be talking about or even clustering the [00:11:18] be talking about or even clustering the the representations to see do we see a [00:11:21] the representations to see do we see a pattern in the representations and [00:11:23] pattern in the representations and sometimes there are some good [00:11:25] sometimes there are some good dimensionality reduction algorithms uh [00:11:28] dimensionality reduction algorithms uh I'm I'm referring to TSN testing here [00:11:31] I'm I'm referring to TSN testing here which we didn't very much talk about but [00:11:33] which we didn't very much talk about but this is a dimensionality reduction [00:11:35] this is a dimensionality reduction framework that you can reduce the [00:11:37] framework that you can reduce the dimensionality of the lens [00:11:38] dimensionality of the lens representations and then visualize it in [00:11:41] representations and then visualize it in 2D or 3D and see if there is a pattern [00:11:44] 2D or 3D and see if there is a pattern that you can find in your [00:11:46] that you can find in your representations. So robustness, [00:11:49] representations. So robustness, generalization and computational [00:11:51] generalization and computational efficiency I these are all quite [00:11:53] efficiency I these are all quite important but the most important thing [00:11:57] important but the most important thing the most important um aspect that we are [00:12:00] the most important um aspect that we are after is [00:12:02] after is the performance on the downstream task [00:12:06] the performance on the downstream task because we are doing the entire [00:12:10] because we are doing the entire self-supervised learning and and define [00:12:12] self-supervised learning and and define the tasks pre text tasks and so on to be [00:12:15] the tasks pre text tasks and so on to be able to improve results for a task of [00:12:18] able to improve results for a task of interest or something that we are we [00:12:20] interest or something that we are we care about. Let's see some quick [00:12:22] care about. Let's see some quick examples of how this could be done. Um [00:12:25] examples of how this could be done. Um this is an example of let's rotate the [00:12:30] this is an example of let's rotate the images and then predict the degree of [00:12:33] images and then predict the degree of rotation in as the output. So we can [00:12:36] rotation in as the output. So we can train this in a self-supervised manner [00:12:39] train this in a self-supervised manner without the need for labels of the [00:12:41] without the need for labels of the objects in the in the image. And we have [00:12:44] objects in the in the image. And we have a bunch of convolution layers in this [00:12:45] a bunch of convolution layers in this example and then a fully connected uh [00:12:47] example and then a fully connected uh neural network at the end to do this [00:12:50] neural network at the end to do this either regression or classification [00:12:52] either regression or classification task. And this means that this is giving [00:12:56] task. And this means that this is giving us some sort of uh a set of good [00:12:58] us some sort of uh a set of good features uh good feature extractor which [00:13:02] features uh good feature extractor which we can at the end remove this these [00:13:06] we can at the end remove this these task pretext task specific parts the FC [00:13:10] task pretext task specific parts the FC layers and puts [00:13:13] layers and puts one layer or in some cases multiple [00:13:16] one layer or in some cases multiple layers uh to to classify the features [00:13:19] layers uh to to classify the features into the object label. So this time we [00:13:22] into the object label. So this time we use the object labels to do the [00:13:24] use the object labels to do the prediction and train these uh this [00:13:27] prediction and train these uh this linear uh function itself. So we often [00:13:32] linear uh function itself. So we often look for a shallow network because if [00:13:34] look for a shallow network because if the features are good enough then we [00:13:36] the features are good enough then we don't need to um do a lot of training um [00:13:42] don't need to um do a lot of training um on um on on getting the class labels [00:13:46] on um on on getting the class labels out. So [00:13:49] out. So this is self-supervised learning in in [00:13:51] this is self-supervised learning in in general and um although we are talking [00:13:54] general and um although we are talking about um the computer vision [00:13:56] about um the computer vision applications but this this paradigm of [00:14:00] applications but this this paradigm of self-supervised learning was actually [00:14:03] self-supervised learning was actually what enabled all of these large language [00:14:06] what enabled all of these large language models uh GPT4 and all of these [00:14:09] models uh GPT4 and all of these frameworks are trained with mostly raw [00:14:13] frameworks are trained with mostly raw data without any manual ual labeling and [00:14:19] data without any manual ual labeling and um not just in language models in speech [00:14:21] um not just in language models in speech and these days quite a lot in robot and [00:14:26] and these days quite a lot in robot and um robotics and reinforcement learning [00:14:29] um robotics and reinforcement learning because when we don't need any labeling [00:14:33] because when we don't need any labeling data, we can start capturing data [00:14:39] data, we can start capturing data raw data without any manual labeling and [00:14:42] raw data without any manual labeling and use those for training. And that's why [00:14:43] use those for training. And that's why you see so many self-driving uh cars in [00:14:47] you see so many self-driving uh cars in the Bay Area collecting data because [00:14:49] the Bay Area collecting data because that's that's getting them the data and [00:14:51] that's that's getting them the data and they don't have to really annotate the [00:14:53] they don't have to really annotate the data but they can still train models [00:14:55] data but they can still train models based on. [00:14:57] based on. So with that um today's agenda we'll [00:15:01] So with that um today's agenda we'll we'll cover some of these pre-text tasks [00:15:04] we'll cover some of these pre-text tasks from image transformations and then I [00:15:07] from image transformations and then I will talk a little bit about [00:15:10] will talk a little bit about a set of algorithms that are around [00:15:14] a set of algorithms that are around contrastive representation learning that [00:15:17] contrastive representation learning that um [00:15:19] um are slightly different from these image [00:15:21] are slightly different from these image transformation based pre-text tasks but [00:15:24] transformation based pre-text tasks but have shown promise. So let's start with [00:15:28] have shown promise. So let's start with the with the first part and there we [00:15:32] the with the first part and there we will cover the tasks one by one. [00:15:36] will cover the tasks one by one. So I talked quite uh a lot about [00:15:39] So I talked quite uh a lot about rotation predicting rotations. Let's see [00:15:42] rotation predicting rotations. Let's see if we can actually u rotate the images [00:15:45] if we can actually u rotate the images with random or or arbitrary [00:15:50] with random or or arbitrary degrees and predict the rotation uh [00:15:54] degrees and predict the rotation uh angle with with a model. And our [00:15:57] angle with with a model. And our hypothesis here is that the model a [00:16:00] hypothesis here is that the model a model could recognize the correct [00:16:03] model could recognize the correct rotation of an object only if it has the [00:16:08] rotation of an object only if it has the common sense visual common sense of what [00:16:10] common sense visual common sense of what the object should look like u [00:16:13] the object should look like u unperturbed. [00:16:14] unperturbed. So these models mostly uh are are [00:16:18] So these models mostly uh are are designed around this concept of uh [00:16:21] designed around this concept of uh visual common sense and um and then if [00:16:26] visual common sense and um and then if if the model is able to to capture that [00:16:29] if the model is able to to capture that it means that it is also able to [00:16:32] it means that it is also able to summarize the image the entire image [00:16:34] summarize the image the entire image into a a useful [00:16:38] into a a useful um set of features good features. [00:16:42] um set of features good features. This paper published in 2018 [00:16:45] This paper published in 2018 um [00:16:47] um implemented this with with just [00:16:50] implemented this with with just exploring four different angles of 0, [00:16:53] exploring four different angles of 0, 19, 90, 180 and uh 270. rotating th [00:16:59] 19, 90, 180 and uh 270. rotating th those into with with one of these [00:17:02] those into with with one of these rotating images with one of these angles [00:17:04] rotating images with one of these angles and then using the convolutional neural [00:17:08] and then using the convolutional neural network to basically predict what which [00:17:10] network to basically predict what which which of these rotations the output is. [00:17:14] which of these rotations the output is. And because they only created four [00:17:16] And because they only created four different outputs, this is a [00:17:17] different outputs, this is a classification task because it only has [00:17:19] classification task because it only has four different cases, right? It doesn't [00:17:21] four different cases, right? It doesn't have to predict the exact value of the [00:17:24] have to predict the exact value of the angle the degrees but it's um [00:17:28] angle the degrees but it's um it's actually um just predicting one of [00:17:31] it's actually um just predicting one of these four [00:17:34] these four uh classes of 0 1 2 or three [00:17:38] uh classes of 0 1 2 or three why 0 1 2 or three. So with that um the [00:17:45] why 0 1 2 or three. So with that um the authors were able to [00:17:48] authors were able to learn good representations [00:17:51] learn good representations and um with those representations they [00:17:55] and um with those representations they they started training the neural network [00:17:57] they started training the neural network at a downstream application basically [00:18:00] at a downstream application basically fine-tuning the neural network [00:18:03] fine-tuning the neural network based on um fine-tuning the encoders and [00:18:07] based on um fine-tuning the encoders and also the [00:18:09] also the um the and the classifier actually we in [00:18:14] um the and the classifier actually we in this case they they froze [00:18:17] this case they they froze first and second layers and then they [00:18:20] first and second layers and then they fine-tuned uh the last convolution layer [00:18:23] fine-tuned uh the last convolution layer and the linear layer. So it's not [00:18:26] and the linear layer. So it's not entirely [00:18:28] entirely um fine-tuning the entire network but [00:18:32] um fine-tuning the entire network but they were able to get very good results. [00:18:35] they were able to get very good results. This is on C410 data set, one of the [00:18:39] This is on C410 data set, one of the data sets that we've talked about [00:18:41] data sets that we've talked about earlier. You see that when when the [00:18:44] earlier. You see that when when the model is pre-trained, it starts with a [00:18:46] model is pre-trained, it starts with a good accuracy to start with. So it means [00:18:49] good accuracy to start with. So it means that it it it already is in a good shape [00:18:52] that it it it already is in a good shape and it's it's having a good [00:18:55] and it's it's having a good understanding of the the objects even in [00:18:58] understanding of the the objects even in the very uh first iterations. But if the [00:19:02] the very uh first iterations. But if the task is simple enough, CR10 is actually [00:19:05] task is simple enough, CR10 is actually not too hard to um to train a model for. [00:19:09] not too hard to um to train a model for. If the task is simple enough, the [00:19:11] If the task is simple enough, the supervised version, the fully supervised [00:19:13] supervised version, the fully supervised version and the one that starts with [00:19:15] version and the one that starts with pre-training often converge to the same [00:19:17] pre-training often converge to the same number, same same accuracy. But again, [00:19:20] number, same same accuracy. But again, if the task is simple enough in a very [00:19:22] if the task is simple enough in a very hard application, much harder [00:19:24] hard application, much harder applications, often the supervised [00:19:26] applications, often the supervised learning frameworks, if we don't do any [00:19:27] learning frameworks, if we don't do any pre-training, larger scale pre-training, [00:19:29] pre-training, larger scale pre-training, we don't get as good results. Okay. [00:19:33] we don't get as good results. Okay. Um so [00:19:37] Um so um they've they've done also some uh [00:19:41] um they've they've done also some uh applica some some um experiments on this [00:19:44] applica some some um experiments on this POS Pascal VOCC 2017 [00:19:50] data set which involves a number of [00:19:53] data set which involves a number of tasks including classification, [00:19:55] tasks including classification, detection and uh segmentation. [00:19:59] detection and uh segmentation. And these all these three uh sets of [00:20:03] And these all these three uh sets of tasks they've used different setups of [00:20:06] tasks they've used different setups of just pre-train just training a few fully [00:20:09] just pre-train just training a few fully connected layers or all of the layers [00:20:11] connected layers or all of the layers for for the classification detection and [00:20:14] for for the classification detection and segmentation tasks. If you look at the [00:20:18] segmentation tasks. If you look at the imageet labels like if we have a huge [00:20:21] imageet labels like if we have a huge labeled data set and we pre-train on [00:20:25] labeled data set and we pre-train on that uh data set we already get a very [00:20:28] that uh data set we already get a very high accuracy but but again have this in [00:20:30] high accuracy but but again have this in mind that this is imageet with all of [00:20:32] mind that this is imageet with all of the labels involved for for the [00:20:35] the labels involved for for the pre-training but if we don't do any [00:20:39] pre-training but if we don't do any supervised pre-training and the [00:20:41] supervised pre-training and the pre-training is all based on [00:20:43] pre-training is all based on self-supervised it's it's showing that [00:20:45] self-supervised it's it's showing that this rot ation uh framework is doing a [00:20:50] this rot ation uh framework is doing a much better job than many of the other [00:20:54] much better job than many of the other counterparts any of the other examples [00:20:57] counterparts any of the other examples that are um other other methods that's [00:21:00] that are um other other methods that's we don't actually go into many of the [00:21:03] we don't actually go into many of the detail of many of those but the it's [00:21:06] detail of many of those but the it's showing efficacy for this rotation [00:21:08] showing efficacy for this rotation pretext task and see how it's different [00:21:11] pretext task and see how it's different how much better it is if you start with [00:21:14] how much better it is if you start with a random [00:21:16] a random um initialization of the weights. So the [00:21:18] um initialization of the weights. So the random initialization versus um [00:21:20] random initialization versus um pre-training with rotation um pretext [00:21:23] pre-training with rotation um pretext task the the difference is huge and the [00:21:26] task the the difference is huge and the the this rotation pretext task is not [00:21:30] the this rotation pretext task is not equal but it's close to pre-training on [00:21:33] equal but it's close to pre-training on the entire imageet. So [00:21:37] the entire imageet. So one of the thing that [00:21:39] one of the thing that they've they've looked into in this [00:21:41] they've they've looked into in this paper was looking at the features and [00:21:43] paper was looking at the features and how um how how the learned features are [00:21:47] how um how how the learned features are meaningful. I mentioned earlier that one [00:21:52] meaningful. I mentioned earlier that one of the ways of evaluating the pretext [00:21:55] of the ways of evaluating the pretext tasks generally self-supervised learning [00:21:57] tasks generally self-supervised learning frameworks is to look at the features [00:21:59] frameworks is to look at the features right and and you can always go from the [00:22:01] right and and you can always go from the features from the fully connected [00:22:02] features from the fully connected layers. We talked about grat cam and all [00:22:05] layers. We talked about grat cam and all of those other attention based [00:22:07] of those other attention based frameworks how we can go from the [00:22:10] frameworks how we can go from the features back to the image space. So [00:22:12] features back to the image space. So this evaluation involves projecting [00:22:15] this evaluation involves projecting those into the image space and seeing [00:22:17] those into the image space and seeing what the model is looking at. If you [00:22:19] what the model is looking at. If you look at the attention maps for the [00:22:21] look at the attention maps for the supervised model often the supervised [00:22:23] supervised model often the supervised model has more focused uh maps because [00:22:28] model has more focused uh maps because it's it's only trying to solve one [00:22:31] it's it's only trying to solve one single task of classification. So if I [00:22:33] single task of classification. So if I if it captures the eye and and the shape [00:22:36] if it captures the eye and and the shape around it doesn't care about the other [00:22:38] around it doesn't care about the other other parts very much. But in cases of [00:22:41] other parts very much. But in cases of self-supervised learning often more [00:22:44] self-supervised learning often more features more areas are covered because [00:22:47] features more areas are covered because it has to have a more holistic [00:22:50] it has to have a more holistic understanding of the image because we [00:22:52] understanding of the image because we don't know what the downstream task is. [00:22:54] don't know what the downstream task is. But it's that the the goal is to perform [00:22:57] But it's that the the goal is to perform equally well in many of those. So that's [00:23:00] equally well in many of those. So that's one of the tasks. Um, if you do have any [00:23:04] one of the tasks. Um, if you do have any questions, skip it. I'll I'll stop after [00:23:07] questions, skip it. I'll I'll stop after uh going over some of the tasks and if [00:23:08] uh going over some of the tasks and if your answer your question was not [00:23:10] your answer your question was not answered, then I would be happy to to [00:23:12] answered, then I would be happy to to answer that. Okay. Um [00:23:16] answer that. Okay. Um another one another popular um pretext [00:23:21] another one another popular um pretext task was to to basically create this 3x3 [00:23:26] task was to to basically create this 3x3 grid and then use networks to predict [00:23:30] grid and then use networks to predict the location of each of the given [00:23:33] the location of each of the given patches with respect to the sensor [00:23:35] patches with respect to the sensor patch. So for for this patch which is [00:23:39] patch. So for for this patch which is around here the output should be three [00:23:42] around here the output should be three because we only have um uh eight [00:23:47] because we only have um uh eight this is 3x3 and the center patch is the [00:23:49] this is 3x3 and the center patch is the the reference. So this also turns out to [00:23:52] the reference. So this also turns out to be an eightway classification task. It's [00:23:57] be an eightway classification task. It's getting any of these patches and yeah, [00:24:00] getting any of these patches and yeah, it's trying to output what the location [00:24:02] it's trying to output what the location of that given patch it is with respect [00:24:05] of that given patch it is with respect to the input uh with respect to the [00:24:08] to the input uh with respect to the sensor [00:24:11] sensor patch. Sorry. [00:24:13] patch. Sorry. So [00:24:15] So uh so this is this was another another [00:24:18] uh so this is this was another another example but the um this other follow-up [00:24:22] example but the um this other follow-up paper paper publication which uh turned [00:24:26] paper paper publication which uh turned this into a jigsaw puzzle framework was [00:24:30] this into a jigsaw puzzle framework was instead of saying to to uh asking the [00:24:33] instead of saying to to uh asking the model to just predict which of these [00:24:35] model to just predict which of these eight patches it is it tried to predict [00:24:39] eight patches it is it tried to predict the exact permutation the right [00:24:41] the exact permutation the right permutation. So what they've done was [00:24:43] permutation. So what they've done was they used the same 3x3 um grid took all [00:24:47] they used the same 3x3 um grid took all of the patches shuffled them randomly [00:24:50] of the patches shuffled them randomly and then ask the neural network to say [00:24:54] and then ask the neural network to say which of the [00:24:57] which of the correct permuta which one should be the [00:24:59] correct permuta which one should be the correct permutation. [00:25:01] correct permutation. So they they basically predict the the [00:25:03] So they they basically predict the the correct permutations. Can you tell me [00:25:05] correct permutations. Can you tell me what is the number of permutations you [00:25:06] what is the number of permutations you can you can have for um [00:25:10] can you can have for um this setup? Say again. [00:25:13] this setup? Say again. Nine factorial. Yes, exactly. So, it's [00:25:14] Nine factorial. Yes, exactly. So, it's it's a huge number, right? 300,000 [00:25:18] it's a huge number, right? 300,000 something, I think. Uh but what they've [00:25:21] something, I think. Uh but what they've done was they've they've created this [00:25:23] done was they've they've created this lookup table with only four 64 um [00:25:28] lookup table with only four 64 um plausible possible uh rotation uh sorry [00:25:32] plausible possible uh rotation uh sorry uh permutations and then only only they [00:25:36] uh permutations and then only only they consider 64 permutations and when when [00:25:38] consider 64 permutations and when when they're shuffling shuffling that they do [00:25:41] they're shuffling shuffling that they do the shuffling based on one of these 64 [00:25:43] the shuffling based on one of these 64 and then the output will also be just a [00:25:45] and then the output will also be just a 6064 uh sized vector. So again, this [00:25:49] 6064 uh sized vector. So again, this turns out to be a just a simple [00:25:51] turns out to be a just a simple classification task with 64 output [00:25:55] classification task with 64 output classes and they've shown this is also a [00:25:58] classes and they've shown this is also a great idea for [00:26:00] great idea for um solving to to define as a pretext [00:26:03] um solving to to define as a pretext task and on the same data set with [00:26:06] task and on the same data set with similar type of [00:26:08] similar type of tasks that I talked about and and how [00:26:11] tasks that I talked about and and how the supervision is is done. they've [00:26:14] the supervision is is done. they've shown their method was um outperforming [00:26:17] shown their method was um outperforming some of the [00:26:19] some of the more [00:26:21] more uh previous models, previous frameworks [00:26:25] uh previous models, previous frameworks and uh again remember that this is this [00:26:28] and uh again remember that this is this was published in 2016. So [00:26:33] was published in 2016. So next pretext task is just in painting [00:26:36] next pretext task is just in painting predicting what is missing. So what they [00:26:39] predicting what is missing. So what they uh have done here was a simple masking [00:26:43] uh have done here was a simple masking strategy. You mask parts of the image [00:26:46] strategy. You mask parts of the image and then you ask the model to impose [00:26:49] and then you ask the model to impose those parts that are masked. So how it [00:26:52] those parts that are masked. So how it was done simple masking um on the input [00:26:56] was done simple masking um on the input image but because we have the all of the [00:26:59] image but because we have the all of the images we actually have the desired [00:27:01] images we actually have the desired output. So an encoder turns this into a [00:27:05] output. So an encoder turns this into a feature space and then that feature [00:27:07] feature space and then that feature space is also there are some uh fully [00:27:11] space is also there are some uh fully connected layers in the middle and then [00:27:12] connected layers in the middle and then there is a decoder that decodes the [00:27:15] there is a decoder that decodes the parts that is missing and the loss [00:27:19] parts that is missing and the loss function is comparing the output [00:27:22] function is comparing the output with what um the ground truth was. [00:27:28] with what um the ground truth was. And this is basically learning to [00:27:30] And this is basically learning to reconstruct the missing pixels. [00:27:34] reconstruct the missing pixels. Again, um we've talked about [00:27:37] Again, um we've talked about autoenccoders a couple of times before. [00:27:39] autoenccoders a couple of times before. And um and this is also some form of an [00:27:42] And um and this is also some form of an autoenccoder that it encodes the input [00:27:44] autoenccoder that it encodes the input image um into a representation that you [00:27:48] image um into a representation that you want to decode the output. But this [00:27:52] want to decode the output. But this autoenccoder is trained with a masking [00:27:55] autoenccoder is trained with a masking strategy. masking uh objective. So [00:28:01] strategy. masking uh objective. So just to show you some examples, the [00:28:03] just to show you some examples, the impainting [00:28:05] impainting evaluations [00:28:06] evaluations um are a little bit interesting and [00:28:08] um are a little bit interesting and tricky because when you want to impsing [00:28:12] tricky because when you want to impsing this this image, right? And it's um we [00:28:16] this this image, right? And it's um we can't say it's it's it's it's [00:28:19] can't say it's it's it's it's there are there is just one um output to [00:28:23] there are there is just one um output to to do um in this case here [00:28:26] to do um in this case here reconstructionbased frameworks earlier [00:28:28] reconstructionbased frameworks earlier reconstruction based frameworks we're [00:28:31] reconstruction based frameworks we're actually creating a lot of uh fuzzy uh [00:28:35] actually creating a lot of uh fuzzy uh and and very [00:28:37] and and very um [00:28:40] a smooth outputs And [00:28:44] a smooth outputs And that's why [00:28:46] that's why this this paper that I'm uh referring to [00:28:49] this this paper that I'm uh referring to here was actually using [00:28:52] here was actually using an an additional adversarial objective [00:28:54] an an additional adversarial objective function which I'm not going to go into [00:28:56] function which I'm not going to go into details because this is a topic of [00:28:57] details because this is a topic of discussion for the next lecture [00:28:59] discussion for the next lecture generative models. But generally how um [00:29:03] generative models. But generally how um these uh frameworks work, we have a [00:29:06] these uh frameworks work, we have a reconstruction loss and the [00:29:07] reconstruction loss and the reconstruction loss is basically [00:29:09] reconstruction loss is basically calculating the difference between the [00:29:12] calculating the difference between the patch uh between the image X and the [00:29:17] patch uh between the image X and the image after it's passed through the [00:29:22] image after it's passed through the uh the encoder. [00:29:24] uh the encoder. So um [00:29:28] So um and then uh these this is element wise [00:29:32] and then uh these this is element wise multiplication and we also have this [00:29:34] multiplication and we also have this mask here because we want to only [00:29:36] mask here because we want to only calculate the loss function the [00:29:38] calculate the loss function the objective function only on the masked [00:29:41] objective function only on the masked area. So we do an element wise mask um [00:29:44] area. So we do an element wise mask um multiplication with the mask as well. [00:29:46] multiplication with the mask as well. And this basically gives us the [00:29:50] And this basically gives us the reconstruction loss for that part of the [00:29:52] reconstruction loss for that part of the mask that um we had. So as I said the [00:29:58] mask that um we had. So as I said the it's also supplemented with an [00:29:59] it's also supplemented with an adversarial objective adversarial [00:30:01] adversarial objective adversarial learning uh loss function which [00:30:05] learning uh loss function which ensures the images that are generated [00:30:09] ensures the images that are generated are real looking right. So with with [00:30:12] are real looking right. So with with that um they have been able to improve [00:30:17] that um they have been able to improve the the parts that are [00:30:20] the the parts that are uh reconstructed to look a little bit uh [00:30:24] uh reconstructed to look a little bit uh better but again details will be um [00:30:28] better but again details will be um discussed in the next lecture. [00:30:31] discussed in the next lecture. So this reconstruction uh framework was [00:30:35] So this reconstruction uh framework was actually able to again provide and this [00:30:39] actually able to again provide and this ours is the same provide additional [00:30:41] ours is the same provide additional benefits when it's run on the same [00:30:45] benefits when it's run on the same classification detection and [00:30:47] classification detection and segmentation task on uh same set of data [00:30:51] segmentation task on uh same set of data sets. [00:30:52] sets. I will come back to this reconstruction [00:30:55] I will come back to this reconstruction based frameworks and and masking in a [00:30:58] based frameworks and and masking in a bit because it's one of the most uh used [00:31:02] bit because it's one of the most uh used models or pretext tasks that these days [00:31:05] models or pretext tasks that these days are used for for pre-training. Uh so [00:31:08] are used for for pre-training. Uh so I'll I'll I'll come back to this but [00:31:10] I'll I'll I'll come back to this but before that let me introduce this this [00:31:13] before that let me introduce this this um other [00:31:15] um other pretext task of image coloring. [00:31:19] pretext task of image coloring. And this is another very simple [00:31:22] And this is another very simple framework setup that we turn a colored [00:31:27] framework setup that we turn a colored image because our data set is mostly [00:31:29] image because our data set is mostly colored colored images, right? We turn [00:31:31] colored colored images, right? We turn that colored image into its components [00:31:36] that colored image into its components or channels that that separate the [00:31:38] or channels that that separate the lightness, the illumination from the [00:31:41] lightness, the illumination from the color itself. There are several color [00:31:43] color itself. There are several color spaces. If you've taken uh courses like [00:31:46] spaces. If you've taken uh courses like computer graphics or CS131 other uh [00:31:49] computer graphics or CS131 other uh computer vision class, you know that [00:31:51] computer vision class, you know that there are so many different color [00:31:54] there are so many different color spaces. Mostly in computer vision we use [00:31:56] spaces. Mostly in computer vision we use RGB. [00:31:58] RGB. But if you want to separate lightness [00:32:01] But if you want to separate lightness illumination from color, there are some [00:32:03] illumination from color, there are some other color spaces. For example, lab [00:32:05] other color spaces. For example, lab color space L A B is one of those color [00:32:10] color space L A B is one of those color spaces that separates lightness from [00:32:13] spaces that separates lightness from color. So we have one channel for [00:32:16] color. So we have one channel for lightness and two channels for defining [00:32:19] lightness and two channels for defining the the actual color. And if we add [00:32:24] the the actual color. And if we add these two all together, L A and B all [00:32:28] these two all together, L A and B all three channels together we can actually [00:32:29] three channels together we can actually get the colored image. So the pretext [00:32:32] get the colored image. So the pretext task here is simple. Given the L [00:32:36] task here is simple. Given the L channel, predict the A and B channels. [00:32:39] channel, predict the A and B channels. Right? So again, we don't need to do any [00:32:42] Right? So again, we don't need to do any manual annotation. It's already in the [00:32:45] manual annotation. It's already in the data. [00:32:47] data. And this was extended into uh other [00:32:51] And this was extended into uh other frameworks. Why why should we only look [00:32:53] frameworks. Why why should we only look at like given L predict A and B? We can [00:32:57] at like given L predict A and B? We can also do the reverse, right? And that um [00:33:01] also do the reverse, right? And that um led us to split something that we call a [00:33:03] led us to split something that we call a split brain autoenccoder where the input [00:33:07] split brain autoenccoder where the input image is split um is basically turned [00:33:10] image is split um is basically turned into one the the L channel lightness [00:33:14] into one the the L channel lightness channel and the color channels uh these [00:33:16] channel and the color channels uh these two images. This is one channel. This is [00:33:18] two images. This is one channel. This is two channels of color. And we train two [00:33:22] two channels of color. And we train two functions, two neural networks, sets of [00:33:24] functions, two neural networks, sets of layers to predict the other one. And [00:33:27] layers to predict the other one. And then at the end in order to calculate [00:33:29] then at the end in order to calculate the loss function uh and um back prop we [00:33:33] the loss function uh and um back prop we just merge these two to generate the [00:33:35] just merge these two to generate the actual image and a loss uh an L2 loss [00:33:39] actual image and a loss uh an L2 loss any distance function can help with [00:33:42] any distance function can help with training this neural network in a more [00:33:45] training this neural network in a more generic uh framework um or or [00:33:48] generic uh framework um or or formulation. The idea is [00:33:51] formulation. The idea is given one channel or a set of channels, [00:33:55] given one channel or a set of channels, predict the others and do the same uh [00:33:58] predict the others and do the same uh for X2. So sets of channels X1, sets of [00:34:03] for X2. So sets of channels X1, sets of channels X2. So given one, we can [00:34:06] channels X2. So given one, we can predict the other one. And these are the [00:34:07] predict the other one. And these are the neural networks for those. Merging them, [00:34:10] neural networks for those. Merging them, we'll get u the value [00:34:14] we'll get u the value the image and then loss function would [00:34:17] the image and then loss function would be simple. So if we have such a [00:34:20] be simple. So if we have such a framework, we can run it on everything, [00:34:22] framework, we can run it on everything, not just u color and illumination, [00:34:25] not just u color and illumination, right? We can have um data from some of [00:34:28] right? We can have um data from some of these [00:34:29] these RGBD sensors, those that have um RGB [00:34:34] RGBD sensors, those that have um RGB channels and and depth channels like for [00:34:36] channels and and depth channels like for example connect and and other sensors [00:34:38] example connect and and other sensors that they use in robotics. And given the [00:34:42] that they use in robotics. And given the RGB channel, predict the depth and vice [00:34:46] RGB channel, predict the depth and vice versa. And this was a very successful [00:34:49] versa. And this was a very successful downstream task that was used for [00:34:53] downstream task that was used for different applications. And as you can [00:34:55] different applications. And as you can see this this uh this model and uh this [00:34:59] see this this uh this model and uh this paper that I just um the split brain and [00:35:02] paper that I just um the split brain and uh the papers that um the the model that [00:35:06] uh the papers that um the the model that predicts colorizes the images is those [00:35:10] predicts colorizes the images is those those features themselves. they do have [00:35:12] those features themselves. they do have actually a very good level of accuracy [00:35:15] actually a very good level of accuracy for [00:35:16] for um predicting the class labels and you [00:35:19] um predicting the class labels and you can see there are many different other [00:35:22] can see there are many different other frameworks that are used also [00:35:26] frameworks that are used also um in terms of uh comparisons again this [00:35:30] um in terms of uh comparisons again this is not as good as supervised learning [00:35:32] is not as good as supervised learning because there is no label involved uh [00:35:35] because there is no label involved uh here and and it's just based on the the [00:35:38] here and and it's just based on the the learned features with concatenated [00:35:41] learned features with concatenated features out of f_sub_1 and f_sub_2. [00:35:43] features out of f_sub_1 and f_sub_2. Okay, so uh the image coloring pre-text [00:35:48] Okay, so uh the image coloring pre-text tax was actually very interesting [00:35:51] tax was actually very interesting because now we could um not only use it [00:35:55] because now we could um not only use it for pre-training neural networks, but it [00:35:57] for pre-training neural networks, but it was also itself useful somehow because [00:36:00] was also itself useful somehow because now we could colorize images that we [00:36:03] now we could colorize images that we don't have a colored version. So we [00:36:06] don't have a colored version. So we could colorize images that and then [00:36:09] could colorize images that and then videos that we don't have a colored [00:36:12] videos that we don't have a colored version of those. [00:36:15] version of those. And not only that, uh one of the other [00:36:17] And not only that, uh one of the other interesting results that they they [00:36:19] interesting results that they they they've shown in the paper was uh this [00:36:22] they've shown in the paper was uh this image of yuseimmed [00:36:24] image of yuseimmed and the halfdme that um they colorized. [00:36:27] and the halfdme that um they colorized. The interesting thing that is [00:36:30] The interesting thing that is is seen in this image is the consistency [00:36:34] is seen in this image is the consistency between the actual object the halfd dome [00:36:37] between the actual object the halfd dome or trees or the bridge and its [00:36:39] or trees or the bridge and its reflection in the in the water. So the [00:36:41] reflection in the in the water. So the model was was also able to understand [00:36:44] model was was also able to understand that this reflection should also somehow [00:36:47] that this reflection should also somehow preserve uh the color based on how it [00:36:50] preserve uh the color based on how it was trained on vast amounts of data. [00:36:53] was trained on vast amounts of data. Again, [00:36:56] Again, keep in mind that these models are all [00:36:58] keep in mind that these models are all pre-large language large vision models [00:37:01] pre-large language large vision models and they have been trained on specific [00:37:03] and they have been trained on specific tasks. So they're not trained for [00:37:06] tasks. So they're not trained for solving everything. [00:37:09] solving everything. So this could be actually extended into [00:37:13] So this could be actually extended into video settings because now if we have a [00:37:15] video settings because now if we have a video, we can have a reference frame [00:37:17] video, we can have a reference frame that has the color and do the coloring [00:37:20] that has the color and do the coloring for the follow-up frames. And how this [00:37:22] for the follow-up frames. And how this is done. This is very simple because uh [00:37:24] is done. This is very simple because uh with with uh I mean this is also very [00:37:27] with with uh I mean this is also very useful because with uh colorizing future [00:37:32] useful because with uh colorizing future frames in the video what we are doing is [00:37:34] frames in the video what we are doing is we basically try to uh track pixels and [00:37:39] we basically try to uh track pixels and objects in the video and the model [00:37:42] objects in the video and the model implicitly learns how these uh tracks [00:37:46] implicitly learns how these uh tracks should be formed. So the hypothesis is [00:37:48] should be formed. So the hypothesis is learning to color video frames should [00:37:51] learning to color video frames should allow model to learn to track regions or [00:37:54] allow model to learn to track regions or objects without labels. [00:37:58] objects without labels. Um [00:37:59] Um and learning to color videos because [00:38:02] and learning to color videos because there are a lot of correspondences is an [00:38:05] there are a lot of correspondences is an interesting task by itself. I would um [00:38:08] interesting task by itself. I would um suggest taking a look at the the [00:38:10] suggest taking a look at the the details. I'll um very briefly talk to [00:38:12] details. I'll um very briefly talk to them talk about them. So if we have a [00:38:15] them talk about them. So if we have a reference frame, what we need to do is [00:38:17] reference frame, what we need to do is for coloring the input frame, we need to [00:38:20] for coloring the input frame, we need to find the pointers of where that uh [00:38:23] find the pointers of where that uh specific object or pixel is and then [00:38:27] specific object or pixel is and then based on that see what the color is and [00:38:29] based on that see what the color is and copy what the color is u as as the color [00:38:31] copy what the color is u as as the color for that pixel as the output at the as [00:38:34] for that pixel as the output at the as the target color. And how this is done [00:38:37] the target color. And how this is done is very much similar to the same topic [00:38:40] is very much similar to the same topic of attention uh that we talked about. So [00:38:44] of attention uh that we talked about. So it's it's about forming attention uh for [00:38:47] it's it's about forming attention uh for each of the pixels for each input frame [00:38:51] each of the pixels for each input frame sorry reference frame and target frame. [00:38:53] sorry reference frame and target frame. We often run a CNN to see what features [00:38:56] We often run a CNN to see what features around those pixels should be used. And [00:38:59] around those pixels should be used. And using those fe features now we can [00:39:02] using those fe features now we can calculate for each of the target pixels [00:39:05] calculate for each of the target pixels we can calculate the attention or the [00:39:08] we can calculate the attention or the distance to all of the frames all of the [00:39:11] distance to all of the frames all of the pixels in the reference frame. And then [00:39:15] pixels in the reference frame. And then after defining this attention with [00:39:18] after defining this attention with respect to the [00:39:23] uh the the pixel of interest in the in [00:39:25] uh the the pixel of interest in the in the target frame with all of the frame [00:39:27] the target frame with all of the frame all of the pixels in the [00:39:30] all of the pixels in the reference frame. Now we can do an [00:39:32] reference frame. Now we can do an average color um of all of based on [00:39:37] average color um of all of based on those attention modules. So attention is [00:39:39] those attention modules. So attention is basically just similarity between the [00:39:41] basically just similarity between the the two. So anyways with that um what we [00:39:44] the two. So anyways with that um what we can do is we can just get the output [00:39:47] can do is we can just get the output color as an average with with that [00:39:50] color as an average with with that tension and then ultimately [00:39:53] tension and then ultimately um calculate the loss function because [00:39:56] um calculate the loss function because we have the the the values of the colors [00:40:00] we have the the the values of the colors the right colors of those pixels in our [00:40:03] the right colors of those pixels in our data and this was able to with this [00:40:07] data and this was able to with this reference frame colorize the images. You [00:40:10] reference frame colorize the images. You see how consistent this this becomes in [00:40:14] see how consistent this this becomes in terms of uh coloring if we color them [00:40:16] terms of uh coloring if we color them separately without this consistency over [00:40:19] separately without this consistency over time. You often uh see like for example [00:40:22] time. You often uh see like for example the person uh person's shirts or or [00:40:25] the person uh person's shirts or or clothing changes color because uh [00:40:29] clothing changes color because uh because there's no [00:40:31] because there's no uh constraint to keep it consistent. And [00:40:34] uh constraint to keep it consistent. And then there has been also very [00:40:37] then there has been also very interesting um applications because now [00:40:40] interesting um applications because now that you're calculating attention to a [00:40:42] that you're calculating attention to a reference frame, you're actually able to [00:40:45] reference frame, you're actually able to track objects, track segments in in [00:40:47] track objects, track segments in in videos and even identify key points in [00:40:52] videos and even identify key points in the videos. That's a good question. [00:40:56] the videos. That's a good question. You're you're asking the qu your your [00:40:59] You're you're asking the qu your your question is about this slide basically [00:41:03] question is about this slide basically and and how the encoder knows about the [00:41:06] and and how the encoder knows about the data to begin with and um gets us good [00:41:12] data to begin with and um gets us good learned representations. So all of these [00:41:14] learned representations. So all of these tasks that I presented and and defined [00:41:19] tasks that I presented and and defined are trying to do something here either [00:41:22] are trying to do something here either decoding [00:41:24] decoding uh classifying or using regression to [00:41:28] uh classifying or using regression to generate some outputs to be able to [00:41:31] generate some outputs to be able to train this encoder. So if your original [00:41:34] train this encoder. So if your original images, if these are all natural images [00:41:36] images, if these are all natural images taken off of internet or imageet or [00:41:38] taken off of internet or imageet or whatever, then you are learning an [00:41:41] whatever, then you are learning an encoder that can extract features from [00:41:43] encoder that can extract features from those types of images, right? With the [00:41:46] those types of images, right? With the pretext task and then when you remove [00:41:49] pretext task and then when you remove the enc decoder and add this this [00:41:51] the enc decoder and add this this classifier to the end, you only need to [00:41:54] classifier to the end, you only need to uh train this this part because this [00:41:56] uh train this this part because this encoder was already trained with all of [00:41:59] encoder was already trained with all of these pre-training tasks that I just uh [00:42:01] these pre-training tasks that I just uh talked about. [00:42:02] talked about. You're asking if the labels are coming [00:42:04] You're asking if the labels are coming from the decoder for pre-training the [00:42:06] from the decoder for pre-training the encoder. And the the answer to that is [00:42:09] encoder. And the the answer to that is yes. That's that's why we define the [00:42:11] yes. That's that's why we define the pre-text tasks because we want to have [00:42:13] pre-text tasks because we want to have some sort of labels, some outputs, [00:42:16] some sort of labels, some outputs, right? And then based on those outputs, [00:42:19] right? And then based on those outputs, we try to um train this entire network [00:42:23] we try to um train this entire network and the [00:42:25] and the on on the way of predicting the right [00:42:27] on on the way of predicting the right labels. This this encoder is also [00:42:30] labels. This this encoder is also trained. Good question. You're asking if [00:42:32] trained. Good question. You're asking if encoder and decoder of one big neural [00:42:34] encoder and decoder of one big neural network or there's differences in [00:42:38] network or there's differences in different um papers in different works. [00:42:41] different um papers in different works. It has uh been completely different in [00:42:44] It has uh been completely different in some cases your encoder and not just [00:42:47] some cases your encoder and not just decoder that's why I'm calling it [00:42:48] decoder that's why I'm calling it classifier in the example that I showed [00:42:51] classifier in the example that I showed you about um predicting the degree. This [00:42:55] you about um predicting the degree. This is just a simple neural network right [00:42:57] is just a simple neural network right the decoder is these FC layers. So this [00:43:00] the decoder is these FC layers. So this could be one entire network and then you [00:43:03] could be one entire network and then you you're replacing it with something for [00:43:05] you're replacing it with something for your downstream task. But in some cases, [00:43:07] your downstream task. But in some cases, for example, when I talked about [00:43:09] for example, when I talked about autoenccoding, [00:43:10] autoenccoding, encoding an image and then decoding [00:43:12] encoding an image and then decoding another image, you have you often have [00:43:15] another image, you have you often have two neural networks that are trained end [00:43:17] two neural networks that are trained end to end because you have you want to make [00:43:20] to end because you have you want to make use of that representation space in the [00:43:21] use of that representation space in the middle. And in the next um thing that I [00:43:25] middle. And in the next um thing that I want to talk about mascato encoders even [00:43:27] want to talk about mascato encoders even there is no symmetry between encoders [00:43:31] there is no symmetry between encoders and decoders. They can be just two [00:43:33] and decoders. They can be just two different um frameworks two different [00:43:36] different um frameworks two different neural networks even without any [00:43:38] neural networks even without any symmetry to [00:43:41] symmetry to um to to train the task. So this this is [00:43:44] um to to train the task. So this this is very much task dependent pretext task [00:43:46] very much task dependent pretext task dependent but they could be belonging to [00:43:50] dependent but they could be belonging to the same architecture that we we know [00:43:52] the same architecture that we we know about say CNN or ResNet or they could be [00:43:56] about say CNN or ResNet or they could be two different architectures even without [00:43:58] two different architectures even without any symmetry. [00:44:00] any symmetry. Remember that these are the very first [00:44:04] Remember that these are the very first um methods for self-supervised learning. [00:44:08] um methods for self-supervised learning. So they're not supposed to be solving [00:44:10] So they're not supposed to be solving everything. Uh that's just a a quick [00:44:12] everything. Uh that's just a a quick disclaimer but the idea is the [00:44:15] disclaimer but the idea is the hypothesis here is if the model is able [00:44:17] hypothesis here is if the model is able to say this is 90° rotated it means that [00:44:21] to say this is 90° rotated it means that implicitly it's understanding the right [00:44:23] implicitly it's understanding the right rotation right sorry right uh right uh [00:44:27] rotation right sorry right uh right uh orientation and direction right and then [00:44:30] orientation and direction right and then u it will be able to if if given a right [00:44:34] u it will be able to if if given a right um an an unrotated image it's able to [00:44:38] um an an unrotated image it's able to recognize what it is in it Right. So, [00:44:42] recognize what it is in it Right. So, but this is a limited task by itself. I [00:44:45] but this is a limited task by itself. I agree with that. The question that you [00:44:47] agree with that. The question that you have is why why here they use 64. Right. [00:44:51] have is why why here they use 64. Right. Um that's a good question but that's [00:44:55] Um that's a good question but that's also an arbitrary choice almost [00:44:57] also an arbitrary choice almost arbitrary choice. As I said there are [00:45:00] arbitrary choice. As I said there are many different types of different number [00:45:02] many different types of different number of uh permutations here 9 factorial. So [00:45:06] of uh permutations here 9 factorial. So it's a very big number. it doesn't make [00:45:08] it's a very big number. it doesn't make sense for us to be predicting all of [00:45:10] sense for us to be predicting all of those. What the authors did here, they [00:45:12] those. What the authors did here, they decided to select a few of those [00:45:15] decided to select a few of those perurbations that there is enough [00:45:17] perurbations that there is enough variation because many of those uh [00:45:19] variation because many of those uh perturbations is just like one uh patch [00:45:22] perturbations is just like one uh patch like switched only, right? So they [00:45:25] like switched only, right? So they selected 64 of those that they have the [00:45:29] selected 64 of those that they have the largest difference between them and they [00:45:32] largest difference between them and they just selected 64 because they wanted to [00:45:34] just selected 64 because they wanted to solve a classification problem instead [00:45:36] solve a classification problem instead of um other types of tasks. [00:45:41] of um other types of tasks. Okay. So I've been talking about these [00:45:44] Okay. So I've been talking about these um [00:45:46] um frameworks that often do some transport [00:45:50] frameworks that often do some transport transformation on uh the image or the [00:45:53] transformation on uh the image or the videos and so on. And this brings us to [00:45:56] videos and so on. And this brings us to this newer framework which is um [00:45:59] this newer framework which is um published in 2021 and then there has [00:46:01] published in 2021 and then there has been so many uh follow-ups on this and [00:46:06] been so many uh follow-ups on this and um it's been a great framework for [00:46:09] um it's been a great framework for pre-training for many tasks even if uh [00:46:12] pre-training for many tasks even if uh these days when we want to pre-train on [00:46:14] these days when we want to pre-train on a um data set uh on raw data sets we [00:46:18] a um data set uh on raw data sets we often use this uh MAE framework it's [00:46:22] often use this uh MAE framework it's called masked out encoders it is also a [00:46:25] called masked out encoders it is also a reconstruction-based framework similar [00:46:28] reconstruction-based framework similar to that masking strategy in painting [00:46:29] to that masking strategy in painting that I mentioned but this is far more um [00:46:33] that I mentioned but this is far more um uh detailed and as you can see this uh [00:46:38] uh detailed and as you can see this uh framework is is not just selecting one [00:46:41] framework is is not just selecting one mask. There are so many different uh [00:46:43] mask. There are so many different uh patches and and places that they do [00:46:45] patches and and places that they do masking with even more aggressive um [00:46:48] masking with even more aggressive um sampling rates. 50% masking or 75 uh [00:46:52] sampling rates. 50% masking or 75 uh masking rates and ratios. And through [00:46:56] masking rates and ratios. And through training in large scale, they have shown [00:46:59] training in large scale, they have shown that not only they are able to [00:47:02] that not only they are able to reconstruct all of those masked areas, [00:47:05] reconstruct all of those masked areas, they are also [00:47:07] they are also getting very good encoders that can [00:47:10] getting very good encoders that can summarize the images into good features. [00:47:14] summarize the images into good features. And how this was done was through [00:47:19] And how this was done was through defining a dec encoder and a decoder. [00:47:22] defining a dec encoder and a decoder. And that's one of the examples that I [00:47:24] And that's one of the examples that I said this is not these are not symmetric [00:47:26] said this is not these are not symmetric and encoders and decoder. So a large [00:47:30] and encoders and decoder. So a large portion of the input masks uh input uh [00:47:33] portion of the input masks uh input uh patches are masked and the ones that are [00:47:36] patches are masked and the ones that are not masked they are [00:47:39] not masked they are um given to the encoder to encode in to [00:47:42] um given to the encoder to encode in to features that are then passed through [00:47:44] features that are then passed through the coder to generate the the complete [00:47:48] the coder to generate the the complete image. But let's let's go a little bit [00:47:51] image. But let's let's go a little bit into details of uh what this means. I do [00:47:54] into details of uh what this means. I do have some of the details and and and how [00:47:56] have some of the details and and and how these models are trained here. But [00:48:00] these models are trained here. But um I um very briefly just explain to you [00:48:04] um I um very briefly just explain to you how these models often work. The encoder [00:48:09] how these models often work. The encoder um here is very much similar to vit all [00:48:13] um here is very much similar to vit all of the them are are based on [00:48:15] of the them are are based on transformers. The vit that we've talked [00:48:17] transformers. The vit that we've talked about uh similar to the vits the images [00:48:20] about uh similar to the vits the images are uh split into patches. The patches [00:48:23] are uh split into patches. The patches are then uh sampled. [00:48:28] are then uh sampled. So uniform sampling is what they've done [00:48:30] So uniform sampling is what they've done in and they've shown 75% sampling is um [00:48:35] in and they've shown 75% sampling is um was was quite efficient in in uh the [00:48:39] was was quite efficient in in uh the experiments and um they use a mask a [00:48:43] experiments and um they use a mask a high masking ratio and then this makes [00:48:47] high masking ratio and then this makes the prediction task very challenging and [00:48:50] the prediction task very challenging and challenging in the set of pre uh pretext [00:48:54] challenging in the set of pre uh pretext for pretext tasks in self-supervised [00:48:56] for pretext tasks in self-supervised learning means the task is meaningful, [00:48:58] learning means the task is meaningful, the task is is very good because the the [00:49:00] the task is is very good because the the model has to learn uh good features to [00:49:03] model has to learn uh good features to be able to reconstruct it. So uh with [00:49:06] be able to reconstruct it. So uh with that high sampling um high ratio of [00:49:10] that high sampling um high ratio of sampling what what it uh it can do is [00:49:13] sampling what what it uh it can do is they can actually augment the data by a [00:49:15] they can actually augment the data by a lot too because each time you mask 75% [00:49:18] lot too because each time you mask 75% of the data. So you can reuse the same [00:49:20] of the data. So you can reuse the same image multiple and multiple times during [00:49:22] image multiple and multiple times during training. So you you will have so much [00:49:25] training. So you you will have so much of data to train this uh encoder for [00:49:28] of data to train this uh encoder for with and then that's why they use a huge [00:49:32] with and then that's why they use a huge encoder a large um vit in as as their [00:49:37] encoder a large um vit in as as their encoder. So this encoder uh itself only [00:49:41] encoder. So this encoder uh itself only sees 25% of the samples the the patches [00:49:45] sees 25% of the samples the the patches embeds those two with a first linear [00:49:48] embeds those two with a first linear projection to some embedding spaces then [00:49:52] projection to some embedding spaces then positional embeddings are um added [00:49:54] positional embeddings are um added exactly same as what uh what we [00:49:56] exactly same as what uh what we mentioned for vits and all of these are [00:50:00] mentioned for vits and all of these are transformer blocks and um [00:50:05] transformer blocks and um the encoder is is very large that's what [00:50:07] the encoder is is very large that's what uh I just mentioned and then when it [00:50:10] uh I just mentioned and then when it comes to the decoding uh part. So we [00:50:14] comes to the decoding uh part. So we have the embeddings of all of those [00:50:16] have the embeddings of all of those patches that were present for the [00:50:20] patches that were present for the patches that were masked or or were [00:50:22] patches that were masked or or were missing for those. There is a trainable [00:50:25] missing for those. There is a trainable uh parameter very much similar to that [00:50:27] uh parameter very much similar to that class token that we had a shared masked [00:50:31] class token that we had a shared masked mask token that is um basically in some [00:50:34] mask token that is um basically in some sort of average uh we can consider that [00:50:37] sort of average uh we can consider that as as an average patch an average [00:50:40] as as an average patch an average representation that is put for the ones [00:50:43] representation that is put for the ones that that are missing are masked and [00:50:45] that that are missing are masked and then the decoder has to transfer those [00:50:48] then the decoder has to transfer those them to the uh image patches of the [00:50:53] them to the uh image patches of the entire image. And then the entire last [00:50:56] entire image. And then the entire last one the entire image is the the target [00:50:58] one the entire image is the the target the is the output target is the output. [00:51:02] the is the output target is the output. So how we train this this is a simple [00:51:05] So how we train this this is a simple MSE based uh mean squared uh error loss [00:51:09] MSE based uh mean squared uh error loss function for um the [00:51:15] function for um the between the image and the reconstructed [00:51:17] between the image and the reconstructed image. The loss function is only [00:51:19] image. The loss function is only computed for the masked patches similar [00:51:22] computed for the masked patches similar to the the previous one that I I just [00:51:24] to the the previous one that I I just talked about. And then what it has is um [00:51:30] talked about. And then what it has is um when we do the training, they've shown [00:51:32] when we do the training, they've shown in the paper that you can do either [00:51:35] in the paper that you can do either linear probing or full fine-tuning [00:51:39] linear probing or full fine-tuning to train your uh to to to to use it for [00:51:43] to train your uh to to to to use it for your downstream tasks for any of the [00:51:45] your downstream tasks for any of the applications that you have in mind. And [00:51:48] applications that you have in mind. And um in linear probing what happens is you [00:51:51] um in linear probing what happens is you often have your encoder frozen and then [00:51:54] often have your encoder frozen and then use the learn representations and only [00:51:57] use the learn representations and only learn a linear function for the for the [00:52:00] learn a linear function for the for the final task. And this this mark is it's [00:52:02] final task. And this this mark is it's it means that it's being trained. But in [00:52:04] it means that it's being trained. But in full fine tuning in fine-tuning the [00:52:06] full fine tuning in fine-tuning the pre-trained encoders are also fine-tuned [00:52:09] pre-trained encoders are also fine-tuned either either all of them or just few um [00:52:13] either either all of them or just few um transformer blocks. And that's uh that's [00:52:16] transformer blocks. And that's uh that's the fine-tuning uh framework. The linear [00:52:20] the fine-tuning uh framework. The linear probing provides a measure of [00:52:22] probing provides a measure of representation quality. How the those [00:52:25] representation quality. How the those representations features are um and [00:52:28] representations features are um and fine-tuning always exploits models near [00:52:33] fine-tuning always exploits models near potential to adopt for new tasks. Okay. [00:52:38] potential to adopt for new tasks. Okay. So uh I if you're interested in this [00:52:42] So uh I if you're interested in this topic and if you're planning to use this [00:52:44] topic and if you're planning to use this this paper I highly advise looking at [00:52:47] this paper I highly advise looking at the paper in his follow-ups. There are [00:52:49] the paper in his follow-ups. There are so many uh discussions around different [00:52:52] so many uh discussions around different aspects model choices hyperparameters [00:52:55] aspects model choices hyperparameters masking ratio is one thing they've [00:52:57] masking ratio is one thing they've they've shown with the masking ratio [00:52:59] they've shown with the masking ratio that the 75% is actually giving is [00:53:02] that the 75% is actually giving is giving them a very high accuracy. So 75% [00:53:05] giving them a very high accuracy. So 75% that's that's the reason that it was [00:53:06] that's that's the reason that it was chosen uh decoder depth decoder width um [00:53:10] chosen uh decoder depth decoder width um mask tokens uh reconstruction targets [00:53:14] mask tokens uh reconstruction targets data augmentation how it's it's u [00:53:17] data augmentation how it's it's u helpful and mask sampling method I'm I'm [00:53:20] helpful and mask sampling method I'm I'm showing the results here again mask [00:53:22] showing the results here again mask sampling method is uh mostly around [00:53:25] sampling method is uh mostly around should should they use um some uh random [00:53:28] should should they use um some uh random masking blocks or grid type of masking [00:53:31] masking blocks or grid type of masking you see the examples here and they've [00:53:33] you see the examples here and they've they've um came to the conclusion that [00:53:35] they've um came to the conclusion that this random masking was was the best uh [00:53:38] this random masking was was the best uh choice [00:53:39] choice and um finally they've been able to to [00:53:43] and um finally they've been able to to show that MAE is um doing a [00:53:48] show that MAE is um doing a much better job compared to many of the [00:53:52] much better job compared to many of the other methods that were used. So some of [00:53:54] other methods that were used. So some of the other state-of-the-art methods were [00:53:56] the other state-of-the-art methods were actually Dynino and Moco V3. If you have [00:53:59] actually Dynino and Moco V3. If you have time, I will briefly go over them. But [00:54:01] time, I will briefly go over them. But um this framework was actually [00:54:05] um this framework was actually outperforming those that are more um um [00:54:11] outperforming those that are more um um at the time advanced frameworks of [00:54:14] at the time advanced frameworks of contrastive learning. So uh I'll stop [00:54:20] contrastive learning. So uh I'll stop for a few questions if you have any [00:54:22] for a few questions if you have any after. Let me just summarize the the [00:54:26] after. Let me just summarize the the what we've talked about. pretext tasks [00:54:29] what we've talked about. pretext tasks uh were actually very important and as [00:54:31] uh were actually very important and as as I said their focus is on [00:54:33] as I said their focus is on understanding the visual common sense [00:54:36] understanding the visual common sense and um one of the things that also [00:54:40] and um one of the things that also related to some of the questions that [00:54:42] related to some of the questions that were asked we can see is coming up with [00:54:45] were asked we can see is coming up with an individual pretext task is often [00:54:49] an individual pretext task is often challenging because the learn [00:54:51] challenging because the learn representations may not be general [00:54:53] representations may not be general enough because of the type of task that [00:54:55] enough because of the type of task that you you define Right? Uh for example, [00:55:01] um if you're using completion rotation [00:55:04] um if you're using completion rotation prediction or or jigsaw puzzle or [00:55:06] prediction or or jigsaw puzzle or colorization, [00:55:08] colorization, your learn representations are good for [00:55:11] your learn representations are good for solving those specific tasks and they [00:55:13] solving those specific tasks and they may not be very good for uh general [00:55:16] may not be very good for uh general pretext um tasks. So the question is in [00:55:20] pretext um tasks. So the question is in split brain autoenccoder how does the [00:55:23] split brain autoenccoder how does the model knows know um how to predict the [00:55:27] model knows know um how to predict the other channel given the for example L [00:55:31] other channel given the for example L channel the illumination channel [00:55:33] channel the illumination channel lightness channel so your question let [00:55:36] lightness channel so your question let me ask answer your question with a [00:55:38] me ask answer your question with a question when you're training a model to [00:55:41] question when you're training a model to predict classes of objects in the image [00:55:44] predict classes of objects in the image how does the encoder know what features [00:55:46] how does the encoder know what features to extract to predict them the [00:55:49] to extract to predict them the um the class of models labeled data [00:55:52] um the class of models labeled data because you have what you're doing is [00:55:55] because you have what you're doing is you're back propagating a loss value [00:55:58] you're back propagating a loss value that is calculated around those labels. [00:56:01] that is calculated around those labels. Right? It's the same story here. We [00:56:03] Right? It's the same story here. We define a network that takes the one of [00:56:06] define a network that takes the one of the channels and outputs the other [00:56:09] the channels and outputs the other channel. How this was trained was by [00:56:12] channel. How this was trained was by back propagating what the output should [00:56:14] back propagating what the output should be. The output was the other other [00:56:16] be. The output was the other other channel. we do have the other channel in [00:56:18] channel. we do have the other channel in the data. So instead of defining the [00:56:22] the data. So instead of defining the task being a classification of [00:56:24] task being a classification of predicting the class of the objects here [00:56:26] predicting the class of the objects here we define the task to be predicting the [00:56:29] we define the task to be predicting the color of the pixels [00:56:32] color of the pixels and the color of the pixels we already [00:56:33] and the color of the pixels we already have them in the data set. So the loss [00:56:36] have them in the data set. So the loss function still can be calculated and [00:56:38] function still can be calculated and back propagated. [00:56:40] back propagated. So the question is how they these [00:56:42] So the question is how they these outputs are used as input to the [00:56:43] outputs are used as input to the decoder. So this is again a vit [00:56:48] decoder. So this is again a vit transformer uh style uh framework the [00:56:51] transformer uh style uh framework the encoder that turns every uh every input [00:56:56] encoder that turns every uh every input to a token at the as the output that is [00:56:59] to a token at the as the output that is representation of that specific uh input [00:57:02] representation of that specific uh input patch. Right? So we've talked about this [00:57:04] patch. Right? So we've talked about this but then we know that this is not this [00:57:07] but then we know that this is not this is not the list of all patches. There [00:57:10] is not the list of all patches. There are some of the patches that are masked. [00:57:12] are some of the patches that are masked. for those that are masked. We also train [00:57:16] for those that are masked. We also train this encoder outputs um shared mask [00:57:20] this encoder outputs um shared mask token. A token that is basically an an [00:57:24] token. A token that is basically an an average token for example. It's it's a [00:57:25] average token for example. It's it's a learnable parameter. We can't [00:57:28] learnable parameter. We can't necessarily interpret it, but we can say [00:57:30] necessarily interpret it, but we can say just it's probably something that is [00:57:32] just it's probably something that is similar to mask like an average token. [00:57:35] similar to mask like an average token. So that that shared mask token is put in [00:57:40] So that that shared mask token is put in the place of those that are missing and [00:57:42] the place of those that are missing and then this long vector long long sequence [00:57:44] then this long vector long long sequence is created. Decoder another transformer [00:57:48] is created. Decoder another transformer framework takes these long this long uh [00:57:52] framework takes these long this long uh set of tokens and outputs those that are [00:57:56] set of tokens and outputs those that are projected as the output pixel value. [00:58:01] Perfect. [00:58:03] Perfect. So, we only have 15 minutes [00:58:07] So, we only have 15 minutes and um [00:58:10] and um and a lot of things to cover. [00:58:15] and a lot of things to cover. But, uh I think what I wanted to to to [00:58:18] But, uh I think what I wanted to to to get out of this uh this um session was [00:58:21] get out of this uh this um session was for you to understand what pretext tasks [00:58:24] for you to understand what pretext tasks are and how we define them. and and one [00:58:26] are and how we define them. and and one of the most um used frameworks right now [00:58:30] of the most um used frameworks right now is the masked autoenccoder which we [00:58:32] is the masked autoenccoder which we actually covered um uh to some good [00:58:35] actually covered um uh to some good extent. But anyways um I was here that [00:58:40] extent. But anyways um I was here that um we we did look at these [00:58:42] um we we did look at these transformations and then we know that [00:58:45] transformations and then we know that all of these transformations are [00:58:48] all of these transformations are representing are actually the same [00:58:50] representing are actually the same object as the original image just in a [00:58:53] object as the original image just in a different uh form. Right? But then we [00:58:56] different uh form. Right? But then we also have the knowledge that in the data [00:58:58] also have the knowledge that in the data set we have other objects that look [00:59:01] set we have other objects that look completely different right. So if I [00:59:04] completely different right. So if I define a task that can say these that [00:59:09] define a task that can say these that belong to the same object, same pixel, [00:59:12] belong to the same object, same pixel, try to bring them close in the [00:59:14] try to bring them close in the representation space. Basically attract [00:59:16] representation space. Basically attract them to each other in the representation [00:59:18] them to each other in the representation space and those that are not um the that [00:59:22] space and those that are not um the that they do not belong to the same object [00:59:26] they do not belong to the same object uh kind of put them try to maximize the [00:59:29] uh kind of put them try to maximize the distance between them in the latent [00:59:32] distance between them in the latent space. basically repel the [00:59:34] space. basically repel the representation of of these two images. [00:59:37] representation of of these two images. Then this is another task that is often [00:59:40] Then this is another task that is often referred to as contrastive learning, [00:59:43] referred to as contrastive learning, contrastive representation learning. And [00:59:46] contrastive representation learning. And there are quite a number of very uh [00:59:50] there are quite a number of very uh interesting methods to look at. We have [00:59:53] interesting methods to look at. We have sampled there are so many papers um uh [00:59:55] sampled there are so many papers um uh in this space especially around uh 2018s [00:59:59] in this space especially around uh 2018s to 2020s u and uh and so on in around [01:00:05] to 2020s u and uh and so on in around those those years Sinclair Moco CPC and [01:00:08] those those years Sinclair Moco CPC and then ultimately Dino which is actually [01:00:12] then ultimately Dino which is actually borrows concepts from contrastive [01:00:13] borrows concepts from contrastive learning but it's not necessarily [01:00:14] learning but it's not necessarily contrastive learning framework [01:00:17] contrastive learning framework um [01:00:19] um and uh what we do there is in order to [01:00:23] and uh what we do there is in order to define, attract and repel functions u [01:00:30] define, attract and repel functions u characteristics or or uh regularize the [01:00:32] characteristics or or uh regularize the model based on those. We define the [01:00:36] model based on those. We define the reference image as X and then all of [01:00:39] reference image as X and then all of those that are transformations of the [01:00:41] those that are transformations of the same as positive samples and all of the [01:00:44] same as positive samples and all of the other objects in the data set or in the [01:00:47] other objects in the data set or in the batch as negative samples. And these [01:00:50] batch as negative samples. And these positive and negative samples will [01:00:52] positive and negative samples will basically define a way for us to [01:00:56] basically define a way for us to calculate the loss function. How can we [01:00:59] calculate the loss function. How can we do that? Assume we have a scoring [01:01:02] do that? Assume we have a scoring function. We want to get a scoring [01:01:03] function. We want to get a scoring function that says the score for the [01:01:07] function that says the score for the reference image encoded version features [01:01:09] reference image encoded version features of the reference image and the features [01:01:11] of the reference image and the features of a positive sample should be larger [01:01:14] of a positive sample should be larger than the score that is comparing [01:01:17] than the score that is comparing the reference image and the [01:01:21] the reference image and the uh negative samples. [01:01:24] uh negative samples. So with this type of um [01:01:29] So with this type of um scoring function if we uh so basically [01:01:31] scoring function if we uh so basically to define uh this type of scoring [01:01:34] to define uh this type of scoring function we define a loss function based [01:01:37] function we define a loss function based on that after training the scoring [01:01:40] on that after training the scoring function uh s you can see this s is the [01:01:45] function uh s you can see this s is the same as a score in the previous slide. [01:01:47] same as a score in the previous slide. Um so if we have that scoring function [01:01:50] Um so if we have that scoring function now in order to attract and repel we can [01:01:55] now in order to attract and repel we can use this uh framework of uh turn them [01:01:59] use this uh framework of uh turn them the softmax uh setup with the exp that [01:02:03] the softmax uh setup with the exp that turn those scores into probability [01:02:06] turn those scores into probability values and then in the denominator you [01:02:09] values and then in the denominator you see all of the other negative samples [01:02:11] see all of the other negative samples that are um used um are are considered [01:02:16] that are um used um are are considered are are In order to implement actually [01:02:18] are are In order to implement actually we use this we use the batch learning [01:02:21] we use this we use the batch learning framework. All of the other negative [01:02:22] framework. All of the other negative samples that belong to other objects in [01:02:26] samples that belong to other objects in um in the batch are are taken as [01:02:28] um in the batch are are taken as negative samples and one of the [01:02:30] negative samples and one of the transformers as as the positive sample. [01:02:33] transformers as as the positive sample. So we define the loss function like this [01:02:36] So we define the loss function like this and um a score for the positive pair [01:02:39] and um a score for the positive pair score for all of the other negative [01:02:41] score for all of the other negative pairs. [01:02:44] pairs. And this function is very similar to [01:02:47] And this function is very similar to what we've actually discussed before. [01:02:50] what we've actually discussed before. Any ideas? This is the cross entropy for [01:02:53] Any ideas? This is the cross entropy for multiclasses. [01:02:54] multiclasses. So if we have n samples, sorry, n um uh [01:02:59] So if we have n samples, sorry, n um uh samples. Yes. [01:03:01] samples. Yes. Then um in this case we have n samples. [01:03:04] Then um in this case we have n samples. So the softmax is [01:03:07] So the softmax is if you have multiple classes, if you [01:03:09] if you have multiple classes, if you have 10 classes as the output, it wants [01:03:12] have 10 classes as the output, it wants to maximize the score for one of those [01:03:14] to maximize the score for one of those 10 and minimize for the rest of it. [01:03:17] 10 and minimize for the rest of it. Right? It's the same story here. We want [01:03:19] Right? It's the same story here. We want to maximize this score but kind of [01:03:22] to maximize this score but kind of minimize the score between negative and [01:03:24] minimize the score between negative and reference. Right? So it's it's the same [01:03:27] reference. Right? So it's it's the same concept that we discussed about [01:03:29] concept that we discussed about multiclass uh classification but put in [01:03:32] multiclass uh classification but put in the form of um formulation of [01:03:34] the form of um formulation of contrastive uh for this loss function as [01:03:37] contrastive uh for this loss function as contra contrastive learning. This [01:03:40] contra contrastive learning. This function is called info NCE or the [01:03:44] function is called info NCE or the information noise contrastive estimation [01:03:46] information noise contrastive estimation loss which was proposed in uh this paper [01:03:51] loss which was proposed in uh this paper and there are a lot of theoretical uh [01:03:54] and there are a lot of theoretical uh discussions in in in the paper that this [01:03:57] discussions in in in the paper that this uh objective this loss function measures [01:03:59] uh objective this loss function measures the dependencies [01:04:01] the dependencies uh sorry that uh it's it's a lower bound [01:04:04] uh sorry that uh it's it's a lower bound on mutual uh information and what in [01:04:08] on mutual uh information and what in mutual Mutual information is uh when you [01:04:11] mutual Mutual information is uh when you have two images and you calculate the [01:04:13] have two images and you calculate the mutual information between them, it's [01:04:15] mutual information between them, it's basically measuring the dependencies or [01:04:18] basically measuring the dependencies or shared information between these two [01:04:20] shared information between these two images. Right? So what we want to do is [01:04:23] images. Right? So what we want to do is we want to maximize shared information [01:04:25] we want to maximize shared information between X and X plus but minimize the [01:04:28] between X and X plus but minimize the shared information for X and X minuses. [01:04:32] shared information for X and X minuses. Right? So the paper says um and again [01:04:37] Right? So the paper says um and again this will itself take half an hour to to [01:04:39] this will itself take half an hour to to go into. So you should definitely take a [01:04:41] go into. So you should definitely take a look at uh the paper if you're [01:04:43] look at uh the paper if you're interested that [01:04:45] interested that this the negative of this loss function [01:04:47] this the negative of this loss function influenc is a lower bound on mutual [01:04:51] influenc is a lower bound on mutual information between X and X plus. So a [01:04:55] information between X and X plus. So a lower bound on mutual information if um [01:04:59] lower bound on mutual information if um the negative of it is a lower bound on [01:05:02] the negative of it is a lower bound on mutual information if I minimize the [01:05:05] mutual information if I minimize the influenc I'm basically maximizing the [01:05:09] influenc I'm basically maximizing the mutual information between X and X plus. [01:05:11] mutual information between X and X plus. So this is what I really want. So that's [01:05:15] So this is what I really want. So that's why we take this as a as the loss [01:05:17] why we take this as a as the loss function and start minimizing [01:05:20] function and start minimizing the uh that value. There is also another [01:05:24] the uh that value. There is also another theoretical aspect in the info inc paper [01:05:27] theoretical aspect in the info inc paper that says the larger this this the [01:05:30] that says the larger this this the number of negative samples the tighter [01:05:32] number of negative samples the tighter the bound. So they it tightens the bound [01:05:35] the bound. So they it tightens the bound based on the number of negative samples. [01:05:38] based on the number of negative samples. So that's why for training a loss [01:05:41] So that's why for training a loss function neural network with this type [01:05:44] function neural network with this type of loss function we need a huge batch [01:05:46] of loss function we need a huge batch size. [01:05:48] size. ute a larger number of negative samples [01:05:51] ute a larger number of negative samples we'll get better um [01:05:54] we'll get better um and more much faster training [01:05:57] and more much faster training convergence [01:05:59] convergence and then this loss function was used in [01:06:02] and then this loss function was used in a number of different frameworks and in [01:06:06] a number of different frameworks and in the next few minutes I'm just going to [01:06:09] the next few minutes I'm just going to tell you what those u frameworks are for [01:06:12] tell you what those u frameworks are for example Sinclair is a simple uh [01:06:16] example Sinclair is a simple uh framework for contrastive learning [01:06:17] framework for contrastive learning learning is basically taking each image [01:06:22] learning is basically taking each image do the two transformation of the same [01:06:24] do the two transformation of the same image transfer it into the [01:06:25] image transfer it into the representation space and what it does is [01:06:28] representation space and what it does is it calculates the the cosine similarity [01:06:32] it calculates the the cosine similarity between um embeddings representations [01:06:36] between um embeddings representations but before doing that it does a linear [01:06:39] but before doing that it does a linear or nonlinear projection into a set of [01:06:42] or nonlinear projection into a set of features Z that calculates the the uh [01:06:47] features Z that calculates the the uh the distance between those in this [01:06:49] the distance between those in this space. And um this is this is the way [01:06:53] space. And um this is this is the way that they uh they do this generating [01:06:55] that they uh they do this generating this positive samples and uh for [01:06:59] this positive samples and uh for generating positive samples all sorts of [01:07:02] generating positive samples all sorts of transformations would uh make sense. The [01:07:05] transformations would uh make sense. The details are are basically uh covered [01:07:08] details are are basically uh covered here. Uh generate a a positive pair by [01:07:11] here. Uh generate a a positive pair by sampling data uh augmentation functions. [01:07:15] sampling data uh augmentation functions. So we sample a few of those then we [01:07:18] So we sample a few of those then we calculate the info NC loss on uh the [01:07:22] calculate the info NC loss on uh the pairs and this is what we iterate [01:07:25] pairs and this is what we iterate because each each sample has two n uh to [01:07:29] because each each sample has two n uh to multiply by n uh samples that we have [01:07:32] multiply by n uh samples that we have created. So what happens is we take a [01:07:36] created. So what happens is we take a list of um [01:07:39] list of um images in the in the mini batch in the [01:07:41] images in the in the mini batch in the batch that we have and um pass them [01:07:45] batch that we have and um pass them through encoder for both variations of [01:07:48] through encoder for both variations of the same image. So each of the images [01:07:51] the same image. So each of the images basically will have a version of it [01:07:54] basically will have a version of it transformed version of it next. So we [01:07:57] transformed version of it next. So we have two n um subject two n samples in [01:08:01] have two n um subject two n samples in the batch now. [01:08:03] the batch now. And then this means that for each of the [01:08:06] And then this means that for each of the samples the next one these two are [01:08:09] samples the next one these two are positive samples and everything else is [01:08:11] positive samples and everything else is negative. So for the first one the [01:08:13] negative. So for the first one the second uh image is positive everything [01:08:16] second uh image is positive everything is else is negative. For the second one [01:08:18] is else is negative. For the second one the first is positive and everything [01:08:20] the first is positive and everything else is negative. And this repeats for [01:08:22] else is negative. And this repeats for all of the samples there. So this is a [01:08:26] all of the samples there. So this is a highle implement uh high level [01:08:29] highle implement uh high level definition of Sinclair. [01:08:31] definition of Sinclair. Please note that we have in assignment [01:08:35] Please note that we have in assignment three a question related to Sinclair [01:08:38] three a question related to Sinclair that you will be uh exploring this [01:08:41] that you will be uh exploring this framework a little bit more but be [01:08:45] framework a little bit more but be careful that the definition is slightly [01:08:47] careful that the definition is slightly different from what the standard [01:08:49] different from what the standard definition that I presented here and and [01:08:51] definition that I presented here and and make sure you follow the instructions in [01:08:54] make sure you follow the instructions in the assignment. So Sinclair was actually [01:08:58] the assignment. So Sinclair was actually very successful. it was able to without [01:09:01] very successful. it was able to without the use of labels or or um and then [01:09:05] the use of labels or or um and then training a linear classifier on top of [01:09:08] training a linear classifier on top of the features. It was able to to surpass [01:09:13] the features. It was able to to surpass all of the previous works and even be as [01:09:17] all of the previous works and even be as uh basically generate results comparable [01:09:20] uh basically generate results comparable to the supervised learning fully [01:09:23] to the supervised learning fully supervised uh learning frameworks. [01:09:25] supervised uh learning frameworks. Although we need a larger neural network [01:09:27] Although we need a larger neural network because now we are learning more generic [01:09:29] because now we are learning more generic features but in terms of accuracies it [01:09:32] features but in terms of accuracies it was not as uh it was comparable to what [01:09:35] was not as uh it was comparable to what we had for supervised learning. [01:09:38] we had for supervised learning. So um the interesting thing um with [01:09:43] So um the interesting thing um with Sinclair and and u some of the results [01:09:46] Sinclair and and u some of the results around it was [01:09:50] around it was um that there are a few choices actually [01:09:54] um that there are a few choices actually let me uh spend time on on the the main [01:09:56] let me uh spend time on on the the main main choices. You may have this question [01:10:00] main choices. You may have this question of why did we project the features into [01:10:03] of why did we project the features into a new variable instead of using the same [01:10:05] a new variable instead of using the same representations. [01:10:07] representations. So this this was a design choice that [01:10:09] So this this was a design choice that they made in Sinclair because they [01:10:13] they made in Sinclair because they rightly so assumed that when we have an [01:10:16] rightly so assumed that when we have an objective function that does this [01:10:18] objective function that does this contrast between samples, you often lose [01:10:22] contrast between samples, you often lose some more information, some extra [01:10:25] some more information, some extra information that do not help with the [01:10:26] information that do not help with the contrastive learning framework. Right? [01:10:28] contrastive learning framework. Right? So in order to preserve all of those uh [01:10:31] So in order to preserve all of those uh extra [01:10:33] extra features representations are defined as [01:10:36] features representations are defined as edge but then there is this linear or [01:10:38] edge but then there is this linear or nonlinear projection in their paper they [01:10:41] nonlinear projection in their paper they use nonlinear projection to to get the [01:10:45] use nonlinear projection to to get the uh z values that they can calculate [01:10:47] uh z values that they can calculate influency on. So that's one important uh [01:10:51] influency on. So that's one important uh design choice and the other one is I [01:10:54] design choice and the other one is I mentioned earlier large batch sizes. You [01:10:56] mentioned earlier large batch sizes. You need huge batch sizes, larger batch [01:10:59] need huge batch sizes, larger batch sizes to be able to get better Sinclair [01:11:03] sizes to be able to get better Sinclair performance. And we talked about it how [01:11:05] performance. And we talked about it how and and why this is the case. But we [01:11:08] and and why this is the case. But we can't always do large batch sizes for [01:11:11] can't always do large batch sizes for many of the tasks that we have at hand [01:11:14] many of the tasks that we have at hand because um of [01:11:18] because um of constraints in memory and so on. And [01:11:20] constraints in memory and so on. And that was why a number of follow-ups for [01:11:23] that was why a number of follow-ups for example Moco was um proposed momentum [01:11:28] example Moco was um proposed momentum contrastive learning instead of using [01:11:32] contrastive learning instead of using the samples all of the negative samples [01:11:35] the samples all of the negative samples in the batch. What it does it it creates [01:11:38] in the batch. What it does it it creates a queue and and keeps a history of the [01:11:44] a queue and and keeps a history of the negative samples across patches over [01:11:47] negative samples across patches over time in the model. So it doesn't only [01:11:50] time in the model. So it doesn't only depend on the negative samples in the [01:11:52] depend on the negative samples in the batch. It has a separate queue that is [01:11:56] batch. It has a separate queue that is defined and and um keeps a number of [01:12:00] defined and and um keeps a number of negative samples and changes it updates [01:12:02] negative samples and changes it updates it over time to train the u to to do the [01:12:09] it over time to train the u to to do the info [01:12:10] info uh loss with the contrastive loss here. [01:12:13] uh loss with the contrastive loss here. But because we have this Q, we cannot [01:12:16] But because we have this Q, we cannot back propagate because those samples are [01:12:18] back propagate because those samples are not in the batch anymore. Right? So we [01:12:21] not in the batch anymore. Right? So we cannot back propagate for the negative [01:12:24] cannot back propagate for the negative uh samples. And that's why it had to [01:12:26] uh samples. And that's why it had to separate the encoder for positive [01:12:28] separate the encoder for positive samples which are now called query and [01:12:31] samples which are now called query and the negative samples which are now [01:12:32] the negative samples which are now called key in this um architecture. So [01:12:38] called key in this um architecture. So the training only impacts encoder and [01:12:41] the training only impacts encoder and over time the Q encoder [01:12:46] over time the Q encoder using a momentum [01:12:48] using a momentum m it updates the key encoder the [01:12:51] m it updates the key encoder the momentum encoder right so this is a [01:12:56] momentum encoder right so this is a framework that is actually been very [01:12:58] framework that is actually been very successful in terms of implementation [01:13:01] successful in terms of implementation and and followup versions of it there is [01:13:04] and and followup versions of it there is a lot of uh interest interesting results [01:13:08] a lot of uh interest interesting results But then what they've done was basically [01:13:12] But then what they've done was basically uh tried hybrid versions of using some [01:13:17] uh tried hybrid versions of using some nonlinear projection heads and data [01:13:20] nonlinear projection heads and data augmentation from Sinclair and using [01:13:22] augmentation from Sinclair and using this mini batch style the the decoupling [01:13:25] this mini batch style the the decoupling of the mini batch and negative samples [01:13:27] of the mini batch and negative samples from Moco. And they've shown that [01:13:29] from Moco. And they've shown that actually if you if you do this together [01:13:32] actually if you if you do this together version moco version two it um improves [01:13:37] version moco version two it um improves the performance by a lot. So I will uh [01:13:42] the performance by a lot. So I will uh stop here but there was some notions of [01:13:45] stop here but there was some notions of CPC the contrastive predictive coding uh [01:13:50] CPC the contrastive predictive coding uh as another example that you can look at [01:13:52] as another example that you can look at in the slides and then a better version [01:13:58] in the slides and then a better version of Moco Moo version 3 and Dino is also [01:14:01] of Moco Moo version 3 and Dino is also one of the widely used frameworks which [01:14:04] one of the widely used frameworks which actually has a similar type of [01:14:06] actually has a similar type of architecture as [01:14:09] architecture as uh as Moco but it's not necessarily uh [01:14:13] uh as Moco but it's not necessarily uh contrastive learning because now we have [01:14:15] contrastive learning because now we have student and teacher networks. So I'll [01:14:18] student and teacher networks. So I'll leave that for a separate discussion and [01:14:20] leave that for a separate discussion and if if uh if uh you're interested we can [01:14:23] if if uh if uh you're interested we can discuss maybe in the future uh slides in [01:14:26] discuss maybe in the future uh slides in future lectures. But anyways uh this is [01:14:29] future lectures. But anyways uh this is also one of the widely used frameworks [01:14:32] also one of the widely used frameworks for extracting features from images and [01:14:34] for extracting features from images and also videos sometimes. ================================================================================ LECTURE 013 ================================================================================ Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 13: Generative Models 1 Source: https://www.youtube.com/watch?v=zbHXQRUNlH0 --- Transcript [00:00:05] Welcome back to CS231N lecture 13. Today [00:00:09] Welcome back to CS231N lecture 13. Today we're going to talk about generative [00:00:10] we're going to talk about generative models. Last time we were talking about [00:00:12] models. Last time we were talking about self-supervised learning um which is [00:00:14] self-supervised learning um which is this really interesting paradigm where [00:00:16] this really interesting paradigm where we want to somehow learn structure [00:00:18] we want to somehow learn structure directly from data with no with no [00:00:20] directly from data with no with no supervision with no labels. the and and [00:00:22] supervision with no labels. the and and the typical present like the typical [00:00:23] the typical present like the typical formulation of self-supervised learning [00:00:25] formulation of self-supervised learning that we talked about a bunch of examples [00:00:26] that we talked about a bunch of examples last time is you have your big data set [00:00:28] last time is you have your big data set with no labels. Ideally, it's just [00:00:30] with no labels. Ideally, it's just images. This is great. You can get a lot [00:00:31] images. This is great. You can get a lot of images. Um you're going to feed these [00:00:33] of images. Um you're going to feed these through some kind of encoder that's [00:00:34] through some kind of encoder that's going to extract a feature [00:00:35] going to extract a feature representation from your images. Then go [00:00:37] representation from your images. Then go through some decoder that will predict [00:00:39] through some decoder that will predict something from that feature [00:00:40] something from that feature representation. And in the whole trick [00:00:42] representation. And in the whole trick in self-supervised learning is coming up [00:00:44] in self-supervised learning is coming up with some kind of pretext task that you [00:00:46] with some kind of pretext task that you can train this whole system on without [00:00:48] can train this whole system on without requiring any kind of human annotation [00:00:49] requiring any kind of human annotation or human labels. Um and then and then so [00:00:52] or human labels. Um and then and then so we talked about things like rotation uh [00:00:54] we talked about things like rotation uh different kinds of tasks that we can use [00:00:55] different kinds of tasks that we can use as pretexts for these self-supervised to [00:00:58] as pretexts for these self-supervised to to formulate these self-supervised [00:00:59] to formulate these self-supervised learning objectives. And then typically [00:01:01] learning objectives. And then typically um this is usually a two-stage procedure [00:01:03] um this is usually a two-stage procedure where first you're going to go and learn [00:01:05] where first you're going to go and learn this self-supervised encoder decoder on [00:01:07] this self-supervised encoder decoder on your self-supervised task on all the [00:01:09] your self-supervised task on all the data that you can find. And then after [00:01:10] data that you can find. And then after that you're going to throw away the [00:01:11] that you're going to throw away the decoder and then slot in some new maybe [00:01:14] decoder and then slot in some new maybe tiny possibly tiny fully connected [00:01:16] tiny possibly tiny fully connected network um and actually train this thing [00:01:18] network um and actually train this thing maybe end to end or maybe just learn the [00:01:19] maybe end to end or maybe just learn the fully connected network at the end um on [00:01:21] fully connected network at the end um on some small labelled task. And the idea [00:01:23] some small labelled task. And the idea here is that via self-supervised [00:01:25] here is that via self-supervised learning this pretext task you can train [00:01:27] learning this pretext task you can train on lots and lots of data millions [00:01:29] on lots and lots of data millions hundreds of millions billions of samples [00:01:31] hundreds of millions billions of samples where we don't have access to to high [00:01:32] where we don't have access to to high quality human labels. And in the process [00:01:34] quality human labels. And in the process of self-supervised learning it's going [00:01:36] of self-supervised learning it's going to learn something about the general [00:01:37] to learn something about the general structure of images or of data. Um, and [00:01:39] structure of images or of data. Um, and then you can transfer that knowledge to [00:01:41] then you can transfer that knowledge to downstream tasks where you have small [00:01:43] downstream tasks where you have small amounts of human labels. So then the [00:01:45] amounts of human labels. So then the typical setup you should keep in your [00:01:46] typical setup you should keep in your mind that we want to work towards in [00:01:47] mind that we want to work towards in self-supervised learning is that you're [00:01:49] self-supervised learning is that you're going to train on like a billion [00:01:51] going to train on like a billion unlabeled images that we're getting from [00:01:52] unlabeled images that we're getting from the internet somewhere. Um, and then [00:01:54] the internet somewhere. Um, and then we're going to transfer those features [00:01:55] we're going to transfer those features to tasks where we're going to we're [00:01:57] to tasks where we're going to we're willing to sit down and label maybe [00:01:58] willing to sit down and label maybe tens, hundreds, maybe thousands of [00:02:00] tens, hundreds, maybe thousands of examples for particular tasks that we [00:02:02] examples for particular tasks that we really care about. But we want those [00:02:04] really care about. But we want those tasks to be um improved by this generic [00:02:06] tasks to be um improved by this generic knowledge that from that we've learned [00:02:08] knowledge that from that we've learned through this self-supervised pretext [00:02:10] through this self-supervised pretext task. And we talked about a couple [00:02:12] task. And we talked about a couple different kinds of pretext tasks last [00:02:13] different kinds of pretext tasks last time including rotation, rearrangement, [00:02:16] time including rotation, rearrangement, reconstruction. Um all of these are [00:02:17] reconstruction. Um all of these are basically having this this sense of that [00:02:20] basically having this this sense of that you're making some geometric [00:02:21] you're making some geometric perturbation, geometric disturbance to [00:02:23] perturbation, geometric disturbance to the input pixels and then you're asking [00:02:24] the input pixels and then you're asking the model to somehow recover from that [00:02:26] the model to somehow recover from that dis from that perturbation. So in the [00:02:28] dis from that perturbation. So in the case of rotation, maybe you rotate the [00:02:30] case of rotation, maybe you rotate the image and you ask the model to predict [00:02:31] image and you ask the model to predict how much it was rotated. In the sense of [00:02:33] how much it was rotated. In the sense of rearrangement or solving jig jigsaw [00:02:35] rearrangement or solving jig jigsaw puzzles, you're going to cut the image [00:02:36] puzzles, you're going to cut the image up into patches and ask the model to try [00:02:38] up into patches and ask the model to try to predict what was the relative [00:02:39] to predict what was the relative arrangement of those patches in the [00:02:41] arrangement of those patches in the original image. Um or in reconstruction, [00:02:43] original image. Um or in reconstruction, maybe you're going to delete some parts [00:02:44] maybe you're going to delete some parts of the input image and then ask the [00:02:46] of the input image and then ask the model to fill them in as some kind of [00:02:47] model to fill them in as some kind of inpainting or reconstruction task. Um [00:02:50] inpainting or reconstruction task. Um and these are fairly successful. Um, we [00:02:52] and these are fairly successful. Um, we also talked last time about a different [00:02:54] also talked last time about a different formulation of self-supervised learning [00:02:55] formulation of self-supervised learning called contrastive learning, which has [00:02:57] called contrastive learning, which has been very successful. Um, and here I I [00:02:59] been very successful. Um, and here I I was told that you ran out of time a [00:03:01] was told that you ran out of time a little bit to cover a couple of these [00:03:02] little bit to cover a couple of these later methods. So, I wanted to just go [00:03:03] later methods. So, I wanted to just go over go over those really quickly at the [00:03:05] over go over those really quickly at the beginning of today's lecture instead. [00:03:07] beginning of today's lecture instead. Um, so the really the idea of [00:03:09] Um, so the really the idea of contrastive learning is you're going to [00:03:11] contrastive learning is you're going to get pairs that are similar and pairs [00:03:13] get pairs that are similar and pairs that are dissimilar and you want to pull [00:03:15] that are dissimilar and you want to pull the similar pairs together and push the [00:03:16] the similar pairs together and push the dissimilar pairs apart. And the way that [00:03:18] dissimilar pairs apart. And the way that you usually do this in the context of [00:03:21] you usually do this in the context of self-supervised learning is you're going [00:03:22] self-supervised learning is you're going to start with your input images. And [00:03:24] to start with your input images. And again, these are unlabeled images. You [00:03:26] again, these are unlabeled images. You don't have labels. You don't have labels [00:03:27] don't have labels. You don't have labels for them. And now for each input image, [00:03:29] for them. And now for each input image, you're going to apply two random [00:03:31] you're going to apply two random transformations. So in the case of the [00:03:33] transformations. So in the case of the cat, we sort of took one crop around the [00:03:35] cat, we sort of took one crop around the cat's face, another crop around the [00:03:37] cat's face, another crop around the backside of the cat, um, and and around [00:03:39] backside of the cat, um, and and around the monkey, we sort of took one around [00:03:40] the monkey, we sort of took one around the monkeykey's face and also dropped it [00:03:42] the monkeykey's face and also dropped it to black and white and, etc., etc. So [00:03:44] to black and white and, etc., etc. So basically for each one of your input [00:03:46] basically for each one of your input images you're going to apply two or [00:03:48] images you're going to apply two or possibly more possibly more than two but [00:03:50] possibly more possibly more than two but two is a is a nice minimal subset two [00:03:52] two is a is a nice minimal subset two random perturbations to your input [00:03:54] random perturbations to your input image. Now you're going to feed all of [00:03:56] image. Now you're going to feed all of those randomly perturbed um versions of [00:03:58] those randomly perturbed um versions of your input data to some kind of feature [00:04:00] your input data to some kind of feature extractor um which is which could be a [00:04:02] extractor um which is which could be a vit could be a CNN um any kind of neural [00:04:04] vit could be a CNN um any kind of neural network that can input an image and [00:04:06] network that can input an image and output a feature representation. Then [00:04:08] output a feature representation. Then you want to apply this notion of [00:04:10] you want to apply this notion of contrastive. So um for each of the two [00:04:13] contrastive. So um for each of the two augmentations that came from the cat, we [00:04:16] augmentations that came from the cat, we want those two feature vectors to be the [00:04:17] want those two feature vectors to be the same. So we color them green. So [00:04:19] same. So we color them green. So basically you you basically comput this [00:04:21] basically you you basically comput this big n square similarity matrix where if [00:04:23] big n square similarity matrix where if well I guess it's 2 n open pern 2 n [00:04:25] well I guess it's 2 n open pern 2 n closed pern squ. So it's 4n^ squ if you [00:04:28] closed pern squ. So it's 4n^ squ if you have n if you have n images you put two [00:04:30] have n if you have n images you put two perturbations on each. So we have a [00:04:32] perturbations on each. So we have a giant um 2nx by 2n matrix for all the [00:04:35] giant um 2nx by 2n matrix for all the all these perturbed images all these [00:04:37] all these perturbed images all these perturbed uh augmented samples that we [00:04:39] perturbed uh augmented samples that we that we got. Um, and now basically we [00:04:42] that we got. Um, and now basically we want to pull together the two augmented [00:04:44] want to pull together the two augmented samples that the two augmentations that [00:04:46] samples that the two augmentations that came from the original image. Um, and [00:04:48] came from the original image. Um, and for every pair of augmented for of every [00:04:50] for every pair of augmented for of every pair of augmentations that came from [00:04:52] pair of augmentations that came from different original images, we want to [00:04:53] different original images, we want to push them apart. So you basically run um [00:04:55] push them apart. So you basically run um run through this feature run all of [00:04:57] run through this feature run all of these things through your feature [00:04:58] these things through your feature extractor, compute this giant um 4n [00:05:01] extractor, compute this giant um 4n squared matrix of all of your scalar [00:05:03] squared matrix of all of your scalar similarities between those feature [00:05:04] similarities between those feature vectors and then pull together the ones [00:05:06] vectors and then pull together the ones that are similar, push apart the ones [00:05:07] that are similar, push apart the ones that ought to be different. Um, and [00:05:09] that ought to be different. Um, and that's the basic idea of contrastive [00:05:10] that's the basic idea of contrastive learning. Um, and one paper that really [00:05:13] learning. Um, and one paper that really pulled all this together, um, a couple [00:05:15] pulled all this together, um, a couple years ago was called Simincar that, um, [00:05:17] years ago was called Simincar that, um, applied this very successfully to [00:05:19] applied this very successfully to self-supervised representation learning [00:05:20] self-supervised representation learning on images. Um, and that's the one I [00:05:22] on images. Um, and that's the one I think he walked through last time. Um, [00:05:24] think he walked through last time. Um, but one kind of problem with the SIM [00:05:26] but one kind of problem with the SIM clear setup is that it requires a fairly [00:05:28] clear setup is that it requires a fairly large batch size um, to actually good to [00:05:30] large batch size um, to actually good to get good convergence. Um, because it [00:05:32] get good convergence. Um, because it gets too it's sort of too easy of a [00:05:34] gets too it's sort of too easy of a problem for the network. If there aren't [00:05:35] problem for the network. If there aren't that many samples, it's sort of too easy [00:05:37] that many samples, it's sort of too easy to pick out the two cat ones that looked [00:05:38] to pick out the two cat ones that looked similar. So to make the problem hard [00:05:40] similar. So to make the problem hard enough for the network to give it good [00:05:41] enough for the network to give it good enough learning signal, you tend to need [00:05:43] enough learning signal, you tend to need quite a large batch size in order to get [00:05:45] quite a large batch size in order to get this model to converge to good features. [00:05:47] this model to converge to good features. And then once you do that, you need to [00:05:48] And then once you do that, you need to rope in all the ideas around large scale [00:05:50] rope in all the ideas around large scale distributed training that we talked [00:05:51] distributed training that we talked about a couple of lectures ago, which is [00:05:53] about a couple of lectures ago, which is totally feasible. It totally works. Um, [00:05:55] totally feasible. It totally works. Um, but you might ask, is there some way you [00:05:56] but you might ask, is there some way you can get away with with without that? Um, [00:05:59] can get away with with without that? Um, and that that leads to a couple of [00:06:01] and that that leads to a couple of approaches that I don't want to go into [00:06:02] approaches that I don't want to go into too much detail. I actually don't want [00:06:03] too much detail. I actually don't want to walk through through these and tell [00:06:05] to walk through through these and tell you exactly how they how they work. I [00:06:06] you exactly how they how they work. I just want to make you aware of their [00:06:08] just want to make you aware of their existence and give you the the general [00:06:09] existence and give you the the general flavor of what they're trying to [00:06:10] flavor of what they're trying to achieve. Um so in this MOCO or momentum [00:06:13] achieve. Um so in this MOCO or momentum contrast approach to self-supervised [00:06:15] contrast approach to self-supervised learning, the setup is very similar to [00:06:17] learning, the setup is very similar to what we just saw in Simcle. You're [00:06:18] what we just saw in Simcle. You're taking data, you're getting augmented [00:06:20] taking data, you're getting augmented pairs. You want you run them through [00:06:21] pairs. You want you run them through with feature encoder. You want to pull [00:06:23] with feature encoder. You want to pull together the ones that are similar, push [00:06:24] together the ones that are similar, push apart the ones that are dissimilar. But [00:06:26] apart the ones that are dissimilar. But the thing that with the thing that [00:06:27] the thing that with the thing that differs is that we want to get away with [00:06:29] differs is that we want to get away with not having to have a gigantic batch size [00:06:31] not having to have a gigantic batch size at every iteration. So to do that they [00:06:33] at every iteration. So to do that they keep a queue of um negatives. They keep [00:06:36] keep a queue of um negatives. They keep a queue of samples from previous [00:06:39] a queue of samples from previous iterations of training. Um and then at [00:06:41] iterations of training. Um and then at every training iteration I've got my my [00:06:43] every training iteration I've got my my X query is my current new batch of data. [00:06:46] X query is my current new batch of data. And I have this this Q um X0 X1 X2 key [00:06:49] And I have this this Q um X0 X1 X2 key which are previous batches of data that [00:06:51] which are previous batches of data that I've seen on previous iterations of [00:06:53] I've seen on previous iterations of training. Now, my current batch of data [00:06:56] training. Now, my current batch of data I'm going to run through my encoder [00:06:57] I'm going to run through my encoder network the same as I always did um and [00:06:59] network the same as I always did um and compute these sort of compute the [00:07:01] compute these sort of compute the contrast of loss the same way that we [00:07:02] contrast of loss the same way that we did with SIM clear the and then these uh [00:07:04] did with SIM clear the and then these uh these this larger Q these like previous [00:07:07] these this larger Q these like previous history of batches we're going to run [00:07:08] history of batches we're going to run through something different the momentum [00:07:10] through something different the momentum encoder um and then still get feature [00:07:12] encoder um and then still get feature representations and compute the same [00:07:14] representations and compute the same kind of similarity that we did through [00:07:16] kind of similarity that we did through the through through the SIM clear uh [00:07:17] the through through the SIM clear uh thing but the problem is that we don't [00:07:19] thing but the problem is that we don't want to back propagate into the momentum [00:07:21] want to back propagate into the momentum encoder because it has too much data it [00:07:23] encoder because it has too much data it too big of a batch. We can't afford to [00:07:25] too big of a batch. We can't afford to fit that in GPU memory. Um, so we want [00:07:27] fit that in GPU memory. Um, so we want to not have to back propagate through [00:07:28] to not have to back propagate through that part. So instead, so that means [00:07:30] that part. So instead, so that means that we're not we cannot upgrade update [00:07:32] that we're not we cannot upgrade update this momentum encoding encoder, this [00:07:34] this momentum encoding encoder, this second encoder via gradient descent [00:07:36] second encoder via gradient descent descent. Instead, we're going to do [00:07:37] descent. Instead, we're going to do something kind of wacky. What we're [00:07:39] something kind of wacky. What we're going to do is have this momentum [00:07:40] going to do is have this momentum encoder have its own set of weights. [00:07:42] encoder have its own set of weights. We're going to learn them not via [00:07:43] We're going to learn them not via gradient descent. Instead, what we're [00:07:45] gradient descent. Instead, what we're going to do is have the momentum encoder [00:07:46] going to do is have the momentum encoder be a exponential moving average of the [00:07:48] be a exponential moving average of the weights of the normal encoder. So the [00:07:50] weights of the normal encoder. So the normal encoder, we're going to learn via [00:07:52] normal encoder, we're going to learn via gradient descent. everything is normal. [00:07:53] gradient descent. everything is normal. Um, we'll forward prop, we'll back prop, [00:07:55] Um, we'll forward prop, we'll back prop, we'll get gradients, we'll make a [00:07:56] we'll get gradients, we'll make a gradient update step on the in on the on [00:07:58] gradient update step on the in on the on the typical encoder. That's the normal [00:08:00] the typical encoder. That's the normal thing. But then after we do that, the [00:08:02] thing. But then after we do that, the momentum encoder, we're going to take [00:08:04] momentum encoder, we're going to take we're going to decay the encoder [00:08:05] we're going to decay the encoder weights. We're going to decay the [00:08:07] weights. We're going to decay the current momentum encoder weights by [00:08:09] current momentum encoder weights by like.99 um, and then add in 1% of the [00:08:12] like.99 um, and then add in 1% of the encoder weights. So then the momentum [00:08:14] encoder weights. So then the momentum encoder, we have this other update rule [00:08:15] encoder, we have this other update rule where it's this lagging trailing [00:08:17] where it's this lagging trailing exponential moving average of the [00:08:19] exponential moving average of the encoder weights. Um, and I don't have a [00:08:22] encoder weights. Um, and I don't have a great intuition or explanation for why [00:08:23] great intuition or explanation for why this exactly makes sense, but there's [00:08:25] this exactly makes sense, but there's very strong empirical evidence that this [00:08:26] very strong empirical evidence that this works. Um, so I unfort so that's uh [00:08:28] works. Um, so I unfort so that's uh that's kind of the state of things, but [00:08:30] that's kind of the state of things, but it's nice because it means that you can [00:08:32] it's nice because it means that you can now get away with learning these [00:08:33] now get away with learning these self-supervised representations without [00:08:35] self-supervised representations without having to have this gigantic massive [00:08:37] having to have this gigantic massive batch of negatives at every iteration. [00:08:39] batch of negatives at every iteration. Um, and this was fairly successful. [00:08:40] Um, and this was fairly successful. There were a bunch of follow-up papers [00:08:41] There were a bunch of follow-up papers that uh that push this direction. Um, [00:08:44] that uh that push this direction. Um, another one that you should be aware of [00:08:45] another one that you should be aware of is called Dino. Um, again the idea is [00:08:47] is called Dino. Um, again the idea is very similar. It uses this similar sort [00:08:49] very similar. It uses this similar sort of momentum encoder, this like dual um [00:08:51] of momentum encoder, this like dual um normal encoder that's learned via [00:08:52] normal encoder that's learned via gradient descent and a momentum encoder [00:08:54] gradient descent and a momentum encoder just as in Moco. Um but the loss is a [00:08:56] just as in Moco. Um but the loss is a little bit different. Instead of using [00:08:58] little bit different. Instead of using softmax, they use some kind of kale [00:08:59] softmax, they use some kind of kale divergence loss. Um and the reason I'm [00:09:01] divergence loss. Um and the reason I'm mentioning this one is because you [00:09:03] mentioning this one is because you should be aware of the existence of [00:09:04] should be aware of the existence of Dinov2 even if we don't talk about [00:09:07] Dinov2 even if we don't talk about exactly what it does because Dinov2 is a [00:09:10] exactly what it does because Dinov2 is a really strong um feature a really strong [00:09:12] really strong um feature a really strong model for self-supervised features [00:09:14] model for self-supervised features that's used quite a lot in practice [00:09:15] that's used quite a lot in practice these days. So what they basically did [00:09:17] these days. So what they basically did is took this recipe from Dinov1 which [00:09:20] is took this recipe from Dinov1 which was kind of similar to MoCo has a lot of [00:09:22] was kind of similar to MoCo has a lot of ideas from Simcle clear as well but a [00:09:24] ideas from Simcle clear as well but a lot of unique details in their approach [00:09:25] lot of unique details in their approach as well. Um but the big the big [00:09:28] as well. Um but the big the big difference in Dinov2 is that they scaled [00:09:29] difference in Dinov2 is that they scaled up the training data quite a lot. So a [00:09:31] up the training data quite a lot. So a lot of these previous self-supervised [00:09:32] lot of these previous self-supervised approaches had been trained on the [00:09:34] approaches had been trained on the imageet data set which was 1 million [00:09:36] imageet data set which was 1 million images. Um and Dinov2 was able to [00:09:38] images. Um and Dinov2 was able to successfully scale up this approach to a [00:09:40] successfully scale up this approach to a much larger training set of about 142 [00:09:42] much larger training set of about 142 million images. So, you know, in deep [00:09:45] million images. So, you know, in deep learning, we like bigger networks, we [00:09:46] learning, we like bigger networks, we like bigger data, we like more GPUs, we [00:09:48] like bigger data, we like more GPUs, we like more flops, we like all of those [00:09:49] like more flops, we like all of those things. Um, and Dinov2 was able to find [00:09:52] things. Um, and Dinov2 was able to find a recipe for self-supervised learning [00:09:53] a recipe for self-supervised learning that successfully scaled up to this much [00:09:55] that successfully scaled up to this much larger data set gives very strong [00:09:57] larger data set gives very strong self-supervised features. And this tends [00:09:58] self-supervised features. And this tends to be used quite a lot in practice [00:10:00] to be used quite a lot in practice today. Um, if you want to pick up [00:10:01] today. Um, if you want to pick up features and then super fine-tune them [00:10:03] features and then super fine-tune them or supervise them for some for some of [00:10:05] or supervise them for some for some of your own downstream tasks. Um, so again, [00:10:07] your own downstream tasks. Um, so again, I don't expect you I don't want to walk [00:10:09] I don't expect you I don't want to walk through all the details of how this [00:10:10] through all the details of how this works. I don't expect you to know, but I [00:10:11] works. I don't expect you to know, but I want you to know that it exists in case [00:10:13] want you to know that it exists in case you want to pick it up and use it for [00:10:14] you want to pick it up and use it for some of your own projects um in the [00:10:16] some of your own projects um in the future. [00:10:18] future. Um so that that's basically all I had to [00:10:20] Um so that that's basically all I had to say about self-supervised learning. Um [00:10:21] say about self-supervised learning. Um any any questions about that before we [00:10:23] any any questions about that before we move on to the the meat of today's [00:10:25] move on to the the meat of today's lecture? [00:10:33] Okay, guess not. So today the main topic [00:10:36] Okay, guess not. So today the main topic is generative models. Um this is really [00:10:39] is generative models. Um this is really cool. Uh this is an area of deep [00:10:40] cool. Uh this is an area of deep learning that basically went from not [00:10:42] learning that basically went from not working at all 10 years ago to like [00:10:45] working at all 10 years ago to like really really working in the last couple [00:10:46] really really working in the last couple of years. Um and this has given rise to [00:10:49] of years. Um and this has given rise to things like language models. These could [00:10:50] things like language models. These could be these can be viewed as generative [00:10:52] be these can be viewed as generative models as we'll see. Um all kind of [00:10:54] models as we'll see. Um all kind of image generation models, all kinds of [00:10:55] image generation models, all kinds of video generation models. These really [00:10:57] video generation models. These really went from just absolutely not working at [00:10:59] went from just absolutely not working at all when I was in grad school. Like that [00:11:01] all when I was in grad school. Like that you would look at these samples and peer [00:11:02] you would look at these samples and peer into them and they just look like low [00:11:04] into them and they just look like low resolution, complete blurry garbage, but [00:11:06] resolution, complete blurry garbage, but somehow you could view some promise in [00:11:08] somehow you could view some promise in them. And I'm glad that people kept [00:11:09] them. And I'm glad that people kept pushing on that and pushed through the [00:11:11] pushing on that and pushed through the blurry garbage and scaled it up over the [00:11:12] blurry garbage and scaled it up over the past decade because now a lot of these [00:11:14] past decade because now a lot of these techniques really do work and that's [00:11:16] techniques really do work and that's very exciting. So this is a this is a [00:11:18] very exciting. So this is a this is a this is an area of deep learning that [00:11:20] this is an area of deep learning that basically didn't work at all with the [00:11:21] basically didn't work at all with the first time we taught this class and [00:11:23] first time we taught this class and that's really cool that it now does. Um [00:11:26] that's really cool that it now does. Um but that said like a lot of the [00:11:27] but that said like a lot of the fundamental ideas around generative [00:11:29] fundamental ideas around generative modeling actually remain the same. um [00:11:30] modeling actually remain the same. um the ideas about how you think about data [00:11:32] the ideas about how you think about data um what are what are approaches for [00:11:34] um what are what are approaches for modeling them a lot of those [00:11:35] modeling them a lot of those mathematical fundamentals actually have [00:11:37] mathematical fundamentals actually have not changed that much um in the past [00:11:39] not changed that much um in the past decade um and so but what changed is [00:11:43] decade um and so but what changed is better is more compute more stable [00:11:45] better is more compute more stable training recipes uh bigger data sets uh [00:11:47] training recipes uh bigger data sets uh distributed training and the ability to [00:11:49] distributed training and the ability to scale all this up into more useful tasks [00:11:51] scale all this up into more useful tasks I think was really what drove the [00:11:52] I think was really what drove the progress over the past decade um there [00:11:54] progress over the past decade um there were some algorithmic tweaks and [00:11:55] were some algorithmic tweaks and especially we'll see that next lecture [00:11:57] especially we'll see that next lecture as we talk when we talk about diffusion [00:11:58] as we talk when we talk about diffusion models [00:12:00] models But um first before we talk about [00:12:02] But um first before we talk about generative modeling, I wanted to step [00:12:03] generative modeling, I wanted to step back a little bit and talk about [00:12:04] back a little bit and talk about supervised versus unsupervised learning, [00:12:06] supervised versus unsupervised learning, right? Because there's a couple of [00:12:07] right? Because there's a couple of different there's a there's a couple of [00:12:09] different there's a there's a couple of different tasks that we try to approach [00:12:11] different tasks that we try to approach in deep learning. Um and they can [00:12:12] in deep learning. Um and they can sometimes be sliced along a couple of [00:12:14] sometimes be sliced along a couple of different orthogonal axes. So I wanted [00:12:16] different orthogonal axes. So I wanted to talk about those a little bit just so [00:12:17] to talk about those a little bit just so we get our terminology and our [00:12:18] we get our terminology and our nomenclature clear. Um so supervised [00:12:21] nomenclature clear. Um so supervised learning is what we've mostly been doing [00:12:22] learning is what we've mostly been doing all semester um except for last lecture. [00:12:25] all semester um except for last lecture. Um, in supervised learning, we have a [00:12:27] Um, in supervised learning, we have a data set of pairs X and Y. Um, and the [00:12:29] data set of pairs X and Y. Um, and the goal is to learn some function that maps [00:12:31] goal is to learn some function that maps from the input data X to the target or [00:12:33] from the input data X to the target or label Y. Um, and we've seen a lot of [00:12:35] label Y. Um, and we've seen a lot of examples of this kind of approach so [00:12:37] examples of this kind of approach so far. Um, something like image [00:12:39] far. Um, something like image classification. Our input X is an image. [00:12:41] classification. Our input X is an image. The output Y is going to be a label or [00:12:44] The output Y is going to be a label or image captioning. The input X is going [00:12:46] image captioning. The input X is going to be an image. The output Y is going to [00:12:48] to be an image. The output Y is going to be some piece of text describing what we [00:12:50] be some piece of text describing what we see in that image. Object detection. [00:12:51] see in that image. Object detection. Input is an image. output is a set of [00:12:53] Input is an image. output is a set of boxes and category labels describing the [00:12:55] boxes and category labels describing the objects that appear in the image or [00:12:57] objects that appear in the image or segmentation. Maybe you assign a pixel [00:12:59] segmentation. Maybe you assign a pixel label, assign a label to every pixel in [00:13:00] label, assign a label to every pixel in the input image. And these are [00:13:03] the input image. And these are supervised learning problems because the [00:13:05] supervised learning problems because the task you're trying to solve, the thing [00:13:06] task you're trying to solve, the thing you want to predict is exactly what you [00:13:08] you want to predict is exactly what you have in your data set. And you sort of [00:13:10] have in your data set. And you sort of all you need to do in some sense is [00:13:12] all you need to do in some sense is learn a function that mimics that XY [00:13:14] learn a function that mimics that XY mapping on your training data set and [00:13:15] mapping on your training data set and then generalizes that mapping to new [00:13:17] then generalizes that mapping to new samples um beyond your training data [00:13:18] samples um beyond your training data set. Now, unsupervised learning is [00:13:21] set. Now, unsupervised learning is something a bit more fishy and [00:13:23] something a bit more fishy and mysterious and hard to describe. Um, but [00:13:27] mysterious and hard to describe. Um, but the idea of unsupervised learning or [00:13:28] the idea of unsupervised learning or sometimes self-supervised learning is [00:13:30] sometimes self-supervised learning is that you don't have any labels. You just [00:13:32] that you don't have any labels. You just have data. You just have samples X. Um, [00:13:34] have data. You just have samples X. Um, you just have images and you want to [00:13:36] you just have images and you want to learn some kind of structure from that [00:13:38] learn some kind of structure from that data. Um, there's no particular task [00:13:40] data. Um, there's no particular task you're necessarily targeting. You're [00:13:41] you're necessarily targeting. You're just trying to uncover good [00:13:42] just trying to uncover good representations, good structure in all [00:13:44] representations, good structure in all of that data. Um, why? so that you can [00:13:47] of that data. Um, why? so that you can you know as we talked about in [00:13:48] you know as we talked about in self-supervised learning often so you [00:13:50] self-supervised learning often so you can apply it to downstream tasks later [00:13:51] can apply it to downstream tasks later on um but there's but the the task [00:13:54] on um but there's but the the task itself in unsupervised learning is often [00:13:56] itself in unsupervised learning is often somewhat unspecified [00:13:58] somewhat unspecified some examples of this are K means [00:14:00] some examples of this are K means clustering where maybe we're trying to [00:14:01] clustering where maybe we're trying to identify um clusters in the data which [00:14:04] identify um clusters in the data which is which is some kind of structure that [00:14:05] is which is some kind of structure that we can examine from the raw pixels that [00:14:07] we can examine from the raw pixels that you know was even though we didn't have [00:14:09] you know was even though we didn't have labels for um or dimensionality [00:14:11] labels for um or dimensionality reduction PCA where we're trying to [00:14:13] reduction PCA where we're trying to uncover some lower dimensional subspace [00:14:14] uncover some lower dimensional subspace or lower dimensional manifold that [00:14:16] or lower dimensional manifold that explains the structure of our data. Um, [00:14:18] explains the structure of our data. Um, and again, this is something we're [00:14:19] and again, this is something we're trying to discover from the data itself. [00:14:21] trying to discover from the data itself. We don't have annotations of what this [00:14:22] We don't have annotations of what this what this ought to be. Um, or density [00:14:24] what this ought to be. Um, or density estimation. Maybe we're trying to fit a [00:14:26] estimation. Maybe we're trying to fit a probability distribution to the data. [00:14:27] probability distribution to the data. We're trying to understand what is the [00:14:29] We're trying to understand what is the probabilistic function that gave rise to [00:14:30] probabilistic function that gave rise to the data samples that we're seeing. And [00:14:32] the data samples that we're seeing. And again, we don't have explicit labels for [00:14:34] again, we don't have explicit labels for this or an explicit training set for [00:14:35] this or an explicit training set for this. So, this is some kind of hidden or [00:14:37] this. So, this is some kind of hidden or latent structure that we're trying to [00:14:38] latent structure that we're trying to uncover through the process of training. [00:14:40] uncover through the process of training. So, this um supervised unsupervised [00:14:42] So, this um supervised unsupervised dichotomy is something that you always [00:14:44] dichotomy is something that you always should keep in mind. Um and you can do [00:14:46] should keep in mind. Um and you can do unsupervised learning which is not [00:14:48] unsupervised learning which is not probabilistic or not generative [00:14:49] probabilistic or not generative necessarily. Something like clustering, [00:14:51] necessarily. Something like clustering, something like PCA, you know, often they [00:14:53] something like PCA, you know, often they have probabilistic interpretations. Um [00:14:54] have probabilistic interpretations. Um but these are examples of unsupervised [00:14:56] but these are examples of unsupervised learning that don't necessarily have a [00:14:57] learning that don't necessarily have a generative or probabilistic [00:14:59] generative or probabilistic interpretation or or don't have to be [00:15:00] interpretation or or don't have to be thought of as such. Um so I often like [00:15:02] thought of as such. Um so I often like to think about the supervised [00:15:03] to think about the supervised unsupervised dichotomy as kind of one [00:15:06] unsupervised dichotomy as kind of one spectrum along which methods or or or um [00:15:08] spectrum along which methods or or or um systems can lie. A separate spectrum [00:15:11] systems can lie. A separate spectrum along which we can classify systems or [00:15:13] along which we can classify systems or tasks is that of generative versus [00:15:15] tasks is that of generative versus discriminative models. Um and these are [00:15:16] discriminative models. Um and these are inherently probabilistic. And when when [00:15:18] inherently probabilistic. And when when we talk about generative or [00:15:19] we talk about generative or discriminative models, we're always [00:15:21] discriminative models, we're always imagining some kind of probabilistic [00:15:22] imagining some kind of probabilistic structure in our data that we're trying [00:15:24] structure in our data that we're trying to uncover or learn from. Um and the [00:15:26] to uncover or learn from. Um and the difference is exactly what is the [00:15:27] difference is exactly what is the probabilistic relationship between the [00:15:29] probabilistic relationship between the variables that we're trying to model. Um [00:15:31] variables that we're trying to model. Um so in discriminative model, so so [00:15:33] so in discriminative model, so so typically we have some some y and some [00:15:34] typically we have some some y and some x. Um, and usually we think of the X as [00:15:37] x. Um, and usually we think of the X as something large, highdimensional, [00:15:38] something large, highdimensional, usually an image in our case, and the Y [00:15:40] usually an image in our case, and the Y is some kind of label or description or [00:15:42] is some kind of label or description or auxiliary information. Um, and so that [00:15:45] auxiliary information. Um, and so that would be like your text, like your [00:15:46] would be like your text, like your caption, like a category label, [00:15:48] caption, like a category label, something like that. Um, and when you do [00:15:50] something like that. Um, and when you do when you talk about a discriminative [00:15:52] when you talk about a discriminative model, we're trying to learn a [00:15:53] model, we're trying to learn a probability distribution of Y given X. [00:15:56] probability distribution of Y given X. So, we're trying to learn a distribution [00:15:58] So, we're trying to learn a distribution over labels um conditioned on our input [00:16:00] over labels um conditioned on our input image X. Um and to understand to really [00:16:04] image X. Um and to understand to really appreciate you know what's going on [00:16:06] appreciate you know what's going on probabilistically you need to remember [00:16:07] probabilistically you need to remember one very important feature of [00:16:09] one very important feature of probability distributions and that's [00:16:10] probability distributions and that's that they are normalized right when you [00:16:12] that they are normalized right when you talk about a probability distribution or [00:16:14] talk about a probability distribution or more generally a density function p of x [00:16:16] more generally a density function p of x um p of x is basically a function that [00:16:18] um p of x is basically a function that applies that that um that that uh that [00:16:21] applies that that um that that uh that assigns a nonzero number um to every [00:16:24] assigns a nonzero number um to every every possible input x with the very [00:16:26] every possible input x with the very important normalization constraint um [00:16:28] important normalization constraint um that if you integrate over the entire [00:16:30] that if you integrate over the entire space of all possible x's It integrates [00:16:32] space of all possible x's It integrates for it integrates to one, right? And [00:16:34] for it integrates to one, right? And this normalization constraint really [00:16:35] this normalization constraint really gives rise to the the power of [00:16:37] gives rise to the the power of probabistic models in some sense because [00:16:39] probabistic models in some sense because the normalization constraint means that [00:16:41] the normalization constraint means that all of your x's need to compete for [00:16:43] all of your x's need to compete for probability mass. There's a fixed unit [00:16:45] probability mass. There's a fixed unit amount of probability mass. Um, and the [00:16:48] amount of probability mass. Um, and the and a prob and and choosing a [00:16:49] and a prob and and choosing a probability distribution or density [00:16:51] probability distribution or density function basically amounts to aortioning [00:16:53] function basically amounts to aortioning out that fixed amount of probability [00:16:55] out that fixed amount of probability mass and smearing it across all possible [00:16:57] mass and smearing it across all possible values of x that could exist. Um, and [00:16:59] values of x that could exist. Um, and all of those X's are in competition [00:17:01] all of those X's are in competition because there's only a fixed unit amount [00:17:02] because there's only a fixed unit amount of mass to go around. So if you want to [00:17:04] of mass to go around. So if you want to push up the probability of one X [00:17:06] push up the probability of one X necessarily, the probabilities of or or [00:17:08] necessarily, the probabilities of or or densities of other X's have to go down. [00:17:11] densities of other X's have to go down. And this this and so then in these [00:17:13] And this this and so then in these different formulations of probabilistic [00:17:15] different formulations of probabilistic models, basically what's changing is [00:17:16] models, basically what's changing is what are the variables that are [00:17:17] what are the variables that are competing for probability mass. Um and [00:17:20] competing for probability mass. Um and that means that even though the symbols [00:17:21] that means that even though the symbols that we write on the page look very [00:17:23] that we write on the page look very similar um they they the different the [00:17:26] similar um they they the different the different competitions among what is [00:17:27] different competitions among what is competing for probability mass um [00:17:29] competing for probability mass um induces very different structure that [00:17:30] induces very different structure that the model is trying to learn or uncover. [00:17:32] the model is trying to learn or uncover. So in the case of a discriminative model [00:17:34] So in the case of a discriminative model we're learning a probabilistic model of [00:17:36] we're learning a probabilistic model of y conditioned on x which means that for [00:17:39] y conditioned on x which means that for every x we our model is predicting a [00:17:41] every x we our model is predicting a probability distribution over all [00:17:43] probability distribution over all possible labels. So if our labels are [00:17:45] possible labels. So if our labels are discrete and categorical um like cat dog [00:17:48] discrete and categorical um like cat dog then that means we have a fixed amount [00:17:50] then that means we have a fixed amount of probability 0 to one um and cat and [00:17:52] of probability 0 to one um and cat and dog must must sum to one um and we have [00:17:55] dog must must sum to one um and we have a separate probability distribution over [00:17:58] a separate probability distribution over the labels for every input x um and [00:18:00] the labels for every input x um and crucially notice here that there is no [00:18:02] crucially notice here that there is no competition among images for probability [00:18:04] competition among images for probability mass because every image is inducing its [00:18:07] mass because every image is inducing its own distribution over the label space [00:18:09] own distribution over the label space there's no kind of competition for mass [00:18:11] there's no kind of competition for mass across the different images the the only [00:18:13] across the different images the the only things that are competing for mass are [00:18:15] things that are competing for mass are the different labels for each image and [00:18:17] the different labels for each image and that's very important when you think [00:18:18] that's very important when you think about discriminative modeling. Um and [00:18:21] about discriminative modeling. Um and one interesting other thing so you know [00:18:23] one interesting other thing so you know and one interesting other facet of [00:18:24] and one interesting other facet of discriminative modeling is that they [00:18:26] discriminative modeling is that they have no real way to reject unreasonable [00:18:28] have no real way to reject unreasonable inputs. So once we've fixed our label [00:18:30] inputs. So once we've fixed our label space of say cat and dog in this example [00:18:32] space of say cat and dog in this example if we feed in something that's not a cat [00:18:34] if we feed in something that's not a cat or a dog at all like a monkey or a piece [00:18:36] or a dog at all like a monkey or a piece of abstract art the system has no [00:18:39] of abstract art the system has no flexibility. It has no freedom to say [00:18:40] flexibility. It has no freedom to say this is unreasonable. it's forced to [00:18:42] this is unreasonable. it's forced to output a a distribution over the fixed [00:18:44] output a a distribution over the fixed vocabulary that we assigned at the [00:18:45] vocabulary that we assigned at the beginning. Um so that's that's maybe [00:18:47] beginning. Um so that's that's maybe could be seen as a shortcoming but it's [00:18:49] could be seen as a shortcoming but it's just important to understand what [00:18:50] just important to understand what exactly is happening under the hood when [00:18:52] exactly is happening under the hood when you think about modeling different kinds [00:18:53] you think about modeling different kinds of data probabilistically. [00:18:56] of data probabilistically. Now a generative model is something very [00:18:57] Now a generative model is something very different. Now instead what we're doing [00:18:59] different. Now instead what we're doing in a generative model is learning a [00:19:00] in a generative model is learning a distribution P of X. Um we want to learn [00:19:02] distribution P of X. Um we want to learn a distribution over all possible images [00:19:05] a distribution over all possible images X. Um and now this is very interesting. [00:19:07] X. Um and now this is very interesting. This means that all possible images that [00:19:09] This means that all possible images that could ever exist in the universe are all [00:19:11] could ever exist in the universe are all now competing with each other for [00:19:13] now competing with each other for probability mass. Um, and this is now [00:19:16] probability mass. Um, and this is now really really a hard question that [00:19:18] really really a hard question that requires, you know, it sounds kind of [00:19:20] requires, you know, it sounds kind of simple in in on its face, but this [00:19:22] simple in in on its face, but this requires you to confront some very deep [00:19:23] requires you to confront some very deep and philosophical problems about the [00:19:25] and philosophical problems about the world, right? Because now all images are [00:19:27] world, right? Because now all images are competing for probability mass. And it [00:19:29] competing for probability mass. And it you're forced in order to model that, [00:19:31] you're forced in order to model that, you're forced to answer questions. [00:19:32] you're forced to answer questions. You're like, you know, should an image [00:19:35] You're like, you know, should an image of a three-legged dog, how should that [00:19:36] of a three-legged dog, how should that get probability mass in relationship to [00:19:38] get probability mass in relationship to an image of a three-ear armed monkey? [00:19:40] an image of a three-ear armed monkey? Probably the three-legged dog should get [00:19:41] Probably the three-legged dog should get pro more probability mass because [00:19:43] pro more probability mass because that's, you know, you can have that [00:19:44] that's, you know, you can have that happen by a dog losing a leg. But how [00:19:46] happen by a dog losing a leg. But how are you going to get a three- armed [00:19:47] are you going to get a three- armed monkey? I don't know. That seems much [00:19:48] monkey? I don't know. That seems much more rare. Um, unless you're modeling [00:19:50] more rare. Um, unless you're modeling sci-fi images or something like this. So [00:19:52] sci-fi images or something like this. So once you're in this regime of im all [00:19:53] once you're in this regime of im all possible images competing for [00:19:55] possible images competing for probability mass now your model really [00:19:57] probability mass now your model really needs to think very carefully about the [00:19:58] needs to think very carefully about the kinds of structure that can can exist in [00:20:00] kinds of structure that can can exist in the data and it becomes a much much [00:20:01] the data and it becomes a much much harder problem to solve. [00:20:04] harder problem to solve. And another interesting thing here is [00:20:06] And another interesting thing here is that now with a generative model the [00:20:08] that now with a generative model the model now does have the capacity to [00:20:09] model now does have the capacity to basically say no this is not a [00:20:11] basically say no this is not a reasonable image this is not a [00:20:12] reasonable image this is not a reasonable input and the way that it can [00:20:14] reasonable input and the way that it can do that is by assigning you know low or [00:20:16] do that is by assigning you know low or even zero probability mass to any one [00:20:19] even zero probability mass to any one image that it gets. So maybe in our [00:20:21] image that it gets. So maybe in our generative model um maybe we only want [00:20:22] generative model um maybe we only want it to recognize you know be a generative [00:20:24] it to recognize you know be a generative model of zoo animals and if we want to [00:20:25] model of zoo animals and if we want to have a generative model of zoo animals [00:20:27] have a generative model of zoo animals then you know if we feed in an image of [00:20:29] then you know if we feed in an image of abstract art it should have zero [00:20:30] abstract art it should have zero probability mass. So now we have a [00:20:32] probability mass. So now we have a mechanism for rejecting or or saying [00:20:34] mechanism for rejecting or or saying that this type of image is not within [00:20:36] that this type of image is not within the scope of what we care about. [00:20:39] the scope of what we care about. And now a conditional generative model [00:20:40] And now a conditional generative model is um even more interesting. So this is [00:20:43] is um even more interesting. So this is where we're learning a conditional [00:20:44] where we're learning a conditional distribution over images x conditioned [00:20:46] distribution over images x conditioned on some um some some label sigma y. And [00:20:49] on some um some some label sigma y. And now this means that for every possible [00:20:51] now this means that for every possible label um maybe we're now inducing a [00:20:53] label um maybe we're now inducing a distri a competition among all possible [00:20:55] distri a competition among all possible images. So in this case if y is say a [00:20:58] images. So in this case if y is say a categorical label of cat and dog now um [00:21:01] categorical label of cat and dog now um for every for each possible categorical [00:21:03] for every for each possible categorical label cat and dog the model is [00:21:04] label cat and dog the model is separately inducing a competition among [00:21:06] separately inducing a competition among all possible images. So now you know in [00:21:09] all possible images. So now you know in in the top distribution maybe this is [00:21:10] in the top distribution maybe this is the probability of all these images um [00:21:12] the probability of all these images um conditioned on the cat label. So then [00:21:14] conditioned on the cat label. So then obviously the cat image should be high. [00:21:15] obviously the cat image should be high. Maybe the monkey and the dog images [00:21:17] Maybe the monkey and the dog images image should be somewhat higher because [00:21:19] image should be somewhat higher because they're still mammals at least, but the [00:21:20] they're still mammals at least, but the abstract art should be very low, maybe [00:21:22] abstract art should be very low, maybe even zero. Um, and then you know a [00:21:23] even zero. Um, and then you know a different distribution among images if [00:21:25] different distribution among images if we're conditioning on the dog label. Um, [00:21:27] we're conditioning on the dog label. Um, and this gets even more interesting if [00:21:29] and this gets even more interesting if you imagine that your conditioning [00:21:30] you imagine that your conditioning signal Y is something much richer than a [00:21:32] signal Y is something much richer than a single categorical label. That that [00:21:34] single categorical label. That that conditioning single signal Y might have [00:21:36] conditioning single signal Y might have been a text description. It might have [00:21:38] been a text description. It might have been a whole paragraph of written text. [00:21:39] been a whole paragraph of written text. It might have been another image plus a [00:21:41] It might have been another image plus a piece of text. So now once you talk [00:21:43] piece of text. So now once you talk about modeling these very rich output [00:21:45] about modeling these very rich output spaces X um conditioned on very rich [00:21:47] spaces X um conditioned on very rich input spaces Y now you're actually [00:21:49] input spaces Y now you're actually asking the model to solve a very [00:21:50] asking the model to solve a very complicated and and quite illdefined [00:21:53] complicated and and quite illdefined problem that requires very deep [00:21:54] problem that requires very deep reasoning about the about the objects [00:21:56] reasoning about the about the objects involved. Um so that's why I think that [00:21:58] involved. Um so that's why I think that generative modeling is such an [00:22:00] generative modeling is such an interesting topic because it looks kind [00:22:01] interesting topic because it looks kind of simil it looks kind of simple. All we [00:22:04] of simil it looks kind of simple. All we did was flop the X and the Y. How hard [00:22:05] did was flop the X and the Y. How hard could it be? Um but all of a sudden it [00:22:07] could it be? Um but all of a sudden it required us to think really hard about [00:22:09] required us to think really hard about what's going on in the visual world. Um, [00:22:11] what's going on in the visual world. Um, and what's also interesting is that we [00:22:13] and what's also interesting is that we wrote down discriminative generative [00:22:14] wrote down discriminative generative models and conditional generative models [00:22:17] models and conditional generative models as three separate categories of things. [00:22:19] as three separate categories of things. Um, but actually they're all related. [00:22:21] Um, but actually they're all related. Um, and they're related through Baze [00:22:23] Um, and they're related through Baze rule, which is um, you know, your kind [00:22:25] rule, which is um, you know, your kind of your one of the most one of the most [00:22:27] of your one of the most one of the most amazing relationships in probability. [00:22:28] amazing relationships in probability. And in particular it says that you know [00:22:31] And in particular it says that you know if we have we can write we can if we [00:22:33] if we have we can write we can if we have access to a discriminative model P [00:22:35] have access to a discriminative model P Y of X um and an unconditional [00:22:37] Y of X um and an unconditional generative model P of X um as well as [00:22:39] generative model P of X um as well as some prior distribution over our labels [00:22:41] some prior distribution over our labels Y we can compose those to build a [00:22:43] Y we can compose those to build a conditional generative model P of Y [00:22:44] conditional generative model P of Y given X um or in general you can always [00:22:47] given X um or in general you can always rearrange B rule in some way so that if [00:22:49] rearrange B rule in some way so that if you have any two of these you can always [00:22:50] you have any two of these you can always get a third one which is pretty cool. Um [00:22:53] get a third one which is pretty cool. Um so I mean so in practice like in in [00:22:56] so I mean so in practice like in in theory you can in principle build a [00:22:58] theory you can in principle build a conditional generative model out of the [00:22:59] conditional generative model out of the other two components. Although in [00:23:01] other two components. Although in practice this is not really how you do [00:23:03] practice this is not really how you do it. You tend to sort of learn [00:23:04] it. You tend to sort of learn conditional generative models um from [00:23:06] conditional generative models um from all from scratch on their own. Although [00:23:09] all from scratch on their own. Although as we'll talk about in diffusion you do [00:23:10] as we'll talk about in diffusion you do end up sometimes learning conditional [00:23:12] end up sometimes learning conditional and unconditional models jointly um for [00:23:14] and unconditional models jointly um for some reasons. Um but this is nice to [00:23:16] some reasons. Um but this is nice to keep in mind that there's a very deep [00:23:17] keep in mind that there's a very deep relationship across these different [00:23:18] relationship across these different flavors of probabilistic models. So then [00:23:21] flavors of probabilistic models. So then you might be wondering, okay, what can [00:23:22] you might be wondering, okay, what can we do with these different flavors of [00:23:24] we do with these different flavors of probabilistic models? Um, with with [00:23:25] probabilistic models? Um, with with discriminative models, this one [00:23:27] discriminative models, this one shouldn't require a lot of creativity. [00:23:28] shouldn't require a lot of creativity. We've seen a lot of these examples so [00:23:29] We've seen a lot of these examples so far this this this uh this quarter. So [00:23:31] far this this this uh this quarter. So with discriminative models, after you [00:23:33] with discriminative models, after you train them, you can assign labels to [00:23:34] train them, you can assign labels to data. You can also do feature learning, [00:23:36] data. You can also do feature learning, right? In the case of say um supervised [00:23:38] right? In the case of say um supervised learning on imageet, we've seen that by [00:23:40] learning on imageet, we've seen that by in the process of trying to predict um [00:23:42] in the process of trying to predict um categorical labels of images, those [00:23:44] categorical labels of images, those models tend to learn useful feature [00:23:45] models tend to learn useful feature representations in the middle that can [00:23:47] representations in the middle that can be transferred to downstream things. So [00:23:49] be transferred to downstream things. So this is an So you tend to use [00:23:50] this is an So you tend to use descriptive models for just directly [00:23:52] descriptive models for just directly predicting the y's that you care about [00:23:53] predicting the y's that you care about or for learning feature representations [00:23:55] or for learning feature representations that are induced in the process of [00:23:56] that are induced in the process of trying to predict those y's. [00:23:59] trying to predict those y's. Generative models um actually these [00:24:00] Generative models um actually these unconditional generative models I I [00:24:02] unconditional generative models I I think are actually kind of useless um in [00:24:04] think are actually kind of useless um in general but what they kind of let you do [00:24:06] general but what they kind of let you do is maybe detect outliers. Um they look [00:24:08] is maybe detect outliers. Um they look at images and say are they are they [00:24:10] at images and say are they are they really do they really have low [00:24:11] really do they really have low probability mass? Are they unreasonable [00:24:13] probability mass? Are they unreasonable images? um you can view that you can [00:24:15] images? um you can view that you can sort of use them for feature learning [00:24:17] sort of use them for feature learning without data without labels. Um so hope [00:24:19] without data without labels. Um so hope that in the process of trying to fit an [00:24:21] that in the process of trying to fit an unconditional distribution P of X the [00:24:23] unconditional distribution P of X the model maybe learned some useful feature [00:24:24] model maybe learned some useful feature representations. Um although in general [00:24:26] representations. Um although in general these have not been super successful for [00:24:28] these have not been super successful for self-supervised learning. Um and [00:24:29] self-supervised learning. Um and typically the contrasted methods that [00:24:30] typically the contrasted methods that we've talked about in the previous [00:24:31] we've talked about in the previous lecture have in practice been much more [00:24:33] lecture have in practice been much more successful for self-supervised learning [00:24:35] successful for self-supervised learning as compared to an unconditional density [00:24:38] as compared to an unconditional density estimation. [00:24:39] estimation. Or um in principle you could use this [00:24:41] Or um in principle you could use this unconditional generative model to sample [00:24:43] unconditional generative model to sample and produce new samples X. Um but I [00:24:45] and produce new samples X. Um but I think this is actually kind of useless [00:24:46] think this is actually kind of useless um because it gives you no control over [00:24:48] um because it gives you no control over what what is being sampled right if you [00:24:50] what what is being sampled right if you have an unconditional generative model [00:24:51] have an unconditional generative model of images then you can sample from it to [00:24:53] of images then you can sample from it to get a new image but you have no control [00:24:55] get a new image but you have no control over what what's in that image. Um so I [00:24:57] over what what's in that image. Um so I think that's that's actually you know [00:24:58] think that's that's actually you know it's mathematically interesting to think [00:25:00] it's mathematically interesting to think about how to build such models but it's [00:25:02] about how to build such models but it's I don't think they have as much [00:25:03] I don't think they have as much practical significance. [00:25:05] practical significance. Um and conditional generative models are [00:25:06] Um and conditional generative models are where are where I think actually things [00:25:08] where are where I think actually things are the most useful and the most [00:25:09] are the most useful and the most interesting. Um and these are the kinds [00:25:11] interesting. Um and these are the kinds of generative models that get trained [00:25:12] of generative models that get trained and used in practice by far the most. Um [00:25:14] and used in practice by far the most. Um so you can in principle use them to [00:25:16] so you can in principle use them to assign labels while rejecting outliers. [00:25:18] assign labels while rejecting outliers. Right? You could say you know if I have [00:25:20] Right? You could say you know if I have a piece of data X then look at the P of [00:25:22] a piece of data X then look at the P of X given Y over all over all of my [00:25:24] X given Y over all over all of my possible Y's and then you know I could [00:25:26] possible Y's and then you know I could reject if if the if that's too low among [00:25:28] reject if if the if that's too low among all the possible Y's. So in principle, [00:25:30] all the possible Y's. So in principle, you could use conditional generative [00:25:31] you could use conditional generative models to do some kind of classification [00:25:33] models to do some kind of classification while also maintaining the ability to [00:25:35] while also maintaining the ability to reject outliers. Um although I don't [00:25:37] reject outliers. Um although I don't think that's really used too much in [00:25:38] think that's really used too much in practice. Um what's really useful for [00:25:40] practice. Um what's really useful for conditional generative models and what [00:25:43] conditional generative models and what is used in practice all the time [00:25:44] is used in practice all the time everywhere is sampling to generate new [00:25:46] everywhere is sampling to generate new data from labels where you actually get [00:25:48] data from labels where you actually get to control what is generated, right? [00:25:50] to control what is generated, right? Because if your Y is now maybe a piece [00:25:52] Because if your Y is now maybe a piece of text, you can write down I want to [00:25:54] of text, you can write down I want to see a cat wearing a hot dog flavored [00:25:56] see a cat wearing a hot dog flavored t-shirt on the moon or whatever and then [00:25:59] t-shirt on the moon or whatever and then your favorite generative model of images [00:26:00] your favorite generative model of images will generate you a brand new image X [00:26:02] will generate you a brand new image X conditioned on that label Y. So this is [00:26:05] conditioned on that label Y. So this is where I think all the juice is, where [00:26:06] where I think all the juice is, where all the magic is, where all the [00:26:07] all the magic is, where all the excitement is. Um although somewhat [00:26:10] excitement is. Um although somewhat confusingly in the literature whenever [00:26:12] confusingly in the literature whenever you see the term generative model you'll [00:26:13] you see the term generative model you'll they kind of mush up between um [00:26:15] they kind of mush up between um unconditional and conditional generative [00:26:17] unconditional and conditional generative modeling. Um, and a lot of the papers [00:26:19] modeling. Um, and a lot of the papers that you read will even sometimes drop [00:26:20] that you read will even sometimes drop the conditioning signal. Why? Because it [00:26:22] the conditioning signal. Why? Because it makes the math look cleaner. It makes [00:26:23] makes the math look cleaner. It makes the equations look cleaner. Um, but I [00:26:26] the equations look cleaner. Um, but I don't think unconditional generative [00:26:27] don't think unconditional generative modeling is super useful. Um, it's [00:26:28] modeling is super useful. Um, it's almost always conditional generative [00:26:30] almost always conditional generative modeling that you really want to do in [00:26:32] modeling that you really want to do in most cases. Um, so just keep aware just [00:26:34] most cases. Um, so just keep aware just be aware that if you read papers, see [00:26:36] be aware that if you read papers, see equations, hear people talk, when they [00:26:38] equations, hear people talk, when they talk about generative modeling, they're [00:26:40] talk about generative modeling, they're probably the one they care about [00:26:41] probably the one they care about training more are these conditional [00:26:43] training more are these conditional generative models. Even if the math [00:26:44] generative models. Even if the math doesn't even if the equations or [00:26:45] doesn't even if the equations or notation doesn't reflect that. [00:26:48] notation doesn't reflect that. So uh for unconditional generative model [00:26:50] So uh for unconditional generative model what does this [00:26:53] what does this ah so I I didn't really tell you that um [00:26:55] ah so I I didn't really tell you that um and and I I was sneaky there because how [00:26:58] and and I I was sneaky there because how you parameterize that actually depends a [00:27:00] you parameterize that actually depends a lot. There's a lot of different [00:27:00] lot. There's a lot of different formulations for all of these things and [00:27:02] formulations for all of these things and what exactly are the going to be the [00:27:03] what exactly are the going to be the inputs and the outputs of the network [00:27:05] inputs and the outputs of the network are going to vary quite a lot depending [00:27:06] are going to vary quite a lot depending on the formulation. Um so we're going to [00:27:08] on the formulation. Um so we're going to talk about a whole tonomy of those in a [00:27:09] talk about a whole tonomy of those in a couple slides. [00:27:12] couple slides. Okay. So you know why generative models? [00:27:14] Okay. So you know why generative models? Um the main reason you want to build [00:27:16] Um the main reason you want to build generative models is whenever there's [00:27:18] generative models is whenever there's some ambiguity in the task you're trying [00:27:20] some ambiguity in the task you're trying to model, right? So the the the beauty [00:27:22] to model, right? So the the the beauty of a probabilistic model P of X given Y [00:27:25] of a probabilistic model P of X given Y is that it's probabilistic. Um there [00:27:27] is that it's probabilistic. Um there might be a whole space of possible [00:27:29] might be a whole space of possible outputs X conditioned on that input [00:27:31] outputs X conditioned on that input label Y. So whenever there's not a like [00:27:34] label Y. So whenever there's not a like in some case sometimes there's just a [00:27:35] in some case sometimes there's just a deterministic mapping, right? I look at [00:27:36] deterministic mapping, right? I look at an image, I want to ask how many cats [00:27:38] an image, I want to ask how many cats are in the image. There's just three [00:27:39] are in the image. There's just three cats. There's just one answer. Um, but [00:27:41] cats. There's just one answer. Um, but in a lot of cases it's more subtle. If I [00:27:43] in a lot of cases it's more subtle. If I ask for a picture of a dog wearing a hot [00:27:46] ask for a picture of a dog wearing a hot dog hat, like there's a lot of different [00:27:47] dog hat, like there's a lot of different images that could exist based on that [00:27:49] images that could exist based on that query. Um, there's uncertainty in the [00:27:51] query. Um, there's uncertainty in the output. Um, and that's exactly what [00:27:52] output. Um, and that's exactly what generative models are trying to model. [00:27:54] generative models are trying to model. They model a whole distribution of [00:27:55] They model a whole distribution of outputs conditioned on their input [00:27:56] outputs conditioned on their input signal. So, anytime that there's [00:27:58] signal. So, anytime that there's ambiguity in the kind of output that you [00:28:00] ambiguity in the kind of output that you want the model to produce conditioned on [00:28:02] want the model to produce conditioned on the input, that's where you want to turn [00:28:03] the input, that's where you want to turn to a generative model. Um, and this is [00:28:05] to a generative model. Um, and this is why, and this is we've seen, we'll see a [00:28:07] why, and this is we've seen, we'll see a couple examples of why where this has [00:28:08] couple examples of why where this has gotten used a lot in the last couple of [00:28:10] gotten used a lot in the last couple of years. So, one example is language [00:28:12] years. So, one example is language modeling. Um, someone asked about chatbt [00:28:13] modeling. Um, someone asked about chatbt a moment ago. So, in language modeling, [00:28:15] a moment ago. So, in language modeling, what you're often trying to do is [00:28:17] what you're often trying to do is predict output text X from input text Y. [00:28:20] predict output text X from input text Y. Um, sorry, the X's and Y's ended up [00:28:22] Um, sorry, the X's and Y's ended up flipped in an awkward way on this [00:28:24] flipped in an awkward way on this example. Um, but you know, so so here's [00:28:27] example. Um, but you know, so so here's a here's an example from Chad GBT, but [00:28:29] a here's an example from Chad GBT, but the input is write me a short rhyming [00:28:30] the input is write me a short rhyming poem about generative models. And wow, [00:28:33] poem about generative models. And wow, it actually works. This is crazy. This [00:28:34] it actually works. This is crazy. This didn't work at all when we first taught [00:28:36] didn't work at all when we first taught this class. Um, I'm not going to read [00:28:38] this class. Um, I'm not going to read it. That would be embarrassing. You can [00:28:39] it. That would be embarrassing. You can read it yourself. Um, but you know, now [00:28:41] read it yourself. Um, but you know, now this is a conditional generative model. [00:28:42] this is a conditional generative model. You could imagine there's a lot of [00:28:44] You could imagine there's a lot of different possible rhyming poems about [00:28:45] different possible rhyming poems about generative models that one might write. [00:28:48] generative models that one might write. Um, and we had to pick one of them. And [00:28:50] Um, and we had to pick one of them. And the beauty of a generative model is that [00:28:51] the beauty of a generative model is that it in principle models that whole [00:28:53] it in principle models that whole distribution over over possible outputs [00:28:55] distribution over over possible outputs conditioned on that input. Um, or text [00:28:57] conditioned on that input. Um, or text to image. You know, make me an image [00:28:59] to image. You know, make me an image showing a person teaching a class on [00:29:00] showing a person teaching a class on generative models in front of a [00:29:01] generative models in front of a whiteboard. um you're kind of looking at [00:29:03] whiteboard. um you're kind of looking at one example through your eyes. Chat GPT [00:29:05] one example through your eyes. Chat GPT gave you a different example, right? [00:29:06] gave you a different example, right? There's a whole different space of [00:29:08] There's a whole different space of possible images that might match this uh [00:29:10] possible images that might match this uh this input text and a generative model [00:29:12] this input text and a generative model allows you to model that whole space and [00:29:14] allows you to model that whole space and sample from that space depending on what [00:29:15] sample from that space depending on what you want. Um or image to video, you [00:29:18] you want. Um or image to video, you know, input an image. What happens next? [00:29:20] know, input an image. What happens next? Um you know, this was me holding my [00:29:22] Um you know, this was me holding my AirPods over a cardboard box. Maybe I'm [00:29:24] AirPods over a cardboard box. Maybe I'm going to drop it. Maybe I'm going to [00:29:25] going to drop it. Maybe I'm going to move my hand. Maybe, you know, maybe I'm [00:29:28] move my hand. Maybe, you know, maybe I'm going to move my hand and the AirPods [00:29:29] going to move my hand and the AirPods will morph into a different kinds of [00:29:30] will morph into a different kinds of AirPods. Um there's all kinds of things [00:29:32] AirPods. Um there's all kinds of things that could happen. A generative model in [00:29:33] that could happen. A generative model in principle lets you kind of model and [00:29:35] principle lets you kind of model and sample from these possible futures. [00:29:38] sample from these possible futures. Um so those are so I I so this is sort [00:29:40] Um so those are so I I so this is sort of why we want to care about generative [00:29:42] of why we want to care about generative modeling. Anytime there's ambiguity in [00:29:43] modeling. Anytime there's ambiguity in the output that's when you want to try [00:29:45] the output that's when you want to try to turn to a generative model to solve [00:29:46] to turn to a generative model to solve it. [00:29:48] it. Um and someone asked about what are the [00:29:49] Um and someone asked about what are the inputs and outputs. It turns out this is [00:29:51] inputs and outputs. It turns out this is a huge field. Um and this is [00:29:53] a huge field. Um and this is surprisingly one field of deep learning [00:29:55] surprisingly one field of deep learning that is quite mathematical. um because [00:29:57] that is quite mathematical. um because it requires you know thinking about what [00:29:58] it requires you know thinking about what are different ways to model probability [00:30:00] are different ways to model probability dist distributions, how can we write [00:30:01] dist distributions, how can we write down loss functions that cause the right [00:30:04] down loss functions that cause the right things to happen. So this is one area [00:30:06] things to happen. So this is one area where when you read papers there may be [00:30:07] where when you read papers there may be a lot of math there may be a lot of [00:30:09] a lot of math there may be a lot of equations um and you actually might need [00:30:11] equations um and you actually might need to think through those equations pretty [00:30:13] to think through those equations pretty carefully to understand what's going on. [00:30:14] carefully to understand what's going on. So this is one sub field that tends to [00:30:16] So this is one sub field that tends to have more more math and more equations [00:30:18] have more more math and more equations um which I think is kind of fun kind of [00:30:20] um which I think is kind of fun kind of interesting. Um so there's this whole [00:30:21] interesting. Um so there's this whole taxonomy of different kinds of [00:30:22] taxonomy of different kinds of generative models that people build. So [00:30:25] generative models that people build. So on the one hand you can imagine one part [00:30:27] on the one hand you can imagine one part of the family tree are what we call [00:30:28] of the family tree are what we call explicit density um methods. These are [00:30:31] explicit density um methods. These are ones where the model actually does you [00:30:33] ones where the model actually does you know the whole thing is we're trying to [00:30:34] know the whole thing is we're trying to model P of X or P of X given Y. Um and [00:30:36] model P of X or P of X given Y. Um and with these explicit density methods you [00:30:38] with these explicit density methods you can actually compute you can get that [00:30:40] can actually compute you can get that value out P of X for any sample X. Um [00:30:43] value out P of X for any sample X. Um and the counter the counterpoint are [00:30:44] and the counter the counterpoint are implicit density methods. These are ones [00:30:47] implicit density methods. These are ones where you can't actually get that [00:30:48] where you can't actually get that probability mass value. you can't [00:30:50] probability mass value. you can't actually get that density value P of X [00:30:52] actually get that density value P of X out from the model, but you can somehow [00:30:54] out from the model, but you can somehow sample from that probability probability [00:30:55] sample from that probability probability distribution. So the difference here um [00:30:58] distribution. So the difference here um is that in an implicit model, right, you [00:31:00] is that in an implicit model, right, you can't actually access the the the value [00:31:02] can't actually access the the the value of the density function, but you can [00:31:04] of the density function, but you can sample from the underlying density [00:31:05] sample from the underlying density function somehow. So the model has [00:31:07] function somehow. So the model has implicitly learned to model the density [00:31:08] implicitly learned to model the density even if you can't get the value out. Um [00:31:10] even if you can't get the value out. Um and on the explicit density side, um [00:31:12] and on the explicit density side, um it's almost the almost the opposite [00:31:14] it's almost the almost the opposite where in many cases even when you like [00:31:16] where in many cases even when you like you can get that explic that explicit [00:31:18] you can get that explic that explicit density value out but then sampling [00:31:19] density value out but then sampling tends to be more complicated sometimes [00:31:21] tends to be more complicated sometimes with these explicit density methods. Not [00:31:23] with these explicit density methods. Not always but sometimes, right? And the [00:31:26] always but sometimes, right? And the reason why you might turn to implicit [00:31:27] reason why you might turn to implicit models is because in many cases you may [00:31:29] models is because in many cases you may not actually care about knowing what [00:31:31] not actually care about knowing what exactly was the density value for any [00:31:33] exactly was the density value for any input. you maybe all you care about is [00:31:35] input. you maybe all you care about is generating samples and get generating [00:31:36] generating samples and get generating good samples get generate generating a [00:31:38] good samples get generate generating a good diversity of samples. So if the [00:31:40] good diversity of samples. So if the thing you really care about is sampling [00:31:42] thing you really care about is sampling then maybe you don't actually need to [00:31:43] then maybe you don't actually need to explicitly be able to see the value of [00:31:45] explicitly be able to see the value of the density for any for any for any [00:31:47] the density for any for any for any input. Um and then things break down and [00:31:49] input. Um and then things break down and cascade and get more fractal-like from [00:31:51] cascade and get more fractal-like from here. So inside explicit density methods [00:31:54] here. So inside explicit density methods um there's ones where like actually yeah [00:31:55] um there's ones where like actually yeah you can really compute the real P of X [00:31:57] you can really compute the real P of X that's being modeled. Um and auto [00:31:59] that's being modeled. Um and auto reggressive models are one example of [00:32:00] reggressive models are one example of that. um or another version of explicit [00:32:03] that. um or another version of explicit density methods are ones where you can [00:32:05] density methods are ones where you can get a density value out but it's not the [00:32:07] get a density value out but it's not the real one. It's some kind of [00:32:08] real one. It's some kind of approximation to the true density um of [00:32:10] approximation to the true density um of the data. Um and variational [00:32:12] the data. Um and variational autoenccoders are one example of an [00:32:14] autoenccoders are one example of an explicit but approximate um generative [00:32:16] explicit but approximate um generative method that we'll see. Now on the other [00:32:18] method that we'll see. Now on the other branch of the family tree um we can [00:32:21] branch of the family tree um we can think about direct methods for implicit [00:32:22] think about direct methods for implicit density. These are ones where maybe it [00:32:24] density. These are ones where maybe it requires a single network evaluation to [00:32:26] requires a single network evaluation to just draw a sample from the underlying [00:32:29] just draw a sample from the underlying from the underlying distribution that's [00:32:30] from the underlying distribution that's being modeled. Um, and a generative [00:32:31] being modeled. Um, and a generative adversarial network is an example of a [00:32:33] adversarial network is an example of a generative model in this part of the [00:32:34] generative model in this part of the family tree. Um, and the other part is [00:32:36] family tree. Um, and the other part is um, I don't know if it has a good name. [00:32:38] um, I don't know if it has a good name. Um, I I called it indirect, but this is [00:32:40] Um, I I called it indirect, but this is a name I made up yesterday. So, please [00:32:42] a name I made up yesterday. So, please feel free to correct me if there's a [00:32:43] feel free to correct me if there's a better term for this. Um but these in [00:32:45] better term for this. Um but these in indirect ones are ones where you can [00:32:47] indirect ones are ones where you can sample from the underlying um from the [00:32:49] sample from the underlying um from the underlying density P of X that's being [00:32:51] underlying density P of X that's being modeled. But it requires some kind of [00:32:53] modeled. But it requires some kind of iterative procedure. There's no feed [00:32:55] iterative procedure. There's no feed forward function that where you can [00:32:56] forward function that where you can input and get the sample directly out. [00:32:58] input and get the sample directly out. There's some kind of iterative method [00:32:59] There's some kind of iterative method that you need to run in order to draw a [00:33:01] that you need to run in order to draw a sample from the underlying density [00:33:02] sample from the underlying density that's being modeled. Um and diffusion [00:33:04] that's being modeled. Um and diffusion models are an example of this that we'll [00:33:05] models are an example of this that we'll see next time. I told you a couple [00:33:07] see next time. I told you a couple slides ago that people are sloppy with [00:33:09] slides ago that people are sloppy with notation and drop the Y. And I did that [00:33:10] notation and drop the Y. And I did that explicitly on purpose on this slide so [00:33:12] explicitly on purpose on this slide so that someone would ask me that question [00:33:13] that someone would ask me that question and you would always be attentive to [00:33:14] and you would always be attentive to that fact. Um so yes exactly every time [00:33:17] that fact. Um so yes exactly every time I've written P of X on this slide and [00:33:18] I've written P of X on this slide and actually all the rest of the slides this [00:33:20] actually all the rest of the slides this lecture I also have been lazy and [00:33:22] lecture I also have been lazy and dropped the Y but you should always [00:33:23] dropped the Y but you should always imagine an additional condition on Y um [00:33:25] imagine an additional condition on Y um in all of these P of X's that you see [00:33:27] in all of these P of X's that you see for the rest of the lecture. So thank [00:33:28] for the rest of the lecture. So thank you for asking that. So the question was [00:33:31] you for asking that. So the question was like is the is the indirect method is it [00:33:33] like is the is the indirect method is it still you know can you just treat that [00:33:34] still you know can you just treat that indirect iterative procedure as a [00:33:36] indirect iterative procedure as a blackbox and then treat that as as a [00:33:38] blackbox and then treat that as as a direct sampling method. Um in principle [00:33:40] direct sampling method. Um in principle yes but in practice no because your [00:33:43] yes but in practice no because your samples kind of end up end up [00:33:44] samples kind of end up end up approximate um you know depending on [00:33:46] approximate um you know depending on exactly the method but like with with [00:33:48] exactly the method but like with with diffusion models you kind of would need [00:33:50] diffusion models you kind of would need to take an infinite number of steps in [00:33:51] to take an infinite number of steps in order to draw a true sample so instead [00:33:53] order to draw a true sample so instead we approximate that with a finite number [00:33:54] we approximate that with a finite number of steps and that's you know true of [00:33:56] of steps and that's you know true of other methods as well diffusion models [00:33:58] other methods as well diffusion models are the most common for this today um [00:34:00] are the most common for this today um but you know some kind of marov chain [00:34:01] but you know some kind of marov chain method or MCMC method in years past [00:34:03] method or MCMC method in years past might have also had this property where [00:34:05] might have also had this property where there is an iterative procedure but if [00:34:06] there is an iterative procedure but if you want to draw an exact sample from [00:34:08] you want to draw an exact sample from the distribution that's being modeled, [00:34:09] the distribution that's being modeled, you need an infinite number of steps to [00:34:10] you need an infinite number of steps to converge. Um, so we always approximate [00:34:12] converge. Um, so we always approximate that and by taking a finite number of [00:34:14] that and by taking a finite number of steps. [00:34:16] steps. Okay. And I was I was pretty I was [00:34:19] Okay. And I was I was pretty I was pretty proud of this tonomy because it's [00:34:21] pretty proud of this tonomy because it's very symmetric. There's, you know, it's [00:34:23] very symmetric. There's, you know, it's a it's a there's there's four leaves, [00:34:25] a it's a there's there's four leaves, there's two branches, and we're going to [00:34:26] there's two branches, and we're going to cover half the tree today and half the [00:34:28] cover half the tree today and half the tree next time. Um, so I thought that [00:34:30] tree next time. Um, so I thought that was a pretty nice uh pretty nice [00:34:31] was a pretty nice uh pretty nice breakdown. question is what's the [00:34:33] breakdown. question is what's the difference between the approximate [00:34:35] difference between the approximate density and directly sampling from an [00:34:37] density and directly sampling from an implicit p of x? The difference is that [00:34:39] implicit p of x? The difference is that even in an in an indirect but implicit [00:34:41] even in an in an indirect but implicit method there's no density value anywhere [00:34:43] method there's no density value anywhere to be found. You can't compute one at [00:34:45] to be found. You can't compute one at all. Um but you can still iteratively [00:34:46] all. Um but you can still iteratively sample in some way. Um with an [00:34:48] sample in some way. Um with an approximate density method um you can [00:34:50] approximate density method um you can still get a value out like you can [00:34:51] still get a value out like you can actually get a density value out that's [00:34:53] actually get a density value out that's going to be some approximate or bound to [00:34:54] going to be some approximate or bound to the true p of x. [00:34:58] Okay. So then the first such auto the [00:35:00] Okay. So then the first such auto the first such generative model that we'll [00:35:02] first such generative model that we'll actually talk about in a little bit more [00:35:03] actually talk about in a little bit more concrete specificity are auto [00:35:05] concrete specificity are auto reggressive models. Um so autogressive [00:35:08] reggressive models. Um so autogressive models we're actually going to take a [00:35:09] models we're actually going to take a slight detour and talk about an a really [00:35:12] slight detour and talk about an a really general idea behind all of generative [00:35:13] general idea behind all of generative modeling and that's the idea of maximum [00:35:15] modeling and that's the idea of maximum likelihood estimation. Um and maximum [00:35:17] likelihood estimation. Um and maximum likelihood estimation is actually a [00:35:19] likelihood estimation is actually a quite general procedure that we can use [00:35:20] quite general procedure that we can use to fit probabilistic models given a [00:35:22] to fit probabilistic models given a finite set of samples. Um so the idea is [00:35:25] finite set of samples. Um so the idea is we're going to write down some explicit [00:35:26] we're going to write down some explicit function for the density. um we said [00:35:28] function for the density. um we said that in that that um some methods are [00:35:31] that in that that um some methods are going to explicitly model the density. [00:35:32] going to explicitly model the density. Well, let's do it with a neural network. [00:35:34] Well, let's do it with a neural network. Let's write a neural network that's [00:35:36] Let's write a neural network that's going to input the data x um input the [00:35:39] going to input the data x um input the weights w of the neural network and it's [00:35:41] weights w of the neural network and it's going to spit out a number that's going [00:35:42] going to spit out a number that's going to tell us what is the density. Um so [00:35:45] to tell us what is the density. Um so then you know we're going to train the [00:35:47] then you know we're going to train the data we're going to given a data set of [00:35:48] data we're going to given a data set of samples x1 x2 xn we're going to train [00:35:51] samples x1 x2 xn we're going to train the model via this this objective [00:35:52] the model via this this objective function. We want to find the weights [00:35:54] function. We want to find the weights that give rise that make the data set [00:35:57] that give rise that make the data set most likely right where we want to set [00:35:59] most likely right where we want to set because as we vary the weights it's [00:36:01] because as we vary the weights it's going to vary the kind of densities that [00:36:02] going to vary the kind of densities that are being modeled by the network. So we [00:36:04] are being modeled by the network. So we want the network to select the density [00:36:06] want the network to select the density that maximizes the likelihood of the [00:36:08] that maximizes the likelihood of the data. Um note that we said likelihood [00:36:10] data. Um note that we said likelihood rather than probability. Um that's a [00:36:12] rather than probability. Um that's a deep philosophical rabbit hole you can [00:36:14] deep philosophical rabbit hole you can fall into. Um the difference is what [00:36:16] fall into. Um the difference is what we're varying. Right? If you think about [00:36:18] we're varying. Right? If you think about probability, you kind of imagine you [00:36:19] probability, you kind of imagine you that the density is fixed and we're sort [00:36:21] that the density is fixed and we're sort of sliding X around and changing what is [00:36:23] of sliding X around and changing what is the probability of X under a fixed [00:36:25] the probability of X under a fixed distribution. When you talk about [00:36:26] distribution. When you talk about likelihood, instead you're often fixing [00:36:28] likelihood, instead you're often fixing the samples X and you're varying the [00:36:30] the samples X and you're varying the distribution itself. Um and saying, you [00:36:32] distribution itself. Um and saying, you know, what is the how does the prob how [00:36:35] know, what is the how does the prob how does the probability density of those [00:36:36] does the probability density of those samples change as we vary different [00:36:38] samples change as we vary different distributions. So you have to think [00:36:40] distributions. So you have to think about very carefully in these equations [00:36:42] about very carefully in these equations what's being fixed and what's varying. [00:36:43] what's being fixed and what's varying. So in this process of maximum likelihood [00:36:45] So in this process of maximum likelihood estimation, what we're doing is varying [00:36:48] estimation, what we're doing is varying the distribution that the model that the [00:36:50] the distribution that the model that the that the neural network is modeling to [00:36:51] that the neural network is modeling to try to maximize the probability of the [00:36:53] try to maximize the probability of the fixed set of samples we have of from [00:36:55] fixed set of samples we have of from that distribution in our training set. [00:36:57] that distribution in our training set. Right? And I guess the the unsaid thing [00:36:58] Right? And I guess the the unsaid thing behind all of this is that we assume [00:37:00] behind all of this is that we assume that there is some underlying true [00:37:02] that there is some underlying true probability distribution P data which [00:37:04] probability distribution P data which was used by the universe to generate the [00:37:06] was used by the universe to generate the data that we are seeing. Um and in some [00:37:08] data that we are seeing. Um and in some sense always what we want to do is try [00:37:10] sense always what we want to do is try to model that true underlying unknown [00:37:13] to model that true underlying unknown distribution P data and we can never [00:37:14] distribution P data and we can never access P data because we don't know we [00:37:16] access P data because we don't know we don't have this omnisient view of like [00:37:18] don't have this omnisient view of like exactly how the universe works. um but [00:37:20] exactly how the universe works. um but instead we get some samples from P data [00:37:23] instead we get some samples from P data that the universe has given to us and [00:37:25] that the universe has given to us and what we're trying to do through our [00:37:26] what we're trying to do through our learning procedure is uncover that [00:37:28] learning procedure is uncover that unknown distribution P data um given a [00:37:30] unknown distribution P data um given a finite number of samples from that [00:37:32] finite number of samples from that unknown distribution right so one [00:37:34] unknown distribution right so one procedure that you can go about this is [00:37:36] procedure that you can go about this is like well let's select the distribution [00:37:38] like well let's select the distribution that makes the data that I saw actually [00:37:40] that makes the data that I saw actually most likely um and that's the objective [00:37:42] most likely um and that's the objective that's that's the maximum likelihood [00:37:43] that's that's the maximum likelihood objective function [00:37:45] objective function right and then a standard trick that we [00:37:47] right and then a standard trick that we do here right we assume that the data [00:37:48] do here right we assume that the data was iid um independent and identically [00:37:50] was iid um independent and identically distributed. So we assume that each one [00:37:52] distributed. So we assume that each one of those X's was drawn from that true P [00:37:54] of those X's was drawn from that true P data distribution. And now we want to [00:37:56] data distribution. And now we want to maximize the joint distribution of all [00:37:57] maximize the joint distribution of all the data that we saw. But because it's [00:37:59] the data that we saw. But because it's independent, we can factor it down um [00:38:01] independent, we can factor it down um into independent probab like independent [00:38:03] into independent probab like independent likelihood of each of the independent [00:38:04] likelihood of each of the independent samples. Um and then the common trick [00:38:06] samples. Um and then the common trick that we always use is the log trick. So [00:38:07] that we always use is the log trick. So we know that log is a monotonic [00:38:09] we know that log is a monotonic function. So if you maximize something, [00:38:12] function. So if you maximize something, it's equivalent to maximizing the log of [00:38:13] it's equivalent to maximizing the log of that something because log is a [00:38:15] that something because log is a monotonic function. Um, and then log is [00:38:17] monotonic function. Um, and then log is also very convenient because it swaps [00:38:19] also very convenient because it swaps sums and products. So it's common to [00:38:21] sums and products. So it's common to just instead of maximizing the the the [00:38:23] just instead of maximizing the the the pro the the the likelihood of the data, [00:38:25] pro the the the likelihood of the data, instead we're going to maximize the log [00:38:27] instead we're going to maximize the log likelihood of the data. And that's the [00:38:28] likelihood of the data. And that's the same as maximizing the likelihood. And [00:38:30] same as maximizing the likelihood. And once we apply the log, then that that um [00:38:32] once we apply the log, then that that um that product splits into a sum and sums [00:38:34] that product splits into a sum and sums are easier to handle. Um and now you [00:38:37] are easier to handle. Um and now you know we slot in our neural network um [00:38:38] know we slot in our neural network um because our neural network is now maybe [00:38:40] because our neural network is now maybe directly outputting the density. So this [00:38:42] directly outputting the density. So this gives a direct objective function that [00:38:43] gives a direct objective function that we could use to train a neural network. [00:38:45] we could use to train a neural network. Um give it to you know this gives it a [00:38:47] Um give it to you know this gives it a very concrete loss function that we can [00:38:48] very concrete loss function that we can use to train a neural network to solve [00:38:49] use to train a neural network to solve this kind of generative modeling [00:38:50] this kind of generative modeling problem. Um but we need a little bit [00:38:53] problem. Um but we need a little bit more structure here to actually make [00:38:54] more structure here to actually make progress. Um so this this is where so [00:38:57] progress. Um so this this is where so this this idea of maximum likelihood [00:38:58] this this idea of maximum likelihood estimation is very general. It doesn't [00:39:00] estimation is very general. It doesn't really assume anything about the kind of [00:39:01] really assume anything about the kind of data. Um it doesn't really assume any [00:39:03] data. Um it doesn't really assume any structure in the data. Um and we in in [00:39:05] structure in the data. Um and we in in general need to put a little bit more [00:39:06] general need to put a little bit more structure on this to make some progress. [00:39:08] structure on this to make some progress. Um, so auto reggressive models basically [00:39:10] Um, so auto reggressive models basically make the assumption that there is some [00:39:12] make the assumption that there is some canonical way that we can take our data [00:39:14] canonical way that we can take our data X and split each ind each data sample X [00:39:17] X and split each ind each data sample X into some sequence of subp parts. Um, [00:39:19] into some sequence of subp parts. Um, X1, X2, XT. Um, you got to be careful [00:39:21] X1, X2, XT. Um, you got to be careful with indices here. Here I said the subp [00:39:23] with indices here. Here I said the subp parts these are subp parts of a single [00:39:25] parts these are subp parts of a single sample. Um, so I use a subscript and the [00:39:28] sample. Um, so I use a subscript and the previous slide we had superscript to [00:39:29] previous slide we had superscript to indicate different samples um, x1 to xn. [00:39:33] indicate different samples um, x1 to xn. So be careful with that. you know, sub [00:39:34] So be careful with that. you know, sub superscript on this slide is different [00:39:36] superscript on this slide is different samples X. Subscript on this slide means [00:39:38] samples X. Subscript on this slide means different parts of the same sample. Um, [00:39:40] different parts of the same sample. Um, so we assume that there's some canonical [00:39:42] so we assume that there's some canonical way to break up our our data sample X [00:39:44] way to break up our our data sample X into some sequence of subp parts. Um, [00:39:46] into some sequence of subp parts. Um, and now we can apply the chain rule of [00:39:48] and now we can apply the chain rule of probability. So um, probability of X is [00:39:50] probability. So um, probability of X is just the joint probability probability [00:39:52] just the joint probability probability of all of those subp parts X1 to XT. Um, [00:39:55] of all of those subp parts X1 to XT. Um, and then given any probability [00:39:57] and then given any probability distribution, you can always break it [00:39:58] distribution, you can always break it apart into this chain rule. Um that [00:40:00] apart into this chain rule. Um that probability of the joint distribution of [00:40:02] probability of the joint distribution of all these variables is equal to [00:40:04] all these variables is equal to probability of the first one um times [00:40:06] probability of the first one um times probability of the first given [00:40:08] probability of the first given conditioned on the first probability of [00:40:10] conditioned on the first probability of the second condition on the first times [00:40:12] the second condition on the first times probability of the third conditioned on [00:40:13] probability of the third conditioned on the first and the second etc etc etc [00:40:16] the first and the second etc etc etc right and this is this is the chain rule [00:40:18] right and this is this is the chain rule of probability this requires no [00:40:19] of probability this requires no assumptions this is always true of any [00:40:21] assumptions this is always true of any kind of um joint distribution of of [00:40:22] kind of um joint distribution of of random variables [00:40:24] random variables um and then this this sort of gives us [00:40:26] um and then this this sort of gives us our our objective function or then then [00:40:28] our our objective function or then then like then you could basically train a [00:40:30] like then you could basically train a neural network that's going to input, [00:40:32] neural network that's going to input, you know, the po the the the previous [00:40:34] you know, the po the the the previous part of the sequence and then try to [00:40:36] part of the sequence and then try to give us a probability distribution over [00:40:37] give us a probability distribution over the next part of the sequence. Does that [00:40:39] the next part of the sequence. Does that sound familiar? Does that sound like [00:40:41] sound familiar? Does that sound like something we've done before? RNN's. Yes. [00:40:44] something we've done before? RNN's. Yes. So that's exactly what an RNN is doing, [00:40:46] So that's exactly what an RNN is doing, right? And RNN has this very natural [00:40:47] right? And RNN has this very natural structure that you know by passing [00:40:49] structure that you know by passing hidden states along forward through [00:40:51] hidden states along forward through time, um the hidden state always depends [00:40:53] time, um the hidden state always depends on the beginning of the sequence up to [00:40:55] on the beginning of the sequence up to the current point. Um so then with [00:40:57] the current point. Um so then with there's a very natural way to use RNN's [00:40:59] there's a very natural way to use RNN's to for auto reggressive modeling. Um so [00:41:01] to for auto reggressive modeling. Um so you're you have your sequence of hidden [00:41:02] you're you have your sequence of hidden states that are basically summarizing [00:41:04] states that are basically summarizing your your sequence and then from each [00:41:06] your your sequence and then from each hidden state you predict probability of [00:41:08] hidden state you predict probability of the next piece of the sequence. Um [00:41:10] the next piece of the sequence. Um condition on the rest condition on all [00:41:11] condition on the rest condition on all earlier parts of the sequence um and [00:41:13] earlier parts of the sequence um and that basically is an RNN language model [00:41:14] that basically is an RNN language model that we saw some lectures ago. Have we [00:41:17] that we saw some lectures ago. Have we seen anything else that can do this? [00:41:19] seen anything else that can do this? Yes, transformers. Um and particularly [00:41:21] Yes, transformers. Um and particularly mask transformers, right? So we talked [00:41:23] mask transformers, right? So we talked in the transformers lecture um [00:41:25] in the transformers lecture um transformers can also be used to have [00:41:27] transformers can also be used to have this this structure where by masking out [00:41:29] this this structure where by masking out the attention matrix in the right way we [00:41:31] the attention matrix in the right way we can make each output of the transformer [00:41:32] can make each output of the transformer depend on only the prefix of the [00:41:34] depend on only the prefix of the sequence. So we can also use [00:41:36] sequence. So we can also use transformers for autogressive um [00:41:37] transformers for autogressive um autogressive modeling and this and [00:41:39] autogressive modeling and this and they're very commonly used for this. [00:41:40] they're very commonly used for this. Okay, so this is um but the problem with [00:41:43] Okay, so this is um but the problem with autogressive modeling is that you need [00:41:45] autogressive modeling is that you need to break your data up into a sequence [00:41:46] to break your data up into a sequence and this is very very natural with um [00:41:48] and this is very very natural with um with text data, right? Because text data [00:41:50] with text data, right? Because text data is naturally a 1D sequence. Um and it's [00:41:53] is naturally a 1D sequence. Um and it's even a 1D sequence of discrete things [00:41:54] even a 1D sequence of discrete things which is great because it's very easy to [00:41:56] which is great because it's very easy to model probabilities of discrete things. [00:41:58] model probabilities of discrete things. Um we've been doing that all semester um [00:42:00] Um we've been doing that all semester um with our favorite cross entropy softmax [00:42:02] with our favorite cross entropy softmax loss, right? The cross entropy softmax [00:42:04] loss, right? The cross entropy softmax loss is always, you know, distribution [00:42:06] loss is always, you know, distribution over discrete like fixed discrete number [00:42:08] over discrete like fixed discrete number of categories. um the network predicts a [00:42:10] of categories. um the network predicts a score for each one of those. Normalize [00:42:12] score for each one of those. Normalize it with a softmax, train with a cross [00:42:13] it with a softmax, train with a cross entropy loss. We know how to do that. Um [00:42:15] entropy loss. We know how to do that. Um so that's that's why these things fit [00:42:17] so that's that's why these things fit very naturally for language models. Um [00:42:19] very naturally for language models. Um because language is already discrete. Um [00:42:21] because language is already discrete. Um language is already a 1D sequence. Um so [00:42:23] language is already a 1D sequence. Um so there's a little bit of fuzziness and [00:42:25] there's a little bit of fuzziness and how you know there's a tokenizer in [00:42:26] how you know there's a tokenizer in there. We're not going to get into that. [00:42:28] there. We're not going to get into that. Um but these are very naturally well [00:42:29] Um but these are very naturally well suited to language problems because [00:42:31] suited to language problems because language is already 1D. It's already [00:42:33] language is already 1D. It's already discreet. Um images are more tricky [00:42:35] discreet. Um images are more tricky because images are not naturally 1D. Um, [00:42:38] because images are not naturally 1D. Um, images are also not naturally discreet. [00:42:40] images are also not naturally discreet. We often think of images as continuous [00:42:42] We often think of images as continuous real valued things. Um, so this is they [00:42:44] real valued things. Um, so this is they don't these don't naturally fit quite as [00:42:46] don't these don't naturally fit quite as nicely onto images. Um, but you know, [00:42:49] nicely onto images. Um, but you know, you got a hammer, you're going to whack [00:42:50] you got a hammer, you're going to whack some nails. So, um, people definitely [00:42:52] some nails. So, um, people definitely apply autogressive models to images in [00:42:54] apply autogressive models to images in kind of a naive way. Um, at least at [00:42:56] kind of a naive way. Um, at least at least some years ago. So one thing you [00:42:59] least some years ago. So one thing you can do one one one thing you can do to [00:43:01] can do one one one thing you can do to um model images with autogressive models [00:43:04] um model images with autogressive models is to treat the images it treat an image [00:43:06] is to treat the images it treat an image as a sequence of pixels right and in [00:43:08] as a sequence of pixels right and in particular each pixel is actually just [00:43:10] particular each pixel is actually just three numbers um and you know in most [00:43:12] three numbers um and you know in most displays and in most representations [00:43:15] displays and in most representations most representations of images those [00:43:16] most representations of images those numbers are actually discreet right so [00:43:18] numbers are actually discreet right so most JPEGs or PGs most of the file [00:43:21] most JPEGs or PGs most of the file formats we use to store images um are [00:43:23] formats we use to store images um are typically 8 bit per channel so there's [00:43:24] typically 8 bit per channel so there's actually you know only a fixed number of [00:43:27] actually you know only a fixed number of values that each pixel can take. So a [00:43:29] values that each pixel can take. So a pixel is just three um three single bite [00:43:31] pixel is just three um three single bite values. A single bite is just like an [00:43:34] values. A single bite is just like an integer from 0 to 255. So a pixel is [00:43:36] integer from 0 to 255. So a pixel is just three integers each of each integer [00:43:38] just three integers each of each integer can be 0 to 255. So what we can do is um [00:43:42] can be 0 to 255. So what we can do is um take our image and then rasterize it out [00:43:45] take our image and then rasterize it out into a long um into a long sequence [00:43:47] into a long um into a long sequence where each element of the sequence is [00:43:49] where each element of the sequence is one of the sub pixels values of our [00:43:51] one of the sub pixels values of our image. Um, and now we've turned our [00:43:53] image. Um, and now we've turned our image into a discrete into a into a [00:43:55] image into a discrete into a into a one-dimensional sequence where each [00:43:56] one-dimensional sequence where each entry in that sequence is a discrete [00:43:58] entry in that sequence is a discrete value. So you can apply autogressive [00:44:00] value. So you can apply autogressive modeling directly to that sequence um in [00:44:02] modeling directly to that sequence um in exactly the way that you might have for [00:44:03] exactly the way that you might have for a language model um using an RNN or a [00:44:05] a language model um using an RNN or a transformer. [00:44:06] transformer. Um can anyone spot a problem with this [00:44:08] Um can anyone spot a problem with this approach? Too long. Very expensive. Very [00:44:11] approach? Too long. Very expensive. Very very expensive. Um so you know a kind of [00:44:14] very expensive. Um so you know a kind of reasonable image that you might want to [00:44:15] reasonable image that you might want to model is maybe 1024 by 1020 1024. Um, [00:44:18] model is maybe 1024 by 1020 1024. Um, that's not even that high resolution [00:44:20] that's not even that high resolution really, but that's a pretty good [00:44:21] really, but that's a pretty good resolution. But if you have 1020 1024 x [00:44:24] resolution. But if you have 1020 1024 x 1024 image, that's going to be a [00:44:26] 1024 image, that's going to be a sequence of 3 million sub pixels. Um, [00:44:28] sequence of 3 million sub pixels. Um, you know, people actually can model [00:44:30] you know, people actually can model these days sequences in the millions, [00:44:32] these days sequences in the millions, but it gets very very expensive. Um, [00:44:33] but it gets very very expensive. Um, there's got to be a more efficient way [00:44:34] there's got to be a more efficient way to do this. Um, so there were some [00:44:36] to do this. Um, so there were some papers a couple years ago where people [00:44:38] papers a couple years ago where people applied these sort of autogressive [00:44:39] applied these sort of autogressive models directly to pixels of images. Um, [00:44:41] models directly to pixels of images. Um, but they were not super successful, I [00:44:43] but they were not super successful, I think, because they they're very [00:44:44] think, because they they're very difficult to scale to high resolution. [00:44:46] difficult to scale to high resolution. So a spoiler alert that we'll talk a [00:44:49] So a spoiler alert that we'll talk a little bit more in next lecture is that [00:44:50] little bit more in next lecture is that this actually has made a resurgence in [00:44:52] this actually has made a resurgence in the last couple of years. Um but the [00:44:54] the last couple of years. Um but the trick is to not model the individual [00:44:55] trick is to not model the individual pixels in the sequence um as individual [00:44:57] pixels in the sequence um as individual pixel values but instead to use some [00:44:59] pixel values but instead to use some other kind of process or procedure or [00:45:01] other kind of process or procedure or model neural network maybe to break that [00:45:03] model neural network maybe to break that image into a sequence of one-dimensional [00:45:05] image into a sequence of one-dimensional tokens. Um so that's something we'll [00:45:06] tokens. Um so that's something we'll talk about a bit more next lecture. Um [00:45:08] talk about a bit more next lecture. Um but this at least gives you the sense of [00:45:10] but this at least gives you the sense of what is an auto reagive model. Um what's [00:45:12] what is an auto reagive model. Um what's the probabilistic formulation of them? [00:45:13] the probabilistic formulation of them? How do you apply them to language? how [00:45:15] How do you apply them to language? how do you apply them to images? So then [00:45:17] do you apply them to images? So then from auto reggressive models we next [00:45:19] from auto reggressive models we next turn to variational autoenccoders um and [00:45:22] turn to variational autoenccoders um and variational autoenccoders are pretty [00:45:24] variational autoenccoders are pretty fun. [00:45:26] fun. So we talked about you know we in in um [00:45:29] So we talked about you know we in in um in these autogressive models we talked [00:45:31] in these autogressive models we talked about you know we're trying to do [00:45:32] about you know we're trying to do maximum likelihood. We broke up our data [00:45:34] maximum likelihood. We broke up our data into a sequence of parts. We're trying [00:45:36] into a sequence of parts. We're trying to maximize the likelihood of the data. [00:45:38] to maximize the likelihood of the data. Um and variational autoenccoders are [00:45:40] Um and variational autoenccoders are going to do something a little bit [00:45:41] going to do something a little bit different. um instead we're there's [00:45:43] different. um instead we're there's still going to be an explicit method. [00:45:44] still going to be an explicit method. There's still going to be some kind of [00:45:45] There's still going to be some kind of density that we can compute. Um but [00:45:48] density that we can compute. Um but there it's going to be intractable. [00:45:49] there it's going to be intractable. We're going to be able to approximate [00:45:50] We're going to be able to approximate it. Um why are we going to do that? We [00:45:52] it. Um why are we going to do that? We had a perfectly good method that [00:45:54] had a perfectly good method that computed densities exactly. Um and what [00:45:56] computed densities exactly. Um and what we're going to give up for that is we're [00:45:58] we're going to give up for that is we're going to gain something. Um we're going [00:45:59] going to gain something. Um we're going to gain the ability to compute [00:46:00] to gain the ability to compute reasonable latent vectors over our data. [00:46:02] reasonable latent vectors over our data. We're going to have you know vectors [00:46:04] We're going to have you know vectors that represent our data that come out [00:46:06] that represent our data that come out that pop out naturally from the learning [00:46:07] that pop out naturally from the learning process. And those vectors are going to [00:46:09] process. And those vectors are going to be useful in their own right. And the [00:46:11] be useful in their own right. And the ability to get access to those latent [00:46:12] ability to get access to those latent vectors is going to be useful enough to [00:46:14] vectors is going to be useful enough to us that we're willing to give up [00:46:16] us that we're willing to give up computing exact densities and instead [00:46:18] computing exact densities and instead settle for these approximate densities [00:46:19] settle for these approximate densities that are actual actually lower bounds on [00:46:21] that are actual actually lower bounds on the true density. Oh, the the motivation [00:46:23] the true density. Oh, the the motivation for breaking stuff up in a sequence in [00:46:25] for breaking stuff up in a sequence in autogressive models um because it [00:46:26] autogressive models um because it factors the problem. It makes each part [00:46:28] factors the problem. It makes each part each part easier to model, right? So [00:46:30] each part easier to model, right? So imagine, you know, imagine you're doing [00:46:32] imagine, you know, imagine you're doing language modeling, right? Um and you [00:46:33] language modeling, right? Um and you have a vocabulary of Vword um and I want [00:46:36] have a vocabulary of Vword um and I want to model the probability of two words [00:46:39] to model the probability of two words jointly, right? How many possible [00:46:40] jointly, right? How many possible two-word sequences are there? There's V^ [00:46:42] two-word sequences are there? There's V^ squ. Um how many possible three-word [00:46:44] squ. Um how many possible three-word sequences are there? There's V cubed. Um [00:46:46] sequences are there? There's V cubed. Um then in general, if you have like how [00:46:47] then in general, if you have like how many T word sequences with a vocabulary [00:46:49] many T word sequences with a vocabulary of V are there? It's um V to the T, [00:46:52] of V are there? It's um V to the T, right? So that's bad. It grows [00:46:54] right? So that's bad. It grows exponentially. If you wanted to directly [00:46:55] exponentially. If you wanted to directly model the joint distribution of a [00:46:57] model the joint distribution of a sequence of T things, the number of [00:46:59] sequence of T things, the number of entries in that discrete probability [00:47:02] entries in that discrete probability distribution you need to model is going [00:47:03] distribution you need to model is going to grow exponentially with the sequence [00:47:04] to grow exponentially with the sequence length. um and that's quickly going to [00:47:06] length. um and that's quickly going to become completely intractable if we if [00:47:08] become completely intractable if we if we want to go to long sequences. So then [00:47:10] we want to go to long sequences. So then the reason we break that up is so that [00:47:11] the reason we break that up is so that we don't have to model it all at once. [00:47:12] we don't have to model it all at once. We factor it in this way and predict [00:47:14] We factor it in this way and predict only one part condition on the previous [00:47:16] only one part condition on the previous parts. Good question. Can we apply the [00:47:18] parts. Good question. Can we apply the log trick to mitigate that? Yeah, [00:47:20] log trick to mitigate that? Yeah, exactly. So in practice like you'll [00:47:21] exactly. So in practice like you'll never actually see these these real [00:47:22] never actually see these these real these like probability density values [00:47:24] these like probability density values modeled. Um almost always you're going [00:47:26] modeled. Um almost always you're going to work in log probabilities instead. Um [00:47:28] to work in log probabilities instead. Um so the model that outputs um the model [00:47:29] so the model that outputs um the model is going to output log probabilities. [00:47:31] is going to output log probabilities. You're going to compute your loss in log [00:47:32] You're going to compute your loss in log space. like for numeric stability you're [00:47:34] space. like for numeric stability you're almost going to compute everything in [00:47:35] almost going to compute everything in log space in practice. So then the p of [00:47:37] log space in practice. So then the p of x is being generated because at the [00:47:39] x is being generated because at the output at the at the top of the [00:47:41] output at the at the top of the transformer it's outputting a [00:47:42] transformer it's outputting a probability distribution over the next [00:47:44] probability distribution over the next token condition on all the previous [00:47:45] token condition on all the previous tokens and it does that for every point [00:47:47] tokens and it does that for every point in the sequence. So you could actually [00:47:48] in the sequence. So you could actually recover this exact probability density [00:47:50] recover this exact probability density value um by multiplying out the the [00:47:53] value um by multiplying out the the values at all points in the sequence. So [00:47:54] values at all points in the sequence. So if I have an input sequence, I pass it [00:47:56] if I have an input sequence, I pass it to the transformer. The transformer will [00:47:58] to the transformer. The transformer will have predicted at every point in the [00:47:59] have predicted at every point in the sequence what is the probability, what [00:48:01] sequence what is the probability, what is the distribution over all the tokens [00:48:03] is the distribution over all the tokens conditioned on the on the earlier part [00:48:04] conditioned on the on the earlier part of the sequence. Um and I can compute [00:48:06] of the sequence. Um and I can compute what was the actual next token, what was [00:48:08] what was the actual next token, what was the predicted probability of the next [00:48:09] the predicted probability of the next token, and then multiply all of those [00:48:11] token, and then multiply all of those across the entire sequence. So that's [00:48:12] across the entire sequence. So that's how we can recover the exact density [00:48:14] how we can recover the exact density value out of one of these autogressive [00:48:15] value out of one of these autogressive models. And that actually would apply [00:48:17] models. And that actually would apply either to an RNN or a transformer. [00:48:21] either to an RNN or a transformer. Okay. Um good questions. So then in a [00:48:25] Okay. Um good questions. So then in a variational autoenccoder um things get [00:48:27] variational autoenccoder um things get hairy. So we're actually going to drop [00:48:29] hairy. So we're actually going to drop the V and talk about autoenccoders for [00:48:30] the V and talk about autoenccoders for just a couple slides because I don't [00:48:31] just a couple slides because I don't think we've done that yet this this uh [00:48:33] think we've done that yet this this uh this course. So in a non-variational [00:48:35] this course. So in a non-variational autoenccoder this is basically going to [00:48:36] autoenccoder this is basically going to be an unsupervised method for learning [00:48:38] be an unsupervised method for learning to extract features Z from inputs X [00:48:41] to extract features Z from inputs X without labels. Um and this actually you [00:48:44] without labels. Um and this actually you know is kind of a ver kind of you know [00:48:45] know is kind of a ver kind of you know in this vein of self-supervised learning [00:48:47] in this vein of self-supervised learning that we just talked about. Um, and our [00:48:50] that we just talked about. Um, and our notion is that the features ought to [00:48:51] notion is that the features ought to extract useful information about the [00:48:53] extract useful information about the data, right? Maybe they somehow [00:48:54] data, right? Maybe they somehow implicitly encode what is the what what [00:48:57] implicitly encode what is the what what is the identity of objects in the image, [00:48:58] is the identity of objects in the image, how many of them there are, what are the [00:49:00] how many of them there are, what are the colors of them. We want this feature [00:49:01] colors of them. We want this feature vector Z to contain useful information [00:49:03] vector Z to contain useful information about about the input X. Um, and this [00:49:06] about about the input X. Um, and this encoder itself could be a neural network [00:49:08] encoder itself could be a neural network of any architecture. It could be an MLP, [00:49:09] of any architecture. It could be an MLP, transformer, CNN, whatever you want. Um, [00:49:12] transformer, CNN, whatever you want. Um, but it inputs our data X and then it's [00:49:14] but it inputs our data X and then it's going to output some vector Z. And then [00:49:16] going to output some vector Z. And then the question is how do we learn this [00:49:17] the question is how do we learn this without without labels. Um we actually [00:49:19] without without labels. Um we actually saw a lot of examples of this in the [00:49:20] saw a lot of examples of this in the previous lecture. Um but there's a very [00:49:22] previous lecture. Um but there's a very simple one which is just try to [00:49:24] simple one which is just try to reconstruct the input. Um so we're going [00:49:25] reconstruct the input. Um so we're going to now have a second part of the model [00:49:28] to now have a second part of the model called the decoder which is going to [00:49:29] called the decoder which is going to input the Z and then output back an X. [00:49:32] input the Z and then output back an X. Um and we want Oh, and I dropped the X. [00:49:34] Um and we want Oh, and I dropped the X. Um and and we're going to train this [00:49:36] Um and and we're going to train this thing to so that the output from the [00:49:37] thing to so that the output from the model should actually match the input. [00:49:39] model should actually match the input. Um this is the in some sense the [00:49:41] Um this is the in some sense the stupidest loss function ever. We're just [00:49:42] stupidest loss function ever. We're just training the model to mimic the identity [00:49:44] training the model to mimic the identity function. Um why do we do that? we [00:49:45] function. Um why do we do that? we already know the identity function. Why [00:49:47] already know the identity function. Why are we expending a lot of flops and [00:49:48] are we expending a lot of flops and training a neural network on a big data [00:49:49] training a neural network on a big data set to just learn the identity function [00:49:51] set to just learn the identity function that we already know? Um it's because [00:49:53] that we already know? Um it's because we're going to bottleneck it in some [00:49:54] we're going to bottleneck it in some way. Um if this model had infinite [00:49:56] way. Um if this model had infinite capacity, for example, if that Z vector [00:49:58] capacity, for example, if that Z vector was very wide, if there were no [00:49:59] was very wide, if there were no constraints on the learning, um I would [00:50:01] constraints on the learning, um I would expect a neural network to just nail [00:50:02] expect a neural network to just nail this problem. Um but we don't want to do [00:50:04] this problem. Um but we don't want to do that because we explicitly don't care [00:50:06] that because we explicitly don't care about learning this objective. We [00:50:08] about learning this objective. We already know the identity function. We [00:50:09] already know the identity function. We don't need an expensive neural network [00:50:10] don't need an expensive neural network to compute it. What we want to do is [00:50:12] to compute it. What we want to do is force the network to try to learn the [00:50:14] force the network to try to learn the identity function under some constraint. [00:50:16] identity function under some constraint. And the constraint that you often use in [00:50:18] And the constraint that you often use in a in a traditional autoenccoder is by [00:50:20] a in a traditional autoenccoder is by bottlenecking that representation Z. In [00:50:22] bottlenecking that representation Z. In particular, that means that that vector [00:50:24] particular, that means that that vector Z in the middle is going to be much much [00:50:26] Z in the middle is going to be much much smaller than the input X. So your input [00:50:28] smaller than the input X. So your input X might be a high resolution image, [00:50:30] X might be a high resolution image, maybe like a 1024x 1024 image that we [00:50:33] maybe like a 1024x 1024 image that we said is composed of 3 million floats, [00:50:35] said is composed of 3 million floats, but then that Z might be like 128 [00:50:38] but then that Z might be like 128 dimensional latent code. So the model is [00:50:40] dimensional latent code. So the model is now asked to solve this problem where I [00:50:41] now asked to solve this problem where I want to reconstruct the output, [00:50:44] want to reconstruct the output, reconstruct the data X, but squash it [00:50:46] reconstruct the data X, but squash it through this layer, this like [00:50:47] through this layer, this like bottlenecking representation in the [00:50:49] bottlenecking representation in the middle. And we hope that this is going [00:50:50] middle. And we hope that this is going to force the model to learn some [00:50:52] to force the model to learn some non-trivial structure about the about [00:50:54] non-trivial structure about the about the data by squashing it through this [00:50:56] the data by squashing it through this this representation in the middle of the [00:50:57] this representation in the middle of the network. Um, and then after we do this, [00:51:00] network. Um, and then after we do this, we can apply our normal self-supervised [00:51:01] we can apply our normal self-supervised learning trick where, you know, you [00:51:03] learning trick where, you know, you could throw away the decoder um, and [00:51:04] could throw away the decoder um, and then use this Z to initialize some [00:51:06] then use this Z to initialize some supervised model for some downstream [00:51:08] supervised model for some downstream task. the same story as in the supervis [00:51:10] task. the same story as in the supervis the the self-supervised story that we [00:51:11] the the self-supervised story that we just saw. Um, but what about what if we [00:51:15] just saw. Um, but what about what if we want actually want to use this to [00:51:16] want actually want to use this to generate data? Um, then what we'd really [00:51:18] generate data? Um, then what we'd really like to do is somehow the opposite of [00:51:19] like to do is somehow the opposite of the self-supervised story. What we'd [00:51:21] the self-supervised story. What we'd really like to do is throw away the [00:51:23] really like to do is throw away the encoder and instead be able to somehow [00:51:25] encoder and instead be able to somehow sample Z's that match the kinds of Z's [00:51:27] sample Z's that match the kinds of Z's that the model learned to represent data [00:51:30] that the model learned to represent data as. And if we had some procedure for [00:51:32] as. And if we had some procedure for sampling Z's that matched the data [00:51:34] sampling Z's that matched the data distribution in some way, then we could [00:51:37] distribution in some way, then we could sample a Z, pass it through our learned [00:51:39] sample a Z, pass it through our learned decoder, and now generate a new sample, [00:51:41] decoder, and now generate a new sample, right? And now this is an implicit [00:51:43] right? And now this is an implicit method, right? You said that there's no [00:51:44] method, right? You said that there's no there's no densities floating around [00:51:46] there's no densities floating around anywhere. But if we had a way to do [00:51:48] anywhere. But if we had a way to do this, it would be a way to draw samples [00:51:50] this, it would be a way to draw samples from the model um without explicitly [00:51:52] from the model um without explicitly modeling the density in any way. [00:51:55] modeling the density in any way. But the problem is that, you know, we've [00:51:56] But the problem is that, you know, we've just kind of kicked the can down the [00:51:58] just kind of kicked the can down the road here a little bit because we said [00:52:00] road here a little bit because we said we wanted we want if we want to generate [00:52:01] we wanted we want if we want to generate images, we want to generate X's, we have [00:52:03] images, we want to generate X's, we have a data set of X's. How do we do that? We [00:52:06] a data set of X's. How do we do that? We said we're going to solve that by [00:52:07] said we're going to solve that by training this autoenccoder and now we [00:52:09] training this autoenccoder and now we have a data set of Z's and we need to [00:52:10] have a data set of Z's and we need to sample in Zpace. It's it's not any [00:52:12] sample in Zpace. It's it's not any easier. So it kind of kind of we're kind [00:52:13] easier. So it kind of kind of we're kind of stuck. Um and the idea of variational [00:52:16] of stuck. Um and the idea of variational autoenccoders is you know what if we [00:52:19] autoenccoders is you know what if we could force some structure on the Z's. [00:52:21] could force some structure on the Z's. Um if you have this traditional [00:52:23] Um if you have this traditional auto-enccoder structure all you're [00:52:24] auto-enccoder structure all you're you're not forcing the model to impose [00:52:26] you're not forcing the model to impose any known structure on the Z's. You're [00:52:28] any known structure on the Z's. You're just asking it to reconstruct the data [00:52:30] just asking it to reconstruct the data um given its latent representation. But [00:52:32] um given its latent representation. But what if we had some mechanism to force [00:52:34] what if we had some mechanism to force the Z's to come from a gausian [00:52:35] the Z's to come from a gausian distribution or some other known [00:52:37] distribution or some other known distribution. If that were the case then [00:52:39] distribution. If that were the case then we could just draw a sample at inference [00:52:42] we could just draw a sample at inference time after this model is trained. Draw a [00:52:44] time after this model is trained. Draw a sample from that known distribution pass [00:52:46] sample from that known distribution pass it through the pass it through the [00:52:47] it through the pass it through the decoder and now we would have our have [00:52:49] decoder and now we would have our have our sample. So forcing these [00:52:52] our sample. So forcing these auto-enccoders to be probabilistic and [00:52:54] auto-enccoders to be probabilistic and to enforce a probabilistic structure on [00:52:56] to enforce a probabilistic structure on that latent space exactly is what a [00:52:58] that latent space exactly is what a variational autoenccoder tries to do. [00:53:00] variational autoenccoder tries to do. Why variational? Um it's a long story. [00:53:02] Why variational? Um it's a long story. Um it says has a long history in a long [00:53:04] Um it says has a long history in a long term lot of history around that [00:53:06] term lot of history around that terminology in the literature. Um but [00:53:08] terminology in the literature. Um but basically variational autoenccoders are [00:53:10] basically variational autoenccoders are a probabilistic spin on our traditional [00:53:12] a probabilistic spin on our traditional autoenccoder. Um, so it's going to learn [00:53:13] autoenccoder. Um, so it's going to learn latent features Z from raw data and then [00:53:16] latent features Z from raw data and then we'll be able to enforce a structure on [00:53:19] we'll be able to enforce a structure on that learned latent space Z such that we [00:53:21] that learned latent space Z such that we can sample from it at inference time [00:53:23] can sample from it at inference time after the model is trained and generate [00:53:25] after the model is trained and generate new samples. [00:53:26] new samples. So in more more more concretely we'll [00:53:28] So in more more more concretely we'll assume that our training data X I again [00:53:31] assume that our training data X I again note here the superscript I means these [00:53:32] note here the superscript I means these are different independent samples of of [00:53:34] are different independent samples of of X. um we assume that um each x i was [00:53:38] X. um we assume that um each x i was generated from some underlying latent [00:53:40] generated from some underlying latent val latent vector uh z that there's some [00:53:43] val latent vector uh z that there's some zi that's lurking under the surface [00:53:45] zi that's lurking under the surface associated with every x i and in the [00:53:48] associated with every x i and in the universe's procedure for generating data [00:53:50] universe's procedure for generating data first it generated the zi then it [00:53:52] first it generated the zi then it generated the xi from the zi um and all [00:53:55] generated the xi from the zi um and all everything that the universe needed to [00:53:56] everything that the universe needed to know in order to generate the image that [00:53:58] know in order to generate the image that we seed that we saw was contained in [00:54:00] we seed that we saw was contained in that latent vector z but we can't see [00:54:03] that latent vector z but we can't see those latent vectors z we can never [00:54:04] those latent vectors z we can never observe them. We don't have a data set [00:54:06] observe them. We don't have a data set of them, right? So the intuition is that [00:54:08] of them, right? So the intuition is that X is an image. Z is some kind of latent [00:54:10] X is an image. Z is some kind of latent feature representation that tells you [00:54:12] feature representation that tells you everything you would ever need to know [00:54:13] everything you would ever need to know about that image, but you can never [00:54:14] about that image, but you can never observe that latent vector. Um, and then [00:54:17] observe that latent vector. Um, and then after training, we could generate a [00:54:18] after training, we could generate a sample by uh, oh, and the other [00:54:20] sample by uh, oh, and the other constraint is that we're going to force [00:54:22] constraint is that we're going to force those Z's to come from a known [00:54:23] those Z's to come from a known distribution. So then after the model is [00:54:25] distribution. So then after the model is trained, then we can do exactly what we [00:54:26] trained, then we can do exactly what we just said. Draw a Z from that known [00:54:28] just said. Draw a Z from that known distribution, pass through the decoder. [00:54:30] distribution, pass through the decoder. That's going to give us a sample. Um, [00:54:32] That's going to give us a sample. Um, and then we'll typically assume a simple [00:54:34] and then we'll typically assume a simple prior. Um, almost always a unit gausian [00:54:37] prior. Um, almost always a unit gausian distribution is the most is by far the [00:54:38] distribution is the most is by far the most common. So then how do we possibly [00:54:41] most common. So then how do we possibly train this? Like this feels like an [00:54:42] train this? Like this feels like an impossible problem. Um, we want to [00:54:44] impossible problem. Um, we want to basically train this network that's [00:54:45] basically train this network that's going to, you know, get these Z's, find [00:54:48] going to, you know, get these Z's, find a Z for every X. We can never observe [00:54:50] a Z for every X. We can never observe the Z's. This seems impossible. What are [00:54:51] the Z's. This seems impossible. What are we going to do? Um, we're going to go [00:54:53] we going to do? Um, we're going to go back to maximum likelihood, right? If we [00:54:55] back to maximum likelihood, right? If we indeed had a data set of X's and Z's, [00:54:58] indeed had a data set of X's and Z's, then we could use maximum likelihood to [00:55:00] then we could use maximum likelihood to directly use the same kind of log trick [00:55:02] directly use the same kind of log trick um maximize the log probability. We [00:55:04] um maximize the log probability. We could use the exact same thing that we [00:55:05] could use the exact same thing that we previously saw. Um and then train a [00:55:07] previously saw. Um and then train a conditional generative model P of X [00:55:09] conditional generative model P of X conditioned on Z, but we don't know Z. [00:55:11] conditioned on Z, but we don't know Z. But let's pretend we do for a moment. Um [00:55:14] But let's pretend we do for a moment. Um so but if but because we don't know Z, [00:55:16] so but if but because we don't know Z, we could try to marginalize, right? we [00:55:18] we could try to marginalize, right? we know that p of x is equal to like maybe [00:55:21] know that p of x is equal to like maybe there's some joint distribution of x and [00:55:22] there's some joint distribution of x and z um that must exist even though we [00:55:25] z um that must exist even though we can't observe it. Um and then in [00:55:26] can't observe it. Um and then in principle you could integrate out the z [00:55:28] principle you could integrate out the z to marginalize over it to get a p of x. [00:55:31] to marginalize over it to get a p of x. Um and then maybe we could do like [00:55:32] Um and then maybe we could do like pretend there's a joint distribution x [00:55:34] pretend there's a joint distribution x and z marginalize out the z somehow and [00:55:36] and z marginalize out the z somehow and still do maximum likelihood. Um you know [00:55:39] still do maximum likelihood. Um you know maybe let's see how this works. So this [00:55:41] maybe let's see how this works. So this term in and and then here we've also [00:55:43] term in and and then here we've also used the chain rule to break up that P [00:55:44] used the chain rule to break up that P of X given Z into that that joint [00:55:46] of X given Z into that that joint probability P of X and Z into P of X [00:55:49] probability P of X and Z into P of X given Z and just P of Z. So um this P of [00:55:52] given Z and just P of Z. So um this P of X given Z that's okay. Um we could [00:55:54] X given Z that's okay. Um we could compute that with our with our decoder [00:55:56] compute that with our with our decoder here on the left. That's a neural [00:55:57] here on the left. That's a neural network that we're hoping to train. Um [00:55:59] network that we're hoping to train. Um this P of Z term is okay. We're going to [00:56:00] this P of Z term is okay. We're going to assume that that's a unit gausian or [00:56:02] assume that that's a unit gausian or some other simple distribution that we [00:56:03] some other simple distribution that we can compute or reason about. Um but this [00:56:05] can compute or reason about. Um but this integral kills us, right? In in general, [00:56:07] integral kills us, right? In in general, we have no feasible way to integrate [00:56:09] we have no feasible way to integrate over the full space of a neural [00:56:11] over the full space of a neural network's input. Right? This um this p [00:56:13] network's input. Right? This um this p of x given z is going to be some very [00:56:14] of x given z is going to be some very complicated function that's modeled by a [00:56:16] complicated function that's modeled by a neural network. There's no going to be [00:56:17] neural network. There's no going to be no way that we can analytically or or [00:56:19] no way that we can analytically or or exactly integrate this. You can train [00:56:21] exactly integrate this. You can train neural networks for individual parts [00:56:22] neural networks for individual parts here. Right? So the whole underlying [00:56:24] here. Right? So the whole underlying notion here whenever you're doing this [00:56:25] notion here whenever you're doing this probabistic modeling is like we're going [00:56:27] probabistic modeling is like we're going to write down some probabilistic terms. [00:56:29] to write down some probabilistic terms. Um hopefully some of them are going to [00:56:30] Um hopefully some of them are going to be simple distributions that we can [00:56:32] be simple distributions that we can write down analytically and reason [00:56:33] write down analytically and reason about. um some of them are going to be [00:56:35] about. um some of them are going to be learned neural network components. So [00:56:36] learned neural network components. So we're kind of assuming that probability [00:56:38] we're kind of assuming that probability of X given Z is going to be some neural [00:56:40] of X given Z is going to be some neural network that we could that we could in [00:56:41] network that we could that we could in principle learn via maximum likelihood. [00:56:43] principle learn via maximum likelihood. Um but we don't we but we're trying to [00:56:46] Um but we don't we but we're trying to write down what objective could we use [00:56:47] write down what objective could we use to learn that neural network via maximum [00:56:49] to learn that neural network via maximum likelihood. Um and we're out of luck [00:56:51] likelihood. Um and we're out of luck here because you can't you have no way [00:56:52] here because you can't you have no way to integrate over Z. Um you could try to [00:56:54] to integrate over Z. Um you could try to approximate that integral via some like [00:56:56] approximate that integral via some like finite sampling. Um, but in general [00:56:58] finite sampling. Um, but in general that's probably not going to work very [00:56:59] that's probably not going to work very well because this Z is a super high [00:57:00] well because this Z is a super high dimensional space and trying to like do [00:57:02] dimensional space and trying to like do a do an approx do doing a an approximate [00:57:05] a do an approx do doing a an approximate numerical integral in the inner loop of [00:57:07] numerical integral in the inner loop of your training is not going to be very a [00:57:08] your training is not going to be very a very good idea. Um, so we could try [00:57:10] very good idea. Um, so we could try something else. Uh, baze rule. That's [00:57:12] something else. Uh, baze rule. That's the other thing we always do in [00:57:14] the other thing we always do in probability. So let's try basu. Um, if [00:57:16] probability. So let's try basu. Um, if we have basu we have another formula [00:57:18] we have basu we have another formula that we can use to write down p of x, [00:57:20] that we can use to write down p of x, right? So p of x we can write down in [00:57:22] right? So p of x we can write down in baze rule. Um, using using baze rule in [00:57:24] baze rule. Um, using using baze rule in this equation on the screen. Um let's [00:57:26] this equation on the screen. Um let's see what we can do with these terms. So [00:57:29] see what we can do with these terms. So this one okay P of X given Z again we [00:57:31] this one okay P of X given Z again we can compute that with our decoder. P of [00:57:33] can compute that with our decoder. P of Z again okay this one's um we assume [00:57:35] Z again okay this one's um we assume this is Gausian so we can compute [00:57:36] this is Gausian so we can compute something with it. Um there's no [00:57:37] something with it. Um there's no integrals here that's good. So we're [00:57:39] integrals here that's good. So we're we're in good shape. Uh but now we're [00:57:41] we're in good shape. Uh but now we're out of luck. This P of Z given X term um [00:57:44] out of luck. This P of Z given X term um this posterior of Z given X. Uh we have [00:57:46] this posterior of Z given X. Uh we have no good way to compute this. Um in order [00:57:48] no good way to compute this. Um in order to compute this term you would also need [00:57:50] to compute this term you would also need some kind of integral. Out of luck we [00:57:52] some kind of integral. Out of luck we can't compute it. What are we going to [00:57:53] can't compute it. What are we going to do? Okay, let's use another neural [00:57:56] do? Okay, let's use another neural network. So the variational autoenccoder [00:57:58] network. So the variational autoenccoder trick is like let's there's that [00:58:00] trick is like let's there's that trolistic term on the bottom here, a b [00:58:02] trolistic term on the bottom here, a b rule that we can't compute. Um let's [00:58:04] rule that we can't compute. Um let's just slot in another neural network to [00:58:06] just slot in another neural network to try to comput it for us. Um so we're [00:58:07] try to comput it for us. Um so we're going to have another neural network Q [00:58:09] going to have another neural network Q with different weights um phi that's [00:58:12] with different weights um phi that's going to learn a different conditional [00:58:14] going to learn a different conditional distribution prob probability of z given [00:58:17] distribution prob probability of z given x. And the whole thing is we want this [00:58:19] x. And the whole thing is we want this other neural network to try to [00:58:20] other neural network to try to approximate the the true P of X given Z [00:58:23] approximate the the true P of X given Z of the first neural network. And you [00:58:25] of the first neural network. And you can't really enforce this in general, [00:58:27] can't really enforce this in general, but you know, let's put a neural network [00:58:29] but you know, let's put a neural network there and see what we can do. [00:58:31] there and see what we can do. So then if we could somehow have this [00:58:33] So then if we could somehow have this other neural network that was [00:58:34] other neural network that was approximating this term on the bottom [00:58:36] approximating this term on the bottom that we can't compute. Um then we could [00:58:37] that we can't compute. Um then we could go and compute our our likelihood and [00:58:40] go and compute our our likelihood and max and do maximum likelihood and we [00:58:41] max and do maximum likelihood and we would all be all be set. Um so that's [00:58:44] would all be all be set. Um so that's kind of what we do when training a [00:58:45] kind of what we do when training a variational autoenccoder. We're [00:58:47] variational autoenccoder. We're basically going to jointly learn two [00:58:48] basically going to jointly learn two different neural networks. One is the [00:58:50] different neural networks. One is the decoder which inputs the latent code Z [00:58:52] decoder which inputs the latent code Z and outputs a distribution over the data [00:58:54] and outputs a distribution over the data X. Um the other is an encoder which is [00:58:57] X. Um the other is an encoder which is going to input the data X and output a [00:59:00] going to input the data X and output a distribution over the latent codes Z. [00:59:02] distribution over the latent codes Z. And each of these are going to be [00:59:03] And each of these are going to be separate neural networks that are [00:59:04] separate neural networks that are separately trained with their own [00:59:05] separately trained with their own independent weights. There's a question [00:59:08] independent weights. There's a question you might have which is um how can you [00:59:10] you might have which is um how can you possibly output a probability [00:59:11] possibly output a probability distribution from a neural network? That [00:59:13] distribution from a neural network? That seems confusing and hard and unclear. [00:59:15] seems confusing and hard and unclear. Um, so the trick here is we're going to [00:59:17] Um, so the trick here is we're going to actually force everything to be a normal [00:59:19] actually force everything to be a normal distribution. Um, and we're going to [00:59:21] distribution. Um, and we're going to have the neural network output the [00:59:22] have the neural network output the parameters of the normal distribution. [00:59:24] parameters of the normal distribution. So typically for the decoder network, [00:59:26] So typically for the decoder network, we're going to assume that we're going [00:59:27] we're going to assume that we're going to the output distribution from the [00:59:29] to the output distribution from the decoder is going to be um diagonal [00:59:31] decoder is going to be um diagonal gausian where the where the entries in [00:59:33] gausian where the where the entries in the diagonal are the pixels of the [00:59:35] the diagonal are the pixels of the neural network. Um, and the model is [00:59:38] neural network. Um, and the model is going to output the mean of that [00:59:39] going to output the mean of that diagonal gausian distribution. And [00:59:41] diagonal gausian distribution. And typically for the decoder we'd assume a [00:59:42] typically for the decoder we'd assume a fixed um a fixed uh variance or or [00:59:44] fixed um a fixed uh variance or or standard deviation sigma squared. Um now [00:59:46] standard deviation sigma squared. Um now for the encoder network um similar same [00:59:49] for the encoder network um similar same idea the model's going to input the data [00:59:51] idea the model's going to input the data sample x and then it's going to output [00:59:53] sample x and then it's going to output the parameters of a gausian distribution [00:59:54] the parameters of a gausian distribution that model the distribution uh q of z [00:59:57] that model the distribution uh q of z given x. Um so in this case the de the [01:00:01] given x. Um so in this case the de the encoder network will output one vector [01:00:03] encoder network will output one vector which is the the mean of that gausian [01:00:04] which is the the mean of that gausian distribution and another vector which is [01:00:06] distribution and another vector which is the diagonal of the coariance of that ve [01:00:08] the diagonal of the coariance of that ve of that of that gausian distribution. Um [01:00:10] of that of that gausian distribution. Um and here it's very important that we [01:00:12] and here it's very important that we assume the diagonal structure because [01:00:14] assume the diagonal structure because otherwise we would have to model you [01:00:16] otherwise we would have to model you know h squ kind of entries in that full [01:00:18] know h squ kind of entries in that full coariance matrix right. So here right [01:00:21] coariance matrix right. So here right you have a you imagine an image that's h [01:00:23] you have a you imagine an image that's h by w p p p p p p p p p p p p p p p p p p [01:00:23] by w p p p p p p p p p p p p p p p p p p p pixels. Um so that means that the [01:00:25] p pixels. Um so that means that the entries in your diagonal matrix are like [01:00:27] entries in your diagonal matrix are like the right the full you could in [01:00:29] the right the full you could in principle model the co the full [01:00:31] principle model the co the full coariance across every pair of pixels in [01:00:33] coariance across every pair of pixels in the image but that would require h squ [01:00:34] the image but that would require h squ w^ squ entries that would be too big um [01:00:36] w^ squ entries that would be too big um so instead we'll just ignore any kind of [01:00:38] so instead we'll just ignore any kind of correlation structure among the the the [01:00:40] correlation structure among the the the different values um and now that means [01:00:42] different values um and now that means that the diagonal coariance is now a [01:00:44] that the diagonal coariance is now a vector that's the same size as the data [01:00:45] vector that's the same size as the data itself right so that means this mu of z [01:00:47] itself right so that means this mu of z given x and this sigma of z given x are [01:00:50] given x and this sigma of z given x are both vectors of the same shape as z um [01:00:53] both vectors of the same shape as z um so we basically treat the neural have [01:00:55] so we basically treat the neural have the neural network output two vectors of [01:00:57] the neural network output two vectors of the same shape and then treat them as [01:00:58] the same shape and then treat them as the parameters of this gausian [01:01:00] the parameters of this gausian distribution. So that's how we can [01:01:02] distribution. So that's how we can output a distribution from a neural [01:01:04] output a distribution from a neural network. If you do maximum likelihood on [01:01:07] network. If you do maximum likelihood on this uh thing with a fixed standard [01:01:09] this uh thing with a fixed standard deviation, it actually becomes [01:01:11] deviation, it actually becomes equivalent to L2. Um and that's a nice [01:01:13] equivalent to L2. Um and that's a nice trick. Um, and the reason you want to do [01:01:15] trick. Um, and the reason you want to do that is because trying to model the [01:01:18] that is because trying to model the diagonal, like if you want, like you [01:01:19] diagonal, like if you want, like you could in principle try to model the same [01:01:21] could in principle try to model the same thing on the decoder and try to model [01:01:23] thing on the decoder and try to model the individual like a separate variance [01:01:25] the individual like a separate variance of every pixel, but that would be kind [01:01:26] of every pixel, but that would be kind of useless. Um, because that would be [01:01:28] of useless. Um, because that would be like if you're not modeling any coarian [01:01:30] like if you're not modeling any coarian structure among the pixels, that would [01:01:31] structure among the pixels, that would basically be saying that each pixel is [01:01:33] basically be saying that each pixel is allowed to like vary a little bit. Um, [01:01:35] allowed to like vary a little bit. Um, and the amount that each pixel is [01:01:37] and the amount that each pixel is allowed to vary kind of depends on the [01:01:38] allowed to vary kind of depends on the pixel. and then sampling from that [01:01:40] pixel. and then sampling from that distribution would basically amount to [01:01:42] distribution would basically amount to um fixing the mean and then adding per [01:01:44] um fixing the mean and then adding per pixel independent noise that's scaled by [01:01:46] pixel independent noise that's scaled by the per pixel variances and that would [01:01:48] the per pixel variances and that would not be a sensible thing to do. Um so in [01:01:51] not be a sensible thing to do. Um so in general like for the decoder you kind of [01:01:53] general like for the decoder you kind of you kind of cheat a little bit um and in [01:01:55] you kind of cheat a little bit um and in you kind of pretend it's outputting a [01:01:57] you kind of pretend it's outputting a probability distribution but in general [01:01:58] probability distribution but in general we're never going to sample from that [01:01:59] we're never going to sample from that distribution. Instead we're always going [01:02:00] distribution. Instead we're always going to output the mean. Does that make [01:02:03] to output the mean. Does that make sense? Yeah. And then it turns out like [01:02:05] sense? Yeah. And then it turns out like if you write this down like the that [01:02:07] if you write this down like the that that that constant sigma square just [01:02:08] that that constant sigma square just comes off as a constant in the front. Um [01:02:10] comes off as a constant in the front. Um and in practice all like maximizing the [01:02:12] and in practice all like maximizing the log likelihood of a gausian distribution [01:02:15] log likelihood of a gausian distribution with a fixed with a fixed variance along [01:02:17] with a fixed with a fixed variance along the diagonal um is equivalent to [01:02:19] the diagonal um is equivalent to minimizing L2 distance between the mean [01:02:21] minimizing L2 distance between the mean and the X which is kind of nice. Yeah, [01:02:23] and the X which is kind of nice. Yeah, good question. Is there some kind of [01:02:24] good question. Is there some kind of like weird invariance or non-invarian [01:02:27] like weird invariance or non-invarian structure here with the pixel shifting? [01:02:29] structure here with the pixel shifting? Um that would be that would be more a [01:02:30] Um that would be that would be more a property of the architecture that you [01:02:32] property of the architecture that you would choose to build the neural [01:02:33] would choose to build the neural network. Um, so you could try to build [01:02:36] network. Um, so you could try to build into your network architecture that's [01:02:37] into your network architecture that's predicting these, you could try to build [01:02:38] predicting these, you could try to build some invariance or um or equariance [01:02:41] some invariance or um or equariance properties into the architecture. Um, [01:02:43] properties into the architecture. Um, but yeah, you're right that in general [01:02:44] but yeah, you're right that in general that's not accounted for at the at the [01:02:46] that's not accounted for at the at the loss level here. [01:02:50] Okay, so now we've got this idea. We've [01:02:52] Okay, so now we've got this idea. We've got an encoder, a decoder, they're both [01:02:53] got an encoder, a decoder, they're both one is inputting X, outputting a [01:02:55] one is inputting X, outputting a distribution over Z. Other is inputting [01:02:57] distribution over Z. Other is inputting Z, inputting a distribution over X. [01:02:59] Z, inputting a distribution over X. What's our training objective? Um, and [01:03:00] What's our training objective? Um, and here's the one slide where we're going [01:03:02] here's the one slide where we're going to do some math. Um but we'll see. So [01:03:05] to do some math. Um but we'll see. So here we're gonna we basically the idea [01:03:06] here we're gonna we basically the idea is we want to do um maximum likelihood. [01:03:08] is we want to do um maximum likelihood. That's usually the the single thing that [01:03:10] That's usually the the single thing that we want. That's like the the guiding [01:03:11] we want. That's like the the guiding principle behind a lot of objectives in [01:03:13] principle behind a lot of objectives in generative modeling. Um so we want to [01:03:15] generative modeling. Um so we want to maximize log P of log P of X and then we [01:03:17] maximize log P of log P of X and then we can use B rule to write that as log P of [01:03:20] can use B rule to write that as log P of this B rule expression. All right, this [01:03:22] this B rule expression. All right, this is this is an exact equivalence. Um now [01:03:25] is this is an exact equivalence. Um now we're going to do something silly. We're [01:03:26] we're going to do something silly. We're going to multiply the top and bottom of [01:03:28] going to multiply the top and bottom of this by our Q of Z given X. Remember, we [01:03:30] this by our Q of Z given X. Remember, we just introduced another neural network Q [01:03:32] just introduced another neural network Q out of nowhere that was in that was [01:03:34] out of nowhere that was in that was modeling this other distribution Q of Z [01:03:36] modeling this other distribution Q of Z given X. And now we're going to multiply [01:03:38] given X. And now we're going to multiply that density term on the top and bottom [01:03:39] that density term on the top and bottom of this uh of this base rule expression. [01:03:42] of this uh of this base rule expression. Now we're going to do some logarithm. Um [01:03:44] Now we're going to do some logarithm. Um and if you're, you know, have some [01:03:46] and if you're, you know, have some foresight actually, you know, you'll [01:03:48] foresight actually, you know, you'll you'll you'll for some reason decide to [01:03:50] you'll you'll for some reason decide to rearrange these terms in this particular [01:03:51] rearrange these terms in this particular order. Um and I've colorcoded them so [01:03:53] order. Um and I've colorcoded them so you can later go and track which term [01:03:54] you can later go and track which term went where. Um but we do some logarithms [01:03:57] went where. Um but we do some logarithms and ar and like break this up into three [01:03:59] and ar and like break this up into three separate terms. Um now you need to make [01:04:02] separate terms. Um now you need to make another magical observation which is [01:04:04] another magical observation which is that this p of x um actually does not [01:04:07] that this p of x um actually does not depend on z right like so far this this [01:04:10] depend on z right like so far this this sequence of three terms this is all an [01:04:12] sequence of three terms this is all an exact equivalence. Um these are all [01:04:13] exact equivalence. Um these are all exact equalities. So even though there's [01:04:15] exact equalities. So even though there's a z in this expression um it actually [01:04:17] a z in this expression um it actually doesn't depend on z because all the z's [01:04:18] doesn't depend on z because all the z's would cancel out. Um and if you have [01:04:20] would cancel out. Um and if you have something that doesn't depend on Z, you [01:04:22] something that doesn't depend on Z, you can always wrap an expectation um over Z [01:04:25] can always wrap an expectation um over Z of that thing. So in this case, we know [01:04:27] of that thing. So in this case, we know that this is a P of X, we can we can [01:04:30] that this is a P of X, we can we can always feel free to wrap an expectation [01:04:32] always feel free to wrap an expectation of Z sampled according to any [01:04:33] of Z sampled according to any distribution that we want um of P of X. [01:04:37] distribution that we want um of P of X. Um and because that internal thing does [01:04:38] Um and because that internal thing does not depend on Z, this is always true for [01:04:41] not depend on Z, this is always true for any for any uh for any distribution that [01:04:43] any for any uh for any distribution that we might choose to to take this [01:04:44] we might choose to to take this expectation over. Okay. So then because [01:04:48] expectation over. Okay. So then because because expectation is a linear thing um [01:04:50] because expectation is a linear thing um we can apply that expectation to each of [01:04:52] we can apply that expectation to each of these three terms upstairs. Um and now [01:04:54] these three terms upstairs. Um and now we have these three terms um each of [01:04:56] we have these three terms um each of which looks very mysterious. Um but if [01:04:59] which looks very mysterious. Um but if you if you kind of you had a lot of [01:05:01] you if you kind of you had a lot of intuition about probability you memorize [01:05:03] intuition about probability you memorize all these formulas that you may have [01:05:04] all these formulas that you may have seen in an earlier statistics or probab [01:05:06] seen in an earlier statistics or probab probability course um maybe you could [01:05:08] probability course um maybe you could learn to recognize some of these. So [01:05:10] learn to recognize some of these. So this first one um we're going to carry [01:05:11] this first one um we're going to carry down as it was before and these second [01:05:14] down as it was before and these second two are actually kale divergence um are [01:05:16] two are actually kale divergence um are actually kale divergence terms. So the [01:05:18] actually kale divergence terms. So the kale divergence is a kind of measure of [01:05:20] kale divergence is a kind of measure of dissimilarity between probability [01:05:22] dissimilarity between probability distributions and it just so happens to [01:05:24] distributions and it just so happens to have this exact definition of these [01:05:26] have this exact definition of these these latter two terms. So we can [01:05:28] these latter two terms. So we can rewrite this exactly as this first term [01:05:30] rewrite this exactly as this first term which is this expectation blah blah blah [01:05:32] which is this expectation blah blah blah we'll talk about it. Um and then plus [01:05:34] we'll talk about it. Um and then plus then plus these two other KL terms. Um, [01:05:36] then plus these two other KL terms. Um, so these two KL terms are basically [01:05:38] so these two KL terms are basically measuring dissimilarity but or or [01:05:40] measuring dissimilarity but or or measuring discrepancy or dissimilarity [01:05:42] measuring discrepancy or dissimilarity between these different kinds of [01:05:44] between these different kinds of probability distributions that we have [01:05:45] probability distributions that we have floating around on this slide. Um, and [01:05:47] floating around on this slide. Um, and now these all look kind of crazy. Um, [01:05:49] now these all look kind of crazy. Um, but if we stare at each of these terms, [01:05:51] but if we stare at each of these terms, we can actually recover a like [01:05:53] we can actually recover a like interpretable structure like an [01:05:55] interpretable structure like an interpretable meaning for each of these [01:05:56] interpretable meaning for each of these three terms. Um, this first one is [01:05:58] three terms. Um, this first one is actually a data reconstruction term. If [01:06:00] actually a data reconstruction term. If we walk through what this is saying, [01:06:02] we walk through what this is saying, this is saying that we're going to [01:06:03] this is saying that we're going to sample a Z. And the way we're going to [01:06:04] sample a Z. And the way we're going to sample the Z is by Q of Z given Q of Z [01:06:09] sample the Z is by Q of Z given Q of Z given X which is our encoder. So we're [01:06:11] given X which is our encoder. So we're going to take our X pass it to the [01:06:13] going to take our X pass it to the encoder. The encoder is going to predict [01:06:15] encoder. The encoder is going to predict a distribution Q of Z given X. Then from [01:06:17] a distribution Q of Z given X. Then from that predicted distribution we're going [01:06:19] that predicted distribution we're going to sample a Z. Then we're going to take [01:06:21] to sample a Z. Then we're going to take an expectation over all such Z and [01:06:23] an expectation over all such Z and maximize the log probability of X given [01:06:25] maximize the log probability of X given Z. So this is basically a data [01:06:27] Z. So this is basically a data reconstruction term. It's saying that if [01:06:29] reconstruction term. It's saying that if we take an X, a data point X, run it [01:06:31] we take an X, a data point X, run it through the encoder to get a [01:06:32] through the encoder to get a distribution over Z, and then pass any [01:06:35] distribution over Z, and then pass any sample of that distrib of that predicted [01:06:37] sample of that distrib of that predicted distribution over Z into the decoder, [01:06:39] distribution over Z into the decoder, we're going to recover X. So this is a [01:06:41] we're going to recover X. So this is a kind of data reconstruction term. The [01:06:43] kind of data reconstruction term. The middle one is a prior term. This is [01:06:45] middle one is a prior term. This is saying we want to um this is the [01:06:47] saying we want to um this is the measuring the KL divergence between Q of [01:06:49] measuring the KL divergence between Q of Z given X and P of Z. So remember Q of Z [01:06:52] Z given X and P of Z. So remember Q of Z given X this is the encoder is inputting [01:06:55] given X this is the encoder is inputting the data X and outputting a distribution [01:06:58] the data X and outputting a distribution over the latent space Z. Um so this is [01:07:01] over the latent space Z. Um so this is the predicted distribution over the [01:07:02] the predicted distribution over the latent space of the encoder and this [01:07:04] latent space of the encoder and this other term P of Z this is the prior this [01:07:06] other term P of Z this is the prior this is the prior that we assumed for the [01:07:08] is the prior that we assumed for the latent space usually diagonal gausian. [01:07:10] latent space usually diagonal gausian. So this term is basically saying um the [01:07:12] So this term is basically saying um the model is separately predicting is sort [01:07:14] model is separately predicting is sort of predicting distributions of Z given X [01:07:17] of predicting distributions of Z given X and we want those predicted [01:07:18] and we want those predicted distributions to match the simple [01:07:20] distributions to match the simple Gausian prior that we had previously set [01:07:22] Gausian prior that we had previously set um that we' previously chosen. So this [01:07:24] um that we' previously chosen. So this is just measuring how much does that [01:07:25] is just measuring how much does that latent space that's learned by our model [01:07:27] latent space that's learned by our model match the prior and this third term gets [01:07:30] match the prior and this third term gets us in trouble. So this third term is um [01:07:33] us in trouble. So this third term is um Q of Z given X. So that's the predicted [01:07:36] Q of Z given X. So that's the predicted distribution over Z given the input data [01:07:39] distribution over Z given the input data X to the encoder. Um, and how much does [01:07:42] X to the encoder. Um, and how much does that match P of Z given X. So that's [01:07:44] that match P of Z given X. So that's this uh this flipped around distribution [01:07:47] this uh this flipped around distribution of what the decoder is modeling. And [01:07:49] of what the decoder is modeling. And this one we're out of luck. We cannot [01:07:51] this one we're out of luck. We cannot compute this term cuz remember what got [01:07:53] compute this term cuz remember what got us into trouble in the first place was [01:07:55] us into trouble in the first place was this P of Z given X. The whole reason we [01:07:57] this P of Z given X. The whole reason we introduced Q was to was because we [01:07:59] introduced Q was to was because we couldn't intro we could not compute this [01:08:00] couldn't intro we could not compute this P this P of Z given X. Um, so now what [01:08:05] P this P of Z given X. Um, so now what do we do? We're going to throw it away [01:08:07] do we do? We're going to throw it away because we know that kale divergences [01:08:09] because we know that kale divergences are always greater than equal to zero. [01:08:11] are always greater than equal to zero. So we know that this last term because [01:08:13] So we know that this last term because it's a kale divergence of two [01:08:14] it's a kale divergence of two distributions even though we cannot [01:08:16] distributions even though we cannot compute those distributions in general, [01:08:17] compute those distributions in general, we know that it must be greater than [01:08:19] we know that it must be greater than zero because that's a that's a [01:08:20] zero because that's a that's a well-known property of kale divergences. [01:08:22] well-known property of kale divergences. So we can throw it away um and get a [01:08:25] So we can throw it away um and get a lower bound to the true probability. So [01:08:27] lower bound to the true probability. So if we throw away that last term then we [01:08:29] if we throw away that last term then we know that log p of x um is greater than [01:08:32] know that log p of x um is greater than or equal to those two terms our [01:08:34] or equal to those two terms our reconstruction term and our prior term. [01:08:36] reconstruction term and our prior term. So this will be the loss that we use to [01:08:37] So this will be the loss that we use to train our our variational autoenccoder. [01:08:40] train our our variational autoenccoder. Um and the idea is that this is an [01:08:41] Um and the idea is that this is an approximation to the true log [01:08:43] approximation to the true log likelihood. This is this is a lower [01:08:44] likelihood. This is this is a lower bound to the log likelihood. So if we [01:08:46] bound to the log likelihood. So if we maximize the lower bound hopefully that [01:08:48] maximize the lower bound hopefully that will also maximize the true log [01:08:50] will also maximize the true log likelihood even though we're not doing [01:08:52] likelihood even though we're not doing it exactly. So that's our training [01:08:54] it exactly. So that's our training objective for variational autoenccoders. [01:08:55] objective for variational autoenccoders. So that's kind of the summary. Um, you [01:08:57] So that's kind of the summary. Um, you know, you're going to jointly and trade, [01:08:58] know, you're going to jointly and trade, you're going to jointly train an encoder [01:09:00] you're going to jointly train an encoder Q and a decoder P to maximize this [01:09:03] Q and a decoder P to maximize this variational lower what's what's called a [01:09:04] variational lower what's what's called a variational lower bound on the true data [01:09:06] variational lower bound on the true data log likelihood. Um, and this is also [01:09:08] log likelihood. Um, and this is also sometimes called the evidence lower [01:09:09] sometimes called the evidence lower bound or elbow. So it's just the elbow [01:09:11] bound or elbow. So it's just the elbow when we're going to maximize the elbow. [01:09:13] when we're going to maximize the elbow. Um, and it has this particular term. We [01:09:15] Um, and it has this particular term. We have these encoder network, this decoder [01:09:16] have these encoder network, this decoder network. Um, that's what we do. So then, [01:09:19] network. Um, that's what we do. So then, you know, to kind of walk through what [01:09:20] you know, to kind of walk through what the training procedure looks like more [01:09:22] the training procedure looks like more explicitly, we're going to have this [01:09:23] explicitly, we're going to have this neural network in we're going to have [01:09:24] neural network in we're going to have this neural network encoder inputs the X [01:09:27] this neural network encoder inputs the X outputs the distribution over Z. Um then [01:09:29] outputs the distribution over Z. Um then we're then we're going to apply this KL [01:09:31] we're then we're going to apply this KL term to the predicted distribution. Um [01:09:34] term to the predicted distribution. Um and in particular because this this is [01:09:36] and in particular because this this is going to force the predicted [01:09:37] going to force the predicted distribution to be unit gausian. So it's [01:09:39] distribution to be unit gausian. So it's basically going to force it's going to [01:09:40] basically going to force it's going to encourage the predicted mean to be zero [01:09:42] encourage the predicted mean to be zero and the predicted sigma to be diagonal [01:09:45] and the predicted sigma to be diagonal ones to be all ones. Then once we get to [01:09:48] ones to be all ones. Then once we get to those predicted distribution from the [01:09:49] those predicted distribution from the encoder, we're going to sample from that [01:09:52] encoder, we're going to sample from that predicted distribution using this [01:09:53] predicted distribution using this so-called reparameterization trick that [01:09:55] so-called reparameterization trick that allows allows you to back prop through [01:09:57] allows allows you to back prop through this. Then we draw a sample Z from the [01:09:59] this. Then we draw a sample Z from the predicted distribution. Once you get the [01:10:01] predicted distribution. Once you get the predicted distrib this sample Z, you run [01:10:03] predicted distrib this sample Z, you run it through your decoder to get your um [01:10:05] it through your decoder to get your um to get your normal distribution [01:10:07] to get your normal distribution predicted by the decoder and then you [01:10:09] predicted by the decoder and then you apply your reconstruction your [01:10:11] apply your reconstruction your reconstruction term of the loss to the [01:10:12] reconstruction term of the loss to the output of the decoder. So even though [01:10:15] output of the decoder. So even though this looked like a large scary slides of [01:10:17] this looked like a large scary slides of math, it actually led to like not too [01:10:19] math, it actually led to like not too crazy of a training objective for this [01:10:21] crazy of a training objective for this uh for this thing. Um and I think this [01:10:23] uh for this thing. Um and I think this variational autoenccoder is actually [01:10:25] variational autoenccoder is actually very interesting because these two [01:10:26] very interesting because these two losses fight against each other in a [01:10:27] losses fight against each other in a very interesting way. So the reconstruct [01:10:30] very interesting way. So the reconstruct because we're basically forcing the [01:10:31] because we're basically forcing the model to bottleneck through this um [01:10:33] model to bottleneck through this um through this latent space Z and these [01:10:34] through this latent space Z and these two terms kind of want different things [01:10:36] two terms kind of want different things from the latent space. So the [01:10:38] from the latent space. So the reconstruction loss um kind of wants the [01:10:40] reconstruction loss um kind of wants the sigma to be zero and the mux to be a [01:10:42] sigma to be zero and the mux to be a different and unique vector for each uh [01:10:44] different and unique vector for each uh for each data x. Um because if that were [01:10:46] for each data x. Um because if that were the case then we could perfectly satisfy [01:10:49] the case then we could perfectly satisfy the reconstruction objective. We would [01:10:50] the reconstruction objective. We would have a separate a separate separate [01:10:52] have a separate a separate separate unique vector for every data point. Um [01:10:54] unique vector for every data point. Um and there would be no probability in [01:10:56] and there would be no probability in there. We could perfectly reconstruct [01:10:57] there. We could perfectly reconstruct everything. So that's kind of what the [01:10:58] everything. So that's kind of what the reconstruction loss wants. But the prior [01:11:01] reconstruction loss wants. But the prior loss actually wants the sigas to be all [01:11:03] loss actually wants the sigas to be all one because it wants it wants it to be [01:11:05] one because it wants it wants it to be unit gausian and it wants all the mues [01:11:06] unit gausian and it wants all the mues to be zero. Um which is very different [01:11:09] to be zero. Um which is very different what the two losses want. So in the [01:11:10] what the two losses want. So in the process of training a VAE, you're asking [01:11:12] process of training a VAE, you're asking these two losses to fight against each [01:11:14] these two losses to fight against each other to try to find some equilibrium [01:11:16] other to try to find some equilibrium between um reconstructing your data well [01:11:18] between um reconstructing your data well and forcing your latent space to be [01:11:20] and forcing your latent space to be close to your prior. And then once [01:11:22] close to your prior. And then once you've trained it, then you can sample Z [01:11:24] you've trained it, then you can sample Z from your prior, run through the decoder [01:11:26] from your prior, run through the decoder and get a sample. Um, another nice thing [01:11:28] and get a sample. Um, another nice thing is that because your latent space was [01:11:30] is that because your latent space was diagonal gausian, there's also a notion [01:11:32] diagonal gausian, there's also a notion of of um uh uh statistical independence [01:11:35] of of um uh uh statistical independence across the different the different [01:11:37] across the different the different entries in your latent space Z. So you [01:11:39] entries in your latent space Z. So you can vary them separately. Um and maybe [01:11:41] can vary them separately. Um and maybe those separate dimensions often encode [01:11:43] those separate dimensions often encode something useful or interpretable or [01:11:45] something useful or interpretable or orthogonal about your data. So in this [01:11:47] orthogonal about your data. So in this case we took a VAE trained it on a data [01:11:49] case we took a VAE trained it on a data set of handwritten digits and you kind [01:11:51] set of handwritten digits and you kind of see that as we vary two dimensions of [01:11:53] of see that as we vary two dimensions of the latent space the the digits kind of [01:11:56] the latent space the the digits kind of smoothly morph from one kind of category [01:11:58] smoothly morph from one kind of category into another and this is a pretty common [01:11:59] into another and this is a pretty common property of VAEs. So that's basically it [01:12:02] property of VAEs. So that's basically it for today. Um to kind of recap what we [01:12:04] for today. Um to kind of recap what we talked about we talked about supervised [01:12:06] talked about we talked about supervised versus unsupervised learning. We t [01:12:08] versus unsupervised learning. We t talked about these three different [01:12:09] talked about these three different flavors of generative modeling. Um and [01:12:11] flavors of generative modeling. Um and then we talked about one branch of this [01:12:13] then we talked about one branch of this family tree of generative models. Um so [01:12:15] family tree of generative models. Um so then next time we're going to uh come [01:12:17] then next time we're going to uh come back and talk about the other half of [01:12:19] back and talk about the other half of the family tree of generative models in [01:12:21] the family tree of generative models in particular talking about generative [01:12:22] particular talking about generative adversarial networks and diffusion [01:12:24] adversarial networks and diffusion models. ================================================================================ LECTURE 014 ================================================================================ Stanford CS231N Deep Learning for Computer Vision| Spring 2025 | Lecture 14: Generative Models 2 Source: https://www.youtube.com/watch?v=Edr4uZFh4EE --- Transcript [00:00:05] So last time we were talking about [00:00:07] So last time we were talking about generative models and we started off [00:00:09] generative models and we started off with some discussion of generative [00:00:10] with some discussion of generative versus discriminative models and recall [00:00:12] versus discriminative models and recall that these are basically both different [00:00:13] that these are basically both different flavors of probabilistic models. Um but [00:00:16] flavors of probabilistic models. Um but it depends on what we're trying to [00:00:17] it depends on what we're trying to predict, what we're conditioning on and [00:00:19] predict, what we're conditioning on and really critically what we're normalizing [00:00:20] really critically what we're normalizing over. So we talked about discriminative [00:00:22] over. So we talked about discriminative models where you're trying to predict [00:00:23] models where you're trying to predict the label Y conditioned on your data X, [00:00:26] the label Y conditioned on your data X, generative models where you're trying to [00:00:27] generative models where you're trying to just learn a probability distribution [00:00:29] just learn a probability distribution over your data X and conditional [00:00:30] over your data X and conditional generative models where you want to [00:00:32] generative models where you want to model the data X conditioned on some [00:00:34] model the data X conditioned on some user input Y um or label Y. Um and [00:00:37] user input Y um or label Y. Um and recall that um these differ in that what [00:00:39] recall that um these differ in that what you're trying to normalize over uh [00:00:41] you're trying to normalize over uh because probabilities distributions [00:00:43] because probabilities distributions introduce this normalizing effect where [00:00:44] introduce this normalizing effect where all things that you're that we're all [00:00:47] all things that you're that we're all where different kinds of things need to [00:00:48] where different kinds of things need to compete for probability mass due to this [00:00:50] compete for probability mass due to this normalization constraint of probability [00:00:52] normalization constraint of probability distributions. [00:00:53] distributions. Um and last time we also went through [00:00:55] Um and last time we also went through this taxonomy of different categories of [00:00:57] this taxonomy of different categories of generative models because it turns out [00:00:59] generative models because it turns out this area of generative modeling has [00:01:01] this area of generative modeling has been something people have studied for a [00:01:02] been something people have studied for a very long time come with a lot of [00:01:04] very long time come with a lot of different categories of methods to try [00:01:05] different categories of methods to try to solve variance of these problems. So [00:01:07] to solve variance of these problems. So we went through this family tree of [00:01:08] we went through this family tree of generative models um where we have exp [00:01:11] generative models um where we have exp where last time we talked about these [00:01:12] where last time we talked about these explicit density models where the model [00:01:15] explicit density models where the model outputs some measure some quantity that [00:01:17] outputs some measure some quantity that you can that some quantity P of X um [00:01:19] you can that some quantity P of X um either either the exact predicted P of X [00:01:22] either either the exact predicted P of X in the case of tractable density models [00:01:24] in the case of tractable density models or some approximate version of P of X in [00:01:27] or some approximate version of P of X in the case of these approximate density [00:01:28] the case of these approximate density models and then in the case of tract [00:01:30] models and then in the case of tract tractable density we saw auto [00:01:32] tractable density we saw auto reggressive as a category of model and [00:01:34] reggressive as a category of model and we saw variational auto-enccoders as an [00:01:36] we saw variational auto-enccoders as an example of something that gives you some [00:01:38] example of something that gives you some approximate density. [00:01:40] approximate density. So recall that auto reggressive models, [00:01:42] So recall that auto reggressive models, what we did is we took our our our image [00:01:44] what we did is we took our our our image or more generally whatever kind of data [00:01:45] or more generally whatever kind of data we're working with and we break it up [00:01:47] we're working with and we break it up into a sequence. Um and then for the [00:01:49] into a sequence. Um and then for the case of image data, we typically treat [00:01:50] case of image data, we typically treat this as a sequence of pixel values or [00:01:52] this as a sequence of pixel values or even sub pixel values. Um and we usually [00:01:55] even sub pixel values. Um and we usually want these to be discreet. So you treat [00:01:57] want these to be discreet. So you treat those subpixel values as um 8 bit [00:01:59] those subpixel values as um 8 bit integers that can each take a value 0 to [00:02:01] integers that can each take a value 0 to 255. you string this out into a long uh [00:02:03] 255. you string this out into a long uh into a long sequence of integers and [00:02:05] into a long sequence of integers and then model this in some using some [00:02:07] then model this in some using some discrete autogressive sequence model [00:02:09] discrete autogressive sequence model typically an RNN or a transformer. [00:02:12] typically an RNN or a transformer. Then we also saw the variation [00:02:13] Then we also saw the variation variational autoenccoders which were [00:02:15] variational autoenccoders which were another uh um explicit density model but [00:02:17] another uh um explicit density model but now they compute not the exact density [00:02:19] now they compute not the exact density but some approximation to the density in [00:02:21] but some approximation to the density in in in particular a lower bound to the [00:02:23] in in particular a lower bound to the density. Um so by to do this we train we [00:02:26] density. Um so by to do this we train we jointly trained some uh encoder network [00:02:29] jointly trained some uh encoder network which is going to input the data X and [00:02:31] which is going to input the data X and output a distribution over latent codes [00:02:33] output a distribution over latent codes Z um and a decoder network which is [00:02:35] Z um and a decoder network which is going to input a latent code Z and [00:02:36] going to input a latent code Z and output a predicted uh piece of data X [00:02:39] output a predicted uh piece of data X and we were able to jointly train these [00:02:41] and we were able to jointly train these two networks the encoder and the decoder [00:02:43] two networks the encoder and the decoder to maximize this variational uh to to [00:02:45] to maximize this variational uh to to max to to maximize this variational [00:02:47] max to to maximize this variational lower bound to our likelihood function. [00:02:49] lower bound to our likelihood function. Um recall that likelihood, maximum [00:02:51] Um recall that likelihood, maximum likelihood is one of the key insights [00:02:53] likelihood is one of the key insights behind all generative modeling that we [00:02:54] behind all generative modeling that we of that often our our objective function [00:02:56] of that often our our objective function for training generative models is [00:02:58] for training generative models is somehow to maximize the likelihood of [00:03:00] somehow to maximize the likelihood of the data that we observe that comes from [00:03:02] the data that we observe that comes from um from our true data distribution. [00:03:05] um from our true data distribution. So today we're going to continue our [00:03:06] So today we're going to continue our discussion of generative models and [00:03:08] discussion of generative models and explore this other half of the family [00:03:09] explore this other half of the family tree um these implicit density models. [00:03:12] tree um these implicit density models. So in in implicit density models, we are [00:03:14] So in in implicit density models, we are no longer going to get access to some [00:03:16] no longer going to get access to some actual density value P of X, but these [00:03:18] actual density value P of X, but these models will sort of implicitly model the [00:03:21] models will sort of implicitly model the probability distribution and even though [00:03:22] probability distribution and even though we can't compute a density value P of X [00:03:24] we can't compute a density value P of X for any piece of data X, we will be able [00:03:26] for any piece of data X, we will be able to sample from the underlying [00:03:28] to sample from the underlying distribution that these models learn. Um [00:03:29] distribution that these models learn. Um so we'll be able to draw samples from [00:03:31] so we'll be able to draw samples from the learned distribution even if we [00:03:33] the learned distribution even if we can't output an actual density value. Um [00:03:36] can't output an actual density value. Um so the first such model that we'll [00:03:37] so the first such model that we'll explore are generative adversarial [00:03:39] explore are generative adversarial networks or um usually called GANs. [00:03:42] networks or um usually called GANs. And it's useful to contrast GANs with [00:03:44] And it's useful to contrast GANs with the variational autoenccoders and auto [00:03:46] the variational autoenccoders and auto reggressive models that we've seen so [00:03:47] reggressive models that we've seen so far. So like we just said um auto [00:03:49] far. So like we just said um auto reggressive models are are likelihood [00:03:51] reggressive models are are likelihood based method. Um their training [00:03:53] based method. Um their training objective is maximum likelihood. So you [00:03:55] objective is maximum likelihood. So you write down this parameterized function [00:03:56] write down this parameterized function that is your P of X where X is a piece [00:03:59] that is your P of X where X is a piece of data and then you maximize this uh [00:04:01] of data and then you maximize this uh over the data that you observe trying to [00:04:03] over the data that you observe trying to do maximum likelihood and variational [00:04:05] do maximum likelihood and variational autoenccoders sort of follow this [00:04:07] autoenccoders sort of follow this similar idea where we can where we write [00:04:09] similar idea where we can where we write down this um approximation to P of X and [00:04:11] down this um approximation to P of X and then maximize that approximation to P of [00:04:13] then maximize that approximation to P of X. Um and now very generative [00:04:15] X. Um and now very generative adversarial networks will do something a [00:04:17] adversarial networks will do something a little bit different. They will give up [00:04:18] little bit different. They will give up on directly modeling that P of X as we [00:04:20] on directly modeling that P of X as we just said. Um but even though they don't [00:04:22] just said. Um but even though they don't explicitly model the p of x or let us [00:04:24] explicitly model the p of x or let us get out those density values, they will [00:04:26] get out those density values, they will give us some way to sample from the [00:04:28] give us some way to sample from the underlying distribution that the that [00:04:29] underlying distribution that the that the model is fitting. [00:04:32] the model is fitting. Um so the setup here is um that we'll [00:04:34] Um so the setup here is um that we'll start by having some finite samples of [00:04:37] start by having some finite samples of data x i which are assumes to be drawn [00:04:39] data x i which are assumes to be drawn from some true data distribution p data. [00:04:42] from some true data distribution p data. Um and our goal is we want to be able to [00:04:44] Um and our goal is we want to be able to draw samples from p data. And recall p [00:04:46] draw samples from p data. And recall p data is something like the true [00:04:48] data is something like the true distribution of the universe. This is [00:04:49] distribution of the universe. This is the distribution that the universe uses [00:04:51] the distribution that the universe uses to give you samples of your data. And [00:04:53] to give you samples of your data. And this is likely a very complicated [00:04:55] this is likely a very complicated distribution. It involves physics. It [00:04:57] distribution. It involves physics. It involves history. It involves social [00:04:59] involves history. It involves social political constraints maybe, right? [00:05:01] political constraints maybe, right? There's a lot of complication that goes [00:05:02] There's a lot of complication that goes into all the stuff happening in the [00:05:03] into all the stuff happening in the universe that gives rise to the data [00:05:05] universe that gives rise to the data that you see. um and somehow we want to [00:05:08] that you see. um and somehow we want to model that fit some approximate model [00:05:10] model that fit some approximate model that tries to match that true data [00:05:12] that tries to match that true data distribution as well as possible and [00:05:13] distribution as well as possible and then allow us to draw new samples from [00:05:15] then allow us to draw new samples from our fitted distribution that look like [00:05:17] our fitted distribution that look like the original data samples that we [00:05:18] the original data samples that we observed. [00:05:20] observed. So the way that we're going to do this [00:05:21] So the way that we're going to do this is um by introducing a latent variable [00:05:23] is um by introducing a latent variable Z. Um this looks kind of like the latent [00:05:25] Z. Um this looks kind of like the latent variable Z that we saw in um variational [00:05:27] variable Z that we saw in um variational autoenccoders where it's going to give [00:05:30] autoenccoders where it's going to give where the this latent variable Z is [00:05:31] where the this latent variable Z is going to be distributed according to [00:05:33] going to be distributed according to some known prior distribution P of Z [00:05:35] some known prior distribution P of Z that we will write down and control [00:05:37] that we will write down and control ourselves and usually this is going to [00:05:38] ourselves and usually this is going to be a unit gausian or or a uniform [00:05:41] be a unit gausian or or a uniform distribution but typically a unit [00:05:42] distribution but typically a unit gausian something very simple that we [00:05:44] gausian something very simple that we know how to sample from we know the [00:05:45] know how to sample from we know the analytical properties of um and now the [00:05:48] analytical properties of um and now the setup is that we're going to um imagine [00:05:50] setup is that we're going to um imagine some data generating process that our [00:05:52] some data generating process that our network is going to model. So here um [00:05:55] network is going to model. So here um we're going to imagine that we sample a [00:05:57] we're going to imagine that we sample a Z according to our known distribution P [00:05:59] Z according to our known distribution P of Z um to get a sampled Z, pass that [00:06:02] of Z um to get a sampled Z, pass that sampled Z through a generator network [00:06:04] sampled Z through a generator network that G of Z um and then that X is going [00:06:07] that G of Z um and then that X is going to be a sample from some generator [00:06:09] to be a sample from some generator distribution PG. Um and as we vary the [00:06:12] distribution PG. Um and as we vary the parameters or the architecture or the [00:06:14] parameters or the architecture or the training of our generator network that [00:06:16] training of our generator network that is going to induce different kinds of [00:06:18] is going to induce different kinds of distributions that we sample from in [00:06:20] distributions that we sample from in this PG distribution. So the whole goal [00:06:22] this PG distribution. So the whole goal in GAN training is to try to force this [00:06:24] in GAN training is to try to force this PG distribution which is induced by our [00:06:26] PG distribution which is induced by our generator network. We want that PG [00:06:28] generator network. We want that PG distribution to match the true P data [00:06:30] distribution to match the true P data distribution as close as possible. And [00:06:32] distribution as close as possible. And because if they match then we could [00:06:34] because if they match then we could sample a Z pass it through our generator [00:06:37] sample a Z pass it through our generator and now we have a Z now we have a sample [00:06:39] and now we have a Z now we have a sample of sampled piece of data that looks a [00:06:40] of sampled piece of data that looks a lot like our P data. [00:06:43] lot like our P data. So the the picture for this is something [00:06:44] So the the picture for this is something like the following. We'll imagine [00:06:46] like the following. We'll imagine sampling Z from our PZ to get a a [00:06:48] sampling Z from our PZ to get a a concrete latencer [00:06:50] concrete latencer G and that will give us a generated [00:06:52] G and that will give us a generated image. So the generator network [00:06:54] image. So the generator network basically is trained to convert a sample [00:06:57] basically is trained to convert a sample from a known distribution Z into a [00:06:59] from a known distribution Z into a sample of our data distribution. But now [00:07:02] sample of our data distribution. But now the question is how can we force these [00:07:04] the question is how can we force these outputs? How can we force the induced [00:07:06] outputs? How can we force the induced generator distribution PG? How can we [00:07:08] generator distribution PG? How can we force this to match the data [00:07:09] force this to match the data distribution P data? Um and the trick in [00:07:12] distribution P data? Um and the trick in generative adversarial networks is that [00:07:14] generative adversarial networks is that we're going to introduce another neural [00:07:16] we're going to introduce another neural network to do that task for us. Right? [00:07:18] network to do that task for us. Right? In the previous versions of of gener of [00:07:20] In the previous versions of of gener of generative modeling um VAEs and [00:07:22] generative modeling um VAEs and autogressive models, we tried to write [00:07:23] autogressive models, we tried to write down some objective function that we [00:07:25] down some objective function that we could minimize that would force our fit [00:07:27] could minimize that would force our fit distribution to match the data [00:07:28] distribution to match the data distribution. Here we're going to [00:07:29] distribution. Here we're going to relinquish that control and basically [00:07:31] relinquish that control and basically have ask another neural network to solve [00:07:33] have ask another neural network to solve that task for us. Um so in particular [00:07:36] that task for us. Um so in particular we're going to train another neural [00:07:37] we're going to train another neural network called the discriminator D. And [00:07:40] network called the discriminator D. And this discriminator is going to be tasked [00:07:42] this discriminator is going to be tasked with inputting a ner inputting an image, [00:07:44] with inputting a ner inputting an image, sometimes a real image, sometimes a fake [00:07:46] sometimes a real image, sometimes a fake image. And it's going to classify [00:07:48] image. And it's going to classify whether the real whether that image was [00:07:49] whether the real whether that image was fake or real. Um, and then the idea is [00:07:52] fake or real. Um, and then the idea is that these two these two networks are [00:07:54] that these two these two networks are going to fight. We're going to train the [00:07:56] going to fight. We're going to train the generator to try to fool the [00:07:57] generator to try to fool the discriminator and we're going to train [00:07:59] discriminator and we're going to train the discriminator as a classification [00:08:00] the discriminator as a classification model to try to correctly discriminate [00:08:02] model to try to correctly discriminate or classify between real data and fake [00:08:04] or classify between real data and fake data. And the intuition is that as these [00:08:07] data. And the intuition is that as these two networks fight then ideally the [00:08:09] two networks fight then ideally the discriminator will get better. It will [00:08:11] discriminator will get better. It will get really good. The discriminator will [00:08:12] get really good. The discriminator will get really good at at determining [00:08:14] get really good at at determining features of real data from fake data. [00:08:16] features of real data from fake data. And once the discriminator gets really [00:08:17] And once the discriminator gets really good then in order to fool the [00:08:19] good then in order to fool the discriminator into thinking that the [00:08:20] discriminator into thinking that the generated samples are classified as [00:08:22] generated samples are classified as real. The generator data will need to [00:08:24] real. The generator data will need to get closer and closer to producing [00:08:26] get closer and closer to producing samples that look like true data. Um so [00:08:28] samples that look like true data. Um so that's the kind of intuition between [00:08:30] that's the kind of intuition between generative adversarial networks. So [00:08:32] generative adversarial networks. So question is does the generator network [00:08:34] question is does the generator network get feedback from the discriminator on [00:08:35] get feedback from the discriminator on whether it's classifying correctly? U [00:08:37] whether it's classifying correctly? U yes and that's crucial for this whole [00:08:39] yes and that's crucial for this whole process working and the type of feedback [00:08:40] process working and the type of feedback it gets is gradients right this this [00:08:43] it gets is gradients right this this whole thing this composite system of the [00:08:45] whole thing this composite system of the generator and the discriminator are just [00:08:47] generator and the discriminator are just neural networks we know how to compute [00:08:48] neural networks we know how to compute gradients through those and those [00:08:50] gradients through those and those communicate through that generated [00:08:51] communicate through that generated image. So we're going to back propagate [00:08:53] image. So we're going to back propagate from the discriminator all the way [00:08:54] from the discriminator all the way through the generated image into the [00:08:56] through the generated image into the generator. So that's how the generator [00:08:58] generator. So that's how the generator is going to learn from the [00:08:59] is going to learn from the discriminator. [00:09:01] discriminator. And then more concretely, we need to [00:09:03] And then more concretely, we need to write down some actual equations, some [00:09:05] write down some actual equations, some actual math that we're going to use to [00:09:06] actual math that we're going to use to concretize this intuition. So in [00:09:09] concretize this intuition. So in particular, um we're going to jointly [00:09:10] particular, um we're going to jointly train the generator G and the and the [00:09:12] train the generator G and the and the discriminator D um with this miniax [00:09:14] discriminator D um with this miniax game. Um this equation looks maybe a [00:09:17] game. Um this equation looks maybe a little bit daunting, so we'll walk [00:09:18] little bit daunting, so we'll walk through each of the terms one by one. [00:09:20] through each of the terms one by one. So here we're going to color code this [00:09:22] So here we're going to color code this and say that the generator is going to [00:09:24] and say that the generator is going to be um in blue, the discriminator is [00:09:26] be um in blue, the discriminator is going to be in red. Um and the [00:09:27] going to be in red. Um and the discriminator is going to be a function [00:09:29] discriminator is going to be a function um that inputs a piece of data x and [00:09:32] um that inputs a piece of data x and outputs the probability that that data [00:09:33] outputs the probability that that data is real. So in particular d ofx equals 0 [00:09:36] is real. So in particular d ofx equals 0 means that the discriminator has [00:09:37] means that the discriminator has classified that piece of data x as fake. [00:09:40] classified that piece of data x as fake. Um d of x= 1 means that the [00:09:42] Um d of x= 1 means that the discriminator has classified that piece [00:09:43] discriminator has classified that piece of data as real. Um of course those are [00:09:45] of data as real. Um of course those are the extreme cases. is the discriminator [00:09:47] the extreme cases. is the discriminator in practice will output some probability [00:09:49] in practice will output some probability that gives you a soft a soft version in [00:09:50] that gives you a soft a soft version in between those two decisions. [00:09:53] between those two decisions. Now um and now imagine what happens if [00:09:55] Now um and now imagine what happens if we fix the generator G and just imagine [00:09:57] we fix the generator G and just imagine this problem from the perspective of the [00:09:58] this problem from the perspective of the discriminator. So then from the [00:10:00] discriminator. So then from the perspective of the discriminator there's [00:10:02] perspective of the discriminator there's two terms here. One this says this first [00:10:04] two terms here. One this says this first term says the discriminator wants d ofx= [00:10:07] term says the discriminator wants d ofx= 1 for real data. Remember d ofx equals 1 [00:10:10] 1 for real data. Remember d ofx equals 1 means that the discriminator says that [00:10:12] means that the discriminator says that it's real. Um, and this this uh this [00:10:14] it's real. Um, and this this uh this expectation basically says we're going [00:10:16] expectation basically says we're going to draw data data samples X from the [00:10:18] to draw data data samples X from the true P data distribution. We're going to [00:10:20] true P data distribution. We're going to pass those through the discriminator and [00:10:22] pass those through the discriminator and then take a log because you know we [00:10:24] then take a log because you know we almost always work in log space when [00:10:25] almost always work in log space when working with probabilities. And remember [00:10:27] working with probabilities. And remember log is a is a monotonic function. So um [00:10:29] log is a is a monotonic function. So um maximizing maximizing log of x is the [00:10:32] maximizing maximizing log of x is the same as maximizing x. So in this case um [00:10:35] same as maximizing x. So in this case um this is saying this this is saying that [00:10:37] this is saying this this is saying that the we want to maximize log of d ofx for [00:10:40] the we want to maximize log of d ofx for real data which is equivalent to saying [00:10:41] real data which is equivalent to saying d ofx equals 1 for real data. Um now on [00:10:44] d ofx equals 1 for real data. Um now on the other side um this is saying that [00:10:46] the other side um this is saying that we're going to draw take an expectation [00:10:48] we're going to draw take an expectation by sampling prior's z uh sorry sampling [00:10:51] by sampling prior's z uh sorry sampling latence z according to our known prior p [00:10:53] latence z according to our known prior p of z. We're going to take those z's pass [00:10:55] of z. We're going to take those z's pass them through the generator um which [00:10:57] them through the generator um which should give us a generated data sample [00:10:59] should give us a generated data sample and then pass that generated data sample [00:11:01] and then pass that generated data sample through the discriminator. Um and now [00:11:03] through the discriminator. Um and now the discriminator wants to classify [00:11:04] the discriminator wants to classify these as these are fake samples. So the [00:11:06] these as these are fake samples. So the discriminator wants to classify these as [00:11:08] discriminator wants to classify these as fake. So we need to somehow invert that [00:11:10] fake. So we need to somehow invert that that that that expression on the left. [00:11:12] that that that expression on the left. So here um we want d of xals 0 to be [00:11:15] So here um we want d of xals 0 to be fake. So one way to say that is uh is to [00:11:18] fake. So one way to say that is uh is to maximize log of 1 minus d of g of z. So [00:11:21] maximize log of 1 minus d of g of z. So this term on the right says the [00:11:22] this term on the right says the discriminator wants d ofx equals 0 for [00:11:24] discriminator wants d ofx equals 0 for fake data. Um and the term on the left [00:11:26] fake data. Um and the term on the left says discriminator wants d ofx= 1 for [00:11:29] says discriminator wants d ofx= 1 for real data. Okay. So that's what the [00:11:31] real data. Okay. So that's what the discriminator is trying to do. The [00:11:32] discriminator is trying to do. The discriminator is trying to correctly [00:11:34] discriminator is trying to correctly just do this classification task between [00:11:36] just do this classification task between generated samples from between generated [00:11:38] generated samples from between generated samples and real samples from our data [00:11:40] samples and real samples from our data set and try to classify them correctly [00:11:41] set and try to classify them correctly as real or fake. Now look at this from [00:11:44] as real or fake. Now look at this from the from the perspective of the [00:11:46] the from the perspective of the generator. So imagine fixing the [00:11:48] generator. So imagine fixing the discriminator and looking at this setup [00:11:50] discriminator and looking at this setup only from the perspective of a generator [00:11:52] only from the perspective of a generator uh with a fixed discriminator. Um now in [00:11:54] uh with a fixed discriminator. Um now in this case this first term doesn't depend [00:11:56] this case this first term doesn't depend on the generator at all because this [00:11:57] on the generator at all because this first term was just about the getting [00:11:59] first term was just about the getting the discriminator to correctly classify [00:12:01] the discriminator to correctly classify the real data samples. Um so this the [00:12:03] the real data samples. Um so this the generator only cares about this term on [00:12:05] generator only cares about this term on the right. Um and intuitively we want [00:12:07] the right. Um and intuitively we want the generator to fool the discriminator [00:12:09] the generator to fool the discriminator into thinking that its samples are real. [00:12:11] into thinking that its samples are real. So that means that the the generator [00:12:13] So that means that the the generator wants d ofx equals 1 for fake data. So [00:12:16] wants d ofx equals 1 for fake data. So the term is the same. Um we draw a [00:12:18] the term is the same. Um we draw a sample z according to p of z. um pass it [00:12:21] sample z according to p of z. um pass it through the generator to get a generated [00:12:22] through the generator to get a generated sample. Pass it through the [00:12:23] sample. Pass it through the discriminator to get the discriminator's [00:12:25] discriminator to get the discriminator's predicted probability on that sample. Um [00:12:28] predicted probability on that sample. Um and now the [00:12:30] and now the now recall the generator wants d ofx [00:12:32] now recall the generator wants d ofx equals 1. So rather than maximizing this [00:12:34] equals 1. So rather than maximizing this like we did from like the discriminator [00:12:36] like we did from like the discriminator wanted to instead we're going to try to [00:12:38] wanted to instead we're going to try to minimize this from for the for the for [00:12:40] minimize this from for the for the for the from the perspective of the [00:12:41] the from the perspective of the generator. [00:12:43] generator. And that gives us this uh this miniax [00:12:45] And that gives us this uh this miniax game. So in particular we can abstract [00:12:47] game. So in particular we can abstract away all this math by writing it as uh [00:12:49] away all this math by writing it as uh some some scalar function V as a [00:12:51] some some scalar function V as a function of G and D. Um and then we say [00:12:53] function of G and D. Um and then we say that the the discriminator wants to [00:12:55] that the the discriminator wants to maximize V. The generator wants to [00:12:57] maximize V. The generator wants to minimize V. Um and there's this they're [00:12:59] minimize V. Um and there's this they're going to fight against each other in [00:13:00] going to fight against each other in that way. And then to optimize this [00:13:03] that way. And then to optimize this we're going to basically run some [00:13:05] we're going to basically run some gradient descent loop taking alternating [00:13:07] gradient descent loop taking alternating steps trying to minimize and maximize [00:13:08] steps trying to minimize and maximize this with respect to the parameters of [00:13:10] this with respect to the parameters of the generator and discriminator. So um [00:13:12] the generator and discriminator. So um loop while true forever then update d [00:13:14] loop while true forever then update d acc according to take a gradient descent [00:13:16] acc according to take a gradient descent step um derivative of v with respect to [00:13:19] step um derivative of v with respect to d and then step plus that gradient [00:13:21] d and then step plus that gradient because remember discriminator is trying [00:13:23] because remember discriminator is trying to maximize this. So we want to do [00:13:25] to maximize this. So we want to do gradient ascent to try to maximize this [00:13:27] gradient ascent to try to maximize this this uh this term. Um and then after we [00:13:30] this uh this term. Um and then after we make an update on the discriminator [00:13:31] make an update on the discriminator weights then we'll make an update on the [00:13:33] weights then we'll make an update on the generator weights by taking derivative [00:13:35] generator weights by taking derivative of that um of that uh of that v with [00:13:37] of that um of that uh of that v with respect to the generator weights g and [00:13:39] respect to the generator weights g and now take a gradient descent step on v [00:13:42] now take a gradient descent step on v because the generator wants to minimize [00:13:44] because the generator wants to minimize that objective. Um so that's basically [00:13:47] that objective. Um so that's basically our our training that's basically the [00:13:48] our our training that's basically the way that we train generative adversarial [00:13:50] way that we train generative adversarial networks. Um, we've got this this thing [00:13:52] networks. Um, we've got this this thing V, which is the value of our miniax [00:13:54] V, which is the value of our miniax game. And we're going to take [00:13:55] game. And we're going to take alternating gradient ascent and gradient [00:13:57] alternating gradient ascent and gradient descent weights on that objective V. Um, [00:14:00] descent weights on that objective V. Um, in order it like alternatively up [00:14:02] in order it like alternatively up alternatingly update the generator and [00:14:04] alternatingly update the generator and discriminator. [00:14:05] discriminator. And one thing that's really important to [00:14:07] And one thing that's really important to realize when we're training generative [00:14:08] realize when we're training generative adversarial networks is that this V is [00:14:11] adversarial networks is that this V is not a loss function. Um that means there [00:14:13] not a loss function. Um that means there is no like the the absolute value of V [00:14:16] is no like the the absolute value of V basically does not tell us anything [00:14:17] basically does not tell us anything about how well the generator and [00:14:19] about how well the generator and discriminator are solving this problem [00:14:20] discriminator are solving this problem or how well or or really the thing we [00:14:23] or how well or or really the thing we care about is how well is that induced [00:14:25] care about is how well is that induced PG distribution matching the data [00:14:27] PG distribution matching the data distribution and just looking at the [00:14:29] distribution and just looking at the value of V doesn't really tell us [00:14:30] value of V doesn't really tell us anything about that right because the [00:14:32] anything about that right because the value of V sort of depends on how good [00:14:35] value of V sort of depends on how good the discriminator is right if the [00:14:36] the discriminator is right if the discriminator is really bad then it's [00:14:38] discriminator is really bad then it's really easy for the generator to fool [00:14:40] really easy for the generator to fool this and get good get good numbers or if [00:14:42] this and get good get good numbers or if the discriminator is really good then [00:14:44] the discriminator is really good then the generator has to be really good. So [00:14:45] the generator has to be really good. So there can be sort of different settings [00:14:47] there can be sort of different settings of D and V D and G that will lead to the [00:14:50] of D and V D and G that will lead to the exact same value for V. Um and that [00:14:52] exact same value for V. Um and that means that generative adversarial [00:14:53] means that generative adversarial networks are often really hard to train. [00:14:55] networks are often really hard to train. Um and even hard to tell when they are [00:14:57] Um and even hard to tell when they are doing a good job at training. Right? [00:14:59] doing a good job at training. Right? Normally when you train neural networks [00:15:00] Normally when you train neural networks you have a loss. You try to minimize the [00:15:02] you have a loss. You try to minimize the loss with respect to the parameters of [00:15:03] loss with respect to the parameters of your network and you want to see that [00:15:05] your network and you want to see that loss go down over the course of [00:15:06] loss go down over the course of training. But you don't have that with [00:15:08] training. But you don't have that with generative adversarial networks. You [00:15:09] generative adversarial networks. You have the generator loss. you have the [00:15:11] have the generator loss. you have the discriminator loss. You can try to plot [00:15:12] discriminator loss. You can try to plot them, but in general they're pretty [00:15:14] them, but in general they're pretty they're pretty meaningless. So with [00:15:16] they're pretty meaningless. So with generative adversarial networks, they're [00:15:17] generative adversarial networks, they're really hard to train. Um one I think one [00:15:21] really hard to train. Um one I think one this objective is sort of fundamentally [00:15:22] this objective is sort of fundamentally unstable. You're sort of trying to do [00:15:24] unstable. You're sort of trying to do jointly maximize and minimize the same [00:15:26] jointly maximize and minimize the same quantity with respect to different parts [00:15:28] quantity with respect to different parts different sets of parameters of the [00:15:30] different sets of parameters of the network. So that's kind of inherently a [00:15:31] network. So that's kind of inherently a difficult optimization problem. And then [00:15:33] difficult optimization problem. And then even worse, you don't have any value to [00:15:35] even worse, you don't have any value to look at to tell whether or not you're [00:15:37] look at to tell whether or not you're making progress towards a good solution. [00:15:39] making progress towards a good solution. So generative adversarial networks um [00:15:41] So generative adversarial networks um they're they're pretty effective but [00:15:43] they're they're pretty effective but they're really hard to train and really [00:15:44] they're really hard to train and really hard to tune and really hard to make [00:15:46] hard to tune and really hard to make progress on. [00:15:48] progress on. So um that's that's sort of the main [00:15:50] So um that's that's sort of the main takeaway for generative adversarial [00:15:51] takeaway for generative adversarial networks. You know there is um one [00:15:53] networks. You know there is um one little trick that is kind of useful to [00:15:55] little trick that is kind of useful to think about for when training GANs is [00:15:57] think about for when training GANs is that um just to kind of imagine the [00:15:59] that um just to kind of imagine the training dynamics of these things. So at [00:16:01] training dynamics of these things. So at the very start of imagine at the very [00:16:02] the very start of imagine at the very beginning of training your generator is [00:16:04] beginning of training your generator is randomly initialized your discriminator [00:16:06] randomly initialized your discriminator is randomly initialized. What's going to [00:16:07] is randomly initialized. What's going to happen? Well, at the very start of [00:16:09] happen? Well, at the very start of training, then your generator is [00:16:11] training, then your generator is producing completely random noise. That [00:16:13] producing completely random noise. That completely random noise is going to look [00:16:14] completely random noise is going to look very different from real images. So, at [00:16:16] very different from real images. So, at the very beginning of training, when the [00:16:18] the very beginning of training, when the generator is terrible, then the [00:16:19] generator is terrible, then the discriminator has a very easy problem. [00:16:21] discriminator has a very easy problem. So, typically within a couple [00:16:23] So, typically within a couple iterations, um the discriminator can [00:16:24] iterations, um the discriminator can like really immediately pick up between [00:16:26] like really immediately pick up between real images and these totally garbage [00:16:28] real images and these totally garbage random fake images that the random that [00:16:30] random fake images that the random that the random generator is giving you. So [00:16:32] the random generator is giving you. So that means that at the very start of [00:16:34] that means that at the very start of training um the generator is the [00:16:35] training um the generator is the discriminator is quickly going to learn [00:16:37] discriminator is quickly going to learn to classify these real these um real and [00:16:40] to classify these real these um real and fake pretty quickly with pretty high [00:16:42] fake pretty quickly with pretty high probability. Um so then it's interesting [00:16:44] probability. Um so then it's interesting to plot what is d of g of z uh sorry [00:16:47] to plot what is d of g of z uh sorry what is the what is the what is what is [00:16:50] what is the what is the what is what is this term um as a function of d of g of [00:16:52] this term um as a function of d of g of z right because that's basically the [00:16:53] z right because that's basically the loss fun that's basically the loss [00:16:55] loss fun that's basically the loss function from the perspective of the [00:16:56] function from the perspective of the generator. So then from the perspective [00:16:58] generator. So then from the perspective of the generator, um we're we're we kind [00:17:01] of the generator, um we're we're we kind of at the beginning of training are [00:17:02] of at the beginning of training are somewhere all the way over here. So the [00:17:05] somewhere all the way over here. So the discriminator is doing a really good job [00:17:06] discriminator is doing a really good job at treating generated samples as fake, [00:17:09] at treating generated samples as fake, classifying them as fake at the [00:17:10] classifying them as fake at the beginning of training, which means that [00:17:11] beginning of training, which means that from the discriminator's perspective, [00:17:13] from the discriminator's perspective, it's trying to optimize a loss function [00:17:15] it's trying to optimize a loss function that looks something like this. Um and [00:17:17] that looks something like this. Um and if you'll notice something, this loss [00:17:18] if you'll notice something, this loss function is flat um or very close to [00:17:20] function is flat um or very close to flat at the place where the generator is [00:17:22] flat at the place where the generator is trying to optim optimize its parameters. [00:17:24] trying to optim optimize its parameters. So this means that in like in practice [00:17:26] So this means that in like in practice when you use this naive objective for um [00:17:29] when you use this naive objective for um training GANs then the generator has a [00:17:31] training GANs then the generator has a really hard time learning at the [00:17:32] really hard time learning at the beginning of training. Oh good question [00:17:34] beginning of training. Oh good question is um how do you assemble the data set [00:17:36] is um how do you assemble the data set like how do we generate a photo of a [00:17:37] like how do we generate a photo of a unicorn if if nothing no unicorn exists. [00:17:40] unicorn if if nothing no unicorn exists. So so this um this P data that that's [00:17:42] So so this um this P data that that's kind of your choice of P data right so [00:17:44] kind of your choice of P data right so whatever data set you happen to assemble [00:17:45] whatever data set you happen to assemble from your training set that's you [00:17:47] from your training set that's you choosing what is the P data that you're [00:17:48] choosing what is the P data that you're trying to model. So in general, if you [00:17:51] trying to model. So in general, if you want to generate a sample of a thing [00:17:53] want to generate a sample of a thing that looks nothing like anything that [00:17:54] that looks nothing like anything that you've ever seen before, you're out of [00:17:56] you've ever seen before, you're out of luck. It's not going to happen. Um, so [00:17:57] luck. It's not going to happen. Um, so the only way that you're generally going [00:17:59] the only way that you're generally going to draw samples is if you have something [00:18:01] to draw samples is if you have something in your training data set that looks [00:18:02] in your training data set that looks kind of like that. So these networks do [00:18:05] kind of like that. So these networks do like all generative models and all [00:18:06] like all generative models and all neural networks really do generalize a [00:18:08] neural networks really do generalize a little bit, right? So then the hope is [00:18:10] little bit, right? So then the hope is that you know maybe you've never seen a [00:18:12] that you know maybe you've never seen a photorealistic image of a unicorn [00:18:14] photorealistic image of a unicorn wearing a Santa Claus hat, but you've [00:18:16] wearing a Santa Claus hat, but you've seen photorealistic images of horses. [00:18:18] seen photorealistic images of horses. You've seen photorealistic images of [00:18:20] You've seen photorealistic images of Santa Claus hats. You've seen drawings [00:18:21] Santa Claus hats. You've seen drawings of unicorns. You've seen drawings of [00:18:23] of unicorns. You've seen drawings of horses. And somehow you've got enough [00:18:25] horses. And somehow you've got enough stuff, even if you've never seen the [00:18:26] stuff, even if you've never seen the exact composition of attributes that you [00:18:28] exact composition of attributes that you want to generate. Um you've seen enough [00:18:30] want to generate. Um you've seen enough stuff that's close enough to it that the [00:18:32] stuff that's close enough to it that the model can generalize and give you [00:18:34] model can generalize and give you something new. And that's always the [00:18:35] something new. And that's always the hope here. What does this look like from [00:18:37] hope here. What does this look like from the discriminator's perspective? So then [00:18:39] the discriminator's perspective? So then like well then if I generated this image [00:18:41] like well then if I generated this image of a photorealistic unicorn wearing a [00:18:43] of a photorealistic unicorn wearing a Santa Claus hat then maybe all the [00:18:45] Santa Claus hat then maybe all the textures are really perfect like the [00:18:46] textures are really perfect like the lighting is perfect the shadows are [00:18:47] lighting is perfect the shadows are perfect the leaves are perfect. So [00:18:49] perfect the leaves are perfect. So there's no real evidence in the data in [00:18:52] there's no real evidence in the data in that sample itself to say that this is [00:18:53] that sample itself to say that this is obviously wrong. Um maybe if the [00:18:55] obviously wrong. Um maybe if the discriminator was really really smart [00:18:56] discriminator was really really smart then the discriminator could somehow [00:18:58] then the discriminator could somehow know that unicorns don't really exist [00:19:00] know that unicorns don't really exist and having a perfectly photorealistic [00:19:01] and having a perfectly photorealistic image of them isn't likely to happen. [00:19:03] image of them isn't likely to happen. But that's a like that's a pretty hard [00:19:05] But that's a like that's a pretty hard semantic problem to solve. So in [00:19:07] semantic problem to solve. So in practice generator discriminators tend [00:19:08] practice generator discriminators tend not to really be that smart. Yeah, good [00:19:11] not to really be that smart. Yeah, good good question. The cost the idea is why [00:19:13] good question. The cost the idea is why don't we look at two curves? Why don't [00:19:14] don't we look at two curves? Why don't we look at one curve saying how good the [00:19:15] we look at one curve saying how good the discriminator is? Why don't we look at [00:19:17] discriminator is? Why don't we look at one curve saying how low good the [00:19:18] one curve saying how low good the generator is? Feel free to plot them. [00:19:20] generator is? Feel free to plot them. They tend to look really useless. Um and [00:19:22] They tend to look really useless. Um and there's probably literally hundreds of [00:19:24] there's probably literally hundreds of research papers of people trying to [00:19:25] research papers of people trying to solve this problem and figure out how do [00:19:27] solve this problem and figure out how do we tweak the GAN objective? How do we [00:19:29] we tweak the GAN objective? How do we not use a log? How do we use a a wazer [00:19:31] not use a log? How do we use a a wazer sign something a something something? [00:19:33] sign something a something something? like put all kinds of crazy stuff into [00:19:34] like put all kinds of crazy stuff into this to make those curves more [00:19:36] this to make those curves more interpretable. Hundreds of papers [00:19:37] interpretable. Hundreds of papers written about it like 5 years of of [00:19:40] written about it like 5 years of of thousands of people's time I don't think [00:19:41] thousands of people's time I don't think anybody came up with a with a good [00:19:43] anybody came up with a with a good solution. Um so still even after like [00:19:45] solution. Um so still even after like hundreds thousands of papers training [00:19:47] hundreds thousands of papers training GANs that kind of they ended up a lot of [00:19:49] GANs that kind of they ended up a lot of people still end up using this vindilla [00:19:50] people still end up using this vindilla formulation which tends not to give you [00:19:52] formulation which tends not to give you very interpretable curves even when you [00:19:53] very interpretable curves even when you split it up that way. The question is, [00:19:55] split it up that way. The question is, is it really is is what happens to the [00:19:57] is it really is is what happens to the discriminator early in training really [00:19:58] discriminator early in training really important? And the answer is no because [00:20:00] important? And the answer is no because this is unlike any other classification [00:20:02] this is unlike any other classification problem we've ever seen before because [00:20:04] problem we've ever seen before because this is a non-stationary distribution, [00:20:06] this is a non-stationary distribution, right? When you train an image [00:20:07] right? When you train an image classifier on imageet or cir or [00:20:09] classifier on imageet or cir or something like that, the data set is [00:20:10] something like that, the data set is fixed and the and the model is just [00:20:12] fixed and the and the model is just trying to classify that static data set. [00:20:13] trying to classify that static data set. Well, but in the case of GAN training, [00:20:15] Well, but in the case of GAN training, the data set that it's trying to fit is [00:20:17] the data set that it's trying to fit is changing under it during the course of [00:20:19] changing under it during the course of training because you know even at the [00:20:21] training because you know even at the beginning of training, maybe it's the [00:20:22] beginning of training, maybe it's the the generated images look really bad. [00:20:24] the generated images look really bad. it's easy to solve the problem, but then [00:20:25] it's easy to solve the problem, but then the generator gets better and now the [00:20:27] the generator gets better and now the data set that it's that the [00:20:28] data set that it's that the discriminator is trying to discriminate [00:20:30] discriminator is trying to discriminate changes under it during the course of [00:20:32] changes under it during the course of training. So that means that it's a [00:20:33] training. So that means that it's a non-stationary problem and you know with [00:20:35] non-stationary problem and you know with very complicated learning dynamics. [00:20:38] very complicated learning dynamics. Yeah, good question. Um can you does do [00:20:40] Yeah, good question. Um can you does do these get caught in local minima? Are [00:20:41] these get caught in local minima? Are there way to kick them out of local [00:20:42] there way to kick them out of local lumina train for a while kick them out? [00:20:45] lumina train for a while kick them out? I think again hundreds of papers, [00:20:46] I think again hundreds of papers, thousands of papers, lots of heristics, [00:20:48] thousands of papers, lots of heristics, nothing really stuck [00:20:51] nothing really stuck correct. So you have to train the you [00:20:52] correct. So you have to train the you have to train this end to end. So [00:20:54] have to train this end to end. So gradients from the discriminator always [00:20:55] gradients from the discriminator always propagate into the generator. Um if you [00:20:57] propagate into the generator. Um if you did like so in particular through this [00:20:59] did like so in particular through this term on the right. So the the only way [00:21:01] term on the right. So the the only way that you're ever getting gradients onto [00:21:02] that you're ever getting gradients onto the generator's parameters is actually [00:21:04] the generator's parameters is actually through the discriminator. Right? [00:21:05] through the discriminator. Right? There's I mean unless you have some [00:21:06] There's I mean unless you have some regularizer in here, but there's no [00:21:08] regularizer in here, but there's no auxiliary term that's telling the [00:21:10] auxiliary term that's telling the generator what to do other than the [00:21:11] generator what to do other than the gradients that gets through the [00:21:12] gradients that gets through the discriminator. And that lead that's [00:21:13] discriminator. And that lead that's again leading to part of the the [00:21:14] again leading to part of the the unstable learning problem. Correct? So [00:21:17] unstable learning problem. Correct? So the that that P data distribution that's [00:21:19] the that that P data distribution that's going to stay fixed over the course of [00:21:20] going to stay fixed over the course of training. All right. So we said um you [00:21:23] training. All right. So we said um you know there's this problem that the [00:21:24] know there's this problem that the generator gets low gradients at the [00:21:25] generator gets low gradients at the course of training. There's a little [00:21:27] course of training. There's a little hack here where rather than um minimize [00:21:30] hack here where rather than um minimize rather than trying to maximize log of 1 [00:21:32] rather than trying to maximize log of 1 minus d of g of z, you can instead [00:21:34] minus d of g of z, you can instead minimize minus log of d of g of z [00:21:36] minimize minus log of d of g of z instead you can convince yourself [00:21:38] instead you can convince yourself offline that those are roughly [00:21:39] offline that those are roughly equivalent. Um but the tlddr is then it [00:21:41] equivalent. Um but the tlddr is then it gives you a better curve for the [00:21:42] gives you a better curve for the generator to get better gradients at the [00:21:44] generator to get better gradients at the start of training. Um so that's really [00:21:45] start of training. Um so that's really important. And whenever you're training [00:21:47] important. And whenever you're training GANs from scratch using this sort of uh [00:21:49] GANs from scratch using this sort of uh log objective, then this uh this trick [00:21:51] log objective, then this uh this trick to use the modified loss for the [00:21:53] to use the modified loss for the generator is really important in [00:21:54] generator is really important in practice. So that means that there [00:21:56] practice. So that means that there actually is a a one V that you're [00:21:57] actually is a a one V that you're computing for the generator, one V that [00:21:59] computing for the generator, one V that a different V that you're computing for [00:22:01] a different V that you're computing for the discriminator. Um and they aren't [00:22:02] the discriminator. Um and they aren't quite the same. [00:22:04] quite the same. Okay, there's another question of why [00:22:06] Okay, there's another question of why might this be a good objective. Um and I [00:22:09] might this be a good objective. Um and I used to have slides that walked through [00:22:10] used to have slides that walked through this proof um step by step, but I don't [00:22:13] this proof um step by step, but I don't think we have time for that today. So [00:22:14] think we have time for that today. So I'll just give you the TLDDR and uh we [00:22:16] I'll just give you the TLDDR and uh we can refer you to something offline to [00:22:17] can refer you to something offline to check. Um but the the TLDDR is that this [00:22:21] check. Um but the the TLDDR is that this objective is good because you can write [00:22:24] objective is good because you can write down this optimal discriminator, right? [00:22:26] down this optimal discriminator, right? So you know this is a this is a nested [00:22:28] So you know this is a this is a nested optimization problem where there's an [00:22:29] optimization problem where there's an inner maximization over D and outer [00:22:31] inner maximization over D and outer minimization over G. So if you do a [00:22:33] minimization over G. So if you do a little bit of math, you can actually [00:22:35] little bit of math, you can actually solve this um in this inner maximization [00:22:38] solve this um in this inner maximization problem and write down what is the [00:22:39] problem and write down what is the optimal gener what is the optimal [00:22:42] optimal gener what is the optimal discriminator. This should actually be [00:22:44] discriminator. This should actually be the op this is the optimal discriminator [00:22:45] the op this is the optimal discriminator with respect to a particular generator [00:22:47] with respect to a particular generator G. Um and you can just write this down [00:22:49] G. Um and you can just write this down of course even though you can write it [00:22:51] of course even though you can write it down. You can never compute it because [00:22:52] down. You can never compute it because it depends on P data. Um and you can [00:22:55] it depends on P data. Um and you can never compute P data because if you had [00:22:56] never compute P data because if you had access to the P data density you'd be [00:22:58] access to the P data density you'd be done. So you can write this down as an [00:23:00] done. So you can write this down as an equation on a slide or on a piece of [00:23:01] equation on a slide or on a piece of paper but you can never compute it. Um [00:23:04] paper but you can never compute it. Um and then uh once you have this inner [00:23:06] and then uh once you have this inner objective once you have um sort of [00:23:08] objective once you have um sort of maximized this inner objective by [00:23:10] maximized this inner objective by writing down the optimal discriminator [00:23:12] writing down the optimal discriminator then you can show that the outer [00:23:13] then you can show that the outer objective is minimized if and only if PG [00:23:15] objective is minimized if and only if PG PG of X is equal to P data. So at least [00:23:18] PG of X is equal to P data. So at least theoretically um kind of the the optimum [00:23:21] theoretically um kind of the the optimum state of both the discriminator and the [00:23:22] state of both the discriminator and the generator actually occurs uniquely when [00:23:25] generator actually occurs uniquely when PG is equal to P data. Um so that that [00:23:28] PG is equal to P data. Um so that that kind of makes us feel good but there's a [00:23:30] kind of makes us feel good but there's a lot of caveats to that theoretical [00:23:31] lot of caveats to that theoretical result. Um the one is that they both [00:23:33] result. Um the one is that they both assume of infinite potential capacity [00:23:35] assume of infinite potential capacity for G and D. Um that those those both [00:23:38] for G and D. Um that those those both assume that your generator and [00:23:39] assume that your generator and discriminator can in principle represent [00:23:41] discriminator can in principle represent any function which of course they can't [00:23:42] any function which of course they can't because they are neural networks of a [00:23:44] because they are neural networks of a fixed size and capacity. Um this also [00:23:46] fixed size and capacity. Um this also tells us absolutely nothing about [00:23:48] tells us absolutely nothing about whether or not we will converge to this [00:23:50] whether or not we will converge to this solution. So even though there is this [00:23:52] solution. So even though there is this optimum point in this objective [00:23:53] optimum point in this objective landscape um you know there this tells [00:23:56] landscape um you know there this tells us absolutely nothing about whether we [00:23:57] us absolutely nothing about whether we can ever reach there via gra v via this [00:24:00] can ever reach there via gra v via this gradient descent gradient ascent um [00:24:02] gradient descent gradient ascent um especially with a finite number of data [00:24:03] especially with a finite number of data samples. So there's this weak there's [00:24:06] samples. So there's this weak there's this um sort of comforting theoretical [00:24:07] this um sort of comforting theoretical results that there is some theoretical [00:24:09] results that there is some theoretical justification to GANs but in practice [00:24:11] justification to GANs but in practice you know this doesn't really hold or [00:24:13] you know this doesn't really hold or give us very strong guarantees. [00:24:15] give us very strong guarantees. Um, so these GANs, you know, in [00:24:17] Um, so these GANs, you know, in practice, the your your generator G and [00:24:19] practice, the your your generator G and your discriminator D both are going to [00:24:21] your discriminator D both are going to be parameterized as neural networks. Um, [00:24:24] be parameterized as neural networks. Um, and they used to be CNN's. Uh, the GANs [00:24:26] and they used to be CNN's. Uh, the GANs kind of fell out of favor before before [00:24:28] kind of fell out of favor before before VITs became popular, but I'm sure they [00:24:30] VITs became popular, but I'm sure they would work with VITs as well. Um, and [00:24:32] would work with VITs as well. Um, and the first GAN that really gave [00:24:33] the first GAN that really gave non-trivial results was called DC GAN [00:24:35] non-trivial results was called DC GAN that had this five layer comnet [00:24:37] that had this five layer comnet architecture which gave these pretty [00:24:38] architecture which gave these pretty what were at the time pretty exciting [00:24:40] what were at the time pretty exciting samples. Um, and I mentioned DC GAN [00:24:42] samples. Um, and I mentioned DC GAN because the first author, Alec Radford [00:24:44] because the first author, Alec Radford at all, you know, for most people doing [00:24:46] at all, you know, for most people doing DC GAN would have been a highlight of [00:24:47] DC GAN would have been a highlight of their career. But for Alec Radford, it [00:24:49] their career. But for Alec Radford, it wasn't nearly enough for him because the [00:24:50] wasn't nearly enough for him because the next project he worked on right after DC [00:24:52] next project he worked on right after DC GAN, does anybody know [00:24:55] GAN, does anybody know GPT? [00:24:56] GPT? GPT. Um, so Alec Radford, you know, DC [00:24:58] GPT. Um, so Alec Radford, you know, DC GAN was kind of a lowlight in his [00:24:59] GAN was kind of a lowlight in his career. He went on to do uh GPT1 and [00:25:02] career. He went on to do uh GPT1 and GPT2 um as well as some other amazing [00:25:05] GPT2 um as well as some other amazing work at OpenAI. So I think there's this [00:25:06] work at OpenAI. So I think there's this really cool connection between um you [00:25:08] really cool connection between um you know people that were working on [00:25:09] know people that were working on generative modeling of images actually [00:25:11] generative modeling of images actually jumped over to do generative modeling of [00:25:12] jumped over to do generative modeling of discrete text data and did some of the [00:25:14] discrete text data and did some of the really important work there. [00:25:16] really important work there. Um and one of the you know the only [00:25:18] Um and one of the you know the only other GAN um paper that I'm going to [00:25:20] other GAN um paper that I'm going to highlight is called style GAN. Um I'm [00:25:21] highlight is called style GAN. Um I'm not really going to walk you through the [00:25:23] not really going to walk you through the details of this one other than to kind [00:25:24] details of this one other than to kind of point you at it as like a good one to [00:25:26] of point you at it as like a good one to read if you want to know kind of the the [00:25:27] read if you want to know kind of the the best practices of GANs. They use a much [00:25:29] best practices of GANs. They use a much more complicated architecture um but [00:25:31] more complicated architecture um but they get pretty good results in [00:25:33] they get pretty good results in practice. And one really nice thing [00:25:35] practice. And one really nice thing about GANs is that they actually tend to [00:25:37] about GANs is that they actually tend to learn something smooth in the latent [00:25:39] learn something smooth in the latent space. So what I mean that is that if we [00:25:41] space. So what I mean that is that if we have two latent vectors Z0 and Z1 and we [00:25:44] have two latent vectors Z0 and Z1 and we interpolate between them that is you [00:25:46] interpolate between them that is you like draw a sample Z0 from your from [00:25:48] like draw a sample Z0 from your from your from your gausian you draw a sample [00:25:50] your from your gausian you draw a sample Z1 from your Gausian then you [00:25:51] Z1 from your Gausian then you interpolate some kind of curve between [00:25:53] interpolate some kind of curve between Z0 and Z1 then for every point along the [00:25:55] Z0 and Z1 then for every point along the curve we're going to generate a sample [00:25:57] curve we're going to generate a sample um using through our uh using our [00:25:59] um using through our uh using our generator. Then if we do that, we tend [00:26:01] generator. Then if we do that, we tend to get smooth interpolations to this [00:26:02] to get smooth interpolations to this latent space, which is something really [00:26:04] latent space, which is something really cool with GANs. And here's an example of [00:26:07] cool with GANs. And here's an example of this latent space interpolation from the [00:26:08] this latent space interpolation from the style GAN 3 paper. Um, so you can see [00:26:11] style GAN 3 paper. Um, so you can see that like these are all generated [00:26:13] that like these are all generated samples um by smoothly varying that [00:26:16] samples um by smoothly varying that latent Z and then passing it through the [00:26:18] latent Z and then passing it through the generator to generate these different [00:26:19] generator to generate these different samples. And you can see that these uh [00:26:21] samples. And you can see that these uh these animals sort of smoothly morph [00:26:23] these animals sort of smoothly morph into each other. Um, so that means that [00:26:24] into each other. Um, so that means that the model has somehow uncovered some [00:26:26] the model has somehow uncovered some useful structure and stuffed it into the [00:26:28] useful structure and stuffed it into the latent space. [00:26:29] latent space. So that's uh that's pretty cool. So I I [00:26:33] So that's uh that's pretty cool. So I I used to talk a lot more about generative [00:26:35] used to talk a lot more about generative adversarial networks. Um the the pros is [00:26:37] adversarial networks. Um the the pros is basically they have a really fairly [00:26:39] basically they have a really fairly simple formulation and if you tune them [00:26:41] simple formulation and if you tune them right like we saw with style GAN 3 then [00:26:43] right like we saw with style GAN 3 then they can actually give you very nice [00:26:45] they can actually give you very nice results. Um very beautiful images, very [00:26:47] results. Um very beautiful images, very high resolution. Uh very good stuff. Um [00:26:49] high resolution. Uh very good stuff. Um but the cons like one like we talked [00:26:51] but the cons like one like we talked about they're they're fairly unstable to [00:26:53] about they're they're fairly unstable to train. You have no loss curve to look [00:26:54] train. You have no loss curve to look at. You have very unstable training. [00:26:56] at. You have very unstable training. They tend to blow up at the drop of a [00:26:58] They tend to blow up at the drop of a hat. Um so you end up with like um mode [00:27:00] hat. Um so you end up with like um mode coll what's called mode collapse. All of [00:27:02] coll what's called mode collapse. All of a sudden you might get nans you might [00:27:03] a sudden you might get nans you might get ins your generator your your [00:27:05] get ins your generator your your discriminator starts going crazy your [00:27:07] discriminator starts going crazy your generator starts producing complete [00:27:08] generator starts producing complete random garbage all the time you have no [00:27:10] random garbage all the time you have no loss curves to look at to tell to [00:27:12] loss curves to look at to tell to diagnose this they're kind of a mess. So [00:27:14] diagnose this they're kind of a mess. So even though that GANs can give you [00:27:16] even though that GANs can give you really nice results if you very very [00:27:18] really nice results if you very very carefully tune them um and very very [00:27:20] carefully tune them um and very very carefully control the normalization the [00:27:22] carefully control the normalization the sampling everything about them um they [00:27:24] sampling everything about them um they have been in practice they've been [00:27:25] have been in practice they've been fairly hard to scale up to really big [00:27:27] fairly hard to scale up to really big models to really big data um so GANs [00:27:29] models to really big data um so GANs were basically the go-to category of [00:27:31] were basically the go-to category of generative models from around 2016 to [00:27:34] generative models from around 2016 to maybe around 2020 2021 something around [00:27:37] maybe around 2020 2021 something around there um and in that 5 years there were [00:27:39] there um and in that 5 years there were like literally thousands and thousands [00:27:41] like literally thousands and thousands of papers um people both trying to use [00:27:43] of papers um people both trying to use different GAN formulations, different [00:27:45] different GAN formulations, different loss functions, different mathematical [00:27:46] loss functions, different mathematical formalisms as well as applying GANs to [00:27:49] formalisms as well as applying GANs to all kinds of different generative [00:27:50] all kinds of different generative modeling tasks that you can imagine. So [00:27:52] modeling tasks that you can imagine. So like this was basically the go-to [00:27:53] like this was basically the go-to generative modeling framework for about [00:27:55] generative modeling framework for about five five or six years. Question is um [00:27:58] five five or six years. Question is um would we just expect this? Would we just [00:27:59] would we just expect this? Would we just shouldn't we just expect these smooth [00:28:00] shouldn't we just expect these smooth latenc? Um I think not necessarily [00:28:03] latenc? Um I think not necessarily because one thing that can happen with [00:28:05] because one thing that can happen with GANs is the generator might just might [00:28:07] GANs is the generator might just might just memorize a fixed number of data [00:28:09] just memorize a fixed number of data samples, right? So what if your [00:28:11] samples, right? So what if your generator ignores the latent Z and just [00:28:13] generator ignores the latent Z and just memorizes 10 samples from the training [00:28:15] memorizes 10 samples from the training data set somehow and then no matter what [00:28:16] data set somehow and then no matter what Z you give it, it always gives you one [00:28:18] Z you give it, it always gives you one of those 10 samples from the training [00:28:20] of those 10 samples from the training data set and it never gives you anything [00:28:21] data set and it never gives you anything else. Then in that case, the disc you're [00:28:23] else. Then in that case, the disc you're going to fool the discriminator because [00:28:25] going to fool the discriminator because the generator is always giving you [00:28:26] the generator is always giving you something which is maybe bitwise [00:28:28] something which is maybe bitwise identical to one of your real samples. [00:28:29] identical to one of your real samples. Um and then in that case the generator [00:28:31] Um and then in that case the generator would have basically piled on sort of [00:28:33] would have basically piled on sort of durac delta density like in the [00:28:35] durac delta density like in the immediate vicinity of a couple finite [00:28:37] immediate vicinity of a couple finite samples. um but put no probability mass [00:28:39] samples. um but put no probability mass anywhere else. So that actually is kind [00:28:41] anywhere else. So that actually is kind of a legitimate solution for the [00:28:42] of a legitimate solution for the generator. Um and that would definitely [00:28:44] generator. Um and that would definitely not give you smooth latence at all. So [00:28:46] not give you smooth latence at all. So that's just one example of how these [00:28:48] that's just one example of how these things can collapse into unintuitive [00:28:49] things can collapse into unintuitive solutions that are not what you want. [00:28:51] solutions that are not what you want. Oh, good question. Um what is the [00:28:53] Oh, good question. Um what is the relationship between your training data [00:28:55] relationship between your training data set and your latence? Um so this is [00:28:57] set and your latence? Um so this is actually something very fundamental [00:28:58] actually something very fundamental about GANs is a great question. So you [00:29:00] about GANs is a great question. So you you can map one way. So the generator [00:29:02] you can map one way. So the generator gives you a mapping from latent space [00:29:04] gives you a mapping from latent space into data space maps from Z to an X. But [00:29:07] into data space maps from Z to an X. But with GANs, you in general have no way to [00:29:09] with GANs, you in general have no way to map back from an X to a Z. Um, and [00:29:11] map back from an X to a Z. Um, and that's that's something very uh [00:29:13] that's that's something very uh different between GANs and VAEs. So VAEs [00:29:15] different between GANs and VAEs. So VAEs will learn an explicit mapping from X [00:29:17] will learn an explicit mapping from X back to Z, but with GANs you have no [00:29:18] back to Z, but with GANs you have no such thing. Um, you can try to invert [00:29:20] such thing. Um, you can try to invert you can try to um compute an inverse um [00:29:23] you can try to um compute an inverse um analytically numerically via gradient [00:29:25] analytically numerically via gradient descent. There's papers that do that. [00:29:27] descent. There's papers that do that. Um, but there's actually like no [00:29:28] Um, but there's actually like no explicitly enforced relationship between [00:29:30] explicitly enforced relationship between X and Z. Um instead the you can think of [00:29:32] X and Z. Um instead the you can think of the discriminator as trying to just [00:29:34] the discriminator as trying to just enforce a distributional alignment [00:29:36] enforce a distributional alignment between the distribution of all the [00:29:37] between the distribution of all the outputs coming from the generator and [00:29:39] outputs coming from the generator and the distribution of all the data samples [00:29:40] the distribution of all the data samples without any kind of explicit supervision [00:29:42] without any kind of explicit supervision between them. Um of course like when it [00:29:44] between them. Um of course like when it comes to GANs anything you probably [00:29:45] comes to GANs anything you probably think about there's probably at least [00:29:46] think about there's probably at least like a dozen papers about so there's [00:29:48] like a dozen papers about so there's also a lot of papers about GAN variants [00:29:50] also a lot of papers about GAN variants that try to learn birectional mappings [00:29:52] that try to learn birectional mappings both ways but those never really took [00:29:53] both ways but those never really took off. Oh good good question. What have we [00:29:56] off. Oh good good question. What have we gained? So when we when we went to VAE, [00:29:58] gained? So when we when we went to VAE, we gained latent vectors, but we gave up [00:29:59] we gained latent vectors, but we gave up density. And now with GANs, it seems [00:30:01] density. And now with GANs, it seems like we gave up latent vectors that we [00:30:03] like we gave up latent vectors that we couldn't control. What you gained was [00:30:04] couldn't control. What you gained was much better samples. Um so when it comes [00:30:07] much better samples. Um so when it comes to VAE, like VAEs tend not to give you [00:30:09] to VAE, like VAEs tend not to give you very good samples. Um VAEs are sort of [00:30:12] very good samples. Um VAEs are sort of characteristically always kind of [00:30:13] characteristically always kind of blurry. They never really look good. Um [00:30:16] blurry. They never really look good. Um VAEs on their own just never tend to [00:30:18] VAEs on their own just never tend to give you very clean, crisp samples. But [00:30:20] give you very clean, crisp samples. But with GANs, um some of the as you saw [00:30:21] with GANs, um some of the as you saw with some of the examples, you can get [00:30:23] with some of the examples, you can get very crisp, very clean, very good [00:30:25] very crisp, very clean, very good samples. [00:30:28] Um but what you lost was the ability it [00:30:29] Um but what you lost was the ability it was like your sanity in trying to tune [00:30:30] was like your sanity in trying to tune these systems. Yeah. At inference time [00:30:33] these systems. Yeah. At inference time you throw away the discriminator and [00:30:34] you throw away the discriminator and just do the generator. So at inference [00:30:35] just do the generator. So at inference you'll just draw a sample Z um draw a [00:30:37] you'll just draw a sample Z um draw a sample Z from the prior pass through the [00:30:39] sample Z from the prior pass through the generator get your sample from your [00:30:41] generator get your sample from your data. So it's very very efficient at [00:30:42] data. So it's very very efficient at inference time. All right. So I [00:30:44] inference time. All right. So I mentioned that GANs used to be the go-to [00:30:47] mentioned that GANs used to be the go-to um category of of generative modeling [00:30:49] um category of of generative modeling for about five or six years. Um, so what [00:30:51] for about five or six years. Um, so what displaced them? And what displaced them [00:30:53] displaced them? And what displaced them was a very different category of models [00:30:56] was a very different category of models called diffusion models. Now, um, I need [00:30:58] called diffusion models. Now, um, I need to put some caveats here. Diffusion [00:31:00] to put some caveats here. Diffusion model literature is crazy, right? Like [00:31:02] model literature is crazy, right? Like you read these papers, they go through [00:31:04] you read these papers, they go through like five pages of math before they tell [00:31:06] like five pages of math before they tell you at all what's going on. And there's [00:31:08] you at all what's going on. And there's like three different mathematical [00:31:10] like three different mathematical formalisms that lead to diffusion models [00:31:12] formalisms that lead to diffusion models that are all very different [00:31:13] that are all very different mathematically. And there's very [00:31:15] mathematically. And there's very different notation, very different [00:31:16] different notation, very different terminology, very different mathematical [00:31:18] terminology, very different mathematical formalisms between papers. So like this [00:31:20] formalisms between papers. So like this this is a sub area that's crazy. Um so I [00:31:23] this is a sub area that's crazy. Um so I need to put a big caveat here that I'm [00:31:25] need to put a big caveat here that I'm not going to cover fully all the [00:31:27] not going to cover fully all the different variants of diffusion models [00:31:29] different variants of diffusion models with all of their proper mathematical [00:31:30] with all of their proper mathematical formalism. Instead, what I'm going to [00:31:32] formalism. Instead, what I'm going to try to do is give you an intuitive [00:31:34] try to do is give you an intuitive overview of diffusion models as well as [00:31:36] overview of diffusion models as well as an intuitive geometric understanding of [00:31:38] an intuitive geometric understanding of the most common form of diffusion models [00:31:40] the most common form of diffusion models today which are called rectified flow [00:31:41] today which are called rectified flow models. Um, and you could really you [00:31:43] models. Um, and you could really you could teach many many lectures about [00:31:44] could teach many many lectures about diffusion models and get into all the [00:31:46] diffusion models and get into all the interesting mathematical nuance of all [00:31:47] interesting mathematical nuance of all these flavors, but we just won't have [00:31:49] these flavors, but we just won't have time for that in in three in two-thirds [00:31:51] time for that in in three in two-thirds of one lecture unfortunately. [00:31:53] of one lecture unfortunately. So, um, with that caveat aside, the intu [00:31:55] So, um, with that caveat aside, the intu the intuition behind diffusion models is [00:31:57] the intuition behind diffusion models is actually kind of easy. So um you know [00:32:00] actually kind of easy. So um you know like with all generative models we want [00:32:02] like with all generative models we want to draw samples um and we we like kind [00:32:05] to draw samples um and we we like kind of like GANs we want to convert samples [00:32:07] of like GANs we want to convert samples from a noise distribution um Z um into a [00:32:10] from a noise distribution um Z um into a data distribution PX. But the way that [00:32:12] data distribution PX. But the way that we're going to do that in diffusion [00:32:14] we're going to do that in diffusion models is totally different. Um GANs [00:32:16] models is totally different. Um GANs learn this deterministic mapping through [00:32:17] learn this deterministic mapping through the generator to map a Z directly to an [00:32:20] the generator to map a Z directly to an X. With a diffusion model we're going to [00:32:21] X. With a diffusion model we're going to do something more implicit, more [00:32:23] do something more implicit, more indirect. So what we're going to do um [00:32:25] indirect. So what we're going to do um first off the first constraint in [00:32:26] first off the first constraint in diffusion models is that the Z the the [00:32:29] diffusion models is that the Z the the noise distribution the noise always has [00:32:31] noise distribution the noise always has to have to have the same shape as our [00:32:33] to have to have the same shape as our data right so if you have an image [00:32:34] data right so if you have an image that's like H byW3 then your noise [00:32:37] that's like H byW3 then your noise distribution always has to be H byw3 as [00:32:39] distribution always has to be H byw3 as well they have to be exactly the same [00:32:41] well they have to be exactly the same shape now what we're going to do is [00:32:43] shape now what we're going to do is consider um v different versions of our [00:32:46] consider um v different versions of our data that are corrupted by increasing [00:32:48] data that are corrupted by increasing levels of noise so here if we have a [00:32:50] levels of noise so here if we have a data sample which is this picture of a [00:32:52] data sample which is this picture of a cat then under the t equals Z then and [00:32:54] cat then under the t equals Z then and then t is going to be our noise level um [00:32:56] then t is going to be our noise level um which is ranges from 0 to 1. So at t [00:32:59] which is ranges from 0 to 1. So at t equals 0 that means no noise that means [00:33:01] equals 0 that means no noise that means a totally clean data sample. Um t [00:33:03] a totally clean data sample. Um t equals.3 is a little bit of noise. We [00:33:06] equals.3 is a little bit of noise. We add some of our noise z into we mix some [00:33:09] add some of our noise z into we mix some of our noise z into our data x. And if [00:33:11] of our noise z into our data x. And if we go all the way to t= 1 we get full [00:33:14] we go all the way to t= 1 we get full noise and those are going to be samples [00:33:15] noise and those are going to be samples directly from our noise distribution. So [00:33:17] directly from our noise distribution. So somehow this ted parameter is going to [00:33:19] somehow this ted parameter is going to interpolate smoothly between our data [00:33:22] interpolate smoothly between our data distribution and our noise distribution. [00:33:24] distribution and our noise distribution. Um and this is something and we can we [00:33:26] Um and this is something and we can we and the and the the noise distribution [00:33:27] and the and the the noise distribution again is going to be gausian um almost [00:33:29] again is going to be gausian um almost always gausian something simple that we [00:33:31] always gausian something simple that we understand and can sample from. Um and [00:33:34] understand and can sample from. Um and now what we're going to do is train a [00:33:35] now what we're going to do is train a neural network to do a little bit of [00:33:36] neural network to do a little bit of incremental denoising. So the neural [00:33:39] incremental denoising. So the neural network is going to receive um some [00:33:41] network is going to receive um some inter some some uh sample which is a [00:33:43] inter some some uh sample which is a piece of data which has been corrupted [00:33:45] piece of data which has been corrupted with some intermediate amount of noise. [00:33:46] with some intermediate amount of noise. And now the neural network is going to [00:33:48] And now the neural network is going to be trained to try to clean it up a [00:33:50] be trained to try to clean it up a little bit, remove just a little bit of [00:33:51] little bit, remove just a little bit of the noise. Um, so the training objective [00:33:53] the noise. Um, so the training objective here is going to be neural network [00:33:55] here is going to be neural network inputs [00:33:57] inputs some an image of some amount with some [00:33:59] some an image of some amount with some amount of noise and it tries to remove [00:34:00] amount of noise and it tries to remove some of the noise. Then at inference [00:34:03] some of the noise. Then at inference time, what we're going to do is do an [00:34:04] time, what we're going to do is do an iterative procedure where we first draw [00:34:06] iterative procedure where we first draw a noise sample directly from our noise [00:34:09] a noise sample directly from our noise distribution PZ and then iteratively [00:34:11] distribution PZ and then iteratively apply the neural network to remove noise [00:34:13] apply the neural network to remove noise from that sample one at a time. So um [00:34:16] from that sample one at a time. So um you know the very first time we do this [00:34:17] you know the very first time we do this we're going to draw a complete sample [00:34:19] we're going to draw a complete sample that's complete noise and then the very [00:34:21] that's complete noise and then the very first application of the neural network [00:34:22] first application of the neural network the network will be trying to remove [00:34:24] the network will be trying to remove noise from full noise. So it'll [00:34:26] noise from full noise. So it'll basically be forced to hallucinate just [00:34:28] basically be forced to hallucinate just a tiny whiff of data structure in that [00:34:30] a tiny whiff of data structure in that noise. And then once we get to some like [00:34:33] noise. And then once we get to some like slightly less noisy example, we're going [00:34:35] slightly less noisy example, we're going to pass it back to the neural network [00:34:36] to pass it back to the neural network and again ask it to remove just a little [00:34:38] and again ask it to remove just a little bit of noise from this now slightly um [00:34:41] bit of noise from this now slightly um slightly dinoised slightly generated [00:34:43] slightly dinoised slightly generated image and then it'll get a little bit [00:34:44] image and then it'll get a little bit less noisy and it'll get a little bit [00:34:46] less noisy and it'll get a little bit less less noisy and it'll get a bit [00:34:48] less less noisy and it'll get a bit little bit less noisy. And eventually if [00:34:50] little bit less noisy. And eventually if we set up all of this stuff correctly, [00:34:52] we set up all of this stuff correctly, then we want to get to a situation where [00:34:54] then we want to get to a situation where we can draw a complete noise sample and [00:34:56] we can draw a complete noise sample and then ask the network to remove noise [00:34:58] then ask the network to remove noise from that complete sample Z of complete [00:35:00] from that complete sample Z of complete random noise until eventually we we've [00:35:02] random noise until eventually we we've removed all the noise and come up with a [00:35:04] removed all the noise and come up with a generated sample from the system. So [00:35:07] generated sample from the system. So that's kind of a weird setting. It's [00:35:09] that's kind of a weird setting. It's kind of a weird thing. Um but that's [00:35:11] kind of a weird thing. Um but that's kind of the intuition behind diffusion [00:35:12] kind of the intuition behind diffusion models is the number of steps of fixed [00:35:15] models is the number of steps of fixed hyperparameter. It depends. So on this [00:35:17] hyperparameter. It depends. So on this slide, I've intentionally been I I was [00:35:19] slide, I've intentionally been I I was sort of forced to be very vague about [00:35:21] sort of forced to be very vague about all these things. Um what is the noise? [00:35:23] all these things. Um what is the noise? What does it mean to corrupt the data [00:35:25] What does it mean to corrupt the data with respect to the noise? What does it [00:35:27] with respect to the noise? What does it mean to remove a little bit of the [00:35:28] mean to remove a little bit of the noise? What does it mean to apply [00:35:30] noise? What does it mean to apply iteratively at inference? Because like I [00:35:33] iteratively at inference? Because like I said, there's so many different [00:35:34] said, there's so many different formalisms of diffusion. There's a lot [00:35:36] formalisms of diffusion. There's a lot of different variants about exactly what [00:35:37] of different variants about exactly what these mean in different situations. So [00:35:39] these mean in different situations. So this slide is intended to be a fairly [00:35:41] this slide is intended to be a fairly highle overview of diffusion and then [00:35:43] highle overview of diffusion and then different specific implementations of [00:35:44] different specific implementations of diffusion models will have different [00:35:46] diffusion models will have different concrete choices for what all these [00:35:47] concrete choices for what all these terms specifically mean. So um does this [00:35:51] terms specifically mean. So um does this uh does this highle picture of diffusion [00:35:53] uh does this highle picture of diffusion kind of make sense? Okay. So then let's [00:35:55] kind of make sense? Okay. So then let's make this more concrete. So now we're [00:35:56] make this more concrete. So now we're going to now we're going to jump from [00:35:57] going to now we're going to jump from you know general diffusion models to a [00:35:59] you know general diffusion models to a particular category of diffusion models [00:36:02] particular category of diffusion models called rectified flow models. Um some [00:36:04] called rectified flow models. Um some people may argue with me and say that [00:36:06] people may argue with me and say that rectified flow is not diffusion. They [00:36:07] rectified flow is not diffusion. They might be some people might say that [00:36:09] might be some people might say that they're different things. I don't really [00:36:10] they're different things. I don't really care. Um to me, rectified flow is a kind [00:36:12] care. Um to me, rectified flow is a kind of diffusion model. Fight me. Um so with [00:36:15] of diffusion model. Fight me. Um so with rectified flow, um the intuition is [00:36:17] rectified flow, um the intuition is basically this. You know, we have this [00:36:19] basically this. You know, we have this same thing. We have our P noise. We have [00:36:20] same thing. We have our P noise. We have our P data. And we're going to draw this [00:36:22] our P data. And we're going to draw this geometrically because I think that's a [00:36:24] geometrically because I think that's a nice a nice way to gain intuition [00:36:26] nice a nice way to gain intuition geometrically in two dimensions in [00:36:27] geometrically in two dimensions in particular because that's all that fits [00:36:28] particular because that's all that fits on the slide. But of course, these are [00:36:30] on the slide. But of course, these are going to be super super highdimensional [00:36:31] going to be super super highdimensional images um and gausian um which is an [00:36:34] images um and gausian um which is an easy way to get led astray, right? [00:36:35] easy way to get led astray, right? because you know that intuitions that [00:36:36] because you know that intuitions that hold in two and three dimensions go like [00:36:38] hold in two and three dimensions go like totally out the window when you go to a [00:36:40] totally out the window when you go to a lot of dimensions. Um it's really sad. [00:36:42] lot of dimensions. Um it's really sad. It's it's sort of sad that we live in [00:36:43] It's it's sort of sad that we live in such a low dimensional universe because [00:36:45] such a low dimensional universe because our intuitions that we build in this [00:36:46] our intuitions that we build in this universe just don't really transfer to [00:36:48] universe just don't really transfer to like 100 dimensional spaces, thousand [00:36:50] like 100 dimensional spaces, thousand dimensional spaces. So always be aware [00:36:52] dimensional spaces. So always be aware but you know it is what it is. We're [00:36:54] but you know it is what it is. We're stuck with the universe we got. So you [00:36:56] stuck with the universe we got. So you know the setup in rectified flow is that [00:36:58] know the setup in rectified flow is that we've got you know our distribution P [00:37:00] we've got you know our distribution P noise, our distribution P data. P [00:37:02] noise, our distribution P data. P noiseis is something simple that we [00:37:03] noiseis is something simple that we understand, we can sample from, we can [00:37:05] understand, we can sample from, we can compute integrals of. It's a very [00:37:06] compute integrals of. It's a very friendly distribution. P data again is [00:37:08] friendly distribution. P data again is something crazy. That's what the [00:37:09] something crazy. That's what the universe is using to give us images. [00:37:12] universe is using to give us images. Now, at every training iteration, um [00:37:14] Now, at every training iteration, um we're going to sample a Z from our prior [00:37:17] we're going to sample a Z from our prior distribution and sample an X from our [00:37:19] distribution and sample an X from our data dist data distribution. um you know [00:37:21] data dist data distribution. um you know we can draw a sample e analytically [00:37:23] we can draw a sample e analytically because um pz is something simple that [00:37:25] because um pz is something simple that we control and drawing a sample from the [00:37:27] we control and drawing a sample from the beta distribution just means picking an [00:37:29] beta distribution just means picking an example from your finite training set. [00:37:31] example from your finite training set. Now what we're and you're also going to [00:37:33] Now what we're and you're also going to choose a a t to be uniform on 0 to 1. [00:37:36] choose a a t to be uniform on 0 to 1. Remember t is our noise level where t [00:37:38] Remember t is our noise level where t equals 0 means no noise t= 1 means all [00:37:41] equals 0 means no noise t= 1 means all the noise. So now we're going to draw [00:37:43] the noise. So now we're going to draw now we're going to draw a line that [00:37:45] now we're going to draw a line that points from our data sample x directly [00:37:47] points from our data sample x directly to our noise sample z. Um, and this [00:37:50] to our noise sample z. Um, and this line, this vector pointing from X to Z [00:37:52] line, this vector pointing from X to Z is we're going to call it V. Um, this is [00:37:54] is we're going to call it V. Um, this is this is going to be the velocity of a [00:37:56] this is going to be the velocity of a flow field. Um, and then in and then we [00:37:58] flow field. Um, and then in and then we set XT to be a point along this line, [00:38:01] set XT to be a point along this line, which is a linear interpolation between [00:38:03] which is a linear interpolation between X and Z. So now we've got um we've got [00:38:05] X and Z. So now we've got um we've got our our noise sample Z, our data sample [00:38:08] our our noise sample Z, our data sample X, we've got the velocity vector between [00:38:10] X, we've got the velocity vector between them V, and we've picked a noisy uh a [00:38:13] them V, and we've picked a noisy uh a noised version of our data XT. Um and [00:38:15] noised version of our data XT. Um and this is what this is you know in the [00:38:16] this is what this is you know in the previous slide when we said um get noisy [00:38:19] previous slide when we said um get noisy data. This is what that means in the [00:38:20] data. This is what that means in the case of rectified flow models. It's a [00:38:22] case of rectified flow models. It's a linear interpolation between a data [00:38:24] linear interpolation between a data sample and a noise sample. And now the [00:38:26] sample and a noise sample. And now the training objective is very very simple. [00:38:28] training objective is very very simple. So now we're going to train a neural [00:38:30] So now we're going to train a neural network f theta. So that's f with [00:38:32] network f theta. So that's f with learnable parameters theta. Um that [00:38:33] learnable parameters theta. Um that neural network is going to input the [00:38:35] neural network is going to input the noisiest sample xt as well as the noise [00:38:37] noisiest sample xt as well as the noise level t and it's going to try to predict [00:38:40] level t and it's going to try to predict the green vector v. So that's that's it. [00:38:42] the green vector v. So that's that's it. That's all we need to do in rectified [00:38:44] That's all we need to do in rectified flow. Um the code for this is very [00:38:46] flow. Um the code for this is very simple. Um you would be shocked like how [00:38:49] simple. Um you would be shocked like how how much obscurity there is when you [00:38:50] how much obscurity there is when you read papers here and it boils down to [00:38:52] read papers here and it boils down to this very simple code. Drives me crazy [00:38:54] this very simple code. Drives me crazy that this is not made more clear in in a [00:38:56] that this is not made more clear in in a lot of presentations of this. So then [00:38:58] lot of presentations of this. So then the the training loop for rectified flow [00:38:59] the the training loop for rectified flow is extremely simple. Um you loop over [00:39:01] is extremely simple. Um you loop over your data set at every iteration. You [00:39:03] your data set at every iteration. You get um Z which is g unit which is unit [00:39:06] get um Z which is g unit which is unit gausian of the same shape as X. You [00:39:09] gausian of the same shape as X. You choose a noise level T which is uniform [00:39:11] choose a noise level T which is uniform 0 to1. You compute XT which is a linear [00:39:13] 0 to1. You compute XT which is a linear interpolation between X and Z. Um you [00:39:16] interpolation between X and Z. Um you give XT and T to your model and then [00:39:18] give XT and T to your model and then your loss is just the mean squared error [00:39:20] your loss is just the mean squared error between um the this ground truth V and [00:39:22] between um the this ground truth V and the model prediction. Um and that's it. [00:39:24] the model prediction. Um and that's it. That's your training objective for [00:39:26] That's your training objective for rectified flow models. Um contrast this [00:39:28] rectified flow models. Um contrast this with GANs. Um when you train rectified [00:39:30] with GANs. Um when you train rectified flow models or really any kind of [00:39:31] flow models or really any kind of diffusion model, you have a loss that [00:39:33] diffusion model, you have a loss that you can look at during training. When [00:39:35] you can look at during training. When the loss goes down, the model is [00:39:36] the loss goes down, the model is generally better. So like for those of [00:39:38] generally better. So like for those of us that went through half a decade of [00:39:40] us that went through half a decade of GAN madness, the first time you train a [00:39:42] GAN madness, the first time you train a diffusion model and like there's a loss [00:39:44] diffusion model and like there's a loss to look at, it's like, oh my god, this [00:39:45] to look at, it's like, oh my god, this is an amazing thing. Um like how many [00:39:47] is an amazing thing. Um like how many hours have we spent looking at GAN plots [00:39:49] hours have we spent looking at GAN plots and they look like this and you have no [00:39:50] and they look like this and you have no idea. It's like reading tea leaves to [00:39:52] idea. It's like reading tea leaves to tell whether or not the model is working [00:39:53] tell whether or not the model is working well. You train a diffusion model, you [00:39:55] well. You train a diffusion model, you get this beautiful smooth exponential [00:39:57] get this beautiful smooth exponential loss curve and it just makes you so [00:39:58] loss curve and it just makes you so happy. Um so that's great. So then [00:40:02] happy. Um so that's great. So then that's training for that's training for [00:40:04] that's training for that's training for diffusion models. Now what do we do at [00:40:06] diffusion models. Now what do we do at inference? Right? Because you the you [00:40:08] inference? Right? Because you the you know gans are kind of easy at inference. [00:40:10] know gans are kind of easy at inference. GANs you just have to take a Z pass it [00:40:12] GANs you just have to take a Z pass it through your generator you get a data [00:40:13] through your generator you get a data sample very straightforward. But now [00:40:15] sample very straightforward. But now with a diffusion model or rectified flow [00:40:17] with a diffusion model or rectified flow model in this case the model output [00:40:19] model in this case the model output itself is like kind of useless. Like we [00:40:21] itself is like kind of useless. Like we get this XT we get a V. Like what are we [00:40:23] get this XT we get a V. Like what are we going to do with this? Not super clear. [00:40:25] going to do with this? Not super clear. So then at inference time um is where [00:40:27] So then at inference time um is where diffusion models get a little bit more [00:40:28] diffusion models get a little bit more complicated compared to uh compared to [00:40:30] complicated compared to uh compared to GANs. [00:40:31] GANs. So um you know at inference we first [00:40:34] So um you know at inference we first will upfront choose a number of steps t [00:40:36] will upfront choose a number of steps t um which is usually a fixed constant um [00:40:38] um which is usually a fixed constant um and in the case of rectified flow models [00:40:40] and in the case of rectified flow models t= 50 um is usually a good number to [00:40:42] t= 50 um is usually a good number to start with. Sometimes you can get down [00:40:43] start with. Sometimes you can get down to t equals 30 and that works okay. Then [00:40:46] to t equals 30 and that works okay. Then what you're going to do is um sample an [00:40:48] what you're going to do is um sample an x directly from your noise distribution. [00:40:50] x directly from your noise distribution. This is going to be pure noise that's [00:40:51] This is going to be pure noise that's sampled from your known noise [00:40:53] sampled from your known noise distribution. [00:40:54] distribution. Then you're going to evaluate the then [00:40:56] Then you're going to evaluate the then you're going to loop from t equals 1. [00:40:59] you're going to loop from t equals 1. you're going to march backwards to t [00:41:01] you're going to march backwards to t equals z. This is your noise level. Um, [00:41:02] equals z. This is your noise level. Um, and in this case, in this simple [00:41:04] and in this case, in this simple version, we're just sort of marching [00:41:05] version, we're just sort of marching linearly from full noise one back to [00:41:08] linearly from full noise one back to noise zero, perfectly clean. Then at [00:41:10] noise zero, perfectly clean. Then at every iteration, [00:41:12] every iteration, we're going to take our our um our XT [00:41:14] we're going to take our our um our XT that was at first full noise, pass it to [00:41:17] that was at first full noise, pass it to the network along with the current noise [00:41:18] the network along with the current noise level, and get the network's predicted [00:41:20] level, and get the network's predicted VT. And remember what this VT is [00:41:22] VT. And remember what this VT is supposed to be in the case of rectified [00:41:24] supposed to be in the case of rectified flow. This V remember was supposed to [00:41:26] flow. This V remember was supposed to point from a data sample all the way to [00:41:29] point from a data sample all the way to a noise sample. So then it's kind of [00:41:31] a noise sample. So then it's kind of geometrically obvious what you should do [00:41:33] geometrically obvious what you should do in the case of rectified flow. You [00:41:35] in the case of rectified flow. You should take a little step along that [00:41:36] should take a little step along that predicted V vector. So you but right [00:41:39] predicted V vector. So you but right because the problem is this rectified [00:41:40] because the problem is this rectified flow model it's not going to that that V [00:41:41] flow model it's not going to that that V is not going to point you all the way to [00:41:43] is not going to point you all the way to a clean sample. Um it's just going to [00:41:44] a clean sample. Um it's just going to get you started. It's going to set you [00:41:46] get you started. It's going to set you on a trajectory towards a clean sample. [00:41:48] on a trajectory towards a clean sample. So we take a little step along that [00:41:49] So we take a little step along that predicted V from the flow model to get a [00:41:52] predicted V from the flow model to get a new X2 which is now a version of the [00:41:54] new X2 which is now a version of the data that has had a little bit of the [00:41:55] data that has had a little bit of the noise removed from it by the model. And [00:41:58] noise removed from it by the model. And now we iterate this. So once we have [00:42:00] now we iterate this. So once we have this X this X2/3 then we pass it back to [00:42:02] this X this X2/3 then we pass it back to the model and get another predicted V [00:42:04] the model and get another predicted V vector from the model. Um and remember [00:42:06] vector from the model. Um and remember the V is supposed to point from clean [00:42:08] the V is supposed to point from clean data all from a from a clean sample all [00:42:10] data all from a from a clean sample all the way to a noise sample. So then again [00:42:12] the way to a noise sample. So then again we can take a gradient step take a [00:42:14] we can take a gradient step take a little step along this predicted V to [00:42:16] little step along this predicted V to get another um X1/3. Repeat this thing [00:42:18] get another um X1/3. Repeat this thing again. Um re evaluate the model again to [00:42:21] again. Um re evaluate the model again to get another predicted V and then take a [00:42:23] get another predicted V and then take a step in this case all the way to no to [00:42:25] step in this case all the way to no to to no noise all the way to the end of [00:42:26] to no noise all the way to the end of that vector uh to get our predicted X0 [00:42:29] that vector uh to get our predicted X0 and then that is our sample from our [00:42:31] and then that is our sample from our diffusion model. So um you know the then [00:42:34] diffusion model. So um you know the then the inference procedure you see here got [00:42:35] the inference procedure you see here got a little bit more complicated compared [00:42:37] a little bit more complicated compared to GANs. Um but what we gained here was [00:42:40] to GANs. Um but what we gained here was um s sanity when you're training. You've [00:42:42] um s sanity when you're training. You've regained that. Um and they tend to give [00:42:45] regained that. Um and they tend to give you much better samples and they tend to [00:42:46] you much better samples and they tend to scale really well to to large data sets [00:42:48] scale really well to to large data sets and large models. And the code here is [00:42:50] and large models. And the code here is really simple, right? So we um we start [00:42:52] really simple, right? So we um we start off by taking a random sample, make it [00:42:54] off by taking a random sample, make it be perfectly random, then march [00:42:56] be perfectly random, then march backwards for t from one back to zero. [00:42:59] backwards for t from one back to zero. At every noise level, you get a [00:43:01] At every noise level, you get a predicted V from the model given your [00:43:02] predicted V from the model given your current sample as well as your T. Then [00:43:04] current sample as well as your T. Then you take what looks kind of like a [00:43:06] you take what looks kind of like a gradient descent step on the model's [00:43:07] gradient descent step on the model's predicted V and update the sample and [00:43:09] predicted V and update the sample and just repeat this whole thing in a loop. [00:43:11] just repeat this whole thing in a loop. So then you can see like these diffusion [00:43:14] So then you can see like these diffusion models aren't so scary after all. Um you [00:43:15] models aren't so scary after all. Um you can actually fit a complete [00:43:17] can actually fit a complete implementation of training and sampling [00:43:19] implementation of training and sampling from a rectified flow model in just a [00:43:21] from a rectified flow model in just a couple lines um on on one slide which I [00:43:23] couple lines um on on one slide which I think is is very nice. [00:43:26] think is is very nice. Okay. So you might be ask so so this is [00:43:28] Okay. So you might be ask so so this is this is pretty nice. I'm I'm pretty [00:43:29] this is pretty nice. I'm I'm pretty happy that we're able to get to a full [00:43:30] happy that we're able to get to a full right this and this will actually work [00:43:32] right this and this will actually work right like if you if you take this code [00:43:33] right like if you if you take this code you plug it in like you plug it in a [00:43:35] you plug it in like you plug it in a reasonable model architecture like this [00:43:37] reasonable model architecture like this will actually this will actually work [00:43:38] will actually this will actually work like this will actually convert to [00:43:39] like this will actually convert to something kind of reasonable in a lot of [00:43:40] something kind of reasonable in a lot of cases you're kind of hitting on the core [00:43:43] cases you're kind of hitting on the core problem in generative modeling that I've [00:43:44] problem in generative modeling that I've been thinking about a lot the last [00:43:45] been thinking about a lot the last couple days while reviewing these slides [00:43:47] couple days while reviewing these slides right the core problem in generative [00:43:49] right the core problem in generative modeling is somehow you have a prior [00:43:50] modeling is somehow you have a prior distribution which is Z's that you know [00:43:52] distribution which is Z's that you know how to sample from you have a data [00:43:54] how to sample from you have a data distribution which is X's that you want [00:43:55] distribution which is X's that you want to generate and kind of the core problem [00:43:57] to generate and kind of the core problem in generative modeling is figuring out [00:43:59] in generative modeling is figuring out how to associate Z's and X's. And all [00:44:02] how to associate Z's and X's. And all your different categories to generative [00:44:03] your different categories to generative modeling kind of do it in different [00:44:04] modeling kind of do it in different ways. Right? In a VAE, you say, I'm [00:44:07] ways. Right? In a VAE, you say, I'm going to have the model predict a Z and [00:44:09] going to have the model predict a Z and then predict an X and then try to force [00:44:11] then predict an X and then try to force that Z to be something I know how to [00:44:12] that Z to be something I know how to sample from. Um, which doesn't super [00:44:14] sample from. Um, which doesn't super work that well. In a GAN, um, you're not [00:44:17] work that well. In a GAN, um, you're not supervising that relationship. Like the [00:44:19] supervising that relationship. Like the generator is kind of figuring out its [00:44:20] generator is kind of figuring out its own mapping from Z to X in a feed [00:44:22] own mapping from Z to X in a feed forward way through this distributional [00:44:24] forward way through this distributional matching objective that the [00:44:25] matching objective that the discriminator is giving it. um in [00:44:26] discriminator is giving it. um in diffusion it's kind of figuring out by [00:44:29] diffusion it's kind of figuring out by what what ends up and it ends up having [00:44:32] what what ends up and it ends up having to integrate these curves. Um and [00:44:34] to integrate these curves. Um and there's some there's a lot of different [00:44:35] there's some there's a lot of different actually like several different [00:44:37] actually like several different mathematical formalisms as to why [00:44:38] mathematical formalisms as to why objectives that look like this end up [00:44:40] objectives that look like this end up matching probability distributions in a [00:44:42] matching probability distributions in a reasonable way. Um but again like the [00:44:44] reasonable way. Um but again like the whole core problem is that we have no [00:44:46] whole core problem is that we have no way ahead of time to get to pair up [00:44:49] way ahead of time to get to pair up samples Z from our prior with samples X [00:44:51] samples Z from our prior with samples X from our data. If we knew how to make [00:44:53] from our data. If we knew how to make that pairing and also knew how to sample [00:44:55] that pairing and also knew how to sample from the prior like that would be that [00:44:57] from the prior like that would be that you'd be done. And in some sense, all [00:44:58] you'd be done. And in some sense, all these different forms of generative [00:45:00] these different forms of generative modeling are different ways to square [00:45:01] modeling are different ways to square that circle and come up with a way to [00:45:03] that circle and come up with a way to learn an association from Z to X even [00:45:06] learn an association from Z to X even though we don't and and be able to [00:45:07] though we don't and and be able to sample from Z even though we don't have [00:45:09] sample from Z even though we don't have that association at training time. [00:45:11] that association at training time. There's a lot of different [00:45:12] There's a lot of different interpretations of this that that can [00:45:14] interpretations of this that that can get very very heavy very quick. So I've [00:45:15] get very very heavy very quick. So I've tried to avoid them. [00:45:18] tried to avoid them. Right? So um but we we said a couple [00:45:20] Right? So um but we we said a couple lecture we said last lecture that um [00:45:22] lecture we said last lecture that um unconditional generative modeling is [00:45:23] unconditional generative modeling is kind of kind of pointless. So what we [00:45:25] kind of kind of pointless. So what we almost always care about is conditional [00:45:27] almost always care about is conditional generative modeling and that's easy to [00:45:28] generative modeling and that's easy to accommodate in in rectified flow. So we [00:45:31] accommodate in in rectified flow. So we can do to do conditional rectified flow [00:45:32] can do to do conditional rectified flow we kind of imagine that there's [00:45:34] we kind of imagine that there's different subp parts of our data [00:45:35] different subp parts of our data distribution. Here I'm saying it's [00:45:37] distribution. Here I'm saying it's categorical. Maybe our data is actually [00:45:38] categorical. Maybe our data is actually squares and triangles. Um, and then we [00:45:40] squares and triangles. Um, and then we have our our whole data distribution P [00:45:42] have our our whole data distribution P data as well as our two sort of [00:45:44] data as well as our two sort of subdistributions P data X given that Y [00:45:46] subdistributions P data X given that Y is a square and P data P data X given [00:45:49] is a square and P data P data X given that Y is the label Y is a triangle. So [00:45:51] that Y is the label Y is a triangle. So this is kind of the the picture you [00:45:53] this is kind of the the picture you should have in mind when you think about [00:45:54] should have in mind when you think about conditional generative modeling. Then in [00:45:56] conditional generative modeling. Then in the case of of rectified flow this is [00:45:58] the case of of rectified flow this is very easy to accommodate. So you just um [00:46:00] very easy to accommodate. So you just um sort of your data set now has pairs X [00:46:02] sort of your data set now has pairs X and Y and your model now takes Y as an [00:46:04] and Y and your model now takes Y as an additional auxiliary input somehow. Um [00:46:06] additional auxiliary input somehow. Um and then during sampling same thing you [00:46:07] and then during sampling same thing you get your predicted V's uh according and [00:46:09] get your predicted V's uh according and you you get your predicted V's you know [00:46:11] you you get your predicted V's you know the model takes as input this extra Y uh [00:46:14] the model takes as input this extra Y uh and you use that. So this all kind of [00:46:15] and you use that. So this all kind of goes through. Um the difference is that [00:46:17] goes through. Um the difference is that now Y is actually hopefully some [00:46:19] now Y is actually hopefully some conditional signal that the user can [00:46:21] conditional signal that the user can control. Maybe this is a text prompt. [00:46:22] control. Maybe this is a text prompt. Maybe this is an input image. Maybe this [00:46:24] Maybe this is an input image. Maybe this is this is some kind of user input that [00:46:26] is this is some kind of user input that you're expecting at inference time. Um [00:46:27] you're expecting at inference time. Um which actually make these models [00:46:28] which actually make these models controllable and useful in practice. [00:46:31] controllable and useful in practice. Um, but then there's another really [00:46:32] Um, but then there's another really interesting question is um, is there any [00:46:35] interesting question is um, is there any knob you can tune to control how much [00:46:37] knob you can tune to control how much the model pays attention to the [00:46:39] the model pays attention to the conditioning signal, right? It turns out [00:46:40] conditioning signal, right? It turns out if you train these things naively, a lot [00:46:42] if you train these things naively, a lot of times they don't often follow the [00:46:44] of times they don't often follow the conditioning signal quite as much as you [00:46:45] conditioning signal quite as much as you would like. Um, so there's a trick [00:46:47] would like. Um, so there's a trick called classifier free guidance or CFG [00:46:50] called classifier free guidance or CFG um, that changes this a little that [00:46:51] um, that changes this a little that changes our diffusion uh, training loop [00:46:54] changes our diffusion uh, training loop just a little bit. So, what we're going [00:46:55] just a little bit. So, what we're going to do is we're still going to train this [00:46:57] to do is we're still going to train this conditional diffusion model that inputs [00:46:58] conditional diffusion model that inputs your XT um inputs your Y, but we're [00:47:01] your XT um inputs your Y, but we're going to every on every training on [00:47:03] going to every on every training on every training iteration, we're going to [00:47:04] every training iteration, we're going to flip a coin. Um, and if that coin is [00:47:06] flip a coin. Um, and if that coin is heads, we're going to delete the [00:47:07] heads, we're going to delete the conditioning information. So, we're [00:47:09] conditioning information. So, we're going to set it equal to some kind of [00:47:10] going to set it equal to some kind of zero value or null value. Um, basically [00:47:12] zero value or null value. Um, basically destroy the conditioning information 50% [00:47:15] destroy the conditioning information 50% of the time. That could be a [00:47:16] of the time. That could be a hyperparameter, but 50 is a pretty good [00:47:17] hyperparameter, but 50 is a pretty good one that most people use in practice. [00:47:19] one that most people use in practice. So, we're going to flip a coin. If the [00:47:20] So, we're going to flip a coin. If the coin is heads, delete the conditioning [00:47:22] coin is heads, delete the conditioning information. So that means that the [00:47:23] information. So that means that the model is conceptually now forced to [00:47:25] model is conceptually now forced to learn two different kinds of velocity [00:47:27] learn two different kinds of velocity vectors. Right? Um [00:47:32] vectors. Right? Um right. So then the same the the model is [00:47:34] right. So then the same the the model is sort of forced to learn two different [00:47:35] sort of forced to learn two different kinds of velocity vectors. Right? So in [00:47:38] kinds of velocity vectors. Right? So in the case where we pass it this uh this [00:47:40] the case where we pass it this uh this null value for y that is that that has [00:47:42] null value for y that is that that has destroyed the conditioning information [00:47:44] destroyed the conditioning information then this is basically an unconditional [00:47:46] then this is basically an unconditional generative model. Now um now that that [00:47:48] generative model. Now um now that that predicted velocity vector V is sort of [00:47:50] predicted velocity vector V is sort of has to point back towards the the meat [00:47:53] has to point back towards the the meat of the whole data distribution P data. [00:47:55] of the whole data distribution P data. Um but when we pass a real conditioning [00:47:57] Um but when we pass a real conditioning uh input Y that's non-destroyed, [00:48:00] uh input Y that's non-destroyed, non-null, non-zero um then we're getting [00:48:02] non-null, non-zero um then we're getting sort of a conditional velocity vector [00:48:04] sort of a conditional velocity vector that is pointing us back towards not the [00:48:06] that is pointing us back towards not the full data distribution but towards the [00:48:08] full data distribution but towards the conditional data distribution which is [00:48:10] conditional data distribution which is conditional on that conditioning signal [00:48:11] conditional on that conditioning signal that we cared about. And then the dumb [00:48:14] that we cared about. And then the dumb trick is we're going to take a linear [00:48:16] trick is we're going to take a linear combination of these two vectors to kind [00:48:18] combination of these two vectors to kind of push it more towards the uh more [00:48:22] of push it more towards the uh more towards the conditional velocity vector. [00:48:24] towards the conditional velocity vector. So in particular, we'll have a scalar [00:48:25] So in particular, we'll have a scalar hyperparameter w and take a linear [00:48:27] hyperparameter w and take a linear combination 1 + w v y minus w v uh v [00:48:32] combination 1 + w v y minus w v uh v null. So that's going to be a vector [00:48:33] null. So that's going to be a vector that now kind of points kind of points [00:48:35] that now kind of points kind of points even more towards the towards the [00:48:37] even more towards the towards the conditional distribution than it does [00:48:39] conditional distribution than it does the data distribution. Um and then the [00:48:40] the data distribution. Um and then the idea is that during sampling we're now [00:48:42] idea is that during sampling we're now going to um right then during sampling [00:48:45] going to um right then during sampling we're now going to step according to [00:48:47] we're now going to step according to this uh this this uh vcfg vector rather [00:48:50] this uh this this uh vcfg vector rather than the rather than the raw vectors [00:48:52] than the rather than the raw vectors predicted by the model and then setting [00:48:54] predicted by the model and then setting w equals 1 um setting w equals z here [00:48:58] w equals 1 um setting w equals z here will recover exactly the the the [00:49:00] will recover exactly the the the conditional one and you know the higher [00:49:02] conditional one and you know the higher your w is then the more you're [00:49:03] your w is then the more you're overemphasizing the conditioning signal. [00:49:06] overemphasizing the conditioning signal. Um so this is this is and then this is [00:49:08] Um so this is this is and then this is pretty easy to implement right so uh [00:49:10] pretty easy to implement right so uh then your inference code doesn't really [00:49:12] then your inference code doesn't really change too much now you but now you [00:49:14] change too much now you but now you evaluate the model twice at every [00:49:15] evaluate the model twice at every iteration to get your vy and your v 0 um [00:49:18] iteration to get your vy and your v 0 um and then you take this this linear [00:49:19] and then you take this this linear combination and then step according to [00:49:21] combination and then step according to that and this is called classifier free [00:49:24] that and this is called classifier free because of a stupid reason there was an [00:49:26] because of a stupid reason there was an earlier paper called classifier guidance [00:49:28] earlier paper called classifier guidance that I don't want to get into and then [00:49:29] that I don't want to get into and then they remove the classifier and even [00:49:31] they remove the classifier and even though there was only like 9 months [00:49:32] though there was only like 9 months between those two papers and it's now [00:49:34] between those two papers and it's now been 4 years since the second one we're [00:49:35] been 4 years since the second one we're still stuck with the name classifier [00:49:37] still stuck with the name classifier free guidance. So it is what it is. [00:49:39] free guidance. So it is what it is. Okay. So that's actually really [00:49:41] Okay. So that's actually really important in practice for getting high [00:49:42] important in practice for getting high quality outputs. Um and that's uh that's [00:49:44] quality outputs. Um and that's uh that's CFG. That's really important. That's [00:49:45] CFG. That's really important. That's used everywhere in diffusion models. Um [00:49:47] used everywhere in diffusion models. Um it does double the cost of sampling [00:49:48] it does double the cost of sampling though because now you need to hit the [00:49:50] though because now you need to hit the model twice on every iteration which is [00:49:51] model twice on every iteration which is kind of a problem. Okay. Uh this there's [00:49:54] kind of a problem. Okay. Uh this there's this thing on optimal prediction. I [00:49:55] this thing on optimal prediction. I think I'll skip that. That's not so [00:49:56] think I'll skip that. That's not so interesting. Um I mean it is interesting [00:49:58] interesting. Um I mean it is interesting but we're running out of but but I'm [00:50:00] but we're running out of but but I'm worried about time. So um but one thing [00:50:02] worried about time. So um but one thing that we sometimes need to do is tweak [00:50:04] that we sometimes need to do is tweak this uh this uh this this this t [00:50:07] this uh this uh this this this t distribution. So we saw in particular [00:50:09] distribution. So we saw in particular that we were sampling t according to to [00:50:11] that we were sampling t according to to a uniform distribution in a raw in like [00:50:13] a uniform distribution in a raw in like a raw rectified flow model. Um and the [00:50:16] a raw rectified flow model. Um and the thing about that is it kind of is going [00:50:18] thing about that is it kind of is going to put uniform emphasis on all noise [00:50:20] to put uniform emphasis on all noise levels. And intuitively when you're at [00:50:23] levels. And intuitively when you're at full noise the problem is very easy [00:50:25] full noise the problem is very easy right? When you're at when you're at [00:50:27] right? When you're at when you're at full noise the problem is very easy. [00:50:29] full noise the problem is very easy. then the model like the optimal [00:50:30] then the model like the optimal prediction from the model is basically [00:50:32] prediction from the model is basically to point towards the mean of the data [00:50:34] to point towards the mean of the data distribution. Um and similarly when [00:50:35] distribution. Um and similarly when you're at um zero noise then the optimal [00:50:38] you're at um zero noise then the optimal prediction is actually to point towards [00:50:40] prediction is actually to point towards the mean of the noise distribution. So [00:50:42] the mean of the noise distribution. So actually the model like the the the [00:50:44] actually the model like the the the optimal prediction from the model at [00:50:45] optimal prediction from the model at full noise and full data and no noise [00:50:48] full noise and full data and no noise are actually very relatively easy [00:50:49] are actually very relatively easy problems. It just needs to learn the [00:50:51] problems. It just needs to learn the mean of those two distributions. Um but [00:50:53] mean of those two distributions. Um but when you when you're somewhere in the [00:50:55] when you when you're somewhere in the middle it's actually really really hard, [00:50:56] middle it's actually really really hard, right? because when you when you when [00:50:58] right? because when you when you when you're somewhere in the middle and you [00:50:59] you're somewhere in the middle and you sample that XT, there might have been [00:51:00] sample that XT, there might have been multiple pairs X and Z that could have [00:51:03] multiple pairs X and Z that could have given rise to that same XT in the [00:51:06] given rise to that same XT in the middle. Um, and then the network [00:51:07] middle. Um, and then the network basically needs to solve this [00:51:08] basically needs to solve this expectation problem and figure out what [00:51:10] expectation problem and figure out what is that optimal direction to predict [00:51:12] is that optimal direction to predict that kind of integrates over all [00:51:14] that kind of integrates over all possible X's and Z's that might [00:51:16] possible X's and Z's that might intersect at this point XT. So somehow [00:51:18] intersect at this point XT. So somehow those points in the middle are [00:51:19] those points in the middle are intuitively much harder for the network [00:51:21] intuitively much harder for the network to solve. So um but when we sample [00:51:23] to solve. So um but when we sample uniformly from 0 to 1t then we're kind [00:51:26] uniformly from 0 to 1t then we're kind of putting equal importance on all [00:51:28] of putting equal importance on all levels of noise which doesn't really [00:51:29] levels of noise which doesn't really match this intuition. So in practice [00:51:31] match this intuition. So in practice you'll often sample from different noise [00:51:33] you'll often sample from different noise schedules. Um and one very popular one [00:51:35] schedules. Um and one very popular one is this one called logic normal sampling [00:51:37] is this one called logic normal sampling which basically um looks kind of like a [00:51:39] which basically um looks kind of like a gausian puts relatively little weight on [00:51:40] gausian puts relatively little weight on the zero and the one with a lot more [00:51:42] the zero and the one with a lot more weight in the middle. Um another thing [00:51:44] weight in the middle. Um another thing you'll see sometimes are these so-called [00:51:46] you'll see sometimes are these so-called shifted noise schedules that are [00:51:47] shifted noise schedules that are asymmetric that shift more towards one [00:51:49] asymmetric that shift more towards one direction or the other. um and those are [00:51:51] direction or the other. um and those are important as we scale to high resolution [00:51:52] important as we scale to high resolution data. The intuition being that when you [00:51:54] data. The intuition being that when you have a very high resolution image then [00:51:56] have a very high resolution image then there can be very strong correlations [00:51:57] there can be very strong correlations across neighboring pixels. When you have [00:51:59] across neighboring pixels. When you have a low resolution image then there then [00:52:01] a low resolution image then there then the correlations across neighboring [00:52:02] the correlations across neighboring pixels tend to be smaller. So depending [00:52:04] pixels tend to be smaller. So depending on how strong of correlations you have [00:52:06] on how strong of correlations you have in your data, you actually may need [00:52:08] in your data, you actually may need different amounts of noise level to [00:52:09] different amounts of noise level to properly destroy information in a nice [00:52:11] properly destroy information in a nice way. So um these things don't naively [00:52:14] way. So um these things don't naively scale to different resolutions, [00:52:17] scale to different resolutions, right? All right. And that's actually a [00:52:17] right? All right. And that's actually a big problem with these diffusion models [00:52:19] big problem with these diffusion models is that they are, you know, they're a [00:52:20] is that they are, you know, they're a beautiful formulation, but they don't [00:52:22] beautiful formulation, but they don't but they're hard to get them to work [00:52:23] but they're hard to get them to work naively on high resolution data. So, um, [00:52:27] naively on high resolution data. So, um, that leads to actually, you know, I said [00:52:29] that leads to actually, you know, I said diffusion models are the most popular [00:52:30] diffusion models are the most popular form of generative modeling. That was a [00:52:32] form of generative modeling. That was a little bit of a lie because what's [00:52:33] little bit of a lie because what's actually most popular are these [00:52:34] actually most popular are these so-called latent diffusion models, which [00:52:37] so-called latent diffusion models, which is a variant that actually gets used [00:52:38] is a variant that actually gets used everywhere. Um, so here it's going to be [00:52:40] everywhere. Um, so here it's going to be a multi-stage procedure. So, what we're [00:52:42] a multi-stage procedure. So, what we're going to do is first train an encoder [00:52:44] going to do is first train an encoder network and a decoder network. The [00:52:46] network and a decoder network. The encoder network is going to map from our [00:52:48] encoder network is going to map from our image into some latent space um which [00:52:51] image into some latent space um which I've colored in purple and ideally that [00:52:53] I've colored in purple and ideally that that latent is going to spatially down [00:52:54] that latent is going to spatially down sample the image by a factor of D um as [00:52:57] sample the image by a factor of D um as well as uh convert from three channels [00:52:58] well as uh convert from three channels up into C channels and a pretty common [00:53:01] up into C channels and a pretty common setting is to get 8 by8 spatial [00:53:02] setting is to get 8 by8 spatial downsampling and to increase to 16 [00:53:05] downsampling and to increase to 16 channels. That's kind of the one some of [00:53:06] channels. That's kind of the one some of these most common encoder decoders. Um [00:53:08] these most common encoder decoders. Um these encoder decoders tend to be CNN's [00:53:10] these encoder decoders tend to be CNN's with attention but um more some more [00:53:12] with attention but um more some more recent papers have explored bits for [00:53:14] recent papers have explored bits for these. Then what we do is we're going to [00:53:16] these. Then what we do is we're going to train a diffusion model not on the raw [00:53:18] train a diffusion model not on the raw pixel space of our images but instead on [00:53:21] pixel space of our images but instead on the latent space which is discovered by [00:53:22] the latent space which is discovered by this encoder decoder model. So then the [00:53:24] this encoder decoder model. So then the training procedure kind of looks like a [00:53:26] training procedure kind of looks like a training for training the diffusion [00:53:27] training for training the diffusion model. We're going to sample an image [00:53:30] model. We're going to sample an image pass it through the encoder that we [00:53:31] pass it through the encoder that we learned in the first stage to get a [00:53:32] learned in the first stage to get a latent um and then add noise to the [00:53:35] latent um and then add noise to the latent and train the diffusion model to [00:53:37] latent and train the diffusion model to dn noiseise the the noised latent. And [00:53:40] dn noiseise the the noised latent. And really importantly, you freeze the [00:53:42] really importantly, you freeze the encoder. So you do not prop you do not [00:53:44] encoder. So you do not prop you do not propagate bit gradients back into the [00:53:45] propagate bit gradients back into the encoder. We're only using it to extract [00:53:48] encoder. We're only using it to extract these latents and then training a [00:53:49] these latents and then training a diffusion model directly on the latent [00:53:50] diffusion model directly on the latent space which is learned by the encoder. [00:53:52] space which is learned by the encoder. Then at inference time once you're all [00:53:54] Then at inference time once you're all done training then we'll sample a random [00:53:56] done training then we'll sample a random latent pass it through the diffusion [00:53:57] latent pass it through the diffusion model many many times to remove all the [00:53:59] model many many times to remove all the noise to get a clean sample. But that [00:54:01] noise to get a clean sample. But that clean sample is now a clean sample in [00:54:03] clean sample is now a clean sample in latent space. So then we need to run the [00:54:05] latent space. So then we need to run the decoder to convert that clean latent [00:54:07] decoder to convert that clean latent into a clean image. Um, and this is [00:54:10] into a clean image. Um, and this is actually the most common form of most [00:54:11] actually the most common form of most diffusion models these days. So, you [00:54:14] diffusion models these days. So, you might be asking, okay, we've knocked [00:54:15] might be asking, okay, we've knocked this diffusion model, um, this encoder [00:54:18] this diffusion model, um, this encoder decoder. How do we train an encoder [00:54:19] decoder. How do we train an encoder decoder? Any ideas? [00:54:22] decoder? Any ideas? Have we seen encoder decoders? [00:54:25] Have we seen encoder decoders? How about a variational autoenccoder? [00:54:26] How about a variational autoenccoder? Um, so in practice, whenever you're [00:54:28] Um, so in practice, whenever you're training these latent diffusion models, [00:54:30] training these latent diffusion models, this encoder decoder tends to be a [00:54:32] this encoder decoder tends to be a variational auto-enccoder. But we just [00:54:34] variational auto-enccoder. But we just said there was a big problem with [00:54:35] said there was a big problem with variational auto-enccoders is that they [00:54:37] variational auto-enccoders is that they give you blurry outputs, right? And if [00:54:39] give you blurry outputs, right? And if this encoder decoder is going to give [00:54:41] this encoder decoder is going to give you blurry outputs, the quality of the [00:54:43] you blurry outputs, the quality of the reconstructions you get out of the [00:54:44] reconstructions you get out of the decoder is going to bottleneck the [00:54:46] decoder is going to bottleneck the quality of the generations you get out [00:54:47] quality of the generations you get out of the downstream diffusion model. So if [00:54:49] of the downstream diffusion model. So if your encoder decoder is giving you [00:54:51] your encoder decoder is giving you blurry, ugly reconstructions, that's not [00:54:53] blurry, ugly reconstructions, that's not going to fly. That's not going to get us [00:54:55] going to fly. That's not going to get us good clean samples. So anyone have an [00:54:57] good clean samples. So anyone have an idea for cleaning up the sample quality [00:54:59] idea for cleaning up the sample quality of a of a of a VAE? put something after [00:55:02] of a of a of a VAE? put something after the decoder in particular um we can make [00:55:05] the decoder in particular um we can make again make it again. So what we tend to [00:55:09] again make it again. So what we tend to do is actually um train this encoder [00:55:11] do is actually um train this encoder that encodes from an image into latent [00:55:13] that encodes from an image into latent space a decoder that goes from latent [00:55:15] space a decoder that goes from latent space back to image a discriminator that [00:55:17] space back to image a discriminator that tries to tell the fake images from the [00:55:19] tries to tell the fake images from the re from the real images and then a [00:55:21] re from the real images and then a diffusion model that uh generates these [00:55:23] diffusion model that uh generates these things in latent space. So this is [00:55:25] things in latent space. So this is basically why we have to walk through [00:55:27] basically why we have to walk through all of these different formulations of [00:55:28] all of these different formulations of diffusion models of generative models in [00:55:30] diffusion models of generative models in order for you to understand the modern [00:55:32] order for you to understand the modern pipeline. Like basically the [00:55:33] pipeline. Like basically the state-of-the-art in generative modeling [00:55:35] state-of-the-art in generative modeling is you know is it a VAE is it a GAN is [00:55:37] is you know is it a VAE is it a GAN is it a diffusion it's all of them right? [00:55:39] it a diffusion it's all of them right? It's all of them. The modern the modern [00:55:40] It's all of them. The modern the modern generative modeling pipeline involves [00:55:42] generative modeling pipeline involves training a VAE and a GAN and a diffusion [00:55:45] training a VAE and a GAN and a diffusion model. I'm sorry it's a mess. Okay so [00:55:48] model. I'm sorry it's a mess. Okay so then you might ask what what what do the [00:55:50] then you might ask what what what do the neural networks actually look like under [00:55:51] neural networks actually look like under the hood here? So um thankfully there is [00:55:54] the hood here? So um thankfully there is some sanity here the last couple of [00:55:55] some sanity here the last couple of years. So relatively straight it turns [00:55:57] years. So relatively straight it turns out that relatively straightforward [00:55:59] out that relatively straightforward transformers actually can be applied to [00:56:01] transformers actually can be applied to these diffusion models and they work [00:56:02] these diffusion models and they work really well. Um these are these are [00:56:04] really well. Um these are these are typically called diffusion transformers [00:56:05] typically called diffusion transformers or DITS but basically these are just [00:56:08] or DITS but basically these are just standard diffusion models uh standard [00:56:10] standard diffusion models uh standard transformer blocks that really don't [00:56:11] transformer blocks that really don't really have much special sauce in them. [00:56:13] really have much special sauce in them. Um there's a couple questions uh the the [00:56:15] Um there's a couple questions uh the the kind of main question you needed to [00:56:16] kind of main question you needed to solve in the architectural side is how [00:56:18] solve in the architectural side is how do you inject the conditioning [00:56:19] do you inject the conditioning information? Um, and in particular, the [00:56:21] information? Um, and in particular, the diffusion model now needs to take like [00:56:23] diffusion model now needs to take like three things as input. It needs to take [00:56:24] three things as input. It needs to take your noisy image. It needs to take your [00:56:26] your noisy image. It needs to take your time stamp t. It also needs to take your [00:56:28] time stamp t. It also needs to take your conditioning signal, which might be [00:56:29] conditioning signal, which might be your, um, like your text or something [00:56:31] your, um, like your text or something like that. So, um, and then you have a [00:56:33] like that. So, um, and then you have a couple different mechanisms for [00:56:34] couple different mechanisms for injecting that conditioning signal into [00:56:35] injecting that conditioning signal into your into your transformer blocks. Um, [00:56:37] your into your transformer blocks. Um, so the first is, um, to predict a scale [00:56:39] so the first is, um, to predict a scale and shift, um, that are going to be used [00:56:41] and shift, um, that are going to be used to like modulate some of the [00:56:42] to like modulate some of the intermediate activations of your of your [00:56:44] intermediate activations of your of your diffusion block. Um, and that's [00:56:46] diffusion block. Um, and that's typically the way that we inject the [00:56:47] typically the way that we inject the time stamp information into diffusion [00:56:48] time stamp information into diffusion models. Um, another thing you can do is [00:56:51] models. Um, another thing you can do is transformers are just models of [00:56:52] transformers are just models of sequences. So you can jam everything [00:56:54] sequences. So you can jam everything into the sequence. Um, you can jam the [00:56:56] into the sequence. Um, you can jam the time stamp into the sequence, you can [00:56:57] time stamp into the sequence, you can jam your text into the sequence, you can [00:56:58] jam your text into the sequence, you can jam whatever you want into the sequence [00:57:00] jam whatever you want into the sequence and have the the have the transformer [00:57:02] and have the the have the transformer just model that sequence of data [00:57:03] just model that sequence of data altogether. Um, and you can do that [00:57:05] altogether. Um, and you can do that either via cross attention or joint [00:57:06] either via cross attention or joint attention. And different models um sort [00:57:08] attention. And different models um sort of do both. And typically we're going to [00:57:10] of do both. And typically we're going to typically in modern diffusion DITS we [00:57:12] typically in modern diffusion DITS we inject the time stamp through this scale [00:57:14] inject the time stamp through this scale shift mechanism and you inject the the [00:57:16] shift mechanism and you inject the the text or other conditioning signal [00:57:17] text or other conditioning signal through um sequence concatenation [00:57:19] through um sequence concatenation usually cross attention but sometimes [00:57:21] usually cross attention but sometimes joint attention as well. Um so then how [00:57:24] joint attention as well. Um so then how can you actually apply this to different [00:57:25] can you actually apply this to different problems? So one task that people care [00:57:27] problems? So one task that people care about a lot is the is the task of text [00:57:29] about a lot is the is the task of text to image generation. So here we're going [00:57:31] to image generation. So here we're going to input a text prompt. This is one that [00:57:33] to input a text prompt. This is one that I wrote yesterday. A professional [00:57:34] I wrote yesterday. A professional documentary photograph of a monkey [00:57:36] documentary photograph of a monkey shaking hands with a tiger in front of [00:57:37] shaking hands with a tiger in front of the Eiffel Tower. monkey is wearing a [00:57:39] the Eiffel Tower. monkey is wearing a hat made out of bananas. Tiger is [00:57:40] hat made out of bananas. Tiger is standing on two legs and wearing a suit. [00:57:42] standing on two legs and wearing a suit. Um, and this is a really this is a real [00:57:43] Um, and this is a really this is a real sample. Like it's crazy that the stuff [00:57:44] sample. Like it's crazy that the stuff works now. Um, but I'm sure you've all [00:57:46] works now. Um, but I'm sure you've all seen these kind of things before. Um, [00:57:48] seen these kind of things before. Um, and the way that this works is you'll [00:57:49] and the way that this works is you'll take your text prompt. Um, pass it [00:57:51] take your text prompt. Um, pass it through usually a pre-trained text [00:57:52] through usually a pre-trained text encoder, right? So, actually I lied. [00:57:54] encoder, right? So, actually I lied. Like there's actually more models you [00:57:55] Like there's actually more models you have to train in addition to an encoder [00:57:57] have to train in addition to an encoder and a decoder and a VAE and a [00:57:59] and a decoder and a VAE and a discriminator. You also need to train a [00:58:00] discriminator. You also need to train a language model secretly to get these [00:58:02] language model secretly to get these things to work. Um, so you'll actually [00:58:03] things to work. Um, so you'll actually typically pick up a pre-trained text [00:58:05] typically pick up a pre-trained text encoder, usually T5, clip, something [00:58:07] encoder, usually T5, clip, something like that to give um, text embeddings [00:58:08] like that to give um, text embeddings and usually the text encoder will be [00:58:10] and usually the text encoder will be frozen. Then those text embeddings um, [00:58:12] frozen. Then those text embeddings um, you pass your text embeddings together [00:58:14] you pass your text embeddings together with your noisy latence into your [00:58:15] with your noisy latence into your diffusion transformer that also gets [00:58:17] diffusion transformer that also gets your diffusion time step. That's going [00:58:18] your diffusion time step. That's going to output clean latence and this thing [00:58:20] to output clean latence and this thing will kind of go iteratively and that'll [00:58:22] will kind of go iteratively and that'll go through your your VAE decoder to give [00:58:24] go through your your VAE decoder to give you your final image. And just to put [00:58:26] you your final image. And just to put some numbers on this to make it [00:58:27] some numbers on this to make it concrete, um, one pretty a pretty [00:58:30] concrete, um, one pretty a pretty powerful open source model right now is [00:58:31] powerful open source model right now is called flux one dev. They use the T5 and [00:58:33] called flux one dev. They use the T5 and clip encoders. Their encoder uses 8x [00:58:35] clip encoders. Their encoder uses 8x down sampling. They train a 12 billion [00:58:37] down sampling. They train a 12 billion parameter um, transformer model on this. [00:58:39] parameter um, transformer model on this. And that transformer has an additional [00:58:42] And that transformer has an additional layer of downsampling on top of the VAE, [00:58:44] layer of downsampling on top of the VAE, which is kind of messy. So it ends up [00:58:46] which is kind of messy. So it ends up having a sequence length of 1024 image [00:58:48] having a sequence length of 1024 image tokens. [00:58:50] tokens. Another task that people care about a [00:58:51] Another task that people care about a lot is text to video. So we can input a [00:58:54] lot is text to video. So we can input a text prompt and then output the pixels [00:58:55] text prompt and then output the pixels of a video that follow that text prompt. [00:58:57] of a video that follow that text prompt. And the pipeline basically looks kind of [00:58:59] And the pipeline basically looks kind of the same. So you're going to input a [00:59:01] the same. So you're going to input a text um through your pre-trained text [00:59:03] text um through your pre-trained text encoder, get noisy latence. Importantly, [00:59:05] encoder, get noisy latence. Importantly, the only difference is that your latents [00:59:06] the only difference is that your latents are now have an extra dimension to [00:59:08] are now have an extra dimension to accommodate the time. So in addition to [00:59:10] accommodate the time. So in addition to two spatial dimensions, HW, you'll also [00:59:12] two spatial dimensions, HW, you'll also have a time dimension in your latent and [00:59:14] have a time dimension in your latent and that will give you clean latence. And [00:59:15] that will give you clean latence. And then your decoder um this is typically [00:59:17] then your decoder um this is typically going to be a spatial temporal [00:59:18] going to be a spatial temporal auto-enccoder now. So it will downsample [00:59:20] auto-enccoder now. So it will downsample both spatially and temporally. Um, so [00:59:22] both spatially and temporally. Um, so then it will take your latence and then [00:59:24] then it will take your latence and then upsample them into pixels which will [00:59:26] upsample them into pixels which will give you a video. Um, and this is [00:59:28] give you a video. Um, and this is actually a this is actually a generated [00:59:30] actually a this is actually a generated video from Meta's movie gen paper that [00:59:32] video from Meta's movie gen paper that came out uh in last year. [00:59:36] Um, okay. And that's putting some that's [00:59:38] Um, okay. And that's putting some that's putting some particular numbers on this [00:59:39] putting some particular numbers on this thing. Um, and the key takeaway of these [00:59:42] thing. Um, and the key takeaway of these video generation models is that they get [00:59:43] video generation models is that they get very expensive to train due to the [00:59:45] very expensive to train due to the sequence length, right? because if you [00:59:46] sequence length, right? because if you want to generate high FPS, high [00:59:49] want to generate high FPS, high resolution, high frame rate video, it [00:59:51] resolution, high frame rate video, it just ends up with a lot of tokens. So, [00:59:53] just ends up with a lot of tokens. So, we said that with um kind of a a fairly [00:59:55] we said that with um kind of a a fairly state-of-the-art um to text to image [00:59:57] state-of-the-art um to text to image diffusion model, that transformer ended [00:59:58] diffusion model, that transformer ended up working on a sequence of 1024 image [01:00:00] up working on a sequence of 1024 image tokens for this um for this texttovideo [01:00:03] tokens for this um for this texttovideo diffusion model. Even though the overall [01:00:05] diffusion model. Even though the overall architecture looks pretty similar, the [01:00:06] architecture looks pretty similar, the biggest difference is in the sequence [01:00:07] biggest difference is in the sequence length. So now rather now they actually [01:00:09] length. So now rather now they actually need to process um 76,000 um video [01:00:12] need to process um 76,000 um video tokens to process this to create this [01:00:14] tokens to process this to create this high resolution video with a lot of [01:00:16] high resolution video with a lot of frames. So that's where the expense [01:00:17] frames. So that's where the expense happens in these video diffusion models [01:00:19] happens in these video diffusion models is actually processing these really long [01:00:21] is actually processing these really long sequences. [01:00:23] sequences. So um I I think basically the last year [01:00:25] So um I I think basically the last year has pretty much been the era of video [01:00:27] has pretty much been the era of video diffusion models. So um there was there [01:00:30] diffusion models. So um there was there it basically seems like every week [01:00:32] it basically seems like every week almost for the past year there's been a [01:00:34] almost for the past year there's been a new interesting video diffusion model [01:00:36] new interesting video diffusion model coming out. Um, and these have been a [01:00:38] coming out. Um, and these have been a mix of open source models, uh, models [01:00:40] mix of open source models, uh, models that have technical reports, um, so they [01:00:42] that have technical reports, um, so they they give you some details about the [01:00:44] they give you some details about the model architecture and the training. Um, [01:00:46] model architecture and the training. Um, and you know, purely industrial models [01:00:47] and you know, purely industrial models where they don't tell you anything, but [01:00:48] where they don't tell you anything, but they'll take your credit card number, [01:00:50] they'll take your credit card number, let you generate samples from it. So, [01:00:52] let you generate samples from it. So, um, I I'm not going to go through all [01:00:53] um, I I'm not going to go through all these in all of these all of these uh, [01:00:55] these in all of these all of these uh, one by one, but I just wanted to give [01:00:56] one by one, but I just wanted to give the sense of like this has been a really [01:00:58] the sense of like this has been a really hot topic the last like really the past [01:01:00] hot topic the last like really the past 18 months. Um and in particular this uh [01:01:03] 18 months. Um and in particular this uh there was this really influential blog [01:01:04] there was this really influential blog post from OpenAI called Sora that came [01:01:06] post from OpenAI called Sora that came out in March 2024 which was not the [01:01:09] out in March 2024 which was not the first um diffusion model on videos but [01:01:11] first um diffusion model on videos but it was the first one that gave really [01:01:12] it was the first one that gave really really really good results. Um and they [01:01:14] really really good results. Um and they kind of adopted this modern sort of [01:01:16] kind of adopted this modern sort of diffusion transformer plus rectified [01:01:18] diffusion transformer plus rectified flow. Um actually I don't know if they [01:01:20] flow. Um actually I don't know if they were using rectified flow in Sora. I [01:01:21] were using rectified flow in Sora. I don't know if they said um but they were [01:01:23] don't know if they said um but they were they were one of the first to really [01:01:24] they were one of the first to really scale up these diffusion transformers um [01:01:26] scale up these diffusion transformers um and get this thing to work really well [01:01:28] and get this thing to work really well and then kind of that was the 4-minute [01:01:29] and then kind of that was the 4-minute mile sort of moment in video diffusion [01:01:31] mile sort of moment in video diffusion models and then all the other big [01:01:33] models and then all the other big companies took notice of that and [01:01:34] companies took notice of that and quickly tried to replicate Sora. So, I [01:01:36] quickly tried to replicate Sora. So, I said it's it's felt like for the past [01:01:38] said it's it's felt like for the past year and a half that like almost every [01:01:40] year and a half that like almost every week there's been a brand new [01:01:42] week there's been a brand new state-of-the-art video diffusion model. [01:01:43] state-of-the-art video diffusion model. And today is no exception because an [01:01:46] And today is no exception because an hour and a half uh at 11:00 a.m. this [01:01:48] hour and a half uh at 11:00 a.m. this morning, Google announced Veo 3, which [01:01:50] morning, Google announced Veo 3, which is almost certainly the best video diff [01:01:52] is almost certainly the best video diff which is almost certainly the best [01:01:53] which is almost certainly the best generative model of video out there [01:01:55] generative model of video out there right now. Um, I literally like read the [01:01:58] right now. Um, I literally like read the blog post while I was in the car on the [01:02:00] blog post while I was in the car on the way here, but it seems cool. Here are [01:02:02] way here, but it seems cool. Here are some samples from Veo3. [01:02:04] some samples from Veo3. So like these are actually generated [01:02:06] So like these are actually generated videos from a text from a text prompt um [01:02:08] videos from a text from a text prompt um in Google's new model. Kind of crazy. [01:02:10] in Google's new model. Kind of crazy. Also, this model joint also models sound [01:02:13] Also, this model joint also models sound jointly. So they can output like audio [01:02:15] jointly. So they can output like audio along with the video frames. This is [01:02:17] along with the video frames. This is another generated one. So you can kind [01:02:19] another generated one. So you can kind of tell what you want to happen in text. [01:02:21] of tell what you want to happen in text. It'll fly over here and like looks [01:02:23] It'll fly over here and like looks crazy. [01:02:25] crazy. Okay. So I thought that's just fun to to [01:02:27] Okay. So I thought that's just fun to to incorporate new stuff. [01:02:30] incorporate new stuff. Okay. So one big problem with um with [01:02:32] Okay. So one big problem with um with diffusion is that during sampling it's [01:02:34] diffusion is that during sampling it's really slow right we said that sampling [01:02:36] really slow right we said that sampling was this iterative procedure and these [01:02:38] was this iterative procedure and these models are can be really big these can [01:02:39] models are can be really big these can be models with tens of billions of [01:02:40] be models with tens of billions of parameters potentially operating on [01:02:42] parameters potentially operating on sequence lengths of tens of thousands or [01:02:44] sequence lengths of tens of thousands or or more so these things get really slow [01:02:46] or more so these things get really slow at inference time because even with [01:02:48] at inference time because even with rectified flow you need like you know [01:02:50] rectified flow you need like you know tens tens of iterations of the model at [01:02:52] tens tens of iterations of the model at inference time so the solution is a [01:02:55] inference time so the solution is a category of algorithms called [01:02:56] category of algorithms called distillation which we don't have time to [01:02:58] distillation which we don't have time to get into I just wanted to put a couple [01:03:00] get into I just wanted to put a couple referenc here make you aware that this [01:03:01] referenc here make you aware that this exists as a set of techniques. So [01:03:03] exists as a set of techniques. So distillation algorithms are basically [01:03:05] distillation algorithms are basically ways that you can take a diffusion model [01:03:07] ways that you can take a diffusion model that normally would take you know 30 50 [01:03:09] that normally would take you know 30 50 100 iterations at inference time to get [01:03:11] 100 iterations at inference time to get good samples and then modify the model [01:03:13] good samples and then modify the model in some way such that you can take many [01:03:15] in some way such that you can take many many fewer steps on inference and still [01:03:17] many fewer steps on inference and still get good samples. Um they tend to [01:03:19] get good samples. Um they tend to sacrifice sample quality. So the whole [01:03:20] sacrifice sample quality. So the whole trick in distillation methods is trying [01:03:22] trick in distillation methods is trying to maintain the sample quality as good [01:03:24] to maintain the sample quality as good as you can while still letting you take [01:03:26] as you can while still letting you take fewer samples at inference time. Um, and [01:03:28] fewer samples at inference time. Um, and some distillation methods let you get [01:03:29] some distillation methods let you get all the way down to singlestep sampling, [01:03:31] all the way down to singlestep sampling, which is really cool, although they tend [01:03:33] which is really cool, although they tend to take quite a hit on the on the on the [01:03:35] to take quite a hit on the on the on the um generation quality when you do that. [01:03:37] um generation quality when you do that. So, that's kind of uh that's kind of, [01:03:39] So, that's kind of uh that's kind of, you know, I'm not going to go through [01:03:40] you know, I'm not going to go through these, but I just put some references [01:03:42] these, but I just put some references here to different papers on distillation [01:03:44] here to different papers on distillation if you want to take a look. Um, and this [01:03:45] if you want to take a look. Um, and this is a really active and evolving area of [01:03:47] is a really active and evolving area of research. So, if you look at these [01:03:48] research. So, if you look at these references, these are um, you know, from [01:03:50] references, these are um, you know, from 2024, from 2025. So, like these are [01:03:53] 2024, from 2025. So, like these are stuff that people are working on right [01:03:54] stuff that people are working on right now is how do we get better [01:03:56] now is how do we get better distillation? And how do we get uh how [01:03:58] distillation? And how do we get uh how do we get diffusion models to be more [01:03:59] do we get diffusion models to be more efficient at inverse time? So another [01:04:02] efficient at inverse time? So another thing so I I kind of mentioned that [01:04:04] thing so I I kind of mentioned that diffusion has this like black hole of [01:04:06] diffusion has this like black hole of math that you can get sucked into. Um [01:04:08] math that you can get sucked into. Um and we intentionally sidestepped that by [01:04:10] and we intentionally sidestepped that by just walking very intuitively through [01:04:11] just walking very intuitively through rectified flow models kind of giving you [01:04:13] rectified flow models kind of giving you a geometric intuition for the problem [01:04:15] a geometric intuition for the problem without really diving through any math [01:04:16] without really diving through any math that proves anything. Um so I wanted to [01:04:18] that proves anything. Um so I wanted to give you just a brief sense of like what [01:04:20] give you just a brief sense of like what some of these formalisms are. Um but [01:04:22] some of these formalisms are. Um but we're not going to be able to go through [01:04:23] we're not going to be able to go through them in detail. Um so here's kind of [01:04:25] them in detail. Um so here's kind of restating the rectified flow objective. [01:04:27] restating the rectified flow objective. We said that during training we're going [01:04:29] We said that during training we're going to sample um our X's and our Z's [01:04:31] to sample um our X's and our Z's according to our data distribution and [01:04:32] according to our data distribution and our noise distribution. We're going to [01:04:34] our noise distribution. We're going to sample T according to some distribution [01:04:36] sample T according to some distribution PT that we choose either um uniform [01:04:38] PT that we choose either um uniform logic normal or shifted something like [01:04:40] logic normal or shifted something like that. And then we'll set XT equal to the [01:04:42] that. And then we'll set XT equal to the linear interpolation between X and Z. [01:04:43] linear interpolation between X and Z. We'll set a now now we've changed we've [01:04:46] We'll set a now now we've changed we've written this a little bit differently in [01:04:47] written this a little bit differently in in this slide. Now we're writing down a [01:04:49] in this slide. Now we're writing down a ground truth velocity VGT that we want [01:04:51] ground truth velocity VGT that we want the network to predict which is Z minus [01:04:53] the network to predict which is Z minus X. Um we compute a predicted V from the [01:04:56] X. Um we compute a predicted V from the network by passing it our noisy XT and [01:04:58] network by passing it our noisy XT and RT then compute an L2 loss minimizing [01:05:01] RT then compute an L2 loss minimizing the V the VGT and the predicted V from [01:05:04] the V the VGT and the predicted V from the network. So most um when I said that [01:05:07] the network. So most um when I said that there's a lot of different formalisms a [01:05:09] there's a lot of different formalisms a lot of different flavors of diffusion um [01:05:11] lot of different flavors of diffusion um what a lot of these look to is sort of [01:05:13] what a lot of these look to is sort of different functional hyperparameters in [01:05:14] different functional hyperparameters in this general setup in this general [01:05:16] this general setup in this general setup. So in the in more generalized [01:05:18] setup. So in the in more generalized flavors of diffusion, usually you'll [01:05:21] flavors of diffusion, usually you'll you'll you can you might vary what is [01:05:22] you'll you can you might vary what is this PT distribution. Usually you don't [01:05:24] this PT distribution. Usually you don't vary the noise distribution. This is [01:05:26] vary the noise distribution. This is almost always gausian um at least for [01:05:27] almost always gausian um at least for for continuous models. Um but what you [01:05:30] for continuous models. Um but what you will vary is like how do you compute [01:05:31] will vary is like how do you compute that noisy XT and in general that will [01:05:33] that noisy XT and in general that will be some functional combination that will [01:05:35] be some functional combination that will be some linear combination of X and Z. [01:05:38] be some linear combination of X and Z. Um and the that the linear combination [01:05:40] Um and the that the linear combination weights will in general be some function [01:05:42] weights will in general be some function of T. But what exactly that function is [01:05:44] of T. But what exactly that function is kind of depends on the diffusion [01:05:45] kind of depends on the diffusion formulation. Then what also varies is [01:05:47] formulation. Then what also varies is what is that ground truth target that we [01:05:49] what is that ground truth target that we ask the model to predict. It's always [01:05:51] ask the model to predict. It's always going to be some linear combination of [01:05:53] going to be some linear combination of our data sample X and our and our um and [01:05:55] our data sample X and our and our um and our latent Z. Um and again the what are [01:05:58] our latent Z. Um and again the what are the linear combination weights might be [01:06:00] the linear combination weights might be functions of T in some formulations. So [01:06:02] functions of T in some formulations. So basically and then uh you know we're [01:06:03] basically and then uh you know we're going to ask the model to we're going to [01:06:05] going to ask the model to we're going to give it that noisy XT and the T um get a [01:06:07] give it that noisy XT and the T um get a predicted Y and then always compute an [01:06:09] predicted Y and then always compute an L2 loss between the two. I mean not [01:06:11] L2 loss between the two. I mean not always but usually. And then what varies [01:06:13] always but usually. And then what varies is basically what are these different [01:06:15] is basically what are these different functional forms? What are these [01:06:16] functional forms? What are these different functions that we plot that we [01:06:18] different functions that we plot that we slot into these different these four [01:06:19] slot into these different these four different spots in this thing. So in the [01:06:21] different spots in this thing. So in the case of rectified flow it's fairly [01:06:23] case of rectified flow it's fairly simple. Um a these all take these really [01:06:25] simple. Um a these all take these really simple forms and um ct and bdt are [01:06:28] simple forms and um ct and bdt are actually just constants. Um there's a [01:06:30] actually just constants. Um there's a kind of another another flavor of this [01:06:32] kind of another another flavor of this called variance preserving um where a is [01:06:35] called variance preserving um where a is where you kind of collapse these two [01:06:36] where you kind of collapse these two into one scalar hyperparameter um called [01:06:38] into one scalar hyperparameter um called sigma of t. Um and now you have these [01:06:41] sigma of t. Um and now you have these linear linear combinations in this [01:06:42] linear linear combinations in this particular way and you choose this [01:06:44] particular way and you choose this because if X and Z are independent and [01:06:46] because if X and Z are independent and have unit variance then your output also [01:06:48] have unit variance then your output also is guaranteed to have unit variance. So [01:06:50] is guaranteed to have unit variance. So that kind of collapses these two [01:06:51] that kind of collapses these two functional hyperparameters into just one [01:06:53] functional hyperparameters into just one noise schedule and then you still need [01:06:54] noise schedule and then you still need to choose that somehow. Um in [01:06:57] to choose that somehow. Um in combination with variance preserving [01:06:58] combination with variance preserving there's also variance exploding is [01:07:00] there's also variance exploding is another one where you'll set a t equals [01:07:01] another one where you'll set a t equals 1 bt equal to again some sigma of t and [01:07:04] 1 bt equal to again some sigma of t and you need to choose that somehow. Um [01:07:06] you need to choose that somehow. Um there's a lot of different way a lot of [01:07:08] there's a lot of different way a lot of different targets that people will [01:07:09] different targets that people will choose. Um sometimes they'll predict [01:07:11] choose. Um sometimes they'll predict sometimes they'll ask the network to [01:07:12] sometimes they'll ask the network to predict the clean data. Sometimes [01:07:14] predict the clean data. Sometimes they'll ask the model to predict the [01:07:15] they'll ask the model to predict the noise that was added. Sometimes they'll [01:07:17] noise that was added. Sometimes they'll ask the network to predict some linear [01:07:18] ask the network to predict some linear combination of the two. Um and in the [01:07:20] combination of the two. Um and in the case of of rectified flow, you are just [01:07:22] case of of rectified flow, you are just predicting the that velocity vector that [01:07:24] predicting the that velocity vector that points from a data directly to a noise. [01:07:26] points from a data directly to a noise. But in more generaliz in more in [01:07:28] But in more generaliz in more in different flavors of diffusion, all of [01:07:30] different flavors of diffusion, all of these can kind of change. Um then you [01:07:32] these can kind of change. Um then you might be wondering you know how how you [01:07:34] might be wondering you know how how you know choosing hyperparameters is bad [01:07:35] know choosing hyperparameters is bad enough. Now we need to choose [01:07:36] enough. Now we need to choose hyperparameters which are themselves [01:07:38] hyperparameters which are themselves functions of t. Like this is crazy. Um [01:07:40] functions of t. Like this is crazy. Um you're never going to set these [01:07:41] you're never going to set these intuitively. So you'd have to be guided [01:07:42] intuitively. So you'd have to be guided by some some kind of math. Um and [01:07:44] by some some kind of math. Um and there's basically three different [01:07:45] there's basically three different mathematical formalisms that people um [01:07:47] mathematical formalisms that people um you that people think about when [01:07:49] you that people think about when training diffusion models that again we [01:07:50] training diffusion models that again we will not walk through in practice. I [01:07:51] will not walk through in practice. I just want to get you the to know the [01:07:53] just want to get you the to know the existence of. Um the first is that [01:07:55] existence of. Um the first is that diffusion is a latent variable model, [01:07:56] diffusion is a latent variable model, right? that we're going to we have our [01:07:58] right? that we're going to we have our clean data samples X0 but then [01:08:00] clean data samples X0 but then associated to every clean data sample. [01:08:02] associated to every clean data sample. There exists some sequence of corrupted [01:08:03] There exists some sequence of corrupted or noisy samples like that that [01:08:06] or noisy samples like that that correspond to that clean sample and we [01:08:08] correspond to that clean sample and we can't observe them. We don't know what [01:08:09] can't observe them. We don't know what they are but we need to figure them out [01:08:10] they are but we need to figure them out somehow. So that's a latent variable [01:08:12] somehow. So that's a latent variable model that ends up looking a lot like a [01:08:14] model that ends up looking a lot like a variational autoenccoder. Remember in a [01:08:15] variational autoenccoder. Remember in a variational autoenccoder we had a Z and [01:08:17] variational autoenccoder we had a Z and an X. We didn't observe the Z. We wanted [01:08:19] an X. We didn't observe the Z. We wanted to train this thing somehow. Um then it [01:08:21] to train this thing somehow. Um then it turns out you can turn a very similar ma [01:08:23] turns out you can turn a very similar ma use a very similar mathematical trick um [01:08:25] use a very similar mathematical trick um as we did in variational autoenccoders [01:08:27] as we did in variational autoenccoders and maximize some variational lower [01:08:28] and maximize some variational lower bound of the likelihood of the data and [01:08:30] bound of the likelihood of the data and that gives rise to this um latent [01:08:32] that gives rise to this um latent variable model interpretation of [01:08:33] variable model interpretation of diffusion. [01:08:35] diffusion. Another another a totally different [01:08:36] Another another a totally different interpretation of diffusion is that it [01:08:37] interpretation of diffusion is that it models something called the score [01:08:38] models something called the score function. Um so given a data given a [01:08:41] function. Um so given a data given a data data given a distribution p data of [01:08:44] data data given a distribution p data of x um there's this nice thing called the [01:08:47] x um there's this nice thing called the score function of the distribution which [01:08:48] score function of the distribution which is the derivative with respect to x of [01:08:50] is the derivative with respect to x of the log of p data of x and intuitively [01:08:52] the log of p data of x and intuitively this given a distribution the score [01:08:54] this given a distribution the score function is a vector field that points [01:08:56] function is a vector field that points towards areas of high of high [01:08:59] towards areas of high of high probability density. So um you know for [01:09:01] probability density. So um you know for any point in the data space the score [01:09:03] any point in the data space the score function is going to be a vector that [01:09:05] function is going to be a vector that points you towards areas of high data [01:09:07] points you towards areas of high data density. And now another interpretation [01:09:09] density. And now another interpretation of diffusion is that diffusion is [01:09:11] of diffusion is that diffusion is learning the score function of the data [01:09:13] learning the score function of the data distribution and in fact learning a set [01:09:15] distribution and in fact learning a set of score functions corresponding to [01:09:17] of score functions corresponding to different levels of noise on the data [01:09:19] different levels of noise on the data distribution. So there's another [01:09:20] distribution. So there's another interpretation of diffusion which is [01:09:22] interpretation of diffusion which is that it's trying to learn a family of [01:09:23] that it's trying to learn a family of score functions corresponding to a [01:09:25] score functions corresponding to a family of noised distributions that [01:09:27] family of noised distributions that corrupt the true data distribution with [01:09:29] corrupt the true data distribution with increasing amounts of known noise. And [01:09:31] increasing amounts of known noise. And that's a totally different mathematical [01:09:32] that's a totally different mathematical formalism that gives rise to very [01:09:34] formalism that gives rise to very similar looking algorithm at the end. [01:09:37] similar looking algorithm at the end. And then the third one that's come onto [01:09:38] And then the third one that's come onto the scene a little bit more recently is [01:09:40] the scene a little bit more recently is this notion of diffusion as solving [01:09:42] this notion of diffusion as solving stochastic differential equations. And I [01:09:44] stochastic differential equations. And I got to admit like I don't fully [01:09:45] got to admit like I don't fully understand this one myself. So don't ask [01:09:46] understand this one myself. So don't ask me too many questions. Um but the idea [01:09:49] me too many questions. Um but the idea is that you want to learn some you want [01:09:51] is that you want to learn some you want to write down some differential equation [01:09:52] to write down some differential equation that's going to write down you know some [01:09:54] that's going to write down you know some infantessimal way to transport samples [01:09:56] infantessimal way to transport samples from a noise distribution into samples [01:09:58] from a noise distribution into samples from a data distribution. And then it [01:10:00] from a data distribution. And then it and then it inference. Then the neural [01:10:02] and then it inference. Then the neural network is basically learning some kind [01:10:03] network is basically learning some kind of numeric integrator, some numeric [01:10:06] of numeric integrator, some numeric integrator to this stochastic [01:10:08] integrator to this stochastic differential equation that we can write [01:10:09] differential equation that we can write down. Um, and this opens up a whole new [01:10:11] down. Um, and this opens up a whole new way of thinking about it, right? Because [01:10:12] way of thinking about it, right? Because from under the lens of stoastic [01:10:14] from under the lens of stoastic differential equations, then we get [01:10:15] differential equations, then we get access to whole different categories of [01:10:17] access to whole different categories of functions to to to whole different [01:10:19] functions to to to whole different categories of methods to sample from [01:10:21] categories of methods to sample from these things at inference time, right? [01:10:22] these things at inference time, right? And from this perspective, the kind of [01:10:24] And from this perspective, the kind of like naive gradient descent type um type [01:10:26] like naive gradient descent type um type approach that we saw in rectified flow [01:10:28] approach that we saw in rectified flow basically corresponds to a forward oiler [01:10:30] basically corresponds to a forward oiler type of integrator on top of a sta [01:10:32] type of integrator on top of a sta stocastic differential equation. And [01:10:33] stocastic differential equation. And then under this interpretation, you can [01:10:35] then under this interpretation, you can imagine doing all kinds of more [01:10:36] imagine doing all kinds of more complicated integrators to maybe do a [01:10:38] complicated integrators to maybe do a better job at mar at marching along this [01:10:40] better job at mar at marching along this score function. So again, like these are [01:10:42] score function. So again, like these are these are these are deep waters. Like [01:10:44] these are these are deep waters. Like there's there's there's papers that go [01:10:45] there's there's there's papers that go into great detail on all these things. [01:10:47] into great detail on all these things. Um, and a blog post that I really liked [01:10:49] Um, and a blog post that I really liked is this one by Sander Deman on [01:10:50] is this one by Sander Deman on perspectives on diffusion, who actually [01:10:52] perspectives on diffusion, who actually gave um eight different perspectives on [01:10:54] gave um eight different perspectives on different ways to think about or view [01:10:56] different ways to think about or view diffusion models. So, this is an [01:10:57] diffusion models. So, this is an excellent post. I would highly recommend [01:10:59] excellent post. I would highly recommend I would actually highly recommend [01:11:00] I would actually highly recommend everything he's written about diffusion [01:11:01] everything he's written about diffusion models. All his blog posts are amazing. [01:11:04] models. All his blog posts are amazing. Um, autogressive models actually come [01:11:07] Um, autogressive models actually come back. We can do the same thing in coder [01:11:09] back. We can do the same thing in coder decoder um and put an autogressive model [01:11:12] decoder um and put an autogressive model on there too. So the other the other you [01:11:14] on there too. So the other the other you know just sneaking this in there at the [01:11:16] know just sneaking this in there at the end in addition to diffusion models the [01:11:18] end in addition to diffusion models the other you know modern recipe for [01:11:19] other you know modern recipe for generative modeling is to train an auto [01:11:21] generative modeling is to train an auto reggressive model on discrete latents [01:11:23] reggressive model on discrete latents that are computed by a discrete [01:11:24] that are computed by a discrete variational autoenccoder. So you know [01:11:27] variational autoenccoder. So you know you know that's why we did the four [01:11:28] you know that's why we did the four generative models that we did right um [01:11:30] generative models that we did right um GANs ves autogressive models diffusion [01:11:33] GANs ves autogressive models diffusion because it turns out they all get used [01:11:34] because it turns out they all get used in in in modern machine learning uh [01:11:36] in in in modern machine learning uh pipelines. [01:11:38] pipelines. So that's basically the summary of [01:11:39] So that's basically the summary of today. Uh today we did a whirlwind tour [01:11:41] today. Uh today we did a whirlwind tour of two different categories of [01:11:42] of two different categories of generative models. We talked about [01:11:44] generative models. We talked about generative adversarial networks as well [01:11:45] generative adversarial networks as well as diffusion models. And we saw kind of [01:11:47] as diffusion models. And we saw kind of their their modern full pipeline [01:11:49] their their modern full pipeline instantiated in latent diffusion models [01:11:51] instantiated in latent diffusion models which is kind of a nice way to wrap up [01:11:52] which is kind of a nice way to wrap up this generative modeling section because [01:11:54] this generative modeling section because all the generative models that we saw [01:11:55] all the generative models that we saw basically come back and come together to [01:11:57] basically come back and come together to form these big modern pipelines. So [01:12:00] form these big modern pipelines. So thanks and next time we'll talk about [01:12:01] thanks and next time we'll talk about vision and language. ================================================================================ LECTURE 015 ================================================================================ Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 15: 3D Vision Source: https://www.youtube.com/watch?v=7lxrKDKtykM --- Transcript [00:00:05] I'm really happy to announce uh our next [00:00:07] I'm really happy to announce uh our next guest speaker for the course uh [00:00:09] guest speaker for the course uh professor Jajun Wu. So Jajun is an [00:00:12] professor Jajun Wu. So Jajun is an assistant professor here at Stanford uh [00:00:14] assistant professor here at Stanford uh in the department of computer science [00:00:16] in the department of computer science and he's a faculty member of the [00:00:18] and he's a faculty member of the Stanford vision and uh sorry vision and [00:00:22] Stanford vision and uh sorry vision and learning lab. Um his research focuses on [00:00:25] learning lab. Um his research focuses on scene understanding uh with an emphasis [00:00:27] scene understanding uh with an emphasis on multimodal perception uh robotics and [00:00:30] on multimodal perception uh robotics and embodied AI, visual generation and [00:00:32] embodied AI, visual generation and reasoning and uh 3D understanding which [00:00:36] reasoning and uh 3D understanding which is the topic of today's lecture. Um and [00:00:38] is the topic of today's lecture. Um and so I'll now turn it over to Jajun to [00:00:40] so I'll now turn it over to Jajun to begin today's lecture. [00:00:42] begin today's lecture. Okay. Yeah. So I'm Jun. I'm an assistant [00:00:44] Okay. Yeah. So I'm Jun. I'm an assistant professor here and I think a few years [00:00:45] professor here and I think a few years ago I used to teach this class code [00:00:47] ago I used to teach this class code teach. So I heard this year it's the [00:00:50] teach. So I heard this year it's the 10th year anniversary right. So we have [00:00:52] 10th year anniversary right. So we have guest speakers from different places. [00:00:55] guest speakers from different places. Okay. Um so today we're going to talk [00:00:56] Okay. Um so today we're going to talk about 3D vision. Uh so it might be kind [00:00:59] about 3D vision. Uh so it might be kind of different from a lot of things you um [00:01:02] of different from a lot of things you um you you learned before because I think [00:01:04] you you learned before because I think in the past few weeks we talked about [00:01:06] in the past few weeks we talked about like conversion new networks and [00:01:08] like conversion new networks and transformers and maybe vision language [00:01:09] transformers and maybe vision language models and generator models as well for [00:01:11] models and generator models as well for justing right. Okay. Yeah. So here for [00:01:14] justing right. Okay. Yeah. So here for 3D I think I'm going to first introduce [00:01:16] 3D I think I'm going to first introduce a little bit on what are the 3D [00:01:18] a little bit on what are the 3D representations you know. So it's more [00:01:20] representations you know. So it's more like it's pretty distant from all the [00:01:22] like it's pretty distant from all the deep learning stuff. But then we're [00:01:23] deep learning stuff. But then we're going to talk about how deep learning or [00:01:25] going to talk about how deep learning or AI has changed 3D vision and how they [00:01:27] AI has changed 3D vision and how they can be integrated in different ways, you [00:01:30] can be integrated in different ways, you know, and we look into a few different [00:01:31] know, and we look into a few different applications about 3D generation, [00:01:33] applications about 3D generation, reconstruction and stuff like that. [00:01:35] reconstruction and stuff like that. Okay. So let's begin by looking into [00:01:38] Okay. So let's begin by looking into what are the possible ways to represent [00:01:39] what are the possible ways to represent objects in in 3D because in 2D is so so [00:01:42] objects in in 3D because in 2D is so so straightforward. It looks like I just [00:01:43] straightforward. It looks like I just have pixels, right? I have a I loading a [00:01:45] have pixels, right? I have a I loading a file of a PNG file or JPEG file. It's [00:01:47] file of a PNG file or JPEG file. It's like 200 by 200 pixels. But how can we [00:01:50] like 200 by 200 pixels. But how can we represent 3D objects? I think that's the [00:01:52] represent 3D objects? I think that's the first thing you want to look into. [00:01:55] first thing you want to look into. And uh 3D objects, you know, they can be [00:01:57] And uh 3D objects, you know, they can be diverse. Uh they can be at different [00:02:00] diverse. Uh they can be at different scales. It can be like huge large [00:02:02] scales. It can be like huge large buildings and trees, complex structures. [00:02:05] buildings and trees, complex structures. And if you zoom in, you know, you can [00:02:07] And if you zoom in, you know, you can also see all the fine details. So what [00:02:09] also see all the fine details. So what are the best 3D representations to [00:02:11] are the best 3D representations to represent all these different types of [00:02:12] represent all these different types of 3D objects at different scales with [00:02:14] 3D objects at different scales with different features [00:02:16] different features and unlike images where everyone just [00:02:19] and unlike images where everyone just use pixels you know so we have 200 by [00:02:21] use pixels you know so we have 200 by 200 500 by 500 um the way to represent [00:02:25] 200 500 by 500 um the way to represent 3D objects in ter you know objects are [00:02:27] 3D objects in ter you know objects are also like you have geometry you have [00:02:28] also like you have geometry you have textures you have materials but let's [00:02:30] textures you have materials but let's just maybe start by looking at geometry [00:02:32] just maybe start by looking at geometry and even just for 3D object geometry [00:02:34] and even just for 3D object geometry there are so many different ways to [00:02:35] there are so many different ways to represent them you know we can basically [00:02:37] represent them you know we can basically categorize them into two different [00:02:39] categorize them into two different categories. One is called explicit [00:02:40] categories. One is called explicit representations. So where you can you [00:02:43] representations. So where you can you can in some sense they're directly I [00:02:45] can in some sense they're directly I would say explicitly representing part [00:02:47] would say explicitly representing part of the objects. This includes uh things [00:02:49] of the objects. This includes uh things like point clouds. If you have a cloud [00:02:50] like point clouds. If you have a cloud of 3D points or a polygon mesh or [00:02:55] of 3D points or a polygon mesh or subdivisions which we're going to talk [00:02:56] subdivisions which we're going to talk about it and others and there's a [00:02:59] about it and others and there's a different categories of object shape [00:03:01] different categories of object shape representations which are often called [00:03:03] representations which are often called implicit. So we're going to talk about [00:03:04] implicit. So we're going to talk about them as well. So I'm going to explain [00:03:06] them as well. So I'm going to explain them in a little bit detail later [00:03:07] them in a little bit detail later including level sets, algebraic [00:03:09] including level sets, algebraic surfaces, distance functions. So they're [00:03:11] surfaces, distance functions. So they're basically representing 3D objects or [00:03:13] basically representing 3D objects or their geometries as functions which it [00:03:15] their geometries as functions which it is not directly know it's not as in some [00:03:17] is not directly know it's not as in some sense intuitive as oh it's just a [00:03:19] sense intuitive as oh it's just a collection of points but as we'll see [00:03:20] collection of points but as we'll see later they also have their own [00:03:22] later they also have their own advantages and weaknesses uh using [00:03:24] advantages and weaknesses uh using implicit representations. So every [00:03:26] implicit representations. So every choice they have their you know suitable [00:03:28] choice they have their you know suitable task and type of geometry in particular [00:03:31] task and type of geometry in particular in the context deep learning you know [00:03:32] in the context deep learning you know they may also have their own strengths [00:03:33] they may also have their own strengths and weakness when you want to apply a [00:03:35] and weakness when you want to apply a deep learning method on top of it. [00:03:38] deep learning method on top of it. So when do we choose representation? You [00:03:40] So when do we choose representation? You know we want we have to store them. So [00:03:42] know we want we have to store them. So pixels are easy to store because it's [00:03:44] pixels are easy to store because it's just a matrix. But then 3D point clouds [00:03:47] just a matrix. But then 3D point clouds are kind of more irregular and also you [00:03:49] are kind of more irregular and also you know if especially if you use some [00:03:50] know if especially if you use some implicit representations like [00:03:51] implicit representations like representing object as a function how [00:03:53] representing object as a function how would you how would you store that in a [00:03:54] would you how would you store that in a in a computer and how does it support [00:03:57] in a computer and how does it support you know creating a new shapes um and [00:03:59] you know creating a new shapes um and especially now let's say maybe the input [00:04:01] especially now let's say maybe the input is a picture or the input is a language [00:04:03] is a picture or the input is a language descriptions and different type of [00:04:05] descriptions and different type of operations you have a 3D objects then [00:04:07] operations you have a 3D objects then how can you edit it simplify it smooth [00:04:09] how can you edit it simplify it smooth it filter it repair it right so you can [00:04:11] it filter it repair it right so you can have to do a lot more you know for for [00:04:13] have to do a lot more you know for for images sometimes you want to do that too [00:04:14] images sometimes you want to do that too you want to edit it. You want to add it [00:04:16] you want to edit it. You want to add it using language. You want to edit it [00:04:17] using language. You want to edit it using stroke. And how can you edit or [00:04:20] using stroke. And how can you edit or perform any type of operations on 3D [00:04:22] perform any type of operations on 3D objects and rendering? How can you turn [00:04:24] objects and rendering? How can you turn that 3D objects render into computer v [00:04:26] that 3D objects render into computer v uh into 2D pixels? In some sense, you [00:04:28] uh into 2D pixels? In some sense, you can 3D vision is is to invert a process, [00:04:31] can 3D vision is is to invert a process, right? How can you go from 2D images to [00:04:33] right? How can you go from 2D images to reconstruct the 3D objects? So, how does [00:04:36] reconstruct the 3D objects? So, how does support all these different things [00:04:37] support all these different things including NN animations especially if [00:04:39] including NN animations especially if you are modeling let's say 3D humans or [00:04:42] you are modeling let's say 3D humans or animals and you want to animate them. So [00:04:44] animals and you want to animate them. So all these factors need to be considered [00:04:46] all these factors need to be considered and something else of course it is sort [00:04:48] and something else of course it is sort of connect through all these is also [00:04:49] of connect through all these is also their integration with different deep [00:04:50] their integration with different deep learning methods for let's say shape [00:04:53] learning methods for let's say shape editing rendering inverse rendering [00:04:55] editing rendering inverse rendering animation as well. So very quickly we [00:04:58] animation as well. So very quickly we can go through some of these [00:04:59] can go through some of these representations like point clouds. Um so [00:05:02] representations like point clouds. Um so point cloud is probably the simplest [00:05:04] point cloud is probably the simplest representations uh only has 3D points. [00:05:07] representations uh only has 3D points. It doesn't have connectivity. So it [00:05:08] It doesn't have connectivity. So it doesn't capture how these points are [00:05:10] doesn't capture how these points are connected. So you basically just have a [00:05:12] connected. So you basically just have a instead of having a n byn matrix uh [00:05:14] instead of having a n byn matrix uh which is about the pixel values of all [00:05:16] which is about the pixel values of all the pixels in a picture now you have a [00:05:18] the pixels in a picture now you have a 3xn matrix where three is xyz [00:05:22] 3xn matrix where three is xyz coordinates of these individual points [00:05:23] coordinates of these individual points and you have a number of points. [00:05:26] and you have a number of points. So sometimes you can represent the [00:05:28] So sometimes you can represent the surface normals of the point as well so [00:05:30] surface normals of the point as well so that you have not only where the point [00:05:31] that you have not only where the point is in the 3D space but also uh which uh [00:05:34] is in the 3D space but also uh which uh to which direction is facing. So you [00:05:36] to which direction is facing. So you have the surface normals which give you [00:05:38] have the surface normals which give you a bit more information and sometimes [00:05:40] a bit more information and sometimes people call them surf which is points uh [00:05:43] people call them surf which is points uh with orientations [00:05:46] with orientations and yeah so why do you need surface [00:05:48] and yeah so why do you need surface normals? Because if you want to render [00:05:49] normals? Because if you want to render them, you want to see like how the [00:05:50] them, you want to see like how the object look like you know if you then [00:05:52] object look like you know if you then that means you have to often specify a [00:05:54] that means you have to often specify a lighting source right where's the [00:05:55] lighting source right where's the lighting coming from and but you know to [00:05:58] lighting coming from and but you know to make the rendering look realistic you [00:05:59] make the rendering look realistic you have to consider how the lighting you [00:06:01] have to consider how the lighting you know coming from a certain direction is [00:06:03] know coming from a certain direction is going to interact with the point and [00:06:05] going to interact with the point and this is where the surface norms normals [00:06:07] this is where the surface norms normals is used to help you to make the [00:06:09] is used to help you to make the rendering look realistic like you can [00:06:10] rendering look realistic like you can see here. [00:06:12] see here. So how can you get points? Um you know a [00:06:14] So how can you get points? Um you know a benefit of the point cloud it is it is [00:06:17] benefit of the point cloud it is it is often a raw format that you will get [00:06:19] often a raw format that you will get from a lot of the 3D sensors. Um you [00:06:21] from a lot of the 3D sensors. Um you know including these kind of depth [00:06:22] know including these kind of depth sensors including some 3D scanners and [00:06:25] sensors including some 3D scanners and you know nowadays I think if you can [00:06:27] you know nowadays I think if you can even use your iPhone and I think they [00:06:28] even use your iPhone and I think they have a AR kit or these kind of software [00:06:30] have a AR kit or these kind of software allow to scan 3D objects but but the raw [00:06:33] allow to scan 3D objects but but the raw output of those sensors they're still [00:06:35] output of those sensors they're still kind of 3D point clouds. Now of course [00:06:36] kind of 3D point clouds. Now of course after that you have to process them and [00:06:38] after that you have to process them and fuse them to make it like maybe objects [00:06:40] fuse them to make it like maybe objects with textures. [00:06:43] with textures. Um so yeah they often results of from [00:06:45] Um so yeah they often results of from scanners they can potentially be very [00:06:46] scanners they can potentially be very noisy and there are things like this and [00:06:48] noisy and there are things like this and you want to fuse them merge them repair [00:06:50] you want to fuse them merge them repair them um and this in this part you know [00:06:53] them um and this in this part you know you have to consider how these different [00:06:55] you have to consider how these different pictures can be registered uh to give [00:06:57] pictures can be registered uh to give you the shared point cloud [00:07:00] you the shared point cloud and um they can they're very flexible so [00:07:03] and um they can they're very flexible so you can because you can move points here [00:07:05] you can because you can move points here and there so you can use them to [00:07:07] and there so you can use them to represent basically any type of object [00:07:08] represent basically any type of object geometry you're not constrained by the [00:07:10] geometry you're not constrained by the topology or stuff like [00:07:11] topology or stuff like uh it's kind of useful for large data [00:07:13] uh it's kind of useful for large data sets because sometimes you have to [00:07:14] sets because sometimes you have to consider a very diverse set of objects. [00:07:17] consider a very diverse set of objects. Um but you know because points are [00:07:19] Um but you know because points are already in some you consider to have pre [00:07:21] already in some you consider to have pre being pres right so you know if you have [00:07:24] being pres right so you know if you have a lot of points if if you're [00:07:25] a lot of points if if you're representing objects and but your points [00:07:27] representing objects and but your points are sampled in uneven way in the sense [00:07:29] are sampled in uneven way in the sense that you have a lot of points let's say [00:07:31] that you have a lot of points let's say on the on the head of the rabbit but you [00:07:33] on the on the head of the rabbit but you have very very few points on the tail of [00:07:34] have very very few points on the tail of the rabbit then it will be actually hard [00:07:36] the rabbit then it will be actually hard to draw samples from these inner sample [00:07:38] to draw samples from these inner sample regions. So sometimes when people [00:07:40] regions. So sometimes when people consider you know sampling points you [00:07:42] consider you know sampling points you have to design algorithm to make sure [00:07:43] have to design algorithm to make sure you sample them roughly evenly you know [00:07:45] you sample them roughly evenly you know across different parts of the objects [00:07:48] across different parts of the objects and other limitations or you know it's [00:07:49] and other limitations or you know it's not obvious how we can directly perform [00:07:52] not obvious how we can directly perform sometimes the very useful operations [00:07:53] sometimes the very useful operations like simplification or subdivisions on [00:07:55] like simplification or subdivisions on these objects. Uh it doesn't directly [00:07:58] these objects. Uh it doesn't directly allow you to do smooth rendering. [00:08:00] allow you to do smooth rendering. There's no topological information. [00:08:02] There's no topological information. Right? So for example here, you know, if [00:08:05] Right? So for example here, you know, if I give you a collection of points, then [00:08:07] I give you a collection of points, then you can't even tell, you know, if this [00:08:08] you can't even tell, you know, if this is like a Taurus or this is like these [00:08:12] is like a Taurus or this is like these kind of ring-l like shapes, right? [00:08:13] kind of ring-l like shapes, right? Because it doesn't tell you how these [00:08:15] Because it doesn't tell you how these points are connected. So it's kind of a [00:08:16] points are connected. So it's kind of a a partial information about what object [00:08:19] a partial information about what object is if you just have the point clouds. So [00:08:22] is if you just have the point clouds. So naturally people will say, okay, how can [00:08:23] naturally people will say, okay, how can I actually capture, you know, more [00:08:25] I actually capture, you know, more information so that I can distinguish [00:08:26] information so that I can distinguish between these two different objects? [00:08:28] between these two different objects? Then naturally that goes to the [00:08:30] Then naturally that goes to the polygonal meshes right. So it represent [00:08:34] polygonal meshes right. So it represent the object that still has a collection [00:08:35] the object that still has a collection of points but then also how these points [00:08:37] of points but then also how these points are connected. So now you have not only [00:08:39] are connected. So now you have not only the points but also the faces the [00:08:41] the points but also the faces the surfaces [00:08:43] surfaces and this is arguably like uh I would say [00:08:45] and this is arguably like uh I would say the most widely used representations for [00:08:47] the most widely used representations for 3D objects in all these graphics engines [00:08:50] 3D objects in all these graphics engines and in computer games and stuff like [00:08:51] and in computer games and stuff like that. You know basically it is all [00:08:52] that. You know basically it is all represented as polygon meshes. Uh but [00:08:54] represented as polygon meshes. Uh but you can see that you know to represent [00:08:56] you can see that you know to represent faces it is more complex because often [00:08:58] faces it is more complex because often you have to consider um you know [00:09:01] you have to consider um you know especially if you're looking at raw [00:09:02] especially if you're looking at raw meshes then every face may have a [00:09:04] meshes then every face may have a different number of points some have [00:09:05] different number of points some have three points some have four points have [00:09:07] three points some have four points have five points and how you can represent [00:09:08] five points and how you can represent them especially given their irregularity [00:09:11] them especially given their irregularity how you can integrate them with neuronet [00:09:13] how you can integrate them with neuronet networks you know especially in early [00:09:14] networks you know especially in early stage when people start with like [00:09:15] stage when people start with like commercial neuron networks okay they [00:09:17] commercial neuron networks okay they always assume a fixed resolution but [00:09:19] always assume a fixed resolution but here you have you know not I would say a [00:09:21] here you have you know not I would say a variable uh dimension of these raw [00:09:24] variable uh dimension of these raw information how does that integrate with [00:09:25] information how does that integrate with deep learning that's been some big [00:09:27] deep learning that's been some big challenge that's why you know deep [00:09:28] challenge that's why you know deep learning with 3D vision kind of started [00:09:30] learning with 3D vision kind of started kind of late because people are thinking [00:09:31] kind of late because people are thinking about how we can adapt all these deep [00:09:33] about how we can adapt all these deep learning method to deal with all these [00:09:35] learning method to deal with all these um uh complex representations for [00:09:37] um uh complex representations for objects which are not as unified as uh [00:09:40] objects which are not as unified as uh images [00:09:42] images but meshes are really widely used and [00:09:44] but meshes are really widely used and they can be you know very complex meshes [00:09:45] they can be you know very complex meshes that capture all the details you know [00:09:47] that capture all the details you know for example you have scanners you get [00:09:49] for example you have scanners you get points and then you fuse them and you [00:09:51] points and then you fuse them and you apply some algorithm you can get you [00:09:53] apply some algorithm you can get you know very arch mesh. This one has like [00:09:55] know very arch mesh. This one has like 56 uh million triangles and 28 million [00:10:00] 56 uh million triangles and 28 million uh vertices uh to represent a sculpture. [00:10:04] uh vertices uh to represent a sculpture. And you can have even larger ones. You [00:10:06] And you can have even larger ones. You know, let's say from Google Earth, they [00:10:07] know, let's say from Google Earth, they have trillions of triangles try to [00:10:09] have trillions of triangles try to represent basically all all the [00:10:11] represent basically all all the buildings on Earth. The nice thing about [00:10:14] buildings on Earth. The nice thing about meshes, it supports a lot of operations [00:10:15] meshes, it supports a lot of operations like subdivisions. Oh, I want to have [00:10:17] like subdivisions. Oh, I want to have more details and how can I have more use [00:10:19] more details and how can I have more use more meshes to capture more details of [00:10:20] more meshes to capture more details of the shape. [00:10:22] the shape. Um and you can do simplification as [00:10:24] Um and you can do simplification as well. Sometimes you want to process [00:10:25] well. Sometimes you want to process things very fast. So I don't need that [00:10:27] things very fast. So I don't need that many meshes. I just want to simplify it [00:10:29] many meshes. I just want to simplify it and so I can there are existing [00:10:31] and so I can there are existing algorithms allows you to do that as [00:10:32] algorithms allows you to do that as well. [00:10:34] well. And regularization you know if you get [00:10:36] And regularization you know if you get irregular mesh and sometimes you want to [00:10:38] irregular mesh and sometimes you want to regularize them so that every phase is a [00:10:41] regularize them so that every phase is a triangle they always connect three [00:10:43] triangle they always connect three vertices. they have roughly uh the same [00:10:45] vertices. they have roughly uh the same uh size and so that it's easier for [00:10:47] uh size and so that it's easier for processing and they have uh more say [00:10:50] processing and they have uh more say good properties that supports future [00:10:52] good properties that supports future processing of different graphics [00:10:53] processing of different graphics algorithms and matches you know there [00:10:55] algorithms and matches you know there have been people who develop these [00:10:57] have been people who develop these algorithms as well so that you have you [00:10:59] algorithms as well so that you have you can ensure that you know basically [00:11:01] can ensure that you know basically points at different regions they're [00:11:03] points at different regions they're roughly evenly sampled so that it won't [00:11:04] roughly evenly sampled so that it won't be the case that okay let's say the head [00:11:06] be the case that okay let's say the head of the tail or the head of the rabbit is [00:11:08] of the tail or the head of the rabbit is much more densely sampled than the tail [00:11:09] much more densely sampled than the tail and these kind of things [00:11:12] and these kind of things okay so this is kind one type of shape [00:11:14] okay so this is kind one type of shape representations and there are other type [00:11:15] representations and there are other type of shape representations. For example, [00:11:18] of shape representations. For example, uh parametric representations because [00:11:20] uh parametric representations because objects are not just totally irregular. [00:11:22] objects are not just totally irregular. It's not just often a collection of [00:11:23] It's not just often a collection of points and meshes. They're very general. [00:11:26] points and meshes. They're very general. But sometimes you lose a lot of [00:11:27] But sometimes you lose a lot of information if you look at let's say [00:11:29] information if you look at let's say your chairs or the tables, you know, you [00:11:30] your chairs or the tables, you know, you have all these kind of straight lines, [00:11:32] have all these kind of straight lines, right? So how can you represent these [00:11:33] right? So how can you represent these kind of straight lines? And when people [00:11:34] kind of straight lines? And when people design them, you know, they often use [00:11:36] design them, you know, they often use some of these parametric [00:11:37] some of these parametric representations, you know, so you can [00:11:39] representations, you know, so you can represent shapes as a function, right? [00:11:41] represent shapes as a function, right? As I think about it, you know, so when I [00:11:43] As I think about it, you know, so when I design them, I can, you know, it's [00:11:45] design them, I can, you know, it's really if I want to represent a surface [00:11:47] really if I want to represent a surface or represent a curve, right? The [00:11:48] or represent a curve, right? The underlying dimen degree of freedom is [00:11:51] underlying dimen degree of freedom is actually lower. You know, often if I [00:11:52] actually lower. You know, often if I have a curve, you know, there's only [00:11:54] have a curve, you know, there's only only one underlying degree of freedom. [00:11:55] only one underlying degree of freedom. That's why I can represent a curve using [00:11:57] That's why I can represent a curve using a function f ofx, right? Just varied x [00:11:59] a function f ofx, right? Just varied x and get a value of y. So you can use [00:12:02] and get a value of y. So you can use basically all these different types of [00:12:04] basically all these different types of functions in 2D, but also more often in [00:12:06] functions in 2D, but also more often in 3D, right? to map I know a certain [00:12:09] 3D, right? to map I know a certain number of variables the underlying [00:12:10] number of variables the underlying intrinsic dimensionality of the objects [00:12:12] intrinsic dimensionality of the objects which are often let's say two or even [00:12:14] which are often let's say two or even one uh and then map it into the 3D space [00:12:17] one uh and then map it into the 3D space and then this allows you to represent a [00:12:19] and then this allows you to represent a 3D objects in a parametric [00:12:20] 3D objects in a parametric representations using basically a set of [00:12:23] representations using basically a set of functions right you can do that for [00:12:25] functions right you can do that for curves no let's say in circles right [00:12:27] curves no let's say in circles right basically if you want to represent a [00:12:28] basically if you want to represent a circle you know you don't really need to [00:12:30] circle you know you don't really need to you know one way is you simple number of [00:12:32] you know one way is you simple number of points right or you can even connect [00:12:34] points right or you can even connect them uh like a meshes you using the [00:12:36] them uh like a meshes you using the lines [00:12:37] lines Another way is you just represent the [00:12:39] Another way is you just represent the curve uh the the circle as this function [00:12:42] curve uh the the circle as this function right basically there's a sign function [00:12:43] right basically there's a sign function and cosine function uh and you just vary [00:12:47] and cosine function uh and you just vary one variable which is t you can think [00:12:48] one variable which is t you can think about it as the degrees or the angles [00:12:50] about it as the degrees or the angles and it will map it to all the points on [00:12:52] and it will map it to all the points on the circles right so here now you can [00:12:54] the circles right so here now you can use a function to represent a parametric [00:12:56] use a function to represent a parametric representations for a curves in 2D and [00:12:59] representations for a curves in 2D and of course you can do that uh in 3D as [00:13:01] of course you can do that uh in 3D as well if you want to represent a sphere [00:13:03] well if you want to represent a sphere all you need is just like two degrees of [00:13:05] all you need is just like two degrees of freedoms um UMV and then you can go [00:13:08] freedoms um UMV and then you can go through these functions so that you can [00:13:09] through these functions so that you can map them to every point in the 3D space [00:13:12] map them to every point in the 3D space for this sphere right so people have [00:13:15] for this sphere right so people have designed I'm not going to detail here [00:13:16] designed I'm not going to detail here but more complex parametric [00:13:18] but more complex parametric representations like basic curves and [00:13:20] representations like basic curves and basier surfaces which allows you to [00:13:22] basier surfaces which allows you to represent like these kind of pretty [00:13:24] represent like these kind of pretty flexible and smooth surfaces in 3D using [00:13:27] flexible and smooth surfaces in 3D using basically a few control points um so you [00:13:31] basically a few control points um so you basically use these basia functions to [00:13:33] basically use these basia functions to capture the underlying lower [00:13:34] capture the underlying lower dimensionalities of these surfaces and [00:13:36] dimensionalities of these surfaces and then you can map these underlying low [00:13:38] then you can map these underlying low dimensionality into these kind of [00:13:40] dimensionality into these kind of flexible shapes [00:13:43] flexible shapes and they also allow you to do things [00:13:45] and they also allow you to do things like subdivisions. So you can you know [00:13:47] like subdivisions. So you can you know trying to get more details uh into the [00:13:49] trying to get more details uh into the surfaces and to make it more fine grain [00:13:50] surfaces and to make it more fine grain and stuff like that. Okay, so that would [00:13:52] and stuff like that. Okay, so that would be the second category of shape [00:13:54] be the second category of shape representations. You know, you can [00:13:55] representations. You know, you can represent 3D objects in a nonparametric [00:13:57] represent 3D objects in a nonparametric way like a collection of unordered [00:13:59] way like a collection of unordered points or their connections as meshes or [00:14:01] points or their connections as meshes or you can represent them in a parametric [00:14:03] you can represent them in a parametric way where you have a function basically [00:14:05] way where you have a function basically and you by varying a few parameters that [00:14:08] and you by varying a few parameters that are underlying the true degree of [00:14:10] are underlying the true degree of freedoms of the object geometry you can [00:14:12] freedoms of the object geometry you can map them into more complex shapes. [00:14:16] map them into more complex shapes. So basically everything here as I said [00:14:18] So basically everything here as I said has been you know if you remember at the [00:14:20] has been you know if you remember at the very beginning we said there are two [00:14:21] very beginning we said there are two types of ways to represent object [00:14:22] types of ways to represent object geometry. one is explicit there's more [00:14:24] geometry. one is explicit there's more implicit and all of them they fall into [00:14:27] implicit and all of them they fall into this category of being quite explicit [00:14:29] this category of being quite explicit right so it's like I have a points and [00:14:30] right so it's like I have a points and points are just directly points you on [00:14:32] points are just directly points you on objects and for the surfaces or for [00:14:35] objects and for the surfaces or for parametric curves as well you know they [00:14:36] parametric curves as well you know they directly map it right to the points on [00:14:39] directly map it right to the points on the on on the on objects um so their [00:14:42] the on on the on objects um so their explicit representations they have a lot [00:14:44] explicit representations they have a lot of benefits you know for first they have [00:14:47] of benefits you know for first they have you map all the points uh directly um so [00:14:50] you map all the points uh directly um so you can get all these points in general [00:14:52] you can get all these points in general eneral you know you can you know every [00:14:54] eneral you know you can you know every point I have let's say I sample I have a [00:14:56] point I have let's say I sample I have a basia surface representations I can [00:14:59] basia surface representations I can basically sample two points and UV in [00:15:01] basically sample two points and UV in this underlying low dimensional space [00:15:03] this underlying low dimensional space and then going through that function I [00:15:04] and then going through that function I map it to a point in the 3D space right [00:15:07] map it to a point in the 3D space right so I directly get a point on the 3D surf [00:15:10] so I directly get a point on the 3D surf in a 3D space so all points are given in [00:15:12] in a 3D space so all points are given in some sense you can say directly I can [00:15:13] some sense you can say directly I can directly get all the points so it's very [00:15:16] directly get all the points so it's very easy for us to sample points right so [00:15:18] easy for us to sample points right so let's say I have this Taurus and I have [00:15:20] let's say I have this Taurus and I have represented using this f function So now [00:15:22] represented using this f function So now my question is can you just sample some [00:15:24] my question is can you just sample some points on the surface of the object for [00:15:26] points on the surface of the object for me? This is so easy because I will just [00:15:28] me? This is so easy because I will just you know randomly put in some UMV [00:15:30] you know randomly put in some UMV values. I just randomly sample those UMV [00:15:32] values. I just randomly sample those UMV values and then let them go through this [00:15:33] values and then let them go through this function and it will just you know [00:15:35] function and it will just you know compute and give me some of the 3D [00:15:38] compute and give me some of the 3D points which are guaranteed to be on [00:15:40] points which are guaranteed to be on this surface of the object. Right? So [00:15:42] this surface of the object. Right? So sampling is much easier. The what is [00:15:45] sampling is much easier. The what is hard about these explicit [00:15:46] hard about these explicit representations? [00:15:48] representations? The hard thing is it's very hard in some [00:15:51] The hard thing is it's very hard in some sense you know to test whether a point [00:15:53] sense you know to test whether a point is inside or outside the objects. [00:15:56] is inside or outside the objects. Similarly, you know, if I represent a [00:15:58] Similarly, you know, if I represent a sphere as this function and it's easy [00:16:01] sphere as this function and it's easy for me to sample points on a sphere, but [00:16:03] for me to sample points on a sphere, but it is hard for me to say, you know, now [00:16:05] it is hard for me to say, you know, now I have a different, you know, I have a [00:16:06] I have a different, you know, I have a query, right? I say this point 3 over 4, [00:16:09] query, right? I say this point 3 over 4, one over two, one over four, I have this [00:16:10] one over two, one over four, I have this point in 3D space, is it inside object [00:16:13] point in 3D space, is it inside object or is it outside object, [00:16:15] or is it outside object, right, right? You know, I think we can [00:16:17] right, right? You know, I think we can maybe actually I'm not even sure about [00:16:19] maybe actually I'm not even sure about that. Um so it is actually kind of hard [00:16:22] that. Um so it is actually kind of hard to test you know whether this certain [00:16:24] to test you know whether this certain point is inside or outside object. So [00:16:28] point is inside or outside object. So you can see that explicit [00:16:29] you can see that explicit representations um is you know it's all [00:16:32] representations um is you know it's all these representations they have their [00:16:33] these representations they have their own strengths and weaknesses and for [00:16:35] own strengths and weaknesses and for explicit representations it's actually [00:16:36] explicit representations it's actually pretty easy to sample points which are [00:16:38] pretty easy to sample points which are very useful because sometimes you want [00:16:39] very useful because sometimes you want to convert them into let's say a [00:16:40] to convert them into let's say a collection of points you know and then [00:16:42] collection of points you know and then you want to apply whatever your point [00:16:43] you want to apply whatever your point neuron networks on it but it's hard to [00:16:46] neuron networks on it but it's hard to test if a certain point is inside [00:16:48] test if a certain point is inside outside object which may have some [00:16:49] outside object which may have some issues you know for example let's say if [00:16:50] issues you know for example let's say if you want to use a uh newer rendering [00:16:52] you want to use a uh newer rendering methods and nowadays a lot of these [00:16:54] methods and nowadays a lot of these newer rendering methods requires a lot [00:16:55] newer rendering methods requires a lot of these kind of queries about whether a [00:16:57] of these kind of queries about whether a point is inside object, outside object, [00:16:59] point is inside object, outside object, what will be the geometry or density of [00:17:00] what will be the geometry or density of object at a particular point, what will [00:17:02] object at a particular point, what will be the material or what will be the [00:17:04] be the material or what will be the radiance or color of the object at a [00:17:06] radiance or color of the object at a particular point. Right? So explicit [00:17:07] particular point. Right? So explicit representations are not very um [00:17:09] representations are not very um supportive of it's not easy to run these [00:17:12] supportive of it's not easy to run these operations on these explicit [00:17:14] operations on these explicit representations. [00:17:16] representations. So naturally people thought okay maybe [00:17:17] So naturally people thought okay maybe we can come up with a different type of [00:17:18] we can come up with a different type of ways to represent geometry and here I [00:17:21] ways to represent geometry and here I say implicit representational geometry [00:17:23] say implicit representational geometry but as you can see later you know a lot [00:17:25] but as you can see later you know a lot of these newer rendering methods or deep [00:17:26] of these newer rendering methods or deep learning methods they just extend these [00:17:28] learning methods they just extend these implicit representations for not only [00:17:29] implicit representations for not only geometry but also for colors and [00:17:30] geometry but also for colors and appearance of objects in 3D. So idea of [00:17:34] appearance of objects in 3D. So idea of these implicit representations is now [00:17:36] these implicit representations is now that I want to uh classify these points. [00:17:39] that I want to uh classify these points. So I assume you know uh if the points [00:17:41] So I assume you know uh if the points are on the objects they're on the [00:17:43] are on the objects they're on the surface of objects then they satisfy [00:17:45] surface of objects then they satisfy some certain relationship. So for [00:17:47] some certain relationship. So for example you know for a sphere what will [00:17:49] example you know for a sphere what will be the points on a sphere on a unit [00:17:50] be the points on a sphere on a unit sphere you know the the the constraint [00:17:52] sphere you know the the the constraint they satisfy is you know square of x and [00:17:55] they satisfy is you know square of x and square of y and square z you know when [00:17:56] square of y and square z you know when they sum them up when you sum them up [00:17:58] they sum them up when you sum them up they equal to one right so this is [00:17:59] they equal to one right so this is constraint they satisfy for all the [00:18:01] constraint they satisfy for all the points um uh on the sphere. All right. [00:18:05] points um uh on the sphere. All right. So more generally you know you can write [00:18:07] So more generally you know you can write it down as you know the constraint will [00:18:08] it down as you know the constraint will be some function of x and y and z equals [00:18:11] be some function of x and y and z equals zero. Right. So in this case fx and y [00:18:13] zero. Right. So in this case fx and y the function will be x2 + y square 2 + z [00:18:16] the function will be x2 + y square 2 + z 2 - 1 right? So that would be the [00:18:18] 2 - 1 right? So that would be the function here. But more generally you [00:18:20] function here. But more generally you can think about it is you know for even [00:18:21] can think about it is you know for even for complex shapes you can represent [00:18:23] for complex shapes you can represent sometimes you know these functions can [00:18:24] sometimes you know these functions can be so complex that you don't even have a [00:18:26] be so complex that you don't even have a closed form you know. So how can I [00:18:27] closed form you know. So how can I represent f I don't know I I just write [00:18:29] represent f I don't know I I just write it as a neuronet network. My hope is a [00:18:31] it as a neuronet network. My hope is a neuron network will be able to represent [00:18:33] neuron network will be able to represent it. But in general the idea is you know [00:18:35] it. But in general the idea is you know you have some function um or some [00:18:37] you have some function um or some constraints that the points on a certain [00:18:40] constraints that the points on a certain object will satisfy and this is the way [00:18:42] object will satisfy and this is the way you will represent an object. This is [00:18:43] you will represent an object. This is called implicit representations which [00:18:45] called implicit representations which started with geometry but as I said you [00:18:48] started with geometry but as I said you now use it in all these different ways [00:18:49] now use it in all these different ways representing textures materials [00:18:51] representing textures materials appearance and all these things. So the [00:18:53] appearance and all these things. So the good thing about implicit representation [00:18:54] good thing about implicit representation oh sorry let's start with the bad thing [00:18:56] oh sorry let's start with the bad thing the bad thing about implicit [00:18:57] the bad thing about implicit representation is now it's actually much [00:18:58] representation is now it's actually much harder to sample points right I tell you [00:19:01] harder to sample points right I tell you okay this is a constraint let's say this [00:19:03] okay this is a constraint let's say this tora satisfy right okay um know every x [00:19:06] tora satisfy right okay um know every x and y and z you know if I put into this [00:19:08] and y and z you know if I put into this function and the output is zero then [00:19:11] function and the output is zero then yeah they must be on the surface of this [00:19:12] yeah they must be on the surface of this object but then how can I get a couple [00:19:15] object but then how can I get a couple of these xyz you know tupless that would [00:19:18] of these xyz you know tupless that would be very hard because they're required to [00:19:19] be very hard because they're required to solve this function and this function is [00:19:21] solve this function and this function is maybe not too hard to solve you Maybe [00:19:23] maybe not too hard to solve you Maybe you can still solve that using some high [00:19:24] you can still solve that using some high school math. Uh but uh you know when the [00:19:27] school math. Uh but uh you know when the function gets really complex you know [00:19:28] function gets really complex you know for arbitrary shapes it becomes much [00:19:30] for arbitrary shapes it becomes much harder to solve these functions. So it's [00:19:32] harder to solve these functions. So it's not easy to actually now sample points [00:19:34] not easy to actually now sample points on the surface of the objects if you are [00:19:37] on the surface of the objects if you are representing objects implicitly. But [00:19:39] representing objects implicitly. But benefit the strength of that is now it's [00:19:41] benefit the strength of that is now it's actually pretty easy to test whether a [00:19:43] actually pretty easy to test whether a point is inside object outside object [00:19:45] point is inside object outside object because you know if I want to do testing [00:19:47] because you know if I want to do testing I just have a query then this is so easy [00:19:49] I just have a query then this is so easy because is it inside outside I just send [00:19:51] because is it inside outside I just send it into that function I'll get a value [00:19:53] it into that function I'll get a value or the value is okay minus one / 8 okay [00:19:55] or the value is okay minus one / 8 okay this more than zero so you know because [00:19:58] this more than zero so you know because I assume you know the object is [00:20:01] I assume you know the object is represented by this function and all the [00:20:02] represented by this function and all the surface point on object they satisfy [00:20:04] surface point on object they satisfy that function equals zero anything [00:20:06] that function equals zero anything that's no I would say lower than zero [00:20:08] that's no I would say lower than zero that's negative The output value is [00:20:10] that's negative The output value is negative then the point must be inside [00:20:12] negative then the point must be inside object and if the output value is [00:20:14] object and if the output value is positive then the point must be outside [00:20:16] positive then the point must be outside object right so now it becomes much [00:20:18] object right so now it becomes much easier to test whether a certain point [00:20:20] easier to test whether a certain point is inside outside object although it [00:20:22] is inside outside object although it becomes much harder to sample a number [00:20:24] becomes much harder to sample a number of points on the surface of object so [00:20:26] of points on the surface of object so you can see now there's a clear [00:20:28] you can see now there's a clear trade-off between this implicit and [00:20:29] trade-off between this implicit and explicit representations here we again [00:20:32] explicit representations here we again we talk about geometry and but this [00:20:34] we talk about geometry and but this distinction and the contrast between [00:20:36] distinction and the contrast between explicit implicit representations is I I [00:20:38] explicit implicit representations is I I think is very important and fundamental [00:20:40] think is very important and fundamental is behind development of deep neuron [00:20:42] is behind development of deep neuron deep neuronet networks when they apply [00:20:44] deep neuronet networks when they apply to 3D uh data in general as we'll see [00:20:47] to 3D uh data in general as we'll see later. Okay. So before we mean 25 [00:20:50] later. Okay. So before we mean 25 minutes so I promise I spend another [00:20:52] minutes so I promise I spend another more no more than five minutes and then [00:20:53] more no more than five minutes and then we're going to talk about deep learning. [00:20:54] we're going to talk about deep learning. So before we talk about how deep [00:20:55] So before we talk about how deep learning can be applied to 3D [00:20:57] learning can be applied to 3D representations in general, you know, a [00:20:59] representations in general, you know, a little bit more on implicit [00:21:00] little bit more on implicit representations is uh some other [00:21:02] representations is uh some other features of implicit representations. [00:21:04] features of implicit representations. The good things about them is it's easy [00:21:06] The good things about them is it's easy uh to compose them, right? So sometimes [00:21:08] uh to compose them, right? So sometimes you feel like oh if I have to represent [00:21:10] you feel like oh if I have to represent everything with a function uh that seems [00:21:13] everything with a function uh that seems uh great you know if I have a closed [00:21:14] uh great you know if I have a closed form but also seem very constrained [00:21:16] form but also seem very constrained because every closed form I can write it [00:21:18] because every closed form I can write it out you know the the geometries look [00:21:20] out you know the the geometries look very very regular. So if I want to [00:21:22] very very regular. So if I want to represent the shape of a cow, you know, [00:21:24] represent the shape of a cow, you know, how would I represent that? What would [00:21:25] how would I represent that? What would be the function I can write for the [00:21:26] be the function I can write for the shape of a cow? It's just not obvious. [00:21:28] shape of a cow? It's just not obvious. But nice thing about implicit [00:21:30] But nice thing about implicit representation is you don't have to [00:21:31] representation is you don't have to write everything in one shot because [00:21:32] write everything in one shot because it's so easy to compose them, right? You [00:21:34] it's so easy to compose them, right? You can actually perform logical operations [00:21:36] can actually perform logical operations on these implicit functions. Let's say [00:21:37] on these implicit functions. Let's say you have two objects and you find the [00:21:39] you have two objects and you find the unions or intersections or differences. [00:21:41] unions or intersections or differences. You know, you can again they're just [00:21:43] You know, you can again they're just values, right? So you put XYZ onto this [00:21:45] values, right? So you put XYZ onto this function, you get a value. You put XYZ [00:21:46] function, you get a value. You put XYZ onto that function, you get a value. You [00:21:48] onto that function, you get a value. You can just do you know arithmetic [00:21:50] can just do you know arithmetic operations on top of these values and [00:21:52] operations on top of these values and that allows you to compute the unions or [00:21:53] that allows you to compute the unions or intersections or differences between [00:21:54] intersections or differences between these objects and eventually you can [00:21:57] these objects and eventually you can compose them to uh develop pretty [00:21:59] compose them to uh develop pretty complex shapes and this is actually you [00:22:02] complex shapes and this is actually you know behind uh support a lot of these [00:22:04] know behind uh support a lot of these industrial designs when people are [00:22:06] industrial designs when people are designing you know complex parts uh for [00:22:09] designing you know complex parts uh for you know I would say uh I don't know I [00:22:11] you know I would say uh I don't know I mean when you're doing manufacturing on [00:22:12] mean when you're doing manufacturing on a on you have a you have to fabricate [00:22:16] a on you have a you have to fabricate some complex shapes a lot of these [00:22:17] some complex shapes a lot of these designs are done by these kind of cat [00:22:19] designs are done by these kind of cat models, computer AD designs and they're [00:22:20] models, computer AD designs and they're composing these implicit functions using [00:22:23] composing these implicit functions using simple logical operations [00:22:25] simple logical operations and you know you can also do things that [00:22:29] and you know you can also do things that beyond just logical you can even add [00:22:30] beyond just logical you can even add things up especially if you have a [00:22:32] things up especially if you have a distance function where every point is [00:22:34] distance function where every point is sort of like oh um the positive value [00:22:36] sort of like oh um the positive value and the negative value the values [00:22:37] and the negative value the values actually have meanings because they [00:22:38] actually have meanings because they indicate how far you are to the surface [00:22:40] indicate how far you are to the surface of object. So you can even add them up [00:22:42] of object. So you can even add them up and this allows you to just smoothly [00:22:44] and this allows you to just smoothly blend the shapes. Right? So you can see [00:22:47] blend the shapes. Right? So you can see that here you know if I have a distance [00:22:49] that here you know if I have a distance function and here I just want to [00:22:50] function and here I just want to represent a vertical line. Okay this is [00:22:52] represent a vertical line. Okay this is here and then anything that's minus zero [00:22:54] here and then anything that's minus zero is to the left of the line anything that [00:22:56] is to the left of the line anything that is positive is to the right of the line [00:22:58] is positive is to the right of the line and then you have another line [00:22:59] and then you have another line represented using different function you [00:23:01] represented using different function you know. So what happens if you add them [00:23:02] know. So what happens if you add them up? You know, if you add them up, then [00:23:05] up? You know, if you add them up, then it naturally becomes an interpolation [00:23:06] it naturally becomes an interpolation between these two shapes, right? This is [00:23:08] between these two shapes, right? This is example of doing things in 1D. But you [00:23:10] example of doing things in 1D. But you can imagine, you know, you can even [00:23:11] can imagine, you know, you can even similar doing things in 3D in a sense. [00:23:13] similar doing things in 3D in a sense. Okay, now you can actually even blend [00:23:15] Okay, now you can actually even blend the these different shapes and these [00:23:17] the these different shapes and these distance functions can be arbitrary [00:23:18] distance functions can be arbitrary composed and allow you to create [00:23:20] composed and allow you to create actually pretty complex worlds like [00:23:22] actually pretty complex worlds like this. And this is not easy but you can [00:23:24] this. And this is not easy but you can even you know think about you know con [00:23:25] even you know think about you know con you can construct really complex with [00:23:27] you can construct really complex with all the details worlds uh by just simply [00:23:31] all the details worlds uh by just simply not simply by difficulty but compose [00:23:34] not simply by difficulty but compose these different functions but they are [00:23:35] these different functions but they are actually very expressive if you're very [00:23:37] actually very expressive if you're very good at it. [00:23:40] good at it. Okay. Um so we said okay we have [00:23:43] Okay. Um so we said okay we have parametric representation that can be [00:23:44] parametric representation that can be explicit that directly give you points [00:23:46] explicit that directly give you points on 3D surface or we can have parametric [00:23:48] on 3D surface or we can have parametric representations like these functions but [00:23:49] representations like these functions but they're implicit right? So they're just [00:23:51] they're implicit right? So they're just like okay now you can only try to verify [00:23:53] like okay now you can only try to verify if the point is inside and outside an [00:23:55] if the point is inside and outside an object but then you can also compose [00:23:56] object but then you can also compose them to build more complex shapes and is [00:23:59] them to build more complex shapes and is that possible for us to also have [00:24:00] that possible for us to also have implicit repetition and nonparametric [00:24:02] implicit repetition and nonparametric like like point style but then you also [00:24:04] like like point style but then you also querying functions or sometimes they [00:24:06] querying functions or sometimes they actually do have things like that and [00:24:07] actually do have things like that and this eventually goes to methods like [00:24:09] this eventually goes to methods like level set methods right so implicit [00:24:11] level set methods right so implicit surfaces are very nice because as we [00:24:12] surfaces are very nice because as we said it's easy to merge them it's easy [00:24:15] said it's easy to merge them it's easy to split them but sometimes you know [00:24:17] to split them but sometimes you know it's hard to describe as we said a [00:24:18] it's hard to describe as we said a complex shapes in closed forms Right? [00:24:20] complex shapes in closed forms Right? You have a cow. How would you represent [00:24:22] You have a cow. How would you represent it? Okay, you can compose them. But you [00:24:24] it? Okay, you can compose them. But you know if every time I have to query [00:24:26] know if every time I have to query whether whether a certain point is [00:24:27] whether whether a certain point is inside a cow, you have to have hundreds [00:24:29] inside a cow, you have to have hundreds of functions and you perform all these [00:24:31] of functions and you perform all these add and or plus minus operations, then [00:24:34] add and or plus minus operations, then it takes a long time. So what if I just [00:24:36] it takes a long time. So what if I just pre-query, right? So I have I have a I [00:24:38] pre-query, right? So I have I have a I have a 3D space and I just sample let's [00:24:40] have a 3D space and I just sample let's say a 100 by 100 by 100 grid, right? So [00:24:43] say a 100 by 100 by 100 grid, right? So any I have now a million points [00:24:45] any I have now a million points pre-sampled and for these 1 million [00:24:46] pre-sampled and for these 1 million points I just premputee whether they're [00:24:49] points I just premputee whether they're inside the objects or outside the [00:24:50] inside the objects or outside the objects. What is the distance of these 1 [00:24:52] objects. What is the distance of these 1 million points to the surfaces for [00:24:54] million points to the surfaces for complex shapes? So you can premputee [00:24:57] complex shapes? So you can premputee them and then you can store all the [00:24:58] them and then you can store all the values in a matrix. This is in 2D but [00:25:01] values in a matrix. This is in 2D but you can this is for visualization but in [00:25:04] you can this is for visualization but in practice is in 3D right? So you have a [00:25:06] practice is in 3D right? So you have a you have a 3D matrix that store all [00:25:08] you have a 3D matrix that store all these precomputed values of the distance [00:25:10] these precomputed values of the distance functions. So now you're sort of in some [00:25:12] functions. So now you're sort of in some you still have implicit representations [00:25:13] you still have implicit representations but be because you have pre-qui them you [00:25:16] but be because you have pre-qui them you you turn it into a nonparametric [00:25:19] you turn it into a nonparametric representations and even if you just [00:25:21] representations and even if you just look at this matrix right in 2D you can [00:25:24] look at this matrix right in 2D you can now still find okay where the boundaries [00:25:26] now still find okay where the boundaries are so where are the boundaries they're [00:25:27] are so where are the boundaries they're just basically where you know you have [00:25:29] just basically where you know you have two adjacent values one is positive and [00:25:31] two adjacent values one is positive and one is negative right so that means [00:25:33] one is negative right so that means there must be somewhere in between that [00:25:36] there must be somewhere in between that uh the point here they satisfy the [00:25:38] uh the point here they satisfy the function fx equals z which means means [00:25:40] function fx equals z which means means the point must be on the surface. Right? [00:25:43] the point must be on the surface. Right? So in that sense you're sort of turning [00:25:46] So in that sense you're sort of turning an parametric representation which are [00:25:48] an parametric representation which are implicit into a nonparametric [00:25:49] implicit into a nonparametric representations by pre-quaring a lot of [00:25:52] representations by pre-quaring a lot of these points using the functions and [00:25:55] these points using the functions and this allows you to have actually more [00:25:56] this allows you to have actually more explicit controls because you know you [00:25:59] explicit controls because you know you can now visualize them. You can say I [00:26:01] can now visualize them. You can say I have this matrix and I can visualize [00:26:02] have this matrix and I can visualize them based on their values. And this is [00:26:04] them based on their values. And this is used a lot in things like uh uh CTS and [00:26:07] used a lot in things like uh uh CTS and MRIs and all these like medical data. [00:26:10] MRIs and all these like medical data. And a related thing is people may say [00:26:12] And a related thing is people may say okay what if I don't care about all [00:26:13] okay what if I don't care about all these distance values you know I can [00:26:14] these distance values you know I can pre-query what's going on at all these [00:26:16] pre-query what's going on at all these points but then I compute all the [00:26:18] points but then I compute all the values. Let's say plus five minus five [00:26:20] values. Let's say plus five minus five but all I care about is whether this is [00:26:22] but all I care about is whether this is inside object or outside object. Right? [00:26:24] inside object or outside object. Right? So if it's positive I'll just treat them [00:26:26] So if it's positive I'll just treat them as one. If it's negative which means [00:26:28] as one. If it's negative which means they're inside object treat them as [00:26:29] they're inside object treat them as zero. Let's say so if you binarize them [00:26:32] zero. Let's say so if you binarize them then this give you a final [00:26:33] then this give you a final representations which is arguably the [00:26:35] representations which is arguably the easiest to to understand this is called [00:26:37] easiest to to understand this is called called voxels right so you know you can [00:26:40] called voxels right so you know you can pre-query where the implicit functions [00:26:41] pre-query where the implicit functions are uh and then you have all these kind [00:26:43] are uh and then you have all these kind of you know density sample grid and but [00:26:45] of you know density sample grid and but now instead of storing you know their [00:26:47] now instead of storing you know their distance functions how far they are from [00:26:49] distance functions how far they are from the surface by going through the [00:26:50] the surface by going through the functions and give you plus five minus 5 [00:26:52] functions and give you plus five minus 5 you just binarize it you only care about [00:26:55] you just binarize it you only care about whether a certain point is inside [00:26:57] whether a certain point is inside objects or outside objects then you have [00:26:59] objects or outside objects then you have a voxel representation which is again [00:27:01] a voxel representation which is again like a 3D matrix can be 100 by 100 by [00:27:04] like a 3D matrix can be 100 by 100 by 100 but for every point you have go [00:27:06] 100 but for every point you have go through this function and query whether [00:27:08] through this function and query whether it's inside and outside objects you have [00:27:10] it's inside and outside objects you have one or zero and you can represent [00:27:12] one or zero and you can represent objects in a binarized way right so this [00:27:14] objects in a binarized way right so this gives you the final representations I'm [00:27:16] gives you the final representations I'm going to talk about for objects in 3D [00:27:19] going to talk about for objects in 3D so I have introduced voxels in a kind of [00:27:22] so I have introduced voxels in a kind of complex way but from a different [00:27:24] complex way but from a different perspective people may say this is [00:27:25] perspective people may say this is actually very easy to understand because [00:27:26] actually very easy to understand because in some sense voxels actually have a lot [00:27:28] in some sense voxels actually have a lot of analogy as pixel because pixels are [00:27:30] of analogy as pixel because pixels are like 2D matrices and now you have a 3D [00:27:32] like 2D matrices and now you have a 3D matrices and voxels is basically just a [00:27:34] matrices and voxels is basically just a 3D matrix right so um although you can [00:27:39] 3D matrix right so um although you can see that they have connections with all [00:27:40] see that they have connections with all the other ways that we can represent [00:27:42] the other ways that we can represent shapes and the way I'm introducing it [00:27:44] shapes and the way I'm introducing it this way is you know actually when when [00:27:47] this way is you know actually when when deep learning come in right so first [00:27:48] deep learning come in right so first deep learning when when did it started [00:27:49] deep learning when when did it started right 2010 you know deep learning has [00:27:51] right 2010 you know deep learning has been there for a long time but the more [00:27:53] been there for a long time but the more than deep learning thing is 2010 people [00:27:56] than deep learning thing is 2010 people started Jeff Hinton started doing that [00:27:57] started Jeff Hinton started doing that on speech recognition And then 2012 they [00:28:00] on speech recognition And then 2012 they have Alex net which is run on image net. [00:28:02] have Alex net which is run on image net. So you've learned all these and they're [00:28:03] So you've learned all these and they're all in 2D. Okay. Now people say okay [00:28:05] all in 2D. Okay. Now people say okay what if I want to do in 3D right? This [00:28:07] what if I want to do in 3D right? This is a very natural thought. So I want to [00:28:09] is a very natural thought. So I want to go from 2D commercial networks. 2012 [00:28:12] go from 2D commercial networks. 2012 there's no transformer right? There's a [00:28:14] there's no transformer right? There's a so how can I apply a 2D commercial [00:28:15] so how can I apply a 2D commercial network on 3D data and everyone knows we [00:28:18] network on 3D data and everyone knows we have all these different 3D [00:28:19] have all these different 3D representations. [00:28:21] representations. But which one to begin with? Right? And [00:28:24] But which one to begin with? Right? And it turns out that people say, "Okay, [00:28:25] it turns out that people say, "Okay, yeah, you know, the people who started [00:28:27] yeah, you know, the people who started doing uh deep learning on your data, [00:28:29] doing uh deep learning on your data, they're the computer vision people. [00:28:30] they're the computer vision people. They're not like the graphics people. [00:28:31] They're not like the graphics people. They're like, I've been working with [00:28:32] They're like, I've been working with pixels and maybe the easiest thing I can [00:28:34] pixels and maybe the easiest thing I can do is just to, you know, scale up and [00:28:36] do is just to, you know, scale up and instead of working on 2D matrices, I [00:28:38] instead of working on 2D matrices, I just make it work on 3D matrices." So [00:28:40] just make it work on 3D matrices." So that would be the simplest thing I can [00:28:41] that would be the simplest thing I can do. Instead of having a 2D convolution [00:28:42] do. Instead of having a 2D convolution in your network, I have a volumetric [00:28:43] in your network, I have a volumetric convolution in your network. Then which [00:28:46] convolution in your network. Then which of these representations allow you or [00:28:47] of these representations allow you or support a volumetric convolution? Right? [00:28:50] support a volumetric convolution? Right? It turned out to be this box of [00:28:51] It turned out to be this box of representation. This is this is [00:28:52] representation. This is this is basically the the easiest you can [00:28:54] basically the the easiest you can imagine, right? So, but the graphics [00:28:56] imagine, right? So, but the graphics people do not agree with that because [00:28:57] people do not agree with that because the graphics people are like, "Oh, this [00:28:59] the graphics people are like, "Oh, this box representation is really bad because [00:29:00] box representation is really bad because it's uh it's very slow to compute as you [00:29:03] it's uh it's very slow to compute as you as we talked about. We have to [00:29:04] as we talked about. We have to pre-sample all these values and there [00:29:06] pre-sample all these values and there you know you can look at the quality [00:29:07] you know you can look at the quality right is so bad compared with meshes or [00:29:10] right is so bad compared with meshes or point clouds." So, people like you know [00:29:11] point clouds." So, people like you know why do you want to start with that? But [00:29:12] why do you want to start with that? But the reason that people started doing [00:29:14] the reason that people started doing deep learning on 3D data with voxels [00:29:16] deep learning on 3D data with voxels because I think it's like you just it's [00:29:17] because I think it's like you just it's so easy to draw an analogy between [00:29:19] so easy to draw an analogy between pixels and voxels and and you just have [00:29:21] pixels and voxels and and you just have to change one type of code that is [00:29:22] to change one type of code that is instead of doing 2D convolution now do [00:29:24] instead of doing 2D convolution now do 3D convolution right so that's sort of [00:29:26] 3D convolution right so that's sort of in some sense how things get started [00:29:28] in some sense how things get started okay but before I talk about the [00:29:29] okay but before I talk about the different methods for 3D uh data another [00:29:32] different methods for 3D uh data another aspect that's very important is the data [00:29:34] aspect that's very important is the data uh sorry for 3D yeah so beyond methods [00:29:36] uh sorry for 3D yeah so beyond methods data sets are also very important you [00:29:38] data sets are also very important you know image net really prompted annex and [00:29:40] know image net really prompted annex and stuff like that so for 3D similarly you [00:29:42] stuff like that so for 3D similarly you know, we have to collect a lot of data [00:29:43] know, we have to collect a lot of data as well. So, preDep learning, um, the [00:29:47] as well. So, preDep learning, um, the common data set, the popular data set [00:29:49] common data set, the popular data set people often use is this thing called [00:29:50] people often use is this thing called Princeton data shape benchmark, which [00:29:52] Princeton data shape benchmark, which has 1,800 models, 180 categories. So, [00:29:55] has 1,800 models, 180 categories. So, you can see they actually have quite a [00:29:57] you can see they actually have quite a lot of categories, 180 categories, but [00:29:59] lot of categories, 180 categories, but there are only 1,800 models, which means [00:30:01] there are only 1,800 models, which means there are like basically 10 models per [00:30:02] there are like basically 10 models per hydrator, which is so small, but back [00:30:05] hydrator, which is so small, but back then it was considered pretty large, and [00:30:06] then it was considered pretty large, and people feel like, oh, this is already [00:30:08] people feel like, oh, this is already enough to do because we can't really [00:30:09] enough to do because we can't really make any of things really work well on [00:30:11] make any of things really work well on them. Um and there was very little [00:30:13] them. Um and there was very little machine learning there. Um so prior to [00:30:16] machine learning there. Um so prior to 2014 you know all these data sets are [00:30:18] 2014 you know all these data sets are kind of more or less small. You know [00:30:20] kind of more or less small. You know they may have a certain number of models [00:30:22] they may have a certain number of models even up to 10,000 9,000 10,000 but you [00:30:25] even up to 10,000 9,000 10,000 but you know they also divided into so many [00:30:26] know they also divided into so many different classes and so each class you [00:30:28] different classes and so each class you only have like 10 models each or less [00:30:30] only have like 10 models each or less than 100 I would say. Um so after that [00:30:33] than 100 I would say. Um so after that people started by saying okay if we have [00:30:35] people started by saying okay if we have image net we also have the 3D data sets [00:30:38] image net we also have the 3D data sets for shapes. So this is uh behind efforts [00:30:40] for shapes. So this is uh behind efforts of a few concurrent work but really I [00:30:42] of a few concurrent work but really I think eventually they sort of [00:30:43] think eventually they sort of consolidated into this thing called [00:30:44] consolidated into this thing called shapenet which is a lot of them are [00:30:46] shapenet which is a lot of them are actually led by Stanford you know [00:30:48] actually led by Stanford you know there's Leo Gibbus and Sylvio Sarasi um [00:30:52] there's Leo Gibbus and Sylvio Sarasi um so they led this kind of large data sets [00:30:54] so they led this kind of large data sets called shapenet which has three million [00:30:56] called shapenet which has three million models and so but in practice just like [00:30:59] models and so but in practice just like image net you have this large image and [00:31:00] image net you have this large image and there's a smaller data set that people [00:31:02] there's a smaller data set that people often use. So shapeet similarly you have [00:31:04] often use. So shapeet similarly you have a shapeet core data set which is what [00:31:05] a shapeet core data set which is what people typically use as 50 basically [00:31:08] people typically use as 50 basically 50,000 models in 55 categories now you [00:31:11] 50,000 models in 55 categories now you can see for every category you have [00:31:12] can see for every category you have 1,000 models on average but in practice [00:31:14] 1,000 models on average but in practice it's not that balanced so for chairs you [00:31:16] it's not that balanced so for chairs you have actually a lot more so that's why [00:31:18] have actually a lot more so that's why you know people say oh now I have [00:31:19] you know people say oh now I have finally I have thousands of models on [00:31:20] finally I have thousands of models on chairs I can train some deep networks on [00:31:22] chairs I can train some deep networks on it right before it just you have 10 [00:31:24] it right before it just you have 10 models you can't do anything so that is [00:31:26] models you can't do anything so that is how things started and uh so there has [00:31:28] how things started and uh so there has been a few years where a lot of these [00:31:30] been a few years where a lot of these advances and all the results are just [00:31:32] advances and all the results are just present on chairs and cars because these [00:31:34] present on chairs and cars because these are like the largest categories in [00:31:35] are like the largest categories in shapeet and people feel like okay that's [00:31:37] shapeet and people feel like okay that's great but then that's not enough so we [00:31:39] great but then that's not enough so we should uh move you know even bigger so [00:31:42] should uh move you know even bigger so in the past few years uh this is work at [00:31:44] in the past few years uh this is work at AI2 the island institute from Seattle [00:31:48] AI2 the island institute from Seattle where what they did is they collected [00:31:49] where what they did is they collected much larger data sets called obverse and [00:31:50] much larger data sets called obverse and objiverse extra large that you have [00:31:53] objiverse extra large that you have roughly 1 million or 10 million uh [00:31:55] roughly 1 million or 10 million uh models uh for different 3D assets you [00:31:57] models uh for different 3D assets you can see they have much more categories [00:31:59] can see they have much more categories uh and they also have these models on [00:32:01] uh and they also have these models on average also are have higher quality [00:32:03] average also are have higher quality also with textures. [00:32:05] also with textures. So these are entire data sets but also [00:32:07] So these are entire data sets but also there are real data sets that are being [00:32:08] there are real data sets that are being produced including you know some some of [00:32:10] produced including you know some some of them are from like 3D scans you know you [00:32:12] them are from like 3D scans you know you just take uh 3D scanners back in 2016 [00:32:16] just take uh 3D scanners back in 2016 people have been working on it this is a [00:32:18] people have been working on it this is a data set uh called I think the redwood [00:32:20] data set uh called I think the redwood data set or something so you have like [00:32:21] data set or something so you have like 10,000 scans of real world objects [00:32:24] 10,000 scans of real world objects and more recently you know people have [00:32:27] and more recently you know people have been building larger data sets where [00:32:29] been building larger data sets where they also encourage people I think this [00:32:31] they also encourage people I think this is effort collab by Meta and Oxford [00:32:34] is effort collab by Meta and Oxford heard. Um, so they encourage people to [00:32:36] heard. Um, so they encourage people to take data for them. Uh, they also pay [00:32:38] take data for them. Uh, they also pay people to take data for them. So people [00:32:39] people to take data for them. So people just use it on iPhone. You you have an [00:32:41] just use it on iPhone. You you have an object, you put it on a table, you use [00:32:42] object, you put it on a table, you use an iPhone, you take a 360 video around [00:32:44] an iPhone, you take a 360 video around objects and then you get $1 or something [00:32:46] objects and then you get $1 or something like that. Um, so they encourage people [00:32:48] like that. Um, so they encourage people to take data for them. This is the first [00:32:50] to take data for them. This is the first version. They have 19,000 videos of [00:32:52] version. They have 19,000 videos of objects. Now these are like real [00:32:53] objects. Now these are like real objects, right? Because capturing real [00:32:55] objects, right? Because capturing real objects is much harder and the object [00:32:56] objects is much harder and the object version and all the things I talked [00:32:57] version and all the things I talked about before it was like synced objects, [00:32:59] about before it was like synced objects, but these are like real objects. And [00:33:01] but these are like real objects. And then uh also because of a lot of the [00:33:03] then uh also because of a lot of the development in 3D vision algorithms, you [00:33:05] development in 3D vision algorithms, you can actually take these 360 videos and [00:33:07] can actually take these 360 videos and trying to reconstruct uh the 3D objects. [00:33:09] trying to reconstruct uh the 3D objects. So now you have paired data of the [00:33:11] So now you have paired data of the videos or images of objects as well as [00:33:13] videos or images of objects as well as their 3D geometries and textures. [00:33:27] This is their first version. I think [00:33:28] This is their first version. I think they have a more recent version V2 or [00:33:30] they have a more recent version V2 or maybe even V3 right now which is [00:33:32] maybe even V3 right now which is supposed to be a little larger but still [00:33:34] supposed to be a little larger but still it's kind of hard to scale up. Think [00:33:35] it's kind of hard to scale up. Think about it right now you have like 90,000 [00:33:37] about it right now you have like 90,000 videos or basically 19,000 objects and I [00:33:40] videos or basically 19,000 objects and I think they're scaling it up but I don't [00:33:42] think they're scaling it up but I don't think it's over 100,000. So basically [00:33:45] think it's over 100,000. So basically you can think about it as for real [00:33:46] you can think about it as for real objects you have like 100,000 models and [00:33:49] objects you have like 100,000 models and but if you look at what's the data set [00:33:51] but if you look at what's the data set size of the images right so it's like I [00:33:53] size of the images right so it's like I know line on 5B or whatever that's like [00:33:55] know line on 5B or whatever that's like 5 billion images and Google and open [00:33:57] 5 billion images and Google and open must have much larger data sets so [00:33:59] must have much larger data sets so there's still kind of a huge gap between [00:34:01] there's still kind of a huge gap between number of data points that you can have [00:34:03] number of data points that you can have for 2D images or videos versus you can [00:34:05] for 2D images or videos versus you can have for 3D objects so I think that's a [00:34:07] have for 3D objects so I think that's a kind of a big challenge you know how we [00:34:08] kind of a big challenge you know how we can move forward with 3D vision and [00:34:10] can move forward with 3D vision and people have different ideas uh but still [00:34:13] people have different ideas uh but still you know this is much larger than what [00:34:14] you know this is much larger than what we had before at least you can you know [00:34:16] we had before at least you can you know it's possible that you can still more or [00:34:17] it's possible that you can still more or less train some deep learning models on [00:34:19] less train some deep learning models on these data sets now [00:34:22] these data sets now and quickly there are also other data [00:34:23] and quickly there are also other data sets of people being built on parts and [00:34:25] sets of people being built on parts and this is also from Stanford where they [00:34:27] this is also from Stanford where they try to annotate a little bit of object [00:34:29] try to annotate a little bit of object parts and their correspondence and [00:34:30] parts and their correspondence and hierarchies [00:34:32] hierarchies um and you know their uh and also uh [00:34:36] um and you know their uh and also uh there's this this data set called [00:34:38] there's this this data set called partnet where they want to annotate not [00:34:40] partnet where they want to annotate not only the parts and their semantics but [00:34:42] only the parts and their semantics but also how they may move also a little bit [00:34:43] also how they may move also a little bit of mo mobility information of different [00:34:45] of mo mobility information of different parts like laptop you can open and close [00:34:47] parts like laptop you can open and close it and there also data sets for 3D [00:34:49] it and there also data sets for 3D scenes so not only just objects and [00:34:51] scenes so not only just objects and parts but they're also the the rooms um [00:34:53] parts but they're also the the rooms um so they're having things like the scan [00:34:56] so they're having things like the scan data sets like you know um this are [00:34:57] data sets like you know um this are people actually just go inside your home [00:34:59] people actually just go inside your home or go inside to actually to our office [00:35:01] or go inside to actually to our office as well this come and then they just [00:35:03] as well this come and then they just have a 3D scanner they scan home and [00:35:05] have a 3D scanner they scan home and then they have some annotations so you [00:35:07] then they have some annotations so you know here [00:35:09] know here know and more recently again you can use [00:35:11] know and more recently again you can use you can do that even with your iPhone [00:35:13] you can do that even with your iPhone right now but still um these kind of [00:35:15] right now but still um these kind of data sets are much smaller right so here [00:35:17] data sets are much smaller right so here this one the first version of scanner [00:35:19] this one the first version of scanner you have 1,500 I think they have a plus+ [00:35:22] you have 1,500 I think they have a plus+ the second version which is roughly the [00:35:23] the second version which is roughly the same size maybe 2,000 or 3,000 rooms so [00:35:26] same size maybe 2,000 or 3,000 rooms so the amount of data you have for 3D data [00:35:29] the amount of data you have for 3D data uh for 3D scenes in particular is also [00:35:31] uh for 3D scenes in particular is also you know even much smaller than the [00:35:32] you know even much smaller than the amount of data you have for 3D objects [00:35:34] amount of data you have for 3D objects um so I I think it's not obvious that [00:35:36] um so I I think it's not obvious that how can we go beyond that constraints [00:35:37] how can we go beyond that constraints because if you have to scan it yourself [00:35:39] because if you have to scan it yourself you're always bounded by how much time [00:35:41] you're always bounded by how much time you have and how much people you [00:35:43] you have and how much people you Um [00:35:45] Um anyway, but you know there there are [00:35:47] anyway, but you know there there are attempts being made in trying to collect [00:35:49] attempts being made in trying to collect data [00:35:52] data okay and finally uh if want to apply [00:35:54] okay and finally uh if want to apply deep learning on to 3D vision so what [00:35:57] deep learning on to 3D vision so what are the tasks we care about right so [00:35:59] are the tasks we care about right so there are generative modeling just like [00:36:00] there are generative modeling just like in generative just like what Justin said [00:36:03] in generative just like what Justin said you can generate 2D images or videos um [00:36:05] you can generate 2D images or videos um you can also generate 3D shapes you can [00:36:07] you can also generate 3D shapes you can generate 3D scenes you can make them [00:36:09] generate 3D scenes you can make them condition right the condition can be [00:36:11] condition right the condition can be condition on language conditional image. [00:36:13] condition on language conditional image. You have an input image and how can you [00:36:15] You have an input image and how can you reconstruct the 3D objects and you have [00:36:17] reconstruct the 3D objects and you have to learn the shape priors. You have to [00:36:19] to learn the shape priors. You have to do shape generation completion. [00:36:21] do shape generation completion. Sometimes you have a partial objects and [00:36:23] Sometimes you have a partial objects and you want to you know repair it. You want [00:36:25] you want to you know repair it. You want to fix it. So there's geometry data [00:36:27] to fix it. So there's geometry data processing as well. Other tasks [00:36:29] processing as well. Other tasks including discriminative uh models for [00:36:31] including discriminative uh models for example you have a 3D shape. How can you [00:36:33] example you have a 3D shape. How can you classify what's the category of objects [00:36:35] classify what's the category of objects it belongs to? Is it a chair or a table? [00:36:38] it belongs to? Is it a chair or a table? and a lot of them and now actually done [00:36:40] and a lot of them and now actually done them by rendering them into pixels right [00:36:42] them by rendering them into pixels right because you have very good you know [00:36:43] because you have very good you know image recognition models like GPT or [00:36:45] image recognition models like GPT or something right so you just take a 3D [00:36:47] something right so you just take a 3D object you can render them into a [00:36:48] object you can render them into a picture you can upload a picture to GPT [00:36:50] picture you can upload a picture to GPT and they can do it for you right so [00:36:51] and they can do it for you right so that's in some sense one way of solving [00:36:53] that's in some sense one way of solving these discriminative problems uh but [00:36:56] these discriminative problems uh but there are also more specific things that [00:36:57] there are also more specific things that is not very easy to solve you know for [00:36:59] is not very easy to solve you know for example you have a different type of [00:37:01] example you have a different type of cell and you have the 3D scans and how [00:37:03] cell and you have the 3D scans and how can you classify the cell and all these [00:37:05] can you classify the cell and all these kind of more specialized domains where [00:37:06] kind of more specialized domains where you don't have that much data So how can [00:37:08] you don't have that much data So how can you solve these discriminative problems [00:37:12] you solve these discriminative problems and uh join modeling of 2D and 3D data [00:37:14] and uh join modeling of 2D and 3D data which is becoming more and more [00:37:15] which is becoming more and more important because the 2D data we have so [00:37:17] important because the 2D data we have so much more we have so many images and [00:37:18] much more we have so many images and videos we have very good foundation [00:37:20] videos we have very good foundation models that got trained on them. So how [00:37:22] models that got trained on them. So how can we leverage the priors in our 2D [00:37:24] can we leverage the priors in our 2D foundation models like what an image [00:37:25] foundation models like what an image look like? How to make an image look [00:37:27] look like? How to make an image look like look realistic? How to make a video [00:37:29] like look realistic? How to make a video look realistic? How can we use that [00:37:30] look realistic? How can we use that information to help our 3D [00:37:31] information to help our 3D reconstruction to be more realistic? [00:37:34] reconstruction to be more realistic? Right? So joint modeling and 2D and 3D [00:37:36] Right? So joint modeling and 2D and 3D data because there are so many large [00:37:38] data because there are so many large scale 2D data sets and very good [00:37:39] scale 2D data sets and very good training models and also there has been [00:37:41] training models and also there has been a lot of advances in uh neural rendering [00:37:43] a lot of advances in uh neural rendering or differential rendering methods that [00:37:45] or differential rendering methods that basically connect 3D world and 2D world [00:37:47] basically connect 3D world and 2D world because you have 3D world you have 3D [00:37:48] because you have 3D world you have 3D model you can render them into 2D the [00:37:50] model you can render them into 2D the rendering process you know can be made [00:37:51] rendering process you know can be made differentiable or can be even [00:37:52] differentiable or can be even approximated with neuronet networks then [00:37:54] approximated with neuronet networks then now you can connect all these data in [00:37:56] now you can connect all these data in different modalities through [00:37:57] different modalities through differentable neural networks allows you [00:37:59] differentable neural networks allows you to bridge the prior you have in 2D data [00:38:02] to bridge the prior you have in 2D data or 2D foundation models into the 3D [00:38:04] or 2D foundation models into the 3D world. [00:38:06] world. Yeah. And then you know sometimes you [00:38:07] Yeah. And then you know sometimes you want to even do some joint multim model [00:38:09] want to even do some joint multim model beyond visual data including textile [00:38:10] beyond visual data including textile data but sometimes you have other data. [00:38:11] data but sometimes you have other data. Let's say in robotics you often have [00:38:13] Let's say in robotics you often have tactile data. So how to fuse them as [00:38:15] tactile data. So how to fuse them as well? And sometimes for autonomous [00:38:18] well? And sometimes for autonomous driving you know maybe you have lighter [00:38:19] driving you know maybe you have lighter data or depth data. How can you fuse [00:38:21] data or depth data. How can you fuse them as well? So we want to use deep [00:38:23] them as well? So we want to use deep learning on 3D data to solve all these [00:38:25] learning on 3D data to solve all these different problems. So we spend all the [00:38:28] different problems. So we spend all the time talking about representation. So [00:38:29] time talking about representation. So how do we begin with? So as I suggested [00:38:31] how do we begin with? So as I suggested you know people who are initially doing [00:38:34] you know people who are initially doing that is the computer vision people do [00:38:35] that is the computer vision people do they work on pixels they work on images. [00:38:36] they work on pixels they work on images. So naturally they say why don't we start [00:38:38] So naturally they say why don't we start with voxels but even before that they [00:38:40] with voxels but even before that they say okay this is the old idea and this [00:38:43] say okay this is the old idea and this is the very first idea that people tried [00:38:45] is the very first idea that people tried in applying deep learning to 3D vision [00:38:46] in applying deep learning to 3D vision and now in some sense it's coming back [00:38:48] and now in some sense it's coming back but the very first idea they tried is [00:38:50] but the very first idea they tried is you know let's don't even worry about [00:38:51] you know let's don't even worry about voxels let's just say you have a 3D [00:38:53] voxels let's just say you have a 3D shape you know it's a mesh it's a voxel [00:38:55] shape you know it's a mesh it's a voxel whatever you know I want to uh learn to [00:38:57] whatever you know I want to uh learn to recognize what an object is right what [00:38:59] recognize what an object is right what is the object here is a chair but what [00:39:02] is the object here is a chair but what if the input is 3D data how can we [00:39:03] if the input is 3D data how can we process that before we have a 3D deep [00:39:06] process that before we have a 3D deep learning methods. What if I just render [00:39:08] learning methods. What if I just render into images because I have very good [00:39:10] into images because I have very good image models. I'll just you know render [00:39:13] image models. I'll just you know render I just take the 3D objects. I would just [00:39:15] I just take the 3D objects. I would just put camera at different places. I can [00:39:16] put camera at different places. I can render all these images the objects from [00:39:18] render all these images the objects from different views and then now this [00:39:20] different views and then now this becomes a 2D problem. I would just apply [00:39:22] becomes a 2D problem. I would just apply a commercial neuron network you know on [00:39:24] a commercial neuron network you know on each of these views and I have some ways [00:39:26] each of these views and I have some ways of fuse them right. So I have some [00:39:28] of fuse them right. So I have some pooling whatever uh and then I just do [00:39:30] pooling whatever uh and then I just do an image classification you know. So [00:39:32] an image classification you know. So this becomes an image classification [00:39:34] this becomes an image classification problem while the only difference being [00:39:35] problem while the only difference being now you have multiple views right so [00:39:38] now you have multiple views right so this is like in some one of the very [00:39:39] this is like in some one of the very first idea people apply to 3D vision [00:39:41] first idea people apply to 3D vision they just use 2D networks and why do you [00:39:44] they just use 2D networks and why do you want to use 2D networks because back [00:39:45] want to use 2D networks because back then they're approaching on image net [00:39:46] then they're approaching on image net and they're very good um so they have [00:39:49] and they're very good um so they have image net is much larger than 3D data [00:39:51] image net is much larger than 3D data sets so any model that approaching on [00:39:52] sets so any model that approaching on image net they have very good [00:39:53] image net they have very good performance so the easiest way to solve [00:39:55] performance so the easiest way to solve your 3D recognition problem is to first [00:39:57] your 3D recognition problem is to first render into 2D [00:39:59] render into 2D later people sort of move away from it [00:40:01] later people sort of move away from it because people like oh you know we have [00:40:03] because people like oh you know we have more 3D data we should try to do 3D [00:40:05] more 3D data we should try to do 3D native come up with 3D native methods [00:40:07] native come up with 3D native methods and people also you know come up with [00:40:08] and people also you know come up with ideas about connecting 3D and 2D through [00:40:10] ideas about connecting 3D and 2D through like newer rendering but now I feel like [00:40:12] like newer rendering but now I feel like this trend is coming back because you [00:40:14] this trend is coming back because you know all these image and video models [00:40:15] know all these image and video models are getting so great I don't know many [00:40:17] are getting so great I don't know many of you may have seen like the V3 [00:40:19] of you may have seen like the V3 whatever was released yesterday right so [00:40:21] whatever was released yesterday right so if they're so great you know maybe we [00:40:23] if they're so great you know maybe we should just rely a bit more on the image [00:40:25] should just rely a bit more on the image and video foundation models again [00:40:26] and video foundation models again because they they just train on you know [00:40:28] because they they just train on you know a thousand times or tens of thousands [00:40:30] a thousand times or tens of thousands times or even more than and maybe a [00:40:32] times or even more than and maybe a million times more data than 3D data. So [00:40:34] million times more data than 3D data. So how can we incorporate that? But I [00:40:36] how can we incorporate that? But I anyway coming back now this is the very [00:40:38] anyway coming back now this is the very first methods in some sense people try [00:40:41] first methods in some sense people try to apply deep learning on 3D data just [00:40:43] to apply deep learning on 3D data just by converting them into 2D and they do [00:40:46] by converting them into 2D and they do very well in image shape classification [00:40:48] very well in image shape classification you have shapes and you want to classify [00:40:49] you have shapes and you want to classify them into different categories and they [00:40:51] them into different categories and they have very good performance. Um and yeah [00:40:55] have very good performance. Um and yeah so you can leverage you know a lot of [00:40:56] so you can leverage you know a lot of literatures on 2D image models. Um, but [00:41:00] literatures on 2D image models. Um, but the issue is you need some projections. [00:41:02] the issue is you need some projections. Um, but sometimes, you know, the input [00:41:04] Um, but sometimes, you know, the input can be very noisy. People like, what if [00:41:06] can be very noisy. People like, what if my input is too noisy? The point clouds [00:41:08] my input is too noisy? The point clouds or whatever, they're just not very good. [00:41:09] or whatever, they're just not very good. If I render them, they look kind of bad. [00:41:11] If I render them, they look kind of bad. Um, so is that possible for us to come [00:41:13] Um, so is that possible for us to come up with more 3D native methods? So later [00:41:16] up with more 3D native methods? So later people tried a number of 3D native [00:41:18] people tried a number of 3D native methods uh that just apply deep learning [00:41:20] methods uh that just apply deep learning directly on 3D data. As I said, the [00:41:22] directly on 3D data. As I said, the easiest way to do this is just to apply [00:41:25] easiest way to do this is just to apply your uh pixel connection [00:41:29] your uh pixel connection network. So this is actually a deep [00:41:33] network. So this is actually a deep belief network which is generated [00:41:34] belief network which is generated network. Uh but still you know you have [00:41:35] network. Uh but still you know you have some 3D convolutional uh features and [00:41:38] some 3D convolutional uh features and this is in 2015 uh by Princeton and you [00:41:41] this is in 2015 uh by Princeton and you can see that they learn gen model that [00:41:43] can see that they learn gen model that can actually synthesize 3D shapes in [00:41:46] can actually synthesize 3D shapes in form of in the form of 3D box at [00:41:48] form of in the form of 3D box at relatively lower resolution. Uh but you [00:41:50] relatively lower resolution. Uh but you know this is 10 years ago now. So back [00:41:52] know this is 10 years ago now. So back then this is considered kind of pretty [00:41:54] then this is considered kind of pretty impressive [00:41:56] impressive and you can do all these condition [00:41:57] and you can do all these condition generations condition on uh semantic [00:41:59] generations condition on uh semantic labels at bats and a desk and tables. [00:42:02] labels at bats and a desk and tables. You can singleize these different shapes [00:42:04] You can singleize these different shapes and because this is a generative network [00:42:06] and because this is a generative network you can also use it for classification. [00:42:08] you can also use it for classification. Um so you can do image shape [00:42:10] Um so you can do image shape classification as well. [00:42:12] classification as well. And later um something that actually we [00:42:16] And later um something that actually we did is you know what if we just applied [00:42:18] did is you know what if we just applied against this generative server. Now you [00:42:20] against this generative server. Now you can use scans to generate 2D pixels. [00:42:22] can use scans to generate 2D pixels. There's no reason you cannot use GANs to [00:42:23] There's no reason you cannot use GANs to generate 3D box. So we just this this [00:42:26] generate 3D box. So we just this this very simple thing and that is apply [00:42:28] very simple thing and that is apply again to 3D box and actually give you a [00:42:30] again to 3D box and actually give you a pretty good generation of uh 3D objects. [00:42:33] pretty good generation of uh 3D objects. This is eight nine years ago. Yeah. [00:42:36] This is eight nine years ago. Yeah. Yeah. Okay. [00:42:39] Yeah. Okay. And um later uh with Chyen from CMU we [00:42:43] And um later uh with Chyen from CMU we also did an extension that is you can [00:42:45] also did an extension that is you can use GANs to not only generate 3D shapes [00:42:47] use GANs to not only generate 3D shapes but also you can render them into 2D you [00:42:50] but also you can render them into 2D you can project them into 2D surfaces u so [00:42:53] can project them into 2D surfaces u so that you can get the depth map of the 2D [00:42:55] that you can get the depth map of the 2D objects you 3D objects you generated and [00:42:57] objects you 3D objects you generated and then you can use a cycle GAN uh to [00:42:59] then you can use a cycle GAN uh to convert this depth map into a color um [00:43:01] convert this depth map into a color um color image. Now you can have adversary [00:43:03] color image. Now you can have adversary losses not only on 3D shapes but also on [00:43:05] losses not only on 3D shapes but also on 2D pictures, right? You want the 3D [00:43:06] 2D pictures, right? You want the 3D shapes to look realistic so that they [00:43:08] shapes to look realistic so that they can be indisting indistinguishable from [00:43:10] can be indisting indistinguishable from the 3D object data you have. You also [00:43:12] the 3D object data you have. You also want the 2D images to look realistic so [00:43:14] want the 2D images to look realistic so that they're indistinguishable from [00:43:16] that they're indistinguishable from images of real cars. Um so then you can [00:43:19] images of real cars. Um so then you can do 3D generation as well as 2D [00:43:21] do 3D generation as well as 2D generation at the same time. And because [00:43:23] generation at the same time. And because you have you know different latent [00:43:25] you have you know different latent vectors for the shapes for the [00:43:27] vectors for the shapes for the viewpoints and for the textures you also [00:43:29] viewpoints and for the textures you also have some level of controllability like [00:43:31] have some level of controllability like you can change the viewpoint you can [00:43:33] you can change the viewpoint you can change the textures you can do [00:43:35] change the textures you can do interpolation and you can you know uh [00:43:38] interpolation and you can you know uh transfer the texture of one car onto the [00:43:40] transfer the texture of one car onto the shape of another car you know this is [00:43:42] shape of another car you know this is 2018 [00:43:45] 2018 so people tried applying deep networks [00:43:47] so people tried applying deep networks like neural networks gener networks on [00:43:50] like neural networks gener networks on 3D voxels instead of 2D pixels [00:43:52] 3D voxels instead of 2D pixels And can we do a little bit better with [00:43:54] And can we do a little bit better with box holes? Because one thing that people [00:43:56] box holes? Because one thing that people have complained about vox holes is [00:43:57] have complained about vox holes is they're just like really slow, right? [00:43:59] they're just like really slow, right? You have to presample them and there are [00:44:00] You have to presample them and there are a lot of wasted effort because a lot of [00:44:02] a lot of wasted effort because a lot of sample points just like empty space or [00:44:05] sample points just like empty space or they're inside object and g give you no [00:44:07] they're inside object and g give you no information. So naturally people thought [00:44:09] information. So naturally people thought okay can we actually make it better? So [00:44:10] okay can we actually make it better? So there are improvements to voxels like [00:44:12] there are improvements to voxels like octtop trees. The idea of octtop trees [00:44:15] octtop trees. The idea of octtop trees is you still have explicit [00:44:16] is you still have explicit representations. Um sorry in some sense [00:44:19] representations. Um sorry in some sense you can argue it's implicit [00:44:20] you can argue it's implicit representation but it's like [00:44:21] representation but it's like nonparametric implicit representations. [00:44:24] nonparametric implicit representations. Uh but then instead of representing [00:44:25] Uh but then instead of representing every spa uh every point in the space at [00:44:28] every spa uh every point in the space at a at a at a uniform scale right you [00:44:30] a at a at a uniform scale right you actually you know um assuming uh the [00:44:33] actually you know um assuming uh the voxels or yeah basically the voxels can [00:44:36] voxels or yeah basically the voxels can be of different sizes you know I just [00:44:38] be of different sizes you know I just divide the space into like different [00:44:39] divide the space into like different regions and I spin a lot more when I [00:44:42] regions and I spin a lot more when I feel like I'm really close to the [00:44:43] feel like I'm really close to the surface objects I just represent objects [00:44:45] surface objects I just represent objects in a much finer scale and when I'm like [00:44:47] in a much finer scale and when I'm like in this empty space or inside objects [00:44:49] in this empty space or inside objects where I really don't care too much about [00:44:51] where I really don't care too much about what's going on I just I can have like [00:44:53] what's going on I just I can have like huge voxels in some sense, right? So you [00:44:55] huge voxels in some sense, right? So you can recursively partition the space and [00:44:57] can recursively partition the space and you can have, you know, different sizes [00:45:00] you can have, you know, different sizes of voxels at different space and this [00:45:02] of voxels at different space and this allows you to really scale up. Um, so [00:45:05] allows you to really scale up. Um, so you can see that compared with just [00:45:06] you can see that compared with just directly using voxels, you know, this is [00:45:08] directly using voxels, you know, this is 2019ish. You know, people say, okay, [00:45:10] 2019ish. You know, people say, okay, octave trees are great because allows me [00:45:13] octave trees are great because allows me to go from lower resolution. That's how [00:45:15] to go from lower resolution. That's how much you can fit into GPU memory, right? [00:45:17] much you can fit into GPU memory, right? You can do 64 x 64 with voxels, but with [00:45:20] You can do 64 x 64 with voxels, but with oct trees, you can do 256, right? So, [00:45:24] oct trees, you can do 256, right? So, and you can even use that for generation [00:45:25] and you can even use that for generation as well. You can generate objects also, [00:45:28] as well. You can generate objects also, you know, they look like voxels, but [00:45:30] you know, they look like voxels, but they're kind of higher resolution [00:45:31] they're kind of higher resolution because you're more efficient in [00:45:32] because you're more efficient in representing the space. So, these are [00:45:35] representing the space. So, these are like the very early attempts in applying [00:45:37] like the very early attempts in applying deep learning to each space and they're [00:45:38] deep learning to each space and they're like, okay, why don't we just try [00:45:40] like, okay, why don't we just try voxels? [00:45:41] voxels? Then this is the moment where people [00:45:44] Then this is the moment where people have get a bit more interest into like, [00:45:46] have get a bit more interest into like, oh, now what if you know uh um the [00:45:49] oh, now what if you know uh um the graphics people feel like, you know, [00:45:50] graphics people feel like, you know, you're just doing all this wrong, right? [00:45:51] you're just doing all this wrong, right? Because why do why do you want to use [00:45:53] Because why do why do you want to use these kind of pretty inefficient ugly [00:45:55] these kind of pretty inefficient ugly looking representations like voxels or [00:45:57] looking representations like voxels or octtop trees? Now we have all these good [00:45:59] octtop trees? Now we have all these good representations point clouds meshes [00:46:01] representations point clouds meshes splines you know why are we not using [00:46:02] splines you know why are we not using these representations but as we said the [00:46:04] these representations but as we said the challenge is you know point here and [00:46:06] challenge is you know point here and there right how can you even apply [00:46:07] there right how can you even apply convolution on points and stuff like [00:46:09] convolution on points and stuff like that it's not just not very obvious but [00:46:10] that it's not just not very obvious but people start to look into it so [00:46:13] people start to look into it so naturally people move into you know [00:46:14] naturally people move into you know applying or developing new deep learning [00:46:16] applying or developing new deep learning methods that directly work on uh not [00:46:18] methods that directly work on uh not only just 3D data but also different [00:46:20] only just 3D data but also different type of 3D representations like point [00:46:21] type of 3D representations like point clouds so in point I think this is an [00:46:25] clouds so in point I think this is an important work also from Stanford uh [00:46:27] important work also from Stanford uh from Leo's team. Um what's going on here [00:46:30] from Leo's team. Um what's going on here is they develop a new type of deep [00:46:32] is they develop a new type of deep network that direct work with 3D point [00:46:34] network that direct work with 3D point clouds. So it's called pointet. Um so [00:46:37] clouds. So it's called pointet. Um so the idea is you know for points you have [00:46:40] the idea is you know for points you have to be permutationally invariant right [00:46:42] to be permutationally invariant right because you know if I have point one and [00:46:44] because you know if I have point one and point 2 okay like 0.1 is here and point [00:46:46] point 2 okay like 0.1 is here and point 2 is here okay now I have a you know [00:46:48] 2 is here okay now I have a you know different input I would say 0.1 here and [00:46:50] different input I would say 0.1 here and point two there right so then you know [00:46:53] point two there right so then you know whatever your network should be [00:46:55] whatever your network should be environment to these different two types [00:46:57] environment to these different two types of input which means you know no matter [00:46:58] of input which means you know no matter I ne I name this one is 0.1 that one's [00:47:01] I ne I name this one is 0.1 that one's point 2 or I name this one's point 2 one [00:47:03] point 2 or I name this one's point 2 one this that one's point 2 your output [00:47:05] this that one's point 2 your output should be the Right? Because there's no [00:47:06] should be the Right? Because there's no the points they're like kind of [00:47:08] the points they're like kind of unordered. There's no guaranteed [00:47:09] unordered. There's no guaranteed ordering in the sense our top left is [00:47:11] ordering in the sense our top left is one one bottom right is 100 100. Right? [00:47:14] one one bottom right is 100 100. Right? So if there's unorder if the points are [00:47:16] So if there's unorder if the points are unordered you have to be permutitional [00:47:19] unordered you have to be permutitional environment. So how can we do that [00:47:22] environment. So how can we do that and second is you have to be also [00:47:24] and second is you have to be also sampling environment. Um so you know [00:47:25] sampling environment. Um so you know sometimes you sample like say uh 10 [00:47:27] sometimes you sample like say uh 10 points on the on the head of the b the [00:47:30] points on the on the head of the b the bunny or a rabbit and five points on the [00:47:32] bunny or a rabbit and five points on the tail of the rabbit and sometimes you [00:47:34] tail of the rabbit and sometimes you sample 10 points on the tail of the [00:47:35] sample 10 points on the tail of the rabbit only five points on the head of [00:47:37] rabbit only five points on the head of of the rabbit right so how can also be [00:47:39] of the rabbit right so how can also be invarant to that right because there's [00:47:40] invarant to that right because there's no guarantee on how you sample points so [00:47:42] no guarantee on how you sample points so they're kind of a bit of a issue here [00:47:44] they're kind of a bit of a issue here and there um but the one idea they they [00:47:48] and there um but the one idea they they imply they they used uh and I think it's [00:47:50] imply they they used uh and I think it's basically probably the most important [00:47:52] basically probably the most important point at is they just you know I just [00:47:54] point at is they just you know I just apply It's also so simple. I just apply [00:47:56] apply It's also so simple. I just apply a symmetric function on the embeddings [00:47:58] a symmetric function on the embeddings of the points. So basically, you know, [00:48:00] of the points. So basically, you know, for all the points, you know, I first [00:48:02] for all the points, you know, I first compute some embeddings for them just [00:48:04] compute some embeddings for them just like you will compute uh embings for [00:48:06] like you will compute uh embings for different regions or different windows [00:48:07] different regions or different windows of image and I compute the features for [00:48:10] of image and I compute the features for each point and then I just have to fuse [00:48:12] each point and then I just have to fuse them. But because I want them to be [00:48:14] them. But because I want them to be permutational environment, right? So I I [00:48:16] permutational environment, right? So I I just use a symmetric function in a sense [00:48:18] just use a symmetric function in a sense that for example, it can be just a max [00:48:20] that for example, it can be just a max function and then I take the maximum [00:48:22] function and then I take the maximum soft max. It can also be a sum function. [00:48:24] soft max. It can also be a sum function. I just add them up, you know. So that's [00:48:27] I just add them up, you know. So that's what's going on. This is so simple. You [00:48:28] what's going on. This is so simple. You have a you have a number of points 1 2 3 [00:48:30] have a you have a number of points 1 2 3 1 and then you have computer embeddings [00:48:33] 1 and then you have computer embeddings for them. And they just aggregate them. [00:48:34] for them. And they just aggregate them. You can say you compute the max for each [00:48:36] You can say you compute the max for each dimension. You can sum them up or stuff [00:48:37] dimension. You can sum them up or stuff like that. And yeah, and then you have [00:48:40] like that. And yeah, and then you have this aggregated embeddings for all the [00:48:42] this aggregated embeddings for all the points and then you go through maybe a [00:48:44] points and then you go through maybe a few layers of fully connected networks [00:48:45] few layers of fully connected networks or stuff like that. and then you use it [00:48:47] or stuff like that. and then you use it to uh classify you know oh are these [00:48:50] to uh classify you know oh are these points really representing a chair or a [00:48:53] points really representing a chair or a table [00:48:55] table so that's basically what's going on and [00:48:58] so that's basically what's going on and uh it turned out to be quite powerful [00:49:00] uh it turned out to be quite powerful and of course there have been a lot of [00:49:01] and of course there have been a lot of the uh I would say improvements on top [00:49:04] the uh I would say improvements on top of that people have been coming up with [00:49:05] of that people have been coming up with new methods that improve on point they [00:49:07] new methods that improve on point they have they have point plus and there are [00:49:10] have they have point plus and there are things like people have been trying to [00:49:11] things like people have been trying to do is like graph neuron networks because [00:49:13] do is like graph neuron networks because you can easily translate points into [00:49:15] you can easily translate points into nodes of a graph and then the the [00:49:17] nodes of a graph and then the the neighborhood you know the proximity [00:49:19] neighborhood you know the proximity whether two point points are close to [00:49:20] whether two point points are close to each other as the edges connecting these [00:49:22] each other as the edges connecting these points. So there have been like graph [00:49:24] points. So there have been like graph neural networks and all these other [00:49:25] neural networks and all these other methods that have been developed for [00:49:26] methods that have been developed for these uh point point cloud processing [00:49:30] these uh point point cloud processing but you know the original idea in the uh [00:49:33] but you know the original idea in the uh point paper is kind of so simple and [00:49:34] point paper is kind of so simple and turned out to be also very powerful. [00:49:37] turned out to be also very powerful. Something else you want to consider is [00:49:38] Something else you want to consider is you also want to measure you know for [00:49:40] you also want to measure you know for pixels it's easy I have an output image [00:49:42] pixels it's easy I have an output image I have the ground truth image I just [00:49:44] I have the ground truth image I just compute differences between the two I [00:49:45] compute differences between the two I have out two loss or whatever for points [00:49:49] have out two loss or whatever for points how would you compare the output point [00:49:50] how would you compare the output point cloud and input point cloud right [00:49:52] cloud and input point cloud right especially if you care about generation [00:49:53] especially if you care about generation task if you do a classification that's [00:49:55] task if you do a classification that's fine you have input point cloud and [00:49:57] fine you have input point cloud and output is uh you know chair table [00:49:59] output is uh you know chair table whatever you have a cross entropy loss [00:50:00] whatever you have a cross entropy loss that's that's all you need but if you [00:50:02] that's that's all you need but if you output if if you're doing a generation [00:50:04] output if if you're doing a generation task you have a box it's also easy right [00:50:06] task you have a box it's also easy right just do a cross entropy loss of the you [00:50:08] just do a cross entropy loss of the you know 100 by 100 by 100 voxal grid but if [00:50:11] know 100 by 100 by 100 voxal grid but if your output is a point 100 points and [00:50:13] your output is a point 100 points and how would you compare uh the output [00:50:14] how would you compare uh the output point cloud versus the ground truth [00:50:16] point cloud versus the ground truth point cloud you have to also design [00:50:17] point cloud you have to also design distance metrics so the common the two [00:50:20] distance metrics so the common the two common distance metrics that people did [00:50:22] common distance metrics that people did use one is called the chamfer distance a [00:50:24] use one is called the chamfer distance a chamfer distance is easy to understand [00:50:26] chamfer distance is easy to understand that is you have two set of points and [00:50:28] that is you have two set of points and for each point on each side or each set [00:50:31] for each point on each side or each set you just basically find the nearest [00:50:32] you just basically find the nearest neighbor right so you have a collection [00:50:34] neighbor right so you have a collection of red points you have a collection of [00:50:35] of red points you have a collection of blue points And for the red point, for [00:50:37] blue points And for the red point, for each of the red point, you just find its [00:50:38] each of the red point, you just find its nearest neighbor in the blue set. And [00:50:40] nearest neighbor in the blue set. And for each of the blue points, you just [00:50:42] for each of the blue points, you just find the nearest neighbor in the red [00:50:43] find the nearest neighbor in the red set. And you want to minimize the [00:50:44] set. And you want to minimize the distance, you know, minimize the [00:50:45] distance, you know, minimize the distance of each point to its nearest [00:50:47] distance of each point to its nearest neighbor in the other set. And the [00:50:49] neighbor in the other set. And the second idea loss function that people [00:50:51] second idea loss function that people may use is called earth move distance. [00:50:53] may use is called earth move distance. And here you do a bipartite matching [00:50:55] And here you do a bipartite matching between the two set of points. And you [00:50:57] between the two set of points. And you have a onetoone paired matching between [00:50:59] have a onetoone paired matching between these points. And you want to minimize [00:51:00] these points. And you want to minimize distance uh among all these pairs. So [00:51:03] distance uh among all these pairs. So these are the two common metrics that [00:51:04] these are the two common metrics that people use when they're comparing the [00:51:05] people use when they're comparing the distance between the point clouds and [00:51:07] distance between the point clouds and they can be made differentiable which [00:51:08] they can be made differentiable which means you can now computer gradients and [00:51:10] means you can now computer gradients and use it to op optimize your neuron [00:51:12] use it to op optimize your neuron network so that hopefully they you know [00:51:13] network so that hopefully they you know output better point clouds if you're [00:51:15] output better point clouds if you're caring about a point cloud generation [00:51:17] caring about a point cloud generation problem. [00:51:18] problem. Um so we have moved from uh vauels to [00:51:22] Um so we have moved from uh vauels to point clouds and then people are like [00:51:24] point clouds and then people are like okay this is great and now I can process [00:51:26] okay this is great and now I can process the points I can output points but we [00:51:28] the points I can output points but we also have other you know beautiful [00:51:29] also have other you know beautiful repartitions like uh splines you know [00:51:32] repartitions like uh splines you know they're very good at capturing the [00:51:33] they're very good at capturing the surfaces of objects if if you use any [00:51:35] surfaces of objects if if you use any kind of neuron network to generate [00:51:36] kind of neuron network to generate voxels or generate point clouds they [00:51:38] voxels or generate point clouds they always look very ugly right so they [00:51:40] always look very ugly right so they don't have smooth surfaces and stuff [00:51:41] don't have smooth surfaces and stuff like that so how can we have a neuronet [00:51:43] like that so how can we have a neuronet network that can output or understand [00:51:45] network that can output or understand objects but also represent the beautiful [00:51:47] objects but also represent the beautiful surfaces [00:51:48] surfaces So people go a bit forward and think [00:51:52] So people go a bit forward and think about how I can integrate neural [00:51:53] about how I can integrate neural networks with things like splines or [00:51:55] networks with things like splines or functions like that. And a notable [00:51:57] functions like that. And a notable example here is the thing called atlet. [00:52:00] example here is the thing called atlet. So what's going on here is they try to [00:52:02] So what's going on here is they try to use deep learning uh but then instead of [00:52:04] use deep learning uh but then instead of you know directly outputting a set of 3D [00:52:06] you know directly outputting a set of 3D point clouds right. So I learn a [00:52:09] point clouds right. So I learn a transformation function. You know I have [00:52:11] transformation function. You know I have a latent shape representations and then [00:52:14] a latent shape representations and then you know if you remember right when we [00:52:15] you know if you remember right when we say you have these parametric [00:52:17] say you have these parametric representation of object shapes you're [00:52:18] representation of object shapes you're basically transforming let's say a 2D [00:52:20] basically transforming let's say a 2D space of U and V into a 3D space like a [00:52:22] space of U and V into a 3D space like a sphere. So and you know for simple [00:52:26] sphere. So and you know for simple things like sphere it's easy you can [00:52:27] things like sphere it's easy you can write it down right what is that [00:52:28] write it down right what is that function uh and s cosine whatever uh but [00:52:32] function uh and s cosine whatever uh but for complex objects it is very hard to [00:52:34] for complex objects it is very hard to write a function and often there's no [00:52:35] write a function and often there's no closed form. So the idea here is okay if [00:52:38] closed form. So the idea here is okay if there's no closed form then why don't we [00:52:40] there's no closed form then why don't we just use a neuron network to represent [00:52:41] just use a neuron network to represent that right so here you can see this [00:52:43] that right so here you can see this neuron network which is repres [00:52:45] neuron network which is repres implemented as MLP just learns that [00:52:48] implemented as MLP just learns that function f right you can take the two [00:52:50] function f right you can take the two values u and v as the input to the [00:52:52] values u and v as the input to the function f and the neuronet networks is [00:52:54] function f and the neuronet networks is performing the computation of that [00:52:55] performing the computation of that function f and u and v and output a [00:52:57] function f and u and v and output a point in the 3D space so it's basically [00:53:00] point in the 3D space so it's basically learning how we are able to transform [00:53:03] learning how we are able to transform this comput space into the 3D case and [00:53:06] this comput space into the 3D case and it might be too hard uh to represent the [00:53:09] it might be too hard uh to represent the entire object using a single [00:53:10] entire object using a single transformation. So people thought okay [00:53:12] transformation. So people thought okay we can use a couple of small neuron [00:53:14] we can use a couple of small neuron networks. So think about it as now you [00:53:16] networks. So think about it as now you have a piece of paper you can fold it in [00:53:17] have a piece of paper you can fold it in different ways and you can fold it [00:53:19] different ways and you can fold it multiple times and all these things get [00:53:20] multiple times and all these things get put together uh to form the final shapes [00:53:23] put together uh to form the final shapes you care about right so this is you can [00:53:27] you care about right so this is you can see the differences between these two [00:53:29] see the differences between these two three different representations you have [00:53:30] three different representations you have input image and if you want to represent [00:53:32] input image and if you want to represent a reconstruct using voxels you can see [00:53:35] a reconstruct using voxels you can see it's it's doing something but you're [00:53:37] it's it's doing something but you're really bounded by limited resolution [00:53:39] really bounded by limited resolution voxels and for point clouds you know [00:53:41] voxels and for point clouds you know yeah you're no longer bounded by the [00:53:43] yeah you're no longer bounded by the resolutions and it give maybe a bit more [00:53:45] resolutions and it give maybe a bit more details but points are really unordered. [00:53:48] details but points are really unordered. you cannot really get any smooth [00:53:49] you cannot really get any smooth surfaces out of the point clouds and for [00:53:52] surfaces out of the point clouds and for this thing called island net which is [00:53:53] this thing called island net which is basically learning transform pieces you [00:53:55] basically learning transform pieces you can see that they have actually smoother [00:53:56] can see that they have actually smoother surfaces right so using new and to [00:53:59] surfaces right so using new and to represent how you can map a parametric [00:54:01] represent how you can map a parametric representations from lower dimensional [00:54:03] representations from lower dimensional space to a higher dimension higher [00:54:04] space to a higher dimension higher dimensional space and you learn multiple [00:54:06] dimensional space and you learn multiple of these mappings and when they're [00:54:07] of these mappings and when they're combined that give you the final output [00:54:10] combined that give you the final output geometries conditional 2D images [00:54:16] okay Uh so finally you know [00:54:22] okay Uh so finally you know uh [00:54:26] no in some sense [00:54:28] no in some sense we can put it this way right so what is [00:54:31] we can put it this way right so what is um what is deep network doing when [00:54:32] um what is deep network doing when they're doing image classification right [00:54:34] they're doing image classification right they're basically learning a very [00:54:35] they're basically learning a very complex functions that map uh input [00:54:38] complex functions that map uh input images in the form of pixels into a [00:54:40] images in the form of pixels into a final category label is it a cat or a [00:54:43] final category label is it a cat or a dog or person or whatever that function [00:54:45] dog or person or whatever that function is really complex and output Output [00:54:47] is really complex and output Output space is really small. Output space is [00:54:49] space is really small. Output space is like, you know, 1,000 dimensions, right? [00:54:50] like, you know, 1,000 dimensions, right? So it's like, okay, it's a cat or dog. [00:54:52] So it's like, okay, it's a cat or dog. You have 1,000 with classification. [00:54:53] You have 1,000 with classification. Output space is so small. Input space is [00:54:55] Output space is so small. Input space is much larger because you have, you know, [00:54:56] much larger because you have, you know, 500 by 500 pixels. So it's that's [00:54:59] 500 by 500 pixels. So it's that's 250,000 or something, right? The input [00:55:01] 250,000 or something, right? The input space is much larger. Output space is [00:55:02] space is much larger. Output space is really small. The function is really [00:55:04] really small. The function is really hard to to to write. You know, I how can [00:55:06] hard to to to write. You know, I how can you can you is it possible for me to [00:55:08] you can you is it possible for me to write down some formulas so that I can [00:55:10] write down some formulas so that I can classify the input image uh by computing [00:55:13] classify the input image uh by computing whatever some specific values and output [00:55:15] whatever some specific values and output if this is a cat or a dog? I cannot do [00:55:17] if this is a cat or a dog? I cannot do that right a function is so hard to [00:55:18] that right a function is so hard to write there's no closed form that's why [00:55:20] write there's no closed form that's why you need a deep network input space is [00:55:21] you need a deep network input space is large output space is small so if you [00:55:24] large output space is small so if you really think about deep networks that [00:55:25] really think about deep networks that way and think about what they are doing [00:55:27] way and think about what they are doing then you realize you know a lot of the [00:55:29] then you realize you know a lot of the things that we have doing with deep [00:55:30] things that we have doing with deep networks on 3D shapes it doesn't seem to [00:55:32] networks on 3D shapes it doesn't seem to map that you know map into that you know [00:55:36] map that you know map into that you know I would say equation right so it doesn't [00:55:38] I would say equation right so it doesn't map that well and what are the [00:55:40] map that well and what are the representations that really map them map [00:55:41] representations that really map them map it the best right what's the optimal [00:55:43] it the best right what's the optimal representations that seems to really fit [00:55:45] representations that seems to really fit into the paradigm [00:55:47] into the paradigm And if we think more carefully I think [00:55:49] And if we think more carefully I think around 2019 and people realize oh yeah [00:55:52] around 2019 and people realize oh yeah you know in some sense deep network is [00:55:53] you know in some sense deep network is implicit function and why don't we just [00:55:55] implicit function and why don't we just use it to represent an implicit function [00:55:57] use it to represent an implicit function for object 3D geometry right so instead [00:56:00] for object 3D geometry right so instead of representing the kind of voxels [00:56:02] of representing the kind of voxels because now you're just doing you know [00:56:04] because now you're just doing you know you you turn it into pixels and you but [00:56:06] you you turn it into pixels and you but then you just scale up into 3D and you [00:56:08] then you just scale up into 3D and you apply 3D convolution okay but you know [00:56:10] apply 3D convolution okay but you know fundamentally voxil is really about you [00:56:12] fundamentally voxil is really about you know whether this thing is inside and [00:56:13] know whether this thing is inside and outside object so instead of just [00:56:15] outside object so instead of just directly [00:56:16] directly pre-quing the space and get the voxels [00:56:18] pre-quing the space and get the voxels and apply convolution on top of it. what [00:56:20] and apply convolution on top of it. what if I just directly using deep network to [00:56:22] if I just directly using deep network to perform that query for me so that I [00:56:24] perform that query for me so that I don't have to run 3D conversion or [00:56:26] don't have to run 3D conversion or anything you know I just query a space [00:56:27] anything you know I just query a space in 3D and deep network should tell me [00:56:29] in 3D and deep network should tell me you know if in the output can just be [00:56:32] you know if in the output can just be one dimensional you know inside and [00:56:33] one dimensional you know inside and outside right whether that point is [00:56:35] outside right whether that point is inside a sweet shape or outside the 3D [00:56:37] inside a sweet shape or outside the 3D shape [00:56:39] shape so finally I think people move that leap [00:56:41] so finally I think people move that leap you know take the leap to go from uh [00:56:43] you know take the leap to go from uh explicit representations on point cows [00:56:45] explicit representations on point cows or splines into implicit repetitions but [00:56:47] or splines into implicit repetitions but not directly working on voxels but [00:56:49] not directly working on voxels but instead that you know think about it as [00:56:50] instead that you know think about it as a level set or or or some implicit [00:56:52] a level set or or or some implicit functions that use deep number to [00:56:54] functions that use deep number to represent you know that's the final step [00:56:56] represent you know that's the final step right going from this uh atlas now [00:56:58] right going from this uh atlas now whatever you're learning that trans [00:57:00] whatever you're learning that trans transformation from a from a 2D space to [00:57:02] transformation from a from a 2D space to the 3D space but now I can directly do [00:57:05] the 3D space but now I can directly do implicit query using the deep networks [00:57:07] implicit query using the deep networks so that goes through deep implicit [00:57:09] so that goes through deep implicit functions where which is kind of [00:57:11] functions where which is kind of interesting because around that time in [00:57:12] interesting because around that time in 2019 there are like four papers that are [00:57:15] 2019 there are like four papers that are doing almost exactly the same thing you [00:57:17] doing almost exactly the same thing you know they all argue that before we have [00:57:19] know they all argue that before we have been using voxels and point calls and [00:57:20] been using voxels and point calls and matches or whatever they have their own [00:57:22] matches or whatever they have their own strengths or weakness but really the [00:57:23] strengths or weakness but really the right thing to do is I should just send [00:57:27] right thing to do is I should just send you know the query into deep network so [00:57:28] you know the query into deep network so deep what it should do is it should take [00:57:31] deep what it should do is it should take the input let's say xyz coordinate and [00:57:33] the input let's say xyz coordinate and output whether that point is inside and [00:57:35] output whether that point is inside and outside object you know and that will be [00:57:39] outside object you know and that will be in some sense the the final I would say [00:57:41] in some sense the the final I would say one of the final and that's the kind of [00:57:42] one of the final and that's the kind of idea that has been that was proposed in [00:57:44] idea that has been that was proposed in 2019 and even right now 2025 right a lot [00:57:46] 2019 and even right now 2025 right a lot of people has been still using this kind [00:57:48] of people has been still using this kind of same idea [00:57:49] of same idea that is I'll just use deep network to [00:57:51] that is I'll just use deep network to tell me whether a point is inside [00:57:53] tell me whether a point is inside outside object and not only you can go a [00:57:56] outside object and not only you can go a little bit beyond than just like a [00:57:57] little bit beyond than just like a binary classification of inside outside [00:57:59] binary classification of inside outside because you can also say oh maybe I care [00:58:01] because you can also say oh maybe I care a bit more I say what will be the sign [00:58:02] a bit more I say what will be the sign distance function how far the point is [00:58:04] distance function how far the point is from the from the surface of the object [00:58:07] from the from the surface of the object or what will be the density values of [00:58:08] or what will be the density values of the of the point or later what will be [00:58:11] the of the point or later what will be the color what will be the radiance [00:58:12] the color what will be the radiance values of of the point right but here [00:58:14] values of of the point right but here starting from 2019 people really start [00:58:16] starting from 2019 people really start to you know apply deep network in a way [00:58:18] to you know apply deep network in a way that is in something similar to [00:58:20] that is in something similar to classification that is I take points in [00:58:23] classification that is I take points in 3D space and use it as implicit function [00:58:24] 3D space and use it as implicit function to query you know some properties of the [00:58:26] to query you know some properties of the points in the 3D space [00:58:29] points in the 3D space and you know people have tried to use it [00:58:31] and you know people have tried to use it as you know represent a collection of [00:58:33] as you know represent a collection of implicit functions not just like how [00:58:35] implicit functions not just like how pieces are deforming into the 3D space [00:58:37] pieces are deforming into the 3D space to get different pieces of papers in 3D [00:58:39] to get different pieces of papers in 3D but really representing implicit parts [00:58:41] but really representing implicit parts of the objects using small neural [00:58:42] of the objects using small neural networks and they can form a complex [00:58:44] networks and they can form a complex shapes uh and you know If you can rent [00:58:48] shapes uh and you know If you can rent represent objects in 3D uh using [00:58:51] represent objects in 3D uh using implicit functions uh you can do that [00:58:52] implicit functions uh you can do that not only for geometry you can query not [00:58:55] not only for geometry you can query not only the inside whether a point is [00:58:56] only the inside whether a point is inside outside objects whether how far [00:58:59] inside outside objects whether how far the point is from the surface of the [00:59:00] the point is from the surface of the object you can also query what will be [00:59:02] object you can also query what will be the radiance what will be the color of [00:59:04] the radiance what will be the color of object and then I go there actually one [00:59:06] object and then I go there actually one year later this is maybe one or two year [00:59:07] year later this is maybe one or two year later no people come up with this thing [00:59:09] later no people come up with this thing called nerf right so where they used [00:59:11] called nerf right so where they used differences here is now say oh I I I [00:59:13] differences here is now say oh I I I should use deep network knowledge to [00:59:15] should use deep network knowledge to query what will be uh the sign distance [00:59:17] query what will be uh the sign distance function or density of the objects but [00:59:19] function or density of the objects but also quering the radiance. Um so here [00:59:22] also quering the radiance. Um so here you can see what's going on is you query [00:59:25] you can see what's going on is you query nerf about xyz coordinate in the 3D [00:59:27] nerf about xyz coordinate in the 3D space. In addition to that, because [00:59:29] space. In addition to that, because you're trying to model the appearance as [00:59:30] you're trying to model the appearance as well, you also query the viewing [00:59:32] well, you also query the viewing directions, right? The camera viewing [00:59:34] directions, right? The camera viewing directions and the output of the neuron [00:59:36] directions and the output of the neuron network is not just like one or zero [00:59:39] network is not just like one or zero inside outside. It is uh the density [00:59:41] inside outside. It is uh the density values in addition to the color values, [00:59:44] values in addition to the color values, right? The radiance and you know if you [00:59:47] right? The radiance and you know if you directly train uh implicit functions on [00:59:49] directly train uh implicit functions on 3D shapes and then you require 3D [00:59:51] 3D shapes and then you require 3D supervision, right? So you know if you [00:59:53] supervision, right? So you know if you have a collection of 3D objects you can [00:59:55] have a collection of 3D objects you can use them as a supervision that give you [00:59:56] use them as a supervision that give you okay you know ground you know you have [00:59:58] okay you know ground you know you have the ground truth about whether a point [00:59:59] the ground truth about whether a point is inside outside a 3D object but here [01:00:02] is inside outside a 3D object but here you know uh you want to train on 2D [01:00:04] you know uh you want to train on 2D images that's what's going on with nerf [01:00:05] images that's what's going on with nerf so they also put that together with a [01:00:07] so they also put that together with a newer rendering uh volume rendering [01:00:09] newer rendering uh volume rendering function and they made this volume [01:00:10] function and they made this volume rendering function differentiable in the [01:00:12] rendering function differentiable in the sense that you can have a rendering [01:00:14] sense that you can have a rendering model uh you can query all these [01:00:16] model uh you can query all these different points in a 3D space you can [01:00:18] different points in a 3D space you can get their colors and also their [01:00:20] get their colors and also their densities and appearances [01:00:23] densities and appearances And then you can compute in how much [01:00:25] And then you can compute in how much light is blocked along the way. Right? [01:00:27] light is blocked along the way. Right? So this is basically volume rendering as [01:00:29] So this is basically volume rendering as in computer graphics. Uh there's [01:00:32] in computer graphics. Uh there's there's very minimally changed made [01:00:34] there's very minimally changed made because you can see even directly from [01:00:36] because you can see even directly from the volume rendering equations that [01:00:38] the volume rendering equations that everything here is you know this is [01:00:40] everything here is you know this is approximation but with the approximation [01:00:42] approximation but with the approximation everything here is sort of [01:00:42] everything here is sort of differentiable. So you can compute you [01:00:44] differentiable. So you can compute you know how much light you know if you have [01:00:46] know how much light you know if you have if the neural network gives you the [01:00:48] if the neural network gives you the density which is basically you can think [01:00:49] density which is basically you can think about as opacity of the point in 3D [01:00:51] about as opacity of the point in 3D space and also give you the color then [01:00:53] space and also give you the color then you can compute you know how much light [01:00:54] you can compute you know how much light has been blocked by the points I sampled [01:00:56] has been blocked by the points I sampled ahead of that ahead of that point and [01:01:00] ahead of that ahead of that point and along the ray and you can compute also [01:01:02] along the ray and you can compute also you know how much light is there [01:01:04] you know how much light is there contributing to what I'm going to see in [01:01:06] contributing to what I'm going to see in this ray from any particular point right [01:01:09] this ray from any particular point right so now you have a few things you have [01:01:11] so now you have a few things you have new number to represent implicit [01:01:12] new number to represent implicit functions for colors or radians and and [01:01:15] functions for colors or radians and and densities and then you have this volume [01:01:17] densities and then you have this volume rendering equations which is made [01:01:18] rendering equations which is made differentiable so that you can learn [01:01:20] differentiable so that you can learn directly from 2D images. So these are [01:01:23] directly from 2D images. So these are the two things I have changed and one is [01:01:26] the two things I have changed and one is I no longer have to train on 3D shapes I [01:01:27] I no longer have to train on 3D shapes I can train on 2D images with this volume [01:01:29] can train on 2D images with this volume render equations and the second is [01:01:31] render equations and the second is instead of just looking into geometry or [01:01:34] instead of just looking into geometry or density of objects in the 3D I also look [01:01:37] density of objects in the 3D I also look into their radiance or appearance in 3D [01:01:39] into their radiance or appearance in 3D right so these two changes lead to this [01:01:41] right so these two changes lead to this kind of big jump uh behind you know from [01:01:44] kind of big jump uh behind you know from nerf or implic from implicit functions [01:01:46] nerf or implic from implicit functions deep SDF and all these earlier methods [01:01:48] deep SDF and all these earlier methods to nerf so a lot of people feel like [01:01:50] to nerf so a lot of people feel like okay yeah nerf has been great and seems [01:01:53] okay yeah nerf has been great and seems out of nowhere. That's really not the [01:01:54] out of nowhere. That's really not the case because, you know, they're very [01:01:56] case because, you know, they're very much inspired if you look at the [01:01:57] much inspired if you look at the articles they later wrote themselves. [01:01:59] articles they later wrote themselves. Um, you know, they're very much inspired [01:02:02] Um, you know, they're very much inspired by all these advances in deep in [01:02:04] by all these advances in deep in functions. Although they focus only on [01:02:05] functions. Although they focus only on geometry, but now I do both geometry and [01:02:09] geometry, but now I do both geometry and appearance and I do learning from 2D [01:02:11] appearance and I do learning from 2D images instead of 3D shapes. Um, so [01:02:14] images instead of 3D shapes. Um, so yeah, here are some results of Nerf. Uh, [01:02:16] yeah, here are some results of Nerf. Uh, this you may have seen many times. [01:02:22] Um [01:02:24] Um okay so if you remember we said you know [01:02:27] okay so if you remember we said you know in the past we have been working on [01:02:28] in the past we have been working on something like generating a 3D shapes [01:02:30] something like generating a 3D shapes and then uh also generating their 2D [01:02:33] and then uh also generating their 2D appearances. Uh here at the very [01:02:35] appearances. Uh here at the very beginning we use the representations [01:02:36] beginning we use the representations that is voxels. Um but now as we said [01:02:40] that is voxels. Um but now as we said yeah nerf is great and if we have [01:02:42] yeah nerf is great and if we have implicit representations there's no need [01:02:43] implicit representations there's no need to really represent it as a voxels. What [01:02:46] to really represent it as a voxels. What if we just replace that with a with a [01:02:47] if we just replace that with a with a with a radiance fields right? So we also [01:02:50] with a radiance fields right? So we also did that as well. So we have a neuron [01:02:52] did that as well. So we have a neuron network that capture the implicit [01:02:53] network that capture the implicit radiance fields and densities but it is [01:02:55] radiance fields and densities but it is generative neuron network and then you [01:02:57] generative neuron network and then you can even still apply the same GAN [01:02:59] can even still apply the same GAN rendering framework so that you can [01:03:00] rendering framework so that you can render objects in 3D as well as their 2D [01:03:03] render objects in 3D as well as their 2D uh pictures and then you can also do the [01:03:05] uh pictures and then you can also do the same as controllability and you know you [01:03:08] same as controllability and you know you can you can change the you can change [01:03:09] can you can change the you can change the camera viewpoint you can change [01:03:11] the camera viewpoint you can change object identity but then you can keep [01:03:13] object identity but then you can keep the viewpoint you can do all these [01:03:14] the viewpoint you can do all these things as we can do before but now with [01:03:16] things as we can do before but now with nerf you can learn directly from images. [01:03:18] nerf you can learn directly from images. uh so you don't have to restrict [01:03:19] uh so you don't have to restrict yourself to the categories of cars or [01:03:21] yourself to the categories of cars or chairs where you have a lot of 3D 3D [01:03:23] chairs where you have a lot of 3D 3D data because you can learn directly from [01:03:25] data because you can learn directly from images and yeah so you can see that now [01:03:29] images and yeah so you can see that now the output becomes much more realistic [01:03:30] the output becomes much more realistic so this is the thing we did called pyan [01:03:33] so this is the thing we did called pyan with uh Eric Chen as a first author and [01:03:36] with uh Eric Chen as a first author and also with uh mostly people from Gordon's [01:03:38] also with uh mostly people from Gordon's group [01:03:42] okay and finally you know nerf is great [01:03:46] okay and finally you know nerf is great but then Nerf has this issue do that is [01:03:48] but then Nerf has this issue do that is you know you have to sample a lot of [01:03:49] you know you have to sample a lot of points in 3D you know you you're no [01:03:51] points in 3D you know you you're no longer pre-sampling them and then [01:03:53] longer pre-sampling them and then applying a volumetric convolution but [01:03:55] applying a volumetric convolution but still just like a level set right you [01:03:56] still just like a level set right you have to sample all the points and [01:03:58] have to sample all the points and current neon all the time and now you [01:04:01] current neon all the time and now you can do it learning from 2D uh you can do [01:04:03] can do it learning from 2D uh you can do all these great things but because you [01:04:04] all these great things but because you still have to do all these sampling but [01:04:06] still have to do all these sampling but it's very slow so people thought it a [01:04:09] it's very slow so people thought it a bit more again from the graphics people [01:04:10] bit more again from the graphics people they're like okay I have this good idea [01:04:12] they're like okay I have this good idea about points and meshes and the nice [01:04:14] about points and meshes and the nice thing about them is they're free in [01:04:15] thing about them is they're free in space they're very efficient uh so it's [01:04:17] space they're very efficient uh so it's possible for us to integrate it to can I [01:04:19] possible for us to integrate it to can I have implicit representations but maybe [01:04:21] have implicit representations but maybe I don't have to have a fixed sampling [01:04:23] I don't have to have a fixed sampling grid I don't have to sample all the at [01:04:25] grid I don't have to sample all the at times because they take so much time so [01:04:27] times because they take so much time so maybe I really should put them together [01:04:29] maybe I really should put them together right so you can argue that nip nerf try [01:04:32] right so you can argue that nip nerf try to parameterize densities uh sorry [01:04:34] to parameterize densities uh sorry parameterize the scenes very very [01:04:35] parameterize the scenes very very densely you have to sample all the [01:04:37] densely you have to sample all the points densely in 3D [01:04:39] points densely in 3D um a lot of points are wasted just like [01:04:41] um a lot of points are wasted just like in voxels you know you have all the [01:04:42] in voxels you know you have all the points that are representing empty space [01:04:44] points that are representing empty space you don't want that in nerf a lot of [01:04:46] you don't want that in nerf a lot of samplings a All the queries are also [01:04:48] samplings a All the queries are also querying empty space and network may [01:04:51] querying empty space and network may give you like density of zero or [01:04:52] give you like density of zero or something like that but it's taking a [01:04:54] something like that but it's taking a lot of time so how can we address that [01:04:57] lot of time so how can we address that um you know what if I just try to sample [01:04:59] um you know what if I just try to sample things more sparsely right I still have [01:05:01] things more sparsely right I still have this implicit representations but [01:05:03] this implicit representations but instead of you know uh you know sampling [01:05:05] instead of you know uh you know sampling empty spaces all the time I only sample [01:05:07] empty spaces all the time I only sample at places where I know there are stuff [01:05:09] at places where I know there are stuff but how can I know that you know if I [01:05:11] but how can I know that you know if I what if I have a point representations [01:05:13] what if I have a point representations so this is the idea behind this thing [01:05:16] so this is the idea behind this thing called Gaussian spats which you may have [01:05:17] called Gaussian spats which you may have heard of. So it still has the same [01:05:19] heard of. So it still has the same implicitive functions, you're acquiring [01:05:21] implicitive functions, you're acquiring new network for densities and for [01:05:23] new network for densities and for appearance and stuff like that, but [01:05:26] appearance and stuff like that, but instead of quiring new network all the [01:05:27] instead of quiring new network all the time, I have a point representation. I [01:05:29] time, I have a point representation. I have these 3D Gaussian blobs in the 3D [01:05:31] have these 3D Gaussian blobs in the 3D space, which I think sometimes you can [01:05:32] space, which I think sometimes you can think about them as point clouds, but [01:05:34] think about them as point clouds, but the points are not like a single point. [01:05:36] the points are not like a single point. They're they're like a blob. They're [01:05:37] They're they're like a blob. They're like some regions. uh and uh because you [01:05:41] like some regions. uh and uh because you know where these blobs are, you know, [01:05:43] know where these blobs are, you know, when you're sending out array from your [01:05:45] when you're sending out array from your camera to the 3D space and sample [01:05:46] camera to the 3D space and sample points, you don't have to sample all the [01:05:48] points, you don't have to sample all the time. You just look at all where these [01:05:49] time. You just look at all where these blobs are and then you can know based on [01:05:51] blobs are and then you can know based on their um the the radius of these [01:05:54] their um the the radius of these different galaxies. You will only sample [01:05:56] different galaxies. You will only sample at regions where you know there's some [01:05:57] at regions where you know there's some stuff. Uh so this makes rendering much [01:06:00] stuff. Uh so this makes rendering much more efficient. [01:06:02] more efficient. And uh so here are some of the [01:06:04] And uh so here are some of the reconstruction results using 3D gossian [01:06:05] reconstruction results using 3D gossian spats. [01:06:14] And you can see that you know in terms [01:06:16] And you can see that you know in terms of quality right they're actually you [01:06:18] of quality right they're actually you know not that you know they're [01:06:20] know not that you know they're comparable I would say they're [01:06:21] comparable I would say they're comparable to nerves uh this is like [01:06:23] comparable to nerves uh this is like different matrix PS and SSM they're like [01:06:25] different matrix PS and SSM they're like rendering qualities and I think the [01:06:27] rendering qualities and I think the x-axis the sorry the y- axis is it [01:06:29] x-axis the sorry the y- axis is it doesn't start from zero so this is a [01:06:30] doesn't start from zero so this is a little misleading but basically you can [01:06:31] little misleading but basically you can see these numbers are really close so in [01:06:33] see these numbers are really close so in terms of quality rendering quality [01:06:34] terms of quality rendering quality gaussian spats and nerves are similar at [01:06:36] gaussian spats and nerves are similar at least when They're first proposed, but [01:06:38] least when They're first proposed, but Gaussian splats are just much more [01:06:40] Gaussian splats are just much more efficient, right? So this is FPS, frame [01:06:42] efficient, right? So this is FPS, frame per second, right? You can render 150 [01:06:44] per second, right? You can render 150 pictures per second. Well, for Nerf, you [01:06:46] pictures per second. Well, for Nerf, you know, it takes you like maybe um 20 [01:06:49] know, it takes you like maybe um 20 seconds to render just a single picture, [01:06:50] seconds to render just a single picture, right? So now you this thing is now made [01:06:53] right? So now you this thing is now made 1,000 times faster. At least that's what [01:06:55] 1,000 times faster. At least that's what they argued. So because you don't you [01:06:57] they argued. So because you don't you you no longer waste all your computing [01:06:59] you no longer waste all your computing power on simply simply empty space and [01:07:01] power on simply simply empty space and querying nerve quering your networks all [01:07:02] querying nerve quering your networks all the time on these uh about these points [01:07:05] the time on these uh about these points that are in the empty space. [01:07:11] Okay. Yeah. So that is basically how [01:07:13] Okay. Yeah. So that is basically how deep learning has been integrated on 3D [01:07:15] deep learning has been integrated on 3D data in all these different [01:07:16] data in all these different representations how they got started how [01:07:18] representations how they got started how they have involved and it connection [01:07:19] they have involved and it connection with all these uh different shape [01:07:21] with all these uh different shape representations. And one thing we didn't [01:07:23] representations. And one thing we didn't talk about I just use two minutes to [01:07:24] talk about I just use two minutes to quickly cover it is uh you know there [01:07:27] quickly cover it is uh you know there has been also interesting things about [01:07:29] has been also interesting things about object geometry that is not only about [01:07:31] object geometry that is not only about the element geometry the specific [01:07:33] the element geometry the specific details about the parts but also the [01:07:35] details about the parts but also the structures you know because often there [01:07:37] structures you know because often there could be you know the shares are [01:07:39] could be you know the shares are symmetric right so we talked a little [01:07:40] symmetric right so we talked a little bit about it where like there's a [01:07:41] bit about it where like there's a parametric surface and you can [01:07:43] parametric surface and you can parameterize part of a surface using [01:07:45] parameterize part of a surface using like a sphere or stuff like that using [01:07:47] like a sphere or stuff like that using these kind of closed form equations and [01:07:49] these kind of closed form equations and that give you a little bit of symmetry [01:07:50] that give you a little bit of symmetry but there has been also more systematic [01:07:52] but there has been also more systematic studies about uh the regularities or [01:07:55] studies about uh the regularities or structures within object geometry [01:07:56] structures within object geometry including their repetitions including [01:07:58] including their repetitions including their symmetries and people also come up [01:08:00] their symmetries and people also come up with different representations for it as [01:08:01] with different representations for it as well you know so you know so how can we [01:08:05] well you know so you know so how can we really represent you know in some sense [01:08:06] really represent you know in some sense you can argue that point clouds meshes [01:08:08] you can argue that point clouds meshes inclive functions they're really [01:08:09] inclive functions they're really representing geometric details maybe for [01:08:11] representing geometric details maybe for the individual parts none of none of [01:08:13] the individual parts none of none of them is directly capturing things like [01:08:15] them is directly capturing things like regularities like symmetry or repetition [01:08:18] regularities like symmetry or repetition so how can we capture that a few other [01:08:20] so how can we capture that a few other attempts that people have been exploring [01:08:23] attempts that people have been exploring mostly from the graphics community is [01:08:25] mostly from the graphics community is you know I can represent objects just [01:08:27] you know I can represent objects just basically as a collection of these [01:08:29] basically as a collection of these sample geometric parts uh like a part [01:08:31] sample geometric parts uh like a part set but I can and there has been methods [01:08:34] set but I can and there has been methods that apply deep learnings on it you know [01:08:36] that apply deep learnings on it you know representing uh using deep to represent [01:08:38] representing uh using deep to represent different parts of object using simple [01:08:40] different parts of object using simple geometry primitives and then compose [01:08:42] geometry primitives and then compose them or you know using implicit [01:08:44] them or you know using implicit functions and compose them as we talked [01:08:46] functions and compose them as we talked about before but you know there has also [01:08:49] about before but you know there has also been attempts to do a bit more right So [01:08:51] been attempts to do a bit more right So not just like representing objects as a [01:08:52] not just like representing objects as a collection of parts without considering [01:08:54] collection of parts without considering their relationships but also modeling [01:08:56] their relationships but also modeling the relationships between these parts. U [01:08:58] the relationships between these parts. U this is you know even more so the case [01:09:00] this is you know even more so the case for for scenes let's say a bed or bed is [01:09:04] for for scenes let's say a bed or bed is usually next to the wall chairs is [01:09:05] usually next to the wall chairs is usually next to a tables and stuff like [01:09:07] usually next to a tables and stuff like that. So you not only want to represent [01:09:08] that. So you not only want to represent them as a you know unrelated collection [01:09:11] them as a you know unrelated collection of parts or objects you want to capture [01:09:13] of parts or objects you want to capture their relationships as well in the [01:09:15] their relationships as well in the hierarchies. you know when you are [01:09:17] hierarchies. you know when you are constructing when you're building uh [01:09:19] constructing when you're building uh you're doing some constructions you're [01:09:20] you're doing some constructions you're architecture uh you're architecting [01:09:22] architecture uh you're architecting design your building um then you of [01:09:25] design your building um then you of course you know you're not like just [01:09:27] course you know you're not like just like representing objects or their [01:09:28] like representing objects or their relationships you have to consider [01:09:30] relationships you have to consider hierarchies what you build first uh [01:09:32] hierarchies what you build first uh there's a classroom and the classroom [01:09:33] there's a classroom and the classroom has you know there's some tables and [01:09:35] has you know there's some tables and chairs in it and chairs has parts [01:09:36] chairs in it and chairs has parts there's basically like a kind of level [01:09:38] there's basically like a kind of level hierarchy and how this can be used and [01:09:40] hierarchy and how this can be used and integrated with neuronet networks as [01:09:42] integrated with neuronet networks as well as you know you have not only [01:09:43] well as you know you have not only hierarchy but also you can you can [01:09:44] hierarchy but also you can you can compose hierarchies and relationships [01:09:46] compose hierarchies and relationships Right? So you have a hierarchal graph [01:09:48] Right? So you have a hierarchal graph where let's say for chairs you know you [01:09:50] where let's say for chairs you know you have different level hierarchy for bases [01:09:52] have different level hierarchy for bases for seats for backs and the bases may [01:09:54] for seats for backs and the bases may have you know different legs but then [01:09:55] have you know different legs but then also the legs themselves are related [01:09:57] also the legs themselves are related right the left leg of the chair and the [01:09:59] right the left leg of the chair and the right leg of the chair they're supposed [01:10:00] right leg of the chair they're supposed to be symmetric and they have they [01:10:01] to be symmetric and they have they should have the identical shape you [01:10:03] should have the identical shape you there are constraints on where these [01:10:04] there are constraints on where these legs are they have to be you know really [01:10:07] legs are they have to be you know really aligned otherwise the chair is going to [01:10:09] aligned otherwise the chair is going to fall. So there are all these constraints [01:10:10] fall. So there are all these constraints that are sort of you know pretty useful [01:10:12] that are sort of you know pretty useful and how could we represent them and [01:10:14] and how could we represent them and people come up with all these different [01:10:15] people come up with all these different representations and for each of them [01:10:17] representations and for each of them there are also you know a lot of [01:10:19] there are also you know a lot of neuronet networks you know deep learning [01:10:20] neuronet networks you know deep learning method designed to learn and to capture [01:10:23] method designed to learn and to capture and to generate objects that satisfy all [01:10:24] and to generate objects that satisfy all these constraints. Uh for example like [01:10:28] these constraints. Uh for example like you can see this is a kind of [01:10:29] you can see this is a kind of hierarchical graph encoders and decoders [01:10:31] hierarchical graph encoders and decoders that try to uh represent and generate 3D [01:10:33] that try to uh represent and generate 3D chairs that satisfy all these [01:10:35] chairs that satisfy all these constraints while maintaining their [01:10:37] constraints while maintaining their hierarchies. Right? So I think this is [01:10:40] hierarchies. Right? So I think this is also from the ODUS group from 2019 [01:10:43] also from the ODUS group from 2019 and you know sometimes we can even [01:10:45] and you know sometimes we can even represent shapes using some form of [01:10:46] represent shapes using some form of programs right because there's [01:10:48] programs right because there's repetitions and for loops and how this [01:10:49] repetitions and for loops and how this can be incorporated into using neuronet [01:10:52] can be incorporated into using neuronet networks to generate programs that [01:10:54] networks to generate programs that synthesize object shapes and synize [01:10:56] synthesize object shapes and synize their uh relations between these object [01:10:58] their uh relations between these object parts. uh and that's also an important [01:11:00] parts. uh and that's also an important topic and most recently I'll say let me [01:11:03] topic and most recently I'll say let me end it this by saying [01:11:05] end it this by saying I think there has been a new trend just [01:11:07] I think there has been a new trend just in the past one year or two that the [01:11:10] in the past one year or two that the deep networks or large language models [01:11:12] deep networks or large language models are doing so well and they understand [01:11:14] are doing so well and they understand things so well that people are exploring [01:11:16] things so well that people are exploring is it possible for us to just use large [01:11:19] is it possible for us to just use large language models like GPT to output these [01:11:21] language models like GPT to output these programs right because they understand [01:11:23] programs right because they understand semantics and what the tier should be [01:11:24] semantics and what the tier should be like what are the constraints the chair [01:11:26] like what are the constraints the chair should satisfy so is that possible for [01:11:27] should satisfy so is that possible for me to use a large language model the [01:11:29] me to use a large language model the output programs but then maybe I can use [01:11:31] output programs but then maybe I can use some implicitive functions or whatever [01:11:32] some implicitive functions or whatever right to capture the specific geometric [01:11:34] right to capture the specific geometric details of the parts of the objects like [01:11:36] details of the parts of the objects like the chairs so there's some kind of new [01:11:38] the chairs so there's some kind of new emerging trend of research that is [01:11:40] emerging trend of research that is happening right now in these days okay I [01:11:42] happening right now in these days okay I think that's all I have thank ================================================================================ LECTURE 016 ================================================================================ Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 16: Vision and Language Source: https://www.youtube.com/watch?v=mQOK0Mfyrkk --- Transcript [00:00:05] Thank you everyone for coming. Um we [00:00:08] Thank you everyone for coming. Um we have another guest lecture. Uh and today [00:00:10] have another guest lecture. Uh and today we have Ranjay Krishna. Uh Ranjay [00:00:12] we have Ranjay Krishna. Uh Ranjay Krishna is a assistant professor at the [00:00:14] Krishna is a assistant professor at the school of computer science and [00:00:16] school of computer science and engineering at the University of [00:00:17] engineering at the University of Washington and he co-directs the Raven [00:00:20] Washington and he co-directs the Raven lab. Um he has taught previous [00:00:22] lab. Um he has taught previous iterations of CS231N in 2020 and 2021 [00:00:26] iterations of CS231N in 2020 and 2021 and his research lies at the [00:00:27] and his research lies at the intersection of computer v uh computer [00:00:29] intersection of computer v uh computer vision natural language processing [00:00:32] vision natural language processing robotics and human computer interaction. [00:00:34] robotics and human computer interaction. In today's lecture he will discuss [00:00:36] In today's lecture he will discuss multimodal foundation models and RJ the [00:00:38] multimodal foundation models and RJ the floor is yours. [00:00:39] floor is yours. Thank you. Uh it's great to be back. The [00:00:41] Thank you. Uh it's great to be back. The first time I ever taught this course uh [00:00:42] first time I ever taught this course uh here at Stanford, it was 2020 and we had [00:00:46] here at Stanford, it was 2020 and we had about 3 weeks where we had to take all [00:00:48] about 3 weeks where we had to take all the material and move it online. Uh I'm [00:00:51] the material and move it online. Uh I'm yeah I'm every year after that has been [00:00:53] yeah I'm every year after that has been much easier to teach. U it's great to be [00:00:56] much easier to teach. U it's great to be back. So today we're going to talk about [00:00:57] back. So today we're going to talk about uh multimodal foundation models. So a [00:01:00] uh multimodal foundation models. So a lot of the lectures in this class so far [00:01:02] lot of the lectures in this class so far has really been focused on building [00:01:04] has really been focused on building individual models for individual tasks. [00:01:07] individual models for individual tasks. So these uh usually follow a few steps [00:01:10] So these uh usually follow a few steps that you've seen over and over again in [00:01:11] that you've seen over and over again in lectures. You collect a data set, [00:01:13] lectures. You collect a data set, usually a training set as well as a test [00:01:14] usually a training set as well as a test set. Then you train a very specialized [00:01:16] set. Then you train a very specialized model for that purpose. So that could be [00:01:18] model for that purpose. So that could be an image classification model or image [00:01:20] an image classification model or image captioning model like the ones you've [00:01:22] captioning model like the ones you've seen in your assignments as well. And [00:01:23] seen in your assignments as well. And then you finally sort of evaluate th [00:01:25] then you finally sort of evaluate th those models on your test set. Now [00:01:27] those models on your test set. Now what's sort of been different uh in the [00:01:29] what's sort of been different uh in the field so far uh in the last couple of [00:01:31] field so far uh in the last couple of years is this sort of shift away from [00:01:33] years is this sort of shift away from these individual models into building [00:01:35] these individual models into building these more foundation models. Um and the [00:01:38] these more foundation models. Um and the way to sort of think about foundation [00:01:40] way to sort of think about foundation models is that it really is trying to [00:01:42] models is that it really is trying to pre-train models on a wide variety of [00:01:44] pre-train models on a wide variety of skills, a wide variety of different [00:01:46] skills, a wide variety of different tasks and then later on adapt those [00:01:48] tasks and then later on adapt those things for individual tasks uh depending [00:01:50] things for individual tasks uh depending on your needs. So for example, one very [00:01:53] on your needs. So for example, one very common uh foundation model that you all [00:01:55] common uh foundation model that you all probably use in some form or the other [00:01:57] probably use in some form or the other is GPT. And GPT was trained on a lot of [00:02:00] is GPT. And GPT was trained on a lot of common crawl data from the internet. And [00:02:02] common crawl data from the internet. And then you take that model that you get [00:02:04] then you take that model that you get and then you fine-tune it for different [00:02:06] and then you fine-tune it for different purposes. So you fine-tune that model uh [00:02:08] purposes. So you fine-tune that model uh for math problems or symbolic reasoning [00:02:10] for math problems or symbolic reasoning or trivia questions. And all of these [00:02:12] or trivia questions. And all of these are individual tasks that this model can [00:02:13] are individual tasks that this model can quickly adapt to. Now, what's nice about [00:02:15] quickly adapt to. Now, what's nice about foundation models is that it allows you [00:02:17] foundation models is that it allows you to sort of do that update step, that [00:02:19] to sort of do that update step, that adaptation to new tasks with very [00:02:21] adaptation to new tasks with very minimal data. Meaning, you don't need to [00:02:23] minimal data. Meaning, you don't need to collect a large amount of training data. [00:02:24] collect a large amount of training data. You can usually get away with very, very [00:02:26] You can usually get away with very, very little. Often times, you can even get [00:02:28] little. Often times, you can even get away with collecting no training data at [00:02:30] away with collecting no training data at all. Um, and so when you think about [00:02:33] all. Um, and so when you think about foundation models, there's many [00:02:34] foundation models, there's many different classes of foundation models [00:02:36] different classes of foundation models that you might care about. Uh, in [00:02:38] that you might care about. Uh, in language, uh, you've got Elmo and BERT [00:02:40] language, uh, you've got Elmo and BERT that really started this entire [00:02:41] that really started this entire revolution. uh and then we now have GPT [00:02:44] revolution. uh and then we now have GPT and T5 and variants of these models. Uh [00:02:47] and T5 and variants of these models. Uh these are things we're not going to talk [00:02:48] these are things we're not going to talk about in this class since we're mostly [00:02:49] about in this class since we're mostly going to be talking about multimodal [00:02:51] going to be talking about multimodal models. What we will talk about is how [00:02:54] models. What we will talk about is how do you build these same kind of [00:02:55] do you build these same kind of foundation models for image [00:02:56] foundation models for image classification and we'll go into uh [00:02:59] classification and we'll go into uh examples like clip and cocoa today. Uh [00:03:02] examples like clip and cocoa today. Uh we'll also talk about how do you combine [00:03:04] we'll also talk about how do you combine language models that that you might have [00:03:05] language models that that you might have seen already in class with these sort of [00:03:07] seen already in class with these sort of vision foundation models to enable all [00:03:10] vision foundation models to enable all kinds of new models, multimodal [00:03:12] kinds of new models, multimodal foundation models that can solve a wide [00:03:14] foundation models that can solve a wide variety of tasks. And then of course we [00:03:16] variety of tasks. And then of course we can do a lot more than just sort of [00:03:18] can do a lot more than just sort of solve tasks in language. Uh we'll talk [00:03:20] solve tasks in language. Uh we'll talk about how you can build models that can [00:03:22] about how you can build models that can output not just text but also uh masks [00:03:25] output not just text but also uh masks uh or images uh that it might you might [00:03:28] uh or images uh that it might you might want to generate. And then finally, [00:03:29] want to generate. And then finally, we'll talk about this idea of chaining [00:03:31] we'll talk about this idea of chaining where you take a bunch of foundation [00:03:32] where you take a bunch of foundation models and then combine them to do all [00:03:35] models and then combine them to do all kinds of new things together. Now, when [00:03:37] kinds of new things together. Now, when we talk about foundation models, there's [00:03:39] we talk about foundation models, there's many different ways to classify them. [00:03:41] many different ways to classify them. Uh, and it's hard because the definition [00:03:43] Uh, and it's hard because the definition is often sort of um disagreed upon, but [00:03:46] is often sort of um disagreed upon, but what you typically will see in a [00:03:48] what you typically will see in a foundation model is that it's robust and [00:03:50] foundation model is that it's robust and general uh to many different tasks. So, [00:03:52] general uh to many different tasks. So, you can apply that same model for all [00:03:54] you can apply that same model for all different use cases, and I'll show you a [00:03:55] different use cases, and I'll show you a ton of use cases today. Um, also [00:03:58] ton of use cases today. Um, also something else that's common in a lot of [00:03:59] something else that's common in a lot of these foundation models is that they [00:04:00] these foundation models is that they have a lot of parameters. They have [00:04:03] have a lot of parameters. They have large numbers amounts of parameters, [00:04:05] large numbers amounts of parameters, large amounts of training data, and [00:04:07] large amounts of training data, and usually they're trained with some sort [00:04:08] usually they're trained with some sort of self-supervised objective. Okay. Um, [00:04:11] of self-supervised objective. Okay. Um, so of course we're not going to talk [00:04:12] so of course we're not going to talk about the language stuff. What we will [00:04:14] about the language stuff. What we will talk about are the ones in green today. [00:04:16] talk about are the ones in green today. And so let's get started with image [00:04:17] And so let's get started with image classification. So how do we actually go [00:04:20] classification. So how do we actually go about building an foundation model that [00:04:22] about building an foundation model that can solve image classification for any [00:04:24] can solve image classification for any sort of data set you might care about? [00:04:27] sort of data set you might care about? Now, if you remember from a few lectures [00:04:28] Now, if you remember from a few lectures ago, we were talking about [00:04:30] ago, we were talking about self-supervised learning. And in [00:04:32] self-supervised learning. And in self-supervised learning, one of those [00:04:34] self-supervised learning, one of those uh methods that you saw was simple, [00:04:36] uh methods that you saw was simple, where you have this contrastive [00:04:37] where you have this contrastive objective that contrasts against um [00:04:41] objective that contrasts against um dissimilar images and pulls closer [00:04:43] dissimilar images and pulls closer representations of the same image that [00:04:45] representations of the same image that has been transformed in some way or the [00:04:47] has been transformed in some way or the other. Now, this idea you can think of [00:04:49] other. Now, this idea you can think of as pulling together similar concepts. [00:04:52] as pulling together similar concepts. different augmentations of a cat should [00:04:54] different augmentations of a cat should result in representations that are [00:04:56] result in representations that are similar to one another, but it should [00:04:58] similar to one another, but it should push away representations for other [00:05:00] push away representations for other kinds of categories like dogs for [00:05:02] kinds of categories like dogs for example. Now, the the hope with training [00:05:04] example. Now, the the hope with training with these self-supervised learning [00:05:06] with these self-supervised learning objectives is that these representations [00:05:08] objectives is that these representations become general enough, right? So that [00:05:10] become general enough, right? So that when you see something new, maybe a [00:05:11] when you see something new, maybe a sketch of a cat or a sketch of a dog, it [00:05:14] sketch of a cat or a sketch of a dog, it still embeds those in the space so that [00:05:16] still embeds those in the space so that it's easy to classify exactly what those [00:05:18] it's easy to classify exactly what those concepts are. Now moving on to [00:05:21] concepts are. Now moving on to multimodal, we can take these same [00:05:22] multimodal, we can take these same ideas, the same objective and then start [00:05:25] ideas, the same objective and then start thinking about what would happen if we [00:05:26] thinking about what would happen if we added text to that representation space. [00:05:29] added text to that representation space. So for example, if we could also embed a [00:05:32] So for example, if we could also embed a representation of the text that says a [00:05:35] representation of the text that says a cute fluffy cat and have that be close [00:05:37] cute fluffy cat and have that be close to the cat representations, that would [00:05:39] to the cat representations, that would be great because now we can sort of [00:05:40] be great because now we can sort of query things in both images as well as [00:05:43] query things in both images as well as text. uh and similarly we if we can also [00:05:46] text. uh and similarly we if we can also embed uh the phrase my favorite dog is a [00:05:48] embed uh the phrase my favorite dog is a golden retriever and ideally that [00:05:50] golden retriever and ideally that representation would lie closer to [00:05:52] representation would lie closer to golden retrievers than other kinds of [00:05:54] golden retrievers than other kinds of dogs. So that's the general idea behind [00:05:57] dogs. So that's the general idea behind sort of adapting the self-supervised [00:05:59] sort of adapting the self-supervised learning objectives we've been talking [00:06:00] learning objectives we've been talking about in class so far to incorporate [00:06:03] about in class so far to incorporate text and other sort of multimodal uh [00:06:05] text and other sort of multimodal uh inputs. So in simincere if you remember [00:06:07] inputs. So in simincere if you remember the main objective was that you want to [00:06:10] the main objective was that you want to pull together again uh transformations [00:06:13] pull together again uh transformations of the same image. So the cat should be [00:06:15] of the same image. So the cat should be closest to its other cat augmentation. [00:06:18] closest to its other cat augmentation. So that green arrow right there sort of [00:06:19] So that green arrow right there sort of indicates two things that should be [00:06:20] indicates two things that should be pulled together and it should be further [00:06:22] pulled together and it should be further away from all the other augmentations. [00:06:24] away from all the other augmentations. So any other image of a dog or a monkey [00:06:27] So any other image of a dog or a monkey you want those representations to be far [00:06:29] you want those representations to be far away. Now we can use that same idea and [00:06:32] away. Now we can use that same idea and now think about training a clip model. [00:06:34] now think about training a clip model. In clip what they do is they still have [00:06:37] In clip what they do is they still have that same image encoder that you have on [00:06:39] that same image encoder that you have on the left hand side but on the right hand [00:06:40] the left hand side but on the right hand side you now have a text encoder and [00:06:43] side you now have a text encoder and this text encoder is embedding [00:06:45] this text encoder is embedding descriptions of those individual images. [00:06:48] descriptions of those individual images. Okay. So your dog image will now [00:06:50] Okay. So your dog image will now hopefully learn that it should be closer [00:06:52] hopefully learn that it should be closer to a representation of a text that says [00:06:54] to a representation of a text that says my favorite dog is a golden retriever [00:06:56] my favorite dog is a golden retriever and far away from all the other [00:06:58] and far away from all the other representations. Okay. And because this [00:07:00] representations. Okay. And because this is the same formulation that you've seen [00:07:02] is the same formulation that you've seen with Sinclair, the objective that you [00:07:03] with Sinclair, the objective that you use to train a model like this is just [00:07:06] use to train a model like this is just by collecting a lot of image text pairs. [00:07:08] by collecting a lot of image text pairs. Uh and then once you have those pairs, [00:07:10] Uh and then once you have those pairs, feed them into a model in a mini batch [00:07:12] feed them into a model in a mini batch and then make sure that you have this [00:07:13] and then make sure that you have this contrastive objective uh that we used [00:07:16] contrastive objective uh that we used for SIM clear, but now we're applying [00:07:17] for SIM clear, but now we're applying them across images and text. So we're [00:07:20] them across images and text. So we're pulling together here in the numerator [00:07:22] pulling together here in the numerator uh the representations uh of similar [00:07:24] uh the representations uh of similar things and pulling apart the [00:07:26] things and pulling apart the representations in the denominator for [00:07:27] representations in the denominator for everything else. Now, of course, we want [00:07:30] everything else. Now, of course, we want that image to be closest to its [00:07:32] that image to be closest to its corresponding text and far away from all [00:07:34] corresponding text and far away from all the other text. But we also want the [00:07:36] the other text. But we also want the inverse to be true as well. So, we have [00:07:38] inverse to be true as well. So, we have a second objective as well that says [00:07:40] a second objective as well that says every text should be closest to its [00:07:42] every text should be closest to its image and further away from all the [00:07:44] image and further away from all the other image descriptions. Right? So, [00:07:46] other image descriptions. Right? So, it's it's a complimentary symmetric loss [00:07:48] it's it's a complimentary symmetric loss that you have between the two different [00:07:50] that you have between the two different types of modalities that you're feeding [00:07:52] types of modalities that you're feeding into this uh learning objective. [00:07:55] into this uh learning objective. Okay? So, of course, what's really nice [00:07:57] Okay? So, of course, what's really nice about a clip-like model is that it can [00:07:59] about a clip-like model is that it can be trained with just associations of [00:08:02] be trained with just associations of images and text. And there's a ton of [00:08:04] images and text. And there's a ton of this data on the internet. So, you have [00:08:06] this data on the internet. So, you have a lot of data of corresponding images [00:08:08] a lot of data of corresponding images and text that you can pull up from the [00:08:10] and text that you can pull up from the internet. You can download and now you [00:08:12] internet. You can download and now you can train this model at a very very [00:08:14] can train this model at a very very large scale. And this is exactly what [00:08:15] large scale. And this is exactly what OpenAI did a couple of years ago in 2021 [00:08:18] OpenAI did a couple of years ago in 2021 when they released their clip model. So [00:08:21] when they released their clip model. So they collected a lot of that data and [00:08:22] they collected a lot of that data and then they trained it using this [00:08:24] then they trained it using this contrastive objective using all of the [00:08:26] contrastive objective using all of the images and text pairs that they found [00:08:27] images and text pairs that they found from the internet and then once they [00:08:29] from the internet and then once they were done training that you follow the [00:08:31] were done training that you follow the same sort of two-step pipeline that you [00:08:33] same sort of two-step pipeline that you saw in the self-supervised learning [00:08:34] saw in the self-supervised learning object uh class where in step one you do [00:08:37] object uh class where in step one you do that pre-training and then in step two [00:08:39] that pre-training and then in step two you can take that image encoder and now [00:08:41] you can take that image encoder and now you can adapt it to a new task. So once [00:08:44] you can adapt it to a new task. So once you have this pre-trained image encoder, [00:08:46] you have this pre-trained image encoder, you take it, you take its weights and [00:08:47] you take it, you take its weights and then you tag on an additional linear [00:08:49] then you tag on an additional linear layer on top uh to adapt it to an image [00:08:52] layer on top uh to adapt it to an image classification task or a detection task [00:08:54] classification task or a detection task or you can put in something like a [00:08:55] or you can put in something like a decoder and even decode out uh semantic [00:08:58] decoder and even decode out uh semantic segmentation maps. Right? So a ton of [00:09:00] segmentation maps. Right? So a ton of different tasks become possible just by [00:09:02] different tasks become possible just by initializing your model from this sort [00:09:05] initializing your model from this sort of pre-trained objective. What was [00:09:07] of pre-trained objective. What was really exciting when this paper came out [00:09:09] really exciting when this paper came out is that linear uh addition of this one [00:09:12] is that linear uh addition of this one linear classifier on top of this clip [00:09:14] linear classifier on top of this clip encoder led to really large improvements [00:09:17] encoder led to really large improvements in performance. So here in this graph [00:09:19] in performance. So here in this graph I'm showing you uh average performance [00:09:21] I'm showing you uh average performance across many different image [00:09:22] across many different image classification data sets and the clip [00:09:25] classification data sets and the clip models the ones in red they're all the [00:09:27] models the ones in red they're all the way at the top and you can see that as [00:09:29] way at the top and you can see that as you sort of train on more and more [00:09:31] you sort of train on more and more images uh you end up getting better and [00:09:33] images uh you end up getting better and better performance. Um, so it was very [00:09:35] better performance. Um, so it was very exciting because it seemed to indicate [00:09:37] exciting because it seemed to indicate that there's this really nice [00:09:38] that there's this really nice pre-training objective that we've been [00:09:40] pre-training objective that we've been able to unlock and there's an abundance [00:09:42] able to unlock and there's an abundance of image text data on the internet which [00:09:44] of image text data on the internet which means that we can train these to be very [00:09:46] means that we can train these to be very very large and very very performant. Um, [00:09:50] very large and very very performant. Um, of course that's not the end of the [00:09:51] of course that's not the end of the story. What we want to do ideally is not [00:09:54] story. What we want to do ideally is not to have to adapt these features for [00:09:56] to have to adapt these features for something new. We would ideally want to [00:09:57] something new. We would ideally want to be able to use a clip model out of the [00:09:59] be able to use a clip model out of the box. So in language models for example, [00:10:02] box. So in language models for example, you train a model to autocomplete [00:10:04] you train a model to autocomplete usually. And this autocomp completion [00:10:06] usually. And this autocomp completion kind of works like this. You have a [00:10:07] kind of works like this. You have a phrase that says I love and then your [00:10:10] phrase that says I love and then your model sort of fills in the next word. [00:10:12] model sort of fills in the next word. For example, cake. And then you train [00:10:14] For example, cake. And then you train with this pre-training objective. And [00:10:16] with this pre-training objective. And what you want to do during the second [00:10:17] what you want to do during the second stage is to basically take that same [00:10:20] stage is to basically take that same model and adapt it to a new task. For [00:10:22] model and adapt it to a new task. For language models, you never have to [00:10:24] language models, you never have to retrain that model. You never have to [00:10:26] retrain that model. You never have to retrain it on a new downstream task. [00:10:28] retrain it on a new downstream task. Every task is a language task. And so [00:10:30] Every task is a language task. And so every task can be treated as this sort [00:10:32] every task can be treated as this sort of autocomplete uh process. But with [00:10:35] of autocomplete uh process. But with clip the problem is there is no [00:10:37] clip the problem is there is no autocomplete process. Right? So we've [00:10:39] autocomplete process. Right? So we've trained this model on this contrastive [00:10:41] trained this model on this contrastive objective. But to adapt it to a new [00:10:43] objective. But to adapt it to a new task, we still need training data and we [00:10:46] task, we still need training data and we still need uh a linear layer on top that [00:10:49] still need uh a linear layer on top that we need to train to adapt it to new [00:10:50] we need to train to adapt it to new tasks. So a lot of people started [00:10:52] tasks. So a lot of people started thinking about what what we can do to [00:10:54] thinking about what what we can do to sort of adapt this uh model for uh you [00:10:57] sort of adapt this uh model for uh you know to use it directly out of the box [00:11:00] know to use it directly out of the box and there's this clever trick that [00:11:01] and there's this clever trick that people came up with and this clever [00:11:03] people came up with and this clever trick is basically using the text [00:11:06] trick is basically using the text encoder as a way of guiding the model uh [00:11:10] encoder as a way of guiding the model uh to generalize to any downstream [00:11:11] to generalize to any downstream classifiification task. And it works [00:11:14] classifiification task. And it works like this. Uh so let's say you want to [00:11:15] like this. Uh so let's say you want to classify what this image is using a clip [00:11:17] classify what this image is using a clip model but you don't want to sort of [00:11:19] model but you don't want to sort of retrain this model or adapt it for any [00:11:21] retrain this model or adapt it for any downstream task. What you can do is you [00:11:24] downstream task. What you can do is you can take the text encoder and pass in a [00:11:27] can take the text encoder and pass in a word through that text encoder to create [00:11:29] word through that text encoder to create a text vector and use nearest neighbors [00:11:32] a text vector and use nearest neighbors to figure out what is the right [00:11:34] to figure out what is the right classification. So the way this works is [00:11:37] classification. So the way this works is you take all the categories in your new [00:11:38] you take all the categories in your new data set. So for example, let's say your [00:11:40] data set. So for example, let's say your new data set contains the categories [00:11:42] new data set contains the categories plane, dog, and bird. You're going to [00:11:45] plane, dog, and bird. You're going to embed all of them in the text space to [00:11:47] embed all of them in the text space to get a vector for plane, a vector for [00:11:49] get a vector for plane, a vector for dog, and a vector for bird. And now when [00:11:52] dog, and a vector for bird. And now when a new image comes in, all you have to do [00:11:54] a new image comes in, all you have to do is embed that image using the image [00:11:57] is embed that image using the image encoder and then find the closest [00:11:59] encoder and then find the closest neighbor. So in this case, you should [00:12:01] neighbor. So in this case, you should find that this image has the highest [00:12:03] find that this image has the highest sort of similarity with uh the correct [00:12:06] sort of similarity with uh the correct class. uh in this case uh it should be [00:12:08] class. uh in this case uh it should be the dog vector and you can see the dog [00:12:10] the dog vector and you can see the dog vector does have the highest sort of [00:12:11] vector does have the highest sort of similarity score and so because of that [00:12:13] similarity score and so because of that you can now classify that image as a [00:12:15] you can now classify that image as a dog. Okay. [00:12:18] dog. Okay. Now you can think of this entire process [00:12:19] Now you can think of this entire process as essentially building a one nearest [00:12:22] as essentially building a one nearest neighbor algorithm. Right? So you're you [00:12:25] neighbor algorithm. Right? So you're you have a bunch of centers that you've or [00:12:27] have a bunch of centers that you've or embeddings that you've generated in the [00:12:29] embeddings that you've generated in the text space and now you can use them as [00:12:31] text space and now you can use them as your class category labels and you're [00:12:33] your class category labels and you're doing one nearest neighbor to find uh [00:12:36] doing one nearest neighbor to find uh the optimal classification for any new [00:12:38] the optimal classification for any new image that comes in. [00:12:40] image that comes in. Now of course a uh single word might not [00:12:44] Now of course a uh single word might not be sufficient to get a really good word [00:12:45] be sufficient to get a really good word vector. Um instead what you might want [00:12:48] vector. Um instead what you might want to do is a use a phrase. And the reason [00:12:50] to do is a use a phrase. And the reason you might want to do this is because a [00:12:52] you might want to do this is because a lot of the internet data it it usually [00:12:54] lot of the internet data it it usually doesn't have words that occur by [00:12:56] doesn't have words that occur by themselves. Clip was trained from just [00:12:59] themselves. Clip was trained from just phrases that that were downloaded from [00:13:00] phrases that that were downloaded from the internet. And so ideally you want to [00:13:03] the internet. And so ideally you want to pick the right phrase that gives you the [00:13:04] pick the right phrase that gives you the best uh representation. So instead of [00:13:07] best uh representation. So instead of just having the categories plane, dog, [00:13:09] just having the categories plane, dog, and bird, you might instead want to [00:13:11] and bird, you might instead want to embed a vector that represents a photo [00:13:13] embed a vector that represents a photo of a plane, a photo of a dog. And turns [00:13:16] of a plane, a photo of a dog. And turns out if you do this one small change, uh, [00:13:18] out if you do this one small change, uh, you suddenly get a large boost on [00:13:20] you suddenly get a large boost on imageet where you see an improvement of [00:13:22] imageet where you see an improvement of about 1.3%. [00:13:24] about 1.3%. Of course, picking that right phrase is [00:13:27] Of course, picking that right phrase is also something that's very difficult to [00:13:28] also something that's very difficult to do. And so what people typically do is [00:13:30] do. And so what people typically do is they don't just pick a single phrase. [00:13:31] they don't just pick a single phrase. They pick many different phrases. Uh, so [00:13:33] They pick many different phrases. Uh, so a photo of a dog, a drawing of a dog, or [00:13:36] a photo of a dog, a drawing of a dog, or a bunch of different ideas for different [00:13:38] a bunch of different ideas for different phrases. And you want to create a many [00:13:40] phrases. And you want to create a many many different vectors for all of those [00:13:42] many different vectors for all of those different phrases you might think of. [00:13:44] different phrases you might think of. And at the end, what you do is you just [00:13:46] And at the end, what you do is you just take the mean vector representation [00:13:49] take the mean vector representation across all of your phrases for each [00:13:51] across all of your phrases for each category and use that as your mean dog [00:13:54] category and use that as your mean dog vector, your mean uh plane vector, and [00:13:56] vector, your mean uh plane vector, and your mean bird vector. Right? And then [00:13:58] your mean bird vector. Right? And then now you're back to where you started and [00:14:00] now you're back to where you started and you can do your same sort of one nearest [00:14:02] you can do your same sort of one nearest neighbor uh algorithm on this. It [00:14:05] neighbor uh algorithm on this. It probably has been trained on imageet. [00:14:08] probably has been trained on imageet. This is I think uh a point to show that [00:14:10] This is I think uh a point to show that you can adapt it to a new task. But I [00:14:13] you can adapt it to a new task. But I will show you other examples of data [00:14:14] will show you other examples of data sets where it's definitely not been [00:14:15] sets where it's definitely not been trained on and it does sort of adapt to [00:14:17] trained on and it does sort of adapt to that as well. So you get a single vector [00:14:20] that as well. So you get a single vector out. Uh it depends on the architecture [00:14:22] out. Uh it depends on the architecture you're using. If you're using ResNet, [00:14:24] you're using. If you're using ResNet, you take the final vector [00:14:25] you take the final vector representation. Uh if your text encoder [00:14:27] representation. Uh if your text encoder is let's say a VIT or a transformer, [00:14:30] is let's say a VIT or a transformer, then you usually take the CLS token uh [00:14:32] then you usually take the CLS token uh of your transformer. [00:14:36] of your transformer. Okay. Um so that's sort of it for clip. [00:14:38] Okay. Um so that's sort of it for clip. you could basically adapt this for a [00:14:39] you could basically adapt this for a wide variety of new image uh [00:14:41] wide variety of new image uh classification tasks and to your [00:14:43] classification tasks and to your question right now uh of course it's not [00:14:45] question right now uh of course it's not that big a deal uh that it performs just [00:14:48] that big a deal uh that it performs just as well on imageet uh but it is still [00:14:50] as well on imageet uh but it is still exciting that it does do well on imageet [00:14:52] exciting that it does do well on imageet at all uh what's I think more [00:14:54] at all uh what's I think more interesting is that when you look at [00:14:56] interesting is that when you look at other data sets data sets that were [00:14:58] other data sets data sets that were collected after clip came out so a data [00:15:00] collected after clip came out so a data set like object net which contains [00:15:02] set like object net which contains objects that people took photos of in [00:15:04] objects that people took photos of in very weird sort of places so they put a [00:15:06] very weird sort of places so they put a banana on the ground and took a photo of [00:15:08] banana on the ground and took a photo of it or they took a banana that was like [00:15:10] it or they took a banana that was like really rotten and took a photo of it. Uh [00:15:12] really rotten and took a photo of it. Uh so things that are just not common. Uh [00:15:14] so things that are just not common. Uh and so in this data set uh if you train [00:15:17] and so in this data set uh if you train on imageet you don't do very well. Uh [00:15:20] on imageet you don't do very well. Uh because imageet again contains most of [00:15:21] because imageet again contains most of these categories in its most typical [00:15:23] these categories in its most typical form. Uh but if you take the clip model [00:15:26] form. Uh but if you take the clip model it performs just as well. uh and that [00:15:29] it performs just as well. uh and that was really really exciting for many [00:15:31] was really really exciting for many people because this ability to [00:15:32] people because this ability to generalize to a completely new data set [00:15:34] generalize to a completely new data set that it hasn't seen before that's even [00:15:36] that it hasn't seen before that's even out of domain to some degree uh was [00:15:38] out of domain to some degree uh was really great. So why do you think this [00:15:39] really great. So why do you think this is why do you think clip generaliz is a [00:15:41] is why do you think clip generaliz is a lot better than training on imageet to [00:15:44] lot better than training on imageet to paraphrase your response because I think [00:15:45] paraphrase your response because I think it's the right response um is you know [00:15:49] it's the right response um is you know the text that you download from the [00:15:50] the text that you download from the internet it contains a lot more than the [00:15:52] internet it contains a lot more than the category labels it contains a lot more [00:15:54] category labels it contains a lot more structural information contains [00:15:55] structural information contains information about shape about uh the [00:15:57] information about shape about uh the colors of things and all of that adds uh [00:16:01] colors of things and all of that adds uh to the representations and so these [00:16:03] to the representations and so these models are able to adapt a lot better to [00:16:05] models are able to adapt a lot better to something that maybe is slightly out of [00:16:07] something that maybe is slightly out of distribution or an object that looks [00:16:09] distribution or an object that looks slightly different because it does have [00:16:10] slightly different because it does have all of these other things it's looking [00:16:12] all of these other things it's looking for as well. Uh and so those that [00:16:14] for as well. Uh and so those that additional supervision really helps [00:16:16] additional supervision really helps quite a lot. Uh the other reason it [00:16:18] quite a lot. Uh the other reason it helps quite a lot is the scale of data. [00:16:20] helps quite a lot is the scale of data. Imageet is only about 1.3 million images [00:16:23] Imageet is only about 1.3 million images or so whereas the internet contains [00:16:25] or so whereas the internet contains millions and billions at at this point [00:16:27] millions and billions at at this point billions of image text pairs that we can [00:16:29] billions of image text pairs that we can download very easily. And so these [00:16:30] download very easily. And so these models have just seen so much more data [00:16:33] models have just seen so much more data uh that this adaptation becomes a lot [00:16:35] uh that this adaptation becomes a lot easier. Um and so people started doing [00:16:37] easier. Um and so people started doing these experiments on a wide variety of [00:16:39] these experiments on a wide variety of uh generalization tasks. So they showed [00:16:41] uh generalization tasks. So they showed that you can generalize these models for [00:16:44] that you can generalize these models for uh not just natural images but also [00:16:46] uh not just natural images but also sketches that you can also do this on [00:16:48] sketches that you can also do this on adversarial data sets as well. And [00:16:50] adversarial data sets as well. And performance across the board seem to [00:16:52] performance across the board seem to indicate that these models are just [00:16:53] indicate that these models are just really really good and robust to many [00:16:55] really really good and robust to many different applications. Um and then here [00:16:58] different applications. Um and then here I'm showing you the difference between [00:16:59] I'm showing you the difference between zeroot and linear probe. And you can see [00:17:02] zeroot and linear probe. And you can see that of course linear probe when you add [00:17:03] that of course linear probe when you add that additional linear classifier and [00:17:06] that additional linear classifier and train it and adapt it a little bit it [00:17:08] train it and adapt it a little bit it does improve performance in majority of [00:17:10] does improve performance in majority of uh the data sets the ones in green and [00:17:13] uh the data sets the ones in green and but it's not always the case. In some [00:17:14] but it's not always the case. In some cases the clip zero shot just performs [00:17:17] cases the clip zero shot just performs really well out of the box. Um and so it [00:17:20] really well out of the box. Um and so it just seemed to indicate that we finally [00:17:22] just seemed to indicate that we finally unlocked this capability of being able [00:17:24] unlocked this capability of being able to adapt um image encoders for a wide [00:17:27] to adapt um image encoders for a wide variety of different downstream tasks. [00:17:29] variety of different downstream tasks. And this is why I think a lot of people [00:17:30] And this is why I think a lot of people talk about clip as the first sort of [00:17:32] talk about clip as the first sort of foundation model uh for images. So let's [00:17:35] foundation model uh for images. So let's talk about what makes clip work so well. [00:17:37] talk about what makes clip work so well. Uh of course there's no real labels as [00:17:39] Uh of course there's no real labels as such with clip. We're just downloading [00:17:41] such with clip. We're just downloading any sort of text associated with images. [00:17:43] any sort of text associated with images. What makes clip work so well is what I [00:17:45] What makes clip work so well is what I was saying it contained when it was [00:17:47] was saying it contained when it was first trained it was trained on about uh [00:17:49] first trained it was trained on about uh somewhere between um well the parameters [00:17:52] somewhere between um well the parameters were just gigantic. they sort of scaled [00:17:54] were just gigantic. they sort of scaled up the model and they changed the [00:17:55] up the model and they changed the architecture from ResNet to a VIT and so [00:17:58] architecture from ResNet to a VIT and so you had this transformer architecture [00:18:00] you had this transformer architecture with 307 million parameters uh that was [00:18:04] with 307 million parameters uh that was used to train this model and the second [00:18:06] used to train this model and the second thing that helped was the amount of data [00:18:08] thing that helped was the amount of data right so instead of just 1.2 2 million [00:18:10] right so instead of just 1.2 2 million images from imageet you suddenly had [00:18:12] images from imageet you suddenly had about 400 million image text pairs from [00:18:14] about 400 million image text pairs from the internet that they downloaded and [00:18:16] the internet that they downloaded and used that to train. So that addition [00:18:17] used that to train. So that addition that scale both in terms of model size [00:18:19] that scale both in terms of model size as well as the amount of data helped [00:18:22] as well as the amount of data helped improve performance quite a lot. So [00:18:24] improve performance quite a lot. So immediately after clip came out uh [00:18:26] immediately after clip came out uh people started experimenting with this [00:18:27] people started experimenting with this objective and there's many different var [00:18:30] objective and there's many different var variants of uh clip that have come out [00:18:32] variants of uh clip that have come out over the years but one in particular [00:18:34] over the years but one in particular that's really stood out came out in [00:18:35] that's really stood out came out in 2022. It's called Koka. And Koka took [00:18:39] 2022. It's called Koka. And Koka took the clip model. Here you can see it's [00:18:40] the clip model. Here you can see it's the same sort of objective here. You've [00:18:42] the same sort of objective here. You've got the image being encoded on one side. [00:18:44] got the image being encoded on one side. You've got the text being encoded on one [00:18:45] You've got the text being encoded on one side. And then you have that contrastive [00:18:47] side. And then you have that contrastive loss between the two. But they added one [00:18:49] loss between the two. But they added one additional thing. They added a decoder [00:18:52] additional thing. They added a decoder as well that took the image features [00:18:55] as well that took the image features from the image encoder and then they fed [00:18:57] from the image encoder and then they fed it in as through cost attention and [00:18:59] it in as through cost attention and caption that image. And turns out this [00:19:02] caption that image. And turns out this captioning process also helps the model [00:19:05] captioning process also helps the model learn quite a lot of rich information. [00:19:07] learn quite a lot of rich information. So the general motivation here is that [00:19:09] So the general motivation here is that it's it's not sufficient to just be able [00:19:11] it's it's not sufficient to just be able to say that this is an image of a cat [00:19:13] to say that this is an image of a cat versus a dog, but to describe that image [00:19:15] versus a dog, but to describe that image in text requires a lot more information [00:19:18] in text requires a lot more information to be learned by the model. And so it's [00:19:20] to be learned by the model. And so it's a the hypothesis is that it's a stronger [00:19:22] a the hypothesis is that it's a stronger learning objective. And so because of [00:19:24] learning objective. And so because of that, it learns better features. And we [00:19:27] that, it learns better features. And we found that to be true overall. Koka when [00:19:29] found that to be true overall. Koka when you compare Koka to clip its performance [00:19:32] you compare Koka to clip its performance improves quite a lot across all the [00:19:34] improves quite a lot across all the different image net variants uh and [00:19:36] different image net variants uh and overall there's like a 10% boost in [00:19:38] overall there's like a 10% boost in performance across all of the data sets [00:19:41] performance across all of the data sets and I think this was the first time [00:19:42] and I think this was the first time where these sort of foundation models [00:19:46] where these sort of foundation models actually beat all the models that we had [00:19:48] actually beat all the models that we had trained from supervised learning. So at [00:19:50] trained from supervised learning. So at this point we had many different models [00:19:52] this point we had many different models that people are putting out onto online [00:19:53] that people are putting out onto online leaderboards uh and in those [00:19:56] leaderboards uh and in those leaderboards across the years um you can [00:19:59] leaderboards across the years um you can see the trend sort of going upwards as [00:20:01] see the trend sort of going upwards as models are performing better and better [00:20:02] models are performing better and better and this is I think the turning point [00:20:04] and this is I think the turning point where people abandoned um supervised [00:20:06] where people abandoned um supervised learning objectives for image encoders [00:20:08] learning objectives for image encoders and instead focus solely on just [00:20:10] and instead focus solely on just pre-training objectives using these sort [00:20:12] pre-training objectives using these sort of self-supervised learning methods from [00:20:14] of self-supervised learning methods from the internet data. Okay, so let's talk [00:20:16] the internet data. Okay, so let's talk about some advantages of clip. Eclipse's [00:20:18] about some advantages of clip. Eclipse's got a lot of really fun things that you [00:20:19] got a lot of really fun things that you can do with it. Uh it's super easy to [00:20:21] can do with it. Uh it's super easy to train, right? Cuz it's just a simple uh [00:20:23] train, right? Cuz it's just a simple uh contrastive learning objective. It's [00:20:25] contrastive learning objective. It's also really fast in terms of inference. [00:20:27] also really fast in terms of inference. Um you can embed your entire data set [00:20:29] Um you can embed your entire data set into uh some representation and then all [00:20:32] into uh some representation and then all you have to do to classify is just do [00:20:35] you have to do to classify is just do retrieval on that uh sort of embedded [00:20:37] retrieval on that uh sort of embedded data set. So you can retrieve things [00:20:39] data set. So you can retrieve things very easily with clips representations [00:20:41] very easily with clips representations which makes it really useful for not [00:20:43] which makes it really useful for not just classification tasks but also [00:20:45] just classification tasks but also search and retrieval tasks as well. [00:20:48] search and retrieval tasks as well. Another thing that people really liked [00:20:49] Another thing that people really liked about clip is that it's open vocabulary. [00:20:51] about clip is that it's open vocabulary. You can feed in any text description and [00:20:53] You can feed in any text description and it should be able to retrieve the right [00:20:54] it should be able to retrieve the right images for you. And so that also allows [00:20:57] images for you. And so that also allows for its applicability across many [00:20:59] for its applicability across many different domains. Um and of course [00:21:02] different domains. Um and of course we're going to talk about this later. [00:21:03] we're going to talk about this later. clip is really amendable to being um [00:21:05] clip is really amendable to being um sort of chained with other models and [00:21:07] sort of chained with other models and this idea of chaining started becoming [00:21:09] this idea of chaining started becoming really popular but hold off on that. [00:21:10] really popular but hold off on that. We'll talk about that in uh a few [00:21:12] We'll talk about that in uh a few minutes. Um of course I'm telling you [00:21:15] minutes. Um of course I'm telling you all the good things. Turns out there's a [00:21:16] all the good things. Turns out there's a lot of bad as well. Uh clip [00:21:19] lot of bad as well. Uh clip unfortunately can distinguish between [00:21:20] unfortunately can distinguish between these two images. So you have an image [00:21:23] these two images. So you have an image of a mug in grass and you have some [00:21:26] of a mug in grass and you have some grass in some mug and clip just does not [00:21:29] grass in some mug and clip just does not know the difference between these two [00:21:31] know the difference between these two things. Okay. Um, the reason it doesn't [00:21:34] things. Okay. Um, the reason it doesn't know is because the clip's learning [00:21:37] know is because the clip's learning objective really depends on its batch [00:21:40] objective really depends on its batch size. If your batch size is not large [00:21:42] size. If your batch size is not large enough, then all of the other batch [00:21:44] enough, then all of the other batch elements are unlikely to provide any [00:21:47] elements are unlikely to provide any useful supervisation uh for the model. [00:21:49] useful supervisation uh for the model. If you're always comparing a cat versus [00:21:51] If you're always comparing a cat versus a truck, you're not really going to [00:21:53] a truck, you're not really going to learn a representation for a cat. Um, [00:21:56] learn a representation for a cat. Um, instead, what you get is some sort of [00:21:58] instead, what you get is some sort of representation that's kind of okay at [00:22:00] representation that's kind of okay at some high level. Uh but if you increase [00:22:01] some high level. Uh but if you increase the batch size, you're more likely to [00:22:03] the batch size, you're more likely to encounter other animals that are similar [00:22:04] encounter other animals that are similar to the cat. And then you learn a much [00:22:06] to the cat. And then you learn a much better representation. And then of [00:22:08] better representation. And then of course if you increase your batch size [00:22:09] course if you increase your batch size to let's say 32,000 uh and you train it [00:22:12] to let's say 32,000 uh and you train it across many many GPUs, then suddenly you [00:22:15] across many many GPUs, then suddenly you start learning really good [00:22:16] start learning really good representations. You can actually start [00:22:18] representations. You can actually start identifying a Welsh corgi versus another [00:22:20] identifying a Welsh corgi versus another corgi. And this only is possible when [00:22:22] corgi. And this only is possible when you have gigantic batch sizes because it [00:22:25] you have gigantic batch sizes because it requires you to have other negative [00:22:27] requires you to have other negative examples in your batch that are close [00:22:29] examples in your batch that are close enough that are sort of hard negatives [00:22:30] enough that are sort of hard negatives that forces the model to have to learn. [00:22:33] that forces the model to have to learn. Right? So that's very important for [00:22:34] Right? So that's very important for getting these models to work well. Um [00:22:37] getting these models to work well. Um but unfortunately regardless of how much [00:22:39] but unfortunately regardless of how much people have tried, increasing this batch [00:22:41] people have tried, increasing this batch size doesn't guarantee that the model [00:22:43] size doesn't guarantee that the model will learn a good representation for [00:22:45] will learn a good representation for things. And so you're sort of at the [00:22:47] things. And so you're sort of at the mercy of the randomness of your training [00:22:49] mercy of the randomness of your training data. Um, so increasing the batch size [00:22:52] data. Um, so increasing the batch size does help with some amount of fine grain [00:22:54] does help with some amount of fine grain concepts, but of course it's still [00:22:55] concepts, but of course it's still limited and training on 32,000 amounts [00:22:59] limited and training on 32,000 amounts of batch size is just too large for most [00:23:02] of batch size is just too large for most labs to even consider doing. Um, people [00:23:06] labs to even consider doing. Um, people have identified this error across many [00:23:09] have identified this error across many different benchmarks uh and sort of [00:23:11] different benchmarks uh and sort of identified that clip just doesn't have [00:23:12] identified that clip just doesn't have this notion of comp compositionality. So [00:23:15] this notion of comp compositionality. So this idea that uh the mug and the grass [00:23:17] this idea that uh the mug and the grass versus grass in the mug, it's really [00:23:19] versus grass in the mug, it's really about composing different concepts like [00:23:21] about composing different concepts like the mug and the grass and the the sort [00:23:23] the mug and the grass and the the sort of relationship in all of those [00:23:25] of relationship in all of those individual components are not composed [00:23:28] individual components are not composed well in your clip uh representations. [00:23:31] well in your clip uh representations. And there's been a ton of different [00:23:32] And there's been a ton of different benchmarks like winterground or crepe or [00:23:35] benchmarks like winterground or crepe or arro and a lot of these benchmarks have [00:23:37] arro and a lot of these benchmarks have actually come from my lab. um they just [00:23:40] actually come from my lab. um they just keep finding over and over again that [00:23:42] keep finding over and over again that clip has a ton of limitations and [00:23:44] clip has a ton of limitations and there's a ton of things that they are [00:23:45] there's a ton of things that they are just unable to do. Now of course in [00:23:48] just unable to do. Now of course in reaction the community immediately [00:23:49] reaction the community immediately started thinking about how do I [00:23:51] started thinking about how do I handcraft my batch so that it contains [00:23:54] handcraft my batch so that it contains the hard negatives you know so if I have [00:23:56] the hard negatives you know so if I have one type of corgi that I should [00:23:58] one type of corgi that I should hopefully have another type of corgi in [00:24:00] hopefully have another type of corgi in there so your model really is forced to [00:24:01] there so your model really is forced to learn good representations uh and so [00:24:04] learn good representations uh and so this idea of training with hard [00:24:06] this idea of training with hard negatives became really popular in the [00:24:08] negatives became really popular in the community for a whole one year until we [00:24:11] community for a whole one year until we released a follow-up paper that said [00:24:13] released a follow-up paper that said that if you train with hard negatives [00:24:14] that if you train with hard negatives you end up actually unlearning a a lot [00:24:16] you end up actually unlearning a a lot of things about semantics. Uh for [00:24:19] of things about semantics. Uh for whatever reason, and this is something [00:24:20] whatever reason, and this is something that we still don't theoretically [00:24:21] that we still don't theoretically understand, we end up actually with much [00:24:24] understand, we end up actually with much worse performance in generalization [00:24:26] worse performance in generalization across different sort of uh environments [00:24:28] across different sort of uh environments and different kind of data sets. So [00:24:30] and different kind of data sets. So there's a lot of work to be done still [00:24:32] there's a lot of work to be done still in ter trying terms of figuring out the [00:24:34] in ter trying terms of figuring out the right way of constructing your data set, [00:24:36] right way of constructing your data set, the right way of constructing your [00:24:38] the right way of constructing your batches and training signal. Uh so we're [00:24:40] batches and training signal. Uh so we're still really far away from that but [00:24:41] still really far away from that but regardless people are still very excited [00:24:43] regardless people are still very excited about clip in general uh because it does [00:24:46] about clip in general uh because it does give you some amount of uh supervision [00:24:48] give you some amount of uh supervision regardless. [00:24:49] regardless. Of course again image level uh captions [00:24:52] Of course again image level uh captions are again not enough. Um ideally what we [00:24:55] are again not enough. Um ideally what we want is more than just that right we [00:24:57] want is more than just that right we want to be able to identify not just [00:24:58] want to be able to identify not just that there's a person crossing the [00:25:00] that there's a person crossing the street but that the person is in this [00:25:02] street but that the person is in this location the car is here the street is [00:25:04] location the car is here the street is here. All of that information, that [00:25:06] here. All of that information, that grounding information is completely [00:25:07] grounding information is completely missing in clip. And so ideally, you'd [00:25:10] missing in clip. And so ideally, you'd want also your data set to contain this [00:25:13] want also your data set to contain this kind of information and your model to [00:25:15] kind of information and your model to also be able to reason about that kind [00:25:16] also be able to reason about that kind of information. Also, uh the final thing [00:25:19] of information. Also, uh the final thing that's a big disadvantage for clip is [00:25:22] that's a big disadvantage for clip is that regardless of how big your data set [00:25:24] that regardless of how big your data set is, even if you collect upwards of let's [00:25:26] is, even if you collect upwards of let's say 5 billion images, it's still not [00:25:28] say 5 billion images, it's still not going to be enough to capture all the [00:25:30] going to be enough to capture all the important things that you might care [00:25:32] important things that you might care about. uh and so there's been a lot of [00:25:34] about. uh and so there's been a lot of efforts that we've been doing in data [00:25:36] efforts that we've been doing in data filtering. So how do you filter the [00:25:38] filtering. So how do you filter the internet to find the best training data [00:25:40] internet to find the best training data for training these clip models? I won't [00:25:42] for training these clip models? I won't go into that today. Uh but there are all [00:25:44] go into that today. Uh but there are all of these sort of mechanisms that people [00:25:45] of these sort of mechanisms that people are exploring. Uh that's now become the [00:25:48] are exploring. Uh that's now become the frontier of what today's research looks [00:25:49] frontier of what today's research looks like in this field. Okay. So that's the [00:25:52] like in this field. Okay. So that's the first branch of foundation models we [00:25:54] first branch of foundation models we talked about. It's about sort of [00:25:56] talked about. It's about sort of generalizing classification to a whole [00:25:58] generalizing classification to a whole host of tasks. Now let's talk about [00:26:00] host of tasks. Now let's talk about vision and language models. So there's a [00:26:03] vision and language models. So there's a new class of foundation models uh which [00:26:06] new class of foundation models uh which has become popular in the last 2 two and [00:26:07] has become popular in the last 2 two and a half years and we often refer to them [00:26:10] a half years and we often refer to them as multimodal language models. Uh and [00:26:13] as multimodal language models. Uh and I'll start off with this discussion by [00:26:15] I'll start off with this discussion by focusing on lava which is arguably one [00:26:18] focusing on lava which is arguably one of the first sort of multimodal language [00:26:19] of the first sort of multimodal language models that became very very popular. [00:26:22] models that became very very popular. The motivation here is that language [00:26:24] The motivation here is that language models which do this next token [00:26:26] models which do this next token prediction uh this autocomplete process [00:26:28] prediction uh this autocomplete process that process is really useful for [00:26:31] that process is really useful for adapting to a lot of new tasks and so [00:26:33] adapting to a lot of new tasks and so can we start thinking about even image [00:26:36] can we start thinking about even image models as well doing the same thing. Can [00:26:39] models as well doing the same thing. Can we given an image also start doing [00:26:41] we given an image also start doing different kinds of reasoning uh that's [00:26:43] different kinds of reasoning uh that's similar to this auto uh regressive [00:26:45] similar to this auto uh regressive process and that gave rise to this class [00:26:47] process and that gave rise to this class of models called visual language models [00:26:49] of models called visual language models or multimodal models. Uh but of course [00:26:52] or multimodal models. Uh but of course this just to be historically correct [00:26:54] this just to be historically correct this idea wasn't completely new uh in [00:26:57] this idea wasn't completely new uh in 2022. Uh in 2019 Vilbert actually [00:27:01] 2022. Uh in 2019 Vilbert actually introduced this idea. Uh there's a paper [00:27:03] introduced this idea. Uh there's a paper called Vilbert in 2019 that took these [00:27:05] called Vilbert in 2019 that took these image models and language models and put [00:27:08] image models and language models and put them all together uh to accomplish a [00:27:11] them all together uh to accomplish a generaliz generalization across [00:27:13] generaliz generalization across different tasks. Uh but they were all [00:27:15] different tasks. Uh but they were all trained uh pre-transformers and also [00:27:18] trained uh pre-transformers and also mostly use LSTMs instead. And so the [00:27:22] mostly use LSTMs instead. And so the rebirth of all of this is what's [00:27:23] rebirth of all of this is what's happening right now with uh Lava where a [00:27:27] happening right now with uh Lava where a lot of these models switched over to a [00:27:29] lot of these models switched over to a better architecture, switched over to a [00:27:31] better architecture, switched over to a better set of objectives, and now aren't [00:27:34] better set of objectives, and now aren't just training on individual tasks, but [00:27:35] just training on individual tasks, but are training on a foundation of a [00:27:38] are training on a foundation of a variety of different tasks using some [00:27:40] variety of different tasks using some sort of pre-training objective from the [00:27:41] sort of pre-training objective from the internet. So how does this work? How do [00:27:43] internet. So how does this work? How do how do you sort of think about Lava? So [00:27:46] how do you sort of think about Lava? So to sort of talk about lava, let's take a [00:27:48] to sort of talk about lava, let's take a step back and think about uh the [00:27:51] step back and think about uh the transformer model or self attention in [00:27:53] transformer model or self attention in particular. So when we think about [00:27:56] particular. So when we think about language models, what they're doing is [00:27:58] language models, what they're doing is they're attending over the past. So you [00:28:00] they're attending over the past. So you have a sequence of words that are coming [00:28:02] have a sequence of words that are coming in. So for example, cats are so and then [00:28:05] in. So for example, cats are so and then your model to generate the next word [00:28:07] your model to generate the next word will attend over that historical context [00:28:09] will attend over that historical context and then generate what it thinks the [00:28:11] and then generate what it thinks the next word should be. So it might think [00:28:13] next word should be. So it might think that the the phrase should be cats are [00:28:15] that the the phrase should be cats are so cute. Um, and here's another way of [00:28:17] so cute. Um, and here's another way of sort of representing that same sort of [00:28:19] sort of representing that same sort of objective. You've got the input text [00:28:20] objective. You've got the input text coming in at the bottom and your model [00:28:22] coming in at the bottom and your model will generate the next word which is [00:28:24] will generate the next word which is cute. So when we think about vision [00:28:27] cute. So when we think about vision language models, what people usually are [00:28:29] language models, what people usually are referring to is adding in additional [00:28:32] referring to is adding in additional context by grounding that conversation [00:28:34] context by grounding that conversation that we're having with some image that [00:28:36] that we're having with some image that we care about. So we might care about [00:28:39] we care about. So we might care about tokenizing our image somehow and feeding [00:28:42] tokenizing our image somehow and feeding those tokens into our language model [00:28:44] those tokens into our language model along with the historical context of [00:28:46] along with the historical context of cats are so and then using that to [00:28:48] cats are so and then using that to autocomplete uh the rest of the [00:28:50] autocomplete uh the rest of the description. So that's the basic idea [00:28:52] description. So that's the basic idea behind llama uh is sort of feed in these [00:28:55] behind llama uh is sort of feed in these image tokens along with the words that [00:28:57] image tokens along with the words that are being generated to continuously [00:28:59] are being generated to continuously generate more words about that image. So [00:29:02] generate more words about that image. So of course a question comes in which is [00:29:04] of course a question comes in which is how do you define these tokens? what [00:29:06] how do you define these tokens? what should these tokens be in the first [00:29:07] should these tokens be in the first place? And Lava their solution was to [00:29:10] place? And Lava their solution was to use the clip image encoder. So they took [00:29:13] use the clip image encoder. So they took the clip model, they took the image [00:29:15] the clip model, they took the image encoder and then they basically [00:29:18] encoder and then they basically extracted tokens from that encoder. So [00:29:21] extracted tokens from that encoder. So the first thing you might think about [00:29:22] the first thing you might think about doing is just using the CLS token. So [00:29:25] doing is just using the CLS token. So here you've got uh let me see if my [00:29:27] here you've got uh let me see if my mouse works here. Oh, it does. Okay. So [00:29:29] mouse works here. Oh, it does. Okay. So you've got the image coming in over here [00:29:31] you've got the image coming in over here and then they're getting patched. Each [00:29:34] and then they're getting patched. Each patch turns into a representation that's [00:29:36] patch turns into a representation that's fed into your transformer architecture [00:29:38] fed into your transformer architecture in clip. It goes through a bunch of [00:29:40] in clip. It goes through a bunch of layers of processing. And then at the [00:29:42] layers of processing. And then at the end, you get a bunch of different tokens [00:29:44] end, you get a bunch of different tokens for each of the patches along with a [00:29:46] for each of the patches along with a representation for the CLS token. And so [00:29:49] representation for the CLS token. And so far, we've only been considering the CLS [00:29:51] far, we've only been considering the CLS token. We've only been doing things with [00:29:52] token. We've only been doing things with the CLS token for doing any sort of [00:29:54] the CLS token for doing any sort of classification task, but there are all [00:29:56] classification task, but there are all of these other tokens in there as well. [00:29:59] of these other tokens in there as well. Now, the problem with these other tokens [00:30:01] Now, the problem with these other tokens is that they're never supervised, right? [00:30:03] is that they're never supervised, right? Right? So the CLS token is supervised [00:30:04] Right? So the CLS token is supervised with this contrastive objective with [00:30:06] with this contrastive objective with text. But the other tokens are never [00:30:08] text. But the other tokens are never used for any purpose. So they might not [00:30:11] used for any purpose. So they might not actually contain any useful information. [00:30:13] actually contain any useful information. And empirically people have shown that [00:30:15] And empirically people have shown that these features are not very useful. But [00:30:17] these features are not very useful. But what they have shown is that if you go [00:30:19] what they have shown is that if you go one more layer back, the pen ultimate [00:30:22] one more layer back, the pen ultimate layer in your clip encoder, these [00:30:24] layer in your clip encoder, these features are actually very useful. Uh so [00:30:28] features are actually very useful. Uh so these features are used to generate the [00:30:29] these features are used to generate the final clip embedding uh in the final [00:30:32] final clip embedding uh in the final layer and they contain a lot of spatial [00:30:35] layer and they contain a lot of spatial information about where objects are uh [00:30:37] information about where objects are uh in your entire image. And so this is [00:30:40] in your entire image. And so this is what people typically use when combining [00:30:43] what people typically use when combining uh clip encoder with uh with a sort of [00:30:46] uh clip encoder with uh with a sort of uh transformer sort of LLM based model. [00:30:49] uh transformer sort of LLM based model. Okay. So this is what the entire lava [00:30:51] Okay. So this is what the entire lava architecture looks like. You feed an [00:30:53] architecture looks like. You feed an image through uh your clip pre-trained [00:30:56] image through uh your clip pre-trained clip encoder. You extract a bunch of [00:30:59] clip encoder. You extract a bunch of features from it. You take those [00:31:01] features from it. You take those features and you pass it through a [00:31:03] features and you pass it through a linear layer that you need to train. And [00:31:05] linear layer that you need to train. And what this linear layer will train to do [00:31:07] what this linear layer will train to do is convert your clip representations [00:31:10] is convert your clip representations into something that the LLM can [00:31:12] into something that the LLM can understand and make sense of. Okay. And [00:31:14] understand and make sense of. Okay. And once you have these tokens, you now [00:31:16] once you have these tokens, you now basically pass in all of your tokens to [00:31:18] basically pass in all of your tokens to your uh language model and it can now [00:31:21] your uh language model and it can now generate some uh conversations about [00:31:23] generate some uh conversations about that image itself. [00:31:26] that image itself. So Lava was one of the very first sort [00:31:28] So Lava was one of the very first sort of popular models that were out there. [00:31:29] of popular models that were out there. And following up, Google quickly [00:31:32] And following up, Google quickly released Flamingo. Uh and in Flamingo, [00:31:34] released Flamingo. Uh and in Flamingo, it followed very much the entire Lava [00:31:36] it followed very much the entire Lava setup of being able to combine uh a [00:31:39] setup of being able to combine uh a vision encoder features with a large [00:31:41] vision encoder features with a large language model. But the the place where [00:31:43] language model. But the the place where they innovated is how do you do that [00:31:46] they innovated is how do you do that fusing of these different features. So [00:31:48] fusing of these different features. So in lava you had the features coming in [00:31:51] in lava you had the features coming in um through a linear layer and fed in as [00:31:53] um through a linear layer and fed in as part of the input. In Flamingo what they [00:31:55] part of the input. In Flamingo what they did instead was they basically took all [00:31:58] did instead was they basically took all the features coming out of your vision [00:31:59] the features coming out of your vision encoder and fed them into every layer of [00:32:02] encoder and fed them into every layer of your LLM. Okay? So they had to make some [00:32:05] your LLM. Okay? So they had to make some they had to make some changes to the LLM [00:32:07] they had to make some changes to the LLM architecture itself. And this is how [00:32:09] architecture itself. And this is how they made those changes. So, here's an [00:32:11] they made those changes. So, here's an example of what Flamingo's training data [00:32:13] example of what Flamingo's training data looks like. You've got images that are [00:32:15] looks like. You've got images that are encoded. So, you you've got this dog and [00:32:17] encoded. So, you you've got this dog and you've got this cat. They both get [00:32:18] you've got this cat. They both get embedded and they're both going to be [00:32:20] embedded and they're both going to be fed into every single layer of your LLM. [00:32:24] fed into every single layer of your LLM. And down here, you've got a data that's [00:32:26] And down here, you've got a data that's sort of describing every single starts [00:32:29] sort of describing every single starts uh with the image and describes that [00:32:31] uh with the image and describes that image, then the next image, describes [00:32:33] image, then the next image, describes the next image, and so on and so forth. [00:32:35] the next image, and so on and so forth. And they're sort of fed in as input to [00:32:37] And they're sort of fed in as input to your LLM. And your output is going to be [00:32:39] your LLM. And your output is going to be to autocomplete that last uh image. [00:32:43] to autocomplete that last uh image. Okay, so you got one image followed by a [00:32:45] Okay, so you got one image followed by a description of the dog second image and [00:32:47] description of the dog second image and you start the description and your model [00:32:49] you start the description and your model will be trained on autocompleting that [00:32:51] will be trained on autocompleting that description for that second image. So [00:32:53] description for that second image. So what did they do? What did they change [00:32:55] what did they do? What did they change to the model itself? They added this [00:32:57] to the model itself? They added this sort of gated X cross attention module [00:33:01] sort of gated X cross attention module to every single layer of your LLM. And [00:33:04] to every single layer of your LLM. And they made one other change. They also [00:33:06] they made one other change. They also added this perceiver sampler right here [00:33:09] added this perceiver sampler right here uh that basically samples and [00:33:12] uh that basically samples and downsamples your image representations. [00:33:14] downsamples your image representations. So there's smaller dimensions and a [00:33:15] So there's smaller dimensions and a fixed number of tokens uh for every [00:33:17] fixed number of tokens uh for every single layer. So let me go into some [00:33:19] single layer. So let me go into some details with what they look like. Um so [00:33:22] details with what they look like. Um so this is the full architecture overall. [00:33:25] this is the full architecture overall. Most of the components are frozen. All [00:33:26] Most of the components are frozen. All the language model weights are frozen. [00:33:28] the language model weights are frozen. All the vision model parts are frozen. [00:33:30] All the vision model parts are frozen. The only parts that are trained are [00:33:32] The only parts that are trained are these perceiver sampler components and [00:33:34] these perceiver sampler components and this cross attention layer that's sort [00:33:36] this cross attention layer that's sort of added into every single uh layer of [00:33:39] of added into every single uh layer of your LLM. So let's talk about what this [00:33:42] your LLM. So let's talk about what this cross attention module looks like. Uh [00:33:44] cross attention module looks like. Uh this is me zooming into that cross [00:33:46] this is me zooming into that cross attention module. So every single LLM [00:33:49] attention module. So every single LLM layer right before the LLM layer you [00:33:51] layer right before the LLM layer you have this cross attention component and [00:33:53] have this cross attention component and what its purpose is is to look at the [00:33:56] what its purpose is is to look at the image features and then decide what [00:33:58] image features and then decide what parts of the image features it wants to [00:34:00] parts of the image features it wants to keep around and what it thinks to be [00:34:02] keep around and what it thinks to be useful for the language model to know [00:34:03] useful for the language model to know about and they designed it as a set of [00:34:07] about and they designed it as a set of uh components that you've already seen [00:34:09] uh components that you've already seen so far. Uh so you attend over the image [00:34:12] so far. Uh so you attend over the image features using a cross attention layer [00:34:14] features using a cross attention layer and then following that cross attention [00:34:16] and then following that cross attention they added this 10H nonlinear activation [00:34:19] they added this 10H nonlinear activation and this is basically deciding what [00:34:21] and this is basically deciding what parts of these components do I want to [00:34:22] parts of these components do I want to keep around which ones uh which parts of [00:34:24] keep around which ones uh which parts of the image do I want to forget uh and [00:34:27] the image do I want to forget uh and then it goes through this fully [00:34:28] then it goes through this fully connected layer where it adapts those [00:34:30] connected layer where it adapts those representations a little bit and then [00:34:32] representations a little bit and then again a tanh nonlinearity to decide [00:34:35] again a tanh nonlinearity to decide again which parts it should keep and [00:34:36] again which parts it should keep and which parts it shouldn't. Once it goes [00:34:38] which parts it shouldn't. Once it goes through those two components and each [00:34:40] through those two components and each one has a residual connection across [00:34:42] one has a residual connection across from it, uh it then goes to your normal [00:34:44] from it, uh it then goes to your normal language model processing and then [00:34:47] language model processing and then continues to generate the word it needs [00:34:48] continues to generate the word it needs to. Okay, so this additional layers are [00:34:51] to. Okay, so this additional layers are being added just as a way for the [00:34:53] being added just as a way for the language to sort of incorporate and [00:34:56] language to sort of incorporate and attend over the vision features at every [00:34:58] attend over the vision features at every single layer. Okay, so the actual [00:35:00] single layer. Okay, so the actual modification itself, if you're [00:35:01] modification itself, if you're interested in what this looks like in [00:35:03] interested in what this looks like in code, is just uh about two or three [00:35:05] code, is just uh about two or three lines of code where uh they added this [00:35:08] lines of code where uh they added this cross attention layer and then this tanh [00:35:10] cross attention layer and then this tanh nonlinearity in between. And that's [00:35:12] nonlinearity in between. And that's really about it. So in terms of code, [00:35:13] really about it. So in terms of code, it's a very minimal change. Although for [00:35:15] it's a very minimal change. Although for the model, it's a very gigantic change [00:35:17] the model, it's a very gigantic change because now it can sort of choose what [00:35:19] because now it can sort of choose what parts of the image to attend to at every [00:35:21] parts of the image to attend to at every single layer of its processing. So you [00:35:23] single layer of its processing. So you give the model a lot of uh ability to [00:35:25] give the model a lot of uh ability to decide when and how to attend over the [00:35:28] decide when and how to attend over the vision features. [00:35:30] vision features. Okay. So flamingo was very very [00:35:32] Okay. So flamingo was very very exciting. Uh but training it was very [00:35:34] exciting. Uh but training it was very difficult. Um and they had this really [00:35:37] difficult. Um and they had this really ingenious way of training it that [00:35:38] ingenious way of training it that allowed our models to adapt to many [00:35:40] allowed our models to adapt to many different tasks. The way they trained it [00:35:42] different tasks. The way they trained it was through this concatenation of a [00:35:46] was through this concatenation of a bunch of different images together. So [00:35:48] bunch of different images together. So you didn't just have one image and one [00:35:50] you didn't just have one image and one description. You had a description at [00:35:53] description. You had a description at the beginning that says here some cute [00:35:55] the beginning that says here some cute pictures of my pets, end of sentence, [00:35:58] pictures of my pets, end of sentence, beginning of image, and then a [00:36:00] beginning of image, and then a description of that first image, end of [00:36:02] description of that first image, end of uh that first component, and then the [00:36:05] uh that first component, and then the second image, and a description of the [00:36:07] second image, and a description of the second image. Okay, so you had the [00:36:09] second image. Okay, so you had the training set up so that it looks like a [00:36:11] training set up so that it looks like a long sequence of image text, image text, [00:36:14] long sequence of image text, image text, interled data. And of course when [00:36:17] interled data. And of course when describing any single image, you don't [00:36:19] describing any single image, you don't want the model to look at the entire [00:36:21] want the model to look at the entire context. You want it to only look at [00:36:23] context. You want it to only look at that one particular image. And so they [00:36:25] that one particular image. And so they created a masking scheme where every [00:36:28] created a masking scheme where every single image when when generating only [00:36:31] single image when when generating only looks at that particular image features [00:36:32] looks at that particular image features and not the other ones. Uh meaning that [00:36:34] and not the other ones. Uh meaning that when you're generating the description [00:36:36] when you're generating the description for my puppy is sitting in the grass, [00:36:38] for my puppy is sitting in the grass, you're only looking at the features uh [00:36:40] you're only looking at the features uh that correspond to the puppy only when [00:36:42] that correspond to the puppy only when generating those words. Similarly, when [00:36:44] generating those words. Similarly, when generating the the description for the [00:36:46] generating the the description for the cat, you're only looking at the cat [00:36:49] cat, you're only looking at the cat image and not the other image. So, there [00:36:51] image and not the other image. So, there is this sort of distinction where they [00:36:53] is this sort of distinction where they created this handcraft and masking [00:36:55] created this handcraft and masking scheme uh to make sure your descriptions [00:36:58] scheme uh to make sure your descriptions are always following and looking at only [00:37:00] are always following and looking at only that particular image. Uh but when [00:37:03] that particular image. Uh but when trained, the model does get to see the [00:37:04] trained, the model does get to see the entire context of everything that it's [00:37:07] entire context of everything that it's generating. [00:37:08] generating. So, why is that helpful? Why is this [00:37:10] So, why is that helpful? Why is this entire process helpful of being able to [00:37:12] entire process helpful of being able to see all of this stuff together? Well, [00:37:14] see all of this stuff together? Well, it's helpful because it allows you to do [00:37:15] it's helpful because it allows you to do these kinds of applications. So, here is [00:37:18] these kinds of applications. So, here is three different applications that [00:37:19] three different applications that Flamingo was able to showcase. The and [00:37:22] Flamingo was able to showcase. The and they all center around having multiple [00:37:24] they all center around having multiple conversations or dealing with multiple [00:37:26] conversations or dealing with multiple images. So, in the first case, you've [00:37:28] images. So, in the first case, you've got an image that's fed in uh and the [00:37:30] got an image that's fed in uh and the flamingo model describes the image by [00:37:33] flamingo model describes the image by saying that this is a picture of two [00:37:35] saying that this is a picture of two teddy bears on the moon. And then what [00:37:37] teddy bears on the moon. And then what it allows people to do is then ask [00:37:39] it allows people to do is then ask another question. So people can ask what [00:37:41] another question. So people can ask what are they doing? And because it's already [00:37:44] are they doing? And because it's already being uh it's training using an existing [00:37:47] being uh it's training using an existing large language model, that large [00:37:49] large language model, that large language model's reasoning capabilities [00:37:51] language model's reasoning capabilities are inherited. And now it can reason and [00:37:53] are inherited. And now it can reason and answer this particular question. It it [00:37:55] answer this particular question. It it can now answer and say there the teddy [00:37:56] can now answer and say there the teddy bears are having a conversation. And [00:37:58] bears are having a conversation. And then a user might ask what objects are [00:38:00] then a user might ask what objects are they using? And again, Flamingo can say [00:38:02] they using? And again, Flamingo can say that it looks like it's a computer and [00:38:04] that it looks like it's a computer and so on and so forth. So you can enable [00:38:05] so on and so forth. So you can enable this multi-turn dialogue about an image [00:38:08] this multi-turn dialogue about an image simply by doing two things. You train by [00:38:12] simply by doing two things. You train by first pre-training the language model [00:38:14] first pre-training the language model and then incorporating that language [00:38:15] and then incorporating that language model into Flamingo and secondly [00:38:17] model into Flamingo and secondly allowing your model to see many [00:38:18] allowing your model to see many different images and many different [00:38:20] different images and many different turns uh throughout its training data so [00:38:23] turns uh throughout its training data so it can adapt to longer sequences of [00:38:25] it can adapt to longer sequences of text. You can also give it multiple [00:38:28] text. You can also give it multiple images and ask what is a common thing [00:38:30] images and ask what is a common thing about these images. And now the flamingo [00:38:32] about these images. And now the flamingo model will look at each of those [00:38:33] model will look at each of those different components and sort of reason [00:38:35] different components and sort of reason and say that they're all flamingos. Uh [00:38:37] and say that they're all flamingos. Uh so you can start doing a lot of these [00:38:38] so you can start doing a lot of these kinds of really cool applications. [00:38:40] kinds of really cool applications. People also showed that you can start [00:38:42] People also showed that you can start doing in context learning. I don't know [00:38:44] doing in context learning. I don't know if this is something you've seen already [00:38:45] if this is something you've seen already with uh language models, but I'm sure [00:38:47] with uh language models, but I'm sure you've used in context learning with GPT [00:38:50] you've used in context learning with GPT where you tell GPT here's an example of [00:38:52] where you tell GPT here's an example of what I want. Give me more things like [00:38:54] what I want. Give me more things like this. Uh, and you can do the same thing [00:38:56] this. Uh, and you can do the same thing with Flamingo where you can pass in an [00:38:58] with Flamingo where you can pass in an image and a description, an image and a [00:39:00] image and a description, an image and a description, and now when you pass in a [00:39:02] description, and now when you pass in a new image, it'll give you a description. [00:39:04] new image, it'll give you a description. Or you can say, uh, uh, here's some [00:39:07] Or you can say, uh, uh, here's some image, here's a question and answer. [00:39:09] image, here's a question and answer. Here's an image, question and answer. [00:39:11] Here's an image, question and answer. And then when you pass in a new image [00:39:12] And then when you pass in a new image and just ask the question, it'll give [00:39:14] and just ask the question, it'll give you the answer. Right? So you're not [00:39:16] you the answer. Right? So you're not training it to do these different kinds [00:39:17] training it to do these different kinds of tasks, but you're providing it with [00:39:19] of tasks, but you're providing it with examples of behaviors that it should [00:39:21] examples of behaviors that it should have and it should just generalize to [00:39:23] have and it should just generalize to new kinds of behaviors uh that you might [00:39:26] new kinds of behaviors uh that you might care about. Similarly, you might care [00:39:28] care about. Similarly, you might care about just classification and you can [00:39:30] about just classification and you can use flamingo to do classification as [00:39:31] use flamingo to do classification as well. So you can give it an image and [00:39:33] well. So you can give it an image and say this is underground, this is [00:39:36] say this is underground, this is congress and then you can ask what is [00:39:38] congress and then you can ask what is this right? Um, and you can also even [00:39:41] this right? Um, and you can also even teach it to do OCR and math where you [00:39:44] teach it to do OCR and math where you give it an image and say, "Oh, this [00:39:45] give it an image and say, "Oh, this should correspond to 2 + 1= 3." And so [00:39:47] should correspond to 2 + 1= 3." And so eventually when you give it a new image, [00:39:49] eventually when you give it a new image, it should be able to autocomplete and [00:39:51] it should be able to autocomplete and extract out 3 * 6 and then also give you [00:39:54] extract out 3 * 6 and then also give you the output by reasoning through this [00:39:56] the output by reasoning through this entire process. Yeah. So this would be [00:39:58] entire process. Yeah. So this would be an example of fshot learning where you [00:39:59] an example of fshot learning where you give it some examples, a few examples of [00:40:01] give it some examples, a few examples of things and then you ask it uh what the [00:40:04] things and then you ask it uh what the new thing should be. uh if I were to [00:40:07] new thing should be. uh if I were to throw away all the incontext examples, [00:40:09] throw away all the incontext examples, that would be zero shot learning. So, [00:40:10] that would be zero shot learning. So, we're not concatenating them. Um we're [00:40:13] we're not concatenating them. Um we're technically passing the image tokens [00:40:15] technically passing the image tokens through this perceiver sampler into [00:40:17] through this perceiver sampler into every single layer of your LLM instead. [00:40:19] every single layer of your LLM instead. And so, only the text is ever [00:40:21] And so, only the text is ever concatenated and fed as input to your [00:40:23] concatenated and fed as input to your flamingo model. And it chooses when to [00:40:25] flamingo model. And it chooses when to attend to which parts of the image. You [00:40:28] attend to which parts of the image. You give it to it once. H but behind the [00:40:30] give it to it once. H but behind the scenes, uh of course, this is the [00:40:31] scenes, uh of course, this is the interface, right? the web interface but [00:40:33] interface, right? the web interface but behind the scenes what they actually do [00:40:35] behind the scenes what they actually do is they cache the model assuming that [00:40:37] is they cache the model assuming that the user will want to continue talking [00:40:39] the user will want to continue talking and so the model is cached and ready to [00:40:41] and so the model is cached and ready to accept more tokens. Yeah. But if they [00:40:43] accept more tokens. Yeah. But if they did not cache it then yes it would pass [00:40:45] did not cache it then yes it would pass in this entire conversation as uh as [00:40:47] in this entire conversation as uh as input. Yeah. [00:40:49] input. Yeah. Okay. So Flamingo was super cool because [00:40:52] Okay. So Flamingo was super cool because uh they have these really big tables in [00:40:54] uh they have these really big tables in their paper that you can go check out. [00:40:56] their paper that you can go check out. Um, but what was really cool about it is [00:40:58] Um, but what was really cool about it is there were all of these tasks that were [00:41:00] there were all of these tasks that were very difficult and you had to adapt clip [00:41:02] very difficult and you had to adapt clip to do them. Uh, but Flamingo was just [00:41:04] to do them. Uh, but Flamingo was just able to do it with zero shot or few [00:41:06] able to do it with zero shot or few shot. Uh, and you started seeing these [00:41:08] shot. Uh, and you started seeing these gigantic improvements across many [00:41:10] gigantic improvements across many different benchmarks. Uh, and this is [00:41:12] different benchmarks. Uh, and this is when I think the field shifted from [00:41:14] when I think the field shifted from reporting on a few classification [00:41:15] reporting on a few classification benchmarks to reporting on any sort of [00:41:17] benchmarks to reporting on any sort of understanding task at all. As long as [00:41:19] understanding task at all. As long as you can frame it as a question answering [00:41:20] you can frame it as a question answering process, you can build uh benchmarks for [00:41:24] process, you can build uh benchmarks for a wide variety of skills and we started [00:41:26] a wide variety of skills and we started seeing that become the norm in the last [00:41:28] seeing that become the norm in the last 2 years in the computer vision field. [00:41:31] 2 years in the computer vision field. Okay. So this is where uh we were I [00:41:34] Okay. So this is where uh we were I think sometime last year and seeing the [00:41:36] think sometime last year and seeing the success of Lava a lot of companies [00:41:38] success of Lava a lot of companies started investing quite heavily on these [00:41:41] started investing quite heavily on these models and so you started seeing a lot [00:41:43] models and so you started seeing a lot of API models like GPT uh 40 GPT 4V uh [00:41:49] of API models like GPT uh 40 GPT 4V uh Gemini 1.5 Pro Gemini 1.5 Flash a lot of [00:41:52] Gemini 1.5 Pro Gemini 1.5 Flash a lot of these models started become being [00:41:54] these models started become being released and even Anthropic came into [00:41:56] released and even Anthropic came into the picture with Claude 3 Opus and now [00:41:59] the picture with Claude 3 Opus and now of course Claude 4 Opus is out Um, and [00:42:02] of course Claude 4 Opus is out Um, and so you had a lot of these models come [00:42:03] so you had a lot of these models come out and they were performing a lot [00:42:05] out and they were performing a lot better on a bunch of these benchmarks. [00:42:07] better on a bunch of these benchmarks. So here I'm showing you the average [00:42:08] So here I'm showing you the average performance across 11 of the more [00:42:10] performance across 11 of the more popular visual understanding benchmarks [00:42:13] popular visual understanding benchmarks in the field and there's this gigantic [00:42:15] in the field and there's this gigantic difference, right? So the difference [00:42:17] difference, right? So the difference between Lava, the model we talked about [00:42:19] between Lava, the model we talked about that's open source, that's down here at [00:42:21] that's open source, that's down here at about 43% accuracy on average. [00:42:24] about 43% accuracy on average. Meanwhile, GPT and all of these other [00:42:26] Meanwhile, GPT and all of these other models are performing much much better [00:42:28] models are performing much much better at about somewhere like um 80s or [00:42:31] at about somewhere like um 80s or high7s, right? So, big difference in [00:42:34] high7s, right? So, big difference in performance between these two different [00:42:35] performance between these two different kinds of models. Um, of course, [00:42:37] kinds of models. Um, of course, immediately seeing this sort of [00:42:39] immediately seeing this sort of difference, people started distilling [00:42:41] difference, people started distilling GPT and Gemini into uh distilled [00:42:45] GPT and Gemini into uh distilled variants and trying to release those [00:42:47] variants and trying to release those models. Uh so, Alibaba, which is a [00:42:49] models. Uh so, Alibaba, which is a company in China, they released this [00:42:50] company in China, they released this model called Quinn. uh and then there's [00:42:53] model called Quinn. uh and then there's intern there's fi there's all of these [00:42:56] intern there's fi there's all of these different models that started coming out [00:42:57] different models that started coming out and all of them were distilled from GPT [00:43:00] and all of them were distilled from GPT uh and if not GPT then Gemini now there [00:43:03] uh and if not GPT then Gemini now there that led to a big problem in the field [00:43:05] that led to a big problem in the field uh a problem that's become a big part of [00:43:08] uh a problem that's become a big part of what my own research agenda has been [00:43:10] what my own research agenda has been trying to sort of focus on which is we [00:43:12] trying to sort of focus on which is we don't actually know as a research [00:43:13] don't actually know as a research community how to build really performant [00:43:16] community how to build really performant vision language models the the tricks [00:43:18] vision language models the the tricks behind how to build them they only the [00:43:21] behind how to build them they only the people in open AI I and Gemini in [00:43:23] people in open AI I and Gemini in Google, those teams know how to build [00:43:25] Google, those teams know how to build these kinds of models. But the open [00:43:26] these kinds of models. But the open source community, they're down here. [00:43:29] source community, they're down here. They're down here. This is where the [00:43:30] They're down here. This is where the research community was as of last year. [00:43:32] research community was as of last year. U of course you can argue these are [00:43:34] U of course you can argue these are really nice open models, but they're not [00:43:36] really nice open models, but they're not really open because they're distilled. [00:43:39] really open because they're distilled. We don't actually know how to reproduce [00:43:41] We don't actually know how to reproduce these models, right? We we can only [00:43:43] these models, right? We we can only produce them but if GPT exists, but if [00:43:46] produce them but if GPT exists, but if GPT doesn't exist, we don't know how to [00:43:48] GPT doesn't exist, we don't know how to create these other models. And so what [00:43:50] create these other models. And so what my own research agenda has been focused [00:43:52] my own research agenda has been focused on over the last couple of years is [00:43:54] on over the last couple of years is figuring out how do you close this gap? [00:43:55] figuring out how do you close this gap? How do you build really good uh [00:43:57] How do you build really good uh multimodal language models and [00:43:59] multimodal language models and disseminate that sort of uh [00:44:00] disseminate that sort of uh understanding to the entire community? [00:44:03] understanding to the entire community? And so um what we've done over the last [00:44:07] And so um what we've done over the last uh 6 months or about a year now is we've [00:44:10] uh 6 months or about a year now is we've created our own sort of uh class of [00:44:12] created our own sort of uh class of models that we called Momo. And I'm [00:44:15] models that we called Momo. And I'm showing you Momo's performance up at the [00:44:16] showing you Momo's performance up at the top. Uh, and what sort of sets Momo [00:44:18] top. Uh, and what sort of sets Momo apart from all the other models out [00:44:20] apart from all the other models out there is that it's completely open [00:44:22] there is that it's completely open source, meaning it's open weights, so [00:44:25] source, meaning it's open weights, so you can download the model. It's open [00:44:27] you can download the model. It's open data, meaning you can download the [00:44:28] data, meaning you can download the training set as well as the evaluation [00:44:29] training set as well as the evaluation set. It's also open code, meaning you [00:44:31] set. It's also open code, meaning you can basically train your own Momo in [00:44:33] can basically train your own Momo in your own home, assuming you have enough [00:44:35] your own home, assuming you have enough GPUs. And uh you can also evaluate add [00:44:38] GPUs. And uh you can also evaluate add on new evaluations adapt this model for [00:44:40] on new evaluations adapt this model for all kinds of new things and of course [00:44:42] all kinds of new things and of course start using it for a wide variety of [00:44:44] start using it for a wide variety of different contexts. Um now of course [00:44:47] different contexts. Um now of course academic benchmarks are not enough right [00:44:49] academic benchmarks are not enough right because what we care about at the end of [00:44:51] because what we care about at the end of the day is are people going to use these [00:44:53] the day is are people going to use these models? Will people want to use these [00:44:55] models? Will people want to use these models over GPT? And so to make sure we [00:44:58] models over GPT? And so to make sure we had that evaluation properly done, we [00:45:00] had that evaluation properly done, we released a playground with a Momo and we [00:45:03] released a playground with a Momo and we did a gigantic user study where we [00:45:05] did a gigantic user study where we actually compared head-to-head outputs [00:45:07] actually compared head-to-head outputs from our models versus outputs from all [00:45:09] from our models versus outputs from all the other models. And our model has the [00:45:12] the other models. And our model has the same ELO rating as GPT. It comes in [00:45:15] same ELO rating as GPT. It comes in second with uh a difference of one in [00:45:17] second with uh a difference of one in ELO rating versus GPT 40. This is that [00:45:20] ELO rating versus GPT 40. This is that same graph rotated uh so that I can show [00:45:23] same graph rotated uh so that I can show you some examples. Our this was a [00:45:25] you some examples. Our this was a gigantic evaluation by the way. This was [00:45:27] gigantic evaluation by the way. This was about 870 users that we uh showed these [00:45:30] about 870 users that we uh showed these model outputs to and we did about [00:45:31] model outputs to and we did about 325,000 pair wise comparisons. We asked [00:45:34] 325,000 pair wise comparisons. We asked people which models output do you [00:45:36] people which models output do you prefer. Our model, the MLM model ranked [00:45:38] prefer. Our model, the MLM model ranked uh again like I said second uh more or [00:45:41] uh again like I said second uh more or less a coin flip between what people [00:45:42] less a coin flip between what people prefer between GPT and our model. Uh but [00:45:45] prefer between GPT and our model. Uh but it already beat out Gemini 1.5 Pro and [00:45:48] it already beat out Gemini 1.5 Pro and Cloud 3.5. Now the big difference is we [00:45:51] Cloud 3.5. Now the big difference is we are a small research lab and uh we're [00:45:53] are a small research lab and uh we're beating out Google's billions of dollars [00:45:56] beating out Google's billions of dollars of investment into Gemini as well as [00:45:58] of investment into Gemini as well as Entropics billions of dollars of [00:45:59] Entropics billions of dollars of investment and already matching GPT. And [00:46:02] investment and already matching GPT. And so we were quite excited by this entire [00:46:04] so we were quite excited by this entire process. Uh but we also developed a 7 [00:46:06] process. Uh but we also developed a 7 billion model that comes in right after [00:46:09] billion model that comes in right after those big models. And that 7 billion [00:46:11] those big models. And that 7 billion model is really uh exciting because you [00:46:13] model is really uh exciting because you can put it on a single GPU. So you can [00:46:15] can put it on a single GPU. So you can now have this model capable of doing a [00:46:18] now have this model capable of doing a wide variety of vision tasks that works [00:46:20] wide variety of vision tasks that works on a single GPU, meaning a lot of people [00:46:23] on a single GPU, meaning a lot of people can now use it and fine-tune it for all [00:46:24] can now use it and fine-tune it for all kinds of things. We released this model [00:46:27] kinds of things. We released this model in September 25 and the community was [00:46:29] in September 25 and the community was very excited by it. Uh there's, you [00:46:31] very excited by it. Uh there's, you know, this is the first time a very [00:46:32] know, this is the first time a very performant uh multimodal vision language [00:46:34] performant uh multimodal vision language model was released and a ton of people [00:46:36] model was released and a ton of people started um talking and writing articles [00:46:38] started um talking and writing articles about all the ways they want to use it. [00:46:40] about all the ways they want to use it. One of the use cases that kept popping [00:46:42] One of the use cases that kept popping up over and over again was this idea of [00:46:44] up over and over again was this idea of using Malmo finally for robotics [00:46:47] using Malmo finally for robotics applications. And I I won't talk about [00:46:49] applications. And I I won't talk about robotics today because you're going to [00:46:51] robotics today because you're going to learn about it in the next class. Um but [00:46:53] learn about it in the next class. Um but I do want to give you some examples of [00:46:55] I do want to give you some examples of things that people were excited about [00:46:56] things that people were excited about with robotics. Um but a ton of people [00:47:00] with robotics. Um but a ton of people even like uh folks uh at NVIDIA started [00:47:03] even like uh folks uh at NVIDIA started chatting about how this is, you know, [00:47:05] chatting about how this is, you know, you should never bet against open source [00:47:06] you should never bet against open source regardless of how much model development [00:47:08] regardless of how much model development you do in private. eventually the open [00:47:10] you do in private. eventually the open source community will catch up and we [00:47:12] source community will catch up and we were catching up at that point. Um, and [00:47:14] were catching up at that point. Um, and so seeing our model out, Meta quickly [00:47:17] so seeing our model out, Meta quickly released in response their Llama uh 3.2 [00:47:20] released in response their Llama uh 3.2 model and a lot of people did [00:47:22] model and a lot of people did evaluations comparing Momo versus uh [00:47:25] evaluations comparing Momo versus uh Meta's Lama model and again I'm very [00:47:27] Meta's Lama model and again I'm very happy that we came up on top of Llama as [00:47:29] happy that we came up on top of Llama as well. So let me show you why Momo does [00:47:32] well. So let me show you why Momo does so well. Momo was the sort of trick to [00:47:34] so well. Momo was the sort of trick to getting these models to work very well. [00:47:36] getting these models to work very well. The trick was to ground its decision-m [00:47:39] The trick was to ground its decision-m in the pixels itself. So usually when [00:47:42] in the pixels itself. So usually when you give a model a question like count [00:47:44] you give a model a question like count how many boats there are, it'll give you [00:47:46] how many boats there are, it'll give you some number and often times it [00:47:47] some number and often times it hallucinates. But what sort of sets our [00:47:49] hallucinates. But what sort of sets our model apart is that it actually points [00:47:51] model apart is that it actually points to all the things that it's counting. So [00:47:54] to all the things that it's counting. So it generates points to all the boats and [00:47:56] it generates points to all the boats and then it outputs a final number. So its [00:47:59] then it outputs a final number. So its decision-m is grounded in the pixels [00:48:01] decision-m is grounded in the pixels itself. Uh, and this allowed us to [00:48:03] itself. Uh, and this allowed us to essentially train a model that unlike [00:48:05] essentially train a model that unlike Llama for Meta that was trained on about [00:48:07] Llama for Meta that was trained on about 6 billion image text pairs, our model [00:48:10] 6 billion image text pairs, our model was trained on only 700,000 image text [00:48:12] was trained on only 700,000 image text pairs. The big difference was that we [00:48:15] pairs. The big difference was that we handcurated the 700,000 image text [00:48:17] handcurated the 700,000 image text pairs. Uh, and that was the biggest sort [00:48:20] pairs. Uh, and that was the biggest sort of difference between what we were able [00:48:21] of difference between what we were able to do versus what these models uh that [00:48:23] to do versus what these models uh that these companies were building were [00:48:25] these companies were building were doing. So a lot of folks are trying to [00:48:28] doing. So a lot of folks are trying to currently download these uh image text [00:48:31] currently download these uh image text pairs from the internet, right? That's [00:48:32] pairs from the internet, right? That's been the foundation of how a lot of [00:48:34] been the foundation of how a lot of people train these visual language [00:48:35] people train these visual language models. You collect a lot of internet [00:48:37] models. You collect a lot of internet data uh of images with their associated [00:48:40] data uh of images with their associated text. But the problem with internet data [00:48:41] text. But the problem with internet data is that it's incidental. The text that's [00:48:43] is that it's incidental. The text that's often associated with an image describes [00:48:46] often associated with an image describes something subjective or something that [00:48:48] something subjective or something that the uploader felt about the image. It [00:48:50] the uploader felt about the image. It rarely actually talks about the contents [00:48:52] rarely actually talks about the contents of the image itself. And meanwhile, this [00:48:55] of the image itself. And meanwhile, this is what our data looks like, right? So a [00:48:57] is what our data looks like, right? So a for a single image, we have a dense [00:49:00] for a single image, we have a dense description of the actual contents of [00:49:03] description of the actual contents of that image. And we have things that [00:49:04] that image. And we have things that people never talk about on the internet. [00:49:06] people never talk about on the internet. There's a ton of task and knowledge [00:49:08] There's a ton of task and knowledge about the visual world that we just [00:49:10] about the visual world that we just never speak about. I will never tell you [00:49:12] never speak about. I will never tell you that something is to the left of [00:49:13] that something is to the left of something else just because it's [00:49:15] something else just because it's unnatural for us to do that. It's just [00:49:17] unnatural for us to do that. It's just so obvious that something is to the left [00:49:19] so obvious that something is to the left of something. Why would you ever [00:49:20] of something. Why would you ever communicate that information? So that's [00:49:22] communicate that information? So that's the kind of information we started [00:49:23] the kind of information we started eliciting from people. We started [00:49:25] eliciting from people. We started talking about we started getting people [00:49:26] talking about we started getting people to talk about how things have particular [00:49:29] to talk about how things have particular size like large or its shape like [00:49:30] size like large or its shape like rectangular. We talked about material [00:49:33] rectangular. We talked about material like polished and rich and uh its [00:49:35] like polished and rich and uh its positioning across the image like it [00:49:37] positioning across the image like it spans uh the horizontal sort of plane of [00:49:39] spans uh the horizontal sort of plane of the image. So all of this information [00:49:42] the image. So all of this information really is what makes these models more [00:49:44] really is what makes these models more performant. Here's another example from [00:49:45] performant. Here's another example from the data set. Uh here's a very simple [00:49:47] the data set. Uh here's a very simple image of a phone screen or tablet screen [00:49:50] image of a phone screen or tablet screen and we connect we have information here [00:49:52] and we connect we have information here that again is completely missing from [00:49:54] that again is completely missing from the internet. Uh things that people [00:49:55] the internet. Uh things that people would find helpful things like this is a [00:49:58] would find helpful things like this is a tablet device. The time is this. The [00:50:00] tablet device. The time is this. The amount of power left in your device is [00:50:02] amount of power left in your device is this. This is the kind of information [00:50:04] this. This is the kind of information that would help people use these models. [00:50:06] that would help people use these models. But this is again the kind of [00:50:07] But this is again the kind of information we never talk about on the [00:50:09] information we never talk about on the internet. And so to get this kind of [00:50:12] internet. And so to get this kind of information we designed a lot of [00:50:13] information we designed a lot of different questions. We spent two years [00:50:15] different questions. We spent two years doing different kinds of elicitation [00:50:17] doing different kinds of elicitation studies to figure out what are the right [00:50:19] studies to figure out what are the right things or pieces of information that are [00:50:21] things or pieces of information that are missing from the internet and how do we [00:50:22] missing from the internet and how do we elicit them them as effectively as [00:50:25] elicit them them as effectively as possible. One thing that was very [00:50:26] possible. One thing that was very important is we had all of our [00:50:29] important is we had all of our annotators not type descriptions but [00:50:32] annotators not type descriptions but talk about descriptions. Talking [00:50:34] talk about descriptions. Talking automatically breaks a lot of [00:50:35] automatically breaks a lot of stereotypes uh around grym maxims. Um, [00:50:39] stereotypes uh around grym maxims. Um, and so we by getting people to talk, we [00:50:42] and so we by getting people to talk, we got them to speak about things that they [00:50:43] got them to speak about things that they would never usually type. Um, the model [00:50:46] would never usually type. Um, the model itself didn't look any different from [00:50:48] itself didn't look any different from Lava. We had the same sort of setup of a [00:50:50] Lava. We had the same sort of setup of a clip encoding coming in. You had a con [00:50:52] clip encoding coming in. You had a con connector that was just a linear layer [00:50:54] connector that was just a linear layer and then you had a large language model [00:50:56] and then you had a large language model that would take in all of these tokens [00:50:58] that would take in all of these tokens and then output any sort of thing that [00:50:59] and then output any sort of thing that you care about. So the model itself [00:51:01] you care about. So the model itself looked very similar to existing models. [00:51:03] looked very similar to existing models. The biggest difference was in the data [00:51:05] The biggest difference was in the data and the quality and density of the data [00:51:07] and the quality and density of the data itself. Uh, and because of this sort of [00:51:09] itself. Uh, and because of this sort of grounding capability where the model [00:51:11] grounding capability where the model grounds its decision-m in the image [00:51:14] grounds its decision-m in the image itself, you could get Momo to do things [00:51:16] itself, you could get Momo to do things that you can't do. You can't use any of [00:51:17] that you can't do. You can't use any of the other models to do. Things like [00:51:19] the other models to do. Things like point to the menu. It actually tells you [00:51:21] point to the menu. It actually tells you where that menu item is. Or you can tell [00:51:23] where that menu item is. Or you can tell it uh to sort of point to where I can [00:51:25] it uh to sort of point to where I can set my search options and it'll show [00:51:27] set my search options and it'll show you, okay, this is where you might want [00:51:28] you, okay, this is where you might want to set those options. Or point to where [00:51:31] to set those options. Or point to where the midsize data sets are, and it'll [00:51:33] the midsize data sets are, and it'll tell you what option you need to move. [00:51:35] tell you what option you need to move. Um, and I already showed you that you [00:51:37] Um, and I already showed you that you can point to count, but you can also [00:51:39] can point to count, but you can also point to do really fine grain things [00:51:41] point to do really fine grain things like being able to sort of ask what is [00:51:43] like being able to sort of ask what is the route number on this bus. MoMA [00:51:46] the route number on this bus. MoMA doesn't just simply give you an answer. [00:51:47] doesn't just simply give you an answer. It actually points to where in the [00:51:49] It actually points to where in the image, in this case, there is this area [00:51:51] image, in this case, there is this area right here that contains the bus number [00:51:53] right here that contains the bus number and then returns the bus number to you. [00:51:55] and then returns the bus number to you. Uh, you can ask it to reason about how [00:51:57] Uh, you can ask it to reason about how many cars on the left versus how many [00:51:59] many cars on the left versus how many cars are on the right. You can ask it to [00:52:01] cars are on the right. You can ask it to reason over depth images or overhead [00:52:03] reason over depth images or overhead images or even really crowded scenes and [00:52:06] images or even really crowded scenes and sports uh areas. What's also really [00:52:09] sports uh areas. What's also really exciting and again we'll talk about this [00:52:10] exciting and again we'll talk about this in a few minutes is this idea of [00:52:12] in a few minutes is this idea of chaining that keeps coming up all across [00:52:15] chaining that keeps coming up all across multimodal models today. The idea of [00:52:17] multimodal models today. The idea of chaining Momo to other models. Uh what [00:52:20] chaining Momo to other models. Uh what you can do is chain the output of Momo [00:52:22] you can do is chain the output of Momo to become the input of another model [00:52:24] to become the input of another model like SAM 2. And so you can tell Momo to [00:52:27] like SAM 2. And so you can tell Momo to point to the Cricut bat. And now you [00:52:28] point to the Cricut bat. And now you take that point, you feed it to a model [00:52:30] take that point, you feed it to a model like SAM 2 which does segmentation. And [00:52:33] like SAM 2 which does segmentation. And now you can do segmentation of that [00:52:35] now you can do segmentation of that Cricut bat across time. And so you can [00:52:37] Cricut bat across time. And so you can start enabling all kinds of new [00:52:38] start enabling all kinds of new applications. Here's one that we played [00:52:40] applications. Here's one that we played around with in the office. Um, which [00:52:43] around with in the office. Um, which again you're going to hopefully learn [00:52:44] again you're going to hopefully learn about in the next lecture with when uh [00:52:47] about in the next lecture with when uh you hear about robotics. Uh but we asked [00:52:49] you hear about robotics. Uh but we asked uh Momo to point to where the water [00:52:51] uh Momo to point to where the water bottle is and then we moved the robot [00:52:53] bottle is and then we moved the robot using simple motion planners to that [00:52:55] using simple motion planners to that water bottle. Next we ask it to go move [00:52:58] water bottle. Next we ask it to go move that water bottle to where the dirty [00:53:00] that water bottle to where the dirty dishes are. It points to the dish uh to [00:53:02] dishes are. It points to the dish uh to the sorry sink and then moves a robot [00:53:04] the sorry sink and then moves a robot there. And then we tell it to go point [00:53:07] there. And then we tell it to go point to where the uh free space is in the [00:53:08] to where the uh free space is in the sink and put that bottle in that [00:53:10] sink and put that bottle in that location. So again you can sort of [00:53:12] location. So again you can sort of combine all these capabilities together [00:53:14] combine all these capabilities together now and chain them to even sort of [00:53:16] now and chain them to even sort of automate a lot of robotics applications. [00:53:19] automate a lot of robotics applications. Uh this has been a lot of focus in my [00:53:21] Uh this has been a lot of focus in my group now is sort of adapting a lot of [00:53:22] group now is sort of adapting a lot of these vision language models and [00:53:24] these vision language models and enabling a lot of generalization in the [00:53:26] enabling a lot of generalization in the actual physical domain. Uh so the [00:53:29] actual physical domain. Uh so the question is around um would these models [00:53:31] question is around um would these models be able to sort of point if you um were [00:53:34] be able to sort of point if you um were always sort of changing the resolution [00:53:36] always sort of changing the resolution of the image to be a fixed resolution [00:53:37] of the image to be a fixed resolution right. Um well turns out that you can [00:53:40] right. Um well turns out that you can actually adapt these models to be any [00:53:42] actually adapt these models to be any resolution nowadays. There are these uh [00:53:44] resolution nowadays. There are these uh mechanisms like flex vit uh that has [00:53:48] mechanisms like flex vit uh that has introduced uh a way of allowing for any [00:53:51] introduced uh a way of allowing for any variable size image input and you can [00:53:53] variable size image input and you can adapt them to sort of point in that new [00:53:55] adapt them to sort of point in that new space instead. Um so your model's [00:53:58] space instead. Um so your model's position embeddings basically change [00:53:59] position embeddings basically change depending on how big your image size is. [00:54:03] depending on how big your image size is. Uh and you're allow you your models you [00:54:05] Uh and you're allow you your models you typically tend to generalize well. So [00:54:07] typically tend to generalize well. So that was sort of uh the conversation [00:54:08] that was sort of uh the conversation around adding vision and multimodal [00:54:11] around adding vision and multimodal models together. In the last sort of 20 [00:54:13] models together. In the last sort of 20 minutes that we have left, I want to [00:54:15] minutes that we have left, I want to talk about generalizing these foundation [00:54:16] talk about generalizing these foundation models to not just deal with image [00:54:18] models to not just deal with image classification and text but be able to [00:54:20] classification and text but be able to sort of generalize to any sort of output [00:54:21] sort of generalize to any sort of output space you might care about. Uh and one [00:54:24] space you might care about. Uh and one of those models that's become really [00:54:25] of those models that's become really popular in this space is this segment [00:54:28] popular in this space is this segment anything model. The segment anything [00:54:30] anything model. The segment anything model or SAM for short. uh what it tries [00:54:33] model or SAM for short. uh what it tries to do is it tries to build a [00:54:35] to do is it tries to build a segmentation model that's a foundation [00:54:37] segmentation model that's a foundation model for all kinds of segmentation [00:54:39] model for all kinds of segmentation tasks. So really what they're trying to [00:54:41] tasks. So really what they're trying to do is allow anybody to sort of point to [00:54:45] do is allow anybody to sort of point to things that they care about in the image [00:54:46] things that they care about in the image and then hopefully have that thing uh be [00:54:49] and then hopefully have that thing uh be something that they the model can sort [00:54:51] something that they the model can sort of output a mask for. Uh so for example [00:54:54] of output a mask for. Uh so for example um you want a model that generalizes [00:54:56] um you want a model that generalizes beyond just a fixed number of categories [00:54:58] beyond just a fixed number of categories to any sort of category you might care [00:55:00] to any sort of category you might care about and you would ideally want these [00:55:02] about and you would ideally want these outputs to be masks for any sort of [00:55:05] outputs to be masks for any sort of category that is of interest to the [00:55:06] category that is of interest to the user. Right? So those are the two goals [00:55:08] user. Right? So those are the two goals uh that we want to generalize to any [00:55:10] uh that we want to generalize to any category a huge number of categories and [00:55:12] category a huge number of categories and we ideally want to be able to very [00:55:15] we ideally want to be able to very specifically output something that the [00:55:16] specifically output something that the user really cares about. So they're both [00:55:18] user really cares about. So they're both challenges. They're both challenges in [00:55:20] challenges. They're both challenges in figuring out how do you collect a large [00:55:21] figuring out how do you collect a large amount of data that spans a wide variety [00:55:23] amount of data that spans a wide variety of categories as well as how do you [00:55:25] of categories as well as how do you design an architecture that really [00:55:28] design an architecture that really pinpoints what the user really cares [00:55:30] pinpoints what the user really cares about. Um now the re let's start with [00:55:32] about. Um now the re let's start with the second question first. Um it's [00:55:34] the second question first. Um it's really ambiguous when uh when we want a [00:55:37] really ambiguous when uh when we want a mask for something. So imagine a [00:55:40] mask for something. So imagine a scenario where you have two cats in an [00:55:42] scenario where you have two cats in an image and a user comes in and says hey I [00:55:44] image and a user comes in and says hey I want a segmentation for the cat. But [00:55:46] want a segmentation for the cat. But it's really not clear which cat they [00:55:48] it's really not clear which cat they want a segmentation for. Ideally, again, [00:55:50] want a segmentation for. Ideally, again, if you had Momo's pointing capability, [00:55:52] if you had Momo's pointing capability, you could actually point to which cat [00:55:54] you could actually point to which cat you care about. And then depending on [00:55:56] you care about. And then depending on the point, you could create the masks uh [00:55:58] the point, you could create the masks uh that matter. Now, of course, these are [00:56:00] that matter. Now, of course, these are not very good masks. And ideally, you [00:56:02] not very good masks. And ideally, you want these masks to be very very good at [00:56:05] want these masks to be very very good at quality that can sort of support a wide [00:56:07] quality that can sort of support a wide variety of downstream applications like [00:56:09] variety of downstream applications like image editing or any kinds of other [00:56:11] image editing or any kinds of other things you might uh think of. So to [00:56:14] things you might uh think of. So to build this architecture that allows any [00:56:16] build this architecture that allows any user to be able to specify exactly what [00:56:18] user to be able to specify exactly what they care about, we needed to go beyond [00:56:20] they care about, we needed to go beyond uh just simply typing in text what you [00:56:23] uh just simply typing in text what you care about. Right? So what the SAM [00:56:25] care about. Right? So what the SAM architecture has is two components uh or [00:56:28] architecture has is two components uh or three specifically. It's got the image [00:56:29] three specifically. It's got the image encoder which is again uh it could be a [00:56:32] encoder which is again uh it could be a clip encoder uh and it's got a prompt [00:56:34] clip encoder uh and it's got a prompt encoder which is something special. This [00:56:36] encoder which is something special. This prompt encoder really tries to encode [00:56:39] prompt encoder really tries to encode text or points or bounding boxes or any [00:56:42] text or points or bounding boxes or any way that a user might want to specify [00:56:44] way that a user might want to specify what they care about. And then given [00:56:46] what they care about. And then given these two things, it passes it through [00:56:47] these two things, it passes it through this really lightweight decoder that [00:56:49] this really lightweight decoder that outputs a mask. And the decoder looks [00:56:51] outputs a mask. And the decoder looks very similar to like the the [00:56:53] very similar to like the the segmentation decoders that you've [00:56:54] segmentation decoders that you've already seen in this course. So overall, [00:56:57] already seen in this course. So overall, this is what uh the model looks like. [00:56:59] this is what uh the model looks like. Given an image, we encode that image [00:57:01] Given an image, we encode that image using an image encoder. uh and then you [00:57:04] using an image encoder. uh and then you have a bunch of different prompts that [00:57:06] have a bunch of different prompts that are going through and um interacting [00:57:09] are going through and um interacting with these image encoding uh through a [00:57:11] with these image encoding uh through a decoder and you output a mask right so [00:57:14] decoder and you output a mask right so this is the overall architecture design [00:57:16] this is the overall architecture design now there's one big thing that is a [00:57:19] now there's one big thing that is a problem with segmentation so let's say a [00:57:21] problem with segmentation so let's say a user does point at this particular [00:57:22] user does point at this particular location and says hey I want a [00:57:24] location and says hey I want a segmentation mask for this location now [00:57:26] segmentation mask for this location now the problem with that segmentation mask [00:57:28] the problem with that segmentation mask is that it's still ambiguous even with a [00:57:30] is that it's still ambiguous even with a point it's still not sufficient because [00:57:32] point it's still not sufficient because that point might be referring to the [00:57:34] that point might be referring to the entire um uh scissor. It might only be [00:57:37] entire um uh scissor. It might only be looking and referring to the parts that [00:57:40] looking and referring to the parts that you can hold or it can be referring to [00:57:42] you can hold or it can be referring to one of the parts that you can hold. So [00:57:44] one of the parts that you can hold. So this ambiguity is really difficult to [00:57:46] this ambiguity is really difficult to sort of resolve for and you don't want [00:57:48] sort of resolve for and you don't want to penalize the model for picking the [00:57:50] to penalize the model for picking the wrong one. So what the SAM architecture [00:57:52] wrong one. So what the SAM architecture does is instead of outputting one [00:57:54] does is instead of outputting one segmentation mask, it actually outputs [00:57:56] segmentation mask, it actually outputs three segmentation masks at different [00:57:57] three segmentation masks at different levels of granularity and then it picks [00:58:00] levels of granularity and then it picks the one that is the closest matching to [00:58:01] the one that is the closest matching to ground truth and then uses that to [00:58:03] ground truth and then uses that to calculate the loss and therefore not [00:58:05] calculate the loss and therefore not penalizing the other ones. So the hope [00:58:07] penalizing the other ones. So the hope is that overall over time this model [00:58:10] is that overall over time this model will learn to output all different kinds [00:58:11] will learn to output all different kinds of masks and then the user gets to [00:58:13] of masks and then the user gets to choose basically which one is the most [00:58:14] choose basically which one is the most appropriate for their use cases. Okay. [00:58:18] appropriate for their use cases. Okay. Um, and if you put all of this together, [00:58:19] Um, and if you put all of this together, the only thing you need now is data. You [00:58:23] the only thing you need now is data. You need a lot of data across many different [00:58:24] need a lot of data across many different categories to really make this model [00:58:27] categories to really make this model possible. Now, the problem with data is [00:58:29] possible. Now, the problem with data is that until this model came out in 2023, [00:58:32] that until this model came out in 2023, this was about a year and a half, maybe [00:58:33] this was about a year and a half, maybe 2 years ago. When this model came out, u [00:58:36] 2 years ago. When this model came out, u most of the segmentation data sets were [00:58:38] most of the segmentation data sets were extremely small. And what this the [00:58:41] extremely small. And what this the authors of this paper did is that they [00:58:43] authors of this paper did is that they grew the amount of segmentation data [00:58:44] grew the amount of segmentation data sets that were out there, the amount of [00:58:46] sets that were out there, the amount of images by about 6x and the number of [00:58:48] images by about 6x and the number of segmentation masks by about 400x. So [00:58:51] segmentation masks by about 400x. So they significantly grew uh and collected [00:58:54] they significantly grew uh and collected a lot of masks to make this model as [00:58:56] a lot of masks to make this model as performant as possible. So again the [00:58:58] performant as possible. So again the message is very similar to the message [00:58:59] message is very similar to the message we had with Flamingo, the message we had [00:59:01] we had with Flamingo, the message we had with Momo. And the message is you need [00:59:04] with Momo. And the message is you need really good highquality data uh to [00:59:06] really good highquality data uh to really get these models to be as [00:59:07] really get these models to be as performant as possible. And for a lot of [00:59:09] performant as possible. And for a lot of vision tasks, the data is completely [00:59:11] vision tasks, the data is completely missing from the internet and you need [00:59:13] missing from the internet and you need to go out and find and collect that data [00:59:16] to go out and find and collect that data uh to get these models to work very [00:59:17] uh to get these models to work very well. Okay. And so to make this data [00:59:20] well. Okay. And so to make this data happen, they created this sort of in the [00:59:22] happen, they created this sort of in the loop process where they initially had [00:59:25] loop process where they initially had some amount of data annotated. Uh from [00:59:27] some amount of data annotated. Uh from that annotation, they created a training [00:59:28] that annotation, they created a training data set. They trained a model and they [00:59:30] data set. They trained a model and they used that model to annotate more data [00:59:32] used that model to annotate more data and then they iteratively refined that [00:59:35] and then they iteratively refined that model generated uh segments using users [00:59:37] model generated uh segments using users and continued this process. So uh they [00:59:40] and continued this process. So uh they had this human in the loop model in the [00:59:42] had this human in the loop model in the loop process of proposing segments and [00:59:45] loop process of proposing segments and then fixing the segments using human [00:59:46] then fixing the segments using human annotators. This is what an example [00:59:48] annotators. This is what an example image looks like from their data set. [00:59:50] image looks like from their data set. You have quite a lot of categories. Each [00:59:52] You have quite a lot of categories. Each individual uh vegetable here is [00:59:55] individual uh vegetable here is annotated with its own mask. So they're [00:59:57] annotated with its own mask. So they're quite expensive to collect. Uh and they [01:00:00] quite expensive to collect. Uh and they did this across um a lot of images, [01:00:03] did this across um a lot of images, millions of images. Here's another [01:00:05] millions of images. Here's another example where again all the single [01:00:07] example where again all the single umbrellas are annotated. Uh again here's [01:00:10] umbrellas are annotated. Uh again here's another one with underwater sea. Uh and [01:00:12] another one with underwater sea. Uh and of course paintings as well. They have [01:00:14] of course paintings as well. They have segmentations of paintings. Uh and so [01:00:16] segmentations of paintings. Uh and so all of this together is really what was [01:00:18] all of this together is really what was foundational to making this foundation [01:00:20] foundational to making this foundation model for segmentation. Okay. Um so [01:00:24] model for segmentation. Okay. Um so that's for segment anything. And I want [01:00:25] that's for segment anything. And I want to use the last couple of uh minutes [01:00:27] to use the last couple of uh minutes that I have left today to really focus [01:00:29] that I have left today to really focus on chaining, which is the last part of [01:00:33] on chaining, which is the last part of multimodal language models. The idea [01:00:35] multimodal language models. The idea behind chaining is something you've [01:00:38] behind chaining is something you've already seen. I've given you hints [01:00:39] already seen. I've given you hints already throughout this lecture. And the [01:00:41] already throughout this lecture. And the idea is to be able to combine different [01:00:43] idea is to be able to combine different models together to enable things that a [01:00:45] models together to enable things that a single model can't do alone. Here's a [01:00:48] single model can't do alone. Here's a fun little exercise we can do as a [01:00:50] fun little exercise we can do as a class. So, I'm giving you four images [01:00:53] class. So, I'm giving you four images and I'm also giving you four categories, [01:00:56] and I'm also giving you four categories, right? And these are potentially [01:00:57] right? And these are potentially categories that some of you have never [01:00:59] categories that some of you have never seen before. And they're also categories [01:01:01] seen before. And they're also categories that Clip hasn't seen. And so, Clip [01:01:04] that Clip hasn't seen. And so, Clip actually fails at these categories cuz [01:01:05] actually fails at these categories cuz it doesn't have any idea which one's [01:01:07] it doesn't have any idea which one's associated with what. Does anyone here [01:01:09] associated with what. Does anyone here know which one is what? Morima. Yeah, [01:01:12] know which one is what? Morima. Yeah, the second one is Mima. That's right. [01:01:14] the second one is Mima. That's right. Yeah. Uh, there's one that's a little [01:01:16] Yeah. Uh, there's one that's a little easy. The vioaduct, [01:01:17] easy. The vioaduct, right? Yeah. I think a lot of you know [01:01:19] right? Yeah. I think a lot of you know which one that is. Um, but yeah, which [01:01:22] which one that is. Um, but yeah, which one's the the dog and which one's the [01:01:24] one's the the dog and which one's the bird? Um, what if I gave you these [01:01:26] bird? Um, what if I gave you these instead? [01:01:28] instead? So now I'm giving you descriptions of [01:01:30] So now I'm giving you descriptions of these things and it suddenly becomes [01:01:32] these things and it suddenly becomes very easy for you to associate each one [01:01:34] very easy for you to associate each one with the right category, right? Uh and [01:01:36] with the right category, right? Uh and that's the basic idea behind chaining [01:01:38] that's the basic idea behind chaining that even if clip has never seen these [01:01:40] that even if clip has never seen these images chances are these these concepts [01:01:43] images chances are these these concepts have been talked about on the internet [01:01:45] have been talked about on the internet to some degree and it's likely that GPT [01:01:48] to some degree and it's likely that GPT might be able to describe it and if GPT [01:01:50] might be able to describe it and if GPT can create those descriptions now those [01:01:52] can create those descriptions now those descriptions become really good ways of [01:01:55] descriptions become really good ways of classifying exactly which category is [01:01:57] classifying exactly which category is which one and that's the idea behind [01:01:59] which one and that's the idea behind chaining is that you take the strengths [01:02:00] chaining is that you take the strengths of one model and you combine it with the [01:02:02] of one model and you combine it with the capabilities of another and suddenly you [01:02:04] capabilities of another and suddenly you get all kinds of new capabilities that [01:02:06] get all kinds of new capabilities that you didn't have before. Uh and so you [01:02:08] you didn't have before. Uh and so you can take a ton of categories that have [01:02:09] can take a ton of categories that have no training data uh in clip, but if you [01:02:12] no training data uh in clip, but if you can describe all of them because clip [01:02:14] can describe all of them because clip has seen a lot of descriptions for [01:02:16] has seen a lot of descriptions for things, it can now start classifying all [01:02:17] things, it can now start classifying all of them very well. And you can start [01:02:19] of them very well. And you can start getting Clip to generate classifications [01:02:21] getting Clip to generate classifications for individual flowers or individual [01:02:23] for individual flowers or individual cars or individual uh spaces or even [01:02:27] cars or individual uh spaces or even different kinds of pets. And you start [01:02:29] different kinds of pets. And you start seeing these improvements on a bunch of [01:02:30] seeing these improvements on a bunch of different data sets that are about more [01:02:32] different data sets that are about more fine grained specialized categories. And [01:02:35] fine grained specialized categories. And the only way it's able to do that is [01:02:37] the only way it's able to do that is because GPT has ingested some ability to [01:02:40] because GPT has ingested some ability to sort of describe those things. Um, and [01:02:42] sort of describe those things. Um, and this idea of being able to generalize to [01:02:44] this idea of being able to generalize to new capabilities is something that uh [01:02:46] new capabilities is something that uh was extremely popular last year and it [01:02:49] was extremely popular last year and it still remains very popular this year. uh [01:02:52] still remains very popular this year. uh and it's through this idea of chaining [01:02:56] and it's through this idea of chaining for any sort of question at all. uh so [01:02:59] for any sort of question at all. uh so for example if I asked you how many are [01:03:01] for example if I asked you how many are there three people in the boat um the [01:03:04] there three people in the boat um the way you might want to do this is by [01:03:07] way you might want to do this is by again asking a multimodal language model [01:03:08] again asking a multimodal language model to answer this question or what you [01:03:10] to answer this question or what you could do is use all of the hundreds of [01:03:12] could do is use all of the hundreds of specialized vision models that we've [01:03:14] specialized vision models that we've been developing over the last few [01:03:16] been developing over the last few decades. So there are object detection [01:03:18] decades. So there are object detection models that you learned about in class. [01:03:20] models that you learned about in class. If you use an object detector, you'd be [01:03:22] If you use an object detector, you'd be able to get a detection for each of the [01:03:23] able to get a detection for each of the three people and then you could just [01:03:25] three people and then you could just say, "Oh, they're three people." Because [01:03:26] say, "Oh, they're three people." Because there are three detections, right? So [01:03:27] there are three detections, right? So that's a general idea is you can chain [01:03:30] that's a general idea is you can chain other models outputs together so that [01:03:33] other models outputs together so that you can do new kinds of capabilities. [01:03:35] you can do new kinds of capabilities. Here's another example. Uh if I ask you [01:03:38] Here's another example. Uh if I ask you how many total people there are uh is [01:03:40] how many total people there are uh is across these two boats, is it six? And [01:03:43] across these two boats, is it six? And again, you can do the same thing. you [01:03:44] again, you can do the same thing. you can write a program that does object [01:03:46] can write a program that does object detection on image one and then object [01:03:47] detection on image one and then object detection on image two and then adds up [01:03:50] detection on image two and then adds up all of those components uh together. [01:03:52] all of those components uh together. Right? So this is the basic idea behind [01:03:55] Right? So this is the basic idea behind um what we now call uh chaining and this [01:03:58] um what we now call uh chaining and this was popularized by a paper that won the [01:04:00] was popularized by a paper that won the best paper award last year uh called [01:04:02] best paper award last year uh called Visprog. And in this Visprock paper, the [01:04:05] Visprog. And in this Visprock paper, the visual programming paper, the idea was [01:04:07] visual programming paper, the idea was that you take any image or any sort of [01:04:10] that you take any image or any sort of question and you generate a program. You [01:04:13] question and you generate a program. You generate a program that says answer [01:04:15] generate a program that says answer something about image one, answer [01:04:16] something about image one, answer something about image two and then [01:04:18] something about image two and then return combine those answers together to [01:04:21] return combine those answers together to give you the final answer. Right? So you [01:04:23] give you the final answer. Right? So you write a function in Python and then in [01:04:25] write a function in Python and then in that Python function you have individual [01:04:28] that Python function you have individual calls to other models uh that we've [01:04:30] calls to other models uh that we've already seen in training. Right? So for [01:04:32] already seen in training. Right? So for example in when asking this particular [01:04:34] example in when asking this particular question uh deciding if this statement [01:04:37] question uh deciding if this statement is true or not the left and right image [01:04:39] is true or not the left and right image contains a total of six people and two [01:04:40] contains a total of six people and two boats you can ask GPT to actually create [01:04:44] boats you can ask GPT to actually create a program that tries to answer this [01:04:46] a program that tries to answer this question and then you can take the [01:04:48] question and then you can take the answer uh from its uh program. Okay. And [01:04:52] answer uh from its uh program. Okay. And you can also get GPT to do in context [01:04:54] you can also get GPT to do in context examples where you give it examples of [01:04:56] examples where you give it examples of programs that it can generate using [01:04:58] programs that it can generate using other functions. And we can see that it [01:05:00] other functions. And we can see that it generalizes to new sort of questions and [01:05:03] generalizes to new sort of questions and you start using all of the functionality [01:05:05] you start using all of the functionality that it has available. Of course, the [01:05:07] that it has available. Of course, the one thing you need to do is give it the [01:05:09] one thing you need to do is give it the functions themselves. So you need to [01:05:11] functions themselves. So you need to tell it that hey you have these [01:05:12] tell it that hey you have these capabilities from other models that you [01:05:14] capabilities from other models that you can use. You can localize things using [01:05:16] can use. You can localize things using an object detector. You can localize [01:05:18] an object detector. You can localize faces using a face detector. uh and you [01:05:21] faces using a face detector. uh and you can have all of these different [01:05:22] can have all of these different capabilities uh across many different [01:05:25] capabilities uh across many different sort of other models that people have [01:05:27] sort of other models that people have created and you can chain them together [01:05:29] created and you can chain them together to do different kinds of tasks. Yeah. So [01:05:31] to do different kinds of tasks. Yeah. So there's two different ways of doing it. [01:05:32] there's two different ways of doing it. One is a static way of doing it where [01:05:34] One is a static way of doing it where you want to give it as diverse examples [01:05:36] you want to give it as diverse examples as possible and then hopes that it [01:05:37] as possible and then hopes that it generalizes. Another one is to [01:05:39] generalizes. Another one is to dynamically choose given this question [01:05:42] dynamically choose given this question what are the best in context examples I [01:05:44] what are the best in context examples I should use. And so you can treat that as [01:05:46] should use. And so you can treat that as another retrieval process where you [01:05:48] another retrieval process where you retrieve the best examples and then you [01:05:50] retrieve the best examples and then you ask it to generate a program and that [01:05:53] ask it to generate a program and that tends to perform a lot better but only [01:05:55] tends to perform a lot better but only if you have a good retrieval system. Uh [01:05:58] if you have a good retrieval system. Uh yes it would require a lot of compute. [01:06:00] yes it would require a lot of compute. So you there's compute in terms of uh [01:06:02] So you there's compute in terms of uh calling GPT which you have to do through [01:06:04] calling GPT which you have to do through an API and then you have to load each of [01:06:06] an API and then you have to load each of these individual models into your memory [01:06:08] these individual models into your memory and then run each of them sequentially. [01:06:10] and then run each of them sequentially. So it could actually be a lot more [01:06:13] So it could actually be a lot more costly. And so what people are trying to [01:06:14] costly. And so what people are trying to do is figure out can we distill these [01:06:16] do is figure out can we distill these capabilities into a single model. Uh and [01:06:18] capabilities into a single model. Uh and that's a big part of what research looks [01:06:20] that's a big part of what research looks like today in 2025. Uh but of course uh [01:06:23] like today in 2025. Uh but of course uh people are also still trying to figure [01:06:25] people are also still trying to figure out how do you chain these things [01:06:26] out how do you chain these things effectively. Very well. [01:06:28] effectively. Very well. Yeah. You can think of it as like an [01:06:29] Yeah. You can think of it as like an agent. Yeah. So you have an agent that's [01:06:31] agent. Yeah. So you have an agent that's basically deciding hey given this [01:06:33] basically deciding hey given this question what are the other models I [01:06:34] question what are the other models I need help from and how do I sort of [01:06:36] need help from and how do I sort of stitch them together to do new kinds of [01:06:38] stitch them together to do new kinds of capabilities? Uh so that's what it looks [01:06:40] capabilities? Uh so that's what it looks like. Yeah. Uh here's another example [01:06:42] like. Yeah. Uh here's another example where you might want to do image [01:06:44] where you might want to do image editing. You might care about replacing [01:06:46] editing. You might care about replacing the sand, the desert with lush green [01:06:48] the sand, the desert with lush green grass. Uh of course, image uh editing [01:06:50] grass. Uh of course, image uh editing models are still in its infancy. And so [01:06:52] models are still in its infancy. And so what you might want to do instead is [01:06:53] what you might want to do instead is call a segmentation model, identify the [01:06:56] call a segmentation model, identify the desert, and then only replace the desert [01:06:59] desert, and then only replace the desert parts, those pixels with grass. And then [01:07:01] parts, those pixels with grass. And then you can sort of composite them together [01:07:03] you can sort of composite them together to make a new image. Okay, that's sort [01:07:06] to make a new image. Okay, that's sort of um all of the things I wanted to talk [01:07:08] of um all of the things I wanted to talk about in terms of different [01:07:09] about in terms of different capabilities. Uh so you've got um here [01:07:12] capabilities. Uh so you've got um here some capabilities around how to think [01:07:14] some capabilities around how to think about foundation models. Uh it really is [01:07:17] about foundation models. Uh it really is at the end of the day uh an ability to [01:07:19] at the end of the day uh an ability to sort of train a model for a single task [01:07:22] sort of train a model for a single task and then from that single sort of uh [01:07:24] and then from that single sort of uh task generalized to many different [01:07:26] task generalized to many different downstream applications. [01:07:28] downstream applications. And we talked about in classification [01:07:30] And we talked about in classification how you can create these models by just [01:07:32] how you can create these models by just taking a lot of image text pairs from [01:07:34] taking a lot of image text pairs from the internet um training them together [01:07:36] the internet um training them together to do different kinds of tasks. uh and [01:07:38] to do different kinds of tasks. uh and that allows you to generalize to new [01:07:39] that allows you to generalize to new kinds of data sets that might not even [01:07:42] kinds of data sets that might not even exist in the real world or have any [01:07:43] exist in the real world or have any labels for in the real world. Uh you can [01:07:46] labels for in the real world. Uh you can also combine them with language models [01:07:48] also combine them with language models and train them to do in context examples [01:07:50] and train them to do in context examples uh like captioning or counting or OCR. [01:07:53] uh like captioning or counting or OCR. And these are again capabilities that [01:07:54] And these are again capabilities that enable many different applications. And [01:07:57] enable many different applications. And then of course the outputs don't always [01:07:58] then of course the outputs don't always have to be language or categories. There [01:08:00] have to be language or categories. There can also be segmentation masks uh where [01:08:02] can also be segmentation masks uh where you can take different kinds of masks uh [01:08:04] you can take different kinds of masks uh depending on different user uh inputs [01:08:06] depending on different user uh inputs and you can generalize this even further [01:08:08] and you can generalize this even further by combining many of these foundation [01:08:10] by combining many of these foundation models or even smaller models together [01:08:12] models or even smaller models together through programs and do all kinds of new [01:08:15] through programs and do all kinds of new things. So hallucinations still happen [01:08:18] things. So hallucinations still happen all all across the board. Um what we're [01:08:21] all all across the board. Um what we're showing is that it seems like pointing [01:08:23] showing is that it seems like pointing does sort of reduce hallucinations quite [01:08:24] does sort of reduce hallucinations quite a bit uh because it does need to find [01:08:27] a bit uh because it does need to find some evidence for its generations. Uh [01:08:29] some evidence for its generations. Uh but that being said, there's no [01:08:30] but that being said, there's no guarantee that it's going to point to to [01:08:32] guarantee that it's going to point to to the right thing at all. Um so there's [01:08:34] the right thing at all. Um so there's many different ways of sort of fixing [01:08:35] many different ways of sort of fixing for this. One is uh of course collecting [01:08:37] for this. One is uh of course collecting more data uh related to the kinds of [01:08:39] more data uh related to the kinds of reasoning that you want it to do. Uh but [01:08:41] reasoning that you want it to do. Uh but a better one is to even have [01:08:43] a better one is to even have verification methods that verify based [01:08:46] verification methods that verify based on the points whether the output is [01:08:48] on the points whether the output is something you should trust or not. So a [01:08:50] something you should trust or not. So a lot of the bigger models and bigger [01:08:51] lot of the bigger models and bigger companies what they typically do uh when [01:08:53] companies what they typically do uh when you use any of their models is they [01:08:55] you use any of their models is they don't have a single model that's [01:08:57] don't have a single model that's generating an output. you usually take [01:08:58] generating an output. you usually take that output and pass it through other [01:09:00] that output and pass it through other sort of verifiers before it even gets to [01:09:02] sort of verifiers before it even gets to the user. Um and that mitigates some of [01:09:04] the user. Um and that mitigates some of these sort of problems. Uh but it is an [01:09:06] these sort of problems. Uh but it is an active line of uh inquiry right now. How [01:09:08] active line of uh inquiry right now. How do you sort of reduce hallucinations and [01:09:10] do you sort of reduce hallucinations and also improve these models actual [01:09:12] also improve these models actual accuracy? [01:09:13] accuracy? Yeah. So uh so repeating your question [01:09:15] Yeah. So uh so repeating your question um is it possible for these models to [01:09:18] um is it possible for these models to build new tools when uh capability [01:09:20] build new tools when uh capability requires a tool that it doesn't have? Uh [01:09:23] requires a tool that it doesn't have? Uh yes uh it can. uh we have a few sort of [01:09:26] yes uh it can. uh we have a few sort of uh preliminary experiments in those [01:09:28] uh preliminary experiments in those directions as well where you can tell a [01:09:29] directions as well where you can tell a model here's a capability that I want [01:09:31] model here's a capability that I want and what you can build is a system that [01:09:33] and what you can build is a system that automatically tries to collect training [01:09:35] automatically tries to collect training data and builds a tool for specific use [01:09:38] data and builds a tool for specific use cases. Uh but that line of work is still [01:09:40] cases. Uh but that line of work is still again in its infancy. Uh it's one that [01:09:42] again in its infancy. Uh it's one that we're actively working on. Uh but you [01:09:44] we're actively working on. Uh but you know a lot of folks are excited about [01:09:46] know a lot of folks are excited about that direction. ================================================================================ LECTURE 017 ================================================================================ Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 17: Robot Learning Source: https://www.youtube.com/watch?v=XSfmOH_xVSU --- Transcript [00:00:05] We're here with our final guest lecture [00:00:07] We're here with our final guest lecture for the course. Um, and today we have uh [00:00:10] for the course. Um, and today we have uh Dr. Yunju Lee. He is an assistant [00:00:13] Dr. Yunju Lee. He is an assistant professor of computer science at [00:00:15] professor of computer science at Columbia University where he leads the [00:00:17] Columbia University where he leads the robotic perception interaction and [00:00:19] robotic perception interaction and learning lab. He is also a former [00:00:21] learning lab. He is also a former instructor of CS231N like all of our [00:00:23] instructor of CS231N like all of our guest lecturers. and he taught the [00:00:25] guest lecturers. and he taught the course in 2023 while he completed his [00:00:28] course in 2023 while he completed his posttock here at Stanford with [00:00:29] posttock here at Stanford with professors fee Lee and Jaun Wu. His [00:00:32] professors fee Lee and Jaun Wu. His research lies at the intersection of [00:00:33] research lies at the intersection of robotics, computer vision and machine [00:00:36] robotics, computer vision and machine learning and specifically his work [00:00:38] learning and specifically his work focuses on robot learning uh and aims to [00:00:40] focuses on robot learning uh and aims to significantly expand robots perception [00:00:43] significantly expand robots perception and physical interaction capabilities. [00:00:45] and physical interaction capabilities. In today's lecture, he'll be discussing [00:00:47] In today's lecture, he'll be discussing exactly that topic, robot learning. And [00:00:49] exactly that topic, robot learning. And I'll now hand it off to Yunju for [00:00:51] I'll now hand it off to Yunju for today's lecture. [00:00:52] today's lecture. Yeah, thank you Z for the very kind [00:00:54] Yeah, thank you Z for the very kind introduction. I'm super excited to be [00:00:56] introduction. I'm super excited to be here like last time I was here giving [00:00:58] here like last time I was here giving lectures was two years ago 2023 and uh [00:01:02] lectures was two years ago 2023 and uh similar lately like I was going through [00:01:04] similar lately like I was going through many of the lectures and today I'm going [00:01:06] many of the lectures and today I'm going to talking about like some of things [00:01:09] to talking about like some of things that I have been working on and also [00:01:11] that I have been working on and also it's also a very coherent piece of this [00:01:14] it's also a very coherent piece of this overall pictures on deep learning for [00:01:15] overall pictures on deep learning for computer visions and this is [00:01:17] computer visions and this is specifically on robot learning. So I'll [00:01:19] specifically on robot learning. So I'll be discussing like what are some of the [00:01:21] be discussing like what are some of the interesting considerations especially in [00:01:24] interesting considerations especially in enabling the robots to better perceive [00:01:26] enabling the robots to better perceive and interact with the physical world and [00:01:28] and interact with the physical world and how some of the considerations might be [00:01:30] how some of the considerations might be different from some typical computer [00:01:32] different from some typical computer vision task and computer vision like [00:01:34] vision task and computer vision like methods. So first of all like you guys [00:01:38] methods. So first of all like you guys have already learned like a lot about [00:01:40] have already learned like a lot about supervised learning. The thing and the [00:01:43] supervised learning. The thing and the setup for supervised learning is that [00:01:45] setup for supervised learning is that you have data X and Y. X is the input [00:01:48] you have data X and Y. X is the input and Y is a label and you are trying to [00:01:50] and Y is a label and you are trying to learn a mapping that maps from the input [00:01:53] learn a mapping that maps from the input X to the output Y. There are examples [00:01:56] X to the output Y. There are examples you have already learned like [00:01:57] you have already learned like classification, regression, object [00:01:59] classification, regression, object detections etc. And you have also [00:02:02] detections etc. And you have also learned about selfsupervised learning [00:02:04] learned about selfsupervised learning where instead of having labels in this [00:02:07] where instead of having labels in this case you are just having the data [00:02:09] case you are just having the data without any labels. when you are trying [00:02:12] without any labels. when you are trying to do is you come up with learning [00:02:13] to do is you come up with learning algorithms that is able to extract or [00:02:16] algorithms that is able to extract or identify the underlying hidden [00:02:18] identify the underlying hidden structures of the data just by like uh [00:02:21] structures of the data just by like uh working or designing some auxiliary [00:02:23] working or designing some auxiliary loss. Some typical example including [00:02:25] loss. Some typical example including like autoenccoders. There's many other [00:02:27] like autoenccoders. There's many other like examples in trying to do [00:02:30] like examples in trying to do unsupervised learning or self-supervised [00:02:32] unsupervised learning or self-supervised learnings like on top of this mass [00:02:34] learnings like on top of this mass amount of unlabelled data. [00:02:37] amount of unlabelled data. And the special thing and the unique [00:02:39] And the special thing and the unique things about robot learning is that [00:02:41] things about robot learning is that robots has to make physical interactions [00:02:44] robots has to make physical interactions and make interactions with the world. So [00:02:47] and make interactions with the world. So it's not just you have the input and [00:02:48] it's not just you have the input and outputs and mapping from the input X to [00:02:50] outputs and mapping from the input X to Y or some kind of latent [00:02:51] Y or some kind of latent representations. It's really about you [00:02:54] representations. It's really about you are influenced evolutions of the [00:02:55] are influenced evolutions of the environments. So no matter what action [00:02:57] environments. So no matter what action you decide to take in the real world, [00:03:00] you decide to take in the real world, the world will change as a result of [00:03:02] the world will change as a result of that actions and the world will give you [00:03:04] that actions and the world will give you some kind of new observations or reward [00:03:06] some kind of new observations or reward telling you like how the thing how the [00:03:09] telling you like how the thing how the environment has been changing and how [00:03:11] environment has been changing and how good you are in executing certain tasks. [00:03:13] good you are in executing certain tasks. So the goal is trying to actually come [00:03:16] So the goal is trying to actually come up with a sequence of actions with [00:03:19] up with a sequence of actions with feedback from the environments that is [00:03:22] feedback from the environments that is to maximize some reward or minimize some [00:03:24] to maximize some reward or minimize some cost and robot learning like especially [00:03:28] cost and robot learning like especially in the recent like years has attract [00:03:31] in the recent like years has attract significant attentions both within [00:03:33] significant attentions both within academia and also within industries. So [00:03:36] academia and also within industries. So we have seen like a many like startup [00:03:39] we have seen like a many like startup companies and including for example [00:03:41] companies and including for example physical intelligence like a Tesla bots [00:03:44] physical intelligence like a Tesla bots or figure they are producing those like [00:03:47] or figure they are producing those like a very seemingly very nice and fancy [00:03:50] a very seemingly very nice and fancy videos of robots doing a wide range of [00:03:52] videos of robots doing a wide range of very complicated tasks like folding [00:03:54] very complicated tasks like folding shirts like manipulating coffee beans or [00:03:57] shirts like manipulating coffee beans or trying to have this humanoid like doing [00:03:59] trying to have this humanoid like doing interesting tasks in the real physical [00:04:01] interesting tasks in the real physical world. So this field like I mentioned [00:04:03] world. So this field like I mentioned has attract a lot of attentions and also [00:04:06] has attract a lot of attentions and also a lot of investments. Here are just some [00:04:08] a lot of investments. Here are just some examples of some recent startups in the [00:04:10] examples of some recent startups in the field of robot learnings that is able to [00:04:12] field of robot learnings that is able to attract a huge amount of investments [00:04:14] attract a huge amount of investments trying to build this general purpose [00:04:16] trying to build this general purpose robots that can make physical [00:04:18] robots that can make physical interactions with the environments. So [00:04:20] interactions with the environments. So obviously not only those startups many [00:04:23] obviously not only those startups many like a big established companies are [00:04:25] like a big established companies are also having their own uh robotics in [00:04:28] also having their own uh robotics in investigations and initiatives trying to [00:04:30] investigations and initiatives trying to develop their own like general purpose [00:04:32] develop their own like general purpose robots that is able to make uh general [00:04:34] robots that is able to make uh general purpose and high performance physical [00:04:36] purpose and high performance physical interactions with the environment. [00:04:39] interactions with the environment. So for today's lectures I'm going to [00:04:42] So for today's lectures I'm going to give you some kind of overviews on some [00:04:45] give you some kind of overviews on some of the key techniques enabling factors [00:04:47] of the key techniques enabling factors of the current success and boom of robot [00:04:50] of the current success and boom of robot learnings. We will start with like a [00:04:52] learnings. We will start with like a problem formulination. So how can we [00:04:54] problem formulination. So how can we more concretely define the problems we [00:04:56] more concretely define the problems we have been building and how can we [00:04:58] have been building and how can we formally thinking about the robots [00:05:00] formally thinking about the robots interactions with the environments. So I [00:05:02] interactions with the environments. So I will then discuss on the more perception [00:05:04] will then discuss on the more perception sides. I will talk about the difference [00:05:07] sides. I will talk about the difference considerations between how robots [00:05:09] considerations between how robots perceive the environments and how people [00:05:11] perceive the environments and how people typically consider in the computer [00:05:13] typically consider in the computer vision community and what's special [00:05:14] vision community and what's special about robots perception. Now talking [00:05:17] about robots perception. Now talking about reinforcement learning, model [00:05:19] about reinforcement learning, model learning, model based planning, [00:05:20] learning, model based planning, imitation learning like also some of the [00:05:23] imitation learning like also some of the recent trends on robotic foundation [00:05:25] recent trends on robotic foundation models and also like using the remaining [00:05:28] models and also like using the remaining time to discuss some of the challenges [00:05:30] time to discuss some of the challenges we still see like a lies ahead of us. So [00:05:34] we still see like a lies ahead of us. So starts with problem formulation. [00:05:38] starts with problem formulation. So this is in general like how the [00:05:40] So this is in general like how the problem should look like at least in a [00:05:43] problem should look like at least in a graphical u illustration. So in the [00:05:46] graphical u illustration. So in the middle we have this agent. The agent is [00:05:48] middle we have this agent. The agent is given some task objective. This task [00:05:51] given some task objective. This task objective could be for example language [00:05:52] objective could be for example language instructions from human or some kind of [00:05:55] instructions from human or some kind of objective functions measuring how good [00:05:57] objective functions measuring how good this agent is in doing some specific [00:05:59] this agent is in doing some specific task. This agent is taking states from [00:06:03] task. This agent is taking states from the physical worlds or some kind of [00:06:05] the physical worlds or some kind of environments and the agent decides what [00:06:08] environments and the agent decides what action to take like this at here that [00:06:10] action to take like this at here that needs to be executed in the physical [00:06:12] needs to be executed in the physical worlds and this physical worlds will be [00:06:15] worlds and this physical worlds will be updated and given this agents this [00:06:18] updated and given this agents this states S+ one as well as the rewards [00:06:20] states S+ one as well as the rewards telling the agents how good it is doing [00:06:23] telling the agents how good it is doing its task. So this is how the framework [00:06:25] its task. So this is how the framework in general look like. So you have to be [00:06:28] in general look like. So you have to be very clear on this type of formulations [00:06:31] very clear on this type of formulations that consists of goal, states, actions [00:06:34] that consists of goal, states, actions and also rewards that specifically [00:06:37] and also rewards that specifically defines the problems of the robot [00:06:39] defines the problems of the robot learning like type of scenarios is very [00:06:43] learning like type of scenarios is very different from the computer visions. I [00:06:45] different from the computer visions. I would like to say like computer vision [00:06:47] would like to say like computer vision is mostly about trying to learn some [00:06:49] is mostly about trying to learn some kind of representations of the [00:06:51] kind of representations of the environments based on the inputs like [00:06:53] environments based on the inputs like highdimensional data. But for robotics, [00:06:55] highdimensional data. But for robotics, it's basically trying to solve some kind [00:06:57] it's basically trying to solve some kind of optimization problems where you have [00:07:00] of optimization problems where you have the constraints which is a physical [00:07:02] the constraints which is a physical world of the environments. You have your [00:07:04] world of the environments. You have your objective functions defined over your [00:07:06] objective functions defined over your goal and you are essentially trying to [00:07:08] goal and you are essentially trying to solve this optimization problems by [00:07:10] solve this optimization problems by coming up with a sequence of actions [00:07:12] coming up with a sequence of actions that can maximize or minimize your [00:07:14] that can maximize or minimize your objective functions. So that's a key [00:07:16] objective functions. So that's a key difference between like robot learning [00:07:18] difference between like robot learning and what people typically consider in [00:07:20] and what people typically consider in computer vision. [00:07:21] computer vision. So some specific instantiations of this [00:07:24] So some specific instantiations of this problem like for example carpole the [00:07:26] problem like for example carpole the goal is can be balance the pole on the [00:07:29] goal is can be balance the pole on the top of a movable carts and the states of [00:07:32] top of a movable carts and the states of these environments essentially describe [00:07:34] these environments essentially describe the physical states status of the [00:07:36] the physical states status of the systems which can include the angle the [00:07:39] systems which can include the angle the angular speeds positions horizontal [00:07:41] angular speeds positions horizontal velocities etc. And the action could be [00:07:44] velocities etc. And the action could be the horizontal force that is applied to [00:07:46] the horizontal force that is applied to the carts. And you could have the [00:07:48] the carts. And you could have the rewards or one indicating at each time [00:07:51] rewards or one indicating at each time step if the pole is being kept as an [00:07:54] step if the pole is being kept as an upright position. [00:07:56] upright position. Some other example could including like [00:07:58] Some other example could including like robots locomotion where the goal is to [00:08:01] robots locomotion where the goal is to make this robots moving forwards and the [00:08:03] make this robots moving forwards and the states could include the angle, [00:08:05] states could include the angle, position, velocities of all joints [00:08:07] position, velocities of all joints within this robots and the action could [00:08:10] within this robots and the action could be the torque applies to each one of the [00:08:12] be the torque applies to each one of the joints and the reward can be like one at [00:08:15] joints and the reward can be like one at each time. the robots like make a step [00:08:18] each time. the robots like make a step forwards and also being kept in an [00:08:20] forwards and also being kept in an upright positions [00:08:23] upright positions and also like some like interesting [00:08:25] and also like some like interesting example including the Atari games. The [00:08:27] example including the Atari games. The goal can be complete the gaming with the [00:08:30] goal can be complete the gaming with the highest score as high as possible like [00:08:32] highest score as high as possible like you can get. And the states will be the [00:08:34] you can get. And the states will be the raw pixel inputs of the gaming screen [00:08:37] raw pixel inputs of the gaming screen and action could be the gaming control [00:08:39] and action could be the gaming control like up, down, left and right. And the [00:08:42] like up, down, left and right. And the reward could be the score increase and [00:08:44] reward could be the score increase and decrease at each time step. And uh some [00:08:47] decrease at each time step. And uh some of the like a more famous examples you [00:08:50] of the like a more famous examples you probably have noticed like earlier [00:08:52] probably have noticed like earlier especially with the developments of [00:08:54] especially with the developments of alpha go. And the definition and the [00:08:56] alpha go. And the definition and the problem of go can also be defined in [00:08:59] problem of go can also be defined in similar ways where the goal is to win [00:09:01] similar ways where the goal is to win the game. In the states will be all the [00:09:04] the game. In the states will be all the pieces that are currently already on the [00:09:06] pieces that are currently already on the go brought and action could be where to [00:09:08] go brought and action could be where to put the next piece down on this board [00:09:10] put the next piece down on this board and the reward could be uh on the last [00:09:12] and the reward could be uh on the last turn. If you win, you get a reward of [00:09:15] turn. If you win, you get a reward of one and if you lose, you get a reward of [00:09:17] one and if you lose, you get a reward of zero. And this not only applies to for [00:09:20] zero. And this not only applies to for example gaming domains like even with [00:09:22] example gaming domains like even with the recent like developments of large [00:09:25] the recent like developments of large language models, you can also like [00:09:27] language models, you can also like thinking about those problems especially [00:09:29] thinking about those problems especially for sequ sequential like generation [00:09:32] for sequ sequential like generation problems in a similar manner. Or the [00:09:34] problems in a similar manner. Or the goal could be to predict the next words [00:09:37] goal could be to predict the next words and the state could be the current words [00:09:39] and the state could be the current words in the sentence. And the action will be [00:09:41] in the sentence. And the action will be what the specific next words you want to [00:09:42] what the specific next words you want to put there. And if it is correct, you get [00:09:46] put there. And if it is correct, you get a reward. If it is incorrect, you get a [00:09:48] a reward. If it is incorrect, you get a rewards of zero. And uh similarly now [00:09:52] rewards of zero. And uh similarly now you probably have already played with [00:09:54] you probably have already played with many of the chat bots quite a lot. And [00:09:56] many of the chat bots quite a lot. And we can also define a problem like [00:09:58] we can also define a problem like similarly where the goal is to be a good [00:10:01] similarly where the goal is to be a good companions to the human user. The states [00:10:04] companions to the human user. The states could be the current confi conversation [00:10:07] could be the current confi conversation and action that should be generated by [00:10:09] and action that should be generated by the chatbots will be the next sentence [00:10:12] the chatbots will be the next sentence you are given to the human user and [00:10:15] you are given to the human user and according to the human evaluations we [00:10:17] according to the human evaluations we could define the reward. If the person [00:10:19] could define the reward. If the person is happy like if it if they are [00:10:20] is happy like if it if they are satisfied you get a rewards of one and [00:10:23] satisfied you get a rewards of one and if uh you are not happy or neutral you [00:10:25] if uh you are not happy or neutral you get some other rewards and more [00:10:28] get some other rewards and more specifically for example in the robotics [00:10:30] specifically for example in the robotics domain and the task could be to fold the [00:10:33] domain and the task could be to fold the clothes and one clothes be folded nicely [00:10:36] clothes and one clothes be folded nicely in the states is the current [00:10:38] in the states is the current observations the robot is getting from [00:10:40] observations the robot is getting from this environment which could including [00:10:43] this environment which could including the multiv- view RGB or RGBD [00:10:46] the multiv- view RGB or RGBD observations of the environment And the [00:10:48] observations of the environment And the robots needs to decide its actions like [00:10:50] robots needs to decide its actions like how to move it in factors. Should it [00:10:52] how to move it in factors. Should it close or open its scrapers in order to [00:10:55] close or open its scrapers in order to manipulate this close and according to [00:10:57] manipulate this close and according to human evaluations if the close is [00:10:59] human evaluations if the close is properly folded? You give the robot a [00:11:02] properly folded? You give the robot a reward of one. And if the close is not [00:11:04] reward of one. And if the close is not folded, you give the rewards of zero. So [00:11:07] folded, you give the rewards of zero. So here is actually how you want to like a [00:11:09] here is actually how you want to like a more concretely thinking about the robot [00:11:11] more concretely thinking about the robot learning problem. This really is a way [00:11:13] learning problem. This really is a way that allowing the agents to interact [00:11:16] that allowing the agents to interact with the world that considers the effect [00:11:18] with the world that considers the effect of an actions and also this sequential [00:11:20] of an actions and also this sequential decision- making problems that is [00:11:22] decision- making problems that is different from what people typically [00:11:23] different from what people typically consider in computer vision. We just [00:11:25] consider in computer vision. We just needs to predict the outputs and the [00:11:28] needs to predict the outputs and the goal states actions and the rewards and [00:11:31] goal states actions and the rewards and objective functions are the things you [00:11:33] objective functions are the things you need to keep in mind whenever you are [00:11:35] need to keep in mind whenever you are thinking about problems along this [00:11:36] thinking about problems along this direction. [00:11:38] direction. So this is about problem formulation. So [00:11:41] So this is about problem formulation. So question is like how specific the reward [00:11:44] question is like how specific the reward needs to be designed like in many of the [00:11:46] needs to be designed like in many of the task the reward can have many different [00:11:48] task the reward can have many different type of specifications. For example in [00:11:50] type of specifications. For example in the self-driving the reward could be [00:11:52] the self-driving the reward could be like as fast as possible or reward could [00:11:54] like as fast as possible or reward could be like we want the the the the [00:11:56] be like we want the the the the passengers to feel comfortable as you [00:11:58] passengers to feel comfortable as you are driving along the roads. So even for [00:12:01] are driving along the roads. So even for clothes folding depending on the user's [00:12:03] clothes folding depending on the user's preference a clothes can be folded in [00:12:04] preference a clothes can be folded in many different ways. Some want the total [00:12:07] many different ways. Some want the total area to be as small as possible. Some [00:12:09] area to be as small as possible. Some want it to be as like a smooth as [00:12:11] want it to be as like a smooth as possible. There could be a difference [00:12:12] possible. There could be a difference like types of rewards. Here I'm just [00:12:15] like types of rewards. Here I'm just talking a generic terms like if a person [00:12:17] talking a generic terms like if a person look at the clothes like do they think [00:12:19] look at the clothes like do they think this is folded or not but more [00:12:20] this is folded or not but more specifically in terms of the reward [00:12:22] specifically in terms of the reward design. There's actually a lot of [00:12:23] design. There's actually a lot of nuances in like a satisfying specific [00:12:26] nuances in like a satisfying specific needs of a specific application. [00:12:30] needs of a specific application. Okay. So I'll continue. So this is how [00:12:32] Okay. So I'll continue. So this is how we are thinking about those robot [00:12:34] we are thinking about those robot learning problem that allows the agents [00:12:36] learning problem that allows the agents to interact with the physical world. Now [00:12:38] to interact with the physical world. Now I'm moving on to robot perception [00:12:41] I'm moving on to robot perception especially in discussing how the [00:12:43] especially in discussing how the perception problem within this robot [00:12:45] perception problem within this robot learning domain is different from what [00:12:47] learning domain is different from what typically people typically consider [00:12:48] typically people typically consider incom. [00:12:50] incom. So this image again you you're actually [00:12:52] So this image again you you're actually going to see this image again and again [00:12:53] going to see this image again and again like through today's lecture. So this is [00:12:56] like through today's lecture. So this is essentially in the question of how we [00:12:58] essentially in the question of how we are handling the whatever information [00:13:01] are handling the whatever information you are getting from the physical world. [00:13:04] you are getting from the physical world. The physical world can give you for [00:13:06] The physical world can give you for highdimensional RGB observation or RGBD [00:13:09] highdimensional RGB observation or RGBD observations. It could also include some [00:13:11] observations. It could also include some other sensory data like tactile [00:13:12] other sensory data like tactile sensings. And this robot perception [00:13:15] sensings. And this robot perception problem is essentially trying to distill [00:13:18] problem is essentially trying to distill or harnessing some structured knowledge [00:13:21] or harnessing some structured knowledge from those highdimensional data that is [00:13:23] from those highdimensional data that is useful for the robot to do the [00:13:25] useful for the robot to do the downstream decision-m [00:13:27] downstream decision-m and essentially the question we are [00:13:30] and essentially the question we are trying to tackle is trying to making [00:13:32] trying to tackle is trying to making sense of this unstructured real world [00:13:34] sense of this unstructured real world and the real world can be very messy. So [00:13:38] and the real world can be very messy. So essentially the observations the robots [00:13:39] essentially the observations the robots are getting from the environments can [00:13:42] are getting from the environments can only contains like incomplete knowledge [00:13:44] only contains like incomplete knowledge of the objects and the environments. [00:13:46] of the objects and the environments. There could be occlusions. There could [00:13:48] There could be occlusions. There could also be like errors from the sensory [00:13:51] also be like errors from the sensory like data and the imperfect action may [00:13:54] like data and the imperfect action may also leads to failure. For example like [00:13:57] also leads to failure. For example like the robots can trying to grasp some [00:13:59] the robots can trying to grasp some objects but that grasping behavior may [00:14:01] objects but that grasping behavior may not always be successful. Sometimes [00:14:03] not always be successful. Sometimes you'll accidentally drop that objects [00:14:05] you'll accidentally drop that objects which will also cause like evolutions [00:14:07] which will also cause like evolutions and unexpected changes of this [00:14:09] and unexpected changes of this environment. They will also need to have [00:14:11] environment. They will also need to have this perception system that is able to [00:14:13] this perception system that is able to handle those scenarios and also this [00:14:15] handle those scenarios and also this environments can change. It is dynamic [00:14:18] environments can change. It is dynamic consists of not just rigid object but [00:14:20] consists of not just rigid object but deformable object clothes [00:14:23] deformable object clothes medias. There could be other agents like [00:14:25] medias. There could be other agents like dogs or other kids or other humans that [00:14:28] dogs or other kids or other humans that are also in the same environments [00:14:29] are also in the same environments messing up with the world and your [00:14:32] messing up with the world and your perception system needs to be able to [00:14:34] perception system needs to be able to cope with all those kind of changes. [00:14:37] cope with all those kind of changes. So that is why like in the robotics [00:14:40] So that is why like in the robotics domain people typically not just working [00:14:42] domain people typically not just working with like a camera data they try to add [00:14:45] with like a camera data they try to add as much sensor as possible to the robots [00:14:48] as much sensor as possible to the robots as long as they can provide some useful [00:14:50] as long as they can provide some useful informations. It really considers like [00:14:52] informations. It really considers like for example like a tactile sensing the [00:14:54] for example like a tactile sensing the audio information the depth informations [00:14:56] audio information the depth informations next and typically we will have to like [00:15:00] next and typically we will have to like a design a systems that is able to put [00:15:02] a design a systems that is able to put all the sensors togethers that allows [00:15:04] all the sensors togethers that allows them to uh complement each other whereas [00:15:08] them to uh complement each other whereas audio information might tell you things [00:15:10] audio information might tell you things about the physical contacts and the [00:15:12] about the physical contacts and the tactile information might tells you [00:15:13] tactile information might tells you about whether a grasp is stable or not [00:15:16] about whether a grasp is stable or not and the camera information tells you [00:15:17] and the camera information tells you about something that is more on the [00:15:19] about something that is more on the higher level on the grand scheme of [00:15:20] higher level on the grand scheme of things about the overall states of this [00:15:22] things about the overall states of this environment. So how these sensors can be [00:15:25] environment. So how these sensors can be composed together and work together is [00:15:27] composed together and work together is very very important to design a capable [00:15:29] very very important to design a capable robotic system that are working in the [00:15:32] robotic system that are working in the real physical world. And besides the the [00:15:35] real physical world. And besides the the numbers of sensory modalities very [00:15:38] numbers of sensory modalities very important difference between like a [00:15:39] important difference between like a robot vision and computer visions is [00:15:41] robot vision and computer visions is trying to really understand the effect [00:15:43] trying to really understand the effect of an actions and also the affordance of [00:15:45] of an actions and also the affordance of this environments. On the left is a very [00:15:48] this environments. On the left is a very typical examples you have already seen [00:15:50] typical examples you have already seen in computer vision which is trying to do [00:15:52] in computer vision which is trying to do instant segmentation. What you are given [00:15:54] instant segmentation. What you are given is this 2D image. You are segments [00:15:56] is this 2D image. You are segments different instances from this 2D image [00:15:59] different instances from this 2D image like a by drawing like a contours over [00:16:01] like a by drawing like a contours over this 2D pixels. But what's difference in [00:16:04] this 2D pixels. But what's difference in the robotics domain? For example, on the [00:16:06] the robotics domain? For example, on the right, the robot can for example given [00:16:08] right, the robot can for example given one object and this object seems to be [00:16:11] one object and this object seems to be maybe just one object or maybe some a [00:16:14] maybe just one object or maybe some a lot of pieces that are for example [00:16:16] lot of pieces that are for example stacked into each other. The robot has [00:16:18] stacked into each other. The robot has to know like what type of actions will [00:16:21] to know like what type of actions will allow us to have better understand [00:16:23] allow us to have better understand better perceptions about this [00:16:25] better perceptions about this environments. Is this one piece of [00:16:26] environments. Is this one piece of object or multiple pieces composed [00:16:28] object or multiple pieces composed togethers? The robot should come up with [00:16:30] togethers? The robot should come up with actions like perturb and actively [00:16:32] actions like perturb and actively interact with the environments for the [00:16:34] interact with the environments for the robot to get a better perceptions about [00:16:36] robot to get a better perceptions about the states of this environment. So that [00:16:39] the states of this environment. So that is why a robot vision is embodied active [00:16:43] is why a robot vision is embodied active and also environmentally situated. By [00:16:46] and also environmentally situated. By embodies what we mean is robots have [00:16:49] embodies what we mean is robots have this kind of physical body that is [00:16:52] this kind of physical body that is directly experiencing the physical world [00:16:54] directly experiencing the physical world directly. Their actions are part of a [00:16:57] directly. Their actions are part of a dynamic uh with the worlds that have [00:16:59] dynamic uh with the worlds that have immediate feedback on their own [00:17:01] immediate feedback on their own sensations [00:17:02] sensations and active meaning the robots are active [00:17:05] and active meaning the robots are active perceivers. It knows why it wishes you [00:17:07] perceivers. It knows why it wishes you to sense and chooses what to perceive [00:17:10] to sense and chooses what to perceive and it determines how and when and where [00:17:12] and it determines how and when and where to achieve that perception. You can move [00:17:14] to achieve that perception. You can move your head around you to know like what's [00:17:16] your head around you to know like what's behind this table. You can just move [00:17:18] behind this table. You can just move around to see what's behind the tables. [00:17:20] around to see what's behind the tables. So this is the active parts which is [00:17:21] So this is the active parts which is different from what people typically [00:17:23] different from what people typically consider in computer vision. that are [00:17:25] consider in computer vision. that are mostly like work with a passively [00:17:27] mostly like work with a passively collected data sets. The third point is [00:17:30] collected data sets. The third point is about situated. The robots are situated [00:17:32] about situated. The robots are situated in the world. They do not deal with [00:17:35] in the world. They do not deal with abstract descriptions but with the here [00:17:37] abstract descriptions but with the here and now of the world directly [00:17:39] and now of the world directly influencing the behavior of the systems. [00:17:42] influencing the behavior of the systems. Bots really have to understand [00:17:44] Bots really have to understand especially in closing perception and [00:17:47] especially in closing perception and action loop. It sees the world and [00:17:50] action loop. It sees the world and understanding its goals and be able to [00:17:52] understanding its goals and be able to act in the environments upon its [00:17:54] act in the environments upon its perceptions. Sometimes the robot don't [00:17:57] perceptions. Sometimes the robot don't have to know the full state of [00:17:58] have to know the full state of environments. For example, if I'm [00:18:00] environments. For example, if I'm buttoning my shirt, I only have to know [00:18:02] buttoning my shirt, I only have to know the local regions near that button for [00:18:04] the local regions near that button for me to button that shirts. So some of the [00:18:06] me to button that shirts. So some of the perception has to be tightly coupled and [00:18:09] perception has to be tightly coupled and co-designed with the task and the [00:18:11] co-designed with the task and the downstream decision-m systems for the [00:18:13] downstream decision-m systems for the robot to focusing on the relevant region [00:18:14] robot to focusing on the relevant region or task relevant regions of the [00:18:16] or task relevant regions of the environments to be properly like a close [00:18:19] environments to be properly like a close this perception and action loop. [00:18:22] this perception and action loop. So this is about some very specific [00:18:25] So this is about some very specific considerations and how robots perception [00:18:27] considerations and how robots perception might be different from what people [00:18:28] might be different from what people typically consider in computer vision. I [00:18:31] typically consider in computer vision. I will starting to discuss some of the [00:18:33] will starting to discuss some of the algorithms that not only allow the robot [00:18:36] algorithms that not only allow the robot to see but allow the robot to act in the [00:18:38] to see but allow the robot to act in the world and we will starts with [00:18:41] world and we will starts with reinforcement learning. Remember earlier [00:18:43] reinforcement learning. Remember earlier we have like seen this image the robots [00:18:47] we have like seen this image the robots has to act upon this environment and get [00:18:49] has to act upon this environment and get rewards from this environments. So one [00:18:52] rewards from this environments. So one very typical ways of trying to solve [00:18:55] very typical ways of trying to solve this optimization problem is allow the [00:18:57] this optimization problem is allow the robots to interact with the worlds as [00:19:00] robots to interact with the worlds as extensively and as massively as [00:19:02] extensively and as massively as possible. You just collect all the [00:19:05] possible. You just collect all the experience data and do this type of like [00:19:08] experience data and do this type of like a trials and errors allow the robots to [00:19:10] a trials and errors allow the robots to understand this action leads to higher [00:19:12] understand this action leads to higher reward and that action leads to lower [00:19:14] reward and that action leads to lower rewards and we can pivot the agents [00:19:18] rewards and we can pivot the agents behaviors towards the actions that give [00:19:20] behaviors towards the actions that give the agents some higher rewards. This is [00:19:22] the agents some higher rewards. This is like the general ideas of reinforcement [00:19:24] like the general ideas of reinforcement learning. is really a way to allow the [00:19:26] learning. is really a way to allow the agents to constantly interact with the [00:19:29] agents to constantly interact with the environments and do this trials and [00:19:31] environments and do this trials and error to maximize the reward or minimize [00:19:33] error to maximize the reward or minimize the cost. And here I also want to be a [00:19:37] the cost. And here I also want to be a bit more specific in discussing the [00:19:39] bit more specific in discussing the difference between reinforcement [00:19:40] difference between reinforcement learning and supervised learning. So [00:19:43] learning and supervised learning. So this is a typical framework of how [00:19:45] this is a typical framework of how reinforcement learning look like. You [00:19:47] reinforcement learning look like. You have the environments. Environment give [00:19:48] have the environments. Environment give the agent some states. Agents generate [00:19:50] the agent some states. Agents generate actions. And the environments like give [00:19:52] actions. And the environments like give the agent the feedback which is the [00:19:54] the agent the feedback which is the rewards and the environments will change [00:19:57] rewards and the environments will change where the environment give the agent the [00:19:59] where the environment give the agent the new states S t plus one and it [00:20:01] new states S t plus one and it essentially a sequence like a temporal [00:20:03] essentially a sequence like a temporal sequence on the temporal domain where [00:20:05] sequence on the temporal domain where the robots has to the agent has to make [00:20:07] the robots has to the agent has to make the sequential decisions and here is a [00:20:11] the sequential decisions and here is a typical like image u typically look like [00:20:14] typical like image u typically look like for supervised learning. So you have the [00:20:16] for supervised learning. So you have the data sets. The data sets will input to [00:20:19] data sets. The data sets will input to the model this X and this model will [00:20:22] the model this X and this model will generate the prediction Y and you will [00:20:24] generate the prediction Y and you will be able to calculate the loss according [00:20:26] be able to calculate the loss according to the model's predictions versus the [00:20:28] to the model's predictions versus the ground truth from this data sets. So [00:20:30] ground truth from this data sets. So this is a typical setup of supervised [00:20:33] this is a typical setup of supervised learning and the key difference some of [00:20:35] learning and the key difference some of the key difference between reinforcement [00:20:37] the key difference between reinforcement learning and supervised learning is that [00:20:40] learning and supervised learning is that the environments might be stochastic or [00:20:43] the environments might be stochastic or the same actions the environment might [00:20:46] the same actions the environment might change in a different manner. Let's say [00:20:48] change in a different manner. Let's say if you are pushing a box forward, [00:20:50] if you are pushing a box forward, depending on the distribution of the [00:20:52] depending on the distribution of the supporting force, the same exact actions [00:20:55] supporting force, the same exact actions will lead to can potentially lead to the [00:20:57] will lead to can potentially lead to the box rotating into different angles. [00:21:00] box rotating into different angles. Meaning there can be uncertainties and [00:21:02] Meaning there can be uncertainties and stochasticities in the environments that [00:21:04] stochasticities in the environments that will like lead to like a stochastic [00:21:07] will like lead to like a stochastic behaviors of these environments which [00:21:09] behaviors of these environments which will also give this agents stochastic [00:21:11] will also give this agents stochastic like rewards where the same action may [00:21:14] like rewards where the same action may not always leads to the same rewards. So [00:21:17] not always leads to the same rewards. So this is very different from supervised [00:21:19] this is very different from supervised learning. We are dealing with an [00:21:20] learning. We are dealing with an uncertain dynamical system. [00:21:23] uncertain dynamical system. The second is about the question of [00:21:26] The second is about the question of credit assignments. So for supervised [00:21:28] credit assignments. So for supervised learning you give the inputs, you [00:21:30] learning you give the inputs, you predict the outputs and directly [00:21:32] predict the outputs and directly calculate the loss like you directly [00:21:34] calculate the loss like you directly know like what are the mistakes and what [00:21:36] know like what are the mistakes and what are the error you are making by making a [00:21:38] are the error you are making by making a specific predictions. But in the [00:21:40] specific predictions. But in the reinforcement learning or sequential [00:21:42] reinforcement learning or sequential decision- making domain, the rewards can [00:21:45] decision- making domain, the rewards can be delayed. Meaning if you like play the [00:21:47] be delayed. Meaning if you like play the game of gold like only until the very [00:21:50] game of gold like only until the very ends of this episodes do you realize you [00:21:52] ends of this episodes do you realize you are winnings or you were losing and [00:21:56] are winnings or you were losing and that's rewards 01 mice is because of [00:22:00] that's rewards 01 mice is because of some very very early like steps maybe [00:22:02] some very very early like steps maybe even the first like steps or some steps [00:22:04] even the first like steps or some steps during the middle of the games. So how [00:22:06] during the middle of the games. So how to properly assign the credits you are [00:22:09] to properly assign the credits you are getting along this sequential decision [00:22:11] getting along this sequential decision makings towards all the actions is also [00:22:13] makings towards all the actions is also another very like a tricky and important [00:22:16] another very like a tricky and important questions people hope to answers within [00:22:19] questions people hope to answers within reinforcement learning. The third thing [00:22:22] reinforcement learning. The third thing is the non-defision diffusion abilities [00:22:25] is the non-defision diffusion abilities of this dynamical systems for example [00:22:28] of this dynamical systems for example for supervised learning like you have [00:22:29] for supervised learning like you have the inputs you feed the inputs through [00:22:32] the inputs you feed the inputs through the model you get outputs you calculate [00:22:34] the model you get outputs you calculate the loss. So everything along this [00:22:36] the loss. So everything along this process is differentiable. So you can [00:22:38] process is differentiable. So you can directly g gradients of the loss [00:22:41] directly g gradients of the loss functions with respected to the [00:22:42] functions with respected to the parameters within the model. But that's [00:22:45] parameters within the model. But that's typically not the case for reinforcement [00:22:47] typically not the case for reinforcement learning where the environments can [00:22:49] learning where the environments can often times not differentiable. So how [00:22:52] often times not differentiable. So how to properly gather gradients of the [00:22:54] to properly gather gradients of the rewards with respect to the actions can [00:22:57] rewards with respect to the actions can be tricky. Sometime people have to [00:22:59] be tricky. Sometime people have to relies on a massive sampling to do like [00:23:01] relies on a massive sampling to do like those type of zeros order estimations of [00:23:03] those type of zeros order estimations of the gradients for you to do proper [00:23:05] the gradients for you to do proper learnings. That's also is another [00:23:07] learnings. That's also is another difference [00:23:09] difference like the last difference is about this [00:23:12] like the last difference is about this non-station of this scenarios where the [00:23:17] non-station of this scenarios where the evolutions and the states of the [00:23:18] evolutions and the states of the environments is really a result of your [00:23:21] environments is really a result of your actions. For supervised learning, no [00:23:23] actions. For supervised learning, no matter whatever you predict, it doesn't [00:23:25] matter whatever you predict, it doesn't influence other data points you are [00:23:27] influence other data points you are getting from this data set. But your [00:23:30] getting from this data set. But your actions will influence the next states [00:23:32] actions will influence the next states you are getting in this sequential [00:23:34] you are getting in this sequential decision- making problems. It is also [00:23:36] decision- making problems. It is also what makes this kind of reinforcement [00:23:38] what makes this kind of reinforcement learning problems a little bit more like [00:23:40] learning problems a little bit more like a nuanced than supervised learning. [00:23:43] a nuanced than supervised learning. So here are some more specific examples. [00:23:46] So here are some more specific examples. Like for example playing these Atari [00:23:48] Like for example playing these Atari games like I mentioned earlier the goal [00:23:50] games like I mentioned earlier the goal could be to complete the games with the [00:23:52] could be to complete the games with the highest score and the states will be the [00:23:54] highest score and the states will be the raw pixel inputs from the gaming screen [00:23:57] raw pixel inputs from the gaming screen and the action could be up down left and [00:23:59] and the action could be up down left and right from the keyboards and we're [00:24:01] right from the keyboards and we're trying to like the reward are the score [00:24:04] trying to like the reward are the score increase and decrease at each time step [00:24:07] increase and decrease at each time step and some typical algorithms within this [00:24:10] and some typical algorithms within this domain either lies in the field of for [00:24:12] domain either lies in the field of for example like a pure learnings or for [00:24:14] example like a pure learnings or for example policy iterations Here's a one [00:24:17] example policy iterations Here's a one examples of like trying to learn this Q [00:24:19] examples of like trying to learn this Q function. Q function essentially [00:24:21] function. Q function essentially measures the discounted expected future [00:24:25] measures the discounted expected future accumulated rewards like when you apply [00:24:28] accumulated rewards like when you apply a specific action A at a specific state [00:24:30] a specific action A at a specific state S and you'll be able to get this Q [00:24:33] S and you'll be able to get this Q functions through like interactions with [00:24:36] functions through like interactions with this gaming environments and after you [00:24:38] this gaming environments and after you have learned this Q functions you can [00:24:41] have learned this Q functions you can evaluate for example like what are the Q [00:24:43] evaluate for example like what are the Q values you are getting by applying [00:24:45] values you are getting by applying different actions in this case there's [00:24:48] different actions in this case there's left and rights ups and downs months. So [00:24:50] left and rights ups and downs months. So there could potentially be four actions [00:24:52] there could potentially be four actions and given this four actions you can look [00:24:54] and given this four actions you can look at their Q values and just execute the [00:24:57] at their Q values and just execute the action that give you the highest Q [00:24:58] action that give you the highest Q values. So that is what allows you to do [00:25:01] values. So that is what allows you to do this type of like decision makings in [00:25:04] this type of like decision makings in this domain. So uh because today we're [00:25:07] this domain. So uh because today we're going to cover a lot of materials so we [00:25:09] going to cover a lot of materials so we won't go into the details of the [00:25:11] won't go into the details of the reinforcement learnings but some of the [00:25:13] reinforcement learnings but some of the current state-of-the-arts reinforcement [00:25:14] current state-of-the-arts reinforcement learning algorithms including SACE like [00:25:17] learning algorithms including SACE like soft actor critics and also PO like [00:25:20] soft actor critics and also PO like proximal policy optimizations. So if you [00:25:22] proximal policy optimizations. So if you are interested you are like very welcome [00:25:24] are interested you are like very welcome to look at those algorithms in details. [00:25:26] to look at those algorithms in details. There's a lot of like open source like [00:25:29] There's a lot of like open source like implementations and tutorials online. [00:25:31] implementations and tutorials online. But here I want to highlight to you some [00:25:34] But here I want to highlight to you some of the results you could potentially get [00:25:36] of the results you could potentially get by going through this reinforced [00:25:38] by going through this reinforced learning specifically like Q-learning [00:25:41] learning specifically like Q-learning process. This is developed by Google [00:25:43] process. This is developed by Google deep mind that is trying to develop this [00:25:46] deep mind that is trying to develop this agents that is trying to play the game [00:25:48] agents that is trying to play the game breakouts in this kind of Atari world. [00:25:52] breakouts in this kind of Atari world. So just after like a 10 minutes of [00:25:54] So just after like a 10 minutes of training the robot the agent can already [00:25:56] training the robot the agent can already like touch the ball but often the times [00:26:00] like touch the ball but often the times like can still missing the ball like [00:26:02] like can still missing the ball like quite often and after some more times of [00:26:05] quite often and after some more times of learning for example like a two hours of [00:26:07] learning for example like a two hours of training and the the the agents can [00:26:11] training and the the the agents can control this kind of paths in a much [00:26:13] control this kind of paths in a much more reliable and consistent fashions [00:26:16] more reliable and consistent fashions that can nearly like all the time can [00:26:18] that can nearly like all the time can catch the ball and be able to like [00:26:20] catch the ball and be able to like constantly getting more and more rewards [00:26:22] constantly getting more and more rewards by like a uh by by by by bouncing it [00:26:26] by like a uh by by by by bouncing it back. And after like a four hours of [00:26:30] back. And after like a four hours of trainings, something that's interesting [00:26:32] trainings, something that's interesting happens where the agents come up with [00:26:35] happens where the agents come up with actually a novel strategy which possibly [00:26:37] actually a novel strategy which possibly is not known to many of you that is [00:26:39] is not known to many of you that is trying to like a push bounce the ball [00:26:42] trying to like a push bounce the ball back to create a tunnel on the left [00:26:44] back to create a tunnel on the left sides of this wall and then it will push [00:26:47] sides of this wall and then it will push this ball like on the upper side of the [00:26:49] this ball like on the upper side of the wall to do this very efficient like uh [00:26:53] wall to do this very efficient like uh reductions of those uh bricks. This is a [00:26:55] reductions of those uh bricks. This is a type of strategy that can be discovered [00:26:57] type of strategy that can be discovered by reinforcement learnings. So this is [00:27:00] by reinforcement learnings. So this is like what's nice about reinforcement [00:27:01] like what's nice about reinforcement learning meaning you allow the agents to [00:27:04] learning meaning you allow the agents to do very extensive and comprehensive [00:27:07] do very extensive and comprehensive exploration and interactions with the [00:27:08] exploration and interactions with the world. And it is totally possible for [00:27:11] world. And it is totally possible for this reinforced learning agents to [00:27:13] this reinforced learning agents to discover some strategies that are better [00:27:15] discover some strategies that are better than even the best human players. Like a [00:27:18] than even the best human players. Like a very typical examples will be the game [00:27:21] very typical examples will be the game of go. So when Alpha Go came out in the [00:27:24] of go. So when Alpha Go came out in the January 2016, it was also like about the [00:27:28] January 2016, it was also like about the time when I was trying to decides like [00:27:31] time when I was trying to decides like what type of like research directions [00:27:32] what type of like research directions I'm going to go. Before that time, I was [00:27:35] I'm going to go. Before that time, I was just working on deep learning for [00:27:37] just working on deep learning for computer visions. But when Alpha Go came [00:27:39] computer visions. But when Alpha Go came out, I'm like I have to work on this [00:27:41] out, I'm like I have to work on this kind of decision- making problems. So [00:27:43] kind of decision- making problems. So that's why I started to touch upon [00:27:45] that's why I started to touch upon reinforcement learning, imitation [00:27:46] reinforcement learning, imitation learning and all the way until now to do [00:27:48] learning and all the way until now to do robot learning that allows the robots do [00:27:50] robot learning that allows the robots do physical interactions with the [00:27:52] physical interactions with the environments. So um I wasn't satisfied [00:27:55] environments. So um I wasn't satisfied with just working with a passively [00:27:56] with just working with a passively collected data sets but we really wanted [00:27:58] collected data sets but we really wanted an agent that can do active interactions [00:28:00] an agent that can do active interactions with the environment. So questions was [00:28:03] with the environment. So questions was how does specifically this Q functions [00:28:06] how does specifically this Q functions like works. So you can see this Q take [00:28:09] like works. So you can see this Q take as input the state S and also the action [00:28:12] as input the state S and also the action and this setup would essentially be the [00:28:14] and this setup would essentially be the parameters of this Q functions where the [00:28:16] parameters of this Q functions where the Q is instantiated as a neuronet network. [00:28:19] Q is instantiated as a neuronet network. In this specific case like I mentioned [00:28:21] In this specific case like I mentioned earlier the states is the raw pixel [00:28:24] earlier the states is the raw pixel inputs that directly getting from the [00:28:26] inputs that directly getting from the gaming screen. So the inputs could be [00:28:28] gaming screen. So the inputs could be this kind of four steps four frames that [00:28:30] this kind of four steps four frames that directly inputed to this Q function. And [00:28:33] directly inputed to this Q function. And if you're dealing with images, a very [00:28:35] if you're dealing with images, a very like a straightforward way of in [00:28:37] like a straightforward way of in instantiating this Q function is to use [00:28:39] instantiating this Q function is to use convolutional neuronet networks. So you [00:28:41] convolutional neuronet networks. So you have convolutional layers like show in [00:28:43] have convolutional layers like show in this kind of orange blocks and then [00:28:45] this kind of orange blocks and then you'll be able to go through this fully [00:28:46] you'll be able to go through this fully connected layers to directly derive this [00:28:49] connected layers to directly derive this like a Q value and in this case because [00:28:52] like a Q value and in this case because there are like a four discrete actions [00:28:54] there are like a four discrete actions like in this case probably just left and [00:28:56] like in this case probably just left and right but let's say there's like a four [00:28:57] right but let's say there's like a four discrete actions up and down left and [00:28:59] discrete actions up and down left and right. So you'll be able to have like [00:29:01] right. So you'll be able to have like different Q value estimations that is [00:29:03] different Q value estimations that is the results of this specific action A [00:29:05] the results of this specific action A and that's how you can use this Q values [00:29:07] and that's how you can use this Q values to make decisions on what action to take [00:29:10] to make decisions on what action to take that is the most effective at maximizing [00:29:12] that is the most effective at maximizing this Q value. Does that answer your [00:29:14] this Q value. Does that answer your question? [00:29:17] Yes. Like uh this is when Alpha Go came [00:29:20] Yes. Like uh this is when Alpha Go came out and obviously since then there has [00:29:22] out and obviously since then there has been a lot of developments and [00:29:23] been a lot of developments and evolutions in making this type of gaming [00:29:26] evolutions in making this type of gaming agents better and better. So then later [00:29:29] agents better and better. So then later there's like alpha go zero that is [00:29:31] there's like alpha go zero that is essentially a simplified versions of [00:29:33] essentially a simplified versions of alpha go they're no longer using any [00:29:35] alpha go they're no longer using any imitation learning to do any initial in [00:29:38] imitation learning to do any initial in initialization and is able to beat like [00:29:40] initialization and is able to beat like at that time the number one like a [00:29:42] at that time the number one like a player like co. So this is actually one [00:29:46] player like co. So this is actually one thing one lessons like people learned [00:29:48] thing one lessons like people learned inside u this AI communities which you [00:29:52] inside u this AI communities which you can call it like the bitter lesson from [00:29:53] can call it like the bitter lesson from the rich suten where sometimes you want [00:29:55] the rich suten where sometimes you want to find the simplest recipes that is the [00:29:58] to find the simplest recipes that is the most and best compatibles with the [00:30:01] most and best compatibles with the scaling. You want to leverage the scale [00:30:03] scaling. You want to leverage the scale like the the power of scalings and [00:30:05] like the the power of scalings and sometimes like making this method [00:30:07] sometimes like making this method simpler will actually give you better [00:30:09] simpler will actually give you better performance by like making it more [00:30:11] performance by like making it more compatible with whatever infrastructure [00:30:13] compatible with whatever infrastructure you can use for scalings and stuff. And [00:30:15] you can use for scalings and stuff. And then later uh um they develop alpha zero [00:30:18] then later uh um they develop alpha zero that is able to generalize the same set [00:30:20] that is able to generalize the same set of algorithms into not just chess uh not [00:30:23] of algorithms into not just chess uh not not just go but other games like chess [00:30:25] not just go but other games like chess and shoji and then they they designed [00:30:28] and shoji and then they they designed like a mzeros that's like not just like [00:30:31] like a mzeros that's like not just like do this kind of model free reinforcement [00:30:32] do this kind of model free reinforcement learning but it's able to learn a latent [00:30:34] learning but it's able to learn a latent space dynamics models to plan over that [00:30:37] space dynamics models to plan over that give you like even better performance. [00:30:39] give you like even better performance. So for this specific domain like I would [00:30:42] So for this specific domain like I would say especially in the gaming [00:30:43] say especially in the gaming developments that really empowers a lot [00:30:46] developments that really empowers a lot of like design and developments in how [00:30:49] of like design and developments in how people can do better and more sample [00:30:51] people can do better and more sample efficient and more scalable design of [00:30:53] efficient and more scalable design of those reinforcement learning like agents [00:30:56] those reinforcement learning like agents and uh in November 2019 like Liso like [00:30:59] and uh in November 2019 like Liso like who was beaten by alpha go announced his [00:31:03] who was beaten by alpha go announced his retirements and he realized there just [00:31:05] retirements and he realized there just no it's just not possible like at that [00:31:07] no it's just not possible like at that times for any human players to beat the [00:31:10] times for any human players to beat the best like a go like AI agents out there. [00:31:14] best like a go like AI agents out there. And obviously since then there has [00:31:16] And obviously since then there has something been other like more complex [00:31:18] something been other like more complex games like Starcraft and Dota that shows [00:31:20] games like Starcraft and Dota that shows that as long as you put enough compute [00:31:23] that as long as you put enough compute as long as you have like a good like a [00:31:25] as long as you have like a good like a very welld designigned like algorithms [00:31:27] very welld designigned like algorithms and infrastructure for you to do the [00:31:29] and infrastructure for you to do the reinforcement learning you can get very [00:31:31] reinforcement learning you can get very very good performance in games that are [00:31:33] very good performance in games that are actually noticeably and orders small [00:31:35] actually noticeably and orders small magnitudes more complicated than the [00:31:38] magnitudes more complicated than the game of go. So I would say like if you [00:31:40] game of go. So I would say like if you have a game like a reasonably designed [00:31:42] have a game like a reasonably designed games like uh there's like a very legit [00:31:45] games like uh there's like a very legit chance like if you put as the the [00:31:47] chance like if you put as the the sufficient resource you can have very [00:31:49] sufficient resource you can have very very powerful gaming agents. So not just [00:31:52] very powerful gaming agents. So not just in games people have also been [00:31:54] in games people have also been developing this reinforcement learning [00:31:55] developing this reinforcement learning algorithms and agents that can work [00:31:58] algorithms and agents that can work directly in the real physical world. So [00:32:01] directly in the real physical world. So this on the left is a work from ETH that [00:32:05] this on the left is a work from ETH that was published in science robotics 2020 [00:32:08] was published in science robotics 2020 that essentially changed my minds about [00:32:10] that essentially changed my minds about how useful reinforcement learning can be [00:32:12] how useful reinforcement learning can be for real physical robots because before [00:32:15] for real physical robots because before it was just mostly games like could [00:32:16] it was just mostly games like could argue in games like there's a lot of you [00:32:20] argue in games like there's a lot of you can just like spawn as many games as [00:32:22] can just like spawn as many games as possible but for the real worlds there's [00:32:24] possible but for the real worlds there's always like sim to real gap where you [00:32:26] always like sim to real gap where you are training on the same game you are [00:32:28] are training on the same game you are also testing on the same game but for [00:32:30] also testing on the same game but for robots was if you train on a simulation [00:32:32] robots was if you train on a simulation like how much does the sim to real gap [00:32:34] like how much does the sim to real gap matters for the agents you generalize to [00:32:36] matters for the agents you generalize to the real environments and this paper [00:32:39] the real environments and this paper really convinced me that sometimes the [00:32:41] really convinced me that sometimes the sim to real gap just may not matter that [00:32:43] sim to real gap just may not matter that much so we are not simulating the bushes [00:32:46] much so we are not simulating the bushes we are not simulating the snows but the [00:32:48] we are not simulating the snows but the agents that's using reinforcement learn [00:32:50] agents that's using reinforcement learn training simulation can give you some [00:32:52] training simulation can give you some very very robust performance in the real [00:32:55] very very robust performance in the real physical worlds like snows and slipper [00:32:57] physical worlds like snows and slipper very very slippery like a surface [00:33:00] very very slippery like a surface On the right is a very recent video [00:33:02] On the right is a very recent video released by Unitry that shows just [00:33:04] released by Unitry that shows just another levels of like dexterities for [00:33:08] another levels of like dexterities for locomotions that can do this kind of sim [00:33:10] locomotions that can do this kind of sim to real transfer allow this robots to do [00:33:12] to real transfer allow this robots to do like a very very crazy and dynamic [00:33:15] like a very very crazy and dynamic behaviors that can navigate into some [00:33:17] behaviors that can navigate into some very rough and challenging terrains. I [00:33:19] very rough and challenging terrains. I would say in the domain of robot [00:33:21] would say in the domain of robot locomotion is close to be a solved [00:33:25] locomotion is close to be a solved problems and the solution to this [00:33:27] problems and the solution to this problems is exactly reinforcement [00:33:30] problems is exactly reinforcement learning. So this is about local motion. [00:33:33] learning. So this is about local motion. The other domain is about manipulations [00:33:36] The other domain is about manipulations where the robots has to manipulate the [00:33:38] where the robots has to manipulate the objects in the real physical world. So [00:33:41] objects in the real physical world. So in 2019 when open AI was still like a [00:33:44] in 2019 when open AI was still like a touch upon like a robotics they designed [00:33:47] touch upon like a robotics they designed the systems they are trying to do [00:33:49] the systems they are trying to do dextrous manipulations of rubric cube [00:33:51] dextrous manipulations of rubric cube they are able to do the reinforcement [00:33:53] they are able to do the reinforcement learning in simulation and do sim real [00:33:56] learning in simulation and do sim real transfer they allow this kind of robots [00:33:58] transfer they allow this kind of robots to like solve this rubric cube but one [00:34:01] to like solve this rubric cube but one caveat is that their success rate is [00:34:03] caveat is that their success rate is very very low like although this video [00:34:05] very very low like although this video seem to be very beautifully done but [00:34:06] seem to be very beautifully done but their success rate was very low and if [00:34:08] their success rate was very low and if you really look at the papers [00:34:10] you really look at the papers uh they only tested a very limited [00:34:12] uh they only tested a very limited amount of trials and uh given that [00:34:15] amount of trials and uh given that number possibly the reliability is not [00:34:18] number possibly the reliability is not very like satisfying but still like [00:34:21] very like satisfying but still like since then people have been able to like [00:34:23] since then people have been able to like extends upon this texturous manipulation [00:34:26] extends upon this texturous manipulation problems allow the robots to do enhanced [00:34:29] problems allow the robots to do enhanced texturous manipulation and [00:34:31] texturous manipulation and reorientations of different types of [00:34:33] reorientations of different types of objects and into different like target [00:34:36] objects and into different like target configurations um is all thanks to the [00:34:39] configurations um is all thanks to the developments of reinforcement learning. [00:34:42] developments of reinforcement learning. But we can see until now our examples in [00:34:46] But we can see until now our examples in locomotions and in hand manipulation, it [00:34:49] locomotions and in hand manipulation, it doesn't really solve the problem. For [00:34:50] doesn't really solve the problem. For example, if the robot can just fold the [00:34:52] example, if the robot can just fold the clothes for you or doing the laundry for [00:34:53] clothes for you or doing the laundry for you in your home like for manipulation [00:34:56] you in your home like for manipulation is still in this kind of very like [00:34:58] is still in this kind of very like isolated domains like working with this [00:35:00] isolated domains like working with this kind of isolated like environments. [00:35:03] kind of isolated like environments. So this is actually some of the key [00:35:05] So this is actually some of the key challenges and bottlenecks of existing [00:35:07] challenges and bottlenecks of existing like a model free reinforcement [00:35:09] like a model free reinforcement learning. This mostly learns from the [00:35:11] learning. This mostly learns from the trials and error with the environments [00:35:13] trials and error with the environments and it requires extensive like [00:35:16] and it requires extensive like interactions with the worlds. For [00:35:18] interactions with the worlds. For example, for the Alpha Go Zero, like it [00:35:21] example, for the Alpha Go Zero, like it actually learns from 3,000 years of [00:35:23] actually learns from 3,000 years of human knowledge in 40 days, which is [00:35:25] human knowledge in 40 days, which is amazing, but like it still requires like [00:35:28] amazing, but like it still requires like many many like years of computation like [00:35:31] many many like years of computation like years of like equivalent computations [00:35:33] years of like equivalent computations for the agents to learn. If in domains [00:35:35] for the agents to learn. If in domains where there's like a huge simal gap and [00:35:38] where there's like a huge simal gap and you want to do the reinforcement [00:35:39] you want to do the reinforcement learning in the real physical worlds [00:35:41] learning in the real physical worlds that will be a huge bottleneck for [00:35:43] that will be a huge bottleneck for learning this reinforced learning agents [00:35:45] learning this reinforced learning agents very effectively and also of course if [00:35:48] very effectively and also of course if there seem to gap if you only can learn [00:35:50] there seem to gap if you only can learn the model in the real environments [00:35:52] the model in the real environments there's a lot of like safety concerns [00:35:54] there's a lot of like safety concerns for example here is an example of [00:35:56] for example here is an example of showing this kind of learning [00:35:58] showing this kind of learning progressions of agents that is [00:36:00] progressions of agents that is controlling this humanoid robots to move [00:36:02] controlling this humanoid robots to move forwards you can see like during this [00:36:04] forwards you can see like during this learning process Although at the very [00:36:05] learning process Although at the very end the robot is able to like move [00:36:07] end the robot is able to like move forwards but there's a lot of like a [00:36:09] forwards but there's a lot of like a very weird behaviors that you can [00:36:11] very weird behaviors that you can totally imagine if you apply this agents [00:36:13] totally imagine if you apply this agents on the real physical robots it will like [00:36:15] on the real physical robots it will like fail catastrophically [00:36:17] fail catastrophically and uh uh it also have like a very [00:36:20] and uh uh it also have like a very limited interpretabilities and sometimes [00:36:23] limited interpretabilities and sometimes very hard to correct things when things [00:36:25] very hard to correct things when things go wrong. And like one interesting thing [00:36:28] go wrong. And like one interesting thing if you really think about how human [00:36:30] if you really think about how human learn to interact with the environment [00:36:31] learn to interact with the environment versus like pure reinforcement learning [00:36:34] versus like pure reinforcement learning you realize that we humans have a very [00:36:37] you realize that we humans have a very intuitive understanding of this [00:36:38] intuitive understanding of this environments. We can imagine how the [00:36:40] environments. We can imagine how the environment is going to change if we [00:36:42] environment is going to change if we apply a specific actions. So it's [00:36:45] apply a specific actions. So it's exactly this predictive capabilities [00:36:47] exactly this predictive capabilities that allows we humans to plan our [00:36:49] that allows we humans to plan our behavior in achieving some specific [00:36:51] behavior in achieving some specific targets. And this predictive [00:36:53] targets. And this predictive capabilities is also actually learned [00:36:55] capabilities is also actually learned from with humans physical interaction [00:36:57] from with humans physical interaction and everyday experiences with the real [00:37:00] and everyday experiences with the real physical world. So going beyond the [00:37:02] physical world. So going beyond the reinforcement learning the next like [00:37:05] reinforcement learning the next like topics I want to discuss will be how we [00:37:07] topics I want to discuss will be how we can indulge the robots like with similar [00:37:10] can indulge the robots like with similar capabilities in imagining the effect of [00:37:12] capabilities in imagining the effect of their actions and to do model based [00:37:14] their actions and to do model based planning. For this specific examples we [00:37:17] planning. For this specific examples we have a simulation. Typically the [00:37:18] have a simulation. Typically the simulation people use would be for examp [00:37:24] essentially a bunch of like rigid body [00:37:26] essentially a bunch of like rigid body simulations where the robots like just [00:37:28] simulations where the robots like just touching this kind of like a polygon [00:37:30] touching this kind of like a polygon type of represented like a pl like a [00:37:33] type of represented like a pl like a representations of the floor. It is not [00:37:35] representations of the floor. It is not simulating for them the bushes. It is [00:37:37] simulating for them the bushes. It is not simulating those snows. But what [00:37:39] not simulating those snows. But what people do is to randomize the simulated [00:37:41] people do is to randomize the simulated environments a lot. They randomize the [00:37:44] environments a lot. They randomize the friction, the geometry and many other [00:37:46] friction, the geometry and many other physical parameter inside this [00:37:47] physical parameter inside this environment such that people will assume [00:37:51] environment such that people will assume whatever you encounter in the real [00:37:53] whatever you encounter in the real physical world is just one data points [00:37:56] physical world is just one data points within the distributions you randomize [00:37:58] within the distributions you randomize within your simulation. So if your [00:38:00] within your simulation. So if your policies can be robust like robust in [00:38:03] policies can be robust like robust in controlling this robots within that [00:38:04] controlling this robots within that distribution and if the real just one [00:38:06] distribution and if the real just one data point within that distribution the [00:38:08] data point within that distribution the policy can generalize and so far at [00:38:10] policy can generalize and so far at least from this empirical evidence this [00:38:12] least from this empirical evidence this type of assumption actually holds and [00:38:14] type of assumption actually holds and the policy actually works very reliably [00:38:15] the policy actually works very reliably and robustly in the real physical world. [00:38:18] and robustly in the real physical world. So the question is about like what is [00:38:20] So the question is about like what is actual command. So actually in many of [00:38:22] actual command. So actually in many of the existing demos there can be a person [00:38:25] the existing demos there can be a person providing high level commands to the [00:38:26] providing high level commands to the robot. For example, which direction [00:38:28] robot. For example, which direction should the robots like walk? Should the [00:38:29] should the robots like walk? Should the robots like rotates in place or just [00:38:31] robots like rotates in place or just still like keep walking forwards [00:38:33] still like keep walking forwards condition on that high level actions [00:38:36] condition on that high level actions provided by human? The robot has to [00:38:38] provided by human? The robot has to decide this kind of low-level actions. [00:38:39] decide this kind of low-level actions. The low level actions are typically for [00:38:41] The low level actions are typically for example the um the the the joint torque [00:38:45] example the um the the the joint torque that are applies to each and every one [00:38:47] that are applies to each and every one of joints on top of this robots. So that [00:38:49] of joints on top of this robots. So that is how like this typically like looks [00:38:52] is how like this typically like looks like meaning humans give high level [00:38:53] like meaning humans give high level commands condition high level commands [00:38:55] commands condition high level commands robot has to use this policy to decide [00:38:57] robot has to use this policy to decide this kind of low-level like actions [00:38:59] this kind of low-level like actions which is in instantiated using like [00:39:01] which is in instantiated using like joint torque. So like I mentioned one [00:39:04] joint torque. So like I mentioned one biggest lesson I learned from this lines [00:39:07] biggest lesson I learned from this lines of work on locom motion is that the [00:39:09] of work on locom motion is that the simulation doesn't have to be perfect as [00:39:11] simulation doesn't have to be perfect as long as you randomize enough you can [00:39:12] long as you randomize enough you can generalize very robustly in the real [00:39:14] generalize very robustly in the real environments. But such lesson hasn't [00:39:17] environments. But such lesson hasn't really been generalized like very well [00:39:19] really been generalized like very well in the manipulation domain. So in the [00:39:21] in the manipulation domain. So in the manipulation domain like how accurate [00:39:23] manipulation domain like how accurate the simulation needs to be and how much [00:39:25] the simulation needs to be and how much does the symmetry real gap matters is [00:39:27] does the symmetry real gap matters is still a research question people hope to [00:39:29] still a research question people hope to answer. So I can give you one specific [00:39:31] answer. So I can give you one specific example. If in the simulation you are [00:39:33] example. If in the simulation you are pushing a box forwards. If in simulation [00:39:35] pushing a box forwards. If in simulation the box rotates for 10° but in reality [00:39:38] the box rotates for 10° but in reality rotates for 12° it may not matter that [00:39:40] rotates for 12° it may not matter that much. But if in the real world your [00:39:42] much. But if in the real world your grasping was successful but in [00:39:45] grasping was successful but in simulation the object shoot just flies [00:39:47] simulation the object shoot just flies away because of some kind of numerical [00:39:48] away because of some kind of numerical issues or if the object like sleep like [00:39:51] issues or if the object like sleep like away between your fingers that's [00:39:53] away between your fingers that's problematic. So there are regions where [00:39:55] problematic. So there are regions where sim to real gap matter. There are other [00:39:57] sim to real gap matter. There are other regions simil gap may not matter that [00:39:59] regions simil gap may not matter that much in the manipulation domain and [00:40:01] much in the manipulation domain and people still are trying to trying to [00:40:02] people still are trying to trying to answer and trying to understand like how [00:40:05] answer and trying to understand like how simil gap can happen and what are some [00:40:07] simil gap can happen and what are some of the most important like recipes and [00:40:09] of the most important like recipes and characteristics that this simulation [00:40:11] characteristics that this simulation needs to have for the most reliable like [00:40:13] needs to have for the most reliable like seem to real transfer. So if I [00:40:15] seem to real transfer. So if I understand your questions correctly [00:40:17] understand your questions correctly you're asking like there are still a [00:40:19] you're asking like there are still a person like providing high level [00:40:21] person like providing high level commands to the robots. Your question is [00:40:23] commands to the robots. Your question is can robot come up with better plans than [00:40:25] can robot come up with better plans than a human. So I can actually give you a [00:40:27] a human. So I can actually give you a more nuanced perspective. Although many [00:40:29] more nuanced perspective. Although many of these videos seems very nice. There [00:40:32] of these videos seems very nice. There is a human operators operating the [00:40:34] is a human operators operating the robots to choose which route to go. For [00:40:37] robots to choose which route to go. For example, like what people typically do [00:40:39] example, like what people typically do is for example, let's say there's kind [00:40:41] is for example, let's say there's kind some kind of rough terrains or some pile [00:40:43] some kind of rough terrains or some pile of rocks. And the humans can actually [00:40:45] of rocks. And the humans can actually try to command the robot to going [00:40:48] try to command the robot to going forward trying to climb those kind of [00:40:49] forward trying to climb those kind of rocks. If that fails, humans can [00:40:51] rocks. If that fails, humans can actually provide some other high level [00:40:53] actually provide some other high level conveyance to get around this pile of [00:40:55] conveyance to get around this pile of rocks. So there can be some kind of [00:40:57] rocks. So there can be some kind of learning also on the human side in [00:40:58] learning also on the human side in understanding the capabilities of those [00:41:00] understanding the capabilities of those robots. So this is also why some of this [00:41:04] robots. So this is also why some of this video can actually looks very nice [00:41:06] video can actually looks very nice because human select the routes that [00:41:08] because human select the routes that human knows can show the limits and also [00:41:10] human knows can show the limits and also the capabilities of this kind of [00:41:12] the capabilities of this kind of low-level controllers and how to do that [00:41:14] low-level controllers and how to do that autonomously. That's actually an very [00:41:16] autonomously. That's actually an very interesting like questions people are [00:41:18] interesting like questions people are also doing research upon. Mhm. [00:41:22] also doing research upon. Mhm. So then I'm going to continue. So I have [00:41:26] So then I'm going to continue. So I have discussed like some of the successful [00:41:27] discussed like some of the successful examples and power of reinforcement [00:41:29] examples and power of reinforcement learnings and I also discussed the [00:41:31] learnings and I also discussed the limitations of reinforcement learning [00:41:32] limitations of reinforcement learning and we still haven't really seen very [00:41:34] and we still haven't really seen very very successful and wide scale [00:41:37] very successful and wide scale deployments of reinforcement learning in [00:41:39] deployments of reinforcement learning in manipulation yet. And we human not just [00:41:41] manipulation yet. And we human not just learn from trials and error. We actually [00:41:43] learn from trials and error. We actually build this type of internal model. So [00:41:45] build this type of internal model. So we're asking the question can we [00:41:46] we're asking the question can we actually learn the models from the [00:41:48] actually learn the models from the robot's interactions with the [00:41:50] robot's interactions with the environments and using that model for [00:41:52] environments and using that model for the robot to do better physical [00:41:53] the robot to do better physical interactions. So specifically what we [00:41:56] interactions. So specifically what we are touching upon again back to this [00:41:58] are touching upon again back to this figure is how we can learn and [00:42:00] figure is how we can learn and approximations of the real physical [00:42:03] approximations of the real physical world and how can this approximated [00:42:06] world and how can this approximated physical world that runs on the virtual [00:42:08] physical world that runs on the virtual domain can help guides the robots [00:42:11] domain can help guides the robots actions and decide what action to take [00:42:13] actions and decide what action to take in the real physical world. So let's say [00:42:16] in the real physical world. So let's say if you already have the model for [00:42:17] if you already have the model for example let's say you have already [00:42:18] example let's say you have already learned the models like we humans have [00:42:20] learned the models like we humans have in our like mental like environments we [00:42:24] in our like mental like environments we can predict given the current state s [00:42:26] can predict given the current state s and also the action t how the state of [00:42:29] and also the action t how the state of the environment will change in new [00:42:31] the environment will change in new states t + one and then we can use this [00:42:34] states t + one and then we can use this essentially is a forward model like [00:42:36] essentially is a forward model like given the current state and action [00:42:37] given the current state and action predicted next states. So what actually [00:42:40] predicted next states. So what actually is the problem for us to do the planning [00:42:42] is the problem for us to do the planning which is essentially an inverse of this [00:42:45] which is essentially an inverse of this forward model where the plan is to give [00:42:47] forward model where the plan is to give the current state and the target states [00:42:49] the current state and the target states and to come up with the action they can [00:42:50] and to come up with the action they can allow the robot to achieve the target [00:42:53] allow the robot to achieve the target states from a given current state [00:42:55] states from a given current state showing the blue dots. We have a targets [00:42:57] showing the blue dots. We have a targets here in the red. We can have maybe some [00:42:59] here in the red. We can have maybe some initial guesses of how the actions might [00:43:01] initial guesses of how the actions might look like. And our model, our [00:43:03] look like. And our model, our approximated learned models will be able [00:43:05] approximated learned models will be able to predict the sequence of the [00:43:07] to predict the sequence of the evolutions of the states which is show [00:43:10] evolutions of the states which is show in this green like a trajectory. And [00:43:13] in this green like a trajectory. And then we can measure the distance between [00:43:15] then we can measure the distance between this green dots and the red dots and [00:43:17] this green dots and the red dots and back propagates or doing optimizations [00:43:20] back propagates or doing optimizations using the gradients of those their their [00:43:22] using the gradients of those their their distance uh with respected to all the [00:43:25] distance uh with respected to all the actions along that trajectories to do [00:43:27] actions along that trajectories to do this type of optimizations in order to [00:43:29] this type of optimizations in order to know like what actions can guide us to [00:43:32] know like what actions can guide us to getting us closer to the targets show in [00:43:34] getting us closer to the targets show in the red. And obviously the model may not [00:43:37] the red. And obviously the model may not be accurate enough. So we typically only [00:43:40] be accurate enough. So we typically only execute the first actions and we obtain [00:43:42] execute the first actions and we obtain the new states from the environment and [00:43:44] the new states from the environment and we can reoptimize the action sequence [00:43:46] we can reoptimize the action sequence using gradient descent or any other [00:43:48] using gradient descent or any other optimization technique you use to do [00:43:50] optimization technique you use to do this trajectory optimizations. And one [00:43:53] this trajectory optimizations. And one of the key benefits especially recently [00:43:56] of the key benefits especially recently with the developments of GPUs and also [00:43:59] with the developments of GPUs and also neurodynamics model is that you can use [00:44:01] neurodynamics model is that you can use a GPU for parallel and simultaneously [00:44:04] a GPU for parallel and simultaneously sampling to allow you to do like a large [00:44:06] sampling to allow you to do like a large scale sampling and optimizations of [00:44:08] scale sampling and optimizations of those action sequences which is actually [00:44:10] those action sequences which is actually quite efficient. [00:44:12] quite efficient. So like given this like a general [00:44:14] So like given this like a general framework you have the model which is [00:44:16] framework you have the model which is this kind of forward process and you can [00:44:18] this kind of forward process and you can always use a kind of forward model to do [00:44:20] always use a kind of forward model to do this inverse optimizations to come up [00:44:22] this inverse optimizations to come up with the actions that can like get you [00:44:24] with the actions that can like get you closer to your target configuration. And [00:44:26] closer to your target configuration. And one of the key questions has always been [00:44:29] one of the key questions has always been what should be the right representation [00:44:30] what should be the right representation of the what should be the right and most [00:44:32] of the what should be the right and most effective state representation is and [00:44:35] effective state representation is and how we can learn this model based on [00:44:36] how we can learn this model based on this state representation. And over the [00:44:39] this state representation. And over the years there has been many different [00:44:40] years there has been many different investigations on choosing or [00:44:42] investigations on choosing or investigating different type of state [00:44:44] investigating different type of state representations. Some earlier work [00:44:46] representations. Some earlier work including how can using for them just 2D [00:44:49] including how can using for them just 2D images as a representation of the states [00:44:51] images as a representation of the states and trying to learn pixel dynamics [00:44:53] and trying to learn pixel dynamics meaning how the image might change if [00:44:56] meaning how the image might change if you apply a specific actions. This is a [00:44:58] you apply a specific actions. This is a work called deep visual foresight which [00:45:00] work called deep visual foresight which set up some of the initial works in the [00:45:03] set up some of the initial works in the whole domains of like a world models. [00:45:05] whole domains of like a world models. And by learning this kind of pixelbased [00:45:07] And by learning this kind of pixelbased dynamics models, people can come up with [00:45:09] dynamics models, people can come up with strategies that is able to for example [00:45:11] strategies that is able to for example minimize the distance between the [00:45:13] minimize the distance between the current observations and some can rotate [00:45:15] current observations and some can rotate objects and pushing the objects around [00:45:17] objects and pushing the objects around in order to achieve the targets show in [00:45:20] in order to achieve the targets show in green and the current states is show in [00:45:22] green and the current states is show in red. So this is about like a pixel [00:45:25] red. So this is about like a pixel dynamics and what people can also do is [00:45:27] dynamics and what people can also do is to use a key points as a representation [00:45:30] to use a key points as a representation of the environments to learn like a key [00:45:32] of the environments to learn like a key points dynamics models. And here like [00:45:35] points dynamics models. And here like what we can do is to track the movement [00:45:37] what we can do is to track the movement of the key points on top of this box [00:45:40] of the key points on top of this box over the 3D space and also neural [00:45:42] over the 3D space and also neural dynamics model of those key points as a [00:45:45] dynamics model of those key points as a result of some pushing actions. And then [00:45:48] result of some pushing actions. And then we can allow the robots to use this [00:45:50] we can allow the robots to use this forward predictive models to plan the [00:45:52] forward predictive models to plan the robots behaviors you to track some [00:45:54] robots behaviors you to track some specific trajectories in order to push [00:45:56] specific trajectories in order to push this box to achieve a target [00:45:59] this box to achieve a target configurations. [00:46:00] configurations. So besides using key points, so what if [00:46:03] So besides using key points, so what if you are encountered some objects with [00:46:05] you are encountered some objects with even higher degrees of freedoms. So if [00:46:09] even higher degrees of freedoms. So if you go like one level finer, you can [00:46:11] you go like one level finer, you can also represent those objects using a [00:46:13] also represent those objects using a sets of particles essentially a set of [00:46:16] sets of particles essentially a set of points here like which is actually a [00:46:18] points here like which is actually a work uh uh that was done while I was [00:46:20] work uh uh that was done while I was here like as a postto where we're [00:46:22] here like as a postto where we're representing this pile of granular [00:46:24] representing this pile of granular pieces using a bunch of particles and [00:46:26] pieces using a bunch of particles and trying to predict how those particles [00:46:29] trying to predict how those particles will move around if you apply a specific [00:46:31] will move around if you apply a specific actions and this forward model can allow [00:46:34] actions and this forward model can allow the robots to do this inverse decision [00:46:36] the robots to do this inverse decision makings. They handle a wide range of [00:46:39] makings. They handle a wide range of granular objects of different like [00:46:41] granular objects of different like granular sizes and we come up with [00:46:43] granular sizes and we come up with strategies that can gather those pieces [00:46:45] strategies that can gather those pieces into the target region show in the [00:46:48] into the target region show in the bottom right like a corner of each like [00:46:50] bottom right like a corner of each like a segment and the same model with like a [00:46:54] a segment and the same model with like a good feedback from the environments [00:46:56] good feedback from the environments allow the robots to correct from the [00:46:58] allow the robots to correct from the model's error and come up with [00:47:00] model's error and come up with strategies that can be very reliably [00:47:02] strategies that can be very reliably aggregates all the object pieces into [00:47:04] aggregates all the object pieces into the target region and obviously See this [00:47:07] the target region and obviously See this model not just can generalize to like a [00:47:09] model not just can generalize to like a different like a granular pieces of [00:47:11] different like a granular pieces of different granular sizes. You can also [00:47:13] different granular sizes. You can also change to different target [00:47:15] change to different target configurations. Here you will very [00:47:17] configurations. Here you will very quickly realize what the target [00:47:19] quickly realize what the target configurations are. The robot has to [00:47:21] configurations are. The robot has to come up with a strategy that to do [00:47:23] come up with a strategy that to do non-trivial redistributions of the [00:47:26] non-trivial redistributions of the granular pieces. And after the [00:47:28] granular pieces. And after the redistribution you had to align the fine [00:47:30] redistribution you had to align the fine grain details with the targets like [00:47:32] grain details with the targets like shape in order to accomplish this like a [00:47:35] shape in order to accomplish this like a pile rearrangement tasks. The task here [00:47:37] pile rearrangement tasks. The task here is actually to rearrange this granular [00:47:39] is actually to rearrange this granular pieces into different letter shapes all [00:47:41] pieces into different letter shapes all the way from letter A to letter Z. And [00:47:44] the way from letter A to letter Z. And with this kind of forward models will be [00:47:46] with this kind of forward models will be very successful in coming up with a [00:47:48] very successful in coming up with a sequence of strategies. Of course with [00:47:50] sequence of strategies. Of course with feedback from the environments to allow [00:47:52] feedback from the environments to allow the robot to rearrange this kind of [00:47:53] the robot to rearrange this kind of object pieces into the target regions. [00:47:55] object pieces into the target regions. And this is actually a highly [00:47:56] And this is actually a highly non-trivial task. And going beyond that, [00:48:00] non-trivial task. And going beyond that, we are also uh have this like a [00:48:02] we are also uh have this like a subsequent work which I was also [00:48:04] subsequent work which I was also involved and done when I was here at [00:48:06] involved and done when I was here at Stanford. And we designed this kind of [00:48:08] Stanford. And we designed this kind of dumping making robots that is actually [00:48:11] dumping making robots that is actually equipped the robots with 15 different 3D [00:48:15] equipped the robots with 15 different 3D printed tools. We have four RGBD cameras [00:48:18] printed tools. We have four RGBD cameras looking at the environments to do a [00:48:20] looking at the environments to do a reconstructions of the geometry of this [00:48:22] reconstructions of the geometry of this doll. And the robots will have to decide [00:48:24] doll. And the robots will have to decide what tool to use and what action to take [00:48:26] what tool to use and what action to take in order to guess this dumpling into [00:48:30] in order to guess this dumpling into getting this dough into a dumpling. And [00:48:32] getting this dough into a dumpling. And the key enabling factor is also this [00:48:34] the key enabling factor is also this like a forward predicting models [00:48:37] like a forward predicting models represented like using particles. Right [00:48:39] represented like using particles. Right here the red dots are representing the [00:48:42] here the red dots are representing the shape of the tool and the blue dots are [00:48:44] shape of the tool and the blue dots are representing the shape of the object. [00:48:46] representing the shape of the object. The first row is our model's open loop [00:48:49] The first row is our model's open loop prediction and second row is like what's [00:48:51] prediction and second row is like what's actually happens in the real [00:48:52] actually happens in the real environments. So this learn model that [00:48:55] environments. So this learn model that directly learns from this real world [00:48:57] directly learns from this real world interactions can very actually predicts [00:49:00] interactions can very actually predicts the change of the shape of the dough [00:49:02] the change of the shape of the dough when using different tools applying [00:49:03] when using different tools applying different actions and ins allows us to [00:49:06] different actions and ins allows us to have this integrated integrated system [00:49:08] have this integrated integrated system that can make a dumpling out of a dough. [00:49:11] that can make a dumpling out of a dough. What's interesting about this video is [00:49:12] What's interesting about this video is there's a person constantly perturbing [00:49:14] there's a person constantly perturbing the robots from doing its job. The [00:49:17] the robots from doing its job. The robots will take the real time visual [00:49:20] robots will take the real time visual feedback from this environment to real [00:49:22] feedback from this environment to real time understanding the shape of the [00:49:24] time understanding the shape of the dough and then using the current [00:49:26] dough and then using the current observations and also the learn dynamics [00:49:29] observations and also the learn dynamics model that predicts how the environment [00:49:30] model that predicts how the environment is going to change how the do shape will [00:49:32] is going to change how the do shape will change. You use a tool to apply a [00:49:35] change. You use a tool to apply a specific actions and based on this [00:49:36] specific actions and based on this forward model is making this inverse [00:49:39] forward model is making this inverse decision. This decision is happening at [00:49:41] decision. This decision is happening at two levels. Both at a high level which [00:49:44] two levels. Both at a high level which is to decide what tool to use which is a [00:49:47] is to decide what tool to use which is a task level decision-m and given this [00:49:49] task level decision-m and given this tools the robot is also have to make [00:49:51] tools the robot is also have to make like a lower level like a motion level [00:49:53] like a lower level like a motion level decisions in deciding what specific [00:49:55] decisions in deciding what specific action to take you to progress into the [00:49:58] action to take you to progress into the next task stage. Human are just so [00:50:00] next task stage. Human are just so annoying adding pieces folding the [00:50:02] annoying adding pieces folding the dough. The robot is very robust to this [00:50:04] dough. The robot is very robust to this external disturbance in continuing its [00:50:06] external disturbance in continuing its progress in doing the task. Here's [00:50:08] progress in doing the task. Here's what's interesting. After robot cuts a [00:50:10] what's interesting. After robot cuts a circle, the humans shows no mercy, [00:50:13] circle, the humans shows no mercy, destroy everything. The robot knows you [00:50:16] destroy everything. The robot knows you actually have to start from the very [00:50:17] actually have to start from the very beginning like redo the task from the [00:50:20] beginning like redo the task from the beginning in order to progress with this [00:50:22] beginning in order to progress with this kind of task objective. So this really [00:50:24] kind of task objective. So this really shows the patience and also the [00:50:26] shows the patience and also the robustness of our systems with this type [00:50:29] robustness of our systems with this type of external disturbance. And all of [00:50:30] of external disturbance. And all of these capabilities is enabled by this [00:50:33] these capabilities is enabled by this kind of neuron dynamics model that [00:50:34] kind of neuron dynamics model that predicts how the shape of the dough will [00:50:36] predicts how the shape of the dough will change if you apply a specific actions. [00:50:39] change if you apply a specific actions. In the end, the robots will place the [00:50:41] In the end, the robots will place the skin on top of this dumpling clip and [00:50:43] skin on top of this dumpling clip and move this kind of feelings on top of [00:50:45] move this kind of feelings on top of this dumpling skin and using a hook to [00:50:48] this dumpling skin and using a hook to close the dumpling clip. You to use this [00:50:50] close the dumpling clip. You to use this general purpose robots equipped with 15 [00:50:53] general purpose robots equipped with 15 general purpose tools to make a dumpling [00:50:56] general purpose tools to make a dumpling out of a dough. So this is about like [00:50:58] out of a dough. So this is about like how we can learn the model and and how [00:51:00] how we can learn the model and and how that model can be useful for for [00:51:02] that model can be useful for for downstream like model based planning. So [00:51:05] downstream like model based planning. So for this specific case like u if we want [00:51:08] for this specific case like u if we want to describe it more rigorously we are [00:51:10] to describe it more rigorously we are not using reinforcement learning like we [00:51:12] not using reinforcement learning like we just learn the model and using that [00:51:14] just learn the model and using that model to do planning although the plan [00:51:16] model to do planning although the plan can be distilled into a policy that can [00:51:18] can be distilled into a policy that can be executed in the real environment in a [00:51:20] be executed in the real environment in a more efficient manner but some people [00:51:23] more efficient manner but some people also call it model based reinforcement [00:51:25] also call it model based reinforcement learning. decide which background you [00:51:27] learning. decide which background you are coming from. You can either call it [00:51:28] are coming from. You can either call it like model learning and model based [00:51:30] like model learning and model based planning. You can also call it model [00:51:31] planning. You can also call it model based reinforcement learning. But the [00:51:33] based reinforcement learning. But the key idea you want to learn the model [00:51:35] key idea you want to learn the model from the robot's physical interactions [00:51:37] from the robot's physical interactions with the real physical world and using [00:51:39] with the real physical world and using that learn model that is very effective [00:51:42] that learn model that is very effective in helping the robot to decide its [00:51:43] in helping the robot to decide its behaviors to like a progress with a task [00:51:46] behaviors to like a progress with a task objective. [00:51:47] objective. So in this specific case it is the high [00:51:49] So in this specific case it is the high level plane and low-level decision [00:51:51] level plane and low-level decision making is done by two different models. [00:51:53] making is done by two different models. So over the high level we are given the [00:51:55] So over the high level we are given the current states like current observation [00:51:57] current states like current observation of the environments and the targets the [00:51:58] of the environments and the targets the robot help to achieve is essentially a [00:52:01] robot help to achieve is essentially a classifier and classify which tool to [00:52:03] classifier and classify which tool to use and condition on this classify like [00:52:06] use and condition on this classify like tool label there a lowle like a policies [00:52:08] tool label there a lowle like a policies that decides what specific action to [00:52:10] that decides what specific action to take in order to like progress into the [00:52:12] take in order to like progress into the next task stage. Very good question. So [00:52:15] next task stage. Very good question. So back then this work was done in 2023. At [00:52:17] back then this work was done in 2023. At that time vision language model wasn't [00:52:19] that time vision language model wasn't like very powerful. So at that time like [00:52:22] like very powerful. So at that time like what we did was to allow a human [00:52:25] what we did was to allow a human operator to do the data collection [00:52:28] operator to do the data collection demonstration of the task for 10 times. [00:52:30] demonstration of the task for 10 times. We use that data to train this kind of [00:52:32] We use that data to train this kind of classifier to classify what tool to use. [00:52:35] classifier to classify what tool to use. That allows us to actually jump back and [00:52:37] That allows us to actually jump back and forth over this chain. Like I mentioned [00:52:39] forth over this chain. Like I mentioned earlier after the robots cut a circle [00:52:41] earlier after the robots cut a circle the human destroy everything. the ROS [00:52:43] the human destroy everything. the ROS should jump back to some earlier stages [00:52:44] should jump back to some earlier stages like that fits to its current [00:52:46] like that fits to its current observation in order to like do the [00:52:48] observation in order to like do the proper like um recovery from the [00:52:50] proper like um recovery from the external disturbances. So in this [00:52:52] external disturbances. So in this specific case like what we have been [00:52:54] specific case like what we have been doing is a combination of sampling based [00:52:56] doing is a combination of sampling based trajectory optimizations versus like a [00:52:59] trajectory optimizations versus like a policy learning. So what we have been [00:53:01] policy learning. So what we have been doing is to give the current states of [00:53:02] doing is to give the current states of this dough. We have our like forward [00:53:04] this dough. We have our like forward predicting models. We'll be able to [00:53:06] predicting models. We'll be able to sample a bunch of actions and sample a [00:53:08] sample a bunch of actions and sample a bunch of tools to like a predict like [00:53:10] bunch of tools to like a predict like what is the evolutions of the shape of [00:53:12] what is the evolutions of the shape of that dough. And then we'll compare the [00:53:14] that dough. And then we'll compare the model's prediction with the targets like [00:53:17] model's prediction with the targets like we hope to achieve which is similar to [00:53:19] we hope to achieve which is similar to what I showed earlier. For example, our [00:53:20] what I showed earlier. For example, our model predicts the shape of the dough [00:53:23] model predicts the shape of the dough will go into this green dots but the [00:53:25] will go into this green dots but the targets is this red dots. We are [00:53:27] targets is this red dots. We are comparing their distance and then that [00:53:29] comparing their distance and then that allows us to select the most effective [00:53:31] allows us to select the most effective actions that can gets us to the target [00:53:34] actions that can gets us to the target at to be as close as possible and we can [00:53:36] at to be as close as possible and we can do a lot of samples like this but [00:53:38] do a lot of samples like this but sampling during the test time is very [00:53:40] sampling during the test time is very time consuming. So we do this type of [00:53:42] time consuming. So we do this type of sampling in an offline fashions which [00:53:44] sampling in an offline fashions which give us a data sets and we can use that [00:53:46] give us a data sets and we can use that data sets to train a policies that can [00:53:48] data sets to train a policies that can be uh inferred at a very very like using [00:53:51] be uh inferred at a very very like using a very short period of time to do the [00:53:52] a very short period of time to do the inference during the test time. There is [00:53:54] inference during the test time. There is still a neural network as a policies. [00:53:56] still a neural network as a policies. Yeah, although that policy is nerds by [00:53:59] Yeah, although that policy is nerds by distilling from our models like uh [00:54:01] distilling from our models like uh predictions over a huge amount of [00:54:03] predictions over a huge amount of samples. For this specific work, there's [00:54:06] samples. For this specific work, there's no physics based simulation at all. We [00:54:08] no physics based simulation at all. We actually have a baseline that use a [00:54:10] actually have a baseline that use a state-of-the-art deformable object [00:54:12] state-of-the-art deformable object simulator which is called MPM, material [00:54:14] simulator which is called MPM, material point methods. What we realize is that [00:54:17] point methods. What we realize is that even if we do very extensive system [00:54:19] even if we do very extensive system identification like estimating the [00:54:20] identification like estimating the parameters of those physics based [00:54:22] parameters of those physics based deformable object simulator, the [00:54:24] deformable object simulator, the identified model is noticeably less [00:54:27] identified model is noticeably less accurate than the model that directly [00:54:28] accurate than the model that directly learned from the real world [00:54:29] learned from the real world interactions. Like I showed earlier, for [00:54:31] interactions. Like I showed earlier, for example, the first row is our model's [00:54:33] example, the first row is our model's open loop prediction. The second row is [00:54:34] open loop prediction. The second row is a ground truth. Our model's prediction [00:54:36] a ground truth. Our model's prediction aligns very well with the ground truth, [00:54:38] aligns very well with the ground truth, which is just much more accurate than [00:54:41] which is just much more accurate than whatever physics based simulators out [00:54:43] whatever physics based simulators out there. [00:54:46] Okay. So if there are no more questions [00:54:48] Okay. So if there are no more questions I will continue. So what we have [00:54:50] I will continue. So what we have discussed is this kind of model learning [00:54:52] discussed is this kind of model learning and how this learned model can be [00:54:54] and how this learned model can be effective for the downstream like a [00:54:55] effective for the downstream like a model based planning and next category [00:54:58] model based planning and next category of algorithms is imitation learning. So [00:55:01] of algorithms is imitation learning. So like just to like recap a little bit [00:55:04] like just to like recap a little bit like we discussed reinforcement learning [00:55:06] like we discussed reinforcement learning that is learn direct learning the [00:55:07] that is learn direct learning the policies by doing trials and error with [00:55:09] policies by doing trials and error with the environments which has a lot of [00:55:11] the environments which has a lot of troubles for example the sample [00:55:12] troubles for example the sample efficiency the safety concerns and for [00:55:14] efficiency the safety concerns and for the model learning what we have been [00:55:16] the model learning what we have been doing is actually forced back to the [00:55:19] doing is actually forced back to the category of supervised learning where we [00:55:21] category of supervised learning where we have the evolutions of the environments. [00:55:23] have the evolutions of the environments. You use that data to do supervised [00:55:25] You use that data to do supervised learning to train this model and using [00:55:27] learning to train this model and using this model to do model based planning [00:55:29] this model to do model based planning and instead of just using supervised [00:55:32] and instead of just using supervised learning to do to train the model people [00:55:34] learning to do to train the model people are also asking can we do supervised [00:55:36] are also asking can we do supervised learning also for the policies. This is [00:55:38] learning also for the policies. This is the general idea of imitation learning [00:55:40] the general idea of imitation learning meaning can we have a big data set that [00:55:44] meaning can we have a big data set that shows how a task needs to be done and [00:55:47] shows how a task needs to be done and using this data sets like to train this [00:55:50] using this data sets like to train this kind of policies. I'm showing this [00:55:51] kind of policies. I'm showing this figure again. This is trying to learn [00:55:53] figure again. This is trying to learn this kind of policy taking the states as [00:55:55] this kind of policy taking the states as inputs that predict the actions and all [00:55:58] inputs that predict the actions and all of this kind of learning signals and [00:56:00] of this kind of learning signals and learning procedures are done through a [00:56:02] learning procedures are done through a large scale collected datas from human [00:56:05] large scale collected datas from human demoing to the robots how a task needs [00:56:07] demoing to the robots how a task needs to be done. So learning from [00:56:09] to be done. So learning from demonstration is of course not new. that [00:56:12] demonstration is of course not new. that has been investigated for decades is [00:56:14] has been investigated for decades is also constantly how we human is actually [00:56:16] also constantly how we human is actually learning to perform a lot of like a [00:56:18] learning to perform a lot of like a physical interactions or social [00:56:20] physical interactions or social activities in the real world like since [00:56:22] activities in the real world like since we are very very young. So one of the [00:56:25] we are very very young. So one of the most like earlier classic like imitation [00:56:28] most like earlier classic like imitation learning algorithms is called behavior [00:56:31] learning algorithms is called behavior cloning and essentially trying to learn [00:56:33] cloning and essentially trying to learn this kind of mapping. The map currently [00:56:35] this kind of mapping. The map currently like from this observation O into the [00:56:38] like from this observation O into the action A and this policy is represented [00:56:41] action A and this policy is represented using this function pi parameterized by [00:56:44] using this function pi parameterized by the SATA. So one of the key issues for [00:56:47] the SATA. So one of the key issues for behavior coloning is called cascading [00:56:50] behavior coloning is called cascading error because like I mentioned the key [00:56:52] error because like I mentioned the key difference between the robot learning or [00:56:54] difference between the robot learning or agents interaction with environments is [00:56:56] agents interaction with environments is it is a sequential decision making [00:56:59] it is a sequential decision making problem. It differs from like a typical [00:57:01] problem. It differs from like a typical like a supervised learning in the [00:57:02] like a supervised learning in the typical computer vision domain in that [00:57:04] typical computer vision domain in that your error can accumulate and being [00:57:06] your error can accumulate and being amplified over time. Let's say at the [00:57:09] amplified over time. Let's say at the very beginning you made a very small [00:57:11] very beginning you made a very small error. That small error can list your [00:57:13] error. That small error can list your states that is slightly deviates from [00:57:16] states that is slightly deviates from the distribution of data you use to [00:57:18] the distribution of data you use to train your model. That will lead a [00:57:19] train your model. That will lead a policy to make a even larger error and [00:57:22] policy to make a even larger error and this error will be amplified over the [00:57:25] this error will be amplified over the temporal horizons that leads to a [00:57:27] temporal horizons that leads to a trajectory that can decrease quite a lot [00:57:29] trajectory that can decrease quite a lot from the demonstration trajectories. So [00:57:31] from the demonstration trajectories. So that's a typical issues of behavior [00:57:33] that's a typical issues of behavior chronic. So that's why uh often times [00:57:36] chronic. So that's why uh often times when people are trying to make imitation [00:57:38] when people are trying to make imitation learning work people follow this type of [00:57:40] learning work people follow this type of pipeline where on the top we have the [00:57:43] pipeline where on the top we have the demonstration collected by the experts. [00:57:46] demonstration collected by the experts. Then we'll use that as a training data [00:57:48] Then we'll use that as a training data to do supervised learning to train this [00:57:50] to do supervised learning to train this policy and we'll rule out the policy in [00:57:53] policy and we'll rule out the policy in the real environment and observe those [00:57:55] the real environment and observe those failure cases and we either collect [00:57:58] failure cases and we either collect additional data or provide corrective [00:58:00] additional data or provide corrective behaviors that allows this data sets to [00:58:03] behaviors that allows this data sets to not only contains the initial [00:58:04] not only contains the initial demonstrations but also those corrective [00:58:06] demonstrations but also those corrective behaviors that gets the errors from the [00:58:10] behaviors that gets the errors from the policies back to the cononical [00:58:12] policies back to the cononical trajectory or get back to the [00:58:13] trajectory or get back to the trajectory. they can still successfully [00:58:15] trajectory. they can still successfully accomplish the task. So this is actually [00:58:17] accomplish the task. So this is actually a typical like life cycles we are trying [00:58:19] a typical like life cycles we are trying to develop any imitation learning agents [00:58:22] to develop any imitation learning agents or imitation learning algorithms in the [00:58:24] or imitation learning algorithms in the real physical world [00:58:26] real physical world and uh uh along this lines because like [00:58:29] and uh uh along this lines because like if people do this kind of imitation [00:58:31] if people do this kind of imitation learning there's not uh very explicit [00:58:34] learning there's not uh very explicit definitions of what the task actually is [00:58:37] definitions of what the task actually is the task is implicitly hidden within [00:58:40] the task is implicitly hidden within those demonstrations. So there's a class [00:58:42] those demonstrations. So there's a class of algorithm called inverse [00:58:44] of algorithm called inverse reinforcement learning where on the left [00:58:47] reinforcement learning where on the left is what people typically like thinking [00:58:48] is what people typically like thinking about for reinforcement learning whereas [00:58:50] about for reinforcement learning whereas on the right where people are trying to [00:58:52] on the right where people are trying to use this inverse reinforcement learning [00:58:54] use this inverse reinforcement learning to like summarize the rewards from your [00:58:57] to like summarize the rewards from your demonstrations and be able to use that [00:58:59] demonstrations and be able to use that rewards to do typical reinforcement [00:59:01] rewards to do typical reinforcement learning to learn this kind of [00:59:02] learning to learn this kind of algorithms. Some of the earlier like [00:59:05] algorithms. Some of the earlier like susex examples were actually also [00:59:07] susex examples were actually also developed at Stanford's by uh Peter Bio [00:59:10] developed at Stanford's by uh Peter Bio and also Andrew in it that allows them [00:59:12] and also Andrew in it that allows them to control uh helicopters to do some [00:59:16] to control uh helicopters to do some very very crazy behaviors and this is [00:59:18] very very crazy behaviors and this is actually a very old work and uh be able [00:59:21] actually a very old work and uh be able to achieve this type of like agile and [00:59:23] to achieve this type of like agile and uh and effective behaviors on this real [00:59:26] uh and effective behaviors on this real physical helicopters is very impressive [00:59:29] physical helicopters is very impressive uh at that times. So this is the power [00:59:31] uh at that times. So this is the power of learning from demonstrations and [00:59:33] of learning from demonstrations and using that demonstration to summarize [00:59:34] using that demonstration to summarize the rewards and in connections with [00:59:36] the rewards and in connections with reinforcement learning this is what we [00:59:38] reinforcement learning this is what we are able to achieve. So obviously over [00:59:41] are able to achieve. So obviously over the years like people have been making [00:59:43] the years like people have been making the imitation learning algorithms more [00:59:45] the imitation learning algorithms more and more effective especially in [00:59:47] and more effective especially in connecting with for example energy based [00:59:50] connecting with for example energy based models. So instead of learning this kind [00:59:51] models. So instead of learning this kind of explicit policy shown on the left [00:59:53] of explicit policy shown on the left they directly do the mapping from [00:59:55] they directly do the mapping from observation all to the actions. If you [00:59:57] observation all to the actions. If you are coming up with kind of implicit [00:59:58] are coming up with kind of implicit policies that's takes idea from energy [01:00:01] policies that's takes idea from energy based models that direct taking the [01:00:03] based models that direct taking the observation actions to predict the score [01:00:06] observation actions to predict the score and using this kind of like a energy [01:00:08] and using this kind of like a energy based model to do inference to get this [01:00:11] based model to do inference to get this kind of predicted actions ahead allows [01:00:14] kind of predicted actions ahead allows the robots to handle demonstrations that [01:00:17] the robots to handle demonstrations that are very multimodels or handle scenarios [01:00:19] are very multimodels or handle scenarios where the optimization landscapes may [01:00:21] where the optimization landscapes may not be very very smooth and the robot is [01:00:24] not be very very smooth and the robot is able to come up with this kind of [01:00:25] able to come up with this kind of strategies like distill these policies [01:00:27] strategies like distill these policies from the demonstrations in doing this [01:00:29] from the demonstrations in doing this kind of contentrich manipulation tasks. [01:00:32] kind of contentrich manipulation tasks. Another to say like uh some of the [01:00:35] Another to say like uh some of the recent very recent success of robot [01:00:38] recent very recent success of robot learning as a whole is a results of this [01:00:41] learning as a whole is a results of this work called diffusion policy which again [01:00:44] work called diffusion policy which again is also taking some of advances in the [01:00:48] is also taking some of advances in the community of gener models. For this one [01:00:50] community of gener models. For this one like for the implicit behavior people [01:00:52] like for the implicit behavior people are drawing inspiration from development [01:00:54] are drawing inspiration from development of energy based model. Energy based [01:00:55] of energy based model. Energy based model is a type of generated models [01:00:57] model is a type of generated models developed in the deep learning [01:00:59] developed in the deep learning community. And there's another class of [01:01:02] community. And there's another class of like a more powerful models in the deep [01:01:04] like a more powerful models in the deep learning community is called diffusion [01:01:06] learning community is called diffusion models. And people are also trying to [01:01:08] models. And people are also trying to use the diffusion models to use it as a [01:01:11] use the diffusion models to use it as a policy function class to allow the [01:01:14] policy function class to allow the agents to also like inherence the [01:01:17] agents to also like inherence the benefits and the properties from those [01:01:19] benefits and the properties from those diffusion models. So this work uh was [01:01:22] diffusion models. So this work uh was originally done at Colombia. That's is [01:01:24] originally done at Colombia. That's is where I uh am right now and the leads [01:01:28] where I uh am right now and the leads like a PI of this work now come to [01:01:31] like a PI of this work now come to Stanfords. You can see like many of the [01:01:32] Stanfords. You can see like many of the work I selected are has a lot of like [01:01:34] work I selected are has a lot of like roots like here at Stanford's like she's [01:01:37] roots like here at Stanford's like she's currently like at the WE departments at [01:01:40] currently like at the WE departments at Stanfords and this policy really shows [01:01:42] Stanfords and this policy really shows some like a very diverse set of [01:01:45] some like a very diverse set of capabilities allow the robots to do not [01:01:48] capabilities allow the robots to do not just like a planner pushings but many [01:01:50] just like a planner pushings but many like a fine grains like manipulation [01:01:52] like a fine grains like manipulation task not only pick and place but for [01:01:54] task not only pick and place but for example here spread butter on top of the [01:01:56] example here spread butter on top of the spreads and also like like a like a [01:01:59] spreads and also like like a like a scramble eggs like like also or peeling [01:02:01] scramble eggs like like also or peeling the potatoes and sliding the books. It [01:02:05] the potatoes and sliding the books. It really shows that this type of recipe [01:02:07] really shows that this type of recipe where you collect a bunch of [01:02:09] where you collect a bunch of demonstrations and using the best like [01:02:11] demonstrations and using the best like policies like learning mechanisms, you [01:02:14] policies like learning mechanisms, you can get a policy that work in the real [01:02:16] can get a policy that work in the real physical worlds in a very very efficient [01:02:19] physical worlds in a very very efficient manners. Meaning you collect the data in [01:02:21] manners. Meaning you collect the data in the morning, you train a policy in the [01:02:22] the morning, you train a policy in the moment, in the afternoon, you can have a [01:02:25] moment, in the afternoon, you can have a working policies working in the real [01:02:27] working policies working in the real physical world. So obviously there's a [01:02:29] physical world. So obviously there's a lot of caveats in how reliable your [01:02:31] lot of caveats in how reliable your policy is and how generalizable your [01:02:33] policy is and how generalizable your policy is, how diversified the initial [01:02:36] policy is, how diversified the initial configuration can be for the policy to [01:02:38] configuration can be for the policy to still work very robustly. But still [01:02:41] still work very robustly. But still imitation learning is the most efficient [01:02:43] imitation learning is the most efficient way for you to get a policy that can do [01:02:46] way for you to get a policy that can do something interesting in the real [01:02:48] something interesting in the real physical world. And for the policies to [01:02:51] physical world. And for the policies to be very effective and robust to the real [01:02:53] be very effective and robust to the real world variations, this type of iterative [01:02:55] world variations, this type of iterative like data collections will needs will [01:02:58] like data collections will needs will needs to be in place for the policy to [01:03:00] needs to be in place for the policy to cover those kind of like uh unexpected [01:03:03] cover those kind of like uh unexpected behaviors or some kind of dev eating [01:03:04] behaviors or some kind of dev eating behaviors. [01:03:06] behaviors. So this is about imitation learning. So [01:03:08] So this is about imitation learning. So any questions? [01:03:13] Okay. So if there's no more questions, I [01:03:15] Okay. So if there's no more questions, I will use the remaining time to discuss [01:03:17] will use the remaining time to discuss some of the uh recent developments that [01:03:20] some of the uh recent developments that drive all the craziness about robot [01:03:22] drive all the craziness about robot learnings which is like a robotic [01:03:25] learnings which is like a robotic foundation models and uh of course like [01:03:28] foundation models and uh of course like this is a very involved domain actually [01:03:30] this is a very involved domain actually for each one of these items you can [01:03:32] for each one of these items you can actually have a course around them. So [01:03:35] actually have a course around them. So like for today's lecture I'm just like [01:03:37] like for today's lecture I'm just like skimming through them very very quickly. [01:03:39] skimming through them very very quickly. I only tell you the gist like the high [01:03:41] I only tell you the gist like the high level like knowledge you need to get by [01:03:43] level like knowledge you need to get by looking at those terms. And for robotic [01:03:46] looking at those terms. And for robotic foundation models, it is a type of [01:03:49] foundation models, it is a type of models that is very similar to like a [01:03:53] models that is very similar to like a reinforcement learning or imitation [01:03:54] reinforcement learning or imitation learnings in their function class. [01:03:58] learnings in their function class. There's no explicit representation for [01:03:59] There's no explicit representation for example of the states or this kind of [01:04:01] example of the states or this kind of model. For example, this robotic [01:04:02] model. For example, this robotic foundation model doesn't learn the model [01:04:05] foundation model doesn't learn the model of the environments. It is still a [01:04:07] of the environments. It is still a policy that map from the observation and [01:04:10] policy that map from the observation and goal into the actions that still like a [01:04:12] goal into the actions that still like a representative can be very nicely [01:04:14] representative can be very nicely represented using these figures. So you [01:04:16] represented using these figures. So you have this agents which is a policy that [01:04:18] have this agents which is a policy that taking the current state and also the [01:04:20] taking the current state and also the goal as inputs that trying to generate [01:04:22] goal as inputs that trying to generate this actions that can be executed in the [01:04:25] this actions that can be executed in the real physical world. But you might say [01:04:27] real physical world. But you might say like this is very similar to imitation [01:04:29] like this is very similar to imitation learning and reinforcement learning. So [01:04:30] learning and reinforcement learning. So what's special about this robotic [01:04:33] what's special about this robotic foundation models? So this is actually [01:04:35] foundation models? So this is actually all rooted back to the all the [01:04:37] all rooted back to the all the developments within the foundation model [01:04:39] developments within the foundation model domain especially those language related [01:04:41] domain especially those language related foundation model and also those vision [01:04:43] foundation model and also those vision language related foundation models [01:04:46] language related foundation models meaning it is a policy but it needs to [01:04:48] meaning it is a policy but it needs to generalize much better than a policy [01:04:51] generalize much better than a policy that just work for one specific task [01:04:54] that just work for one specific task like here is actually my definitions I'm [01:04:56] like here is actually my definitions I'm joining analogies from the current [01:04:58] joining analogies from the current developments of vision language models [01:05:00] developments of vision language models meaning their outputs may not always be [01:05:04] meaning their outputs may not always be the perfect one but to always generate [01:05:06] the perfect one but to always generate something reasonable as you promise this [01:05:08] something reasonable as you promise this kind of foundation model. So what we [01:05:11] kind of foundation model. So what we hope to achieve with this robotic [01:05:12] hope to achieve with this robotic foundation model is the synthesized [01:05:14] foundation model is the synthesized action may not always be the optimal [01:05:16] action may not always be the optimal actions as conditioned by the [01:05:18] actions as conditioned by the observation and the task. But the [01:05:20] observation and the task. But the generary trajectory or will always be [01:05:22] generary trajectory or will always be beautiful and reasonable to execute in [01:05:24] beautiful and reasonable to execute in the real physical world. Like beautiful [01:05:26] the real physical world. Like beautiful meaning it shouldn't just any like [01:05:28] meaning it shouldn't just any like jiggling motions. It should be smooth [01:05:29] jiggling motions. It should be smooth and continuous. Reasonable meaning you [01:05:31] and continuous. Reasonable meaning you should listen to the language [01:05:33] should listen to the language instructions you are given to the [01:05:34] instructions you are given to the robots. So obviously there are also many [01:05:38] robots. So obviously there are also many different names describing exactly the [01:05:40] different names describing exactly the same things. Some people call it vision [01:05:42] same things. Some people call it vision language action models like VAS. Some [01:05:44] language action models like VAS. Some people call it like a large behavior [01:05:46] people call it like a large behavior models. But in the eence they're all [01:05:48] models. But in the eence they're all describing the same thing meaning this [01:05:50] describing the same thing meaning this policy taking the observation and [01:05:52] policy taking the observation and language instructions or whatever like a [01:05:54] language instructions or whatever like a task specification is trying to generate [01:05:56] task specification is trying to generate the actions that generalize widely [01:05:58] the actions that generalize widely across a wide range of scenarios. [01:06:01] across a wide range of scenarios. So this area is actually quite noisy. [01:06:04] So this area is actually quite noisy. Noisy meaning it's very very like hard [01:06:07] Noisy meaning it's very very like hard to quantify the progress of different [01:06:10] to quantify the progress of different kind of robotic foundation models [01:06:12] kind of robotic foundation models because you're coding a foundation [01:06:13] because you're coding a foundation model. What does that mean? That means [01:06:15] model. What does that mean? That means you expect this model to generalize like [01:06:17] you expect this model to generalize like very broadly over a wide range of [01:06:19] very broadly over a wide range of scenarios. If that's your expectation, [01:06:21] scenarios. If that's your expectation, you actually need significant evidence [01:06:23] you actually need significant evidence to show it actually generalize broadly. [01:06:25] to show it actually generalize broadly. So that's why evaluation and [01:06:26] So that's why evaluation and quantitative measurements of their [01:06:28] quantitative measurements of their progress is very challenging. But still [01:06:30] progress is very challenging. But still by looking at the empirical videos you [01:06:33] by looking at the empirical videos you can still see a lot of very interesting [01:06:35] can still see a lot of very interesting and concrete progress over the past few [01:06:38] and concrete progress over the past few years. So a lot of the earlier [01:06:40] years. So a lot of the earlier investigation starts like with like RT1 [01:06:44] investigation starts like with like RT1 which was released in the December 2022 [01:06:47] which was released in the December 2022 and since then I would say maybe roughly [01:06:50] and since then I would say maybe roughly every half a year there's a new model [01:06:52] every half a year there's a new model for RT2 RTX open VA and some of the [01:06:56] for RT2 RTX open VA and some of the recent ones like uh PI zero that's is [01:07:00] recent ones like uh PI zero that's is making concrete progress along this [01:07:02] making concrete progress along this lines of developing more and more [01:07:03] lines of developing more and more generalizable robotic foundation models [01:07:06] generalizable robotic foundation models and actually this year that's a boost [01:07:09] and actually this year that's a boost there's a huge like first of a lot of [01:07:12] there's a huge like first of a lot of like foundation model like a helix high [01:07:14] like foundation model like a helix high robot gemini robotics pispin 5 etc. So [01:07:17] robot gemini robotics pispin 5 etc. So there's a lot of investigations and also [01:07:19] there's a lot of investigations and also investments in this domain in not not [01:07:22] investments in this domain in not not only like a capital investment but [01:07:24] only like a capital investment but talent investments in developing better [01:07:27] talent investments in developing better better and better and more generalizable [01:07:29] better and better and more generalizable robotic foundation models. So due to the [01:07:32] robotic foundation models. So due to the time I I I clearly cannot go into [01:07:34] time I I I clearly cannot go into details of all these models. So if you [01:07:36] details of all these models. So if you are interested, I actually gave a [01:07:38] are interested, I actually gave a tutorial two months ago at tripleAI [01:07:40] tutorial two months ago at tripleAI specifically describing and discussing [01:07:42] specifically describing and discussing some of the models along this axis. If [01:07:44] some of the models along this axis. If you are interested, please go and uh [01:07:46] you are interested, please go and uh watch it. And for today, I'll be mostly [01:07:49] watch it. And for today, I'll be mostly give you a high level overview of what [01:07:52] give you a high level overview of what actually uh is essentials for this kind [01:07:55] actually uh is essentials for this kind of foundation models and what it [01:07:56] of foundation models and what it actually looks like with an examples [01:07:59] actually looks like with an examples from PI zero. So PI0 was first released [01:08:03] from PI zero. So PI0 was first released in October 2024. [01:08:06] in October 2024. I think this is a word that convinced me [01:08:09] I think this is a word that convinced me that this type of robotic foundation [01:08:10] that this type of robotic foundation model can do some very reliable like a [01:08:13] model can do some very reliable like a texturous manipulations in the real [01:08:15] texturous manipulations in the real world environments. It can handle like [01:08:17] world environments. It can handle like cloth folding and box folding and many [01:08:19] cloth folding and box folding and many other different types of manipulation [01:08:21] other different types of manipulation tasks at a very reliable manner. And [01:08:25] tasks at a very reliable manner. And here's how the framework actually looks [01:08:27] here's how the framework actually looks like on the high level. On the left is [01:08:31] like on the high level. On the left is data sets. So for any model to be called [01:08:34] data sets. So for any model to be called foundation model it needs a fuel for [01:08:35] foundation model it needs a fuel for that foundation model and that fuel is [01:08:37] that foundation model and that fuel is data. So they aggregate a lot of data [01:08:40] data. So they aggregate a lot of data both within academia and also data [01:08:42] both within academia and also data collected by themselves across course [01:08:44] collected by themselves across course many different embodiment where the [01:08:46] many different embodiment where the robots are doing like some kind of [01:08:47] robots are doing like some kind of interesting and useful task in the real [01:08:49] interesting and useful task in the real world environments and they use this [01:08:51] world environments and they use this data to do pre-training and one [01:08:54] data to do pre-training and one important caveat of this pre-training [01:08:56] important caveat of this pre-training this starts with a pre-trained like [01:08:59] this starts with a pre-trained like vision language models the vision [01:09:00] vision language models the vision language model that's already trained on [01:09:02] language model that's already trained on vast amount of like a vision language [01:09:04] vast amount of like a vision language related data that can naturally adapt [01:09:07] related data that can naturally adapt those kind of semantically and knowledge [01:09:09] those kind of semantically and knowledge from those models and together by doing [01:09:12] from those models and together by doing co-fineting what they call co-fine [01:09:14] co-fineting what they call co-fine tuning using both objective for action [01:09:16] tuning using both objective for action predictions and also the objective [01:09:18] predictions and also the objective adapted from those vision question [01:09:20] adapted from those vision question answering those type of tasks you will [01:09:22] answering those type of tasks you will be able to first preserve the semantic [01:09:25] be able to first preserve the semantic knowledge within those model but at the [01:09:27] knowledge within those model but at the same time you can predict the robot [01:09:29] same time you can predict the robot actions and this is a pre-tuning stage a [01:09:32] actions and this is a pre-tuning stage a very important like design for many of [01:09:35] very important like design for many of the existing robotic foundation models [01:09:37] the existing robotic foundation models is called post training which is also [01:09:39] is called post training which is also inspired by many of the developments in [01:09:42] inspired by many of the developments in the large language model communities [01:09:44] the large language model communities where you have this base model. Base [01:09:46] where you have this base model. Base model can give you some reasonable [01:09:48] model can give you some reasonable baseline performance but you really want [01:09:50] baseline performance but you really want the performance to be very very good on [01:09:51] the performance to be very very good on a specific task. You actually have to [01:09:53] a specific task. You actually have to collect task specific data to fine-tune [01:09:56] collect task specific data to fine-tune the models do post training on the data [01:09:58] the models do post training on the data for that specific task for the [01:10:00] for that specific task for the performance to be satisfied. [01:10:02] performance to be satisfied. So they are evaluating their whole [01:10:04] So they are evaluating their whole systems over three different categories. [01:10:06] systems over three different categories. The first one you should directly use [01:10:08] The first one you should directly use their base model and their base model [01:10:10] their base model and their base model can already be good enough for some very [01:10:12] can already be good enough for some very simple indistri in distribution data. [01:10:15] simple indistri in distribution data. The the task like is actually the task [01:10:18] The the task like is actually the task that might have already been encountered [01:10:20] that might have already been encountered during the pre-training stage. If for [01:10:22] during the pre-training stage. If for indistribution task but slightly more [01:10:24] indistribution task but slightly more complicated you can do post training to [01:10:27] complicated you can do post training to allow the base model to further improve [01:10:29] allow the base model to further improve on those indistribution task. And for [01:10:32] on those indistribution task. And for unseen task typically you have to do [01:10:34] unseen task typically you have to do post trainings by collecting those task [01:10:36] post trainings by collecting those task specific data and fine-tune your [01:10:39] specific data and fine-tune your pre-trained model on this tasks for it [01:10:41] pre-trained model on this tasks for it to be performance [01:10:43] to be performance and this pi zero model is actually open [01:10:45] and this pi zero model is actually open sourced and you can just download the [01:10:47] sourced and you can just download the checkpoints my lab the students in my [01:10:49] checkpoints my lab the students in my lab have already been starting to [01:10:51] lab have already been starting to playing with this like uh their models [01:10:53] playing with this like uh their models and trying to do post training and we [01:10:54] and trying to do post training and we are starting to see some very promising [01:10:57] are starting to see some very promising like results. So if you are interested [01:10:59] like results. So if you are interested highly encouraged to try it. That is a [01:11:02] highly encouraged to try it. That is a very good question. So you are [01:11:03] very good question. So you are essentially asking about the efficiency [01:11:05] essentially asking about the efficiency of the existing robotic foundation [01:11:06] of the existing robotic foundation models. So there's a lot of reasons why [01:11:09] models. So there's a lot of reasons why the policy is actually slower than [01:11:11] the policy is actually slower than humans. One of the major reason is [01:11:13] humans. One of the major reason is actually adapted from how the data was [01:11:16] actually adapted from how the data was collected, how the demonstration data [01:11:17] collected, how the demonstration data was collected. Typically for many of [01:11:19] was collected. Typically for many of these scenarios the demonstration data [01:11:21] these scenarios the demonstration data was collected by human tele operating [01:11:24] was collected by human tele operating the robots on that exact same robots to [01:11:26] the robots on that exact same robots to do the data collections in for example [01:11:28] do the data collections in for example folding this box and human tele [01:11:31] folding this box and human tele operations is actually slower than human [01:11:33] operations is actually slower than human just using their hands to do this tasks [01:11:36] just using their hands to do this tasks even if you have gou for them hours of [01:11:38] even if you have gou for them hours of training. This is because you are just [01:11:40] training. This is because you are just using a different embodiment than a [01:11:41] using a different embodiment than a environment that you person is the most [01:11:44] environment that you person is the most familiar with. And also at the same time [01:11:46] familiar with. And also at the same time because the robot arms are still like [01:11:49] because the robot arms are still like certain distance away from you. There [01:11:50] certain distance away from you. There will be occlusion. Sometime you have to [01:11:52] will be occlusion. Sometime you have to look very closely carefully like [01:11:54] look very closely carefully like changing your heads changing the the the [01:11:56] changing your heads changing the the the viewing angles in order to really [01:11:58] viewing angles in order to really understand is it a time to progress into [01:12:00] understand is it a time to progress into next task stage or not. there's a lot of [01:12:02] next task stage or not. there's a lot of like caveats and inefficiencies [01:12:06] like caveats and inefficiencies of the current data collection regimes. [01:12:08] of the current data collection regimes. That is why the policy directly trained [01:12:10] That is why the policy directly trained on those data turn out to be slower than [01:12:12] on those data turn out to be slower than like human speeds. So that's why there's [01:12:15] like human speeds. So that's why there's a lot of like investigations in how we [01:12:17] a lot of like investigations in how we can do those kind of data collections to [01:12:19] can do those kind of data collections to be even more efficient to be at human [01:12:21] be even more efficient to be at human speeds that is actually a very active [01:12:23] speeds that is actually a very active research direction. So this is a very [01:12:25] research direction. So this is a very good question. So for this like a box [01:12:27] good question. So for this like a box folding house I would argue this is [01:12:29] folding house I would argue this is already a very long hor task. So I was [01:12:31] already a very long hor task. So I was very impressed by how good this one [01:12:34] very impressed by how good this one single policy is able to handle this [01:12:36] single policy is able to handle this long harden task. But you could argue if [01:12:38] long harden task. But you could argue if you really want this policy to be useful [01:12:40] you really want this policy to be useful in some like a more larger scale like in [01:12:43] in some like a more larger scale like in a wide scenarios in your home you not [01:12:45] a wide scenarios in your home you not only want the robot to fold a box if I [01:12:47] only want the robot to fold a box if I wanted to fold the shirts and clean the [01:12:49] wanted to fold the shirts and clean the beds and clean all the messes on the [01:12:51] beds and clean all the messes on the floor. If in those type of scenarios [01:12:54] floor. If in those type of scenarios like uh currently me personally I don't [01:12:57] like uh currently me personally I don't believe one gigantic policy is able to [01:12:58] believe one gigantic policy is able to be able to adapt to those scenarios some [01:13:01] be able to adapt to those scenarios some higher level obstructions or some kind [01:13:03] higher level obstructions or some kind of singraft some symbolic [01:13:04] of singraft some symbolic representations needs to be in place as [01:13:06] representations needs to be in place as a condition for this vision language [01:13:09] a condition for this vision language action models for those like policy to [01:13:11] action models for those like policy to be the most effective and useful to [01:13:13] be the most effective and useful to steer into different type of tasks and [01:13:15] steer into different type of tasks and scale to larger scale environments and [01:13:17] scale to larger scale environments and more complicated tasks. [01:13:19] more complicated tasks. So they started with a pre-trained [01:13:21] So they started with a pre-trained vision language models. So that's why [01:13:22] vision language models. So that's why like there's already a lot of semantic [01:13:24] like there's already a lot of semantic knowledge that are learned through this [01:13:26] knowledge that are learned through this large scale pre-training using this [01:13:28] large scale pre-training using this vision language data. So that is why [01:13:30] vision language data. So that is why like some of the generalization are [01:13:32] like some of the generalization are coming for free. Meaning the base model [01:13:34] coming for free. Meaning the base model can actually have surprisingly good [01:13:36] can actually have surprisingly good levels of generalizations at a semantic [01:13:38] levels of generalizations at a semantic level. And it's just that you have to [01:13:40] level. And it's just that you have to fine-tune this model with those robot [01:13:41] fine-tune this model with those robot data to making sure it can also [01:13:42] data to making sure it can also generalize not as a semantic level but [01:13:44] generalize not as a semantic level but also as action level. [01:13:47] also as action level. So we can have the question maybe at [01:13:48] So we can have the question maybe at unknown because we're already about time [01:13:50] unknown because we're already about time I still have maybe one the last maybe [01:13:52] I still have maybe one the last maybe two three minutes I'll discuss the some [01:13:54] two three minutes I'll discuss the some of the remaining challenges especially [01:13:56] of the remaining challenges especially along the developments of robot learning [01:13:58] along the developments of robot learning models so one of the major challenges [01:14:00] models so one of the major challenges the whole community recognize is [01:14:03] the whole community recognize is evaluation evaluation currency is [01:14:06] evaluation evaluation currency is primarily done in the real world for [01:14:07] primarily done in the real world for example this is a picture of uh Google [01:14:10] example this is a picture of uh Google robotics they have a grid of this kind [01:14:12] robotics they have a grid of this kind of teleoperating like Aloha systems that [01:14:15] of teleoperating like Aloha systems that they are doing data collection and also [01:14:17] they are doing data collection and also evaluation and real world evaluation is [01:14:19] evaluation and real world evaluation is both costly and noisy. Their exact words [01:14:22] both costly and noisy. Their exact words to me was for evaluation they have large [01:14:25] to me was for evaluation they have large enough budget such that they can still [01:14:26] enough budget such that they can still make progress. This was their exact [01:14:28] make progress. This was their exact words. Meaning if you were to do the [01:14:30] words. Meaning if you were to do the evaluation or I were to do the [01:14:31] evaluation or I were to do the evaluation the results can be very [01:14:33] evaluation the results can be very different from each other depending on [01:14:34] different from each other depending on how we specify the initial configuration [01:14:36] how we specify the initial configuration and how the lighting condition might [01:14:38] and how the lighting condition might change. Even the friction parameters of [01:14:40] change. Even the friction parameters of the manufacturer can make a huge [01:14:41] the manufacturer can make a huge difference in how robust your downstream [01:14:43] difference in how robust your downstream policy is. So this is very costly and [01:14:45] policy is. So this is very costly and they have to wait for two days for the [01:14:47] they have to wait for two days for the results to come back and currently like [01:14:51] results to come back and currently like uh there's very weak correlation between [01:14:53] uh there's very weak correlation between the training loss and real world success [01:14:55] the training loss and real world success rates. This is another very important [01:14:57] rates. This is another very important caveat and difference between supervised [01:14:59] caveat and difference between supervised learning and also this kind of [01:15:00] learning and also this kind of sequential decision-m this kind of [01:15:02] sequential decision-m this kind of policy learning is for supervised [01:15:04] policy learning is for supervised learning like your training loss [01:15:06] learning like your training loss directly measures how good your model [01:15:08] directly measures how good your model is. But for this kind of poly learning [01:15:10] is. But for this kind of poly learning your training loss measures how good [01:15:12] your training loss measures how good this onestep prediction is which [01:15:14] this onestep prediction is which sometimes may not be and actually often [01:15:16] sometimes may not be and actually often the times is not indicative of the [01:15:19] the times is not indicative of the performance of policy over like a long [01:15:21] performance of policy over like a long task horizons. Even if your loss is low [01:15:24] task horizons. Even if your loss is low but for long horizon task execution your [01:15:26] but for long horizon task execution your policy can actually be worse. And uh the [01:15:29] policy can actually be worse. And uh the training objective versus the task [01:15:31] training objective versus the task specific metrics like training versus [01:15:33] specific metrics like training versus test horizons are some of the reasons [01:15:35] test horizons are some of the reasons explaining like why it is very hard to [01:15:38] explaining like why it is very hard to come up with even approximate or proxy [01:15:40] come up with even approximate or proxy metrics to measure the performance of [01:15:42] metrics to measure the performance of the policy and people have to rely on [01:15:44] the policy and people have to rely on real world evaluations. [01:15:46] real world evaluations. So then the question is what about doing [01:15:48] So then the question is what about doing the evaluation in the simulated [01:15:49] the evaluation in the simulated environments there has been a lot of [01:15:51] environments there has been a lot of like investigation for the behavior [01:15:53] like investigation for the behavior which is done in fif lab here also at [01:15:55] which is done in fif lab here also at Stanford's and also the habitat 3.0 zero [01:15:57] Stanford's and also the habitat 3.0 zero from meta. People are trying to come up [01:15:59] from meta. People are trying to come up with this expensive simulated [01:16:01] with this expensive simulated environments trying to do evaluation and [01:16:03] environments trying to do evaluation and measurements of the robot policies and [01:16:06] measurements of the robot policies and obviously there has their own issues [01:16:09] obviously there has their own issues especially with regard to sim to real [01:16:11] especially with regard to sim to real gap like how can you do very accurate [01:16:14] gap like how can you do very accurate simulation of rigid body deformable [01:16:16] simulation of rigid body deformable object close they have good correlations [01:16:18] object close they have good correlations with the real world performance and [01:16:20] with the real world performance and assets is also another major issues [01:16:22] assets is also another major issues where large scale generalizations and [01:16:24] where large scale generalizations and generations of those assets is a huge [01:16:26] generations of those assets is a huge pain um I can elaborate but like maybe [01:16:30] pain um I can elaborate but like maybe after the lectures and also how to [01:16:32] after the lectures and also how to digitize the real world is an issue and [01:16:33] digitize the real world is an issue and how to do procedural generations of [01:16:35] how to do procedural generations of realistic and diverse things are all [01:16:37] realistic and diverse things are all issues of using simulation to do [01:16:39] issues of using simulation to do evaluations for robot learning policies [01:16:42] evaluations for robot learning policies and really we want to find a correlation [01:16:44] and really we want to find a correlation between s and real and it's calling upon [01:16:46] between s and real and it's calling upon the imagets in embodied AI like the [01:16:49] the imagets in embodied AI like the reason why imaget was successful is [01:16:51] reason why imaget was successful is because at least for few years any [01:16:53] because at least for few years any progress in imaget meaning progress in [01:16:55] progress in imaget meaning progress in deep learning and computeration we want [01:16:57] deep learning and computeration we want the same thing we want have this [01:16:59] the same thing we want have this platform meaning any progress on that [01:17:01] platform meaning any progress on that benchmark or platform meaning progress [01:17:03] benchmark or platform meaning progress in robot learnings so that's something [01:17:05] in robot learnings so that's something that we really want and uh um I can [01:17:09] that we really want and uh um I can maybe skip through like uh we talk about [01:17:11] maybe skip through like uh we talk about how to build this foundational policies [01:17:13] how to build this foundational policies there can also be investigations on how [01:17:14] there can also be investigations on how to build like a foundational word models [01:17:17] to build like a foundational word models especially now people are collecting [01:17:18] especially now people are collecting large scale action condition robot [01:17:20] large scale action condition robot interaction data to train this [01:17:22] interaction data to train this foundation policies but there's a lot of [01:17:24] foundation policies but there's a lot of dynamics knowledge embedded in those [01:17:25] dynamics knowledge embedded in those data if we just use those data to do [01:17:27] data if we just use those data to do policy learning that would be such a [01:17:29] policy learning that would be such a waste. So we are also thinking about how [01:17:31] waste. So we are also thinking about how we can use this large scale action [01:17:33] we can use this large scale action condition robot interaction data that [01:17:35] condition robot interaction data that was already collected to train those [01:17:37] was already collected to train those foundational policy to train this [01:17:39] foundational policy to train this foundational word models and how they [01:17:41] foundational word models and how they can do interplay between each others and [01:17:44] can do interplay between each others and there are some existing works that are [01:17:46] there are some existing works that are thinking about along the direction of [01:17:48] thinking about along the direction of building this kind of foundational word [01:17:49] building this kind of foundational word models and there are some like very like [01:17:51] models and there are some like very like interesting like characteristics you [01:17:54] interesting like characteristics you might thinking about like do you want it [01:17:55] might thinking about like do you want it to be 3D do you want structural prior [01:17:58] to be 3D do you want structural prior how much learning versus how much [01:17:59] how much learning versus how much physics and how you can correlate with [01:18:01] physics and how you can correlate with the real physical world. And maybe [01:18:04] the real physical world. And maybe actually I think uh we are about time so [01:18:06] actually I think uh we are about time so I will end it here. And uh this is the [01:18:11] I will end it here. And uh this is the future we hope to achieve to really [01:18:12] future we hope to achieve to really build this foundational robotic model [01:18:14] build this foundational robotic model that can work very widely and generalize [01:18:16] that can work very widely and generalize very well in the unstructured data [01:18:19] very well in the unstructured data environments around us. And next [01:18:21] environments around us. And next lectures will be human- centered AI. And [01:18:23] lectures will be human- centered AI. And that will be uh the end of today's [01:18:25] that will be uh the end of today's lecture. Thank you so much. ================================================================================ LECTURE 018 ================================================================================ Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 18: Human-Centered AI Source: https://www.youtube.com/watch?v=g8UaBfj6Sh8 --- Transcript [00:00:05] Welcome to the last lecture of the [00:00:08] Welcome to the last lecture of the quarter for CS231N. [00:00:10] quarter for CS231N. And it was great to see you guys uh at [00:00:13] And it was great to see you guys uh at the beginning and now at the end. And [00:00:16] the beginning and now at the end. And this lecture is a little bit of a [00:00:17] this lecture is a little bit of a departure. We're not going to teach any [00:00:19] departure. We're not going to teach any new materials in terms of algorithms. [00:00:23] new materials in terms of algorithms. It's more a talk that I'd like to give [00:00:26] It's more a talk that I'd like to give to students to put a perspective both on [00:00:29] to students to put a perspective both on um um on a a more longer term research [00:00:33] um um on a a more longer term research evolution but as well as a another [00:00:38] evolution but as well as a another dimension that is important to today's [00:00:40] dimension that is important to today's AI which we would call it the human [00:00:42] AI which we would call it the human perspective. So um for the completeness [00:00:46] perspective. So um for the completeness of the material there's a little bit of [00:00:48] of the material there's a little bit of a overlap uh that you might see from [00:00:51] a overlap uh that you might see from other parts of this course but uh [00:00:54] other parts of this course but uh hopefully you it it makes sense in a in [00:00:56] hopefully you it it makes sense in a in a fuller in a fuller way. So the title [00:01:00] a fuller in a fuller way. So the title of this slide or this lecture is what we [00:01:04] of this slide or this lecture is what we see and what we value AI with a human [00:01:08] see and what we value AI with a human perspective. And I know that some of you [00:01:10] perspective. And I know that some of you have already heard about this is really [00:01:13] have already heard about this is really the beginning the origin of vision both [00:01:15] the beginning the origin of vision both in terms of evolution as well as in [00:01:18] in terms of evolution as well as in terms of our technology. And we did talk [00:01:21] terms of our technology. And we did talk about um the first light that came to [00:01:25] about um the first light that came to the animal world back in back 540 [00:01:29] the animal world back in back 540 million years ago. And uh that was when [00:01:32] million years ago. And uh that was when uh animals or trilobytes to be specific [00:01:35] uh animals or trilobytes to be specific developed uh a photo a photosensitive [00:01:39] developed uh a photo a photosensitive cells to glean what the outer world is [00:01:43] cells to glean what the outer world is about. And according to zoologologists [00:01:47] about. And according to zoologologists like Andrew Parker, um what happened is [00:01:50] like Andrew Parker, um what happened is that because of this um uh first the [00:01:54] that because of this um uh first the onsite of vision, uh it set off an [00:01:58] onsite of vision, uh it set off an evolutionary arms race where animals [00:02:02] evolutionary arms race where animals either evolved or died. And that arms [00:02:05] either evolved or died. And that arms race gave rise to the speciation or [00:02:09] race gave rise to the speciation or explosive speciation of animals which [00:02:12] explosive speciation of animals which now zoologologists call the Cambrian [00:02:15] now zoologologists call the Cambrian explo explosion or the big ban of [00:02:17] explo explosion or the big ban of evolution. And of course um you wouldn't [00:02:21] evolution. And of course um you wouldn't be surprised that vision is still to [00:02:24] be surprised that vision is still to this day a primary sensory intelligent [00:02:28] this day a primary sensory intelligent system in many many animals. Not all [00:02:31] system in many many animals. Not all animals use vision. uh admittedly but [00:02:35] animals use vision. uh admittedly but many do and that's uh also one of the [00:02:37] many do and that's uh also one of the primary sensory systems for uh humans [00:02:41] primary sensory systems for uh humans and uh we use vision to um to you know [00:02:46] and uh we use vision to um to you know do everything from survival to work to [00:02:49] do everything from survival to work to entertainment to socialization to [00:02:51] entertainment to socialization to learning development and many other [00:02:54] learning development and many other things. So that's the the the [00:02:58] things. So that's the the the recapture or summary of evolution. And [00:03:01] recapture or summary of evolution. And we also uh briefly talked about um [00:03:05] we also uh briefly talked about um computer vision being a summer vision [00:03:08] computer vision being a summer vision project back in the 1960s as an attempt [00:03:12] project back in the 1960s as an attempt to use a couple of undergrads to [00:03:15] to use a couple of undergrads to construct the significant portion of the [00:03:18] construct the significant portion of the visual system. And that was very in line [00:03:22] visual system. And that was very in line with the history of AI where we tend to [00:03:25] with the history of AI where we tend to um have clarity of the northstar but [00:03:29] um have clarity of the northstar but underestimate how long it would take. We [00:03:32] underestimate how long it would take. We are still probably experiencing that [00:03:34] are still probably experiencing that today. But a lot has happened, right? [00:03:37] today. But a lot has happened, right? Like um you don't need me to tell you [00:03:40] Like um you don't need me to tell you that from empowering self-driving car to [00:03:44] that from empowering self-driving car to understanding images to the generative [00:03:47] understanding images to the generative AI uh revolution, we're seeing vision is [00:03:51] AI uh revolution, we're seeing vision is uh playing a huge role and and also in [00:03:54] uh playing a huge role and and also in many parts leading the wave. [00:03:57] many parts leading the wave. So [00:03:58] So maybe it's time to just take a different [00:04:01] maybe it's time to just take a different look at this both historically and going [00:04:04] look at this both historically and going towards the future is where have we come [00:04:06] towards the future is where have we come from and where are we going and this is [00:04:10] from and where are we going and this is an important uh topic to to uh discuss [00:04:13] an important uh topic to to uh discuss because a lot of what has happened will [00:04:16] because a lot of what has happened will inform what will happen. Uh so I'm [00:04:20] inform what will happen. Uh so I'm organizing this talk in three chunks is [00:04:24] organizing this talk in three chunks is um first of all building AI to see what [00:04:28] um first of all building AI to see what humans see and that's where we came from [00:04:30] humans see and that's where we came from that we were so inspired by human [00:04:33] that we were so inspired by human capability that we want to m make [00:04:35] capability that we want to m make machines that do the same and then we'll [00:04:38] machines that do the same and then we'll talk about building AI to see what [00:04:40] talk about building AI to see what humans don't see and then we'll finish [00:04:42] humans don't see and then we'll finish with building AI to see what humans [00:04:45] with building AI to see what humans would like to see. Uh let's just start [00:04:47] would like to see. Uh let's just start with the first one. Building a to see [00:04:49] with the first one. Building a to see what humans see. Again, just a little [00:04:52] what humans see. Again, just a little bit of a review. Humans are so good at [00:04:55] bit of a review. Humans are so good at seeing. We know this. This is a very [00:04:58] seeing. We know this. This is a very half a century old um uh experiment [00:05:02] half a century old um uh experiment showing us that even [00:05:05] showing us that even watching a video you've never watched [00:05:08] watching a video you've never watched played at 10 hertz, which means every [00:05:10] played at 10 hertz, which means every frame is only about a on the screen for [00:05:13] frame is only about a on the screen for 100 milliseconds. You've never seen [00:05:15] 100 milliseconds. You've never seen that. It's still no problem for human [00:05:18] that. It's still no problem for human eyes to detect a target. In this case, a [00:05:21] eyes to detect a target. In this case, a person in a in a in this complex scene [00:05:24] person in a in a in this complex scene where you have no idea a prior knowledge [00:05:27] where you have no idea a prior knowledge about what this person is. It really [00:05:29] about what this person is. It really underscores the sup s uh superb ability [00:05:34] underscores the sup s uh superb ability of human visual understanding especially [00:05:37] of human visual understanding especially object focused understanding. We also [00:05:40] object focused understanding. We also have briefly mentioned that around the [00:05:43] have briefly mentioned that around the turn of the century uh neurosych uh [00:05:46] turn of the century uh neurosych uh neurohysiologists are measuring the [00:05:49] neurohysiologists are measuring the speed of light uh speed of vision in [00:05:52] speed of light uh speed of vision in terms of uh humans seeing complex [00:05:55] terms of uh humans seeing complex objects in um in in in the form of brain [00:06:00] objects in um in in in the form of brain signals, brain electrical signals [00:06:03] signals, brain electrical signals measured from EG caps. And we see that [00:06:07] measured from EG caps. And we see that differentiating or categorizing animals [00:06:10] differentiating or categorizing animals versus animals is a very complex task. [00:06:13] versus animals is a very complex task. Yet humans are capable of doing that at [00:06:17] Yet humans are capable of doing that at 150 millisecond after the onset of the [00:06:20] 150 millisecond after the onset of the stimuli. And this is remarkable speed [00:06:23] stimuli. And this is remarkable speed given the the wetwware we have between [00:06:26] given the the wetwware we have between our under our skulls. Also [00:06:29] our under our skulls. Also neuroysiologists [00:06:31] neuroysiologists have taught us that uh objects is a very [00:06:35] have taught us that uh objects is a very important functionality in our visual [00:06:39] important functionality in our visual intelligence in humans. So important [00:06:42] intelligence in humans. So important that there are neuro cororlates in our [00:06:44] that there are neuro cororlates in our brain areas that are dedicated to object [00:06:48] brain areas that are dedicated to object understanding such as face areas or [00:06:51] understanding such as face areas or place areas or body parts areas. This [00:06:54] place areas or body parts areas. This shows that evolution has really spent [00:06:57] shows that evolution has really spent time to hone in our visual intelligence [00:07:00] time to hone in our visual intelligence skills when it comes to object [00:07:02] skills when it comes to object recognition. So all this built up the [00:07:07] recognition. So all this built up the history for the field of computer vision [00:07:10] history for the field of computer vision that a few decades ago object [00:07:13] that a few decades ago object recognition became a fundamental [00:07:15] recognition became a fundamental building block for visual intelligence [00:07:17] building block for visual intelligence and we want to empower machines with [00:07:20] and we want to empower machines with that. And in order to do that, we define [00:07:23] that. And in order to do that, we define the problem or at least the original [00:07:26] the problem or at least the original problem as given an image, how do we [00:07:30] problem as given an image, how do we empower enable a computer to call out [00:07:34] empower enable a computer to call out the objects or what what the uh what the [00:07:38] the objects or what what the uh what the object is in the image. That's such an [00:07:41] object is in the image. That's such an effortless task for humans. But if you [00:07:44] effortless task for humans. But if you think about it now that you've learned [00:07:45] think about it now that you've learned enough computer vision to know that [00:07:48] enough computer vision to know that mathematically there's infinite [00:07:50] mathematically there's infinite possibilities to um to actually uh [00:07:53] possibilities to um to actually uh recognize any object because of [00:07:56] recognize any object because of different lighting, texture, background, [00:07:59] different lighting, texture, background, occlusion, uh viewing angle, scaling and [00:08:03] occlusion, uh viewing angle, scaling and and you know whatever you name it. So [00:08:05] and you know whatever you name it. So this is actually fundamentally a a [00:08:08] this is actually fundamentally a a difficult tasks difficult task. [00:08:12] difficult tasks difficult task. Um the history pre-deep learning is also [00:08:15] Um the history pre-deep learning is also very interesting. There were some pretty [00:08:18] very interesting. There were some pretty heroic attempts at solving the problem [00:08:21] heroic attempts at solving the problem of generic generalizable object [00:08:25] of generic generalizable object recognition and the first wave of [00:08:27] recognition and the first wave of attempt was actually very inspired by [00:08:31] attempt was actually very inspired by psychology itself. We self-introspect [00:08:34] psychology itself. We self-introspect sometimes uh sometimes even to the [00:08:36] sometimes uh sometimes even to the detriment of over self-introspection. [00:08:39] detriment of over self-introspection. We think that humans compose parts, [00:08:42] We think that humans compose parts, right? Like we look at objects, we can [00:08:45] right? Like we look at objects, we can see geometric parts and then we can [00:08:48] see geometric parts and then we can compose them into different objects. And [00:08:50] compose them into different objects. And that idea of using pre-desated [00:08:56] that idea of using pre-desated parts or shapes and to you know compose [00:09:00] parts or shapes and to you know compose them in specific ways was the first wave [00:09:05] them in specific ways was the first wave of object recognition. So these are [00:09:08] of object recognition. So these are different um work or models coming from [00:09:11] different um work or models coming from the 70s, 80s or even going all the way [00:09:13] the 70s, 80s or even going all the way to 90s of using different parts and [00:09:16] to 90s of using different parts and configurations to recognize objects. Of [00:09:19] configurations to recognize objects. Of course, it didn't really work. It's [00:09:21] course, it didn't really work. It's mathematically beautiful and simple, but [00:09:23] mathematically beautiful and simple, but it didn't work. So a second wave of [00:09:27] it didn't work. So a second wave of object recognition pre-deep learning was [00:09:30] object recognition pre-deep learning was actually a really important um uh era in [00:09:35] actually a really important um uh era in in the field of AI is really the [00:09:38] in the field of AI is really the beginning of machine learning [00:09:39] beginning of machine learning statistical m machine learning. It was [00:09:41] statistical m machine learning. It was the marriage between computer [00:09:43] the marriage between computer programming and statistical modeling. [00:09:46] programming and statistical modeling. And with that marriage, we start to re [00:09:48] And with that marriage, we start to re realize the world is so complex. These [00:09:51] realize the world is so complex. These problems, this intelligence problems, [00:09:54] problems, this intelligence problems, whether it's visual intelligence or [00:09:56] whether it's visual intelligence or language intelligence or other kind of [00:09:58] language intelligence or other kind of intelligence, uh in order to generalize, [00:10:02] intelligence, uh in order to generalize, we need to learn um learn the [00:10:05] we need to learn um learn the parameters. It's very hard to use [00:10:07] parameters. It's very hard to use hand-tuned models to to um to uh um you [00:10:12] hand-tuned models to to um to uh um you know get good good uh learning. We now [00:10:16] know get good good uh learning. We now know we need data even though we didn't [00:10:18] know we need data even though we didn't at that time know how much data but we [00:10:21] at that time know how much data but we also know that we need to uh design or [00:10:24] also know that we need to uh design or architect statistical models so that [00:10:27] architect statistical models so that they have the capability of learning [00:10:29] they have the capability of learning through different uh through different [00:10:31] through different uh through different learning rules. And because of that we [00:10:34] learning rules. And because of that we saw a blossoming of models in that era [00:10:37] saw a blossoming of models in that era where we're learning you know random [00:10:39] where we're learning you know random fields or base nets or support vector [00:10:42] fields or base nets or support vector machines and all that. Um and in fact a [00:10:46] machines and all that. Um and in fact a lot of progress was made um by the time [00:10:49] lot of progress was made um by the time we are in the first decade of 21st [00:10:53] we are in the first decade of 21st century in object recognition that we [00:10:55] century in object recognition that we even have international benchmarks of a [00:10:58] even have international benchmarks of a small number of object classes to [00:11:01] small number of object classes to encourage everybody to um to to to [00:11:04] encourage everybody to um to to to compare their algorithms. So we're [00:11:06] compare their algorithms. So we're inching together. The last unlock for [00:11:10] inching together. The last unlock for object recognition as we have learned [00:11:13] object recognition as we have learned again goes back to cognitive science. So [00:11:16] again goes back to cognitive science. So this particular psychologist Irv Beerman [00:11:19] this particular psychologist Irv Beerman had long conjectured that humans can [00:11:23] had long conjectured that humans can recognize a huge number of objects and [00:11:26] recognize a huge number of objects and this is intuitive for our common [00:11:28] this is intuitive for our common knowledge. But he actually put a number [00:11:31] knowledge. But he actually put a number on it. I I personally call it the [00:11:33] on it. I I personally call it the Beerman number which is that by you know [00:11:36] Beerman number which is that by you know age six or seven children he conjectured [00:11:40] age six or seven children he conjectured were able to recognize about 30,000 to [00:11:44] were able to recognize about 30,000 to 100,000 different visual categories and [00:11:47] 100,000 different visual categories and he used this where did he come up with [00:11:50] he used this where did he come up with this number is a combination of looking [00:11:52] this number is a combination of looking at dictionary the number of nouns as [00:11:55] at dictionary the number of nouns as well as visual studies of how kids uh [00:11:58] well as visual studies of how kids uh recognize different uh uh objects [00:12:01] recognize different uh uh objects But it's a number that's pretty daunting [00:12:04] But it's a number that's pretty daunting and pretty sobering for the field of [00:12:07] and pretty sobering for the field of computer vision because up till now this [00:12:10] computer vision because up till now this is middle of the first decade of 21st [00:12:12] is middle of the first decade of 21st century we were working with tiny number [00:12:16] century we were working with tiny number of object categories and a tiny number [00:12:19] of object categories and a tiny number of uh images to work with compared to [00:12:22] of uh images to work with compared to what humans experience. And this was as [00:12:25] what humans experience. And this was as you know the onset of the uh the the [00:12:28] you know the onset of the uh the the motivation for image net project which [00:12:33] motivation for image net project which um took this beerman number really [00:12:35] um took this beerman number really seriously and um constructed uh we [00:12:39] seriously and um constructed uh we constructed this data set that is on par [00:12:42] constructed this data set that is on par with what the psychologist beamman [00:12:44] with what the psychologist beamman conjectured which is around 22,000 [00:12:48] conjectured which is around 22,000 object classes over 15 million images [00:12:51] object classes over 15 million images and of course that's the beginning that [00:12:54] and of course that's the beginning that you start to come into this uh class is [00:12:57] you start to come into this uh class is that because of the large data provided [00:13:00] that because of the large data provided by imageet we start to see that powerful [00:13:05] by imageet we start to see that powerful algorithms like neuronet network at the [00:13:07] algorithms like neuronet network at the beginning it was convolutional neuronet [00:13:09] beginning it was convolutional neuronet network of course now we use [00:13:10] network of course now we use transformers and all that uh start to [00:13:13] transformers and all that uh start to really show their power through big data [00:13:17] really show their power through big data and uh and I'm this is a generic uh [00:13:20] and uh and I'm this is a generic uh slide for those of those people who [00:13:23] slide for those of those people who didn't learn about this, I'm going to [00:13:25] didn't learn about this, I'm going to skip this because you all know this. So [00:13:27] skip this because you all know this. So the the the the quick history is as soon [00:13:30] the the the the quick history is as soon as we have imageet, as soon as we use [00:13:33] as we have imageet, as soon as we use convolutional neuronet network a few [00:13:35] convolutional neuronet network a few years after the uh beginning of imageet, [00:13:38] years after the uh beginning of imageet, we saw this door blasted open in terms [00:13:41] we saw this door blasted open in terms of uh uh solving the problem of object [00:13:44] of uh uh solving the problem of object recognition. Now we have algorithms that [00:13:47] recognition. Now we have algorithms that we can take to look at any picture in [00:13:50] we can take to look at any picture in the world and be able to recognize [00:13:53] the world and be able to recognize objects in the big or small and and in [00:13:56] objects in the big or small and and in any kind of you know orientation. It's [00:13:58] any kind of you know orientation. It's not is it 100% soft? No. There's always [00:14:03] not is it 100% soft? No. There's always longtail problems we can solve. But as [00:14:05] longtail problems we can solve. But as far as industrial application goes, this [00:14:08] far as industrial application goes, this has come a long way and really has been [00:14:11] has come a long way and really has been um a a a matured problem. And of course [00:14:14] um a a a matured problem. And of course all of you know and all this came at the [00:14:17] all of you know and all this came at the convergence point which is the year 2012 [00:14:21] convergence point which is the year 2012 where the image that challenge provided [00:14:23] where the image that challenge provided the data for the convolutional neuronet [00:14:26] the data for the convolutional neuronet network and they used two GPUs at that [00:14:29] network and they used two GPUs at that time and the three ingredients come [00:14:32] time and the three ingredients come together and brought uh brought the [00:14:34] together and brought uh brought the moment of uh deep learning um uh the [00:14:38] moment of uh deep learning um uh the birth of deep learning and uh in this [00:14:40] birth of deep learning and uh in this class we also talked a little bit about [00:14:43] class we also talked a little bit about different various architectures that [00:14:46] different various architectures that image net challenge engendered u [00:14:50] image net challenge engendered u throughout the past uh decade or so in [00:14:53] throughout the past uh decade or so in terms of you know convolutional neuronet [00:14:55] terms of you know convolutional neuronet network or restnet and and so on. So [00:14:58] network or restnet and and so on. So that's um that's um you know that's the [00:15:02] that's um that's um you know that's the beginning really about deep learning [00:15:04] beginning really about deep learning revolution. And of course, in terms of [00:15:07] revolution. And of course, in terms of the quest for visual intelligence, we're [00:15:09] the quest for visual intelligence, we're not going to stop at just being able to [00:15:12] not going to stop at just being able to label objects in a scene. For example, [00:15:14] label objects in a scene. For example, in this two scenes, right? If you just [00:15:17] in this two scenes, right? If you just label objects, you'll think it's just a [00:15:19] label objects, you'll think it's just a llama and a person. But if I show you [00:15:22] llama and a person. But if I show you the second scene with the llama and a [00:15:24] the second scene with the llama and a person, the story is completely [00:15:26] person, the story is completely different. Even though you have the same [00:15:29] different. Even though you have the same object, you have very different [00:15:31] object, you have very different relationship. So, cognitive scientist [00:15:34] relationship. So, cognitive scientist once again was a head of computer [00:15:36] once again was a head of computer scientist uh and and inspired us to [00:15:40] scientist uh and and inspired us to think about um visual intelligence [00:15:43] think about um visual intelligence beyond just naming objects or [00:15:45] beyond just naming objects or categorizing objects. In this particular [00:15:48] categorizing objects. In this particular paper, Jeremy Wolf, who is a pretty [00:15:50] paper, Jeremy Wolf, who is a pretty prominent uh psychologist um wrote this [00:15:54] prominent uh psychologist um wrote this beautiful paper that called out that [00:15:57] beautiful paper that called out that relationships between objects must be [00:16:00] relationships between objects must be coded as part of our understanding of [00:16:03] coded as part of our understanding of complex natural things. And inspired by [00:16:06] complex natural things. And inspired by that work, the field of computer vision [00:16:09] that work, the field of computer vision start to look at how do we uh understand [00:16:12] start to look at how do we uh understand relationships. And this is uh early [00:16:14] relationships. And this is uh early work. You guys got a lecture from Renjay [00:16:18] work. You guys got a lecture from Renjay and uh uh last week or or yeah last week [00:16:22] and uh uh last week or or yeah last week this is was his PhD thesis looking at [00:16:25] this is was his PhD thesis looking at learning object relationships using [00:16:29] learning object relationships using scene graph as a representation. In this [00:16:32] scene graph as a representation. In this case, scene graph is defined by these [00:16:35] case, scene graph is defined by these entity nodes that are objects and their [00:16:38] entity nodes that are objects and their relationships are uh defined by the [00:16:41] relationships are uh defined by the connectivity between the nodes or [00:16:43] connectivity between the nodes or sometimes they have attribute [00:16:44] sometimes they have attribute relationships that defines the the [00:16:47] relationships that defines the the particular objects. And even a scene as [00:16:50] particular objects. And even a scene as simple as this with mostly just two [00:16:53] simple as this with mostly just two people, you know, um, one feeding a cake [00:16:55] people, you know, um, one feeding a cake to the other, you can form a very dense [00:16:58] to the other, you can form a very dense scene gra scene graph because of the [00:17:00] scene gra scene graph because of the richness of the visual scene. And this [00:17:03] richness of the visual scene. And this was Ren's thesis after the the the image [00:17:08] was Ren's thesis after the the the image that object recognition um era where we [00:17:12] that object recognition um era where we try to we build a data set uh uh called [00:17:15] try to we build a data set uh uh called visual genome where we try to uh put [00:17:18] visual genome where we try to uh put together object relationships object uh [00:17:22] together object relationships object uh and also story descriptions. And uh one [00:17:25] and also story descriptions. And uh one of the work that Ranj did I thought that [00:17:28] of the work that Ranj did I thought that was really fun was zero shot learning of [00:17:31] was really fun was zero shot learning of unusual uh object relationships. For [00:17:34] unusual uh object relationships. For example, it's not unusual to see person [00:17:37] example, it's not unusual to see person riding a horse. It's not unusual to see [00:17:40] riding a horse. It's not unusual to see person wearing a hat, but it's unusual [00:17:43] person wearing a hat, but it's unusual in general to see horse wearing hat. And [00:17:46] in general to see horse wearing hat. And in the era of big data training, it's [00:17:49] in the era of big data training, it's hard to get this kind of data repeatedly [00:17:52] hard to get this kind of data repeatedly because you just don't have too many of [00:17:54] because you just don't have too many of that. But using this compositional scene [00:17:57] that. But using this compositional scene graph representation, we're able to [00:17:59] graph representation, we're able to learn the more common relationships and [00:18:02] learn the more common relationships and then derive uncommon relationships um in [00:18:05] then derive uncommon relationships um in that representation. And again this is [00:18:08] that representation. And again this is another um example of zero shot learning [00:18:10] another um example of zero shot learning where you know person sitting on chair [00:18:13] where you know person sitting on chair and and fire hydrant on the lawn or on [00:18:15] and and fire hydrant on the lawn or on the field are all common relationships [00:18:18] the field are all common relationships but person sitting on fire hydrant is [00:18:20] but person sitting on fire hydrant is the one that would not you know it's [00:18:23] the one that would not you know it's hard to get data and we're able to do [00:18:25] hard to get data and we're able to do that uh uh to make that happen and this [00:18:28] that uh uh to make that happen and this is just a figure from the paper that [00:18:30] is just a figure from the paper that shows that wrenches work at that time [00:18:33] shows that wrenches work at that time achieved state-of-the-art recognition [00:18:35] achieved state-of-the-art recognition rate compared to many other um uh [00:18:38] rate compared to many other um uh methods. [00:18:39] methods. But relationship is not enough, right? [00:18:42] But relationship is not enough, right? The ability to actually tell a story [00:18:46] The ability to actually tell a story that is a lot more richer or also using [00:18:49] that is a lot more richer or also using natural language is actually um the next [00:18:54] natural language is actually um the next big goal. So around the year 2014 [00:18:59] big goal. So around the year 2014 201 around 2014 we start working on that [00:19:03] 201 around 2014 we start working on that problem. And think about it. That's just [00:19:05] problem. And think about it. That's just two years after the image that uh Alex [00:19:08] two years after the image that uh Alex that moment. But the field was starting [00:19:11] that moment. But the field was starting to evolve so fast. We're so inspired by [00:19:14] to evolve so fast. We're so inspired by um by what we can do using uh a [00:19:18] um by what we can do using uh a combination of u convolutional neuronet [00:19:21] combination of u convolutional neuronet network as well as a a uh um a language [00:19:25] network as well as a a uh um a language model called LSTM. And this is the [00:19:28] model called LSTM. And this is the thesis by Andre Kapathy that uh we were [00:19:31] thesis by Andre Kapathy that uh we were one of the first uh teams that um showed [00:19:35] one of the first uh teams that um showed how to do image captioning or [00:19:37] how to do image captioning or storytelling as well as dense captioning [00:19:40] storytelling as well as dense captioning which is also part of the work that [00:19:42] which is also part of the work that Justin Johnson did and and I know he's [00:19:45] Justin Johnson did and and I know he's he's one of the co-hurs of this course [00:19:48] he's one of the co-hurs of this course and that was around the time between [00:19:51] and that was around the time between 2015 to 2018 a lot of work has happened [00:19:56] 2015 to 2018 a lot of work has happened um to to solve the problem. Of course, [00:19:58] um to to solve the problem. Of course, today using a multimodal LLMs, we have [00:20:02] today using a multimodal LLMs, we have taken the solution of this problem even [00:20:04] taken the solution of this problem even to another uh another notch. But this is [00:20:08] to another uh another notch. But this is the beginning of that line of work and [00:20:11] the beginning of that line of work and uh uh frankly I myself as a computer [00:20:14] uh uh frankly I myself as a computer vision scientist who entered the field [00:20:17] vision scientist who entered the field at the beginning of the century was very [00:20:20] at the beginning of the century was very surprised by how fast uh our field was [00:20:23] surprised by how fast uh our field was able to solve this problem. as soon as [00:20:25] able to solve this problem. as soon as we've got data as well as neuronet [00:20:27] we've got data as well as neuronet network algorithms. [00:20:30] network algorithms. But a much harder problem is actually in [00:20:33] But a much harder problem is actually in dynamic scenes. In dynamic scenes, we uh [00:20:37] dynamic scenes. In dynamic scenes, we uh we tend to have much more complex [00:20:39] we tend to have much more complex relationships, much more complex uh [00:20:42] relationships, much more complex uh movements and the camera also the camera [00:20:45] movements and the camera also the camera uh movement or or the the the um the [00:20:49] uh movement or or the the the um the entity the actors within the scene can [00:20:52] entity the actors within the scene can do a lot of different things. So in this [00:20:55] do a lot of different things. So in this work that a collaboration with Isan and [00:20:58] work that a collaboration with Isan and um a bunch of students in our lab uh we [00:21:01] um a bunch of students in our lab uh we call it multiobject multi- actor [00:21:04] call it multiobject multi- actor activity understanding. This is a much [00:21:07] activity understanding. This is a much newer work. We only published this a [00:21:09] newer work. We only published this a couple of years ago. uh to capture the [00:21:13] couple of years ago. uh to capture the relationship between these actors and [00:21:16] relationship between these actors and their activities in dynamic scene is [00:21:18] their activities in dynamic scene is still I would say an unsolved problem [00:21:22] still I would say an unsolved problem and this will have profound implication [00:21:24] and this will have profound implication you know that you're in Silicon Valley [00:21:27] you know that you're in Silicon Valley so you you're hearing so much um um um [00:21:32] so you you're hearing so much um um um excitement of robots for example if we [00:21:36] excitement of robots for example if we ever dream to have everyday robots that [00:21:38] ever dream to have everyday robots that work amongst us robots costly to solve [00:21:41] work amongst us robots costly to solve solve this problem. Understand how [00:21:43] solve this problem. Understand how complex the scene is, what people are [00:21:45] complex the scene is, what people are doing, who is doing what, what is next [00:21:48] doing, who is doing what, what is next and this is an unsolved problem. Um so [00:21:54] and this is an unsolved problem. Um so also you know in addition to what I have [00:21:56] also you know in addition to what I have shown you you you have learned a little [00:21:59] shown you you you have learned a little bit in this class and and and related [00:22:02] bit in this class and and and related computer vision problem uh but not we [00:22:04] computer vision problem uh but not we didn't have time to elaborate for [00:22:06] didn't have time to elaborate for example 3D computer vision or human pose [00:22:10] example 3D computer vision or human pose understanding and of course generative [00:22:13] understanding and of course generative uh AI and generative models. So this is [00:22:16] uh AI and generative models. So this is just to show you that the field of [00:22:18] just to show you that the field of computer vision uh since the rebirth of [00:22:21] computer vision uh since the rebirth of modern AI has been just moving [00:22:24] modern AI has been just moving extraordinarily fast. [00:22:27] extraordinarily fast. But the take-home message in this [00:22:30] But the take-home message in this section for me is that um two things. [00:22:34] section for me is that um two things. One is that data compute and neuronet [00:22:37] One is that data compute and neuronet network algorithm truly have converged [00:22:39] network algorithm truly have converged about 10 years ago or 13 years ago and [00:22:42] about 10 years ago or 13 years ago and that was the moment that modern AI or [00:22:45] that was the moment that modern AI or deep learning revolution has happened. [00:22:47] deep learning revolution has happened. But the history of that and so much of [00:22:50] But the history of that and so much of the problem that we have been working on [00:22:52] the problem that we have been working on is truly inspired by cognitive science [00:22:55] is truly inspired by cognitive science and psychology and neuroscience. And [00:22:59] and psychology and neuroscience. And that to me is a um is going to continue [00:23:03] that to me is a um is going to continue to to uh happen is that we will continue [00:23:08] to to uh happen is that we will continue to be inspired by what the brain can do [00:23:11] to be inspired by what the brain can do or how the brain does things and also [00:23:14] or how the brain does things and also we'll continue to use AI to help our [00:23:16] we'll continue to use AI to help our brain brain research. So there is a very [00:23:19] brain brain research. So there is a very intimate relationship between today's AI [00:23:22] intimate relationship between today's AI and cognitive science, neuroscience, [00:23:24] and cognitive science, neuroscience, brain science and all that. So that's uh [00:23:27] brain science and all that. So that's uh that's the first section and of course a [00:23:30] that's the first section and of course a lot of people students and and [00:23:33] lot of people students and and collaborators have contributed to what I [00:23:35] collaborators have contributed to what I have just presented. [00:23:38] have just presented. Now let's talk about going beyond just [00:23:42] Now let's talk about going beyond just building AI to see what humans don't [00:23:44] building AI to see what humans don't see. This is where pushing AI beyond the [00:23:47] see. This is where pushing AI beyond the capability of humans or you can call it [00:23:50] capability of humans or you can call it superhumans. For example, most people [00:23:54] superhumans. For example, most people don't recognize a ton of dinosaurs. You [00:23:57] don't recognize a ton of dinosaurs. You can probably name a few. Some kids [00:24:00] can probably name a few. Some kids really can name a lot. Well, let alone [00:24:04] really can name a lot. Well, let alone thousands and tens of thousands of bird [00:24:07] thousands and tens of thousands of bird species [00:24:09] species or tens of thousands of car categories. [00:24:14] or tens of thousands of car categories. So, this is the the the line of work [00:24:17] So, this is the the the line of work that I call fine grained object [00:24:19] that I call fine grained object categorization. humans are just not that [00:24:23] categorization. humans are just not that good at it. And um this is still a [00:24:26] good at it. And um this is still a problem that I don't think we're fully [00:24:28] problem that I don't think we're fully solved yet to be honest. Um uh in this [00:24:31] solved yet to be honest. Um uh in this generative AI era especially we're [00:24:34] generative AI era especially we're talking a lot about multimodal LLMs. [00:24:37] talking a lot about multimodal LLMs. this problem has somewhat be um [00:24:41] this problem has somewhat be um neglected or or it just is not a [00:24:43] neglected or or it just is not a mainstream problem but it really will [00:24:46] mainstream problem but it really will you know still will come and and play a [00:24:49] you know still will come and and play a important role. So in this early work of [00:24:52] important role. So in this early work of fine grain bird species recognition, we [00:24:55] fine grain bird species recognition, we put together you know a data set [00:24:58] put together you know a data set actually we used a data set of 4,000 [00:25:01] actually we used a data set of 4,000 birds and as you can see as we go down [00:25:04] birds and as you can see as we go down the tree of the species the uh the um [00:25:10] the tree of the species the uh the um the the eras actually as we go up the [00:25:15] the the eras actually as we go up the species as we have generalizable um more [00:25:19] species as we have generalizable um more general names the error decreases but [00:25:22] general names the error decreases but which means it's a convoluted way of [00:25:24] which means it's a convoluted way of saying by the time you're in the fine [00:25:26] saying by the time you're in the fine grained uh um um level we still make a [00:25:31] grained uh um um level we still make a lot of errors the the algorithm are [00:25:33] lot of errors the the algorithm are still not totally u totally u u ready um [00:25:40] still not totally u totally u u ready um another work that I find fascinating is [00:25:43] another work that I find fascinating is that a few years ago a group of students [00:25:46] that a few years ago a group of students uh and and in my lab um train a fine [00:25:52] uh and and in my lab um train a fine grained car classifier in terms of mod u [00:25:56] grained car classifier in terms of mod u make model and year. It turns out after [00:26:00] make model and year. It turns out after 1970s there are thousands of car make [00:26:04] 1970s there are thousands of car make car models that are defined by different [00:26:09] car models that are defined by different make, model and year. And then you can [00:26:12] make, model and year. And then you can tea take we took um Google Street View [00:26:16] tea take we took um Google Street View images from 200 or 100 I think major [00:26:21] images from 200 or 100 I think major cities across the country and then we [00:26:25] cities across the country and then we use the uh fine grain car detectors to [00:26:28] use the uh fine grain car detectors to detect what are the cars on the street [00:26:31] detect what are the cars on the street of these uh cities and we use it as a [00:26:34] of these uh cities and we use it as a lens to study social uh patterns. For [00:26:37] lens to study social uh patterns. For example, um what is the pattern here? I [00:26:40] example, um what is the pattern here? I showed education patterns. You know, car [00:26:44] showed education patterns. You know, car models and education patterns are highly [00:26:47] models and education patterns are highly correlated or uh or uh income patterns [00:26:52] correlated or uh or uh income patterns highly correlated. In that paper, we [00:26:55] highly correlated. In that paper, we show voting patterns highly correlate uh [00:26:58] show voting patterns highly correlate uh correlated or even environmental, you [00:27:01] correlated or even environmental, you know, patterns highly correlated. So [00:27:03] know, patterns highly correlated. So it's a really interesting way of using [00:27:06] it's a really interesting way of using computer vision as a lens to study our [00:27:09] computer vision as a lens to study our society and no human no individual human [00:27:12] society and no human no individual human not even a collection of humans uh can [00:27:15] not even a collection of humans uh can do this easily at all. So so AI is [00:27:18] do this easily at all. So so AI is really pushing the boundary of what uh [00:27:20] really pushing the boundary of what uh humans can see. Um [00:27:24] humans can see. Um to drive home this idea let's do a [00:27:27] to drive home this idea let's do a couple of tests. Humans are actually [00:27:30] couple of tests. Humans are actually have our limitations, right? I just [00:27:32] have our limitations, right? I just talked about celebrating humans ability [00:27:34] talked about celebrating humans ability of seeing, but we also have our [00:27:36] of seeing, but we also have our limitation. This is a very famous uh uh [00:27:38] limitation. This is a very famous uh uh visual illusion test called Stroop test. [00:27:41] visual illusion test called Stroop test. And the idea is that you all can read [00:27:43] And the idea is that you all can read the words, but if you if I ask you to [00:27:46] the words, but if you if I ask you to read the color of the word as fast as [00:27:50] read the color of the word as fast as possible going from right, left to right [00:27:52] possible going from right, left to right and top to down, you find it's it's not [00:27:55] and top to down, you find it's it's not that easy, right? Try to read it like [00:27:58] that easy, right? Try to read it like red, yellow, [00:28:01] red, yellow, green, [00:28:03] green, purple, blue, black, orange. It it it's [00:28:07] purple, blue, black, orange. It it it's it's it it's fighting with you. This is [00:28:10] it's it it's fighting with you. This is the fight between visual attention and [00:28:12] the fight between visual attention and and all that. Here's another example. [00:28:15] and all that. Here's another example. Um, there are two alternating images of [00:28:18] Um, there are two alternating images of the of the uh picture and there's one [00:28:22] the of the uh picture and there's one change, a pretty big change that's [00:28:24] change, a pretty big change that's happening between the two alternating [00:28:27] happening between the two alternating pictures. I don't know if you spot the [00:28:29] pictures. I don't know if you spot the change. Do you spot it? [00:28:31] change. Do you spot it? The engine. [00:28:32] The engine. Yes, it's the engine, right? So, it [00:28:34] Yes, it's the engine, right? So, it takes a while to to spot it. So, this is [00:28:38] takes a while to to spot it. So, this is a very famous psychology uh experiment [00:28:41] a very famous psychology uh experiment called change blindness. Now all this is [00:28:44] called change blindness. Now all this is fun. Stroop test is fun. This is fun. [00:28:47] fun. Stroop test is fun. This is fun. But this is not fun. That human [00:28:50] But this is not fun. That human attention is limited. And in some [00:28:53] attention is limited. And in some situations in our work and life, that [00:28:57] situations in our work and life, that kind of attention limit can be dire. For [00:29:00] kind of attention limit can be dire. For example, medical errors are the third [00:29:05] example, medical errors are the third leading cause of death in in America's [00:29:09] leading cause of death in in America's health care system. And of course [00:29:11] health care system. And of course leaving this pair of scissors in the [00:29:13] leaving this pair of scissors in the body of the patient is kind of the [00:29:15] body of the patient is kind of the iconic image of medical errors. But [00:29:18] iconic image of medical errors. But there are so many medical errors, [00:29:20] there are so many medical errors, pharmaceutical errors, there's uh you [00:29:23] pharmaceutical errors, there's uh you know um procedure errors, clerical [00:29:26] know um procedure errors, clerical errors, u diagnostic errors. So so one [00:29:29] errors, u diagnostic errors. So so one has to be very careful. For example, in [00:29:32] has to be very careful. For example, in surgery rooms, um, honestly, scissors [00:29:35] surgery rooms, um, honestly, scissors don't get left in the bodies typically, [00:29:38] don't get left in the bodies typically, but much smaller, uh, things like suture [00:29:42] but much smaller, uh, things like suture needles or a piece of gauze and all [00:29:45] needles or a piece of gauze and all that. So, we uh, and today [00:29:49] that. So, we uh, and today most of this is still just track by [00:29:51] most of this is still just track by hands, right? Like we have these [00:29:53] hands, right? Like we have these checklist to to uh to to track in the [00:29:59] checklist to to uh to to track in the surgery rooms. If something is missing, [00:30:02] surgery rooms. If something is missing, the surgery has to be paused. On [00:30:04] the surgery has to be paused. On average, that pause is close to an hour. [00:30:08] average, that pause is close to an hour. And think about the danger for the [00:30:10] And think about the danger for the patient, the exposure to bacteria and [00:30:13] patient, the exposure to bacteria and the bleeding and all that just because [00:30:16] the bleeding and all that just because we have to search for that item. So if [00:30:19] we have to search for that item. So if there is a way to use AI to help our PA [00:30:24] there is a way to use AI to help our PA uh our uh doctors surgeons to track item [00:30:28] uh our uh doctors surgeons to track item that would be so powerful and this is [00:30:30] that would be so powerful and this is just a demo this is not a deploy system [00:30:33] just a demo this is not a deploy system were not there in terms of fidelity but [00:30:36] were not there in terms of fidelity but this is a demo to show that we can use [00:30:38] this is a demo to show that we can use AI to count in this case goss you know [00:30:41] AI to count in this case goss you know uh and and all that and um and this is [00:30:45] uh and and all that and um and this is just an example of pushing AI to see [00:30:49] just an example of pushing AI to see what humans don't see. Here's another [00:30:52] what humans don't see. Here's another example um that is really fun. I don't [00:30:56] example um that is really fun. I don't know if I showed this before, but this [00:30:58] know if I showed this before, but this is one of my favorite visual illusions [00:31:01] is one of my favorite visual illusions where I'm just giving you the answer. If [00:31:04] where I'm just giving you the answer. If if you look at the two square A and B on [00:31:07] if you look at the two square A and B on a checkerboard [00:31:09] a checkerboard on the top, it is so hard to believe [00:31:12] on the top, it is so hard to believe they have the same grayscale or [00:31:14] they have the same grayscale or luminance. And then you look at the [00:31:15] luminance. And then you look at the bottom, you're like, "Ah, of course they [00:31:17] bottom, you're like, "Ah, of course they do." But why? Even though you have the [00:31:21] do." But why? Even though you have the bottom picture in front of you, seeing [00:31:25] bottom picture in front of you, seeing the top is still gives you the illusion. [00:31:29] the top is still gives you the illusion. Why? Because evolution has pre-wired us [00:31:32] Why? Because evolution has pre-wired us in in conjecturing or understanding our [00:31:36] in in conjecturing or understanding our world in its common way with the common [00:31:40] world in its common way with the common physics of the shape of objects, [00:31:43] physics of the shape of objects, lighting source, how shadows are made [00:31:46] lighting source, how shadows are made and all that. This is this is so deep in [00:31:50] and all that. This is this is so deep in our evolution in our visual development [00:31:54] our evolution in our visual development that it's hard for us to see see it [00:31:58] that it's hard for us to see see it another way. So what I'm trying to get [00:32:00] another way. So what I'm trying to get at is there's bias in our visual in our [00:32:05] at is there's bias in our visual in our human visual system. The bias might come [00:32:07] human visual system. The bias might come from evolutionary construct. The bias [00:32:11] from evolutionary construct. The bias can come from our social um uh [00:32:14] can come from our social um uh experience. The bias can come from the [00:32:17] experience. The bias can come from the data we're exposed to. But some of these [00:32:20] data we're exposed to. But some of these biases can be harmful. Right? when the [00:32:23] biases can be harmful. Right? when the bias happens that became unfair uh to a [00:32:28] bias happens that became unfair uh to a group of people a community and we have [00:32:31] group of people a community and we have to be aware of this. A few years ago [00:32:34] to be aware of this. A few years ago face recognition algorithm was not good [00:32:37] face recognition algorithm was not good and and it tends to recognize uh certain [00:32:40] and and it tends to recognize uh certain skin color and and even gender better [00:32:42] skin color and and even gender better than others and it has consequences. [00:32:45] than others and it has consequences. Think about self-driving car think about [00:32:48] Think about self-driving car think about you know um many other uh medical use [00:32:52] you know um many other uh medical use cases. So um so we have to be vigilant [00:32:54] cases. So um so we have to be vigilant about this. Um I do believe that um [00:33:00] about this. Um I do believe that um AI bias has been a problem that people [00:33:02] AI bias has been a problem that people now are carrying. You know a few years [00:33:05] now are carrying. You know a few years ago this problem was so new that many [00:33:08] ago this problem was so new that many people are not even paying attention. [00:33:10] people are not even paying attention. But fast forward to 2025. I'm not saying [00:33:13] But fast forward to 2025. I'm not saying we have solved this problem. But I'm [00:33:15] we have solved this problem. But I'm personally a lot happier to see that so [00:33:18] personally a lot happier to see that so many people are paying attention to [00:33:21] many people are paying attention to this. Not only just in academia but also [00:33:23] this. Not only just in academia but also in industry. [00:33:26] in industry. And then there's another kind of not [00:33:28] And then there's another kind of not seeing and this is interesting. [00:33:31] seeing and this is interesting. Sometimes not seeing is exactly what we [00:33:34] Sometimes not seeing is exactly what we want [00:33:36] want because you want to respect privacy. So [00:33:39] because you want to respect privacy. So how do you create AI that helps people [00:33:44] how do you create AI that helps people to see yet you still want it not to see [00:33:47] to see yet you still want it not to see what people don't want you to see? This [00:33:49] what people don't want you to see? This is a very deep. It's a technical problem [00:33:52] is a very deep. It's a technical problem as well as a human problem. So from a [00:33:55] as well as a human problem. So from a technical point of view, there are many [00:33:57] technical point of view, there are many ways to consider ML machine learning [00:34:00] ways to consider ML machine learning privacy. I'm just listing here from a [00:34:04] privacy. I'm just listing here from a visual approach point of view a few [00:34:06] visual approach point of view a few years ago. Our lab wrote this paper [00:34:09] years ago. Our lab wrote this paper about um using smart cameras in patient [00:34:13] about um using smart cameras in patient rooms or patient homes to help doctors [00:34:16] rooms or patient homes to help doctors to see better. uh but even there we have [00:34:19] to see better. uh but even there we have to recognize issues like faces or just [00:34:23] to recognize issues like faces or just full body information and and even homes [00:34:27] full body information and and even homes and this is a list of potential [00:34:29] and this is a list of potential solutions. For example, you can do [00:34:31] solutions. For example, you can do blurring or you can do masking, you can [00:34:35] blurring or you can do masking, you can do dimensionality reduction, but you can [00:34:37] do dimensionality reduction, but you can also, you know, try to do uh different [00:34:41] also, you know, try to do uh different approaches for example, federated [00:34:43] approaches for example, federated learning so that you don't send all the [00:34:44] learning so that you don't send all the data to the server or encryption and and [00:34:48] data to the server or encryption and and other things. So, I'm not going to [00:34:50] other things. So, I'm not going to belabor this, but there's one work I [00:34:52] belabor this, but there's one work I want to show you. It's not even my work, [00:34:54] want to show you. It's not even my work, but I really like this work. And um it's [00:34:57] but I really like this work. And um it's a work about taking videos of people and [00:35:03] a work about taking videos of people and try to recognize the action of people [00:35:06] try to recognize the action of people but yet respecting the privacy of [00:35:09] but yet respecting the privacy of people. How do you do that? Right? For [00:35:11] people. How do you do that? Right? For example, in this case um you want to [00:35:15] example, in this case um you want to take a video of this kid uh moving in [00:35:18] take a video of this kid uh moving in the scene. Um there are ways to do this. [00:35:22] the scene. Um there are ways to do this. If you blur this you or defocus this or [00:35:26] If you blur this you or defocus this or do some of these it kind of yeah you can [00:35:30] do some of these it kind of yeah you can you can provide uh you can protect [00:35:32] you can provide uh you can protect privacy but you also lose enough [00:35:34] privacy but you also lose enough information that you might not even know [00:35:36] information that you might not even know what this person is doing and for many [00:35:38] what this person is doing and for many applications the whole goal is to know [00:35:40] applications the whole goal is to know what this person is doing. So in this [00:35:44] what this person is doing. So in this particular work uh led by Hong Koen's [00:35:47] particular work uh led by Hong Koen's students they actually did a combination [00:35:50] students they actually did a combination of hardware and software approach where [00:35:54] of hardware and software approach where they handcrafted a lens that is actually [00:36:00] they handcrafted a lens that is actually um that that that actually filters uh [00:36:04] um that that that actually filters uh visual data in a particular way. so [00:36:07] visual data in a particular way. so particular that if you look at the top [00:36:10] particular that if you look at the top row, the lens, what the lens captures [00:36:13] row, the lens, what the lens captures into the camera protects the privacy a [00:36:16] into the camera protects the privacy a lot. You don't see the person's face, [00:36:18] lot. You don't see the person's face, you don't see the body and so on. But [00:36:22] you don't see the body and so on. But because it's a lens that's particularly [00:36:24] because it's a lens that's particularly designed in connection with a piece of [00:36:27] designed in connection with a piece of software, it can help to back out the [00:36:31] software, it can help to back out the movement information or the the the [00:36:33] movement information or the the the human activity information without [00:36:37] human activity information without backing out face information. So that's [00:36:40] backing out face information. So that's a really interesting approach. That's a [00:36:42] a really interesting approach. That's a hybrid between hardware and software [00:36:45] hybrid between hardware and software aiming towards important applications [00:36:48] aiming towards important applications that you want to see people to protect [00:36:51] that you want to see people to protect them but you don't want to see too much [00:36:53] them but you don't want to see too much because you want to respect privacy. So [00:36:55] because you want to respect privacy. So that's that's a work I really like. I [00:36:58] that's that's a work I really like. I really like the spirit of that work. So [00:37:01] really like the spirit of that work. So okay. So in this part of the lecture I [00:37:05] okay. So in this part of the lecture I shared with you a number of things just [00:37:08] shared with you a number of things just considerations of building AI to see [00:37:11] considerations of building AI to see what humans don't see. Sometimes we're [00:37:13] what humans don't see. Sometimes we're pushing AI like fine grain recognition [00:37:16] pushing AI like fine grain recognition of birds to go beyond human ability. [00:37:19] of birds to go beyond human ability. Those are superhuman ability. Sometimes [00:37:22] Those are superhuman ability. Sometimes we know humans are not good. We have [00:37:24] we know humans are not good. We have bias or we have attention issues and [00:37:27] bias or we have attention issues and then we want to use AI to help us. And [00:37:29] then we want to use AI to help us. And then sometimes we generally have [00:37:32] then sometimes we generally have situations we don't want anyone to see. [00:37:35] situations we don't want anyone to see. And then how do you use AI to continue [00:37:37] And then how do you use AI to continue to help without violating those privacy [00:37:40] to help without violating those privacy concerns. So you can see that AI is a [00:37:43] concerns. So you can see that AI is a very interesting powerful tool. It can [00:37:46] very interesting powerful tool. It can both help but amplify [00:37:49] both help but amplify us. And if we have bias, if we have [00:37:52] us. And if we have bias, if we have issues, AI can amplify us too. So when [00:37:55] issues, AI can amplify us too. So when we build AI, it is so important not only [00:37:59] we build AI, it is so important not only to take that technology perspective but [00:38:02] to take that technology perspective but also to take the human perspective to [00:38:04] also to take the human perspective to commit to study, forecast and guide AI [00:38:08] commit to study, forecast and guide AI to understand its human impact and and [00:38:11] to understand its human impact and and respect human values. So that's that's [00:38:14] respect human values. So that's that's the second message take-home message and [00:38:18] the second message take-home message and again a number of collab collaborators [00:38:20] again a number of collab collaborators and students participated in this work. [00:38:24] and students participated in this work. Okay, now let's talk about building AI [00:38:27] Okay, now let's talk about building AI to see what human want to see. And in [00:38:30] to see what human want to see. And in fact, we're going to go beyond seeing. [00:38:32] fact, we're going to go beyond seeing. We're going to connect seeing and doing [00:38:34] We're going to connect seeing and doing together. [00:38:36] together. So if you think about today's societal [00:38:40] So if you think about today's societal anxiety about AI, one of the biggest [00:38:43] anxiety about AI, one of the biggest anxiety is labor. Uh a lot of headline [00:38:47] anxiety is labor. Uh a lot of headline news will say labor is under threat. [00:38:51] news will say labor is under threat. robots taking over jobs. [00:38:54] robots taking over jobs. The truth is the the the picture is [00:38:56] The truth is the the the picture is complex. You know, denying job change is [00:39:01] complex. You know, denying job change is wrong. Every technological shift in [00:39:04] wrong. Every technological shift in humans history has caused labor market [00:39:07] humans history has caused labor market change. And some of them are very [00:39:09] change. And some of them are very painful. Some of them can lead to even [00:39:12] painful. Some of them can lead to even civil wars and wars. Um but so but but [00:39:16] civil wars and wars. Um but so but but also that change sometimes is inevitable [00:39:20] also that change sometimes is inevitable and uh [00:39:22] and uh a tiny digression a lot of the labor [00:39:26] a tiny digression a lot of the labor threat rhetoric that we have been [00:39:28] threat rhetoric that we have been hearing think about physical labors but [00:39:32] hearing think about physical labors but today in the past two years if you look [00:39:34] today in the past two years if you look at genai's impact is white collar u jobs [00:39:39] at genai's impact is white collar u jobs that are drastically uh being impacted [00:39:42] that are drastically uh being impacted especially software engineer engineering [00:39:44] especially software engineer engineering and uh and uh analytical work in in [00:39:48] and uh and uh analytical work in in offices. So, so there's just definitely [00:39:51] offices. So, so there's just definitely uh labor change, but in the meantime, we [00:39:54] uh labor change, but in the meantime, we also need to recognize that AI also can [00:39:58] also need to recognize that AI also can be helpful. We actually fundamentally [00:40:01] be helpful. We actually fundamentally have human labor shortages in many [00:40:04] have human labor shortages in many situations, especially in elderly care [00:40:07] situations, especially in elderly care as well as health care. First of all, as [00:40:10] as well as health care. First of all, as modern medicine improves, [00:40:12] modern medicine improves, uh, human life expectancy increases and [00:40:16] uh, human life expectancy increases and that just inevitably pushes the society [00:40:18] that just inevitably pushes the society towards um, uh, longer living and that's [00:40:22] towards um, uh, longer living and that's a good thing. But in the meantime, we [00:40:25] a good thing. But in the meantime, we have shortages of labors. Young people [00:40:28] have shortages of labors. Young people need to work and that's how to make this [00:40:31] need to work and that's how to make this uh, society uh, vibrant, economy [00:40:34] uh, society uh, vibrant, economy vibrant. But who is taking care of our [00:40:36] vibrant. But who is taking care of our elderlys? who are taking care of our [00:40:38] elderlys? who are taking care of our chronically ill. Uh even in America's [00:40:42] chronically ill. Uh even in America's hospital, we have such a attrition of uh [00:40:46] hospital, we have such a attrition of uh health care workers especially nurses um [00:40:49] health care workers especially nurses um that uh we don't have enough h hands, [00:40:52] that uh we don't have enough h hands, ears, eyes to to help our patients. So [00:40:55] ears, eyes to to help our patients. So instead of thinking about this word [00:40:58] instead of thinking about this word replace, we actually can think about AI [00:41:02] replace, we actually can think about AI augmenting. And you got a glimpse of [00:41:04] augmenting. And you got a glimpse of that in my surgery room example. Indeed, [00:41:08] that in my surgery room example. Indeed, in health care, there's so many spaces [00:41:11] in health care, there's so many spaces in our health care that we don't have [00:41:13] in our health care that we don't have enough pairs of eyes. And that's what I [00:41:16] enough pairs of eyes. And that's what I call the dark spaces of health care from [00:41:20] call the dark spaces of health care from surgery room to patient room to uh [00:41:22] surgery room to patient room to uh pharmaceutical to homes and and so on. [00:41:26] pharmaceutical to homes and and so on. So, how do we make AI help? And this is [00:41:29] So, how do we make AI help? And this is something that uh Isan has been leading [00:41:32] something that uh Isan has been leading a ton of this work also with Zing. um is [00:41:35] a ton of this work also with Zing. um is that we have been looking at this [00:41:37] that we have been looking at this problem of ambient intelligence for [00:41:40] problem of ambient intelligence for healthcare where we combine smart [00:41:42] healthcare where we combine smart sensors with machine learning algorithms [00:41:45] sensors with machine learning algorithms to glean health critical insights from [00:41:50] to glean health critical insights from these situations in healthcare setup so [00:41:55] these situations in healthcare setup so that we can alert the patients or family [00:41:57] that we can alert the patients or family members or doctors in time to help uh [00:42:01] members or doctors in time to help uh patients and again the fuller paper is [00:42:04] patients and again the fuller paper is in this particular paper we published a [00:42:06] in this particular paper we published a couple of years ago. Let me just give [00:42:08] couple of years ago. Let me just give you a couple of examples. One example is [00:42:11] you a couple of examples. One example is this hand hygiene project which actually [00:42:13] this hand hygiene project which actually started way before co um and uh hand [00:42:18] started way before co um and uh hand hygiene turns out to be really important [00:42:21] hygiene turns out to be really important for keeping hospital um infection low. [00:42:25] for keeping hospital um infection low. Hospital acquired infection is actually [00:42:28] Hospital acquired infection is actually one of the leading causes of American [00:42:31] one of the leading causes of American patients fatality in our hospitals. It [00:42:34] patients fatality in our hospitals. It kills more than three times more people [00:42:37] kills more than three times more people per year than car accidents nationwide. [00:42:41] per year than car accidents nationwide. And it is really hard to control. Most [00:42:44] And it is really hard to control. Most of these germs are passed from patient [00:42:46] of these germs are passed from patient room to patient room and then they they [00:42:48] room to patient room and then they they kind of just brew together. So what do [00:42:52] kind of just brew together. So what do we do? The hospitals try to use human [00:42:54] we do? The hospitals try to use human auditors but we just talked about we [00:42:56] auditors but we just talked about we don't even have enough nurses let alone [00:42:59] don't even have enough nurses let alone hiring auditors and also this is you [00:43:02] hiring auditors and also this is you cannot hire enough of them. There's [00:43:04] cannot hire enough of them. There's human fatigue we just talked about human [00:43:06] human fatigue we just talked about human detention problem. So this is not a [00:43:09] detention problem. So this is not a pretty prohibitive uh solution. There [00:43:12] pretty prohibitive uh solution. There were some technological solution like [00:43:14] were some technological solution like RFID, you know, put the badge if the [00:43:17] RFID, you know, put the badge if the badge or the person wearing the badge is [00:43:19] badge or the person wearing the badge is close to the u sink or the hand hygiene [00:43:23] close to the u sink or the hand hygiene uh hand sanitizer dispenser, it gives [00:43:26] uh hand sanitizer dispenser, it gives you the hint that the person or most [00:43:28] you the hint that the person or most likely the doctor or the nurses is [00:43:30] likely the doctor or the nurses is washing their hands. But that's very [00:43:32] washing their hands. But that's very non-specific. You cannot guarantee and [00:43:34] non-specific. You cannot guarantee and the hospital rooms are pretty small. [00:43:36] the hospital rooms are pretty small. Corridors are small and just standing [00:43:40] Corridors are small and just standing next to something doesn't mean you're [00:43:41] next to something doesn't mean you're doing it. So a few years ago we did this [00:43:44] doing it. So a few years ago we did this project where we put smart sensors that [00:43:47] project where we put smart sensors that provides p protects privacy by just only [00:43:51] provides p protects privacy by just only gleaning depth information like the blue [00:43:54] gleaning depth information like the blue screen and the blue uh video there. And [00:43:57] screen and the blue uh video there. And then we use vision computer vision [00:43:59] then we use vision computer vision algorithm to classify actions. Is the [00:44:02] algorithm to classify actions. Is the person washing hand or not washing hand? [00:44:05] person washing hand or not washing hand? And uh the result is that if you compare [00:44:07] And uh the result is that if you compare ground truth with the uh algorithm [00:44:11] ground truth with the uh algorithm output versus human outputs or human um [00:44:15] output versus human outputs or human um detective detection results, you can see [00:44:18] detective detection results, you can see algorithm is so much better and more [00:44:21] algorithm is so much better and more consistent than than humans. Uh you have [00:44:24] consistent than than humans. Uh you have to almost show the same video to four [00:44:28] to almost show the same video to four humans to get almost as good as AI. And [00:44:33] humans to get almost as good as AI. And this is just not plausible. If it's one [00:44:36] this is just not plausible. If it's one person, you can see how sparse the [00:44:39] person, you can see how sparse the detection is and that's not good. So [00:44:41] detection is and that's not good. So this is one application. Another [00:44:44] this is one application. Another application we uh we worked on is ICUs. [00:44:48] application we uh we worked on is ICUs. ICU is where patients fights life and [00:44:51] ICU is where patients fights life and death. ICU is also where um 1% of US GDP [00:44:58] death. ICU is also where um 1% of US GDP is spent. So making ICU as effective as [00:45:03] is spent. So making ICU as effective as safely as possible is a top priority. [00:45:06] safely as possible is a top priority. One of the goal well the goal of ICU is [00:45:10] One of the goal well the goal of ICU is to to get our patient safely out of ICU [00:45:14] to to get our patient safely out of ICU and go into step down units or even go [00:45:16] and go into step down units or even go home. So uh one of the most important [00:45:21] home. So uh one of the most important thing people have learned in ICU is to [00:45:24] thing people have learned in ICU is to help patients to move [00:45:27] help patients to move proper movement which we call [00:45:29] proper movement which we call mobilization. It's actually important [00:45:31] mobilization. It's actually important for recovery. But this is a very dicey [00:45:34] for recovery. But this is a very dicey situation. You have to get nurses to [00:45:36] situation. You have to get nurses to help. Doctors have to give orders. You [00:45:39] help. Doctors have to give orders. You have to move properly and it has to be [00:45:42] have to move properly and it has to be in different time like designated time [00:45:45] in different time like designated time and you have to assess the movement and [00:45:48] and you have to assess the movement and all this is not easy right so we [00:45:50] all this is not easy right so we collaborated with Stanford as well as [00:45:53] collaborated with Stanford as well as Utah's inter mountain hospital to put [00:45:55] Utah's inter mountain hospital to put these smart sensors in ICU units and [00:45:58] these smart sensors in ICU units and help doctors to monitor patient movement [00:46:02] help doctors to monitor patient movement in this particular case four different [00:46:05] in this particular case four different kind of movements getting out of bed [00:46:06] kind of movements getting out of bed getting in bed getting out of chair [00:46:08] getting in bed getting out of chair getting in here. These things are so [00:46:10] getting in here. These things are so important for ICU patients. I know that [00:46:12] important for ICU patients. I know that it's for us it's a no-brainer, but this [00:46:15] it's for us it's a no-brainer, but this really is uh is is critical and you can [00:46:19] really is uh is is critical and you can see that AI can help to do the kind of [00:46:23] see that AI can help to do the kind of detection and prediction that uh that is [00:46:26] detection and prediction that uh that is so helpful for doctors especially when [00:46:28] so helpful for doctors especially when there is a labor shortage. Last but not [00:46:31] there is a labor shortage. Last but not the least example is aging in place. And [00:46:34] the least example is aging in place. And this is just so important for you know [00:46:37] this is just so important for you know many many reasons. People uh seniors [00:46:40] many many reasons. People uh seniors want to live home independently and [00:46:42] want to live home independently and healthily. And uh uh remember during the [00:46:46] healthily. And uh uh remember during the beginning of co when we had so much [00:46:50] beginning of co when we had so much fatality among the aging um seniors a [00:46:54] fatality among the aging um seniors a lot has to do with hospital overrun and [00:46:57] lot has to do with hospital overrun and and overtaxed hospital system. So [00:47:00] and overtaxed hospital system. So putting and keeping seniors safe and [00:47:04] putting and keeping seniors safe and well in their homes is really critical [00:47:06] well in their homes is really critical and using smart sensors we can help you [00:47:10] and using smart sensors we can help you know early detection of infection [00:47:12] know early detection of infection especially using thermal cameras or [00:47:15] especially using thermal cameras or mobility we just talked about in ICU [00:47:18] mobility we just talked about in ICU similar here or understanding sleep [00:47:20] similar here or understanding sleep patterns or understanding dietary [00:47:22] patterns or understanding dietary patterns all these are realms of [00:47:25] patterns all these are realms of possibilities by AI and smart sensors [00:47:30] possibilities by AI and smart sensors And then last but not least, what if [00:47:32] And then last but not least, what if there's still short labor shortage after [00:47:35] there's still short labor shortage after smart sensors? The thing about smart [00:47:37] smart sensors? The thing about smart sensors is that their their information [00:47:40] sensors is that their their information gathering system, but they cannot go [00:47:44] gathering system, but they cannot go there and help to turn a patient or [00:47:47] there and help to turn a patient or bring water and medicine pills to the p [00:47:50] bring water and medicine pills to the p uh elderly. So this brings us to to the [00:47:53] uh elderly. So this brings us to to the last technical topic which is inbody AI [00:47:57] last technical topic which is inbody AI or we would call a a large part of [00:48:00] or we would call a a large part of inbody AI is robotics and this is where [00:48:05] inbody AI is robotics and this is where I find it extremely exciting because it [00:48:08] I find it extremely exciting because it closes the loop between perception and [00:48:11] closes the loop between perception and action and if you think about the [00:48:14] action and if you think about the Cambrian explosion of evolution when [00:48:17] Cambrian explosion of evolution when eyes when there's onset of eyes, animals [00:48:21] eyes when there's onset of eyes, animals start to move. So we uh the area of [00:48:25] start to move. So we uh the area of robotics is where we can close the loop [00:48:27] robotics is where we can close the loop between seeing and doing. But it's not [00:48:30] between seeing and doing. But it's not easy, right? Robots, as much as we're [00:48:32] easy, right? Robots, as much as we're very excited by them, still are very [00:48:35] very excited by them, still are very very slow. They are very very clumsy. [00:48:39] very slow. They are very very clumsy. It's very hard for them to uh adapt to a [00:48:43] It's very hard for them to uh adapt to a generalizable situation. [00:48:46] generalizable situation. And uh in today's robotic research we we [00:48:49] And uh in today's robotic research we we as a field have have made a ton of [00:48:53] as a field have have made a ton of progress and and Stanford is definitely [00:48:55] progress and and Stanford is definitely one of the centers of robotic learning [00:48:58] one of the centers of robotic learning but still most of these work are kind of [00:49:01] but still most of these work are kind of constrained in their setup are are short [00:49:04] constrained in their setup are are short horizon tasks like pick and place and it [00:49:08] horizon tasks like pick and place and it it has you know it it has anecdotal [00:49:12] it has you know it it has anecdotal setup and lack of clinical sorry lack [00:49:15] setup and lack of clinical sorry lack lack of standard uh benchmark. So let me [00:49:19] lack of standard uh benchmark. So let me just share with you a couple of work in [00:49:21] just share with you a couple of work in our lab. Uh one work is uh a few years [00:49:25] our lab. Uh one work is uh a few years ago we're looking at how to bring robots [00:49:28] ago we're looking at how to bring robots to the wild. If we have to pre-desate [00:49:31] to the wild. If we have to pre-desate the set of tasks, it's kind of [00:49:34] the set of tasks, it's kind of unsatisfying. On the other hand, if you [00:49:36] unsatisfying. On the other hand, if you look at today's LLM, it's totally in a [00:49:39] look at today's LLM, it's totally in a while. You can talk about anything. So [00:49:41] while. You can talk about anything. So my student Wong and a few students want [00:49:44] my student Wong and a few students want to close this gap. So this idea is that [00:49:48] to close this gap. So this idea is that um how do we give an open instruction to [00:49:52] um how do we give an open instruction to a robot any instruction without [00:49:56] a robot any instruction without pre-training [00:49:58] pre-training everything in a in a in a closed world [00:50:01] everything in a in a in a closed world and the robot can do some tasks. So [00:50:04] and the robot can do some tasks. So let's say your training set is open a [00:50:07] let's say your training set is open a drawer like that. in the wild you have [00:50:10] drawer like that. in the wild you have doors like that. So how do you how do [00:50:13] doors like that. So how do you how do you make some uh progress in that [00:50:15] you make some uh progress in that problem and uh so the the goal is is is [00:50:20] problem and uh so the the goal is is is in the wild generalization and here's an [00:50:24] in the wild generalization and here's an a a result um you know or or overall [00:50:28] a a result um you know or or overall algorithm [00:50:30] algorithm um I don't know if this is so glitchy [00:50:33] um I don't know if this is so glitchy but whatever what you're saying is we [00:50:36] but whatever what you're saying is we want to tell this robot arm to open a [00:50:40] want to tell this robot arm to open a drawer [00:50:41] drawer by planning a motion path that avoids [00:50:45] by planning a motion path that avoids knocking down that flower and all this [00:50:48] knocking down that flower and all this is all this instruction were not [00:50:50] is all this instruction were not pre-trained. So what we do is actually [00:50:54] pre-trained. So what we do is actually we borrow uh the latest advances in LLM [00:50:59] we borrow uh the latest advances in LLM as well as in visual language model and [00:51:02] as well as in visual language model and the idea is that we use LLM and V uh uh [00:51:08] the idea is that we use LLM and V uh uh LLM to give us a instruction set and [00:51:12] LLM to give us a instruction set and then we use visual language model to [00:51:14] then we use visual language model to help us to recognize or understand the [00:51:16] help us to recognize or understand the environment and then we turn that into a [00:51:20] environment and then we turn that into a motion planning app so that the robotic [00:51:23] motion planning app so that the robotic arm can execute. And because we're using [00:51:26] arm can execute. And because we're using LLMs as well as VLMs, we don't have to [00:51:31] LLMs as well as VLMs, we don't have to we we are we get rid of the problem of [00:51:34] we we are we get rid of the problem of training robot in a closed world and [00:51:36] training robot in a closed world and bring them to a more generalizable or in [00:51:39] bring them to a more generalizable or in the wild. And the details is the [00:51:42] the wild. And the details is the instruction of open top drawer comes in. [00:51:46] instruction of open top drawer comes in. LLM turns this into like literally codes [00:51:51] LLM turns this into like literally codes and then because of these instructions [00:51:55] and then because of these instructions like drawer or handle um the uh we send [00:52:00] like drawer or handle um the uh we send this information to a VLM model and that [00:52:03] this information to a VLM model and that model detects drawer and handle in the [00:52:06] model detects drawer and handle in the scene and then because of that it [00:52:12] scene and then because of that it updates its information and then it [00:52:14] updates its information and then it updates a motion motion map. This is [00:52:17] updates a motion motion map. This is presented by a heat map to show you [00:52:20] presented by a heat map to show you where the robot arm should focus, where [00:52:22] where the robot arm should focus, where it should not focus. And with that, then [00:52:25] it should not focus. And with that, then you get give it another instruction. But [00:52:27] you get give it another instruction. But watch out to the uh for the vase. Again, [00:52:30] watch out to the uh for the vase. Again, it goes through the same thing with LLM, [00:52:33] it goes through the same thing with LLM, writes the code or or um generates the [00:52:36] writes the code or or um generates the code, send it through a VLM model. VLRL [00:52:40] code, send it through a VLM model. VLRL model detects the object and then [00:52:43] model detects the object and then updates the motion planning map. In this [00:52:46] updates the motion planning map. In this case, it's the negative, not the [00:52:48] case, it's the negative, not the positive because you you want to avoid [00:52:50] positive because you you want to avoid that. And then combining with the [00:52:52] that. And then combining with the previous map, you get a heat map of [00:52:55] previous map, you get a heat map of knowing where to avoid and where to go. [00:52:58] knowing where to avoid and where to go. And uh eventually [00:53:00] And uh eventually um what we do is we do this for the [00:53:03] um what we do is we do this for the motion planning map. We do it for [00:53:05] motion planning map. We do it for rotation to gripper velocity. And then [00:53:08] rotation to gripper velocity. And then this is the result. Um actually let me [00:53:12] this is the result. Um actually let me just show you this. Um this is the [00:53:14] just show you this. Um this is the actual result of a of the of the robot. [00:53:18] actual result of a of the of the robot. And then we do this for many different [00:53:20] And then we do this for many different tasks, right? We we can do it for [00:53:23] tasks, right? We we can do it for articulated object manipulation. [00:53:26] articulated object manipulation. We can do it uh um we can you know here [00:53:31] We can do it uh um we can you know here just many different examples you know [00:53:34] just many different examples you know napkins or sweeping the floor. uh what [00:53:38] napkins or sweeping the floor. uh what is this getting toast the setting up [00:53:40] is this getting toast the setting up table um [00:53:43] table um and also dealing with online [00:53:46] and also dealing with online disturbances and so on. So, so this is [00:53:49] disturbances and so on. So, so this is one work. Another work I want to just [00:53:52] one work. Another work I want to just show you um quickly um is that um [00:53:57] show you um quickly um is that um overall uh robotics um research is still [00:54:02] overall uh robotics um research is still in lacking good benchmark. [00:54:05] in lacking good benchmark. And while we're still experimenting in [00:54:08] And while we're still experimenting in the in the labs, we know real world is [00:54:12] the in the labs, we know real world is so much more complex, so much more [00:54:14] so much more complex, so much more uncertain, have large variability, is so [00:54:18] uncertain, have large variability, is so interactive and social and is has a lot [00:54:21] interactive and social and is has a lot of multitasking task. And then we know [00:54:24] of multitasking task. And then we know that both natural language and computer [00:54:26] that both natural language and computer vision has benefited a lot from setting [00:54:30] vision has benefited a lot from setting up important largecale data sets for [00:54:34] up important largecale data sets for both training and benchmark. So in our [00:54:37] both training and benchmark. So in our lab we have been working on this project [00:54:40] lab we have been working on this project that is towards an ecological robotic [00:54:43] that is towards an ecological robotic learning building an ecological robotic [00:54:46] learning building an ecological robotic learning environment and a uh and and [00:54:49] learning environment and a uh and and and try to encourage researchers to [00:54:52] and try to encourage researchers to benchmark against a large and diverse [00:54:55] benchmark against a large and diverse set of activities. And that's the [00:54:58] set of activities. And that's the behavior uh benchmark which is benchmark [00:55:02] behavior uh benchmark which is benchmark for everyday household activity in [00:55:04] for everyday household activity in virtual interactive and ecological [00:55:06] virtual interactive and ecological environments. [00:55:08] environments. Now here's a question because this [00:55:10] Now here's a question because this lecture has a lot to do with human [00:55:12] lecture has a lot to do with human values is who is to say which tasks [00:55:16] values is who is to say which tasks robots should do. I know that every [00:55:19] robots should do. I know that every graduate students who are working on [00:55:20] graduate students who are working on robotics just want two task. One is [00:55:23] robotics just want two task. One is laundry the other one is dishwasher. [00:55:25] laundry the other one is dishwasher. That's great. Let but moving beyond [00:55:28] That's great. Let but moving beyond grasu what are this what are the tasks [00:55:32] grasu what are this what are the tasks we should get robots to do for us. So [00:55:35] we should get robots to do for us. So instead of us coming up with this task [00:55:38] instead of us coming up with this task list we actually did a human- centered [00:55:41] list we actually did a human- centered survey to ask robots sorry to ask humans [00:55:48] survey to ask robots sorry to ask humans that what would you like robots to help [00:55:49] that what would you like robots to help you? Let me let me test this. Um would [00:55:52] you? Let me let me test this. Um would you like a robot to help you to clean [00:55:54] you like a robot to help you to clean the kitchen floor? [00:55:57] the kitchen floor? Say yes or no. Okay, good. Um, normal [00:56:00] Say yes or no. Okay, good. Um, normal people would say yes. [00:56:03] people would say yes. Shoveling snow. [00:56:05] Shoveling snow. Okay. Folding laundry. [00:56:08] Okay. Folding laundry. Okay, good. Cooking breakfast. [00:56:12] Okay, good. Cooking breakfast. See, we're getting mixture answers, [00:56:15] See, we're getting mixture answers, right? [00:56:16] right? What about opening Christmas gift? [00:56:19] What about opening Christmas gift? Right. Exactly. People are different. [00:56:22] Right. Exactly. People are different. Like I actually think robot can do this [00:56:24] Like I actually think robot can do this pretty well but we don't want it and [00:56:27] pretty well but we don't want it and there is we even ask one of the task we [00:56:29] there is we even ask one of the task we ask is buying wedding rings. Can you [00:56:31] ask is buying wedding rings. Can you imagine that? Um so so what we did is [00:56:36] imagine that? Um so so what we did is actually we want to respect human [00:56:38] actually we want to respect human preference. So we took a a bunch of [00:56:41] preference. So we took a a bunch of government surveys from uh uh labor [00:56:45] government surveys from uh uh labor office in US and Europe and so on and [00:56:49] office in US and Europe and so on and clan together uh put together thousands [00:56:52] clan together uh put together thousands of everyday activity tasks and then we [00:56:56] of everyday activity tasks and then we went online to find people. We want to [00:57:00] went online to find people. We want to be as diverse as possible, but I I think [00:57:04] be as diverse as possible, but I I think we have room to improve. But we found [00:57:06] we have room to improve. But we found 1,400 people and to answer these tasks [00:57:11] 1,400 people and to answer these tasks and tell us which task they want robots [00:57:14] and tell us which task they want robots to help. And then we rank that. And you [00:57:17] to help. And then we rank that. And you can see that just like glass students, [00:57:21] can see that just like glass students, people want robots to help with [00:57:23] people want robots to help with cleaning, a lot of cleaning, toilet [00:57:26] cleaning, a lot of cleaning, toilet cleaning, floor cleaning. But people [00:57:29] cleaning, floor cleaning. But people don't want robots to play squash for you [00:57:32] don't want robots to play squash for you or to buy a wedding ring or to even mix [00:57:35] or to buy a wedding ring or to even mix baby cereals. There's a lot of tasks [00:57:37] baby cereals. There's a lot of tasks that matters to us as humans emotionally [00:57:40] that matters to us as humans emotionally or socially or whatever. So we our goal [00:57:44] or socially or whatever. So we our goal is first we decided we doesn't uh we we [00:57:48] is first we decided we doesn't uh we we have a principled way to uh decide which [00:57:51] have a principled way to uh decide which are the thousand tasks that we want to [00:57:55] are the thousand tasks that we want to train robots for and those are the tasks [00:57:57] train robots for and those are the tasks that's humans prefer to get help and um [00:58:01] that's humans prefer to get help and um with that in mind we have to actually [00:58:04] with that in mind we have to actually build a ver uh virtual environments. So [00:58:07] build a ver uh virtual environments. So we we act we scanned or or or acquired [00:58:11] we we act we scanned or or or acquired 3D scene from 50 different uh real world [00:58:16] 3D scene from 50 different uh real world environments from restaurants to [00:58:18] environments from restaurants to apartments to to uh grocery stores to to [00:58:23] apartments to to uh grocery stores to to offices and so on. And then we acquired [00:58:27] offices and so on. And then we acquired um this number is actually outdated. We [00:58:30] um this number is actually outdated. We require more than 10,000 object assets, [00:58:34] require more than 10,000 object assets, 3D assets uh that has a lot of [00:58:37] 3D assets uh that has a lot of properties whether it's articulation, [00:58:39] properties whether it's articulation, deformability [00:58:41] deformability and all those properties. And then we uh [00:58:45] and all those properties. And then we uh we have to build a simulation [00:58:46] we have to build a simulation environment. A lot of people have built [00:58:49] environment. A lot of people have built simulation environment. Let me just fast [00:58:51] simulation environment. Let me just fast forward. But uh our particular [00:58:53] forward. But uh our particular simulation environment was a [00:58:55] simulation environment was a collaboration with Nvidia's Omniverse [00:58:57] collaboration with Nvidia's Omniverse group and we were going for building a [00:59:02] group and we were going for building a physically, perceptually and also [00:59:05] physically, perceptually and also interactively [00:59:06] interactively high quality simulation environment and [00:59:10] high quality simulation environment and this you know especially taking account [00:59:12] this you know especially taking account for example physical effects like [00:59:14] for example physical effects like thermal transparency, deformability and [00:59:17] thermal transparency, deformability and so on. We also tested our uh behavior [00:59:22] so on. We also tested our uh behavior environment against other environments [00:59:25] environment against other environments in terms of perceptual realism for from [00:59:28] in terms of perceptual realism for from human user study. And you know here are [00:59:31] human user study. And you know here are some of the examples of physical [00:59:34] some of the examples of physical interaction uh such as cloth or or [00:59:38] interaction uh such as cloth or or liquids and so on. So there's a lot of [00:59:40] liquids and so on. So there's a lot of nuance that has gone into this work and [00:59:43] nuance that has gone into this work and let me just fast forward and um and [00:59:47] let me just fast forward and um and these are some benchmark we did compared [00:59:49] these are some benchmark we did compared to other uh other work. Okay, let me [00:59:51] to other uh other work. Okay, let me just fast forward. Um [00:59:54] just fast forward. Um so this is ongoing work actually in in [00:59:57] so this is ongoing work actually in in our lab and because of this we're able [01:00:00] our lab and because of this we're able to you know uh we are using behavior to [01:00:05] to you know uh we are using behavior to help us to learn robotics to help us [01:00:08] help us to learn robotics to help us actually to push us to gather more [01:00:10] actually to push us to gather more interesting data and also to to use that [01:00:14] interesting data and also to to use that for um for even cognitive studies. Let [01:00:18] for um for even cognitive studies. Let me just fast forward. Um [01:00:22] me just fast forward. Um uh one thing I want to share with you is [01:00:27] uh one thing I want to share with you is that let me let me just share these [01:00:30] that let me let me just share these numbers is today's algorithms still [01:00:34] numbers is today's algorithms still cannot do behavior tasks. And the of all [01:00:38] cannot do behavior tasks. And the of all these roles the top role is what we wish [01:00:44] these roles the top role is what we wish robots can do. Give them no privilege [01:00:46] robots can do. Give them no privilege information. they have to be dropped in [01:00:48] information. they have to be dropped in the environment and do these tasks and [01:00:51] the environment and do these tasks and we benchmarked three behavior task using [01:00:54] we benchmarked three behavior task using today's uh robotic algorithm and the [01:00:58] today's uh robotic algorithm and the performance is just zero. And once you [01:01:01] performance is just zero. And once you start to give more priv uh privileged [01:01:05] start to give more priv uh privileged information or make assumptions that [01:01:07] information or make assumptions that make the task simpler like magic motion [01:01:11] make the task simpler like magic motion or some uh you know uh perfect memory [01:01:14] or some uh you know uh perfect memory and all that things start to get better. [01:01:16] and all that things start to get better. So this is this is if you look at it um [01:01:20] So this is this is if you look at it um you know uh only look at the top row you [01:01:23] you know uh only look at the top row you get pretty depressed by today's uh [01:01:25] get pretty depressed by today's uh robots but as a grad student I hope [01:01:27] robots but as a grad student I hope you're inspired because that means we [01:01:29] you're inspired because that means we have a lot of room to uh to grow. Um [01:01:34] have a lot of room to uh to grow. Um okay uh these are just different papers [01:01:36] okay uh these are just different papers from our lab. I'm going to actually fast [01:01:38] from our lab. I'm going to actually fast forward because I think uh we've talked [01:01:40] forward because I think uh we've talked enough about this. Um [01:01:44] enough about this. Um uh well uh by the way we're also doing [01:01:48] uh well uh by the way we're also doing digital twin of behavior in the digital [01:01:51] digital twin of behavior in the digital environment as well as in the in the [01:01:53] environment as well as in the in the real world environment and that's that's [01:01:56] real world environment and that's that's a great way of test uh testing real to [01:01:59] a great way of test uh testing real to sim transfer. Again, this is a unsolved [01:02:03] sim transfer. Again, this is a unsolved problem and um and uh there's a long way [01:02:07] problem and um and uh there's a long way to go. And this is in this particular [01:02:09] to go. And this is in this particular case, we're showing you that this robot [01:02:12] case, we're showing you that this robot without speeding up. You can see how [01:02:14] without speeding up. You can see how slow it is is trying to clean up this [01:02:17] slow it is is trying to clean up this room. And uh uh you know, okay, hooray. [01:02:22] room. And uh uh you know, okay, hooray. Um this is actually uh some of the [01:02:26] Um this is actually uh some of the mistakes that this robot uh makes in for [01:02:31] mistakes that this robot uh makes in for example it cannot pick up uh the bottle [01:02:35] example it cannot pick up uh the bottle or earlier it just went the wrong way [01:02:39] or earlier it just went the wrong way and uh um placed it in the wrong place. [01:02:42] and uh um placed it in the wrong place. So there's still a lot of uh mistakes. [01:02:45] So there's still a lot of uh mistakes. Okay, let's uh let me um let me fast [01:02:48] Okay, let's uh let me um let me fast forward. Uh we're actually we're also [01:02:52] forward. Uh we're actually we're also using this environment to study visually [01:02:54] using this environment to study visually impaired patients and this is a great [01:02:57] impaired patients and this is a great way of uh putting patients in a safe [01:03:00] way of uh putting patients in a safe environment to um to to study. Uh, one [01:03:04] environment to um to to study. Uh, one last thing I want to show you is really [01:03:07] last thing I want to show you is really super cool and and this is the last um [01:03:11] super cool and and this is the last um technical work I want to show is that we [01:03:14] technical work I want to show is that we also now are collaborating with um [01:03:19] also now are collaborating with um psychologists and doctors to study how [01:03:22] psychologists and doctors to study how we can use brain waves to control [01:03:25] we can use brain waves to control robots. So what you're seeing here is a [01:03:28] robots. So what you're seeing here is a demo where a grad student I think is [01:03:32] demo where a grad student I think is wearing this EEG cap and that just is [01:03:36] wearing this EEG cap and that just is sending instructions and the robotic arm [01:03:39] sending instructions and the robotic arm is cooking a Japanese meal purely from [01:03:42] is cooking a Japanese meal purely from thoughts. There's no invasive brain [01:03:46] thoughts. There's no invasive brain control. This is from electrical [01:03:49] control. This is from electrical signals. What we had to do is to [01:03:51] signals. What we had to do is to pre-train these thoughts. You have to [01:03:54] pre-train these thoughts. You have to pre-train the robotic arm with say lift [01:03:57] pre-train the robotic arm with say lift or place or or drop or or whatever. And [01:04:02] or place or or drop or or whatever. And once you do that, this is an entire meal [01:04:04] once you do that, this is an entire meal cooked based on the the wave. This is [01:04:07] cooked based on the the wave. This is really sci-fi. And this this has [01:04:10] really sci-fi. And this this has happened last year. So I'm pretty [01:04:13] happened last year. So I'm pretty excited by where all this is going. [01:04:16] excited by where all this is going. Combining vision and perception and [01:04:19] Combining vision and perception and robotics and also, you know, helping [01:04:21] robotics and also, you know, helping helping people in clinical setting. This [01:04:23] helping people in clinical setting. This is really uh the future of this is [01:04:25] is really uh the future of this is helping severely paralyzed uh patients. [01:04:29] helping severely paralyzed uh patients. Okay. So, so behavior project is really [01:04:34] Okay. So, so behavior project is really aimed at augmenting people. It's a large [01:04:38] aimed at augmenting people. It's a large scale and diverse benchmark and it has [01:04:42] scale and diverse benchmark and it has realistic and ecological you know [01:04:45] realistic and ecological you know physics and perception. And this is the [01:04:48] physics and perception. And this is the last part of what the last take-home [01:04:51] last part of what the last take-home message is that we not only want to [01:04:54] message is that we not only want to build AI to just do things or see [01:04:57] build AI to just do things or see things. We really want to build it to [01:04:59] things. We really want to build it to help people. AI being an augmentation [01:05:03] help people. AI being an augmentation tool or enhancing tool for humanity is [01:05:06] tool or enhancing tool for humanity is very important instead of a tool that [01:05:08] very important instead of a tool that replace us. ================================================================================ LECTURE INDEX.md ================================================================================ CS231n – Deep Learning for Computer Vision Playlist: https://youtube.com/playlist?list=PLoROMvodv4rOmsNzYBMe0gJY2XS8AQg16 Total Videos: 18 Transcripts Downloaded: 18 Failed/No Captions: 0 --- Lectures 1. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 1: Introduction - Video: [https://www.youtube.com/watch?v=2fq9wYslV0A](https://www.youtube.com/watch?v=2fq9wYslV0A) - Transcript: [001_2fq9wYslV0A.md](001_2fq9wYslV0A.md) 2. Stanford CS231N | Spring 2025 | Lecture 2: Image Classification with Linear Classifiers - Video: [https://www.youtube.com/watch?v=pdqofxJeBN8](https://www.youtube.com/watch?v=pdqofxJeBN8) - Transcript: [002_pdqofxJeBN8.md](002_pdqofxJeBN8.md) 3. Stanford CS231N | Spring 2025 | Lecture 3: Regularization and Optimization - Video: [https://www.youtube.com/watch?v=dyNGd06MWn4](https://www.youtube.com/watch?v=dyNGd06MWn4) - Transcript: [003_dyNGd06MWn4.md](003_dyNGd06MWn4.md) 4. Stanford CS231N | Spring 2025 | Lecture 4: Neural Networks and Backpropagation - Video: [https://www.youtube.com/watch?v=25zD5qJHYsk](https://www.youtube.com/watch?v=25zD5qJHYsk) - Transcript: [004_25zD5qJHYsk.md](004_25zD5qJHYsk.md) 5. Stanford CS231N | Spring 2025 | Lecture 5: Image Classification with CNNs - Video: [https://www.youtube.com/watch?v=f3g1zGdxptI](https://www.youtube.com/watch?v=f3g1zGdxptI) - Transcript: [005_f3g1zGdxptI.md](005_f3g1zGdxptI.md) 6. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 6: CNN Architectures - Video: [https://www.youtube.com/watch?v=aVJy4O5TOk8](https://www.youtube.com/watch?v=aVJy4O5TOk8) - Transcript: [006_aVJy4O5TOk8.md](006_aVJy4O5TOk8.md) 7. Stanford CS231N | Spring 2025 | Lecture 7: Recurrent Neural Networks - Video: [https://www.youtube.com/watch?v=kG2lAPBF7zA](https://www.youtube.com/watch?v=kG2lAPBF7zA) - Transcript: [007_kG2lAPBF7zA.md](007_kG2lAPBF7zA.md) 8. Stanford CS231N | Spring 2025 | Lecture 8: Attention and Transformers - Video: [https://www.youtube.com/watch?v=RQowiOF_FvQ](https://www.youtube.com/watch?v=RQowiOF_FvQ) - Transcript: [008_RQowiOF_FvQ.md](008_RQowiOF_FvQ.md) 9. Stanford CS231N | Spring 2025 | Lecture 9: Object Detection, Image Segmentation, Visualizing - Video: [https://www.youtube.com/watch?v=PTypu6GqEd4](https://www.youtube.com/watch?v=PTypu6GqEd4) - Transcript: [009_PTypu6GqEd4.md](009_PTypu6GqEd4.md) 10. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 10: Video Understanding - Video: [https://www.youtube.com/watch?v=wElqklprhPE](https://www.youtube.com/watch?v=wElqklprhPE) - Transcript: [010_wElqklprhPE.md](010_wElqklprhPE.md) 11. Stanford CS231N | Spring 2025 | Lecture 11: Large Scale Distributed Training - Video: [https://www.youtube.com/watch?v=9MvD-XsowsE](https://www.youtube.com/watch?v=9MvD-XsowsE) - Transcript: [011_9MvD-XsowsE.md](011_9MvD-XsowsE.md) 12. Stanford CS231N | Spring 2025 | Lecture 12: Self-Supervised Learning - Video: [https://www.youtube.com/watch?v=4howBU7THbM](https://www.youtube.com/watch?v=4howBU7THbM) - Transcript: [012_4howBU7THbM.md](012_4howBU7THbM.md) 13. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 13: Generative Models 1 - Video: [https://www.youtube.com/watch?v=zbHXQRUNlH0](https://www.youtube.com/watch?v=zbHXQRUNlH0) - Transcript: [013_zbHXQRUNlH0.md](013_zbHXQRUNlH0.md) 14. Stanford CS231N Deep Learning for Computer Vision| Spring 2025 | Lecture 14: Generative Models 2 - Video: [https://www.youtube.com/watch?v=Edr4uZFh4EE](https://www.youtube.com/watch?v=Edr4uZFh4EE) - Transcript: [014_Edr4uZFh4EE.md](014_Edr4uZFh4EE.md) 15. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 15: 3D Vision - Video: [https://www.youtube.com/watch?v=7lxrKDKtykM](https://www.youtube.com/watch?v=7lxrKDKtykM) - Transcript: [015_7lxrKDKtykM.md](015_7lxrKDKtykM.md) 16. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 16: Vision and Language - Video: [https://www.youtube.com/watch?v=mQOK0Mfyrkk](https://www.youtube.com/watch?v=mQOK0Mfyrkk) - Transcript: [016_mQOK0Mfyrkk.md](016_mQOK0Mfyrkk.md) 17. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 17: Robot Learning - Video: [https://www.youtube.com/watch?v=XSfmOH_xVSU](https://www.youtube.com/watch?v=XSfmOH_xVSU) - Transcript: [017_XSfmOH_xVSU.md](017_XSfmOH_xVSU.md) 18. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 18: Human-Centered AI - Video: [https://www.youtube.com/watch?v=g8UaBfj6Sh8](https://www.youtube.com/watch?v=g8UaBfj6Sh8) - Transcript: [018_g8UaBfj6Sh8.md](018_g8UaBfj6Sh8.md)