================================================================================
LECTURE 001
================================================================================

Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 1: Introduction

Source: https://www.youtube.com/watch?v=2fq9wYslV0A

---

Transcript

[00:00:05] This is CS231N
[00:00:07] This is CS231N and uh I'm professor Falei from computer
[00:00:10] and uh I'm professor Falei from computer science department. I will be co-eing
[00:00:13] science department. I will be co-eing this uh quarter with professor Isan
[00:00:16] this uh quarter with professor Isan Adelli and uh my graduate student Zay.
[00:00:19] Adelli and uh my graduate student Zay. So you'll meet them as well as our
[00:00:21] So you'll meet them as well as our wonderful uh TA team that you will meet
[00:00:24] wonderful uh TA team that you will meet later. So I just want to uh get started.
[00:00:28] later. So I just want to uh get started. Yeah. So this is what excites me that um
[00:00:31] Yeah. So this is what excites me that um AI has become such an interdisciplinary
[00:00:34] AI has become such an interdisciplinary um field that what you're going to learn
[00:00:37] um field that what you're going to learn in this class of course it's very
[00:00:39] in this class of course it's very technical it's about computer vision and
[00:00:42] technical it's about computer vision and deep learning but I really do hope that
[00:00:44] deep learning but I really do hope that you take it to whichever discipline you
[00:00:46] you take it to whichever discipline you work in and are passionate about and
[00:00:48] work in and are passionate about and apply it. So we hear a lot about the
[00:00:52] apply it. So we hear a lot about the field of AI. So how do we position uh
[00:00:55] field of AI. So how do we position uh computer vision and the scope of this
[00:00:57] computer vision and the scope of this class? If you consider AI as this big
[00:01:00] class? If you consider AI as this big bubble, um computer vision is very much
[00:01:04] bubble, um computer vision is very much a integral part of AI. Uh some of you
[00:01:08] a integral part of AI. Uh some of you have heard me saying that not only
[00:01:10] have heard me saying that not only vision is part of intelligence, it's a
[00:01:12] vision is part of intelligence, it's a cornerstone to intelligence. Unlocking
[00:01:14] cornerstone to intelligence. Unlocking the mystery of visual intelligence is
[00:01:17] the mystery of visual intelligence is unlocking the mystery of intelligence.
[00:01:20] unlocking the mystery of intelligence. But one of the most important tools,
[00:01:22] But one of the most important tools, mathematical tools to uh to solving AI
[00:01:27] mathematical tools to uh to solving AI is uh machine learning or some people
[00:01:29] is uh machine learning or some people call statistical machine learning. And
[00:01:31] call statistical machine learning. And this is exactly um you know what we will
[00:01:35] this is exactly um you know what we will be talking about within the field of
[00:01:37] be talking about within the field of machine learning. Uh in the past 10 plus
[00:01:40] machine learning. Uh in the past 10 plus years, we have seen a major revolution
[00:01:42] years, we have seen a major revolution called deep learning. And I'll explain a
[00:01:45] called deep learning. And I'll explain a little bit of what deep learning is.
[00:01:46] little bit of what deep learning is. Deep learning is a set of uh uh
[00:01:49] Deep learning is a set of uh uh algorithmic techniques that is built
[00:01:51] algorithmic techniques that is built around uh a family of algorithms called
[00:01:54] around uh a family of algorithms called neuronet networks. And uh so if you ask
[00:01:58] neuronet networks. And uh so if you ask me to pinpoint the the scope of this
[00:02:01] me to pinpoint the the scope of this class, we'll not be able to cover the
[00:02:03] class, we'll not be able to cover the entirety of computer vision. We'll not
[00:02:05] entirety of computer vision. We'll not be able to cover the entirety of machine
[00:02:07] be able to cover the entirety of machine learning or deep learning, but we're
[00:02:09] learning or deep learning, but we're going to cover the core intersection of
[00:02:12] going to cover the core intersection of these two fields. And of course um just
[00:02:15] these two fields. And of course um just like the entirety of AI, computer vision
[00:02:19] like the entirety of AI, computer vision is becoming more and more a uh
[00:02:21] is becoming more and more a uh interdisciplinary field. A lot of the
[00:02:24] interdisciplinary field. A lot of the techniques we use as well as the
[00:02:26] techniques we use as well as the problems we work with intersect with
[00:02:28] problems we work with intersect with many different other fields like natural
[00:02:31] many different other fields like natural language processing, speech recognition,
[00:02:34] language processing, speech recognition, robotics and AI as a whole is a field
[00:02:38] robotics and AI as a whole is a field that intersects with mathematics,
[00:02:40] that intersects with mathematics, neuroscience, computer science,
[00:02:42] neuroscience, computer science, psychology, physics, biology and many
[00:02:44] psychology, physics, biology and many application areas from medicine to law
[00:02:46] application areas from medicine to law to education uh and business and so on.
[00:02:50] to education uh and business and so on. So what you will get for this lecture,
[00:02:54] So what you will get for this lecture, our first lecture is I'll give a very
[00:02:56] our first lecture is I'll give a very brief history of computer vision and
[00:02:59] brief history of computer vision and deep learning and then uh professor
[00:03:01] deep learning and then uh professor Delhi will uh go over the overview of
[00:03:04] Delhi will uh go over the overview of this course and lay the groundwork of
[00:03:06] this course and lay the groundwork of how this course is set up and what our
[00:03:09] how this course is set up and what our expectations are.
[00:03:12] expectations are. So you know the history of vision did
[00:03:16] So you know the history of vision did not begin when you were born or humanity
[00:03:20] not begin when you were born or humanity was born. The history of vision began
[00:03:23] was born. The history of vision began 540 million years ago. You might you
[00:03:26] 540 million years ago. You might you might ask what happened 540 million
[00:03:29] might ask what happened 540 million years ago? Why are we pinpointing a
[00:03:31] years ago? Why are we pinpointing a relatively specific date or year in
[00:03:34] relatively specific date or year in evolution? Well, it's because a lot of
[00:03:36] evolution? Well, it's because a lot of fossil studies have shown us there is a
[00:03:40] fossil studies have shown us there is a mystery period called um Cambrian
[00:03:43] mystery period called um Cambrian explosion. It is the fossil studies
[00:03:46] explosion. It is the fossil studies showed about 10 million years in
[00:03:48] showed about 10 million years in evolution. During that time, which is a
[00:03:51] evolution. During that time, which is a very short period of time for evolution,
[00:03:54] very short period of time for evolution, we see the explosion of animal species
[00:03:58] we see the explosion of animal species in the fossil study. Which means before
[00:04:01] in the fossil study. Which means before Cambrian explosion, life on Earth was
[00:04:04] Cambrian explosion, life on Earth was pretty chill. It was actually in the
[00:04:06] pretty chill. It was actually in the water. There's no land yet. No animals
[00:04:09] water. There's no land yet. No animals on the land yet. And animals just float
[00:04:12] on the land yet. And animals just float around around. So what caused this
[00:04:15] around around. So what caused this explosion in animal speciation? There
[00:04:18] explosion in animal speciation? There were many theories from climate to
[00:04:20] were many theories from climate to chemical composition of the ocean water.
[00:04:23] chemical composition of the ocean water. But the most uh compelling one of the
[00:04:25] But the most uh compelling one of the most compelling theories was the onset
[00:04:28] most compelling theories was the onset of ice. uh the first animal a trilobyte
[00:04:32] of ice. uh the first animal a trilobyte um they gained photosensitive cells. So
[00:04:35] um they gained photosensitive cells. So the eyes we were talking about were not
[00:04:38] the eyes we were talking about were not sophisticated lenses and retinas and
[00:04:40] sophisticated lenses and retinas and nerve cells. It was literally a very
[00:04:43] nerve cells. It was literally a very simple pinghole and that pinhole
[00:04:45] simple pinghole and that pinhole collected light. Once you collected
[00:04:49] collected light. Once you collected light
[00:04:51] light is completely different without senses
[00:04:56] is completely different without senses life is metabolism. It's very passive.
[00:04:59] life is metabolism. It's very passive. It is just metabolism and you come and
[00:05:02] It is just metabolism and you come and go. With senses, you become an integral
[00:05:05] go. With senses, you become an integral part of the environment that you might
[00:05:08] part of the environment that you might want to change. You might want to
[00:05:09] want to change. You might want to actually uh survive in it. Some some
[00:05:13] actually uh survive in it. Some some animals becomes your or plants become
[00:05:15] animals becomes your or plants become your dinner and you become someone
[00:05:17] your dinner and you become someone else's dinner. So evolutionary forces
[00:05:20] else's dinner. So evolutionary forces drives
[00:05:22] drives um um intelligence to evolve because of
[00:05:26] um um intelligence to evolve because of the onset of senses because of the onset
[00:05:28] the onset of senses because of the onset of vision along with uh haptics or or
[00:05:32] of vision along with uh haptics or or tactile sensing those are the two most
[00:05:34] tactile sensing those are the two most uh the oldest uh senses for for animals.
[00:05:38] uh the oldest uh senses for for animals. So that entire course of 540 million
[00:05:41] So that entire course of 540 million years of evolution of vision is the
[00:05:45] years of evolution of vision is the evolution of intelligence. Vision as one
[00:05:47] evolution of intelligence. Vision as one of the primary senses of animals drove
[00:05:51] of the primary senses of animals drove the development of nervous system. The
[00:05:53] the development of nervous system. The development of intelligence. Almost all
[00:05:56] development of intelligence. Almost all animals on earth today we know of have
[00:06:00] animals on earth today we know of have vision or use vision as one of the
[00:06:02] vision or use vision as one of the primary senses. Humans are especially
[00:06:05] primary senses. Humans are especially visual animals. More than half of our
[00:06:07] visual animals. More than half of our cortical cells are involved in uh visual
[00:06:11] cortical cells are involved in uh visual processing and we have a very complex
[00:06:14] processing and we have a very complex and convoluted visual system. So this is
[00:06:17] and convoluted visual system. So this is what excites me to enter the field of
[00:06:19] what excites me to enter the field of vision and I hope it excites you. So um
[00:06:24] vision and I hope it excites you. So um now let's just fast forward from uh u
[00:06:27] now let's just fast forward from uh u fast forward from uh Cambrian explosion
[00:06:30] fast forward from uh Cambrian explosion to actually human civilization. Um
[00:06:33] to actually human civilization. Um humans does innovate and not only we see
[00:06:37] humans does innovate and not only we see we want to build machines that see. So
[00:06:41] we want to build machines that see. So here is a a drawing by a couple of
[00:06:43] here is a a drawing by a couple of drawings by of course Leonardo Davinci
[00:06:46] drawings by of course Leonardo Davinci who was just forever um curious about
[00:06:48] who was just forever um curious about everything. He he studied um camera
[00:06:53] everything. He he studied um camera obscura for uh how to make seeing
[00:06:56] obscura for uh how to make seeing machines. in fact that uh even way
[00:06:59] machines. in fact that uh even way before him in ancient Greece and in
[00:07:02] before him in ancient Greece and in ancient China, we have seen documents
[00:07:05] ancient China, we have seen documents about uh you know thinkers, philosophers
[00:07:09] about uh you know thinkers, philosophers thinking about how to make uh how to
[00:07:12] thinking about how to make uh how to project objects through pinholes and to
[00:07:16] project objects through pinholes and to to uh create images of objects. And of
[00:07:20] to uh create images of objects. And of course in our modern life cameras have
[00:07:24] course in our modern life cameras have truly exploded. But uh but cameras are
[00:07:28] truly exploded. But uh but cameras are not enough for seeing just like eyes are
[00:07:30] not enough for seeing just like eyes are not enough for seeing. These are
[00:07:32] not enough for seeing. These are apparatus. We need to understand how
[00:07:34] apparatus. We need to understand how visual intelligence happens and that's
[00:07:36] visual intelligence happens and that's really the crux of this course. So let's
[00:07:40] really the crux of this course. So let's let's just talk about the a little bit
[00:07:43] let's just talk about the a little bit of the history that brought us to this
[00:07:46] of the history that brought us to this intersection of deep learning and
[00:07:48] intersection of deep learning and computer vision. So um so let me go back
[00:07:54] computer vision. So um so let me go back to the 1950s. Um the 1950s a set of very
[00:08:00] to the 1950s. Um the 1950s a set of very critical critically important uh
[00:08:02] critical critically important uh experiments happened in neuroscience and
[00:08:05] experiments happened in neuroscience and that was the study of the visual
[00:08:07] that was the study of the visual pathways of mammals especially the
[00:08:10] pathways of mammals especially the seinal work by Hubo and Viso. They used
[00:08:13] seinal work by Hubo and Viso. They used uh electrodes to put into live cats
[00:08:17] uh electrodes to put into live cats anesthesized and then they study the
[00:08:20] anesthesized and then they study the receptive field of neurons that are in
[00:08:23] receptive field of neurons that are in the primary visual cortex. What they
[00:08:26] the primary visual cortex. What they have learned to their surprise are two
[00:08:29] have learned to their surprise are two very important things. One is that um
[00:08:33] very important things. One is that um vis uh uh neurons that are responsible
[00:08:37] vis uh uh neurons that are responsible for seeing in the primary visual cortex
[00:08:41] for seeing in the primary visual cortex have their own individual receptive
[00:08:44] have their own individual receptive fields. Receptive fields means that for
[00:08:47] fields. Receptive fields means that for every neuron there's a part of space it
[00:08:51] every neuron there's a part of space it actually sees. It's not all the space.
[00:08:54] actually sees. It's not all the space. It's not very big. It tends to be a very
[00:08:57] It's not very big. It tends to be a very uh uh confined patch of the space and
[00:09:01] uh uh confined patch of the space and within that space it sees uh uh
[00:09:05] within that space it sees uh uh specialized patterns simple patterns
[00:09:08] specialized patterns simple patterns when the the the when you're measuring
[00:09:10] when the the the when you're measuring from the early uh early part of the uh
[00:09:14] from the early uh early part of the uh uh visual pathway and by and large in
[00:09:17] uh visual pathway and by and large in the primary visual cortex which is
[00:09:19] the primary visual cortex which is around here in the back of the head not
[00:09:21] around here in the back of the head not near your eyes um it's oriented edges or
[00:09:25] near your eyes um it's oriented edges or moving oriented edges. So every neuron
[00:09:28] moving oriented edges. So every neuron some neuron will be seeing the edge like
[00:09:30] some neuron will be seeing the edge like this, some will be seeing an edge like
[00:09:31] this, some will be seeing an edge like this or this. And uh that's how vision
[00:09:36] this or this. And uh that's how vision the computation in the brain begins. The
[00:09:39] the computation in the brain begins. The second thing they learned is that uh
[00:09:41] second thing they learned is that uh visual pathway is hierarchical. As you
[00:09:44] visual pathway is hierarchical. As you move beyond the the visual pathway, the
[00:09:47] move beyond the the visual pathway, the neurons feed into other neurons and
[00:09:50] neurons feed into other neurons and these the neurons in the higher uh
[00:09:54] these the neurons in the higher uh layers or deeper layers of the visual
[00:09:56] layers or deeper layers of the visual hierarchy have more complex receptive
[00:09:59] hierarchy have more complex receptive field. So if you begin with
[00:10:02] field. So if you begin with oriented edges, you might feed into a
[00:10:05] oriented edges, you might feed into a corner receptor, you might feed into a
[00:10:09] corner receptor, you might feed into a an object receptor. I'm overly
[00:10:11] an object receptor. I'm overly simplifying but that's the concept is
[00:10:14] simplifying but that's the concept is that neurons feed into each other and
[00:10:16] that neurons feed into each other and then they create this big network of um
[00:10:21] then they create this big network of um uh of uh computation. Of course most of
[00:10:24] uh of uh computation. Of course most of you sitting here are already thinking
[00:10:27] you sitting here are already thinking the way I've been describing this will
[00:10:29] the way I've been describing this will have a profound impact on the modeling
[00:10:32] have a profound impact on the modeling the neuronet network modeling of uh
[00:10:34] the neuronet network modeling of uh visual algorithms. Let's keep going.
[00:10:37] visual algorithms. Let's keep going. That's year
[00:10:39] That's year 1959. It's very early studies of uh
[00:10:42] 1959. It's very early studies of uh scene. By the way, um about 30 years
[00:10:47] scene. By the way, um about 30 years later uh maybe not quite 20 something
[00:10:50] later uh maybe not quite 20 something years later, Huba Visel won the Nobel uh
[00:10:53] years later, Huba Visel won the Nobel uh Nobel Prize in medicine for studying
[00:10:55] Nobel Prize in medicine for studying this uh uncovering the the uh principles
[00:10:59] this uh uncovering the the uh principles of visual processing.
[00:11:02] of visual processing. Another um milestone in the early
[00:11:04] Another um milestone in the early history of computer vision was the first
[00:11:06] history of computer vision was the first PhD thesis of computer vision. Most
[00:11:09] PhD thesis of computer vision. Most people attribute Larry Roberts in 1963
[00:11:13] people attribute Larry Roberts in 1963 writing the the first PhD thesis just
[00:11:16] writing the the first PhD thesis just studying shape and this is a very very
[00:11:19] studying shape and this is a very very character uh representation of the world
[00:11:22] character uh representation of the world and the idea is that can we take a shape
[00:11:25] and the idea is that can we take a shape like this and understand the the
[00:11:28] like this and understand the the surfaces and the the corners and
[00:11:30] surfaces and the the corners and features of this shape um it's intuitive
[00:11:33] features of this shape um it's intuitive that humans do. So an entire PhD thesis
[00:11:36] that humans do. So an entire PhD thesis is developed uh is devoted to this and
[00:11:39] is developed uh is devoted to this and that's the beginning of computer vision
[00:11:43] that's the beginning of computer vision and uh and around that time in 1966
[00:11:48] and uh and around that time in 1966 um uh as an MIT professor uh created a
[00:11:53] um uh as an MIT professor uh created a summer project in MIT and asked to hire
[00:11:57] summer project in MIT and asked to hire a few undergrads very smart ones um to
[00:12:02] a few undergrads very smart ones um to study uh vision And the goal was pretty
[00:12:05] study uh vision And the goal was pretty much solve computer vision or solve
[00:12:07] much solve computer vision or solve vision for one summer. Of course, just
[00:12:10] vision for one summer. Of course, just like the rest of the history of AI, we
[00:12:13] like the rest of the history of AI, we tend to be over optimistic of what we
[00:12:18] tend to be over optimistic of what we can do in a short period of time. So
[00:12:21] can do in a short period of time. So vision did not get solved in in that
[00:12:24] vision did not get solved in in that summer. In fact, it has blossomed into
[00:12:27] summer. In fact, it has blossomed into an incredible computer science field. If
[00:12:30] an incredible computer science field. If you go to our annual conferences every
[00:12:33] you go to our annual conferences every year now it has more than 10,000 people
[00:12:35] year now it has more than 10,000 people attending but 1960s is where you know
[00:12:39] attending but 1960s is where you know between Larry Robert's uh PhD thesis as
[00:12:42] between Larry Robert's uh PhD thesis as well as uh as well as this kind of
[00:12:44] well as uh as well as this kind of project we in our field consider that
[00:12:48] project we in our field consider that the beginning of the field of computer
[00:12:50] the beginning of the field of computer vision.
[00:12:51] vision. A seminal book was written in the 1970s
[00:12:54] A seminal book was written in the 1970s by David Barah who um unfortunately died
[00:12:57] by David Barah who um unfortunately died too early. He wanted to study vision
[00:13:00] too early. He wanted to study vision systematically and start to consider how
[00:13:04] systematically and start to consider how visual processing happens. Even though
[00:13:06] visual processing happens. Even though this is not explicitly stated but there
[00:13:08] this is not explicitly stated but there is a lot of inspiration from you know
[00:13:11] is a lot of inspiration from you know neuroscience and cognitive science. He
[00:13:13] neuroscience and cognitive science. He was thinking about you know if you look
[00:13:15] was thinking about you know if you look at if if a um if you take a input image
[00:13:20] at if if a um if you take a input image how do we visually process and
[00:13:22] how do we visually process and understand the image maybe the first
[00:13:24] understand the image maybe the first layer is more like edges just like we
[00:13:27] layer is more like edges just like we saw uh he calls it primal sketch and
[00:13:31] saw uh he calls it primal sketch and then there is a 2 and 1 halfd sketch
[00:13:33] then there is a 2 and 1 halfd sketch which uh uh uh you know separates
[00:13:36] which uh uh uh you know separates different uh uh depth um u of the the
[00:13:41] different uh uh depth um u of the the objects in the image. So the ball is the
[00:13:44] objects in the image. So the ball is the foreground object and then the the the
[00:13:46] foreground object and then the the the grass here. Oh no, not grass. The floor
[00:13:49] grass here. Oh no, not grass. The floor ground here is the background. So he
[00:13:52] ground here is the background. So he does these two and a halfD sketch and
[00:13:54] does these two and a halfD sketch and then finally he he believes David Mar
[00:13:57] then finally he he believes David Mar believes the the the grand holy grail
[00:14:01] believes the the the grand holy grail victory of solving vision is to know the
[00:14:04] victory of solving vision is to know the entire full 3D representation. And that
[00:14:08] entire full 3D representation. And that is actually a um the hardest thing of
[00:14:12] is actually a um the hardest thing of vision. Let me digress for 20 seconds
[00:14:15] vision. Let me digress for 20 seconds because um if you think about vision
[00:14:18] because um if you think about vision right for all animals it's a illposed
[00:14:22] right for all animals it's a illposed problem. Since the early trilobytes who
[00:14:26] problem. Since the early trilobytes who collected light from underwater
[00:14:30] collected light from underwater light the world through photons is
[00:14:33] light the world through photons is projected on something on a surface more
[00:14:36] projected on something on a surface more or less 2D at that time it was just I
[00:14:39] or less 2D at that time it was just I don't know some patch in the in the
[00:14:41] don't know some patch in the in the animal but right now for us it's a it's
[00:14:44] animal but right now for us it's a it's a retina um but the actual world is 3D
[00:14:48] a retina um but the actual world is 3D so recovering
[00:14:50] so recovering 3D information the entire entire 3D
[00:14:54] 3D information the entire entire 3D world
[00:14:56] world from 2D images is the fundamental
[00:14:59] from 2D images is the fundamental problem nature had to solve and computer
[00:15:01] problem nature had to solve and computer vision has to solve and mathematically
[00:15:04] vision has to solve and mathematically that's a illpose problem. So what did
[00:15:07] that's a illpose problem. So what did linker do? Anybody have a wild guess?
[00:15:17] Yes, nature the trick that nature did is
[00:15:20] Yes, nature the trick that nature did is develop mult multiple eyes mostly too.
[00:15:22] develop mult multiple eyes mostly too. Some animals have more than uh more than
[00:15:24] Some animals have more than uh more than two and that's and then you triangulate
[00:15:27] two and that's and then you triangulate information but two eyes are not enough.
[00:15:29] information but two eyes are not enough. You actually have to understand
[00:15:31] You actually have to understand correspondences and all that. We'll
[00:15:33] correspondences and all that. We'll touch on some of these topics but there
[00:15:35] touch on some of these topics but there are other computer vision classes in uh
[00:15:38] are other computer vision classes in uh in Stanford offers that also
[00:15:40] in Stanford offers that also specifically talk about 3D vision. But
[00:15:42] specifically talk about 3D vision. But the point is it's a very hard problem
[00:15:46] the point is it's a very hard problem and and we have to solve it. Nature has
[00:15:48] and and we have to solve it. Nature has solved it. humans have solved it but not
[00:15:50] solved it. humans have solved it but not to extreme uh precision. In fact, humans
[00:15:54] to extreme uh precision. In fact, humans are not that precise. You know, I
[00:15:56] are not that precise. You know, I roughly know the 3D shapes but I don't
[00:15:59] roughly know the 3D shapes but I don't have you know geometric precision of all
[00:16:02] have you know geometric precision of all the shapes. So that's that's one thing
[00:16:04] the shapes. So that's that's one thing to consider and appreciate how hard this
[00:16:07] to consider and appreciate how hard this problem is. Another thing that is very
[00:16:10] problem is. Another thing that is very different for uh computer vision and and
[00:16:13] different for uh computer vision and and language is actually something
[00:16:15] language is actually something philosophically subtle. Language doesn't
[00:16:18] philosophically subtle. Language doesn't exist in nature. You cannot point to
[00:16:22] exist in nature. You cannot point to something and say there's language.
[00:16:24] something and say there's language. Language is a purely generated
[00:16:29] Language is a purely generated thing. I don't even know what word to
[00:16:31] thing. I don't even know what word to use, right? It comes through our brain.
[00:16:35] use, right? It comes through our brain. It's generated. It's 1D. It's
[00:16:39] It's generated. It's 1D. It's sequential.
[00:16:40] sequential. So this actually has profound
[00:16:43] So this actually has profound implication in the latest wave of Gen AI
[00:16:46] implication in the latest wave of Gen AI algorithms is this is why these LLMs
[00:16:49] algorithms is this is why these LLMs which is outside of the scope of this
[00:16:51] which is outside of the scope of this class is so powerful because the the we
[00:16:55] class is so powerful because the the we can model language that way but vision
[00:16:57] can model language that way but vision is not generated. There is actually a
[00:17:01] is not generated. There is actually a physical world out there respecting the
[00:17:03] physical world out there respecting the laws of physics and materials and all
[00:17:05] laws of physics and materials and all that. So vision has very different
[00:17:08] that. So vision has very different tasks. So I just want you to appreciate
[00:17:12] tasks. So I just want you to appreciate uh the difference between language and
[00:17:14] uh the difference between language and vision and and actually frankly
[00:17:16] vision and and actually frankly appreciate nature how how it solved this
[00:17:18] appreciate nature how how it solved this problem. Okay let's keep going. 1970s.
[00:17:23] problem. Okay let's keep going. 1970s. Um the early pioneers of computer vision
[00:17:27] Um the early pioneers of computer vision without data, without really much of uh
[00:17:30] without data, without really much of uh powerful computers, without um the
[00:17:34] powerful computers, without um the mathematical advances we have seen today
[00:17:37] mathematical advances we have seen today are already beginning to solve some of
[00:17:39] are already beginning to solve some of the harder problem of computer vision.
[00:17:41] the harder problem of computer vision. For example, recognition of objects here
[00:17:44] For example, recognition of objects here in um in Stanford. One of the pioneering
[00:17:47] in um in Stanford. One of the pioneering work is called uh generalized cylinders
[00:17:50] work is called uh generalized cylinders by Rodney Brookke and Tom Binford. And
[00:17:53] by Rodney Brookke and Tom Binford. And ironically Rodney Brooks uh today is on
[00:17:56] ironically Rodney Brooks uh today is on campus actually some part over there
[00:17:59] campus actually some part over there giving uh giving a talk at the robotics
[00:18:02] giving uh giving a talk at the robotics conference and he went on to become one
[00:18:05] conference and he went on to become one of the greatest roboticists of our time
[00:18:07] of the greatest roboticists of our time and was founder of uh uh Roomba and many
[00:18:10] and was founder of uh uh Roomba and many other robots. And then uh not very far
[00:18:13] other robots. And then uh not very far from us in another part of Palo Alto uh
[00:18:16] from us in another part of Palo Alto uh researchers have worked on these uh also
[00:18:19] researchers have worked on these uh also compositional uh compositional um um u
[00:18:24] compositional uh compositional um um u models of uh human body and and objects.
[00:18:28] models of uh human body and and objects. And then in the 1980s
[00:18:30] And then in the 1980s um digital photos start to appear. At
[00:18:34] um digital photos start to appear. At least photos start to appear and people
[00:18:37] least photos start to appear and people can digitize that a little bit. And then
[00:18:40] can digitize that a little bit. And then um there are some great work in edge
[00:18:43] um there are some great work in edge detection. You look at all this and um
[00:18:47] detection. You look at all this and um it probably feels a sense of
[00:18:48] it probably feels a sense of disappointment, right? Like I mean it's
[00:18:52] disappointment, right? Like I mean it's kind of trivial to get some sketches and
[00:18:55] kind of trivial to get some sketches and edges and it's not really going anywhere
[00:18:57] edges and it's not really going anywhere if that's how that's how vision you know
[00:19:00] if that's how that's how vision you know uh works at that time. And in fact,
[00:19:02] uh works at that time. And in fact, you're not so wrong. That was around the
[00:19:05] you're not so wrong. That was around the time um before many of you were born
[00:19:08] time um before many of you were born that we entered AI winter. The field a
[00:19:11] that we entered AI winter. The field a entered AI winter because the enthusiasm
[00:19:15] entered AI winter because the enthusiasm and hence funding for AI research has
[00:19:17] and hence funding for AI research has really dwindled. A lot of things didn't
[00:19:19] really dwindled. A lot of things didn't deliver. Computer vision didn't deliver,
[00:19:22] deliver. Computer vision didn't deliver, expert systems didn't deliver, robotics
[00:19:25] expert systems didn't deliver, robotics didn't deliver. But under the hood of
[00:19:28] didn't deliver. But under the hood of this winter, a lot of things uh a lot of
[00:19:32] this winter, a lot of things uh a lot of research start to grow uh from different
[00:19:34] research start to grow uh from different fields like computer vision, NLP,
[00:19:36] fields like computer vision, NLP, robotics. So let's also look at another
[00:19:39] robotics. So let's also look at another strand of research that had a profound
[00:19:41] strand of research that had a profound implication in computer vision is that
[00:19:44] implication in computer vision is that cognitive and neuroscience continue to
[00:19:46] cognitive and neuroscience continue to blossom. And what is really important
[00:19:48] blossom. And what is really important especially for the field of computer
[00:19:50] especially for the field of computer vision is cognitive and neuroscience is
[00:19:53] vision is cognitive and neuroscience is starting to point to us the northstar
[00:19:56] starting to point to us the northstar problems we should work on. For example,
[00:19:58] problems we should work on. For example, psychologists have told us there's
[00:20:00] psychologists have told us there's something special about seeing nature,
[00:20:02] something special about seeing nature, seeing uh seeing real world. Uh this is
[00:20:05] seeing uh seeing real world. Uh this is a uh this is a study by uh Vidderman who
[00:20:09] a uh this is a study by uh Vidderman who shows that the detection of bicycles on
[00:20:12] shows that the detection of bicycles on two images differ depending on if the
[00:20:17] two images differ depending on if the images are scrambled or not. Think about
[00:20:19] images are scrambled or not. Think about it from a photon point of view. These
[00:20:21] it from a photon point of view. These two bicycles land in the same location
[00:20:25] two bicycles land in the same location on your retina, but somehow the rest of
[00:20:28] on your retina, but somehow the rest of the image impacts uh the the the
[00:20:32] the image impacts uh the the the viewer seeing the the objects in in
[00:20:36] viewer seeing the the objects in in um in in in the target object. So there
[00:20:39] um in in in the target object. So there is something telling us that seeing the
[00:20:41] is something telling us that seeing the entire forest or entire world uh impacts
[00:20:45] entire forest or entire world uh impacts the way we see objects. It also tells us
[00:20:47] the way we see objects. It also tells us visual processing is very fast. Here's
[00:20:50] visual processing is very fast. Here's another direct measure of how fast we we
[00:20:53] another direct measure of how fast we we uh we detect objects. This is an early
[00:20:56] uh we detect objects. This is an early 1970s uh experiment um showing people uh
[00:21:00] 1970s uh experiment um showing people uh a video and uh and the the task for the
[00:21:04] a video and uh and the the task for the human uh the subject is to detect the
[00:21:07] human uh the subject is to detect the human in one of the frames. I suppose
[00:21:09] human in one of the frames. I suppose every one of you have seen that human in
[00:21:11] every one of you have seen that human in one of the uh frames. But think about
[00:21:13] one of the uh frames. But think about how remarkable your eyes are or your
[00:21:16] how remarkable your eyes are or your brain is because uh you've never seen
[00:21:18] brain is because uh you've never seen this video. I didn't tell you which
[00:21:20] this video. I didn't tell you which frame the the the target object would
[00:21:22] frame the the the target object would appear. I did not tell you what the
[00:21:24] appear. I did not tell you what the target target object would look like,
[00:21:26] target target object would look like, where it is and gestures and all that.
[00:21:28] where it is and gestures and all that. Yet you have no problem detecting the
[00:21:31] Yet you have no problem detecting the humans. So that is and on top of that
[00:21:36] humans. So that is and on top of that these uh frames are played at 10 hertz
[00:21:39] these uh frames are played at 10 hertz which means you're seeing every frame
[00:21:40] which means you're seeing every frame for only 100 milliseconds. And this is
[00:21:44] for only 100 milliseconds. And this is how remarkable our visual system is. In
[00:21:48] how remarkable our visual system is. In fact, uh, Simon Thorp, another, uh, uh,
[00:21:51] fact, uh, Simon Thorp, another, uh, uh, uh, cognitive neuroscientist, have
[00:21:54] uh, cognitive neuroscientist, have measured the speed. If you hook people
[00:21:56] measured the speed. If you hook people up in EG caps and show them complex
[00:22:00] up in EG caps and show them complex natural things and ask human subjects to
[00:22:04] natural things and ask human subjects to categorize things from animals without
[00:22:07] categorize things from animals without uh, versus things without animals,
[00:22:10] uh, versus things without animals, hundreds of them, and then you measure
[00:22:12] hundreds of them, and then you measure the brain wave. It turned out after 150
[00:22:15] the brain wave. It turned out after 150 milliseconds of seeing a photo uh your
[00:22:19] milliseconds of seeing a photo uh your brain already has a signal of a
[00:22:21] brain already has a signal of a differential signal that categorizes.
[00:22:24] differential signal that categorizes. You might not be so impressed because
[00:22:26] You might not be so impressed because compared to today's GPUs and modern
[00:22:29] compared to today's GPUs and modern chips 150 milliseconds is really orders
[00:22:32] chips 150 milliseconds is really orders of magnitude slower. But you got to
[00:22:36] of magnitude slower. But you got to admire our wet wear. our brain, our
[00:22:40] admire our wet wear. our brain, our neurons don't work as fast as
[00:22:42] neurons don't work as fast as transistors. 150 millisecond is actually
[00:22:45] transistors. 150 millisecond is actually really fast. Uh it's only a few hops in
[00:22:48] really fast. Uh it's only a few hops in the brain in terms of neuroprocessing.
[00:22:51] the brain in terms of neuroprocessing. So yet again, this is telling us humans
[00:22:54] So yet again, this is telling us humans are really good at um um seeing objects
[00:22:58] are really good at um um seeing objects and categorizing them. In fact, not only
[00:23:01] and categorizing them. In fact, not only were so good at seeing objects and
[00:23:02] were so good at seeing objects and categorizing them, we even developed
[00:23:04] categorizing them, we even developed specialized brain areas that have expert
[00:23:08] specialized brain areas that have expert ability in recognizing faces or places
[00:23:12] ability in recognizing faces or places or body parts. And these are discoveries
[00:23:15] or body parts. And these are discoveries by MIT uh neurohysiologists in the 1990s
[00:23:19] by MIT uh neurohysiologists in the 1990s and early 21st century. So all these
[00:23:22] and early 21st century. So all these studies tell us well we should not just
[00:23:26] studies tell us well we should not just be studying these kind of character
[00:23:29] be studying these kind of character shapes or the sketches of images. We
[00:23:34] shapes or the sketches of images. We really should go after uh important
[00:23:37] really should go after uh important fundamental problems that drives visual
[00:23:40] fundamental problems that drives visual intelligence. And one of those problem
[00:23:42] intelligence. And one of those problem that everything has been telling us is
[00:23:44] that everything has been telling us is object recognition. is object
[00:23:47] object recognition. is object recognition in natural setting. There's
[00:23:50] recognition in natural setting. There's a lot of objects out out there in the
[00:23:52] a lot of objects out out there in the world and studying this is going to
[00:23:55] world and studying this is going to unlock is going to be part of the
[00:23:58] unlock is going to be part of the unlocking of visual intelligence. And
[00:24:00] unlocking of visual intelligence. And that's what we did as a field. We
[00:24:03] that's what we did as a field. We started by uh looking at how we can
[00:24:06] started by uh looking at how we can separate foreground objects from
[00:24:08] separate foreground objects from background objects. This is called uh
[00:24:11] background objects. This is called uh recognition by grouping. in the 1990s.
[00:24:14] recognition by grouping. in the 1990s. Keep in mind we're still in AI winter,
[00:24:16] Keep in mind we're still in AI winter, but research is actually happening and
[00:24:19] but research is actually happening and progressing. And then there is uh um you
[00:24:22] progressing. And then there is uh um you know studies of features and and this is
[00:24:26] know studies of features and and this is some of you might still remember like
[00:24:27] some of you might still remember like sift features and matching and when I
[00:24:31] sift features and matching and when I enter grad school the most exciting
[00:24:33] enter grad school the most exciting thing was face detection. I remember
[00:24:35] thing was face detection. I remember that first year in my grad school this
[00:24:37] that first year in my grad school this paper was published and five years later
[00:24:41] paper was published and five years later the first digital camera used this
[00:24:43] the first digital camera used this paper's algorithm and delivered uh
[00:24:47] paper's algorithm and delivered uh automatic face focus because of uh face
[00:24:50] automatic face focus because of uh face detection. So things start to work and
[00:24:54] detection. So things start to work and be taken into industry and then around
[00:24:58] be taken into industry and then around the early 21st century a very important
[00:25:02] the early 21st century a very important thing start to happen is internet start
[00:25:06] thing start to happen is internet start to happen. when internet start to happen
[00:25:09] to happen. when internet start to happen um data start to proliferate and the
[00:25:13] um data start to proliferate and the combination of digital cameras and
[00:25:16] combination of digital cameras and internet start to give the field of
[00:25:18] internet start to give the field of computer vision some data to work with.
[00:25:22] computer vision some data to work with. So in that early days we're working with
[00:25:24] So in that early days we're working with thousands of d uh images or tens of
[00:25:27] thousands of d uh images or tens of thousands of images to study the visual
[00:25:30] thousands of images to study the visual recognition problem or the object
[00:25:31] recognition problem or the object recognition problem. So you've got uh
[00:25:34] recognition problem. So you've got uh data sets like Pascal Visual Object
[00:25:36] data sets like Pascal Visual Object Challenge or Caltech 101. So that was
[00:25:41] Challenge or Caltech 101. So that was I'm going to pause here. Um and uh and
[00:25:46] I'm going to pause here. Um and uh and and this is where the the the first
[00:25:48] and this is where the the the first thread of computer vision start to
[00:25:50] thread of computer vision start to progress and you might be wondering why
[00:25:52] progress and you might be wondering why is she proing uh pausing because I'm
[00:25:55] is she proing uh pausing because I'm going to come back and talk about deep
[00:25:56] going to come back and talk about deep learning. So while
[00:26:00] learning. So while you know this field of vision was
[00:26:02] you know this field of vision was progressing through neurohysiology
[00:26:05] progressing through neurohysiology to computer vision to cognitive neur uh
[00:26:08] to computer vision to cognitive neur uh neuroscience to computer vision again a
[00:26:12] neuroscience to computer vision again a separate effort is going on in parallel
[00:26:15] separate effort is going on in parallel and that eventually became deep
[00:26:16] and that eventually became deep learning. It started from these early
[00:26:19] learning. It started from these early studies of uh neuronet network things
[00:26:23] studies of uh neuronet network things like perception and uh and people like
[00:26:26] like perception and uh and people like Ramart started to you know work and of
[00:26:30] Ramart started to you know work and of course Jeff Hinton in his early days
[00:26:32] course Jeff Hinton in his early days start to work with a small number of
[00:26:34] start to work with a small number of artificial neurons and look at how that
[00:26:36] artificial neurons and look at how that can can process information and learn.
[00:26:40] can can process information and learn. Um and you've heard uh people like uh
[00:26:45] Um and you've heard uh people like uh the great minds like Marvin Minsky uh
[00:26:48] the great minds like Marvin Minsky uh and his colleagues uh working on on
[00:26:51] and his colleagues uh working on on different aspects of these uh
[00:26:53] different aspects of these uh perception. But he also did Marvin
[00:26:56] perception. But he also did Marvin Minsky did say that uh perceptuals
[00:26:59] Minsky did say that uh perceptuals cannot um cannot learn these X or logic
[00:27:04] cannot um cannot learn these X or logic functions and that caused a little bit
[00:27:06] functions and that caused a little bit of a setback in neuronet network. Well,
[00:27:11] of a setback in neuronet network. Well, things continue to progress despite the
[00:27:14] things continue to progress despite the setback. And one of the most important
[00:27:17] setback. And one of the most important work
[00:27:19] work before the inflection point, first
[00:27:21] before the inflection point, first inflection point is this neocognitron
[00:27:23] inflection point is this neocognitron work by Fukushima in Japan. Fukushima
[00:27:27] work by Fukushima in Japan. Fukushima handdesigned
[00:27:29] handdesigned a neuronet network that looks like this.
[00:27:31] a neuronet network that looks like this. So it has about six or seven or five or
[00:27:34] So it has about six or seven or five or six layers and then he kind of he kind
[00:27:39] six layers and then he kind of he kind of designed the different functions
[00:27:41] of designed the different functions across the layers which you will learn
[00:27:44] across the layers which you will learn more that more or less was inspired by
[00:27:48] more that more or less was inspired by the the visual pathway that I was
[00:27:50] the the visual pathway that I was describing. Remember the CAD experiment
[00:27:52] describing. Remember the CAD experiment from simple receptive field to more
[00:27:55] from simple receptive field to more complicated receptive field and he was
[00:27:57] complicated receptive field and he was doing that here. you know the early
[00:28:00] doing that here. you know the early layers have simple functions and then
[00:28:02] layers have simple functions and then the later layers have more complex
[00:28:04] the later layers have more complex functions um and and the simple ones he
[00:28:07] functions um and and the simple ones he call it convolution or he uses the
[00:28:09] call it convolution or he uses the convolution function and the more
[00:28:11] convolution function and the more complex one he was pulling the
[00:28:13] complex one he was pulling the information from the convolution layers
[00:28:15] information from the convolution layers so Neocognitron
[00:28:17] so Neocognitron was really a engineering fee because
[00:28:21] was really a engineering fee because every parameter was handdesigned he has
[00:28:24] every parameter was handdesigned he has there are hundreds of parameters he has
[00:28:26] there are hundreds of parameters he has to just meticulously put them together
[00:28:29] to just meticulously put them together so that this small neuronet network can
[00:28:32] so that this small neuronet network can recognize digits or letters.
[00:28:36] recognize digits or letters. So the real breakthrough came around
[00:28:38] So the real breakthrough came around that time in 1986 is a learning rule.
[00:28:43] that time in 1986 is a learning rule. That learning rule is called back
[00:28:44] That learning rule is called back propagation. It's going to be one of our
[00:28:46] propagation. It's going to be one of our first classes to show you that Ramahar,
[00:28:49] first classes to show you that Ramahar, Jeff Hinton and they they they
[00:28:53] Jeff Hinton and they they they took neuronet network um um um
[00:28:57] took neuronet network um um um architecture and introduced a error
[00:29:01] architecture and introduced a error correcting objective function so that if
[00:29:05] correcting objective function so that if you put in some input and know what the
[00:29:08] you put in some input and know what the correct output is, how do you take the
[00:29:11] correct output is, how do you take the difference between what the neuronet
[00:29:13] difference between what the neuronet network outputs versus the actual
[00:29:16] network outputs versus the actual correct answer and then propagate the uh
[00:29:20] correct answer and then propagate the uh the information back so that you can uh
[00:29:23] the information back so that you can uh um improve the parameters along the
[00:29:27] um improve the parameters along the neuronet network and that propagation
[00:29:30] neuronet network and that propagation from the output back to the entire
[00:29:33] from the output back to the entire neuronet network is called back
[00:29:35] neuronet network is called back propagation. it follows some of these
[00:29:37] propagation. it follows some of these basic calculus chain rules and uh that
[00:29:40] basic calculus chain rules and uh that was a watershed moment for a neuronet
[00:29:44] was a watershed moment for a neuronet network algorithm. So one of the most
[00:29:47] network algorithm. So one of the most and of course we're still smack in the
[00:29:49] and of course we're still smack in the middle of AI winter all these work was
[00:29:52] middle of AI winter all these work was uh was happening without public fanfare
[00:29:54] uh was happening without public fanfare but of course in in the world of
[00:29:57] but of course in in the world of research these are very important
[00:29:58] research these are very important milestones. One of the most earliest
[00:30:01] milestones. One of the most earliest applications of this uh neuronet network
[00:30:04] applications of this uh neuronet network with back propagation is Yamakun's
[00:30:06] with back propagation is Yamakun's convolutional neuronet network made in
[00:30:08] convolutional neuronet network made in the 1990s when he was working in the
[00:30:10] the 1990s when he was working in the Bell labs and uh what he did is just
[00:30:14] Bell labs and uh what he did is just created a slightly bigger network about
[00:30:16] created a slightly bigger network about seven layersish and uh and made it good
[00:30:20] seven layersish and uh and made it good enough with great engineering uh
[00:30:22] enough with great engineering uh capability to recognize letters and it
[00:30:25] capability to recognize letters and it was actually shipped to uh some part of
[00:30:27] was actually shipped to uh some part of the US postal offices and banks to to
[00:30:30] the US postal offices and banks to to read digits and letters. So that was a
[00:30:35] read digits and letters. So that was a um application of early neuronet
[00:30:36] um application of early neuronet network. And then um uh Jeff Hinton and
[00:30:40] network. And then um uh Jeff Hinton and Yan Lun continue to work on your
[00:30:42] Yan Lun continue to work on your network. It didn't go very far because
[00:30:46] network. It didn't go very far because um despite these um improvements and
[00:30:51] um despite these um improvements and tweaks of these neuronet network things
[00:30:54] tweaks of these neuronet network things more or less just stalled. you know, we
[00:30:57] more or less just stalled. you know, we uh they collected a big data set of
[00:30:59] uh they collected a big data set of digits and letters and digits and
[00:31:01] digits and letters and digits and letters kind of was quasy soft in terms
[00:31:04] letters kind of was quasy soft in terms of recognition. But if you put the
[00:31:06] of recognition. But if you put the system through, you know, the kind of uh
[00:31:08] system through, you know, the kind of uh uh digital photos that the
[00:31:10] uh digital photos that the neuroscientists were using to recognize
[00:31:12] neuroscientists were using to recognize cats and dogs and microwaves and chairs
[00:31:14] cats and dogs and microwaves and chairs and flowers, it just didn't work. And uh
[00:31:18] and flowers, it just didn't work. And uh uh a huge part of this problem is the
[00:31:21] uh a huge part of this problem is the lack of data. And uh lack of data is not
[00:31:26] lack of data. And uh lack of data is not just an inconvenience. It's actually a
[00:31:28] just an inconvenience. It's actually a mathematical problem because these uh
[00:31:32] mathematical problem because these uh these algorithms are high-capacity uh
[00:31:35] these algorithms are high-capacity uh algorithms that actually needs to be
[00:31:38] algorithms that actually needs to be driven by lots of data in order to to
[00:31:41] driven by lots of data in order to to learn to generalize. And there is some
[00:31:43] learn to generalize. And there is some deep mathematical principles behind this
[00:31:46] deep mathematical principles behind this rules of generalization and model
[00:31:48] rules of generalization and model overfitting. And data was
[00:31:51] overfitting. And data was underappreciated was underlooked because
[00:31:54] underappreciated was underlooked because most people are just looking at these
[00:31:55] most people are just looking at these architectures. They did not realize that
[00:31:58] architectures. They did not realize that data is part of the first class citizen
[00:32:00] data is part of the first class citizen for for uh machine learning and deep
[00:32:02] for for uh machine learning and deep learning. So this is part of the work
[00:32:05] learning. So this is part of the work that my lab uh my students and I did in
[00:32:08] that my lab uh my students and I did in the early 2000s um that we recognize
[00:32:13] the early 2000s um that we recognize this importance of data. We hypothesized
[00:32:17] this importance of data. We hypothesized that the whole field was uh was actually
[00:32:21] that the whole field was uh was actually missing this underappreciating the
[00:32:23] missing this underappreciating the importance of data. So we went about and
[00:32:26] importance of data. So we went about and collected a huge data set called image
[00:32:28] collected a huge data set called image net that has 15 million images after
[00:32:30] net that has 15 million images after cleaning a billion images. And this uh
[00:32:33] cleaning a billion images. And this uh 15 million images were sorted across
[00:32:36] 15 million images were sorted across 22,000 categories of objects. We
[00:32:39] 22,000 categories of objects. We actually studied a lot of the cognitive
[00:32:42] actually studied a lot of the cognitive and psychology literature to to
[00:32:45] and psychology literature to to appreciate that 22,000
[00:32:48] appreciate that 22,000 um uh images uh were uh oh sorry 22,000
[00:32:52] um uh images uh were uh oh sorry 22,000 categories were roughly in the order of
[00:32:55] categories were roughly in the order of the number of categories that uh humans
[00:32:57] the number of categories that uh humans learn to recognize in the early years of
[00:32:59] learn to recognize in the early years of their life. And then we open sourced
[00:33:01] their life. And then we open sourced this data set and created a imageet
[00:33:04] this data set and created a imageet challenge called the large scale visual
[00:33:06] challenge called the large scale visual recognition challenge. We curated a
[00:33:08] recognition challenge. We curated a subset of image net of a million images,
[00:33:12] subset of image net of a million images, a million plus images and a thousand
[00:33:14] a million plus images and a thousand object classes and then ran an
[00:33:17] object classes and then ran an international uh object recognition um
[00:33:20] international uh object recognition um challenge for for many years. And the
[00:33:22] challenge for for many years. And the goal is that we ask uh researchers to
[00:33:26] goal is that we ask uh researchers to participate and their goal is to create
[00:33:28] participate and their goal is to create algorithms. It doesn't matter which kind
[00:33:30] algorithms. It doesn't matter which kind of algorithms. And then we'll test you
[00:33:32] of algorithms. And then we'll test you on your algorithm's ability to recognize
[00:33:35] on your algorithm's ability to recognize photos and see if you can call out these
[00:33:38] photos and see if you can call out these a thousand uh object classes in as
[00:33:41] a thousand uh object classes in as correctly as possible. And here are the
[00:33:44] correctly as possible. And here are the errors, right? Like first year we run
[00:33:46] errors, right? Like first year we run this uh uh we run this uh um uh
[00:33:50] this uh uh we run this uh um uh competition, the the error the algorithm
[00:33:53] competition, the the error the algorithm the best performing algorithms error was
[00:33:55] the best performing algorithms error was nearly 30%. And it's really pretty
[00:33:58] nearly 30%. And it's really pretty abysmal because humans can perform under
[00:34:01] abysmal because humans can perform under like say 3% error and then 2011 it
[00:34:05] like say 3% error and then 2011 it wasn't that exciting but something
[00:34:08] wasn't that exciting but something happened in 2012 that was the most
[00:34:11] happened in 2012 that was the most exciting year. That year
[00:34:14] exciting year. That year um Jeff Hinton and his students
[00:34:16] um Jeff Hinton and his students participated in this challenge using
[00:34:18] participated in this challenge using convolutional neuronet network and they
[00:34:21] convolutional neuronet network and they reduce the error almost by half and uh
[00:34:24] reduce the error almost by half and uh and truly showed the power of deep
[00:34:28] and truly showed the power of deep learning algorithms. And so the
[00:34:31] learning algorithms. And so the participating algorithm in 2012 image
[00:34:34] participating algorithm in 2012 image challenge was called Alex Net. And uh
[00:34:38] challenge was called Alex Net. And uh the funny thing is if you look at Alex
[00:34:41] the funny thing is if you look at Alex net um it's not that different from the
[00:34:45] net um it's not that different from the Fukushima's neo cognitron 32 years ago
[00:34:49] Fukushima's neo cognitron 32 years ago but two two major things happened over
[00:34:53] but two two major things happened over between these two one is that back
[00:34:56] between these two one is that back propagation happened it's a principled
[00:34:59] propagation happened it's a principled mathematically rigorous learning rule so
[00:35:02] mathematically rigorous learning rule so that you don't have to ever use hand to
[00:35:04] that you don't have to ever use hand to tune parameters and that was a major
[00:35:08] tune parameters and that was a major breakthrough theoretically. Another
[00:35:10] breakthrough theoretically. Another breakthrough was uh was u um was data.
[00:35:14] breakthrough was uh was u um was data. The recognition of data and uh the
[00:35:18] The recognition of data and uh the understanding of data driving these
[00:35:20] understanding of data driving these highcapacity models which eventually
[00:35:22] highcapacity models which eventually will have trillion parameters but at
[00:35:24] will have trillion parameters but at that time was millions of parameters was
[00:35:27] that time was millions of parameters was critical for for uh for setting off the
[00:35:30] critical for for uh for setting off the deep learning uh uh you know the the the
[00:35:34] deep learning uh uh you know the the the to for for this to work. And really many
[00:35:38] to for for this to work. And really many people consider the year of 2012 the and
[00:35:42] people consider the year of 2012 the and the Alex net algorithm that won the
[00:35:46] the Alex net algorithm that won the image net challenge the historical
[00:35:49] image net challenge the historical moment of the birth of or rebirth of
[00:35:52] moment of the birth of or rebirth of modern AI or the birth of deep learning
[00:35:54] modern AI or the birth of deep learning revolution.
[00:35:56] revolution. And of course the reason many of you are
[00:35:59] And of course the reason many of you are here is since then we are in the era of
[00:36:02] here is since then we are in the era of deep learning explosion. If you look at
[00:36:05] deep learning explosion. If you look at computer vision's um uh main annual
[00:36:10] computer vision's um uh main annual research conference called CVPR uh the
[00:36:13] research conference called CVPR uh the number of papers have exploded and arc
[00:36:17] number of papers have exploded and arc paper has exploded and many new
[00:36:21] paper has exploded and many new algorithms since then have been invented
[00:36:24] algorithms since then have been invented uh to participate in the image that
[00:36:27] uh to participate in the image that challenge in the following years. We're
[00:36:29] challenge in the following years. We're going to study some of these algorithms
[00:36:31] going to study some of these algorithms but the the the point is that some of
[00:36:34] but the the the point is that some of these algorithms uh beyond Alex that
[00:36:37] these algorithms uh beyond Alex that have had a profound impact in the
[00:36:40] have had a profound impact in the progress of uh of the field of computer
[00:36:43] progress of uh of the field of computer vision and into the applications of
[00:36:45] vision and into the applications of computer vision. So um so a lot of
[00:36:51] computer vision. So um so a lot of things have happened that we're going to
[00:36:53] things have happened that we're going to cover some of these. Not only the field
[00:36:56] cover some of these. Not only the field of computer vision made a major progress
[00:36:59] of computer vision made a major progress in the in creating algorithms to
[00:37:02] in the in creating algorithms to recognize everyday objects like cats and
[00:37:04] recognize everyday objects like cats and dogs and chairs. We also quickly right
[00:37:08] dogs and chairs. We also quickly right after uh image that challenge uh the
[00:37:10] after uh image that challenge uh the 2012 uh moment. We've got algorithms
[00:37:14] 2012 uh moment. We've got algorithms that can recognize uh um uh you know
[00:37:18] that can recognize uh um uh you know much more uh much more complicated uh
[00:37:22] much more uh much more complicated uh images, can retrieve images or can do
[00:37:25] images, can retrieve images or can do multiple object detections, can do image
[00:37:28] multiple object detections, can do image uh uh segmentation. These are all
[00:37:31] uh uh segmentation. These are all different tasks in visual recognition
[00:37:34] different tasks in visual recognition that you'll find yourself getting
[00:37:36] that you'll find yourself getting familiar with throughout this course
[00:37:38] familiar with throughout this course because vision is not just calling out
[00:37:41] because vision is not just calling out cats and dogs. There's so much in in the
[00:37:44] cats and dogs. There's so much in in the nuanced ability of visual recognition
[00:37:48] nuanced ability of visual recognition and uh and of course vision is not just
[00:37:51] and uh and of course vision is not just static images. So their work in video
[00:37:55] static images. So their work in video classification, human activity
[00:37:57] classification, human activity recognition. I'm showing you this
[00:38:00] recognition. I'm showing you this overview. You will learn some of these.
[00:38:02] overview. You will learn some of these. It's uh these are all um you don't have
[00:38:05] It's uh these are all um you don't have to understand exactly all what's going
[00:38:07] to understand exactly all what's going on here but I want you to appreciate uh
[00:38:11] on here but I want you to appreciate uh the the the different the variety of
[00:38:13] the the the different the variety of vision tasks. medical imaging. Those of
[00:38:16] vision tasks. medical imaging. Those of you who come from a medical uh field is
[00:38:21] you who come from a medical uh field is you know whether it's radiology or
[00:38:22] you know whether it's radiology or pathology or or even other aspects of
[00:38:26] pathology or or even other aspects of medicine is deeply visual and this has a
[00:38:29] medicine is deeply visual and this has a profound impact. Um you know scientific
[00:38:33] profound impact. Um you know scientific discovery even the uh the seinal uh
[00:38:37] discovery even the uh the seinal uh picture you probably remember of the
[00:38:39] picture you probably remember of the first photography of black hole is uh
[00:38:42] first photography of black hole is uh uses a lot of computer vision. um and
[00:38:45] uses a lot of computer vision. um and computational photography techniques. Of
[00:38:48] computational photography techniques. Of course uh you know applications in the
[00:38:50] course uh you know applications in the sustainability and environment is also
[00:38:54] sustainability and environment is also um you know computer vision contributed
[00:38:56] um you know computer vision contributed a lot of that and uh and uh we also have
[00:39:01] a lot of that and uh and uh we also have made a lot of progress in image
[00:39:02] made a lot of progress in image captioning uh right after the image that
[00:39:06] captioning uh right after the image that uh 2012 moment this is actually work by
[00:39:08] uh 2012 moment this is actually work by Andre Capathy when he was my student his
[00:39:11] Andre Capathy when he was my student his uh thesis work um then we uh also worked
[00:39:16] uh thesis work um then we uh also worked on you know relationship ship
[00:39:18] on you know relationship ship understanding. So not only uh not only
[00:39:21] understanding. So not only uh not only visual intelligence is about seeing
[00:39:23] visual intelligence is about seeing what's on the pixel, you also see what's
[00:39:25] what's on the pixel, you also see what's beyond pixels including relationships of
[00:39:29] beyond pixels including relationships of uh of objects. Um and also style
[00:39:32] uh of objects. Um and also style transfer. Um a lot of this work you will
[00:39:35] transfer. Um a lot of this work you will actually Justin Johnson who will come to
[00:39:37] actually Justin Johnson who will come to guest lecture this course will tell you
[00:39:39] guest lecture this course will tell you all about his seminar work in uh in
[00:39:42] all about his seminar work in uh in style transfer. Um and of course in
[00:39:46] style transfer. Um and of course in generative AI eras we get these really
[00:39:50] generative AI eras we get these really incredible results like you know face
[00:39:52] incredible results like you know face generation and uh uh this is the very
[00:39:56] generation and uh uh this is the very early days of image generation of Dolly
[00:39:59] early days of image generation of Dolly that I think this is the early Dali of
[00:40:02] that I think this is the early Dali of course now midjourney and everything has
[00:40:04] course now midjourney and everything has gone beyond these avocado and peach
[00:40:07] gone beyond these avocado and peach chairs but uh but really we are squarely
[00:40:12] chairs but uh but really we are squarely in the most exciting modern era of AI
[00:40:15] in the most exciting modern era of AI explosion. The the the um combination
[00:40:20] explosion. The the the um combination the three converging forces of
[00:40:23] the three converging forces of computation, algorithms and data have uh
[00:40:27] computation, algorithms and data have uh have uh taken this field just to a whole
[00:40:30] have uh taken this field just to a whole different level where we're now totally
[00:40:34] different level where we're now totally out of AI winter. I would say we're in
[00:40:37] out of AI winter. I would say we're in an AI global warming period
[00:40:40] an AI global warming period and I don't I don't see any of this
[00:40:43] and I don't I don't see any of this slowing down. So um for both good and
[00:40:47] slowing down. So um for both good and bad reasons and also you know just a
[00:40:50] bad reasons and also you know just a word because we are in the Silicon
[00:40:52] word because we are in the Silicon Valley we're in the very building of uh
[00:40:55] Valley we're in the very building of uh Juan building and Nvidia uh lecture hall
[00:40:58] Juan building and Nvidia uh lecture hall so we cannot ignore also the progress of
[00:41:02] so we cannot ignore also the progress of uh hardware and what that played. So
[00:41:05] uh hardware and what that played. So here is just the the um the the the flop
[00:41:09] here is just the the um the the the flop per dollar graph for Nvidia's GPUs. And
[00:41:14] per dollar graph for Nvidia's GPUs. And before 2020, you know, the progress was
[00:41:18] before 2020, you know, the progress was steady. But as soon as deep learning
[00:41:22] steady. But as soon as deep learning started to drive these um these uh GPUs
[00:41:26] started to drive these um these uh GPUs and chips, you can just see the the the
[00:41:29] and chips, you can just see the the the the G-flops have just completely taken
[00:41:33] the G-flops have just completely taken off. And uh we're just by alien measure
[00:41:37] off. And uh we're just by alien measure we are in this accelerated curve of uh
[00:41:41] we are in this accelerated curve of uh of lots of compute as well as lots of
[00:41:44] of lots of compute as well as lots of AI. And these are just different graphs
[00:41:47] AI. And these are just different graphs showing you conference attendance,
[00:41:49] showing you conference attendance, startups and enterprise applications in
[00:41:53] startups and enterprise applications in AI all across not just computer vision
[00:41:55] AI all across not just computer vision but also NLP and others have um have uh
[00:42:00] but also NLP and others have um have uh just exploded. Okay. So quickly, last
[00:42:03] just exploded. Okay. So quickly, last but not the least, it's been exciting.
[00:42:06] but not the least, it's been exciting. There has been a lot of successes, but
[00:42:08] There has been a lot of successes, but there is still a lot to be done in
[00:42:10] there is still a lot to be done in computer vision. So this problem is
[00:42:12] computer vision. So this problem is still not totally solved and with great
[00:42:16] still not totally solved and with great tools comes with great consequences as
[00:42:19] tools comes with great consequences as well, right? So um um computer vision
[00:42:23] well, right? So um um computer vision can do a lot of good, but it also can do
[00:42:25] can do a lot of good, but it also can do harm. For example, human bias.
[00:42:29] harm. For example, human bias. Every single AI algorithm today, the
[00:42:31] Every single AI algorithm today, the large ones are driven by data. And data
[00:42:35] large ones are driven by data. And data is an artifact of human activities on
[00:42:38] is an artifact of human activities on earth and in history. And a lot of the
[00:42:41] earth and in history. And a lot of the data carry our bias. And uh this gets
[00:42:45] data carry our bias. And uh this gets carried in AI systems. We have seen a
[00:42:48] carried in AI systems. We have seen a lot of face recognition algorithms
[00:42:50] lot of face recognition algorithms having the same kind of bias that humans
[00:42:52] having the same kind of bias that humans have. And we do have to really recognize
[00:42:55] have. And we do have to really recognize that we can also use AI to impact human
[00:42:59] that we can also use AI to impact human lives. Some for the good, think about
[00:43:01] lives. Some for the good, think about medical imaging. But some are
[00:43:04] medical imaging. But some are questionable. What if AI is solely
[00:43:07] questionable. What if AI is solely behind deciding your job or deciding
[00:43:10] behind deciding your job or deciding your financial loans? So um again,
[00:43:14] your financial loans? So um again, is it totally bad? Is it totally good?
[00:43:17] is it totally bad? Is it totally good? These are very complicated issues. This
[00:43:19] These are very complicated issues. This is also why I always get so excited when
[00:43:22] is also why I always get so excited when students from HNS or law school or
[00:43:24] students from HNS or law school or education school or business school
[00:43:26] education school or business school attend my class because not all AI
[00:43:29] attend my class because not all AI issues are engineering issues. We have a
[00:43:32] issues are engineering issues. We have a lot of human factors and societal issues
[00:43:35] lot of human factors and societal issues to to solve. I'm also particularly
[00:43:38] to to solve. I'm also particularly excited by AI's medicine and healthc
[00:43:40] excited by AI's medicine and healthc care use. And this is something really
[00:43:42] care use. And this is something really dear to my heart that uh professor Delhi
[00:43:45] dear to my heart that uh professor Delhi and Zay who are also co-instructors of
[00:43:47] and Zay who are also co-instructors of this course. We three of us work on AI
[00:43:51] this course. We three of us work on AI for aging population as well as uh
[00:43:53] for aging population as well as uh patients and to try to use computer
[00:43:56] patients and to try to use computer vision to deliver um you know care to to
[00:43:59] vision to deliver um you know care to to people. So this is a good use and also
[00:44:02] people. So this is a good use and also even in terms of technology human vision
[00:44:05] even in terms of technology human vision is remarkable. I want you to come out of
[00:44:09] is remarkable. I want you to come out of not only today's class but also this
[00:44:11] not only today's class but also this entire course to appreciate despite how
[00:44:15] entire course to appreciate despite how much computer vision can do there's just
[00:44:17] much computer vision can do there's just so much more nuance subtlety richness
[00:44:22] so much more nuance subtlety richness complexity and also emotion in human
[00:44:25] complexity and also emotion in human vision you know look at these kids
[00:44:28] vision you know look at these kids studying whatever that their curiosity
[00:44:30] studying whatever that their curiosity lead them or the humor in this image
[00:44:33] lead them or the humor in this image there's still a lot more that computer
[00:44:35] there's still a lot more that computer vision cannot do so I hope that continue
[00:44:37] vision cannot do so I hope that continue to entice you to study uh computer
[00:44:40] to entice you to study uh computer vision. At this point, I'm going to give
[00:44:43] vision. At this point, I'm going to give the podium to uh professor Delhi to go
[00:44:46] the podium to uh professor Delhi to go over the rest of the class. Thank you.
[00:44:50] over the rest of the class. Thank you.  Awesome. Thank you, Fay. Uh
[00:44:55] Awesome. Thank you, Fay. Uh great start of the quarter and I hope my
[00:44:58] great start of the quarter and I hope my microphone is working um right now.
[00:45:00] microphone is working um right now. Okay, good. I'm seeing some uh nodding
[00:45:03] Okay, good. I'm seeing some uh nodding of heads. All right. Uh so
[00:45:07] of heads. All right. Uh so very excited to be here with you all and
[00:45:14] very excited to be here with you all and um I'm hoping that um you will have a
[00:45:17] um I'm hoping that um you will have a fun and challenging course with an
[00:45:21] fun and challenging course with an amazing list of co-instructors that we
[00:45:23] amazing list of co-instructors that we have and and great TAs.
[00:45:26] have and and great TAs. So
[00:45:28] So in this class we are going to cover a
[00:45:31] in this class we are going to cover a wide variety of topics around computer
[00:45:33] wide variety of topics around computer vision use of deep learning in this
[00:45:35] vision use of deep learning in this space categorized into four different
[00:45:40] space categorized into four different topics. We'll start with deep learning
[00:45:44] topics. We'll start with deep learning basics and and let's start actually with
[00:45:47] basics and and let's start actually with a simple question of what is computer
[00:45:50] a simple question of what is computer vision really. So at its core it's about
[00:45:55] vision really. So at its core it's about enabling machines
[00:45:57] enabling machines to see and understand images and
[00:46:04] uh basically this is the most
[00:46:06] uh basically this is the most fundamental task uh in the space uh in
[00:46:09] fundamental task uh in the space uh in this space is is image classification.
[00:46:13] this space is is image classification. You give the model an image say of a cat
[00:46:17] You give the model an image say of a cat and the model should output a label cat
[00:46:21] and the model should output a label cat and that's it.
[00:46:24] and that's it. But it's uh this deceptively simple task
[00:46:28] But it's uh this deceptively simple task um is the foundation for much of more
[00:46:30] um is the foundation for much of more complex applications from self-driving
[00:46:33] complex applications from self-driving to medical diagnosis and so on. So how
[00:46:37] to medical diagnosis and so on. So how do we teach a machine to do this? One of
[00:46:41] do we teach a machine to do this? One of the simplest approaches is to use linear
[00:46:44] the simplest approaches is to use linear classification as you can see in this
[00:46:47] classification as you can see in this slide. So
[00:46:50] slide. So imagine each of the images in our data
[00:46:53] imagine each of the images in our data set is shown with a dot in that uh space
[00:46:57] set is shown with a dot in that uh space and each axis shows some sort of feature
[00:47:02] and each axis shows some sort of feature uh which was driven from the image
[00:47:04] uh which was driven from the image itself. Here we are showing a 2D space
[00:47:07] itself. Here we are showing a 2D space but uh for simplicity but the task of a
[00:47:11] but uh for simplicity but the task of a linear classifier is to find the hyper
[00:47:14] linear classifier is to find the hyper plane or the linear function that
[00:47:17] plane or the linear function that separates these two say cats from dogs.
[00:47:23] separates these two say cats from dogs. But we all know that these linear models
[00:47:25] But we all know that these linear models often go just uh only so far. Um they
[00:47:29] often go just uh only so far. Um they struggle when the data isn't cleanly
[00:47:31] struggle when the data isn't cleanly separable with a straight line. So the
[00:47:35] separable with a straight line. So the question is what's next? We'll get into
[00:47:38] question is what's next? We'll get into uh the the topics of how to model more
[00:47:42] uh the the topics of how to model more complex patterns and um if if we do so
[00:47:47] complex patterns and um if if we do so we often face challenges of overfeeding
[00:47:51] we often face challenges of overfeeding and underfeeding which um are the topics
[00:47:55] and underfeeding which um are the topics we will cover in the early lectures of
[00:47:57] we will cover in the early lectures of the the class and to strike a right
[00:48:02] the the class and to strike a right balance. the right balance.
[00:48:04] balance. the right balance. We use techniques like regularization
[00:48:08] We use techniques like regularization uh to control model complexity and
[00:48:10] uh to control model complexity and optimization
[00:48:13] optimization to find the best fit parameters.
[00:48:16] to find the best fit parameters. So these are the nuts and bolts of of
[00:48:19] So these are the nuts and bolts of of deep learning and and creating these u
[00:48:22] deep learning and and creating these u uh models training models that not only
[00:48:25] uh models training models that not only fits the data but also generalizes to
[00:48:28] fits the data but also generalizes to unseen and and new data as well. And now
[00:48:32] unseen and and new data as well. And now comes the the fun part. Neural networks,
[00:48:34] comes the the fun part. Neural networks, we've been talking about them uh quite a
[00:48:37] we've been talking about them uh quite a lot. And what neural networks do unlike
[00:48:41] lot. And what neural networks do unlike the linear classifiers they they stack
[00:48:44] the linear classifiers they they stack multiple layers of um operations
[00:48:48] multiple layers of um operations to model nonlinear
[00:48:51] to model nonlinear um
[00:48:53] um functions to be able to either classify
[00:48:56] functions to be able to either classify to to solve the same problem of uh image
[00:48:59] to to solve the same problem of uh image classification and so on. Um
[00:49:04] classification and so on. Um these are the models powering uh
[00:49:07] these are the models powering uh everything from uh Google photos and
[00:49:10] everything from uh Google photos and then now everybody's familiar with chat
[00:49:12] then now everybody's familiar with chat GPD chat GPTs vision models and so on.
[00:49:15] GPD chat GPTs vision models and so on. Uh in this course we will uh go deep
[00:49:19] Uh in this course we will uh go deep into the
[00:49:21] into the uh details of how they work, how they
[00:49:25] uh details of how they work, how they are trained and we will be looking into
[00:49:28] are trained and we will be looking into how to debugging and improving them.
[00:49:31] how to debugging and improving them. After looking at the deep learning
[00:49:34] After looking at the deep learning basics, we will cover the topics of
[00:49:37] basics, we will cover the topics of perceiving and understanding the visual
[00:49:40] perceiving and understanding the visual world, which is a complex process that
[00:49:44] world, which is a complex process that involves interpreting a vast array of
[00:49:48] involves interpreting a vast array of visual information. And to do so, we
[00:49:51] visual information. And to do so, we often first define tasks
[00:49:54] often first define tasks that refer to specific challenges or
[00:49:56] that refer to specific challenges or problems we aim to solve. Some of the
[00:49:59] problems we aim to solve. Some of the examples are object detection, scene
[00:50:01] examples are object detection, scene understanding, motion detection and so
[00:50:03] understanding, motion detection and so on. And to solve um these tasks, we use
[00:50:08] on. And to solve um these tasks, we use uh different models which are
[00:50:10] uh different models which are computational and theoretical
[00:50:14] computational and theoretical frameworks we develop to mimic or
[00:50:17] frameworks we develop to mimic or explain how our visual system
[00:50:20] explain how our visual system accomplishes these tasks. Uh one of the
[00:50:23] accomplishes these tasks. Uh one of the examples of the these types of models is
[00:50:26] examples of the these types of models is uh neural networks.
[00:50:30] So by aligning models and uh with with
[00:50:35] So by aligning models and uh with with tasks, we can create systems that can
[00:50:39] tasks, we can create systems that can see and interpret the world around us.
[00:50:44] see and interpret the world around us. Speaking of tasks, let's uh go back to
[00:50:47] Speaking of tasks, let's uh go back to the topic of image uh classification,
[00:50:51] the topic of image uh classification, predicting a single label for an entire
[00:50:54] predicting a single label for an entire image.
[00:50:56] image. Um but we know that real world computer
[00:50:59] Um but we know that real world computer vision is much richer than this and
[00:51:02] vision is much richer than this and let's walk through some of the tasks
[00:51:05] let's walk through some of the tasks that go beyond classification. First
[00:51:07] that go beyond classification. First semantic segmentation
[00:51:10] semantic segmentation um where we
[00:51:12] um where we we are not just labeling the
[00:51:15] we are not just labeling the uh the object or the entire image as cat
[00:51:18] uh the object or the entire image as cat or tree or whatever. Here we are looking
[00:51:22] or tree or whatever. Here we are looking for labels for every single pixel in the
[00:51:25] for labels for every single pixel in the image. So every pixel is is a grass,
[00:51:27] image. So every pixel is is a grass, cat, tree or sky. But we don't uh
[00:51:32] cat, tree or sky. But we don't uh distinguish between individual uh
[00:51:34] distinguish between individual uh objects. And next we have object
[00:51:36] objects. And next we have object detection
[00:51:38] detection where um we now want to not only say
[00:51:43] where um we now want to not only say what is the what is in the image but
[00:51:45] what is the what is in the image but also pinpoint the location. And that's
[00:51:47] also pinpoint the location. And that's why we create bounding boxes around the
[00:51:51] why we create bounding boxes around the objects and associate them with specific
[00:51:54] objects and associate them with specific labels. And finally we have instance
[00:51:56] labels. And finally we have instance segmentation. We will uh review we'll go
[00:51:58] segmentation. We will uh review we'll go into instance segmentation which is the
[00:52:01] into instance segmentation which is the most granular of them all. It combines
[00:52:05] most granular of them all. It combines the ideas of detection and segmentation
[00:52:08] the ideas of detection and segmentation together and every object instance gets
[00:52:10] together and every object instance gets its own mask.
[00:52:13] its own mask. So these tasks uh require you know much
[00:52:16] So these tasks uh require you know much deeper understanding special
[00:52:19] deeper understanding special understanding in images and they push
[00:52:21] understanding in images and they push the models to do more than just
[00:52:24] the models to do more than just recognizing
[00:52:25] recognizing categories.
[00:52:28] categories. The complexity doesn't stop with with
[00:52:29] The complexity doesn't stop with with static images. Let's let's look at some
[00:52:32] static images. Let's let's look at some uh temporal dimensions. So there's the
[00:52:34] uh temporal dimensions. So there's the task of video classification as Fe
[00:52:37] task of video classification as Fe talked about where we want to understand
[00:52:40] talked about where we want to understand what's happening in the in the video is
[00:52:42] what's happening in the in the video is the is is there someone running jumping
[00:52:44] the is is there someone running jumping or or dancing.
[00:52:47] or or dancing. There is the topic of multimodal video
[00:52:50] There is the topic of multimodal video understanding which is combining vision
[00:52:54] understanding which is combining vision and sound and other modalities. uh for
[00:52:57] and sound and other modalities. uh for example in this uh this this example the
[00:53:00] example in this uh this this example the person is playing a vibrophone to really
[00:53:03] person is playing a vibrophone to really understand what's happening here we have
[00:53:05] understand what's happening here we have to create a blend of visual features and
[00:53:08] to create a blend of visual features and audio features to be able to understand
[00:53:10] audio features to be able to understand what's happening and finally there is
[00:53:12] what's happening and finally there is the topic of visualization and
[00:53:15] the topic of visualization and understanding that we we will be
[00:53:17] understanding that we we will be covering in this class where we want to
[00:53:20] covering in this class where we want to interpret what's being learned uh by the
[00:53:23] interpret what's being learned uh by the models and and um see an attention uh
[00:53:29] models and and um see an attention uh frame or attention map of what the model
[00:53:31] frame or attention map of what the model is is attending to to do a correct
[00:53:34] is is attending to to do a correct classification and so on.
[00:53:37] classification and so on. And then uh we have models beyond tasks.
[00:53:39] And then uh we have models beyond tasks. we we look into models and uh there the
[00:53:44] we we look into models and uh there the very first topic let me introduce to you
[00:53:46] very first topic let me introduce to you uh that we'll be covering is uh
[00:53:48] uh that we'll be covering is uh convolutional neural networks or CNN's
[00:53:51] convolutional neural networks or CNN's there are a number of operations we will
[00:53:53] there are a number of operations we will be going over the details um in the
[00:53:56] be going over the details um in the class starting from an image a number of
[00:53:59] class starting from an image a number of convolution sampling and fully connected
[00:54:01] convolution sampling and fully connected operations and finally uh creating the
[00:54:04] operations and finally uh creating the output and beyond convolutional neural
[00:54:08] output and beyond convolutional neural networks we will
[00:54:11] networks we will study uh recurrent neural networks for
[00:54:13] study uh recurrent neural networks for sequential data and even newer
[00:54:15] sequential data and even newer architectures such as
[00:54:18] architectures such as transformers and attention based uh
[00:54:21] transformers and attention based uh frameworks.
[00:54:24] frameworks. So next we will be covering some large
[00:54:27] So next we will be covering some large scale distributed training topics which
[00:54:30] scale distributed training topics which is kind of new uh this quarter. I'm sure
[00:54:35] is kind of new uh this quarter. I'm sure you've all heard about large language
[00:54:37] you've all heard about large language models, large vision models and so on.
[00:54:40] models, large vision models and so on. And uh we will be briefly discussing how
[00:54:44] And uh we will be briefly discussing how these models are actually trained. We
[00:54:47] these models are actually trained. We know that data and data sets are
[00:54:50] know that data and data sets are expanding models and and large models
[00:54:52] expanding models and and large models are are models models are becoming
[00:54:55] are are models models are becoming larger and larger. And in order to train
[00:54:58] larger and larger. And in order to train such models, there are some strategies
[00:55:02] such models, there are some strategies for example data parallelization, model
[00:55:04] for example data parallelization, model paralization that we'll cover in this
[00:55:07] paralization that we'll cover in this class. But beyond that, there will be so
[00:55:10] class. But beyond that, there will be so many challenges such as synchronization
[00:55:13] many challenges such as synchronization between these models and workers and so
[00:55:16] between these models and workers and so on.
[00:55:17] on. as well as several other um aspects that
[00:55:20] as well as several other um aspects that we'll be covering in one of the lectures
[00:55:23] we'll be covering in one of the lectures this uh quarter and and we will go also
[00:55:26] this uh quarter and and we will go also some uh over some of the trends um for
[00:55:30] some uh over some of the trends um for training these large models. After
[00:55:34] training these large models. After completing this topic, what we'll do
[00:55:36] completing this topic, what we'll do next is looking into
[00:55:40] next is looking into generative and interactive visual
[00:55:44] generative and interactive visual intelligence
[00:55:45] intelligence where we will first start with u with
[00:55:50] where we will first start with u with self-supervised learning.
[00:55:52] self-supervised learning. Self-supervised learning is a branch of
[00:55:54] Self-supervised learning is a branch of machine learning in which models learn
[00:55:57] machine learning in which models learn to understand and represent data by
[00:56:01] to understand and represent data by getting some training signals from the
[00:56:03] getting some training signals from the data itself. We will cover this this
[00:56:05] data itself. We will cover this this topic. It's a it's um uh one of the
[00:56:08] topic. It's a it's um uh one of the approaches that has enabled training of
[00:56:10] approaches that has enabled training of large scale models using vast amounts of
[00:56:14] large scale models using vast amounts of data that do not require labels,
[00:56:16] data that do not require labels, unlabeled data.
[00:56:19] unlabeled data. And they have played a key role in
[00:56:22] And they have played a key role in recent breakthroughs in in computer
[00:56:24] recent breakthroughs in in computer vision in general.
[00:56:26] vision in general. And we will talk a little bit about
[00:56:28] And we will talk a little bit about generative models. They go beyond
[00:56:32] generative models. They go beyond recognition. They actually generate.
[00:56:36] recognition. They actually generate. This is an example of the content of a
[00:56:38] This is an example of the content of a Stanford campus photo uh which is
[00:56:41] Stanford campus photo uh which is reimagined in the style of Van Go's uh
[00:56:44] reimagined in the style of Van Go's uh star night. This is known as style
[00:56:48] star night. This is known as style transfer. A classic application of neuro
[00:56:50] transfer. A classic application of neuro generative uh techniques.
[00:56:54] generative uh techniques. Generative models can now translate
[00:56:57] Generative models can now translate language into images given a prompt. A
[00:57:03] language into images given a prompt. A model like Dolly Dolly 2 generates an
[00:57:06] model like Dolly Dolly 2 generates an entirely novel image.
[00:57:09] entirely novel image. This showcases how generative vision
[00:57:12] This showcases how generative vision models blend understanding, creativity,
[00:57:15] models blend understanding, creativity, and control in in their generations.
[00:57:19] and control in in their generations. And you've probably heard recently about
[00:57:23] And you've probably heard recently about the topic of diffusion models in in
[00:57:26] the topic of diffusion models in in general. That's another thing that we'll
[00:57:27] general. That's another thing that we'll be covering in u this quarter.
[00:57:33] be covering in u this quarter. They basically learn to reverse a
[00:57:35] They basically learn to reverse a gradual uh noising process to generate
[00:57:39] gradual uh noising process to generate uh images. And interestingly, in
[00:57:42] uh images. And interestingly, in assignment three, you will actually be
[00:57:44] assignment three, you will actually be implementing a generative model that
[00:57:47] implementing a generative model that generates emojis from
[00:57:51] generates emojis from text u inputs from prompts. For example,
[00:57:54] text u inputs from prompts. For example, a face with a cowboy hat, which is den
[00:57:57] a face with a cowboy hat, which is den noiseis from pure noise.
[00:58:01] noiseis from pure noise. Vision language models are the next
[00:58:04] Vision language models are the next topic uh topic of interest we will be
[00:58:06] topic uh topic of interest we will be covering.
[00:58:08] covering. Um they connect text and images in a
[00:58:13] Um they connect text and images in a shared representation space and given a
[00:58:16] shared representation space and given a caption or or uh or image the model
[00:58:20] caption or or uh or image the model retrieves or generates its corresponding
[00:58:23] retrieves or generates its corresponding pair as you can see. So there are a lot
[00:58:27] pair as you can see. So there are a lot of advances in this area that we'll be
[00:58:29] of advances in this area that we'll be covering some of the key examples.
[00:58:32] covering some of the key examples. Again this is a key task for
[00:58:36] Again this is a key task for uh crossmodel retrieval or understanding
[00:58:39] uh crossmodel retrieval or understanding and visual question answering and so on.
[00:58:41] and visual question answering and so on. So we'll get get to that in the class
[00:58:43] So we'll get get to that in the class too. Moving beyond 2D, models can now
[00:58:49] too. Moving beyond 2D, models can now reconstruct and generate 3D
[00:58:53] reconstruct and generate 3D representations from images. And uh here
[00:58:56] representations from images. And uh here you can see uh some waxel based
[00:58:59] you can see uh some waxel based reconstructions,
[00:59:01] reconstructions, shape completion and even 3D object
[00:59:04] shape completion and even 3D object detection for uh from single view
[00:59:08] detection for uh from single view uh images. So 3D vision enables more
[00:59:13] uh images. So 3D vision enables more especially grounded understanding which
[00:59:16] especially grounded understanding which is crucial for robotics and AI ARVR
[00:59:19] is crucial for robotics and AI ARVR applications. And finally vision uh
[00:59:22] applications. And finally vision uh powers empowers embodied agents
[00:59:27] powers empowers embodied agents that act in the physical world. So these
[00:59:31] that act in the physical world. So these models often must uh perceive, plan and
[00:59:35] models often must uh perceive, plan and execute whether it's cleaning um up a
[00:59:39] execute whether it's cleaning um up a messy room or generalizing from human
[00:59:43] messy room or generalizing from human demonstrations. So with with all of
[00:59:47] demonstrations. So with with all of these we will be covering different
[00:59:49] these we will be covering different topics around generative and interactive
[00:59:52] topics around generative and interactive visual intelligence and finally we will
[00:59:57] visual intelligence and finally we will cover some human- centered applications
[01:00:00] cover some human- centered applications and implications and as as um very
[01:00:04] and implications and as as um very nicely explained. So there is um
[01:00:08] nicely explained. So there is um computer vision and generally AI have
[01:00:10] computer vision and generally AI have been um having a lot of impact in the in
[01:00:14] been um having a lot of impact in the in the past years and it's very important
[01:00:17] the past years and it's very important to to understand the human- centered
[01:00:19] to to understand the human- centered aspects and applications and some of
[01:00:22] aspects and applications and some of these impacts are reflected by these
[01:00:25] these impacts are reflected by these awards that are um going to researchers
[01:00:29] awards that are um going to researchers in in this space. Um it was first
[01:00:33] in in this space. Um it was first recognized by the uh touring award 2018
[01:00:38] recognized by the uh touring award 2018 which is the most prestigious technical
[01:00:40] which is the most prestigious technical award given to major contributions for
[01:00:43] award given to major contributions for of lasting importance for computing.
[01:00:47] of lasting importance for computing. Jeff, Antonio, Benjio, and Yan Lon
[01:00:50] Jeff, Antonio, Benjio, and Yan Lon um received the award for conceptual
[01:00:54] um received the award for conceptual engineering breakthroughs that have made
[01:00:57] engineering breakthroughs that have made deep neural networks a critical
[01:00:59] deep neural networks a critical component of computing. Beyond that,
[01:01:02] component of computing. Beyond that, last year in in in 2024,
[01:01:05] last year in in in 2024, uh Jeffrey Inon was jointly awarded the
[01:01:08] uh Jeffrey Inon was jointly awarded the Nobel Prize in in physics
[01:01:11] Nobel Prize in in physics alongside John Hopfield for for their
[01:01:13] alongside John Hopfield for for their foundational contributions to uh neural
[01:01:16] foundational contributions to uh neural networks. And finally, I want to very
[01:01:19] networks. And finally, I want to very briefly mention the learning objectives
[01:01:22] briefly mention the learning objectives for this class will be
[01:01:25] for this class will be formalizing computer vision applications
[01:01:28] formalizing computer vision applications into tasks. As you can um see some of
[01:01:31] into tasks. As you can um see some of the details here, we want to develop and
[01:01:36] the details here, we want to develop and train vision models, models that operate
[01:01:39] train vision models, models that operate on on images and visual data, images,
[01:01:41] on on images and visual data, images, videos and so on. Gain an understanding
[01:01:45] videos and so on. Gain an understanding of where the field is and where it is
[01:01:48] of where the field is and where it is headed. That's why we have some new new
[01:01:51] headed. That's why we have some new new topics also covered specifically in this
[01:01:54] topics also covered specifically in this uh this year.
[01:01:57] uh this year. So the four topics that I mentioned
[01:02:00] So the four topics that I mentioned earlier, we will be going over the
[01:02:03] earlier, we will be going over the basics in the very first very first few
[01:02:06] basics in the very first very first few weeks. Bear with us because these are
[01:02:07] weeks. Bear with us because these are important topics and you need to
[01:02:10] important topics and you need to understand the the details first. How to
[01:02:13] understand the the details first. How to build the models from scratch and then
[01:02:15] build the models from scratch and then we'll get to more interesting exciting
[01:02:18] we'll get to more interesting exciting topics of the day. um computer vision
[01:02:22] topics of the day. um computer vision and and finally we will have one uh big
[01:02:24] and and finally we will have one uh big lecture on human centered AI and and
[01:02:28] lecture on human centered AI and and computer vision. I want to just leave
[01:02:31] computer vision. I want to just leave you with what we'll be covering next
[01:02:34] you with what we'll be covering next session that's that's going to be image
[01:02:37] session that's that's going to be image classification and linear classifiers
[01:02:41] classification and linear classifiers which will get us started with the world
[01:02:44] which will get us started with the world of CS231. Thank you.


================================================================================
LECTURE 002
================================================================================

Stanford CS231N | Spring 2025 | Lecture 2: Image Classification with Linear Classifiers

Source: https://www.youtube.com/watch?v=pdqofxJeBN8

---

Transcript

[00:00:05] We will be talking today about image
[00:00:09] We will be talking today about image classification. Basically continuing our
[00:00:11] classification. Basically continuing our our discussion on the topic of image
[00:00:14] our discussion on the topic of image classification from last uh lecture and
[00:00:18] classification from last uh lecture and we'll get a little bit into some
[00:00:22] we'll get a little bit into some uh topics that gets us closer to neural
[00:00:25] uh topics that gets us closer to neural networks and and ultimately
[00:00:27] networks and and ultimately convolutional neural networks and so on.
[00:00:30] convolutional neural networks and so on. We'll start with linear classifiers.
[00:00:36] We'll start with linear classifiers. Moving to the
[00:00:39] Moving to the next slide. This was the the syllabus
[00:00:43] next slide. This was the the syllabus that we've talked about last uh lecture
[00:00:48] that we've talked about last uh lecture in the in the previous lecture
[00:00:51] in the in the previous lecture where we did talk about three major
[00:00:54] where we did talk about three major categories of
[00:00:56] categories of topics. Deep learning basics,
[00:01:00] topics. Deep learning basics, perceiving uh and understanding the
[00:01:02] perceiving uh and understanding the visual world, reconstructing and
[00:01:05] visual world, reconstructing and interacting with the visual world as the
[00:01:07] interacting with the visual world as the three major topics and and some sub um
[00:01:12] three major topics and and some sub um topics that we will be covering in the
[00:01:13] topics that we will be covering in the in the class and at at the end we will
[00:01:16] in the class and at at the end we will have some discussions around the human-
[00:01:19] have some discussions around the human- centered AI aspects
[00:01:21] centered AI aspects and
[00:01:24] today the goal is to cover the first
[00:01:27] today the goal is to cover the first three items datadriven approaches. I
[00:01:30] three items datadriven approaches. I will I will try to tell you what this
[00:01:32] will I will try to tell you what this means and
[00:01:35] means and linear classification
[00:01:37] linear classification as well as the K nearest neighbor
[00:01:40] as well as the K nearest neighbor algorithm.
[00:01:43] algorithm. So
[00:01:45] So like last the previous lecture let's
[00:01:49] like last the previous lecture let's start with our core task of
[00:01:54] start with our core task of image classification. Again, it's a core
[00:01:56] image classification. Again, it's a core task in computer vision and we actually
[00:02:00] task in computer vision and we actually come back to this task quite often
[00:02:02] come back to this task quite often throughout the quarter because it's a
[00:02:04] throughout the quarter because it's a very good benchmark and we have some
[00:02:06] very good benchmark and we have some examples to to tell you how the
[00:02:09] examples to to tell you how the algorithms work. So this is one of the
[00:02:13] algorithms work. So this is one of the items that we come back to quite often.
[00:02:17] items that we come back to quite often. We we want to define the the image
[00:02:19] We we want to define the the image classification task today and then
[00:02:24] classification task today and then introduce two of the datadriven
[00:02:26] introduce two of the datadriven approaches for image classification. One
[00:02:29] approaches for image classification. One of them nearest neighbor and one of them
[00:02:31] of them nearest neighbor and one of them the other one near linear linear
[00:02:33] the other one near linear linear classifier. There are some other
[00:02:35] classifier. There are some other approaches which we have
[00:02:38] approaches which we have listed in our backup slides and you're
[00:02:41] listed in our backup slides and you're welcome to to look at them after the
[00:02:44] welcome to to look at them after the class.
[00:02:45] class. But this is what we will be covering.
[00:02:50] But this is what we will be covering. So what is image classification?
[00:02:53] So what is image classification? Given an image and
[00:02:57] Given an image and a number of predefined labels,
[00:03:00] a number of predefined labels, predetermined labels, a set of possible
[00:03:03] predetermined labels, a set of possible labels such as in this example you see
[00:03:06] labels such as in this example you see dog, cat, truck, plane, and so on.
[00:03:11] dog, cat, truck, plane, and so on. The job of the system is to assign one
[00:03:14] The job of the system is to assign one of those labels to this uh image.
[00:03:18] of those labels to this uh image. So
[00:03:20] So to us this is actually a very very easy
[00:03:22] to us this is actually a very very easy task because our brains our co cognitive
[00:03:26] task because our brains our co cognitive system is wired to
[00:03:30] system is wired to get a holistic understanding of this
[00:03:32] get a holistic understanding of this image and assign a label to it. But when
[00:03:35] image and assign a label to it. But when it comes to coding this and and looking
[00:03:38] it comes to coding this and and looking at how a computer can make sense of this
[00:03:40] at how a computer can make sense of this image that's a that's completely a
[00:03:43] image that's a that's completely a different story
[00:03:45] different story and we want to see how machines can make
[00:03:49] and we want to see how machines can make sense of such data. So images are often
[00:03:54] sense of such data. So images are often defined
[00:03:56] defined by matrices of data
[00:04:00] by matrices of data more broadly more generally tensors of
[00:04:02] more broadly more generally tensors of data
[00:04:04] data and often the numbers are the each of
[00:04:06] and often the numbers are the each of the pixel values are between zero and
[00:04:10] the pixel values are between zero and 255 which is a 8 bit um data structure
[00:04:15] 255 which is a 8 bit um data structure and since this is an image a colored
[00:04:19] and since this is an image a colored image
[00:04:21] image assuming that it's with a resolution of
[00:04:25] assuming that it's with a resolution of 800 by 600.
[00:04:28] 800 by 600. Since it's an R RGB image, it has three
[00:04:30] Since it's an R RGB image, it has three channels of red, green, and blue RGB.
[00:04:33] channels of red, green, and blue RGB. And therefore, it's a tensor of 800 by
[00:04:38] And therefore, it's a tensor of 800 by 600 by 3 as you can see on the slide.
[00:04:43] 600 by 3 as you can see on the slide. So, um,
[00:04:47] as you can probably guess, this is the
[00:04:51] as you can probably guess, this is the semantic gap between our perception of
[00:04:54] semantic gap between our perception of this image and how the machine perceives
[00:04:56] this image and how the machine perceives and and sees the the image, right? And
[00:05:00] and and sees the the image, right? And in order to be able to uh even
[00:05:03] in order to be able to uh even understand how this this could be very
[00:05:06] understand how this this could be very challenging let's look at some some
[00:05:09] challenging let's look at some some challenges some some variations in this
[00:05:11] challenges some some variations in this type of imaging u data. So let's assume
[00:05:15] type of imaging u data. So let's assume for example as as one example let's
[00:05:17] for example as as one example let's assume that um we move the camera if the
[00:05:21] assume that um we move the camera if the camera is moved for example panning the
[00:05:24] camera is moved for example panning the camera around
[00:05:26] camera around even if the cat sits completely and
[00:05:28] even if the cat sits completely and perfectly still all of those pixel
[00:05:32] perfectly still all of those pixel values every single pixel value of 800
[00:05:35] values every single pixel value of 800 by 600 by3 will be changed.
[00:05:41] by 600 by3 will be changed. So all these pixels will have a new
[00:05:43] So all these pixels will have a new value. Again, for us humans, it's the
[00:05:49] value. Again, for us humans, it's the same object. There's no absolutely no
[00:05:52] same object. There's no absolutely no difference. But from a computer's
[00:05:53] difference. But from a computer's perspective, it's completely a new data
[00:05:56] perspective, it's completely a new data point. So this is one of the challenges,
[00:06:01] point. So this is one of the challenges, but there are quite a few
[00:06:04] but there are quite a few others as well. For example,
[00:06:06] others as well. For example, illumination is another um challenge.
[00:06:12] illumination is another um challenge. So if you've seen or if you've taken
[00:06:15] So if you've seen or if you've taken courses in graphics or maybe uh other
[00:06:18] courses in graphics or maybe uh other other vision courses
[00:06:20] other vision courses and or digital image processing uh
[00:06:23] and or digital image processing uh courses for engineering applications
[00:06:25] courses for engineering applications you know that the value of each RGB uh
[00:06:29] you know that the value of each RGB uh pixel the RGB values are a function of
[00:06:35] pixel the RGB values are a function of the surface material color and their
[00:06:38] the surface material color and their light source
[00:06:40] light source and and that's Why same cat, same object
[00:06:44] and and that's Why same cat, same object may look at differently in terms of
[00:06:47] may look at differently in terms of numbers when it comes to
[00:06:50] numbers when it comes to uh being pictured in different
[00:06:53] uh being pictured in different illumination conditions.
[00:06:55] illumination conditions. With that uh in mind, so whether the cat
[00:06:59] With that uh in mind, so whether the cat is in a dark room or under the sun,
[00:07:01] is in a dark room or under the sun, still it's the cat. It's it's it's one
[00:07:03] still it's the cat. It's it's it's one cat. But um this is creating challenges
[00:07:07] cat. But um this is creating challenges for for the machine. Can you maybe name
[00:07:09] for for the machine. Can you maybe name some other other challenges that um may
[00:07:13] some other other challenges that um may change the the values of the pixels and
[00:07:17] change the the values of the pixels and create problems for the machine to to
[00:07:19] create problems for the machine to to recognize objects
[00:07:22] recognize objects other than illumination and viewpoint
[00:07:24] other than illumination and viewpoint changes that I mentioned. So background
[00:07:27] changes that I mentioned. So background background clutter background objects
[00:07:29] background clutter background objects yes which is actually our next slide.
[00:07:33] yes which is actually our next slide. Yes, background uh clutter is another
[00:07:36] Yes, background uh clutter is another challenge. What? Anything else?
[00:07:40] challenge. What? Anything else? Zooming in and out. Yes. So, the scale
[00:07:43] Zooming in and out. Yes. So, the scale basically of the object in the in the
[00:07:46] basically of the object in the in the image. Yes. What else? The resolution of
[00:07:50] image. Yes. What else? The resolution of the image that is that could be
[00:07:53] the image that is that could be considered as that's definitely a
[00:07:56] considered as that's definitely a challenge. But often with um machine
[00:07:59] challenge. But often with um machine learning models or or any model that we
[00:08:01] learning models or or any model that we want to recognize action um objects in
[00:08:04] want to recognize action um objects in in images since we normalize the size of
[00:08:08] in images since we normalize the size of the image resolution may not be that uh
[00:08:11] the image resolution may not be that uh important unless there is zooming
[00:08:13] important unless there is zooming effects of the the objects. Occlusion is
[00:08:16] effects of the the objects. Occlusion is one of the major problems. Again as
[00:08:19] one of the major problems. Again as humans it's very easy to say this is cat
[00:08:22] humans it's very easy to say this is cat these are cats. Even the last one which
[00:08:24] these are cats. Even the last one which is actually a very challenging the one
[00:08:26] is actually a very challenging the one on the on the right. You can only see a
[00:08:29] on the on the right. You can only see a tail and a little bit of probably paw in
[00:08:32] tail and a little bit of probably paw in in the right side.
[00:08:37] One could say yes that could be a tiger
[00:08:40] One could say yes that could be a tiger or it could be I don't know a raccoon
[00:08:43] or it could be I don't know a raccoon with a tiny tail. But because this is
[00:08:47] with a tiny tail. But because this is because of the context, because we know
[00:08:49] because of the context, because we know this is inside a living room on a couch,
[00:08:52] this is inside a living room on a couch, most probably it's a cat. It's a cat. So
[00:08:55] most probably it's a cat. It's a cat. So again, for us humans, it's not that
[00:08:58] again, for us humans, it's not that hard. Beyond that, there are many other
[00:09:02] hard. Beyond that, there are many other problems. Deformation cats are very
[00:09:05] problems. Deformation cats are very deformable. So
[00:09:08] deformable. So they
[00:09:10] they create challenges for for algorithms to
[00:09:14] create challenges for for algorithms to be uh detected recognized.
[00:09:18] be uh detected recognized. I mean not today's algorithms generally
[00:09:20] I mean not today's algorithms generally for building uh step-by-step um
[00:09:24] for building uh step-by-step um algorithms that can detect objects. So
[00:09:26] algorithms that can detect objects. So deformation is one of the other major
[00:09:30] deformation is one of the other major challenges
[00:09:31] challenges and beyond that the intra class
[00:09:36] and beyond that the intra class variation is one more important
[00:09:41] variation is one more important challenge.
[00:09:44] challenge. We know that cats uh can come in
[00:09:46] We know that cats uh can come in different sizes, colors, patterns or
[00:09:49] different sizes, colors, patterns or even they can they have different breeds
[00:09:51] even they can they have different breeds and all of those are still cats. But um
[00:09:56] and all of those are still cats. But um for the for the machines it's it's not
[00:09:59] for the for the machines it's it's not that easy to recognize the intraclass
[00:10:03] that easy to recognize the intraclass variations.
[00:10:04] variations. One other interesting challenge is the
[00:10:08] One other interesting challenge is the context
[00:10:09] context because
[00:10:11] because if you only look at that
[00:10:15] if you only look at that that part that that image on the right
[00:10:17] that part that that image on the right or if an algorithm looks at this without
[00:10:21] or if an algorithm looks at this without considering the context it's very easy
[00:10:23] considering the context it's very easy to classify this as a tiger or some
[00:10:27] to classify this as a tiger or some other animal. But because of the context
[00:10:30] other animal. But because of the context and because we know that there's the
[00:10:33] and because we know that there's the effect of shadows and so on, this could
[00:10:37] effect of shadows and so on, this could probably be classified correctly.
[00:10:41] probably be classified correctly. So,
[00:10:46] but the thing is that the classifiers
[00:10:47] but the thing is that the classifiers that we have today can do really great
[00:10:51] that we have today can do really great job, a good jobs at
[00:10:55] job, a good jobs at classifying the images, identifying the
[00:10:58] classifying the images, identifying the objects in images. Thanks to efforts
[00:11:02] objects in images. Thanks to efforts like ImageNet and also all of the
[00:11:04] like ImageNet and also all of the follow-up works that created larger
[00:11:07] follow-up works that created larger scale benchmarks for training larger
[00:11:11] scale benchmarks for training larger scale models and um
[00:11:16] scale models and um and and in this class what we want to do
[00:11:18] and and in this class what we want to do is to get to uh to a place that we build
[00:11:23] is to get to uh to a place that we build models that can recognize activ u
[00:11:26] models that can recognize activ u recognize objects and also other um
[00:11:30] recognize objects and also other um aspects. within the image. For the rest
[00:11:32] aspects. within the image. For the rest of this class, we are going to be
[00:11:35] of this class, we are going to be working towards building step by step
[00:11:38] working towards building step by step the building blocks that are needed for
[00:11:40] the building blocks that are needed for building those large algorithms. And
[00:11:44] building those large algorithms. And before doing so, uh we have to look at
[00:11:50] before doing so, uh we have to look at the most basic building block
[00:11:53] the most basic building block of classifying an image and that is
[00:11:57] of classifying an image and that is building implementing a function like
[00:12:00] building implementing a function like this. So if you've taken some of the
[00:12:05] this. So if you've taken some of the computer science or engineering courses
[00:12:07] computer science or engineering courses that often build
[00:12:10] that often build frameworks through algorithms like for
[00:12:12] frameworks through algorithms like for example sorting in uh as a as a computer
[00:12:16] example sorting in uh as a as a computer algorithm it's often it often comes with
[00:12:19] algorithm it's often it often comes with some if then else rules and some for
[00:12:23] some if then else rules and some for loops and so on. So there's there's a
[00:12:25] loops and so on. So there's there's a clear flowchart of um tasks and
[00:12:30] clear flowchart of um tasks and step steps if then else steps that
[00:12:33] step steps if then else steps that creates an algorithm for sorting. But
[00:12:36] creates an algorithm for sorting. But when it comes to images and
[00:12:38] when it comes to images and understanding the visual world that is
[00:12:42] understanding the visual world that is um not happening that is a challenge
[00:12:45] um not happening that is a challenge there is no way to hardcode the steps
[00:12:49] there is no way to hardcode the steps for classifying images. Although
[00:12:52] for classifying images. Although there has been some efforts um in this
[00:12:55] there has been some efforts um in this space, there are papers
[00:12:58] space, there are papers that
[00:13:00] that they've they've uh tried to come up with
[00:13:03] they've they've uh tried to come up with algorithms and steps to recognize
[00:13:06] algorithms and steps to recognize objects. And one of those was based on
[00:13:10] objects. And one of those was based on um edge detectors,
[00:13:13] um edge detectors, finding the edges in
[00:13:16] finding the edges in um in in the image as a first step. And
[00:13:19] um in in the image as a first step. And then after after creating all of these
[00:13:23] then after after creating all of these patterns, look at
[00:13:26] patterns, look at uh the important patterns. for example,
[00:13:28] uh the important patterns. for example, corners. Extract some features that are
[00:13:31] corners. Extract some features that are around the corners or count the number
[00:13:34] around the corners or count the number of specific types of corners and based
[00:13:37] of specific types of corners and based on those from those try to map that into
[00:13:43] on those from those try to map that into the output class. So while this is been
[00:13:47] the output class. So while this is been an interesting effort and and it had
[00:13:50] an interesting effort and and it had some success on very limited um
[00:13:54] some success on very limited um variability of type of images but this
[00:13:57] variability of type of images but this is very hard to first it's very hard to
[00:14:00] is very hard to first it's very hard to scale these types of algorithms. Even if
[00:14:02] scale these types of algorithms. Even if it works it's very hard to scale because
[00:14:07] it works it's very hard to scale because you have to create these rules and
[00:14:09] you have to create these rules and everything for every single object that
[00:14:11] everything for every single object that you want to recognize. and second
[00:14:14] you want to recognize. and second finding the logic for each of those
[00:14:17] finding the logic for each of those requires a lot of uh effort by itself as
[00:14:20] requires a lot of uh effort by itself as well. So because of these challenges, I
[00:14:23] well. So because of these challenges, I think um
[00:14:25] think um these types of algorithms which are u
[00:14:28] these types of algorithms which are u based on creating logics and procedures
[00:14:32] based on creating logics and procedures for detecting objects or classifying
[00:14:35] for detecting objects or classifying images have not been quite successful
[00:14:38] images have not been quite successful and machine learning comes with this
[00:14:41] and machine learning comes with this datadriven approach. So with this new
[00:14:44] datadriven approach. So with this new paradigm of and another paradigm of
[00:14:48] paradigm of and another paradigm of looking at this problem from a
[00:14:51] looking at this problem from a datadriven
[00:14:53] datadriven perspective,
[00:14:55] perspective, we define a a procedure of a three-step
[00:14:59] we define a a procedure of a three-step process and the first one is to collect
[00:15:05] process and the first one is to collect data sets of images and their labels. So
[00:15:09] data sets of images and their labels. So there are many different ways of if you
[00:15:11] there are many different ways of if you have if you want to recognize a specific
[00:15:14] have if you want to recognize a specific type of object
[00:15:16] type of object we or or specific types of objects we
[00:15:20] we or or specific types of objects we can look for data sets or single data
[00:15:24] can look for data sets or single data points over the internet to create a
[00:15:27] points over the internet to create a data many samples from each of the
[00:15:30] data many samples from each of the examples.
[00:15:32] examples. We used to be doing this 10 20 years ago
[00:15:35] We used to be doing this 10 20 years ago uh using search engines and image search
[00:15:38] uh using search engines and image search engines over the internet to create
[00:15:41] engines over the internet to create these types of data sets. Now we have
[00:15:44] these types of data sets. Now we have all of the data sets. So
[00:15:48] all of the data sets. So and then the second step is using
[00:15:50] and then the second step is using machine learning algorithms to train a
[00:15:52] machine learning algorithms to train a classifier. basically build a function
[00:15:57] classifier. basically build a function train that takes the
[00:16:00] train that takes the images in the training data and their
[00:16:03] images in the training data and their associated labels and builds a model
[00:16:06] associated labels and builds a model that can relate associate images with
[00:16:09] that can relate associate images with the labels. And then um the last step
[00:16:13] the labels. And then um the last step would be evaluating the classifier on
[00:16:16] would be evaluating the classifier on new
[00:16:18] new images
[00:16:20] images which means implementing a function
[00:16:23] which means implementing a function called predict that takes the model and
[00:16:27] called predict that takes the model and some test images and for those test
[00:16:29] some test images and for those test images that were were not part of the
[00:16:32] images that were were not part of the training images um predicts the labels
[00:16:35] training images um predicts the labels and returns those as the output. So a
[00:16:40] and returns those as the output. So a very simple procedure but instead of
[00:16:44] very simple procedure but instead of building a logic we are building a
[00:16:46] building a logic we are building a datadriven approach for it.
[00:16:50] datadriven approach for it. As I said we want to talk about two
[00:16:53] As I said we want to talk about two popular methods and and classifiers. One
[00:16:56] popular methods and and classifiers. One of them is nearest neighbor classifier.
[00:16:59] of them is nearest neighbor classifier. This is the easiest form of
[00:17:01] This is the easiest form of classification. Um, and
[00:17:05] classification. Um, and we specifically want to go over this
[00:17:08] we specifically want to go over this because we can learn some of the
[00:17:11] because we can learn some of the concepts
[00:17:14] concepts around building these classifiers and
[00:17:16] around building these classifiers and it's easier to explain some um
[00:17:21] it's easier to explain some um some of the details and then we'll move
[00:17:23] some of the details and then we'll move to the topic of linear classification.
[00:17:26] to the topic of linear classification. Okay. to do that what we do uh to build
[00:17:30] Okay. to do that what we do uh to build the nearest neighbor classifier as as I
[00:17:32] the nearest neighbor classifier as as I said we need to build the train and
[00:17:36] said we need to build the train and predict functions the train function
[00:17:39] predict functions the train function needs to just memorize all of the data
[00:17:42] needs to just memorize all of the data and labels so the training function
[00:17:44] and labels so the training function basically doesn't do anything other than
[00:17:46] basically doesn't do anything other than keeping everything in the memory and
[00:17:48] keeping everything in the memory and then the prediction function the predict
[00:17:51] then the prediction function the predict uh function
[00:17:53] uh function looks for the most similar training
[00:17:56] looks for the most similar training image basically It creates a lookup
[00:17:58] image basically It creates a lookup table of all of the images and all of
[00:18:01] table of all of the images and all of their labels. And during the prediction
[00:18:05] their labels. And during the prediction or testing time, what what it does is
[00:18:08] or testing time, what what it does is tries to find the closest one, the the
[00:18:11] tries to find the closest one, the the most similar image and outputs the label
[00:18:15] most similar image and outputs the label for that image. Let's look at an
[00:18:18] for that image. Let's look at an example. So assuming that we have these
[00:18:21] example. So assuming that we have these five
[00:18:23] five as in your in our training data then yes
[00:18:27] as in your in our training data then yes you you see my cursor and then um this
[00:18:32] you you see my cursor and then um this is the query image the input image for
[00:18:34] is the query image the input image for for prediction. What we want to do is to
[00:18:37] for prediction. What we want to do is to see which of these training data and
[00:18:40] see which of these training data and training images is the most similar to
[00:18:42] training images is the most similar to this one and for that we need a distance
[00:18:45] this one and for that we need a distance function. So this distance function
[00:18:48] function. So this distance function needs to take the two images each pair
[00:18:51] needs to take the two images each pair of images each of these images compared
[00:18:53] of images each of these images compared to the query image and return a value
[00:18:58] to the query image and return a value which defines the similarity between
[00:19:01] which defines the similarity between these two
[00:19:04] these two uh inputs these two images. There are
[00:19:07] uh inputs these two images. There are many different ways of doing that. One
[00:19:10] many different ways of doing that. One of the most popular ones is
[00:19:13] of the most popular ones is L1 distance which is defined as the sum
[00:19:17] L1 distance which is defined as the sum over all absolute values of pixel
[00:19:21] over all absolute values of pixel differences
[00:19:22] differences between the two images image I1 and I2.
[00:19:26] between the two images image I1 and I2. As an example, if this is a testing
[00:19:27] As an example, if this is a testing image,
[00:19:29] image, if we
[00:19:31] if we want to calculate the distance of this
[00:19:34] want to calculate the distance of this image with
[00:19:36] image with uh an image in the training data, we do
[00:19:39] uh an image in the training data, we do a pixel wise subtraction
[00:19:42] a pixel wise subtraction and the difference between the pixel
[00:19:44] and the difference between the pixel values and then sum them up which
[00:19:46] values and then sum them up which defines this new this value as
[00:19:52] defines this new this value as as the uh the distance between these two
[00:19:56] as the uh the distance between these two So
[00:19:59] So this is the most basic distance function
[00:20:01] this is the most basic distance function but it's actually very useful in many
[00:20:05] but it's actually very useful in many applications. We'll be coming back to
[00:20:07] applications. We'll be coming back to this uh L1 and and any other variations
[00:20:10] this uh L1 and and any other variations of distances in the class quite often
[00:20:13] of distances in the class quite often with this very simple definition. We
[00:20:15] with this very simple definition. We want to see how we can get it get this
[00:20:17] want to see how we can get it get this implemented. As I said the first step is
[00:20:20] implemented. As I said the first step is to just memorize the training data. So
[00:20:23] to just memorize the training data. So the train function just keeps the data
[00:20:26] the train function just keeps the data in the memory and then
[00:20:30] in the memory and then what the predict function does using
[00:20:34] what the predict function does using actually some um Python libraries and
[00:20:37] actually some um Python libraries and numpy and so on. We can implement this
[00:20:40] numpy and so on. We can implement this in just four lines. We calculate the
[00:20:43] in just four lines. We calculate the distances between each of the testing
[00:20:46] distances between each of the testing samples
[00:20:48] samples and the training data.
[00:20:53] Take the minimum for each of the testing
[00:20:56] Take the minimum for each of the testing samples and then output the label for it
[00:21:02] samples and then output the label for it for the for the for the one with mean
[00:21:04] for the for the for the one with mean index. So this is going to be the uh
[00:21:09] index. So this is going to be the uh implementation for for the predict
[00:21:11] implementation for for the predict function. Yeah, the the pixel values as
[00:21:14] function. Yeah, the the pixel values as I explained
[00:21:18] I explained the most the the the simplest form this
[00:21:21] the most the the the simplest form this is a tensor of 800 by 600 by3
[00:21:26] is a tensor of 800 by 600 by3 and three channels and these are RGB
[00:21:29] and three channels and these are RGB values for each of the pixel uh
[00:21:31] values for each of the pixel uh locations. So yes the I should actually
[00:21:35] locations. So yes the I should actually repeat the questions for online students
[00:21:36] repeat the questions for online students too and the question was the pixel
[00:21:39] too and the question was the pixel values what do they represent? Yeah. So
[00:21:42] values what do they represent? Yeah. So the next question is why it's between 0
[00:21:44] the next question is why it's between 0 and 255. Um so the there are many
[00:21:48] and 255. Um so the there are many different standards for storing images.
[00:21:52] different standards for storing images. The most popular one that we use in in
[00:21:55] The most popular one that we use in in almost all images that you see online
[00:21:58] almost all images that you see online and uh and here they are RGB. RGB is a
[00:22:02] and uh and here they are RGB. RGB is a 24bit format sometimes 32 because
[00:22:06] 24bit format sometimes 32 because there's an another channel alpha. We
[00:22:08] there's an another channel alpha. We don't want to get into uh into those but
[00:22:11] don't want to get into uh into those but the 24bit format it means that for each
[00:22:15] the 24bit format it means that for each of the channels for each of the three
[00:22:17] of the channels for each of the three channels of red green and blue which
[00:22:19] channels of red green and blue which create all of all color color
[00:22:21] create all of all color color combinations we can have eight bits. So
[00:22:25] combinations we can have eight bits. So that's that's the standard that is uh
[00:22:27] that's that's the standard that is uh defined. There are some other frameworks
[00:22:30] defined. There are some other frameworks too but this is the most popular one.
[00:22:34] too but this is the most popular one. So um with that let me
[00:22:38] So um with that let me go back to the code and ask you a
[00:22:40] go back to the code and ask you a question.
[00:22:46] So
[00:22:47] So [Applause]
[00:22:49] [Applause] I know some of the students uh most of
[00:22:51] I know some of the students uh most of the students um come with engineering
[00:22:54] the students um come with engineering backgrounds and a little bit of computer
[00:22:56] backgrounds and a little bit of computer science as well. But we want to see
[00:23:00] science as well. But we want to see with say n samples n examples that we
[00:23:03] with say n samples n examples that we have in the training data how fast the
[00:23:06] have in the training data how fast the training and prediction happens.
[00:23:11] I'm hoping that you're familiar with the
[00:23:14] I'm hoping that you're familiar with the big O notation that we often u represent
[00:23:19] big O notation that we often u represent computational and sometimes space
[00:23:21] computational and sometimes space complexities with. But here if you look
[00:23:25] complexities with. But here if you look at the the algorithms I'll I'll go with
[00:23:27] at the the algorithms I'll I'll go with the training data in the in the training
[00:23:30] the training data in the in the training function uh and then I want you to help
[00:23:32] function uh and then I want you to help me with the
[00:23:35] me with the the dancer for prediction for the
[00:23:37] the dancer for prediction for the training step the the training is of 01
[00:23:42] training step the the training is of 01 because we are not actually doing
[00:23:44] because we are not actually doing anything we are not even moving any data
[00:23:46] anything we are not even moving any data we're just keeping the copy of the data
[00:23:49] we're just keeping the copy of the data in the memory so no operations it means
[00:23:52] in the memory so no operations it means that without operations with and
[00:23:54] that without operations with and operations of order one we can
[00:23:59] operations of order one we can complete the training step. What about
[00:24:01] complete the training step. What about the prediction step? For each of the
[00:24:05] the prediction step? For each of the single examples of
[00:24:09] single examples of the um the training the testing data,
[00:24:12] the um the training the testing data, how many operations should we should we
[00:24:14] how many operations should we should we take? And uh yes, if we have n training
[00:24:19] take? And uh yes, if we have n training data, it means that we have to calculate
[00:24:21] data, it means that we have to calculate the distance of every single uh testing
[00:24:23] the distance of every single uh testing image with all of the images in the
[00:24:26] image with all of the images in the training data. So at least in the order
[00:24:29] training data. So at least in the order of n operations.
[00:24:33] of n operations. So
[00:24:36] this is um this is this is not really
[00:24:40] this is um this is this is not really good because what we often want to do is
[00:24:43] good because what we often want to do is because training is not doing anything
[00:24:46] because training is not doing anything but during testing during prediction
[00:24:48] but during testing during prediction time we are spending so much time uh
[00:24:51] time we are spending so much time uh just to do comparisons with between the
[00:24:54] just to do comparisons with between the data and the each single data point and
[00:24:58] data and the each single data point and the training examples.
[00:25:00] the training examples. This would be similar to
[00:25:03] This would be similar to the fact that each single time that you
[00:25:05] the fact that each single time that you ask that GPT a question, it will try to
[00:25:09] ask that GPT a question, it will try to see what the answer is and compare it
[00:25:11] see what the answer is and compare it with all of the possible answers over
[00:25:14] with all of the possible answers over the internet, which will take years and
[00:25:17] the internet, which will take years and then return your your your response,
[00:25:19] then return your your your response, right? So it it wouldn't work when it
[00:25:21] right? So it it wouldn't work when it wants to it wants to scale for very
[00:25:23] wants to it wants to scale for very simple problems. We used to be using
[00:25:25] simple problems. We used to be using these types of approaches. So what we
[00:25:28] these types of approaches. So what we often want is to
[00:25:32] often want is to build classifiers uh that are fast
[00:25:35] build classifiers uh that are fast during prediction. They they do it much
[00:25:38] during prediction. They they do it much faster but it's okay if they take a lot
[00:25:40] faster but it's okay if they take a lot of time to do uh during the training
[00:25:42] of time to do uh during the training because that could be done offline. So
[00:25:45] because that could be done offline. So with that um in mind although there has
[00:25:49] with that um in mind although there has been a lot of efforts making uh nearest
[00:25:53] been a lot of efforts making uh nearest neighbor more
[00:25:55] neighbor more much faster
[00:25:57] much faster using GPUs and so on which are beyond
[00:26:00] using GPUs and so on which are beyond the the scope of this class. If you're
[00:26:02] the the scope of this class. If you're interested you can take a look at those.
[00:26:04] interested you can take a look at those. But with that I want to look at some of
[00:26:06] But with that I want to look at some of the visualizations and and how this this
[00:26:09] the visualizations and and how this this algorithm in general works. So given
[00:26:13] algorithm in general works. So given this space that we have five classes of
[00:26:16] this space that we have five classes of red, blue, green, purple and red sorry
[00:26:19] red, blue, green, purple and red sorry yellow. Um
[00:26:23] yellow. Um and each dot represents one one training
[00:26:26] and each dot represents one one training sample in that class.
[00:26:29] sample in that class. If you partition the space for every
[00:26:31] If you partition the space for every single point, you see that we can we can
[00:26:35] single point, you see that we can we can create these five partitions. um let's
[00:26:38] create these five partitions. um let's say five or in this case six different
[00:26:40] say five or in this case six different partitions that
[00:26:44] partitions that each point um if if you have a testing
[00:26:47] each point um if if you have a testing sample that is in that specific region
[00:26:50] sample that is in that specific region the color of that region shows what the
[00:26:53] the color of that region shows what the nearest neighbor for that sample will
[00:26:56] nearest neighbor for that sample will be. So this this is going to be the
[00:27:01] be. So this this is going to be the nearest neighbor algorithm. One nearest
[00:27:03] nearest neighbor algorithm. One nearest neighbor algorithm partition the space
[00:27:05] neighbor algorithm partition the space in this um setting. But do you see a
[00:27:09] in this um setting. But do you see a problem here in this uh example? So the
[00:27:13] problem here in this uh example? So the yellow one is is exactly is in the
[00:27:16] yellow one is is exactly is in the middle of all of the greens. And this
[00:27:17] middle of all of the greens. And this means that probably that's an outlier.
[00:27:20] means that probably that's an outlier. That's probably a noise. And this is the
[00:27:22] That's probably a noise. And this is the case for many many problems that we have
[00:27:24] case for many many problems that we have to solve. And with that um
[00:27:29] to solve. And with that um the reason that u there are there is
[00:27:32] the reason that u there are there is this this big yellow region in the
[00:27:34] this this big yellow region in the middle is just this single point and
[00:27:38] middle is just this single point and because you're only using one nearest
[00:27:41] because you're only using one nearest neighbor this happens. So to make it a
[00:27:43] neighbor this happens. So to make it a little bit more robust we can increase
[00:27:45] little bit more robust we can increase the number of nearest neighbors that we
[00:27:48] the number of nearest neighbors that we take which turns the nearest neighbor
[00:27:50] take which turns the nearest neighbor algorithm into a k nearest neighbor. And
[00:27:53] algorithm into a k nearest neighbor. And we often
[00:27:55] we often select more than one
[00:27:58] select more than one point or or sample. And we often take
[00:28:02] point or or sample. And we often take the majority voting for
[00:28:06] the majority voting for for identifying the label for any given
[00:28:09] for identifying the label for any given testing sample testing image.
[00:28:14] testing sample testing image. But the problem that uh you can see here
[00:28:16] But the problem that uh you can see here is now we have some white regions. Those
[00:28:20] is now we have some white regions. Those white regions are areas that we cannot
[00:28:22] white regions are areas that we cannot make a decision a complete decision
[00:28:24] make a decision a complete decision because those areas are areas that we
[00:28:28] because those areas are areas that we have equal number of samples from the
[00:28:31] have equal number of samples from the neighbors um from the three different
[00:28:35] neighbors um from the three different classes and there's no way to identify
[00:28:38] classes and there's no way to identify what the label of that per that example
[00:28:42] what the label of that per that example in the in the white region is. And for
[00:28:45] in the in the white region is. And for for you if you create these these types
[00:28:47] for you if you create these these types of spaces for your problems this means
[00:28:50] of spaces for your problems this means that if if you look at these type spaces
[00:28:52] that if if you look at these type spaces it means that those are good regions to
[00:28:54] it means that those are good regions to go and collect more data for. So those
[00:28:57] go and collect more data for. So those are unclear spaces. So it's a good way
[00:29:00] are unclear spaces. So it's a good way of finding regions that are important
[00:29:04] of finding regions that are important for data uh more data collection. Okay.
[00:29:07] for data uh more data collection. Okay. So we can go larger on the value of the
[00:29:10] So we can go larger on the value of the K. But one of the choices that we have,
[00:29:14] K. But one of the choices that we have, one of the
[00:29:16] one of the um
[00:29:18] um factors that that plays an important
[00:29:20] factors that that plays an important role is the value of K. But if you
[00:29:23] role is the value of K. But if you remember, we had another decision to
[00:29:25] remember, we had another decision to make which was the distance function. We
[00:29:28] make which was the distance function. We talked about the L1 distance. Again, sum
[00:29:31] talked about the L1 distance. Again, sum of all of the absolute values
[00:29:35] of all of the absolute values between pair-wise differences of um the
[00:29:39] between pair-wise differences of um the pixels.
[00:29:41] pixels. And
[00:29:44] if I visualize the L1 distance or
[00:29:47] if I visualize the L1 distance or sometimes in some context we call it
[00:29:49] sometimes in some context we call it Manhattan distance,
[00:29:52] Manhattan distance, the the distance function is is kind of
[00:29:56] the the distance function is is kind of visualized in this way.
[00:29:58] visualized in this way. If I calculate if I if I look at this
[00:30:00] If I calculate if I if I look at this square that I have in the
[00:30:04] square that I have in the in this space all of the points on that
[00:30:08] in this space all of the points on that square are they have a same distance
[00:30:12] square are they have a same distance from the origin from the center point.
[00:30:15] from the origin from the center point. So
[00:30:17] So this is a good way of visualizing and
[00:30:19] this is a good way of visualizing and seeing how uh this L1 distance function
[00:30:22] seeing how uh this L1 distance function works. Another popular framework,
[00:30:24] works. Another popular framework, another popular distance function that
[00:30:26] another popular distance function that we use is L2 which instead of the
[00:30:31] we use is L2 which instead of the absolute value calculates the square of
[00:30:34] absolute value calculates the square of the differences sums it up but because
[00:30:37] the differences sums it up but because of the square we also do a square root
[00:30:40] of the square we also do a square root and visualizing that we'll get the
[00:30:43] and visualizing that we'll get the circle
[00:30:44] circle um
[00:30:47] um visualization where each of the points
[00:30:51] visualization where each of the points on the circle are they have the same
[00:30:53] on the circle are they have the same distance from the from the center from
[00:30:55] distance from the from the center from the origin. So this visualization
[00:30:58] the origin. So this visualization actually helps us understand the
[00:30:59] actually helps us understand the differences between these distances too
[00:31:01] differences between these distances too and these are the the most basic and
[00:31:03] and these are the the most basic and easiest distance functions that we can
[00:31:05] easiest distance functions that we can use. So there are again a lot more the
[00:31:09] use. So there are again a lot more the um the reason this visualization is
[00:31:12] um the reason this visualization is helpful is because sometimes if you
[00:31:15] helpful is because sometimes if you rotate the so x and y in these two
[00:31:17] rotate the so x and y in these two visualizations are basically the
[00:31:19] visualizations are basically the features. If we have two pixel values,
[00:31:22] features. If we have two pixel values, two two features then they have this 2D
[00:31:24] two two features then they have this 2D space and this X and Y are often those
[00:31:28] space and this X and Y are often those features. So if I rotate these features
[00:31:33] features. So if I rotate these features meaning if I use other types of features
[00:31:36] meaning if I use other types of features this L1 will have a different different
[00:31:38] this L1 will have a different different framework different value while it's not
[00:31:40] framework different value while it's not any different for L2. So that's why this
[00:31:44] any different for L2. So that's why this is a big difference between L1 and L2.
[00:31:47] is a big difference between L1 and L2. And sometimes if our features are are
[00:31:49] And sometimes if our features are are very specific and meaningful and we want
[00:31:51] very specific and meaningful and we want to preserve their information, often L1
[00:31:54] to preserve their information, often L1 is is is more important is is better
[00:31:56] is is is more important is is better because it has um kind of as you can see
[00:32:02] because it has um kind of as you can see um a shape that preserves and and
[00:32:06] um a shape that preserves and and enforces distances based on the
[00:32:08] enforces distances based on the features. But if uh there is features
[00:32:12] features. But if uh there is features are more arbitrary then L2 distance
[00:32:14] are more arbitrary then L2 distance makes more sense. If I um want to
[00:32:17] makes more sense. If I um want to calculate the distance so the distance
[00:32:20] calculate the distance so the distance of all of the points on this shape
[00:32:22] of all of the points on this shape from the origin are exactly the same
[00:32:26] from the origin are exactly the same right if I use the L1 distance. But for
[00:32:30] right if I use the L1 distance. But for L2 distance,
[00:32:32] L2 distance, the points on this circle have the same
[00:32:35] the points on this circle have the same distance from uh the center uh or the
[00:32:38] distance from uh the center uh or the origin of this uh
[00:32:42] this space. So um that's basically the
[00:32:46] this space. So um that's basically the the main
[00:32:50] what what these two images are showing.
[00:32:52] what what these two images are showing. Any point on this shape
[00:32:56] Any point on this shape when using an L1 distance have the same
[00:32:58] when using an L1 distance have the same distance from the origin and for the
[00:33:00] distance from the origin and for the circle any point on the circle if you're
[00:33:03] circle any point on the circle if you're using L2 distance will have the same
[00:33:06] using L2 distance will have the same distance from the origin.
[00:33:08] distance from the origin. Yeah. Why it's important to is better to
[00:33:11] Yeah. Why it's important to is better to use L1 if we want to preserve the
[00:33:14] use L1 if we want to preserve the features. So to answer that question, if
[00:33:18] features. So to answer that question, if I rotate the feature um axis the the
[00:33:23] I rotate the feature um axis the the distances and this distance function
[00:33:25] distances and this distance function changes completely, right? While if I do
[00:33:29] changes completely, right? While if I do the same here,
[00:33:31] the same here, nothing changes. It's it's the exact
[00:33:34] nothing changes. It's it's the exact same value of the uh features, right? Uh
[00:33:37] same value of the uh features, right? Uh distance, sorry. So in this case L1 is
[00:33:42] distance, sorry. So in this case L1 is very sensitive on the feature values
[00:33:44] very sensitive on the feature values while L2 is not. If you select another
[00:33:48] while L2 is not. If you select another feature in the same space that is having
[00:33:50] feature in the same space that is having a different creates a different shape
[00:33:52] a different creates a different shape then your L function the distance
[00:33:55] then your L function the distance function changes as well. So if I draw
[00:33:59] function changes as well. So if I draw the lines here again the question for
[00:34:02] the lines here again the question for online students is why it changes if we
[00:34:04] online students is why it changes if we rotate. If I select another feature that
[00:34:07] rotate. If I select another feature that goes from this side, right, then the
[00:34:10] goes from this side, right, then the lines will look different,
[00:34:12] lines will look different, right? So if you rotate the this thing,
[00:34:18] right? So if you rotate the this thing, but it's for that shape, it's not it's
[00:34:20] but it's for that shape, it's not it's agnostic, right?
[00:34:24] agnostic, right? So with these two distance functions
[00:34:26] So with these two distance functions that we talked about, if I revisualize
[00:34:29] that we talked about, if I revisualize the
[00:34:31] the um space you can see with K equals to
[00:34:35] um space you can see with K equals to one with one nearest neighbor with L1
[00:34:37] one with one nearest neighbor with L1 and L2 these are the space
[00:34:40] and L2 these are the space partitionings. One of the interesting
[00:34:43] partitionings. One of the interesting things that you can see here is that
[00:34:45] things that you can see here is that with L1
[00:34:48] with L1 uh function most of the part most of the
[00:34:51] uh function most of the part most of the boundaries are
[00:34:53] boundaries are parallel to the the the two axis the two
[00:34:56] parallel to the the the two axis the two features X uh one and X2 very much
[00:35:00] features X uh one and X2 very much sensitive to the features while there we
[00:35:02] sensitive to the features while there we have a little bit of more smooth
[00:35:05] have a little bit of more smooth uh boundary separation.
[00:35:08] uh boundary separation. So there is a tool online on the lab
[00:35:11] So there is a tool online on the lab website that you can uh play around with
[00:35:14] website that you can uh play around with with this with different distance
[00:35:16] with this with different distance functions and different number of K. You
[00:35:19] functions and different number of K. You can you can see you can create a
[00:35:20] can you can see you can create a different setup. So you can play around
[00:35:23] different setup. So you can play around with it. But why did we talk about
[00:35:26] with it. But why did we talk about nearest neighbor to begin with first?
[00:35:28] nearest neighbor to begin with first? Yes, it's it's the easiest um problem to
[00:35:32] Yes, it's it's the easiest um problem to solve, easiest solution, easiest
[00:35:35] solve, easiest solution, easiest datadriven approach and um great to
[00:35:39] datadriven approach and um great to start with. But one of the main reasons
[00:35:41] start with. But one of the main reasons that we
[00:35:42] that we  we want to iterate and and discuss
[00:35:46] we want to iterate and and discuss nearest neighbor is the fact that we can
[00:35:51] nearest neighbor is the fact that we can look into the the the topic of
[00:35:53] look into the the the topic of hyperparameters.
[00:35:55] hyperparameters. Hyperparameters are often
[00:35:58] Hyperparameters are often some of the
[00:36:00] some of the variables that you have to make a
[00:36:02] variables that you have to make a decision on to be able to run your
[00:36:05] decision on to be able to run your algorithm. In this case, the value k
[00:36:10] algorithm. In this case, the value k the number of nearest neighbors is
[00:36:12] the number of nearest neighbors is defined is is a hyperparameter.
[00:36:15] defined is is a hyperparameter. Depending on how many number of uh
[00:36:18] Depending on how many number of uh nearest neighbors you take, your outputs
[00:36:20] nearest neighbors you take, your outputs will be different. And then another
[00:36:23] will be different. And then another choice that you have here is the
[00:36:25] choice that you have here is the distance function.
[00:36:27] distance function. So the
[00:36:31] So the choice of hyperparameters are often very
[00:36:33] choice of hyperparameters are often very much data set dependent and sometimes
[00:36:36] much data set dependent and sometimes problem dependent
[00:36:40] problem dependent and we have to have a way to identify
[00:36:43] and we have to have a way to identify those to kind of optimize for them
[00:36:48] those to kind of optimize for them for each single problem. And that's what
[00:36:51] for each single problem. And that's what does what is often referred to as
[00:36:52] does what is often referred to as hyperparameter tuning in machine
[00:36:55] hyperparameter tuning in machine learning algorithms in in deep learning
[00:36:56] learning algorithms in in deep learning algorithms and so on. And how to do
[00:36:59] algorithms and so on. And how to do that? How to set
[00:37:02] that? How to set the hyperparameters? There are different
[00:37:03] the hyperparameters? There are different approaches. One of them is to choose the
[00:37:06] approaches. One of them is to choose the hyperparameters that that work the best
[00:37:09] hyperparameters that that work the best for the training data. So you have a set
[00:37:12] for the training data. So you have a set of images or data in in your training
[00:37:15] of images or data in in your training data. You look for
[00:37:18] data. You look for the best set of hyper hyperparameters
[00:37:20] the best set of hyper hyperparameters that generates the best training
[00:37:23] that generates the best training uh or
[00:37:25] uh or minimum training loss. While it works
[00:37:29] minimum training loss. While it works for the training data, it's not a good
[00:37:31] for the training data, it's not a good idea at all because especially with near
[00:37:35] idea at all because especially with near nearest neighbor, K equal to one is
[00:37:37] nearest neighbor, K equal to one is always the best the best value, right?
[00:37:40] always the best the best value, right? Because you're you're memorizing
[00:37:41] Because you're you're memorizing training data. So K equal to one will
[00:37:43] training data. So K equal to one will give give you always the 100% accuracy.
[00:37:47] give give you always the 100% accuracy. So we know that this is this is not a
[00:37:50] So we know that this is this is not a great idea. The second one is
[00:37:54] great idea. The second one is choosing hyperparameter that works best
[00:37:56] choosing hyperparameter that works best for for a held out testing set.
[00:38:00] for for a held out testing set. While this is a little bit better than
[00:38:03] While this is a little bit better than the first one,
[00:38:05] the first one, there is also a big problem here. Can
[00:38:07] there is also a big problem here. Can can anybody say why this is a problem
[00:38:11] can anybody say why this is a problem exactly? And so it's it's it's kind of
[00:38:13] exactly? And so it's it's it's kind of cheating because you are trying to find
[00:38:15] cheating because you are trying to find the best hyperparameter that works on
[00:38:17] the best hyperparameter that works on the testing data and you don't know how
[00:38:20] the testing data and you don't know how the model will work on any other data
[00:38:24] the model will work on any other data points not not in the testing set. So
[00:38:27] points not not in the testing set. So yes that is that is exactly right. It's
[00:38:31] yes that is that is exactly right. It's not a good idea because we don't know
[00:38:33] not a good idea because we don't know how the model will generalize and for
[00:38:36] how the model will generalize and for sure never do it do this as as as we
[00:38:40] sure never do it do this as as as we talked about it's kind of cheating and a
[00:38:43] talked about it's kind of cheating and a better idea is to always separate take
[00:38:47] better idea is to always separate take some part of the training data as
[00:38:49] some part of the training data as validation set
[00:38:51] validation set and train your model on the training
[00:38:53] and train your model on the training data on the train portion of the the new
[00:38:56] data on the train portion of the the new portion that we call train and then Try
[00:39:00] portion that we call train and then Try to find or optimize your hyper
[00:39:03] to find or optimize your hyper hyperparameter on the validation set and
[00:39:06] hyperparameter on the validation set and after you've found the best set of
[00:39:09] after you've found the best set of hyperparameters then use those
[00:39:12] hyperparameters then use those hyperparameters to replicate the results
[00:39:14] hyperparameters to replicate the results for the testing set and do the
[00:39:16] for the testing set and do the predictions for the testing set. So this
[00:39:17] predictions for the testing set. So this is a much better approach although it
[00:39:21] is a much better approach although it does have some uh challenges itself
[00:39:25] does have some uh challenges itself because um
[00:39:29] because um sometimes the the validation set that
[00:39:31] sometimes the the validation set that you've selected it may not be a good
[00:39:33] you've selected it may not be a good representative of the entire landscape
[00:39:36] representative of the entire landscape because you your validation set is
[00:39:38] because you your validation set is almost always much smaller and and
[00:39:41] almost always much smaller and and that's why one of the a better approach
[00:39:44] that's why one of the a better approach is to use um
[00:39:49] is to use um cross validation for setting
[00:39:50] cross validation for setting hyperparameters.
[00:39:52] hyperparameters. Basically you split your training data
[00:39:55] Basically you split your training data into a number of folds, a number of uh
[00:39:58] into a number of folds, a number of uh partitions here in this case five and
[00:40:02] partitions here in this case five and each of the folds
[00:40:05] each of the folds plays as the validation set once and
[00:40:08] plays as the validation set once and iteratively you you run this five times
[00:40:10] iteratively you you run this five times for five-fold class validation. You do
[00:40:13] for five-fold class validation. You do this five times and average the
[00:40:17] this five times and average the accuracies. So you set a a value of the
[00:40:19] accuracies. So you set a a value of the hyperparameter. You run this for all
[00:40:21] hyperparameter. You run this for all these five uh sets.
[00:40:24] these five uh sets. Define the accuracy. Calculate the
[00:40:26] Define the accuracy. Calculate the accuracy on the validation set. Average
[00:40:29] accuracy on the validation set. Average it. And then you do this multiple times
[00:40:31] it. And then you do this multiple times to find the best setting for the
[00:40:33] to find the best setting for the hyperparameter. After you found the
[00:40:35] hyperparameter. After you found the hyperparameter setting, you apply to the
[00:40:37] hyperparameter setting, you apply to the testing set. This is a little bit more
[00:40:40] testing set. This is a little bit more reliable and and generates much better
[00:40:43] reliable and and generates much better results. Although in larger scale deep
[00:40:45] results. Although in larger scale deep learning it is less
[00:40:49] learning it is less practiced because repeating this
[00:40:52] practiced because repeating this multiple times and five times with huge
[00:40:54] multiple times and five times with huge data sets is is very hard. So we often
[00:40:56] data sets is is very hard. So we often use intuitions for setting
[00:40:58] use intuitions for setting hyperparameters and the single
[00:41:01] hyperparameters and the single validation set is some sometimes the the
[00:41:04] validation set is some sometimes the the approach we go with. But this is pretty
[00:41:07] approach we go with. But this is pretty much uh advised
[00:41:09] much uh advised again outside computer vision outside
[00:41:11] again outside computer vision outside larger scale data sets. Often research
[00:41:15] larger scale data sets. Often research papers require doing these types of
[00:41:17] papers require doing these types of cross validation and um
[00:41:21] cross validation and um and and these types of uh statistical
[00:41:24] and and these types of uh statistical frameworks to make sure your results are
[00:41:27] frameworks to make sure your results are uh reproducible on a testing set.
[00:41:29] uh reproducible on a testing set. Anyways, so there are different
[00:41:31] Anyways, so there are different approaches. Let's um finalize the topic,
[00:41:36] approaches. Let's um finalize the topic, wrap up the topic of um
[00:41:39] wrap up the topic of um nearest neighbor and um look at the some
[00:41:43] nearest neighbor and um look at the some examples, some results.
[00:41:45] examples, some results. So, let me introduce you to the CR10
[00:41:50] So, let me introduce you to the CR10 data set. It's one of the data sets that
[00:41:51] data set. It's one of the data sets that you're going to be using in your
[00:41:53] you're going to be using in your assignments um quite often. It has 10
[00:41:57] assignments um quite often. It has 10 classes with a number of training um
[00:42:00] classes with a number of training um images and testing images. The 10
[00:42:02] images and testing images. The 10 classes, some of the examples are shown
[00:42:04] classes, some of the examples are shown here with nearest neighbor. For each of
[00:42:08] here with nearest neighbor. For each of the testing images, if we run
[00:42:13] the testing images, if we run nearest neighbor and select the top 10
[00:42:16] nearest neighbor and select the top 10 nearest neighbors, they are all
[00:42:19] nearest neighbors, they are all um visualized there.
[00:42:23] um visualized there. As you can imagine and guess, one of the
[00:42:26] As you can imagine and guess, one of the first questions to answer is how many
[00:42:31] what should be the value for K? How many
[00:42:33] what should be the value for K? How many nearest neighbors should we take? And
[00:42:36] nearest neighbors should we take? And one study, one of the
[00:42:39] one study, one of the quick experiments with five-fold each of
[00:42:41] quick experiments with five-fold each of those points is uh is one of the folds
[00:42:44] those points is uh is one of the folds in fivefold for each of the values of K
[00:42:48] in fivefold for each of the values of K shows
[00:42:50] shows different values here. And as you can
[00:42:53] different values here. And as you can probably see here, K equal to 7
[00:42:58] probably see here, K equal to 7 generates the best results in terms of
[00:43:00] generates the best results in terms of accuracy, which is close to 29 28%
[00:43:06] accuracy, which is close to 29 28% accuracy,
[00:43:09] accuracy, which is actually not too bad because
[00:43:12] which is actually not too bad because this is a 10 classification problem. And
[00:43:14] this is a 10 classification problem. And with a 10 classification problem, often
[00:43:17] with a 10 classification problem, often the random guess gets you a 10%
[00:43:20] the random guess gets you a 10% accuracy. So this is much better than
[00:43:24] accuracy. So this is much better than random guess. So it's working. It's
[00:43:25] random guess. So it's working. It's doing something but there's a lot of
[00:43:28] doing something but there's a lot of room to improve. So if we go back and
[00:43:33] room to improve. So if we go back and look at the examples, we can actually
[00:43:36] look at the examples, we can actually see there are so many mistakes,
[00:43:38] see there are so many mistakes, especially with the with the one that is
[00:43:40] especially with the with the one that is closest. For example, the fourth row, if
[00:43:42] closest. For example, the fourth row, if you look at that, it's a it's a frog,
[00:43:44] you look at that, it's a it's a frog, but the first example seems to be a cat.
[00:43:47] but the first example seems to be a cat. Sorry, a dog. And um you can guess why
[00:43:51] Sorry, a dog. And um you can guess why this is happening because the distance
[00:43:54] this is happening because the distance is being applied on pixels
[00:43:58] is being applied on pixels and pixel wise they look like each
[00:44:01] and pixel wise they look like each other. They have the same type of colors
[00:44:04] other. They have the same type of colors in most pixels. So they are much closer.
[00:44:08] in most pixels. So they are much closer. This this example and many other
[00:44:09] This this example and many other examples show that distances that work
[00:44:12] examples show that distances that work on pixels and pixel values are not the
[00:44:16] on pixels and pixel values are not the best choices. we never we never practice
[00:44:18] best choices. we never we never practice them. There are much better approaches
[00:44:21] them. There are much better approaches that we'll discussing
[00:44:23] that we'll discussing um at the end of more um in the in the
[00:44:28] um at the end of more um in the in the future lectures. And just to wrap up the
[00:44:32] future lectures. And just to wrap up the topic, this is another example. If you
[00:44:34] topic, this is another example. If you look at this original image, those three
[00:44:37] look at this original image, those three images while they look very much
[00:44:39] images while they look very much different in terms of color or maybe uh
[00:44:42] different in terms of color or maybe uh occlusion or the one in the uh the the
[00:44:46] occlusion or the one in the uh the the third one from uh from the left side is
[00:44:49] third one from uh from the left side is just same pixel. It's the same image
[00:44:52] just same pixel. It's the same image with one pixel shifting to the right. I
[00:44:55] with one pixel shifting to the right. I think although from a human eyes
[00:44:58] think although from a human eyes perspective there's no absolutely no
[00:45:00] perspective there's no absolutely no difference but the this the distance
[00:45:03] difference but the this the distance between that and the original image is
[00:45:05] between that and the original image is the same as the other two examples that
[00:45:07] the same as the other two examples that you see here. I'll um stop for a couple
[00:45:10] you see here. I'll um stop for a couple of questions and this is the summary of
[00:45:12] of questions and this is the summary of what we've discussed. So the question is
[00:45:14] what we've discussed. So the question is what how we make a decision right that's
[00:45:16] what how we make a decision right that's in those cases you often go with random
[00:45:19] in those cases you often go with random uh randomly selected one of the tops. So
[00:45:22] uh randomly selected one of the tops. So if you are to collect more data,
[00:45:26] if you are to collect more data, if you're uh for example, you're solving
[00:45:28] if you're uh for example, you're solving a problem now in genetics or you're
[00:45:31] a problem now in genetics or you're solving a problem in medical imaging,
[00:45:33] solving a problem in medical imaging, when you visualize your um examples,
[00:45:38] when you visualize your um examples, your features or whatever. And then in
[00:45:41] your features or whatever. And then in this nearest neighbor space, you do see
[00:45:44] this nearest neighbor space, you do see pockets of space that you don't have any
[00:45:46] pockets of space that you don't have any any good samples for or there is
[00:45:50] any good samples for or there is ambiguity,
[00:45:51] ambiguity, then you often try to go and find other
[00:45:54] then you often try to go and find other samples that lie in that same area in
[00:45:57] samples that lie in that same area in that space. Okay. Um so summarizing what
[00:46:02] that space. Okay. Um so summarizing what we've uh talked about with K nearest
[00:46:05] we've uh talked about with K nearest neighbor it was mostly about
[00:46:09] neighbor it was mostly about um
[00:46:11] um understanding the easiest algorithm
[00:46:13] understanding the easiest algorithm datadriven approach and then um talking
[00:46:17] datadriven approach and then um talking a little bit about hyperparameter tuning
[00:46:19] a little bit about hyperparameter tuning and and how distance metrics and the
[00:46:22] and and how distance metrics and the value of K play a very important role.
[00:46:26] value of K play a very important role. Moving on to the next topic which is
[00:46:29] Moving on to the next topic which is linear classifiers. 20 five minutes uh
[00:46:32] linear classifiers. 20 five minutes uh time to cover this. I want to
[00:46:37] time to cover this. I want to spend the remaining time of this lecture
[00:46:40] spend the remaining time of this lecture to to talk about this very important
[00:46:43] to to talk about this very important topic. This is the most important
[00:46:46] topic. This is the most important building block for almost
[00:46:51] all of deep learning.
[00:46:54] all of deep learning. and um
[00:46:57] and um and we need to um see
[00:47:01] and we need to um see how this this approach is different. So
[00:47:04] how this this approach is different. So first we want to see how it's different
[00:47:06] first we want to see how it's different from nearest neighbor. So this is a
[00:47:08] from nearest neighbor. So this is a parametric approach meaning that now we
[00:47:11] parametric approach meaning that now we are learning we are we are finding some
[00:47:14] are learning we are we are finding some parameters W or some weights that map
[00:47:18] parameters W or some weights that map the input image into the output classes
[00:47:22] the input image into the output classes the output numbers. In this case, when
[00:47:25] the output numbers. In this case, when we create this function f that that maps
[00:47:28] we create this function f that that maps input to the output, often those outputs
[00:47:31] input to the output, often those outputs are kind of membership scores
[00:47:35] are kind of membership scores of the image to each of those 10 uh
[00:47:40] of the image to each of those 10 uh output classes labels. So with this
[00:47:43] output classes labels. So with this setup that we we build, a linear
[00:47:47] setup that we we build, a linear classifier
[00:47:49] classifier first maps uh uses uses w uses these
[00:47:53] first maps uh uses uses w uses these parameters to map each of the inputs x
[00:47:57] parameters to map each of the inputs x into a value which is the output y. And
[00:48:01] into a value which is the output y. And how this is done is very simple. This
[00:48:05] how this is done is very simple. This image is basically an area of say 32x
[00:48:08] image is basically an area of say 32x 32x3.
[00:48:10] 32x3. So
[00:48:12] So 372 numbers and this defines
[00:48:17] 372 numbers and this defines our X which is 372 by one vector and we
[00:48:23] our X which is 372 by one vector and we know that we have 10 output classes. So
[00:48:26] know that we have 10 output classes. So we need 10 different scores and the
[00:48:28] we need 10 different scores and the scores are the the output will be kind
[00:48:31] scores are the the output will be kind of a vector of 10 by one and this means
[00:48:35] of a vector of 10 by one and this means that we have to identify to find a
[00:48:37] that we have to identify to find a weight matrix W that is a 10 by 372
[00:48:43] weight matrix W that is a 10 by 372 that maps X into the output scores.
[00:48:47] that maps X into the output scores. just to complete this linear function.
[00:48:50] just to complete this linear function. We often use this bias um
[00:48:54] We often use this bias um term as well. It's an input independent
[00:48:58] term as well. It's an input independent um value which actually has a lot of
[00:49:02] um value which actually has a lot of different use cases. I I can talk about
[00:49:04] different use cases. I I can talk about it uh when I do some geometric
[00:49:06] it uh when I do some geometric visualizations, but it it sometimes
[00:49:09] visualizations, but it it sometimes creates a shift for different um class
[00:49:12] creates a shift for different um class class scores and um helps with much
[00:49:16] class scores and um helps with much better
[00:49:18] better separation of each each class. So as I
[00:49:21] separation of each each class. So as I said, these linear functions are
[00:49:24] said, these linear functions are actually building blocks for building
[00:49:26] actually building blocks for building neural networks. Each of these linear
[00:49:29] neural networks. Each of these linear classifiers, linear uh functions when
[00:49:33] classifiers, linear uh functions when put together one after the other create
[00:49:35] put together one after the other create these uh large neural networks.
[00:49:39] these uh large neural networks. Although there are a lot other a lot of
[00:49:42] Although there are a lot other a lot of other things that that need to be added
[00:49:44] other things that that need to be added here but um this is one of the most
[00:49:46] here but um this is one of the most important components. If we look at some
[00:49:49] important components. If we look at some of the
[00:49:52] popular
[00:49:54] popular neural networks, we can see that linear
[00:49:58] neural networks, we can see that linear functions are everywhere in the in the
[00:50:02] functions are everywhere in the in the architectures.
[00:50:04] architectures. So to better understand what this
[00:50:06] So to better understand what this mapping and this function is doing,
[00:50:08] mapping and this function is doing, let's let's go back to our example of
[00:50:11] let's let's go back to our example of CR10 and our uh training and testing
[00:50:14] CR10 and our uh training and testing samples and so on and even make it a
[00:50:17] samples and so on and even make it a little bit simpler. Instead of looking
[00:50:18] little bit simpler. Instead of looking at large images of 32x 32, let's look at
[00:50:22] at large images of 32x 32, let's look at images of 2x two um an input image that
[00:50:26] images of 2x two um an input image that has four pixels. This means that
[00:50:30] has four pixels. This means that the input image is turned into a vector.
[00:50:34] the input image is turned into a vector. As you can see here, we have to find a W
[00:50:40] As you can see here, we have to find a W and the values of B. So the input image
[00:50:45] and the values of B. So the input image is mapped into some scores as the
[00:50:48] is mapped into some scores as the output. So
[00:50:51] output. So this is this is how the linear function
[00:50:54] this is this is how the linear function from an algebraic viewpoint looks like.
[00:50:58] from an algebraic viewpoint looks like. The output scores here we are
[00:50:59] The output scores here we are considering three classes of cat, dog
[00:51:02] considering three classes of cat, dog and and chip. And um as you can see this
[00:51:09] and and chip. And um as you can see this function maps the image the vector
[00:51:12] function maps the image the vector representing the image into those
[00:51:15] representing the image into those scores.
[00:51:17] scores. So
[00:51:18] So algebraic viewpoint of linear
[00:51:20] algebraic viewpoint of linear classification. Now let's look at some
[00:51:24] classification. Now let's look at some visual uh perspectives of this linear
[00:51:26] visual uh perspectives of this linear classifier.
[00:51:28] classifier. As you can see, we often create these
[00:51:31] As you can see, we often create these like each of these images as um
[00:51:36] like each of these images as um we talked about for each this image for
[00:51:39] we talked about for each this image for each of the
[00:51:41] each of the classes we define some sort of we have a
[00:51:45] classes we define some sort of we have a we have a row of this m row in the
[00:51:47] we have a row of this m row in the matrix W right. So this this row is kind
[00:51:50] matrix W right. So this this row is kind of a template for that specific class.
[00:51:52] of a template for that specific class. If I separate it like this. So this
[00:51:57] If I separate it like this. So this image is multiplied by W and B and W is
[00:52:01] image is multiplied by W and B and W is this is the template from each of the
[00:52:03] this is the template from each of the three classes of cat, dog and ship. And
[00:52:08] three classes of cat, dog and ship. And after training or building the model on
[00:52:11] after training or building the model on the C4 data set, if I look at the visual
[00:52:15] the C4 data set, if I look at the visual uh perspective visual uh a a visual
[00:52:18] uh perspective visual uh a a visual point viewpoint of the linear
[00:52:20] point viewpoint of the linear classifier, if I look at those templates
[00:52:23] classifier, if I look at those templates that are learned for each of the 10
[00:52:25] that are learned for each of the 10 classes, you can see these these
[00:52:27] classes, you can see these these templates. So it's very interesting that
[00:52:30] templates. So it's very interesting that in some of these cases for example for
[00:52:32] in some of these cases for example for the for the example for car you do see a
[00:52:35] the for the example for car you do see a front uh the front of a car templateish
[00:52:40] front uh the front of a car templateish and um although this is all done with
[00:52:43] and um although this is all done with just one linear classifier.
[00:52:46] just one linear classifier. So the visual aspect visual uh viewpoint
[00:52:48] So the visual aspect visual uh viewpoint of the linear classifier and there is
[00:52:51] of the linear classifier and there is another aspect of um geometric
[00:52:54] another aspect of um geometric viewpoint.
[00:52:56] viewpoint. What this linear classifier often does
[00:52:59] What this linear classifier often does is finding those lines if it's in 2D
[00:53:03] is finding those lines if it's in 2D space finding those lines that separates
[00:53:06] space finding those lines that separates each each class from the others.
[00:53:09] each each class from the others. And as you can see here, red, blue and
[00:53:11] And as you can see here, red, blue and and green are defined in different
[00:53:16] and green are defined in different classes. And um in higher dimensional
[00:53:19] classes. And um in higher dimensional space, instead of those lines, it's it's
[00:53:21] space, instead of those lines, it's it's it's these hyperplanes as you can see in
[00:53:23] it's these hyperplanes as you can see in this example on the on the left.
[00:53:27] this example on the on the left. So
[00:53:29] So you can also see the the use of the the
[00:53:32] you can also see the the use of the the bias term here because if if we didn't
[00:53:35] bias term here because if if we didn't have the bias all of these lines should
[00:53:36] have the bias all of these lines should have passed through the origin from the
[00:53:38] have passed through the origin from the center of that space which doesn't
[00:53:41] center of that space which doesn't really make sense but with the with the
[00:53:43] really make sense but with the with the bias we can actually create more
[00:53:45] bias we can actually create more reliable
[00:53:48] reliable functions uh and and uh decision
[00:53:51] functions uh and and uh decision boundaries.
[00:53:53] boundaries. So
[00:53:55] So a linear function is very useful. A
[00:53:57] a linear function is very useful. A linear classifier is very useful for
[00:53:59] linear classifier is very useful for many applications as uh we talked about
[00:54:02] many applications as uh we talked about and it's a building block of
[00:54:06] and it's a building block of more complex neural networks but it does
[00:54:08] more complex neural networks but it does have its own challenges because it
[00:54:11] have its own challenges because it doesn't it can't classify many instances
[00:54:15] doesn't it can't classify many instances of separate uh data. For example, in in
[00:54:19] of separate uh data. For example, in in this case, if class one is the first and
[00:54:22] this case, if class one is the first and third quadrant and the second class is
[00:54:24] third quadrant and the second class is second and the fourth, there's no way to
[00:54:27] second and the fourth, there's no way to linearly separate these. Another example
[00:54:28] linearly separate these. Another example is
[00:54:30] is um if if we have this type of separation
[00:54:34] um if if we have this type of separation between class one and class two that uh
[00:54:38] between class one and class two that uh shows the distance from the origin being
[00:54:40] shows the distance from the origin being between one and two as class one and
[00:54:42] between one and two as class one and then everything else is class two.
[00:54:44] then everything else is class two. Similarly, if there is there are three
[00:54:46] Similarly, if there is there are three modes, three areas in the space that are
[00:54:49] modes, three areas in the space that are one class and then the second class is
[00:54:51] one class and then the second class is everything else. So, in in all of these
[00:54:53] everything else. So, in in all of these cases, it's actually very hard to to do
[00:54:55] cases, it's actually very hard to to do the separation.
[00:54:58] the separation. So
[00:55:01] what we um should do so we talked about
[00:55:05] what we um should do so we talked about the linear classifiers and and how they
[00:55:08] the linear classifiers and and how they uh they can actually map the input
[00:55:10] uh they can actually map the input images into any form of labels in the
[00:55:13] images into any form of labels in the outputs. But
[00:55:16] outputs. But now what remains is how to choose the
[00:55:18] now what remains is how to choose the value W that for each of these images
[00:55:21] value W that for each of these images maps the image into a score for each
[00:55:24] maps the image into a score for each single class as the output.
[00:55:27] single class as the output. And in order to do that, we need to
[00:55:29] And in order to do that, we need to define a loss function, sometimes
[00:55:32] define a loss function, sometimes referred to as objective function that
[00:55:36] referred to as objective function that quantifies how bad the classifier, how
[00:55:39] quantifies how bad the classifier, how bad the model is is working. So the
[00:55:43] bad the model is is working. So the level of unhappiness with uh respect to
[00:55:46] level of unhappiness with uh respect to the score on the training data.
[00:55:49] the score on the training data. After defining those, we need to find a
[00:55:52] After defining those, we need to find a way to efficiently change the values of
[00:55:55] way to efficiently change the values of W to be able to op uh to to to minimize
[00:56:00] W to be able to op uh to to to minimize that unhappiness basically minimize the
[00:56:04] that unhappiness basically minimize the loss function. And this is the this is
[00:56:06] loss function. And this is the this is the optimization
[00:56:08] the optimization process. So the topic of next class,
[00:56:11] process. So the topic of next class, next next lecture. And
[00:56:15] next next lecture. And in order to do that again for simplicity
[00:56:18] in order to do that again for simplicity let's let's look at a easier and easier
[00:56:22] let's let's look at a easier and easier example having these three classes a
[00:56:24] example having these three classes a linear function as you can see here and
[00:56:29] linear function as you can see here and the
[00:56:31] the three classes of cat car and frog. We
[00:56:35] three classes of cat car and frog. We need a loss function that tells how good
[00:56:38] need a loss function that tells how good our current classifier is. And in order
[00:56:40] our current classifier is. And in order to do that we need to parameterize the
[00:56:43] to do that we need to parameterize the problem X I and Y I defining the input
[00:56:49] problem X I and Y I defining the input image label images and the corresponding
[00:56:52] image label images and the corresponding labels. And then we need a loss function
[00:56:56] labels. And then we need a loss function in distance function that
[00:56:58] in distance function that maps uh looks at the the differences and
[00:57:02] maps uh looks at the the differences and how bad the scores are compared to the
[00:57:05] how bad the scores are compared to the ones that are predicted. this the
[00:57:07] ones that are predicted. this the predicted scores fx and w and the ground
[00:57:11] predicted scores fx and w and the ground truth values the values that are already
[00:57:13] truth values the values that are already given y i's we often uh normalize them
[00:57:17] given y i's we often uh normalize them based on the number of samples as well
[00:57:19] based on the number of samples as well but uh it's not that important so this
[00:57:21] but uh it's not that important so this defines the loss function the objective
[00:57:23] defines the loss function the objective function
[00:57:26] so
[00:57:29] so how we can do the um optimization and
[00:57:32] how we can do the um optimization and and how can we uh
[00:57:37] really find the the W's. There are
[00:57:41] really find the the W's. There are different ways of defining this L
[00:57:44] different ways of defining this L li LI and I want to talk about softmax
[00:57:47] li LI and I want to talk about softmax classifier.
[00:57:48] classifier. Uh right now
[00:57:51] Uh right now as an example
[00:57:56] for that cat if you remember the scores
[00:57:58] for that cat if you remember the scores that were given were um 3.2 5.1 and a
[00:58:02] that were given were um 3.2 5.1 and a minus 1.7. These are um the the scores
[00:58:08] minus 1.7. These are um the the scores that are the output of the function. We
[00:58:10] that are the output of the function. We decid we discussed fxi and w. And
[00:58:16] decid we discussed fxi and w. And in order to turn these scores because
[00:58:18] in order to turn these scores because these scores are unbounded and the
[00:58:20] these scores are unbounded and the values are often not very much uh
[00:58:23] values are often not very much uh controllable
[00:58:24] controllable because this is just a linear function,
[00:58:26] because this is just a linear function, right? In order to turn these into some
[00:58:31] right? In order to turn these into some scoring functions,
[00:58:33] scoring functions, the best the possible best possible way
[00:58:36] the best the possible best possible way is to turn these into probabilities
[00:58:38] is to turn these into probabilities which defines the probability
[00:58:41] which defines the probability of the class being this class K for each
[00:58:46] of the class being this class K for each inputs image XI. Right? And in order to
[00:58:50] inputs image XI. Right? And in order to do that, we first
[00:58:53] do that, we first this is the the function that we we use
[00:58:56] this is the the function that we we use the soft max function. We first
[00:58:59] the soft max function. We first exponentiate the values of the scores to
[00:59:04] exponentiate the values of the scores to create these numbers. When we use exp on
[00:59:07] create these numbers. When we use exp on these numbers, the the numbers will
[00:59:09] these numbers, the the numbers will always be the outputs will always be
[00:59:11] always be the outputs will always be positive, right? And we need to make
[00:59:14] positive, right? And we need to make sure that the probabilities are always
[00:59:15] sure that the probabilities are always positive. And after creating these
[00:59:18] positive. And after creating these numbers, what we can do is just
[00:59:20] numbers, what we can do is just normalize them.
[00:59:22] normalize them. So exponentiate and then normalize based
[00:59:24] So exponentiate and then normalize based on sum of all of the samples. Right? So
[00:59:27] on sum of all of the samples. Right? So then we normalize them based on sum of
[00:59:31] then we normalize them based on sum of all samples and this creates
[00:59:35] all samples and this creates very um very good set of values that
[00:59:39] very um very good set of values that define a probability function. So this
[00:59:41] define a probability function. So this is a distribution function. they sum to
[00:59:43] is a distribution function. they sum to one. And
[00:59:46] one. And if I want to interpret this, it's very
[00:59:49] if I want to interpret this, it's very simple to say it's this this set of W's
[00:59:54] simple to say it's this this set of W's parameters
[00:59:56] parameters thinks that this image is a cat with a
[01:00:01] thinks that this image is a cat with a probability of 13%. Or 13, right? And um
[01:00:07] probability of 13%. Or 13, right? And um obviously this is this is making making
[01:00:10] obviously this is this is making making a mistake in this example because the W
[01:00:12] a mistake in this example because the W is not a good setting. We should um
[01:00:15] is not a good setting. We should um optimize it and change it. So these
[01:00:19] optimize it and change it. So these probabilities are the counterparts of
[01:00:21] probabilities are the counterparts of these unnormalized
[01:00:23] these unnormalized uh log probabilities which are often
[01:00:26] uh log probabilities which are often referred to as logits. So
[01:00:30] referred to as logits. So if you've uh taken other machine
[01:00:32] if you've uh taken other machine learning courses or if you I'm sure in
[01:00:35] learning courses or if you I'm sure in other fields you've used uh logistic
[01:00:38] other fields you've used uh logistic regression. This is a similar type of
[01:00:41] regression. This is a similar type of framework. This is the exact same
[01:00:42] framework. This is the exact same framework as logistic regression
[01:00:46] framework as logistic regression and since we have multiple classes here
[01:00:49] and since we have multiple classes here it's a multinnomial logistic regression.
[01:00:54] it's a multinnomial logistic regression. How do we define the function L? I told
[01:00:58] How do we define the function L? I told you that there are different ways of
[01:00:59] you that there are different ways of defining the function L.
[01:01:02] defining the function L. We want to define a loss function that
[01:01:05] We want to define a loss function that what's the objective here? We want to
[01:01:08] what's the objective here? We want to maximize
[01:01:10] maximize the probability of
[01:01:13] the probability of the sample belonging to the correct
[01:01:15] the sample belonging to the correct class. Right? So we want to maximize the
[01:01:19] class. Right? So we want to maximize the value of3.
[01:01:22] value of3. Now we have other larger values in in
[01:01:25] Now we have other larger values in in that um
[01:01:27] that um set. So
[01:01:30] set. So if you want to maximize this this is a
[01:01:32] if you want to maximize this this is a maximization problem right in order to
[01:01:35] maximization problem right in order to turn it because all of the objectives
[01:01:38] turn it because all of the objectives that we define we try to build the
[01:01:41] that we define we try to build the minimization objective function. The
[01:01:44] minimization objective function. The first step is just to
[01:01:46] first step is just to uh to negate the values right we negate
[01:01:50] uh to negate the values right we negate it. So the maximization problem turns
[01:01:52] it. So the maximization problem turns into a minimization problem. And then we
[01:01:54] into a minimization problem. And then we also take the log of the value just to
[01:01:57] also take the log of the value just to uh make the numbers a little bit more
[01:01:59] uh make the numbers a little bit more manageable. So negative log of that
[01:02:02] manageable. So negative log of that value will define the objective function
[01:02:05] value will define the objective function the the loss function for solving this
[01:02:09] the the loss function for solving this problem. Very simple. That's that's the
[01:02:11] problem. Very simple. That's that's the objective or the loss function for um
[01:02:15] objective or the loss function for um softmax and um for for this logistic
[01:02:20] softmax and um for for this logistic regression function and if you've taken
[01:02:23] regression function and if you've taken as I said other other classes like CS2
[01:02:26] as I said other other classes like CS2 229
[01:02:28] 229 um it's often referred to as maximum
[01:02:30] um it's often referred to as maximum likelihood estimation as well it's the
[01:02:32] likelihood estimation as well it's the same algorithm
[01:02:34] same algorithm so with that in mind I want to say that
[01:02:40] so with that in mind I want to say that So as as as we discussed it's the
[01:02:43] So as as as we discussed it's the negative of the log of that probability
[01:02:45] negative of the log of that probability of the correct class
[01:02:47] of the correct class which defines the objective function the
[01:02:49] which defines the objective function the loss function
[01:02:54] and that's basically that simple but
[01:02:56] and that's basically that simple but there are other types of interpreting
[01:02:58] there are other types of interpreting this this framework as well. So
[01:03:02] this this framework as well. So one way of um
[01:03:06] one way of um redefining this loss function is saying
[01:03:09] redefining this loss function is saying that we have some estimated
[01:03:11] that we have some estimated probabilities and we also have a
[01:03:14] probabilities and we also have a probability function that that defines
[01:03:15] probability function that that defines the correct probabilities. What we want
[01:03:18] the correct probabilities. What we want to do is to match these two probability
[01:03:22] to do is to match these two probability functions. Right? And in order to to do
[01:03:26] functions. Right? And in order to to do that we want to minimize the KL
[01:03:29] that we want to minimize the KL divergence call back laborer divergence.
[01:03:32] divergence call back laborer divergence. This is a information theoretic u
[01:03:36] This is a information theoretic u perspective of looking at this uh loss
[01:03:38] perspective of looking at this uh loss function. And again
[01:03:42] function. And again those are exactly the same. this um KL
[01:03:45] those are exactly the same. this um KL divergence in this setting
[01:03:49] divergence in this setting simplifies into the same negative log
[01:03:52] simplifies into the same negative log function that we defined. And even going
[01:03:55] function that we defined. And even going further, this is
[01:03:59] further, this is um exactly the cross entropy function
[01:04:02] um exactly the cross entropy function because if if we define this um use this
[01:04:06] because if if we define this um use this entropy of P
[01:04:10] entropy of P which is um uh entropy of the correct
[01:04:14] which is um uh entropy of the correct values, correct probabilities plus the
[01:04:17] values, correct probabilities plus the same KL divergence. Again, this
[01:04:19] same KL divergence. Again, this simplifies into the same negative log
[01:04:21] simplifies into the same negative log function. And that's because when we use
[01:04:24] function. And that's because when we use one hot encoding setting for the classes
[01:04:28] one hot encoding setting for the classes the entropy is zero. So that's one of
[01:04:31] the entropy is zero. So that's one of the reasons that we call this function
[01:04:33] the reasons that we call this function cross entropy or binary cross entropy
[01:04:35] cross entropy or binary cross entropy function in in all of deep learning
[01:04:37] function in in all of deep learning you've probably
[01:04:40] you've probably if you've used any of the neural network
[01:04:42] if you've used any of the neural network frameworks you've heard about BCE binary
[01:04:44] frameworks you've heard about BCE binary cross entropy or you will be hearing
[01:04:46] cross entropy or you will be hearing about it a lot. So this is the same uh
[01:04:50] about it a lot. So this is the same uh framework. We start very simple but we
[01:04:52] framework. We start very simple but we got to the
[01:04:54] got to the to the
[01:04:56] to the uh similarities and differences between
[01:04:58] uh similarities and differences between each of those. So the objective the
[01:05:02] each of those. So the objective the sorry the loss function was defined as
[01:05:04] sorry the loss function was defined as negative log of this probability and the
[01:05:06] negative log of this probability and the probability was was defined by the
[01:05:08] probability was was defined by the softmax which we talked about and and
[01:05:12] softmax which we talked about and and then optimizing for this which is the
[01:05:14] then optimizing for this which is the topic of next session will give us the
[01:05:17] topic of next session will give us the right W's. But before I end I want to
[01:05:22] right W's. But before I end I want to ask a couple of questions with this
[01:05:24] ask a couple of questions with this definition that you see here. What is
[01:05:26] definition that you see here. What is the mean and maximum
[01:05:29] the mean and maximum value that you can see for the loss
[01:05:31] value that you can see for the loss function li?
[01:05:33] function li? Yes, it's zero which we which turns into
[01:05:35] Yes, it's zero which we which turns into minus minus infinity but we have a a
[01:05:38] minus minus infinity but we have a a negative negation there. So it would be
[01:05:40] negative negation there. So it would be infinity that is correct. But then we
[01:05:43] infinity that is correct. But then we also have to uh
[01:05:47] also have to uh yes that that's that's definitely um
[01:05:51] yes that that's that's definitely um that's right. And um let me actually
[01:05:55] that's right. And um let me actually look at a second question. Yes, this
[01:05:57] look at a second question. Yes, this one.
[01:06:00] one. So when we initialize all of the SI, so
[01:06:04] So when we initialize all of the SI, so basically the W's in the beginning, it's
[01:06:06] basically the W's in the beginning, it's almost random. So the probabilities of
[01:06:10] almost random. So the probabilities of each of the classes becomes mostly
[01:06:15] each of the classes becomes mostly um
[01:06:17] um equal. What is the softmax LI? assuming
[01:06:21] equal. What is the softmax LI? assuming we have C classes
[01:06:30] and especially if it's C 10.
[01:06:34] and especially if it's C 10. So because the probabilities are equal
[01:06:37] So because the probabilities are equal it means that all of the probabilities
[01:06:39] it means that all of the probabilities are around 1 / C right and then that
[01:06:43] are around 1 / C right and then that will uh be defined as log of C and we
[01:06:47] will uh be defined as log of C and we have 10 if we have 10 classes then the
[01:06:49] have 10 if we have 10 classes then the log or ln of 10 is 2.3 which is the um x
[01:06:54] log or ln of 10 is 2.3 which is the um x we know we know about


================================================================================
LECTURE 003
================================================================================

Stanford CS231N | Spring 2025 | Lecture 3: Regularization and Optimization

Source: https://www.youtube.com/watch?v=dyNGd06MWn4

---

Transcript

[00:00:05] Today's lecture topic will be about uh
[00:00:08] Today's lecture topic will be about uh regularization and optimization which
[00:00:10] regularization and optimization which are two very important concepts more
[00:00:12] are two very important concepts more broadly in deep learning and machine
[00:00:14] broadly in deep learning and machine learning uh but especially important for
[00:00:16] learning uh but especially important for computer vision. And we're going to
[00:00:18] computer vision. And we're going to start with a recap from last week and
[00:00:20] start with a recap from last week and discuss some of the topics that we
[00:00:22] discuss some of the topics that we discussed last time.
[00:00:24] discussed last time. So um we really honed in on this idea of
[00:00:27] So um we really honed in on this idea of image classification as a core task in
[00:00:29] image classification as a core task in computer vision. And what this task is
[00:00:32] computer vision. And what this task is is given an image as input you try to
[00:00:35] is given an image as input you try to map this image to um a a label inside of
[00:00:39] map this image to um a a label inside of a set of labels. So here we have five
[00:00:41] a set of labels. So here we have five different labels uh cat, dog, bird,
[00:00:44] different labels uh cat, dog, bird, deer, and truck. And the goal is to
[00:00:46] deer, and truck. And the goal is to assign the correct label to the input
[00:00:48] assign the correct label to the input image. and you're creating some model or
[00:00:50] image. and you're creating some model or some function that takes an image as
[00:00:51] some function that takes an image as input and outputs the specific label
[00:00:53] input and outputs the specific label here.
[00:00:55] here. And we also talked about you know a lot
[00:00:57] And we also talked about you know a lot of the challenges for classification. So
[00:01:00] of the challenges for classification. So one of the main challenges is uh shown
[00:01:03] one of the main challenges is uh shown in the top left here but it's this idea
[00:01:05] in the top left here but it's this idea of the uh semantic gap between what we
[00:01:08] of the uh semantic gap between what we as humans perceive in the image which is
[00:01:09] as humans perceive in the image which is the cat and what it's actually
[00:01:11] the cat and what it's actually represented uh as in the computer which
[00:01:13] represented uh as in the computer which is this grid of uh pixel values where uh
[00:01:17] is this grid of uh pixel values where uh you know you have this multi-dimensional
[00:01:18] you know you have this multi-dimensional array or tensor um and you have discrete
[00:01:22] array or tensor um and you have discrete values for each of the pixels. This is
[00:01:24] values for each of the pixels. This is very different from how we're perceiving
[00:01:26] very different from how we're perceiving the image and deciding that this is a
[00:01:28] the image and deciding that this is a cat. So being able to map from this uh
[00:01:31] cat. So being able to map from this uh complex numeric representation into one
[00:01:34] complex numeric representation into one that we humans understand is the core
[00:01:36] that we humans understand is the core challenge here. But also there's
[00:01:37] challenge here. But also there's challenges uh surrounding the images
[00:01:39] challenges uh surrounding the images themselves. So if you look at something
[00:01:41] themselves. So if you look at something like the uh illumination of the scene.
[00:01:44] like the uh illumination of the scene. So here you'll have different pixel
[00:01:46] So here you'll have different pixel intensities based on where the lighting
[00:01:47] intensities based on where the lighting is in the scene. Uh also you know you
[00:01:50] is in the scene. Uh also you know you could have certain parts of your object
[00:01:52] could have certain parts of your object are in the shade and harder to see. Um
[00:01:55] are in the shade and harder to see. Um cats by nature are very deformable. So
[00:01:57] cats by nature are very deformable. So you talk about deformable objects. They
[00:01:58] you talk about deformable objects. They can move around and twist and bend in
[00:02:00] can move around and twist and bend in different ways, so they won't always
[00:02:01] different ways, so they won't always have the same shape. And this can prove
[00:02:03] have the same shape. And this can prove challenging if you're trying to design
[00:02:04] challenging if you're trying to design an algorithm to detect objects. Uh
[00:02:06] an algorithm to detect objects. Uh there's also the challenge of occlusion.
[00:02:08] there's also the challenge of occlusion. So you could have a cat that's like
[00:02:09] So you could have a cat that's like hiding underneath the couch uh cushions
[00:02:11] hiding underneath the couch uh cushions here, but we as a human can clearly tell
[00:02:13] here, but we as a human can clearly tell this a cat because of the tail uh how
[00:02:15] this a cat because of the tail uh how it's sort of sticking out at the end
[00:02:16] it's sort of sticking out at the end here. And then you know the way that
[00:02:18] here. And then you know the way that cats behave, we can infer that this is a
[00:02:20] cats behave, we can infer that this is a cat. Uh you'll also have things like
[00:02:22] cat. Uh you'll also have things like background clutter where the object
[00:02:23] background clutter where the object could blend into the background. So we
[00:02:26] could blend into the background. So we need to account for this somehow as
[00:02:27] need to account for this somehow as well. And finally, there's this idea of
[00:02:30] well. And finally, there's this idea of interclass variation where different
[00:02:32] interclass variation where different objects in the same category can look
[00:02:34] objects in the same category can look very different from each other, but we
[00:02:36] very different from each other, but we still need to group them all into the
[00:02:37] still need to group them all into the same category. So here are a lot of just
[00:02:39] same category. So here are a lot of just the challenges of recognition and why it
[00:02:42] the challenges of recognition and why it isn't such a simple problem where you
[00:02:44] isn't such a simple problem where you can just sort of write if else rules to
[00:02:46] can just sort of write if else rules to account for everything and just simple
[00:02:48] account for everything and just simple logic to classify. So if logic's sort of
[00:02:51] logic to classify. So if logic's sort of thrown out the window, you can't just
[00:02:52] thrown out the window, you can't just create these logic rules. um how do you
[00:02:54] create these logic rules. um how do you actually create a classifier? Here's
[00:02:57] actually create a classifier? Here's where we talked about datadriven
[00:02:58] where we talked about datadriven approaches. And we talked about
[00:03:00] approaches. And we talked about basically the simplest uh machine
[00:03:02] basically the simplest uh machine learning model which is this K nearest
[00:03:04] learning model which is this K nearest neighbors model. And the idea is that
[00:03:07] neighbors model. And the idea is that you look at for a given data point, what
[00:03:10] you look at for a given data point, what are the existing data points in your
[00:03:12] are the existing data points in your training set that uh are very close in
[00:03:16] training set that uh are very close in distance to your new data point coming
[00:03:18] distance to your new data point coming in. And for the one nearest neighbor
[00:03:20] in. And for the one nearest neighbor case, uh this just results in, you know,
[00:03:23] case, uh this just results in, you know, you find the closest data point, you
[00:03:25] you find the closest data point, you assign it that class label. And you can
[00:03:27] assign it that class label. And you can also look at multiple nearest neighbors
[00:03:29] also look at multiple nearest neighbors where you're assigning the most common
[00:03:32] where you're assigning the most common class label among those nearest
[00:03:33] class label among those nearest neighbors. So we talked about these two
[00:03:35] neighbors. So we talked about these two different approaches. We talked about
[00:03:36] different approaches. We talked about how um you ideally don't want to split
[00:03:39] how um you ideally don't want to split your data set into train and test but
[00:03:41] your data set into train and test but you can do train validation and test so
[00:03:44] you can do train validation and test so that you can use this validation set
[00:03:46] that you can use this validation set here to actually uh help you choose your
[00:03:48] here to actually uh help you choose your hyperparameters. So the main
[00:03:51] hyperparameters. So the main hyperparameter for k nearest neighbors
[00:03:53] hyperparameter for k nearest neighbors is this k uh one or five in these
[00:03:55] is this k uh one or five in these examples. And what we showed is an
[00:03:57] examples. And what we showed is an example where you're plotting what is
[00:03:59] example where you're plotting what is your accuracy on this validation set
[00:04:00] your accuracy on this validation set here um over the different k values
[00:04:03] here um over the different k values here. and you would, you know, you would
[00:04:05] here. and you would, you know, you would choose the one that has the highest
[00:04:06] choose the one that has the highest accuracy. Um, so this is how you'd use
[00:04:08] accuracy. Um, so this is how you'd use the validation set and then you would
[00:04:09] the validation set and then you would reserve this test set for okay, how does
[00:04:11] reserve this test set for okay, how does your model do on completely new data
[00:04:13] your model do on completely new data it's never seen before. That would be
[00:04:14] it's never seen before. That would be the purpose of the test set. This is all
[00:04:16] the purpose of the test set. This is all just recap. Um, there was a bit of
[00:04:19] just recap. Um, there was a bit of confusion about uh distance metrics. We
[00:04:21] confusion about uh distance metrics. We put a post on ED that explains this in
[00:04:22] put a post on ED that explains this in more detail. Um but um we've talked
[00:04:25] more detail. Um but um we've talked about two different distance me metrics.
[00:04:27] about two different distance me metrics. The most two commonly used ones in
[00:04:29] The most two commonly used ones in machine learning which are the sort of
[00:04:31] machine learning which are the sort of Manhattan distance and or L1 distance
[00:04:33] Manhattan distance and or L1 distance and L2 distance or uklidian distance. Um
[00:04:36] and L2 distance or uklidian distance. Um L2 distance is like um if you imagine
[00:04:39] L2 distance is like um if you imagine just the straight line distance
[00:04:42] just the straight line distance sort of how we think of distance in
[00:04:43] sort of how we think of distance in everyday usage of the word uh
[00:04:45] everyday usage of the word uh geometrically. And then Manhattan
[00:04:47] geometrically. And then Manhattan distance is this idea where you can only
[00:04:49] distance is this idea where you can only sort of traverse left and right and up
[00:04:51] sort of traverse left and right and up and down in this diagram. and you can't
[00:04:52] and down in this diagram. and you can't move diagonally. So specifically looking
[00:04:55] move diagonally. So specifically looking at just one quick example here um the
[00:04:57] at just one quick example here um the reason why all these points on the line
[00:04:59] reason why all these points on the line are the same distance from the origin
[00:05:01] are the same distance from the origin here um is because you can't move
[00:05:03] here um is because you can't move diagonally. So you have to move in this
[00:05:05] diagonally. So you have to move in this case up 0.5 and to the right 0.5. So the
[00:05:08] case up 0.5 and to the right 0.5. So the total distance is one whereas you know
[00:05:10] total distance is one whereas you know here you're just going a straight line
[00:05:11] here you're just going a straight line but it's uh one also the same distance
[00:05:14] but it's uh one also the same distance here whereas in the L2 distance um all
[00:05:17] here whereas in the L2 distance um all the points equidistant from the origin
[00:05:19] the points equidistant from the origin here form a circle because you can you
[00:05:21] here form a circle because you can you can just go in the direct line here. So
[00:05:24] can just go in the direct line here. So this is maybe a brief explanation. The
[00:05:27] this is maybe a brief explanation. The final thing we sort of honed in on last
[00:05:29] final thing we sort of honed in on last time was this idea of a linear
[00:05:31] time was this idea of a linear classifier. So um the basic idea in the
[00:05:35] classifier. So um the basic idea in the in the basic setting that we did is we
[00:05:37] in the basic setting that we did is we have an image which is you know say
[00:05:39] have an image which is you know say width 32 and height 32 and there are
[00:05:42] width 32 and height 32 and there are three uh pixel values for each of the
[00:05:46] three uh pixel values for each of the spatial locations in our image
[00:05:47] spatial locations in our image representing the red, green and blue uh
[00:05:49] representing the red, green and blue uh intensities
[00:05:51] intensities forming the color. And the idea is we
[00:05:53] forming the color. And the idea is we take this array of numbers for our image
[00:05:55] take this array of numbers for our image and we flatten it out into an array of
[00:05:58] and we flatten it out into an array of just 3,000 different uh numbers 372. And
[00:06:01] just 3,000 different uh numbers 372. And then we're multiplying this vector by
[00:06:04] then we're multiplying this vector by our weight matrix W. And um the basic
[00:06:08] our weight matrix W. And um the basic idea is if we have a weight matrix W
[00:06:10] idea is if we have a weight matrix W that has uh you know uh height here of
[00:06:14] that has uh you know uh height here of uh 10 and then the width is 3 372. We're
[00:06:18] uh 10 and then the width is 3 372. We're multiplying each of these rows by uh our
[00:06:21] multiplying each of these rows by uh our input sample X and this will give us 10
[00:06:24] input sample X and this will give us 10 resulting uh class scores. So um often
[00:06:28] resulting uh class scores. So um often times we'll add a bias term as well
[00:06:29] times we'll add a bias term as well which would just be one bias term for
[00:06:31] which would just be one bias term for each class. So this would be you know a
[00:06:34] each class. So this would be you know a size 10 vector here. And we also talked
[00:06:36] size 10 vector here. And we also talked about three different ways you can view
[00:06:38] about three different ways you can view or think about this these linear models.
[00:06:41] or think about this these linear models. One is the algebraic viewpoint which I
[00:06:42] One is the algebraic viewpoint which I described here where each row is um
[00:06:46] described here where each row is um represented sort of independently
[00:06:47] represented sort of independently representing the class and you multiply
[00:06:49] representing the class and you multiply it by the input vector x. You get your
[00:06:52] it by the input vector x. You get your uh score and you add the bias to get
[00:06:54] uh score and you add the bias to get your your final score. um you do each
[00:06:56] your your final score. um you do each row sort of independently in this sense.
[00:06:58] row sort of independently in this sense. You can also view um these uh learned
[00:07:02] You can also view um these uh learned class uh weights here as templates where
[00:07:06] class uh weights here as templates where if we then uh sort of reravel the uh
[00:07:10] if we then uh sort of reravel the uh vector into the original shape of the
[00:07:12] vector into the original shape of the image uh we could plot the intensities
[00:07:14] image uh we could plot the intensities here and understand what um what is sort
[00:07:18] here and understand what um what is sort of the uh template per class which is
[00:07:21] of the uh template per class which is what this visualization represents. And
[00:07:24] what this visualization represents. And then the final way you can think about
[00:07:25] then the final way you can think about it is a sort of geometric viewpoint
[00:07:28] it is a sort of geometric viewpoint where uh each of these um rows in our
[00:07:32] where uh each of these um rows in our weight matrix are represented by these
[00:07:34] weight matrix are represented by these lines here in our um input space and
[00:07:39] lines here in our um input space and specifically the line is where we set
[00:07:41] specifically the line is where we set this equation to zero um which is the
[00:07:44] this equation to zero um which is the decision boundary. So this forms the
[00:07:46] decision boundary. So this forms the point where uh you know above the line
[00:07:48] point where uh you know above the line you could have a positive score and
[00:07:50] you could have a positive score and below the line you would have a negative
[00:07:52] below the line you would have a negative score for the for the for the class. So
[00:07:55] score for the for the for the class. So these are sort of the different
[00:07:56] these are sort of the different viewpoints for how you can view these
[00:07:58] viewpoints for how you can view these linear models. They're all doing the
[00:08:00] linear models. They're all doing the same thing. And one nice thing about the
[00:08:02] same thing. And one nice thing about the geometric viewpoint is that uh if you
[00:08:04] geometric viewpoint is that uh if you visualize your data like say you want to
[00:08:06] visualize your data like say you want to classify uh blue versus red here it's
[00:08:09] classify uh blue versus red here it's very easy to tell that you can't draw a
[00:08:11] very easy to tell that you can't draw a line that perfectly separates um the
[00:08:14] line that perfectly separates um the data here. So it's kind of a nice uh way
[00:08:16] data here. So it's kind of a nice uh way you can gain intuition about what is
[00:08:18] you can gain intuition about what is possible for a linear model to do.
[00:08:21] possible for a linear model to do. Okay. Um I think that's sort of the
[00:08:23] Okay. Um I think that's sort of the highle recap of what we discussed last
[00:08:26] highle recap of what we discussed last time. Um I'll actually be going into a
[00:08:28] time. Um I'll actually be going into a little bit more detail on sort of the
[00:08:30] little bit more detail on sort of the new content for uh this lecture now but
[00:08:34] new content for uh this lecture now but I just wanted to pause briefly if anyone
[00:08:36] I just wanted to pause briefly if anyone had any questions about what we
[00:08:37] had any questions about what we discussed last time or at the beginning
[00:08:38] discussed last time or at the beginning of this lecture feel free to ask.
[00:08:41] of this lecture feel free to ask.  Yeah. So the question for those online
[00:08:43] Yeah. So the question for those online is for this uh visual viewpoint is this
[00:08:46] is for this uh visual viewpoint is this the same as um sort of running K nearest
[00:08:49] the same as um sort of running K nearest neighbors and this would be maybe one of
[00:08:51] neighbors and this would be maybe one of the neighbors that you're comparing
[00:08:52] the neighbors that you're comparing against uh are they mathematically
[00:08:54] against uh are they mathematically equivalent? Um no they're not the same
[00:08:57] equivalent? Um no they're not the same because um these templates are formed
[00:08:59] because um these templates are formed from this line. And so it's not like one
[00:09:01] from this line. And so it's not like one specific uh data point but you can still
[00:09:04] specific uh data point but you can still calculate the templates based on the uh
[00:09:08] calculate the templates based on the uh this sort of uh they would represent
[00:09:10] this sort of uh they would represent more like if we see in this diagram here
[00:09:12] more like if we see in this diagram here there's the line pointing in the
[00:09:14] there's the line pointing in the direction of the class. So it would be
[00:09:15] direction of the class. So it would be sort of representing this point here.
[00:09:18] sort of representing this point here. Yeah. So the question is how did we get
[00:09:20] Yeah. So the question is how did we get this like 372 number? So the idea here
[00:09:24] this like 372 number? So the idea here is that um if the height of our image is
[00:09:27] is that um if the height of our image is 32 pixels and the width is 32 pixels and
[00:09:30] 32 pixels and the width is 32 pixels and then each location in the image is
[00:09:31] then each location in the image is represented by three values the red,
[00:09:33] represented by three values the red, green and blue pixel intensities um we
[00:09:35] green and blue pixel intensities um we would then get 32  32  3 total values
[00:09:39] would then get 32  32  3 total values to represent the entire image and that's
[00:09:40] to represent the entire image and that's how we get this 3,72 number.
[00:09:44] how we get this 3,72 number. So um here's a I guess very specific
[00:09:46] So um here's a I guess very specific example of a uh linear model here. And
[00:09:51] example of a uh linear model here. And we when we multiply our input X by our
[00:09:54] we when we multiply our input X by our weight matrix W we get the resulting
[00:09:57] weight matrix W we get the resulting scores for these different classes. And
[00:10:00] scores for these different classes. And you can see that uh for cat it's not
[00:10:01] you can see that uh for cat it's not doing so well because car has a higher
[00:10:03] doing so well because car has a higher score and we want the highest score for
[00:10:05] score and we want the highest score for the uh correct class. Also here on the
[00:10:09] the uh correct class. Also here on the second example does pretty well because
[00:10:10] second example does pretty well because it's doing it correctly. But then in the
[00:10:12] it's doing it correctly. But then in the frog example, it sort of gets it
[00:10:13] frog example, it sort of gets it completely wrong where it's by far the
[00:10:15] completely wrong where it's by far the lowest score of the three. So um
[00:10:17] lowest score of the three. So um intuitively we can tell that these
[00:10:19] intuitively we can tell that these scores are not very good. But how do we
[00:10:21] scores are not very good. But how do we sort of mathematically formalize this
[00:10:22] sort of mathematically formalize this intuition and how do we determine how
[00:10:24] intuition and how do we determine how good a given model is? And this is the
[00:10:27] good a given model is? And this is the idea of a loss function which tells you
[00:10:30] idea of a loss function which tells you how good or specifically tells you how
[00:10:32] how good or specifically tells you how bad uh a classifier is. So given a data
[00:10:35] bad uh a classifier is. So given a data set of examples uh where we're indexing
[00:10:38] set of examples uh where we're indexing by with this uh letter I we have X I is
[00:10:42] by with this uh letter I we have X I is each of the training examples Yi is each
[00:10:44] each of the training examples Yi is each of the training labels. Um we can
[00:10:46] of the training labels. Um we can compute the loss over our entire data
[00:10:48] compute the loss over our entire data set where we calculate this uh loss for
[00:10:52] set where we calculate this uh loss for each training example by sending it
[00:10:54] each training example by sending it through our model here which is this f
[00:10:56] through our model here which is this f ofxi w. We get our label and then we
[00:11:00] ofxi w. We get our label and then we compute it compared to the ground truth
[00:11:02] compute it compared to the ground truth label yi and we just take the average
[00:11:03] label yi and we just take the average over our whole data set. So this is how
[00:11:06] over our whole data set. So this is how we do this. We talked about um in last
[00:11:09] we do this. We talked about um in last lecture the softmax loss or the cross
[00:11:11] lecture the softmax loss or the cross entropy loss which is the most commonly
[00:11:13] entropy loss which is the most commonly used loss for classification. And um so
[00:11:16] used loss for classification. And um so I won't discuss that again in so much
[00:11:18] I won't discuss that again in so much detail here, but um basically it's very
[00:11:21] detail here, but um basically it's very high uh it's very high loss when you
[00:11:23] high uh it's very high loss when you predict low probability of the correct
[00:11:25] predict low probability of the correct class. It's very low loss when you're
[00:11:27] class. It's very low loss when you're predicting the correct class at very
[00:11:29] predicting the correct class at very high probability.
[00:11:31] high probability. Um so these this what I just uh
[00:11:35] Um so these this what I just uh explained is all uh contained within
[00:11:38] explained is all uh contained within what we call the data loss. So this is a
[00:11:41] what we call the data loss. So this is a loss that tells you how well do the
[00:11:44] loss that tells you how well do the model predictions match our training
[00:11:45] model predictions match our training data. And obviously we want this to be
[00:11:47] data. And obviously we want this to be very low and if it's very low it means
[00:11:48] very low and if it's very low it means our model's fitting our training data
[00:11:50] our model's fitting our training data well. Um but there's a second component
[00:11:53] well. Um but there's a second component which I'll discuss today which is this
[00:11:55] which I'll discuss today which is this regularization term of the loss
[00:11:57] regularization term of the loss function. So what this does is it's
[00:12:00] function. So what this does is it's intended to it's intended to prevent the
[00:12:03] intended to it's intended to prevent the model from doing too well on the
[00:12:04] model from doing too well on the training data. Uh so it actually does
[00:12:06] training data. Uh so it actually does worse on the training data, but the goal
[00:12:08] worse on the training data, but the goal is to make it do better on new test data
[00:12:10] is to make it do better on new test data or unseen data. So worse on training,
[00:12:12] or unseen data. So worse on training, but better on on a test set. That's the
[00:12:14] but better on on a test set. That's the point of regularization. Um and we'll go
[00:12:17] point of regularization. Um and we'll go over a lot of the intuition for how to
[00:12:18] over a lot of the intuition for how to think about it in the in the next slides
[00:12:20] think about it in the in the next slides here. But the highle goal is to do worse
[00:12:22] here. But the highle goal is to do worse on the training data, but then better on
[00:12:24] on the training data, but then better on the test data or just unseen data.
[00:12:26] the test data or just unseen data. That's the point of regularization.
[00:12:29] That's the point of regularization.  Yeah. So we're computing the loss on
[00:12:31] Yeah. So we're computing the loss on each of the eye training examples. Yeah.
[00:12:34] each of the eye training examples. Yeah. So the loss of the i example uses the x
[00:12:36] So the loss of the i example uses the x i and the y i. Um
[00:12:40] i and the y i. Um does that make sense? I mean you you
[00:12:41] does that make sense? I mean you you could not have an i here, but
[00:12:44] could not have an i here, but this is just saying the i the loss term.
[00:12:46] this is just saying the i the loss term. Yeah. Yeah. You you normally don't have
[00:12:49] Yeah. Yeah. You you normally don't have a different loss for each I if that's
[00:12:50] a different loss for each I if that's what you're asking. Yeah. Yeah. Yeah. So
[00:12:53] what you're asking. Yeah. Yeah. Yeah. So this is just you could we describe li as
[00:12:56] this is just you could we describe li as the loss for the training example. So
[00:12:57] the loss for the training example. So we're just using here. But yeah, it
[00:12:59] we're just using here. But yeah, it could be.
[00:13:02] could be. Um so for regularization um people
[00:13:04] Um so for regularization um people usually have this intuition when
[00:13:06] usually have this intuition when thinking about it where this is sort of
[00:13:08] thinking about it where this is sort of like a toy example and the the idea is
[00:13:10] like a toy example and the the idea is we want to fit some function to these
[00:13:12] we want to fit some function to these points where our input is x and our
[00:13:14] points where our input is x and our output is y and uh you say have two
[00:13:17] output is y and uh you say have two different types of models f_sub_1 and
[00:13:19] different types of models f_sub_1 and f_sub_2 um and you're trying to decide
[00:13:22] f_sub_2 um and you're trying to decide which of these is better. So F1 goes
[00:13:24] which of these is better. So F1 goes through all of our data points. So the
[00:13:26] through all of our data points. So the training or the data loss will be very
[00:13:28] training or the data loss will be very low because it's basically doing it
[00:13:30] low because it's basically doing it perfectly. Whereas F2 um doesn't go
[00:13:33] perfectly. Whereas F2 um doesn't go through every point perfectly, but
[00:13:36] through every point perfectly, but intuitively it feels like probably F2 is
[00:13:38] intuitively it feels like probably F2 is a better model when we're now testing on
[00:13:41] a better model when we're now testing on new data we've never seen before. So um
[00:13:44] new data we've never seen before. So um regularization sort of captures this
[00:13:46] regularization sort of captures this intuition of you don't want to overfit
[00:13:48] intuition of you don't want to overfit your data so hard and you might actually
[00:13:50] your data so hard and you might actually be better off with a model that fits the
[00:13:52] be better off with a model that fits the data less but is either simpler or has
[00:13:55] data less but is either simpler or has some other properties that uh make it a
[00:13:57] some other properties that uh make it a better choice. And so if we you know ask
[00:14:00] better choice. And so if we you know ask okay how is this how are these models
[00:14:02] okay how is this how are these models going to do on new data that's within
[00:14:03] going to do on new data that's within our same distribution you'll find that
[00:14:05] our same distribution you'll find that you know F2 does a much better job at
[00:14:07] you know F2 does a much better job at modeling. So here's it's doing better on
[00:14:09] modeling. So here's it's doing better on the unseen data. Um
[00:14:12] the unseen data. Um so um I think there's also an intuition
[00:14:15] so um I think there's also an intuition in this previous example that's
[00:14:16] in this previous example that's demonstrated very well where we're
[00:14:17] demonstrated very well where we're preferring simpler models uh where it's
[00:14:19] preferring simpler models uh where it's sort of like AAM's razor which is this
[00:14:21] sort of like AAM's razor which is this idea in uh philosophy and also
[00:14:24] idea in uh philosophy and also scientific discovery where if you have
[00:14:25] scientific discovery where if you have multiple competing hypotheses you should
[00:14:28] multiple competing hypotheses you should go with the simplest one first and then
[00:14:29] go with the simplest one first and then if that if you know for sure that's
[00:14:31] if that if you know for sure that's wrong then you can start trying out more
[00:14:32] wrong then you can start trying out more complicated ones as you go. This is
[00:14:34] complicated ones as you go. This is maybe also some intuition you can have
[00:14:35] maybe also some intuition you can have for why regularization can be useful.
[00:14:40] for why regularization can be useful. Okay. And then one final thing about
[00:14:41] Okay. And then one final thing about this equation that I didn't touch on yet
[00:14:43] this equation that I didn't touch on yet is this lambda parameter here. So this
[00:14:46] is this lambda parameter here. So this is the regularization strength which is
[00:14:48] is the regularization strength which is another hyperparameter. So we might use
[00:14:50] another hyperparameter. So we might use uh training validation sets to set what
[00:14:53] uh training validation sets to set what is the optimal lambda here as well. But
[00:14:55] is the optimal lambda here as well. But the basic idea is we can set this to a
[00:14:57] the basic idea is we can set this to a floating point between I guess zero and
[00:15:00] floating point between I guess zero and infinity where zero would be basically
[00:15:03] infinity where zero would be basically there is no regularization infinity you
[00:15:05] there is no regularization infinity you know up to infinity you have really
[00:15:06] know up to infinity you have really strong progressively stronger uh
[00:15:08] strong progressively stronger uh regularization. So um it's very much a
[00:15:12] regularization. So um it's very much a tunable knob you have for determining
[00:15:14] tunable knob you have for determining how much you want to prevent the model
[00:15:15] how much you want to prevent the model from fitting to your training data.
[00:15:19] from fitting to your training data. And I'll go through some simple examples
[00:15:22] And I'll go through some simple examples now of regularization. So here we have
[00:15:24] now of regularization. So here we have L2 regularization which uh basically
[00:15:28] L2 regularization which uh basically what you do is you have your weight
[00:15:30] what you do is you have your weight matrix you square each of the terms in
[00:15:32] matrix you square each of the terms in your weight matrix and then you sum them
[00:15:34] your weight matrix and then you sum them all together. That gives you your your
[00:15:36] all together. That gives you your your score here that you then multiply by
[00:15:38] score here that you then multiply by lambda and you add to your total loss.
[00:15:40] lambda and you add to your total loss. That's L2 regularization. L1
[00:15:43] That's L2 regularization. L1 regularization is very similar but
[00:15:44] regularization is very similar but instead of squaring you're taking the
[00:15:45] instead of squaring you're taking the absolute value. So in practice there are
[00:15:48] absolute value. So in practice there are some differences between uh how these
[00:15:51] some differences between uh how these two regularizations uh perform when
[00:15:54] two regularizations uh perform when you're training models. So one of the
[00:15:55] you're training models. So one of the things that happens with L2
[00:15:56] things that happens with L2 regularization because you're squaring
[00:15:58] regularization because you're squaring each of the values when you have a
[00:16:00] each of the values when you have a really small value it gets squared like
[00:16:02] really small value it gets squared like say you have you know 0.001 you square
[00:16:04] say you have you know 0.001 you square it becomes even smaller. Um so L2
[00:16:07] it becomes even smaller. Um so L2 regularization allows for sort of these
[00:16:09] regularization allows for sort of these really small values close to zero
[00:16:10] really small values close to zero because you then square them so they
[00:16:12] because you then square them so they become even smaller and so your your
[00:16:14] become even smaller and so your your your penalty here is very very low if
[00:16:16] your penalty here is very very low if you have these very small values with L2
[00:16:19] you have these very small values with L2 whereas L1 you're not squaring it. So
[00:16:21] whereas L1 you're not squaring it. So it's sort of just whatever the the
[00:16:23] it's sort of just whatever the the baseline value was. It's not like it's
[00:16:24] baseline value was. It's not like it's getting smaller before you're uh
[00:16:27] getting smaller before you're uh computing this regularization term. So
[00:16:29] computing this regularization term. So in practice what this leads to is L1
[00:16:30] in practice what this leads to is L1 regularization you get a lot more values
[00:16:32] regularization you get a lot more values that are zero actually in your weight
[00:16:34] that are zero actually in your weight matrix um or very close to zero whereas
[00:16:37] matrix um or very close to zero whereas L2 you can have generally it's more
[00:16:40] L2 you can have generally it's more spread out where you have values that
[00:16:41] spread out where you have values that are small but uh but but non zero
[00:16:44] are small but uh but but non zero because the penalty becomes so small
[00:16:46] because the penalty becomes so small seems pretty clear to you why L2 prefers
[00:16:49] seems pretty clear to you why L2 prefers sort of spread out weights that are all
[00:16:51] sort of spread out weights that are all small but why does L1 prefer sparse uh
[00:16:54] small but why does L1 prefer sparse uh vectors so I think the way to think of
[00:16:56] vectors so I think the way to think of it is that if a value can be zero uh and
[00:16:59] it is that if a value can be zero uh and your performance is roughly the same uh
[00:17:02] your performance is roughly the same uh then this would push you towards zeroing
[00:17:04] then this would push you towards zeroing that value whereas for L2 what you might
[00:17:06] that value whereas for L2 what you might have is the value just becomes very
[00:17:08] have is the value just becomes very small but non zero because of the
[00:17:10] small but non zero because of the squaring
[00:17:12] squaring so uh the question is can we talk about
[00:17:14] so uh the question is can we talk about what does pushing towards a zero value
[00:17:16] what does pushing towards a zero value mean so um we're going to talk about
[00:17:18] mean so um we're going to talk about more how we use this loss term but the
[00:17:22] more how we use this loss term but the basic idea is we're trying to minimize
[00:17:23] basic idea is we're trying to minimize it so um we're trying to minimize the
[00:17:26] it so um we're trying to minimize the loss or minimize the error of our model
[00:17:28] loss or minimize the error of our model and um if we have a term here which is
[00:17:32] and um if we have a term here which is giving us positive values that don't
[00:17:34] giving us positive values that don't it's sort of not affecting the model
[00:17:36] it's sort of not affecting the model performance and the data loss we will uh
[00:17:39] performance and the data loss we will uh sort of remove those through the
[00:17:40] sort of remove those through the optimization procedure. It's sort of a
[00:17:42] optimization procedure. It's sort of a trade-off. You're trying to optimize the
[00:17:43] trade-off. You're trying to optimize the joint uh sum of the regularization term
[00:17:46] joint uh sum of the regularization term and the data loss term. So if your data
[00:17:48] and the data loss term. So if your data loss isn't changing much but you're able
[00:17:50] loss isn't changing much but you're able to go lower on the regularization term,
[00:17:52] to go lower on the regularization term, you'll get a more optimized model. So it
[00:17:54] you'll get a more optimized model. So it will it will uh it will be preferred
[00:17:56] will it will uh it will be preferred based on trying to minimize the overall
[00:17:58] based on trying to minimize the overall term. Um so uh I think we'll also touch
[00:18:03] term. Um so uh I think we'll also touch later in the course about much more
[00:18:04] later in the course about much more complex forms of regularization where
[00:18:07] complex forms of regularization where they're all doing this basic idea of
[00:18:09] they're all doing this basic idea of worse on the training data to do better
[00:18:10] worse on the training data to do better on the test data. Um but some of them
[00:18:12] on the test data. Um but some of them you will even like change the layers of
[00:18:14] you will even like change the layers of your model. So um they actually get
[00:18:16] your model. So um they actually get pretty complicated. This is like an
[00:18:18] pretty complicated. This is like an ongoing research area of how to
[00:18:20] ongoing research area of how to regularize models. Uh there's new papers
[00:18:23] regularize models. Uh there's new papers each year. So lots of stuff here and
[00:18:25] each year. So lots of stuff here and we'll only cover a small subset in this
[00:18:27] we'll only cover a small subset in this course.
[00:18:29] course. So to summarize why do we regularize
[00:18:31] So to summarize why do we regularize models? Um the first is you know it
[00:18:34] models? Um the first is you know it allows us to express some sort of
[00:18:35] allows us to express some sort of preference over weights. So if for some
[00:18:37] preference over weights. So if for some reason in our problem we think the
[00:18:39] reason in our problem we think the solution should be spread out or should
[00:18:41] solution should be spread out or should contain a lot of sparity where a lot of
[00:18:42] contain a lot of sparity where a lot of the values in the weight matrix are
[00:18:44] the values in the weight matrix are zero, we might prefer one set of
[00:18:46] zero, we might prefer one set of regularization L2 versus L1 over
[00:18:48] regularization L2 versus L1 over another. Um it also can depending on how
[00:18:51] another. Um it also can depending on how we're regularizing make the model
[00:18:53] we're regularizing make the model simpler um so that it works better on
[00:18:55] simpler um so that it works better on test data. So it could simplify the
[00:18:57] test data. So it could simplify the model if we're say uh heavily
[00:18:59] model if we're say uh heavily regularizing really high polomial terms
[00:19:02] regularizing really high polomial terms in our model. Uh for example in what I
[00:19:04] in our model. Uh for example in what I showed earlier or and something we won't
[00:19:07] showed earlier or and something we won't touch on in too much detail is uh
[00:19:08] touch on in too much detail is uh especially L2 regularization can
[00:19:10] especially L2 regularization can actually improve the optimization
[00:19:12] actually improve the optimization process because um if you imagine like
[00:19:14] process because um if you imagine like the squared is like a parabola. Uh so if
[00:19:17] the squared is like a parabola. Uh so if you're plotting y equals x^2 it's a
[00:19:20] you're plotting y equals x^2 it's a parabola and uh these are like convex so
[00:19:22] parabola and uh these are like convex so you get a lot of nice optimization
[00:19:23] you get a lot of nice optimization properties where there's a global
[00:19:25] properties where there's a global minimum. Uh we won't touch on that in
[00:19:26] minimum. Uh we won't touch on that in this course that's like beyond the scope
[00:19:28] this course that's like beyond the scope but know that for certain types of
[00:19:30] but know that for certain types of optimization the regularization actually
[00:19:31] optimization the regularization actually helps train the model faster too.
[00:19:35] helps train the model faster too. Okay. Um I guess I have a question for
[00:19:38] Okay. Um I guess I have a question for you all and what we'll do is you'll do
[00:19:40] you all and what we'll do is you'll do uh um uh one if it's W1 and two with
[00:19:44] uh um uh one if it's W1 and two with your hand if it's W2. So uh which of
[00:19:47] your hand if it's W2. So uh which of these two um weights w1 w2 would l the
[00:19:52] these two um weights w1 w2 would l the l2 regularizer prefer? So we have our
[00:19:55] l2 regularizer prefer? So we have our input x. It's when you multiply it, you
[00:19:58] input x. It's when you multiply it, you do the dotproduct with the weights, you
[00:20:00] do the dotproduct with the weights, you get the same score. So you get a score
[00:20:01] get the same score. So you get a score of one either way. And here's where the
[00:20:03] of one either way. And here's where the data loss would be the same. And we're
[00:20:05] data loss would be the same. And we're trying to determine which of the uh
[00:20:08] trying to determine which of the uh weights would our regularizer prefer. So
[00:20:10] weights would our regularizer prefer. So go one if you think it's W1 and go two
[00:20:13] go one if you think it's W1 and go two if you think it's W2.
[00:20:15] if you think it's W2. All right, lots of twos. Yeah, it's W2
[00:20:17] All right, lots of twos. Yeah, it's W2 because as you said, it's more spread
[00:20:18] because as you said, it's more spread out. You're going to be squaring each of
[00:20:20] out. You're going to be squaring each of these turns. So, it's 1/4. You square
[00:20:21] these turns. So, it's 1/4. You square it, becomes 1/16th. You sum it all
[00:20:23] it, becomes 1/16th. You sum it all together, it's 1/4 is the total
[00:20:26] together, it's 1/4 is the total regularization term here. And then here,
[00:20:28] regularization term here. And then here, it's, you know, you square it, so it's
[00:20:29] it's, you know, you square it, so it's one. So, it's four times lower in terms
[00:20:32] one. So, it's four times lower in terms of the regularization loss.
[00:20:34] of the regularization loss. Um, and as we said, the intuition is you
[00:20:36] Um, and as we said, the intuition is you like more spread out weights. Um, and
[00:20:38] like more spread out weights. Um, and then here's another question. Which one
[00:20:40] then here's another question. Which one would L1 prefer? Now, so you do one if
[00:20:43] would L1 prefer? Now, so you do one if it's uh weight one and two if it's
[00:20:45] it's uh weight one and two if it's weight two.
[00:20:49] Okay, we got a lot of ones. Uh so this
[00:20:51] Okay, we got a lot of ones. Uh so this one's actually a bit of a trick
[00:20:52] one's actually a bit of a trick question. So um what L1 regularization
[00:20:55] question. So um what L1 regularization is, you sum each of the terms so they'll
[00:20:56] is, you sum each of the terms so they'll both be sum to one. Uh in practice, you
[00:20:59] both be sum to one. Uh in practice, you probably would see this one because as
[00:21:00] probably would see this one because as we said, sparity, but in terms of a loss
[00:21:02] we said, sparity, but in terms of a loss standpoint, these two um weights would
[00:21:05] standpoint, these two um weights would actually be equivalent uh in terms of L1
[00:21:07] actually be equivalent uh in terms of L1 because one is just the sum of uh the
[00:21:10] because one is just the sum of uh the 0.25 25 uh four times and then the other
[00:21:12] 0.25 25 uh four times and then the other one's just one. So they're both summed
[00:21:15] one's just one. So they're both summed to one and so the the actual um
[00:21:18] to one and so the the actual um regularization term is the same for
[00:21:19] regularization term is the same for these.
[00:21:21] these. Yeah. Okay. So what's an example where
[00:21:22] Yeah. Okay. So what's an example where L1 would be preferred if this is like
[00:21:24] L1 would be preferred if this is like 0.9 for example?
[00:21:27] 0.9 for example? Okay. So just to recap um we have a data
[00:21:30] Okay. So just to recap um we have a data set of XY pairs and we have some way to
[00:21:34] set of XY pairs and we have some way to calculate scores for each of the classes
[00:21:38] calculate scores for each of the classes uh with which in our case is just a
[00:21:39] uh with which in our case is just a linear model. You're doing a matrix
[00:21:41] linear model. You're doing a matrix multiply. Um the loss for each of the I
[00:21:44] multiply. Um the loss for each of the I training examples in the softmax loss
[00:21:47] training examples in the softmax loss which we discussed last time is you
[00:21:49] which we discussed last time is you exponentiate each of your scores
[00:21:52] exponentiate each of your scores um and then you divide by the total sum
[00:21:55] um and then you divide by the total sum of the scores. So you exponentiate to
[00:21:57] of the scores. So you exponentiate to make them all positive and then you sum
[00:21:59] make them all positive and then you sum to get a probability distribution. So
[00:22:01] to get a probability distribution. So they all the final values in this all
[00:22:04] they all the final values in this all sum to one and you have a score for each
[00:22:06] sum to one and you have a score for each class and you take the minus log of your
[00:22:08] class and you take the minus log of your of the correct label. So this is uh of
[00:22:11] of the correct label. So this is uh of the the probability of the correct label
[00:22:12] the the probability of the correct label which is given here. Um and the full
[00:22:15] which is given here. Um and the full loss is you just run this over each of
[00:22:17] loss is you just run this over each of your training examples, calculate li for
[00:22:20] your training examples, calculate li for each of those and then you um add your
[00:22:22] each of those and then you um add your regularization term here depending on
[00:22:24] regularization term here depending on what is the weights of your model. Why
[00:22:27] what is the weights of your model. Why do we use softmax in general? So soft is
[00:22:30] do we use softmax in general? So soft is great because um what it does as a
[00:22:33] great because um what it does as a function is it converts any set of
[00:22:35] function is it converts any set of floatingoint numbers into a probability
[00:22:37] floatingoint numbers into a probability distribution where uh they will sum to
[00:22:40] distribution where uh they will sum to one and uh depending on the value of the
[00:22:43] one and uh depending on the value of the score that will translate to the
[00:22:45] score that will translate to the relative probability of that value. So
[00:22:47] relative probability of that value. So if you have a really high positive
[00:22:48] if you have a really high positive number and everything else is very low
[00:22:50] number and everything else is very low negative you'll have merely one for
[00:22:52] negative you'll have merely one for softmax and zeros almost for the the
[00:22:54] softmax and zeros almost for the the other values. So it's nice because it
[00:22:56] other values. So it's nice because it converts any list of uh floatingoint
[00:23:00] converts any list of uh floatingoint numbers into a list of probabilities
[00:23:01] numbers into a list of probabilities based on the uh the the values of the of
[00:23:04] based on the uh the the values of the of the list. That's the main utility of
[00:23:06] the list. That's the main utility of softmax. So the question is that you can
[00:23:09] softmax. So the question is that you can view the regularization we talked about
[00:23:11] view the regularization we talked about which is L1 L2 as um a way of
[00:23:14] which is L1 L2 as um a way of regularizing based on the magnitude of
[00:23:16] regularizing based on the magnitude of the weights which is true and how does
[00:23:18] the weights which is true and how does that translate to simpler models? So I
[00:23:20] that translate to simpler models? So I think in L1's explanation is actually
[00:23:22] think in L1's explanation is actually pretty simple because uh if we prefer
[00:23:24] pretty simple because uh if we prefer say um terms that have a lot of zeros in
[00:23:27] say um terms that have a lot of zeros in it, it's basically a linear model with
[00:23:29] it, it's basically a linear model with fewer coefficients. So that one is
[00:23:31] fewer coefficients. So that one is actually I think relatively
[00:23:32] actually I think relatively straightforward. But I think in general
[00:23:34] straightforward. But I think in general regularization is not always going to
[00:23:36] regularization is not always going to give you a simpler model. It depends on
[00:23:38] give you a simpler model. It depends on how it's used. So for example in the
[00:23:41] how it's used. So for example in the diagram we showed at the very beginning
[00:23:43] diagram we showed at the very beginning here um here you could imagine that um
[00:23:46] here um here you could imagine that um you have L2 regularization or L1
[00:23:49] you have L2 regularization or L1 regularization where you're penalizing
[00:23:50] regularization where you're penalizing more the higher degree polinomial terms
[00:23:52] more the higher degree polinomial terms of your function. So in that sense it's
[00:23:54] of your function. So in that sense it's pretty clear how you could design
[00:23:56] pretty clear how you could design regularization to prefer a simpler
[00:23:57] regularization to prefer a simpler model. But it doesn't always need to be
[00:23:59] model. But it doesn't always need to be that way. Really what it is is this idea
[00:24:01] that way. Really what it is is this idea of uh doing worse on the training data
[00:24:03] of uh doing worse on the training data to do better on the test data. And uh
[00:24:06] to do better on the test data. And uh that's not always going to give you a
[00:24:07] that's not always going to give you a simpler model. And in fact, there are
[00:24:08] simpler model. And in fact, there are many types of uh uh um regularization
[00:24:12] many types of uh uh um regularization like dropout that actually make your
[00:24:13] like dropout that actually make your model more complex uh but give you
[00:24:15] model more complex uh but give you better performance on the on the test
[00:24:17] better performance on the on the test data.
[00:24:23] Cool. Um so now that we've talked about
[00:24:25] Cool. Um so now that we've talked about how we can calculate how good a given W
[00:24:28] how we can calculate how good a given W is based on the training data and uh
[00:24:32] is based on the training data and uh this regularization term, the question
[00:24:34] this regularization term, the question now is like how do we actually find the
[00:24:35] now is like how do we actually find the best W?
[00:24:37] best W? uh and this is what optimization is
[00:24:39] uh and this is what optimization is which is the second half of today's
[00:24:41] which is the second half of today's lecture. So I think when people describe
[00:24:45] lecture. So I think when people describe uh optimization they will usually use
[00:24:46] uh optimization they will usually use this idea of a loss landscape which you
[00:24:49] this idea of a loss landscape which you can think of as like a normal landscape
[00:24:51] can think of as like a normal landscape like on planet earth where the up and
[00:24:53] like on planet earth where the up and down vertical or z-axis direction is the
[00:24:56] down vertical or z-axis direction is the loss. So this is the value you're trying
[00:24:58] loss. So this is the value you're trying to minimize and then say in this example
[00:25:00] to minimize and then say in this example you have two uh parameters in your model
[00:25:02] you have two uh parameters in your model which is sort of the x and y direction
[00:25:03] which is sort of the x and y direction of where you are in this landscape. And
[00:25:05] of where you are in this landscape. And the idea is you're just like uh
[00:25:07] the idea is you're just like uh basically a person. You're walking
[00:25:08] basically a person. You're walking around this landscape and you're trying
[00:25:10] around this landscape and you're trying to find what is the smallest or lowest
[00:25:12] to find what is the smallest or lowest point in the entire landscape. I think
[00:25:14] point in the entire landscape. I think one of the reasons this analogy and this
[00:25:16] one of the reasons this analogy and this is a very commonly used analogy. Um a
[00:25:18] is a very commonly used analogy. Um a little bit falls apart is because you
[00:25:19] little bit falls apart is because you know as humans we can just see like
[00:25:21] know as humans we can just see like visually we can look into the distance
[00:25:22] visually we can look into the distance and see what is the lowest point of the
[00:25:24] and see what is the lowest point of the valley. But I think this analogy is
[00:25:26] valley. But I think this analogy is actually pretty accurate if you think of
[00:25:27] actually pretty accurate if you think of the person as being blindfolded. They
[00:25:29] the person as being blindfolded. They don't have access to any visual
[00:25:30] don't have access to any visual information. they can only feel sort of
[00:25:32] information. they can only feel sort of the earth where they are right now and
[00:25:34] the earth where they are right now and understand what is the slope of the
[00:25:36] understand what is the slope of the ground on the current point in which
[00:25:37] ground on the current point in which they're standing. I think if you view it
[00:25:38] they're standing. I think if you view it in mat lens this analogies actually
[00:25:40] in mat lens this analogies actually becomes extremely accurate um for how uh
[00:25:43] becomes extremely accurate um for how uh we're trying to find the best model and
[00:25:45] we're trying to find the best model and we have this complex landscape of uh
[00:25:47] we have this complex landscape of uh different loss values depending on the
[00:25:49] different loss values depending on the parameters of our model which translate
[00:25:51] parameters of our model which translate to the location of the person in this uh
[00:25:53] to the location of the person in this uh landscape.
[00:25:55] landscape. So how can you find the best point? uh
[00:25:58] So how can you find the best point? uh we could go with like a really simple
[00:26:00] we could go with like a really simple idea which is maybe a really bad uh you
[00:26:03] idea which is maybe a really bad uh you know bad idea but it could work. So here
[00:26:06] know bad idea but it could work. So here it's just basically a for loop where
[00:26:07] it's just basically a for loop where we're trying a thousand different values
[00:26:09] we're trying a thousand different values of W randomly and we're just choosing
[00:26:12] of W randomly and we're just choosing the best one. So obviously not very
[00:26:15] the best one. So obviously not very mathematically rigorous but you know you
[00:26:16] mathematically rigorous but you know you will do better than uh a random baseline
[00:26:19] will do better than uh a random baseline and if you had nothing else to go for
[00:26:21] and if you had nothing else to go for maybe this isn't so bad. um you would
[00:26:24] maybe this isn't so bad. um you would get like 15.5% accuracy on the uh
[00:26:28] get like 15.5% accuracy on the uh sciphar 10 data set which is the one I
[00:26:30] sciphar 10 data set which is the one I showed earlier with the frog and the car
[00:26:32] showed earlier with the frog and the car and things like that uh with the 10
[00:26:34] and things like that uh with the 10 different categories. Uh but it doesn't
[00:26:36] different categories. Uh but it doesn't perform very good. I mean the
[00:26:37] perform very good. I mean the state-of-the-art on this data set it's
[00:26:38] state-of-the-art on this data set it's basically solved through modern uh deep
[00:26:40] basically solved through modern uh deep learning. You get 99.7% accuracy. So uh
[00:26:43] learning. You get 99.7% accuracy. So uh clearly it's not bad but it's also I
[00:26:46] clearly it's not bad but it's also I wouldn't say particularly good.
[00:26:48] wouldn't say particularly good. Uh strategy number two, which is sort of
[00:26:51] Uh strategy number two, which is sort of what I maybe uh explained a bit earlier,
[00:26:54] what I maybe uh explained a bit earlier, is this idea of following the slope. So
[00:26:56] is this idea of following the slope. So for this um you can imagine you're like
[00:27:00] for this um you can imagine you're like blindfolded on the lost landscape and
[00:27:01] blindfolded on the lost landscape and you're feeling the ground underneath you
[00:27:03] you're feeling the ground underneath you and you're thinking okay which you know
[00:27:05] and you're thinking okay which you know which way is the slope of the earth
[00:27:08] which way is the slope of the earth pointing me and I should walk in that
[00:27:10] pointing me and I should walk in that direction at all times. This basic idea
[00:27:13] direction at all times. This basic idea is the fundamental way in which we train
[00:27:15] is the fundamental way in which we train all the models in this course and in
[00:27:17] all the models in this course and in which basically all deep learning models
[00:27:18] which basically all deep learning models are trained where you're feeling the
[00:27:20] are trained where you're feeling the location of the current place in the
[00:27:22] location of the current place in the lost landscape and you're going down the
[00:27:23] lost landscape and you're going down the hill. Um so this is very intuitive way
[00:27:26] hill. Um so this is very intuitive way to explain it. We'll now go over more of
[00:27:28] to explain it. We'll now go over more of the math uh behind it but this is what
[00:27:30] the math uh behind it but this is what you should be maybe visualizing in your
[00:27:31] you should be maybe visualizing in your head.
[00:27:33] head. So uh how do you actually follow the
[00:27:35] So uh how do you actually follow the slope? Um so in one dimension I'm sure
[00:27:38] slope? Um so in one dimension I'm sure you all are familiar with the idea of a
[00:27:40] you all are familiar with the idea of a derivative which in calculus we can
[00:27:42] derivative which in calculus we can think of as uh this limit h definition
[00:27:44] think of as uh this limit h definition where we add a very small number to uh
[00:27:47] where we add a very small number to uh our current location. We calculate the
[00:27:49] our current location. We calculate the value of the function at that new
[00:27:50] value of the function at that new location. We subtract the current
[00:27:53] location. We subtract the current location and then we divide by the step
[00:27:54] location and then we divide by the step size. And as we take the limit uh for h
[00:27:57] size. And as we take the limit uh for h approaching zero this gives us the uh
[00:28:00] approaching zero this gives us the uh derivative of the uh of the function at
[00:28:03] derivative of the uh of the function at that point. Now uh this is for 1D but in
[00:28:06] that point. Now uh this is for 1D but in multiple dimensions you use the gradient
[00:28:09] multiple dimensions you use the gradient which is where you're calculating
[00:28:10] which is where you're calculating essentially this limit definition for
[00:28:13] essentially this limit definition for each of the values uh uh separately. So
[00:28:16] each of the values uh uh separately. So you'll have a different derivative for
[00:28:18] you'll have a different derivative for each of your values and the and you get
[00:28:21] each of your values and the and you get a vector instead. Um and this gives you
[00:28:24] a vector instead. Um and this gives you the direction along each dimension. So
[00:28:27] the direction along each dimension. So um you can actually calculate the slope
[00:28:29] um you can actually calculate the slope in the dimension by taking the
[00:28:30] in the dimension by taking the dotproduct of the gradient with the
[00:28:33] dotproduct of the gradient with the direction and specifically the direction
[00:28:35] direction and specifically the direction of the steepest descent or down the hill
[00:28:38] of the steepest descent or down the hill is the negative gradient. So the
[00:28:40] is the negative gradient. So the gradient points up the hill, the
[00:28:41] gradient points up the hill, the negative gradient points uh down the
[00:28:43] negative gradient points uh down the hill. So this is the you know the
[00:28:45] hill. So this is the you know the direction we should be traveling if
[00:28:46] direction we should be traveling if we're trying to get to the bottom of
[00:28:47] we're trying to get to the bottom of this lost landscape.
[00:28:50] this lost landscape. So you know maybe what are some ways you
[00:28:51] So you know maybe what are some ways you can calculate the uh derivative? A
[00:28:53] can calculate the uh derivative? A really simple one is you could just
[00:28:54] really simple one is you could just actually try to use the limit h
[00:28:56] actually try to use the limit h definition with a very small h. Uh so
[00:28:58] definition with a very small h. Uh so you add you know 0.00001
[00:29:01] you add you know 0.00001 you uh actually can compute you know the
[00:29:03] you uh actually can compute you know the last two digits of the loss change
[00:29:05] last two digits of the loss change slightly. So you can compute the
[00:29:06] slightly. So you can compute the difference divide by the step size and
[00:29:08] difference divide by the step size and you can get like an approximation of
[00:29:10] you can get like an approximation of your derivative here. And you could
[00:29:11] your derivative here. And you could actually do this for each of your uh
[00:29:14] actually do this for each of your uh values in your w. You just do this
[00:29:16] values in your w. You just do this procedure over and over again. Um but it
[00:29:19] procedure over and over again. Um but it has a few problems. is very slow because
[00:29:21] has a few problems. is very slow because you just need to loop through each of
[00:29:22] you just need to loop through each of the values. It's also approximate. So
[00:29:25] the values. It's also approximate. So you're not even calculating the actual
[00:29:26] you're not even calculating the actual derivative and especially with floating
[00:29:28] derivative and especially with floating point arithmetic you can get pretty
[00:29:30] point arithmetic you can get pretty significant errors here. So uh this is
[00:29:33] significant errors here. So uh this is not really preferred but this basic idea
[00:29:35] not really preferred but this basic idea or intuition of what we could be doing
[00:29:37] or intuition of what we could be doing is to calculate the derivative this way.
[00:29:40] is to calculate the derivative this way. But really we have the loss as a
[00:29:43] But really we have the loss as a function of w. So um we know how to
[00:29:46] function of w. So um we know how to calculate the scores to calculate uh to
[00:29:49] calculate the scores to calculate uh to get our loss which is given by our
[00:29:50] get our loss which is given by our function for our model and we can then
[00:29:54] function for our model and we can then compute the total loss with the
[00:29:55] compute the total loss with the regularization terms as well. And this
[00:29:58] regularization terms as well. And this entire uh loss is a function of uh
[00:30:02] entire uh loss is a function of uh basically w's the w's the x i's and the
[00:30:05] basically w's the w's the x i's and the y i's. So you have your w matrix, you
[00:30:07] y i's. So you have your w matrix, you have your x i's and y i's and then you
[00:30:08] have your x i's and y i's and then you have this formula with you know maybe
[00:30:09] have this formula with you know maybe some logs and exponents but
[00:30:11] some logs and exponents but fundamentally um this is a function of w
[00:30:15] fundamentally um this is a function of w x and y and we specifically want to
[00:30:18] x and y and we specifically want to calculate the gradient which is given by
[00:30:20] calculate the gradient which is given by this Greek letter naba of our loss with
[00:30:23] this Greek letter naba of our loss with respect to the weights. So we can
[00:30:25] respect to the weights. So we can imagine our x and i uh x i and y i's are
[00:30:29] imagine our x and i uh x i and y i's are held constant and we're trying to
[00:30:30] held constant and we're trying to calculate the uh the derivative just
[00:30:32] calculate the uh the derivative just with respect to the weights.
[00:30:35] with respect to the weights. So to do this we can just uh use
[00:30:37] So to do this we can just uh use calculus use the chain rule use the
[00:30:39] calculus use the chain rule use the different methods we've learned for
[00:30:40] different methods we've learned for calculating derivatives based on uh sort
[00:30:43] calculating derivatives based on uh sort of complex equations uh or not so
[00:30:45] of complex equations uh or not so complex but you know you need to have
[00:30:47] complex but you know you need to have some logs and exponents and chain rules
[00:30:48] some logs and exponents and chain rules here to solve it. Um so this will be an
[00:30:52] here to solve it. Um so this will be an exercise in the homework so I won't go
[00:30:53] exercise in the homework so I won't go through step by step how to do it now
[00:30:54] through step by step how to do it now but it is relatively straightforward I
[00:30:56] but it is relatively straightforward I think conceptually it should make sense
[00:30:57] think conceptually it should make sense to you all how you do this. You assume
[00:30:58] to you all how you do this. You assume the x and the y's are constant and you
[00:31:00] the x and the y's are constant and you solve for what is the derivative as you
[00:31:02] solve for what is the derivative as you change w. So now we actually have a way
[00:31:06] change w. So now we actually have a way where we can calculate W based on our uh
[00:31:10] where we can calculate W based on our uh or sorry DW the gradient of W with
[00:31:13] or sorry DW the gradient of W with respect to our data and the current W
[00:31:16] respect to our data and the current W and whatever our loss function is which
[00:31:18] and whatever our loss function is which is how to compute the error.
[00:31:20] is how to compute the error. So this is I guess summary. So um you
[00:31:24] So this is I guess summary. So um you could do a numerical gradient but it's
[00:31:26] could do a numerical gradient but it's approximate slow and uh the nice thing
[00:31:28] approximate slow and uh the nice thing is that it's very easy to write. you
[00:31:29] is that it's very easy to write. you just add a really small h, take the
[00:31:32] just add a really small h, take the difference, divide by h. Um, the
[00:31:34] difference, divide by h. Um, the analytic gradient is nice because it's
[00:31:36] analytic gradient is nice because it's exact, it's fast, but you could like
[00:31:38] exact, it's fast, but you could like potentially if you're creating a new
[00:31:39] potentially if you're creating a new gradient from scratch, like new code to
[00:31:41] gradient from scratch, like new code to calculate from scratch, you could have
[00:31:42] calculate from scratch, you could have an error in it. So, if you are doing
[00:31:43] an error in it. So, if you are doing this, people normally will have a
[00:31:46] this, people normally will have a gradient check, which is where they try
[00:31:47] gradient check, which is where they try the h uh version where they have a
[00:31:50] the h uh version where they have a really small h value and then they uh
[00:31:52] really small h value and then they uh make sure that it's around the same
[00:31:53] make sure that it's around the same neighborhood and that's a good way to
[00:31:54] neighborhood and that's a good way to make sure you don't have any bugs in
[00:31:55] make sure you don't have any bugs in your code. Um so um you'll be doing in
[00:31:58] your code. Um so um you'll be doing in there'll be gradient checks in your
[00:31:59] there'll be gradient checks in your homework assignments to make sure your
[00:32:00] homework assignments to make sure your implementations are correct also.
[00:32:02] implementations are correct also.  Yeah. So the question is um we often say
[00:32:05] Yeah. So the question is um we often say we want a loss function that's
[00:32:06] we want a loss function that's differentiable uh because then we can
[00:32:08] differentiable uh because then we can calculate the gradients but if we have a
[00:32:10] calculate the gradients but if we have a better loss function somehow and uh we
[00:32:14] better loss function somehow and uh we can't analytically calculate the
[00:32:16] can't analytically calculate the gradient but we could use uh this h kind
[00:32:19] gradient but we could use uh this h kind of numerical method. Could we do that? I
[00:32:22] of numerical method. Could we do that? I think um in general it's hard to
[00:32:24] think um in general it's hard to construct a better loss function that
[00:32:26] construct a better loss function that would be um non uh like non
[00:32:32] would be um non uh like non non-ifferiable. You could possibly
[00:32:34] non-ifferiable. You could possibly though and if you there is just a true
[00:32:35] though and if you there is just a true loss function that is the best for your
[00:32:37] loss function that is the best for your case but it is non-ifferiable you could
[00:32:39] case but it is non-ifferiable you could go with uh this approach and it it may
[00:32:41] go with uh this approach and it it may work. I think it would struggle if like
[00:32:44] work. I think it would struggle if like for example your loss uh is just truly
[00:32:47] for example your loss uh is just truly non-ifferiable across all points and
[00:32:49] non-ifferiable across all points and it's basically like a cluster of
[00:32:50] it's basically like a cluster of non-connected uh points then you know
[00:32:53] non-connected uh points then you know moving in the step of deepest steepest
[00:32:56] moving in the step of deepest steepest descent wouldn't really get you
[00:32:57] descent wouldn't really get you necessarily at your best solution uh if
[00:32:59] necessarily at your best solution uh if they're not well connected and forming
[00:33:01] they're not well connected and forming sort of this geography so uh it could
[00:33:03] sort of this geography so uh it could work but I would think that if your loss
[00:33:05] work but I would think that if your loss is non- differentiable across most of
[00:33:07] is non- differentiable across most of the domain then probably you wouldn't be
[00:33:10] the domain then probably you wouldn't be able to use these approaches to find the
[00:33:12] able to use these approaches to find the uh the bottom point.
[00:33:14] uh the bottom point.  Yeah. So the I guess TLDDR of the
[00:33:16] Yeah. So the I guess TLDDR of the explanation is that if your function's
[00:33:18] explanation is that if your function's convex then uh it works very well with
[00:33:21] convex then uh it works very well with this sort of gradient uh descent or
[00:33:23] this sort of gradient uh descent or steepest descent type of approach. But
[00:33:24] steepest descent type of approach. But if you have a non- differentiable
[00:33:26] if you have a non- differentiable non-convex function probably this
[00:33:28] non-convex function probably this approach won't work uh as well because
[00:33:30] approach won't work uh as well because you're not going to be stepping in the
[00:33:31] you're not going to be stepping in the right direction. It's not necessarily
[00:33:34] right direction. It's not necessarily errorprone if your code is perfectly
[00:33:35] errorprone if your code is perfectly good but maybe you have a mistake in
[00:33:38] good but maybe you have a mistake in your code and it's hard to tell right
[00:33:39] your code and it's hard to tell right away. uh and but the h uh the limit h
[00:33:42] away. uh and but the h uh the limit h definition is very easy to code up right
[00:33:43] definition is very easy to code up right you just set h to be a very small value
[00:33:45] you just set h to be a very small value you run it through your function and you
[00:33:47] you run it through your function and you add a very small amount so that's less
[00:33:49] add a very small amount so that's less errorrone for implementation
[00:33:51] errorrone for implementation  okay
[00:33:52] okay  for implementation
[00:33:53] for implementation  okay not more errorrone if it's working
[00:33:56] okay not more errorrone if it's working correctly
[00:33:58] correctly  okay so um now I'll talk about this
[00:34:00] okay so um now I'll talk about this fundamental algorithm for optimization
[00:34:02] fundamental algorithm for optimization called gradient descent and the basic
[00:34:04] called gradient descent and the basic intuition is what we already explained
[00:34:06] intuition is what we already explained before we calculate the slope at each
[00:34:08] before we calculate the slope at each point when we're on our loss landscape
[00:34:09] point when we're on our loss landscape and we take a step in the direction
[00:34:11] and we take a step in the direction downwards towards the bottom of the loss
[00:34:13] downwards towards the bottom of the loss landscape. So what we do is we calculate
[00:34:16] landscape. So what we do is we calculate the um gradients of our weights given
[00:34:19] the um gradients of our weights given the loss function, the data and our
[00:34:22] the loss function, the data and our current weight values. This tells us how
[00:34:24] current weight values. This tells us how much we should change each of the
[00:34:25] much we should change each of the weights to go down the slope. And then
[00:34:27] weights to go down the slope. And then we have to have a step size. So how far
[00:34:29] we have to have a step size. So how far down the hill are we taking a step in
[00:34:31] down the hill are we taking a step in the direction. Um and so you go down the
[00:34:34] the direction. Um and so you go down the hill, so it's the minus sign here and
[00:34:35] hill, so it's the minus sign here and the step size times uh the gradient. So
[00:34:39] the step size times uh the gradient. So this is basically what gradient descent
[00:34:41] this is basically what gradient descent is. Uh you're calculating the gradient
[00:34:43] is. Uh you're calculating the gradient at each step and you're moving a fixed
[00:34:45] at each step and you're moving a fixed direction uh in the direction of the
[00:34:48] direction uh in the direction of the negative gradient down the hill.
[00:34:51] negative gradient down the hill. Um so given a concrete example here. So
[00:34:54] Um so given a concrete example here. So instead of this being like a 3D loss
[00:34:55] instead of this being like a 3D loss landscape, often people will visualize
[00:34:57] landscape, often people will visualize it like this where we're sort of looking
[00:34:58] it like this where we're sort of looking down at the landscape and uh purple
[00:35:01] down at the landscape and uh purple would represent the highest points and
[00:35:02] would represent the highest points and red would represent the bottom or the
[00:35:04] red would represent the bottom or the valley here. Um and we could imagine we
[00:35:06] valley here. Um and we could imagine we have our original W. We can calculate
[00:35:08] have our original W. We can calculate the loss. We know the direction of the
[00:35:10] the loss. We know the direction of the slope of the negative gradient
[00:35:12] slope of the negative gradient direction. And this arrow might
[00:35:14] direction. And this arrow might represent the uh fixed step size that we
[00:35:16] represent the uh fixed step size that we talked about before. We're taking a
[00:35:18] talked about before. We're taking a fixed uh step size in that direction.
[00:35:25] Yes. So you can see it's fixed step
[00:35:26] Yes. So you can see it's fixed step size, but as the uh gradient becomes
[00:35:29] size, but as the uh gradient becomes smaller, we're still multiplying it by
[00:35:31] smaller, we're still multiplying it by this fixed step size. So the step the
[00:35:33] this fixed step size. So the step the effective step size actually does become
[00:35:35] effective step size actually does become smaller because the gradient is smaller
[00:35:36] smaller because the gradient is smaller near the end where it's flat or near the
[00:35:39] near the end where it's flat or near the end where it's more flat. So this is
[00:35:41] end where it's more flat. So this is what it looks like when we're always
[00:35:42] what it looks like when we're always heading in the direction of the steepest
[00:35:44] heading in the direction of the steepest descent. So the question is when we step
[00:35:47] descent. So the question is when we step down, how do we know when we're going to
[00:35:48] down, how do we know when we're going to stop? Well, I guess in uh in this
[00:35:51] stop? Well, I guess in uh in this formula, right, like you just keep
[00:35:52] formula, right, like you just keep looping forever, so you never stop. Uh
[00:35:54] looping forever, so you never stop. Uh so this was probably not the best.
[00:35:55] so this was probably not the best. Normally you have a predetermined number
[00:35:57] Normally you have a predetermined number of iterations that you run it for. Or
[00:36:00] of iterations that you run it for. Or you can look at um if the loss is not
[00:36:02] you can look at um if the loss is not significantly changing by a fixed
[00:36:04] significantly changing by a fixed amount. Also you could have like a
[00:36:06] amount. Also you could have like a tolerance for how much you're expecting
[00:36:08] tolerance for how much you're expecting the loss to keep decreasing by and if
[00:36:09] the loss to keep decreasing by and if it's no longer decreasing. You know it's
[00:36:11] it's no longer decreasing. You know it's only decreasing by one e minus 5 or 1 e
[00:36:13] only decreasing by one e minus 5 or 1 e - 9 you know maybe you maybe you stop
[00:36:15] - 9 you know maybe you maybe you stop there because it's good enough. Um so
[00:36:16] there because it's good enough. Um so those are the two ways you can determine
[00:36:18] those are the two ways you can determine when to stop is fixed number of
[00:36:19] when to stop is fixed number of iterations or a a stopping criteria of
[00:36:22] iterations or a a stopping criteria of you know how much we're not really
[00:36:24] you know how much we're not really improving that much anymore.
[00:36:28] Okay. Um so now I'll talk about the sort
[00:36:31] Okay. Um so now I'll talk about the sort of most popular variant of gradient
[00:36:33] of most popular variant of gradient descent which is called stochastic
[00:36:35] descent which is called stochastic gradient descent.
[00:36:37] gradient descent. And when we talked about gradient
[00:36:39] And when we talked about gradient descent before we talked about
[00:36:40] descent before we talked about calculating the loss of our weights by
[00:36:43] calculating the loss of our weights by summing over our entire training set the
[00:36:45] summing over our entire training set the loss of li for each i in our entire n
[00:36:48] loss of li for each i in our entire n training sets. Um but this is like
[00:36:51] training sets. Um but this is like potentially a lot of computation if we
[00:36:53] potentially a lot of computation if we have a very large data set. So what uh
[00:36:56] have a very large data set. So what uh stochastic gradient descent is is it
[00:36:59] stochastic gradient descent is is it basically now instead of looking at the
[00:37:00] basically now instead of looking at the entire data set we're looking at a
[00:37:01] entire data set we're looking at a subset each time which we call a mini
[00:37:04] subset each time which we call a mini batch or a batch of data. And um so here
[00:37:07] batch or a batch of data. And um so here if we look at the code it's like you
[00:37:08] if we look at the code it's like you know we're sampling 256 data points from
[00:37:11] know we're sampling 256 data points from our data set. So the batch size is 256.
[00:37:14] our data set. So the batch size is 256. We evaluate the gradients of this 256
[00:37:18] We evaluate the gradients of this 256 subset of our data set and then we do
[00:37:21] subset of our data set and then we do the same thing as before. So the reason
[00:37:23] the same thing as before. So the reason why it's called stochcastic gradient
[00:37:24] why it's called stochcastic gradient descent is because we're sampling a
[00:37:26] descent is because we're sampling a random subset of our data set each time
[00:37:28] random subset of our data set each time we're running the algorithm uh each step
[00:37:30] we're running the algorithm uh each step of the algorithm. So um this is
[00:37:32] of the algorithm. So um this is stochastic gradient descent. You're
[00:37:33] stochastic gradient descent. You're basically on a running it on a random
[00:37:35] basically on a running it on a random subset each time.
[00:37:37] subset each time. And in practice people won't just sample
[00:37:40] And in practice people won't just sample it completely random. They'll make sure
[00:37:42] it completely random. They'll make sure to um get through all the examples in
[00:37:45] to um get through all the examples in their data set and then sort of loop
[00:37:46] their data set and then sort of loop around again. And that's called one
[00:37:48] around again. And that's called one epoch of training where you loop through
[00:37:50] epoch of training where you loop through all your data samples once in a random
[00:37:52] all your data samples once in a random order.
[00:37:54] order. Okay. Um there are some problems with uh
[00:37:57] Okay. Um there are some problems with uh gradient descent or stocastic gradient
[00:37:58] gradient descent or stocastic gradient descent. So um this visualization is
[00:38:01] descent. So um this visualization is sort of the same type as the colored one
[00:38:04] sort of the same type as the colored one I showed before where we're looking down
[00:38:05] I showed before where we're looking down the loss landscape. But these curves are
[00:38:07] the loss landscape. But these curves are called the level set where it's a set of
[00:38:10] called the level set where it's a set of points where the loss is the same on all
[00:38:11] points where the loss is the same on all of them. So this is another way of
[00:38:12] of them. So this is another way of visualizing sort of very popular way to
[00:38:15] visualizing sort of very popular way to visualize top down looking at the loss
[00:38:16] visualize top down looking at the loss but it's without the colors. Um and so
[00:38:20] but it's without the colors. Um and so you could imagine that you have uh this
[00:38:23] you could imagine that you have uh this phenomenon where it's like a really
[00:38:24] phenomenon where it's like a really narrow valley where it's really steep
[00:38:26] narrow valley where it's really steep along the sides and you're trying to
[00:38:27] along the sides and you're trying to traverse the center of the valley and um
[00:38:31] traverse the center of the valley and um you know gradient descent actually does
[00:38:33] you know gradient descent actually does run into issues here. Does anyone um
[00:38:36] run into issues here. Does anyone um have any ideas for what could go wrong?
[00:38:39] have any ideas for what could go wrong?  Yeah. So, one of the things you could do
[00:38:40] Yeah. So, one of the things you could do is overshoot. Um, where you're sort of
[00:38:43] is overshoot. Um, where you're sort of moving up and down along this direction.
[00:38:45] moving up and down along this direction. Um, and if the if it's steep enough and
[00:38:47] Um, and if the if it's steep enough and your step size is large enough, you you
[00:38:49] your step size is large enough, you you might actually oscillate out of the
[00:38:50] might actually oscillate out of the valley. So, uh you can imagine if your
[00:38:52] valley. So, uh you can imagine if your step size is very large and this is
[00:38:54] step size is very large and this is really steep, you're actually going to
[00:38:55] really steep, you're actually going to be gaining like you're moving out and
[00:38:57] be gaining like you're moving out and out each time because you're you always
[00:38:59] out each time because you're you always have this fixed step size. So, um if
[00:39:02] have this fixed step size. So, um if it's steep enough, you could just like
[00:39:03] it's steep enough, you could just like bounce out of the valley. Uh that that
[00:39:05] bounce out of the valley. Uh that that that actually does happen if your
[00:39:06] that actually does happen if your learning rate is too large. So that's
[00:39:08] learning rate is too large. So that's one thing that can happen. And then also
[00:39:10] one thing that can happen. And then also um even if your learning rate's not too
[00:39:12] um even if your learning rate's not too large or your step size is not too large
[00:39:15] large or your step size is not too large um you can have this phenomenon where
[00:39:17] um you can have this phenomenon where you're sort of jittering because the
[00:39:19] you're sort of jittering because the gradient is much larger in the steep
[00:39:22] gradient is much larger in the steep direction. So uh you're sort of
[00:39:24] direction. So uh you're sort of jittering but you're not making very
[00:39:25] jittering but you're not making very much meaningful progress towards the
[00:39:26] much meaningful progress towards the actual center because you're spending
[00:39:28] actual center because you're spending all this time oscillating back and forth
[00:39:29] all this time oscillating back and forth up and down. So this is a pretty big
[00:39:32] up and down. So this is a pretty big issue with just default SGD.
[00:39:36] issue with just default SGD. Um and then mathematically just an aside
[00:39:38] Um and then mathematically just an aside um the loss function we consider here to
[00:39:41] um the loss function we consider here to have a high condition number which is
[00:39:42] have a high condition number which is the ratio of the largest
[00:39:45] the ratio of the largest to smallest singular value of the hessen
[00:39:46] to smallest singular value of the hessen matrix which is the second derivative.
[00:39:48] matrix which is the second derivative. So you can imagine like the second
[00:39:50] So you can imagine like the second derivative along this up and down
[00:39:52] derivative along this up and down direction is very high but then side to
[00:39:54] direction is very high but then side to side it's very low because it's very
[00:39:55] side it's very low because it's very flat. So that's what causes this
[00:39:57] flat. So that's what causes this phenomenon.
[00:40:00] phenomenon. All right. Um so one of the things we
[00:40:04] All right. Um so one of the things we also might have an issue with SGD is um
[00:40:08] also might have an issue with SGD is um what happens if the loss function has a
[00:40:10] what happens if the loss function has a local minima or local minimum or a
[00:40:13] local minima or local minimum or a saddle point. So um for example here
[00:40:16] saddle point. So um for example here it's it for like just the very end of
[00:40:19] it's it for like just the very end of this curve it's completely flat. So if
[00:40:21] this curve it's completely flat. So if we were to imagine um there's like we're
[00:40:24] we were to imagine um there's like we're moving down the hill here um we would
[00:40:26] moving down the hill here um we would just get stuck here because it's flat
[00:40:28] just get stuck here because it's flat and we wouldn't be able to progress any
[00:40:30] and we wouldn't be able to progress any further because when we take the
[00:40:31] further because when we take the gradient here is zero. Um so this is
[00:40:33] gradient here is zero. Um so this is actually a pretty big uh issue where
[00:40:36] actually a pretty big uh issue where it'll get stuck either in a local
[00:40:37] it'll get stuck either in a local minimum because we you know once we
[00:40:39] minimum because we you know once we reach here we don't really have any
[00:40:41] reach here we don't really have any direction to go the gradient is zero or
[00:40:43] direction to go the gradient is zero or it's very small and we'll just sort of
[00:40:44] it's very small and we'll just sort of oscillate back and forth here. And then
[00:40:46] oscillate back and forth here. And then here it could actually get stuck on this
[00:40:48] here it could actually get stuck on this uh bottom example because uh the
[00:40:50] uh bottom example because uh the gradient zero here even though if it
[00:40:51] gradient zero here even though if it went a little bit further it could go
[00:40:52] went a little bit further it could go down significantly more.
[00:40:55] down significantly more.  Yeah. So the question is um you know
[00:40:57] Yeah. So the question is um you know maybe we can change the way we're doing
[00:40:59] maybe we can change the way we're doing the steps. Maybe we could use the hessen
[00:41:01] the steps. Maybe we could use the hessen to determine the direction we go. Um we
[00:41:03] to determine the direction we go. Um we actually do have a brief slide talking
[00:41:04] actually do have a brief slide talking about the sort of hessen style approach
[00:41:06] about the sort of hessen style approach at the very end. That's not very
[00:41:08] at the very end. That's not very commonly used in deep learning. But the
[00:41:10] commonly used in deep learning. But the short answer is yes. There are going to
[00:41:12] short answer is yes. There are going to be actually several ways in which you
[00:41:13] be actually several ways in which you can account for this that we're going to
[00:41:14] can account for this that we're going to go into in like 5 minutes. So it's a
[00:41:16] go into in like 5 minutes. So it's a good question. Yeah, we'll get to that.
[00:41:22] Okay. Um so um I think one of the other
[00:41:25] Okay. Um so um I think one of the other things that you might not know is that
[00:41:28] things that you might not know is that empirically saddle points are actually
[00:41:29] empirically saddle points are actually much more common as you move to higher
[00:41:31] much more common as you move to higher dimensional models. So as your weight uh
[00:41:33] dimensional models. So as your weight uh matrix gets larger and larger um you're
[00:41:35] matrix gets larger and larger um you're more likely to find these saddle points.
[00:41:37] more likely to find these saddle points. And there's this paper describing the
[00:41:39] And there's this paper describing the frequency of them. Uh if you don't know
[00:41:40] frequency of them. Uh if you don't know a saddle point uh it's called a saddle
[00:41:42] a saddle point uh it's called a saddle point because it's shaped like a like a
[00:41:44] point because it's shaped like a like a saddle like on a horse. And at the
[00:41:46] saddle like on a horse. And at the center of this saddle uh the uh gradient
[00:41:50] center of this saddle uh the uh gradient is actually zero in all directions. So
[00:41:51] is actually zero in all directions. So it's like the bottom of this and at the
[00:41:53] it's like the bottom of this and at the top of this sort of uh curvature and so
[00:41:56] top of this sort of uh curvature and so in both the x and the y directions the
[00:41:58] in both the x and the y directions the gradient zero. So you could get stuck
[00:41:59] gradient zero. So you could get stuck here despite being very close to like
[00:42:02] here despite being very close to like going significantly down the lost
[00:42:03] going significantly down the lost landscape on either side. So this is
[00:42:06] landscape on either side. So this is also a pretty common issue with SGD
[00:42:08] also a pretty common issue with SGD where these saddle points and as we move
[00:42:10] where these saddle points and as we move to higher dimensional spaces or this is
[00:42:12] to higher dimensional spaces or this is equivalent to models with more
[00:42:14] equivalent to models with more parameters uh this is more and more
[00:42:16] parameters uh this is more and more common. This is a big issue. Um and then
[00:42:19] common. This is a big issue. Um and then a final issue with SGD is that um we are
[00:42:24] a final issue with SGD is that um we are sampling a subset of our data each time.
[00:42:26] sampling a subset of our data each time. Right? So we're not looking at the whole
[00:42:29] Right? So we're not looking at the whole this represents the entire loss across
[00:42:31] this represents the entire loss across all the data but we're looking at just a
[00:42:32] all the data but we're looking at just a subset each time. So we'll actually have
[00:42:34] subset each time. So we'll actually have somewhat noisy update steps because
[00:42:36] somewhat noisy update steps because we're not looking at the entire data uh
[00:42:38] we're not looking at the entire data uh set. So we'll sort of be stepping
[00:42:41] set. So we'll sort of be stepping towards the center uh towards this sort
[00:42:43] towards the center uh towards this sort of uh local minimum that we're trying to
[00:42:46] of uh local minimum that we're trying to reach here. But uh each step doesn't go
[00:42:48] reach here. But uh each step doesn't go directly in that direction. So there's
[00:42:50] directly in that direction. So there's some noise in how we're progressing
[00:42:52] some noise in how we're progressing because we're subsampling the data set.
[00:42:56] because we're subsampling the data set. Okay, cool. Um I think uh to summarize
[00:43:01] Okay, cool. Um I think uh to summarize these are the main issues and there's a
[00:43:04] these are the main issues and there's a pretty neat trick you can do uh where
[00:43:05] pretty neat trick you can do uh where you just basically add momentum and you
[00:43:07] you just basically add momentum and you can really think of this as the same way
[00:43:09] can really think of this as the same way as like if you have a ball that's
[00:43:10] as like if you have a ball that's rolling down the hill where it gains
[00:43:11] rolling down the hill where it gains momentum. It's actually very similar to
[00:43:13] momentum. It's actually very similar to how it's modeled in terms of the
[00:43:14] how it's modeled in terms of the physical properties. So u it's a good
[00:43:16] physical properties. So u it's a good way to gain intuition about it at the
[00:43:18] way to gain intuition about it at the very least. So um you can imagine that
[00:43:20] very least. So um you can imagine that um it could help with the you have these
[00:43:22] um it could help with the you have these local minimum because if you're rolling
[00:43:24] local minimum because if you're rolling down with enough velocity you'll be able
[00:43:25] down with enough velocity you'll be able to come out of it. Um if you have the
[00:43:27] to come out of it. Um if you have the saddle points or the like the just uh
[00:43:30] saddle points or the like the just uh flat point here, the model has been
[00:43:32] flat point here, the model has been rolling down the entire hill. So it
[00:43:33] rolling down the entire hill. So it won't get stuck here anymore. It will
[00:43:34] won't get stuck here anymore. It will continue. Um also um if you have this
[00:43:38] continue. Um also um if you have this poor conditioning
[00:43:40] poor conditioning value, you will still have maybe some uh
[00:43:43] value, you will still have maybe some uh oscillation, but the nice thing is that
[00:43:45] oscillation, but the nice thing is that um it will sort of accumulate speed in
[00:43:48] um it will sort of accumulate speed in this direction to the right because it
[00:43:49] this direction to the right because it will have multiple steps that keep going
[00:43:51] will have multiple steps that keep going that way. So it'll gain faster and
[00:43:52] that way. So it'll gain faster and faster uh towards the center here. So it
[00:43:55] faster uh towards the center here. So it also helps with this problem. Finally,
[00:43:57] also helps with this problem. Finally, it can also help sort of average out
[00:43:59] it can also help sort of average out some of the noise with the gradients
[00:44:01] some of the noise with the gradients because they all sort of have a
[00:44:03] because they all sort of have a direction in common uh which is towards
[00:44:05] direction in common uh which is towards this uh minimum here. And so um as
[00:44:08] this uh minimum here. And so um as you're computing the momentum it sort of
[00:44:11] you're computing the momentum it sort of builds on itself and it will converge
[00:44:13] builds on itself and it will converge faster because um it sort of the noise
[00:44:16] faster because um it sort of the noise uh is uh accounted for by looking at the
[00:44:20] uh is uh accounted for by looking at the direction they all share in common which
[00:44:21] direction they all share in common which is uh uh which is included in the
[00:44:24] is uh uh which is included in the momentum. So let me show you how to
[00:44:26] momentum. So let me show you how to actually do it. But this is sort of the
[00:44:27] actually do it. But this is sort of the general intuition for how momentum
[00:44:29] general intuition for how momentum works.
[00:44:31] works. So we have SGD here. We have our uh mini
[00:44:36] So we have SGD here. We have our uh mini batch x. We're computing the gradient
[00:44:38] batch x. We're computing the gradient which is dx. We have the learning rate
[00:44:40] which is dx. We have the learning rate or the step size. If you multiply and
[00:44:42] or the step size. If you multiply and then we do the negative because we need
[00:44:43] then we do the negative because we need to go down the hill. This gives us our
[00:44:45] to go down the hill. This gives us our new x. So this is sgd for with momentum.
[00:44:49] new x. So this is sgd for with momentum. We're now updating by this velocity
[00:44:52] We're now updating by this velocity term. So instead of updating by the
[00:44:53] term. So instead of updating by the gradient at the specific point, we're
[00:44:55] gradient at the specific point, we're updating by the velocity. And the
[00:44:57] updating by the velocity. And the velocity at a given time step is given
[00:44:59] velocity at a given time step is given by the previous velocity uh plus the
[00:45:02] by the previous velocity uh plus the current uh slope. So this is sort of how
[00:45:04] current uh slope. So this is sort of how you calculate it. And you have this row
[00:45:06] you calculate it. And you have this row value which is the momentum the actual
[00:45:08] value which is the momentum the actual like how much momentum you want to have.
[00:45:10] like how much momentum you want to have. And if you have it very high then your
[00:45:12] And if you have it very high then your new velocity is more dependent on the
[00:45:14] new velocity is more dependent on the previous uh time steps velocity. And
[00:45:16] previous uh time steps velocity. And this sort of is a running average uh
[00:45:18] this sort of is a running average uh therefore of the uh last gradients uh
[00:45:21] therefore of the uh last gradients uh depend and the momentum term here gives
[00:45:23] depend and the momentum term here gives you how much to weight the past versus
[00:45:25] you how much to weight the past versus the present. So now we're updating by
[00:45:27] the present. So now we're updating by this and we still have this alpha which
[00:45:28] this and we still have this alpha which is the step size. So um it's actually a
[00:45:30] is the step size. So um it's actually a very simple change, right? You just are
[00:45:32] very simple change, right? You just are now computing the velocity which is a
[00:45:35] now computing the velocity which is a function of the current velocity plus
[00:45:38] function of the current velocity plus our gradient. So I think I'll pause for
[00:45:40] our gradient. So I think I'll pause for questions here. Um this is the
[00:45:43] questions here. Um this is the explanation of momentum and um maybe I
[00:45:46] explanation of momentum and um maybe I could also recap briefly how it resolves
[00:45:48] could also recap briefly how it resolves all these issues we saw. So um you know
[00:45:50] all these issues we saw. So um you know if now that you're adding momentum in
[00:45:52] if now that you're adding momentum in the past uh over the past gradient steps
[00:45:55] the past uh over the past gradient steps you could see how it would keep
[00:45:56] you could see how it would keep continuing along this direction and
[00:45:57] continuing along this direction and depending on your row if your momentum
[00:45:59] depending on your row if your momentum is very high it would keep going and be
[00:46:02] is very high it would keep going and be able to account for a very large sort of
[00:46:04] able to account for a very large sort of hump here with the local minimum. Also
[00:46:06] hump here with the local minimum. Also it's very good at sort of these saddle
[00:46:07] it's very good at sort of these saddle points because it will just continue
[00:46:08] points because it will just continue along the direction in which it was
[00:46:10] along the direction in which it was going previously for a significant
[00:46:12] going previously for a significant amount of time. and uh poor
[00:46:13] amount of time. and uh poor conditioning. You know, if we're having
[00:46:15] conditioning. You know, if we're having cumulatively going to the right upon
[00:46:17] cumulatively going to the right upon each step, the momentum will also um be
[00:46:20] each step, the momentum will also um be consistent there and build up. And then
[00:46:22] consistent there and build up. And then if uh we're oscillating significantly
[00:46:24] if uh we're oscillating significantly here, it will move less um it will move
[00:46:27] here, it will move less um it will move less in the direction because it's sort
[00:46:29] less in the direction because it's sort of the values will cancel out um in
[00:46:31] of the values will cancel out um in terms of the current direction and the
[00:46:33] terms of the current direction and the velocity. Um they'll be pointing the
[00:46:35] velocity. Um they'll be pointing the opposite direction so it will get
[00:46:36] opposite direction so it will get minimized. The question is what happens
[00:46:39] minimized. The question is what happens if you're rolling like right along the
[00:46:41] if you're rolling like right along the saddle? I mean I think in practice it's
[00:46:42] saddle? I mean I think in practice it's very unlikely but in that case yeah you
[00:46:44] very unlikely but in that case yeah you would be you would just get stuck uh in
[00:46:47] would be you would just get stuck uh in the saddle. Yeah I think that's like you
[00:46:49] the saddle. Yeah I think that's like you know your initial conditions like
[00:46:50] know your initial conditions like wherever you start is very unfortunate.
[00:46:53] wherever you start is very unfortunate. So uh yeah sometimes I guess that could
[00:46:55] So uh yeah sometimes I guess that could happen but it's very unlikely. Yeah. And
[00:46:57] happen but it's very unlikely. Yeah. And it's also why in practice people won't
[00:46:58] it's also why in practice people won't run like a single model um training run.
[00:47:02] run like a single model um training run. Often they'll run multiple ones with
[00:47:03] Often they'll run multiple ones with different random seeds just in case
[00:47:05] different random seeds just in case something like that could happen.
[00:47:06] something like that could happen. Another thing is if you're doing
[00:47:08] Another thing is if you're doing stochastic uh gradient descent, you're
[00:47:10] stochastic uh gradient descent, you're much more likely to have at least a
[00:47:11] much more likely to have at least a little bit of noise to get you out of
[00:47:12] little bit of noise to get you out of like directly in that saddle uh back and
[00:47:15] like directly in that saddle uh back and forth. So I think it's basically it
[00:47:17] forth. So I think it's basically it never would happen because of the
[00:47:18] never would happen because of the randomness, but I hypothetically I think
[00:47:21] randomness, but I hypothetically I think it could that could occur. Yeah.
[00:47:23] it could that could occur. Yeah.  So the question is why is the saddle
[00:47:25] So the question is why is the saddle just an issue with SGD and not
[00:47:26] just an issue with SGD and not optimization in general? Um it would
[00:47:28] optimization in general? Um it would also be an issue with the entire data
[00:47:29] also be an issue with the entire data set. It might even be more common with
[00:47:31] set. It might even be more common with the entire data set. So uh it's an issue
[00:47:33] the entire data set. So uh it's an issue that SGD faces but it's also an issue
[00:47:35] that SGD faces but it's also an issue that other optimization algorithms that
[00:47:37] that other optimization algorithms that just rely on gradient descent with no uh
[00:47:40] just rely on gradient descent with no uh sort of bells and whistles attached
[00:47:41] sort of bells and whistles attached would also um they would face the same
[00:47:44] would also um they would face the same thing. Yeah.
[00:47:45] thing. Yeah.  Yeah. So the question is does uh adding
[00:47:48] Yeah. So the question is does uh adding the momentum make it more difficult to
[00:47:50] the momentum make it more difficult to converge because we'll overshoot and
[00:47:52] converge because we'll overshoot and then you know have to come back. Uh I
[00:47:53] then you know have to come back. Uh I think the short answer is yeah it might
[00:47:55] think the short answer is yeah it might not help with converging but it will
[00:47:57] not help with converging but it will help you find uh on average it will help
[00:47:59] help you find uh on average it will help you find a better uh minimum point to
[00:48:02] you find a better uh minimum point to converge to. So it will converge maybe
[00:48:05] converge to. So it will converge maybe more slowly because you won't get stuck
[00:48:07] more slowly because you won't get stuck on a in a local minimum uh like you
[00:48:09] on a in a local minimum uh like you would just converge here if there was no
[00:48:10] would just converge here if there was no momentum right versus overshooting. So I
[00:48:13] momentum right versus overshooting. So I think a lot of this stuff is empirically
[00:48:15] think a lot of this stuff is empirically uh shown where it's like it happens to
[00:48:17] uh shown where it's like it happens to be with the specific class of neural
[00:48:18] be with the specific class of neural networks the momentum does help uh
[00:48:21] networks the momentum does help uh training but um this is the intuition
[00:48:23] training but um this is the intuition for why we prefer it. Uh I think to be
[00:48:26] for why we prefer it. Uh I think to be honest um people will use whatever works
[00:48:30] honest um people will use whatever works best and there are cases where people
[00:48:31] best and there are cases where people have found that stocastic gradient
[00:48:32] have found that stocastic gradient descent without momentum would
[00:48:34] descent without momentum would outperform for a particular model. So,
[00:48:37] outperform for a particular model. So, uh, here's the intuition about why it
[00:48:38] uh, here's the intuition about why it could perform better, but in practice,
[00:48:40] could perform better, but in practice, people will just try a bunch of
[00:48:41] people will just try a bunch of different ones and see what works best.
[00:48:43] different ones and see what works best. And I'm going over the most common ones
[00:48:45] And I'm going over the most common ones that people try now. Yeah. But yeah,
[00:48:47] that people try now. Yeah. But yeah, you're right. It could hurt convergence
[00:48:49] you're right. It could hurt convergence potentially.
[00:48:53] Okay. Um, all right. I'll continue then.
[00:48:56] Okay. Um, all right. I'll continue then. So, um, uh, yeah, I think we went
[00:48:59] So, um, uh, yeah, I think we went through this. Um, and, uh, one other
[00:49:02] through this. Um, and, uh, one other thing I wanted to point out is that
[00:49:03] thing I wanted to point out is that there are different ways you can
[00:49:05] there are different ways you can formulate this. So these equations are
[00:49:06] formulate this. So these equations are identical but you'll sometimes depending
[00:49:08] identical but you'll sometimes depending on the implementation see it written in
[00:49:10] on the implementation see it written in different ways but they you know they're
[00:49:12] different ways but they you know they're doing the same thing. Uh maybe in
[00:49:14] doing the same thing. Uh maybe in interest of time I'll skip over why
[00:49:16] interest of time I'll skip over why they're identical but you could go over
[00:49:17] they're identical but you could go over in the slide and prove to yourself that
[00:49:20] in the slide and prove to yourself that these are essentially the same
[00:49:21] these are essentially the same formulations.
[00:49:23] formulations. Okay. Um I think the next thing I'll
[00:49:25] Okay. Um I think the next thing I'll talk about is a different optimizer. So
[00:49:28] talk about is a different optimizer. So we talked about momentum and now we'll
[00:49:30] we talked about momentum and now we'll talk about something called RMS prop. So
[00:49:32] talk about something called RMS prop. So um RMS prop is a bit you know bit of an
[00:49:35] um RMS prop is a bit you know bit of an older method now 2012 but uh came out by
[00:49:39] older method now 2012 but uh came out by Jeffrey Hinton's group and the basic
[00:49:42] Jeffrey Hinton's group and the basic idea is to instead of just having this
[00:49:45] idea is to instead of just having this running um velocity which the momentum
[00:49:49] running um velocity which the momentum captures it's to actually add u
[00:49:51] captures it's to actually add u elementwise scaling of the gradient. So
[00:49:54] elementwise scaling of the gradient. So um when we're when we're how do we do
[00:49:57] um when we're when we're how do we do this is we have this gradient squared
[00:49:59] this is we have this gradient squared term and the decay rate here is very
[00:50:02] term and the decay rate here is very much like the momentum that we the
[00:50:03] much like the momentum that we the momentum term we explained before but
[00:50:05] momentum term we explained before but now it's on the squared gradient. So uh
[00:50:08] now it's on the squared gradient. So uh we have this sort of running average
[00:50:09] we have this sort of running average where we take the previous term here the
[00:50:11] where we take the previous term here the gradient squared and then we do 1 minus
[00:50:13] gradient squared and then we do 1 minus times and then here it is the literally
[00:50:15] times and then here it is the literally the gradient squared and so this is a
[00:50:17] the gradient squared and so this is a running average of our squared
[00:50:19] running average of our squared gradients. So uh you know bigger values
[00:50:21] gradients. So uh you know bigger values will get much bigger, smaller values
[00:50:23] will get much bigger, smaller values will get much smaller and if there are
[00:50:24] will get much smaller and if there are consistently large gradients in certain
[00:50:27] consistently large gradients in certain values those will get very large as we
[00:50:29] values those will get very large as we continue our uh running average here and
[00:50:32] continue our uh running average here and we're actually going to divide here in
[00:50:34] we're actually going to divide here in the update step we divide by the square
[00:50:36] the update step we divide by the square root of it. So the basic idea here is
[00:50:38] root of it. So the basic idea here is we're actually now stepping someone
[00:50:40] we're actually now stepping someone asked earlier I think there was a
[00:50:41] asked earlier I think there was a question what if we change the direction
[00:50:43] question what if we change the direction in which we're stepping. uh this is
[00:50:44] in which we're stepping. uh this is exactly the type of thing you can do and
[00:50:46] exactly the type of thing you can do and this is what this is doing where we're
[00:50:48] this is what this is doing where we're dividing by this squared gradient term.
[00:50:50] dividing by this squared gradient term. So for values in which uh we have very
[00:50:53] So for values in which uh we have very large squared gradients for the values
[00:50:56] large squared gradients for the values of w in which the derivative is very
[00:50:58] of w in which the derivative is very large um we'll we'll uh divide by a
[00:51:00] large um we'll we'll uh divide by a larger value. So we'll step not as far
[00:51:02] larger value. So we'll step not as far in that direction. In the more flat
[00:51:04] in that direction. In the more flat regions we'll step farther because we're
[00:51:05] regions we'll step farther because we're dividing by a smaller term here. So this
[00:51:08] dividing by a smaller term here. So this is the basic intuition behind it and it
[00:51:11] is the basic intuition behind it and it very much addresses one of the questions
[00:51:12] very much addresses one of the questions someone had earlier about can we change
[00:51:14] someone had earlier about can we change the way we're stepping in the direction
[00:51:15] the way we're stepping in the direction and that's exactly what this is doing
[00:51:16] and that's exactly what this is doing here. So you still have a learning rate
[00:51:18] here. So you still have a learning rate but you're dividing it by this uh square
[00:51:20] but you're dividing it by this uh square root of the cumulative uh squared
[00:51:23] root of the cumulative uh squared gradients which gives you larger steps
[00:51:26] gradients which gives you larger steps in the flatter areas of your loss
[00:51:28] in the flatter areas of your loss landscape and shorter steps in the very
[00:51:30] landscape and shorter steps in the very steep areas. Can anyone explain? I sort
[00:51:32] steep areas. Can anyone explain? I sort of just gave a brief summary, but what
[00:51:34] of just gave a brief summary, but what happens in this specific line here of
[00:51:36] happens in this specific line here of the code? Why does uh what happens with
[00:51:40] the code? Why does uh what happens with our gradient step direction? How does it
[00:51:42] our gradient step direction? How does it change? We're dividing by this value
[00:51:45] change? We're dividing by this value which is dependent on the current
[00:51:46] which is dependent on the current gradient and also the past gradients. So
[00:51:49] gradient and also the past gradients. So the values one of these values are very
[00:51:51] the values one of these values are very large. So these are you know you know
[00:51:53] large. So these are you know you know these are vector operations. So we have
[00:51:54] these are vector operations. So we have a set of um derivatives here and we're
[00:51:57] a set of um derivatives here and we're dividing elementwise by another set of
[00:52:01] dividing elementwise by another set of uh squared gradient values. So when it's
[00:52:03] uh squared gradient values. So when it's very large well the denominator is very
[00:52:05] very large well the denominator is very large then the step becomes effectively
[00:52:07] large then the step becomes effectively less in that direction because we're
[00:52:08] less in that direction because we're dividing by a large value. And when it's
[00:52:10] dividing by a large value. And when it's a very small value the step becomes much
[00:52:12] a very small value the step becomes much larger because uh the gradient squared
[00:52:14] larger because uh the gradient squared term is small. So it's in the
[00:52:16] term is small. So it's in the denominator and we're uh increasing the
[00:52:18] denominator and we're uh increasing the effective step size. Oh yeah. So it's uh
[00:52:22] effective step size. Oh yeah. So it's uh specifically for this type of example
[00:52:25] specifically for this type of example here where you have maybe a very narrow
[00:52:28] here where you have maybe a very narrow valley where you want to be moving more
[00:52:31] valley where you want to be moving more uh in the flatter direction.
[00:52:33] uh in the flatter direction.  Yeah. The question is what does a small
[00:52:35] Yeah. The question is what does a small gradient mean in this context and how
[00:52:37] gradient mean in this context and how does this help us move uh along the
[00:52:40] does this help us move uh along the steep directions less and along the flat
[00:52:42] steep directions less and along the flat directions more?
[00:52:45] directions more? Yeah. Yeah. So I think actually this is
[00:52:47] Yeah. Yeah. So I think actually this is maybe a great visual because it compares
[00:52:48] maybe a great visual because it compares the three different approaches here. So
[00:52:51] the three different approaches here. So we have uh with momentum which you can
[00:52:53] we have uh with momentum which you can see sort of it overshoots as there was a
[00:52:55] see sort of it overshoots as there was a question about earlier but then it kind
[00:52:56] question about earlier but then it kind of comes back. Um you have SGD which is
[00:52:59] of comes back. Um you have SGD which is slower because it's not it's just sort
[00:53:01] slower because it's not it's just sort of always moving in the fixed direction
[00:53:02] of always moving in the fixed direction and then you have RMS prop which we just
[00:53:03] and then you have RMS prop which we just mentioned. So uh the way that RMS prop
[00:53:06] mentioned. So uh the way that RMS prop works is because the gradient and the
[00:53:08] works is because the gradient and the direction that I'm moving my mouse here
[00:53:10] direction that I'm moving my mouse here is uh is higher uh the gradient square
[00:53:14] is uh is higher uh the gradient square term is larger. So we move less in that
[00:53:17] term is larger. So we move less in that sorry we move less in that direction. So
[00:53:19] sorry we move less in that direction. So it's you can see it actually quickly
[00:53:20] it's you can see it actually quickly starts turning here towards the center
[00:53:22] starts turning here towards the center where it's a flatter landscape at this
[00:53:24] where it's a flatter landscape at this point but it's traversing more in that
[00:53:26] point but it's traversing more in that direction. So we're actually sort of
[00:53:27] direction. So we're actually sort of changing the direction we're going based
[00:53:29] changing the direction we're going based on uh going less in the steep direction
[00:53:31] on uh going less in the steep direction and more in the flat direction. So these
[00:53:32] and more in the flat direction. So these are the uh sort of three and then
[00:53:34] are the uh sort of three and then there's one more we'll discuss which is
[00:53:36] there's one more we'll discuss which is by far the most popular optimizer used
[00:53:38] by far the most popular optimizer used uh in modern deep learning. Um so it's
[00:53:41] uh in modern deep learning. Um so it's sort of just a combination of the SGD
[00:53:43] sort of just a combination of the SGD momentum and RMS prop. So here is almost
[00:53:48] momentum and RMS prop. So here is almost what the atom optimizer is which is the
[00:53:50] what the atom optimizer is which is the most popular optimizer in deep learning
[00:53:52] most popular optimizer in deep learning and you also have all the prerequisite
[00:53:53] and you also have all the prerequisite knowledge now to understand it. Um so
[00:53:56] knowledge now to understand it. Um so you look at it and this first term here
[00:53:58] you look at it and this first term here in the red is basically the momentum we
[00:54:01] in the red is basically the momentum we described before where we have uh the
[00:54:05] described before where we have uh the current uh so the beta 1 is like the
[00:54:08] current uh so the beta 1 is like the momentum term and then we have the
[00:54:10] momentum term and then we have the velocity here and we're taking a running
[00:54:12] velocity here and we're taking a running average. Um the second moment here is
[00:54:15] average. Um the second moment here is like the gradient squared term for uh
[00:54:18] like the gradient squared term for uh RMS prop and we're doing the same thing
[00:54:21] RMS prop and we're doing the same thing here where we're multiplying the
[00:54:23] here where we're multiplying the learning rate instead of by the step
[00:54:24] learning rate instead of by the step size by the velocity but now we're still
[00:54:26] size by the velocity but now we're still doing the thing where we take the square
[00:54:28] doing the thing where we take the square root. Um and it's the second moment
[00:54:31] root. Um and it's the second moment here. And the reason they use first
[00:54:32] here. And the reason they use first moment second moment is like a relation
[00:54:33] moment second moment is like a relation to physics and mechanics. Um, but it's
[00:54:36] to physics and mechanics. Um, but it's basically just a combination of the two
[00:54:38] basically just a combination of the two things we explained earlier where you're
[00:54:40] things we explained earlier where you're accelerating movement along the flat
[00:54:42] accelerating movement along the flat directions, dampening it among the steep
[00:54:44] directions, dampening it among the steep ones, and then you're also adding this
[00:54:45] ones, and then you're also adding this notion of momentum and velocity. So, you
[00:54:48] notion of momentum and velocity. So, you gradually build up speed if you're
[00:54:49] gradually build up speed if you're continuously moving in the same
[00:54:51] continuously moving in the same direction. Um, so, uh, as it's written
[00:54:54] direction. Um, so, uh, as it's written right now, this will actually run into
[00:54:56] right now, this will actually run into issues at the very first time step. And
[00:54:59] issues at the very first time step. And it might be a little bit uh unclear to
[00:55:02] it might be a little bit uh unclear to you why, but I'll actually wait for
[00:55:04] you why, but I'll actually wait for someone to have a guess. So, one thing
[00:55:06] someone to have a guess. So, one thing to note is that these betas beta 1 beta
[00:55:08] to note is that these betas beta 1 beta 2 are usually initialized very close to
[00:55:10] 2 are usually initialized very close to one. So, like 0.9 0.999
[00:55:13] one. So, like 0.9 0.999 uh and that these two values are also
[00:55:15] uh and that these two values are also initialized to zero. So, uh during your
[00:55:18] initialized to zero. So, uh during your first time step, if you just use this
[00:55:19] first time step, if you just use this formulation of atom, you would run into
[00:55:22] formulation of atom, you would run into potentially unwanted behavior. So um one
[00:55:25] potentially unwanted behavior. So um one of the other things is it has to do with
[00:55:27] of the other things is it has to do with the second moment calculation. So this
[00:55:30] the second moment calculation. So this is the main issue here. When you
[00:55:32] is the main issue here. When you calculate the second moment and then use
[00:55:34] calculate the second moment and then use it on the next line, you sort of run
[00:55:36] it on the next line, you sort of run into an issue. Yeah. So the denominator
[00:55:39] into an issue. Yeah. So the denominator is zero basically. Yeah, that's the
[00:55:40] is zero basically. Yeah, that's the exact issue. So it starts at zero. So
[00:55:42] exact issue. So it starts at zero. So this term is zero. Um you have a very
[00:55:44] this term is zero. Um you have a very large beta. Uh so this value is very
[00:55:48] large beta. Uh so this value is very small. And if your gradient is not very
[00:55:50] small. And if your gradient is not very large in your first step, you can have
[00:55:52] large in your first step, you can have this whole term basically be very close
[00:55:53] this whole term basically be very close to zero. Now we're dividing by something
[00:55:55] to zero. Now we're dividing by something very close to zero and it just creates a
[00:55:57] very close to zero and it just creates a very large initial step even though our
[00:55:59] very large initial step even though our gradient was small. So that's probably
[00:56:01] gradient was small. So that's probably not something we really want. And so the
[00:56:04] not something we really want. And so the final thing that Adam has is it adds
[00:56:06] final thing that Adam has is it adds these bias terms here which is
[00:56:07] these bias terms here which is specifically to account for this issue
[00:56:09] specifically to account for this issue where it's dependent now on the time
[00:56:11] where it's dependent now on the time step of training. So uh you know I think
[00:56:14] step of training. So uh you know I think this is also something you'll go into in
[00:56:16] this is also something you'll go into in the homework. I just want to give the u
[00:56:18] the homework. I just want to give the u basically intuition behind atom why the
[00:56:20] basically intuition behind atom why the naive implementation wouldn't work which
[00:56:22] naive implementation wouldn't work which is this really large initial step and
[00:56:24] is this really large initial step and you'll go over in the homework
[00:56:25] you'll go over in the homework implementing this and you'll see how the
[00:56:27] implementing this and you'll see how the time step is used but the basic idea is
[00:56:29] time step is used but the basic idea is this is to account for that very large
[00:56:31] this is to account for that very large initial step and uh as your time step
[00:56:34] initial step and uh as your time step increases these uh bias terms are not
[00:56:36] increases these uh bias terms are not needed as much
[00:56:38] needed as much okay cool um these are some like good
[00:56:41] okay cool um these are some like good defaults that people normally use um if
[00:56:43] defaults that people normally use um if you're training a model with atom you
[00:56:46] you're training a model with atom you could go with these and you know maybe
[00:56:48] could go with these and you know maybe it'll work maybe it won't but it's a
[00:56:49] it'll work maybe it won't but it's a good starting point and you can then
[00:56:52] good starting point and you can then tell from the remaining slides we'll
[00:56:54] tell from the remaining slides we'll talk about how do you know if your
[00:56:56] talk about how do you know if your learning rate's right how do you know if
[00:56:57] learning rate's right how do you know if uh these other values are right so I'll
[00:56:59] uh these other values are right so I'll I'll speed up a little bit um just to in
[00:57:02] I'll speed up a little bit um just to in the interest of time but the basic idea
[00:57:03] the interest of time but the basic idea is that you can see these all all these
[00:57:05] is that you can see these all all these different optimizers converging um they
[00:57:07] different optimizers converging um they all have different properties you can
[00:57:09] all have different properties you can sort of see how atom is this combination
[00:57:10] sort of see how atom is this combination of RMS prop and SGD with momentum where
[00:57:13] of RMS prop and SGD with momentum where it has characteristics of both which is
[00:57:15] it has characteristics of both which is very neat to see visually it aligns with
[00:57:17] very neat to see visually it aligns with our intuition.
[00:57:19] our intuition. Um one final topic related to atom is
[00:57:22] Um one final topic related to atom is that um we could look at how
[00:57:24] that um we could look at how regularization interacts with the
[00:57:26] regularization interacts with the optimizer. So um for example if we have
[00:57:29] optimizer. So um for example if we have L2 regularization how does this affect
[00:57:31] L2 regularization how does this affect how the optimizer uh works and I think
[00:57:34] how the optimizer uh works and I think the answer is it's actually not
[00:57:35] the answer is it's actually not immediately obvious and you can do it in
[00:57:37] immediately obvious and you can do it in different ways. So uh in default atom
[00:57:40] different ways. So uh in default atom they compute L2 when they're computing
[00:57:41] they compute L2 when they're computing their gradient. So you know we looked at
[00:57:44] their gradient. So you know we looked at the gradient and there was the loss
[00:57:46] the gradient and there was the loss portion of our so the data loss portion
[00:57:48] portion of our so the data loss portion and then the regularization loss for
[00:57:50] and then the regularization loss for atom it's using both of those when it
[00:57:51] atom it's using both of those when it computes the gradient um but atom w
[00:57:54] computes the gradient um but atom w basically does only looks at the data
[00:57:56] basically does only looks at the data loss for doing all of these moment
[00:57:58] loss for doing all of these moment calculations and all these steps and it
[00:58:00] calculations and all these steps and it just adds the regularization term at the
[00:58:02] just adds the regularization term at the end here. Um so basically all I'm trying
[00:58:05] end here. Um so basically all I'm trying to describe to you all is there is
[00:58:06] to describe to you all is there is flexibility for how you incorporate
[00:58:08] flexibility for how you incorporate regularization into your optimizers. Um
[00:58:10] regularization into your optimizers. Um weight decay is generally when you just
[00:58:12] weight decay is generally when you just add it at the end L2 regularization at
[00:58:14] add it at the end L2 regularization at the end and you don't includes it
[00:58:15] the end and you don't includes it include it in the uh actual optimizer
[00:58:17] include it in the uh actual optimizer for how you're calculating the
[00:58:18] for how you're calculating the velocities and momentum etc. Um so this
[00:58:21] velocities and momentum etc. Um so this is the main difference and sometimes uh
[00:58:24] is the main difference and sometimes uh under a lot of settings atom w works
[00:58:26] under a lot of settings atom w works slightly better like I think the llama
[00:58:27] slightly better like I think the llama series from meta they all use atom w uh
[00:58:30] series from meta they all use atom w uh assuming I assume because it does
[00:58:32] assuming I assume because it does slightly better for them. So we have one
[00:58:35] slightly better for them. So we have one function optimize why are you splitting
[00:58:36] function optimize why are you splitting it into two? Yeah. So if you mix it into
[00:58:40] it into two? Yeah. So if you mix it into one function, that's what atom does. And
[00:58:42] one function, that's what atom does. And atom w is specifically the separating it
[00:58:45] atom w is specifically the separating it into two. So why you might want to do
[00:58:47] into two. So why you might want to do that is because if you don't want your
[00:58:49] that is because if you don't want your velocities, your momentums to actually
[00:58:52] velocities, your momentums to actually be a function of the weights, you want
[00:58:54] be a function of the weights, you want it to be a function of the loss. So if
[00:58:55] it to be a function of the loss. So if you're trying to traverse your loss
[00:58:57] you're trying to traverse your loss landscape sort of more independent of
[00:58:59] landscape sort of more independent of your actual weight values, that's why
[00:59:01] your actual weight values, that's why you might want to separate it. But you
[00:59:02] you might want to separate it. But you still might want a regularization term,
[00:59:04] still might want a regularization term, but you don't want it to interfere with
[00:59:05] but you don't want it to interfere with the moment calculation. So this is the
[00:59:07] the moment calculation. So this is the specific reason why they do it.
[00:59:09] specific reason why they do it. Ultimately it's empirical. You try both
[00:59:11] Ultimately it's empirical. You try both and you see which one works better. But
[00:59:12] and you see which one works better. But this is why you would do it that way.
[00:59:15] this is why you would do it that way. Okay cool. Um so we'll talk about
[00:59:17] Okay cool. Um so we'll talk about learning rates. So um there are
[00:59:18] learning rates. So um there are different uh ways in which uh learning
[00:59:21] different uh ways in which uh learning rates can be chosen and sometimes you'll
[00:59:24] rates can be chosen and sometimes you'll get a very high learning rate where what
[00:59:26] get a very high learning rate where what will happen is basically your loss will
[00:59:28] will happen is basically your loss will uh get very large as you sort of
[00:59:29] uh get very large as you sort of oscillate out of the loss landscape as
[00:59:31] oscillate out of the loss landscape as we described earlier. Um if you have a
[00:59:33] we described earlier. Um if you have a very low learning rate your issue is you
[00:59:35] very low learning rate your issue is you just converge very slowly. If you have a
[00:59:37] just converge very slowly. If you have a high learning rate, but you're not
[00:59:38] high learning rate, but you're not oscillating out, but you might not be
[00:59:40] oscillating out, but you might not be able to converge because you're sort of
[00:59:42] able to converge because you're sort of bumping around the uh local minimum, but
[00:59:44] bumping around the uh local minimum, but you're not actually able to get uh any
[00:59:46] you're not actually able to get uh any any lower in because your learning
[00:59:48] any lower in because your learning rate's too high. And ideally, a good
[00:59:49] rate's too high. And ideally, a good learning rate would have this property
[00:59:51] learning rate would have this property where it decreere it causes your loss to
[00:59:53] where it decreere it causes your loss to decrease quickly over time, but then you
[00:59:55] decrease quickly over time, but then you see continued improvements as you
[00:59:57] see continued improvements as you continue to train the model.
[00:59:59] continue to train the model. Um in reality actually depending on the
[01:00:01] Um in reality actually depending on the situation a lot of these could be good
[01:00:03] situation a lot of these could be good learning rates and also depending on the
[01:00:05] learning rates and also depending on the step in training which is the final uh
[01:00:07] step in training which is the final uh thing we'll discuss in lecture today. So
[01:00:10] thing we'll discuss in lecture today. So you can actually change your learning
[01:00:11] you can actually change your learning rate as you train your model. You don't
[01:00:14] rate as you train your model. You don't need to always have a fixed learning
[01:00:15] need to always have a fixed learning rate or step size and pretty much all
[01:00:17] rate or step size and pretty much all modern uh deep learning like all the
[01:00:20] modern uh deep learning like all the best models coming out have different
[01:00:22] best models coming out have different ways they vary the learning rate during
[01:00:24] ways they vary the learning rate during training. So one really simple way you
[01:00:27] training. So one really simple way you could do it is after a fixed number of
[01:00:29] could do it is after a fixed number of iterations
[01:00:31] iterations you just uh take onetenth of the
[01:00:33] you just uh take onetenth of the learning rate and you continue training.
[01:00:35] learning rate and you continue training. So this can resolve the issue of where
[01:00:38] So this can resolve the issue of where uh your learning rate is too high for
[01:00:40] uh your learning rate is too high for you to be able to converge any further.
[01:00:42] you to be able to converge any further. So then you reduce it and you're able to
[01:00:44] So then you reduce it and you're able to get lower into the loss landscape. Uh
[01:00:46] get lower into the loss landscape. Uh and this is really commonly used when
[01:00:47] and this is really commonly used when training ResNets. So that's a very
[01:00:50] training ResNets. So that's a very popular type of convolutional neural
[01:00:51] popular type of convolutional neural network which we'll discuss later in the
[01:00:52] network which we'll discuss later in the course. Um another thing you could do is
[01:00:55] course. Um another thing you could do is sort of cosine learning rate decay. So
[01:00:58] sort of cosine learning rate decay. So this is one is also extremely popular.
[01:01:00] this is one is also extremely popular. So um here you have uh basically this is
[01:01:04] So um here you have uh basically this is like half of a cosine wave where you're
[01:01:06] like half of a cosine wave where you're uh starting at your maximum learning
[01:01:08] uh starting at your maximum learning rate here and then you go down to zero
[01:01:10] rate here and then you go down to zero to the end and it follows this sort of
[01:01:12] to the end and it follows this sort of half cosine uh shape. And here's the
[01:01:15] half cosine uh shape. And here's the formula for calculating it. I won't go
[01:01:17] formula for calculating it. I won't go into too many any details but the basic
[01:01:19] into too many any details but the basic idea is there's a ton of different ways
[01:01:20] idea is there's a ton of different ways to do it. when your loss uses a cosign
[01:01:23] to do it. when your loss uses a cosign uh learning rateuler, you'll often see a
[01:01:25] uh learning rateuler, you'll often see a shape like this where um it sort of you
[01:01:28] shape like this where um it sort of you get pretty good continued gains in the
[01:01:29] get pretty good continued gains in the middle part of training. But the basic
[01:01:31] middle part of training. But the basic idea is that the actual shape of your
[01:01:33] idea is that the actual shape of your loss during training will highly depend
[01:01:34] loss during training will highly depend on whatuler you use. So this is the
[01:01:36] on whatuler you use. So this is the basic idea I'm trying to convey. It
[01:01:38] basic idea I'm trying to convey. It looks very different for example than
[01:01:39] looks very different for example than this one where you can literally see
[01:01:41] this one where you can literally see where we're uh taking onetenth of the
[01:01:43] where we're uh taking onetenth of the learning rate during training.
[01:01:45] learning rate during training. Um another thing you do is just a linear
[01:01:47] Um another thing you do is just a linear learning rate decay. So um it just
[01:01:49] learning rate decay. So um it just follows a straight line. uh you could do
[01:01:50] follows a straight line. uh you could do inverse square root etc etc. There's
[01:01:52] inverse square root etc etc. There's basically an unlimited number of ways
[01:01:54] basically an unlimited number of ways you could uh mess with your learning
[01:01:56] you could uh mess with your learning rate during training and depending on
[01:01:58] rate during training and depending on the type of model you're training and
[01:01:59] the type of model you're training and depending on what works best you you
[01:02:01] depending on what works best you you just choose the one that works best but
[01:02:02] just choose the one that works best but here are some ones you could try that uh
[01:02:04] here are some ones you could try that uh could perform well in your setting. Uh
[01:02:06] could perform well in your setting. Uh also a really really popular strategy is
[01:02:09] also a really really popular strategy is to have a linear warm-up. So instead of
[01:02:11] to have a linear warm-up. So instead of just starting at your maximum learning
[01:02:13] just starting at your maximum learning rate, you spend a fixed number of
[01:02:14] rate, you spend a fixed number of iterations to sort of linearly warm up
[01:02:18] iterations to sort of linearly warm up to whatever your maximum value is and
[01:02:20] to whatever your maximum value is and then you go about doing whateveruler you
[01:02:22] then you go about doing whateveruler you had afterwards. So for example, linear
[01:02:23] had afterwards. So for example, linear warm-up and then this would be like the
[01:02:26] warm-up and then this would be like the inverse square root or linear warm-up
[01:02:27] inverse square root or linear warm-up and then um like cosine is a very
[01:02:29] and then um like cosine is a very popular setup for training models. Uh
[01:02:32] popular setup for training models. Uh one final thing is that there is this
[01:02:35] one final thing is that there is this empirical rule of thumb uh or called the
[01:02:38] empirical rule of thumb uh or called the linear uh scaling um linear scaling
[01:02:41] linear uh scaling um linear scaling hypothesis or linear scaling law or
[01:02:43] hypothesis or linear scaling law or something like that. I forget the name.
[01:02:45] something like that. I forget the name. I think it's linear scaling law where it
[01:02:46] I think it's linear scaling law where it shows the uh that if you increase your
[01:02:49] shows the uh that if you increase your batch size or the number of training
[01:02:51] batch size or the number of training examples per update by n uh you could
[01:02:53] examples per update by n uh you could you should also scale your learning rate
[01:02:56] you should also scale your learning rate by n. So as you increase your batch
[01:02:58] by n. So as you increase your batch size, you should increase your learning
[01:03:00] size, you should increase your learning rate uh directly proportionally. So uh I
[01:03:04] rate uh directly proportionally. So uh I think the math behind this is a bit
[01:03:06] think the math behind this is a bit involved and also it's more of an
[01:03:07] involved and also it's more of an empirical rule of thumb. So people have
[01:03:09] empirical rule of thumb. So people have like tried to show uh mathematical
[01:03:12] like tried to show uh mathematical proofs for why this could be useful but
[01:03:13] proofs for why this could be useful but and based on you know the variation of
[01:03:16] and based on you know the variation of gradients and your batch and the number
[01:03:18] gradients and your batch and the number of uh gradients you calculate per batch
[01:03:21] of uh gradients you calculate per batch etc. But really this is just shown
[01:03:23] etc. But really this is just shown empirically to be true for a large
[01:03:24] empirically to be true for a large number of problems. So uh this is a good
[01:03:26] number of problems. So uh this is a good rule of thumb. If you have a winning
[01:03:28] rule of thumb. If you have a winning recipe but you want to increase the
[01:03:29] recipe but you want to increase the batch size then also increase your
[01:03:31] batch size then also increase your learning rate by the same amount.
[01:03:35] learning rate by the same amount. Cool. Um and then the final thing I'll
[01:03:37] Cool. Um and then the final thing I'll talk I'll touch upon very briefly is
[01:03:39] talk I'll touch upon very briefly is this idea of uh second order
[01:03:43] this idea of uh second order optimization which is uses the hessen
[01:03:44] optimization which is uses the hessen that someone asked a question about
[01:03:46] that someone asked a question about earlier too. So we won't talk about this
[01:03:48] earlier too. So we won't talk about this very much in depth but just to let you
[01:03:49] very much in depth but just to let you know this exists. It's not something we
[01:03:51] know this exists. It's not something we cover in the course very much, but um
[01:03:54] cover in the course very much, but um the basic idea is right now we're using
[01:03:55] the basic idea is right now we're using the gradient to form a linear
[01:03:57] the gradient to form a linear approximation of basically where is the
[01:03:59] approximation of basically where is the downward direction where we're trying to
[01:04:01] downward direction where we're trying to traverse this lost landscape. Um and we
[01:04:04] traverse this lost landscape. Um and we just sort of look at the direction and
[01:04:06] just sort of look at the direction and we take a general step in that
[01:04:07] we take a general step in that direction. And we added fancy things
[01:04:08] direction. And we added fancy things like momentum and uh you know the RMS
[01:04:12] like momentum and uh you know the RMS prop where we're de accelerating along
[01:04:14] prop where we're de accelerating along the steep directions. But this is the
[01:04:15] the steep directions. But this is the basic idea. We're using this uh gradient
[01:04:18] basic idea. We're using this uh gradient uh at each time step.
[01:04:21] uh at each time step. The idea of the hessen is instead of
[01:04:23] The idea of the hessen is instead of using the gradient you basically try to
[01:04:26] using the gradient you basically try to fit a uh pol a quadratic or um like a
[01:04:32] fit a uh pol a quadratic or um like a two second degree polinomial to your uh
[01:04:34] two second degree polinomial to your uh function based on the derivatives at
[01:04:36] function based on the derivatives at that point or the hessions at that point
[01:04:38] that point or the hessions at that point and uh you you then try to find the
[01:04:40] and uh you you then try to find the minimum this way and in certain uh
[01:04:43] minimum this way and in certain uh optimization problems this actually
[01:04:45] optimization problems this actually works extremely well but generally um we
[01:04:47] works extremely well but generally um we don't use it in deep learning because it
[01:04:50] don't use it in deep learning because it requires two things. So one you have to
[01:04:51] requires two things. So one you have to do this like tailaylor series expansion
[01:04:53] do this like tailaylor series expansion whereas right now we're just sort of
[01:04:55] whereas right now we're just sort of doing the first part where we're taking
[01:04:56] doing the first part where we're taking the derivative but you would need to be
[01:04:58] the derivative but you would need to be able to calculate the second mixed
[01:05:00] able to calculate the second mixed derivatives which is already uh maybe
[01:05:02] derivatives which is already uh maybe difficult and then on top of that this
[01:05:05] difficult and then on top of that this mixed uh derivative of all of your uh
[01:05:10] mixed uh derivative of all of your uh parameters in your model by all the
[01:05:12] parameters in your model by all the other parameters in your model can get
[01:05:13] other parameters in your model can get very large as you have like these many
[01:05:15] very large as you have like these many million or billion parameter neural
[01:05:16] million or billion parameter neural networks. So in practice uh we don't use
[01:05:19] networks. So in practice uh we don't use it because um these values have the the
[01:05:23] it because um these values have the the matrices become way too large and so you
[01:05:25] matrices become way too large and so you run out of memory on your computer if
[01:05:27] run out of memory on your computer if you try to run it specifically your GPU
[01:05:29] you try to run it specifically your GPU memory. But if you're training a smaller
[01:05:31] memory. But if you're training a smaller model or if you're okay with spending
[01:05:33] model or if you're okay with spending much more time uh to get better steps
[01:05:36] much more time uh to get better steps towards your uh minimum then maybe you
[01:05:39] towards your uh minimum then maybe you want to look into this. Depends on the
[01:05:41] want to look into this. Depends on the problem set but for smaller models this
[01:05:42] problem set but for smaller models this actually works quite well. But for these
[01:05:44] actually works quite well. But for these large neural networks we're training, we
[01:05:45] large neural networks we're training, we basically never do this due to the
[01:05:46] basically never do this due to the memory restrictions and all the time you
[01:05:48] memory restrictions and all the time you spent uh computationally trying to
[01:05:50] spent uh computationally trying to calculate the hashing etc. You would
[01:05:53] calculate the hashing etc. You would rather just see more data during
[01:05:54] rather just see more data during training.
[01:05:58] All right. So um some I guess uh
[01:06:01] All right. So um some I guess uh concluding uh thoughts for you all that
[01:06:04] concluding uh thoughts for you all that can be useful. So um Adam or Adamw is a
[01:06:07] can be useful. So um Adam or Adamw is a really good default choice for training
[01:06:09] really good default choice for training your first model. If you're working on a
[01:06:10] your first model. If you're working on a new problem in a domain, I would
[01:06:12] new problem in a domain, I would recommend it. Um, and it could even work
[01:06:15] recommend it. Um, and it could even work okay even if you do constant learning
[01:06:16] okay even if you do constant learning rate. So, usually people will try Adam
[01:06:19] rate. So, usually people will try Adam W's constant learning rate or with like
[01:06:20] W's constant learning rate or with like a linear warm-up and then a cosine
[01:06:21] a linear warm-up and then a cosine decay. Those are like a really uh
[01:06:23] decay. Those are like a really uh popular combination. Uh, also um I think
[01:06:27] popular combination. Uh, also um I think SGD and momentum can sometimes
[01:06:29] SGD and momentum can sometimes outperform atom. But the tricky thing is
[01:06:31] outperform atom. But the tricky thing is because you uh you you generally have to
[01:06:34] because you uh you you generally have to like tune the values more. So you have
[01:06:37] like tune the values more. So you have to try many more learning rates because
[01:06:39] to try many more learning rates because uh you don't have this like RMS prop
[01:06:42] uh you don't have this like RMS prop term to account for the steep directions
[01:06:44] term to account for the steep directions and also you might have to try different
[01:06:46] and also you might have to try different scheduling values whereas in practice
[01:06:48] scheduling values whereas in practice atoms sort of like best by test like
[01:06:49] atoms sort of like best by test like people have tried in a bunch of
[01:06:50] people have tried in a bunch of different domains and it works very
[01:06:51] different domains and it works very well. It's very uh adaptive to the loss
[01:06:53] well. It's very uh adaptive to the loss landscape.
[01:06:55] landscape. If you're doing like a full batch update
[01:06:57] If you're doing like a full batch update where you're already at each step, you
[01:06:59] where you're already at each step, you can fit basically your entire training
[01:07:01] can fit basically your entire training set into your batch size, uh you might
[01:07:04] set into your batch size, uh you might want to look beyond first order
[01:07:05] want to look beyond first order optimization into second order and
[01:07:07] optimization into second order and beyond because it seems like your data
[01:07:09] beyond because it seems like your data set's not very large or maybe your model
[01:07:10] set's not very large or maybe your model is not very large and you could
[01:07:12] is not very large and you could potentially benefit from having these uh
[01:07:14] potentially benefit from having these uh non nonlinear sort of update steps and
[01:07:17] non nonlinear sort of update steps and uh computing more sophisticated
[01:07:19] uh computing more sophisticated strategies for going down the uh trying
[01:07:22] strategies for going down the uh trying to find the the minimum here.
[01:07:25] to find the the minimum here. Um yeah so I think uh we're essentially
[01:07:27] Um yeah so I think uh we're essentially done with the lecture. I'll give some
[01:07:28] done with the lecture. I'll give some slides about uh looking forward. So how
[01:07:32] slides about uh looking forward. So how do we optimize more complex functions
[01:07:34] do we optimize more complex functions than linear models which is what we
[01:07:35] than linear models which is what we covered in this lecture and uh next
[01:07:38] covered in this lecture and uh next lecture specifically we'll be looking at
[01:07:40] lecture specifically we'll be looking at uh neural networks which uh you know is
[01:07:42] uh neural networks which uh you know is very exciting topic. um a two-layer
[01:07:45] very exciting topic. um a two-layer neural network the one we'll discuss in
[01:07:46] neural network the one we'll discuss in class is basically you have two of these
[01:07:49] class is basically you have two of these weight matrices now one for each layer
[01:07:51] weight matrices now one for each layer and you have something called a
[01:07:53] and you have something called a nonlinearity sort of stuck between so um
[01:07:57] nonlinearity sort of stuck between so um in this case the most common sorry not
[01:07:58] in this case the most common sorry not the most common but the most simple one
[01:08:00] the most common but the most simple one is just this uh ru function which you'll
[01:08:03] is just this uh ru function which you'll learn about more but basic idea is now
[01:08:04] learn about more but basic idea is now we have two weight matrices and we have
[01:08:06] we have two weight matrices and we have this additional function that's done in
[01:08:08] this additional function that's done in between the weight matrix calculations
[01:08:11] between the weight matrix calculations this is nice because as I said it's
[01:08:12] this is nice because as I said it's nonlinear So, you know, if we're trying
[01:08:14] nonlinear So, you know, if we're trying to build a linear classifier to classify
[01:08:16] to build a linear classifier to classify data like this, you'll run into an issue
[01:08:18] data like this, you'll run into an issue where the blue points and the red points
[01:08:20] where the blue points and the red points are not linearly separable. But maybe
[01:08:23] are not linearly separable. But maybe there's some transformations we can do
[01:08:24] there's some transformations we can do or through many layers of a model, we
[01:08:26] or through many layers of a model, we can eventually transform the data into a
[01:08:29] can eventually transform the data into a way in which it is separable by a line,
[01:08:31] way in which it is separable by a line, which would be our then sort of final
[01:08:32] which would be our then sort of final layer of the model.


================================================================================
LECTURE 004
================================================================================

Stanford CS231N | Spring 2025 | Lecture 4: Neural Networks and Backpropagation

Source: https://www.youtube.com/watch?v=25zD5qJHYsk

---

Transcript

[00:00:05] As you can see on this slide, today
[00:00:07] As you can see on this slide, today we're going to talk about neural
[00:00:09] we're going to talk about neural networks and back
[00:00:13] networks and back propagation which is actually the
[00:00:15] propagation which is actually the process
[00:00:17] process early years I was I was studying this. I
[00:00:20] early years I was I was studying this. I was often referring to it as the magical
[00:00:23] was often referring to it as the magical process that um lets the neural networks
[00:00:28] process that um lets the neural networks learn from their own mistakes
[00:00:30] learn from their own mistakes pretty much like humans but in a more
[00:00:34] pretty much like humans but in a more organized fashion and also uh using a
[00:00:38] organized fashion and also uh using a little bit more math. So let's let's
[00:00:41] little bit more math. So let's let's dive into the topic. I'm I'm sure this
[00:00:44] dive into the topic. I'm I'm sure this is going to be uh exciting and this is a
[00:00:47] is going to be uh exciting and this is a found is is laying a foundation for the
[00:00:50] found is is laying a foundation for the rest of the quarter. Every single
[00:00:52] rest of the quarter. Every single algorithm that we'll be discussing in
[00:00:54] algorithm that we'll be discussing in the future without even mentioning is is
[00:00:57] the future without even mentioning is is using a form of back propagation and um
[00:01:01] using a form of back propagation and um so that's why understanding this this
[00:01:03] so that's why understanding this this lecture and the topics are
[00:01:06] lecture and the topics are uh is is very important. Okay. in
[00:01:09] uh is is very important. Okay. in keeping us with the uh tradition. Let's
[00:01:12] keeping us with the uh tradition. Let's cover what we've talked about so far.
[00:01:19] cover what we've talked about so far. So,
[00:01:21] So, I'm sure you now remember uh what we
[00:01:24] I'm sure you now remember uh what we talked about last time.
[00:01:27] talked about last time. We we said how we can form the
[00:01:33] We we said how we can form the objective functions or loss functions
[00:01:34] objective functions or loss functions what we call here and then um we talked
[00:01:38] what we call here and then um we talked about regularization.
[00:01:40] about regularization. But to do that
[00:01:42] But to do that uh we formulated everything through the
[00:01:46] uh we formulated everything through the XY
[00:01:48] XY uh defining the pairs and and scoring
[00:01:52] uh defining the pairs and and scoring function which in this case we are using
[00:01:55] function which in this case we are using a linear scoring function as you can see
[00:01:58] a linear scoring function as you can see and also
[00:02:01] and also uh defining ultimately this this loss
[00:02:04] uh defining ultimately this this loss function. So this graph that you see on
[00:02:06] function. So this graph that you see on the right is what we
[00:02:10] the right is what we uh we drew showing all the process the
[00:02:14] uh we drew showing all the process the entire process of learning. There has
[00:02:16] entire process of learning. There has been some questions um
[00:02:20] been some questions um the questions last
[00:02:23] the questions last last le in the last lecture and also
[00:02:25] last le in the last lecture and also even before that that why we only using
[00:02:28] even before that that why we only using the soft max function. I wanted to uh
[00:02:31] the soft max function. I wanted to uh reiterate that it's not the only loss
[00:02:34] reiterate that it's not the only loss function that we have and we use it's
[00:02:36] function that we have and we use it's it's um it's one of the most widely used
[00:02:39] it's um it's one of the most widely used in deep learning and and building
[00:02:42] in deep learning and and building especially for the task of
[00:02:43] especially for the task of classification but there are so many
[00:02:45] classification but there are so many other options that we use for for other
[00:02:49] other options that we use for for other task for different tasks even for the
[00:02:51] task for different tasks even for the task of classification
[00:02:53] task of classification if you've looked at the slides that
[00:02:55] if you've looked at the slides that we've shared on the website I I included
[00:02:58] we've shared on the website I I included this hinge loss loss or um used to be
[00:03:01] this hinge loss loss or um used to be called SVM loss in the reading
[00:03:05] called SVM loss in the reading assignments on le in lecture two. So in
[00:03:08] assignments on le in lecture two. So in the slides we had uh examples and and
[00:03:11] the slides we had uh examples and and everything around
[00:03:13] everything around the topic of
[00:03:15] the topic of um hinge loss. It is also one of those
[00:03:19] um hinge loss. It is also one of those widely used um loss functions especially
[00:03:22] widely used um loss functions especially in the early years of uh neural
[00:03:25] in the early years of uh neural networks. And um just to give you a high
[00:03:30] networks. And um just to give you a high level understanding of what it is uh
[00:03:32] level understanding of what it is uh this is a loss
[00:03:35] this is a loss function that unlike soft max does not
[00:03:38] function that unlike soft max does not turn the scores into probabilities. So
[00:03:41] turn the scores into probabilities. So turning them into probabilities is not
[00:03:43] turning them into probabilities is not is not the only option. Right? So we can
[00:03:46] is not the only option. Right? So we can uh use other forms. This this function
[00:03:50] uh use other forms. This this function encourages the score of uh let me
[00:03:53] encourages the score of uh let me highlight here the score um
[00:03:58] highlight here the score um of the correct items uh which is defined
[00:04:02] of the correct items uh which is defined by s yi to be higher than the scores of
[00:04:06] by s yi to be higher than the scores of all other items sj. You can see the the
[00:04:09] all other items sj. You can see the the condition here creating a
[00:04:13] condition here creating a value of zero.
[00:04:16] value of zero. if the condition is is true and
[00:04:18] if the condition is is true and otherwise
[00:04:20] otherwise what it does is um so as as I said it it
[00:04:25] what it does is um so as as I said it it encourages the score of the the correct
[00:04:27] encourages the score of the the correct item uh to be higher than the scores of
[00:04:30] item uh to be higher than the scores of all other items um by at least a margin
[00:04:34] all other items um by at least a margin that number one that you see there is
[00:04:36] that number one that you see there is the margin that it creates and then if
[00:04:39] the margin that it creates and then if the condition is violated the loss
[00:04:42] the condition is violated the loss increases proportionally
[00:04:45] increases proportionally um from the margin and this is the
[00:04:47] um from the margin and this is the visualization. So creating the
[00:04:51] visualization. So creating the uh this um
[00:04:54] uh this um visualization of of the function. So
[00:04:56] visualization of of the function. So this promotes correct scores by
[00:05:00] this promotes correct scores by penalizing cases where irrelevant items
[00:05:03] penalizing cases where irrelevant items are scored too highly. So again refer to
[00:05:08] are scored too highly. So again refer to assignment reading assignment in lecture
[00:05:10] assignment reading assignment in lecture two for examples and uh and to get
[00:05:13] two for examples and uh and to get better understanding of that. Next we
[00:05:17] better understanding of that. Next we have talked about uh general
[00:05:19] have talked about uh general optimization how to find the the best
[00:05:22] optimization how to find the the best parameters w for
[00:05:26] parameters w for for uh the neural network. And in doing
[00:05:30] for uh the neural network. And in doing so, we talked a little bit about this
[00:05:32] so, we talked a little bit about this lost landscape
[00:05:35] lost landscape uh being as a large valley um as as as
[00:05:39] uh being as a large valley um as as as shown in this um image. And every point
[00:05:43] shown in this um image. And every point on that valley is a different set of
[00:05:47] on that valley is a different set of weight parameters
[00:05:49] weight parameters and we wanted to find the set of
[00:05:51] and we wanted to find the set of parameters W that minimizes that loss
[00:05:55] parameters W that minimizes that loss landscape.
[00:05:56] landscape. We talked about the fact that the key is
[00:05:59] We talked about the fact that the key is is being able to take the gradient of
[00:06:02] is being able to take the gradient of the loss function L with respect to W
[00:06:06] the loss function L with respect to W and use the gradient for optimization in
[00:06:09] and use the gradient for optimization in a stepbystep
[00:06:11] a stepbystep manner which gave us the gradient
[00:06:15] manner which gave us the gradient descent algorithm. Right? So the weights
[00:06:19] descent algorithm. Right? So the weights are basically updated. Although it's
[00:06:22] are basically updated. Although it's very hard for me to see from from this
[00:06:25] very hard for me to see from from this distance what I'm pointing to, but I
[00:06:28] distance what I'm pointing to, but I guess I
[00:06:29] guess I uh I can guess. So um anyways,
[00:06:36] uh I can guess. So um anyways, in order to walk down the loss landscape
[00:06:39] in order to walk down the loss landscape towards the minimum value, a step size
[00:06:44] towards the minimum value, a step size is defined. and we often uh get
[00:06:49] is defined. and we often uh get take one step
[00:06:52] take one step with respect to a step step size in the
[00:06:54] with respect to a step step size in the negative direction of the uh gradient.
[00:06:58] negative direction of the uh gradient. So this was the gradient descent
[00:07:01] So this was the gradient descent algorithm and in order to optimize we
[00:07:03] algorithm and in order to optimize we talked about uh two different approaches
[00:07:07] talked about uh two different approaches of numerical gradient and analytical
[00:07:10] of numerical gradient and analytical gradient both of which having pros and
[00:07:14] gradient both of which having pros and cons and we discussed in practice to
[00:07:18] cons and we discussed in practice to drive analytical gradients.
[00:07:22] drive analytical gradients. In practice, we drive analytical
[00:07:24] In practice, we drive analytical gradients and often if it's if it's hard
[00:07:26] gradients and often if it's if it's hard to do the implementation and the math
[00:07:29] to do the implementation and the math and everything, we check our
[00:07:31] and everything, we check our implementations with neural numerical
[00:07:34] implementations with neural numerical gradients.
[00:07:37] gradients. And one of the other challenges we
[00:07:39] And one of the other challenges we talked about was the use of um um
[00:07:43] talked about was the use of um um incorporating the loss function and its
[00:07:45] incorporating the loss function and its gradient on the entire data set. So if
[00:07:48] gradient on the entire data set. So if you have a large data set, it's it's
[00:07:50] you have a large data set, it's it's very expensive to
[00:07:53] very expensive to um to run the loss function and the the
[00:07:56] um to run the loss function and the the der derivative on the entire data set.
[00:07:58] der derivative on the entire data set. And that's why we talked about the idea
[00:08:01] And that's why we talked about the idea of mini batches using a number of
[00:08:05] of mini batches using a number of examples sampled from the data set often
[00:08:08] examples sampled from the data set often 32 um
[00:08:12] 32 um often maybe uh 3264
[00:08:16] often maybe uh 3264 uh 128 or 256.
[00:08:19] uh 128 or 256. And that subsampled
[00:08:21] And that subsampled data is used for
[00:08:25] data is used for identifying the gradients and then
[00:08:27] identifying the gradients and then taking the steps towards the
[00:08:32] taking the steps towards the the minimal.
[00:08:34] the minimal. And beyond SGD and a stochastic um
[00:08:38] And beyond SGD and a stochastic um gradient descent, we talked about some
[00:08:41] gradient descent, we talked about some optimizations of SD SGD with momentum
[00:08:46] optimizations of SD SGD with momentum um RMS prom uh prop and and atom
[00:08:50] um RMS prom uh prop and and atom optimizer.
[00:08:52] optimizer. And um there were a lot of details that
[00:08:55] And um there were a lot of details that I would refer you to the lecture
[00:08:58] I would refer you to the lecture um the third lecture if you have um any
[00:09:01] um the third lecture if you have um any specific
[00:09:03] specific uh questions about those. So
[00:09:06] uh questions about those. So and then one of the other things that we
[00:09:09] and then one of the other things that we talked about was the importance of the
[00:09:11] talked about was the importance of the learning rate and and scheduling the
[00:09:14] learning rate and and scheduling the learning rate. And in some of the
[00:09:16] learning rate. And in some of the optimizers we often try to start with a
[00:09:19] optimizers we often try to start with a larger value of the learning rate and
[00:09:21] larger value of the learning rate and then start using different types of um
[00:09:26] then start using different types of um decaying the learning rate or or
[00:09:28] decaying the learning rate or or reducing its its value by a factor. Um
[00:09:34] reducing its its value by a factor. Um this is normally uh needed in many
[00:09:37] this is normally uh needed in many uh optimizers but but in in some of the
[00:09:40] uh optimizers but but in in some of the more recent ones uh atom and it's its
[00:09:42] more recent ones uh atom and it's its variance we often do not need to uh
[00:09:46] variance we often do not need to uh manually or explicitly decrease that
[00:09:48] manually or explicitly decrease that because they that that is kind of
[00:09:50] because they that that is kind of encoded into the optimizer itself.
[00:09:53] encoded into the optimizer itself. So with that I uh want us to get uh to
[00:09:57] So with that I uh want us to get uh to the topic of neural networks
[00:10:01] the topic of neural networks and and see how we can actually build um
[00:10:05] and and see how we can actually build um neural networks and and solve more
[00:10:09] neural networks and and solve more exciting and and harder problems.
[00:10:14] exciting and and harder problems. So uh
[00:10:18] So uh we've so far talked about this this
[00:10:20] we've so far talked about this this function linear function of w multiplied
[00:10:23] function linear function of w multiplied by x and that is um is is the most basic
[00:10:29] by x and that is um is is the most basic neural network that could be defined.
[00:10:32] neural network that could be defined. It's just one u layer. We will be
[00:10:34] It's just one u layer. We will be talking about the layers.
[00:10:36] talking about the layers. And um what I want you to pay attention
[00:10:41] And um what I want you to pay attention to here is are these dimensions D and C
[00:10:45] to here is are these dimensions D and C which are the dimensionality of the
[00:10:47] which are the dimensionality of the input um data input X or or the uh
[00:10:53] input um data input X or or the uh number of features and C is the number
[00:10:55] number of features and C is the number of classes basically the number of
[00:10:57] of classes basically the number of output uh nodes or or neurons whatever
[00:11:01] output uh nodes or or neurons whatever number of outputs we need.
[00:11:05] number of outputs we need. And um in order to create a neural
[00:11:09] And um in order to create a neural network at a second layer, we can define
[00:11:14] network at a second layer, we can define a new set of weights uh referred to as
[00:11:18] a new set of weights uh referred to as W2 here. And we apply those to the the
[00:11:23] W2 here. And we apply those to the the previous layer of W1 uh multiplied by X.
[00:11:29] previous layer of W1 uh multiplied by X. Again pay attention to the
[00:11:31] Again pay attention to the dimensionalities here that we have the C
[00:11:34] dimensionalities here that we have the C number of outputs and D as the number of
[00:11:36] number of outputs and D as the number of uh input features. But then we also
[00:11:38] uh input features. But then we also define H and and uh that defines
[00:11:44] define H and and uh that defines uh the number of neurons the number of
[00:11:48] uh the number of neurons the number of hidden layer nodes or neurons.
[00:11:53] hidden layer nodes or neurons. That's one point. The second point is
[00:11:54] That's one point. The second point is this max function that um we'll be
[00:11:59] this max function that um we'll be coming back to and we'll explain what it
[00:12:02] coming back to and we'll explain what it it is and what it means. What um the max
[00:12:06] it is and what it means. What um the max operation is doing here is to create a
[00:12:09] operation is doing here is to create a nonlinearity between the linear
[00:12:11] nonlinearity between the linear transformations done by W1 and W2. And
[00:12:16] transformations done by W1 and W2. And this is actually a very very important
[00:12:19] this is actually a very very important um
[00:12:20] um part of um the the process. I will talk
[00:12:23] part of um the the process. I will talk a little bit about the nonlinearity but
[00:12:25] a little bit about the nonlinearity but also uh look at this this last part
[00:12:29] also uh look at this this last part before I forget in practice that's right
[00:12:32] before I forget in practice that's right that we are only including w and x we we
[00:12:35] that we are only including w and x we we as we talked about this in the first and
[00:12:37] as we talked about this in the first and second lecture we also
[00:12:40] second lecture we also uh incorporate a bias just to have a a
[00:12:44] uh incorporate a bias just to have a a complete framework so so in practice we
[00:12:46] complete framework so so in practice we also have bias but we don't write it
[00:12:49] also have bias but we don't write it here for the sake of simplicity anyways
[00:12:52] here for the sake of simplicity anyways the max operation is creating the
[00:12:55] the max operation is creating the nonlinearity and it's actually very
[00:12:57] nonlinearity and it's actually very important because as we talked about um
[00:13:01] important because as we talked about um the
[00:13:03] the linear classifiers last in the last few
[00:13:06] linear classifiers last in the last few lectures we we said we mentioned that
[00:13:09] lectures we we said we mentioned that there are so many different problems
[00:13:11] there are so many different problems that we can't uh separate the samples
[00:13:15] that we can't uh separate the samples with just one single line right this was
[00:13:17] with just one single line right this was one of the examples that um in order to
[00:13:20] one of the examples that um in order to be able to solve this problem with
[00:13:22] be able to solve this problem with neural networks with linear functions.
[00:13:25] neural networks with linear functions. we need some sort of nonlinear
[00:13:27] we need some sort of nonlinear transformation from the original space
[00:13:30] transformation from the original space to a new space and now in the new space
[00:13:32] to a new space and now in the new space you see that it's they are separable
[00:13:35] you see that it's they are separable using a a line right so with um in this
[00:13:41] using a a line right so with um in this case it's it's a nonlinear u
[00:13:44] case it's it's a nonlinear u transformation between the input and
[00:13:46] transformation between the input and then the second space which is mapping
[00:13:48] then the second space which is mapping the x and y to their polar polar
[00:13:52] the x and y to their polar polar coordinates r and theta but again Um
[00:13:55] coordinates r and theta but again Um this is just one example. There are so
[00:13:57] this is just one example. There are so many other um examples too.
[00:14:01] many other um examples too. So um with
[00:14:05] So um with this example, let's go back to oops uh
[00:14:07] this example, let's go back to oops uh let's go back to our uh definition of
[00:14:11] let's go back to our uh definition of the two layer neural network.
[00:14:14] the two layer neural network. As you've probably seen in the
[00:14:17] As you've probably seen in the literature and outside uh this class,
[00:14:19] literature and outside uh this class, these types of networks which only
[00:14:22] these types of networks which only depend on weights and u inputs and
[00:14:26] depend on weights and u inputs and layers and so on. There are no other
[00:14:28] layers and so on. There are no other operations than multiplication are often
[00:14:31] operations than multiplication are often referred to as fully connected networks
[00:14:33] referred to as fully connected networks or multilayer perceptrons MLPS. So
[00:14:37] or multilayer perceptrons MLPS. So that's that's one um thing and we can
[00:14:40] that's that's one um thing and we can actually stack more and more layers to
[00:14:43] actually stack more and more layers to create better um more um larger networks
[00:14:49] create better um more um larger networks and in this case again uh pay attention
[00:14:52] and in this case again uh pay attention to the dimensionalities and the hidden
[00:14:54] to the dimensionalities and the hidden layers that we have in u in in in the
[00:14:58] layers that we have in u in in in the middle and the dimensionalities that do
[00:15:00] middle and the dimensionalities that do uh match one after the other. So um
[00:15:08] back to this visual representation of
[00:15:11] back to this visual representation of what the neural network is doing. Um
[00:15:16] what the neural network is doing. Um we talked about this when when we had
[00:15:19] we talked about this when when we had the linear representations that often
[00:15:23] the linear representations that often what happens is the network through the
[00:15:27] what happens is the network through the weights is learning some sort of
[00:15:30] weights is learning some sort of templates.
[00:15:32] templates. If you remember last uh last week we
[00:15:34] If you remember last uh last week we were talking about these templates that
[00:15:36] were talking about these templates that are being learned. So again I'm saying
[00:15:38] are being learned. So again I'm saying templates they're they're not I mean
[00:15:40] templates they're they're not I mean they are some representatives of the
[00:15:44] they are some representatives of the images but from the data
[00:15:47] images but from the data uh depending on what data it was trained
[00:15:49] uh depending on what data it was trained on. So these templates in in um what we
[00:15:53] on. So these templates in in um what we discussed last week were kind of
[00:15:56] discussed last week were kind of generated by by these um 10 outputs by
[00:16:00] generated by by these um 10 outputs by applying the W's on top of the input
[00:16:03] applying the W's on top of the input neurons right so um with that
[00:16:09] neurons right so um with that now that we have multiple layers more
[00:16:11] now that we have multiple layers more layers now we can actually create some
[00:16:13] layers now we can actually create some more templates now now we have a layer
[00:16:16] more templates now now we have a layer in the middle that can actually create
[00:16:19] in the middle that can actually create 100 templates lets um as opposed to just
[00:16:22] 100 templates lets um as opposed to just just 10 for a linear classifier although
[00:16:24] just 10 for a linear classifier although we still have those 10 as well and this
[00:16:27] we still have those 10 as well and this again uh in a very high from a very high
[00:16:30] again uh in a very high from a very high level understanding point of view I'm
[00:16:32] level understanding point of view I'm I'm telling you what this means when um
[00:16:36] I'm telling you what this means when um we have these 100 neurons in the in the
[00:16:38] we have these 100 neurons in the in the middle we are giving the network the
[00:16:40] middle we are giving the network the power to create templates for not entire
[00:16:44] power to create templates for not entire objects but maybe parts of the object
[00:16:47] objects but maybe parts of the object for example the classes that you see
[00:16:48] for example the classes that you see here We had bird, cat, um, deer, dog,
[00:16:52] here We had bird, cat, um, deer, dog, frog, horse, they all have eyes, right?
[00:16:56] frog, horse, they all have eyes, right? So one of those 10 templates, 100
[00:16:59] So one of those 10 templates, 100 templates could probably be a part of
[00:17:01] templates could probably be a part of the objects that is that could be shared
[00:17:03] the objects that is that could be shared between all of the classes. So from a
[00:17:06] between all of the classes. So from a high level point of view and
[00:17:07] high level point of view and understanding this um these can form
[00:17:11] understanding this um these can form templates and when we come back to the
[00:17:14] templates and when we come back to the topics of visualization and what we
[00:17:16] topics of visualization and what we learn from the neural networks this uh
[00:17:19] learn from the neural networks this uh this topic will we'll uncover more
[00:17:21] this topic will we'll uncover more details about what I I'm talking about
[00:17:23] details about what I I'm talking about right now. So um
[00:17:27] right now. So um back to the function max we talked about
[00:17:30] back to the function max we talked about max uh function the nonlinearity that is
[00:17:32] max uh function the nonlinearity that is created here and in neural network
[00:17:37] created here and in neural network terminology we call that an activation
[00:17:40] terminology we call that an activation function right it and it's actually
[00:17:43] function right it and it's actually playing a very very important role a
[00:17:46] playing a very very important role a pivotal role in in building the model uh
[00:17:50] pivotal role in in building the model uh building a neural network let's answer
[00:17:52] building a neural network let's answer this question that we have on the slide
[00:17:54] this question that we have on the slide what happens if we try to build a neural
[00:17:57] what happens if we try to build a neural network without uh one of these
[00:17:59] network without uh one of these activation functions let's say the the
[00:18:02] activation functions let's say the the max function this will be our function
[00:18:04] max function this will be our function if I remove the the max right so it
[00:18:07] if I remove the the max right so it would be w2 * w1 by x what would happen
[00:18:11] would be w2 * w1 by x what would happen here yes exactly so as u you can guess
[00:18:15] here yes exactly so as u you can guess and and correctly you mentioned the
[00:18:20] and and correctly you mentioned the multiplication of w2 by w1 could
[00:18:23] multiplication of w2 by w1 could actually be replaced easily with another
[00:18:25] actually be replaced easily with another matrix W3 and then your function becomes
[00:18:29] matrix W3 and then your function becomes just a linear function. So everything
[00:18:30] just a linear function. So everything could be lumped uh together. So we need
[00:18:33] could be lumped uh together. So we need some sort of nonlinearity in the middle
[00:18:36] some sort of nonlinearity in the middle to be able to give us the uh the power
[00:18:40] to be able to give us the uh the power to solve nonlinear
[00:18:43] to solve nonlinear uh problems. The function that we just
[00:18:46] uh problems. The function that we just talked uh about is uh ReLU. It's the
[00:18:51] talked uh about is uh ReLU. It's the rectified linear unit. It's a very
[00:18:54] rectified linear unit. It's a very popular function activation function
[00:18:56] popular function activation function used in neural networks while there are
[00:19:01] used in neural networks while there are so many other variants that have have
[00:19:03] so many other variants that have have been tested in many uh many other
[00:19:06] been tested in many uh many other architectures and and even in the more
[00:19:09] architectures and and even in the more modern architectures. One of the
[00:19:11] modern architectures. One of the problems that ReLU has it it sometimes
[00:19:14] problems that ReLU has it it sometimes creates dead neurons because it it's
[00:19:18] creates dead neurons because it it's it's it's making everything equal to
[00:19:20] it's it's making everything equal to zero if it's uh not positive. Right? So
[00:19:23] zero if it's uh not positive. Right? So in order to avoid the dead uh neurons
[00:19:26] in order to avoid the dead uh neurons leaky relu with this uh type of modeling
[00:19:30] leaky relu with this uh type of modeling or uh elu the exponential linear unit
[00:19:37] or uh elu the exponential linear unit are other options. ELU is a little bit
[00:19:39] are other options. ELU is a little bit better because it has a better zero
[00:19:41] better because it has a better zero centered uh function and then there are
[00:19:45] centered uh function and then there are some newer variations
[00:19:49] some newer variations um jello uh gshian error uh linear units
[00:19:55] um jello uh gshian error uh linear units or I don't know I've I've heard both
[00:19:57] or I don't know I've I've heard both variation jello and yellow so um could
[00:20:01] variation jello and yellow so um could be could be used they are often used
[00:20:03] be could be used they are often used more often in neuro architecture in in
[00:20:06] more often in neuro architecture in in transformers and Um we also have
[00:20:11] transformers and Um we also have silo
[00:20:13] silo or or swish. Uh it's the sigmoid linear
[00:20:17] or or swish. Uh it's the sigmoid linear unit that um uh that one is is also used
[00:20:22] unit that um uh that one is is also used in some of the modern CNN architectures.
[00:20:25] in some of the modern CNN architectures. Google was using this for uh efficient
[00:20:28] Google was using this for uh efficient some of the variations of their uh
[00:20:30] some of the variations of their uh models and also in efficient net.
[00:20:34] models and also in efficient net. Other than these there are um functions
[00:20:37] Other than these there are um functions like simmoid and uh ten or or tanh that
[00:20:43] like simmoid and uh ten or or tanh that are often also used for
[00:20:46] are often also used for as as activation functions although they
[00:20:49] as as activation functions although they do have a few problems
[00:20:53] do have a few problems because they do squash values in a
[00:20:56] because they do squash values in a narrow range and that uh sometimes
[00:20:59] narrow range and that uh sometimes results in vanishing vanishing
[00:21:01] results in vanishing vanishing gradients. So we often do not use
[00:21:03] gradients. So we often do not use sigmoid or or tang in the middle of the
[00:21:07] sigmoid or or tang in the middle of the neural networks. They are often used in
[00:21:09] neural networks. They are often used in the later layers where um we want to for
[00:21:14] the later layers where um we want to for example binarize the outputs and and uh
[00:21:18] example binarize the outputs and and uh things like that. So as I said ReLU is
[00:21:21] things like that. So as I said ReLU is often a good default choice. It's it's
[00:21:23] often a good default choice. It's it's very much used in many
[00:21:25] very much used in many architectures and there are so many
[00:21:27] architectures and there are so many variations of the same function that we
[00:21:30] variations of the same function that we talked uh about.
[00:21:33] talked uh about. I want to summarize what what we've
[00:21:35] I want to summarize what what we've talked about and then answer some
[00:21:38] talked about and then answer some questions. So um we did talk about
[00:21:42] questions. So um we did talk about different uh adding layers and and so
[00:21:44] different uh adding layers and and so on. But I want to highlight that
[00:21:46] on. But I want to highlight that activation functions are often functions
[00:21:49] activation functions are often functions that are operating in the
[00:21:53] that are operating in the layers. And you also have W's which
[00:21:57] layers. And you also have W's which define the weights mapping between the
[00:22:00] define the weights mapping between the previous layer and the and the next
[00:22:03] previous layer and the and the next layer. Again, these are fully connected
[00:22:06] layer. Again, these are fully connected neural networks with very simple
[00:22:09] neural networks with very simple uh implementations.
[00:22:12] uh implementations. What we only need is to be able to
[00:22:15] What we only need is to be able to define an activation function. And in
[00:22:18] define an activation function. And in this example, if you look at the
[00:22:19] this example, if you look at the example, we have um the sigmoid uh
[00:22:24] example, we have um the sigmoid uh function defined as the activation
[00:22:26] function defined as the activation function. And very easily using that
[00:22:29] function. And very easily using that activation. The first and the second
[00:22:32] activation. The first and the second layers of the hidden values, hidden
[00:22:36] layers of the hidden values, hidden neurons are
[00:22:38] neurons are uh calculated by applying w1 by x and
[00:22:42] uh calculated by applying w1 by x and then also the uh the bias and then
[00:22:44] then also the uh the bias and then applying it um applying the function the
[00:22:46] applying it um applying the function the the activation function and then same
[00:22:48] the activation function and then same for h2 and the output will be very
[00:22:52] for h2 and the output will be very simply the the dot product for the uh
[00:22:56] simply the the dot product for the uh between the with w3 and the last layer
[00:22:59] between the with w3 and the last layer of hidden values creating the output
[00:23:03] of hidden values creating the output layer. I'll stop here to answer some
[00:23:06] layer. I'll stop here to answer some questions if there are any and then I
[00:23:08] questions if there are any and then I would love to continue it. That is a
[00:23:10] would love to continue it. That is a great question and the question is how
[00:23:12] great question and the question is how would we choose for a new problem which
[00:23:15] would we choose for a new problem which of these activation functions to use? Um
[00:23:19] of these activation functions to use? Um the short answer to your question is uh
[00:23:22] the short answer to your question is uh is yes. It's it's empir empirical in
[00:23:24] is yes. It's it's empir empirical in most cases. Um but we often start with
[00:23:27] most cases. Um but we often start with with Reu or we go with um standard
[00:23:31] with Reu or we go with um standard activation functions being used for
[00:23:33] activation functions being used for those specific architectures. Uh as I
[00:23:36] those specific architectures. Uh as I mentioned there are um activation
[00:23:39] mentioned there are um activation functions that are often commonly used
[00:23:42] functions that are often commonly used in um CNN's or in transformers and and
[00:23:46] in um CNN's or in transformers and and uh different architectures. So we often
[00:23:49] uh different architectures. So we often uh go with the with the ones that are
[00:23:52] uh go with the with the ones that are tested before. But yes, it's it's mostly
[00:23:54] tested before. But yes, it's it's mostly empirical. If you're if you're designing
[00:23:56] empirical. If you're if you're designing a new network for a new problem, then uh
[00:23:59] a new network for a new problem, then uh that's one of your choices that you have
[00:24:02] that's one of your choices that you have to make very much similar to other
[00:24:03] to make very much similar to other hyperparameters.
[00:24:05] hyperparameters. So the question here uh is is uh what is
[00:24:09] So the question here uh is is uh what is the attribute that is is is basically
[00:24:11] the attribute that is is is basically common between all of these activation
[00:24:13] common between all of these activation functions and uh what it it really does.
[00:24:17] functions and uh what it it really does. I will uh give you some examples and
[00:24:19] I will uh give you some examples and I'll I'll go into some of the details of
[00:24:22] I'll I'll go into some of the details of what these activation functions are
[00:24:25] what these activation functions are doing. Uh basically the uh main and and
[00:24:31] doing. Uh basically the uh main and and the most important common characteristic
[00:24:34] the most important common characteristic here is to create nonlinearity and we're
[00:24:37] here is to create nonlinearity and we're not using a linear function as as the
[00:24:39] not using a linear function as as the activation. Right? So creating some sort
[00:24:42] activation. Right? So creating some sort of nonlinearity is is something that is
[00:24:44] of nonlinearity is is something that is makes makes it very important. And why
[00:24:47] makes makes it very important. And why do we have so many variations? I told
[00:24:49] do we have so many variations? I told you a little bit about the problems with
[00:24:51] you a little bit about the problems with v vanishing gradients. I told you uh a
[00:24:54] v vanishing gradients. I told you uh a little bit about
[00:24:56] little bit about uh differentiability of the the
[00:24:58] uh differentiability of the the functions. They should be uh
[00:25:00] functions. They should be uh differentiable because we are using them
[00:25:02] differentiable because we are using them in neural network. and uh and and uh
[00:25:06] in neural network. and uh and and uh sometimes having a a proper zero
[00:25:09] sometimes having a a proper zero centered value and a smooth function
[00:25:12] centered value and a smooth function makes it uh much faster to to get
[00:25:15] makes it uh much faster to to get converging networks. So there are so
[00:25:18] converging networks. So there are so many different uh factors these are the
[00:25:20] many different uh factors these are the the main ones that I uh told you and
[00:25:22] the main ones that I uh told you and talked about which play an important
[00:25:24] talked about which play an important role for defining or designing these uh
[00:25:27] role for defining or designing these uh functions.
[00:25:30] functions. I'll talk a little bit more about that
[00:25:32] I'll talk a little bit more about that when I uh go into details of the
[00:25:35] when I uh go into details of the functions too. In all of the layers we
[00:25:39] functions too. In all of the layers we often use um same activation functions
[00:25:42] often use um same activation functions but as I said sometimes in the later
[00:25:44] but as I said sometimes in the later layers or the output layer we use like a
[00:25:46] layers or the output layer we use like a sigmoid activation sigmoid uh function
[00:25:50] sigmoid activation sigmoid uh function and or tangent function. So um but but
[00:25:53] and or tangent function. So um but but commonly yes and uh
[00:25:58] commonly yes and uh the question was if we use the same uh
[00:26:02] the question was if we use the same uh across the networks
[00:26:04] across the networks uh the the entire network same function
[00:26:06] uh the the entire network same function for all of the neurons.
[00:26:09] for all of the neurons. Okay. Um continuing uh to
[00:26:14] Okay. Um continuing uh to what we were talking about which um is
[00:26:18] what we were talking about which um is the implementation of these um
[00:26:22] the implementation of these um models these um a neural network. So
[00:26:26] models these um a neural network. So there is a very simple way um I mean
[00:26:29] there is a very simple way um I mean that building a neural network a two
[00:26:31] that building a neural network a two layer neuron network in Python is just
[00:26:33] layer neuron network in Python is just less than or or 20 lines of code very
[00:26:37] less than or or 20 lines of code very simple define our network as I talked
[00:26:41] simple define our network as I talked about the dimensionalities n is the
[00:26:43] about the dimensionalities n is the number of samples dn d in is the
[00:26:47] number of samples dn d in is the dimensionality of the input and d out is
[00:26:49] dimensionality of the input and d out is the dimension of the out the output and
[00:26:51] the dimension of the out the output and h the number of neurons in the
[00:26:55] h the number of neurons in the uh hidden layer and we talked about this
[00:26:59] uh hidden layer and we talked about this is just creating X and Y and randomize u
[00:27:02] is just creating X and Y and randomize u randomly init uh initializing W's. Then
[00:27:06] randomly init uh initializing W's. Then we have the forward forward pass which
[00:27:09] we have the forward forward pass which means applying W's to the inputs
[00:27:14] means applying W's to the inputs layer by layer and ultimately creating
[00:27:17] layer by layer and ultimately creating the output the prediction wise predicted
[00:27:23] the output the prediction wise predicted wise and then al and and and finally
[00:27:27] wise and then al and and and finally calculating the loss function and
[00:27:29] calculating the loss function and outputting that loss value
[00:27:33] outputting that loss value after the forward As we need an
[00:27:36] after the forward As we need an optimization pro process, a way to
[00:27:39] optimization pro process, a way to calculate the analytical gradients and
[00:27:44] calculate the analytical gradients and use those gradients that are created to
[00:27:49] use those gradients that are created to run gradient descent to optimize W1 and
[00:27:53] run gradient descent to optimize W1 and W2. Basically taking one step towards
[00:27:56] W2. Basically taking one step towards the optimal value of the network.
[00:28:00] the optimal value of the network. But this part calculating the analytical
[00:28:04] But this part calculating the analytical gradient is is the most important um
[00:28:07] gradient is is the most important um part in in here that we haven't very
[00:28:10] part in in here that we haven't very much uh gone into. So this is the the
[00:28:13] much uh gone into. So this is the the almost the rest of this lecture is about
[00:28:16] almost the rest of this lecture is about making this work and scale in um
[00:28:20] making this work and scale in um different settings. So um
[00:28:26] after training and building such a
[00:28:28] after training and building such a neural network depending on on how many
[00:28:32] neural network depending on on how many nodes we use in the hidden layer you see
[00:28:35] nodes we use in the hidden layer you see that we can we can identify we can get
[00:28:37] that we can we can identify we can get different patterns uh of separation
[00:28:40] different patterns uh of separation between the two classes and more neurons
[00:28:44] between the two classes and more neurons often means more capacity to learn more
[00:28:47] often means more capacity to learn more complex functions and better separation.
[00:28:51] complex functions and better separation. of the
[00:28:54] of the uh the nodes the the
[00:28:57] uh the nodes the the points
[00:29:00] points if you take a look at this this is this
[00:29:02] if you take a look at this this is this is very much similar to this this
[00:29:05] is very much similar to this this pattern I'm showing here is is similar
[00:29:07] pattern I'm showing here is is similar to the one that I showed in the second
[00:29:08] to the one that I showed in the second lecture where we were talking about k
[00:29:12] lecture where we were talking about k nearest neighbor and and when we had
[00:29:15] nearest neighbor and and when we had only k equal to one the the one nearest
[00:29:20] only k equal to one the the one nearest neighbor framework
[00:29:22] neighbor framework it was very much similar to using more
[00:29:24] it was very much similar to using more neurons right so same type of arguments
[00:29:28] neurons right so same type of arguments happen here that that if we give a lot
[00:29:30] happen here that that if we give a lot of capacity to the network then we will
[00:29:34] of capacity to the network then we will have some overfeeding problems uh we
[00:29:37] have some overfeeding problems uh we won't be able to generalize to unseen
[00:29:39] won't be able to generalize to unseen data but uh there are many different
[00:29:43] data but uh there are many different solutions for
[00:29:45] solutions for this as well and one thing that I as a
[00:29:48] this as well and one thing that I as a rule of thumb what I want to highlight
[00:29:51] rule of thumb what I want to highlight here for you is to not use the size of
[00:29:54] here for you is to not use the size of neural network as a regularizer. We
[00:29:57] neural network as a regularizer. We don't often use that as a hyperparameter
[00:30:00] don't often use that as a hyperparameter to fine-tune this this network size
[00:30:02] to fine-tune this this network size although we experiment with different um
[00:30:04] although we experiment with different um values of uh the network size and um
[00:30:09] values of uh the network size and um related hyperparameters. But what we
[00:30:12] related hyperparameters. But what we often do is we we go with a a little bit
[00:30:15] often do is we we go with a a little bit of a um bigger network that we need and
[00:30:19] of a um bigger network that we need and then we use the regularization
[00:30:22] then we use the regularization um and then this regularizer and
[00:30:24] um and then this regularizer and specifically this regular regularization
[00:30:28] specifically this regular regularization um hyperparameter
[00:30:30] um hyperparameter to to do a different to check the
[00:30:34] to to do a different to check the different setups. So what we often tune
[00:30:37] different setups. So what we often tune is the regularization and regularization
[00:30:40] is the regularization and regularization hyperparameter not necessarily the
[00:30:41] hyperparameter not necessarily the network size itself.
[00:30:45] network size itself. Okay. Um
[00:30:48] Okay. Um this is the concept of neural networks
[00:30:52] this is the concept of neural networks in a nutshell. But we
[00:30:56] in a nutshell. But we have heard about neural networks and how
[00:30:58] have heard about neural networks and how they could be related to uh the
[00:31:02] they could be related to uh the biological there are some biological
[00:31:04] biological there are some biological inspirations. Uh so I'll I'll talk a
[00:31:05] inspirations. Uh so I'll I'll talk a little bit about it but there's a
[00:31:06] little bit about it but there's a question basically your question is uh
[00:31:10] question basically your question is uh why is the model
[00:31:12] why is the model more underfeeding when we increase the
[00:31:15] more underfeeding when we increase the value of lambda here? Yes. So um just to
[00:31:20] value of lambda here? Yes. So um just to quickly answer that question, the value
[00:31:22] quickly answer that question, the value of lambda is controlling how much
[00:31:25] of lambda is controlling how much contribution the regularizer should have
[00:31:28] contribution the regularizer should have in the overall loss, right? And the
[00:31:31] in the overall loss, right? And the larger contribution that you have um on
[00:31:34] larger contribution that you have um on the regularizer and remember that
[00:31:36] the regularizer and remember that regularizer was defined on W's. So it's
[00:31:40] regularizer was defined on W's. So it's constraining the W's. It's giving less
[00:31:42] constraining the W's. It's giving less freedom to the values on W's, right? So
[00:31:46] freedom to the values on W's, right? So less freedom equals a little bit of like
[00:31:50] less freedom equals a little bit of like u
[00:31:52] u more generic boundaries not not
[00:31:55] more generic boundaries not not necessarily giving you those uh like
[00:31:57] necessarily giving you those uh like detailed values detailed um uh parts of
[00:32:00] detailed values detailed um uh parts of the the boundaries right so if you
[00:32:03] the the boundaries right so if you constrain the model too much even with
[00:32:06] constrain the model too much even with regularizer you're also going to get uh
[00:32:10] regularizer you're also going to get uh values like that decision boundaries
[00:32:12] values like that decision boundaries like that yes the right regularizer
[00:32:15] like that yes the right regularizer always overfeeds uh prevents
[00:32:17] always overfeeds uh prevents overfeeding.
[00:32:19] overfeeding. Again you you are creating a compromise
[00:32:21] Again you you are creating a compromise a balance between the loss like
[00:32:25] a balance between the loss like predicting the actual the right output.
[00:32:27] predicting the actual the right output. So the first part of the loss is
[00:32:29] So the first part of the loss is predicting the right output. The second
[00:32:31] predicting the right output. The second part is only playing with the values of
[00:32:33] part is only playing with the values of the weights doesn't doesn't care about
[00:32:34] the weights doesn't doesn't care about the outputs anymore. If you overweight
[00:32:36] the outputs anymore. If you overweight this you're not going to get very good
[00:32:39] this you're not going to get very good uh classifiers right. So creating a
[00:32:42] uh classifiers right. So creating a balance regularizer is always good but
[00:32:44] balance regularizer is always good but nothing is good if you use too much of
[00:32:46] nothing is good if you use too much of it. Right.
[00:32:51] [Music]
[00:32:53] [Music]  Um could you go over again why we would
[00:32:55] Um could you go over again why we would want to change the regularization rather
[00:32:57] want to change the regularization rather than the size?
[00:33:00] than the size?  So um there are many different reasons.
[00:33:03] So um there are many different reasons. One of them is size of the network.
[00:33:05] One of them is size of the network. You're building networks. If you're
[00:33:06] You're building networks. If you're going to build networks that sometimes
[00:33:09] going to build networks that sometimes that you have to run them for few days
[00:33:12] that you have to run them for few days to to to get some results, right? So
[00:33:17] to to to get some results, right? So networks um we often what we what we
[00:33:20] networks um we often what we what we often do is we start increasing the
[00:33:23] often do is we start increasing the quality the number of parameters in
[00:33:25] quality the number of parameters in networks until we see some levels of
[00:33:28] networks until we see some levels of overfeeding.
[00:33:29] overfeeding. So that's that's the time that we know
[00:33:31] So that's that's the time that we know that the network is is actually
[00:33:33] that the network is is actually understanding the patterns in the data
[00:33:35] understanding the patterns in the data and is trying is is is now able to
[00:33:37] and is trying is is is now able to memorize the data and that's the time
[00:33:40] memorize the data and that's the time that we try to minimize the overfeitting
[00:33:43] that we try to minimize the overfeitting by regularizing the the network. So
[00:33:46] by regularizing the the network. So regularization plays an important factor
[00:33:48] regularization plays an important factor there. So if we if we go too high on the
[00:33:52] there. So if we if we go too high on the uh on the number of parameters number of
[00:33:54] uh on the number of parameters number of complexity of the network then that's
[00:33:57] complexity of the network then that's going to be causing a problem. We never
[00:33:59] going to be causing a problem. We never never often do that. We often for a new
[00:34:02] never often do that. We often for a new problem start with smaller u networks
[00:34:05] problem start with smaller u networks and and increase that um after with
[00:34:09] and and increase that um after with correct that with the regularizer for a
[00:34:11] correct that with the regularizer for a given problem.
[00:34:14] given problem. How do we know how many neurons we need
[00:34:16] How do we know how many neurons we need to solve the problem? that's based on uh
[00:34:22] to solve the problem? that's based on uh empirical um research work and and
[00:34:26] empirical um research work and and looking at other uh similar type of
[00:34:29] looking at other uh similar type of there is no one prescription for all
[00:34:32] there is no one prescription for all like you have to look at um other
[00:34:35] like you have to look at um other counterparts other other types of
[00:34:37] counterparts other other types of networks that were trained in sim on
[00:34:39] networks that were trained in sim on similar data. start from that range and
[00:34:42] similar data. start from that range and then um often you do a number of
[00:34:46] then um often you do a number of experiments yourself to balance and
[00:34:48] experiments yourself to balance and increase or decrease the complexity of
[00:34:50] increase or decrease the complexity of the network. So it's it's often and
[00:34:52] the network. So it's it's often and always pretty much bound to uh
[00:34:56] always pretty much bound to uh exploration.
[00:34:58] exploration. So your question is are there any
[00:35:00] So your question is are there any theoretical and and foundational work uh
[00:35:03] theoretical and and foundational work uh done to see um which ones to use, which
[00:35:07] done to see um which ones to use, which activation functions to use and how many
[00:35:09] activation functions to use and how many layers to use. There are so many uh
[00:35:11] layers to use. There are so many uh research and and and uh papers out
[00:35:14] research and and and uh papers out analyzing these and and also some
[00:35:16] analyzing these and and also some methods for optimizing all of these um
[00:35:21] methods for optimizing all of these um meta or hyperparameters of of the
[00:35:23] meta or hyperparameters of of the networks. We're not going to get into uh
[00:35:26] networks. We're not going to get into uh them in in detail because again a big
[00:35:29] them in in detail because again a big part of it is based on very much
[00:35:31] part of it is based on very much dependent on the data set the problem
[00:35:33] dependent on the data set the problem you're solving and um so
[00:35:38] you're solving and um so uh the best answer to your question is
[00:35:39] uh the best answer to your question is yes there are some works out there but
[00:35:42] yes there are some works out there but um but again each of those make
[00:35:45] um but again each of those make assumptions that may not be necessarily
[00:35:47] assumptions that may not be necessarily true to your uh for your application or
[00:35:50] true to your uh for your application or problem. So um what what happens is um
[00:35:56] problem. So um what what happens is um there are some biological inspirations.
[00:35:58] there are some biological inspirations. Again these inspirations are very much
[00:36:00] Again these inspirations are very much very loose. If there is a neuroscientist
[00:36:02] very loose. If there is a neuroscientist sitting here or is watching online um
[00:36:06] sitting here or is watching online um it's uh do not take all of the examples
[00:36:10] it's uh do not take all of the examples that I'm um giving you or talking about
[00:36:14] that I'm um giving you or talking about as u
[00:36:16] as u the ground truth. But
[00:36:19] the ground truth. But generally what happens in in neurons and
[00:36:21] generally what happens in in neurons and you know this this is a a
[00:36:25] you know this this is a a visualization of a neuron. It does have
[00:36:27] visualization of a neuron. It does have a cell body that often uh aggregates the
[00:36:33] a cell body that often uh aggregates the impulses carried through dendrites to
[00:36:36] impulses carried through dendrites to the cell um
[00:36:39] the cell um body itself cell uh body and then using
[00:36:42] body itself cell uh body and then using axons those impulses are carried away to
[00:36:45] axons those impulses are carried away to other uh neurons. This is very much
[00:36:49] other uh neurons. This is very much similar to what we are doing in our
[00:36:51] similar to what we are doing in our neural networks. We often have a
[00:36:53] neural networks. We often have a function that captures the
[00:36:58] function that captures the uh the the the the signals all of the
[00:37:03] uh the the the the signals all of the previous impulses activations from the
[00:37:06] previous impulses activations from the previous layers and in the cell body
[00:37:10] previous layers and in the cell body that function
[00:37:12] that function is operated on the um on the inputs and
[00:37:16] is operated on the um on the inputs and outputs the out the the activations and
[00:37:20] outputs the out the the activations and passes them to the next layer. here next
[00:37:23] passes them to the next layer. here next neuron. And that's basically why we need
[00:37:26] neuron. And that's basically why we need some sort of activation function here to
[00:37:28] some sort of activation function here to to create the impulses to to to increase
[00:37:32] to create the impulses to to to increase or decrease the the values. Um, so with
[00:37:38] or decrease the the values. Um, so with that again, uh, that there are many
[00:37:41] that again, uh, that there are many differences between biological neurons
[00:37:44] differences between biological neurons and how they could actually be way more
[00:37:46] and how they could actually be way more complex than what
[00:37:49] complex than what neural networks we build look like. But
[00:37:53] neural networks we build look like. But um, but generally there are common
[00:37:56] um, but generally there are common concepts.
[00:37:58] concepts. Often the neural networks that we build
[00:38:00] Often the neural networks that we build are organized into regular patterns and
[00:38:04] are organized into regular patterns and those patterns are because we want to
[00:38:06] those patterns are because we want to have better computational efficiency
[00:38:08] have better computational efficiency when we be we implement the uh neural
[00:38:11] when we be we implement the uh neural networks. Although there has been
[00:38:13] networks. Although there has been research creating these complex neural
[00:38:16] research creating these complex neural networks and and trying to optimize but
[00:38:19] networks and and trying to optimize but um again in terms of results they are
[00:38:23] um again in terms of results they are almost comparable with the regular
[00:38:26] almost comparable with the regular functions regular neuron networks that
[00:38:28] functions regular neuron networks that we often build and we'll be talking
[00:38:31] we often build and we'll be talking about in this class.
[00:38:36] I can't uh warn you enough on on on on
[00:38:40] I can't uh warn you enough on on on on being careful with your brain analogies
[00:38:43] being careful with your brain analogies and um and how this could uh be
[00:38:47] and um and how this could uh be interpreted. So there are so many
[00:38:49] interpreted. So there are so many differences and um I'll I'll just uh
[00:38:52] differences and um I'll I'll just uh stop here and would be happy to discuss
[00:38:55] stop here and would be happy to discuss if if anybody was interested in the
[00:38:58] if if anybody was interested in the neuroscience aspect of things as well.
[00:39:01] neuroscience aspect of things as well. So um plugging everything in we did have
[00:39:06] So um plugging everything in we did have a scoring function.
[00:39:08] a scoring function. This scoring function
[00:39:11] This scoring function turns the inputs through some W's, some
[00:39:14] turns the inputs through some W's, some weight vectors or weight matrices into
[00:39:18] weight vectors or weight matrices into scores. And um
[00:39:22] scores. And um what we often use as the loss function
[00:39:26] what we often use as the loss function for the network is using those scores
[00:39:28] for the network is using those scores either through hinge loss or soft max or
[00:39:31] either through hinge loss or soft max or other variations.
[00:39:34] other variations. And uh and and in addition to that we
[00:39:37] And uh and and in addition to that we defined regularizers
[00:39:39] defined regularizers which ultimately give us the total loss
[00:39:42] which ultimately give us the total loss of a the data loss plus regularizer.
[00:39:48] of a the data loss plus regularizer. And um we talked about this fact that in
[00:39:51] And um we talked about this fact that in order to be able to optimize W1 and W2,
[00:39:56] order to be able to optimize W1 and W2, what we need is to be able to take the
[00:40:00] what we need is to be able to take the derivative of the partial derivative of
[00:40:02] derivative of the partial derivative of L with respect to W1 and W2. Uh partial
[00:40:08] L with respect to W1 and W2. Uh partial A by W1 and W2. There are so many
[00:40:12] A by W1 and W2. There are so many different um details that we have to be
[00:40:17] different um details that we have to be aware of.
[00:40:19] aware of. First, uh building these functions and
[00:40:22] First, uh building these functions and then taking the derivatives and and
[00:40:25] then taking the derivatives and and writing them down is often tedious.
[00:40:27] writing them down is often tedious. There are lots of matrix calculations
[00:40:30] There are lots of matrix calculations and need a lot of uh work on the paper
[00:40:33] and need a lot of uh work on the paper before you can actually implement a
[00:40:35] before you can actually implement a neural network. The other challenge, the
[00:40:37] neural network. The other challenge, the other problem is what if you want to
[00:40:40] other problem is what if you want to change the loss slightly different from
[00:40:43] change the loss slightly different from what we we've we've done the paper all
[00:40:46] what we we've we've done the paper all of the calculations
[00:40:48] of the calculations um over. So in that case again we have
[00:40:51] um over. So in that case again we have to um
[00:40:54] to um redo the entire thing and finally um
[00:40:59] redo the entire thing and finally um this becomes intractable and sometimes
[00:41:02] this becomes intractable and sometimes invisible if the loss function is
[00:41:05] invisible if the loss function is complex. So with complex functions
[00:41:08] complex. So with complex functions that's going to be even
[00:41:10] that's going to be even harder.
[00:41:12] harder. But there is a better idea, something
[00:41:14] But there is a better idea, something that is often used um in our
[00:41:18] that is often used um in our implementations and I'm going to go into
[00:41:21] implementations and I'm going to go into a few examples today just to make sure
[00:41:23] a few examples today just to make sure everybody is on the same page and
[00:41:26] everybody is on the same page and understands these uh topics.
[00:41:29] understands these uh topics. So and and that is computational graphs
[00:41:32] So and and that is computational graphs and the idea of back propagation.
[00:41:36] and the idea of back propagation. computational graphs are are
[00:41:39] computational graphs are are putting together all of the operations
[00:41:41] putting together all of the operations in the neural network and um creating
[00:41:45] in the neural network and um creating that stepbystep thing and and and start
[00:41:49] that stepbystep thing and and and start from the inputs and all of the
[00:41:51] from the inputs and all of the parameters that are uh basically needed
[00:41:53] parameters that are uh basically needed and get the loss as the final um output
[00:41:58] and get the loss as the final um output final layer. So in this case we had a
[00:42:01] final layer. So in this case we had a loss function which could be a softmax
[00:42:03] loss function which could be a softmax function or a hinge loss function
[00:42:05] function or a hinge loss function whatever it is it's the loss function
[00:42:08] whatever it is it's the loss function which is added to the regularizer the
[00:42:11] which is added to the regularizer the the function rw
[00:42:14] the function rw and r has the has w as the input. So
[00:42:18] and r has the has w as the input. So these two added together calculate or or
[00:42:21] these two added together calculate or or create the loss and before doing the um
[00:42:26] create the loss and before doing the um or or having the loss calculated we
[00:42:28] or or having the loss calculated we often also need to aggregate X and W and
[00:42:32] often also need to aggregate X and W and create the score. This is a
[00:42:35] create the score. This is a multiplication function.
[00:42:38] multiplication function. This is um actually very useful because
[00:42:40] This is um actually very useful because most of the neural networks that you
[00:42:42] most of the neural networks that you build are also they have graphical
[00:42:44] build are also they have graphical representations and all of these complex
[00:42:47] representations and all of these complex functions could be could be shown with
[00:42:50] functions could be could be shown with the same uh framework and then we can
[00:42:53] the same uh framework and then we can use this and uh and build their
[00:42:56] use this and uh and build their computation graph starting from input
[00:42:59] computation graph starting from input image or input data. There are a bunch
[00:43:02] image or input data. There are a bunch of weights throughout the network. And
[00:43:04] of weights throughout the network. And finally, there is the loss function.
[00:43:09] And again, this is useful because um
[00:43:13] And again, this is useful because um there are some complex neural networks
[00:43:15] there are some complex neural networks like this um neural touring machine and
[00:43:20] like this um neural touring machine and uh that is actually used for temporal
[00:43:22] uh that is actually used for temporal and sequential data. So there's a lot of
[00:43:24] and sequential data. So there's a lot of unrolling of this this machine. And if
[00:43:27] unrolling of this this machine. And if we have to do all of the work manually
[00:43:30] we have to do all of the work manually by hand, this is going to be un um
[00:43:34] by hand, this is going to be un um intractable and and not uh feasible. So
[00:43:39] intractable and and not uh feasible. So and that's why
[00:43:41] and that's why when we build this this computational
[00:43:44] when we build this this computational graph, the solution to that is back
[00:43:48] graph, the solution to that is back propagation.
[00:43:50] propagation. And I want to start with a very simple
[00:43:52] And I want to start with a very simple example. So we start with a function um
[00:43:56] example. So we start with a function um f of x y and z which is x + y * z. And
[00:44:03] f of x y and z which is x + y * z. And if I draw the computational graph for
[00:44:07] if I draw the computational graph for for this function, you see we have an
[00:44:10] for this function, you see we have an operation
[00:44:12] operation which is uh the the the addition
[00:44:15] which is uh the the the addition operation between x and y and then we
[00:44:18] operation between x and y and then we have a multiplication between x uh
[00:44:21] have a multiplication between x uh between the
[00:44:23] between the that that addition of x1 and y
[00:44:25] that that addition of x1 and y multiplied by z.
[00:44:29] multiplied by z. So given an input setup of x = -2, y = 5
[00:44:35] So given an input setup of x = -2, y = 5 and z = -4.
[00:44:38] and z = -4. Now actually we can make all of the
[00:44:40] Now actually we can make all of the calculations for and and do this uh the
[00:44:43] calculations for and and do this uh the forward
[00:44:45] forward pass the uh for stepping forward in the
[00:44:49] pass the uh for stepping forward in the the neural network. The first step is
[00:44:53] the neural network. The first step is adding x and y which gives us three. And
[00:44:57] adding x and y which gives us three. And in order to be able to to understand the
[00:44:59] in order to be able to to understand the steps and step by step, I'm I'm giving
[00:45:01] steps and step by step, I'm I'm giving the name to it. So Q= X + Y. And if I
[00:45:07] the name to it. So Q= X + Y. And if I want to calculate the partial
[00:45:11] want to calculate the partial derivatives of Q with respect to both X
[00:45:13] derivatives of Q with respect to both X and Y, it's very simple because we have
[00:45:16] and Y, it's very simple because we have the formulation here between X and um Q
[00:45:20] the formulation here between X and um Q and X and Y. The the formulation is
[00:45:23] and X and Y. The the formulation is there. the derivatives the partial
[00:45:26] there. the derivatives the partial partial q by x = 1 and partial q by y =
[00:45:31] partial q by x = 1 and partial q by y = 1 as well. So so this is this is a
[00:45:34] 1 as well. So so this is this is a simple setup we know it exists so just
[00:45:38] simple setup we know it exists so just keep it in the back of our minds. Then u
[00:45:42] keep it in the back of our minds. Then u the second operation is f= q *
[00:45:46] the second operation is f= q * z. Again, since we have this function,
[00:45:49] z. Again, since we have this function, it's very easy to calculate to to write
[00:45:51] it's very easy to calculate to to write the partial derivatives, right?
[00:45:54] the partial derivatives, right? Uh partial f by q= z and f by z equals u
[00:46:01] Uh partial f by q= z and f by z equals u q. So, it's kind of swap between z and
[00:46:04] q. So, it's kind of swap between z and q.
[00:46:08] I'm hoping that everybody knows all of
[00:46:09] I'm hoping that everybody knows all of these from linear algebra. So, and if
[00:46:13] these from linear algebra. So, and if you don't, you should definitely check
[00:46:15] you don't, you should definitely check it out and remind yourself because these
[00:46:18] it out and remind yourself because these are actually very very important
[00:46:20] are actually very very important um algebra in general and uh for the
[00:46:24] um algebra in general and uh for the rest of the quarter. what we want and
[00:46:28] rest of the quarter. what we want and what we need uh in this setup and and to
[00:46:31] what we need uh in this setup and and to complete this example of back
[00:46:33] complete this example of back propagation, we need the partial
[00:46:36] propagation, we need the partial derivative of f with respect to x, y and
[00:46:41] derivative of f with respect to x, y and z.
[00:46:44] z. How we start and how back propagation
[00:46:47] How we start and how back propagation implements this is to start at the front
[00:46:50] implements this is to start at the front of the network at the end of the network
[00:46:52] of the network at the end of the network and we start going back back propagating
[00:46:58] and we start going back back propagating all of the gradients and um this is
[00:47:03] all of the gradients and um this is basically a recursive process that will
[00:47:05] basically a recursive process that will be u running. So
[00:47:09] be u running. So derivative of f with respect to f
[00:47:13] derivative of f with respect to f is what? It's the thing with respect to
[00:47:17] is what? It's the thing with respect to itself, right? So it's always the the
[00:47:20] itself, right? So it's always the the the the last part the derivative of loss
[00:47:24] the the last part the derivative of loss function with respect to itself is is
[00:47:26] function with respect to itself is is always one. If I want to back prop the
[00:47:32] always one. If I want to back prop the first the most immediate one is z. You
[00:47:36] first the most immediate one is z. You can see here that
[00:47:38] can see here that um we have z and and for this one if I
[00:47:41] um we have z and and for this one if I calculate the derivative of f with
[00:47:44] calculate the derivative of f with respect to z we already have it right f
[00:47:48] respect to z we already have it right f with respect to z is equal to q. So
[00:47:51] with respect to z is equal to q. So whatever the value of q is goes to
[00:47:56] whatever the value of q is goes to this um as as the gradient as well. Next
[00:48:02] this um as as the gradient as well. Next we have um q q is the the next one. next
[00:48:06] we have um q q is the the next one. next one that is directly connected to f. So
[00:48:09] one that is directly connected to f. So this is also easy to compute because we
[00:48:11] this is also easy to compute because we have um derivative of f with respect to
[00:48:14] have um derivative of f with respect to q. We have also already uh calculated
[00:48:17] q. We have also already uh calculated that it's equal to z whatever z is
[00:48:20] that it's equal to z whatever z is that's the value of derivative here
[00:48:22] that's the value of derivative here minus 4.
[00:48:24] minus 4. Next we have
[00:48:26] Next we have y which is directly before q and we know
[00:48:30] y which is directly before q and we know that y and f although we need derivative
[00:48:33] that y and f although we need derivative of f with respect to y but y and f are
[00:48:37] of f with respect to y but y and f are not directly connected and that's where
[00:48:40] not directly connected and that's where we use the chain rule
[00:48:43] we use the chain rule where
[00:48:45] where we split the the
[00:48:49] we split the the uh calculation of derivatives with
[00:48:51] uh calculation of derivatives with respect to the variable in in the
[00:48:53] respect to the variable in in the middle. So partial f by y equals to
[00:48:57] middle. So partial f by y equals to partial f by q q by y. Right? So this is
[00:49:03] partial f by q q by y. Right? So this is the
[00:49:04] the this is how the chain rule could be
[00:49:07] this is how the chain rule could be could be written in this case. And now I
[00:49:10] could be written in this case. And now I want to introduce you to two important
[00:49:13] want to introduce you to two important new terms. Local gradient and upstream
[00:49:15] new terms. Local gradient and upstream gradient. up string gradient is often
[00:49:18] gradient. up string gradient is often the gradient that comes from the end of
[00:49:21] the gradient that comes from the end of the network to this this um
[00:49:27] current node that we are in and then uh
[00:49:31] current node that we are in and then uh the local gradient is the gradient of
[00:49:36] the local gradient is the gradient of the the the note what the input of the
[00:49:39] the the the note what the input of the note is with u
[00:49:43] note is with u gradient of of its output with respect
[00:49:45] gradient of of its output with respect to its input. So it's the local
[00:49:47] to its input. So it's the local gradient. So defining these is actually
[00:49:50] gradient. So defining these is actually not too hard because f by q we already
[00:49:54] not too hard because f by q we already have the value q by y we also already
[00:49:58] have the value q by y we also already have the value. So it's 1 multiplied by
[00:50:03] have the value. So it's 1 multiplied by z and the value will become minus 4.
[00:50:07] z and the value will become minus 4. Same story when it it's the
[00:50:10] Same story when it it's the for the other uh variable x here the
[00:50:16] for the other uh variable x here the local gradient upstream could again be
[00:50:19] local gradient upstream could again be written uh down with this chain rule and
[00:50:23] written uh down with this chain rule and it also results in minus 4 and and um
[00:50:29] it also results in minus 4 and and um gives us because because in both cases
[00:50:31] gives us because because in both cases the
[00:50:33] the gradient with respect to x or y was
[00:50:36] gradient with respect to x or y was already one. So both of them get the
[00:50:40] already one. So both of them get the same value. So with this computational
[00:50:43] same value. So with this computational setup and the computational graph, it
[00:50:45] setup and the computational graph, it becomes very easy to modularize what we
[00:50:49] becomes very easy to modularize what we want to do for every single node in the
[00:50:52] want to do for every single node in the neural network
[00:50:55] neural network having X and Y as input or whatever else
[00:50:58] having X and Y as input or whatever else and Z as the output.
[00:51:02] and Z as the output. What we need
[00:51:04] What we need are first the local gradients
[00:51:08] are first the local gradients which we can always we have the function
[00:51:10] which we can always we have the function f it's a function of x and y. So the
[00:51:14] f it's a function of x and y. So the output gradient of output with respect
[00:51:17] output gradient of output with respect to the each of the inputs it's easy to
[00:51:20] to the each of the inputs it's easy to calculate for every single node and what
[00:51:25] calculate for every single node and what we need to be able to back propagate is
[00:51:28] we need to be able to back propagate is the upstream gradient right and the back
[00:51:31] the upstream gradient right and the back propagation process is giving us the
[00:51:33] propagation process is giving us the power to get this upstream gradient step
[00:51:36] power to get this upstream gradient step by step. So we when we are at this node
[00:51:39] by step. So we when we are at this node we also have the upstream gradient
[00:51:41] we also have the upstream gradient already uh calculated from the future
[00:51:45] already uh calculated from the future nodes and um
[00:51:49] nodes and um that's what we need. What we can do
[00:51:52] that's what we need. What we can do after this is just to multiply the
[00:51:55] after this is just to multiply the upstream gradient with the local
[00:51:57] upstream gradient with the local gradient and create what now we call
[00:52:02] gradient and create what now we call downstream gradients. So the downstream
[00:52:04] downstream gradients. So the downstream gradients are going to be upstream
[00:52:06] gradients are going to be upstream gradients for the previous layers.
[00:52:07] gradients for the previous layers. Right? So that's how we calculate that
[00:52:10] Right? So that's how we calculate that for X. Same story when it comes to Y.
[00:52:16] for X. Same story when it comes to Y. So this this whole process gives us the
[00:52:19] So this this whole process gives us the the power to create and and calculate
[00:52:22] the power to create and and calculate all of these completely locally and step
[00:52:24] all of these completely locally and step by step go backwards and give it to the
[00:52:27] by step go backwards and give it to the previous nodes so they can continue the
[00:52:30] previous nodes so they can continue the process. So again this is one of the
[00:52:33] process. So again this is one of the most um fundamental operations in all of
[00:52:37] most um fundamental operations in all of neural networks and many optimization
[00:52:41] neural networks and many optimization uh processes involving multiple layers
[00:52:44] uh processes involving multiple layers of uh information.
[00:52:47] of uh information. If I understand the question correctly,
[00:52:49] If I understand the question correctly, you're saying how can we understand this
[00:52:51] you're saying how can we understand this intuitively what the gradients are
[00:52:52] intuitively what the gradients are doing, right? So let's take one step
[00:52:56] doing, right? So let's take one step back and and see why we why we are here
[00:52:59] back and and see why we why we are here to begin with. What we needed was to
[00:53:02] to begin with. What we needed was to identify to calculate the gradient of
[00:53:04] identify to calculate the gradient of the loss function with respect to w1 and
[00:53:07] the loss function with respect to w1 and w2 and w's in general to be able to take
[00:53:11] w2 and w's in general to be able to take a step to uh in the negative u direction
[00:53:15] a step to uh in the negative u direction of in the opposite direction of the
[00:53:17] of in the opposite direction of the gradients to be able to find the the
[00:53:19] gradients to be able to find the the optimal value right optimal loss. So in
[00:53:23] optimal value right optimal loss. So in order to do that we need gradient of l
[00:53:26] order to do that we need gradient of l loss with respect to everything. So what
[00:53:30] loss with respect to everything. So what we are doing is we are just moving
[00:53:32] we are doing is we are just moving gradient of L with respect to all
[00:53:34] gradient of L with respect to all variables in the network back to every
[00:53:37] variables in the network back to every single value of the network without
[00:53:40] single value of the network without sitting down and writing the function
[00:53:42] sitting down and writing the function for the entire network. If the network
[00:53:43] for the entire network. If the network has um 100 layers, we're not going to be
[00:53:47] has um 100 layers, we're not going to be sitting down and writing the function
[00:53:48] sitting down and writing the function for all of the 100 layers separately.
[00:53:51] for all of the 100 layers separately. This is how we back propagate step by
[00:53:53] This is how we back propagate step by step to to get the values that we need
[00:53:56] step to to get the values that we need for that optimization process of every
[00:53:58] for that optimization process of every single weight that is going to be
[00:53:59] single weight that is going to be incorporated in the network.
[00:54:02] incorporated in the network. Okay. Um
[00:54:05] Okay. Um example uh another example
[00:54:08] example uh another example um
[00:54:10] um so this is a little bit more complex
[00:54:12] so this is a little bit more complex function uh function of weights and x
[00:54:15] function uh function of weights and x and we have 1 / 1 + e ^ of a linear
[00:54:20] and we have 1 / 1 + e ^ of a linear combination of x and um w.
[00:54:24] combination of x and um w. So
[00:54:26] So there are a bunch of multiplications,
[00:54:29] there are a bunch of multiplications, additions,
[00:54:30] additions, negation and and the exp function and
[00:54:34] negation and and the exp function and and ultimately the one over whatever we
[00:54:39] and ultimately the one over whatever we calculated function. So with with with
[00:54:42] calculated function. So with with with all of those let's look at this example
[00:54:45] all of those let's look at this example that we have specific values for w0 x0
[00:54:50] that we have specific values for w0 x0 w1 x1 and w2 with these given values we
[00:54:55] w1 x1 and w2 with these given values we can do the forward pass calculate every
[00:54:58] can do the forward pass calculate every single value that we have in uh this
[00:55:01] single value that we have in uh this process and just to remind you we do
[00:55:05] process and just to remind you we do have some of the details some of the we
[00:55:08] have some of the details some of the we we know um For an exp function e ^ of x
[00:55:14] we know um For an exp function e ^ of x what its derivative is with respect to x
[00:55:17] what its derivative is with respect to x constant
[00:55:18] constant uh multiplication always the derivative
[00:55:20] uh multiplication always the derivative is the constant value itself 1 /x has a
[00:55:24] is the constant value itself 1 /x has a derivative of -1 / x2
[00:55:27] derivative of -1 / x2 uh these are again what we know from
[00:55:28] uh these are again what we know from algebra. So and and if it's a constant
[00:55:32] algebra. So and and if it's a constant addition it's always the derivative is
[00:55:34] addition it's always the derivative is equal to one. So as I said always in the
[00:55:40] equal to one. So as I said always in the very beginning at the at at the end of
[00:55:42] very beginning at the at at the end of the network the derivative of L with
[00:55:46] the network the derivative of L with respect to L is always equal to one. So
[00:55:49] respect to L is always equal to one. So that's where we start using this uh rule
[00:55:54] that's where we start using this uh rule the derivative of function 1 /x. Now we
[00:55:57] the derivative of function 1 /x. Now we can calculate upstream I said it's it's
[00:56:00] can calculate upstream I said it's it's one always at the end. The local
[00:56:03] one always at the end. The local gradient could be one min -1 / x 2. What
[00:56:07] gradient could be one min -1 / x 2. What is the value of x that whatever the
[00:56:09] is the value of x that whatever the input is? So this calculation results in
[00:56:12] input is? So this calculation results in minus uh.53.
[00:56:16] minus uh.53. So.53 is the downstream gradient which
[00:56:20] So.53 is the downstream gradient which defines the upstream gradient for the
[00:56:22] defines the upstream gradient for the next one. And in the next again the
[00:56:25] next one. And in the next again the function here is just a constant
[00:56:27] function here is just a constant addition where we know that the local
[00:56:30] addition where we know that the local gradient equals to one. So one
[00:56:33] gradient equals to one. So one multiplied by upstream gradient same
[00:56:36] multiplied by upstream gradient same value goes back
[00:56:39] value goes back uh next step is the exp function. So for
[00:56:45] uh next step is the exp function. So for that again the lo the upstream we
[00:56:48] that again the lo the upstream we already have the value for the local
[00:56:52] already have the value for the local gradient it's e to the power of x what
[00:56:54] gradient it's e to the power of x what is x the input of to the of this uh step
[00:56:58] is x the input of to the of this uh step minus one. So calculating this will give
[00:57:01] minus one. So calculating this will give us minus.2
[00:57:04] us minus.2 and this goes back to the next step.
[00:57:08] and this goes back to the next step. Here we again have a multiplication with
[00:57:12] Here we again have a multiplication with a constant number which defines the
[00:57:15] a constant number which defines the local gradient equal to that um number
[00:57:18] local gradient equal to that um number that that uh constant value and
[00:57:23] that that uh constant value and defining the new gradient downstream
[00:57:25] defining the new gradient downstream gradient and going back now here we have
[00:57:29] gradient and going back now here we have an addition function where we are
[00:57:32] an addition function where we are getting some um data some some
[00:57:36] getting some um data some some information sorry two inputs
[00:57:39] information sorry two inputs um of the different values here. And
[00:57:42] um of the different values here. And again, if you want to calculate the
[00:57:44] again, if you want to calculate the upstream gradient, it's 02 already. We
[00:57:46] upstream gradient, it's 02 already. We have it. The downstream at the local
[00:57:49] have it. The downstream at the local gradients will be equal to
[00:57:53] gradients will be equal to uh one because it's just an addition
[00:57:56] uh one because it's just an addition between two values and an addition. The
[00:57:58] between two values and an addition. The derivative of x + y with respect to both
[00:58:02] derivative of x + y with respect to both x and y is always one. So both inputs
[00:58:05] x and y is always one. So both inputs will be the same. Then we have
[00:58:09] will be the same. Then we have multiplication operations with
[00:58:11] multiplication operations with multiplication.
[00:58:13] multiplication. Uh upstream gradient again we have the
[00:58:15] Uh upstream gradient again we have the values and the local gradients with
[00:58:19] values and the local gradients with respect to a multiplication val is if we
[00:58:23] respect to a multiplication val is if we always have say for example a multiplied
[00:58:26] always have say for example a multiplied by x. The derivative of this with
[00:58:28] by x. The derivative of this with respect to x is always a the other
[00:58:30] respect to x is always a the other variable. Right? So here for the first
[00:58:35] variable. Right? So here for the first one it's minus one which is the value of
[00:58:38] one it's minus one which is the value of x and for the second one it's two which
[00:58:41] x and for the second one it's two which is the value of w. So the other other
[00:58:43] is the value of w. So the other other variable whatever the value it has. So
[00:58:47] variable whatever the value it has. So with that we can calculate everything
[00:58:49] with that we can calculate everything and then then also calculate the the
[00:58:51] and then then also calculate the the ones with respect to w1 and x1. Again,
[00:58:54] ones with respect to w1 and x1. Again, we made all of these calculations so we
[00:58:57] we made all of these calculations so we can identify how much W should be
[00:59:03] can identify how much W should be changed in order to step towards the
[00:59:06] changed in order to step towards the optimal point in the neural in the
[00:59:09] optimal point in the neural in the network.
[00:59:11] network. So this was another example. There are
[00:59:14] So this was another example. There are so many different ways to
[00:59:17] so many different ways to draw a computational graph. This was not
[00:59:20] draw a computational graph. This was not the only one that I I explained. So we
[00:59:22] the only one that I I explained. So we can actually lump all of the function
[00:59:24] can actually lump all of the function together and define a sigmoid sigmoid
[00:59:27] together and define a sigmoid sigmoid because this is basically sigmoid of a
[00:59:30] because this is basically sigmoid of a linear function. Right? So the linear
[00:59:32] linear function. Right? So the linear function could be here and then all of
[00:59:35] function could be here and then all of these operations could be defined as
[00:59:37] these operations could be defined as sigmoid. And actually sigmoid is
[00:59:39] sigmoid. And actually sigmoid is interesting and very useful to um to use
[00:59:43] interesting and very useful to um to use because the local gradient using sigmoid
[00:59:46] because the local gradient using sigmoid is dependent on sigmoid itself. So the
[00:59:49] is dependent on sigmoid itself. So the local gradient of sigmoid with respect
[00:59:52] local gradient of sigmoid with respect to the variable x if we do the
[00:59:54] to the variable x if we do the calculations and simplify it's 1 minus
[00:59:57] calculations and simplify it's 1 minus sigmoid multiplied by the sigmoid of the
[01:00:00] sigmoid multiplied by the sigmoid of the same x. So it's actually a very very uh
[01:00:04] same x. So it's actually a very very uh useful uh framework. So anyways uh
[01:00:08] useful uh framework. So anyways uh useful function and easy and in order to
[01:00:12] useful function and easy and in order to calculate the upstream um the the the
[01:00:15] calculate the upstream um the the the downstream gradient again what the
[01:00:18] downstream gradient again what the upstream gradient was was value one and
[01:00:22] upstream gradient was was value one and if I calculate the local gradient which
[01:00:24] if I calculate the local gradient which is this function replacing x with what
[01:00:27] is this function replacing x with what the input was which is one
[01:00:30] the input was which is one it's it's how this will be calculated
[01:00:32] it's it's how this will be calculated multiplied by one will be 2 which is
[01:00:35] multiplied by one will be 2 which is actually the exact same value that we
[01:00:37] actually the exact same value that we had from before doing it um separately.
[01:00:42] had from before doing it um separately. I want to summarize and and say that um
[01:00:47] I want to summarize and and say that um there are few patterns in in the data
[01:00:50] there are few patterns in in the data often um very much in for the for the
[01:00:54] often um very much in for the for the for the nodes that um we can actually
[01:00:58] for the nodes that um we can actually kind of memorize. uh there is AD gate
[01:01:01] kind of memorize. uh there is AD gate for the AD gate it's always a gradient
[01:01:04] for the AD gate it's always a gradient distributor because of the the the
[01:01:08] distributor because of the the the properties of addition that I explained
[01:01:11] properties of addition that I explained the gradients will uh remain the same as
[01:01:17] the gradients will uh remain the same as whatever its input is for the
[01:01:19] whatever its input is for the multiplication gate it's the it's a swap
[01:01:23] multiplication gate it's the it's a swap function again I told you gradient of xy
[01:01:26] function again I told you gradient of xy with respect to x is y with respect to Y
[01:01:29] with respect to x is y with respect to Y is X. So it's kind of a swap and then
[01:01:34] is X. So it's kind of a swap and then um there is copy gate in the copy gate.
[01:01:38] um there is copy gate in the copy gate. copy gate it the operation that happens
[01:01:41] copy gate it the operation that happens is is just an addition of what is coming
[01:01:45] is is just an addition of what is coming uh to the network and to the node or
[01:01:49] uh to the network and to the node or gate the gate and then ultimately
[01:01:51] gate the gate and then ultimately there's a max gate which is actually
[01:01:53] there's a max gate which is actually something that we use quite often very
[01:01:55] something that we use quite often very much similar to the relu function max um
[01:02:00] much similar to the relu function max um gate has the gradient of because it's
[01:02:02] gate has the gradient of because it's it's up it's taking a max between its
[01:02:05] it's up it's taking a max between its inputs so whichever the max value was
[01:02:08] inputs so whichever the max value was you just relay the or route the gradient
[01:02:12] you just relay the or route the gradient towards that um direction.
[01:02:16] towards that um direction. So with that um it's it's very simple to
[01:02:19] So with that um it's it's very simple to now implement a neural network forward
[01:02:23] now implement a neural network forward pass compute all of the steps and then
[01:02:26] pass compute all of the steps and then in the backward pass we start computing
[01:02:28] in the backward pass we start computing the gradients and step by step I explain
[01:02:31] the gradients and step by step I explain that the gradient of the uh fun the loss
[01:02:35] that the gradient of the uh fun the loss function with respect to itself is
[01:02:37] function with respect to itself is always one and then we start from the
[01:02:40] always one and then we start from the end of the network and go up. You can
[01:02:43] end of the network and go up. You can see here that we are going uh up. So
[01:02:45] see here that we are going uh up. So this is the sigmoid function
[01:02:48] this is the sigmoid function calculating the gradients. Then going up
[01:02:50] calculating the gradients. Then going up that was the add gate. We had another
[01:02:53] that was the add gate. We had another add gate and then we had two multiply
[01:02:57] add gate and then we had two multiply gates which basically very simply um
[01:03:01] gates which basically very simply um gives us the implementations.
[01:03:04] gives us the implementations. And with this type of formulation, what
[01:03:07] And with this type of formulation, what we can do is just do some implement
[01:03:12] we can do is just do some implement modular modularize every um function in
[01:03:17] modular modularize every um function in the in neural network and create the
[01:03:21] the in neural network and create the forward and backward APIs for every
[01:03:24] forward and backward APIs for every single function that we need in the
[01:03:26] single function that we need in the neural network. So in this case this is
[01:03:29] neural network. So in this case this is a multiplication gate that because for
[01:03:32] a multiplication gate that because for multiplication we need to access the
[01:03:35] multiplication we need to access the inputs for use in the backward pass we
[01:03:39] inputs for use in the backward pass we often save them memorize them but then
[01:03:43] often save them memorize them but then calculate the forward pass values and
[01:03:45] calculate the forward pass values and then the backward pass calculate the
[01:03:47] then the backward pass calculate the gradients. So this means we can write
[01:03:51] gradients. So this means we can write our functions and put the forward and
[01:03:54] our functions and put the forward and backward passes all in. And this is how
[01:03:57] backward passes all in. And this is how PyTorch operators right now look like.
[01:04:00] PyTorch operators right now look like. If you look at the sigmoid layer for
[01:04:02] If you look at the sigmoid layer for example, it's just the forward pass.
[01:04:04] example, it's just the forward pass. Although
[01:04:05] Although um in this very function, it's not
[01:04:07] um in this very function, it's not implemented. It's it's somewhere else in
[01:04:09] implemented. It's it's somewhere else in the C++ code in the C code um that it's
[01:04:12] the C++ code in the C code um that it's actually implemented in PyTorch. But
[01:04:14] actually implemented in PyTorch. But then the backward pass of sigmoid is
[01:04:17] then the backward pass of sigmoid is also calculating the same function that
[01:04:19] also calculating the same function that we just uh talked about.
[01:04:22] we just uh talked about. So with this um
[01:04:27] So with this um the so far what we've what we've said uh
[01:04:31] the so far what we've what we've said uh and I actually covered most of the
[01:04:33] and I actually covered most of the examples that I wanted to cover in um
[01:04:38] examples that I wanted to cover in um using the scholar values. All of the
[01:04:40] using the scholar values. All of the examples were were were just scholar
[01:04:42] examples were were were just scholar values. But we know that um all of these
[01:04:48] values. But we know that um all of these operations could actually be implemented
[01:04:50] operations could actually be implemented in vector or matrix uh forms. Just
[01:04:53] in vector or matrix uh forms. Just expanding on that um piece here. Um we
[01:04:59] expanding on that um piece here. Um we talked about this that with the scholar
[01:05:01] talked about this that with the scholar to scholar setting so far what we've
[01:05:04] to scholar setting so far what we've talked about uh for any input x and y
[01:05:07] talked about uh for any input x and y being scholars. So um the derivative
[01:05:11] being scholars. So um the derivative will also be a scholar which means if we
[01:05:14] will also be a scholar which means if we change x by a small amount how much the
[01:05:18] change x by a small amount how much the value of y uh will change if it's now
[01:05:22] value of y uh will change if it's now vectorzed and and there are vectors if
[01:05:24] vectorzed and and there are vectors if if x is a vector of n elements and y is
[01:05:28] if x is a vector of n elements and y is a scalar a vector to a scalar uh
[01:05:32] a scalar a vector to a scalar uh derivative then in this case the
[01:05:36] derivative then in this case the derivative will also be a vector. vector
[01:05:38] derivative will also be a vector. vector and every single element in that vector
[01:05:43] and every single element in that vector means if we change x by a small value
[01:05:47] means if we change x by a small value that that value then how much the amount
[01:05:50] that that value then how much the amount of y changes then the entire amount of y
[01:05:53] of y changes then the entire amount of y because it's just one single value and
[01:05:55] because it's just one single value and then there are also vector to vector uh
[01:05:58] then there are also vector to vector uh frameworks where x and y both of them
[01:06:02] frameworks where x and y both of them are vectors of arbitrary size n and m in
[01:06:07] are vectors of arbitrary size n and m in those cases the derivatives will form a
[01:06:10] those cases the derivatives will form a matrix or what we call Jacobians
[01:06:13] matrix or what we call Jacobians and and there
[01:06:16] and and there uh for each of the elements in X if it
[01:06:20] uh for each of the elements in X if it changes by a small amount then uh this
[01:06:23] changes by a small amount then uh this derivative tells us how much each
[01:06:26] derivative tells us how much each element of Y will be changed again look
[01:06:30] element of Y will be changed again look at the sub u scripts here they are not
[01:06:34] at the sub u scripts here they are not um completely they're not the They could
[01:06:37] um completely they're not the They could be different for every single element in
[01:06:40] be different for every single element in this jubian. There has a there's a clear
[01:06:44] this jubian. There has a there's a clear meaning. And then how we can do or or um
[01:06:48] meaning. And then how we can do or or um see it and visualize it in here is that
[01:06:53] see it and visualize it in here is that if you want to backrop uh with vectors
[01:06:57] if you want to backrop uh with vectors say xy and z are vectors of size dx, dy
[01:07:03] say xy and z are vectors of size dx, dy and dz. Again the loss
[01:07:08] and dz. Again the loss derivative um L is the loss itself is
[01:07:12] derivative um L is the loss itself is always scholar because that's always one
[01:07:14] always scholar because that's always one value we want to uh minimize but then
[01:07:17] value we want to uh minimize but then calculating the upstream gradient will
[01:07:21] calculating the upstream gradient will result in also a vector DZ same size as
[01:07:25] result in also a vector DZ same size as its u
[01:07:27] its u um variable Z and same story happens
[01:07:32] um variable Z and same story happens when it comes to downstream gradient in
[01:07:35] when it comes to downstream gradient in in downstream gradients. U actually
[01:07:38] in downstream gradients. U actually before going to downstream gradients let
[01:07:40] before going to downstream gradients let me tell you a little bit about the local
[01:07:43] me tell you a little bit about the local gradients where we have
[01:07:47] gradients where we have uh
[01:07:48] uh gradient of Z with respect to X and uh Y
[01:07:53] gradient of Z with respect to X and uh Y and in this case that's the part that I
[01:07:55] and in this case that's the part that I said there will be Jacobians because now
[01:07:58] said there will be Jacobians because now the matrix the the gradients will turn
[01:08:00] the matrix the the gradients will turn into matrices. So we have two Jacobian
[01:08:04] into matrices. So we have two Jacobian matrices here um defined by the size of
[01:08:08] matrices here um defined by the size of their input by multiplied by the size of
[01:08:12] their input by multiplied by the size of the output and um and then this results
[01:08:19] the output and um and then this results in downstream gradients that are
[01:08:22] in downstream gradients that are multiplication of upstream and the um
[01:08:26] multiplication of upstream and the um local gradient and there we get same
[01:08:30] local gradient and there we get same size as the inputs. X itself. So the
[01:08:35] size as the inputs. X itself. So the we will have a vector again here because
[01:08:37] we will have a vector again here because the input was a vector same size in
[01:08:40] the input was a vector same size in terms of the
[01:08:42] terms of the gradients. Um
[01:08:45] gradients. Um I just mentioned that gradients of
[01:08:47] I just mentioned that gradients of variables with respect to loss always
[01:08:50] variables with respect to loss always have the same dimensionality as the
[01:08:52] have the same dimensionality as the original variable itself as also shown
[01:08:55] original variable itself as also shown in this um slide. So um
[01:09:02] back prop with vectors was that's just
[01:09:05] back prop with vectors was that's just one uh one example here. Let's say we
[01:09:08] one uh one example here. Let's say we have a function which is the max of zero
[01:09:10] have a function which is the max of zero and x. That's that's the relu function.
[01:09:14] and x. That's that's the relu function. So this is an elementwise function that
[01:09:17] So this is an elementwise function that takes the input takes a max between zero
[01:09:19] takes the input takes a max between zero and if it's non zero if it's uh non-
[01:09:22] and if it's non zero if it's uh non- negative it passes through otherwise
[01:09:24] negative it passes through otherwise replaces it with a zero. Assume you get
[01:09:28] replaces it with a zero. Assume you get some upstream gradients and there now we
[01:09:31] some upstream gradients and there now we need to build a Jacobian matrix here and
[01:09:34] need to build a Jacobian matrix here and this Jacobian matrix in this case
[01:09:36] this Jacobian matrix in this case because this is an elementwise operation
[01:09:40] because this is an elementwise operation it doesn't have any dependence on any of
[01:09:43] it doesn't have any dependence on any of the other inputs only the dependence is
[01:09:47] the other inputs only the dependence is on the value itself. This is a very
[01:09:50] on the value itself. This is a very sparse matrix only has value on the main
[01:09:54] sparse matrix only has value on the main diagonal. And those values are actually
[01:09:58] diagonal. And those values are actually either zero or one depending on if the
[01:10:01] either zero or one depending on if the max was actually um taken or a zero was
[01:10:05] max was actually um taken or a zero was if the the value was passed through or
[01:10:08] if the the value was passed through or or just the zero was replaced um with it
[01:10:12] or just the zero was replaced um with it um in in its place. multiplying it by
[01:10:15] um in in its place. multiplying it by the upstream gradient gives us the
[01:10:18] the upstream gradient gives us the downstream gradient and um this is how
[01:10:23] downstream gradient and um this is how um the calculations are done. As I said,
[01:10:26] um the calculations are done. As I said, Jacobian here is mat is is a sparse
[01:10:29] Jacobian here is mat is is a sparse because in this case the operation is
[01:10:32] because in this case the operation is element wise and that could actually be
[01:10:36] element wise and that could actually be instead of in the in the backward pass
[01:10:39] instead of in the in the backward pass instead of calculating that huge sparse
[01:10:43] instead of calculating that huge sparse jacobian matrix what we do is just use
[01:10:46] jacobian matrix what we do is just use these rule-based calculation of the
[01:10:48] these rule-based calculation of the gradient for for this max function. So
[01:10:51] gradient for for this max function. So we don't really store that that matrix
[01:10:53] we don't really store that that matrix and and do not calculate that because we
[01:10:57] and and do not calculate that because we know the how the the function uh
[01:10:59] know the how the the function uh operates.
[01:11:01] operates. And then this could also be extended to
[01:11:04] And then this could also be extended to matrices um and even tensors if the
[01:11:08] matrices um and even tensors if the inputs are not vectors. They are high
[01:11:10] inputs are not vectors. They are high higher dimensionalities uh high
[01:11:12] higher dimensionalities uh high dimensionality data. So in those cases
[01:11:16] dimensionality data. So in those cases again the gradients with respect to the
[01:11:20] again the gradients with respect to the variables would be of the same size as
[01:11:24] variables would be of the same size as that specific variable and calculating
[01:11:28] that specific variable and calculating the upstream and downstream calcul uh
[01:11:33] the upstream and downstream calcul uh matrices and derivatives is going to be
[01:11:36] matrices and derivatives is going to be done same as how we discussed uh and and
[01:11:40] done same as how we discussed uh and and showed
[01:11:42] showed earlier for vectors and then when it
[01:11:45] earlier for vectors and then when it comes to the local gradients although um
[01:11:50] comes to the local gradients although um however it's going to be a huge matrix a
[01:11:53] however it's going to be a huge matrix a huge jacobian because we have a matrix
[01:11:55] huge jacobian because we have a matrix as in as the output and the matrix as
[01:11:57] as in as the output and the matrix as the input and then the local gradients
[01:12:00] the input and then the local gradients will be the
[01:12:02] will be the same size as the multiplication of its
[01:12:05] same size as the multiplication of its input size and and the output size. So
[01:12:08] input size and and the output size. So it's it's going to be a huge matrix by
[01:12:11] it's it's going to be a huge matrix by itself. Let me give you an example. If
[01:12:14] itself. Let me give you an example. If this is input X and W as as input for a
[01:12:18] this is input X and W as as input for a node for a gate with matrix
[01:12:20] node for a gate with matrix multiplication,
[01:12:22] multiplication, what happens is
[01:12:25] what happens is uh and and generates this this Y as the
[01:12:27] uh and and generates this this Y as the output. Calculating
[01:12:30] output. Calculating derivative of L with respect to Y gives
[01:12:33] derivative of L with respect to Y gives us these Jacobian matrices. And say we
[01:12:38] us these Jacobian matrices. And say we have a batch size mini batch size of 64
[01:12:42] have a batch size mini batch size of 64 and dimensionality of those matrices is
[01:12:44] and dimensionality of those matrices is is 4,96
[01:12:46] is 4,96 then this means that that Jacobian
[01:12:48] then this means that that Jacobian matrix that huge Jacobian matrix will be
[01:12:51] matrix that huge Jacobian matrix will be over 256 GB for just one single multiply
[01:12:56] over 256 GB for just one single multiply um matrix multiplication.
[01:13:00] um matrix multiplication. So in order to
[01:13:04] So in order to simplify this we often um what we do is
[01:13:08] simplify this we often um what we do is we try to look at the values and how
[01:13:12] we try to look at the values and how they impact each other. For example,
[01:13:14] they impact each other. For example, what part of parts of Y will be affected
[01:13:19] what part of parts of Y will be affected if element one element of X um gets
[01:13:24] if element one element of X um gets impacted. So there uh x uh n and d this
[01:13:29] impacted. So there uh x uh n and d this this specific one often affects just one
[01:13:33] this specific one often affects just one row in the output. And this is um this
[01:13:37] row in the output. And this is um this basically helps us uh identify for
[01:13:41] basically helps us uh identify for calculating each of these nodes. We
[01:13:43] calculating each of these nodes. We don't need to create the huge jacobian.
[01:13:46] don't need to create the huge jacobian. we can actually write the backward pass
[01:13:49] we can actually write the backward pass functions uh specifically for matrix
[01:13:51] functions uh specifically for matrix multiplication
[01:13:53] multiplication uh in a more efficient way and uh I'm
[01:13:58] uh in a more efficient way and uh I'm almost done so you answer this question
[01:14:01] almost done so you answer this question how much does x and d affect the value
[01:14:06] how much does x and d affect the value of y and m so this is y and m that is um
[01:14:11] of y and m so this is y and m that is um getting impact from x and d
[01:14:15] getting impact from x and d how much Does it get impact? It means
[01:14:18] how much Does it get impact? It means that what should I place or or put as
[01:14:21] that what should I place or or put as its gradient with respect to um
[01:14:26] its gradient with respect to um the the specific value x and d.
[01:14:30] the the specific value x and d. Just to remind you this is a
[01:14:32] Just to remind you this is a multiplication operation.
[01:14:37] In multiply gates it should be a swap
[01:14:40] In multiply gates it should be a swap right. So whatever
[01:14:43] right. So whatever this this uh the answer to this question
[01:14:45] this this uh the answer to this question is something a value in W right remember
[01:14:50] is something a value in W right remember that we had this multiplication gate
[01:14:51] that we had this multiplication gate which was a swap multiplier so there is
[01:14:54] which was a swap multiplier so there is a swap happening here so the value of
[01:15:02] X being affected uh affecting Y one of
[01:15:06] X being affected uh affecting Y one of the elements in Y is going to be
[01:15:08] the elements in Y is going to be dependent on
[01:15:11] dependent on a W that is in row D defined by the X
[01:15:15] a W that is in row D defined by the X matrix and column M defined by the Y
[01:15:19] matrix and column M defined by the Y matrix. So it's swapping the values.
[01:15:22] matrix. So it's swapping the values. It's the same swap but here now we have
[01:15:25] It's the same swap but here now we have to look at the the giant matrices and
[01:15:28] to look at the the giant matrices and find which which uh specific element uh
[01:15:30] find which which uh specific element uh it should be and then based on that we
[01:15:34] it should be and then based on that we can actually replace the entire thing
[01:15:37] can actually replace the entire thing with matrix multiplication
[01:15:40] with matrix multiplication matrix operations. The gradient of L
[01:15:42] matrix operations. The gradient of L with respect to X will be defined as
[01:15:45] with respect to X will be defined as this simple matrix operation. And then
[01:15:48] this simple matrix operation. And then the gradient of L with respect to W will
[01:15:51] the gradient of L with respect to W will be defined as this very simple
[01:15:53] be defined as this very simple multiplication. Again swap here for X we
[01:15:57] multiplication. Again swap here for X we include the entire W. For W we include
[01:16:01] include the entire W. For W we include the entire X and do the uh
[01:16:04] the entire X and do the uh multiplications.
[01:16:05] multiplications. These formulas makes it easy to to
[01:16:08] These formulas makes it easy to to implement larger and and harder
[01:16:11] implement larger and and harder operations
[01:16:12] operations and um get them implemented in the
[01:16:15] and um get them implemented in the backboard passes. All right, we're done.
[01:16:19] backboard passes. All right, we're done. Um just to summarize, we talked today
[01:16:22] Um just to summarize, we talked today about fully connected neural networks.
[01:16:25] about fully connected neural networks. We went through all the steps needed for
[01:16:29] We went through all the steps needed for back propagation, the forward passes,
[01:16:31] back propagation, the forward passes, backward passes and next session we will
[01:16:35] backward passes and next session we will be getting into the topic of
[01:16:37] be getting into the topic of convolutional neural networks.
[01:16:40] convolutional neural networks. Thank you.


================================================================================
LECTURE 005
================================================================================

Stanford CS231N | Spring 2025 | Lecture 5: Image Classification with CNNs

Source: https://www.youtube.com/watch?v=f3g1zGdxptI

---

Transcript

[00:00:05] Today we're going to be talking about
[00:00:06] Today we're going to be talking about image classification with CNN's. So you
[00:00:09] image classification with CNN's. So you might be wondering who am I? Um I'm a
[00:00:10] might be wondering who am I? Um I'm a new face. You haven't seen me before in
[00:00:12] new face. You haven't seen me before in this class. I'm Justin. I'm the fourth
[00:00:14] this class. I'm Justin. I'm the fourth mystery instructor in this class. I
[00:00:15] mystery instructor in this class. I think my picture's been on the website,
[00:00:17] think my picture's been on the website, but it's my first time here today. Um
[00:00:19] but it's my first time here today. Um and a little about me. Um I did my PhD
[00:00:22] and a little about me. Um I did my PhD here at Stanford from 2012 to 2018. uh
[00:00:25] here at Stanford from 2012 to 2018. uh working with FE on deep learning,
[00:00:27] working with FE on deep learning, computer vision, pretty much all tasks
[00:00:28] computer vision, pretty much all tasks in computer vision around that time. Um
[00:00:31] in computer vision around that time. Um during my time here at Stanford, I was
[00:00:33] during my time here at Stanford, I was uh lucky enough to initiate CS231N with
[00:00:36] uh lucky enough to initiate CS231N with Andre and Feay and others uh and teach
[00:00:39] Andre and Feay and others uh and teach it you know quite a few times 2015, 16,
[00:00:41] it you know quite a few times 2015, 16, 17 and 18 and 19 um at Stanford. After
[00:00:46] 17 and 18 and 19 um at Stanford. After that I spent time at uh Facebook AI
[00:00:49] that I spent time at uh Facebook AI research doing all kinds of deep
[00:00:50] research doing all kinds of deep learning computer vision stuff there. Um
[00:00:52] learning computer vision stuff there. Um and being I was a faculty member at
[00:00:54] and being I was a faculty member at University of Michigan. Um and there I
[00:00:56] University of Michigan. Um and there I taught basically the same class a couple
[00:00:58] taught basically the same class a couple more times. So I've taught this class a
[00:00:59] more times. So I've taught this class a couple times but it's been a while since
[00:01:01] couple times but it's been a while since I've been here. Um and most recently
[00:01:03] I've been here. Um and most recently I've been doing a startup called World
[00:01:05] I've been doing a startup called World Labs with FE. Um and that's just a
[00:01:07] Labs with FE. Um and that's just a little bit about me.
[00:01:09] little bit about me. Now about where we are on this class. So
[00:01:12] Now about where we are on this class. So we're kind of at an interesting point in
[00:01:14] we're kind of at an interesting point in the class right now. um where the class
[00:01:16] the class right now. um where the class is kind of divided up into these couple
[00:01:17] is kind of divided up into these couple different segments. Um and we basically
[00:01:19] different segments. Um and we basically finished the first segment. The first
[00:01:20] finished the first segment. The first segment is basically around deep
[00:01:22] segment is basically around deep learning basics. And this is really cool
[00:01:24] learning basics. And this is really cool because now all the stuff that you've
[00:01:26] because now all the stuff that you've seen in like four lectures is basically
[00:01:28] seen in like four lectures is basically all the fundamentals of deep learning.
[00:01:29] all the fundamentals of deep learning. You basically know this whole pipeline
[00:01:31] You basically know this whole pipeline of what is the basic pieces that go into
[00:01:34] of what is the basic pieces that go into building a deep learning system. So I
[00:01:35] building a deep learning system. So I thought it's useful here at the
[00:01:36] thought it's useful here at the beginning of this inflection point to
[00:01:38] beginning of this inflection point to just sort of step back and recap some of
[00:01:40] just sort of step back and recap some of the major themes that we've seen in the
[00:01:41] the major themes that we've seen in the first bit of the course. Um so the first
[00:01:44] first bit of the course. Um so the first is this idea of image classification
[00:01:46] is this idea of image classification with linear classifiers. Um and this was
[00:01:48] with linear classifiers. Um and this was kind of meant as a toy problem to give
[00:01:50] kind of meant as a toy problem to give you a sense of a kind of problem you
[00:01:51] you a sense of a kind of problem you might solve with deep learning where
[00:01:53] might solve with deep learning where usually the first step in solving a deep
[00:01:54] usually the first step in solving a deep learning problem is to define your
[00:01:56] learning problem is to define your problem in a way where it is um you take
[00:01:58] problem in a way where it is um you take input some grid of numbers some tensors.
[00:02:01] input some grid of numbers some tensors. You produce as output some tensors and
[00:02:03] You produce as output some tensors and you want to formalize this problem as
[00:02:04] you want to formalize this problem as some input output of tensors. Um, and we
[00:02:07] some input output of tensors. Um, and we kind of do that in the image
[00:02:08] kind of do that in the image classification setting by saying that we
[00:02:10] classification setting by saying that we want to classify images into a bunch of
[00:02:11] want to classify images into a bunch of human understandable categories. The
[00:02:13] human understandable categories. The inputs are going going to be these grids
[00:02:15] inputs are going going to be these grids of pixel values which are arranged in
[00:02:17] of pixel values which are arranged in threedimensional tensors. The outputs
[00:02:19] threedimensional tensors. The outputs are going to be scores giving us the
[00:02:21] are going to be scores giving us the affinity or the the degree to which we
[00:02:23] affinity or the the degree to which we have one score per category. You define
[00:02:25] have one score per category. You define a set of categories in advance and the
[00:02:27] a set of categories in advance and the network is supposed to predict high
[00:02:28] network is supposed to predict high scores for categories that the image is
[00:02:30] scores for categories that the image is likely to be low scores for the
[00:02:32] likely to be low scores for the categories that the image is likely to
[00:02:33] categories that the image is likely to be not. Um and then we can define we can
[00:02:36] be not. Um and then we can define we can set up a a problem to you know uh use a
[00:02:39] set up a a problem to you know uh use a weight matrix uh multiply that against
[00:02:41] weight matrix uh multiply that against the the image pixels and predict these
[00:02:43] the the image pixels and predict these scores. Um and we saw that there's a
[00:02:45] scores. Um and we saw that there's a couple different viewpoints a couple
[00:02:46] couple different viewpoints a couple different ways that we can interpret
[00:02:48] different ways that we can interpret these linear classifiers. Um and this
[00:02:50] these linear classifiers. Um and this basically sets up a functional form
[00:02:52] basically sets up a functional form saying that we can predict scores for
[00:02:54] saying that we can predict scores for images if only we have a weight matrix
[00:02:56] images if only we have a weight matrix W. So then the question is how do we
[00:02:57] W. So then the question is how do we select a good weight matrix W? And for
[00:03:00] select a good weight matrix W? And for that we go to loss functions. Right? So
[00:03:02] that we go to loss functions. Right? So loss functions are these things that
[00:03:03] loss functions are these things that tell us you know given a particular
[00:03:05] tell us you know given a particular value of a weight matrix given a
[00:03:07] value of a weight matrix given a particular data set how well does this
[00:03:09] particular data set how well does this weight matrix fit the solve the problem
[00:03:11] weight matrix fit the solve the problem on this data set. Um and in particular
[00:03:13] on this data set. Um and in particular we saw some examples of loss functions
[00:03:15] we saw some examples of loss functions that are commonly used for
[00:03:16] that are commonly used for classification problems including the
[00:03:18] classification problems including the softmax loss and probably the SVM loss
[00:03:21] softmax loss and probably the SVM loss as well. Um and then now okay so now
[00:03:23] as well. Um and then now okay so now we've kind of gotten a little bit
[00:03:25] we've kind of gotten a little bit further along in our problem. We've set
[00:03:26] further along in our problem. We've set up the problem of image classification.
[00:03:28] up the problem of image classification. um we have a model for solving that
[00:03:30] um we have a model for solving that problem using linear classifiers. We
[00:03:32] problem using linear classifiers. We have a way to tell if our solutions are
[00:03:33] have a way to tell if our solutions are good using a loss function. But now we
[00:03:35] good using a loss function. But now we actually need to search for a good
[00:03:37] actually need to search for a good solution in that space. And that's where
[00:03:39] solution in that space. And that's where optimization comes in. Right? So now
[00:03:40] optimization comes in. Right? So now that we now you think of the now you
[00:03:43] that we now you think of the now you think of defining this optimization
[00:03:44] think of defining this optimization landscape where on the x-axis or on the
[00:03:47] landscape where on the x-axis or on the on the horizontal plane are all the
[00:03:48] on the horizontal plane are all the different possible settings of your
[00:03:49] different possible settings of your weight matrix. And then the loss
[00:03:51] weight matrix. And then the loss function is basically a height of this
[00:03:53] function is basically a height of this plane where a high loss function is bad
[00:03:55] plane where a high loss function is bad because losing things is bad. So you
[00:03:57] because losing things is bad. So you want low loss. So the purpose of
[00:03:58] want low loss. So the purpose of optimization is to somehow traverse this
[00:04:01] optimization is to somehow traverse this space, slide down this manifold and find
[00:04:03] space, slide down this manifold and find a point at the bottom of very low loss.
[00:04:06] a point at the bottom of very low loss. And now each point in this space
[00:04:07] And now each point in this space corresponds to a weight matrix. So by
[00:04:09] corresponds to a weight matrix. So by sliding down that space, we're going to
[00:04:10] sliding down that space, we're going to find a good weight matrix that solves
[00:04:12] find a good weight matrix that solves our problem um and gives us a good
[00:04:14] our problem um and gives us a good solution to our task. And in particular,
[00:04:16] solution to our task. And in particular, we saw a couple different commonly used
[00:04:18] we saw a couple different commonly used optimization algorithms that are used in
[00:04:20] optimization algorithms that are used in deep learning pipelines. uh stochastic
[00:04:22] deep learning pipelines. uh stochastic gradient descent usually with momentum
[00:04:24] gradient descent usually with momentum RMS prop atom um and one sort of
[00:04:27] RMS prop atom um and one sort of interesting uh topical note is that
[00:04:29] interesting uh topical note is that right now um one of the biggest uh deep
[00:04:32] right now um one of the biggest uh deep learning research conferences is I clear
[00:04:34] learning research conferences is I clear international conference on learning
[00:04:35] international conference on learning representations um and just yesterday uh
[00:04:38] representations um and just yesterday uh iclear 2025 gave their test of time
[00:04:41] iclear 2025 gave their test of time award to the atom paper um because the
[00:04:43] award to the atom paper um because the paper that introduced this atom
[00:04:44] paper that introduced this atom optimizer was published at iclear uh 10
[00:04:47] optimizer was published at iclear uh 10 years ago in 2015 um so at a lot of
[00:04:49] years ago in 2015 um so at a lot of academic conferences we'll they'll tend
[00:04:50] academic conferences we'll they'll tend to give test of time awards to some of
[00:04:52] to give test of time awards to some of the most impactful papers from 10 years
[00:04:54] the most impactful papers from 10 years ago. Um, and just yesterday the atom
[00:04:56] ago. Um, and just yesterday the atom optimizer that you guys have been learn
[00:04:58] optimizer that you guys have been learn that you guys saw in this class got this
[00:05:00] that you guys saw in this class got this very prestigious test of time award at
[00:05:01] very prestigious test of time award at iClar 2025. So I thought that was pretty
[00:05:04] iClar 2025. So I thought that was pretty cool and a nice uh sort of way to
[00:05:06] cool and a nice uh sort of way to connect what you've been learning to
[00:05:07] connect what you've been learning to stuff that's happening right now in the
[00:05:08] stuff that's happening right now in the machine learning community.
[00:05:11] machine learning community. Okay. So then now that we now we
[00:05:13] Okay. So then now that we now we basically at this point we've got we've
[00:05:14] basically at this point we've got we've got our linear classifiers, we've got
[00:05:16] got our linear classifiers, we've got our loss functions, we can optimize
[00:05:17] our loss functions, we can optimize them. Now we're almost good to go. Um
[00:05:20] them. Now we're almost good to go. Um but we ran into a problem is that the
[00:05:22] but we ran into a problem is that the linear classifiers that we started with
[00:05:23] linear classifiers that we started with are actually not very powerful. Um and
[00:05:26] are actually not very powerful. Um and we saw this from we we saw two different
[00:05:28] we saw this from we we saw two different ways of attacking this this uh this
[00:05:30] ways of attacking this this uh this deficiency in linear classifiers. One
[00:05:32] deficiency in linear classifiers. One was from the visual viewpoint where you
[00:05:34] was from the visual viewpoint where you can interpret a linear classifier by
[00:05:36] can interpret a linear classifier by thinking about by thinking of that
[00:05:37] thinking about by thinking of that learned weight matrix as an image where
[00:05:39] learned weight matrix as an image where you learn one image template for each of
[00:05:41] you learn one image template for each of the categories that you're trying to
[00:05:43] the categories that you're trying to that you're trying to classify against.
[00:05:44] that you're trying to classify against. Um, and if you think about it that way,
[00:05:46] Um, and if you think about it that way, we realize that the weights of your
[00:05:48] we realize that the weights of your linear classifier, each row of that
[00:05:49] linear classifier, each row of that weight matrix is one template. So the so
[00:05:51] weight matrix is one template. So the so the linear classifier basically needs to
[00:05:53] the linear classifier basically needs to summarize all of its knowledge about
[00:05:55] summarize all of its knowledge about each category into just one template.
[00:05:57] each category into just one template. And that's kind of a difficult that's
[00:05:59] And that's kind of a difficult that's just not a very powerful classifier. Um,
[00:06:01] just not a very powerful classifier. Um, so then you can this kind of shows up in
[00:06:03] so then you can this kind of shows up in these visualized templates of a learned
[00:06:05] these visualized templates of a learned linear classifier where you can see that
[00:06:07] linear classifier where you can see that for for categories like um the car, this
[00:06:10] for for categories like um the car, this car kind of looks like a red blob, but
[00:06:12] car kind of looks like a red blob, but cars don't have to be red, right? What
[00:06:13] cars don't have to be red, right? What if your car was blue or purple or green
[00:06:15] if your car was blue or purple or green or something else? There's just no good
[00:06:17] or something else? There's just no good way for a linear classifier to caption
[00:06:19] way for a linear classifier to caption this cap capture this notion of there
[00:06:21] this cap capture this notion of there might be different appearances for an
[00:06:22] might be different appearances for an object for each category. Um or from the
[00:06:25] object for each category. Um or from the geometric viewpoint, if we imagine these
[00:06:27] geometric viewpoint, if we imagine these these each point of our data set as some
[00:06:29] these each point of our data set as some point in highdimensional space, then a
[00:06:31] point in highdimensional space, then a linear classifier is basically going in
[00:06:33] linear classifier is basically going in and carving up that space with
[00:06:34] and carving up that space with hyperplanes. Um, so that's really good
[00:06:36] hyperplanes. Um, so that's really good if all your categories actually do lie
[00:06:38] if all your categories actually do lie in linearly separable regions of your
[00:06:40] in linearly separable regions of your space, but there's no reason to expect
[00:06:42] space, but there's no reason to expect that to be true in general. So these are
[00:06:44] that to be true in general. So these are both two big deficiencies that we ran
[00:06:45] both two big deficiencies that we ran into when looking at these linear
[00:06:47] into when looking at these linear classifiers um as applied to image
[00:06:48] classifiers um as applied to image classification problems.
[00:06:51] classification problems. And that led us in in that led us to
[00:06:52] And that led us in in that led us to define this notion of neural networks
[00:06:54] define this notion of neural networks where we're going to generalize our
[00:06:55] where we're going to generalize our linear classifiers um to no longer just
[00:06:58] linear classifiers um to no longer just have one weight matrix, but instead
[00:06:59] have one weight matrix, but instead stack two weight matrices on top of each
[00:07:01] stack two weight matrices on top of each other um with a nonlinearity in between
[00:07:03] other um with a nonlinearity in between them. And now this gives us a much more
[00:07:06] them. And now this gives us a much more powerful mechanism for predicting scores
[00:07:08] powerful mechanism for predicting scores from our inputs. Now the now basically
[00:07:10] from our inputs. Now the now basically the problem is still the same. We have
[00:07:12] the problem is still the same. We have our input pixels going through this uh
[00:07:14] our input pixels going through this uh this computation spitting out scores.
[00:07:16] this computation spitting out scores. But now rather than computing but we
[00:07:18] But now rather than computing but we just basically selected a different
[00:07:19] just basically selected a different functional form for this score function.
[00:07:21] functional form for this score function. Um and this gave us and now you know the
[00:07:24] Um and this gave us and now you know the algebra is pretty simple. You just need
[00:07:25] algebra is pretty simple. You just need to go from f= wx. You add an extra w2
[00:07:29] to go from f= wx. You add an extra w2 add a little nonlinearity in between. So
[00:07:30] add a little nonlinearity in between. So the algebra doesn't change very much,
[00:07:32] the algebra doesn't change very much, but in doing so, your classifiers get
[00:07:34] but in doing so, your classifiers get much much more powerful than they were
[00:07:35] much much more powerful than they were before. Um, but now things get a little
[00:07:38] before. Um, but now things get a little bit complicated again because how does
[00:07:40] bit complicated again because how does this play into optimization, right? We
[00:07:42] this play into optimization, right? We know that if we have a loss function, if
[00:07:44] know that if we have a loss function, if we have a model, then we want to find
[00:07:45] we have a model, then we want to find values of those weight matrix that cause
[00:07:47] values of those weight matrix that cause the loss to go down. And and to do that,
[00:07:49] the loss to go down. And and to do that, we need to compute gradients. We need to
[00:07:51] we need to compute gradients. We need to be able to compute gradients of the loss
[00:07:53] be able to compute gradients of the loss with respect to all the parameters of
[00:07:54] with respect to all the parameters of our model. And that's this notion of a
[00:07:56] our model. And that's this notion of a computational graph. So these
[00:07:58] computational graph. So these computational graphs are basically a
[00:07:59] computational graphs are basically a data structure to organize the
[00:08:01] data structure to organize the computation of a neural network where
[00:08:03] computation of a neural network where each node in the graph is a little
[00:08:04] each node in the graph is a little functional primitive like a matrix
[00:08:06] functional primitive like a matrix multiply or a ru or some other something
[00:08:08] multiply or a ru or some other something else like that and then data flows
[00:08:10] else like that and then data flows forward in this graph from left to right
[00:08:12] forward in this graph from left to right from our inputs and our weights on the
[00:08:14] from our inputs and our weights on the left flowing through all these
[00:08:15] left flowing through all these intermediate nodes in the graph to spit
[00:08:17] intermediate nodes in the graph to spit out the loss function on the right. Um
[00:08:19] out the loss function on the right. Um and then once we compute the loss, you
[00:08:21] and then once we compute the loss, you you traverse this lo this uh this graph
[00:08:23] you traverse this lo this uh this graph backwards from right to left to compute
[00:08:25] backwards from right to left to compute gradients of that loss with respect to
[00:08:27] gradients of that loss with respect to all the nodes of the of the of the of
[00:08:29] all the nodes of the of the of the of the graph inside the network. Um, and
[00:08:31] the graph inside the network. Um, and this is now really cool because it
[00:08:33] this is now really cool because it basically means that we can write down
[00:08:35] basically means that we can write down these arbitrarily complicated neural
[00:08:36] these arbitrarily complicated neural networks, these arbitrarily complicated
[00:08:38] networks, these arbitrarily complicated expressions for computing our our
[00:08:40] expressions for computing our our outputs from our inputs, but we now have
[00:08:43] outputs from our inputs, but we now have an automate a nearly automated algorithm
[00:08:45] an automate a nearly automated algorithm for computing whatever gradients we want
[00:08:47] for computing whatever gradients we want through arbitrarily complex neural
[00:08:48] through arbitrarily complex neural networks. Um, and the way that we do
[00:08:50] networks. Um, and the way that we do that is this magic of back propagation.
[00:08:52] that is this magic of back propagation. Um, and back propagation is really cool.
[00:08:55] Um, and back propagation is really cool. I think it's one of the algorithms that
[00:08:56] I think it's one of the algorithms that makes deep learning work because it
[00:08:58] makes deep learning work because it makes it takes this global problem of
[00:09:01] makes it takes this global problem of how do we compute the loss through this
[00:09:03] how do we compute the loss through this computational graph and converts it into
[00:09:05] computational graph and converts it into a local problem and now each of these
[00:09:07] a local problem and now each of these nodes don't doesn't need to know
[00:09:08] nodes don't doesn't need to know anything about the larger context of
[00:09:11] anything about the larger context of what is the graph I'm living in what is
[00:09:12] what is the graph I'm living in what is the problem I'm trying to solve we just
[00:09:14] the problem I'm trying to solve we just need to be able to define these little
[00:09:16] need to be able to define these little nodes inside of our computational graph
[00:09:18] nodes inside of our computational graph that on the forward pass know how know
[00:09:20] that on the forward pass know how know how to compute outputs from their inputs
[00:09:22] how to compute outputs from their inputs and then on the backwards pass can
[00:09:24] and then on the backwards pass can receive gradients coming upstream. It
[00:09:26] receive gradients coming upstream. It doesn't have to care where those
[00:09:28] doesn't have to care where those gradients come from. What what was what
[00:09:30] gradients come from. What what was what was causing those gradients to happen.
[00:09:32] was causing those gradients to happen. And I just need to compute gradients
[00:09:34] And I just need to compute gradients downstream gradients with respect to my
[00:09:35] downstream gradients with respect to my inputs given my upstream gradients. And
[00:09:38] inputs given my upstream gradients. And again, this is so powerful because now
[00:09:40] again, this is so powerful because now it gives us this mechanism where we can
[00:09:42] it gives us this mechanism where we can just define a bunch of different types
[00:09:44] just define a bunch of different types of nodes um that all just have follow
[00:09:46] of nodes um that all just have follow this local API of computing outputs,
[00:09:49] this local API of computing outputs, computing local gradients. And if we as
[00:09:51] computing local gradients. And if we as long as we follow that API for all of
[00:09:53] long as we follow that API for all of the nodes, then we can start to stitch
[00:09:54] the nodes, then we can start to stitch them together into these big complicated
[00:09:56] them together into these big complicated computational graphs that can do
[00:09:58] computational graphs that can do basically arbitrary computation. And the
[00:10:00] basically arbitrary computation. And the gradients just come for free when we
[00:10:02] gradients just come for free when we turn the crank on the back propagation
[00:10:03] turn the crank on the back propagation algorithm.
[00:10:05] algorithm. Um and u you know this is this slide you
[00:10:07] Um and u you know this is this slide you know that you guys saw last time is
[00:10:09] know that you guys saw last time is basically back propagation on uh on
[00:10:11] basically back propagation on uh on scalar values but we can generalize this
[00:10:13] scalar values but we can generalize this to work on vector value on vector valued
[00:10:15] to work on vector value on vector valued or matrix or tensor valued values as
[00:10:17] or matrix or tensor valued values as well. Um but the basic thing to remember
[00:10:20] well. Um but the basic thing to remember is that your inputs are some tensors,
[00:10:22] is that your inputs are some tensors, your outputs are some tensors. And now
[00:10:24] your outputs are some tensors. And now your your upstream gradient that you get
[00:10:27] your your upstream gradient that you get is the gradient of the loss with respect
[00:10:28] is the gradient of the loss with respect to your outputs. Um and that always has
[00:10:31] to your outputs. Um and that always has the same shape as your outputs, right?
[00:10:32] the same shape as your outputs, right? Because the loss is a scalar. A gradient
[00:10:35] Because the loss is a scalar. A gradient of a loss with respect to a tensor says
[00:10:38] of a loss with respect to a tensor says for each element in a tensor, if I were
[00:10:39] for each element in a tensor, if I were to wiggle that element a little bit,
[00:10:41] to wiggle that element a little bit, then how much does the loss wiggle,
[00:10:43] then how much does the loss wiggle, right? And because the loss is a scalar,
[00:10:44] right? And because the loss is a scalar, we just need to wiggle each of those. we
[00:10:46] we just need to wiggle each of those. we just need to imagine wiggling each
[00:10:47] just need to imagine wiggling each element in our tensor independently. Um,
[00:10:50] element in our tensor independently. Um, and that is the definition of our
[00:10:51] and that is the definition of our gradient. So then that's very easy to
[00:10:53] gradient. So then that's very easy to remember. Your upstream gradients always
[00:10:54] remember. Your upstream gradients always have the exact same shape as your
[00:10:56] have the exact same shape as your outputs. Your downstream gradients,
[00:10:58] outputs. Your downstream gradients, those are the gradients with respect to
[00:10:59] those are the gradients with respect to my inputs. Those also have the same
[00:11:01] my inputs. Those also have the same shape as my inputs. Right? So then the
[00:11:04] shape as my inputs. Right? So then the back propagation algorithm is basically
[00:11:06] back propagation algorithm is basically just the chain rule where I need to
[00:11:08] just the chain rule where I need to somehow compute my downstream gradients
[00:11:10] somehow compute my downstream gradients as a function of my upstream gradients
[00:11:12] as a function of my upstream gradients and whatever function I was trying to
[00:11:13] and whatever function I was trying to compute. um and you'll get some practice
[00:11:15] compute. um and you'll get some practice on later assignments computing uh you
[00:11:17] on later assignments computing uh you know writing down the gradient
[00:11:18] know writing down the gradient expressions for different kinds of
[00:11:20] expressions for different kinds of operators uh in in your neural networks.
[00:11:23] operators uh in in your neural networks. So basically this gives us our recipe
[00:11:25] So basically this gives us our recipe for solving pretty much any problem in
[00:11:27] for solving pretty much any problem in deep learning, right? Like this this was
[00:11:29] deep learning, right? Like this this was intended to be quite a bit more general
[00:11:31] intended to be quite a bit more general than just image classification or just
[00:11:32] than just image classification or just linear classifiers or just fully
[00:11:34] linear classifiers or just fully connected networks. Right now if you
[00:11:36] connected networks. Right now if you have any kind of problem that you want
[00:11:37] have any kind of problem that you want to solve, you just need to encode it as
[00:11:39] to solve, you just need to encode it as tensors. Write down some computational
[00:11:41] tensors. Write down some computational graph that computes your output tensors
[00:11:43] graph that computes your output tensors from your input tensors. Collect a data
[00:11:45] from your input tensors. Collect a data set of input output tensors. Write down
[00:11:47] set of input output tensors. Write down a loss function for the kind of problem
[00:11:49] a loss function for the kind of problem you want to solve. Now, and then
[00:11:51] you want to solve. Now, and then optimize that loss function using
[00:11:53] optimize that loss function using gradient descent, using back
[00:11:54] gradient descent, using back propagation. Um, and that's a really
[00:11:56] propagation. Um, and that's a really powerful recipe that basically powers
[00:11:58] powerful recipe that basically powers all deep learning applications, whe
[00:12:00] all deep learning applications, whe whether it's image classification, image
[00:12:02] whether it's image classification, image generation, large language models,
[00:12:04] generation, large language models, pretty much anything involving a neural
[00:12:05] pretty much anything involving a neural neural network is trained using this
[00:12:08] neural network is trained using this using this formula or some slight
[00:12:10] using this formula or some slight variant on top of this formula. So that
[00:12:12] variant on top of this formula. So that leads us to the second part of the
[00:12:13] leads us to the second part of the class, which is perceiving and
[00:12:15] class, which is perceiving and understanding the visual world. So here
[00:12:17] understanding the visual world. So here is where we want to get a little bit
[00:12:18] is where we want to get a little bit more specialized and start talking about
[00:12:20] more specialized and start talking about you know not the general framework of
[00:12:22] you know not the general framework of deep learning but how does this apply to
[00:12:24] deep learning but how does this apply to to problems that we want to solve in
[00:12:25] to problems that we want to solve in computer vision processing images doing
[00:12:27] computer vision processing images doing interesting stuff with images. Um and
[00:12:30] interesting stuff with images. Um and today we'll take a step towards that by
[00:12:31] today we'll take a step towards that by talking about a bit more about
[00:12:33] talking about a bit more about convolutional networks. Um so
[00:12:35] convolutional networks. Um so convolutional networks actually are a
[00:12:37] convolutional networks actually are a pretty small lift on top of this
[00:12:39] pretty small lift on top of this framework that we've already already
[00:12:40] framework that we've already already defined. Right? So we've already talked
[00:12:42] defined. Right? So we've already talked about two right we have this you have
[00:12:44] about two right we have this you have this sort of general paradigm of
[00:12:46] this sort of general paradigm of computational graphs of little operators
[00:12:48] computational graphs of little operators that can live inside of our
[00:12:49] that can live inside of our computational graphs but we so we have
[00:12:51] computational graphs but we so we have this beautiful framework but we actually
[00:12:52] this beautiful framework but we actually haven't filled in a lot of the specifics
[00:12:54] haven't filled in a lot of the specifics of that framework. We've actually only
[00:12:55] of that framework. We've actually only seen like two or three different kinds
[00:12:58] seen like two or three different kinds of nodes that can live inside of our
[00:12:59] of nodes that can live inside of our computational graphs. We've seen fully
[00:13:01] computational graphs. We've seen fully connected layers which are which is
[00:13:03] connected layers which are which is basically a matrix multiply. We've seen
[00:13:04] basically a matrix multiply. We've seen activation functions like our ReLU and
[00:13:06] activation functions like our ReLU and we've seen our loss functions themselves
[00:13:09] we've seen our loss functions themselves and um in you know now to build up from
[00:13:12] and um in you know now to build up from what we've seen already into
[00:13:13] what we've seen already into convolutional networks basically all we
[00:13:15] convolutional networks basically all we need to do is add a couple new types of
[00:13:18] need to do is add a couple new types of nodes um that can fit into our
[00:13:19] nodes um that can fit into our computational graphs. Um, and in
[00:13:21] computational graphs. Um, and in particular, there's really only two
[00:13:23] particular, there's really only two operators that we need to talk about to
[00:13:24] operators that we need to talk about to build much more powerful networks, which
[00:13:26] build much more powerful networks, which will be the convolution layer that we'll
[00:13:28] will be the convolution layer that we'll spend most of today's lecture talking
[00:13:29] spend most of today's lecture talking about. Um, and then the pooling layer,
[00:13:31] about. Um, and then the pooling layer, which is um, another thing that we often
[00:13:33] which is um, another thing that we often use when processing images.
[00:13:36] use when processing images. So, that's kind of the road map for
[00:13:38] So, that's kind of the road map for today. I want to talk a little bit about
[00:13:39] today. I want to talk a little bit about convolutional networks in general. Then,
[00:13:41] convolutional networks in general. Then, we'll talk about these two particular
[00:13:43] we'll talk about these two particular um, computational primitives that we can
[00:13:45] um, computational primitives that we can use to build convolutional networks in
[00:13:47] use to build convolutional networks in our computational graphs.
[00:13:50] our computational graphs. Okay, so we've already talked about im
[00:13:51] Okay, so we've already talked about im so then here we want to step back a
[00:13:53] so then here we want to step back a little bit and think about this problem
[00:13:54] little bit and think about this problem of image classification again. So we've
[00:13:56] of image classification again. So we've already talked about how image
[00:13:57] already talked about how image classification is this super core
[00:13:59] classification is this super core problem in computer vision where we want
[00:14:01] problem in computer vision where we want to take an input image and then predict
[00:14:03] to take an input image and then predict you know from that input image what what
[00:14:04] you know from that input image what what is in this image as a set of category uh
[00:14:06] is in this image as a set of category uh you know predict one of k category
[00:14:08] you know predict one of k category labels basically um in this image
[00:14:10] labels basically um in this image obviously is a cat so we want to predict
[00:14:11] obviously is a cat so we want to predict the cat classifier um and most of the
[00:14:14] the cat classifier um and most of the and you know we basically solved this
[00:14:16] and you know we basically solved this problem in some sense already by
[00:14:17] problem in some sense already by building linear classifiers and by
[00:14:19] building linear classifiers and by building uh fully connected multi-layer
[00:14:21] building uh fully connected multi-layer perceptron neural networks um but these
[00:14:23] perceptron neural networks um but these these uh these networks are basically
[00:14:25] these uh these networks are basically operating in pixel space um they their
[00:14:27] operating in pixel space um they their inputs you know remember we said the
[00:14:29] inputs you know remember we said the first way to the first step to solving a
[00:14:30] first way to the first step to solving a deep learning problem is to formulate it
[00:14:32] deep learning problem is to formulate it in terms of input output tensors well in
[00:14:35] in terms of input output tensors well in this case our input tensors were the raw
[00:14:37] this case our input tensors were the raw pixel values of our images um so when we
[00:14:40] pixel values of our images um so when we write f ofx= wx that x input that's just
[00:14:43] write f ofx= wx that x input that's just the literal values of all of our pixels
[00:14:46] the literal values of all of our pixels um and then we go from those raw pixel
[00:14:47] um and then we go from those raw pixel values to our class scores um but
[00:14:50] values to our class scores um but there's another way to do it which was
[00:14:51] there's another way to do it which was common back in the dark ages before
[00:14:53] common back in the dark ages before neural networks came about and saved us
[00:14:55] neural networks came about and saved us from all this tedium um maybe back in
[00:14:57] from all this tedium um maybe back in the early 2000s sort of up until maybe
[00:14:59] the early 2000s sort of up until maybe 2010 2011-ish was this idea of feature
[00:15:02] 2010 2011-ish was this idea of feature representations. So here the idea is we
[00:15:04] representations. So here the idea is we could actually um you can actually
[00:15:06] could actually um you can actually choose what is going to be your input to
[00:15:07] choose what is going to be your input to your neural network. So you could have
[00:15:09] your neural network. So you could have said rather than feeding the raw pixel
[00:15:12] said rather than feeding the raw pixel values of the network of the image into
[00:15:14] values of the network of the image into our neural network instead we could
[00:15:16] our neural network instead we could define some other kind of function which
[00:15:18] define some other kind of function which is going to extract features extract
[00:15:20] is going to extract features extract some convert those pixel values of our
[00:15:22] some convert those pixel values of our image into some other meaningful
[00:15:23] image into some other meaningful representation that we as the
[00:15:25] representation that we as the intelligent human designers of this
[00:15:27] intelligent human designers of this system believe represent or capture some
[00:15:29] system believe represent or capture some of the important facets of our input
[00:15:31] of the important facets of our input image. Um so then when if you're doing
[00:15:34] image. Um so then when if you're doing you know image classification on top of
[00:15:36] you know image classification on top of a feature representation your step one
[00:15:38] a feature representation your step one would be to define a feature
[00:15:40] would be to define a feature representation that converts your um
[00:15:42] representation that converts your um your raw image pixels into this higher
[00:15:45] your raw image pixels into this higher level representation. And now that
[00:15:46] level representation. And now that feature representation will now take the
[00:15:48] feature representation will now take the will now be the X that feeds into your
[00:15:50] will now be the X that feeds into your linear classifier. Um, and there was a
[00:15:53] linear classifier. Um, and there was a ton of work in computer vision, uh, you
[00:15:55] ton of work in computer vision, uh, you know, really in the 2000s to the late
[00:15:57] know, really in the 2000s to the late 2010s or to the early 2010sish that were
[00:16:00] 2010s or to the early 2010sish that were that used this idea of feature
[00:16:01] that used this idea of feature representations for all kinds of CL for
[00:16:03] representations for all kinds of CL for all kinds of tasks. Um, and I don't
[00:16:05] all kinds of tasks. Um, and I don't really think it's useful to go into
[00:16:07] really think it's useful to go into super great detail on any of these
[00:16:09] super great detail on any of these particular feature representations
[00:16:10] particular feature representations because spoiler alert, they got
[00:16:12] because spoiler alert, they got deprecated like 10 years ago. Um, but
[00:16:14] deprecated like 10 years ago. Um, but it's useful to have a a flavor for what
[00:16:16] it's useful to have a a flavor for what they might have looked like. So one
[00:16:18] they might have looked like. So one example of a kind of feature
[00:16:20] example of a kind of feature representation that people sometimes
[00:16:21] representation that people sometimes used is this notion of a color
[00:16:23] used is this notion of a color histogram. So here what we could say is
[00:16:25] histogram. So here what we could say is divide the space. So maybe we we we
[00:16:27] divide the space. So maybe we we we think that somehow the distribution of
[00:16:29] think that somehow the distribution of colors in our image might be a useful
[00:16:31] colors in our image might be a useful thing for a classifier to look at or
[00:16:33] thing for a classifier to look at or care about, right? Because maybe you're
[00:16:34] care about, right? Because maybe you're building a fruit detector, apple
[00:16:36] building a fruit detector, apple detector, and you want to know if it's
[00:16:37] detector, and you want to know if it's ripe or not. Maybe a maybe a red apple
[00:16:39] ripe or not. Maybe a maybe a red apple from a green apple. you know how knowing
[00:16:41] from a green apple. you know how knowing how much red or green is in the image
[00:16:42] how much red or green is in the image might be something that we as humans
[00:16:44] might be something that we as humans think is useful for the network to know
[00:16:46] think is useful for the network to know for for making its classifications. So
[00:16:48] for for making its classifications. So here uh we could build we could try to
[00:16:50] here uh we could build we could try to build a feature representation that
[00:16:51] build a feature representation that captures that intuition. So here what we
[00:16:53] captures that intuition. So here what we might do is take the space of all
[00:16:54] might do is take the space of all possible colors, discretise that space
[00:16:57] possible colors, discretise that space into some set of buckets and now for
[00:16:59] into some set of buckets and now for every pixel in our image, we map that
[00:17:00] every pixel in our image, we map that pixel to one of the discrete buckets in
[00:17:02] pixel to one of the discrete buckets in our color space and then basically our
[00:17:04] our color space and then basically our feature representation becomes something
[00:17:06] feature representation becomes something like a count of how many pixels in the
[00:17:08] like a count of how many pixels in the image fall into this color bucket. Um
[00:17:10] image fall into this color bucket. Um and now you could and now this this is
[00:17:12] and now you could and now this this is kind of an interesting representation
[00:17:14] kind of an interesting representation because it destroys all the spatial
[00:17:15] because it destroys all the spatial structure of the image and it only talks
[00:17:17] structure of the image and it only talks about the color distributions. So now
[00:17:19] about the color distributions. So now you know if you had red in the corner
[00:17:20] you know if you had red in the corner versus red on the other side they would
[00:17:22] versus red on the other side they would those two images would look the same to
[00:17:24] those two images would look the same to this color histogram features um but
[00:17:25] this color histogram features um but they would look very different from the
[00:17:27] they would look very different from the raw on the raw pixel perspective. So the
[00:17:30] raw on the raw pixel perspective. So the color histogram is a kind of one basic
[00:17:32] color histogram is a kind of one basic kind of feature extractor or feature
[00:17:34] kind of feature extractor or feature representation that you can build on
[00:17:35] representation that you can build on images that basically looks only at
[00:17:37] images that basically looks only at color and does not look at spatial
[00:17:38] color and does not look at spatial structure at all. Um another category of
[00:17:42] structure at all. Um another category of um feature representations that people
[00:17:43] um feature representations that people used to look at is sort of the con is
[00:17:45] used to look at is sort of the con is sort of the sort of dual to that in
[00:17:47] sort of the sort of dual to that in which are these histogram of oriented
[00:17:49] which are these histogram of oriented gradients and I don't think it's useful
[00:17:50] gradients and I don't think it's useful to talk too much about exactly how these
[00:17:52] to talk too much about exactly how these are computed but the intuition of these
[00:17:54] are computed but the intuition of these is that they basically throw away the
[00:17:55] is that they basically throw away the color information and they only look at
[00:17:57] color information and they only look at the structure information they basically
[00:17:59] the structure information they basically want to look for every point in the
[00:18:00] want to look for every point in the image what are like what direction like
[00:18:03] image what are like what direction like what is the local direction of the edges
[00:18:04] what is the local direction of the edges in the image around that local region so
[00:18:06] in the image around that local region so here you can see that this frog you know
[00:18:08] here you can see that this frog you know the leaves of the frog frog, it kind of
[00:18:10] the leaves of the frog frog, it kind of extracts diagonal type of features
[00:18:11] extracts diagonal type of features because it corresponds to these diagonal
[00:18:13] because it corresponds to these diagonal structures over here. Um, or around the
[00:18:15] structures over here. Um, or around the frog's eyes, you can see, oh, it sort of
[00:18:17] frog's eyes, you can see, oh, it sort of captured those those those spherical
[00:18:18] captured those those those spherical those circular structures. So, again,
[00:18:21] those circular structures. So, again, it's not super useful to see how this is
[00:18:22] it's not super useful to see how this is computed, but it's useful to know that
[00:18:24] computed, but it's useful to know that these are the kinds of features that
[00:18:25] these are the kinds of features that people would uh designed for images
[00:18:28] people would uh designed for images maybe a decade or decade and a half ago.
[00:18:31] maybe a decade or decade and a half ago. Um, and people combined these in all
[00:18:32] Um, and people combined these in all kinds of complicated ways. So, people
[00:18:34] kinds of complicated ways. So, people would often you might wonder, oh, what's
[00:18:37] would often you might wonder, oh, what's the best feature representation? The
[00:18:38] the best feature representation? The usual answer was just stack them all
[00:18:40] usual answer was just stack them all together. So a pretty common approach
[00:18:42] together. So a pretty common approach would be to have a bunch of different
[00:18:43] would be to have a bunch of different feature representations um extract them
[00:18:45] feature representations um extract them all from your image and then concatenate
[00:18:47] all from your image and then concatenate them all into one big feature vector. Um
[00:18:49] them all into one big feature vector. Um and that's kind of becomes your feature
[00:18:51] and that's kind of becomes your feature representation for your image. Um and
[00:18:52] representation for your image. Um and now you could imagine once we have this
[00:18:54] now you could imagine once we have this feature representation, we can basically
[00:18:56] feature representation, we can basically stick whatever kind of classifier we
[00:18:58] stick whatever kind of classifier we want on top of it. Um and this and it's
[00:19:00] want on top of it. Um and this and it's interesting to then take a step back and
[00:19:02] interesting to then take a step back and contrast that picture, that viewpoint of
[00:19:05] contrast that picture, that viewpoint of that whole system, you know. So system A
[00:19:08] that whole system, you know. So system A is thinking about a feature feature
[00:19:10] is thinking about a feature feature extractor plus learned network or
[00:19:13] extractor plus learned network or learned linear classifier on top of your
[00:19:15] learned linear classifier on top of your features. And then system B is endto-end
[00:19:17] features. And then system B is endto-end neural networks. And it's they actually
[00:19:20] neural networks. And it's they actually don't look that different if you take a
[00:19:22] don't look that different if you take a step back and think about it in the
[00:19:23] step back and think about it in the right way. Both of these systems are
[00:19:25] right way. Both of these systems are ultimately inputting the raw pixels of
[00:19:28] ultimately inputting the raw pixels of the image and outputting some scores or
[00:19:30] the image and outputting some scores or predictions about the image. Um the
[00:19:32] predictions about the image. Um the difference is that the is the difference
[00:19:34] difference is that the is the difference is which part of the system is designed
[00:19:36] is which part of the system is designed by humans versus which part is learned
[00:19:38] by humans versus which part is learned via gradient descent. In the feature
[00:19:40] via gradient descent. In the feature extraction plus linear classifier
[00:19:42] extraction plus linear classifier paradigm the feature extraction portion
[00:19:44] paradigm the feature extraction portion of the system is designed that could be
[00:19:46] of the system is designed that could be a bunch of really hairy C code or hairy
[00:19:48] a bunch of really hairy C code or hairy mat lab code. Um and you don't want to
[00:19:49] mat lab code. Um and you don't want to think about the details of what's going
[00:19:51] think about the details of what's going on inside of that. Um and then just the
[00:19:54] on inside of that. Um and then just the part that you're learning via gradient
[00:19:55] part that you're learning via gradient descent the part that you're learning
[00:19:56] descent the part that you're learning from your training data is just that
[00:19:58] from your training data is just that classifier on top of the feature
[00:19:59] classifier on top of the feature extractor. Whereas the neural network
[00:20:02] extractor. Whereas the neural network approach is basically saying gradient
[00:20:04] approach is basically saying gradient descent is probably a better programmer
[00:20:06] descent is probably a better programmer than you and lots of data probably knows
[00:20:08] than you and lots of data probably knows more about your problem than you do. So
[00:20:10] more about your problem than you do. So the then the intuition of these neural
[00:20:12] the then the intuition of these neural network classifiers is there's still
[00:20:14] network classifiers is there's still ultimately going to be a system that
[00:20:16] ultimately going to be a system that inputs the raw pixel values and spits
[00:20:18] inputs the raw pixel values and spits out your classification scores at the
[00:20:19] out your classification scores at the end. But the difference is that all
[00:20:21] end. But the difference is that all parts of that system from the raw pixels
[00:20:23] parts of that system from the raw pixels all the way to the final classification
[00:20:24] all the way to the final classification scores will be tuned via gradient
[00:20:26] scores will be tuned via gradient descent and will be learned from your
[00:20:28] descent and will be learned from your training data set. So the intuition is
[00:20:30] training data set. So the intuition is that you know there might be in this
[00:20:32] that you know there might be in this feature extraction paradigm there might
[00:20:34] feature extraction paradigm there might be some bottlenecks. You as a human
[00:20:35] be some bottlenecks. You as a human might get something wrong. You might
[00:20:36] might get something wrong. You might have wrong intuition about what parts of
[00:20:38] have wrong intuition about what parts of the problem are important what things
[00:20:40] the problem are important what things are not important or it might be really
[00:20:41] are not important or it might be really hard for you to write down the right the
[00:20:43] hard for you to write down the right the perfect feature extractor that solves
[00:20:44] perfect feature extractor that solves your problem. Um and this endto-end
[00:20:46] your problem. Um and this endto-end learning approach of comnets and really
[00:20:48] learning approach of comnets and really of deep learning more generally is just
[00:20:50] of deep learning more generally is just saying that data and compute can likely
[00:20:52] saying that data and compute can likely solve that problem better than you as a
[00:20:54] solve that problem better than you as a human designer can. Um and this is and
[00:20:56] human designer can. Um and this is and this b this paradigm has basically won
[00:20:58] this b this paradigm has basically won over over the past decade and a half for
[00:21:01] over over the past decade and a half for lots and lots of problems repeatedly.
[00:21:05] lots and lots of problems repeatedly. Okay. So that that kind of gives an
[00:21:07] Okay. So that that kind of gives an intuition of so that so then the
[00:21:08] intuition of so that so then the question is like for the particular
[00:21:10] question is like for the particular problem of images um how should we
[00:21:12] problem of images um how should we design these endto-end systems right
[00:21:14] design these endto-end systems right like it's not going to be a fully
[00:21:15] like it's not going to be a fully connected network all the way that would
[00:21:17] connected network all the way that would be a little bit silly. We do need to
[00:21:18] be a little bit silly. We do need to still put a little bit of design into
[00:21:20] still put a little bit of design into the system, right? But the difference
[00:21:22] the system, right? But the difference about designing a neural network versus
[00:21:23] about designing a neural network versus designing a feature extractor is that in
[00:21:25] designing a feature extractor is that in designing a neural network, you're not
[00:21:27] designing a neural network, you're not designing a particular function of a
[00:21:28] designing a particular function of a feature extractor. You're kind of
[00:21:30] feature extractor. You're kind of defining a whole category of functions
[00:21:32] defining a whole category of functions where the category of functions is
[00:21:33] where the category of functions is defined by the structure of your
[00:21:35] defined by the structure of your computational graph um by the by the
[00:21:37] computational graph um by the by the sequence of operators that get run. But
[00:21:39] sequence of operators that get run. But the but there's a little bit of but
[00:21:40] the but there's a little bit of but there's some flexibility in that system
[00:21:42] there's some flexibility in that system because you're leaving the weights of
[00:21:43] because you're leaving the weights of the system free to be learned from data.
[00:21:46] the system free to be learned from data. But the role of the human designer still
[00:21:47] But the role of the human designer still matters. You still need to decide what
[00:21:49] matters. You still need to decide what is that architecture of your network.
[00:21:52] is that architecture of your network. What is that sequence of operators that
[00:21:54] What is that sequence of operators that get stitched into a computational graph?
[00:21:55] get stitched into a computational graph? What are the sizes of all the matrices
[00:21:57] What are the sizes of all the matrices involved at every stage of processing.
[00:21:59] involved at every stage of processing. So there still is a lot of role for the
[00:22:01] So there still is a lot of role for the human to design parts of the problem um
[00:22:03] human to design parts of the problem um in this deep learning era. Um but the
[00:22:06] in this deep learning era. Um but the the that like what you're designing is a
[00:22:08] the that like what you're designing is a little bit different.
[00:22:09] little bit different. So this is basically where we start to
[00:22:11] So this is basically where we start to see the deficiencies in the tools that
[00:22:12] see the deficiencies in the tools that we have so far for solving this problem,
[00:22:15] we have so far for solving this problem, right? because we see that we've we
[00:22:16] right? because we see that we've we we've seen linear layers. We've seen
[00:22:18] we've seen linear layers. We've seen fully connected networks. Um and the
[00:22:20] fully connected networks. Um and the kind of only neural network architecture
[00:22:22] kind of only neural network architecture that we've seen is to flatten our pixels
[00:22:24] that we've seen is to flatten our pixels of our image into a big vector. Do
[00:22:26] of our image into a big vector. Do matrix multiply, do RLU, do more matrix
[00:22:29] matrix multiply, do RLU, do more matrix multiply, do more RLU, and that's about
[00:22:31] multiply, do more RLU, and that's about it. That's all we know how to do at this
[00:22:32] it. That's all we know how to do at this point. Um and one big problem with that
[00:22:34] point. Um and one big problem with that is that it destroys the spatial
[00:22:35] is that it destroys the spatial structure of the images. Um there's this
[00:22:37] structure of the images. Um there's this big problem, right? Like it's sort of
[00:22:40] big problem, right? Like it's sort of like images are actually not
[00:22:41] like images are actually not one-dimensional objects. Images are
[00:22:43] one-dimensional objects. Images are two-dimensional. They have two
[00:22:44] two-dimensional. They have two dimensional structure. That
[00:22:45] dimensional structure. That two-dimensional structure matters for
[00:22:47] two-dimensional structure matters for the content of those images. And when
[00:22:49] the content of those images. And when you build a linear classifier on raw
[00:22:51] you build a linear classifier on raw pixels by stretching it out into a big
[00:22:53] pixels by stretching it out into a big vector, you're basically ignoring that
[00:22:55] vector, you're basically ignoring that important factor of your input data in
[00:22:57] important factor of your input data in the design of your neural network
[00:22:58] the design of your neural network architecture. So when we think about
[00:23:00] architecture. So when we think about designing neural network architectures
[00:23:02] designing neural network architectures for images in particular, we want to
[00:23:04] for images in particular, we want to think what are other what are other
[00:23:06] think what are other what are other designs for our network? What are other
[00:23:08] designs for our network? What are other computational primitives we can slot
[00:23:09] computational primitives we can slot into our computational graphs that
[00:23:11] into our computational graphs that better that better respect um that
[00:23:14] better that better respect um that two-dimensional structure of images?
[00:23:17] two-dimensional structure of images? And that leads us to convolutional
[00:23:18] And that leads us to convolutional networks, right? So convolutional
[00:23:20] networks, right? So convolutional networks are basically a category of
[00:23:21] networks are basically a category of neural network architectures that are
[00:23:23] neural network architectures that are built of linear layers, non
[00:23:25] built of linear layers, non nonlinearities, convolution layers,
[00:23:27] nonlinearities, convolution layers, pooling layers, sometimes a couple
[00:23:29] pooling layers, sometimes a couple others um that stitch together into
[00:23:31] others um that stitch together into these neural network architectures that
[00:23:32] these neural network architectures that input raw pixel values and then output
[00:23:34] input raw pixel values and then output some kind of prediction or scores for
[00:23:37] some kind of prediction or scores for our uh for our images. And the general
[00:23:39] our uh for our images. And the general structure of these is that usually
[00:23:41] structure of these is that usually they'll they'll have some prefix some
[00:23:43] they'll they'll have some prefix some body of the network which is some
[00:23:44] body of the network which is some interled sequence of convolution layers
[00:23:46] interled sequence of convolution layers pooling layers and nonlinearities that
[00:23:49] pooling layers and nonlinearities that can be thought of of do as extracting
[00:23:51] can be thought of of do as extracting some useful feature representation for
[00:23:52] some useful feature representation for the image. And then on top of that
[00:23:54] the image. And then on top of that there'll usually be some kind of fully
[00:23:56] there'll usually be some kind of fully connected layers sometimes as few as one
[00:23:59] connected layers sometimes as few as one um but sometimes more than one which you
[00:24:01] um but sometimes more than one which you can think of as a multi-layer perceptron
[00:24:03] can think of as a multi-layer perceptron fully connected network classifier that
[00:24:05] fully connected network classifier that lives on top of and ingests the features
[00:24:07] lives on top of and ingests the features from the convolutional portion of the
[00:24:09] from the convolutional portion of the network. But crucially this whole system
[00:24:12] network. But crucially this whole system is tuned end to end um via gradient
[00:24:15] is tuned end to end um via gradient descent by minimizing the loss on your
[00:24:17] descent by minimizing the loss on your training data set. And these networks
[00:24:20] training data set. And these networks actually have quite a bit of long
[00:24:21] actually have quite a bit of long history. So um these this this image
[00:24:24] history. So um these this this image this particular combinant architecture
[00:24:25] this particular combinant architecture that we've drawn on the screen actually
[00:24:27] that we've drawn on the screen actually comes from a paper back in 1998 um with
[00:24:30] comes from a paper back in 1998 um with Yan Lakun Leon Bau uh and others who
[00:24:33] Yan Lakun Leon Bau uh and others who were at the time building these
[00:24:35] were at the time building these convolutional neural networks all the
[00:24:36] convolutional neural networks all the way back in 1998 to perform the task of
[00:24:38] way back in 1998 to perform the task of digit classification um and it actually
[00:24:41] digit classification um and it actually worked pretty well but it was really
[00:24:43] worked pretty well but it was really expensive they didn't have GPUs they
[00:24:45] expensive they didn't have GPUs they didn't have TPUs they didn't have the
[00:24:46] didn't have TPUs they didn't have the the kind of compute resources that we
[00:24:48] the kind of compute resources that we did today but the underlying algorithm
[00:24:50] did today but the underlying algorithm the underlying network architecture
[00:24:52] the underlying network architecture basically looks pretty similar in 1998
[00:24:54] basically looks pretty similar in 1998 um to what things were you to to the
[00:24:56] um to what things were you to to the kinds of architectures that people were
[00:24:57] kinds of architectures that people were using well into the 2010s
[00:25:00] using well into the 2010s and then zooming forward from 1998 up
[00:25:02] and then zooming forward from 1998 up until 2012 um that's when the alexnet
[00:25:05] until 2012 um that's when the alexnet architecture came out and this was kind
[00:25:06] architecture came out and this was kind of a a big boom like giant explosion of
[00:25:09] of a a big boom like giant explosion of deep learning especially in computer
[00:25:11] deep learning especially in computer vision and I think we talked about this
[00:25:12] vision and I think we talked about this in some earlier lectures um but the
[00:25:14] in some earlier lectures um but the alexnet architecture again like doesn't
[00:25:16] alexnet architecture again like doesn't look that different from this yanlun
[00:25:18] look that different from this yanlun lenat architecture from 1998 it's a
[00:25:20] lenat architecture from 1998 it's a bunch of convolutional layers fully
[00:25:21] bunch of convolutional layers fully connected layers. It's bigger. There's
[00:25:23] connected layers. It's bigger. There's more layers. The layers have more units
[00:25:25] more layers. The layers have more units in them. Um but it's still trained end
[00:25:27] in them. Um but it's still trained end to end with back propagation to minimize
[00:25:29] to end with back propagation to minimize some fairly simple loss functions. Um
[00:25:32] some fairly simple loss functions. Um but here like Alexet was when really
[00:25:34] but here like Alexet was when really things started to take off and at this
[00:25:36] things started to take off and at this time they were able to train on GPUs,
[00:25:38] time they were able to train on GPUs, GPUs were available um there was more
[00:25:40] GPUs were available um there was more data available because of internet
[00:25:41] data available because of internet because of imageet. So that's the so
[00:25:44] because of imageet. So that's the so Alexet is when things really started to
[00:25:46] Alexet is when things really started to take off. Um so then the era from I
[00:25:48] take off. Um so then the era from I think about 2012 to around 2020ish was
[00:25:52] think about 2012 to around 2020ish was an era where convolutional networks were
[00:25:54] an era where convolutional networks were basically dominating almost every
[00:25:55] basically dominating almost every problem in computer vision. Um they were
[00:25:57] problem in computer vision. Um they were sol like basically anything you any kind
[00:25:59] sol like basically anything you any kind of a problem that you wanted to do with
[00:26:00] of a problem that you wanted to do with an image. Um in that era it was almost
[00:26:02] an image. Um in that era it was almost certainly going to be a comnet that had
[00:26:04] certainly going to be a comnet that had the best performance on that problem. So
[00:26:06] the best performance on that problem. So this this included tasks like detection
[00:26:08] this this included tasks like detection on the left um which is the task of not
[00:26:10] on the left um which is the task of not just classifying an image but drawing a
[00:26:12] just classifying an image but drawing a box around um all the objects in the
[00:26:14] box around um all the objects in the image and putting a category label on
[00:26:15] image and putting a category label on the box. Segmentation is the task of
[00:26:18] the box. Segmentation is the task of assigning labels not at the box level or
[00:26:20] assigning labels not at the box level or the image level but instead assigning
[00:26:22] the image level but instead assigning labels at the pixel level. So now we
[00:26:24] labels at the pixel level. So now we want to assign a category label to every
[00:26:26] want to assign a category label to every pixel independently in our image. Um and
[00:26:28] pixel independently in our image. Um and we'll talk more about architectures for
[00:26:29] we'll talk more about architectures for these problems in future lectures. But
[00:26:31] these problems in future lectures. But you know these are sol these can be
[00:26:33] you know these are sol these can be solved very effectively using
[00:26:34] solved very effectively using convolutional networks.
[00:26:36] convolutional networks. People used comments for other kinds of
[00:26:38] People used comments for other kinds of problems involving language as well. So
[00:26:40] problems involving language as well. So the task of image captioning where we
[00:26:42] the task of image captioning where we want to predict a a natural language
[00:26:44] want to predict a a natural language caption from an image. Uh the some of
[00:26:46] caption from an image. Uh the some of the first widely successful approaches
[00:26:48] the first widely successful approaches to this problem also were built on
[00:26:49] to this problem also were built on convolutional networks. Um and then even
[00:26:52] convolutional networks. Um and then even for even for some more recent tasks of
[00:26:54] for even for some more recent tasks of generative modeling, right? So text to
[00:26:56] generative modeling, right? So text to image text to image uh sorry captioning
[00:26:59] image text to image uh sorry captioning is basically the problem of image to
[00:27:00] is basically the problem of image to text where we input an image and then
[00:27:02] text where we input an image and then want to output a natural language
[00:27:04] want to output a natural language sentence describing the image. We can
[00:27:06] sentence describing the image. We can also think about the inverse problem of
[00:27:08] also think about the inverse problem of text to image generation where we want
[00:27:10] text to image generation where we want to in input a natural language
[00:27:11] to in input a natural language description of something that we're
[00:27:13] description of something that we're imagining in our head and have the
[00:27:15] imagining in our head and have the system generate a new image from scratch
[00:27:17] system generate a new image from scratch that you know hopefully matches our
[00:27:19] that you know hopefully matches our input description. Um and some of the
[00:27:21] input description. Um and some of the really some of the first really widely
[00:27:23] really some of the first really widely successful uh widely successful versions
[00:27:26] successful uh widely successful versions of this problem also were built on
[00:27:28] of this problem also were built on convolutional networks. So this uh this
[00:27:30] convolutional networks. So this uh this particular figure is from the stable
[00:27:32] particular figure is from the stable diffusion paper that came out back in
[00:27:33] diffusion paper that came out back in 2021. Um and this you know this has got
[00:27:36] 2021. Um and this you know this has got this technology has gotten a lot better
[00:27:38] this technology has gotten a lot better in the last couple years and we'll talk
[00:27:39] in the last couple years and we'll talk more about that in some later lectures.
[00:27:41] more about that in some later lectures. But it's useful to point out that this
[00:27:42] But it's useful to point out that this basically the the first versions of this
[00:27:44] basically the the first versions of this that started to work really well were
[00:27:46] that started to work really well were also built on convolutional networks. So
[00:27:49] also built on convolutional networks. So basically convolutional networks were so
[00:27:50] basically convolutional networks were so important for this history of computer
[00:27:52] important for this history of computer vision that the initial version of this
[00:27:54] vision that the initial version of this class that we started way back in 2015
[00:27:56] class that we started way back in 2015 was actually called um convolutional
[00:27:58] was actually called um convolutional neural networks for visual recognition
[00:28:00] neural networks for visual recognition because at the time convolutional
[00:28:02] because at the time convolutional networks was basically synonymous with
[00:28:03] networks was basically synonymous with computer vision. Um and computer vision
[00:28:06] computer vision. Um and computer vision was basically the the biggest the
[00:28:08] was basically the the biggest the biggest field that was benefiting from
[00:28:09] biggest field that was benefiting from deep learning at that time. So in in
[00:28:11] deep learning at that time. So in in setting out to teach a class about deep
[00:28:13] setting out to teach a class about deep learning, it actually made a lot of
[00:28:15] learning, it actually made a lot of sense to focus entirely on the problem
[00:28:16] sense to focus entirely on the problem of convolutional networks for image
[00:28:18] of convolutional networks for image problems. Um and that's basically the
[00:28:20] problems. Um and that's basically the inception of this class 10 years ago. Um
[00:28:22] inception of this class 10 years ago. Um but the field has actually evolved a lot
[00:28:24] but the field has actually evolved a lot since then, right? Convolutional
[00:28:26] since then, right? Convolutional networks have actually gotten replaced.
[00:28:28] networks have actually gotten replaced. Visual recognition, there's a lot of
[00:28:29] Visual recognition, there's a lot of other interesting problems that we can
[00:28:30] other interesting problems that we can solve now. So you'll notice that the
[00:28:32] solve now. So you'll notice that the name of the class changed at some point
[00:28:33] name of the class changed at some point along the way um and to no longer focus
[00:28:36] along the way um and to no longer focus so specifically on on neural on
[00:28:38] so specifically on on neural on convolutional networks. And the reason
[00:28:40] convolutional networks. And the reason for that is that you know I said this
[00:28:42] for that is that you know I said this was the era from 2012 to 2020. You might
[00:28:44] was the era from 2012 to 2020. You might be wondering what happened in 2020 other
[00:28:46] be wondering what happened in 2020 other than co that could have displaced
[00:28:48] than co that could have displaced convolutional networks. Um it wasn't co
[00:28:51] convolutional networks. Um it wasn't co it was transformers. Right? So
[00:28:52] it was transformers. Right? So transformers are this alternate neural
[00:28:54] transformers are this alternate neural network architecture that we'll talk
[00:28:55] network architecture that we'll talk about in a couple more lectures. Um but
[00:28:57] about in a couple more lectures. Um but basically they started off in natural
[00:28:59] basically they started off in natural language processing for processing uh
[00:29:00] language processing for processing uh documents for processing text strings.
[00:29:02] documents for processing text strings. Um, and the transformer architecture got
[00:29:04] Um, and the transformer architecture got first published in 2017 and for a couple
[00:29:07] first published in 2017 and for a couple years after that it mainly stayed in the
[00:29:09] years after that it mainly stayed in the regime of processing text. But there was
[00:29:11] regime of processing text. But there was a really important paper in 2021 that
[00:29:13] a really important paper in 2021 that basically applied nearly the exact same
[00:29:15] basically applied nearly the exact same transformer architecture that had been
[00:29:16] transformer architecture that had been getting used to process text to process
[00:29:19] getting used to process text to process strings and instead used it to process
[00:29:21] strings and instead used it to process images in nearly the exact same way. Um,
[00:29:23] images in nearly the exact same way. Um, and since then people have found that in
[00:29:26] and since then people have found that in a for a lot of the previous problems
[00:29:27] a for a lot of the previous problems that we just talked about that were
[00:29:29] that we just talked about that were previously solved using convolutional
[00:29:30] previously solved using convolutional networks. You could replace the CNN with
[00:29:32] networks. You could replace the CNN with a transformer, keep everything else the
[00:29:34] a transformer, keep everything else the same and the problem and you tend to get
[00:29:36] same and the problem and you tend to get better performance on these problems.
[00:29:38] better performance on these problems. They they scale up to more data, they
[00:29:39] They they scale up to more data, they scale up to more compute. Um, and you
[00:29:41] scale up to more compute. Um, and you know this is a then we can get more we
[00:29:44] know this is a then we can get more we can get more data, we can get more
[00:29:45] can get more data, we can get more compute. Um, so that so these are much
[00:29:47] compute. Um, so that so these are much more commonly used for more and more
[00:29:48] more commonly used for more and more computer vision problems these days.
[00:29:51] computer vision problems these days. Um, and we'll talk much more about
[00:29:52] Um, and we'll talk much more about transformers in lecture 8, but I thought
[00:29:54] transformers in lecture 8, but I thought it would be weird to be pitching comnets
[00:29:57] it would be weird to be pitching comnets super hard when actually they don't get
[00:29:58] super hard when actually they don't get used quite as much nowadays as they did
[00:30:00] used quite as much nowadays as they did maybe 5 years ago. But I still think
[00:30:03] maybe 5 years ago. But I still think it's really useful to talk about
[00:30:04] it's really useful to talk about convolutional networks. Um, one because
[00:30:06] convolutional networks. Um, one because there is a lot of historical
[00:30:07] there is a lot of historical significance. Two, these uh these these
[00:30:10] significance. Two, these uh these these algorithms still do get used quite a lot
[00:30:11] algorithms still do get used quite a lot in practice. Um, three, it helps you
[00:30:13] in practice. Um, three, it helps you build intuitions about what's important
[00:30:15] build intuitions about what's important for images. Um, and four, they're
[00:30:17] for images. Um, and four, they're actually not completely dead, right?
[00:30:18] actually not completely dead, right? Like a lot of times we're actually
[00:30:19] Like a lot of times we're actually building hybrid systems. Sometimes we
[00:30:21] building hybrid systems. Sometimes we use convolutions, sometimes we use
[00:30:22] use convolutions, sometimes we use transformers, sometimes we mix them
[00:30:23] transformers, sometimes we mix them together in various ways. So it's
[00:30:25] together in various ways. So it's actually super useful to still know
[00:30:26] actually super useful to still know about this stuff.
[00:30:28] about this stuff. So then basically um the rest of today
[00:30:30] So then basically um the rest of today we're going to talk more about
[00:30:31] we're going to talk more about convolutional networks. We said that we
[00:30:33] convolutional networks. We said that we already we said that a convolutional
[00:30:35] already we said that a convolutional network is just a computational graph
[00:30:36] network is just a computational graph for processing images that's built from
[00:30:38] for processing images that's built from a couple different primitives. We've
[00:30:40] a couple different primitives. We've already met the fully connected layer
[00:30:41] already met the fully connected layer and the activation function. So we
[00:30:42] and the activation function. So we basically need to walk through these two
[00:30:44] basically need to walk through these two more layers of the convolution layer and
[00:30:45] more layers of the convolution layer and the pooling layer. Quick recap of the
[00:30:48] the pooling layer. Quick recap of the fully connected layer. This is what
[00:30:50] fully connected layer. This is what we've already talked about in the
[00:30:51] we've already talked about in the context of linear classifiers. So with
[00:30:53] context of linear classifiers. So with our fully connected layer, like we said,
[00:30:55] our fully connected layer, like we said, basically what we do is we take our
[00:30:56] basically what we do is we take our pixels of our image. Our image is this
[00:30:59] pixels of our image. Our image is this three-dimensional tensor. 3 32x 32x3.
[00:31:02] three-dimensional tensor. 3 32x 32x3. 32x 32 are these two spatial dimensions.
[00:31:04] 32x 32 are these two spatial dimensions. 3 are three channel dimensions for your
[00:31:06] 3 are three channel dimensions for your RGB colors. So we take the that 32x 32x3
[00:31:10] RGB colors. So we take the that 32x 32x3 vector. You stretch it out into a long
[00:31:12] vector. You stretch it out into a long vector of 3072 cuz that's you know if
[00:31:14] vector of 3072 cuz that's you know if you multiply those in your head, that's
[00:31:15] you multiply those in your head, that's the number you get. Um and then you have
[00:31:17] the number you get. Um and then you have this basically this vector of 372 3072
[00:31:20] this basically this vector of 372 3072 numbers. Um we have a weight matrix
[00:31:22] numbers. Um we have a weight matrix that's 3072 by 10 in this case because
[00:31:24] that's 3072 by 10 in this case because 10 is the number of output classes that
[00:31:26] 10 is the number of output classes that we want. You do a matrix vector multiply
[00:31:28] we want. You do a matrix vector multiply between those two you end up with with a
[00:31:29] between those two you end up with with a vector of 10 numbers giving us our class
[00:31:31] vector of 10 numbers giving us our class score. Um and in particular it's
[00:31:34] score. Um and in particular it's interest like in trying to generalize
[00:31:36] interest like in trying to generalize this from fully connected layers to
[00:31:38] this from fully connected layers to convolutional layers. It's useful to
[00:31:40] convolutional layers. It's useful to think a little bit more about the
[00:31:41] think a little bit more about the structure of what this fully connected
[00:31:43] structure of what this fully connected layer is doing. Right? that fully
[00:31:44] layer is doing. Right? that fully connected layer. Um the output vector
[00:31:46] connected layer. Um the output vector contains 10 elements. Each one of those
[00:31:49] contains 10 elements. Each one of those elements is a single number. Each one of
[00:31:51] elements is a single number. Each one of those numbers is predicted by computing
[00:31:53] those numbers is predicted by computing an inner product between one of the rows
[00:31:55] an inner product between one of the rows of your weight matrix and the entire
[00:31:57] of your weight matrix and the entire input vector. Right? But each each entry
[00:32:00] input vector. Right? But each each entry you should basically think of as a dot
[00:32:01] you should basically think of as a dot productduct. And a dot productduct you
[00:32:02] productduct. And a dot productduct you should basically think about as a
[00:32:03] should basically think about as a template match. Right? Because the dot
[00:32:05] template match. Right? Because the dot productduct between two vectors is high
[00:32:07] productduct between two vectors is high when the two vectors point the same way
[00:32:09] when the two vectors point the same way and it's zero when the two vectors are
[00:32:10] and it's zero when the two vectors are orthogonal. So anything built on dot
[00:32:12] orthogonal. So anything built on dot productducts is basically a kind of
[00:32:14] productducts is basically a kind of template matching. So what the way that
[00:32:16] template matching. So what the way that you should think about these fully
[00:32:17] you should think about these fully connected layers is that we have a set
[00:32:19] connected layers is that we have a set of templates. Each of the templates has
[00:32:21] of templates. Each of the templates has the same size as our input. Um and then
[00:32:23] the same size as our input. Um and then the output is the template matching
[00:32:26] the output is the template matching score between each one of our templates
[00:32:28] score between each one of our templates and the entire input. So then once we
[00:32:31] and the entire input. So then once we think about it that way, there's
[00:32:32] think about it that way, there's actually a nice way we can generalize
[00:32:34] actually a nice way we can generalize this from fully connected layers into
[00:32:36] this from fully connected layers into convolutional layers. And that's by
[00:32:38] convolutional layers. And that's by saying, you know, we're still going to
[00:32:39] saying, you know, we're still going to have this notion of template matching.
[00:32:40] have this notion of template matching. We're still going to have this notion of
[00:32:42] We're still going to have this notion of learning a bank of filters, but what
[00:32:44] learning a bank of filters, but what we're going to change is that those
[00:32:46] we're going to change is that those filters will those those templates are
[00:32:48] filters will those those templates are no longer going to have the same shape
[00:32:49] no longer going to have the same shape as the input. Instead, now our fil now
[00:32:52] as the input. Instead, now our fil now our filters will have a will only look
[00:32:55] our filters will have a will only look at a small subset of the input. So, more
[00:32:58] at a small subset of the input. So, more concretely, um, rather than stretching
[00:33:00] concretely, um, rather than stretching out our image into a big vector of 372
[00:33:02] out our image into a big vector of 372 numbers, instead we're going to maintain
[00:33:04] numbers, instead we're going to maintain the 3D spatial structure of of our
[00:33:06] the 3D spatial structure of of our image. It's going to now be a
[00:33:08] image. It's going to now be a three-dimensional tensor um of three
[00:33:10] three-dimensional tensor um of three channels, sometimes called depth or
[00:33:12] channels, sometimes called depth or channels dimension. Um 32 width, 32
[00:33:14] channels dimension. Um 32 width, 32 height. And now one of our filters is
[00:33:16] height. And now one of our filters is going to be a tiny little sub image, a
[00:33:18] going to be a tiny little sub image, a tiny lowresolution image, in this case a
[00:33:20] tiny lowresolution image, in this case a 5x5 pixel image. Um and importantly that
[00:33:24] 5x5 pixel image. Um and importantly that that that small filter needs to have
[00:33:26] that that small filter needs to have three channels. The channels are always
[00:33:28] three channels. The channels are always going to span the same as the number of
[00:33:30] going to span the same as the number of channels in the input, but the spatial
[00:33:31] channels in the input, but the spatial size will be smaller. And now what we're
[00:33:34] size will be smaller. And now what we're going to do is we're going to comput
[00:33:35] going to do is we're going to comput dotproducts. We think about that small
[00:33:37] dotproducts. We think about that small filter as a little chunk of image
[00:33:39] filter as a little chunk of image template and we're going to slide it
[00:33:41] template and we're going to slide it everywhere across the image and and say
[00:33:43] everywhere across the image and and say for every point in the image, how much
[00:33:44] for every point in the image, how much does that sub portion subport subp part
[00:33:47] does that sub portion subport subp part of the image match this template that
[00:33:49] of the image match this template that we're learning in our convolutional
[00:33:50] we're learning in our convolutional filter. So we'll plop that convolutional
[00:33:53] filter. So we'll plop that convolutional filter down at some chunk of the image.
[00:33:55] filter down at some chunk of the image. That 5x5x3 filter will line up with some
[00:33:58] That 5x5x3 filter will line up with some 5x5x3 chunk of the image at that spatial
[00:34:01] 5x5x3 chunk of the image at that spatial location. We'll comput an inner product
[00:34:03] location. We'll comput an inner product between those two. And that will give us
[00:34:04] between those two. And that will give us one single scalar number telling us how
[00:34:06] one single scalar number telling us how much does that chunk of the image align
[00:34:08] much does that chunk of the image align with our template. And now we'll repeat
[00:34:11] with our template. And now we'll repeat that process and slide that template
[00:34:13] that process and slide that template everywhere in our image. And every place
[00:34:15] everywhere in our image. And every place that we plop down that template, it'll
[00:34:17] that we plop down that template, it'll give us we'll again compute this
[00:34:18] give us we'll again compute this template matching score that says how
[00:34:20] template matching score that says how much does that piece of image align with
[00:34:22] much does that piece of image align with that one template. And as we slide that
[00:34:24] that one template. And as we slide that filter everywhere on the input image,
[00:34:27] filter everywhere on the input image, we're going to collect all of those
[00:34:28] we're going to collect all of those scores, all of those template matching
[00:34:30] scores, all of those template matching scores into a plane, right? And that
[00:34:33] scores into a plane, right? And that plane will now be a two-dimensional a
[00:34:35] plane will now be a two-dimensional a two-dimensional plane that says
[00:34:37] two-dimensional plane that says basically for every point in the every
[00:34:39] basically for every point in the every point in the plane now corresponds to
[00:34:41] point in the plane now corresponds to how much did that corresponding piece of
[00:34:43] how much did that corresponding piece of the input image align with our filter.
[00:34:47] the input image align with our filter. Um, but of course um this is deep
[00:34:49] Um, but of course um this is deep learning. We want a lot of compute and
[00:34:51] learning. We want a lot of compute and how do we get more compute? we have more
[00:34:53] how do we get more compute? we have more filters. So now we'll we'll add a second
[00:34:55] filters. So now we'll we'll add a second filter and we'll say rather we'll we'll
[00:34:57] filter and we'll say rather we'll we'll repeat the whole process again um with
[00:34:59] repeat the whole process again um with another filter. So we'll have we we'll
[00:35:01] another filter. So we'll have we we'll go to go we had the a 5x5x3 filter that
[00:35:03] go to go we had the a 5x5x3 filter that we colored in blue. Um now let's imagine
[00:35:06] we colored in blue. Um now let's imagine a second filter that's now colored in
[00:35:07] a second filter that's now colored in green. Our second filter will still be
[00:35:10] green. Our second filter will still be 5x 5x3 and we'll repeat the exact same
[00:35:12] 5x 5x3 and we'll repeat the exact same procedure of sliding that green filter
[00:35:14] procedure of sliding that green filter everywhere on the image. um compute
[00:35:16] everywhere on the image. um compute template matching scores between the
[00:35:17] template matching scores between the green filter and little sub pieces of
[00:35:19] green filter and little sub pieces of the image and then collect all of those
[00:35:21] the image and then collect all of those scores in a second plane telling us you
[00:35:24] scores in a second plane telling us you know for every point in the image how
[00:35:25] know for every point in the image how much did it respond to the green filter
[00:35:28] much did it respond to the green filter and now we can com and now we can
[00:35:29] and now we can com and now we can basically um iterate this and and add as
[00:35:32] basically um iterate this and and add as many filters as we want. Um so then in
[00:35:34] many filters as we want. Um so then in this case we are drawing six filters um
[00:35:37] this case we are drawing six filters um each of them is going to be 3x 5x 5 or
[00:35:40] each of them is going to be 3x 5x 5 or 3x so yeah 3x 5x 5. So then we can
[00:35:43] 3x so yeah 3x 5x 5. So then we can actually collect all of those filters
[00:35:44] actually collect all of those filters into a single four-dimensional tensor.
[00:35:46] into a single four-dimensional tensor. Right? So that four-dimensional tensor
[00:35:48] Right? So that four-dimensional tensor now has six as a leading dimension
[00:35:50] now has six as a leading dimension because we have six filters. And then
[00:35:52] because we have six filters. And then that 3x 5x5 is that image template. It
[00:35:56] that 3x 5x5 is that image template. It is that chunk is that template um that
[00:35:58] is that chunk is that template um that we're learning. And now the convolution
[00:36:00] we're learning. And now the convolution layer basically takes as input our
[00:36:02] layer basically takes as input our three-dimensional input our
[00:36:03] three-dimensional input our three-dimensional image and our
[00:36:05] three-dimensional image and our four-dimensional bank of filters slides
[00:36:07] four-dimensional bank of filters slides slides all the filters everywhere in the
[00:36:09] slides all the filters everywhere in the image and gives us these response
[00:36:10] image and gives us these response planes. So then once we collect all
[00:36:12] planes. So then once we collect all those response planes and stack them up
[00:36:14] those response planes and stack them up in a third dimension then our output has
[00:36:17] in a third dimension then our output has um has size 6x 28x 28 where 28x 28
[00:36:22] um has size 6x 28x 28 where 28x 28 should be interpreted as spatial
[00:36:23] should be interpreted as spatial dimensions that and that six is a
[00:36:25] dimensions that and that six is a channel dimension. Um and of course
[00:36:27] channel dimension. Um and of course we'll also uh you know just as we do
[00:36:29] we'll also uh you know just as we do with linear layers we'll often add a
[00:36:31] with linear layers we'll often add a learnable bias vector as well to our
[00:36:33] learnable bias vector as well to our convolutional layers. So that then in
[00:36:35] convolutional layers. So that then in that sense is sort of in the in a linear
[00:36:37] that sense is sort of in the in a linear layer a bias is one scaler per row in
[00:36:40] layer a bias is one scaler per row in the in in the in the linear layer.
[00:36:41] the in in the in the linear layer. Correspondingly in a convolutional layer
[00:36:43] Correspondingly in a convolutional layer we'll have typically one scalar bias
[00:36:45] we'll have typically one scalar bias value for every filter in our
[00:36:47] value for every filter in our convolutional for every for every one of
[00:36:49] convolutional for every for every one of our convolutional filters. So that means
[00:36:51] our convolutional filters. So that means that we'll have a sixdimensional bias
[00:36:52] that we'll have a sixdimensional bias vector in this in this setting. Yeah the
[00:36:55] vector in this in this setting. Yeah the question was clarifying three is the RGB
[00:36:56] question was clarifying three is the RGB channels. Yeah that's correct. Question
[00:36:59] channels. Yeah that's correct. Question is how do you get the filters back to
[00:37:00] is how do you get the filters back to the miracle of gradient descent and back
[00:37:02] the miracle of gradient descent and back propagation. Right. So the idea is that
[00:37:04] propagation. Right. So the idea is that we're defining this operator. This
[00:37:06] we're defining this operator. This operator is going to have an input image
[00:37:08] operator is going to have an input image and a and a set of filters, but we but
[00:37:11] and a and a set of filters, but we but we no human is going to define what
[00:37:12] we no human is going to define what those filters are going to be. Instead,
[00:37:14] those filters are going to be. Instead, we're going to initialize those filters
[00:37:15] we're going to initialize those filters randomly and then they'll be learned via
[00:37:17] randomly and then they'll be learned via gradient descent um on whatever problem
[00:37:19] gradient descent um on whatever problem you're trying to solve. So that that's
[00:37:21] you're trying to solve. So that that's actually a really important thing to
[00:37:22] actually a really important thing to keep in mind and that that that gives
[00:37:24] keep in mind and that that that gives these these layers their power is that
[00:37:26] these these layers their power is that we're defining this fairly
[00:37:27] we're defining this fairly computationally expensive layer but
[00:37:29] computationally expensive layer but we're expecting that it'll be filled in
[00:37:31] we're expecting that it'll be filled in with um with the data from our with the
[00:37:33] with um with the data from our with the data and compute from our training.
[00:37:35] data and compute from our training. Question is how do you how do you set
[00:37:37] Question is how do you how do you set the five? Um that's a hyperparameter. So
[00:37:39] the five? Um that's a hyperparameter. So you know we talked about hyperparameters
[00:37:40] you know we talked about hyperparameters and cross validation a couple lectures
[00:37:42] and cross validation a couple lectures ago. So these would be architectural
[00:37:43] ago. So these would be architectural hyperparameters that you would typically
[00:37:44] hyperparameters that you would typically set via cross validation in some way.
[00:37:47] set via cross validation in some way. Yeah good question. Does it make sense
[00:37:48] Yeah good question. Does it make sense to have different sizes of filters? So,
[00:37:50] to have different sizes of filters? So, as we'll see in the CNN architectures
[00:37:51] as we'll see in the CNN architectures lecture next lecture, uh some actually I
[00:37:54] lecture next lecture, uh some actually I think you're going to talk about
[00:37:55] think you're going to talk about inception. Um sometimes you sometimes
[00:37:57] inception. Um sometimes you sometimes you actually do have that but that
[00:37:58] you actually do have that but that typically happens at the uh but um for
[00:38:01] typically happens at the uh but um for there there's kind of a nice API design
[00:38:03] there there's kind of a nice API design problem when you're designing what is
[00:38:05] problem when you're designing what is the what is a primitive in your
[00:38:06] the what is a primitive in your computational graph versus what is going
[00:38:08] computational graph versus what is going to be an emergent structure built out of
[00:38:09] to be an emergent structure built out of primitives. So in this case it's we
[00:38:11] primitives. So in this case it's we usually define a single convolutional
[00:38:13] usually define a single convolutional layer as having a fixed filter size
[00:38:15] layer as having a fixed filter size because that makes it easier to compute
[00:38:16] because that makes it easier to compute and write efficient GPU kernels. But if
[00:38:18] and write efficient GPU kernels. But if you you can effectively have multiple
[00:38:21] you you can effectively have multiple multiply sized filters by stitching
[00:38:23] multiply sized filters by stitching together a computational graph that
[00:38:25] together a computational graph that combines convolution layers with
[00:38:26] combines convolution layers with different filter sizes in a larger
[00:38:28] different filter sizes in a larger network structure. So it's sort of yes
[00:38:30] network structure. So it's sort of yes and no is the answer to your question.
[00:38:33] and no is the answer to your question. The question is what are we learning? Um
[00:38:34] The question is what are we learning? Um and this is very important to
[00:38:35] and this is very important to distinguish between a parameter versus a
[00:38:37] distinguish between a parameter versus a hyperparameter. So a hyperparameter is
[00:38:39] hyperparameter. So a hyperparameter is something that we set before we start
[00:38:41] something that we set before we start training the network. So in this case,
[00:38:43] training the network. So in this case, one one of the hyperparameters would be
[00:38:44] one one of the hyperparameters would be the number of filters and the size of
[00:38:46] the number of filters and the size of those filters because those set the the
[00:38:48] those filters because those set the the shapes of our tensor, right? So then and
[00:38:51] shapes of our tensor, right? So then and a parameter is a value that we're going
[00:38:53] a parameter is a value that we're going to set and optimize over the course of
[00:38:54] to set and optimize over the course of gradient descent. So in this case, the
[00:38:56] gradient descent. So in this case, the number of filters, the number of output
[00:38:58] number of filters, the number of output channels, the size of those filters,
[00:38:59] channels, the size of those filters, those will be hyperparameters. We set
[00:39:01] those will be hyperparameters. We set those once before we start training. At
[00:39:03] those once before we start training. At the beginning of training, we'll
[00:39:04] the beginning of training, we'll randomly initialize the filters and then
[00:39:06] randomly initialize the filters and then the value that will that will give us a
[00:39:08] the value that will that will give us a fixed shape fixedsized tensor and then
[00:39:10] fixed shape fixedsized tensor and then the values inside of that tensor will
[00:39:12] the values inside of that tensor will float around and change over the course
[00:39:13] float around and change over the course of optimization. So that's the so then
[00:39:15] of optimization. So that's the so then those are parameters because they get
[00:39:17] those are parameters because they get set via grad via gradient descent. Yes,
[00:39:20] set via grad via gradient descent. Yes, the question is what gradient are we
[00:39:21] the question is what gradient are we computing? Whenever you whenever you do
[00:39:23] computing? Whenever you whenever you do back propagation, you're always
[00:39:24] back propagation, you're always computing gradient of the loss with
[00:39:26] computing gradient of the loss with respect to things inside the network. So
[00:39:28] respect to things inside the network. So in this case we'll be computing gradient
[00:39:29] in this case we'll be computing gradient of the loss with respect to the
[00:39:32] of the loss with respect to the individual scaler with respect to our
[00:39:33] individual scaler with respect to our our convolutional filter weights. So
[00:39:35] our convolutional filter weights. So that's basically saying you know what is
[00:39:37] that's basically saying you know what is a gradient that's saying for every
[00:39:38] a gradient that's saying for every individual scalar inside of every one of
[00:39:40] individual scalar inside of every one of our filters if we wiggle that scaler a
[00:39:42] our filters if we wiggle that scaler a little bit then how much is the loss
[00:39:44] little bit then how much is the loss going to change. So then the gradient of
[00:39:46] going to change. So then the gradient of the loss with respect we're always
[00:39:47] the loss with respect we're always computing gradient of the loss with
[00:39:48] computing gradient of the loss with respect to our our convolutional
[00:39:50] respect to our our convolutional filters. The question is basically like
[00:39:52] filters. The question is basically like what do we do with the bias? So, so
[00:39:54] what do we do with the bias? So, so basically the bias would be added to
[00:39:56] basically the bias would be added to each of our inner products, right? So,
[00:39:58] each of our inner products, right? So, then we'll always compute like inner
[00:39:59] then we'll always compute like inner product of one of our filters against a
[00:40:00] product of one of our filters against a chunk of the image and then add the
[00:40:02] chunk of the image and then add the corresponding scaler from the bias. The
[00:40:05] corresponding scaler from the bias. The bias is is a is a vector, but the number
[00:40:07] bias is is a is a vector, but the number of entries in the vector is equal to the
[00:40:08] of entries in the vector is equal to the number of filters. So then each each
[00:40:10] number of filters. So then each each entry in the bias gets basically
[00:40:13] entry in the bias gets basically broadcast across the entire spatial
[00:40:15] broadcast across the entire spatial dimension in the output. Um, but each
[00:40:17] dimension in the output. Um, but each bias only gets used for one bias gets
[00:40:20] bias only gets used for one bias gets used for one filter. So conceptually you
[00:40:23] used for one filter. So conceptually you basically one filter you slide
[00:40:24] basically one filter you slide everywhere that gives us a
[00:40:26] everywhere that gives us a two-dimensional plane of activations
[00:40:27] two-dimensional plane of activations right and then if you have a second
[00:40:29] right and then if you have a second filter you get a second plane of
[00:40:30] filter you get a second plane of activations those are independent
[00:40:32] activations those are independent operators right like step one um slide
[00:40:34] operators right like step one um slide first filter everywhere step two slide
[00:40:37] first filter everywhere step two slide the second filter everywhere every
[00:40:38] the second filter everywhere every filter gives rise to a plane um a plane
[00:40:40] filter gives rise to a plane um a plane of of a plane that we call an activation
[00:40:42] of of a plane that we call an activation map and then we stack all of those up um
[00:40:44] map and then we stack all of those up um and that's that's the operation of the
[00:40:46] and that's that's the operation of the convolution layer question is yeah
[00:40:48] convolution layer question is yeah basically after every gradient descent
[00:40:50] basically after every gradient descent every time we do gradient descent sent
[00:40:52] every time we do gradient descent sent um it's going to change the filters,
[00:40:53] um it's going to change the filters, right? So whenever you whenever you
[00:40:55] right? So whenever you whenever you imagine training a neural network, it's
[00:40:56] imagine training a neural network, it's always this loop of like while true get
[00:40:59] always this loop of like while true get a batch of data, send your data through
[00:41:00] a batch of data, send your data through the network, forward pass, compute loss,
[00:41:03] the network, forward pass, compute loss, um backward pass, compute gradient with
[00:41:05] um backward pass, compute gradient with respect to loss, and now make a gradient
[00:41:07] respect to loss, and now make a gradient step using your optimizer. So then it's
[00:41:08] step using your optimizer. So then it's always going to be data forward loss
[00:41:11] always going to be data forward loss backward step and then every time you do
[00:41:13] backward step and then every time you do a step, it's going to make a change to
[00:41:14] a step, it's going to make a change to the filters.
[00:41:16] the filters. All right, so I I I swung the other way.
[00:41:18] All right, so I I I swung the other way. I said more questions. I got too many
[00:41:20] I said more questions. I got too many questions. Um but that's good. We'll
[00:41:21] questions. Um but that's good. We'll we'll kind of equalize in here. Okay. So
[00:41:24] we'll kind of equalize in here. Okay. So then uh we talked about the convolution
[00:41:26] then uh we talked about the convolution layer. Um you know it's actually pretty
[00:41:27] layer. Um you know it's actually pretty common in the convolution layer to work
[00:41:29] common in the convolution layer to work on it in a batched mode. So rather than
[00:41:31] on it in a batched mode. So rather than working on one input image, we'll
[00:41:32] working on one input image, we'll actually work on a batch of input
[00:41:34] actually work on a batch of input images. So this this is kind of nice cuz
[00:41:36] images. So this this is kind of nice cuz it makes everything four-dimensional.
[00:41:37] it makes everything four-dimensional. Now we have a four-dimensional tensor of
[00:41:39] Now we have a four-dimensional tensor of inputs which is a set of input images.
[00:41:42] inputs which is a set of input images. We have a four-dimensional tensor of
[00:41:43] We have a four-dimensional tensor of filters which is a set of filters each
[00:41:46] filters which is a set of filters each of which is a threedimensional chunk of
[00:41:47] of which is a threedimensional chunk of an image. And then the output is a
[00:41:49] an image. And then the output is a four-dimensional uh is a
[00:41:50] four-dimensional uh is a four-dimensional tensor which is a set
[00:41:52] four-dimensional tensor which is a set of outputs. Each output one output per
[00:41:55] of outputs. Each output one output per image. Each images output is a
[00:41:57] image. Each images output is a three-dimensional tensor giving a stack
[00:41:59] three-dimensional tensor giving a stack of feature planes. Um you you have to
[00:42:01] of feature planes. Um you you have to start to think in lots of dimensions
[00:42:02] start to think in lots of dimensions when you start to build neural networks.
[00:42:04] when you start to build neural networks. And that's actually kind of fun. Um so
[00:42:07] And that's actually kind of fun. Um so then here's kind of the general
[00:42:08] then here's kind of the general formulation of a convolution layer. um
[00:42:10] formulation of a convolution layer. um is that in general you're going to take
[00:42:11] is that in general you're going to take as input a fourdimensional tensor of n
[00:42:13] as input a fourdimensional tensor of n by cn by h by w which is a set of n
[00:42:16] by cn by h by w which is a set of n images. Each of those n images has c
[00:42:19] images. Each of those n images has c channels. Um for the case of an RGB
[00:42:21] channels. Um for the case of an RGB image that'll be three but we might in
[00:42:23] image that'll be three but we might in general have more than c more than three
[00:42:25] general have more than c more than three channels. This could be arbitrary. Um
[00:42:27] channels. This could be arbitrary. Um and then h and w is the spatial size of
[00:42:29] and then h and w is the spatial size of our input images. Our convolutional
[00:42:31] our input images. Our convolutional filters will be a four-dimensional
[00:42:32] filters will be a four-dimensional tensor of shape C out by C in by KW by
[00:42:36] tensor of shape C out by C in by KW by KH. K C out is the number of filters.
[00:42:39] KH. K C out is the number of filters. the number of output channels. Um CN is
[00:42:42] the number of output channels. Um CN is you know and then the rest of that are
[00:42:43] you know and then the rest of that are threedimensional filters. So it's a set
[00:42:45] threedimensional filters. So it's a set of threedimensional filters. Um each
[00:42:47] of threedimensional filters. Um each threedimensional filter has shape C in
[00:42:49] threedimensional filter has shape C in by KW by KH. That's the kernel width and
[00:42:52] by KW by KH. That's the kernel width and kernel height. And then we have C out
[00:42:54] kernel height. And then we have C out such filters that could collect it into
[00:42:55] such filters that could collect it into a four-dimensional tensor. And then as
[00:42:58] a four-dimensional tensor. And then as output we're going to produce a
[00:42:59] output we're going to produce a four-dimensional tensor again where the
[00:43:01] four-dimensional tensor again where the shape is N for the number of images one
[00:43:04] shape is N for the number of images one output per image. C out. Each of those
[00:43:06] output per image. C out. Each of those outputs is going to consist of C out
[00:43:08] outputs is going to consist of C out feature planes um one per filter. And
[00:43:11] feature planes um one per filter. And then each of those planes is going to be
[00:43:13] then each of those planes is going to be H prime by WP prime. Um and this is kind
[00:43:16] H prime by WP prime. Um and this is kind of the general formulation of a con
[00:43:18] of the general formulation of a con layer.
[00:43:19] layer. And then a convolutional network is just
[00:43:21] And then a convolutional network is just a network is just a computational graph
[00:43:23] a network is just a computational graph that includes a bunch of con layers. So
[00:43:25] that includes a bunch of con layers. So in practice we'll tend to stack up a
[00:43:27] in practice we'll tend to stack up a bunch of convolutional operators one
[00:43:29] bunch of convolutional operators one after another. Um and in stacking a
[00:43:31] after another. Um and in stacking a bunch of convolutional operators that
[00:43:32] bunch of convolutional operators that will be that will be a convolutional
[00:43:34] will be that will be a convolutional network. Um so this was kind of a simple
[00:43:36] network. Um so this was kind of a simple connet. You know we start with an image
[00:43:38] connet. You know we start with an image that's 3x 32x 32. Um then we have a con
[00:43:41] that's 3x 32x 32. Um then we have a con layer that has six filters. Each filter
[00:43:43] layer that has six filters. Each filter is 5x5x3. Then after we do the first
[00:43:46] is 5x5x3. Then after we do the first convolution that gives us a new
[00:43:48] convolution that gives us a new three-dimensional set of activations for
[00:43:50] three-dimensional set of activations for that one image where it we have six
[00:43:53] that one image where it we have six channels that matches the six filters
[00:43:55] channels that matches the six filters 28x 28 because the spatial size changed
[00:43:57] 28x 28 because the spatial size changed a little bit through the convolution.
[00:43:58] a little bit through the convolution. Then we have another convolution that
[00:44:00] Then we have another convolution that now has 10 filters which each of which
[00:44:02] now has 10 filters which each of which is 5x 5x 6. So the 10 is going to give
[00:44:05] is 5x 5x 6. So the 10 is going to give us the output dimen the output
[00:44:07] us the output dimen the output dimensions in the next layer of the
[00:44:08] dimensions in the next layer of the convolution. Um and this six is going to
[00:44:11] convolution. Um and this six is going to be the number of channels that needs to
[00:44:12] be the number of channels that needs to match up the channel dimension here of
[00:44:14] match up the channel dimension here of the input to the convolution. Um so you
[00:44:16] the input to the convolution. Um so you can kind of see like you you can just
[00:44:18] can kind of see like you you can just stack up a bunch of these convolution
[00:44:19] stack up a bunch of these convolution layers and perform a lot of computation.
[00:44:22] layers and perform a lot of computation. But there's actually a problem in
[00:44:23] But there's actually a problem in exactly this network archite
[00:44:24] exactly this network archite architecture design. And can anybody
[00:44:26] architecture design. And can anybody spot it? sizing. Uh, that that's a
[00:44:29] spot it? sizing. Uh, that that's a problem. Not the one I had in mind. Are
[00:44:31] problem. Not the one I had in mind. Are local. That's another good problem. Not
[00:44:33] local. That's another good problem. Not the one I had in mind. Uh, actually
[00:44:34] the one I had in mind. Uh, actually those two we'll be able to fix pretty
[00:44:36] those two we'll be able to fix pretty easily in a couple slides, but I had a
[00:44:37] easily in a couple slides, but I had a different problem in mind. A lot of
[00:44:38] different problem in mind. A lot of memory. Uh, that is a problem, but not
[00:44:40] memory. Uh, that is a problem, but not one we can fix. You just got to buy a
[00:44:42] one we can fix. You just got to buy a bigger GPU.
[00:44:44] bigger GPU. Number of filters increases. Uh, I don't
[00:44:46] Number of filters increases. Uh, I don't think that's a problem necessarily.
[00:44:47] think that's a problem necessarily. That's okay. Ah, everything's linear.
[00:44:50] That's okay. Ah, everything's linear. Yes, that is a problem. Right. So, we
[00:44:52] Yes, that is a problem. Right. So, we said that convolution was dot was
[00:44:54] said that convolution was dot was dotproducts. Dot product is a linear
[00:44:55] dotproducts. Dot product is a linear operator. um composition of two linear
[00:44:57] operator. um composition of two linear operators is still a linear operator. So
[00:44:59] operators is still a linear operator. So that means that if we have two
[00:45:00] that means that if we have two convolution layers stacked directly on
[00:45:02] convolution layers stacked directly on top of each other, they actually have
[00:45:03] top of each other, they actually have the same representational power as a
[00:45:05] the same representational power as a single convolution layer because because
[00:45:07] single convolution layer because because of the linearity of the operator. Um
[00:45:09] of the linearity of the operator. Um there's actually a very simple fix to
[00:45:10] there's actually a very simple fix to that. Add an activation function.
[00:45:12] that. Add an activation function. Exactly. So it's the same actually the
[00:45:14] Exactly. So it's the same actually the same bug that we saw in multi-layer
[00:45:16] same bug that we saw in multi-layer neural networks and the same fix. We
[00:45:17] neural networks and the same fix. We need to add a nonlinear activation
[00:45:19] need to add a nonlinear activation function in between our convolutional
[00:45:21] function in between our convolutional layers if we want. This introduces
[00:45:23] layers if we want. This introduces nonlinearity to the problem.
[00:45:24] nonlinearity to the problem. nonlinearity to the to the to the
[00:45:26] nonlinearity to the to the to the network architecture and increases the
[00:45:27] network architecture and increases the representational power of the network
[00:45:29] representational power of the network that we're learning. So you know com in
[00:45:32] that we're learning. So you know com in general com nets are going to be some
[00:45:33] general com nets are going to be some stack of convolution layers
[00:45:35] stack of convolution layers nonlinearities and other kinds of other
[00:45:36] nonlinearities and other kinds of other kinds of layers in our computational
[00:45:38] kinds of layers in our computational graph. There was a question earlier
[00:45:40] graph. There was a question earlier about what do the convolutional filters
[00:45:41] about what do the convolutional filters learn. Um this is basically we can view
[00:45:43] learn. Um this is basically we can view this by but by analogy with what we
[00:45:45] this by but by analogy with what we already did in linear classifiers. So in
[00:45:47] already did in linear classifiers. So in linear classifiers we have this
[00:45:49] linear classifiers we have this intuition where each row could be
[00:45:51] intuition where each row could be visualized each row of the learned
[00:45:52] visualized each row of the learned weight matrix could be thought of as a
[00:45:54] weight matrix could be thought of as a template that has the same shape as the
[00:45:56] template that has the same shape as the whole input image. Now with a
[00:45:58] whole input image. Now with a convolutional filter um you can you can
[00:46:00] convolutional filter um you can you can think of it the same way but now each
[00:46:02] think of it the same way but now each filter rather than extending over the
[00:46:04] filter rather than extending over the entire spatial size of the input image
[00:46:05] entire spatial size of the input image is going to be just a small subpiece a
[00:46:07] is going to be just a small subpiece a sub chunk of an image. So we can
[00:46:09] sub chunk of an image. So we can actually visualize the first uh we can
[00:46:12] actually visualize the first uh we can actually visualize the first layer
[00:46:13] actually visualize the first layer convolution filters um of a trained
[00:46:16] convolution filters um of a trained neural network. So these are some these
[00:46:18] neural network. So these are some these are the first layer convolution filters
[00:46:19] are the first layer convolution filters that are learned by an alexn net
[00:46:21] that are learned by an alexn net architecture that was trained for image
[00:46:23] architecture that was trained for image classification on imageet. Um and here
[00:46:25] classification on imageet. Um and here each of these are basically little
[00:46:26] each of these are basically little chunks of RGB images. These are the
[00:46:28] chunks of RGB images. These are the little templates that get slid around
[00:46:30] little templates that get slid around the input image in the first layer of
[00:46:32] the input image in the first layer of the alexnet architecture. Um, and you
[00:46:34] the alexnet architecture. Um, and you know, the fact that this was AlexNet,
[00:46:35] know, the fact that this was AlexNet, the fact that this was trained on
[00:46:36] the fact that this was trained on imageet, the fact that this was
[00:46:37] imageet, the fact that this was classification, um, it turns out just
[00:46:39] classification, um, it turns out just about all convolutional networks end up
[00:46:41] about all convolutional networks end up learning filters that look something
[00:46:43] learning filters that look something like this, um, on almost all problems
[00:46:45] like this, um, on almost all problems and almost all data sets and tasks as
[00:46:47] and almost all data sets and tasks as long as they're sort of reasonable
[00:46:48] long as they're sort of reasonable tasks. Um, and the the thing we see is
[00:46:51] tasks. Um, and the the thing we see is that we often learn two kinds of filters
[00:46:53] that we often learn two kinds of filters in here. Um, one tends to be looking for
[00:46:55] in here. Um, one tends to be looking for colors, especially opposing colors. So,
[00:46:57] colors, especially opposing colors. So, you'll see like this one is looking for
[00:46:59] you'll see like this one is looking for a contrast between green and red. We
[00:47:01] a contrast between green and red. We also see colored blobs like pink and
[00:47:03] also see colored blobs like pink and green blobs. And the other category of
[00:47:05] green blobs. And the other category of filter we tend to see are looking as
[00:47:07] filter we tend to see are looking as looking for somehow the spatial
[00:47:08] looking for somehow the spatial structure of the images. So like this
[00:47:10] structure of the images. So like this one is looking for a vertical edge, a
[00:47:12] one is looking for a vertical edge, a horizontal edge. These are this one is
[00:47:14] horizontal edge. These are this one is looking for a vertical edge. Some of
[00:47:16] looking for a vertical edge. Some of these are looking for a diagonal edg
[00:47:17] these are looking for a diagonal edg edges. So they tend to look for colors
[00:47:19] edges. So they tend to look for colors and edges um and like these little these
[00:47:22] and edges um and like these little these in these little local neighborhoods of
[00:47:23] in these little local neighborhoods of our input images. Um, so we can play
[00:47:26] our input images. Um, so we can play this trick on the first layer of the
[00:47:28] this trick on the first layer of the convolutional filter and just visualize
[00:47:29] convolutional filter and just visualize them directly as images. It gets a
[00:47:31] them directly as images. It gets a little bit trickier to visualize the
[00:47:32] little bit trickier to visualize the higher layers in the network. Um, and I
[00:47:35] higher layers in the network. Um, and I I'm not going to explain this figure.
[00:47:36] I'm not going to explain this figure. I'm just going to present it without too
[00:47:37] I'm just going to present it without too much too much explanation. But, um,
[00:47:39] much too much explanation. But, um, higher layers of the network tend to
[00:47:41] higher layers of the network tend to learn larger spatial structures of our
[00:47:43] learn larger spatial structures of our input image. Um here the visualization
[00:47:45] input image. Um here the visualization is like each row represents a filter in
[00:47:48] is like each row represents a filter in in a learned network and each column
[00:47:50] in a learned network and each column represents some piece of an input image
[00:47:53] represents some piece of an input image that that filter was responding strongly
[00:47:55] that that filter was responding strongly to. So the the visualization here is is
[00:47:57] to. So the the visualization here is is a bit different than the previous slide.
[00:47:59] a bit different than the previous slide. Um so these are all basically chunks of
[00:48:01] Um so these are all basically chunks of input images that a filter was
[00:48:02] input images that a filter was responding to. And here you can see that
[00:48:04] responding to. And here you can see that this sixth layer convolution, one of
[00:48:06] this sixth layer convolution, one of these filters feels like it's responding
[00:48:08] these filters feels like it's responding maybe to eyes. This one looks like maybe
[00:48:10] maybe to eyes. This one looks like maybe it's responding to pieces of text. Um,
[00:48:12] it's responding to pieces of text. Um, this one looks like maybe it's
[00:48:14] this one looks like maybe it's responding to wheels or or circles or
[00:48:16] responding to wheels or or circles or top halves of circles, something like
[00:48:17] top halves of circles, something like that. Um, and again like these this all
[00:48:20] that. Um, and again like these this all sort of gets driven via gradient descent
[00:48:21] sort of gets driven via gradient descent via training on your large scale data
[00:48:23] via training on your large scale data sets via gradient descent. Uh, nobody's
[00:48:25] sets via gradient descent. Uh, nobody's sort of sitting down and designing these
[00:48:26] sort of sitting down and designing these filters by hand. Um, and like I said,
[00:48:28] filters by hand. Um, and like I said, visualizing these higher layer filters
[00:48:30] visualizing these higher layer filters is a bit tricky and more involved.
[00:48:32] is a bit tricky and more involved. Question was um, if you if you if you
[00:48:35] Question was um, if you if you if you look at all the responses to the
[00:48:36] look at all the responses to the filters, can you reconstruct the
[00:48:38] filters, can you reconstruct the original image? Actually, it turns out
[00:48:40] original image? Actually, it turns out you can do that. And the trick that and
[00:48:42] you can do that. And the trick that and the way that you do that is also
[00:48:43] the way that you do that is also gradient descent. Um gradient sense is
[00:48:45] gradient descent. Um gradient sense is really powerful and that's something
[00:48:47] really powerful and that's something that we'll talk about I think in a
[00:48:48] that we'll talk about I think in a couple more lectures on some some
[00:48:50] couple more lectures on some some mechanisms that do that. Oh that's a
[00:48:52] mechanisms that do that. Oh that's a good question. How do you how do the how
[00:48:54] good question. How do you how do the how do the filters get differentiated? um
[00:48:56] do the filters get differentiated? um that actually comes down to the random
[00:48:57] that actually comes down to the random initialization, right? So then it's
[00:48:59] initialization, right? So then it's really important that the way you
[00:49:00] really important that the way you initialize your filters um is random and
[00:49:03] initialize your filters um is random and and they have and and crucially that you
[00:49:04] and they have and and crucially that you have a different initialization for each
[00:49:06] have a different initialization for each filter when you start training your
[00:49:07] filter when you start training your network. Um because that's going to
[00:49:08] network. Um because that's going to break the symmetry between the filters,
[00:49:10] break the symmetry between the filters, right? Because if all the filters are
[00:49:11] right? Because if all the filters are exactly the same, um the loss is the
[00:49:13] exactly the same, um the loss is the same, then that gradient is going to
[00:49:14] same, then that gradient is going to broadcast back and be the same on all
[00:49:16] broadcast back and be the same on all the filters. So if you initialize them
[00:49:17] the filters. So if you initialize them the same, they will stay the same. Um
[00:49:19] the same, they will stay the same. Um but if you initialize them to be
[00:49:20] but if you initialize them to be different, then you'll break the
[00:49:21] different, then you'll break the symmetry and they can learn different
[00:49:22] symmetry and they can learn different features.
[00:49:23] features.  Yeah. Basically the the the human
[00:49:25] Yeah. Basically the the the human designer of the network needs to write
[00:49:26] designer of the network needs to write down what is the sequence of operators
[00:49:28] down what is the sequence of operators and the sequence of channels and that's
[00:49:29] and the sequence of channels and that's the question of neural network
[00:49:31] the question of neural network architecture design that we'll talk a
[00:49:32] architecture design that we'll talk a little bit more about in the next
[00:49:33] little bit more about in the next lecture.
[00:49:35] lecture. Good question is how do we like why is
[00:49:37] Good question is how do we like why is it the deeper layers visualize larger
[00:49:39] it the deeper layers visualize larger structures that actually has a bit to do
[00:49:40] structures that actually has a bit to do with the receptive fields that we have a
[00:49:42] with the receptive fields that we have a slide on in a couple in a in a little
[00:49:43] slide on in a couple in a in a little bit. So maybe maybe we'll get there and
[00:49:44] bit. So maybe maybe we'll get there and I think a couple more some of these
[00:49:45] I think a couple more some of these questions will get answered. Um so one
[00:49:48] questions will get answered. Um so one thing that already came up is how do we
[00:49:49] thing that already came up is how do we look at the spatial dimensions of these
[00:49:51] look at the spatial dimensions of these convolutions? Um so I wanted to take a
[00:49:53] convolutions? Um so I wanted to take a take a look a closer look at exactly how
[00:49:55] take a look a closer look at exactly how we compute the spatial dimensions of our
[00:49:56] we compute the spatial dimensions of our convolutions. Right? So in this case um
[00:49:59] convolutions. Right? So in this case um here we've taken this picture of a con
[00:50:01] here we've taken this picture of a con this picture of a convolution. We're
[00:50:02] this picture of a convolution. We're rotating at 90° and dropping the channel
[00:50:04] rotating at 90° and dropping the channel dimension. So now the channel dimension
[00:50:06] dimension. So now the channel dimension is going into the board. Um and then we
[00:50:08] is going into the board. Um and then we have our 7x7 spatial dimensions. So here
[00:50:11] have our 7x7 spatial dimensions. So here we're looking at an input that's 7 by 7
[00:50:12] we're looking at an input that's 7 by 7 in spatial size and we have a 3x3 com
[00:50:14] in spatial size and we have a 3x3 com kernel. And then the question is how big
[00:50:16] kernel. And then the question is how big is our output going to be here?
[00:50:19] is our output going to be here? Well, 1 2 3 4 5, right? So, our output
[00:50:24] Well, 1 2 3 4 5, right? So, our output is going to be 5 by five because we can
[00:50:25] is going to be 5 by five because we can slide that filter and plop it down in
[00:50:27] slide that filter and plop it down in five different spaces. Um, and then we
[00:50:29] five different spaces. Um, and then we can generalize it, right? If our input
[00:50:30] can generalize it, right? If our input has has length w, our com filter has
[00:50:33] has has length w, our com filter has length k, then our output is going to be
[00:50:34] length k, then our output is going to be w minus k + one. And you can sit down
[00:50:37] w minus k + one. And you can sit down and convince yourself that that's the
[00:50:38] and convince yourself that that's the right formula. Um, but there's kind of a
[00:50:40] right formula. Um, but there's kind of a problem that actually a couple people
[00:50:42] problem that actually a couple people already pointed out is that your feature
[00:50:44] already pointed out is that your feature maps are going to shrink in spatial size
[00:50:46] maps are going to shrink in spatial size as you go through this convolution. um
[00:50:47] as you go through this convolution. um that's kind of annoying. Um it's
[00:50:49] that's kind of annoying. Um it's actually like you could actually work
[00:50:51] actually like you could actually work with that and there are some neural
[00:50:52] with that and there are some neural network architectures that deal with
[00:50:53] network architectures that deal with that. But sometimes we're lazy and we
[00:50:55] that. But sometimes we're lazy and we just want to keep the same size for
[00:50:57] just want to keep the same size for everything because that's just basically
[00:50:59] everything because that's just basically simpler for human designers to think
[00:51:01] simpler for human designers to think about. Um and one trick that we do there
[00:51:03] about. Um and one trick that we do there is something called padding. So here
[00:51:05] is something called padding. So here it's common to add additional data like
[00:51:09] it's common to add additional data like virtual data around the input of your
[00:51:11] virtual data around the input of your around your true input data um that
[00:51:13] around your true input data um that you're going to like basically add extra
[00:51:15] you're going to like basically add extra zeros um around before you compute the
[00:51:17] zeros um around before you compute the convolution operator. Um and now in this
[00:51:20] convolution operator. Um and now in this this basically lets us solve this
[00:51:21] this basically lets us solve this shrinking feature map problem because
[00:51:23] shrinking feature map problem because now if we have um you know add padding
[00:51:25] now if we have um you know add padding of P in this case we have padding P
[00:51:27] of P in this case we have padding P equals 1. So we're adding one pixel of
[00:51:29] equals 1. So we're adding one pixel of zeros all around everywhere then we we
[00:51:32] zeros all around everywhere then we we basically add 2 P to our output size. So
[00:51:35] basically add 2 P to our output size. So in particular if you have a three if you
[00:51:37] in particular if you have a three if you have like a 3x3 con and you add padding
[00:51:39] have like a 3x3 con and you add padding of one then your your feature map is
[00:51:42] of one then your your feature map is going to stay the same size and that's
[00:51:43] going to stay the same size and that's convenient. Um now if you've taken
[00:51:45] convenient. Um now if you've taken signal processing there actually are
[00:51:47] signal processing there actually are some problems here right like this can
[00:51:49] some problems here right like this can lead to weird weirdness in from a signal
[00:51:51] lead to weird weirdness in from a signal processing perspective but we'll ignore
[00:51:52] processing perspective but we'll ignore that and we'll just look at the sizes
[00:51:54] that and we'll just look at the sizes sizes and shapes of the tensors because
[00:51:56] sizes and shapes of the tensors because that's that's a little bit easier to
[00:51:57] that's that's a little bit easier to comprehend but be aware like why are we
[00:51:59] comprehend but be aware like why are we putting zeros is that going to cause
[00:52:00] putting zeros is that going to cause problems yes it is going to cause
[00:52:02] problems yes it is going to cause problems on the borders but it seems to
[00:52:04] problems on the borders but it seems to be okay in a lot of cases
[00:52:07] be okay in a lot of cases okay yeah so then like I said a pretty
[00:52:08] okay yeah so then like I said a pretty common setting is to set p is to have
[00:52:10] common setting is to set p is to have actually k be an odd number um and then
[00:52:13] actually k be an odd number um and then have P be k minus 1 /2 because that's
[00:52:16] have P be k minus 1 /2 because that's going to mean that your your your your
[00:52:17] going to mean that your your your your spatial size after convolution is the
[00:52:19] spatial size after convolution is the same as the spatial size before the
[00:52:20] same as the spatial size before the convolution.
[00:52:22] convolution. Okay. Then the next interesting thing to
[00:52:24] Okay. Then the next interesting thing to think about is this notion of receptive
[00:52:26] think about is this notion of receptive fields. Someone was asking a little bit
[00:52:28] fields. Someone was asking a little bit uh over here why do the deeper layers
[00:52:30] uh over here why do the deeper layers learn larger structures? That's actually
[00:52:32] learn larger structures? That's actually sort of inherent in the way that
[00:52:33] sort of inherent in the way that convolutions are built. Right? So in
[00:52:36] convolutions are built. Right? So in thinking about a single convolution um
[00:52:38] thinking about a single convolution um each output is looking at this local
[00:52:40] each output is looking at this local region of an input. Right? So by design
[00:52:43] region of an input. Right? So by design the output of one convolution at the
[00:52:45] the output of one convolution at the first layer can only be looking at a
[00:52:47] first layer can only be looking at a piece of the image which is the same
[00:52:48] piece of the image which is the same size as the convolutional kernel that
[00:52:50] size as the convolutional kernel that you're learning. But if we build a
[00:52:52] you're learning. But if we build a convenant that's stacking multiple
[00:52:53] convenant that's stacking multiple convolutions on top of each other then
[00:52:55] convolutions on top of each other then these receptive fields get magnified
[00:52:57] these receptive fields get magnified through the network. So then you in this
[00:52:59] through the network. So then you in this case we're looking at a at a network
[00:53:02] case we're looking at a at a network with three convolution layers and we see
[00:53:04] with three convolution layers and we see that um in the in the final layer of
[00:53:06] that um in the in the final layer of activations each entry here depends on a
[00:53:09] activations each entry here depends on a local region in the in the in the layer
[00:53:11] local region in the in the in the layer before it. But each one of those entries
[00:53:14] before it. But each one of those entries depends in turn on a local region in the
[00:53:16] depends in turn on a local region in the layer before it which depends in turn on
[00:53:18] layer before it which depends in turn on a local region in the layer before it.
[00:53:20] a local region in the layer before it. So when you have these convolutions,
[00:53:22] So when you have these convolutions, even though each individual convolution
[00:53:24] even though each individual convolution is looking at a local neighborhood in
[00:53:26] is looking at a local neighborhood in the layer before it, as you stack up
[00:53:27] the layer before it, as you stack up convolutions in a bunch of layers, then
[00:53:29] convolutions in a bunch of layers, then the effective size of the original input
[00:53:32] the effective size of the original input that each of those convolutions is
[00:53:33] that each of those convolutions is looking at grows in the grows over the
[00:53:36] looking at grows in the grows over the course of the network. And in
[00:53:38] course of the network. And in particular, um this this uh this this we
[00:53:40] particular, um this this uh this this we call the effective receptive field. So
[00:53:42] call the effective receptive field. So the effective receptive field of a
[00:53:44] the effective receptive field of a convolution is basically how many pixels
[00:53:46] convolution is basically how many pixels in the original image um had the
[00:53:48] in the original image um had the opportunity to influence um one one
[00:53:50] opportunity to influence um one one activation of the network you know later
[00:53:52] activation of the network you know later on downstream.
[00:53:54] on downstream. And you'll notice that the convolution
[00:53:56] And you'll notice that the convolution actually the this effective receptive
[00:53:57] actually the this effective receptive field basically grows linearly with the
[00:53:59] field basically grows linearly with the number of convolution layers. Um, and
[00:54:01] number of convolution layers. Um, and there's a potential problem here is
[00:54:03] there's a potential problem here is because ultimately when we make
[00:54:04] because ultimately when we make classification decisions at the end of
[00:54:06] classification decisions at the end of our network, we would like our
[00:54:07] our network, we would like our classification decisions to basically
[00:54:09] classification decisions to basically aggregate global information across the
[00:54:11] aggregate global information across the entire image. Um, but you need a lot of
[00:54:13] entire image. Um, but you need a lot of comp layers to do it. So a trick there
[00:54:15] comp layers to do it. So a trick there is um basically to add some kind of way
[00:54:18] is um basically to add some kind of way to increase effective receptive fields
[00:54:20] to increase effective receptive fields more quickly. One way that we can do
[00:54:22] more quickly. One way that we can do this in convolution is by introducing
[00:54:24] this in convolution is by introducing something called a stride. So here what
[00:54:26] something called a stride. So here what we're saying is that rather than placing
[00:54:28] we're saying is that rather than placing the filter everywhere in the image,
[00:54:30] the filter everywhere in the image, we're going to skip some. So we're going
[00:54:31] we're going to skip some. So we're going to instead of moving the field moving
[00:54:34] to instead of moving the field moving the receptive field with one, we're
[00:54:36] the receptive field with one, we're going to stride it by two instead. So
[00:54:38] going to stride it by two instead. So now in this case, we go back to our 7x7
[00:54:39] now in this case, we go back to our 7x7 input 3x3 con do a stride two. Now
[00:54:42] input 3x3 con do a stride two. Now what's the output size? 1 2 3 3x3. Um
[00:54:48] what's the output size? 1 2 3 3x3. Um and then in general if we have our input
[00:54:50] and then in general if we have our input W filter size K padding of P stride S
[00:54:54] W filter size K padding of P stride S then we get this kind of ugly formula
[00:54:56] then we get this kind of ugly formula for the size of the output W minus K.
[00:54:58] for the size of the output W minus K. The uh bigger kernels shrink the input
[00:55:01] The uh bigger kernels shrink the input plus 2 P um padding adds back some of
[00:55:03] plus 2 P um padding adds back some of the missing size divided by the stride.
[00:55:06] the missing size divided by the stride. The stride you know divides the input
[00:55:08] The stride you know divides the input shape and then plus one because that's
[00:55:10] shape and then plus one because that's how because of some fence post fence
[00:55:12] how because of some fence post fence post math.
[00:55:14] post math. Okay. So then the strided convolutions
[00:55:16] Okay. So then the strided convolutions are interesting because if you go back
[00:55:17] are interesting because if you go back to this picture now when we do a strided
[00:55:19] to this picture now when we do a strided convolution it's effectively
[00:55:20] convolution it's effectively downsampling the image inside the neural
[00:55:22] downsampling the image inside the neural network. So then when we have a strided
[00:55:24] network. So then when we have a strided convolution then each con layer is
[00:55:27] convolution then each con layer is effectively like dividing the the shape
[00:55:29] effectively like dividing the the shape of the feature map usually by two and
[00:55:30] of the feature map usually by two and then when we stack these that means that
[00:55:32] then when we stack these that means that now you can get exponential growth in
[00:55:34] now you can get exponential growth in the effective receptive field. So if you
[00:55:36] the effective receptive field. So if you stack you know a bunch of con players
[00:55:37] stack you know a bunch of con players and each of those con layers is actually
[00:55:38] and each of those con layers is actually downsampling by a factor of two. Then if
[00:55:40] downsampling by a factor of two. Then if you run through a similar exercise
[00:55:42] you run through a similar exercise you'll see that that the effective
[00:55:43] you'll see that that the effective receptive field is now growing
[00:55:44] receptive field is now growing exponentially in the depth of the
[00:55:46] exponentially in the depth of the network. So that means that with very
[00:55:47] network. So that means that with very with relatively few layers we can build
[00:55:49] with relatively few layers we can build up a very large effective receptive
[00:55:51] up a very large effective receptive field that looks at the entire input
[00:55:52] field that looks at the entire input image.
[00:55:54] image. Um okay so here let's work through just
[00:55:56] Um okay so here let's work through just one one example to make sure that we all
[00:55:58] one one example to make sure that we all are on the same page about convolution.
[00:56:00] are on the same page about convolution. So here let's think about an input
[00:56:01] So here let's think about an input volume 3x 32x32. Let's think about a
[00:56:04] volume 3x 32x32. Let's think about a convolution layer with 10 filters. Each
[00:56:06] convolution layer with 10 filters. Each of those filters is 5 x5 um with stride
[00:56:09] of those filters is 5 x5 um with stride one pad 2. What's the size of the
[00:56:11] one pad 2. What's the size of the output? I color coded it because there's
[00:56:13] output? I color coded it because there's a lot of numbers here to keep to keep
[00:56:14] a lot of numbers here to keep to keep track of. Right? So here um it's 10 x
[00:56:17] track of. Right? So here um it's 10 x 32x 32. This 32 is actually a different
[00:56:20] 32x 32. This 32 is actually a different 32 than this 32. So that's why they're
[00:56:22] 32 than this 32. So that's why they're different colors of blue. Um right, but
[00:56:24] different colors of blue. Um right, but this 10 is the number of output
[00:56:25] this 10 is the number of output channels. Output channels has to match
[00:56:27] channels. Output channels has to match the number of filters. Um and the
[00:56:29] the number of filters. Um and the spatial size is computed using that
[00:56:31] spatial size is computed using that formula that we just saw. So then the
[00:56:33] formula that we just saw. So then the input spatial size comes down here. Um
[00:56:36] input spatial size comes down here. Um plus two plus the padding comes down
[00:56:38] plus two plus the padding comes down here. Padding adds the spatial size. Um
[00:56:41] here. Padding adds the spatial size. Um five is the convolutional kernel that
[00:56:43] five is the convolutional kernel that divides the spatial size. Stride of one.
[00:56:45] divides the spatial size. Stride of one. So that's trivial. And then add one. And
[00:56:47] So that's trivial. And then add one. And this just so happens to come out to 32.
[00:56:49] this just so happens to come out to 32. Um so in this case this follows the same
[00:56:51] Um so in this case this follows the same pattern that we talked about a couple
[00:56:52] pattern that we talked about a couple slides ago where it's a five where it's
[00:56:54] slides ago where it's a five where it's an oddshaped convolutional kernel. In
[00:56:55] an oddshaped convolutional kernel. In this case five, the padding is 2. So if
[00:56:58] this case five, the padding is 2. So if the kernel size is is 2k + one then
[00:57:01] the kernel size is is 2k + one then padding of k means we maintain the same
[00:57:03] padding of k means we maintain the same spatial size
[00:57:05] spatial size number of learnable parameters here
[00:57:08] number of learnable parameters here maybe I'll just go through these cuz uh
[00:57:10] maybe I'll just go through these cuz uh we have a couple more slides to get
[00:57:11] we have a couple more slides to get through um so here um in this case
[00:57:14] through um so here um in this case number of learnable parameters is 760
[00:57:16] number of learnable parameters is 760 because we have um each filter is
[00:57:18] because we have um each filter is basically 3x5 x5 um and we have one for
[00:57:21] basically 3x5 x5 um and we have one for the bias so we have 76 learnable
[00:57:23] the bias so we have 76 learnable parameters per filter we have 10 filters
[00:57:25] parameters per filter we have 10 filters so it's 760 multiple 760 learnable
[00:57:29] so it's 760 multiple 760 learnable parameters here. We can also compute the
[00:57:31] parameters here. We can also compute the number of multiply ad operations. How
[00:57:33] number of multiply ad operations. How much compute does this convolution
[00:57:34] much compute does this convolution kernel how much compute does this
[00:57:36] kernel how much compute does this convolution operator take? So here it's
[00:57:38] convolution operator take? So here it's it's a lot seven. Well, is it a lot? I
[00:57:40] it's a lot seven. Well, is it a lot? I don't know. I don't have a lot. You may
[00:57:42] don't know. I don't have a lot. You may not you may or may not have a lot of
[00:57:43] not you may or may not have a lot of intuition for what is a lot of
[00:57:44] intuition for what is a lot of computation. But in this case, um the
[00:57:46] computation. But in this case, um the way that I think about computing how
[00:57:47] way that I think about computing how many flops, how much how much compute
[00:57:49] many flops, how much how much compute does a convolution operator take. We
[00:57:51] does a convolution operator take. We think about the output volume size is
[00:57:53] think about the output volume size is 10x 32x 32. And we know that each entry
[00:57:55] 10x 32x 32. And we know that each entry in that output volume was computed via a
[00:57:58] in that output volume was computed via a dot productduct. A dot productduct in
[00:57:59] dot productduct. A dot productduct in particular between one of our filters
[00:58:01] particular between one of our filters and a chunk of our input. So in this
[00:58:03] and a chunk of our input. So in this case we know the total flops because we
[00:58:05] case we know the total flops because we know the number of outputs is 10 x 32x
[00:58:07] know the number of outputs is 10 x 32x 32 which is about 10,000. And then each
[00:58:09] 32 which is about 10,000. And then each of those outputs is computed via a
[00:58:11] of those outputs is computed via a dotproduct of a 3x5x5 um filter and a
[00:58:14] dotproduct of a 3x5x5 um filter and a 3x5x5 chunk of the image. So that's 75
[00:58:17] 3x5x5 chunk of the image. So that's 75 elements. So um multiplying those
[00:58:19] elements. So um multiplying those together together means it takes about
[00:58:20] together together means it takes about 768,000 floating point uh multiply ad
[00:58:23] 768,000 floating point uh multiply ad operations.
[00:58:26] operations. Okay. So then here's the kind of oneline
[00:58:27] Okay. So then here's the kind of oneline sum one one slide summary of
[00:58:29] sum one one slide summary of convolution. I'm not going to walk
[00:58:31] convolution. I'm not going to walk through this. This is more for your for
[00:58:32] through this. This is more for your for you to look at later but this just
[00:58:34] you to look at later but this just summarizes all the hyperparameters and
[00:58:35] summarizes all the hyperparameters and the formulas associated with convolution
[00:58:37] the formulas associated with convolution layers. Um if you look in PyTorch,
[00:58:39] layers. Um if you look in PyTorch, PyTorch is the deep learning framework
[00:58:41] PyTorch is the deep learning framework that a lot of people use. Um there
[00:58:43] that a lot of people use. Um there you'll see um this convolution layer has
[00:58:46] you'll see um this convolution layer has all these hyperparameters that we talked
[00:58:47] all these hyperparameters that we talked about. There's a couple other
[00:58:49] about. There's a couple other interesting hyperparameters that we
[00:58:50] interesting hyperparameters that we didn't talk about called group groups
[00:58:51] didn't talk about called group groups and dilation. Um dilation isn't really
[00:58:53] and dilation. Um dilation isn't really used so much anymore. Groups still get
[00:58:55] used so much anymore. Groups still get used sometimes. Um and but maybe we'll
[00:58:57] used sometimes. Um and but maybe we'll talk about those in a later lecture. Um
[00:59:00] talk about those in a later lecture. Um you can have other kind other kinds of
[00:59:02] you can have other kind other kinds of convolutions too. Um so we talked about
[00:59:04] convolutions too. Um so we talked about 2D convolution. We can also do 1D
[00:59:05] 2D convolution. We can also do 1D convolution. um where rather than having
[00:59:08] convolution. um where rather than having a two-dimensional signal that we slide a
[00:59:09] a two-dimensional signal that we slide a filter over, we now have a two like a
[00:59:12] filter over, we now have a two like a one-dimensional signal that we slide a
[00:59:13] one-dimensional signal that we slide a filter over in one with one degree of
[00:59:14] filter over in one with one degree of freedom or a threedimensional
[00:59:16] freedom or a threedimensional convolution where we have a
[00:59:17] convolution where we have a three-dimensional signal, a
[00:59:19] three-dimensional signal, a three-dimensional filter and now you can
[00:59:20] three-dimensional filter and now you can slide that filter everywhere in 3D space
[00:59:22] slide that filter everywhere in 3D space to convolve with the input signal. So
[00:59:24] to convolve with the input signal. So this idea of a convolution really
[00:59:25] this idea of a convolution really extends beyond just two dimen two
[00:59:27] extends beyond just two dimen two dimensional images.
[00:59:29] dimensional images. Okay, that's basically all about
[00:59:31] Okay, that's basically all about convolution. Um and the last one is
[00:59:34] convolution. Um and the last one is pooling. Thankfully, pooling is pretty
[00:59:36] pooling. Thankfully, pooling is pretty simple. So, pooling layers are basically
[00:59:38] simple. So, pooling layers are basically another way to downsample inside of your
[00:59:40] another way to downsample inside of your neural network. So, we saw that strided
[00:59:42] neural network. So, we saw that strided convolution is one way that we can down
[00:59:44] convolution is one way that we can down sample inside of a neural network. And
[00:59:46] sample inside of a neural network. And down sampling is useful because it lets
[00:59:48] down sampling is useful because it lets us build up receptive fields more
[00:59:49] us build up receptive fields more quickly as we go through the through the
[00:59:51] quickly as we go through the through the depth of the network. Um, but
[00:59:52] depth of the network. Um, but convolution actually still costs quite a
[00:59:54] convolution actually still costs quite a lot of computation. So, convolution, you
[00:59:56] lot of computation. So, convolution, you know, is where the most of the most of
[00:59:58] know, is where the most of the most of the flops, most of the compute happens
[00:59:59] the flops, most of the compute happens in a convolutional network. And pooling
[01:00:01] in a convolutional network. And pooling layers are basically a way to downsample
[01:00:03] layers are basically a way to downsample um that's very very cheap that doesn't
[01:00:05] um that's very very cheap that doesn't cost a lot of compute. And the idea in
[01:00:07] cost a lot of compute. And the idea in in a pooling layer is given our
[01:00:09] in a pooling layer is given our three-dimensional tensor um where in
[01:00:11] three-dimensional tensor um where in this case 64x12 x112 you should think
[01:00:14] this case 64x12 x112 you should think about that as um a spat like a
[01:00:16] about that as um a spat like a threedimensional volume of features
[01:00:17] threedimensional volume of features where the spatial size is 112 x12 and we
[01:00:21] where the spatial size is 112 x12 and we have 64 planes 64 channels of
[01:00:23] have 64 planes 64 channels of activation. So now what we're going to
[01:00:25] activation. So now what we're going to and each one of those planes is a is a
[01:00:27] and each one of those planes is a is a 112 x12 image. But then what we're going
[01:00:30] 112 x12 image. But then what we're going to do is take each one of those
[01:00:31] to do is take each one of those individual feature planes, pull it out
[01:00:33] individual feature planes, pull it out from our input tensor and down sample
[01:00:34] from our input tensor and down sample them independently and then restack them
[01:00:36] them independently and then restack them to compute the output. So then this uh
[01:00:38] to compute the output. So then this uh input 64x224 x 224, we're going to pull
[01:00:41] input 64x224 x 224, we're going to pull out each of those 224 x24 planes
[01:00:44] out each of those 224 x24 planes independently down sample it and then
[01:00:45] independently down sample it and then restack them to give an same number of
[01:00:47] restack them to give an same number of channels um but change in the spatial
[01:00:50] channels um but change in the spatial size. What is the method we use for
[01:00:52] size. What is the method we use for downsampling? Great question. Um so uh
[01:00:55] downsampling? Great question. Um so uh the way there's actually that's actually
[01:00:56] the way there's actually that's actually a hyperparameter. There's a couple
[01:00:57] a hyperparameter. There's a couple different mechanisms of downsampling
[01:00:58] different mechanisms of downsampling that we use. Um, one of the common one,
[01:01:00] that we use. Um, one of the common one, one of the most common ones to use is
[01:01:02] one of the most common ones to use is actually max is called max pooling. Um,
[01:01:04] actually max is called max pooling. Um, so in max pooling, what we're going to
[01:01:06] so in max pooling, what we're going to do is um take our single depth slice,
[01:01:08] do is um take our single depth slice, divide it up into non-over overlapping
[01:01:10] divide it up into non-over overlapping regions. In this case, these are two uh
[01:01:13] regions. In this case, these are two uh and and we often use and we use the same
[01:01:14] and and we often use and we use the same terminology to talk about these as we do
[01:01:16] terminology to talk about these as we do with convolution. So this we this we
[01:01:18] with convolution. So this we this we could say it's a kernel size 2x two with
[01:01:20] could say it's a kernel size 2x two with stride of two um because that that then
[01:01:22] stride of two um because that that then that divides our inputs into these
[01:01:24] that divides our inputs into these non-over overlapping 2x2 tiles. Then
[01:01:26] non-over overlapping 2x2 tiles. Then within each of those non-over
[01:01:27] within each of those non-over overlapping 2x two tiles, we'll take the
[01:01:29] overlapping 2x two tiles, we'll take the max entry. In this case, it's a 6, 8, 3,
[01:01:33] max entry. In this case, it's a 6, 8, 3, 4. So then you take the max entry inside
[01:01:35] 4. So then you take the max entry inside each of those. And then that and then
[01:01:37] each of those. And then that and then that that gives us our spatial
[01:01:39] that that gives us our spatial compression. Um, and there's a whole you
[01:01:41] compression. Um, and there's a whole you can imagine a whole set of
[01:01:42] can imagine a whole set of hyperparameters here. You could say what
[01:01:44] hyperparameters here. You could say what is the kernel size? You could change the
[01:01:46] is the kernel size? You could change the kernel size. You can change the stride.
[01:01:48] kernel size. You can change the stride. You can also change the function that we
[01:01:49] You can also change the function that we use for downsampling. Um, max pooling is
[01:01:52] use for downsampling. Um, max pooling is pretty common. You'll also see average.
[01:01:53] pretty common. You'll also see average. You'll also see anti-alias downpooling
[01:01:56] You'll also see anti-alias downpooling sometimes. Um, so these are all just
[01:01:57] sometimes. Um, so these are all just ways that you can downsample these
[01:01:59] ways that you can downsample these feature maps one at a time. Uh, good
[01:02:01] feature maps one at a time. Uh, good question. Do we make a use of padding?
[01:02:02] question. Do we make a use of padding? Um, typically you do not use padding
[01:02:04] Um, typically you do not use padding inside of inside of pooling layers. Um,
[01:02:06] inside of inside of pooling layers. Um, there's nothing mathematically
[01:02:08] there's nothing mathematically preventing you from doing so. Um, but in
[01:02:10] preventing you from doing so. Um, but in the case of max pooling, it would be
[01:02:11] the case of max pooling, it would be kind of silly. It's basically equivalent
[01:02:12] kind of silly. It's basically equivalent to aru. Um, and when so whenever you're
[01:02:15] to aru. Um, and when so whenever you're using max pooling, if you're also using
[01:02:16] using max pooling, if you're also using aru, that would be redundant. Um, so
[01:02:18] aru, that would be redundant. Um, so typically we don't use padding in in
[01:02:20] typically we don't use padding in in pooling layers. I'm actually not sure if
[01:02:22] pooling layers. I'm actually not sure if PyTorch has a flag for pooling for
[01:02:24] PyTorch has a flag for pooling for padding or not in in pooling layers.
[01:02:27] padding or not in in pooling layers. Yeah. So the the stride would be another
[01:02:29] Yeah. So the the stride would be another one of these architectural
[01:02:29] one of these architectural hyperparameters. Um but usually you
[01:02:31] hyperparameters. Um but usually you don't tune these things too much. Um
[01:02:33] don't tune these things too much. Um usually the intuition behind a pooling
[01:02:34] usually the intuition behind a pooling layer like honestly the most common is I
[01:02:36] layer like honestly the most common is I want to down sample my I want to
[01:02:38] want to down sample my I want to downsample everything by a factor of two
[01:02:39] downsample everything by a factor of two is the mo is like by far the most common
[01:02:41] is the mo is like by far the most common operation. So then the most common thing
[01:02:43] operation. So then the most common thing to do would be 2x two stride 2.
[01:02:45] to do would be 2x two stride 2. Sometimes you'll do like two 4x4 stride
[01:02:48] Sometimes you'll do like two 4x4 stride two, but basically the most common
[01:02:50] two, but basically the most common settings by far is I want to down sample
[01:02:52] settings by far is I want to down sample my everything by a factor of exactly
[01:02:54] my everything by a factor of exactly two. Oh, that's a very good question.
[01:02:56] two. Oh, that's a very good question. Uh, do images all have to be the same
[01:02:58] Uh, do images all have to be the same input size? Um, in all the language that
[01:03:00] input size? Um, in all the language that we're talking about so far, yes. Um,
[01:03:02] we're talking about so far, yes. Um, you're going to run into big problems if
[01:03:04] you're going to run into big problems if your image if your input images are not
[01:03:05] your image if your input images are not the same size. So then things that
[01:03:07] the same size. So then things that you'll typically do to fix that would be
[01:03:09] you'll typically do to fix that would be like one you either resize all your
[01:03:11] like one you either resize all your images to the exact same size before you
[01:03:13] images to the exact same size before you batch them to feed to the network. Um
[01:03:15] batch them to feed to the network. Um sometimes you'll also pad your images
[01:03:17] sometimes you'll also pad your images out with zeros or some other value to
[01:03:19] out with zeros or some other value to make them all the same size but now
[01:03:21] make them all the same size but now padded rather than warped. Um or you
[01:03:23] padded rather than warped. Um or you need to basically run these layers
[01:03:24] need to basically run these layers independently for images of different
[01:03:26] independently for images of different aspect ratios. Um so that's another
[01:03:27] aspect ratios. Um so that's another thing that you'll do sometimes um in
[01:03:29] thing that you'll do sometimes um in more sophisticated training setups is
[01:03:31] more sophisticated training setups is that sometimes you'll do what's known as
[01:03:32] that sometimes you'll do what's known as like aspect ratio bucketing. So then
[01:03:34] like aspect ratio bucketing. So then from your training data you'll bucket
[01:03:36] from your training data you'll bucket them into different aspect ratios and
[01:03:37] them into different aspect ratios and then each forward backward pass of the
[01:03:39] then each forward backward pass of the network will be on a on a batch of
[01:03:40] network will be on a on a batch of images of the same resolution and aspect
[01:03:42] images of the same resolution and aspect ratio but then each iteration you might
[01:03:44] ratio but then each iteration you might grab images with different resolutions
[01:03:45] grab images with different resolutions or aspect ratios but that's something
[01:03:47] or aspect ratios but that's something that you'll see in some of the more
[01:03:48] that you'll see in some of the more common larger production systems. Yeah.
[01:03:51] common larger production systems. Yeah. So the question is where do you put
[01:03:52] So the question is where do you put these? Um these are usually interspersed
[01:03:54] these? Um these are usually interspersed with the convolution layers. So a pretty
[01:03:56] with the convolution layers. So a pretty common architecture a pretty common
[01:03:57] common architecture a pretty common pattern for comn nets is to intersperse
[01:03:59] pattern for comn nets is to intersperse the convolution and pooling. So for
[01:04:01] the convolution and pooling. So for example, you'll see like com pool comcom
[01:04:04] example, you'll see like com pool comcom pool comcom pool fully connected fully
[01:04:06] pool comcom pool fully connected fully connected is kind of a prototypical
[01:04:08] connected is kind of a prototypical convolutional network.
[01:04:10] convolutional network.  Yes, that's an extra excellent question.
[01:04:11] Yes, that's an extra excellent question. Does this introduce nonlinearity? So it
[01:04:13] Does this introduce nonlinearity? So it depends on the type of pooling operation
[01:04:16] depends on the type of pooling operation that you're using. So if you're doing
[01:04:18] that you're using. So if you're doing max pooling that's a nonlinearity. So um
[01:04:20] max pooling that's a nonlinearity. So um in some networks if you have a max
[01:04:22] in some networks if you have a max pooling you may you may not use a relu
[01:04:25] pooling you may you may not use a relu around that convolution because a max
[01:04:27] around that convolution because a max pooling is a nonlinearity itself. If
[01:04:28] pooling is a nonlinearity itself. If it's an average pooling, um, that's also
[01:04:30] it's an average pooling, um, that's also a linear operator. So then if you do
[01:04:32] a linear operator. So then if you do average pooling, it's linear. So then
[01:04:33] average pooling, it's linear. So then you probably still would want a relu
[01:04:35] you probably still would want a relu there.
[01:04:37] there. Okay. So then here's my my quick oneside
[01:04:39] Okay. So then here's my my quick oneside summary of pooling. Um, it's basically
[01:04:41] summary of pooling. Um, it's basically the same hyperparameters as convolution.
[01:04:43] the same hyperparameters as convolution. Um, except you've got this extra pooling
[01:04:44] Um, except you've got this extra pooling function which is what is the mechanism
[01:04:46] function which is what is the mechanism you're using to do the downsampling.
[01:04:48] you're using to do the downsampling. Um, then the last thing I wanted to
[01:04:51] Um, then the last thing I wanted to mention is this notion of translation
[01:04:55] mention is this notion of translation equivariance. What the hell is that? Um,
[01:04:57] equivariance. What the hell is that? Um, so I said at the beginning of the of the
[01:04:59] so I said at the beginning of the of the beginning of the lecture that we wanted
[01:05:00] beginning of the lecture that we wanted operators that are respecting the
[01:05:02] operators that are respecting the spatial structure of our images, right?
[01:05:04] spatial structure of our images, right? And that we have this notion that
[01:05:05] And that we have this notion that flattening our images into big vectors
[01:05:07] flattening our images into big vectors is somehow not respecting the spatial
[01:05:09] is somehow not respecting the spatial structure of of our images. So there's a
[01:05:11] structure of of our images. So there's a really interesting property that is
[01:05:12] really interesting property that is shared by both convolution and pooling
[01:05:15] shared by both convolution and pooling which is one way to more for to
[01:05:16] which is one way to more for to formalize this notion of them respecting
[01:05:19] formalize this notion of them respecting the 2D spatial structure of the images
[01:05:21] the 2D spatial structure of the images and that's this notion of translation
[01:05:22] and that's this notion of translation equariance. So it sounds like it sounds
[01:05:25] equariance. So it sounds like it sounds pretty crazy but the idea is we can
[01:05:27] pretty crazy but the idea is we can imagine two different operators uh two
[01:05:29] imagine two different operators uh two different branches um along one branch
[01:05:31] different branches um along one branch we can imagine taking our image doing a
[01:05:33] we can imagine taking our image doing a convolution or pooling operator to get
[01:05:35] convolution or pooling operator to get an updated image and then translating
[01:05:37] an updated image and then translating the result by shifting that feature map
[01:05:39] the result by shifting that feature map to the side by for example then you
[01:05:42] to the side by for example then you could imagine changing the order of
[01:05:43] could imagine changing the order of these two things instead. What we could
[01:05:45] these two things instead. What we could have done instead is first translate the
[01:05:48] have done instead is first translate the image and then do our convolution or
[01:05:50] image and then do our convolution or pool operator on top of the translated
[01:05:52] pool operator on top of the translated image. And it and it just so happens
[01:05:54] image. And it and it just so happens that in this case the order doesn't
[01:05:56] that in this case the order doesn't matter. If you translate and then
[01:05:58] matter. If you translate and then convolution, you get the same result as
[01:06:00] convolution, you get the same result as if you had done convolution and then
[01:06:02] if you had done convolution and then translate um subject to some boundary
[01:06:04] translate um subject to some boundary conditions blah blah blah blah blah.
[01:06:05] conditions blah blah blah blah blah. like in in sort of the limit of
[01:06:07] like in in sort of the limit of infinitely large images and blah blah
[01:06:09] infinitely large images and blah blah blah blah blah ignoring this ignoring
[01:06:10] blah blah blah ignoring this ignoring some of these technical conditions. Um
[01:06:12] some of these technical conditions. Um it's really interesting that you can
[01:06:14] it's really interesting that you can actually swap the order of translation
[01:06:16] actually swap the order of translation in space versus performing these these
[01:06:18] in space versus performing these these downsampling or convolution operators.
[01:06:20] downsampling or convolution operators. Um and that bakes in an important
[01:06:22] Um and that bakes in an important intuition about images which is that
[01:06:24] intuition about images which is that when we're processing images um pro we
[01:06:27] when we're processing images um pro we the the features that we extract from an
[01:06:29] the the features that we extract from an image should only depend on the content
[01:06:31] image should only depend on the content of the image and should not depend on
[01:06:33] of the image and should not depend on where in the image that where the
[01:06:35] where in the image that where the absolute location in the image that
[01:06:37] absolute location in the image that content came from. So that means that
[01:06:38] content came from. So that means that you know if I'm looking this way it
[01:06:40] you know if I'm looking this way it looks like people it looks like people
[01:06:42] looks like people it looks like people and benches. If I'm looking this way it
[01:06:44] and benches. If I'm looking this way it looks like people and benches. and the
[01:06:46] looks like people and benches. and the fact that it's over here on my right and
[01:06:48] fact that it's over here on my right and the fact that it's over here on my left,
[01:06:49] the fact that it's over here on my left, I want to process that data in the exact
[01:06:51] I want to process that data in the exact same way. And that's an important um
[01:06:53] same way. And that's an important um intuition, an important structure of
[01:06:55] intuition, an important structure of that we want to of of images and of the
[01:06:57] that we want to of of images and of the kind of 2D data that we're processing.
[01:06:58] kind of 2D data that we're processing. And the and this notion of translation
[01:07:00] And the and this notion of translation equariance basically is a way to
[01:07:02] equariance basically is a way to mathematically describe how that
[01:07:04] mathematically describe how that structure is baked into these operators.
[01:07:06] structure is baked into these operators. Um so this is kind of interesting um
[01:07:08] Um so this is kind of interesting um that it's it's a way that we can build
[01:07:10] that it's it's a way that we can build in our intuition about how images ought
[01:07:12] in our intuition about how images ought to be processed um through the design of
[01:07:14] to be processed um through the design of our operators not through the design of
[01:07:16] our operators not through the design of our of our feature extraction methods as
[01:07:17] our of our feature extraction methods as we saw at the beginning. The question is
[01:07:19] we saw at the beginning. The question is why do you do a translation? You you
[01:07:21] why do you do a translation? You you don't like this is not something you're
[01:07:22] don't like this is not something you're actually going to do. Um this is
[01:07:23] actually going to do. Um this is basically a mathematical cur curiosity,
[01:07:25] basically a mathematical cur curiosity, right? To be clear that you should not
[01:07:27] right? To be clear that you should not generally do this inside of your neural
[01:07:29] generally do this inside of your neural networks. Um this is basically like it's
[01:07:31] networks. Um this is basically like it's interesting to note that this happens to
[01:07:32] interesting to note that this happens to be true but you would not do this inside
[01:07:34] be true but you would not do this inside of your neural networks.
[01:07:36] of your neural networks. Um and if you if you were a
[01:07:38] Um and if you if you were a mathematician you call this a commutive
[01:07:39] mathematician you call this a commutive diagram and mathematicians love those
[01:07:41] diagram and mathematicians love those things.
[01:07:42] things. Okay so that's basically the summary of
[01:07:44] Okay so that's basically the summary of today. Um we talked about convolutional
[01:07:46] today. Um we talked about convolutional networks. We talked about you know why
[01:07:48] networks. We talked about you know why they're interesting. We talked about
[01:07:49] they're interesting. We talked about these two new operators of convolution
[01:07:51] these two new operators of convolution and pooling. And then next lecture we'll
[01:07:53] and pooling. And then next lecture we'll see how to stitch those together into
[01:07:54] see how to stitch those together into CNN architectures. Um and see you next
[01:07:57] CNN architectures. Um and see you next time for that.


================================================================================
LECTURE 006
================================================================================

Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 6: CNN Architectures

Source: https://www.youtube.com/watch?v=aVJy4O5TOk8

---

Transcript

[00:00:05] Hi everyone. Um, my name is Zayn. I
[00:00:07] Hi everyone. Um, my name is Zayn. I realized I actually didn't introduce
[00:00:09] realized I actually didn't introduce myself on the first lecture I gave uh,
[00:00:11] myself on the first lecture I gave uh, which was lecture three, but I'm one of
[00:00:12] which was lecture three, but I'm one of the co-instructors for the course. My
[00:00:14] the co-instructors for the course. My name is Zane Dante. I'm co-advised by
[00:00:16] name is Zane Dante. I'm co-advised by Essan and Feay. I'm a fourth year PhD
[00:00:18] Essan and Feay. I'm a fourth year PhD student uh, at Stamford. And in this
[00:00:22] student uh, at Stamford. And in this lecture today, lecture six, we'll be
[00:00:24] lecture today, lecture six, we'll be talking about training convolutional
[00:00:25] talking about training convolutional neural networks and also um, CNN
[00:00:28] neural networks and also um, CNN architectures.
[00:00:29] architectures. So um you know I would say this lecture
[00:00:32] So um you know I would say this lecture is really broken up into two different
[00:00:34] is really broken up into two different components. The first one is just
[00:00:36] components. The first one is just telling you how to piece together all of
[00:00:38] telling you how to piece together all of the different building blocks that we've
[00:00:40] the different building blocks that we've learned like convolutional layers uh
[00:00:42] learned like convolutional layers uh linear layers or fully connected layers
[00:00:44] linear layers or fully connected layers together to create uh CNN architecture.
[00:00:46] together to create uh CNN architecture. We'll go through some examples and then
[00:00:48] We'll go through some examples and then also we'll talk about how you actually
[00:00:50] also we'll talk about how you actually train these and all the steps uh
[00:00:52] train these and all the steps uh involved there. So as I mentioned before
[00:00:54] involved there. So as I mentioned before we'll have basically two different
[00:00:56] we'll have basically two different topics. The first one is how to build
[00:00:57] topics. The first one is how to build CNN's and by this I mean how do you
[00:00:59] CNN's and by this I mean how do you actually define your CNN architecture to
[00:01:01] actually define your CNN architecture to set it up to be trained and then the
[00:01:03] set it up to be trained and then the second set of topics today is how do you
[00:01:04] second set of topics today is how do you train CNN's. So starting with the first
[00:01:08] train CNN's. So starting with the first set of topics here we'll go through the
[00:01:10] set of topics here we'll go through the layers in convolutional neural networks.
[00:01:12] layers in convolutional neural networks. And if you recall from last lecture we
[00:01:15] And if you recall from last lecture we learned about the key layer in these
[00:01:16] learned about the key layer in these models which is the convolution layer.
[00:01:19] models which is the convolution layer. And the way that these layers work is
[00:01:21] And the way that these layers work is they have these filters. um you have a
[00:01:23] they have these filters. um you have a predefined number of filters per one of
[00:01:26] predefined number of filters per one of these convolution layers. In this case,
[00:01:27] these convolution layers. In this case, six. Um they match the depth of your
[00:01:30] six. Um they match the depth of your input data here. So in this case, we
[00:01:32] input data here. So in this case, we have a 32x32 RGB image. So we have three
[00:01:35] have a 32x32 RGB image. So we have three depth channels. Each of these filters
[00:01:37] depth channels. Each of these filters slides across the image and calculates a
[00:01:39] slides across the image and calculates a score at each point. At that location in
[00:01:41] score at each point. At that location in the image, you take the dotproduct of
[00:01:43] the image, you take the dotproduct of the values in the filter with the values
[00:01:45] the values in the filter with the values in the image. Uh so you multiply all
[00:01:46] in the image. Uh so you multiply all these values together, you sum them
[00:01:48] these values together, you sum them together, and then you add a bias term.
[00:01:50] together, and then you add a bias term. And this is how you calculate each um
[00:01:53] And this is how you calculate each um value in your output activation map on
[00:01:55] value in your output activation map on the right. So you have these sliding
[00:01:57] the right. So you have these sliding windows that go across the image. They
[00:01:59] windows that go across the image. They calculate a score at each position and
[00:02:01] calculate a score at each position and that's how you get these uh activation
[00:02:03] that's how you get these uh activation maps and you have one per each filter.
[00:02:06] maps and you have one per each filter. Normally we'll do a sort of relu or a
[00:02:09] Normally we'll do a sort of relu or a nonlinearity activation function at the
[00:02:10] nonlinearity activation function at the end here. So this is from last lecture.
[00:02:13] end here. So this is from last lecture. I won't spend too much time on it. The
[00:02:15] I won't spend too much time on it. The question is for images the depth is
[00:02:17] question is for images the depth is equal to the number of channels RGB but
[00:02:20] equal to the number of channels RGB but here the depth is six for the output
[00:02:22] here the depth is six for the output here. So if we had a second convolution
[00:02:24] here. So if we had a second convolution layer afterwards they would need to be
[00:02:26] layer afterwards they would need to be filters that uh go across all six of
[00:02:29] filters that uh go across all six of these uh activation maps. So the next
[00:02:32] these uh activation maps. So the next layer would be a depth of six. Okay. And
[00:02:35] layer would be a depth of six. Okay. And then the second layer we talked about
[00:02:36] then the second layer we talked about which is much simpler than the
[00:02:38] which is much simpler than the convolution layer is this idea of a
[00:02:39] convolution layer is this idea of a pooling layer. So here um it's still
[00:02:42] pooling layer. So here um it's still this sort of filter that we're sliding
[00:02:44] this sort of filter that we're sliding across the image uh you know 2 x2 filter
[00:02:46] across the image uh you know 2 x2 filter with stride two. So we're skipping over.
[00:02:48] with stride two. So we're skipping over. We're not doing every single location.
[00:02:49] We're not doing every single location. And here is a max pooling. So we're just
[00:02:51] And here is a max pooling. So we're just taking the max of each of these areas.
[00:02:53] taking the max of each of these areas. And that's the value we get here. Or you
[00:02:55] And that's the value we get here. Or you could do an average pooling. Um you know
[00:02:57] could do an average pooling. Um you know these are both commonly used I would
[00:02:59] these are both commonly used I would say. And depending on the architecture
[00:03:01] say. And depending on the architecture you would probably just if you're
[00:03:02] you would probably just if you're creating new architecture you would try
[00:03:03] creating new architecture you would try both of them and see what performs
[00:03:04] both of them and see what performs better here. But the basic idea is to
[00:03:07] better here. But the basic idea is to consolidate among the height and width
[00:03:08] consolidate among the height and width dimensions for your image.
[00:03:12] dimensions for your image. Okay. So um we've basically gone over at
[00:03:15] Okay. So um we've basically gone over at this point in the course all the the top
[00:03:17] this point in the course all the the top row here which is convolution layers,
[00:03:19] row here which is convolution layers, pooling layers and also the fully
[00:03:20] pooling layers and also the fully connected layers. These are the first
[00:03:22] connected layers. These are the first layers that we talked about in the uh
[00:03:23] layers that we talked about in the uh neural networks lecture where it's
[00:03:25] neural networks lecture where it's basically one matrix multiply followed
[00:03:27] basically one matrix multiply followed by an activation function. And for the
[00:03:30] by an activation function. And for the rest of this lecture, I'll talk about
[00:03:31] rest of this lecture, I'll talk about the remaining layers that you see in
[00:03:33] the remaining layers that you see in CNN's at least the the commonly used
[00:03:35] CNN's at least the the commonly used ones which include normalization layers
[00:03:38] ones which include normalization layers uh which I'll go into those and then
[00:03:39] uh which I'll go into those and then also dropout which is a regular uh
[00:03:42] also dropout which is a regular uh regularization technique that's used
[00:03:43] regularization technique that's used actually in the model architecture
[00:03:45] actually in the model architecture itself and then finally we'll revisit
[00:03:47] itself and then finally we'll revisit the activation functions and I'll tell
[00:03:48] the activation functions and I'll tell you about you know here are the most
[00:03:50] you about you know here are the most commonly used ones both historically and
[00:03:52] commonly used ones both historically and in the modern era of deep learning. Um,
[00:03:55] in the modern era of deep learning. Um, so starting out with normalization
[00:03:56] so starting out with normalization layers, the basic idea here is we're
[00:03:59] layers, the basic idea here is we're going to be calculating statistics like
[00:04:01] going to be calculating statistics like the mean and standard deviation for our
[00:04:03] the mean and standard deviation for our input data and then using those to
[00:04:05] input data and then using those to normalize the data and then we'll uh
[00:04:08] normalize the data and then we'll uh we'll basically learn what is the uh
[00:04:11] we'll basically learn what is the uh optimal distribution for the model to
[00:04:14] optimal distribution for the model to learn at that point. So so very
[00:04:15] learn at that point. So so very concretely we learn parameters that will
[00:04:18] concretely we learn parameters that will scale and shift our input data by a
[00:04:21] scale and shift our input data by a learned mean and a learned uh standard
[00:04:23] learned mean and a learned uh standard deviation. So how all of these
[00:04:25] deviation. So how all of these normalization layers work is there's two
[00:04:28] normalization layers work is there's two steps. The first is to normalize the
[00:04:30] steps. The first is to normalize the data coming in to be a unit Gaussian. So
[00:04:32] data coming in to be a unit Gaussian. So mean zero, standard deviation one. And
[00:04:35] mean zero, standard deviation one. And then we will scale and shift it. So
[00:04:37] then we will scale and shift it. So multiply by some value to increase or
[00:04:40] multiply by some value to increase or decrease the standard deviation and then
[00:04:42] decrease the standard deviation and then shift it to change where the mean is.
[00:04:44] shift it to change where the mean is. And all normalization layers do this
[00:04:45] And all normalization layers do this technique. But the way that they differ
[00:04:47] technique. But the way that they differ is how they calculate the statistics. So
[00:04:49] is how they calculate the statistics. So how are you calculating the mean and
[00:04:50] how are you calculating the mean and standard dev standard deviation and
[00:04:52] standard dev standard deviation and which values are you applying these
[00:04:54] which values are you applying these calculated statistics to but all
[00:04:56] calculated statistics to but all normalization layers are doing this
[00:04:57] normalization layers are doing this highle process.
[00:05:00] highle process. So um I'll talk about layer norm which
[00:05:03] So um I'll talk about layer norm which is the most commonly used uh
[00:05:05] is the most commonly used uh normalization layer I would say today in
[00:05:07] normalization layer I would say today in deep learning and it's really commonly
[00:05:08] deep learning and it's really commonly used in transformers specifically um and
[00:05:11] used in transformers specifically um and so you can imagine you have some data
[00:05:13] so you can imagine you have some data coming in X which is a batch size of N.
[00:05:16] coming in X which is a batch size of N. So we have n samples coming into our
[00:05:18] So we have n samples coming into our model and each of these are vectors of
[00:05:21] model and each of these are vectors of dimension d. So what it layer norm does
[00:05:24] dimension d. So what it layer norm does is we calculate a mean and standard
[00:05:27] is we calculate a mean and standard deviation for each of our samples
[00:05:29] deviation for each of our samples separately. So we're calculating what is
[00:05:31] separately. So we're calculating what is the mean among the along the depth uh or
[00:05:34] the mean among the along the depth uh or or the dimension d here and what is the
[00:05:36] or the dimension d here and what is the standard deviation. Then we learn
[00:05:39] standard deviation. Then we learn parameters and so these are learnable
[00:05:42] parameters and so these are learnable parameters learned uh through via
[00:05:44] parameters learned uh through via gradient descent in our model. uh to
[00:05:46] gradient descent in our model. uh to then apply to each sample. So after we
[00:05:48] then apply to each sample. So after we calculate our statistics in this way,
[00:05:50] calculate our statistics in this way, treating each sample separately to
[00:05:52] treating each sample separately to calculate the mean and standard
[00:05:54] calculate the mean and standard deviation, we then apply these learned
[00:05:57] deviation, we then apply these learned uh scale and shift or uh parameters
[00:06:00] uh scale and shift or uh parameters here. So we subtract the mean and divide
[00:06:02] here. So we subtract the mean and divide by the standard deviation within our
[00:06:04] by the standard deviation within our input data to normalize it and then we
[00:06:07] input data to normalize it and then we apply the scale here but with
[00:06:08] apply the scale here but with multiplication and the the shift. So um
[00:06:11] multiplication and the the shift. So um this is the idea behind layer norm and
[00:06:14] this is the idea behind layer norm and at a high level layer uh had a high
[00:06:16] at a high level layer uh had a high level idea all of these different uh
[00:06:19] level idea all of these different uh normalization layers are all are are all
[00:06:21] normalization layers are all are are all computing very similar things but the
[00:06:23] computing very similar things but the main difference is how are they
[00:06:25] main difference is how are they computing the uh the mean and standard
[00:06:28] computing the uh the mean and standard deviation. So this is a really nice
[00:06:30] deviation. So this is a really nice visualization from a paper called group
[00:06:32] visualization from a paper called group normalization that introduces a new way
[00:06:35] normalization that introduces a new way to normalize. um it's I would say not so
[00:06:37] to normalize. um it's I would say not so commonly used these days but this is
[00:06:38] commonly used these days but this is actually a really great way to gain
[00:06:40] actually a really great way to gain intuition about how these different
[00:06:42] intuition about how these different normalization layers are different. So
[00:06:44] normalization layers are different. So for layer norm um I described the really
[00:06:47] for layer norm um I described the really simple case where we just have vectors
[00:06:49] simple case where we just have vectors that we're normalizing but in the in the
[00:06:51] that we're normalizing but in the in the case for convolutional neural networks
[00:06:53] case for convolutional neural networks we have a channel dimension or the depth
[00:06:56] we have a channel dimension or the depth uh we have the in the height and the
[00:06:57] uh we have the in the height and the width or the spatial dimensions of the
[00:06:59] width or the spatial dimensions of the image. So what layerorm does is for each
[00:07:02] image. So what layerorm does is for each sample we're still processing it
[00:07:04] sample we're still processing it separately and we're calculating the
[00:07:06] separately and we're calculating the mean across all of the channels all of
[00:07:08] mean across all of the channels all of the heights and all the widths. So if we
[00:07:11] the heights and all the widths. So if we look um sort of back into this diagram
[00:07:14] look um sort of back into this diagram here um you would basically be counting
[00:07:16] here um you would basically be counting calculating one uh one mean and one
[00:07:19] calculating one uh one mean and one standard deviation over all of these
[00:07:22] standard deviation over all of these over all of these values. So um we're
[00:07:24] over all of these values. So um we're for each of our input uh data points
[00:07:26] for each of our input uh data points we're calculating one mean and one
[00:07:28] we're calculating one mean and one standard deviation across all of the
[00:07:30] standard deviation across all of the channels all of the height and width
[00:07:31] channels all of the height and width dimensions. So this is what layer norm
[00:07:32] dimensions. So this is what layer norm is doing. But you could imagine feasibly
[00:07:35] is doing. But you could imagine feasibly that you could calculate these
[00:07:36] that you could calculate these statistics differently. Batchnorm uh
[00:07:39] statistics differently. Batchnorm uh you're you're taking each channel so
[00:07:41] you're you're taking each channel so each channel is being calculated as one
[00:07:43] each channel is being calculated as one mean and one standard deviation and
[00:07:45] mean and one standard deviation and you're applying it just to that channel.
[00:07:46] you're applying it just to that channel. Uh and so you're averaging across all
[00:07:48] Uh and so you're averaging across all the data in your batch. Instance norm is
[00:07:50] the data in your batch. Instance norm is even more granular and then a group
[00:07:52] even more granular and then a group norm. So I just want to point out all
[00:07:53] norm. So I just want to point out all these layers are kind of trying to do
[00:07:55] these layers are kind of trying to do the same thing where you're a you're
[00:07:56] the same thing where you're a you're you're basically normalizing your data
[00:07:58] you're basically normalizing your data and then having these learnable scaling
[00:08:00] and then having these learnable scaling and shifting parameters but the way they
[00:08:01] and shifting parameters but the way they do it is different because they're
[00:08:03] do it is different because they're calculating the statistics using
[00:08:04] calculating the statistics using different variable uh different subsets
[00:08:06] different variable uh different subsets of your your input data. Yeah. So the
[00:08:09] of your your input data. Yeah. So the question is for layer norm are we
[00:08:10] question is for layer norm are we calculating one mean and one standard
[00:08:12] calculating one mean and one standard deviation for each image or input data?
[00:08:14] deviation for each image or input data? Yes, they're all calculated separately.
[00:08:18] Yes, they're all calculated separately.  But for batch norm it's they would not
[00:08:19] But for batch norm it's they would not be the case in this example here. Yeah,
[00:08:22] be the case in this example here. Yeah, for batch norm it's actually within the
[00:08:24] for batch norm it's actually within the mini batch when you're, you know, you're
[00:08:26] mini batch when you're, you know, you're doing gradient descent, you have a small
[00:08:27] doing gradient descent, you have a small batch of data you're looking at, you
[00:08:28] batch of data you're looking at, you feed it into your model. Um, you're
[00:08:31] feed it into your model. Um, you're calculating the per channel, uh, mean
[00:08:35] calculating the per channel, uh, mean and standard deviation based on all of
[00:08:37] and standard deviation based on all of the data in your batch.
[00:08:41] Yeah, I think this is like a good if you
[00:08:43] Yeah, I think this is like a good if you can understand this diagram, I think you
[00:08:44] can understand this diagram, I think you understand what all of the different uh,
[00:08:46] understand what all of the different uh, normalization layers are doing. So it
[00:08:48] normalization layers are doing. So it might be worthwhile after lecture if you
[00:08:49] might be worthwhile after lecture if you still don't fully understand it just to
[00:08:50] still don't fully understand it just to like go through and make sure you
[00:08:52] like go through and make sure you understand what you know the and shaded
[00:08:54] understand what you know the and shaded in blue these are the values we're both
[00:08:56] in blue these are the values we're both calculating our statistics over and then
[00:08:58] calculating our statistics over and then applying the mean and standard
[00:09:00] applying the mean and standard deviations to uh yeah one final question
[00:09:01] deviations to uh yeah one final question and then we'll go on is channel same as
[00:09:04] and then we'll go on is channel same as like the layers
[00:09:06] like the layers  so channel here is the is the depth here
[00:09:10] so channel here is the is the depth here so the the number of uh spatial
[00:09:15] so the the number of uh spatial um the different spatial values you have
[00:09:17] um the different spatial values you have at each spatial the number of values you
[00:09:18] at each spatial the number of values you have at each spatial location.
[00:09:22] have at each spatial location. Okay, cool. Um, so this is uh we talked
[00:09:26] Okay, cool. Um, so this is uh we talked about normalization layers. The key idea
[00:09:27] about normalization layers. The key idea is you're calculating these statistics,
[00:09:29] is you're calculating these statistics, applying them to your input data and
[00:09:32] applying them to your input data and then uh learning a scale and shift uh
[00:09:35] then uh learning a scale and shift uh parameter that you then apply. So um the
[00:09:38] parameter that you then apply. So um the next type of layer we'll talk about is
[00:09:40] next type of layer we'll talk about is called dropout. And this is a
[00:09:42] called dropout. And this is a regularization layer in CNN's. This is
[00:09:44] regularization layer in CNN's. This is the final layer that you'll need to
[00:09:46] the final layer that you'll need to learn in order to basically we can start
[00:09:48] learn in order to basically we can start going through all these different CNN
[00:09:50] going through all these different CNN architectures that people have created
[00:09:51] architectures that people have created over the years. Um so with dropout the
[00:09:55] over the years. Um so with dropout the basic idea is to add randomization
[00:09:58] basic idea is to add randomization during the training process that we then
[00:10:00] during the training process that we then take away at test time and the goal is
[00:10:02] take away at test time and the goal is to make it harder for the model to learn
[00:10:04] to make it harder for the model to learn the training data but then it will
[00:10:05] the training data but then it will generalize better. So this is a form of
[00:10:07] generalize better. So this is a form of regularization.
[00:10:09] regularization. The way we do it concretely is that in
[00:10:12] The way we do it concretely is that in each forward pass of our layer, we'll
[00:10:14] each forward pass of our layer, we'll actually randomly zero out some of the
[00:10:16] actually randomly zero out some of the outputs or activations from that from
[00:10:18] outputs or activations from that from that layer. And uh the main parameter
[00:10:22] that layer. And uh the main parameter you have for this dropout layer, which
[00:10:24] you have for this dropout layer, which is just a fixed hyperparameter, is the
[00:10:26] is just a fixed hyperparameter, is the probability of dropping out the values.
[00:10:28] probability of dropping out the values. And 0.5 is probably the most common or
[00:10:31] And 0.5 is probably the most common or 0.25 is also commonly used here. Um so
[00:10:34] 0.25 is also commonly used here. Um so you're just dropping out a fixed
[00:10:36] you're just dropping out a fixed percentage of the values here. And so uh
[00:10:39] percentage of the values here. And so uh you know going forward to the next layer
[00:10:40] you know going forward to the next layer these would be zero. Um and so you don't
[00:10:42] these would be zero. Um and so you don't you don't really need to calculate the
[00:10:44] you don't really need to calculate the um the values here. So um I mean
[00:10:48] um the values here. So um I mean basically all the all the outputs here
[00:10:49] basically all the all the outputs here are zero at this point. So you can get
[00:10:50] are zero at this point. So you can get there's some tricks you can do with
[00:10:51] there's some tricks you can do with masking. So you don't even need to
[00:10:53] masking. So you don't even need to calculate the values because 0 times any
[00:10:55] calculate the values because 0 times any value will be uh will be zero. So um I
[00:10:58] value will be uh will be zero. So um I think in general uh you might ask why
[00:11:01] think in general uh you might ask why does this work? And um I would say this
[00:11:03] does this work? And um I would say this is more of an empirical thing than a uh
[00:11:06] is more of an empirical thing than a uh well wellstudied from a um from a
[00:11:10] well wellstudied from a um from a theoretical standpoint. But um there are
[00:11:13] theoretical standpoint. But um there are actually some ways you can view what
[00:11:14] actually some ways you can view what dropout is doing to gain intuition of
[00:11:16] dropout is doing to gain intuition of why this might be useful. So um it it
[00:11:19] why this might be useful. So um it it basically forces your network to you can
[00:11:21] basically forces your network to you can imagine it forces it to have redundant
[00:11:23] imagine it forces it to have redundant representations. So if we have a list of
[00:11:25] representations. So if we have a list of features that we're learning at a given
[00:11:27] features that we're learning at a given layer, say the the layer right before
[00:11:29] layer, say the the layer right before the output of our model, and we have a
[00:11:32] the output of our model, and we have a CNN that is basically extracting each of
[00:11:35] CNN that is basically extracting each of these features. So it can detect if
[00:11:37] these features. So it can detect if there's ears in the image or if there's
[00:11:39] there's ears in the image or if there's a tail, if it's furry, it has claws. And
[00:11:41] a tail, if it's furry, it has claws. And you want to uh your model to output the
[00:11:43] you want to uh your model to output the probability of a cat score. So um one of
[00:11:46] probability of a cat score. So um one of the things that's useful about this is
[00:11:48] the things that's useful about this is because some of these values might
[00:11:49] because some of these values might randomly be dropped out during training,
[00:11:51] randomly be dropped out during training, your model can't over rely on certain
[00:11:53] your model can't over rely on certain features being present. in in uh some of
[00:11:55] features being present. in in uh some of the classes and actually needs to learn
[00:11:57] the classes and actually needs to learn a more broad uh set of correspondences
[00:12:00] a more broad uh set of correspondences between your features and your your
[00:12:02] between your features and your your output classes. So the model can't just
[00:12:05] output classes. So the model can't just you know hard focus on uh okay well if
[00:12:07] you know hard focus on uh okay well if if it has an ear and is furry these are
[00:12:10] if it has an ear and is furry these are and you know it just so happens that
[00:12:12] and you know it just so happens that these are always cats uh or something
[00:12:13] these are always cats uh or something like this or if it has claws um and it
[00:12:17] like this or if it has claws um and it has an ear it'll it'll almost always be
[00:12:19] has an ear it'll it'll almost always be a cat on your data set. So it'll
[00:12:21] a cat on your data set. So it'll actually help you generalize better to
[00:12:22] actually help you generalize better to new new features despite the fact that
[00:12:25] new new features despite the fact that in your data set there might actually be
[00:12:26] in your data set there might actually be really strong correlations between the
[00:12:28] really strong correlations between the codependency of certain features in your
[00:12:30] codependency of certain features in your output class. By having dropout you're
[00:12:33] output class. By having dropout you're essentially making it so the model can't
[00:12:35] essentially making it so the model can't rely on these during the training phase
[00:12:36] rely on these during the training phase because it won't always see the pairs of
[00:12:38] because it won't always see the pairs of features together. So this is an example
[00:12:41] features together. So this is an example for for for cat and the question is if
[00:12:44] for for for cat and the question is if we had something like tree instead how
[00:12:46] we had something like tree instead how would you determine which features to
[00:12:48] would you determine which features to drop out? So, uh, the dropping out part
[00:12:51] drop out? So, uh, the dropping out part is actually completely random. So, we're
[00:12:52] is actually completely random. So, we're not making any choices about this. It's
[00:12:54] not making any choices about this. It's just in this case, 50% of your your
[00:12:58] just in this case, 50% of your your features at any given step will be
[00:13:00] features at any given step will be dropped out and set to zero. Um, so
[00:13:02] dropped out and set to zero. Um, so yeah, you don't have to make choices
[00:13:04] yeah, you don't have to make choices about it, which is kind of nice. Um, but
[00:13:06] about it, which is kind of nice. Um, but it is completely random.
[00:13:09] it is completely random. How would the model know if you're only
[00:13:10] How would the model know if you're only seeing a subset of the features like
[00:13:12] seeing a subset of the features like tail and claw here? Um, the point is you
[00:13:14] tail and claw here? Um, the point is you will actually do worse on the training
[00:13:15] will actually do worse on the training data because you're only seeing a subset
[00:13:17] data because you're only seeing a subset of the features. So it does make the
[00:13:18] of the features. So it does make the model worse by not having all the
[00:13:20] model worse by not having all the information. Uh but then it does better
[00:13:22] information. Uh but then it does better at test time is the idea. So worse in
[00:13:24] at test time is the idea. So worse in training time and better in test time
[00:13:25] training time and better in test time because at test time um you're basically
[00:13:28] because at test time um you're basically no longer having this dropout. So the
[00:13:31] no longer having this dropout. So the final component here which maybe I
[00:13:33] final component here which maybe I should have explained first before
[00:13:34] should have explained first before fielding questions is the idea at test
[00:13:36] fielding questions is the idea at test time you're no longer dropping out any
[00:13:38] time you're no longer dropping out any of the values. So this is randomness
[00:13:40] of the values. So this is randomness that we're adding during the training
[00:13:41] that we're adding during the training phase only. Then at test time we're uh
[00:13:44] phase only. Then at test time we're uh ne we're never masking any of the output
[00:13:46] ne we're never masking any of the output activations and we're removing this
[00:13:47] activations and we're removing this dropout uh idea altogether. Now one
[00:13:50] dropout uh idea altogether. Now one thing we need to note is that because if
[00:13:53] thing we need to note is that because if we were dropping out 50% of the
[00:13:55] we were dropping out 50% of the activations uh during training time at
[00:13:57] activations uh during training time at test time you're basically having 50%
[00:13:59] test time you're basically having 50% more values that are being input to uh
[00:14:02] more values that are being input to uh each of your uh layers. And so this can
[00:14:05] each of your uh layers. And so this can cause issues if you don't scale it. So
[00:14:07] cause issues if you don't scale it. So what you need to do is multiply by the
[00:14:09] what you need to do is multiply by the probability of dropout so that the um
[00:14:12] probability of dropout so that the um the sort of magnitude of the values
[00:14:14] the sort of magnitude of the values coming into each layer is preserved
[00:14:16] coming into each layer is preserved during training and test time. Otherwise
[00:14:17] during training and test time. Otherwise if you're dropping 50% of the values and
[00:14:19] if you're dropping 50% of the values and then at test time you just include all
[00:14:21] then at test time you just include all of them you'll get really weird behavior
[00:14:23] of them you'll get really weird behavior because um you'll be seeing much larger
[00:14:25] because um you'll be seeing much larger magnitude of inputs than before.
[00:14:28] magnitude of inputs than before.  Yeah. So what about for backward prop?
[00:14:30] Yeah. So what about for backward prop? Uh so for back prop when you have these
[00:14:33] Uh so for back prop when you have these zeroed values yeah it's sort of like uh
[00:14:36] zeroed values yeah it's sort of like uh you don't uh you you don't need to
[00:14:38] you don't uh you you don't need to traverse that uh path of your directed
[00:14:41] traverse that uh path of your directed graph anymore. It's very similar to relu
[00:14:43] graph anymore. It's very similar to relu if you have a a zeroed value at that
[00:14:45] if you have a a zeroed value at that point. Um the gradient becomes zero. So
[00:14:48] point. Um the gradient becomes zero. So um anything be sort of uh further back
[00:14:51] um anything be sort of uh further back in your computational graph uh there's
[00:14:54] in your computational graph uh there's no gradients calculated at that point.
[00:14:56] no gradients calculated at that point. If you're dropping out um certain values
[00:15:00] If you're dropping out um certain values or activations, the weights associated
[00:15:02] or activations, the weights associated with those specific activations will not
[00:15:04] with those specific activations will not be updated uh during gradient descent if
[00:15:07] be updated uh during gradient descent if you're dropping them out. Yeah. So the
[00:15:09] you're dropping them out. Yeah. So the question is um maybe I'll reframe it.
[00:15:11] question is um maybe I'll reframe it. What are we doing at test time? So at
[00:15:13] What are we doing at test time? So at test time we are using all of the um
[00:15:16] test time we are using all of the um output activations. We're not dropping
[00:15:18] output activations. We're not dropping them out anymore, but we need to scale
[00:15:20] them out anymore, but we need to scale by the probability of dropout. So we
[00:15:21] by the probability of dropout. So we multiply each of our output activations
[00:15:23] multiply each of our output activations by this p value because now we're using
[00:15:25] by this p value because now we're using all of them. Uh so otherwise the uh you
[00:15:28] all of them. Uh so otherwise the uh you can imagine you have each node is sort
[00:15:30] can imagine you have each node is sort of seeing a significantly higher number
[00:15:32] of seeing a significantly higher number of inputs than it did during training at
[00:15:34] of inputs than it did during training at test time. So you need to uh multiply by
[00:15:37] test time. So you need to uh multiply by this p value to maintain the same uh
[00:15:39] this p value to maintain the same uh magnitude of your your your inputs
[00:15:40] magnitude of your your your inputs coming in and uh the variance stays the
[00:15:43] coming in and uh the variance stays the same and all these different properties
[00:15:44] same and all these different properties it works very nicely if you do it like
[00:15:46] it works very nicely if you do it like this. Yeah. So the question is can you
[00:15:48] this. Yeah. So the question is can you just add noise to the image instead? The
[00:15:50] just add noise to the image instead? The answer is yes and we'll go over how to
[00:15:52] answer is yes and we'll go over how to do that in future slides. Yes, that's a
[00:15:54] do that in future slides. Yes, that's a great idea to add noise to your image.
[00:15:57] great idea to add noise to your image. Okay. Um, some specific code here. I
[00:16:00] Okay. Um, some specific code here. I won't go over this because we already
[00:16:01] won't go over this because we already mentioned this, but you know, you're
[00:16:02] mentioned this, but you know, you're dropping a um a p percentage of your
[00:16:06] dropping a um a p percentage of your activations here and then you multiply
[00:16:08] activations here and then you multiply here at test time.
[00:16:10] here at test time. Okay. Um the the next topic I'll talk
[00:16:12] Okay. Um the the next topic I'll talk about is activation functions. So, you
[00:16:13] about is activation functions. So, you all have basically learned all of the
[00:16:15] all have basically learned all of the key layers now in CNN's and now we're
[00:16:18] key layers now in CNN's and now we're going to be talking about these
[00:16:19] going to be talking about these activation functions. If you remember,
[00:16:21] activation functions. If you remember, the whole point of these activation
[00:16:22] the whole point of these activation functions is to introduce nonlinearities
[00:16:25] functions is to introduce nonlinearities to our model. So, um right now with
[00:16:28] to our model. So, um right now with these convolution operators, the kernel
[00:16:30] these convolution operators, the kernel sighting across the image and the simple
[00:16:32] sighting across the image and the simple layers without act uh sorry, the fully
[00:16:33] layers without act uh sorry, the fully connected layers without um activations,
[00:16:36] connected layers without um activations, they're all just linear operations
[00:16:37] they're all just linear operations because they're multiplications and
[00:16:39] because they're multiplications and additions. Um and the whole point of the
[00:16:40] additions. Um and the whole point of the activation function is to add
[00:16:42] activation function is to add nonlinearity. So um historically sigmoid
[00:16:45] nonlinearity. So um historically sigmoid is a really commonly was a really
[00:16:47] is a really commonly was a really commonly used activation function but
[00:16:50] commonly used activation function but there's actually a key problem with
[00:16:51] there's actually a key problem with sigmoid that is the reason why it's no
[00:16:53] sigmoid that is the reason why it's no longer used today. And so sigmoid if you
[00:16:56] longer used today. And so sigmoid if you graph it looks like this. You can see
[00:16:57] graph it looks like this. You can see the equation in the top right of the
[00:17:00] the equation in the top right of the slide here. And um the main issue is
[00:17:04] slide here. And um the main issue is that empirically what happened was after
[00:17:07] that empirically what happened was after many layers of sigmoids you would get
[00:17:09] many layers of sigmoids you would get smaller and smaller gradients as you're
[00:17:11] smaller and smaller gradients as you're computing back props. So starting from
[00:17:13] computing back props. So starting from the end the gradients are fairly large
[00:17:14] the end the gradients are fairly large in magnitude and as you undergo multiple
[00:17:16] in magnitude and as you undergo multiple layers of back propagation go to the
[00:17:18] layers of back propagation go to the initial early layers of your model um
[00:17:20] initial early layers of your model um you would get smaller and smaller
[00:17:22] you would get smaller and smaller gradients as you do this process. So
[00:17:23] gradients as you do this process. So I'll actually open this question up to
[00:17:25] I'll actually open this question up to the class um you know this isn't a a
[00:17:28] the class um you know this isn't a a phenomenon we see that occurs with
[00:17:30] phenomenon we see that occurs with sigmoid and so in what reason reason uh
[00:17:32] sigmoid and so in what reason reason uh regions in our graph does sigmoid have a
[00:17:35] regions in our graph does sigmoid have a really small gradient? Yeah so very
[00:17:37] really small gradient? Yeah so very negative and very positive values is
[00:17:39] negative and very positive values is correct and this is actually a huge
[00:17:40] correct and this is actually a huge issue. I mean you can visually see here
[00:17:42] issue. I mean you can visually see here in the graph the gradient's very flat at
[00:17:44] in the graph the gradient's very flat at you know you're taking the derivative
[00:17:45] you know you're taking the derivative here it's very small um and so basically
[00:17:47] here it's very small um and so basically for almost all of our input space uh
[00:17:50] for almost all of our input space uh from negative infinity to positive
[00:17:51] from negative infinity to positive infinity you have very small gradients
[00:17:53] infinity you have very small gradients and it's only this narrow range in the
[00:17:54] and it's only this narrow range in the middle where you have something that's
[00:17:56] middle where you have something that's uh non non zero so it basically
[00:17:59] uh non non zero so it basically approaches zero very quickly on both
[00:18:01] approaches zero very quickly on both ends of the extremes and so this means
[00:18:03] ends of the extremes and so this means that if the values coming into sigmoid
[00:18:05] that if the values coming into sigmoid are very large or very small then your
[00:18:07] are very large or very small then your gradient will be uh very
[00:18:11] So this is one of the main reasons why
[00:18:13] So this is one of the main reasons why relu became super popular because now in
[00:18:16] relu became super popular because now in the positive region we don't have any of
[00:18:18] the positive region we don't have any of this behavior. Um it's just a derivative
[00:18:20] this behavior. Um it's just a derivative of one here. Um but in in practice you
[00:18:24] of one here. Um but in in practice you still have this flat portion here on the
[00:18:27] still have this flat portion here on the left where um you know your gradient is
[00:18:30] left where um you know your gradient is zero. Um so so now we basically have
[00:18:34] zero. Um so so now we basically have half of our input uh domain here. we we
[00:18:37] half of our input uh domain here. we we get a gradient of one and the other half
[00:18:38] get a gradient of one and the other half is zero which is better than almost all
[00:18:41] is zero which is better than almost all of it being zero or very close to zero
[00:18:43] of it being zero or very close to zero except for a small region in the middle.
[00:18:45] except for a small region in the middle. So in practice these work better also
[00:18:47] So in practice these work better also it's much cheaper to just compute a max
[00:18:49] it's much cheaper to just compute a max operation between zero and your input
[00:18:51] operation between zero and your input value than the sigmoid function. So for
[00:18:53] value than the sigmoid function. So for those two reasons, Railus became super
[00:18:55] those two reasons, Railus became super popular. Um but you still have this
[00:18:57] popular. Um but you still have this issue where for any negative input, you
[00:19:00] issue where for any negative input, you get a zero gradient. And so um more
[00:19:03] get a zero gradient. And so um more recently, there's been uh popular
[00:19:05] recently, there's been uh popular activation functions that avoid this um
[00:19:08] activation functions that avoid this um by basically having a non-flat section
[00:19:11] by basically having a non-flat section of the activation function in the nearby
[00:19:14] of the activation function in the nearby neighborhood uh to near zero. So um this
[00:19:19] neighborhood uh to near zero. So um this is Gaoo and there's also Celu which I'll
[00:19:21] is Gaoo and there's also Celu which I'll show in a slide but I won't go over the
[00:19:23] show in a slide but I won't go over the formula. They look very similar. Um the
[00:19:25] formula. They look very similar. Um the basic idea is to smoothen out the this
[00:19:27] basic idea is to smoothen out the this uh non non smooth jump here in the in
[00:19:31] uh non non smooth jump here in the in the derivative from 0 to 1 at 0 for RLU.
[00:19:36] the derivative from 0 to 1 at 0 for RLU. So um no you know this is a very sharp
[00:19:39] So um no you know this is a very sharp uh uh and non smooth function with RLU.
[00:19:41] uh uh and non smooth function with RLU. But the nice part about Gillu is we
[00:19:43] But the nice part about Gillu is we actually have nonzero gradients here.
[00:19:46] actually have nonzero gradients here. And in the limit as x approaches
[00:19:48] And in the limit as x approaches infinity or negative infinity, it does
[00:19:50] infinity or negative infinity, it does converge to uh relu as well. But you get
[00:19:53] converge to uh relu as well. But you get sort of more smooth behavior in the
[00:19:54] sort of more smooth behavior in the middle here. And specifically what GU
[00:19:56] middle here. And specifically what GU calculates is this um Gaussian error
[00:20:00] calculates is this um Gaussian error linear unit. So this is the uh
[00:20:03] linear unit. So this is the uh cumulative distribution function of a
[00:20:06] cumulative distribution function of a Gaussian normal. So if you imagine the
[00:20:07] Gaussian normal. So if you imagine the area under the curve of a Gaussian uh
[00:20:09] area under the curve of a Gaussian uh that's what this 5x is at any point x.
[00:20:11] that's what this 5x is at any point x. So if you have a really uh negative
[00:20:14] So if you have a really uh negative value here you'll have a value close to
[00:20:16] value here you'll have a value close to zero which is why it sort of converges
[00:20:18] zero which is why it sort of converges to relu zero here and then a very uh
[00:20:21] to relu zero here and then a very uh high positive value it gets very close
[00:20:23] high positive value it gets very close to one the area under the curve so it
[00:20:25] to one the area under the curve so it converges to x here. So um this is gau
[00:20:28] converges to x here. So um this is gau it has these nice properties and it
[00:20:29] it has these nice properties and it converges to relu at the extremes too
[00:20:32] converges to relu at the extremes too and this is the main uh activation
[00:20:34] and this is the main uh activation function used in transformers today.
[00:20:38] function used in transformers today. Um if you look at all of them and you
[00:20:39] Um if you look at all of them and you kind of squint a lot of these kind of
[00:20:41] kind of squint a lot of these kind of look the same. The basic idea is
[00:20:42] look the same. The basic idea is something relatively flat and then it uh
[00:20:45] something relatively flat and then it uh in the limit it approaches x equals uh
[00:20:47] in the limit it approaches x equals uh sorry f ofx equals x um and and sort of
[00:20:51] sorry f ofx equals x um and and sort of it becomes a linear line. Um so celu is
[00:20:55] it becomes a linear line. Um so celu is actually x * sigmoid of x um which also
[00:20:58] actually x * sigmoid of x um which also has this property of for a very negative
[00:21:00] has this property of for a very negative value you have something close to zero
[00:21:02] value you have something close to zero and for a very positive value uh it's
[00:21:04] and for a very positive value uh it's close to one. So it's actually similar
[00:21:06] close to one. So it's actually similar to the um cumulative distribution
[00:21:08] to the um cumulative distribution function for the unit Gaussian that is
[00:21:10] function for the unit Gaussian that is fi here. So that's why the the shapes
[00:21:12] fi here. So that's why the the shapes are actually really similar looking too.
[00:21:15] are actually really similar looking too. Okay. Um so you might ask where are
[00:21:18] Okay. Um so you might ask where are these activations used in CNN's and the
[00:21:20] these activations used in CNN's and the general answer is that they're placed
[00:21:22] general answer is that they're placed after linear operators. So almost
[00:21:25] after linear operators. So almost anytime we have a feed forward or a
[00:21:27] anytime we have a feed forward or a linear layer or a fully connected layer,
[00:21:29] linear layer or a fully connected layer, these are all words for the same layer.
[00:21:31] these are all words for the same layer. So matrix multiply followed by
[00:21:33] So matrix multiply followed by activation function. um or if we have a
[00:21:35] activation function. um or if we have a convolutional layer pretty much after
[00:21:37] convolutional layer pretty much after these is where we place the the so after
[00:21:39] these is where we place the the so after the convolutional layer or after these
[00:21:41] the convolutional layer or after these linear layers we'll put the the
[00:21:42] linear layers we'll put the the activation function.
[00:21:45] activation function. Okay, so you've learned everything now
[00:21:48] Okay, so you've learned everything now about all the components of CNN's and
[00:21:50] about all the components of CNN's and I'll now go through some examples of how
[00:21:51] I'll now go through some examples of how we put them together and how people have
[00:21:53] we put them together and how people have created state-of-the-art convolutional
[00:21:55] created state-of-the-art convolutional neural network architectures.
[00:21:58] neural network architectures. Um I think this is a really neat slide
[00:22:00] Um I think this is a really neat slide because it plots two different values.
[00:22:02] because it plots two different values. So on one hand we have the error rate
[00:22:04] So on one hand we have the error rate which is these blue bars and this is
[00:22:06] which is these blue bars and this is over time. So these are different models
[00:22:08] over time. So these are different models people have trained on imageet and then
[00:22:10] people have trained on imageet and then you have these orange triangles which
[00:22:12] you have these orange triangles which represent the number of layers uh that
[00:22:15] represent the number of layers uh that models have and you can see at the same
[00:22:17] models have and you can see at the same point that we have a significant drop in
[00:22:20] point that we have a significant drop in error where we actually surpass human
[00:22:22] error where we actually surpass human performance for the first time we see uh
[00:22:25] performance for the first time we see uh a huge increase in the number of layers.
[00:22:27] a huge increase in the number of layers. So we'll go over in class today how they
[00:22:30] So we'll go over in class today how they were able to achieve this and what were
[00:22:31] were able to achieve this and what were the design challenges and goals for how
[00:22:33] the design challenges and goals for how they how they did this. Um historically
[00:22:36] they how they did this. Um historically AlexNet was the first sort of CNN based
[00:22:39] AlexNet was the first sort of CNN based paper that worked really well on imageet
[00:22:41] paper that worked really well on imageet and they used they were able to train it
[00:22:42] and they used they were able to train it by using GPUs. We talked about this
[00:22:44] by using GPUs. We talked about this earlier in lecture, so I won't spend too
[00:22:46] earlier in lecture, so I won't spend too much details around AlexNet from a
[00:22:48] much details around AlexNet from a historical lens, but I do want to
[00:22:50] historical lens, but I do want to compare it to another architecture
[00:22:52] compare it to another architecture called VGG, um, which was a really
[00:22:55] called VGG, um, which was a really standard and commonly used uh,
[00:22:57] standard and commonly used uh, architecture in the 2010s. And I think,
[00:23:00] architecture in the 2010s. And I think, uh, I can plot the two, uh, CNN
[00:23:04] uh, I can plot the two, uh, CNN architectures together side by side
[00:23:06] architectures together side by side here. So in general in AI, we like to
[00:23:08] here. So in general in AI, we like to plot our model architectures using block
[00:23:10] plot our model architectures using block diagrams where each block represents a
[00:23:13] diagrams where each block represents a different layer or a group of layers
[00:23:16] different layer or a group of layers that are stacked together. And it also
[00:23:18] that are stacked together. And it also helps you gain intuition about you know
[00:23:20] helps you gain intuition about you know what are the general differences just at
[00:23:22] what are the general differences just at a at an initial glance. So these orange
[00:23:24] a at an initial glance. So these orange blocks which are the common ones here
[00:23:26] blocks which are the common ones here are 3x3 convolution layers. So these are
[00:23:29] are 3x3 convolution layers. So these are convolution layers that have filters
[00:23:31] convolution layers that have filters that are sliding across that are size
[00:23:32] that are sliding across that are size 3x3. Uh their stride is one. So they're
[00:23:36] 3x3. Uh their stride is one. So they're visiting every location in the image,
[00:23:38] visiting every location in the image, not skipping over anything. And they add
[00:23:40] not skipping over anything. And they add padding of one around the outside so
[00:23:42] padding of one around the outside so that we're we're not shrinking as we do
[00:23:44] that we're we're not shrinking as we do these convolution uh uh layers. And so
[00:23:48] these convolution uh uh layers. And so um they also add these pooling max
[00:23:51] um they also add these pooling max pooling layers throughout here too. And
[00:23:53] pooling layers throughout here too. And you'll notice that for all of these they
[00:23:54] you'll notice that for all of these they have um after a pooling layer they'll
[00:23:57] have um after a pooling layer they'll start doing two sets of fully connected
[00:24:00] start doing two sets of fully connected layers of dimension uh 4096 followed by
[00:24:04] layers of dimension uh 4096 followed by a dimension of a thousand. And the
[00:24:06] a dimension of a thousand. And the reason we have a thousand at the end is
[00:24:07] reason we have a thousand at the end is because imageet was a thousand different
[00:24:09] because imageet was a thousand different image categories. So we need to have
[00:24:11] image categories. So we need to have scores for each of these categories. So
[00:24:13] scores for each of these categories. So the final layer is always uh equal to
[00:24:15] the final layer is always uh equal to the number of classes you have for for
[00:24:18] the number of classes you have for for image classification problem.
[00:24:20] image classification problem. Um, and so you can see it actually looks
[00:24:22] Um, and so you can see it actually looks extremely similar. It's sort of just
[00:24:24] extremely similar. It's sort of just like a scaled up version of Alexet with
[00:24:26] like a scaled up version of Alexet with more uh layers here. And also they're
[00:24:29] more uh layers here. And also they're now doing some uh sort of three groups
[00:24:31] now doing some uh sort of three groups of convolutions at a time followed by
[00:24:32] of convolutions at a time followed by pooling rather than uh two layers in a
[00:24:34] pooling rather than uh two layers in a pooling or even one. Um so it is
[00:24:38] pooling or even one. Um so it is actually pretty remarkable that there's
[00:24:40] actually pretty remarkable that there's only basically uh three different types
[00:24:42] only basically uh three different types of layers right in these models but they
[00:24:44] of layers right in these models but they they perform extremely well compared to
[00:24:46] they perform extremely well compared to anything people had tried before at this
[00:24:48] anything people had tried before at this point. Um so these are I would say the
[00:24:50] point. Um so these are I would say the simplest layers. You might uh simplest
[00:24:52] simplest layers. You might uh simplest models we're going to discuss today. But
[00:24:54] models we're going to discuss today. But you might ask why are they doing three
[00:24:56] you might ask why are they doing three by uh 3x3 convolutions like how do they
[00:24:58] by uh 3x3 convolutions like how do they pick this this value? And um there is
[00:25:01] pick this this value? And um there is actually some intuition behind how they
[00:25:04] actually some intuition behind how they chose 3x3. So um and specifically they
[00:25:08] chose 3x3. So um and specifically they have groups of three or even four of
[00:25:10] have groups of three or even four of these. So I'll ask you all a question.
[00:25:12] these. So I'll ask you all a question. Um what is the effective receptive
[00:25:15] Um what is the effective receptive field? So we we looked at receptive
[00:25:18] field? So we we looked at receptive field um last time but it's it's
[00:25:20] field um last time but it's it's basically the idea of the parts of your
[00:25:22] basically the idea of the parts of your input image that a particular value in
[00:25:24] input image that a particular value in your activation map has seen before. So
[00:25:26] your activation map has seen before. So what values have been used to compute
[00:25:28] what values have been used to compute the final activation map after many
[00:25:31] the final activation map after many layers of your model. So we have three
[00:25:32] layers of your model. So we have three of these layers that are all 3x3
[00:25:35] of these layers that are all 3x3 convolutions with this sliding filter
[00:25:37] convolutions with this sliding filter with stride of one. What is the
[00:25:38] with stride of one. What is the effective receptive field of each value
[00:25:40] effective receptive field of each value in our activation
[00:25:42] in our activation map A3 here? So this is after the third
[00:25:44] map A3 here? So this is after the third layer.
[00:25:46] layer. So I'm showing one of the layers here.
[00:25:49] So I'm showing one of the layers here. You can see for each value in A3, it's
[00:25:51] You can see for each value in A3, it's computed by looking at a 3x3 grid of
[00:25:53] computed by looking at a 3x3 grid of values in A2. And then conceivably for
[00:25:56] values in A2. And then conceivably for each one in A2, it's a 3x3 grid in A1.
[00:25:58] each one in A2, it's a 3x3 grid in A1. And for each of those, it's a 3x3 grid
[00:26:00] And for each of those, it's a 3x3 grid in our input. So uh I'll let you all
[00:26:03] in our input. So uh I'll let you all think about this for a little bit. Uh
[00:26:05] think about this for a little bit. Uh maybe it will help to see the next uh
[00:26:08] maybe it will help to see the next uh layer here. So it is actually really
[00:26:10] layer here. So it is actually really helpful to visualize this. So at A1,
[00:26:13] helpful to visualize this. So at A1, each of the corners here, you know,
[00:26:15] each of the corners here, you know, we're calculating a new from a new 3x3
[00:26:18] we're calculating a new from a new 3x3 grid here. Um, so from our input, how
[00:26:23] grid here. Um, so from our input, how large is this overall square? 7 by 7.
[00:26:27] large is this overall square? 7 by 7. Yeah, exactly. So this first one is 3x3,
[00:26:29] Yeah, exactly. So this first one is 3x3, this next one is 5x 5, and then the next
[00:26:32] this next one is 5x 5, and then the next one is 7 by 7. And we can visualize it
[00:26:34] one is 7 by 7. And we can visualize it here uh pretty easily. So the nice thing
[00:26:37] here uh pretty easily. So the nice thing about the 3x3 convolution with stride
[00:26:39] about the 3x3 convolution with stride one is that uh you're basically always
[00:26:41] one is that uh you're basically always adding two to your receptive field at
[00:26:44] adding two to your receptive field at each layer because um you know each
[00:26:46] each layer because um you know each point here you're looking to the left
[00:26:48] point here you're looking to the left and to the right and above and below it.
[00:26:50] and to the right and above and below it. So after you have many blocks of those
[00:26:52] So after you have many blocks of those you're just adding two each time.
[00:26:56] Okay. Um so we basically just showed
[00:27:00] Okay. Um so we basically just showed that a stack of three of these 3x3
[00:27:02] that a stack of three of these 3x3 convolution and stride one layers has
[00:27:04] convolution and stride one layers has the same effective field as one 7 by7
[00:27:07] the same effective field as one 7 by7 layer.
[00:27:08] layer.  Yeah. So the question is how much of
[00:27:09] Yeah. So the question is how much of this is justification after the fact
[00:27:11] this is justification after the fact versus how much of this is intuition
[00:27:12] versus how much of this is intuition that they then use to say design the
[00:27:15] that they then use to say design the experiments. Um I think it actually
[00:27:17] experiments. Um I think it actually probably depends on the architecture. So
[00:27:19] probably depends on the architecture. So for some of them it's more intuition
[00:27:20] for some of them it's more intuition focused. And actually the one we're
[00:27:21] focused. And actually the one we're going to cover next I think really was
[00:27:24] going to cover next I think really was uh the whole research direction was
[00:27:25] uh the whole research direction was spawned by an empirical finding that
[00:27:27] spawned by an empirical finding that there's a thought experiment about. So
[00:27:29] there's a thought experiment about. So that's resonates and I think for there
[00:27:31] that's resonates and I think for there there's actually pretty good intuition
[00:27:32] there's actually pretty good intuition that led the whole investigation into
[00:27:34] that led the whole investigation into what will work well. Um, but then for
[00:27:35] what will work well. Um, but then for this one, I mean, I I I can't say for I
[00:27:38] this one, I mean, I I I can't say for I can't speak for the authors um whether
[00:27:41] can't speak for the authors um whether it's justification after effect or um
[00:27:45] it's justification after effect or um you know, based on empirical findings or
[00:27:48] you know, based on empirical findings or um or if it was involved in the design
[00:27:51] um or if it was involved in the design choices because I haven't seen them
[00:27:52] choices because I haven't seen them speak publicly about that yet. But for
[00:27:54] speak publicly about that yet. But for resonance, I do know it it was the the
[00:27:57] resonance, I do know it it was the the hypothesis that led to the creation. Um
[00:28:00] hypothesis that led to the creation. Um but this is actually a really nice
[00:28:01] but this is actually a really nice property. So three of these 3x3's have
[00:28:03] property. So three of these 3x3's have the same effective receptive field as
[00:28:05] the same effective receptive field as one 7x7 layer and it actually has fewer
[00:28:07] one 7x7 layer and it actually has fewer parameters too. So um we have if you
[00:28:10] parameters too. So um we have if you imagine our channel dimension is saying
[00:28:12] imagine our channel dimension is saying stay staying the same. Uh you have these
[00:28:14] stay staying the same. Uh you have these 3x3 grids where um you have you know
[00:28:18] 3x3 grids where um you have you know your input number of channels. So 3 * 3
[00:28:20] your input number of channels. So 3  3  c would be the number of values in
[00:28:22] * c would be the number of values in each of these filters. And if we have c
[00:28:23] each of these filters. And if we have c of these filters it's 3  3  c * c or 3
[00:28:27] of these filters it's 3  3  c * c or 3 c ^ 2 and we have three layers total. So
[00:28:30] c ^ 2 and we have three layers total. So um sort of if we look at it through this
[00:28:33] um sort of if we look at it through this lens, it's actually um it's actually
[00:28:36] lens, it's actually um it's actually fewer parameters and we're having a more
[00:28:38] fewer parameters and we're having a more complex um and more nonlinear model uh
[00:28:42] complex um and more nonlinear model uh that we're building here. So fewer
[00:28:44] that we're building here. So fewer parameters and it can model more complex
[00:28:46] parameters and it can model more complex relationships among your input data. So
[00:28:49] relationships among your input data. So this is maybe why stacking these 3x3
[00:28:52] this is maybe why stacking these 3x3 layers together um could be better than
[00:28:54] layers together um could be better than having just a larger filter that you're
[00:28:56] having just a larger filter that you're sliding across. So fewer parameters and
[00:28:59] sliding across. So fewer parameters and then also you can model more complex
[00:29:00] then also you can model more complex relationships.
[00:29:03] relationships. Okay. Um I'll now talk about resonance
[00:29:05] Okay. Um I'll now talk about resonance which I I guess I'll very much bring up
[00:29:07] which I I guess I'll very much bring up the thought experiment that someone just
[00:29:08] the thought experiment that someone just asked a question about. So um there was
[00:29:12] asked a question about. So um there was actually an empirical finding that
[00:29:13] actually an empirical finding that spawned a lot of the um conversation and
[00:29:16] spawned a lot of the um conversation and thought around designing ResNets and the
[00:29:19] thought around designing ResNets and the idea was um actually shown that if you
[00:29:22] idea was um actually shown that if you keep stacking deeper layers on a plain
[00:29:24] keep stacking deeper layers on a plain CNN network like something like this and
[00:29:27] CNN network like something like this and you just keep adding layers to it, you
[00:29:29] you just keep adding layers to it, you build it larger and larger. What
[00:29:30] build it larger and larger. What happens? So what we found and what they
[00:29:33] happens? So what we found and what they found is that um the 20 layer model will
[00:29:36] found is that um the 20 layer model will actually have a lower test error than a
[00:29:38] actually have a lower test error than a 56 layer model. And you might think that
[00:29:41] 56 layer model. And you might think that this is because of overfitting, but it's
[00:29:43] this is because of overfitting, but it's actually not because if we look at the
[00:29:44] actually not because if we look at the training error, the training error of
[00:29:46] training error, the training error of this 20 layer model is also lower. So it
[00:29:48] this 20 layer model is also lower. So it has a lower training and a lower test
[00:29:51] has a lower training and a lower test error. Basically means that the model is
[00:29:53] error. Basically means that the model is doing better on all accounts. Um so why
[00:29:56] doing better on all accounts. Um so why is this 56 layer model performing worse
[00:29:59] is this 56 layer model performing worse than a 20 layer model? it doesn't you
[00:30:01] than a 20 layer model? it doesn't you know you might it might be confusing and
[00:30:04] know you might it might be confusing and we know as I mentioned right before it's
[00:30:06] we know as I mentioned right before it's it's not caused by overfitting
[00:30:09] it's not caused by overfitting so these deeper models have more
[00:30:11] so these deeper models have more representational power and theoretically
[00:30:13] representational power and theoretically they should be able to uh represent any
[00:30:16] they should be able to uh represent any model that a more shallow network uh can
[00:30:19] model that a more shallow network uh can can model. So the the set of possible um
[00:30:23] can model. So the the set of possible um uh mappings between your inputs and
[00:30:26] uh mappings between your inputs and different values for your larger
[00:30:27] different values for your larger networks is a is a superset of for your
[00:30:30] networks is a is a superset of for your smaller networks. Uh because
[00:30:32] smaller networks. Uh because theoretically you could imagine that um
[00:30:34] theoretically you could imagine that um you basically are just setting some of
[00:30:36] you basically are just setting some of these layers to be the identity function
[00:30:39] these layers to be the identity function where um they're the layers doing
[00:30:41] where um they're the layers doing nothing and then you would have if you
[00:30:43] nothing and then you would have if you set half your layers to do nothing you
[00:30:44] set half your layers to do nothing you have exactly the same representation
[00:30:46] have exactly the same representation power as a as a model one half the size.
[00:30:49] power as a as a model one half the size. So the idea is not that um these models
[00:30:53] So the idea is not that um these models are worse, but we're in terms of the
[00:30:55] are worse, but we're in terms of the representational power, but they're
[00:30:57] representational power, but they're actually harder to optimize because this
[00:31:00] actually harder to optimize because this the set of possible models for your your
[00:31:02] the set of possible models for your your deeper networks is larger and it
[00:31:04] deeper networks is larger and it contains all of the possible models that
[00:31:06] contains all of the possible models that your your your more shallow networks
[00:31:08] your your your more shallow networks could could learn.
[00:31:11] could could learn. So I sort of hinted at it before, but
[00:31:14] So I sort of hinted at it before, but how specifically could the deeper model
[00:31:17] how specifically could the deeper model learn to be at least as good as a
[00:31:18] learn to be at least as good as a shallow model? It's by setting. So if we
[00:31:20] shallow model? It's by setting. So if we have a two-layer model versus a one
[00:31:22] have a two-layer model versus a one layer, uh so one layer here and two
[00:31:24] layer, uh so one layer here and two layer on the right. If we set one of the
[00:31:26] layer on the right. If we set one of the layers to just essentially be a identity
[00:31:28] layers to just essentially be a identity matrix or you know this is just an
[00:31:30] matrix or you know this is just an identity function, the the the model
[00:31:33] identity function, the the the model should be at least as good as the
[00:31:34] should be at least as good as the shallow model.
[00:31:36] shallow model. Um it should be at least as good as a
[00:31:37] Um it should be at least as good as a shallow model.
[00:31:40] shallow model. So how do we actually build this
[00:31:42] So how do we actually build this intuition into our models? We want them
[00:31:44] intuition into our models? We want them to be able to be just as good as a
[00:31:46] to be able to be just as good as a shallower model if they want to be uh
[00:31:49] shallower model if they want to be uh during optimization. So the way that we
[00:31:52] during optimization. So the way that we do this is actually by fitting what's
[00:31:54] do this is actually by fitting what's called a residual mapping instead of
[00:31:56] called a residual mapping instead of directly trying to fit a desired
[00:31:58] directly trying to fit a desired underlap underlying mapping. And what
[00:32:01] underlap underlying mapping. And what this looks like is we basically take the
[00:32:03] this looks like is we basically take the value x and we copy it over past our
[00:32:06] value x and we copy it over past our convolution layers. So that um the value
[00:32:10] convolution layers. So that um the value at this point is already receiving x our
[00:32:14] at this point is already receiving x our original input as well as the output of
[00:32:16] original input as well as the output of our two convolution stacks. So um
[00:32:19] our two convolution stacks. So um basically at this point this f ofx or
[00:32:21] basically at this point this f ofx or which is called the residual map here um
[00:32:23] which is called the residual map here um it could just learn uh zero values for
[00:32:26] it could just learn uh zero values for the uh all the com filters and the
[00:32:30] the uh all the com filters and the output would be zero here and then we
[00:32:32] output would be zero here and then we would just add x uh along here and we
[00:32:34] would just add x uh along here and we would get x. So it allows a very simple
[00:32:36] would get x. So it allows a very simple way for the model to bypass these layers
[00:32:39] way for the model to bypass these layers if it doesn't need to learn anything for
[00:32:41] if it doesn't need to learn anything for the layers. And what this means is you
[00:32:43] the layers. And what this means is you can really easily now learn uh basically
[00:32:45] can really easily now learn uh basically this identity function that we talked
[00:32:47] this identity function that we talked about earlier by just learning uh zero
[00:32:50] about earlier by just learning uh zero zero filters. So your filters all are
[00:32:52] zero filters. So your filters all are filled with zero values for example
[00:32:55] filled with zero values for example or uh in this case more practically what
[00:32:58] or uh in this case more practically what happens is they just need to learn very
[00:32:59] happens is they just need to learn very small values because instead of learning
[00:33:01] small values because instead of learning the uh this entire mapping from x to h
[00:33:05] the uh this entire mapping from x to h of x they just need to learn this
[00:33:07] of x they just need to learn this difference which is f ofx. So um you're
[00:33:10] difference which is f ofx. So um you're now just learning this sort of
[00:33:12] now just learning this sort of difference between your desired output
[00:33:14] difference between your desired output here and um the the copied over block.
[00:33:18] here and um the the copied over block. This is called uh residual block or
[00:33:20] This is called uh residual block or residual connection residual connection
[00:33:22] residual connection residual connection when you copy over values from an
[00:33:23] when you copy over values from an earlier layer into a later layer in your
[00:33:26] earlier layer into a later layer in your model and then you just add it to the
[00:33:28] model and then you just add it to the the values at that point.
[00:33:30] the values at that point. So that I talked a bit about the
[00:33:32] So that I talked a bit about the intuition which was this observed
[00:33:34] intuition which was this observed phenomenon that these larger networks
[00:33:37] phenomenon that these larger networks were achieving worse training and worse
[00:33:39] were achieving worse training and worse test error because they were harder to
[00:33:40] test error because they were harder to optimize. So the intuition was we need
[00:33:43] optimize. So the intuition was we need to build a model that can really easily
[00:33:45] to build a model that can really easily model the shallower networks so it can
[00:33:47] model the shallower networks so it can be at least as good as a shallower
[00:33:48] be at least as good as a shallower model. The way they did this was by
[00:33:50] model. The way they did this was by adding a residual connection so that you
[00:33:52] adding a residual connection so that you can just copy over the values easily um
[00:33:54] can just copy over the values easily um build that into the architecture itself
[00:33:56] build that into the architecture itself rather than trying to learn some
[00:33:58] rather than trying to learn some identity mapping among the convolutional
[00:34:00] identity mapping among the convolutional layers. And empirically this showed to
[00:34:02] layers. And empirically this showed to work extremely well too. Yeah. So what
[00:34:05] work extremely well too. Yeah. So what does the residual block carry? Um so we
[00:34:06] does the residual block carry? Um so we have our input X. We pass it through two
[00:34:09] have our input X. We pass it through two different convolutional layers and we
[00:34:10] different convolutional layers and we get our output f ofx. Um x is lit is
[00:34:13] get our output f ofx. Um x is lit is just copied over here. So this is
[00:34:15] just copied over here. So this is exactly the same as X and we add it to
[00:34:17] exactly the same as X and we add it to the output of uh these two blocks which
[00:34:20] the output of uh these two blocks which is f ofx. Yeah, x is the output of one
[00:34:23] is f ofx. Yeah, x is the output of one of the previous layers or if it's the
[00:34:25] of the previous layers or if it's the very first layer of the model it would
[00:34:27] very first layer of the model it would be the the image.
[00:34:28] be the the image.  Yeah. So the question is if maybe you
[00:34:30] Yeah. So the question is if maybe you just don't have enough data and if you
[00:34:31] just don't have enough data and if you added enough data then maybe you could
[00:34:33] added enough data then maybe you could train a model without these blocks. Um I
[00:34:35] train a model without these blocks. Um I think these blocks actually do extremely
[00:34:37] think these blocks actually do extremely help you with learning from more data.
[00:34:40] help you with learning from more data. Um I think the issue it was really an
[00:34:42] Um I think the issue it was really an optimization problem. So transformers
[00:34:44] optimization problem. So transformers use residual blocks for exactly the same
[00:34:46] use residual blocks for exactly the same reason because uh and I think it
[00:34:48] reason because uh and I think it actually helps you model these more
[00:34:50] actually helps you model these more complex models and it actually enables
[00:34:52] complex models and it actually enables you to use more data. So um I I think
[00:34:54] you to use more data. So um I I think it's very good residual blocks help you
[00:34:56] it's very good residual blocks help you use more data more efficiently because
[00:34:58] use more data more efficiently because you're able to model a greater number
[00:35:00] you're able to model a greater number you're able to more easily model a
[00:35:02] you're able to more easily model a greater number of uh functions.
[00:35:05] greater number of uh functions. Yeah. So the question is that um maybe
[00:35:08] Yeah. So the question is that um maybe if we just trained for longer um the
[00:35:12] if we just trained for longer um the performance would eventually converge to
[00:35:14] performance would eventually converge to the value of the smaller network and
[00:35:15] the value of the smaller network and maybe it's just harder to optimize
[00:35:17] maybe it's just harder to optimize because it takes longer to train a a
[00:35:18] because it takes longer to train a a larger model. Um and I think the answer
[00:35:20] larger model. Um and I think the answer is no that these um these these were not
[00:35:24] is no that these um these these were not uh like they were not converging to the
[00:35:26] uh like they were not converging to the performance of the smaller model
[00:35:28] performance of the smaller model regardless of how long you trained it.
[00:35:30] regardless of how long you trained it. And the reason is because it's being
[00:35:31] And the reason is because it's being stuck in essentially local minimum uh
[00:35:36] stuck in essentially local minimum uh and when you add these residual
[00:35:38] and when you add these residual connections you're uh avoiding these.
[00:35:41] connections you're uh avoiding these. This is the actual explanation why this
[00:35:44] This is the actual explanation why this is the case is still I would say being a
[00:35:46] is the case is still I would say being a more active area of research. Um it's
[00:35:49] more active area of research. Um it's really hard to understand exactly what
[00:35:51] really hard to understand exactly what causes these models to um avoid local a
[00:35:54] causes these models to um avoid local a global minimum or sorry to avoid local
[00:35:56] global minimum or sorry to avoid local minimum and not get an optimal solution
[00:35:58] minimum and not get an optimal solution or to uh what causes them to not train
[00:36:02] or to uh what causes them to not train uh and find better solutions. And often
[00:36:04] uh and find better solutions. And often times this is really an empirical
[00:36:05] times this is really an empirical finding but there's some intuition
[00:36:07] finding but there's some intuition behind it. And in this case the
[00:36:08] behind it. And in this case the intuition was that uh we want to enable
[00:36:10] intuition was that uh we want to enable our models to do at least as well as the
[00:36:12] our models to do at least as well as the shallower models which we know were
[00:36:13] shallower models which we know were performing better at the time. Um, so
[00:36:15] performing better at the time. Um, so it's not that you could just train it
[00:36:17] it's not that you could just train it for longer and it would uh do better.
[00:36:19] for longer and it would uh do better. Um, it was actually a limitation where
[00:36:21] Um, it was actually a limitation where it was no longer it was un just
[00:36:24] it was no longer it was un just completely unable to achieve as good as
[00:36:25] completely unable to achieve as good as the shallower models.
[00:36:31] Okay. Um,
[00:36:34] Okay. Um, so here's the overall ResNet
[00:36:36] so here's the overall ResNet architecture. Um, we have these stacks
[00:36:39] architecture. Um, we have these stacks of residual blocks now. So that's what
[00:36:42] of residual blocks now. So that's what these two blocks here together mean. um
[00:36:45] these two blocks here together mean. um it's a residual uh block. So we have a
[00:36:48] it's a residual uh block. So we have a 3x3 convolution with a relu followed by
[00:36:50] 3x3 convolution with a relu followed by another 3x3 convolution and we're
[00:36:52] another 3x3 convolution and we're copying over this x value here. We're
[00:36:54] copying over this x value here. We're adding it to the outputs here and then
[00:36:55] adding it to the outputs here and then we're having a relu afterwards. So each
[00:36:57] we're having a relu afterwards. So each of these pairs of blocks is one of these
[00:37:00] of these pairs of blocks is one of these residual and that's why you see this
[00:37:02] residual and that's why you see this line skipping over here because the
[00:37:03] line skipping over here because the values getting added forward. Um the
[00:37:06] values getting added forward. Um the cool thing about ResNets also um is that
[00:37:09] cool thing about ResNets also um is that they uh basically had a lot of these
[00:37:12] they uh basically had a lot of these different um depths that they created.
[00:37:14] different um depths that they created. So they created a whole family of
[00:37:16] So they created a whole family of models, some smaller and some larger.
[00:37:18] models, some smaller and some larger. And they showed that as they increased
[00:37:19] And they showed that as they increased the number of layers, their performance
[00:37:21] the number of layers, their performance was increasing, albeit the the sort of
[00:37:24] was increasing, albeit the the sort of the difference in performance as you got
[00:37:25] the difference in performance as you got to the larger and larger models became
[00:37:27] to the larger and larger models became smaller. So it was sort of reaching a
[00:37:29] smaller. So it was sort of reaching a point of given the data set uh they
[00:37:32] point of given the data set uh they weren't able to scale any significant
[00:37:34] weren't able to scale any significant amount by adding more layers further
[00:37:36] amount by adding more layers further than that but you they saw significant
[00:37:38] than that but you they saw significant improvements in performance among
[00:37:40] improvements in performance among especially the earlier models and then
[00:37:41] especially the earlier models and then 101 to 152 is where the performance it
[00:37:44] 101 to 152 is where the performance it wasn't really change it was marginally
[00:37:45] wasn't really change it was marginally better but uh performance change was
[00:37:47] better but uh performance change was maybe only like 1% at that point. Yeah.
[00:37:50] maybe only like 1% at that point. Yeah. How did they get the number of 152? I
[00:37:52] How did they get the number of 152? I actually don't know how they got the
[00:37:53] actually don't know how they got the number of 152. Uh I think they wanted to
[00:37:55] number of 152. Uh I think they wanted to try different values here and you can
[00:37:58] try different values here and you can see that I mean they're not exactly
[00:38:00] see that I mean they're not exactly doubling but you know there's sort of
[00:38:03] doubling but you know there's sort of you know a significant increase in each
[00:38:05] you know a significant increase in each time. I don't know how they picked 152.
[00:38:07] time. I don't know how they picked 152. I I that's a good question. Maybe they
[00:38:09] I I that's a good question. Maybe they showed it somehow worked better than
[00:38:10] showed it somehow worked better than other I actually don't know though. So
[00:38:14] other I actually don't know though. So generally when you're trying multiple
[00:38:15] generally when you're trying multiple different uh number of layers for your
[00:38:17] different uh number of layers for your model like given that say these are the
[00:38:19] model like given that say these are the number of layers you want to try um what
[00:38:22] number of layers you want to try um what you'll do is you'll sort of first train
[00:38:23] you'll do is you'll sort of first train the smallest model see performance and
[00:38:26] the smallest model see performance and then add more layers see if your
[00:38:27] then add more layers see if your performance increases and go on so
[00:38:28] performance increases and go on so forth. So that's probably why they
[00:38:30] forth. So that's probably why they stopped at 152 is because performance
[00:38:32] stopped at 152 is because performance wasn't increases as much anymore. Um and
[00:38:34] wasn't increases as much anymore. Um and also there's GPU memory limitations. So
[00:38:37] also there's GPU memory limitations. So as you get larger and larger models, it
[00:38:38] as you get larger and larger models, it becomes harder to train from a hardware
[00:38:40] becomes harder to train from a hardware perspective because you need to fit more
[00:38:41] perspective because you need to fit more parameters into your GPU memory. So um
[00:38:44] parameters into your GPU memory. So um there is like a limit for given your
[00:38:46] there is like a limit for given your compute setup how large of a model you
[00:38:47] compute setup how large of a model you can train. Um I think you need to train
[00:38:50] can train. Um I think you need to train these models separately though. So um
[00:38:52] these models separately though. So um you have one model run for 18 layers and
[00:38:55] you have one model run for 18 layers and one for 34 etc.
[00:38:58] one for 34 etc. So the question is um how to think of
[00:39:01] So the question is um how to think of intuition of CNN blocks given we're
[00:39:04] intuition of CNN blocks given we're using these residual connections because
[00:39:07] using these residual connections because um you can still think of it as higher
[00:39:08] um you can still think of it as higher levels of abstraction and this is shown
[00:39:10] levels of abstraction and this is shown to be true in the layers. So um you're
[00:39:13] to be true in the layers. So um you're you're not in instead of learning the um
[00:39:16] you're not in instead of learning the um within the block itself instead of just
[00:39:18] within the block itself instead of just learning the higher level features
[00:39:19] learning the higher level features you're learning the delta from the
[00:39:20] you're learning the delta from the original image to get the higher level
[00:39:22] original image to get the higher level features. That's what you're learning in
[00:39:24] features. That's what you're learning in the block. So you're learning the delta
[00:39:26] the block. So you're learning the delta but you're still achieving these higher
[00:39:28] but you're still achieving these higher level representations at each step. So
[00:39:30] level representations at each step. So that part is the same but the actual
[00:39:32] that part is the same but the actual functional way of doing it is we're
[00:39:34] functional way of doing it is we're learning this uh right you learn this f
[00:39:37] learning this uh right you learn this f ofx that you add your your your previous
[00:39:40] ofx that you add your your your previous input to. So it's like you're learning
[00:39:42] input to. So it's like you're learning the delta. The question is if you do
[00:39:44] the delta. The question is if you do addition does that require you to have
[00:39:45] addition does that require you to have the same tensor size? The answer is yes.
[00:39:47] the same tensor size? The answer is yes. And uh it's part of the reason why it's
[00:39:51] And uh it's part of the reason why it's really a nice property that all of these
[00:39:52] really a nice property that all of these ones have this 3x3 convolution with
[00:39:54] ones have this 3x3 convolution with stride one so that you maintain the in
[00:39:56] stride one so that you maintain the in one padding so that you maintain the
[00:39:58] one padding so that you maintain the same size uh at every at every layer
[00:40:01] same size uh at every at every layer going forward. So after you had say a
[00:40:03] going forward. So after you had say a pooling layer, you wouldn't I mean you
[00:40:05] pooling layer, you wouldn't I mean you could maybe come up with a way to do it
[00:40:07] could maybe come up with a way to do it where you double the you know you
[00:40:09] where you double the you know you increase uh you sort of unpool the
[00:40:12] increase uh you sort of unpool the values. Uh but after a pooling layer for
[00:40:14] values. Uh but after a pooling layer for example you couldn't do a naive addition
[00:40:15] example you couldn't do a naive addition anymore because the size of your tensors
[00:40:17] anymore because the size of your tensors are different. So these are done within
[00:40:20] are different. So these are done within you know before a pool
[00:40:23] you know before a pool at least the regular one. I mean you
[00:40:24] at least the regular one. I mean you could get around it by just having each
[00:40:26] could get around it by just having each value be spread out on into multiple
[00:40:28] value be spread out on into multiple values for example.
[00:40:31] values for example. Okay. Um so these are basically the main
[00:40:33] Okay. Um so these are basically the main takeaways for ResNets. Um one other neat
[00:40:36] takeaways for ResNets. Um one other neat trick they do is they basically
[00:40:38] trick they do is they basically periodically after a certain number of
[00:40:39] periodically after a certain number of these blocks they'll uh double the
[00:40:42] these blocks they'll uh double the number of filters and down sample the
[00:40:43] number of filters and down sample the spatial dimension. Um so basically you
[00:40:47] spatial dimension. Um so basically you can imagine if you start with a really
[00:40:49] can imagine if you start with a really flat uh image as the activations get
[00:40:53] flat uh image as the activations get pushed through the network it becomes
[00:40:54] pushed through the network it becomes smaller spatially but then the depth is
[00:40:57] smaller spatially but then the depth is larger. So um this is how to think of it
[00:40:59] larger. So um this is how to think of it and then at the very end it just becomes
[00:41:00] and then at the very end it just becomes a vector that you then use for
[00:41:02] a vector that you then use for classification. So that's how you should
[00:41:04] classification. So that's how you should be like visualizing what's happening to
[00:41:05] be like visualizing what's happening to the values in the network itself and the
[00:41:08] the values in the network itself and the shape of them. And then one other sort
[00:41:10] shape of them. And then one other sort of thing that's somewhat unique to
[00:41:12] of thing that's somewhat unique to ResNets other architectures do this too
[00:41:14] ResNets other architectures do this too but before all these uh layers with the
[00:41:16] but before all these uh layers with the residual blocks they have uh this
[00:41:19] residual blocks they have uh this relatively larger convolution layer and
[00:41:21] relatively larger convolution layer and here is just empirically shown that it
[00:41:23] here is just empirically shown that it did better if they added this here. So
[00:41:26] did better if they added this here. So this one is purely empirical finding.
[00:41:29] this one is purely empirical finding. Okay. Um I think um yeah basically to
[00:41:34] Okay. Um I think um yeah basically to highlight you know it did extremely well
[00:41:35] highlight you know it did extremely well these larger models. It was the only it
[00:41:37] these larger models. It was the only it was the first time they were able to
[00:41:38] was the first time they were able to train 100 plus layer models
[00:41:40] train 100 plus layer models successfully. So it was a really big
[00:41:41] successfully. So it was a really big deal and basically resnets were used in
[00:41:44] deal and basically resnets were used in a huge variety of computer vision tasks
[00:41:46] a huge variety of computer vision tasks afterwards. Almost every task in
[00:41:48] afterwards. Almost every task in computer vision was using a resnet at
[00:41:50] computer vision was using a resnet at the time because they performed so well
[00:41:51] the time because they performed so well because of these residual connections.
[00:41:54] because of these residual connections. Okay. Um so we talked about some CNN
[00:41:57] Okay. Um so we talked about some CNN architectures the main ones being the
[00:41:59] architectures the main ones being the main one being ResNet and then uh also
[00:42:01] main one being ResNet and then uh also VGG historically. So we talked about why
[00:42:03] VGG historically. So we talked about why the smaller filter size is useful and
[00:42:06] the smaller filter size is useful and having many layers of these is useful.
[00:42:08] having many layers of these is useful. So the final thing I'll talk about in
[00:42:09] So the final thing I'll talk about in terms of how we actually construct the
[00:42:11] terms of how we actually construct the CNN's and prime them to be ready for
[00:42:13] CNN's and prime them to be ready for training is how do you actually
[00:42:15] training is how do you actually initialize the weight values of the
[00:42:17] initialize the weight values of the individual layers.
[00:42:19] individual layers. So um you know depending on what values
[00:42:23] So um you know depending on what values you choose you could either put values
[00:42:25] you choose you could either put values that are too small or too large which
[00:42:27] that are too small or too large which would cause significant issues for your
[00:42:29] would cause significant issues for your model during training. So here we're
[00:42:32] model during training. So here we're it's basically a um six layer network
[00:42:34] it's basically a um six layer network where we have uh 4096dimensional
[00:42:38] where we have uh 4096dimensional uh features and this is just a fully six
[00:42:41] uh features and this is just a fully six layers of fully connected model and we
[00:42:44] layers of fully connected model and we initialize them here. This is getting a
[00:42:46] initialize them here. This is getting a unit gaussian random and then we're
[00:42:48] unit gaussian random and then we're multiplying it by 0.01 to get very small
[00:42:51] multiplying it by 0.01 to get very small values close to zero and we have you
[00:42:54] values close to zero and we have you know ru at each layer too. So if you
[00:42:56] know ru at each layer too. So if you plot the forward pass of this model, uh
[00:43:00] plot the forward pass of this model, uh you would actually see that at the
[00:43:02] you would actually see that at the beginning you get a relatively high you
[00:43:05] beginning you get a relatively high you know cuz it's rel so all the means are
[00:43:07] know cuz it's rel so all the means are going to be positive. Um but you'll have
[00:43:09] going to be positive. Um but you'll have a mean and standard deviation that is
[00:43:11] a mean and standard deviation that is relatively high but as each layer
[00:43:13] relatively high but as each layer progresses because we had a really small
[00:43:15] progresses because we had a really small weight initialization um it becomes
[00:43:18] weight initialization um it becomes smaller and smaller mean and standard
[00:43:19] smaller and smaller mean and standard deviation. And really ideally we would
[00:43:21] deviation. And really ideally we would want basically all these to be the same
[00:43:23] want basically all these to be the same for each layer. Um because it makes our
[00:43:25] for each layer. Um because it makes our optimization problem much nicer to
[00:43:27] optimization problem much nicer to solve.
[00:43:28] solve. Um so if we say use 0.05 instead of 0.01
[00:43:33] Um so if we say use 0.05 instead of 0.01 uh can anyone imagine what might be the
[00:43:35] uh can anyone imagine what might be the issue here if we set it to too large of
[00:43:37] issue here if we set it to too large of a value. So what's too when it's too
[00:43:40] a value. So what's too when it's too small it goes to zero basically. What
[00:43:43] small it goes to zero basically. What happens if it's too large? Yeah.
[00:43:45] happens if it's too large? Yeah. Basically the activations get larger and
[00:43:47] Basically the activations get larger and larger at each layer. So uh if you plot
[00:43:49] larger at each layer. So uh if you plot it here you can see that you know by the
[00:43:51] it here you can see that you know by the end there's just a massive mean and
[00:43:53] end there's just a massive mean and standard deviation and if you're
[00:43:54] standard deviation and if you're training a 152 layer ResNet and you know
[00:43:57] training a 152 layer ResNet and you know you can imagine this becomes quite an
[00:43:59] you can imagine this becomes quite an issue very quickly. So how do you
[00:44:01] issue very quickly. So how do you actually uh do this? So in this case um
[00:44:05] actually uh do this? So in this case um you know maybe the the optimal value I
[00:44:07] you know maybe the the optimal value I think is 0.022 or something but how
[00:44:09] think is 0.022 or something but how would you actually know that and how
[00:44:11] would you actually know that and how would you do this more generally across
[00:44:12] would you do this more generally across any layer? Um, there are a few different
[00:44:14] any layer? Um, there are a few different ways you can initialize weights. And
[00:44:16] ways you can initialize weights. And I'll go over the most commonly used one
[00:44:18] I'll go over the most commonly used one today in class, but know there are other
[00:44:20] today in class, but know there are other ones. And generally what they're a
[00:44:22] ones. And generally what they're a function of is the dimension of your um
[00:44:25] function of is the dimension of your um of your of your values here. So you'll
[00:44:28] of your of your values here. So you'll have a different value for a 4096
[00:44:30] have a different value for a 4096 dimension uh fully connected layer
[00:44:33] dimension uh fully connected layer versus a 2048. And the specific formula
[00:44:36] versus a 2048. And the specific formula we'll go through um is called Kaiming
[00:44:39] we'll go through um is called Kaiming initialization is actually the same
[00:44:40] initialization is actually the same person who created uh resonance. So
[00:44:43] person who created uh resonance. So Kaiming Hu he was you know very famous I
[00:44:46] Kaiming Hu he was you know very famous I mean he still is a very famous computer
[00:44:47] mean he still is a very famous computer vision researcher. I think he's one of
[00:44:48] vision researcher. I think he's one of the most widely cited computer
[00:44:50] the most widely cited computer scientists of the last uh 10 or 15
[00:44:52] scientists of the last uh 10 or 15 years, maybe the most. Uh so he he's
[00:44:54] years, maybe the most. Uh so he he's extremely well known in the computer
[00:44:56] extremely well known in the computer vision community and he also came up
[00:44:58] vision community and he also came up with uh this idea of initializing the
[00:45:02] with uh this idea of initializing the values to the square root of two over
[00:45:05] values to the square root of two over your input dimension size. And I won't
[00:45:08] your input dimension size. And I won't go over all the details for how they uh
[00:45:10] go over all the details for how they uh derived this and showed that with a relu
[00:45:13] derived this and showed that with a relu activation this would cause the standard
[00:45:14] activation this would cause the standard deviation and mean to be relatively
[00:45:16] deviation and mean to be relatively constant throughout the layers. But if
[00:45:18] constant throughout the layers. But if you do uh plot it, you see this does
[00:45:20] you do uh plot it, you see this does have the effect. So you can almost think
[00:45:21] have the effect. So you can almost think of this as a magic formula where if you
[00:45:22] of this as a magic formula where if you plug it in, you get the desired
[00:45:23] plug it in, you get the desired properties. And if you want to know the
[00:45:26] properties. And if you want to know the derivation, you can actually uh we link
[00:45:27] derivation, you can actually uh we link the paper here. So feel free to look
[00:45:29] the paper here. So feel free to look into that, but you can just sort of uh
[00:45:31] into that, but you can just sort of uh take our word. I won't go through the
[00:45:32] take our word. I won't go through the details here, but um it does this
[00:45:34] details here, but um it does this desired effect where the mean and
[00:45:36] desired effect where the mean and standard deviation is unchanging. And
[00:45:38] standard deviation is unchanging. And you can also imagine that for any given
[00:45:39] you can also imagine that for any given setup, you could also just through
[00:45:41] setup, you could also just through testing try to find what is the what is
[00:45:44] testing try to find what is the what is the value here.
[00:45:46] the value here. Okay. Um, so these are how you we
[00:45:49] Okay. Um, so these are how you we discuss how you initialize weights, how
[00:45:51] discuss how you initialize weights, how you combine these different layers
[00:45:53] you combine these different layers together to form a CNN architecture,
[00:45:56] together to form a CNN architecture, which activations function, which
[00:45:57] which activations function, which activation functions people use, and
[00:45:59] activation functions people use, and then all the different layers and CNN's.
[00:46:01] then all the different layers and CNN's. So I think already we covered quite a
[00:46:02] So I think already we covered quite a few topics. So I think I'll pause very
[00:46:04] few topics. So I think I'll pause very briefly to see if there's any questions
[00:46:06] briefly to see if there's any questions about these. In the second part of the
[00:46:08] about these. In the second part of the lecture, it's actually much less dense
[00:46:10] lecture, it's actually much less dense than the first part. So we'll be mainly
[00:46:12] than the first part. So we'll be mainly going over a lot of nice practical tips
[00:46:15] going over a lot of nice practical tips for when you're training these models.
[00:46:17] for when you're training these models. So the question is how do you do weight
[00:46:18] So the question is how do you do weight initialization for CNN's? So you still
[00:46:21] initialization for CNN's? So you still use this same uh initialization but your
[00:46:23] use this same uh initialization but your dimension in here is the size of your
[00:46:25] dimension in here is the size of your kernel. So if you have a 3x3 kernel with
[00:46:28] kernel. So if you have a 3x3 kernel with channels uh say six it would be 3  3 
[00:46:31] channels uh say six it would be 3  3  6 that is but it's the same idea. It's
[00:46:34] 6 that is but it's the same idea. It's just you calculate your dimensions
[00:46:36] just you calculate your dimensions differently depending on the layer the
[00:46:37] differently depending on the layer the type of layer. Yeah.
[00:46:39] type of layer. Yeah. It's you can think of it's the number of
[00:46:41] It's you can think of it's the number of values roughly in each operation but it
[00:46:45] values roughly in each operation but it does depend on the layer and and some
[00:46:47] does depend on the layer and and some layers use different uh weight
[00:46:48] layers use different uh weight initializations but this is specifically
[00:46:50] initializations but this is specifically for CNN's how this timing initialization
[00:46:52] for CNN's how this timing initialization applies to it.
[00:46:54] applies to it.  Yeah. So the question is why do your
[00:46:55] Yeah. So the question is why do your activations expose explode if you have
[00:46:57] activations expose explode if you have too large of an initialization value? So
[00:47:00] too large of an initialization value? So um if you so you you imagine at each
[00:47:02] um if you so you you imagine at each layer of your initialized network you
[00:47:04] layer of your initialized network you have um a set of randomly initialized
[00:47:07] have um a set of randomly initialized values and if they're very large then um
[00:47:11] values and if they're very large then um when you do your um you know you have a
[00:47:14] when you do your um you know you have a RLU activation afterwards and that
[00:47:16] RLU activation afterwards and that doesn't actually cap the um the the
[00:47:20] doesn't actually cap the um the the outputs of your layer right you can go
[00:47:22] outputs of your layer right you can go to infinity with RLU so if you have a
[00:47:25] to infinity with RLU so if you have a too large of a set of values that you're
[00:47:27] too large of a set of values that you're essentially repeating the same operation
[00:47:29] essentially repeating the same operation on because you're initializing all the
[00:47:30] on because you're initializing all the weights uh to the same set of uh random
[00:47:33] weights uh to the same set of uh random values. Then at each layer, you'll be
[00:47:35] values. Then at each layer, you'll be multiplying one set of large values by a
[00:47:38] multiplying one set of large values by a set that's been initialized too large so
[00:47:40] set that's been initialized too large so that it becomes larger at each uh
[00:47:43] that it becomes larger at each uh iteration afterwards. I I I mean you
[00:47:45] iteration afterwards. I I I mean you could think of it as it's sort of like a
[00:47:46] could think of it as it's sort of like a recurrence relation where u because
[00:47:49] recurrence relation where u because they're all initialized randomly at the
[00:47:50] they're all initialized randomly at the start where it's being multiplied by a
[00:47:53] start where it's being multiplied by a value and you would want in a simple
[00:47:55] value and you would want in a simple recurrence relation you'd want it to be
[00:47:56] recurrence relation you'd want it to be one, right? But because we have a vector
[00:47:58] one, right? But because we have a vector of values that are being multiplied by
[00:48:00] of values that are being multiplied by our matrix um it depends on the
[00:48:03] our matrix um it depends on the dimension of the vector what your
[00:48:04] dimension of the vector what your average uh output would be and what the
[00:48:07] average uh output would be and what the uh sort after the relu because um you
[00:48:10] uh sort after the relu because um you have basically a standard deviation for
[00:48:12] have basically a standard deviation for what is your activations then you you
[00:48:15] what is your activations then you you remove all the negative ones and you're
[00:48:16] remove all the negative ones and you're left with what are your outputs at that
[00:48:18] left with what are your outputs at that point and if you have a really large
[00:48:20] point and if you have a really large values you have a really large standard
[00:48:21] values you have a really large standard deviation so when you remove the the uh
[00:48:24] deviation so when you remove the the uh the bottom half of it you get your your
[00:48:27] the bottom half of it you get your your starts moving more positive and more
[00:48:28] starts moving more positive and more positive. Uh, did that make sense? It's
[00:48:30] positive. Uh, did that make sense? It's sort of I didn't have slides to show it,
[00:48:32] sort of I didn't have slides to show it, but okay. Mostly sorry. Yeah, you can
[00:48:34] but okay. Mostly sorry. Yeah, you can see more details in the paper. It's
[00:48:35] see more details in the paper. It's actually not too bad to read, I think.
[00:48:41] So, the the conclusion of the discussion
[00:48:43] So, the the conclusion of the discussion here is that normalization will solve
[00:48:45] here is that normalization will solve this activation
[00:48:47] this activation issue where they're blowing up, but it
[00:48:48] issue where they're blowing up, but it still might be harder to optimize. Um
[00:48:50] still might be harder to optimize. Um maybe we can we should probably do a
[00:48:53] maybe we can we should probably do a follow-up post on Edge explaining this
[00:48:55] follow-up post on Edge explaining this in more detail, but I think it's a
[00:48:56] in more detail, but I think it's a really good question actually. Yeah. Um
[00:48:58] really good question actually. Yeah. Um it would solve I think this particular
[00:48:59] it would solve I think this particular issue, but maybe it's still hard to do
[00:49:02] issue, but maybe it's still hard to do something like this in the discussion.
[00:49:03] something like this in the discussion. Yeah, it's a good question.
[00:49:06] Yeah, it's a good question. Okay,
[00:49:09] cool. Um so I'll talk about these steps
[00:49:12] cool. Um so I'll talk about these steps now. So how do you actually train your
[00:49:13] now. So how do you actually train your model? And uh the nice thing for data
[00:49:16] model? And uh the nice thing for data prep-processing is it's really easy for
[00:49:17] prep-processing is it's really easy for images. So if you have your giant image
[00:49:19] images. So if you have your giant image data set, the standard way to do it is
[00:49:21] data set, the standard way to do it is you calculate the average red, the
[00:49:23] you calculate the average red, the average green and the average blue pixel
[00:49:25] average green and the average blue pixel along with the standard deviations and
[00:49:27] along with the standard deviations and you take your input image, you subtract
[00:49:29] you take your input image, you subtract the mean and you divide by the standard
[00:49:31] the mean and you divide by the standard deviation. And this is how you this is
[00:49:34] deviation. And this is how you this is how you do uh data normalization for
[00:49:36] how you do uh data normalization for images. It's actually very
[00:49:37] images. It's actually very straightforward. Um so it does require
[00:49:39] straightforward. Um so it does require you to premp compute the means and
[00:49:40] you to premp compute the means and standard and standard deviation for each
[00:49:42] standard and standard deviation for each pixel channel. Um so sometimes what
[00:49:45] pixel channel. Um so sometimes what people will do is they'll use means that
[00:49:47] people will do is they'll use means that have already been calculated like a very
[00:49:49] have already been calculated like a very common one is to use the imageet uh
[00:49:51] common one is to use the imageet uh means and standard deviations and uh
[00:49:54] means and standard deviations and uh apply those to your input images even if
[00:49:56] apply those to your input images even if you're training a model not on imageet.
[00:49:58] you're training a model not on imageet. Um so it it is very um data set
[00:50:03] Um so it it is very um data set dependent is is is the way to think of
[00:50:04] dependent is is is the way to think of this and different models will use
[00:50:06] this and different models will use different values here depending on their
[00:50:08] different values here depending on their data set but the most commonly used one
[00:50:10] data set but the most commonly used one is just use the mean and standard
[00:50:11] is just use the mean and standard deviation from image net. Yeah. So any
[00:50:14] deviation from image net. Yeah. So any input image you've apply this operation
[00:50:17] input image you've apply this operation before the model sees it.
[00:50:22] Okay. Um so yeah that one was really
[00:50:25] Okay. Um so yeah that one was really quick. Um and then in terms of data
[00:50:26] quick. Um and then in terms of data augmentation, so someone had a
[00:50:29] augmentation, so someone had a suggestion uh earlier in the class, why
[00:50:31] suggestion uh earlier in the class, why don't we just add noise to our uh image?
[00:50:34] don't we just add noise to our uh image? And that's a great idea and we'll talk
[00:50:35] And that's a great idea and we'll talk about the different ways you can add
[00:50:36] about the different ways you can add noise to your image here. This helps
[00:50:38] noise to your image here. This helps with regularization and helps prevent
[00:50:40] with regularization and helps prevent your model from overfitting.
[00:50:43] your model from overfitting. So we talked about it before um but this
[00:50:46] So we talked about it before um but this is sort of a common pattern with
[00:50:47] is sort of a common pattern with regularization where during training
[00:50:49] regularization where during training time you add some kind of randomness and
[00:50:51] time you add some kind of randomness and then at testing time you then average
[00:50:53] then at testing time you then average out the randomness. So um sometimes this
[00:50:56] out the randomness. So um sometimes this is approximate but you know for example
[00:50:58] is approximate but you know for example for dropout we saw that during training
[00:51:00] for dropout we saw that during training time we'll randomly drop say 50% of the
[00:51:02] time we'll randomly drop say 50% of the activations and then at testing time
[00:51:04] activations and then at testing time we'll use all the activations but then
[00:51:06] we'll use all the activations but then we'll need to scale it down by this
[00:51:07] we'll need to scale it down by this probability of dropout P. Um so this is
[00:51:10] probability of dropout P. Um so this is a really common pattern and it's also
[00:51:12] a really common pattern and it's also used for data augmentation. So um you
[00:51:16] used for data augmentation. So um you can imagine this uh being a this
[00:51:20] can imagine this uh being a this cylinder here is like your data set. you
[00:51:22] cylinder here is like your data set. you load an image and a label. So we have a
[00:51:24] load an image and a label. So we have a cat label and we have our original image
[00:51:26] cat label and we have our original image from our data set um before we actually
[00:51:29] from our data set um before we actually pass it into our model. It's extremely
[00:51:31] pass it into our model. It's extremely common and basically always in in modern
[00:51:34] common and basically always in in modern deep learning people will always use
[00:51:36] deep learning people will always use data augmentation for training uh
[00:51:39] data augmentation for training uh computer vision models. But the basic
[00:51:41] computer vision models. But the basic idea is to apply some transformations to
[00:51:43] idea is to apply some transformations to the image to make it look different uh
[00:51:45] the image to make it look different uh but still recognizable for the category
[00:51:47] but still recognizable for the category class and then pass that to your model
[00:51:49] class and then pass that to your model and you're computing the loss here. So
[00:51:52] and you're computing the loss here. So one of the nice benefits of this is you
[00:51:53] one of the nice benefits of this is you can effectively increase the size of
[00:51:54] can effectively increase the size of your data set because instead of seeing
[00:51:57] your data set because instead of seeing each image multiple times, it will be
[00:51:59] each image multiple times, it will be seeing different versions of the image
[00:52:00] seeing different versions of the image with different transformations that all
[00:52:02] with different transformations that all still are the same category label. So
[00:52:04] still are the same category label. So you can basically get more data and
[00:52:06] you can basically get more data and therefore it increases your
[00:52:07] therefore it increases your generalization capabilities, but your
[00:52:09] generalization capabilities, but your training loss will be higher because um
[00:52:12] training loss will be higher because um you're not just seeing the same example
[00:52:14] you're not just seeing the same example over and over again. So it makes it
[00:52:15] over and over again. So it makes it harder for the model to just memorize.
[00:52:18] harder for the model to just memorize. So how do we know the weight
[00:52:19] So how do we know the weight initialization is just right? So we know
[00:52:21] initialization is just right? So we know it's right in this case because the
[00:52:24] it's right in this case because the means and the standard deviations are
[00:52:25] means and the standard deviations are relatively constant throughout the
[00:52:28] relatively constant throughout the layers of the network and we're not
[00:52:29] layers of the network and we're not seeing uh in this case we saw sort of uh
[00:52:33] seeing uh in this case we saw sort of uh mode collapse to zero. In this case it
[00:52:35] mode collapse to zero. In this case it was sort of blowing up to infinity as we
[00:52:38] was sort of blowing up to infinity as we increase the number of layers. So the
[00:52:41] increase the number of layers. So the way you can ensure it always happens is
[00:52:43] way you can ensure it always happens is by using the formula. This will always
[00:52:45] by using the formula. This will always initialize them. Well um so in practice
[00:52:47] initialize them. Well um so in practice that's how people do it. If you were
[00:52:49] that's how people do it. If you were creating a new layer um that maybe does
[00:52:53] creating a new layer um that maybe does some different kind of operation that no
[00:52:55] some different kind of operation that no one's done before, then yeah, you
[00:52:56] one's done before, then yeah, you probably would need to try a bunch of
[00:52:58] probably would need to try a bunch of different weight initialization schemes
[00:52:59] different weight initialization schemes and see what works best. Um but
[00:53:01] and see what works best. Um but generally for these linear layers or for
[00:53:05] generally for these linear layers or for or the convolutional layers, you can use
[00:53:06] or the convolutional layers, you can use this uh formula here which is called the
[00:53:10] this uh formula here which is called the kiming initialization. Yeah.
[00:53:14] kiming initialization. Yeah. Okay. Um so back to data augmentation.
[00:53:17] Okay. Um so back to data augmentation. Um, so what are the different
[00:53:18] Um, so what are the different augmentations you can do specifically?
[00:53:20] augmentations you can do specifically? So one of them is horizontal flipping.
[00:53:22] So one of them is horizontal flipping. This depends on the problem. Um, so if
[00:53:24] This depends on the problem. Um, so if you want to have a model that reads
[00:53:26] you want to have a model that reads text, this would be a very bad
[00:53:27] text, this would be a very bad augmentation to use because the text is
[00:53:29] augmentation to use because the text is now it's like you're looking through a
[00:53:31] now it's like you're looking through a mirror and you can't read it properly.
[00:53:33] mirror and you can't read it properly. Um, so this is sometimes useful for
[00:53:36] Um, so this is sometimes useful for everyday objects. It's usually pretty
[00:53:38] everyday objects. It's usually pretty good because most objects are
[00:53:40] good because most objects are symmetrical. So this property uh
[00:53:43] symmetrical. So this property uh actually works pretty well. Um, and then
[00:53:44] actually works pretty well. Um, and then you could also imagine if you're looking
[00:53:46] you could also imagine if you're looking maybe at images from a microscope or
[00:53:48] maybe at images from a microscope or overhead that you could also do a
[00:53:49] overhead that you could also do a vertical flip and that would make sense.
[00:53:51] vertical flip and that would make sense. But for everyday objects, vertical
[00:53:53] But for everyday objects, vertical flipping actually doesn't really make
[00:53:54] flipping actually doesn't really make sense because a cat is almost always
[00:53:56] sense because a cat is almost always seen in this position. But maybe if you
[00:53:58] seen in this position. But maybe if you had a data set where cats were in all
[00:53:59] had a data set where cats were in all different orientations, you could
[00:54:00] different orientations, you could imagine that flipping or rotating or all
[00:54:02] imagine that flipping or rotating or all these things would make sense for for
[00:54:04] these things would make sense for for your data set.
[00:54:06] your data set. Um, another type of augmentation is this
[00:54:09] Um, another type of augmentation is this resizing and cropping idea. So what
[00:54:12] resizing and cropping idea. So what ResNets and many um different image
[00:54:16] ResNets and many um different image models in uh deep learning do is they
[00:54:20] models in uh deep learning do is they basically take a random crop of the
[00:54:22] basically take a random crop of the image and uh then resize that to be your
[00:54:26] image and uh then resize that to be your image size. They might even take another
[00:54:28] image size. They might even take another crop afterwards. So the most common
[00:54:30] crop afterwards. So the most common strategy is you pick the length of what
[00:54:32] strategy is you pick the length of what is basically the short side of your
[00:54:34] is basically the short side of your image. Um so if you have a input image
[00:54:38] image. Um so if you have a input image size to your model of 224 x 224 pixels
[00:54:41] size to your model of 224 x 224 pixels you would pick a value larger than this
[00:54:43] you would pick a value larger than this first and uh find sort of find some crop
[00:54:47] first and uh find sort of find some crop of your image that contains these uh
[00:54:50] of your image that contains these uh this larger scale L uh and this these
[00:54:53] this larger scale L uh and this these are commonly used values. You crop the
[00:54:54] are commonly used values. You crop the image um so you you you uh sorry you
[00:54:58] image um so you you you uh sorry you don't crop you resize the image to be uh
[00:55:01] don't crop you resize the image to be uh that scale. So instead of say this is a
[00:55:03] that scale. So instead of say this is a 800 by 600 image if we used 256 here um
[00:55:07] 800 by 600 image if we used 256 here um we resize the short side. So 600 would
[00:55:09] we resize the short side. So 600 would be 256 and then 800 would be scaled
[00:55:11] be 256 and then 800 would be scaled correspondingly. So we scale it to this
[00:55:14] correspondingly. So we scale it to this L. We scale the short side to L. And
[00:55:16] L. We scale the short side to L. And then we crop a random patch of 224x 224
[00:55:20] then we crop a random patch of 224x 224 pixels from that image. So you're
[00:55:21] pixels from that image. So you're scaling the image by first preserving
[00:55:24] scaling the image by first preserving you preserve the relative resolution but
[00:55:26] you preserve the relative resolution but you make it smaller or larger to fit
[00:55:29] you make it smaller or larger to fit this L and then you take a random crop
[00:55:30] this L and then you take a random crop of that. And this is like by far the
[00:55:33] of that. And this is like by far the most commonly used uh rand random
[00:55:35] most commonly used uh rand random resized crop is what it's called in most
[00:55:37] resized crop is what it's called in most libraries. So this is used in most
[00:55:38] libraries. So this is used in most problems because it works pretty well
[00:55:40] problems because it works pretty well and it reserves the relative uh
[00:55:42] and it reserves the relative uh resolution of your of your images. Um
[00:55:45] resolution of your of your images. Um and then there's another neat trick you
[00:55:46] and then there's another neat trick you can do with augmentation called test
[00:55:48] can do with augmentation called test time augmentation. So if you really just
[00:55:50] time augmentation. So if you really just want to get the best performance
[00:55:51] want to get the best performance possible, you can basically get a bunch
[00:55:53] possible, you can basically get a bunch of these different crops and resizes and
[00:55:55] of these different crops and resizes and run them all through your model and then
[00:55:57] run them all through your model and then average your predictions at the end. And
[00:55:59] average your predictions at the end. And for ResNets, people will often try a
[00:56:01] for ResNets, people will often try a bunch of different scales, um, a bunch
[00:56:03] bunch of different scales, um, a bunch of different crop locations, and, uh,
[00:56:06] of different crop locations, and, uh, maybe even flip it. And usually you'll
[00:56:09] maybe even flip it. And usually you'll start getting diminishing returns, but
[00:56:11] start getting diminishing returns, but you can get actually pretty good like 1
[00:56:12] you can get actually pretty good like 1 to 2% performance boost by using this
[00:56:14] to 2% performance boost by using this sort of test time augmentation. So if
[00:56:16] sort of test time augmentation. So if you're in a setting where it really
[00:56:17] you're in a setting where it really matters, you're trying to eek out every
[00:56:19] matters, you're trying to eek out every last bit of percentage points, then this
[00:56:21] last bit of percentage points, then this is actually a really great trick that
[00:56:22] is actually a really great trick that you can use for any almost any computer
[00:56:24] you can use for any almost any computer vision problem.
[00:56:27] vision problem. Um, okay. So a final sort of few
[00:56:30] Um, okay. So a final sort of few augmentations. One is color jitter. So
[00:56:32] augmentations. One is color jitter. So here we're specifically randomizing the
[00:56:35] here we're specifically randomizing the contrast and brightness and scaling the
[00:56:37] contrast and brightness and scaling the image correspondingly. So maybe images
[00:56:39] image correspondingly. So maybe images look more muted or more sorry the colors
[00:56:41] look more muted or more sorry the colors look more muted or more um brighter. But
[00:56:44] look more muted or more um brighter. But these are sort of very traditional uh
[00:56:46] these are sort of very traditional uh image processing techniques. And usually
[00:56:48] image processing techniques. And usually with all these different augmentations,
[00:56:49] with all these different augmentations, you'll try different values and see
[00:56:52] you'll try different values and see which ones make your images still look
[00:56:54] which ones make your images still look in distribution and look normal to you
[00:56:56] in distribution and look normal to you as a human. And that's like a pretty
[00:56:57] as a human. And that's like a pretty good way to judge what values you should
[00:56:59] good way to judge what values you should pick for how much jitter you should
[00:57:00] pick for how much jitter you should have, how much brightness variance, uh,
[00:57:02] have, how much brightness variance, uh, etc. So, normally when I'm starting a
[00:57:04] etc. So, normally when I'm starting a problem, I'll try a bunch of these
[00:57:05] problem, I'll try a bunch of these different augmentations. I'll see what
[00:57:07] different augmentations. I'll see what is making the data look different from
[00:57:08] is making the data look different from the original data, but then still
[00:57:10] the original data, but then still recognizable to me and still very easy
[00:57:13] recognizable to me and still very easy to recognize. And that's like generally
[00:57:14] to recognize. And that's like generally a good set of augmentations to use. Um,
[00:57:18] a good set of augmentations to use. Um, final one is you can imagine just like
[00:57:20] final one is you can imagine just like cropping out parts of the image where
[00:57:21] cropping out parts of the image where you just are basically putting a black
[00:57:23] you just are basically putting a black or a gray box over it. Um, and I think
[00:57:26] or a gray box over it. Um, and I think this one's maybe less commonly used, but
[00:57:27] this one's maybe less commonly used, but it kind of shows you how you can get
[00:57:29] it kind of shows you how you can get creative with the augmentations
[00:57:30] creative with the augmentations depending on your problem. Like say
[00:57:32] depending on your problem. Like say you're in a setting where things will
[00:57:34] you're in a setting where things will get covered, like the the the camera
[00:57:36] get covered, like the the the camera will be oluded, so it won't be able to
[00:57:38] will be oluded, so it won't be able to see the objects fully. This could be a
[00:57:40] see the objects fully. This could be a really neat trick you do to make your
[00:57:41] really neat trick you do to make your model more resilient to stuff blocking
[00:57:42] model more resilient to stuff blocking parts of your objects. So, you could
[00:57:45] parts of your objects. So, you could almost imagine for your given setting,
[00:57:46] almost imagine for your given setting, what augmentations make sense? what ways
[00:57:48] what augmentations make sense? what ways can you uh transform your your input
[00:57:51] can you uh transform your your input data so that it's still recognizable to
[00:57:53] data so that it's still recognizable to you as a human but it makes it harder
[00:57:54] you as a human but it makes it harder for the model to memorize the training
[00:57:56] for the model to memorize the training examples.
[00:57:59] examples. Okay. Um so the final set of topics here
[00:58:02] Okay. Um so the final set of topics here are basically extremely practical. So
[00:58:05] are basically extremely practical. So when you're say doing a project or uh
[00:58:08] when you're say doing a project or uh training a model say for your course
[00:58:10] training a model say for your course project um I think you should basically
[00:58:13] project um I think you should basically do the exact things we're going to be
[00:58:14] do the exact things we're going to be describing in the coming slides. Um but
[00:58:16] describing in the coming slides. Um but this also applies outside the course to
[00:58:18] this also applies outside the course to any computer vision domain you could be
[00:58:20] any computer vision domain you could be uh practicing in. So um in practice in
[00:58:24] uh practicing in. So um in practice in many times we don't actually have so
[00:58:26] many times we don't actually have so much data. You know imageet the original
[00:58:28] much data. You know imageet the original version had a million images. Uh maybe
[00:58:30] version had a million images. Uh maybe you don't have a million images for your
[00:58:31] you don't have a million images for your problem which almost none of us do
[00:58:33] problem which almost none of us do unless you've been collecting vast
[00:58:35] unless you've been collecting vast amount of data with a huge team. So um
[00:58:38] amount of data with a huge team. So um if you don't have a lot of data can you
[00:58:39] if you don't have a lot of data can you still train CNN's and the short answer
[00:58:41] still train CNN's and the short answer is yes you can but you need to be a
[00:58:43] is yes you can but you need to be a little bit smart with how you do it.
[00:58:46] little bit smart with how you do it. So um I think in one of the I think
[00:58:48] So um I think in one of the I think maybe it was last lecture we showed how
[00:58:50] maybe it was last lecture we showed how the different uh filters in your CNN are
[00:58:53] the different uh filters in your CNN are sort of extracting different uh types of
[00:58:55] sort of extracting different uh types of features. So this goes back to someone
[00:58:57] features. So this goes back to someone asked about like the hierarchy of
[00:58:58] asked about like the hierarchy of features in convolutional neural
[00:59:00] features in convolutional neural networks. So at the beginning it's more
[00:59:02] networks. So at the beginning it's more of just like edges or patterns or or
[00:59:04] of just like edges or patterns or or really small shapes. And then at the
[00:59:06] really small shapes. And then at the highest level um you can imagine uh if
[00:59:09] highest level um you can imagine uh if we put an image into our CNN and we get
[00:59:12] we put an image into our CNN and we get this uh final uh vector right before we
[00:59:16] this uh final uh vector right before we get our class scores and we compare that
[00:59:18] get our class scores and we compare that to other images in our data set. Um
[00:59:21] to other images in our data set. Um you'll actually see that these these
[00:59:24] you'll actually see that these these values uh of the vector of our image are
[00:59:27] values uh of the vector of our image are actually really close. So this is sort
[00:59:28] actually really close. So this is sort of like you can think of this as sort of
[00:59:30] of like you can think of this as sort of like the nearest neighbors thing we did
[00:59:32] like the nearest neighbors thing we did before, but instead of it being the
[00:59:33] before, but instead of it being the pixels of the image, we're we're looking
[00:59:35] pixels of the image, we're we're looking at the uh the vector at the very end of
[00:59:37] at the uh the vector at the very end of your CNN right before you you have the
[00:59:39] your CNN right before you you have the classification layer. Um so this would
[00:59:41] classification layer. Um so this would be like the the 4096 or the 2048 layer.
[00:59:45] be like the the 4096 or the 2048 layer. And if we look at the difference here is
[00:59:48] And if we look at the difference here is the L2 distance. Um you'll find that for
[00:59:51] the L2 distance. Um you'll find that for a given image if you put it into your
[00:59:53] a given image if you put it into your model and you look at the other images
[00:59:55] model and you look at the other images that are close to the model in this
[00:59:57] that are close to the model in this vector space right here after you uh go
[01:00:00] vector space right here after you uh go through all the layers except the last
[01:00:01] through all the layers except the last one that you'll find the images are
[01:00:03] one that you'll find the images are extremely close to each other when the
[01:00:05] extremely close to each other when the items are in the same category. So
[01:00:07] items are in the same category. So intuitively what this basically means is
[01:00:09] intuitively what this basically means is that these features here are actually
[01:00:11] that these features here are actually really good at at uh you could build a
[01:00:15] really good at at uh you could build a linear classifier on top of them and
[01:00:16] linear classifier on top of them and then be able to or or a K nearest
[01:00:18] then be able to or or a K nearest neighbor classifier and be able to
[01:00:20] neighbor classifier and be able to classify objects extremely well. Um so
[01:00:23] classify objects extremely well. Um so so how could you use this in practice?
[01:00:26] so how could you use this in practice? So um what you would do is you would
[01:00:28] So um what you would do is you would first train your model on imageet or you
[01:00:30] first train your model on imageet or you would just grab a model someone else has
[01:00:32] would just grab a model someone else has trained on imageet or one of these
[01:00:34] trained on imageet or one of these really large uh web internet scale data
[01:00:36] really large uh web internet scale data sets and you can just uh freeze all of
[01:00:40] sets and you can just uh freeze all of these layers. So you don't train any of
[01:00:42] these layers. So you don't train any of them. You keep them exactly the same as
[01:00:43] them. You keep them exactly the same as before and you replace this final layer
[01:00:46] before and you replace this final layer instead of it being in the in the case
[01:00:48] instead of it being in the in the case of imageet a thousand classes you
[01:00:49] of imageet a thousand classes you replace it with the number of classes
[01:00:51] replace it with the number of classes you have in your data set. And then when
[01:00:53] you have in your data set. And then when you're training the model, you only
[01:00:54] you're training the model, you only train this layer here. So if we think
[01:00:57] train this layer here. So if we think about um we talked about how there sort
[01:01:00] about um we talked about how there sort of in the old paradigm of computer
[01:01:02] of in the old paradigm of computer vision, you had feature extractors which
[01:01:03] vision, you had feature extractors which was a predefined set of operations to
[01:01:06] was a predefined set of operations to get stuff like color histograms and
[01:01:08] get stuff like color histograms and other uh predefined features. Um you can
[01:01:11] other uh predefined features. Um you can almost think of uh the frozen model as
[01:01:13] almost think of uh the frozen model as doing this. It's a predefined feature
[01:01:15] doing this. It's a predefined feature extractor that we're not changing in any
[01:01:17] extractor that we're not changing in any way, but we're using it to calculate
[01:01:19] way, but we're using it to calculate features that we then train a model on
[01:01:20] features that we then train a model on top of. It's actually extremely similar
[01:01:22] top of. It's actually extremely similar under that paradigm because you're not
[01:01:24] under that paradigm because you're not training it here. And if you have a
[01:01:26] training it here. And if you have a larger data set, what tends to work best
[01:01:28] larger data set, what tends to work best in practice is to actually train the
[01:01:31] in practice is to actually train the whole model, but you're initializing it
[01:01:33] whole model, but you're initializing it from these values that are that were
[01:01:35] from these values that are that were pre-trained say on imageet or some other
[01:01:37] pre-trained say on imageet or some other really large internet scale data set. So
[01:01:40] really large internet scale data set. So I think pretty much for all of the
[01:01:42] I think pretty much for all of the problems I ever work on, I'm doing this
[01:01:45] problems I ever work on, I'm doing this step three here because my I have maybe
[01:01:48] step three here because my I have maybe a million or 10 million training
[01:01:50] a million or 10 million training examples. So I'll start it with a model
[01:01:52] examples. So I'll start it with a model that was trained on billions that I
[01:01:54] that was trained on billions that I don't have the compute for and then I'll
[01:01:56] don't have the compute for and then I'll fine-tune the model uh on my relatively
[01:01:59] fine-tune the model uh on my relatively smaller data set and I'll get better
[01:02:01] smaller data set and I'll get better performance than if I just try to train
[01:02:02] performance than if I just try to train a model myself uh because the model's
[01:02:04] a model myself uh because the model's basically seen more data. that's uh
[01:02:07] basically seen more data. that's uh created a better feature extractor and
[01:02:08] created a better feature extractor and then when I fine-tune the whole thing it
[01:02:10] then when I fine-tune the whole thing it can still be specific enough to my
[01:02:11] can still be specific enough to my problem. You're basically taking say say
[01:02:14] say say say say say say say say say say say say say say say say say say say say
[01:02:14] say say say say say say say say say say say say say say say say say say say
[01:02:14] say say say say say say say say say let's use a very concrete case where
[01:02:16] let's use a very concrete case where we're training a model on imageet um
[01:02:18] we're training a model on imageet um we're taking this model and we're
[01:02:20] we're taking this model and we're replacing the final layer so that it's
[01:02:22] replacing the final layer so that it's no longer outputting a thousand classes
[01:02:25] no longer outputting a thousand classes it's outputting um you know the number
[01:02:27] it's outputting um you know the number of classes in your data set and we're
[01:02:29] of classes in your data set and we're initializing this randomly using the
[01:02:31] initializing this randomly using the kiming initialization we talked about
[01:02:33] kiming initialization we talked about before but the rest of these layers are
[01:02:34] before but the rest of these layers are maintaining their values uh that they
[01:02:37] maintaining their values uh that they they had before so we're not changing
[01:02:39] they had before so we're not changing these values and during gradient descent
[01:02:41] these values and during gradient descent we're never changing these values. So,
[01:02:43] we're never changing these values. So, um these values are unchanged. We
[01:02:45] um these values are unchanged. We basically take our image, we pass it
[01:02:47] basically take our image, we pass it through our model, and now it's just
[01:02:49] through our model, and now it's just like we have a it's almost like you're
[01:02:50] like we have a it's almost like you're just training a linear linear classifier
[01:02:52] just training a linear linear classifier where your input are these 4096 vectors
[01:02:55] where your input are these 4096 vectors for each image that we calculate by
[01:02:57] for each image that we calculate by passing it through the whole model. Then
[01:02:58] passing it through the whole model. Then we have our vector of 4096 and we're
[01:03:01] we have our vector of 4096 and we're just mapping that to the number of
[01:03:02] just mapping that to the number of classes and we're only training this
[01:03:04] classes and we're only training this mapping at the end.
[01:03:05] mapping at the end.  Yeah. So the question is uh will you
[01:03:07] Yeah. So the question is uh will you have some bias in your model because
[01:03:09] have some bias in your model because it's trained on imageet? The answer is
[01:03:11] it's trained on imageet? The answer is definitely. So um the the model if you
[01:03:15] definitely. So um the the model if you do this uh number two like this uh way
[01:03:18] do this uh number two like this uh way of training um then it will do best on
[01:03:22] of training um then it will do best on data sets that look very similar to
[01:03:23] data sets that look very similar to imageet. So these would be like pictures
[01:03:25] imageet. So these would be like pictures of everyday things like laptops or uh
[01:03:28] of everyday things like laptops or uh maybe a classroom or a person things
[01:03:30] maybe a classroom or a person things like this where imageet is everyday
[01:03:32] like this where imageet is everyday objects but if it was say photos of Mars
[01:03:34] objects but if it was say photos of Mars uh it would do a lot worse. So there's
[01:03:36] uh it would do a lot worse. So there's definitely bias based on the training
[01:03:38] definitely bias based on the training data of the pre-trained model and you
[01:03:39] data of the pre-trained model and you want to get something that is in the
[01:03:41] want to get something that is in the same type of distribution or you're
[01:03:43] same type of distribution or you're seeing the same kinds of objects or
[01:03:45] seeing the same kinds of objects or locations or things like that. So um the
[01:03:48] locations or things like that. So um the question is what do you do when your
[01:03:49] question is what do you do when your data set is out of distribution? Um I
[01:03:52] data set is out of distribution? Um I actually have a um a slide here to cover
[01:03:55] actually have a um a slide here to cover some of that. So it's a great question.
[01:03:56] some of that. So it's a great question. Um so um if you have a very similar data
[01:03:59] Um so um if you have a very similar data set but very little data, you can use
[01:04:01] set but very little data, you can use the linear classifier strategy we just
[01:04:02] the linear classifier strategy we just mentioned. If you have a similar data
[01:04:04] mentioned. If you have a similar data set, quite a lot of data, you'll get
[01:04:05] set, quite a lot of data, you'll get best performance by fine-tuning all the
[01:04:07] best performance by fine-tuning all the layers. These are strategies two and
[01:04:08] layers. These are strategies two and three on the slide that I mentioned
[01:04:09] three on the slide that I mentioned earlier. But what about when you have a
[01:04:11] earlier. But what about when you have a very different data set? So, um if you
[01:04:14] very different data set? So, um if you have a lot of data, you might just want
[01:04:16] have a lot of data, you might just want to start from scratch. Um or you could
[01:04:19] to start from scratch. Um or you could you might get better performance if you
[01:04:21] you might get better performance if you still initialize here. You would test,
[01:04:22] still initialize here. You would test, but there's no guaranteed way to know
[01:04:24] but there's no guaranteed way to know that performance would be better or
[01:04:25] that performance would be better or worse. Um and then yeah, if you have
[01:04:28] worse. Um and then yeah, if you have very little data or a very different
[01:04:29] very little data or a very different data set, you probably want to try to
[01:04:31] data set, you probably want to try to find a model that's trained on something
[01:04:32] find a model that's trained on something close. There are specific techniques
[01:04:34] close. There are specific techniques that researchers have looked into for
[01:04:36] that researchers have looked into for out of domain generalization and um you
[01:04:40] out of domain generalization and um you know this basic idea of you have one
[01:04:41] know this basic idea of you have one domain, you train a model on one domain
[01:04:42] domain, you train a model on one domain and you're trying to learn a new domain
[01:04:44] and you're trying to learn a new domain that's different in some ways. So this
[01:04:45] that's different in some ways. So this is an active area of research, but I
[01:04:47] is an active area of research, but I wouldn't say there's like a general
[01:04:49] wouldn't say there's like a general technique that always works and it's a
[01:04:51] technique that always works and it's a bit problem dependent in that setting.
[01:04:52] bit problem dependent in that setting. Whereas this is for everything except
[01:04:55] Whereas this is for everything except the upper right quadrant here. It works
[01:04:57] the upper right quadrant here. It works pretty well in practice. So there are
[01:04:58] pretty well in practice. So there are actually techniques for this and it's a
[01:05:00] actually techniques for this and it's a pretty active area of research and
[01:05:01] pretty active area of research and certain models generalize better. Like I
[01:05:03] certain models generalize better. Like I think language models are pretty good at
[01:05:04] think language models are pretty good at learning a lot of different domains for
[01:05:05] learning a lot of different domains for example. Um but yeah it's it's it's
[01:05:09] example. Um but yeah it's it's it's definitely the worst scenario to be in
[01:05:11] definitely the worst scenario to be in where you have a completely different
[01:05:12] where you have a completely different problem that anyone's ever worked on
[01:05:14] problem that anyone's ever worked on before and you don't have a lot of data.
[01:05:15] before and you don't have a lot of data. It's by far the hardest to train a model
[01:05:17] It's by far the hardest to train a model in that setting. So the question is do
[01:05:18] in that setting. So the question is do you ever do anything between training
[01:05:19] you ever do anything between training one file layer and all layers? Yeah,
[01:05:21] one file layer and all layers? Yeah, people have actually done a lot of work
[01:05:22] people have actually done a lot of work looking into training a subset of the
[01:05:24] looking into training a subset of the layers. Um, there's also a technique
[01:05:27] layers. Um, there's also a technique called Laura, which we might go into in
[01:05:29] called Laura, which we might go into in the transformers lecture. I'm not sure
[01:05:30] the transformers lecture. I'm not sure if it'll make it this year, but the
[01:05:32] if it'll make it this year, but the basic idea is to fine-tune all the
[01:05:34] basic idea is to fine-tune all the layers in a way where you're not
[01:05:36] layers in a way where you're not changing all the values exactly, but
[01:05:37] changing all the values exactly, but you're learning uh basically low rank uh
[01:05:41] you're learning uh basically low rank uh uh differences between the different
[01:05:43] uh differences between the different layers. um where you're sort of
[01:05:45] layers. um where you're sort of fine-tuning the differences between the
[01:05:48] fine-tuning the differences between the original layers rather than fine-tuning
[01:05:50] original layers rather than fine-tuning the layers themselves. So um yeah,
[01:05:52] the layers themselves. So um yeah, there's techniques you could use Laura
[01:05:55] there's techniques you could use Laura uh and it would need more explanation,
[01:05:58] uh and it would need more explanation, but the basic idea is instead of
[01:05:59] but the basic idea is instead of fine-tuning the actual values, you're
[01:06:00] fine-tuning the actual values, you're fine-tuning uh these differences between
[01:06:02] fine-tuning uh these differences between the value layers. Sort of like how a
[01:06:04] the value layers. Sort of like how a ResNet, you're learning the difference.
[01:06:06] ResNet, you're learning the difference. Loras are like that, but you do it with
[01:06:07] Loras are like that, but you do it with a very small number of parameters. I
[01:06:10] a very small number of parameters. I think the question is uh how did they
[01:06:11] think the question is uh how did they basically decide how many layers to
[01:06:13] basically decide how many layers to pick? Why did they pick a large number
[01:06:15] pick? Why did they pick a large number of layers? Specifically, why are there
[01:06:18] of layers? Specifically, why are there two convolution layers of each size
[01:06:19] two convolution layers of each size instead of one? Um, so it's actually
[01:06:22] instead of one? Um, so it's actually really similar to the example we showed
[01:06:23] really similar to the example we showed earlier with VGG where if you have three
[01:06:25] earlier with VGG where if you have three of these 3x3 convolutions, you're able
[01:06:28] of these 3x3 convolutions, you're able to uh have the same receptive field as a
[01:06:30] to uh have the same receptive field as a 7x7 convolution, but you're able to
[01:06:33] 7x7 convolution, but you're able to model more nonlinear relationships
[01:06:35] model more nonlinear relationships because you have these three activation
[01:06:36] because you have these three activation functions rather than just one
[01:06:38] functions rather than just one activation on the 7x7 filter. So
[01:06:40] activation on the 7x7 filter. So basically 3x3 is more expressive, but
[01:06:43] basically 3x3 is more expressive, but you're still looking at the same set of
[01:06:44] you're still looking at the same set of values as long as you have enough of
[01:06:46] values as long as you have enough of them. So a a larger set of smaller
[01:06:48] them. So a a larger set of smaller filters is more expressive than a
[01:06:50] filters is more expressive than a smaller set of larger filters.
[01:06:53] smaller set of larger filters. Okay.
[01:06:55] Okay. Um okay. So we'll go on. Um yeah,
[01:06:58] Um okay. So we'll go on. Um yeah, basically try to find a large data set
[01:06:59] basically try to find a large data set that has similar data. Uh get a model
[01:07:01] that has similar data. Uh get a model that was trained on that and then
[01:07:02] that was trained on that and then fine-tune it on your own data. Uh some
[01:07:05] fine-tune it on your own data. Uh some good links. Uh PyTorch image models has
[01:07:07] good links. Uh PyTorch image models has a bunch of models that are trained on
[01:07:09] a bunch of models that are trained on imageet and other data sets. Um and then
[01:07:11] imageet and other data sets. Um and then also just in the PyTorch vision GitHub
[01:07:14] also just in the PyTorch vision GitHub repo, you'll find some too. Okay. Um
[01:07:16] repo, you'll find some too. Okay. Um I'll talk over very briefly at the end
[01:07:17] I'll talk over very briefly at the end for hyperparameter selection. So um if
[01:07:20] for hyperparameter selection. So um if you're having difficulty training your
[01:07:22] you're having difficulty training your model and it's not working right away, I
[01:07:24] model and it's not working right away, I think the best thing you can do is to
[01:07:25] think the best thing you can do is to overfit on a small sample. So this is
[01:07:27] overfit on a small sample. So this is like the default debugging strategy in
[01:07:29] like the default debugging strategy in deep learning where you just have one
[01:07:31] deep learning where you just have one data point and you want to see your
[01:07:32] data point and you want to see your training loss basically go to zero. Your
[01:07:34] training loss basically go to zero. Your model should be able to memorize one
[01:07:36] model should be able to memorize one training example and if it's not able to
[01:07:37] training example and if it's not able to do that, you have a bug somewhere in
[01:07:38] do that, you have a bug somewhere in your code or you're not picking the
[01:07:40] your code or you're not picking the right kind of model to model your
[01:07:42] right kind of model to model your problem. Um, so this is a really good uh
[01:07:45] problem. Um, so this is a really good uh training problem and it'll also tell you
[01:07:46] training problem and it'll also tell you like what learning rates work or what
[01:07:48] like what learning rates work or what ones don't work and you'll get a rough
[01:07:49] ones don't work and you'll get a rough idea of the neighborhood of learning
[01:07:51] idea of the neighborhood of learning rates you should explore. So this is a
[01:07:52] rates you should explore. So this is a good way to make sure your model's
[01:07:54] good way to make sure your model's correct, make sure your learning rate is
[01:07:56] correct, make sure your learning rate is reasonable and also to make sure you
[01:07:58] reasonable and also to make sure you don't have any other bugs that could be
[01:08:00] don't have any other bugs that could be impacting your code. So this is like
[01:08:02] impacting your code. So this is like always step one if you're having issues
[01:08:04] always step one if you're having issues just uh running some code. This is how
[01:08:06] just uh running some code. This is how you debug. Um the second thing you would
[01:08:09] you debug. Um the second thing you would want to do after you get this is maybe
[01:08:11] want to do after you get this is maybe you try a very coarse grid of
[01:08:12] you try a very coarse grid of hyperparameters. So I would first try
[01:08:14] hyperparameters. So I would first try with different learning rates and see
[01:08:16] with different learning rates and see how if you train model on different
[01:08:18] how if you train model on different learning rates what are the training
[01:08:19] learning rates what are the training losses look like. You want one that has
[01:08:21] losses look like. You want one that has the most sustained uh decreasing in the
[01:08:23] the most sustained uh decreasing in the training loss over maybe one epoch.
[01:08:25] training loss over maybe one epoch. That's like a pretty good estimation but
[01:08:27] That's like a pretty good estimation but you can train for longer. Um once you
[01:08:29] you can train for longer. Um once you get a good set of learning rates you
[01:08:31] get a good set of learning rates you could then look into other
[01:08:31] could then look into other hyperparameters too. And specifically
[01:08:34] hyperparameters too. And specifically besides the loss, you'll also want to
[01:08:36] besides the loss, you'll also want to look at the accuracy curves.
[01:08:38] look at the accuracy curves. So you have your training accuracy and
[01:08:39] So you have your training accuracy and your validation accuracy. Um if they're
[01:08:42] your validation accuracy. Um if they're still going up, it means you want to
[01:08:43] still going up, it means you want to keep training. Uh pretty reasonable, but
[01:08:46] keep training. Uh pretty reasonable, but you might have a scenario where the
[01:08:48] you might have a scenario where the training loss is going up, but your
[01:08:49] training loss is going up, but your validation loss is going down. Uh this
[01:08:51] validation loss is going down. Uh this is overfitting. So um we need to either
[01:08:55] is overfitting. So um we need to either increase the regularization or if we can
[01:08:57] increase the regularization or if we can get more data, that could also work. But
[01:08:58] get more data, that could also work. But you need to do one of the two um in
[01:09:01] you need to do one of the two um in order to improve your performance
[01:09:02] order to improve your performance further beyond uh the peak right here I
[01:09:04] further beyond uh the peak right here I guess would be the best model you have
[01:09:05] guess would be the best model you have so far. Um if you're seeing very little
[01:09:08] so far. Um if you're seeing very little of a gap here then you can probably
[01:09:10] of a gap here then you can probably train for model uh for longer because
[01:09:12] train for model uh for longer because generally we you do want to just uh get
[01:09:17] generally we you do want to just uh get to the point where your validation loss
[01:09:18] to the point where your validation loss has been maximized. So if you could just
[01:09:20] has been maximized. So if you could just keep training uh you know you could keep
[01:09:22] keep training uh you know you could keep training. So uh even if there's not a
[01:09:24] training. So uh even if there's not a significant gap here or anything um if
[01:09:27] significant gap here or anything um if you see uh the validation loss is
[01:09:29] you see uh the validation loss is similar to the training uh sorry the
[01:09:30] similar to the training uh sorry the validation accuracy similar to the
[01:09:32] validation accuracy similar to the training accuracy you can probably keep
[01:09:33] training accuracy you can probably keep training until your training accuracy
[01:09:35] training until your training accuracy starts diverging from your your
[01:09:37] starts diverging from your your validation accuracy
[01:09:39] validation accuracy and you basically can repeat this
[01:09:40] and you basically can repeat this process over and over again. Um one
[01:09:43] process over and over again. Um one final note is in terms of for
[01:09:45] final note is in terms of for hyperparameter search normally people
[01:09:47] hyperparameter search normally people think you know you have two
[01:09:48] think you know you have two hyperparameters or more that you're
[01:09:49] hyperparameters or more that you're searching over should you try every
[01:09:51] searching over should you try every combination of the hyperparameters or
[01:09:53] combination of the hyperparameters or what is the best way to do it I think in
[01:09:55] what is the best way to do it I think in practice a random search over their
[01:09:57] practice a random search over their hyperparameter space works a lot better
[01:09:59] hyperparameter space works a lot better than a like a grid search where you're
[01:10:01] than a like a grid search where you're trying every set of a predefined set and
[01:10:04] trying every set of a predefined set and the reason is mainly because you can
[01:10:06] the reason is mainly because you can imagine you have one axis which is an
[01:10:07] imagine you have one axis which is an unimportant hyperparameter where
[01:10:09] unimportant hyperparameter where depending on the value your performance
[01:10:10] depending on the value your performance will be roughly the same versus an
[01:10:12] will be roughly the same versus an important one. If you do random values
[01:10:14] important one. If you do random values across all of these, you'll actually
[01:10:15] across all of these, you'll actually search the hyperparameter space of your
[01:10:18] search the hyperparameter space of your uh important parameter much more
[01:10:19] uh important parameter much more thoroughly. Whereas you're sort of
[01:10:20] thoroughly. Whereas you're sort of wasting time uh doing on on the grid
[01:10:23] wasting time uh doing on on the grid search rechecking multiple values for an
[01:10:25] search rechecking multiple values for an unimportant hyperparameter. So in
[01:10:26] unimportant hyperparameter. So in practice, you should define the ranges
[01:10:28] practice, you should define the ranges you want to try and then just uh
[01:10:30] you want to try and then just uh randomly collect hyperparameters with
[01:10:32] randomly collect hyperparameters with values from those ranges. And that's
[01:10:34] values from those ranges. And that's probably the best way to do it. And you
[01:10:36] probably the best way to do it. And you just keep running till you get the best
[01:10:37] just keep running till you get the best model. Okay. Uh that's it. So we talked
[01:10:40] model. Okay. Uh that's it. So we talked about layers and CNN's activation
[01:10:41] about layers and CNN's activation functions, CNN architectures and weight
[01:10:43] functions, CNN architectures and weight initialization. How do you actually
[01:10:45] initialization. How do you actually predefine and build these models and
[01:10:47] predefine and build these models and then we talked about how do you actually
[01:10:48] then we talked about how do you actually train it? How do you change your data to
[01:10:50] train it? How do you change your data to be input to the model? How do you
[01:10:51] be input to the model? How do you augment it? Uh transfer learning is a
[01:10:53] augment it? Uh transfer learning is a really neat trick for improving
[01:10:54] really neat trick for improving performance and then how do you pick the
[01:10:56] performance and then how do you pick the best hybrid parameters. So uh yeah, we
[01:10:58] best hybrid parameters. So uh yeah, we covered a lot in lecture today. Thank
[01:10:59] covered a lot in lecture today. Thank you all so much. Uh yeah.


================================================================================
LECTURE 007
================================================================================

Stanford CS231N | Spring 2025 | Lecture 7: Recurrent Neural Networks

Source: https://www.youtube.com/watch?v=kG2lAPBF7zA

---

Transcript

[00:00:05] So hello everyone, welcome to lecture 7.
[00:00:09] So hello everyone, welcome to lecture 7. Um I also wanted to go over some
[00:00:10] Um I also wanted to go over some clarifications from last time. So when I
[00:00:12] clarifications from last time. So when I gave lecture last time, there were two
[00:00:14] gave lecture last time, there were two ed posts that I think were good that you
[00:00:17] ed posts that I think were good that you all might want to check out. Uh but in
[00:00:19] all might want to check out. Uh but in case you haven't seen it, I'll just go
[00:00:20] case you haven't seen it, I'll just go through it really quickly. I think when
[00:00:22] through it really quickly. I think when describing dropout, how to scale
[00:00:24] describing dropout, how to scale probabilities at test time. During
[00:00:25] probabilities at test time. During lecture, there was a bit of confusion.
[00:00:27] lecture, there was a bit of confusion. Um and basically the what I said in the
[00:00:30] Um and basically the what I said in the slide had a sort of mismatch. So um in
[00:00:34] slide had a sort of mismatch. So um in each forward pass for dropout we have
[00:00:36] each forward pass for dropout we have this hyperparameter P which is either
[00:00:38] this hyperparameter P which is either the amount of neurons you're dropping
[00:00:40] the amount of neurons you're dropping out or it's the amount of neurons that
[00:00:42] out or it's the amount of neurons that you're keeping depending on which
[00:00:44] you're keeping depending on which implementation of dropout you're using.
[00:00:46] implementation of dropout you're using. Generally they do it's the number of
[00:00:48] Generally they do it's the number of ones you drop out. Um so in most
[00:00:51] ones you drop out. Um so in most libraries that's what P means. But the
[00:00:53] libraries that's what P means. But the basic idea is that at test time you want
[00:00:55] basic idea is that at test time you want the expected output to be the same as at
[00:00:57] the expected output to be the same as at training time. So this means that if you
[00:01:00] training time. So this means that if you dropped 25% of your activations during
[00:01:03] dropped 25% of your activations during training time at test time you would
[00:01:05] training time at test time you would scale by 75 so that the expected output
[00:01:08] scale by 75 so that the expected output is the same. Um and so I think there was
[00:01:10] is the same. Um and so I think there was a bit of confusion because in this slide
[00:01:12] a bit of confusion because in this slide the implementation uses P as the
[00:01:14] the implementation uses P as the probability of uh keeping a unit active.
[00:01:17] probability of uh keeping a unit active. So uh there's a bit of a mismatch there.
[00:01:20] So uh there's a bit of a mismatch there. Just to clarify, there was also a
[00:01:22] Just to clarify, there was also a question in class from last time about
[00:01:24] question in class from last time about how normalization can be useful and
[00:01:26] how normalization can be useful and maybe resolve the issues that arise when
[00:01:28] maybe resolve the issues that arise when you have weights that are initialized
[00:01:30] you have weights that are initialized incorrectly. But we had this to choice
[00:01:32] incorrectly. But we had this to choice setting where we have 2D inputs to our
[00:01:34] setting where we have 2D inputs to our model and a two-layer neural network
[00:01:36] model and a two-layer neural network with RLU. Um it's outputting basically
[00:01:38] with RLU. Um it's outputting basically this quadrant function. So if the point
[00:01:41] this quadrant function. So if the point lies in the top right, it'll output one
[00:01:43] lies in the top right, it'll output one or two or three or four depending on
[00:01:44] or two or three or four depending on which quadrant uh the point lies in. And
[00:01:48] which quadrant uh the point lies in. And uh we plot the different training losses
[00:01:50] uh we plot the different training losses and test losses for uh good
[00:01:52] and test losses for uh good initialization using the kiming
[00:01:54] initialization using the kiming initialization we discussed last time
[00:01:55] initialization we discussed last time and bad initialization where the
[00:01:57] and bad initialization where the standard deviation is too high. And the
[00:02:00] standard deviation is too high. And the blue plot here represents bad and the
[00:02:03] blue plot here represents bad and the green represents bad initialization with
[00:02:04] green represents bad initialization with layer norm. So you can see it actually
[00:02:06] layer norm. So you can see it actually does resolve a lot of the issues but to
[00:02:08] does resolve a lot of the issues but to get the best performance you still need
[00:02:09] get the best performance you still need good weight initialization which are the
[00:02:11] good weight initialization which are the two lines afterwards. So you can go dive
[00:02:13] two lines afterwards. So you can go dive in and also whether or not layerm helped
[00:02:16] in and also whether or not layerm helped depends uh depends on the problem. So in
[00:02:19] depends uh depends on the problem. So in this quadrant one you can imagine that
[00:02:20] this quadrant one you can imagine that you don't need to know the exact 2D
[00:02:22] you don't need to know the exact 2D position uh of each point. So layerm was
[00:02:24] position uh of each point. So layerm was actually helping but for some of the
[00:02:26] actually helping but for some of the other functions that are in the code you
[00:02:28] other functions that are in the code you can check out um where you need to know
[00:02:30] can check out um where you need to know the exact coordinate in order to get the
[00:02:32] the exact coordinate in order to get the right output layer actually hurts
[00:02:33] right output layer actually hurts performance uh because you lose some
[00:02:35] performance uh because you lose some information about the exact spatial
[00:02:37] information about the exact spatial location of your input when you're doing
[00:02:38] location of your input when you're doing this subtraction of the mean and
[00:02:39] this subtraction of the mean and dividing by the standard deviation. So
[00:02:42] dividing by the standard deviation. So uh just some notes here um basically at
[00:02:45] uh just some notes here um basically at a high level it does help with the issue
[00:02:47] a high level it does help with the issue but gap remains. So you can't get by
[00:02:50] but gap remains. So you can't get by this weight initialization issue with
[00:02:51] this weight initialization issue with just normalization.
[00:02:54] just normalization. Um and as I mentioned it may not always
[00:02:57] Um and as I mentioned it may not always make sense depending on what you're
[00:02:58] make sense depending on what you're trying to model. Um so I think just to
[00:03:01] trying to model. Um so I think just to recap also from last time we've been
[00:03:03] recap also from last time we've been mainly talking about these sort of
[00:03:04] mainly talking about these sort of vanilla standard non-recurrent neural
[00:03:06] vanilla standard non-recurrent neural networks so far. So this is a fixed size
[00:03:09] networks so far. So this is a fixed size input and a fixed size output. Um you
[00:03:12] input and a fixed size output. Um you have sort of this onetime setup where
[00:03:14] have sort of this onetime setup where you set your activation functions. You
[00:03:16] you set your activation functions. You do data pre-processing according to some
[00:03:18] do data pre-processing according to some fixed mean and standard deviation for
[00:03:20] fixed mean and standard deviation for the images uh the image channels. You
[00:03:22] the images uh the image channels. You have your weight initialization uh and
[00:03:25] have your weight initialization uh and normalization functions that you use as
[00:03:28] normalization functions that you use as well as um transfer learning. So if you
[00:03:31] well as um transfer learning. So if you pre-train on one data set like imageet
[00:03:33] pre-train on one data set like imageet or some other large scale internet data
[00:03:35] or some other large scale internet data set, you can get better results if you
[00:03:36] set, you can get better results if you initialize your weights to those values.
[00:03:38] initialize your weights to those values. We also talked about training dynamics,
[00:03:40] We also talked about training dynamics, how you can babysit the learning process
[00:03:42] how you can babysit the learning process and choosing a good learning rate, how
[00:03:44] and choosing a good learning rate, how you want to update your different uh
[00:03:45] you want to update your different uh hyperparameters and also how to optimize
[00:03:47] hyperparameters and also how to optimize those based on the validation
[00:03:49] those based on the validation performance as well as uh test time
[00:03:51] performance as well as uh test time augmentation to improve performance
[00:03:53] augmentation to improve performance further. So, a really good tool for
[00:03:55] further. So, a really good tool for points two and three here is uh
[00:03:57] points two and three here is uh something I use in basically all my
[00:03:59] something I use in basically all my projects called weights and biases. So,
[00:04:00] projects called weights and biases. So, you might find this useful. It's a
[00:04:02] you might find this useful. It's a really neat way that you can um
[00:04:04] really neat way that you can um essentially look at different you set
[00:04:07] essentially look at different you set different runs with different
[00:04:08] different runs with different hyperparameters. In this case, uh they
[00:04:10] hyperparameters. In this case, uh they show a dropout column here. So, these
[00:04:12] show a dropout column here. So, these are all the different values of dropout.
[00:04:14] are all the different values of dropout. Then color coding is really nice. So,
[00:04:15] Then color coding is really nice. So, you can see that generally the lower
[00:04:17] you can see that generally the lower values of dropout will achieve higher
[00:04:19] values of dropout will achieve higher accuracy. And so you can visualize uh
[00:04:22] accuracy. And so you can visualize uh these different uh validation uh sorry
[00:04:25] these different uh validation uh sorry these different hyperparameters based on
[00:04:26] these different hyperparameters based on validation step performance and uh you
[00:04:29] validation step performance and uh you can sort of uh see you know based on
[00:04:32] can sort of uh see you know based on many runs get an idea of which
[00:04:34] many runs get an idea of which hyperparameters work best. So I always
[00:04:36] hyperparameters work best. So I always use this I think it's great especially
[00:04:37] use this I think it's great especially if you have the compute where you can
[00:04:38] if you have the compute where you can just run something over and over again
[00:04:40] just run something over and over again to improve performance more. This is a
[00:04:41] to improve performance more. This is a really neat way of visualizing it. I
[00:04:42] really neat way of visualizing it. I think they do it well. There are other
[00:04:43] think they do it well. There are other tools like tensorboard uh but this is
[00:04:46] tools like tensorboard uh but this is personally the one that I like.
[00:04:49] personally the one that I like. Okay. Um, so for the rest of lecture
[00:04:52] Okay. Um, so for the rest of lecture today, we'll be discussing sequence
[00:04:54] today, we'll be discussing sequence modeling. So this is in contrast to a
[00:04:57] modeling. So this is in contrast to a fixed-sized input uh as as input to our
[00:05:00] fixed-sized input uh as as input to our model. What if we have a sequence of
[00:05:01] model. What if we have a sequence of variable length and also we'll be
[00:05:04] variable length and also we'll be discussing what are the sort of simple
[00:05:06] discussing what are the sort of simple neural networks that people use before
[00:05:08] neural networks that people use before the era of transformers which mainly
[00:05:10] the era of transformers which mainly consist of RNN's and some variants of
[00:05:12] consist of RNN's and some variants of RNN's. And then I'll also relate in one
[00:05:15] RNN's. And then I'll also relate in one slide how RNN's actually are similar to
[00:05:18] slide how RNN's actually are similar to a lot of and inspire a lot of the modern
[00:05:21] a lot of and inspire a lot of the modern uh type of language models that you see
[00:05:23] uh type of language models that you see called state space models. So you might
[00:05:25] called state space models. So you might have heard of mamba there's some other
[00:05:26] have heard of mamba there's some other ones too that we'll talk about in the
[00:05:28] ones too that we'll talk about in the slide but the basic ideas the key
[00:05:30] slide but the basic ideas the key concepts from RNNs are still being used
[00:05:32] concepts from RNNs are still being used today. They're not just used in the past
[00:05:34] today. They're not just used in the past and they have a lot of nice advantages
[00:05:36] and they have a lot of nice advantages over transformers that we'll go into.
[00:05:39] over transformers that we'll go into. Cool. So to specifically formulate this
[00:05:42] Cool. So to specifically formulate this sequence modeling task, you can imagine
[00:05:44] sequence modeling task, you can imagine we have a vanilla neural network where
[00:05:46] we have a vanilla neural network where we have one fixed size input to one
[00:05:48] we have one fixed size input to one fixed size output which is what we've
[00:05:50] fixed size output which is what we've discussed in the course so far. In
[00:05:52] discussed in the course so far. In contrast, you could have a one to many
[00:05:54] contrast, you could have a one to many uh sequence modeling task. So here we
[00:05:57] uh sequence modeling task. So here we still have a fixed size input like say
[00:05:58] still have a fixed size input like say an image but we want to output a
[00:06:00] an image but we want to output a sequence of variable length. So one
[00:06:02] sequence of variable length. So one common example is image captioning. So
[00:06:04] common example is image captioning. So we input an image and we want to output
[00:06:06] we input an image and we want to output a sequence of words or characters or
[00:06:08] a sequence of words or characters or however you're modeling uh the language
[00:06:11] however you're modeling uh the language or encoding it but the goal is to have a
[00:06:13] or encoding it but the goal is to have a variable length caption output for
[00:06:15] variable length caption output for what's happening in the image. You could
[00:06:18] what's happening in the image. You could also have a many to one sequence
[00:06:20] also have a many to one sequence modeling task. So here we could imagine
[00:06:22] modeling task. So here we could imagine our inputs are say a video and we're
[00:06:25] our inputs are say a video and we're trying to classify what is this a video
[00:06:27] trying to classify what is this a video of. So we give it a sequence of video
[00:06:30] of. So we give it a sequence of video frames and the output is one single
[00:06:32] frames and the output is one single class label similar to the image
[00:06:34] class label similar to the image classification case but now we have
[00:06:35] classification case but now we have multiple frames as input rather than
[00:06:37] multiple frames as input rather than just a single image. So this is an
[00:06:39] just a single image. So this is an example of many to one. Then you also
[00:06:41] example of many to one. Then you also have many to many. So um the number of
[00:06:46] have many to many. So um the number of inputs and outputs in the sequences
[00:06:48] inputs and outputs in the sequences don't need to match. So you could have
[00:06:50] don't need to match. So you could have your input is a variable number of
[00:06:52] your input is a variable number of frames and your output in this case
[00:06:54] frames and your output in this case could be a caption of variable length
[00:06:56] could be a caption of variable length and they don't need to necessarily
[00:06:57] and they don't need to necessarily match. Um but they could match. So you
[00:06:59] match. Um but they could match. So you could have so for every single input you
[00:07:01] could have so for every single input you have one output. And for discussing
[00:07:03] have one output. And for discussing RNN's we'll mainly be focusing on this
[00:07:05] RNN's we'll mainly be focusing on this setting on the far right but there are
[00:07:06] setting on the far right but there are basically a lot of small changes you can
[00:07:09] basically a lot of small changes you can make to change and reformulate the
[00:07:11] make to change and reformulate the problem to apply to the other settings.
[00:07:13] problem to apply to the other settings. But this is sort of the most
[00:07:14] But this is sort of the most straightforward one. Every time there's
[00:07:15] straightforward one. Every time there's an input there's an output. and we'll be
[00:07:17] an input there's an output. and we'll be using it for the beginning of class to
[00:07:19] using it for the beginning of class to talk about how RNN's work
[00:07:22] talk about how RNN's work and a canonical example problem here
[00:07:24] and a canonical example problem here would be video classification where
[00:07:25] would be video classification where you're classifying every single frame.
[00:07:28] you're classifying every single frame. Okay, so what is an RNN? Uh the basic
[00:07:31] Okay, so what is an RNN? Uh the basic idea is you have an input sequence X and
[00:07:35] idea is you have an input sequence X and an output sequence Y. And what makes an
[00:07:38] an output sequence Y. And what makes an RNN an RNN is this recurrent nature. So
[00:07:41] RNN an RNN is this recurrent nature. So often people will diagram it by this
[00:07:43] often people will diagram it by this sort of arrow that's feeding back into
[00:07:45] sort of arrow that's feeding back into the block. uh this is how you know it's
[00:07:47] the block. uh this is how you know it's sort of like a recurrent layer when
[00:07:48] sort of like a recurrent layer when you're reading different diagrams. But
[00:07:50] you're reading different diagrams. But what it actually means is that RNN's
[00:07:52] what it actually means is that RNN's have this internal state or a hidden
[00:07:55] have this internal state or a hidden state as it's often called that is
[00:07:57] state as it's often called that is updated as a sequence is processed. So
[00:07:59] updated as a sequence is processed. So every time there's a new input to the
[00:08:01] every time there's a new input to the model, we process that and we calculate
[00:08:03] model, we process that and we calculate a new hidden state or internal state. So
[00:08:06] a new hidden state or internal state. So there's a hidden state, it updates and
[00:08:07] there's a hidden state, it updates and it depends on the new inputs as well as
[00:08:10] it depends on the new inputs as well as the previous internal or hidden state.
[00:08:13] the previous internal or hidden state. Um I think this diagram is sometimes a
[00:08:16] Um I think this diagram is sometimes a bit confusing when you're trying to
[00:08:17] bit confusing when you're trying to think about how the gradients are
[00:08:18] think about how the gradients are actually calculated and what are the
[00:08:19] actually calculated and what are the order of operations. So people will
[00:08:21] order of operations. So people will often do this uh diagram of an unrolled
[00:08:25] often do this uh diagram of an unrolled RNN. And so here it's basically the same
[00:08:28] RNN. And so here it's basically the same as before but we're explicitly showing
[00:08:30] as before but we're explicitly showing that the current hidden state
[00:08:32] that the current hidden state calculation is dependent on our input at
[00:08:34] calculation is dependent on our input at that time step as well as the previous
[00:08:37] that time step as well as the previous uh RNN state. So we're more explicitly
[00:08:40] uh RNN state. So we're more explicitly modeling what is exactly needed to
[00:08:42] modeling what is exactly needed to calculate each output and each RNN. You
[00:08:44] calculate each output and each RNN. You move backwards uh in the computational
[00:08:46] move backwards uh in the computational graph.
[00:08:48] graph. So I've been speaking with words so far.
[00:08:51] So I've been speaking with words so far. So let's formulate this with
[00:08:52] So let's formulate this with mathematical equations. Now so the basic
[00:08:54] mathematical equations. Now so the basic idea is we're trying to uh process the
[00:08:58] idea is we're trying to uh process the sequence of vectors X and we're applying
[00:09:00] sequence of vectors X and we're applying this recurrence formula at every single
[00:09:02] this recurrence formula at every single time step. So we have our new hidden
[00:09:05] time step. So we have our new hidden state as a function of the old hidden
[00:09:08] state as a function of the old hidden state and the input vector at some time
[00:09:10] state and the input vector at some time step as well as we have a function with
[00:09:13] step as well as we have a function with normally with an activation function
[00:09:14] normally with an activation function along with some parameters w. So you can
[00:09:17] along with some parameters w. So you can think of this as it's very similar to
[00:09:19] think of this as it's very similar to the sort of initial neural network
[00:09:21] the sort of initial neural network layers we were learning where it's a
[00:09:22] layers we were learning where it's a weight matrix uh multiply by and then
[00:09:25] weight matrix uh multiply by and then you follow it up by an activation
[00:09:27] you follow it up by an activation function. This is the same thing here.
[00:09:29] function. This is the same thing here. The only change is that it's now a
[00:09:30] The only change is that it's now a recurrence formula. So we're using um
[00:09:34] recurrence formula. So we're using um the same set of W's and the same um uh
[00:09:38] the same set of W's and the same um uh activation function each time we're
[00:09:40] activation function each time we're computing the hidden state.
[00:09:43] computing the hidden state. So um basically yeah as I mentioned this
[00:09:46] So um basically yeah as I mentioned this is a recurrence formula. Um and to get
[00:09:49] is a recurrence formula. Um and to get the actual output so how do we calculate
[00:09:50] the actual output so how do we calculate this blue block? we have a separate uh
[00:09:53] this blue block? we have a separate uh function that depends on a separate set
[00:09:56] function that depends on a separate set of uh parameters that convert our hidden
[00:09:59] of uh parameters that convert our hidden dimension state into uh the dimension of
[00:10:03] dimension state into uh the dimension of our output and it also is a set of
[00:10:05] our output and it also is a set of weights to convert the hidden state to
[00:10:06] weights to convert the hidden state to the output. So this does two in one. It
[00:10:08] the output. So this does two in one. It sort of changes the dimension of our
[00:10:10] sort of changes the dimension of our vectors from the dimension size of our
[00:10:11] vectors from the dimension size of our hidden state which can be whatever we
[00:10:12] hidden state which can be whatever we want to the dimension size of our output
[00:10:15] want to the dimension size of our output and then also it provides a
[00:10:16] and then also it provides a transformation there. So, why is a
[00:10:20] transformation there. So, why is a weight matrix that you will multiply by
[00:10:22] weight matrix that you will multiply by your hidden state to get the so it does
[00:10:25] your hidden state to get the so it does two things. It converts your hidden
[00:10:26] two things. It converts your hidden state to the dimension of your output.
[00:10:28] state to the dimension of your output. So, your hidden state and output could
[00:10:29] So, your hidden state and output could be different dimensions. And then also
[00:10:31] be different dimensions. And then also it's uh you know it's a weight matrix
[00:10:34] it's uh you know it's a weight matrix that you learn. So, not only does it do
[00:10:36] that you learn. So, not only does it do this dimension change, but also it
[00:10:37] this dimension change, but also it applies a transformation to your hidden
[00:10:40] applies a transformation to your hidden state. So, it's how you convert your
[00:10:41] state. So, it's how you convert your hidden states to your outputs. What WHY
[00:10:44] hidden states to your outputs. What WHY is. So the previous slide was how we
[00:10:46] is. So the previous slide was how we calculate the new hidden state. So
[00:10:49] calculate the new hidden state. So there's it's essentially the same idea
[00:10:51] there's it's essentially the same idea where you're doing this uh recursively
[00:10:53] where you're doing this uh recursively with the same set of parameters but we
[00:10:54] with the same set of parameters but we have one set of parameters and one
[00:10:56] have one set of parameters and one function for calculating the hidden
[00:10:57] function for calculating the hidden state. We have another set of parameters
[00:10:59] state. We have another set of parameters and another function for calculating the
[00:11:00] and another function for calculating the output depending on what type of task it
[00:11:02] output depending on what type of task it is and how we want to model the RNN.
[00:11:06] is and how we want to model the RNN. Yeah. So they still share the same
[00:11:08] Yeah. So they still share the same weights for each time step. Um but
[00:11:10] weights for each time step. Um but there's two different things here. One
[00:11:11] there's two different things here. One is to calculate basically how do you
[00:11:13] is to calculate basically how do you maybe it'll be more clear as we go
[00:11:14] maybe it'll be more clear as we go through more concrete examples but uh
[00:11:16] through more concrete examples but uh how do you actually calculate the new
[00:11:18] how do you actually calculate the new hidden state which is this internal
[00:11:20] hidden state which is this internal state of the RNN and then how do you
[00:11:21] state of the RNN and then how do you convert that hidden state to the output
[00:11:23] convert that hidden state to the output which is this slide.
[00:11:26] which is this slide. Okay. Um so looking through this
[00:11:28] Okay. Um so looking through this unrolled uh diagram here um we can see
[00:11:32] unrolled uh diagram here um we can see that you sort of need to initialize your
[00:11:35] that you sort of need to initialize your hidden state to some value. So we
[00:11:37] hidden state to some value. So we usually call we call this H0 and you can
[00:11:40] usually call we call this H0 and you can initialize it to whatever you want. In
[00:11:42] initialize it to whatever you want. In principle usually this is a learned um
[00:11:45] principle usually this is a learned um input vector. Um but now we'll
[00:11:48] input vector. Um but now we'll specifically go into each step of this
[00:11:51] specifically go into each step of this unrolled uh recurrent RNN and actually
[00:11:53] unrolled uh recurrent RNN and actually go through a concrete example for what
[00:11:55] go through a concrete example for what it looks like when you're doing the
[00:11:56] it looks like when you're doing the forward pass.
[00:11:58] forward pass. So um one thing to note that already
[00:12:00] So um one thing to note that already came up with some of the questions is
[00:12:02] came up with some of the questions is that um we're processing the sequence of
[00:12:04] that um we're processing the sequence of vectors x and we're applying this
[00:12:06] vectors x and we're applying this recurrence formula at each time step. So
[00:12:09] recurrence formula at each time step. So uh really do notice how the same
[00:12:11] uh really do notice how the same function and the same set of parameters
[00:12:12] function and the same set of parameters are used at every time step when
[00:12:14] are used at every time step when computing the hidden state and a
[00:12:16] computing the hidden state and a separate function and a separate set of
[00:12:17] separate function and a separate set of parameters are always used at each time
[00:12:19] parameters are always used at each time step when predicting the output from the
[00:12:21] step when predicting the output from the hidden state. Yes. So can old values of
[00:12:24] hidden state. Yes. So can old values of y affect the new hidden state? uh under
[00:12:26] y affect the new hidden state? uh under some formulations. Yes. And we'll
[00:12:27] some formulations. Yes. And we'll actually go through one example of why
[00:12:29] actually go through one example of why that's used. It's most commonly used if
[00:12:30] that's used. It's most commonly used if you want to predict the next uh like if
[00:12:34] you want to predict the next uh like if you're doing a language modeling or auto
[00:12:36] you're doing a language modeling or auto reggressive modeling task where you're
[00:12:37] reggressive modeling task where you're trying to predict one value given the
[00:12:39] trying to predict one value given the previous values. Uh people will just use
[00:12:41] previous values. Uh people will just use the previous values as the input. Uh so
[00:12:43] the previous values as the input. Uh so that's generally how people do that
[00:12:45] that's generally how people do that explicit uh formulation of how can y
[00:12:47] explicit uh formulation of how can y affect the next hidden state. What is
[00:12:49] affect the next hidden state. What is the difference between h and x at the
[00:12:51] the difference between h and x at the first time step? Um so they use
[00:12:54] first time step? Um so they use basically different uh weights. So so
[00:12:58] basically different uh weights. So so the H0 is using uh all of it's using the
[00:13:01] the H0 is using uh all of it's using the weights that are used to update every
[00:13:03] weights that are used to update every hidden state to the next one. Whereas
[00:13:06] hidden state to the next one. Whereas the we'll go through exactly what the
[00:13:07] the we'll go through exactly what the weights look like but u basically
[00:13:09] weights look like but u basically they're using different weights is the
[00:13:10] they're using different weights is the short answer. Um okay so when people say
[00:13:13] short answer. Um okay so when people say vanilla RNN they usually are almost
[00:13:15] vanilla RNN they usually are almost exactly referring to this type of model
[00:13:18] exactly referring to this type of model where um we have our hidden state t
[00:13:21] where um we have our hidden state t which is uh uses tan h or hyperbolic
[00:13:24] which is uh uses tan h or hyperbolic tangent as an activation function. This
[00:13:26] tangent as an activation function. This is nice because it's bounded between one
[00:13:28] is nice because it's bounded between one and negative 1. Um so as you do the
[00:13:31] and negative 1. Um so as you do the operation over and over again your
[00:13:33] operation over and over again your values will stay within this range. Um
[00:13:35] values will stay within this range. Um so this is a nice property to have. It's
[00:13:37] so this is a nice property to have. It's also zero centered and you can represent
[00:13:38] also zero centered and you can represent both positive and negative values. This
[00:13:40] both positive and negative values. This is why people use 10h. Um uh al also we
[00:13:45] is why people use 10h. Um uh al also we um sometimes have an output function fy
[00:13:48] um sometimes have an output function fy here but in the simplest case your
[00:13:50] here but in the simplest case your output yt could just be a matrix
[00:13:51] output yt could just be a matrix multiply by your by your hidden state.
[00:13:53] multiply by your by your hidden state. So this is really like the most simple
[00:13:55] So this is really like the most simple formulation of of an RNN. And what we'll
[00:13:58] formulation of of an RNN. And what we'll specifically go in our concrete example
[00:14:00] specifically go in our concrete example today and lecture is this idea of just
[00:14:04] today and lecture is this idea of just manly manually creating a recurrent
[00:14:06] manly manually creating a recurrent neural network. So, we're not going to
[00:14:07] neural network. So, we're not going to learn this through gradient descent or
[00:14:09] learn this through gradient descent or all these different methods. I'm just
[00:14:11] all these different methods. I'm just going to sort of uh show you how you
[00:14:13] going to sort of uh show you how you could construct one by hand and we'll go
[00:14:15] could construct one by hand and we'll go through it and you'll understand the
[00:14:17] through it and you'll understand the forward pass, what each of the different
[00:14:19] forward pass, what each of the different weight matrices are doing as well as how
[00:14:21] weight matrices are doing as well as how the output is calculated.
[00:14:23] the output is calculated. So, in this really toy example, because
[00:14:25] So, in this really toy example, because it needs to be pretty simple if we're
[00:14:26] it needs to be pretty simple if we're just going to be like going through all
[00:14:27] just going to be like going through all the different weights, uh you're given a
[00:14:29] the different weights, uh you're given a sequences of zeros and ones. Um, and
[00:14:32] sequences of zeros and ones. Um, and your goal is to output a one when
[00:14:34] your goal is to output a one when there's two repeated ones in a row. So
[00:14:36] there's two repeated ones in a row. So you're basically detecting repeated
[00:14:38] you're basically detecting repeated ones. Uh, and you'll output a zero
[00:14:40] ones. Uh, and you'll output a zero otherwise. So you can see this input
[00:14:42] otherwise. So you can see this input sequence coming in 0 1 0 1. Uh, so far
[00:14:45] sequence coming in 0 1 0 1. Uh, so far there's been no repeated ones. But now
[00:14:47] there's been no repeated ones. But now we have a repeated one. Then we have
[00:14:48] we have a repeated one. Then we have another repeated one because there's two
[00:14:50] another repeated one because there's two in a row here and so on. So this is the
[00:14:53] in a row here and so on. So this is the type of model we're building. It's
[00:14:54] type of model we're building. It's trying to do this task. This is
[00:14:56] trying to do this task. This is specifically the many to many sequence
[00:14:58] specifically the many to many sequence modeling task where we have one output
[00:14:59] modeling task where we have one output for every input.
[00:15:02] for every input. And so um we've kind of been talking
[00:15:03] And so um we've kind of been talking high level so far, but if you're trying
[00:15:05] high level so far, but if you're trying to create an RNN to do this, what
[00:15:07] to create an RNN to do this, what information should be captured in the
[00:15:10] information should be captured in the hidden state? So you have this internal
[00:15:12] hidden state? So you have this internal state of your model, what information
[00:15:14] state of your model, what information needs to be captured there in order to
[00:15:15] needs to be captured there in order to do this task?
[00:15:17] do this task? Yeah. So the input to the previous time
[00:15:19] Yeah. So the input to the previous time step and if our output is only dependent
[00:15:21] step and if our output is only dependent on the hidden state, what else do we
[00:15:23] on the hidden state, what else do we need to know? And the current Yeah.
[00:15:24] need to know? And the current Yeah. Yeah. Exactly. Um so this is the
[00:15:27] Yeah. Exactly. Um so this is the information that we need to capture in
[00:15:28] information that we need to capture in our hidden state. So um previous input
[00:15:31] our hidden state. So um previous input and the current value for x. So 0 or
[00:15:33] and the current value for x. So 0 or one. And the way I'll do this is I'll
[00:15:36] one. And the way I'll do this is I'll just set the hidden state t to be a
[00:15:37] just set the hidden state t to be a threedimensional vector. The reason why
[00:15:39] threedimensional vector. The reason why it's three is this one will come in
[00:15:40] it's three is this one will come in handy when we're trying to do the uh
[00:15:42] handy when we're trying to do the uh output stage calculation, but you could
[00:15:45] output stage calculation, but you could probably construct one without a one
[00:15:47] probably construct one without a one here. Um this is just to make the math
[00:15:49] here. Um this is just to make the math easy and simple for the for the purposes
[00:15:51] easy and simple for the for the purposes of the lecture today. And the other
[00:15:52] of the lecture today. And the other information is the current value. So
[00:15:54] information is the current value. So this will either be zero or one along
[00:15:56] this will either be zero or one along with the previous value 0 or one. and
[00:15:58] with the previous value 0 or one. and we'll initialize it to be 001. So that
[00:16:01] we'll initialize it to be 001. So that um we're basically assuming it's
[00:16:03] um we're basically assuming it's basically seen two zeros in a row before
[00:16:04] basically seen two zeros in a row before at this point. Um yeah, so this is how
[00:16:07] at this point. Um yeah, so this is how we will do this will be the uh type of
[00:16:11] we will do this will be the uh type of variables we're trying to track in our
[00:16:13] variables we're trying to track in our hidden state and this is how we'll
[00:16:14] hidden state and this is how we'll initialize h0. So I talked about how you
[00:16:15] initialize h0. So I talked about how you can initialize it to vary different
[00:16:17] can initialize it to vary different strategies or you could learn it. This
[00:16:19] strategies or you could learn it. This is what we'll initialize it to. Okay,
[00:16:22] is what we'll initialize it to. Okay, now let's walk by the co let's walk
[00:16:23] now let's walk by the co let's walk through the code and I'll do it step by
[00:16:25] through the code and I'll do it step by step. So, I'm just putting it on screen
[00:16:27] step. So, I'm just putting it on screen here right now. But we'll also do ReLU.
[00:16:30] here right now. But we'll also do ReLU. Um, I guess sorry, one one other thing I
[00:16:32] Um, I guess sorry, one one other thing I missed on this slide is that we're
[00:16:33] missed on this slide is that we're setting our activation functions to be
[00:16:34] setting our activation functions to be RLU just to make the math easy. So,
[00:16:36] RLU just to make the math easy. So, it'll just be max of zero or whatever
[00:16:39] it'll just be max of zero or whatever the value is. We're only dealing
[00:16:40] the value is. We're only dealing essentially with zeros and ones in this
[00:16:41] essentially with zeros and ones in this case. So, it makes it pretty simple to
[00:16:43] case. So, it makes it pretty simple to think about. Yeah, you probably could
[00:16:46] think about. Yeah, you probably could construct it so that it works with tan
[00:16:48] construct it so that it works with tan h, but this is just something that I
[00:16:49] h, but this is just something that I created as an example for how to run it.
[00:16:52] created as an example for how to run it. And so just to make the math really
[00:16:53] And so just to make the math really easy, we'll just do RLU. Um but yeah,
[00:16:55] easy, we'll just do RLU. Um but yeah, you could conceivably make a model that
[00:16:57] you could conceivably make a model that could do this with tanh. Um yeah,
[00:17:02] could do this with tanh. Um yeah, cool. Um so we have RLU. We have two
[00:17:05] cool. Um so we have RLU. We have two specific weights here. We have the first
[00:17:08] specific weights here. We have the first weight which um converts our uh previous
[00:17:12] weight which um converts our uh previous hidden state. Uh it it applies a
[00:17:15] hidden state. Uh it it applies a transformation to the previous hidden
[00:17:16] transformation to the previous hidden state onto the sort of to calculate the
[00:17:19] state onto the sort of to calculate the next one. And then we have this weight
[00:17:21] next one. And then we have this weight here which um converts our input x to
[00:17:25] here which um converts our input x to the dimension of our hidden state as
[00:17:26] the dimension of our hidden state as well as applies a transformation. So we
[00:17:29] well as applies a transformation. So we are setting this second one. So our our
[00:17:31] are setting this second one. So our our our current hidden state is a function
[00:17:32] our current hidden state is a function of the previous hidden state as long
[00:17:35] of the previous hidden state as long along with the current time step. And so
[00:17:37] along with the current time step. And so when we're trying to calculate this uh
[00:17:40] when we're trying to calculate this uh hidden state at time step t, we're
[00:17:43] hidden state at time step t, we're looking to calculate this current value
[00:17:46] looking to calculate this current value first. So we'll use the x value here.
[00:17:49] first. So we'll use the x value here. We'll set the weight to be a 3x one
[00:17:52] We'll set the weight to be a 3x one column vector um with values 1 0 0 such
[00:17:56] column vector um with values 1 0 0 such that when x is zero and we do the matrix
[00:18:00] that when x is zero and we do the matrix multiply we get 0 vector and when x is 1
[00:18:04] multiply we get 0 vector and when x is 1 we'll get 1 0 0 and we'll add this to
[00:18:07] we'll get 1 0 0 and we'll add this to another term but basically this is going
[00:18:08] another term but basically this is going to be calculating what is the current
[00:18:10] to be calculating what is the current value here. So it'll be either uh zero
[00:18:13] value here. So it'll be either uh zero on top or a one on top and it's
[00:18:15] on top or a one on top and it's calculated based on this first operation
[00:18:18] calculated based on this first operation here.
[00:18:20] here. Okay. Um, so that's how we're
[00:18:22] Okay. Um, so that's how we're calculating the current value based on
[00:18:25] calculating the current value based on the uh input. Now we'll talk about uh
[00:18:29] the uh input. Now we'll talk about uh you know how are we doing this hidden
[00:18:31] you know how are we doing this hidden state transformation. So we want to just
[00:18:33] state transformation. So we want to just use the current value for this top value
[00:18:35] use the current value for this top value here. So in our weight matrix we'll just
[00:18:37] here. So in our weight matrix we'll just have zeros in the top row. This means
[00:18:39] have zeros in the top row. This means that when we multiply it with the
[00:18:40] that when we multiply it with the previous hidden state we'll get a zero
[00:18:42] previous hidden state we'll get a zero value here for the top. So it'll be 0
[00:18:44] value here for the top. So it'll be 0 plus whatever value the right hand side
[00:18:46] plus whatever value the right hand side contains. So that's how we're going to
[00:18:48] contains. So that's how we're going to maintain this not changing based on the
[00:18:50] maintain this not changing based on the previous hidden state. And we'll set it
[00:18:52] previous hidden state. And we'll set it to be 1 0 0 for the next row. Why we do
[00:18:56] to be 1 0 0 for the next row. Why we do this is you can imagine we have the
[00:18:58] this is you can imagine we have the hidden state from the previous time step
[00:19:00] hidden state from the previous time step here. And we want to set the uh now
[00:19:04] here. And we want to set the uh now previous to be the former current time
[00:19:06] previous to be the former current time step. So we have a 1 0 0. What this will
[00:19:08] step. So we have a 1 0 0. What this will do is it'll multiply by htus one. We'll
[00:19:11] do is it'll multiply by htus one. We'll set the current value over to now the
[00:19:14] set the current value over to now the previous value for this time uh step. So
[00:19:17] previous value for this time uh step. So basically this term will be a zero on
[00:19:19] basically this term will be a zero on top and it will be whatever the previous
[00:19:21] top and it will be whatever the previous time step uh input value was as the
[00:19:24] time step uh input value was as the second term and then this final bit here
[00:19:26] second term and then this final bit here just maintains the one so that we're
[00:19:28] just maintains the one so that we're keeping this one across all
[00:19:29] keeping this one across all calculations. Um so just to recap we
[00:19:32] calculations. Um so just to recap we have zeros here because we want the
[00:19:34] have zeros here because we want the right hand side term to be tracking this
[00:19:37] right hand side term to be tracking this uh current value. We have a one here to
[00:19:40] uh current value. We have a one here to copy over the current from the former
[00:19:42] copy over the current from the former time step to be the previous uh sorry
[00:19:44] time step to be the previous uh sorry the the to to copy the current of the
[00:19:46] the the to to copy the current of the former time step to be the previous of
[00:19:48] former time step to be the previous of the current time step. Uh so we're just
[00:19:50] the current time step. Uh so we're just doing you know h uh maybe it's easy in
[00:19:53] doing you know h uh maybe it's easy in the code but uh you know ht previous is
[00:19:56] the code but uh you know ht previous is equal to ht then we want to also move
[00:19:58] equal to ht then we want to also move the corresponding value down one here
[00:20:01] the corresponding value down one here and then this is just a copy of the one.
[00:20:03] and then this is just a copy of the one. Um so how do we actually get our output
[00:20:05] Um so how do we actually get our output now? So we tal we basically talked about
[00:20:07] now? So we tal we basically talked about how we can track these values given the
[00:20:10] how we can track these values given the weight matrices I talked about. So whhh
[00:20:13] weight matrices I talked about. So whhh and w xh. So if we have a weight matrix
[00:20:16] and w xh. So if we have a weight matrix to convert our hidden state into the
[00:20:19] to convert our hidden state into the output dimension we want it to be uh
[00:20:22] output dimension we want it to be uh 1x3. So it's uh single value that's
[00:20:25] 1x3. So it's uh single value that's being output when we have this hidden
[00:20:26] being output when we have this hidden dimension as input. And this is sort of
[00:20:29] dimension as input. And this is sort of like a dotproduct between the values
[00:20:31] like a dotproduct between the values here and the values here. So what this
[00:20:33] here and the values here. So what this will correspond to is the current plus
[00:20:36] will correspond to is the current plus the previous minus1 minus one because we
[00:20:39] the previous minus1 minus one because we multiply the minus1 here. This is where
[00:20:40] multiply the minus1 here. This is where the one became useful and uh the current
[00:20:45] the one became useful and uh the current associated here with a one and then also
[00:20:47] associated here with a one and then also the the previous associated here with a
[00:20:49] the the previous associated here with a one as well. Um so that's how we
[00:20:52] one as well. Um so that's how we actually do it. And if you if you think
[00:20:53] actually do it. And if you if you think about it um this general formula will
[00:20:57] about it um this general formula will work. So uh if we have say we're looking
[00:21:00] work. So uh if we have say we're looking here we have the current plus the
[00:21:02] here we have the current plus the previous is 2 minus one is one for this
[00:21:05] previous is 2 minus one is one for this uh left hand term inside the ru so the
[00:21:07] uh left hand term inside the ru so the max of one and 0 is one and if these are
[00:21:10] max of one and 0 is one and if these are both zero you'll have a minus one so
[00:21:12] both zero you'll have a minus one so we'll get zero these are a one and a
[00:21:14] we'll get zero these are a one and a zero then you'll still get zero so these
[00:21:17] zero then you'll still get zero so these are how you can construct these weight
[00:21:19] are how you can construct these weight matrices but I actually wanted to pause
[00:21:21] matrices but I actually wanted to pause briefly um and talk about if there were
[00:21:24] briefly um and talk about if there were any questions about any step among this
[00:21:26] any questions about any step among this calculation
[00:21:27] calculation because this is the only example we'll
[00:21:30] because this is the only example we'll go through in class where we're
[00:21:31] go through in class where we're literally doing all the matrix and
[00:21:32] literally doing all the matrix and vector multiplications and the rest will
[00:21:34] vector multiplications and the rest will be more highle explanations for how
[00:21:37] be more highle explanations for how people tend to put these layers
[00:21:38] people tend to put these layers together. So I just want to pause and
[00:21:41] together. So I just want to pause and see if there's a question about how the
[00:21:43] see if there's a question about how the matri matrices and vectors are tracked
[00:21:45] matri matrices and vectors are tracked and multiplied and updated.
[00:21:48] and multiplied and updated. Yeah, so the question is how do you go
[00:21:49] Yeah, so the question is how do you go about constructing the weight matrices
[00:21:51] about constructing the weight matrices which is a really great question uh and
[00:21:53] which is a really great question uh and I thought to put it in the slide here.
[00:21:55] I thought to put it in the slide here. So, how how would you actually do this?
[00:21:56] So, how how would you actually do this? Um, it's the same way we're always
[00:21:58] Um, it's the same way we're always finding the weight matrices in this
[00:21:59] finding the weight matrices in this class. We're going to be using gradient
[00:22:00] class. We're going to be using gradient descent and we'll talk about how you do
[00:22:02] descent and we'll talk about how you do gradient descent descent when you have
[00:22:03] gradient descent descent when you have multiple time steps and maybe you have
[00:22:05] multiple time steps and maybe you have losses computed at each time step as
[00:22:07] losses computed at each time step as well. So, that'll be a lot of what we go
[00:22:09] well. So, that'll be a lot of what we go into right next. So, it's a great
[00:22:11] into right next. So, it's a great question and very relevant to the
[00:22:12] question and very relevant to the lecture. So, this is um just an example
[00:22:16] lecture. So, this is um just an example so you can see how all of the weight
[00:22:17] so you can see how all of the weight matrices are multiplied. Um the
[00:22:20] matrices are multiplied. Um the basically if you were trying to change
[00:22:22] basically if you were trying to change if you're trying to initialize with this
[00:22:24] if you're trying to initialize with this and then train it to do another task um
[00:22:26] and then train it to do another task um that would be sort of like transfer
[00:22:28] that would be sort of like transfer learning where you're initializing the
[00:22:30] learning where you're initializing the weights with this. Um but in practice I
[00:22:34] weights with this. Um but in practice I don't think it would work very well at
[00:22:35] don't think it would work very well at all because your hidden state is really
[00:22:36] all because your hidden state is really small and people normally do much larger
[00:22:39] small and people normally do much larger hidden states. I just wanted to do
[00:22:41] hidden states. I just wanted to do something that I could fit in the slide
[00:22:42] something that I could fit in the slide here. Yeah. Okay. I'll go over the
[00:22:44] here. Yeah. Okay. I'll go over the second row again. Um so if you imagine
[00:22:47] second row again. Um so if you imagine we have ht minus one as a column vector
[00:22:50] we have ht minus one as a column vector here when you do the matrix multiply um
[00:22:54] here when you do the matrix multiply um to get the value of the left hand side
[00:22:56] to get the value of the left hand side here what this second row is doing is
[00:22:58] here what this second row is doing is you're taking the value and you're sort
[00:23:00] you're taking the value and you're sort of you know rotating it and doing the
[00:23:02] of you know rotating it and doing the dot productduct with the values here. So
[00:23:04] dot productduct with the values here. So for the second row that gets calculated
[00:23:07] for the second row that gets calculated it will be the entry here will be equal
[00:23:10] it will be the entry here will be equal to the top of the the vector here. So
[00:23:13] to the top of the the vector here. So this is how we move the current down to
[00:23:15] this is how we move the current down to the previous is in this step here. So
[00:23:18] the previous is in this step here. So the the end result of this matrix
[00:23:19] the the end result of this matrix multiply will be that such that the
[00:23:21] multiply will be that such that the second uh value is the uh is the current
[00:23:26] second uh value is the uh is the current value from t minus one.
[00:23:30] value from t minus one. So we do the matrix multiply with t
[00:23:31] So we do the matrix multiply with t minus one here. And so the second row of
[00:23:35] minus one here. And so the second row of this operation and this operation are
[00:23:36] this operation and this operation are both giving us vectors of the size of
[00:23:39] both giving us vectors of the size of our hidden state. And so we're adding
[00:23:41] our hidden state. And so we're adding them together. Yeah, the left is doing
[00:23:44] them together. Yeah, the left is doing the previous carryover and the right is
[00:23:46] the previous carryover and the right is doing the current. And that's also sort
[00:23:48] doing the current. And that's also sort of how it works for RNN's when you are
[00:23:51] of how it works for RNN's when you are doing it beyond this sort of toy example
[00:23:53] doing it beyond this sort of toy example where this weight matrix is being
[00:23:55] where this weight matrix is being multiplied by the current input and this
[00:23:56] multiplied by the current input and this other weight matrix is being multiplied
[00:23:58] other weight matrix is being multiplied by the the previous hidden state. So um
[00:24:01] by the the previous hidden state. So um that's what these weight matrices track
[00:24:03] that's what these weight matrices track more generally than this specific
[00:24:04] more generally than this specific problem as well.
[00:24:06] problem as well. Okay. Um so how do you actually compute
[00:24:10] Okay. Um so how do you actually compute the gradients? Uh let's look at the
[00:24:12] the gradients? Uh let's look at the computational graph. So uh just to draw
[00:24:14] computational graph. So uh just to draw it a little bit more explicitly than
[00:24:15] it a little bit more explicitly than before. Um we have these x1 coming in
[00:24:18] before. Um we have these x1 coming in and then x2 and we have a sequence of
[00:24:20] and then x2 and we have a sequence of x's. We're calculating a hidden state at
[00:24:22] x's. We're calculating a hidden state at each time step. And we're specifically
[00:24:24] each time step. And we're specifically using the same w's uh same weight
[00:24:27] using the same w's uh same weight matrices for each of these calculations
[00:24:29] matrices for each of these calculations as well. So we need to be thinking about
[00:24:31] as well. So we need to be thinking about this when we're thinking about how we're
[00:24:33] this when we're thinking about how we're computing the gradients.
[00:24:35] computing the gradients. And let's start with the many to many
[00:24:37] And let's start with the many to many scenario. So we have an output for each
[00:24:39] scenario. So we have an output for each input. And um in this scenario you can
[00:24:42] input. And um in this scenario you can often also calculate a loss for each
[00:24:44] often also calculate a loss for each output which is how correct is the
[00:24:45] output which is how correct is the output at that stage. So if we're doing
[00:24:48] output at that stage. So if we're doing this uh setting you have a loss at each
[00:24:52] this uh setting you have a loss at each uh step and you can sum them all
[00:24:54] uh step and you can sum them all together to get your total loss and this
[00:24:56] together to get your total loss and this would be your loss across the entire
[00:24:57] would be your loss across the entire input sequence.
[00:24:59] input sequence. And um when we do back prop basically
[00:25:02] And um when we do back prop basically the final loss is the loss uh you know
[00:25:06] the final loss is the loss uh you know this final loss when we comput our final
[00:25:08] this final loss when we comput our final loss we can then calculate the loss per
[00:25:10] loss we can then calculate the loss per time step as well um or depending on the
[00:25:13] time step as well um or depending on the formulation. So if we're calculating a
[00:25:15] formulation. So if we're calculating a loss per time step you can treat them
[00:25:17] loss per time step you can treat them independently. Um sometimes you have an
[00:25:19] independently. Um sometimes you have an overall loss based on the loss uh per
[00:25:21] overall loss based on the loss uh per time step too. Um we can also uh get the
[00:25:26] time step too. Um we can also uh get the basically the final gradients for each
[00:25:28] basically the final gradients for each of these W's. You can calculate the
[00:25:30] of these W's. You can calculate the gradient for each time step separately
[00:25:32] gradient for each time step separately and then you're going to sum them all
[00:25:33] and then you're going to sum them all together. So this is how it works in
[00:25:35] together. So this is how it works in practice. You could imagine if it were
[00:25:38] practice. You could imagine if it were different W's at each time step, you
[00:25:40] different W's at each time step, you could pretty uh probably easily see how
[00:25:42] could pretty uh probably easily see how the computational graph could be
[00:25:44] the computational graph could be structured such that you're calculating
[00:25:45] structured such that you're calculating a different gradient for each of these
[00:25:47] a different gradient for each of these different W's. And so we're essentially
[00:25:50] different W's. And so we're essentially treating our single W for computational
[00:25:52] treating our single W for computational purposes as uh a set of um different
[00:25:57] purposes as uh a set of um different W's, but then at the end we merge all
[00:25:58] W's, but then at the end we merge all the gradients together because it's just
[00:25:59] the gradients together because it's just the same weight matrix that's being
[00:26:01] the same weight matrix that's being multiplied. So conceptually you can
[00:26:03] multiplied. So conceptually you can think of it as you're just calculating
[00:26:06] think of it as you're just calculating it for each time step almost treating it
[00:26:08] it for each time step almost treating it in your head like it's a different W
[00:26:10] in your head like it's a different W being used, but because it's the same
[00:26:11] being used, but because it's the same value of the weights, you just need to
[00:26:13] value of the weights, you just need to sum all the gradients that you calculate
[00:26:15] sum all the gradients that you calculate at each time step together.
[00:26:17] at each time step together. Um in the many to one scenario, you'll
[00:26:20] Um in the many to one scenario, you'll just have a a single loss calculated
[00:26:22] just have a a single loss calculated here. Um so and sometimes you'll only
[00:26:26] here. Um so and sometimes you'll only use the final hidden state uh to
[00:26:28] use the final hidden state uh to calculate the value depending on the
[00:26:30] calculate the value depending on the problem setting. Like say you're trying
[00:26:30] problem setting. Like say you're trying to do video classification, it may make
[00:26:33] to do video classification, it may make makes sense to use the hidden state from
[00:26:34] makes sense to use the hidden state from every step because you might have
[00:26:36] every step because you might have information about the video throughout
[00:26:38] information about the video throughout the entire course of the video during
[00:26:39] the entire course of the video during classification. You can do some pooling
[00:26:41] classification. You can do some pooling like average pooling or max pooling or
[00:26:43] like average pooling or max pooling or something like that to compute your your
[00:26:45] something like that to compute your your y-value. Um and then sort of uh if you
[00:26:49] y-value. Um and then sort of uh if you have this sort of one to many mapping
[00:26:51] have this sort of one to many mapping like in image classification
[00:26:53] like in image classification um or sorry in image captioning um there
[00:26:56] um or sorry in image captioning um there was a question about how you could
[00:26:57] was a question about how you could incorporate the previous Y's. So um you
[00:27:00] incorporate the previous Y's. So um you still need to have an input input to
[00:27:02] still need to have an input input to your FW
[00:27:04] your FW uh because it's sort of two different of
[00:27:06] uh because it's sort of two different of these weight matrices. One that's
[00:27:08] these weight matrices. One that's expecting input vector X and the other
[00:27:10] expecting input vector X and the other that's expecting um the previous time
[00:27:13] that's expecting um the previous time steps hidden state. So um you could
[00:27:16] steps hidden state. So um you could imagine that you can put a lot of values
[00:27:18] imagine that you can put a lot of values in here. You could just put zeros or you
[00:27:20] in here. You could just put zeros or you could put the the previous um the
[00:27:23] could put the the previous um the previous output here.
[00:27:26] previous output here. Okay. Um so I explained at a high level
[00:27:28] Okay. Um so I explained at a high level how you do the back propagation. But
[00:27:30] how you do the back propagation. But there's actually some specific issues
[00:27:32] there's actually some specific issues that you'll run into uh when you're
[00:27:34] that you'll run into uh when you're trying this conceptual when you're sort
[00:27:36] trying this conceptual when you're sort of looking through this conceptual
[00:27:37] of looking through this conceptual framework that are very practical in
[00:27:39] framework that are very practical in terms of running out of GPU memory which
[00:27:41] terms of running out of GPU memory which is always the cause of basically all the
[00:27:42] is always the cause of basically all the issues when you're trying to train a a
[00:27:44] issues when you're trying to train a a neural network that and I guess nan loss
[00:27:47] neural network that and I guess nan loss during training. So um when you're
[00:27:49] during training. So um when you're computing say a loss at each time step
[00:27:52] computing say a loss at each time step um and you have an extremely long input
[00:27:55] um and you have an extremely long input sequence you'll actually it's really
[00:27:57] sequence you'll actually it's really easy to understand you need to be
[00:27:58] easy to understand you need to be keeping the activations and the
[00:28:00] keeping the activations and the gradients at each uh time step in memory
[00:28:02] gradients at each uh time step in memory then summing them all together. This is
[00:28:04] then summing them all together. This is going to get extremely large as your
[00:28:06] going to get extremely large as your input sequence increases. So what can
[00:28:08] input sequence increases. So what can you do practically to resolve this
[00:28:10] you do practically to resolve this issue? Um this is called back
[00:28:12] issue? Um this is called back propagation through time by the way when
[00:28:14] propagation through time by the way when you have the same weight matrix that's
[00:28:16] you have the same weight matrix that's being applied in multiple different time
[00:28:18] being applied in multiple different time steps and then you're summing the
[00:28:19] steps and then you're summing the gradient at each time step together. Um
[00:28:22] gradient at each time step together. Um so um what you can do is it's called
[00:28:24] so um what you can do is it's called truncated back propagation through time.
[00:28:26] truncated back propagation through time. So you basically fix a time window and
[00:28:29] So you basically fix a time window and you can look at basically uh pretending
[00:28:31] you can look at basically uh pretending that this is uh all the model was
[00:28:34] that this is uh all the model was trained on so far. We start with our h0.
[00:28:37] trained on so far. We start with our h0. Um we calculate based on the input at
[00:28:40] Um we calculate based on the input at time step one and our previous h value.
[00:28:43] time step one and our previous h value. We can calculate what is the current
[00:28:44] We can calculate what is the current hidden uh state at h1. And then we can
[00:28:47] hidden uh state at h1. And then we can use that to calculate our output. We'll
[00:28:49] use that to calculate our output. We'll have our loss. And we can run this for
[00:28:51] have our loss. And we can run this for each of our examples. And you can
[00:28:53] each of our examples. And you can imagine how in this setting it's
[00:28:54] imagine how in this setting it's relatively easy to see how you just sort
[00:28:56] relatively easy to see how you just sort of treat the beginning sequence as if
[00:28:58] of treat the beginning sequence as if this is all we were seeing during
[00:29:00] this is all we were seeing during training. And moving to the next block,
[00:29:02] training. And moving to the next block, you can now essentially you're starting
[00:29:04] you can now essentially you're starting your H0 with now it's the output of your
[00:29:06] your H0 with now it's the output of your previous step here. So we're
[00:29:09] previous step here. So we're initializing the hidden state with
[00:29:10] initializing the hidden state with whatever the output was in our final
[00:29:11] whatever the output was in our final step, but the gradients are no longer
[00:29:13] step, but the gradients are no longer carrying over. So we're basically
[00:29:15] carrying over. So we're basically batching the computational graph such
[00:29:17] batching the computational graph such that we're only looking at the loss in a
[00:29:19] that we're only looking at the loss in a neighborhood of these time steps at a
[00:29:21] neighborhood of these time steps at a time. This is a fixed window size that
[00:29:23] time. This is a fixed window size that you set. So this is how you get around
[00:29:24] you set. So this is how you get around this um relatively I would say um common
[00:29:30] this um relatively I would say um common issue especially as you have really long
[00:29:32] issue especially as you have really long input sequences. Um and so yeah you
[00:29:35] input sequences. Um and so yeah you basically are batching it out and you
[00:29:37] basically are batching it out and you can just keep doing this for the entire
[00:29:39] can just keep doing this for the entire input uh sequence. So um one other thing
[00:29:44] input uh sequence. So um one other thing is that um you know you might ask how
[00:29:46] is that um you know you might ask how does this work if we have just a single
[00:29:48] does this work if we have just a single output at the very end. Um so you can
[00:29:51] output at the very end. Um so you can still calculate the gradients at each
[00:29:53] still calculate the gradients at each time step but you will no longer have
[00:29:55] time step but you will no longer have this um uh loss that's uh dependent on
[00:30:00] this um uh loss that's uh dependent on the time step uh itself uh the output of
[00:30:03] the time step uh itself uh the output of the time step itself rather you'll be
[00:30:04] the time step itself rather you'll be relying on upstream gradients. So you
[00:30:06] relying on upstream gradients. So you can imagine we're looking at the far
[00:30:08] can imagine we're looking at the far right of the diagram here and we have
[00:30:10] right of the diagram here and we have our loss that we calculate based on the
[00:30:12] our loss that we calculate based on the output at the final time step. um we can
[00:30:14] output at the final time step. um we can calculate what is the gradient uh with
[00:30:16] calculate what is the gradient uh with respect to our current uh hidden state
[00:30:18] respect to our current uh hidden state at the end. And then we have our whh
[00:30:22] at the end. And then we have our whh matrix to help us understand how did the
[00:30:25] matrix to help us understand how did the uh how did the previous hidden state
[00:30:27] uh how did the previous hidden state contribute to the uh final hidden state
[00:30:30] contribute to the uh final hidden state and we can use that to uh calculate the
[00:30:33] and we can use that to uh calculate the gradient and understanding based on the
[00:30:36] gradient and understanding based on the previous hidden state and the weight
[00:30:37] previous hidden state and the weight matrix. how can we change this
[00:30:39] matrix. how can we change this transformation matrix whh such that um
[00:30:42] transformation matrix whh such that um we would be changing our loss and uh
[00:30:45] we would be changing our loss and uh then you can just you basically just
[00:30:47] then you can just you basically just applying the gradient rule to whh over
[00:30:50] applying the gradient rule to whh over and over again here and you're only
[00:30:51] and over again here and you're only looking at how the hidden state changed
[00:30:52] looking at how the hidden state changed the next hidden state and how that
[00:30:54] the next hidden state and how that contributed to the loss. So you look at
[00:30:57] contributed to the loss. So you look at the final example here. This tells you
[00:30:59] the final example here. This tells you how changing the hidden state depends on
[00:31:01] how changing the hidden state depends on loss and then you know how the previous
[00:31:03] loss and then you know how the previous hidden states how they change how that
[00:31:05] hidden states how they change how that affected uh the current hidden state
[00:31:07] affected uh the current hidden state which is given by this whh matrix. So
[00:31:10] which is given by this whh matrix. So using the W's at each time like using
[00:31:12] using the W's at each time like using different W's at each time step would
[00:31:14] different W's at each time step would essentially mean that you're um no
[00:31:16] essentially mean that you're um no longer modeling it as a recurrence
[00:31:18] longer modeling it as a recurrence relation. So basically you have uh you
[00:31:20] relation. So basically you have uh you can think of it as one layer for each
[00:31:22] can think of it as one layer for each different possible time step. Um so you
[00:31:26] different possible time step. Um so you would probably see um worse performance
[00:31:28] would probably see um worse performance because um if you are sort of no longer
[00:31:32] because um if you are sort of no longer modeling it as a sequence recursively.
[00:31:34] modeling it as a sequence recursively. You're just uh I mean imagine you train
[00:31:36] You're just uh I mean imagine you train a neural network where you have a series
[00:31:38] a neural network where you have a series of inputs. Each one has a separate
[00:31:39] of inputs. Each one has a separate weight that it goes to
[00:31:40] weight that it goes to  independently.
[00:31:41] independently.  Yeah. Independently. That would make
[00:31:42] Yeah. Independently. That would make sense for a problem where it's not a
[00:31:45] sense for a problem where it's not a sequence modeling problem. You just have
[00:31:46] sequence modeling problem. You just have a set of things that you want to
[00:31:48] a set of things that you want to classify. U you would need to know like
[00:31:50] classify. U you would need to know like the the amount of the sequence ahead of
[00:31:52] the the amount of the sequence ahead of time. So I think it could work if it's
[00:31:54] time. So I think it could work if it's not a sequence but for uh sequences of
[00:31:57] not a sequence but for uh sequences of variable length um I I think it would
[00:32:00] variable length um I I think it would not work very well u because you're sort
[00:32:03] not work very well u because you're sort of um I mean I'm trying to think of the
[00:32:06] of um I mean I'm trying to think of the simple way to explain it but it's sort
[00:32:07] simple way to explain it but it's sort of like you're just training one neural
[00:32:08] of like you're just training one neural network for each uh time step. So it's
[00:32:11] network for each uh time step. So it's sort of not the way to formulate it. So
[00:32:14] sort of not the way to formulate it. So how does this work with chunking? So um
[00:32:16] how does this work with chunking? So um we have our so we we can do you
[00:32:19] we have our so we we can do you understand how to this point we can at
[00:32:21] understand how to this point we can at the point right here with the red dot we
[00:32:23] the point right here with the red dot we can calculate um the gradient of the
[00:32:26] can calculate um the gradient of the loss with respect to our final hidden
[00:32:28] loss with respect to our final hidden state. Okay. So um if we can do that
[00:32:30] state. Okay. So um if we can do that then we can calculate the gradient of
[00:32:32] then we can calculate the gradient of our loss with respect to our second to
[00:32:34] our loss with respect to our second to final hidden state because we know our
[00:32:36] final hidden state because we know our final hidden state is dependent on our
[00:32:38] final hidden state is dependent on our previous hidden state times this weight
[00:32:40] previous hidden state times this weight matrix W. Okay. So we can do this and we
[00:32:43] matrix W. Okay. So we can do this and we can go back and forth until here and at
[00:32:45] can go back and forth until here and at this point uh we can uh sort of all we
[00:32:48] this point uh we can uh sort of all we need to save is this final step here. So
[00:32:50] need to save is this final step here. So what is the gradient of this very final
[00:32:54] what is the gradient of this very final uh or finals I'm maybe overusing the
[00:32:57] uh or finals I'm maybe overusing the word final but what is the gradient of
[00:32:58] word final but what is the gradient of this initial um hidden state within our
[00:33:02] this initial um hidden state within our truncated batch with respect to the loss
[00:33:05] truncated batch with respect to the loss and then when we're calculating
[00:33:06] and then when we're calculating backwards we just use that value to
[00:33:09] backwards we just use that value to calculate all the previous time steps.
[00:33:11] calculate all the previous time steps. So that's the overall process. You're
[00:33:13] So that's the overall process. You're only looking at how the hidden state
[00:33:16] only looking at how the hidden state transforms to form the new hidden state.
[00:33:19] transforms to form the new hidden state. And that's the only value that's getting
[00:33:20] And that's the only value that's getting updated here.
[00:33:23] updated here. Um uh yeah. Yeah. Oh, and also the also
[00:33:26] Um uh yeah. Yeah. Oh, and also the also how the input uh changes the hidden
[00:33:27] how the input uh changes the hidden state. So you're looking at two values.
[00:33:29] state. So you're looking at two values. Both how the input affects it and how
[00:33:31] Both how the input affects it and how the input affects the next hidden state
[00:33:33] the input affects the next hidden state and the the previous uh hidden state.
[00:33:36] and the the previous uh hidden state. Sorry, there's two values. Yeah. So the
[00:33:38] Sorry, there's two values. Yeah. So the learning still occurs for all the
[00:33:40] learning still occurs for all the batches. Um so um you have your loss
[00:33:43] batches. Um so um you have your loss with respect to um each of your
[00:33:47] with respect to um each of your parameters in W here and then when
[00:33:50] parameters in W here and then when you're calculating it for the previous
[00:33:52] you're calculating it for the previous time step um you you you basically keep
[00:33:56] time step um you you you basically keep this one value. If you change the final
[00:33:58] this one value. If you change the final the initial hidden state here, how does
[00:33:59] the initial hidden state here, how does that change loss? you can calculate that
[00:34:01] that change loss? you can calculate that and then you can see how all the
[00:34:02] and then you can see how all the variables feeding into this uh and
[00:34:04] variables feeding into this uh and namely this original hidden state and
[00:34:06] namely this original hidden state and the current time step how will that
[00:34:07] the current time step how will that affect the variable but then when you're
[00:34:09] affect the variable but then when you're actually moving to the next chunk over
[00:34:11] actually moving to the next chunk over you only need to look at how does the
[00:34:13] you only need to look at how does the this hidden state here affect the hidden
[00:34:15] this hidden state here affect the hidden state on in in the next chunk. So you're
[00:34:18] state on in in the next chunk. So you're looking at this division boundary. The
[00:34:19] looking at this division boundary. The one variable you need to track over is
[00:34:21] one variable you need to track over is what is the gradient um of the hidden
[00:34:24] what is the gradient um of the hidden state that occurs after the chunk. Um
[00:34:27] state that occurs after the chunk. Um and then you can use that to calculate
[00:34:29] and then you can use that to calculate the gradient of the current hidden state
[00:34:30] the gradient of the current hidden state which is dependent on input x and the
[00:34:32] which is dependent on input x and the and the previous. So there's different
[00:34:34] and the previous. So there's different ways you can formulate it, but you can
[00:34:36] ways you can formulate it, but you can imagine we just apply uh the update to
[00:34:39] imagine we just apply uh the update to all the the weights here and we zero out
[00:34:41] all the the weights here and we zero out the memory. The only thing we're
[00:34:42] the memory. The only thing we're tracking is uh yeah the the gradient
[00:34:45] tracking is uh yeah the the gradient right here. So you're you can apply you
[00:34:48] right here. So you're you can apply you can do a gradient apply step where you
[00:34:49] can do a gradient apply step where you apply all the gradients to the weights
[00:34:51] apply all the gradients to the weights depending on the learning rate and your
[00:34:53] depending on the learning rate and your optimizer and all this stuff and then
[00:34:54] optimizer and all this stuff and then you move on to calculating the next
[00:34:56] you move on to calculating the next batch. So the reason why this isn't a
[00:34:58] batch. So the reason why this isn't a perfect calculation is because um you're
[00:35:01] perfect calculation is because um you're calculating these independently rather
[00:35:03] calculating these independently rather than all at once. So you sort of have
[00:35:05] than all at once. So you sort of have three different updates rather than just
[00:35:07] three different updates rather than just one update at a time. But it should be
[00:35:10] one update at a time. But it should be um you're still calculating the gradient
[00:35:12] um you're still calculating the gradient for each each step here. you keep one
[00:35:15] for each each step here. you keep one thing in memory which is how does this
[00:35:17] thing in memory which is how does this hidden state the the first one in the
[00:35:19] hidden state the the first one in the batch how is that
[00:35:22] batch how is that how can we update the hidden state here
[00:35:24] how can we update the hidden state here to to determine the loss and we throw
[00:35:27] to to determine the loss and we throw out all the other ones so you have the
[00:35:30] out all the other ones so you have the weights in memory you can apply the
[00:35:32] weights in memory you can apply the gradient you do your learning rate
[00:35:34] gradient you do your learning rate multiply and you apply it to the weights
[00:35:36] multiply and you apply it to the weights you'll also see a similar thing if you
[00:35:38] you'll also see a similar thing if you do distributed um uh learning so if you
[00:35:41] do distributed um uh learning so if you have a gradient calculated on each GPU
[00:35:43] have a gradient calculated on each GPU separately
[00:35:44] separately they will apply them all to the same set
[00:35:46] they will apply them all to the same set of weights even though they're
[00:35:47] of weights even though they're calculated independently. So I think we
[00:35:50] calculated independently. So I think we have a lecture on distributed learning
[00:35:52] have a lecture on distributed learning coming up. So it's a similar thing where
[00:35:53] coming up. So it's a similar thing where you're not tracking it all in the same
[00:35:56] you're not tracking it all in the same memory at the same time and you're
[00:35:57] memory at the same time and you're applying it to the weights one at a
[00:35:59] applying it to the weights one at a time.
[00:36:00] time.  Yeah. Yeah. Yeah. It would be better if
[00:36:02] Yeah. Yeah. Yeah. It would be better if you could fit it all in memory. Yeah.
[00:36:04] you could fit it all in memory. Yeah. Yeah. It would be better if you fit it
[00:36:05] Yeah. It would be better if you fit it all in memory. I mean this is mainly for
[00:36:07] all in memory. I mean this is mainly for for this one it's essentially the same
[00:36:09] for this one it's essentially the same but in this setting u maybe it's more
[00:36:11] but in this setting u maybe it's more clear how you're explicitly losing
[00:36:12] clear how you're explicitly losing information.
[00:36:14] information. So, um, here you're only looking at some
[00:36:17] So, um, here you're only looking at some of the outputs at a time. Um, so you
[00:36:21] of the outputs at a time. Um, so you it's really clear how we're not looking
[00:36:23] it's really clear how we're not looking at the entire set of the losses when
[00:36:26] at the entire set of the losses when we're calculating because there's losses
[00:36:28] we're calculating because there's losses at each time step. So, you lose
[00:36:30] at each time step. So, you lose information here, but in this case, you
[00:36:32] information here, but in this case, you wouldn't lose information. Uh, I think
[00:36:35] wouldn't lose information. Uh, I think one more practical example where we
[00:36:37] one more practical example where we can't fit the whole RNN on the slide is
[00:36:39] can't fit the whole RNN on the slide is this idea of a character level language
[00:36:41] this idea of a character level language model. And it's really funny because
[00:36:43] model. And it's really funny because these were shown to be quite effective
[00:36:46] these were shown to be quite effective uh 10 years ago. Um and you can it's
[00:36:49] uh 10 years ago. Um and you can it's really funny because you can see how the
[00:36:50] really funny because you can see how the current wave of language models are sort
[00:36:53] current wave of language models are sort of a buildup of this really simple
[00:36:55] of a buildup of this really simple approach of just predicting characters
[00:36:57] approach of just predicting characters with RNN's. Um so usually when you do a
[00:37:00] with RNN's. Um so usually when you do a model like this you will input uh your
[00:37:03] model like this you will input uh your characters and then people call this a
[00:37:05] characters and then people call this a one hot uh encoding where um you
[00:37:08] one hot uh encoding where um you basically have one uh you have a a one
[00:37:11] basically have one uh you have a a one in your vector and zeros in every other
[00:37:13] in your vector and zeros in every other location. So it's sort of like the index
[00:37:15] location. So it's sort of like the index here. Uh you can encode this as the
[00:37:17] here. Uh you can encode this as the index and then uh we can use these as
[00:37:21] index and then uh we can use these as inputs and we can calculate our hidden
[00:37:23] inputs and we can calculate our hidden layers based on the previous hidden
[00:37:25] layers based on the previous hidden layer as well as the current uh input.
[00:37:28] layer as well as the current uh input. And then we have our output layer the
[00:37:30] And then we have our output layer the same where now we can look at um you
[00:37:33] same where now we can look at um you know what is the the output for the
[00:37:36] know what is the the output for the corresponding correct value here which
[00:37:37] corresponding correct value here which is taken as the next time step. So we
[00:37:40] is taken as the next time step. So we want the output for example to be E. We
[00:37:42] want the output for example to be E. We map it over here. We look at you can
[00:37:45] map it over here. We look at you can imagine this is something like softmax
[00:37:47] imagine this is something like softmax and we have the logits. So these are the
[00:37:48] and we have the logits. So these are the scores. Um 2.2 it's lower than 4.1. So
[00:37:53] scores. Um 2.2 it's lower than 4.1. So um so yeah we we have a we have a you
[00:37:56] um so yeah we we have a we have a you know this is maybe not so great of an
[00:37:57] know this is maybe not so great of an output at this time step um and so on
[00:38:00] output at this time step um and so on and so forth. So you can really view
[00:38:01] and so forth. So you can really view this as a time stepwise classification
[00:38:03] this as a time stepwise classification problem and that's exactly what uh in
[00:38:07] problem and that's exactly what uh in general these language model these
[00:38:08] general these language model these language models are doing is timestep
[00:38:10] language models are doing is timestep wise classification based on softmax.
[00:38:14] wise classification based on softmax. Um, so at test time, the basic idea is
[00:38:17] Um, so at test time, the basic idea is we need to also sample characters one at
[00:38:19] we need to also sample characters one at a time and and just feed it back into
[00:38:20] a time and and just feed it back into the model. So it sees what it generated
[00:38:23] the model. So it sees what it generated at the previous time step. Um, so so on
[00:38:26] at the previous time step. Um, so so on and so forth uh repeating until we
[00:38:28] and so forth uh repeating until we generate the words. So you can actually
[00:38:30] generate the words. So you can actually create RNN's to do this um basic uh
[00:38:35] create RNN's to do this um basic uh language model language modeling task by
[00:38:37] language model language modeling task by operating at a character level and it
[00:38:39] operating at a character level and it works quite well. Um, one thing to note
[00:38:42] works quite well. Um, one thing to note is that in terms of this input layer,
[00:38:44] is that in terms of this input layer, um, usually we don't actually input one
[00:38:46] um, usually we don't actually input one hot embeddings into the model and
[00:38:48] hot embeddings into the model and instead we'll have something called an
[00:38:50] instead we'll have something called an embedding layer where, um, this is
[00:38:52] embedding layer where, um, this is essentially just a giant matrix which is
[00:38:55] essentially just a giant matrix which is the dimensions um, the it's d byd where
[00:38:58] the dimensions um, the it's d byd where d is the number of different inputs you
[00:39:00] d is the number of different inputs you have to your model. And um what you do
[00:39:03] have to your model. And um what you do is you sort of you can imagine this as a
[00:39:04] is you sort of you can imagine this as a matrix multiply where we grab the first
[00:39:07] matrix multiply where we grab the first row or in this case we grab the second
[00:39:09] row or in this case we grab the second row of our embedding matrix based on
[00:39:12] row of our embedding matrix based on what our input sample is here. And we
[00:39:14] what our input sample is here. And we just use this as a matrix multiply.
[00:39:17] just use this as a matrix multiply. This is incorrect actually. This this
[00:39:19] This is incorrect actually. This this one should be higher probability. Yeah,
[00:39:23] one should be higher probability. Yeah, it's it's funny. We've had these slides
[00:39:25] it's it's funny. We've had these slides for quite a few years and I guess no one
[00:39:26] for quite a few years and I guess no one noticed it. question. Um anyway, so we
[00:39:30] noticed it. question. Um anyway, so we have uh E here as our target character.
[00:39:33] have uh E here as our target character. Um and so in this case, you're correct
[00:39:36] Um and so in this case, you're correct that it's the model's actually getting
[00:39:38] that it's the model's actually getting it wrong. So we will want to penalize it
[00:39:39] it wrong. So we will want to penalize it heavy for this time. So yeah. Yeah, it
[00:39:40] heavy for this time. So yeah. Yeah, it was a good question. Um yeah, so um one
[00:39:44] was a good question. Um yeah, so um one of the nice things about this
[00:39:45] of the nice things about this implementation is also it's really
[00:39:46] implementation is also it's really simple. So it's like 112 lines of Python
[00:39:49] simple. So it's like 112 lines of Python code and you can train these models on a
[00:39:51] code and you can train these models on a variety of different tasks. So this is
[00:39:53] variety of different tasks. So this is like the prelim era of what you could
[00:39:55] like the prelim era of what you could do. You can train it on sonnetss by
[00:39:57] do. You can train it on sonnetss by William Shakespeare. And as I mentioned,
[00:39:59] William Shakespeare. And as I mentioned, there's a blog post by former instructor
[00:40:01] there's a blog post by former instructor of this course, Andre Carpathy, back in
[00:40:03] of this course, Andre Carpathy, back in 2015 which talked about how these RNNs
[00:40:06] 2015 which talked about how these RNNs are sort of unreasonably effective at
[00:40:08] are sort of unreasonably effective at what they do in generating text. Yeah.
[00:40:10] what they do in generating text. Yeah.  Could you explain why you use the why
[00:40:12] Could you explain why you use the why you use an embedded layer? Oh.
[00:40:14] you use an embedded layer? Oh.  Yeah. So, the basic idea for an
[00:40:16] Yeah. So, the basic idea for an embedding layer is that generally it's
[00:40:17] embedding layer is that generally it's better to have vectors as input to our
[00:40:20] better to have vectors as input to our models. And you can learn what these
[00:40:21] models. And you can learn what these embedding layers are too. So um you know
[00:40:24] embedding layers are too. So um you know we tend to favor like spread out weights
[00:40:26] we tend to favor like spread out weights in general when we're trying to learn uh
[00:40:28] in general when we're trying to learn uh these. So you can initialize your
[00:40:29] these. So you can initialize your embedding layer to this sort of very uh
[00:40:32] embedding layer to this sort of very uh small zero values with something like
[00:40:33] small zero values with something like the kiming initialization we talked
[00:40:35] the kiming initialization we talked about and then you're just looking at
[00:40:36] about and then you're just looking at one row of it at a time as your input
[00:40:38] one row of it at a time as your input vector rather than it being like a a
[00:40:40] vector rather than it being like a a number as input. Uh how you would have
[00:40:42] number as input. Uh how you would have to represent that is basically a one
[00:40:44] to represent that is basically a one with a bunch of zeros and
[00:40:45] with a bunch of zeros and optimizationally the embedding works
[00:40:48] optimizationally the embedding works better.
[00:40:51] Okay. Um, so yeah, you can do it in 112
[00:40:54] Okay. Um, so yeah, you can do it in 112 lines of Python code, which is pretty
[00:40:56] lines of Python code, which is pretty neat. Um, you can train it on songs by
[00:40:58] neat. Um, you can train it on songs by William Shakespeare and it'll actually
[00:40:59] William Shakespeare and it'll actually output reasonable text. We'll go through
[00:41:01] output reasonable text. We'll go through some examples. So, one of the cool
[00:41:03] some examples. So, one of the cool things is you can see as you train the
[00:41:04] things is you can see as you train the model more, it becomes more and more
[00:41:06] model more, it becomes more and more coherent. So, um, at the beginning, it's
[00:41:08] coherent. So, um, at the beginning, it's basically just gibberish because it
[00:41:09] basically just gibberish because it hasn't learned proper values for W. And
[00:41:12] hasn't learned proper values for W. And then as you train it more and more, um,
[00:41:14] then as you train it more and more, um, it becomes more like this stage three
[00:41:16] it becomes more like this stage three kind of looks like English, at least
[00:41:17] kind of looks like English, at least some of the words, you know, right
[00:41:19] some of the words, you know, right there. And then as you train more it
[00:41:20] there. And then as you train more it actually starts working really well. Um
[00:41:22] actually starts working really well. Um which this is I guess was a a bit of
[00:41:24] which this is I guess was a a bit of foreshadowing for what was to come in
[00:41:26] foreshadowing for what was to come in the era in the era of AI which is pretty
[00:41:29] the era in the era of AI which is pretty cool. Um you can see fullon uh it learns
[00:41:32] cool. Um you can see fullon uh it learns things about the style how you should
[00:41:34] things about the style how you should you know have someone's name and how uh
[00:41:37] you know have someone's name and how uh you know something that seems fairly
[00:41:39] you know something that seems fairly plausible. Uh as you have it generating
[00:41:41] plausible. Uh as you have it generating more and more it starts making less and
[00:41:42] more and more it starts making less and less sense but it it's pretty cool to
[00:41:44] less sense but it it's pretty cool to see. Um, you can train it on code like I
[00:41:48] see. Um, you can train it on code like I think this in this example they trained
[00:41:50] think this in this example they trained it on Linux. So the just the source code
[00:41:52] it on Linux. So the just the source code for Linux they trained one of these
[00:41:54] for Linux they trained one of these character level RNN's and you can see it
[00:41:56] character level RNN's and you can see it generating C code which looks pretty
[00:41:58] generating C code which looks pretty good. Uh I don't know if this would
[00:42:00] good. Uh I don't know if this would compile but it looks reasonable just
[00:42:02] compile but it looks reasonable just looking at it. And this idea has really
[00:42:05] looking at it. And this idea has really taken off over the past uh few years. So
[00:42:08] taken off over the past uh few years. So um I mean I'm sure you all are know
[00:42:11] um I mean I'm sure you all are know especially since a lot of you work in
[00:42:13] especially since a lot of you work in computer science or coding or your
[00:42:14] computer science or coding or your students in this area but um there's all
[00:42:17] students in this area but um there's all of these different programming tools now
[00:42:19] of these different programming tools now for these language models that were
[00:42:21] for these language models that were essentially trained on a similar task
[00:42:22] essentially trained on a similar task where they've consumed a bunch of this
[00:42:25] where they've consumed a bunch of this uh uh training data that's just existing
[00:42:27] uh uh training data that's just existing code and instead of trying to predict
[00:42:29] code and instead of trying to predict the next character they're trying to
[00:42:31] the next character they're trying to predict the next token which is a group
[00:42:32] predict the next token which is a group of characters and how they define tokens
[00:42:35] of characters and how they define tokens depends on the model and there's a lot
[00:42:36] depends on the model and there's a lot of details we could get into there But
[00:42:38] of details we could get into there But at a high level, it's a really similar
[00:42:39] at a high level, it's a really similar thing. They're just predicting groups of
[00:42:40] thing. They're just predicting groups of characters autogressively, one after the
[00:42:42] characters autogressively, one after the next. And it's really seen a blow up in
[00:42:44] next. And it's really seen a blow up in recent years with all these existing
[00:42:46] recent years with all these existing tools. Yeah.
[00:42:51] What is the input to the model? Is it
[00:42:52] What is the input to the model? Is it like a trigger?
[00:42:54] like a trigger?  Oh, what? Like for this?
[00:42:55] Oh, what? Like for this?  Yeah.
[00:42:56] Yeah.  Um, you could have the input be Yeah,
[00:42:59] Um, you could have the input be Yeah, you just maybe you start with a random
[00:43:01] you just maybe you start with a random character could be one way to do it. Uh,
[00:43:03] character could be one way to do it. Uh, but you would need some initial input.
[00:43:04] but you would need some initial input. Um there could be usually with uh
[00:43:08] Um there could be usually with uh language models they have a start token
[00:43:09] language models they have a start token as like a predetermined this is always
[00:43:12] as like a predetermined this is always what you see at the start of your
[00:43:13] what you see at the start of your sequence. So you could do a similar
[00:43:14] sequence. So you could do a similar things with RNNs. I don't know in this
[00:43:16] things with RNNs. I don't know in this exact scenario what they did. Maybe they
[00:43:18] exact scenario what they did. Maybe they just did a character but it's hard to
[00:43:19] just did a character but it's hard to know. So the question is how does
[00:43:21] know. So the question is how does labeling work with language models? And
[00:43:24] labeling work with language models? And the neat thing about these pure language
[00:43:26] the neat thing about these pure language models all they're doing is just
[00:43:27] models all they're doing is just predicting the next token. You don't
[00:43:28] predicting the next token. You don't need to label it. You just need to give
[00:43:29] need to label it. You just need to give it a lot of text. That's why these
[00:43:31] it a lot of text. That's why these models are so good is because they
[00:43:33] models are so good is because they scrape the internet for, you know,
[00:43:35] scrape the internet for, you know, essentially all available text and then
[00:43:37] essentially all available text and then they train a model on all of it. So
[00:43:39] they train a model on all of it. So that's that's why they're so good. It's
[00:43:41] that's that's why they're so good. It's cuz you it's just generating the next
[00:43:42] cuz you it's just generating the next token and you don't need to label it. So
[00:43:44] token and you don't need to label it. So that's why language models are so good.
[00:43:46] that's why language models are so good. So the question is uh if we're always
[00:43:49] So the question is uh if we're always taking the maximum probability output at
[00:43:51] taking the maximum probability output at each time step, um won't we always just
[00:43:53] each time step, um won't we always just be generating the same thing over and
[00:43:56] be generating the same thing over and over again? And the answer is yes,
[00:43:57] over again? And the answer is yes, actually. So if you just took the
[00:43:59] actually. So if you just took the maximum probability uh I guess uh this
[00:44:02] maximum probability uh I guess uh this example is not so good but imagine the
[00:44:04] example is not so good but imagine the probabilities are correct here and you
[00:44:06] probabilities are correct here and you just took the maximum probability at
[00:44:07] just took the maximum probability at each time step you would always be
[00:44:09] each time step you would always be getting the same output given the same
[00:44:10] getting the same output given the same input. In practice what people do is
[00:44:12] input. In practice what people do is they don't do this is called greed
[00:44:14] they don't do this is called greed decoding. You're always picking the
[00:44:15] decoding. You're always picking the maximum probability. In practice they
[00:44:17] maximum probability. In practice they sample based on a distribution the
[00:44:19] sample based on a distribution the distribution given by the probabilities
[00:44:20] distribution given by the probabilities output by your softmax. So you won't
[00:44:22] output by your softmax. So you won't pick the max probability. You would pick
[00:44:24] pick the max probability. You would pick uh say in this case probability
[00:44:27] uh say in this case probability 84 for this one or probability.13 for
[00:44:30] 84 for this one or probability.13 for this other uh output variable and then
[00:44:32] this other uh output variable and then you would run that for each uh sequence.
[00:44:34] you would run that for each uh sequence. And there's a bunch of different ways
[00:44:35] And there's a bunch of different ways you can do it too. There's like you can
[00:44:37] you can do it too. There's like you can search ahead called beam searching where
[00:44:39] search ahead called beam searching where you're trying different ones and seeing
[00:44:40] you're trying different ones and seeing which one has the highest overall
[00:44:42] which one has the highest overall probability for the sequence. So there's
[00:44:44] probability for the sequence. So there's there's a lot of this is like a whole
[00:44:46] there's a lot of this is like a whole active area of research. How do you
[00:44:47] active area of research. How do you sample from these models? But the simple
[00:44:49] sample from these models? But the simple answer is you don't always pick the
[00:44:50] answer is you don't always pick the highest probability. Yes. The question
[00:44:53] highest probability. Yes. The question is in the case where we have many to one
[00:44:55] is in the case where we have many to one outputs are we outputting something each
[00:44:57] outputs are we outputting something each time or do we have something to look at
[00:44:58] time or do we have something to look at here? So I think in practice to save
[00:45:01] here? So I think in practice to save compute you wouldn't want to output
[00:45:02] compute you wouldn't want to output something that's never used but you
[00:45:04] something that's never used but you could feasibly output it at each time
[00:45:05] could feasibly output it at each time step and it might be interesting
[00:45:07] step and it might be interesting depending on your problem to look at
[00:45:08] depending on your problem to look at that and understand is the output
[00:45:10] that and understand is the output converging over the course of training
[00:45:12] converging over the course of training or not something like that. So you you
[00:45:13] or not something like that. So you you it might be useful to look at but
[00:45:15] it might be useful to look at but generally people wouldn't do it just to
[00:45:17] generally people wouldn't do it just to save compute but it could be useful
[00:45:19] save compute but it could be useful actually. Yeah could help you understand
[00:45:20] actually. Yeah could help you understand the way your model works if there's
[00:45:21] the way your model works if there's certain triggers or things that help it
[00:45:24] certain triggers or things that help it predict uh the correct answer.
[00:45:27] predict uh the correct answer. Cool. Good questions.
[00:45:30] Cool. Good questions. Um okay so we'll keep on chugging along.
[00:45:33] Um okay so we'll keep on chugging along. Um we talked about these RNNs how good
[00:45:36] Um we talked about these RNNs how good they are at generating characters. We
[00:45:37] they are at generating characters. We related them to some of these modern
[00:45:39] related them to some of these modern coding tools which are really neat. Um,
[00:45:41] coding tools which are really neat. Um, one of the cool things also about RNN's
[00:45:43] one of the cool things also about RNN's is you can look at the activation
[00:45:45] is you can look at the activation uh activation values and they'll
[00:45:47] uh activation values and they'll actually sometimes tell you interesting
[00:45:49] actually sometimes tell you interesting things about what the model's tracking.
[00:45:50] things about what the model's tracking. So, we had in our little toy example, we
[00:45:52] So, we had in our little toy example, we had we looked at the act output
[00:45:54] had we looked at the act output activations and you would see it's the
[00:45:57] activations and you would see it's the current value and the previous value.
[00:45:58] current value and the previous value. That was what the RNN states or cells
[00:46:00] That was what the RNN states or cells were were tracking. Um, what you can
[00:46:02] were were tracking. Um, what you can also do is give it basically a sequence
[00:46:05] also do is give it basically a sequence here. Um, in the models I'll show in
[00:46:07] here. Um, in the models I'll show in these slides, it's using a tanh
[00:46:08] these slides, it's using a tanh activation. So this is from one to minus
[00:46:11] activation. So this is from one to minus one and minus one means it's visualized
[00:46:14] one and minus one means it's visualized as red here and very close to one would
[00:46:17] as red here and very close to one would be blue. Um we get the whole spectrum
[00:46:19] be blue. Um we get the whole spectrum here. And you can look at for each
[00:46:22] here. And you can look at for each character coming in what is the
[00:46:23] character coming in what is the activation of that cell at that time
[00:46:25] activation of that cell at that time step. And so that's how they're color
[00:46:27] step. And so that's how they're color coding these plots here. This one's not
[00:46:29] coding these plots here. This one's not really showing anything. It's random. A
[00:46:30] really showing anything. It's random. A lot of them won't be interpretable. But
[00:46:32] lot of them won't be interpretable. But some of them have pretty cool things
[00:46:33] some of them have pretty cool things that you can track. Like for example,
[00:46:35] that you can track. Like for example, this one's a quote detector. So it turns
[00:46:38] this one's a quote detector. So it turns on basically as soon as the quote starts
[00:46:40] on basically as soon as the quote starts and it uh ends when the quote ends. So
[00:46:42] and it uh ends when the quote ends. So this is basically something in the RNN
[00:46:44] this is basically something in the RNN tracking. We need to have an end quote
[00:46:45] tracking. We need to have an end quote at some point. Uh and when to put it is
[00:46:48] at some point. Uh and when to put it is sort of uh something the model is trying
[00:46:50] sort of uh something the model is trying to figure out, but it's tracking it. Um
[00:46:53] to figure out, but it's tracking it. Um another cool thing is the line tracking
[00:46:56] another cool thing is the line tracking line length tracking cell. So it starts
[00:46:59] line length tracking cell. So it starts uh kind of very uh high value and then
[00:47:03] uh kind of very uh high value and then it becomes a very low value as you near
[00:47:05] it becomes a very low value as you near where the model thinks there'll be a new
[00:47:06] where the model thinks there'll be a new line character. So this is also kind of
[00:47:09] line character. So this is also kind of cool as a way to look at uh this other
[00:47:13] cool as a way to look at uh this other value. And these are again just single
[00:47:14] value. And these are again just single activations in a layer of this model
[00:47:16] activations in a layer of this model that we're looking at and mapping it to
[00:47:19] that we're looking at and mapping it to um each character. So it's highly
[00:47:21] um each character. So it's highly interpretable.
[00:47:22] interpretable. Um they have this sort of if statement
[00:47:25] Um they have this sort of if statement cell. So anything within an if statement
[00:47:27] cell. So anything within an if statement is being tracked here which is also
[00:47:29] is being tracked here which is also pretty cool and uh even uh things like
[00:47:31] pretty cool and uh even uh things like detecting quotes or comments because it
[00:47:33] detecting quotes or comments because it needs to know to output the end of
[00:47:36] needs to know to output the end of comment uh character here. So it's
[00:47:38] comment uh character here. So it's something it needs to track and so you
[00:47:40] something it needs to track and so you have this nice interpretable cell as
[00:47:41] have this nice interpretable cell as well. And then finally this code depth
[00:47:44] well. And then finally this code depth cell. So as you sort of have nesting in
[00:47:47] cell. So as you sort of have nesting in your code, it activates uh more and more
[00:47:49] your code, it activates uh more and more at each time step at each not time step
[00:47:52] at each time step at each not time step at each uh step into the sort of
[00:47:54] at each uh step into the sort of indentation into the um into your code
[00:47:57] indentation into the um into your code hierarchy. So yeah, this is pretty neat.
[00:47:59] hierarchy. So yeah, this is pretty neat. You can actually look at the activations
[00:48:01] You can actually look at the activations and directly map them onto the inputs
[00:48:03] and directly map them onto the inputs without needing to do any fancy tricks.
[00:48:05] without needing to do any fancy tricks. Um which is which is actually pretty
[00:48:06] Um which is which is actually pretty incredible if you think about how
[00:48:08] incredible if you think about how interpretable some of these hidden
[00:48:10] interpretable some of these hidden states are in the RNN. It's actually
[00:48:12] states are in the RNN. It's actually somewhat similar to what we were doing
[00:48:13] somewhat similar to what we were doing where we were manually assigning it, but
[00:48:15] where we were manually assigning it, but the RNN is sort of internally doing a
[00:48:17] the RNN is sort of internally doing a very similar process.
[00:48:19] very similar process. Cool. Um, so I'll talk about now some of
[00:48:21] Cool. Um, so I'll talk about now some of the trade-offs like why you might want
[00:48:23] the trade-offs like why you might want to use an RNN and when is it helpful.
[00:48:25] to use an RNN and when is it helpful. So, um, the nice thing is they can
[00:48:27] So, um, the nice thing is they can process any length of input. So, a lot
[00:48:30] process any length of input. So, a lot of these modern language models that
[00:48:32] of these modern language models that rely on transformers have something
[00:48:33] rely on transformers have something called a context length or context
[00:48:34] called a context length or context window, maximum context window. Um, RNNs
[00:48:38] window, maximum context window. Um, RNNs don't have this. they can just take a
[00:48:39] don't have this. they can just take a sequence of infinite length essentially
[00:48:41] sequence of infinite length essentially as long as you can keep running the
[00:48:42] as long as you can keep running the model on it. Um so there's no context
[00:48:45] model on it. Um so there's no context length limit. Um the computation for the
[00:48:48] length limit. Um the computation for the time step t can in theory use
[00:48:51] time step t can in theory use information from many steps back if it's
[00:48:52] information from many steps back if it's captured in the hidden state. So if your
[00:48:55] captured in the hidden state. So if your model is effectively capturing all of
[00:48:56] model is effectively capturing all of the dynamics of your input sequence in
[00:48:58] the dynamics of your input sequence in the hidden state, in theory it can use
[00:49:00] the hidden state, in theory it can use values from an extremely long time ago.
[00:49:02] values from an extremely long time ago. Um but in practice there might be some
[00:49:04] Um but in practice there might be some issues with this which we'll kind of go
[00:49:06] issues with this which we'll kind of go into some details there. Um also the
[00:49:10] into some details there. Um also the model size does not increase for a
[00:49:12] model size does not increase for a longer input. So um you know we had an
[00:49:15] longer input. So um you know we had an example where what if you just had a
[00:49:16] example where what if you just had a different layer for or different layer
[00:49:18] different layer for or different layer for each input time step. Uh you don't
[00:49:20] for each input time step. Uh you don't have this issue which is nice. And then
[00:49:23] have this issue which is nice. And then uh we're applying the same weights at
[00:49:24] uh we're applying the same weights at each time step. So basically we know the
[00:49:26] each time step. So basically we know the update rule for how we calculate the
[00:49:28] update rule for how we calculate the outputs is the same every single time.
[00:49:29] outputs is the same every single time. So um there's some nice like symmetry
[00:49:32] So um there's some nice like symmetry here and also just and when you think
[00:49:34] here and also just and when you think conceptually about the problem you're
[00:49:35] conceptually about the problem you're always doing the same thing at every
[00:49:36] always doing the same thing at every single time step which is is nice
[00:49:38] single time step which is is nice conceptually and also helps with some
[00:49:40] conceptually and also helps with some imple implementation as well. So what
[00:49:42] imple implementation as well. So what are the main disadvantages? So you need
[00:49:45] are the main disadvantages? So you need to compute the previous hidden state to
[00:49:48] to compute the previous hidden state to compute the next one every single time.
[00:49:51] compute the next one every single time. So this can be slow if you need to you
[00:49:55] So this can be slow if you need to you sort of have each hidden state is
[00:49:56] sort of have each hidden state is determined and uh is conditioned on all
[00:49:59] determined and uh is conditioned on all the previous ones then this recurrence
[00:50:01] the previous ones then this recurrence computation can actually tend up taking
[00:50:04] computation can actually tend up taking end up taking a lot of time. So although
[00:50:06] end up taking a lot of time. So although this is not an issue at during inference
[00:50:08] this is not an issue at during inference time when you're always like for a
[00:50:10] time when you're always like for a transformer also you have this issue
[00:50:11] transformer also you have this issue where you need to output the next uh
[00:50:14] where you need to output the next uh token or character every single time but
[00:50:16] token or character every single time but at training time it's actually difficult
[00:50:18] at training time it's actually difficult to batch uh all of these together and
[00:50:21] to batch uh all of these together and and during training because in order to
[00:50:23] and during training because in order to calculate the loss you need to calculate
[00:50:25] calculate the loss you need to calculate the previous hidden states. Um so so
[00:50:28] the previous hidden states. Um so so this can pose challenges for scaling up
[00:50:31] this can pose challenges for scaling up to a lot of data and then in practice um
[00:50:34] to a lot of data and then in practice um it's actually difficult to access
[00:50:36] it's actually difficult to access information many time steps back because
[00:50:38] information many time steps back because we have a fixed size hidden state and
[00:50:40] we have a fixed size hidden state and we're trying to cram all the information
[00:50:42] we're trying to cram all the information into it. So you'll eventually lose some
[00:50:44] into it. So you'll eventually lose some information as your sequence goes longer
[00:50:45] information as your sequence goes longer and longer.
[00:50:49] Cool. I'll talk about some uh uh
[00:50:51] Cool. I'll talk about some uh uh applications more specific to computer
[00:50:53] applications more specific to computer vision where RNNs have seen success now.
[00:50:55] vision where RNNs have seen success now. Um so one of them is image captioning
[00:50:58] Um so one of them is image captioning which we talked about. So um the basic
[00:51:00] which we talked about. So um the basic thing here is we mentioned there's this
[00:51:02] thing here is we mentioned there's this sort of start token or start character
[00:51:04] sort of start token or start character which begins the sequence and then you
[00:51:07] which begins the sequence and then you will terminate when you have this end
[00:51:08] will terminate when you have this end character or end token. In this case it
[00:51:10] character or end token. In this case it seems like it's word level uh tokens. So
[00:51:13] seems like it's word level uh tokens. So um you could have a model like this. you
[00:51:15] um you could have a model like this. you have um the most basic way to do it is
[00:51:18] have um the most basic way to do it is you essentially have a CNN or something
[00:51:21] you essentially have a CNN or something that encodes a visual encoder that
[00:51:23] that encodes a visual encoder that encodes the image and we use that as
[00:51:26] encodes the image and we use that as input to our uh recurrent neural network
[00:51:28] input to our uh recurrent neural network as well as the uh previous text that was
[00:51:31] as well as the uh previous text that was generated. So we sort of have two stages
[00:51:33] generated. So we sort of have two stages here and and very more concretely how
[00:51:36] here and and very more concretely how would you combine the CNN and RNN? You
[00:51:37] would you combine the CNN and RNN? You can imagine you have this test image. It
[00:51:39] can imagine you have this test image. It comes in. So your your model sort of
[00:51:42] comes in. So your your model sort of going downwards here starting at the
[00:51:44] going downwards here starting at the first layers at the top and then moving
[00:51:46] first layers at the top and then moving downwards. Um you can imagine this is
[00:51:47] downwards. Um you can imagine this is something that was like say trained on
[00:51:49] something that was like say trained on imageet or something. Um and so we're
[00:51:52] imageet or something. Um and so we're not going to use the class uh labels but
[00:51:54] not going to use the class uh labels but we're going to use this second to last
[00:51:56] we're going to use this second to last layer. This is sort of the common
[00:51:57] layer. This is sort of the common strategy we saw for transfer learning as
[00:51:59] strategy we saw for transfer learning as well or this sort of getting good visual
[00:52:01] well or this sort of getting good visual uh representations of images.
[00:52:04] uh representations of images. So we use the second to last uh layer
[00:52:06] So we use the second to last uh layer and then we can start using this as
[00:52:08] and then we can start using this as input to our hidden state. And now our
[00:52:12] input to our hidden state. And now our hidden state is also a function of um
[00:52:15] hidden state is also a function of um this w
[00:52:17] this w um value here. So we don't necessarily
[00:52:20] um value here. So we don't necessarily have um just a hidden state. We're also
[00:52:24] have um just a hidden state. We're also tracking the visual components here. But
[00:52:27] tracking the visual components here. But I won't spend too much time on this
[00:52:28] I won't spend too much time on this because um this won't be in any of the
[00:52:31] because um this won't be in any of the assignments. But just to give you a
[00:52:32] assignments. But just to give you a flavor of here's how RNN's were used
[00:52:34] flavor of here's how RNN's were used historically with CNN's where we're
[00:52:36] historically with CNN's where we're taking a CNN pre-trained on imageet and
[00:52:38] taking a CNN pre-trained on imageet and now we're including this information
[00:52:39] now we're including this information into the hidden state as well. Um so you
[00:52:42] into the hidden state as well. Um so you know we use the sampling process either
[00:52:43] know we use the sampling process either greedy sampling some other version of
[00:52:45] greedy sampling some other version of sampling to calculate tokens at each
[00:52:47] sampling to calculate tokens at each time step. We end it when we have this
[00:52:49] time step. We end it when we have this end token whenever we sample the end
[00:52:51] end token whenever we sample the end token. That's how we know when to
[00:52:53] token. That's how we know when to finish. And these models actually worked
[00:52:55] finish. And these models actually worked very well for the time. I think uh they
[00:52:57] very well for the time. I think uh they had a lot of great successes. So you can
[00:52:59] had a lot of great successes. So you can see here a lot of nice examples of where
[00:53:02] see here a lot of nice examples of where the model's outputting very reasonable
[00:53:04] the model's outputting very reasonable uh captions based on the input image
[00:53:08] uh captions based on the input image but also they the models would struggle
[00:53:10] but also they the models would struggle in a lot of scenarios too. So um a lot
[00:53:13] in a lot of scenarios too. So um a lot of these have to do with sort of the
[00:53:15] of these have to do with sort of the distribution of where these images are
[00:53:17] distribution of where these images are commonly seen in the training data. For
[00:53:19] commonly seen in the training data. For example, um someone holding something
[00:53:21] example, um someone holding something with their hands sort of cupped like
[00:53:23] with their hands sort of cupped like this. Um very much looks like how they
[00:53:25] this. Um very much looks like how they might hold a mouse, but obviously um you
[00:53:29] might hold a mouse, but obviously um you know, we can tell this is a phone
[00:53:31] know, we can tell this is a phone because it's a flat object they're
[00:53:32] because it's a flat object they're holding and their their their hand is
[00:53:34] holding and their their their hand is facing up, not not downwards. So, this
[00:53:36] facing up, not not downwards. So, this sort of thing is interesting to see. Um
[00:53:39] sort of thing is interesting to see. Um also, I guess they think the woman's
[00:53:41] also, I guess they think the woman's holding a cat where she's just wearing
[00:53:42] holding a cat where she's just wearing some fur uh clothing. um you know they
[00:53:47] some fur uh clothing. um you know they see a beach so they assume there's a
[00:53:49] see a beach so they assume there's a surfboard. This type of hallucination I
[00:53:51] surfboard. This type of hallucination I would say is still extremely common with
[00:53:53] would say is still extremely common with vision language models today where uh
[00:53:55] vision language models today where uh it'll think there's objects present that
[00:53:58] it'll think there's objects present that are commonly present in a given scene
[00:54:00] are commonly present in a given scene but aren't in this particular scene that
[00:54:01] but aren't in this particular scene that you're looking at. Um also you know
[00:54:04] you're looking at. Um also you know things like bird being perched in the
[00:54:05] things like bird being perched in the tree or uh throwing a ball but he's
[00:54:07] tree or uh throwing a ball but he's catching a ball. These are all based on
[00:54:09] catching a ball. These are all based on the bias in the data set. Essentially in
[00:54:11] the bias in the data set. Essentially in the model learning that during training
[00:54:14] the model learning that during training it is most probably true that a certain
[00:54:15] it is most probably true that a certain object or a certain action is being
[00:54:17] object or a certain action is being performed when in the actual image it's
[00:54:19] performed when in the actual image it's not the case. Um so in the data set uh
[00:54:23] not the case. Um so in the data set uh there's high coccurrence of these
[00:54:25] there's high coccurrence of these actions or objects with the particular
[00:54:27] actions or objects with the particular scene. So the model learns to associate
[00:54:29] scene. So the model learns to associate them but it doesn't learn to
[00:54:30] them but it doesn't learn to disentangle. Okay, it's happening in
[00:54:31] disentangle. Okay, it's happening in this scene because of the, you know, in
[00:54:33] this scene because of the, you know, in this scene we know they're not throwing
[00:54:35] this scene we know they're not throwing because the glove is here and the ball's
[00:54:36] because the glove is here and the ball's going into the glove, not in the other
[00:54:38] going into the glove, not in the other hand. But you need to sort of explain it
[00:54:40] hand. But you need to sort of explain it uh like that. And the way we train these
[00:54:41] uh like that. And the way we train these models, we're we're training them just
[00:54:43] models, we're we're training them just to output the caption. So we're not
[00:54:44] to output the caption. So we're not doing any sort of explanation there. And
[00:54:46] doing any sort of explanation there. And that's why u it's part of the reason you
[00:54:47] that's why u it's part of the reason you see this co-occurrence issue.
[00:54:51] see this co-occurrence issue. Okay. Um so for visual question
[00:54:54] Okay. Um so for visual question answering, this is another really common
[00:54:56] answering, this is another really common task where RNN's reused. And there are
[00:54:58] task where RNN's reused. And there are sort of two formulations for visual
[00:55:00] sort of two formulations for visual question answering that were commonly
[00:55:02] question answering that were commonly used. One is to basically um you say you
[00:55:05] used. One is to basically um you say you have a model that is a a captioning
[00:55:07] have a model that is a a captioning model and you want to see how well it
[00:55:10] model and you want to see how well it could answer questions. One thing you
[00:55:12] could answer questions. One thing you could do is to give it this question and
[00:55:15] could do is to give it this question and then have it output text and look at the
[00:55:16] then have it output text and look at the probabilities of each of the answer
[00:55:18] probabilities of each of the answer sequences. So you have a probability for
[00:55:20] sequences. So you have a probability for each character or token and you could
[00:55:22] each character or token and you could multiply them together to get the
[00:55:24] multiply them together to get the probability of the overall answer. This
[00:55:25] probability of the overall answer. This is like one way you could use one of
[00:55:27] is like one way you could use one of these RNN style uh models to do question
[00:55:30] these RNN style uh models to do question answering. Um a more common way people
[00:55:33] answering. Um a more common way people did it is they would have basically a
[00:55:36] did it is they would have basically a question as input to the model. Um
[00:55:40] question as input to the model. Um multiple different answers also as in
[00:55:42] multiple different answers also as in separate inputs to your model and then
[00:55:44] separate inputs to your model and then it's outputting essentially a
[00:55:45] it's outputting essentially a probability per question. And so in this
[00:55:46] probability per question. And so in this case it would be a four-way classifier
[00:55:49] case it would be a four-way classifier where uh you have four different classes
[00:55:51] where uh you have four different classes answer one, answer two, answer three,
[00:55:53] answer one, answer two, answer three, answer four and you're just outputting
[00:55:54] answer four and you're just outputting the probabilities and a lot of different
[00:55:55] the probabilities and a lot of different ways you can formulate it but um this is
[00:55:58] ways you can formulate it but um this is a very common task in computer vision
[00:56:00] a very common task in computer vision where you need to use language and where
[00:56:02] where you need to use language and where sequence modeling helps.
[00:56:04] sequence modeling helps. Um also visual dialogue um you know at
[00:56:07] Um also visual dialogue um you know at the time these were all considered very
[00:56:09] the time these were all considered very separate tasks. these days the tasks you
[00:56:12] separate tasks. these days the tasks you have sort of one model that can do
[00:56:13] have sort of one model that can do almost all of these um but you know how
[00:56:15] almost all of these um but you know how can can you have a chat about an image
[00:56:17] can can you have a chat about an image we've really seen an explosion in the
[00:56:18] we've really seen an explosion in the capabilities of these kinds of models in
[00:56:20] capabilities of these kinds of models in the last 2 years
[00:56:22] the last 2 years um maybe uh one other type of model that
[00:56:25] um maybe uh one other type of model that RNNs were commonly used for is for this
[00:56:28] RNNs were commonly used for is for this visual navigation task so um you know
[00:56:31] visual navigation task so um you know you you have these images coming in and
[00:56:33] you you have these images coming in and you want to output a sequence of
[00:56:34] you want to output a sequence of directions to move in some 2D floor plan
[00:56:37] directions to move in some 2D floor plan how do you get to the target destination
[00:56:40] how do you get to the target destination Um this is another application for you
[00:56:42] Um this is another application for you all to be aware of where these sequence
[00:56:44] all to be aware of where these sequence modeling um these sequence models were
[00:56:46] modeling um these sequence models were used.
[00:56:48] used. Okay. Um one thing I want to note that I
[00:56:51] Okay. Um one thing I want to note that I didn't really explicitly mention before
[00:56:54] didn't really explicitly mention before but um just in the same way where we can
[00:56:56] but um just in the same way where we can have multi-layer CNN's or multilayer uh
[00:57:00] have multi-layer CNN's or multilayer uh uh of these sort of dense or fully
[00:57:03] uh of these sort of dense or fully connected layers um you can also have
[00:57:05] connected layers um you can also have multi-layer RNNs. And in practice most
[00:57:07] multi-layer RNNs. And in practice most of the RNNs I showed were multi-layer
[00:57:09] of the RNNs I showed were multi-layer RNNs. Now the main difference is that um
[00:57:13] RNNs. Now the main difference is that um you sort of treat each layer uh
[00:57:15] you sort of treat each layer uh separately. So the hidden state of say
[00:57:18] separately. So the hidden state of say layer 1 depends on the hidden state of
[00:57:20] layer 1 depends on the hidden state of the previous time step of of layer 1. Um
[00:57:23] the previous time step of of layer 1. Um so in this in the sort of depthwise
[00:57:25] so in this in the sort of depthwise dimension sorry in the yeah in in the
[00:57:28] dimension sorry in the yeah in in the depth dimension each of these
[00:57:31] depth dimension each of these layers you're only looking at the hidden
[00:57:33] layers you're only looking at the hidden states from that layer in the previous
[00:57:34] states from that layer in the previous time steps. And then in terms of looking
[00:57:36] time steps. And then in terms of looking at windows in the time dimension um
[00:57:40] at windows in the time dimension um instead of so the first layer will have
[00:57:42] instead of so the first layer will have the actual input x as the input but then
[00:57:44] the actual input x as the input but then the second layer will take us input the
[00:57:46] the second layer will take us input the output y from the previous uh layer. So
[00:57:50] output y from the previous uh layer. So you can sort of stack these up and it
[00:57:51] you can sort of stack these up and it forms this grid where we have each layer
[00:57:55] forms this grid where we have each layer is operating uh with regards to the
[00:57:58] is operating uh with regards to the previous hidden states only within that
[00:57:59] previous hidden states only within that layer. But then in terms of passing
[00:58:01] layer. But then in terms of passing input output that is between layers. And
[00:58:04] input output that is between layers. And so you can see to calculate this top
[00:58:05] so you can see to calculate this top right value we need to calculate all of
[00:58:08] right value we need to calculate all of the different values uh all of the
[00:58:11] the different values uh all of the different hidden states in this entire
[00:58:12] different hidden states in this entire computational graph beforehand. So you
[00:58:14] computational graph beforehand. So you can get a feel for how as you start
[00:58:17] can get a feel for how as you start training this gets to be a very involved
[00:58:19] training this gets to be a very involved process and not very efficient.
[00:58:22] process and not very efficient. Okay. Um I'll talk about one of the key
[00:58:26] Okay. Um I'll talk about one of the key variants for RNN's that was proposed
[00:58:28] variants for RNN's that was proposed actually a while ago in the 1990s but
[00:58:31] actually a while ago in the 1990s but saw a lot of success for quite some time
[00:58:34] saw a lot of success for quite some time until the transformer revolution called
[00:58:36] until the transformer revolution called LSTMs. You won't need to know about the
[00:58:39] LSTMs. You won't need to know about the details of how LSTMs operate, but what I
[00:58:42] details of how LSTMs operate, but what I hope you learn is that RNN's have some
[00:58:44] hope you learn is that RNN's have some key um disadvantages that LSTMs seek to
[00:58:48] key um disadvantages that LSTMs seek to alleviate and a lot of the more modern
[00:58:51] alleviate and a lot of the more modern statebased models also try to seek to
[00:58:53] statebased models also try to seek to alleviate some of these same issues that
[00:58:55] alleviate some of these same issues that RNN's face.
[00:58:57] RNN's face. So um you know we talked about how by
[00:59:00] So um you know we talked about how by default tanh is a really commonly used
[00:59:03] default tanh is a really commonly used activation function and we also talked
[00:59:05] activation function and we also talked about how you have this whh matrix that
[00:59:07] about how you have this whh matrix that converts your previous hidden state to
[00:59:09] converts your previous hidden state to the to the uh sort of the new one and
[00:59:12] the to the uh sort of the new one and it's summed with this uh wxh matrix that
[00:59:15] it's summed with this uh wxh matrix that converts your input uh vector xt at the
[00:59:18] converts your input uh vector xt at the current time step into your uh hidden
[00:59:20] current time step into your uh hidden state dimension. Then you sum these
[00:59:22] state dimension. Then you sum these together. Um, you can also formulate
[00:59:24] together. Um, you can also formulate this as sort of uh we have our weights
[00:59:27] this as sort of uh we have our weights here and weights here and you're sort of
[00:59:28] here and weights here and you're sort of stacking the vectors uh like this. And
[00:59:31] stacking the vectors uh like this. And so sometimes for shorthand people will
[00:59:33] so sometimes for shorthand people will just combine both of these W's together
[00:59:35] just combine both of these W's together to form one big W. Um but you should
[00:59:38] to form one big W. Um but you should note that um it's sort of like these are
[00:59:42] note that um it's sort of like these are two blocks that are uh diagonally
[00:59:44] two blocks that are uh diagonally positioned together where there's there
[00:59:46] positioned together where there's there would be like if you're formulating like
[00:59:48] would be like if you're formulating like this there's sort of a lot of zeros in
[00:59:49] this there's sort of a lot of zeros in this w because whh is not interacting
[00:59:52] this w because whh is not interacting with xt at all but this is a shorthand
[00:59:54] with xt at all but this is a shorthand way to notate it where it makes thinking
[00:59:56] way to notate it where it makes thinking about it and writing down the math
[00:59:58] about it and writing down the math easier. So you will see all three
[00:59:59] easier. So you will see all three variants here. Um, this one's maybe the
[01:00:01] variants here. Um, this one's maybe the most explicit about where the actual
[01:00:04] most explicit about where the actual values, the non-zero values in the
[01:00:05] values, the non-zero values in the weight matrices lie. And so one way to
[01:00:08] weight matrices lie. And so one way to think of it is you stack these vectors
[01:00:09] think of it is you stack these vectors together, which is shown here. Um, we're
[01:00:11] together, which is shown here. Um, we're multiplying by this w and then we pass
[01:00:13] multiplying by this w and then we pass it through tanh. This gives us our
[01:00:14] it through tanh. This gives us our output ht, which we pass to the next uh,
[01:00:18] output ht, which we pass to the next uh, RNN. You can imagine these are stacked.
[01:00:19] RNN. You can imagine these are stacked. And then you may also have either the
[01:00:22] And then you may also have either the output directly YT or we have this layer
[01:00:25] output directly YT or we have this layer um where it's a weight matrix time HT
[01:00:28] um where it's a weight matrix time HT with a with an activation function
[01:00:29] with a with an activation function around it too. Uh yeah question.
[01:00:33] around it too. Uh yeah question.  Oh sure. Yeah.
[01:00:35] Oh sure. Yeah.  Okay.
[01:00:35] Okay.  So here we have multilayer RN.
[01:00:37] So here we have multilayer RN.  Yeah.
[01:00:37] Yeah.  How is the uh the symmetric pretty much?
[01:00:41] How is the uh the symmetric pretty much?  Yeah. So the weights are shared within
[01:00:43] Yeah. So the weights are shared within the layers for multi-layer RNN. All of
[01:00:45] the layers for multi-layer RNN. All of these um all of these hidden state
[01:00:49] these um all of these hidden state updates will use the same weights and
[01:00:51] updates will use the same weights and then each layer which you stack it
[01:00:54] then each layer which you stack it vertically in this diagram each layer
[01:00:55] vertically in this diagram each layer will have a separate set of weights.
[01:00:58] will have a separate set of weights. Okay.
[01:01:00] Okay. Um yeah. So so this is sort of the way
[01:01:03] Um yeah. So so this is sort of the way that it works. And then when you have
[01:01:05] that it works. And then when you have back propagation, we talked about if you
[01:01:08] back propagation, we talked about if you don't have a loss for each time step y,
[01:01:12] don't have a loss for each time step y, um you need to only calculate your loss
[01:01:13] um you need to only calculate your loss based on what the loss is of your output
[01:01:15] based on what the loss is of your output ht. Um and so when you do this back
[01:01:18] ht. Um and so when you do this back propagation, you're multiplying by w and
[01:01:21] propagation, you're multiplying by w and then you're also taking the derivative
[01:01:22] then you're also taking the derivative of tan h. And both of these can actually
[01:01:24] of tan h. And both of these can actually have some issues. So um when
[01:01:27] have some issues. So um when specifically mathematically looking at
[01:01:29] specifically mathematically looking at what is the gradient of the um how can
[01:01:32] what is the gradient of the um how can you know if we change each component of
[01:01:34] you know if we change each component of our hidden state t um with respect to t
[01:01:37] our hidden state t um with respect to t minus one. So so sorry if we change each
[01:01:39] minus one. So so sorry if we change each component of t minus one how will that
[01:01:41] component of t minus one how will that affect ht um this is what this gradient
[01:01:44] affect ht um this is what this gradient is calculating. we need the derivative
[01:01:45] is calculating. we need the derivative of tan h um because this is our
[01:01:47] of tan h um because this is our activation function and then we have whh
[01:01:50] activation function and then we have whh which is the the multiply here for
[01:01:52] which is the the multiply here for converting the previous hidden state to
[01:01:54] converting the previous hidden state to the next one. Um so this is actually how
[01:01:56] the next one. Um so this is actually how we calculate the gradient and here we
[01:01:58] we calculate the gradient and here we can run into issues. So um if we're
[01:02:00] can run into issues. So um if we're calculating the loss at each time step
[01:02:04] calculating the loss at each time step and we have the loss here the total loss
[01:02:06] and we have the loss here the total loss we just sum it for each of the weights.
[01:02:09] we just sum it for each of the weights. Um so the total loss is just the sum of
[01:02:11] Um so the total loss is just the sum of the loss at each time step with respect
[01:02:13] the loss at each time step with respect to these this reused w matrix.
[01:02:18] to these this reused w matrix. Um and so you end up getting this
[01:02:20] Um and so you end up getting this product of these um of these d L uh you
[01:02:25] product of these um of these d L uh you know uh you circuit to calculate sorry
[01:02:28] know uh you circuit to calculate sorry the loss of L at the final step LT with
[01:02:32] the loss of L at the final step LT with respect to HT you need to calculate each
[01:02:34] respect to HT you need to calculate each of the intermediate hidden states and
[01:02:36] of the intermediate hidden states and how that affects uh W. In order to
[01:02:38] how that affects uh W. In order to calculate this final uh loss here by
[01:02:40] calculate this final uh loss here by using the chain rule u we sort of
[01:02:43] using the chain rule u we sort of mentioned example here and just to point
[01:02:44] mentioned example here and just to point out why this is an issue. Um if we look
[01:02:47] out why this is an issue. Um if we look at these individual terms so if we hone
[01:02:49] at these individual terms so if we hone in on this aspect of how does changing
[01:02:52] in on this aspect of how does changing the uh current hidden state change the
[01:02:54] the uh current hidden state change the next one which is the majority of these
[01:02:56] next one which is the majority of these calculations uh contained in this
[01:02:58] calculations uh contained in this product term here. we get that it's this
[01:03:01] product term here. we get that it's this tan it's this uh it's the same thing we
[01:03:04] tan it's this uh it's the same thing we mentioned earlier where you have this
[01:03:05] mentioned earlier where you have this derivative of tanh multiplied by your
[01:03:08] derivative of tanh multiplied by your whh and so why is this an issue? Well,
[01:03:11] whh and so why is this an issue? Well, first of all the this is sort of the
[01:03:13] first of all the this is sort of the derivative of tanh plotted here. The
[01:03:15] derivative of tanh plotted here. The maximum value is one and so almost
[01:03:18] maximum value is one and so almost always you're getting less than one. So
[01:03:21] always you're getting less than one. So you can have vanishing gradients from
[01:03:23] you can have vanishing gradients from this term here. Um but even if we assume
[01:03:26] this term here. Um but even if we assume there's no nonlinearity uh nonlinearity
[01:03:30] there's no nonlinearity uh nonlinearity or we pick some um activation function
[01:03:33] or we pick some um activation function that doesn't have this issue um if we
[01:03:35] that doesn't have this issue um if we look at this uh weight matrix here um
[01:03:38] look at this uh weight matrix here um that we're multiplying at each um time
[01:03:41] that we're multiplying at each um time step either we're going to have a large
[01:03:44] step either we're going to have a large singular value. So um this will be you
[01:03:47] singular value. So um this will be you know as the as the vectors are coming in
[01:03:49] know as the as the vectors are coming in what is the maximum they'll be uh
[01:03:52] what is the maximum they'll be uh stretched if it's say a unit vector uh
[01:03:54] stretched if it's say a unit vector uh singular value is telling you what
[01:03:56] singular value is telling you what what's the maximum that a unit vector
[01:03:58] what's the maximum that a unit vector could be stretched by the matrix. Uh so
[01:03:59] could be stretched by the matrix. Uh so if it's very large you can have these
[01:04:01] if it's very large you can have these gradients explode or if it's very small
[01:04:03] gradients explode or if it's very small you can have this vanishing gradients
[01:04:04] you can have this vanishing gradients issue. Um, and if you have exploding
[01:04:08] issue. Um, and if you have exploding gradients, we have a fix which is this
[01:04:10] gradients, we have a fix which is this scaling the gradient. So you can just
[01:04:12] scaling the gradient. So you can just divide or like clip it and somehow so
[01:04:14] divide or like clip it and somehow so that you don't. So too big of a
[01:04:16] that you don't. So too big of a gradient, it's not too much of an issue.
[01:04:17] gradient, it's not too much of an issue. But this really small gradient vanishing
[01:04:20] But this really small gradient vanishing gradient issue is actually the main
[01:04:22] gradient issue is actually the main issue with why people don't just use
[01:04:24] issue with why people don't just use really long RNNs in practice because of
[01:04:26] really long RNNs in practice because of tanh and because you under many
[01:04:28] tanh and because you under many scenarios your your weight matrix has
[01:04:30] scenarios your your weight matrix has this property where it's either um you
[01:04:33] this property where it's either um you know expanding your activations or or or
[01:04:36] know expanding your activations or or or reducing them. So um yeah I think these
[01:04:40] reducing them. So um yeah I think these these are the main reasons why it
[01:04:41] these are the main reasons why it motivated a change in RNN architectures
[01:04:43] motivated a change in RNN architectures and why a lot of the reasons why people
[01:04:45] and why a lot of the reasons why people don't use RNN's. This is one of the main
[01:04:47] don't use RNN's. This is one of the main issues. So how do you resolve this? So
[01:04:50] issues. So how do you resolve this? So the way that people did it was this
[01:04:53] the way that people did it was this creation of the LSTM and the highle idea
[01:04:56] creation of the LSTM and the highle idea which I won't go into too many details
[01:04:57] which I won't go into too many details because it's actually quite complicated
[01:05:00] because it's actually quite complicated um is that you have four of these
[01:05:02] um is that you have four of these different gates that are tracking
[01:05:03] different gates that are tracking different values that instead of just
[01:05:06] different values that instead of just having one hidden state, you sort of
[01:05:07] having one hidden state, you sort of have multiple of these values you
[01:05:09] have multiple of these values you premputee to determine how to change
[01:05:11] premputee to determine how to change your hidden state and then also what
[01:05:12] your hidden state and then also what information to pass through a different
[01:05:14] information to pass through a different pathway. So you have the regular hidden
[01:05:16] pathway. So you have the regular hidden state pathway. You have a different
[01:05:17] state pathway. You have a different pathway where it's easier to pass
[01:05:19] pathway where it's easier to pass information. And this is the basic idea.
[01:05:22] information. And this is the basic idea. Um you know at a at a high level they
[01:05:25] Um you know at a at a high level they call it a gate gate or like what are you
[01:05:27] call it a gate gate or like what are you actually writing to the hidden state of
[01:05:29] actually writing to the hidden state of the cell. The input gate which is
[01:05:31] the cell. The input gate which is deciding whether or not you write
[01:05:33] deciding whether or not you write information to the cell. The forget gate
[01:05:35] information to the cell. The forget gate how much to forget from previous time
[01:05:37] how much to forget from previous time steps as well as the output gate which
[01:05:39] steps as well as the output gate which is like how much are you actually uh
[01:05:41] is like how much are you actually uh outputting for your hidden state. So you
[01:05:43] outputting for your hidden state. So you can see this is like really involved a
[01:05:44] can see this is like really involved a lot of design choices here and they put
[01:05:47] lot of design choices here and they put it all together into this I would say uh
[01:05:49] it all together into this I would say uh fairly complicated diagram but the basic
[01:05:51] fairly complicated diagram but the basic idea is uh this part is the same where
[01:05:54] idea is uh this part is the same where we're doing this weight multiply but now
[01:05:56] we're doing this weight multiply but now we have four different values we're
[01:05:57] we have four different values we're computing instead of just the ht um we
[01:06:00] computing instead of just the ht um we have the input gate and the gate gate to
[01:06:03] have the input gate and the gate gate to determine how much to write here and we
[01:06:05] determine how much to write here and we have our output that's passed to the
[01:06:07] have our output that's passed to the next hidden state um this is sort of you
[01:06:11] next hidden state um this is sort of you can think of this top section here is
[01:06:13] can think of this top section here is like a highway where the goal is to not
[01:06:15] like a highway where the goal is to not have any activation functions. So no
[01:06:17] have any activation functions. So no tanh so we avoid the issues we had where
[01:06:20] tanh so we avoid the issues we had where tanh made the gradients vanish. Um and
[01:06:23] tanh made the gradients vanish. Um and all we're applying is this forget gate.
[01:06:25] all we're applying is this forget gate. So as long as we're not basically
[01:06:28] So as long as we're not basically forgetting uh all the information at
[01:06:31] forgetting uh all the information at each time step we're able to pass
[01:06:33] each time step we're able to pass information more easily. This is the
[01:06:35] information more easily. This is the highle explanation and then more
[01:06:37] highle explanation and then more importantly in practice people seem to
[01:06:39] importantly in practice people seem to see that this worked very well. Um again
[01:06:42] see that this worked very well. Um again you won't be implementing this for the
[01:06:43] you won't be implementing this for the course at all. Um but I think this is a
[01:06:46] course at all. Um but I think this is a really commonly used baseline still in
[01:06:48] really commonly used baseline still in some uh deep learning papers. So it's
[01:06:50] some uh deep learning papers. So it's good to know about but uh I think you
[01:06:52] good to know about but uh I think you can think of this through the lens of
[01:06:54] can think of this through the lens of people are trying to construct these
[01:06:56] people are trying to construct these things to make up for all the issues
[01:06:57] things to make up for all the issues that RNN's had which is vanishing
[01:06:59] that RNN's had which is vanishing gradients and also the lack of
[01:07:01] gradients and also the lack of information being captured. Uh you need
[01:07:03] information being captured. Uh you need to cram everything into this hidden
[01:07:04] to cram everything into this hidden state. Right? So you have really
[01:07:05] state. Right? So you have really long-term dependencies. Those are lost.
[01:07:07] long-term dependencies. Those are lost. So they created a separate pathway to to
[01:07:09] So they created a separate pathway to to to pass over this uh more long-term
[01:07:11] to pass over this uh more long-term information through the top here.
[01:07:14] information through the top here. Um so do LSTM solve the Ganishing
[01:07:17] Um so do LSTM solve the Ganishing gradient problem completely? Um it
[01:07:19] gradient problem completely? Um it definitely helps. So um it makes the RNN
[01:07:22] definitely helps. So um it makes the RNN easier to preserve this information over
[01:07:23] easier to preserve this information over many time steps by using this top
[01:07:25] many time steps by using this top pathway diagram. Um so it's in contrast
[01:07:29] pathway diagram. Um so it's in contrast it's a much harder for vanilla RNNs to
[01:07:32] it's a much harder for vanilla RNNs to learn our current weight matrix that
[01:07:34] learn our current weight matrix that preserves info uh info in the hidden
[01:07:36] preserves info uh info in the hidden state across every single time step if
[01:07:38] state across every single time step if we're always doing the same operation
[01:07:40] we're always doing the same operation and we're not able to just pass
[01:07:41] and we're not able to just pass information directly without an
[01:07:43] information directly without an activation function. Um so it doesn't
[01:07:46] activation function. Um so it doesn't guarantee it but it's makes it
[01:07:48] guarantee it but it's makes it significantly easier uh and it helps
[01:07:50] significantly easier uh and it helps improve learning long-term dependencies
[01:07:52] improve learning long-term dependencies and uh works very well empirically. So
[01:07:56] and uh works very well empirically. So people generally don't train RNN so much
[01:07:58] people generally don't train RNN so much and they'll more often train LSTMs if
[01:08:00] and they'll more often train LSTMs if you were going to go with this recurrent
[01:08:02] you were going to go with this recurrent um modeling route. Um but I think in
[01:08:06] um modeling route. Um but I think in general these are also I would saying I
[01:08:09] general these are also I would saying I would say significantly fallen out of
[01:08:10] would say significantly fallen out of fashion. But this gives you a sense of
[01:08:12] fashion. But this gives you a sense of the way that people have tried to design
[01:08:15] the way that people have tried to design RNNs to account for the issues they
[01:08:16] RNNs to account for the issues they face.
[01:08:18] face. So, um, one other thing that would be
[01:08:21] So, um, one other thing that would be kind of cool to tie in to something you
[01:08:22] kind of cool to tie in to something you learned earlier in this course is this
[01:08:24] learned earlier in this course is this idea of directly adding outputs and
[01:08:27] idea of directly adding outputs and skipping some activation functions or
[01:08:29] skipping some activation functions or other layers is actually highly related
[01:08:31] other layers is actually highly related to the idea that we discussed in ResNets
[01:08:33] to the idea that we discussed in ResNets where you have these skip connections
[01:08:35] where you have these skip connections where uh the value is just copied over
[01:08:37] where uh the value is just copied over and added uh later in the in the in the
[01:08:40] and added uh later in the in the in the in the layer block. So you have multiple
[01:08:43] in the layer block. So you have multiple in resonance you have multiple of these
[01:08:44] in resonance you have multiple of these convolution layers stacked together and
[01:08:45] convolution layers stacked together and then you add uh skip connection where
[01:08:47] then you add uh skip connection where the value just gets added here and you
[01:08:49] the value just gets added here and you can sort of think of uh this in a
[01:08:52] can sort of think of uh this in a similar light for LSTMs how it's
[01:08:54] similar light for LSTMs how it's skipping over some of these uh layers
[01:08:56] skipping over some of these uh layers and it helps improve as you get in this
[01:08:59] and it helps improve as you get in this case instead of very large depth of the
[01:09:01] case instead of very large depth of the model it's very long sequences of time
[01:09:04] model it's very long sequences of time steps. So um this is sort of it's a
[01:09:06] steps. So um this is sort of it's a parallel but it's a little different
[01:09:07] parallel but it's a little different because one is the number of layers and
[01:09:08] because one is the number of layers and the other is the number of time steps.
[01:09:12] the other is the number of time steps. Okay. Um I think the final slide for
[01:09:16] Okay. Um I think the final slide for today's lecture is just a little tie-in
[01:09:18] today's lecture is just a little tie-in for how um these RNNs have sort of made
[01:09:21] for how um these RNNs have sort of made a bit of a resurgent in the last uh year
[01:09:23] a bit of a resurgent in the last uh year or two which is kind of funny uh because
[01:09:26] or two which is kind of funny uh because I think if we taught the course maybe a
[01:09:28] I think if we taught the course maybe a year or two ago I would have been much
[01:09:30] year or two ago I would have been much more willing to want to cut RNNs
[01:09:31] more willing to want to cut RNNs entirely. But there are actually a lot
[01:09:33] entirely. But there are actually a lot of nice advantages they have. So the
[01:09:35] of nice advantages they have. So the main one is this unlimited context
[01:09:36] main one is this unlimited context length. So one of the main issues with
[01:09:38] length. So one of the main issues with transformers is they have a limited
[01:09:40] transformers is they have a limited context length. As people are really
[01:09:42] context length. As people are really pushing the boundaries for what these
[01:09:43] pushing the boundaries for what these models are capable of, this context
[01:09:45] models are capable of, this context length is becoming more and more of an
[01:09:46] length is becoming more and more of an issue. So if there have been various
[01:09:48] issue. So if there have been various workarounds in the transformer space,
[01:09:50] workarounds in the transformer space, people do things like rope and some
[01:09:51] people do things like rope and some other techniques to try to extend the
[01:09:53] other techniques to try to extend the context length. But it's a pretty
[01:09:55] context length. But it's a pretty significant limitation of the model. Um
[01:09:57] significant limitation of the model. Um the other thing is that during um during
[01:10:00] the other thing is that during um during inference for RNN's the compute scales
[01:10:03] inference for RNN's the compute scales linearly with the sequence length or or
[01:10:05] linearly with the sequence length or or during training too but uh basically as
[01:10:07] during training too but uh basically as you add more and more uh steps to your
[01:10:10] you add more and more uh steps to your sequence uh you just need to recomputee
[01:10:12] sequence uh you just need to recomputee the same operation over and over again.
[01:10:14] the same operation over and over again. So there's no operation that looks
[01:10:16] So there's no operation that looks across the entire input sequence like
[01:10:18] across the entire input sequence like you have for transformers. Um so these
[01:10:20] you have for transformers. Um so these are really big advantages and there have
[01:10:23] are really big advantages and there have been a couple of papers. So to shout out
[01:10:24] been a couple of papers. So to shout out a few um there's this RWKV model um you
[01:10:29] a few um there's this RWKV model um you can check out archive link here and also
[01:10:30] can check out archive link here and also mamba are both uh mainly highlighting
[01:10:33] mamba are both uh mainly highlighting this idea of we're able to achieve
[01:10:34] this idea of we're able to achieve linear time sequence modeling. So as you
[01:10:37] linear time sequence modeling. So as you scale up your input sequence the compute
[01:10:38] scale up your input sequence the compute also scales linearly as opposed to
[01:10:40] also scales linearly as opposed to quadratically with transformers and so
[01:10:42] quadratically with transformers and so you get uh it's better for long context
[01:10:45] you get uh it's better for long context problems sometimes in terms of compute
[01:10:48] problems sometimes in terms of compute it works better and has these main
[01:10:49] it works better and has these main advantages. So we will try to get the
[01:10:51] advantages. So we will try to get the best of both worlds and there's been a
[01:10:52] best of both worlds and there's been a lot of research in this area. How can
[01:10:54] lot of research in this area. How can you get the performance of transformers
[01:10:55] you get the performance of transformers with the scaling of RNNs? Okay. Um so
[01:10:59] with the scaling of RNNs? Okay. Um so that's all for today in class. We
[01:11:01] that's all for today in class. We basically talked about how uh there's a
[01:11:03] basically talked about how uh there's a lot of different ways you can design
[01:11:04] lot of different ways you can design architectures with RNN's. Vanilla RNNs
[01:11:07] architectures with RNN's. Vanilla RNNs are simple but they don't work that
[01:11:08] are simple but they don't work that well. And there's been more complex
[01:11:10] well. And there's been more complex variants that people have proposed that
[01:11:12] variants that people have proposed that introduce ways to selectively pass
[01:11:14] introduce ways to selectively pass information. um this backward flow of
[01:11:17] information. um this backward flow of gradients in the RNNs can either explode
[01:11:18] gradients in the RNNs can either explode or vanish depending on the activation
[01:11:20] or vanish depending on the activation function you use or what is the
[01:11:21] function you use or what is the properties of your weight matrix. So um
[01:11:24] properties of your weight matrix. So um you often need this back propagation
[01:11:26] you often need this back propagation through time to actually compute the
[01:11:28] through time to actually compute the gradient as well. Um and then finally
[01:11:30] gradient as well. Um and then finally basically these better architectures are
[01:11:32] basically these better architectures are hot topic of research right now as well
[01:11:34] hot topic of research right now as well as just generally new paradigms for
[01:11:36] as just generally new paradigms for reasoning over sequences. So yeah I
[01:11:38] reasoning over sequences. So yeah I think that's it for today. Uh next time
[01:11:39] think that's it for today. Uh next time we'll talk about attention and
[01:11:41] we'll talk about attention and transformers.


================================================================================
LECTURE 008
================================================================================

Stanford CS231N | Spring 2025 | Lecture 8: Attention and Transformers

Source: https://www.youtube.com/watch?v=RQowiOF_FvQ

---

Transcript

[00:00:04] All right, welcome back everyone to
[00:00:06] All right, welcome back everyone to lecture 8. Uh, today we're going to talk
[00:00:08] lecture 8. Uh, today we're going to talk about attention and transformers. And I
[00:00:10] about attention and transformers. And I think this is a this is a really fun
[00:00:11] think this is a this is a really fun one. So, as a quick recap, last time we
[00:00:14] one. So, as a quick recap, last time we were talking about recurrent neural
[00:00:15] were talking about recurrent neural networks and recurrent neural networks
[00:00:17] networks and recurrent neural networks were this new kind of neural network
[00:00:18] were this new kind of neural network architecture meant for processing
[00:00:19] architecture meant for processing sequences. And in particular, we saw how
[00:00:21] sequences. And in particular, we saw how neural networks by by processing
[00:00:23] neural networks by by processing sequences let us attack a whole new
[00:00:25] sequences let us attack a whole new different kinds of problems than we than
[00:00:26] different kinds of problems than we than we could with convolutional networks uh
[00:00:28] we could with convolutional networks uh before. So in particular usually we had
[00:00:31] before. So in particular usually we had been thinking about these onetoone
[00:00:32] been thinking about these onetoone problems where you input one thing like
[00:00:34] problems where you input one thing like an image and then output one thing like
[00:00:36] an image and then output one thing like a classification for what's in that
[00:00:38] a classification for what's in that image. But once you have the ability to
[00:00:40] image. But once you have the ability to move beyond images and move towards
[00:00:41] move beyond images and move towards sequences of data it let us tackle a lot
[00:00:44] sequences of data it let us tackle a lot of new kinds of problems like one to
[00:00:45] of new kinds of problems like one to many problems image captioning maybe we
[00:00:47] many problems image captioning maybe we want to input an image output a textual
[00:00:49] want to input an image output a textual description of that image which is going
[00:00:50] description of that image which is going to be a sequence of words. Um maybe many
[00:00:53] to be a sequence of words. Um maybe many to one where we do input a sequence of
[00:00:54] to one where we do input a sequence of frames and output a classification for
[00:00:56] frames and output a classification for those frames. um and a bunch of other
[00:00:58] those frames. um and a bunch of other problems along this vein. Um so now
[00:01:00] problems along this vein. Um so now we're seeing that moving be moving into
[00:01:02] we're seeing that moving be moving into these more um sophisticated neural
[00:01:04] these more um sophisticated neural network architectures um both is sort of
[00:01:06] network architectures um both is sort of more interesting architecturally but
[00:01:07] more interesting architecturally but also lets us tackle new problems than we
[00:01:09] also lets us tackle new problems than we could with um with kind of more
[00:01:11] could with um with kind of more traditional feed forward neural
[00:01:12] traditional feed forward neural networks. So today we're going to build
[00:01:15] networks. So today we're going to build on that and talk about two new things in
[00:01:17] on that and talk about two new things in today's lecture. Um the first thing is
[00:01:19] today's lecture. Um the first thing is going to be attention which is going to
[00:01:20] going to be attention which is going to be a brand new neural network primitive
[00:01:22] be a brand new neural network primitive that fundamentally operates on sets of
[00:01:24] that fundamentally operates on sets of vectors. And then the second thing we're
[00:01:26] vectors. And then the second thing we're going to talk about is the transformer.
[00:01:28] going to talk about is the transformer. And the transformer is a different
[00:01:30] And the transformer is a different neural network architecture that has
[00:01:31] neural network architecture that has self attention at its core. Um, and this
[00:01:35] self attention at its core. Um, and this and the spoiler alert is that
[00:01:36] and the spoiler alert is that transformers are basically the
[00:01:38] transformers are basically the architecture that we use for almost all
[00:01:39] architecture that we use for almost all problems in deep learning today. So any
[00:01:42] problems in deep learning today. So any of the largest applications that you're
[00:01:43] of the largest applications that you're seeing out there in the wild today,
[00:01:45] seeing out there in the wild today, whether it's classifying images,
[00:01:47] whether it's classifying images, generating images, generating text,
[00:01:49] generating images, generating text, classifying text, working with audio,
[00:01:51] classifying text, working with audio, basically any kind of large neural
[00:01:53] basically any kind of large neural network today, um that is large,
[00:01:55] network today, um that is large, state-of-the-art, trained in a lot of
[00:01:56] state-of-the-art, trained in a lot of data, um deployed by a big company,
[00:01:58] data, um deployed by a big company, almost all of them are going to be
[00:02:00] almost all of them are going to be transformers today. Um so that's really
[00:02:02] transformers today. Um so that's really exciting that we get to get you up to
[00:02:04] exciting that we get to get you up to speed on the latest and greatest
[00:02:05] speed on the latest and greatest architectures that people are using now.
[00:02:08] architectures that people are using now. Um, but even though transformers are
[00:02:10] Um, but even though transformers are this sort of state-of-the-art
[00:02:11] this sort of state-of-the-art architecture that everyone is using for
[00:02:13] architecture that everyone is using for everything today, they have kind of they
[00:02:14] everything today, they have kind of they have a relatively long history. Um, and
[00:02:17] have a relatively long history. Um, and they initially like it's kind of
[00:02:19] they initially like it's kind of interesting watching these fields
[00:02:20] interesting watching these fields develop because looking back on it when
[00:02:22] develop because looking back on it when the moment that Transformers came out,
[00:02:23] the moment that Transformers came out, it feels like it ought to have been this
[00:02:24] it feels like it ought to have been this big moment, this big thing when there
[00:02:26] big moment, this big thing when there was a big sea change, this new
[00:02:27] was a big sea change, this new architecture, this new thing. But it
[00:02:29] architecture, this new thing. But it actually didn't feel that way because
[00:02:31] actually didn't feel that way because even though there was one moment where
[00:02:32] even though there was one moment where the transformer architecture was born,
[00:02:34] the transformer architecture was born, um these ideas around self attention,
[00:02:37] um these ideas around self attention, around using attention in various ways,
[00:02:39] around using attention in various ways, those had actually been around in the
[00:02:40] those had actually been around in the field for several years at that time.
[00:02:42] field for several years at that time. And in particular, these ideas around
[00:02:44] And in particular, these ideas around attention, self attention, they actually
[00:02:46] attention, self attention, they actually developed out of recurrent neural
[00:02:47] developed out of recurrent neural networks. So we're going to start there
[00:02:49] networks. So we're going to start there to talk about and motivate these
[00:02:50] to talk about and motivate these problems. So this is going to be a
[00:02:52] problems. So this is going to be a little bit mirroring the historical
[00:02:53] little bit mirroring the historical development of these ideas.
[00:02:55] development of these ideas. So for that reason um we're actually
[00:02:57] So for that reason um we're actually going to in order to introduce
[00:02:58] going to in order to introduce transformers we're actually going to
[00:03:00] transformers we're actually going to roll back and recap a little bit about
[00:03:02] roll back and recap a little bit about this idea of recurrent neural networks
[00:03:03] this idea of recurrent neural networks that we saw in the last lecture. So as a
[00:03:05] that we saw in the last lecture. So as a motivating problem um let's think about
[00:03:07] motivating problem um let's think about this sequence to sequence problem of
[00:03:10] this sequence to sequence problem of translation. So we want to input one
[00:03:11] translation. So we want to input one sequence which is going to be a sequence
[00:03:13] sequence which is going to be a sequence of words in English. Then we want to
[00:03:15] of words in English. Then we want to output another sequence which is going
[00:03:17] output another sequence which is going to be a sequence of words in a different
[00:03:18] to be a sequence of words in a different language Italian. Right? And because you
[00:03:21] language Italian. Right? And because you and we can't make assumptions that
[00:03:22] and we can't make assumptions that there's any correspondence between those
[00:03:24] there's any correspondence between those words. Right? The the number of words in
[00:03:26] words. Right? The the number of words in the English sentence might be different
[00:03:27] the English sentence might be different from the number of words in the Italian
[00:03:29] from the number of words in the Italian sentence and the order of those words
[00:03:30] sentence and the order of those words might be totally different. So this is a
[00:03:32] might be totally different. So this is a perfect application of the kind of
[00:03:33] perfect application of the kind of sequence processing algorithms that we
[00:03:35] sequence processing algorithms that we uh sequence processing architectures
[00:03:36] uh sequence processing architectures that we saw in recurrent neural
[00:03:38] that we saw in recurrent neural networks. Um and indeed this idea of
[00:03:40] networks. Um and indeed this idea of processing these uh sequencetosequence
[00:03:42] processing these uh sequencetosequence problems with recurrent neural networks.
[00:03:44] problems with recurrent neural networks. This goes all the way back to 2014 um
[00:03:46] This goes all the way back to 2014 um even even a bit earlier than that. But
[00:03:48] even even a bit earlier than that. But people had been processing sequences
[00:03:50] people had been processing sequences with recurrent neural networks for more
[00:03:51] with recurrent neural networks for more than a decade more than a decade at this
[00:03:53] than a decade more than a decade at this point. So the basic architecture for
[00:03:56] point. So the basic architecture for processing sequence sequence to sequence
[00:03:58] processing sequence sequence to sequence problems with recurrent neural networks
[00:03:59] problems with recurrent neural networks is that typically you'll start with one
[00:04:01] is that typically you'll start with one encoder. Your encoder is a recurrent
[00:04:03] encoder. Your encoder is a recurrent neural network. Um the recurrent neural
[00:04:05] neural network. Um the recurrent neural network recall is this function that
[00:04:06] network recall is this function that gets applied recursively um on two
[00:04:09] gets applied recursively um on two inputs. One is your xt your input at the
[00:04:11] inputs. One is your xt your input at the current time step and the other is your
[00:04:13] current time step and the other is your ht minus one which is your hidden state
[00:04:15] ht minus one which is your hidden state at the previous time step. and your
[00:04:17] at the previous time step. and your recurrent neural network unit will then
[00:04:18] recurrent neural network unit will then spit out a next hidden unit, a next
[00:04:21] spit out a next hidden unit, a next hidden state at the next time step. And
[00:04:23] hidden state at the next time step. And then we can apply that same recurrent
[00:04:25] then we can apply that same recurrent neural network unit over in time um to
[00:04:27] neural network unit over in time um to process a sequence of a of potentially
[00:04:29] process a sequence of a of potentially variable length. So in this case, we're
[00:04:32] variable length. So in this case, we're using a recurrent neural network encoder
[00:04:34] using a recurrent neural network encoder that inputs the input sequence in
[00:04:35] that inputs the input sequence in English. Um input sequence, you know,
[00:04:37] English. Um input sequence, you know, you you got to use relatively short
[00:04:39] you you got to use relatively short sentences um to fit on slides and still
[00:04:41] sentences um to fit on slides and still have all the boxes show up. So we're
[00:04:42] have all the boxes show up. So we're using a kind of short, silly sentence,
[00:04:44] using a kind of short, silly sentence, uh we see the sky. Um, and this, you
[00:04:46] uh we see the sky. Um, and this, you know, each word in that sentence gets
[00:04:48] know, each word in that sentence gets processed but via one tick of the
[00:04:50] processed but via one tick of the recurrent neural network. And now, and
[00:04:53] recurrent neural network. And now, and then we're going to the idea of this
[00:04:54] then we're going to the idea of this encoder neural recurrent neural network
[00:04:56] encoder neural recurrent neural network is it wants to process all of the words
[00:04:58] is it wants to process all of the words in in the input sequence and somehow
[00:05:00] in in the input sequence and somehow summarize the content of that input
[00:05:01] summarize the content of that input sentence so that we can translate it
[00:05:03] sentence so that we can translate it into a different into our output target
[00:05:05] into a different into our output target language. So the the the more the more
[00:05:08] language. So the the the more the more concrete way that this happens is that
[00:05:10] concrete way that this happens is that after processing all the words in the
[00:05:12] after processing all the words in the input in the input sequence, we're going
[00:05:14] input in the input sequence, we're going to summarize the the entire content of
[00:05:16] to summarize the the entire content of that input sequence into a single vector
[00:05:19] that input sequence into a single vector called the context vector. Um and
[00:05:21] called the context vector. Um and there's a there's a couple different
[00:05:22] there's a there's a couple different ways that people would typically do
[00:05:23] ways that people would typically do these in recurren recurrent neural
[00:05:25] these in recurren recurrent neural networks. I don't think the details are
[00:05:26] networks. I don't think the details are too interesting. So as a you can just
[00:05:28] too interesting. So as a you can just think that that context vector is
[00:05:30] think that that context vector is basically the last hidden state of the
[00:05:32] basically the last hidden state of the of the encoder recurrent neural network.
[00:05:34] of the encoder recurrent neural network. Um and now the idea is that because you
[00:05:35] Um and now the idea is that because you know this this these because of the
[00:05:37] know this this these because of the recurrent structure of our recurrent
[00:05:38] recurrent structure of our recurrent neural networks the last hidden state
[00:05:40] neural networks the last hidden state sort of incorporates information of the
[00:05:42] sort of incorporates information of the entire input sequence. So we can think
[00:05:44] entire input sequence. So we can think of that last hidden state as summarizing
[00:05:46] of that last hidden state as summarizing or encoding all of the information in
[00:05:48] or encoding all of the information in the entire input sequence. Um so then
[00:05:50] the entire input sequence. Um so then that that is one vector that is going to
[00:05:52] that that is one vector that is going to kind of summarize that entire input
[00:05:54] kind of summarize that entire input sequence to do whatever we want with it.
[00:05:56] sequence to do whatever we want with it. Um and in this case what we want to do
[00:05:58] Um and in this case what we want to do with it is translate that input sequence
[00:06:00] with it is translate that input sequence into an output sequence in a different
[00:06:02] into an output sequence in a different language. So to do that we're going to
[00:06:03] language. So to do that we're going to use a second recurrent neural network
[00:06:05] use a second recurrent neural network called the decoder. Um which usually has
[00:06:08] called the decoder. Um which usually has the same architecture but potentially
[00:06:09] the same architecture but potentially different a different weight matrix
[00:06:11] different a different weight matrix different set of learned parameters. Um
[00:06:12] different set of learned parameters. Um and this decoder GU is going to be a
[00:06:15] and this decoder GU is going to be a different recurrent neural network with
[00:06:16] different recurrent neural network with different different learnable weights u
[00:06:18] different different learnable weights u but has the same basic idea. Um now this
[00:06:21] but has the same basic idea. Um now this a recurrent neural network unit is going
[00:06:23] a recurrent neural network unit is going to take three inputs at every time step.
[00:06:25] to take three inputs at every time step. It's going to take um yt minus one which
[00:06:27] It's going to take um yt minus one which is the token and the output sequence at
[00:06:29] is the token and the output sequence at the previous time step. It's going to
[00:06:31] the previous time step. It's going to take ST minus one, which is the previous
[00:06:34] take ST minus one, which is the previous hidden state in the output sequence, and
[00:06:36] hidden state in the output sequence, and C, which is that context vector summariz
[00:06:38] C, which is that context vector summariz summarizing the entire input sequence.
[00:06:40] summarizing the entire input sequence. Um, and then we kind of unroll that
[00:06:41] Um, and then we kind of unroll that output sequence just as we saw in the
[00:06:43] output sequence just as we saw in the last lecture and and produce words one
[00:06:45] last lecture and and produce words one at a time in the output sequence. And I
[00:06:47] at a time in the output sequence. And I don't speak Italian, so I'm not going to
[00:06:49] don't speak Italian, so I'm not going to try to I'm not going to try to pronounce
[00:06:50] try to I'm not going to try to pronounce these, but um there's some Italian words
[00:06:52] these, but um there's some Italian words on the screen that you can see. And I'm
[00:06:55] on the screen that you can see. And I'm assuming that that indeed translates to
[00:06:58] assuming that that indeed translates to we see the sky. Um hopefully that's
[00:07:01] we see the sky. Um hopefully that's correct, right? But the idea is we're
[00:07:02] correct, right? But the idea is we're going to like tick this recurrent neural
[00:07:04] going to like tick this recurrent neural network one tick at a time. It's going
[00:07:05] network one tick at a time. It's going to output words one at a time. And this
[00:07:07] to output words one at a time. And this is basically a summary of what we saw
[00:07:09] is basically a summary of what we saw last lecture. So this should this should
[00:07:11] last lecture. So this should this should basically not be too surprising in light
[00:07:13] basically not be too surprising in light of the previous lecture.
[00:07:15] of the previous lecture. But there's a potential problem here,
[00:07:17] But there's a potential problem here, right? And that's the there's a
[00:07:18] right? And that's the there's a communication bottleneck here between
[00:07:20] communication bottleneck here between the input sequence and the output
[00:07:22] the input sequence and the output sequence, right? The only way in which
[00:07:25] sequence, right? The only way in which the input sequence is communicating with
[00:07:26] the input sequence is communicating with the output sequence is via that context
[00:07:28] the output sequence is via that context vector C. Um, and that C is going to be
[00:07:31] vector C. Um, and that C is going to be a fixed length vector, right? Because
[00:07:32] a fixed length vector, right? Because the size of that vector is fixed when we
[00:07:35] the size of that vector is fixed when we set the size of our recurrent neural
[00:07:36] set the size of our recurrent neural network. Um, and maybe that's fine,
[00:07:38] network. Um, and maybe that's fine, right? So C might be a fixed length
[00:07:40] right? So C might be a fixed length vector of like 128 floats, 1024 floats,
[00:07:43] vector of like 128 floats, 1024 floats, but the size of that input vector is not
[00:07:46] but the size of that input vector is not going to change as our input and output
[00:07:47] going to change as our input and output sequence sizes grow or shrink. And
[00:07:49] sequence sizes grow or shrink. And that's a potential problem, right? So if
[00:07:51] that's a potential problem, right? So if we're doing short sequences like we see
[00:07:53] we're doing short sequences like we see the sky, maybe it seems pretty plausible
[00:07:55] the sky, maybe it seems pretty plausible that we can summarize everything we need
[00:07:57] that we can summarize everything we need to know about that sequence in that
[00:07:58] to know about that sequence in that fixed vector of 12 the 1024 floats. But
[00:08:02] fixed vector of 12 the 1024 floats. But what if we're not trying to translate
[00:08:03] what if we're not trying to translate four words? What if we're trying to
[00:08:05] four words? What if we're trying to translate a whole paragraph or a whole
[00:08:06] translate a whole paragraph or a whole book or like an entire corpus of data?
[00:08:10] book or like an entire corpus of data? um then in that case we're going to run
[00:08:11] um then in that case we're going to run into a bottleneck where at some point as
[00:08:13] into a bottleneck where at some point as we scale that input sequence it's just
[00:08:15] we scale that input sequence it's just not going to be sensible for to ask the
[00:08:17] not going to be sensible for to ask the network to summarize the entire input
[00:08:19] network to summarize the entire input sequence into a into a single fixed
[00:08:22] sequence into a into a single fixed length vector. So that's going to be a
[00:08:23] length vector. So that's going to be a problem.
[00:08:25] problem. So the solution here is actually let's
[00:08:28] So the solution here is actually let's not bottleneck the network through a
[00:08:30] not bottleneck the network through a fixed through one fixed length vector.
[00:08:33] fixed through one fixed length vector. Instead let's change the architecture of
[00:08:35] Instead let's change the architecture of our recurrent neural network.
[00:08:37] our recurrent neural network. Intuitively, what we want to do is not
[00:08:39] Intuitively, what we want to do is not force a bottleneck between a in a fixed
[00:08:41] force a bottleneck between a in a fixed length vector between the input and the
[00:08:42] length vector between the input and the output. Instead, as we process the
[00:08:44] output. Instead, as we process the output sequence, we're going to give the
[00:08:46] output sequence, we're going to give the model the ability to look back at the
[00:08:48] model the ability to look back at the input sequence. And now, every time it
[00:08:50] input sequence. And now, every time it produces an output vector, we want to
[00:08:52] produces an output vector, we want to give the network the opportunity to look
[00:08:54] give the network the opportunity to look back at the entire input sequence. And
[00:08:56] back at the entire input sequence. And if we do this, there's going to be no
[00:08:58] if we do this, there's going to be no bottleneck. It will scale to much longer
[00:08:59] bottleneck. It will scale to much longer sequences. And hopefully, the the model
[00:09:01] sequences. And hopefully, the the model architecture will work much better. So
[00:09:03] architecture will work much better. So that's the motivating idea that led to
[00:09:05] that's the motivating idea that led to attention and transformers and all this
[00:09:07] attention and transformers and all this great stuff that we see today. It all
[00:09:08] great stuff that we see today. It all came like you know one way of telling
[00:09:10] came like you know one way of telling the story is that it all came from
[00:09:12] the story is that it all came from trying to solve this bottleneck problem
[00:09:13] trying to solve this bottleneck problem in recurrent neural networks. So let's
[00:09:16] in recurrent neural networks. So let's see um how we can actually implement
[00:09:18] see um how we can actually implement this intuition and endow our recurrent
[00:09:20] this intuition and endow our recurrent neural network with the ability to look
[00:09:21] neural network with the ability to look back at the input sequence on every time
[00:09:23] back at the input sequence on every time step. So here we're going to you know
[00:09:25] step. So here we're going to you know start with the same thing. Um our
[00:09:26] start with the same thing. Um our encoder neural network is going to
[00:09:28] encoder neural network is going to remain the same. No changes there. we
[00:09:30] remain the same. No changes there. we still need to set some initial hidden
[00:09:32] still need to set some initial hidden state for the output sequence. Um and so
[00:09:35] state for the output sequence. Um and so we need to set some initial um decoder
[00:09:37] we need to set some initial um decoder state s0 in some way. But now once we
[00:09:40] state s0 in some way. But now once we have that decoder hidden state, what
[00:09:42] have that decoder hidden state, what we're going to do is look back at the
[00:09:44] we're going to do is look back at the input sequence. So the way that we're
[00:09:46] input sequence. So the way that we're going to do that is by computing some
[00:09:48] going to do that is by computing some alignment scores by comparing that that
[00:09:51] alignment scores by comparing that that basically compute um a scalar value a
[00:09:53] basically compute um a scalar value a scalar value for each step in the input
[00:09:56] scalar value for each step in the input sequence that says how much does that
[00:09:58] sequence that says how much does that initial decoder state s0 how much does
[00:10:01] initial decoder state s0 how much does that decoder state match each token of
[00:10:03] that decoder state match each token of the input sequence. So in this case
[00:10:05] the input sequence. So in this case there were four tokens in the input
[00:10:07] there were four tokens in the input sequence. So we want to compute four
[00:10:08] sequence. So we want to compute four alignment scores each of which is just a
[00:10:10] alignment scores each of which is just a single number um that says how how what
[00:10:12] single number um that says how how what is the similarity between the input to
[00:10:15] is the similarity between the input to the the input sequence the token of the
[00:10:17] the the input sequence the token of the input sequence and this initial um this
[00:10:20] input sequence and this initial um this initial uh this initial decoder state as
[00:10:22] initial uh this initial decoder state as zero. Now there's a there's a lot of
[00:10:24] zero. Now there's a there's a lot of ways that we could implement alignment
[00:10:25] ways that we could implement alignment scores but um a simple way is just use a
[00:10:27] scores but um a simple way is just use a simple linear layer that we're calling f
[00:10:29] simple linear layer that we're calling f subat. And so that linear layer is going
[00:10:31] subat. And so that linear layer is going to input is going to concatenate the
[00:10:33] to input is going to concatenate the decoder hidden state s um with one of
[00:10:36] decoder hidden state s um with one of the encoder hidden states H concatenate
[00:10:38] the encoder hidden states H concatenate those two into a vector and then um
[00:10:40] those two into a vector and then um apply a linear transform that squashes
[00:10:42] apply a linear transform that squashes that down into a scaler. Um and that's
[00:10:43] that down into a scaler. Um and that's just a linear operator that can be put
[00:10:45] just a linear operator that can be put into a computational graph and learned
[00:10:47] into a computational graph and learned jointly via gradient descent just in the
[00:10:49] jointly via gradient descent just in the way that we learn all other parameters
[00:10:50] way that we learn all other parameters of a network. Um so now at this point
[00:10:52] of a network. Um so now at this point we've got sort of this scalar alignment
[00:10:54] we've got sort of this scalar alignment score um for each se for each step in
[00:10:56] score um for each se for each step in the input sequence. And now we want to
[00:10:58] the input sequence. And now we want to apply a softmax function. Right? These
[00:11:00] apply a softmax function. Right? These scaler alignment scores are totally
[00:11:02] scaler alignment scores are totally unbounded. They're arbitrary val they're
[00:11:04] unbounded. They're arbitrary val they're arbitrary real values from minus
[00:11:05] arbitrary real values from minus infinity to infinity. We want to put
[00:11:07] infinity to infinity. We want to put some um some structure on this to
[00:11:09] some um some structure on this to prevent things from blowing up. So one
[00:11:10] prevent things from blowing up. So one way that we do this is apply a softmax
[00:11:12] way that we do this is apply a softmax function. So we've got four scalar
[00:11:14] function. So we've got four scalar values um telling us the alignment of
[00:11:16] values um telling us the alignment of that decoder hidden state which each
[00:11:18] that decoder hidden state which each with each of the encoder hidden states.
[00:11:20] with each of the encoder hidden states. Um now we apply a softmax over those
[00:11:22] Um now we apply a softmax over those four values to give us a distribution
[00:11:24] four values to give us a distribution over those four over those four values.
[00:11:27] over those four over those four values. So remember the softmax function that we
[00:11:28] So remember the softmax function that we saw a few lectures ago is going to take
[00:11:31] saw a few lectures ago is going to take a vector of arbitrary scores and convert
[00:11:33] a vector of arbitrary scores and convert it into a probability distribution which
[00:11:35] it into a probability distribution which means it'll have the property that each
[00:11:37] means it'll have the property that each entry in the output softmax
[00:11:38] entry in the output softmax probabilities will be between 0 and one
[00:11:40] probabilities will be between 0 and one and they will sum to one. So we can
[00:11:43] and they will sum to one. So we can think of so whenever we have whenever we
[00:11:45] think of so whenever we have whenever we run a vector through a softmax we can
[00:11:47] run a vector through a softmax we can think of the thing we get out as a
[00:11:49] think of the thing we get out as a probability distribution rather a
[00:11:50] probability distribution rather a discrete probability distribution over
[00:11:52] discrete probability distribution over those input scores. So in this case, so
[00:11:54] those input scores. So in this case, so in this so at this point after we take
[00:11:56] in this so at this point after we take those alignment scores and run them
[00:11:58] those alignment scores and run them through a softmax, what we what we've
[00:12:00] through a softmax, what we what we've essentially done is predicted a
[00:12:02] essentially done is predicted a distribution over the input tokens um
[00:12:05] distribution over the input tokens um given that decoder hidden state. So now
[00:12:08] given that decoder hidden state. So now what we want to do is take that
[00:12:09] what we want to do is take that distribution over over the input tokens
[00:12:12] distribution over over the input tokens and use them to compute a vector um
[00:12:15] and use them to compute a vector um summarizing the information in the
[00:12:17] summarizing the information in the encoder. So the way that we do that is
[00:12:19] encoder. So the way that we do that is we take our attention scores which
[00:12:21] we take our attention scores which recall are these uh these numbers a a11
[00:12:25] recall are these uh these numbers a a11 a12 a13 a14 they're all between 0 and 1.
[00:12:28] a12 a13 a14 they're all between 0 and 1. They sum to one. We're going to take a
[00:12:29] They sum to one. We're going to take a linear combination now of the encoder
[00:12:31] linear combination now of the encoder hidden states h1 h2 h3 h4. Um and take a
[00:12:35] hidden states h1 h2 h3 h4. Um and take a linear combination of those encoder
[00:12:37] linear combination of those encoder hidden states weighted by our attention
[00:12:39] hidden states weighted by our attention scores. Um and this will give us a new a
[00:12:42] scores. Um and this will give us a new a context vector that we're calling c1
[00:12:44] context vector that we're calling c1 here in purple. um which is going to
[00:12:46] here in purple. um which is going to summarize the information in the encoder
[00:12:48] summarize the information in the encoder sequence um in in some way that's
[00:12:50] sequence um in in some way that's modulated by that by those by those
[00:12:52] modulated by that by those by those attention weights. Um and now at this
[00:12:54] attention weights. Um and now at this point right so then this this C1 is
[00:12:56] point right so then this this C1 is basically some linear combination of the
[00:12:59] basically some linear combination of the input encoder states H1 to H4 things
[00:13:02] input encoder states H1 to H4 things look basically the same as they did in
[00:13:03] look basically the same as they did in the non-attention case. So we we have
[00:13:05] the non-attention case. So we we have our context vector um we concatenate it
[00:13:07] our context vector um we concatenate it with our in with our first token of the
[00:13:09] with our in with our first token of the output sequence y0 pass that to our
[00:13:12] output sequence y0 pass that to our recurrent unit um to get both to get um
[00:13:14] recurrent unit um to get both to get um to get our uh next hidden state of the
[00:13:17] to get our uh next hidden state of the decoder recurrent neural network as well
[00:13:19] decoder recurrent neural network as well as the first output token from the
[00:13:21] as the first output token from the decoder recurrent neural network. Um so
[00:13:24] decoder recurrent neural network. Um so basically the structure of that decoder
[00:13:27] basically the structure of that decoder RNN did not really change. Um all we did
[00:13:29] RNN did not really change. Um all we did is rather than set we've computed the
[00:13:31] is rather than set we've computed the context vector in a different way using
[00:13:33] context vector in a different way using this attention linear combination
[00:13:35] this attention linear combination mechanism.
[00:13:36] mechanism. But now crucially um right so the
[00:13:39] But now crucially um right so the intuition here is that this context
[00:13:41] intuition here is that this context vector basically attends or looks at
[00:13:43] vector basically attends or looks at different parts of the input sequence
[00:13:45] different parts of the input sequence that is modulated by whatever the output
[00:13:48] that is modulated by whatever the output RNN wants to look at at this moment in
[00:13:50] RNN wants to look at at this moment in time. Um so for example you know part of
[00:13:53] time. Um so for example you know part of the input you know as uh part of the
[00:13:56] the input you know as uh part of the input sequence has this token these
[00:13:58] input sequence has this token these these two words we see. So then in
[00:14:00] these two words we see. So then in trying to produce that one word in
[00:14:02] trying to produce that one word in Italian that corresponds to we see the
[00:14:05] Italian that corresponds to we see the network probably wants to go back and
[00:14:06] network probably wants to go back and look at those two words in the input
[00:14:08] look at those two words in the input sequence in order to know what output
[00:14:10] sequence in order to know what output word to produce. So we might expect we
[00:14:13] word to produce. So we might expect we we might want to have some we might
[00:14:14] we might want to have some we might expect that intuitively when trying to
[00:14:17] expect that intuitively when trying to produce the word um vendo um then the
[00:14:21] produce the word um vendo um then the network will want to look back at the
[00:14:23] network will want to look back at the words we see and put higher attention
[00:14:25] words we see and put higher attention weights on those and it doesn't really
[00:14:26] weights on those and it doesn't really care about the sky because those words
[00:14:29] care about the sky because those words are not necessary for producing that
[00:14:31] are not necessary for producing that vendamo output. Um and that's the kind
[00:14:33] vendamo output. Um and that's the kind of intuition we're giving the network
[00:14:35] of intuition we're giving the network the ability to look back at the relevant
[00:14:36] the ability to look back at the relevant parts of the input sequence for the word
[00:14:38] parts of the input sequence for the word that it's trying to predict at this
[00:14:39] that it's trying to predict at this moment in time.
[00:14:41] moment in time. Um and the other thing to keep in mind
[00:14:43] Um and the other thing to keep in mind is that this is all differentiable. Um
[00:14:45] is that this is all differentiable. Um we don't need to supervise the network.
[00:14:47] we don't need to supervise the network. We don't need to tell it which words in
[00:14:49] We don't need to tell it which words in the input sequence were required for
[00:14:50] the input sequence were required for each word in the output. Instead, this
[00:14:52] each word in the output. Instead, this is just a big computational graph
[00:14:54] is just a big computational graph composed of differentiable operations.
[00:14:56] composed of differentiable operations. Um this all of this will can be learned
[00:14:58] Um this all of this will can be learned end to end via gradient descent. So at
[00:15:00] end to end via gradient descent. So at the end of the day, we're still going to
[00:15:01] the end of the day, we're still going to have this, you know, cross entropy
[00:15:02] have this, you know, cross entropy softmax loss where the network is trying
[00:15:04] softmax loss where the network is trying to predict the tokens of the output
[00:15:06] to predict the tokens of the output sequence. And in the process of trying
[00:15:08] sequence. And in the process of trying to predict the right tokens in the
[00:15:10] to predict the right tokens in the output sequence, it's going to learn for
[00:15:12] output sequence, it's going to learn for itself how to attend to different parts
[00:15:14] itself how to attend to different parts of the input sequence. So that's that's
[00:15:16] of the input sequence. So that's that's really critical, right? If we have to go
[00:15:18] really critical, right? If we have to go in and supervise and tell the network
[00:15:20] in and supervise and tell the network the alignment between the two, it would
[00:15:21] the alignment between the two, it would be very difficult to get training data
[00:15:23] be very difficult to get training data for this kind of thing. The question is
[00:15:25] for this kind of thing. The question is how do we initialize the decoder? Uh
[00:15:26] how do we initialize the decoder? Uh we're actually using the word you got to
[00:15:28] we're actually using the word you got to be careful we're using the word
[00:15:29] be careful we're using the word initialize a little bit overloaded here.
[00:15:31] initialize a little bit overloaded here. So one question is the decoder is itself
[00:15:33] So one question is the decoder is itself a neural network that has weights. When
[00:15:35] a neural network that has weights. When we start training that network we need
[00:15:37] we start training that network we need to initialize those weights in some way.
[00:15:39] to initialize those weights in some way. So then we will typically initialize the
[00:15:40] So then we will typically initialize the weights of the decoder randomly and then
[00:15:43] weights of the decoder randomly and then optimize them via gradient descent just
[00:15:45] optimize them via gradient descent just as we do with any other neural network
[00:15:46] as we do with any other neural network weights. Um but there's a second notion
[00:15:48] weights. Um but there's a second notion of initialize which is that when the
[00:15:50] of initialize which is that when the network is processing a sequence um
[00:15:53] network is processing a sequence um whether whatever its current value of
[00:15:54] whether whatever its current value of the weights are we need some way to set
[00:15:56] the weights are we need some way to set that initial hidden state at the time we
[00:15:58] that initial hidden state at the time we start processing an output sequence. Um
[00:16:01] start processing an output sequence. Um and in that case we need some rule or
[00:16:03] and in that case we need some rule or some some some way to set that initial
[00:16:06] some some some way to set that initial hidden state of the decoder output
[00:16:08] hidden state of the decoder output sequence. Um there's a couple different
[00:16:09] sequence. Um there's a couple different mechanisms for this. Um sometimes you
[00:16:11] mechanisms for this. Um sometimes you might initialize it as the hidden the
[00:16:14] might initialize it as the hidden the last hidden state of the encoder is one
[00:16:16] last hidden state of the encoder is one thing you'll sometimes do. You might
[00:16:17] thing you'll sometimes do. You might have a linear transform that um projects
[00:16:20] have a linear transform that um projects has some learned projection from the
[00:16:21] has some learned projection from the last decoder state to the first from the
[00:16:23] last decoder state to the first from the last encoder state to the first decoder
[00:16:25] last encoder state to the first decoder state. Um or sometimes will people even
[00:16:28] state. Um or sometimes will people even initialize the first hidden state of the
[00:16:30] initialize the first hidden state of the decoder to be all zeros. Um any of those
[00:16:32] decoder to be all zeros. Um any of those will work as long as you train the
[00:16:34] will work as long as you train the network to expect that kind of input. So
[00:16:36] network to expect that kind of input. So the question is negations and exors
[00:16:38] the question is negations and exors would this cause a problem? Maybe this
[00:16:39] would this cause a problem? Maybe this is this is a this is a hard problem but
[00:16:41] is this is a this is a hard problem but then you need a lot of data a lot of
[00:16:42] then you need a lot of data a lot of flops to try to hope the network can
[00:16:44] flops to try to hope the network can disentangle this. Um but basically
[00:16:46] disentangle this. Um but basically recurrent unit takes three things as
[00:16:48] recurrent unit takes three things as input. It take in the decoder right it
[00:16:50] input. It take in the decoder right it takes the previous hidden state the
[00:16:52] takes the previous hidden state the previous decoder hidden state it takes
[00:16:54] previous decoder hidden state it takes the current context vector and it takes
[00:16:56] the current context vector and it takes the current um token in the output
[00:16:58] the current um token in the output sequence and then from that we produce
[00:17:00] sequence and then from that we produce the next hidden state and then from the
[00:17:02] the next hidden state and then from the next hidden state then we go and predict
[00:17:03] next hidden state then we go and predict the output token. So that's actually the
[00:17:05] the output token. So that's actually the same setup as in the non-attention case.
[00:17:08] same setup as in the non-attention case. I guess there there's an implicit
[00:17:09] I guess there there's an implicit connection from there's a there's a
[00:17:11] connection from there's a there's a connection from S0 to S1 that we're not
[00:17:13] connection from S0 to S1 that we're not drawing. So there there should have been
[00:17:14] drawing. So there there should have been another arrow from S0 to S1. I think I
[00:17:17] another arrow from S0 to S1. I think I just dropped the S0 arrow. So sorry
[00:17:19] just dropped the S0 arrow. So sorry about that. Well, we're basically
[00:17:21] about that. Well, we're basically letting the network decide for itself to
[00:17:23] letting the network decide for itself to look back at any part of the in input
[00:17:25] look back at any part of the in input sequence that it thinks might be
[00:17:26] sequence that it thinks might be relevant for the task at hand. Um so but
[00:17:28] relevant for the task at hand. Um so but the reason why we think that this
[00:17:30] the reason why we think that this mechanism is plausible and might be
[00:17:32] mechanism is plausible and might be helpful for the network is because we
[00:17:34] helpful for the network is because we know that you know in a language task
[00:17:36] know that you know in a language task there often is some kind of
[00:17:37] there often is some kind of correspondence between words in the
[00:17:39] correspondence between words in the output and words in the input and we
[00:17:41] output and words in the input and we want to let the network kind of look
[00:17:42] want to let the network kind of look back and pick out which are the relevant
[00:17:44] back and pick out which are the relevant bits of the input for producing this bit
[00:17:46] bits of the input for producing this bit of the output. But again we're not
[00:17:47] of the output. But again we're not directly supervising it. We're not
[00:17:49] directly supervising it. We're not telling it how to use these attention
[00:17:50] telling it how to use these attention scores. But the intuition is that we
[00:17:52] scores. But the intuition is that we think that's a plausible thing that it
[00:17:54] think that's a plausible thing that it might choose to do given this mechanism.
[00:17:57] might choose to do given this mechanism. Okay. So that's that's sort of like one
[00:17:58] Okay. So that's that's sort of like one tick of the output. Um, and now
[00:18:01] tick of the output. Um, and now basically we do it again. We do this
[00:18:03] basically we do it again. We do this whole process again for every time we
[00:18:04] whole process again for every time we tick the decoder RNN, right? Remember
[00:18:06] tick the decoder RNN, right? Remember the problem we were trying to solve is
[00:18:08] the problem we were trying to solve is that previously the decoder was
[00:18:10] that previously the decoder was bottlenecking through a single vector.
[00:18:11] bottlenecking through a single vector. Um, now we're going to compute instead
[00:18:13] Um, now we're going to compute instead of bottlenecking through a single
[00:18:14] of bottlenecking through a single vector, we're going to repeat this whole
[00:18:16] vector, we're going to repeat this whole process again and compute a new context
[00:18:18] process again and compute a new context vector for the second time step of the
[00:18:19] vector for the second time step of the decoder and go let it go back and look
[00:18:21] decoder and go let it go back and look at the whole input sequence yet again.
[00:18:23] at the whole input sequence yet again. So now um basically given our S1 which
[00:18:27] So now um basically given our S1 which is our computed first hit like hidden
[00:18:29] is our computed first hit like hidden state in the decoder we're going to go
[00:18:31] state in the decoder we're going to go back you take S1 go back and compute
[00:18:34] back you take S1 go back and compute comparison and and use our attention uh
[00:18:37] comparison and and use our attention uh mechanism to compute similarity scores
[00:18:39] mechanism to compute similarity scores between S1 and all of the hidden states
[00:18:41] between S1 and all of the hidden states in the encoder. Um that will compute our
[00:18:43] in the encoder. Um that will compute our similarity scores using that exact same
[00:18:46] similarity scores using that exact same fat that same linear projection that we
[00:18:48] fat that same linear projection that we used at the first time step. um compute
[00:18:50] used at the first time step. um compute these alignment scores again, cram them
[00:18:52] these alignment scores again, cram them through softmax to get a a new
[00:18:54] through softmax to get a a new distribution over the input sequence for
[00:18:56] distribution over the input sequence for the second decoder time step and now
[00:18:58] the second decoder time step and now compute a new linear combination of the
[00:19:00] compute a new linear combination of the encoder hidden states now weighted by
[00:19:02] encoder hidden states now weighted by this new distribution that we computed
[00:19:05] this new distribution that we computed at the second time step. Um and this sec
[00:19:07] at the second time step. Um and this sec this this will basically give us a new
[00:19:08] this this will basically give us a new context vector um C2
[00:19:12] context vector um C2 that now is a different summarization of
[00:19:15] that now is a different summarization of the input sequence that's now computed
[00:19:17] the input sequence that's now computed as a new linear combination of the input
[00:19:19] as a new linear combination of the input encoder hidden states and then we then
[00:19:22] encoder hidden states and then we then the whole thing kind of iterates right
[00:19:24] the whole thing kind of iterates right we have a new context vector we use that
[00:19:25] we have a new context vector we use that to run another tick of our decoder RNN
[00:19:28] to run another tick of our decoder RNN unit that will now now does include that
[00:19:31] unit that will now now does include that that mysterious missing arrow that
[00:19:32] that mysterious missing arrow that wasn't there on the previous time step.
[00:19:34] wasn't there on the previous time step. So then given our new context vector,
[00:19:37] So then given our new context vector, given the next token of the output
[00:19:38] given the next token of the output sequence and given the S1 hidden state
[00:19:41] sequence and given the S1 hidden state of the decoder, we compute a new decoder
[00:19:44] of the decoder, we compute a new decoder state S2 and then from that compute
[00:19:46] state S2 and then from that compute another token of the output sequence. Um
[00:19:49] another token of the output sequence. Um and again remember you know in this case
[00:19:51] and again remember you know in this case it's producing ill which maybe is the
[00:19:54] it's producing ill which maybe is the according to the slide. I hope that's
[00:19:55] according to the slide. I hope that's true. Um, and then you know in this case
[00:19:58] true. Um, and then you know in this case there's maybe a one to one
[00:19:59] there's maybe a one to one correspondence between the word the the
[00:20:01] correspondence between the word the the network is trying to produce for this
[00:20:02] network is trying to produce for this sequence and one of the words in the
[00:20:03] sequence and one of the words in the output and one of the words in the
[00:20:05] output and one of the words in the input. So we might expect that the
[00:20:07] input. So we might expect that the network should put relatively high
[00:20:08] network should put relatively high attention weight on just one of the
[00:20:10] attention weight on just one of the words in the input sequence and
[00:20:12] words in the input sequence and relatively low attention weight on all
[00:20:14] relatively low attention weight on all the other words in the input sequence.
[00:20:15] the other words in the input sequence. But again, we don't supervise this. The
[00:20:17] But again, we don't supervise this. The network is deciding for itself how to
[00:20:18] network is deciding for itself how to make use of this mechanism all driven by
[00:20:20] make use of this mechanism all driven by gradient descent on our training task.
[00:20:22] gradient descent on our training task. Um, and this whole thing is going to
[00:20:24] Um, and this whole thing is going to we're just going to repeat that whole
[00:20:25] we're just going to repeat that whole process for every tick of the decoder
[00:20:27] process for every tick of the decoder RNN.
[00:20:29] RNN. Um, so now this basically solves our
[00:20:31] Um, so now this basically solves our problem right there. We are no longer
[00:20:33] problem right there. We are no longer bottlenecking the input sequence through
[00:20:35] bottlenecking the input sequence through a single fixed length vector. Instead,
[00:20:36] a single fixed length vector. Instead, we have this new mechanism where at
[00:20:38] we have this new mechanism where at every time step of the decoder, the
[00:20:40] every time step of the decoder, the network looks back at the entire input
[00:20:42] network looks back at the entire input sequence, reummarizes the input sequence
[00:20:45] sequence, reummarizes the input sequence um for the to generate a new context
[00:20:47] um for the to generate a new context vector on the fly for this one time step
[00:20:49] vector on the fly for this one time step of the decoder and then uses that to
[00:20:50] of the decoder and then uses that to produce the outputs. So this is a this
[00:20:52] produce the outputs. So this is a this is a pretty cool mechanism and this is
[00:20:54] is a pretty cool mechanism and this is called attention because the network is
[00:20:56] called attention because the network is attending or looking at different parts
[00:20:58] attending or looking at different parts of the input sequence um at every at
[00:21:01] of the input sequence um at every at every moment in its output. So we talked
[00:21:04] every moment in its output. So we talked about these attention weights and we
[00:21:05] about these attention weights and we said that they were driven that the
[00:21:07] said that they were driven that the network was learning for itself how to
[00:21:08] network was learning for itself how to set these attention weights based on its
[00:21:10] set these attention weights based on its training data based on its training
[00:21:12] training data based on its training task. Um and another really cool thing
[00:21:14] task. Um and another really cool thing about attention is it also gives us a
[00:21:16] about attention is it also gives us a way to introspect and see what the
[00:21:18] way to introspect and see what the network is looking at as it's trying to
[00:21:20] network is looking at as it's trying to solve this problem. So we never told it
[00:21:22] solve this problem. So we never told it how the what what what what what the
[00:21:24] how the what what what what what the what the alignment was between the input
[00:21:26] what the alignment was between the input sequence and the output sequence. But by
[00:21:27] sequence and the output sequence. But by looking at the attention weights that
[00:21:29] looking at the attention weights that that the network predicts when trying to
[00:21:31] that the network predicts when trying to solve this task, we get a sense of what
[00:21:33] solve this task, we get a sense of what the network was looking at um while
[00:21:35] the network was looking at um while trying to solve the problem. So that
[00:21:37] trying to solve the problem. So that gives us a way to interpret the
[00:21:38] gives us a way to interpret the processing of the neural network in some
[00:21:40] processing of the neural network in some way. Um and so here's so one thing that
[00:21:42] way. Um and so here's so one thing that we can do is then go and look at in the
[00:21:45] we can do is then go and look at in the process of producing a particular of
[00:21:47] process of producing a particular of processing a particular sequence what
[00:21:49] processing a particular sequence what were the attention weights that the
[00:21:50] were the attention weights that the network predicted when trying to do this
[00:21:53] network predicted when trying to do this task and we can visualize these in a
[00:21:54] task and we can visualize these in a two-dimensional grid. So here we're
[00:21:57] two-dimensional grid. So here we're looking at an example of English to
[00:21:58] looking at an example of English to French translation. Um and across the
[00:22:02] French translation. Um and across the and across the the top we have our input
[00:22:05] and across the the top we have our input sequence. The agreement on the European
[00:22:07] sequence. The agreement on the European economic area was signed in August 1992.
[00:22:09] economic area was signed in August 1992. And then running down the rows is the
[00:22:12] And then running down the rows is the output sequence which is in French which
[00:22:14] output sequence which is in French which I will not attempt to pronounce. Um but
[00:22:17] I will not attempt to pronounce. Um but you can see that like basically through
[00:22:18] you can see that like basically through this attention mechanism for every
[00:22:20] this attention mechanism for every remember the way this attention
[00:22:21] remember the way this attention mechanism worked is that each time the
[00:22:24] mechanism worked is that each time the network produced one of these words in
[00:22:25] network produced one of these words in the output sequence it predicted a
[00:22:27] the output sequence it predicted a probability distribution over the entire
[00:22:29] probability distribution over the entire input sequence. So we visualized that in
[00:22:32] input sequence. So we visualized that in that first row. So if you look at the
[00:22:34] that first row. So if you look at the first row of this matrix, we're
[00:22:35] first row of this matrix, we're visualizing that predicted probability
[00:22:37] visualizing that predicted probability distribution over the entire input
[00:22:39] distribution over the entire input English sentence. And we see that when
[00:22:41] English sentence. And we see that when trying to predict that first um word the
[00:22:44] trying to predict that first um word the of the French sentence, then it puts a
[00:22:47] of the French sentence, then it puts a lot of probability mass on the English
[00:22:48] lot of probability mass on the English word 'the' and basically no probability
[00:22:51] word 'the' and basically no probability mass on any of the other words. Then
[00:22:53] mass on any of the other words. Then when predicting the second word of the
[00:22:54] when predicting the second word of the output sequence, remember it goes back
[00:22:56] output sequence, remember it goes back and predicts a new distribution over the
[00:22:58] and predicts a new distribution over the entire input sequence and that's going
[00:22:59] entire input sequence and that's going to be the second row in this matrix. So
[00:23:02] to be the second row in this matrix. So you can see that accord um I it puts a
[00:23:05] you can see that accord um I it puts a lot of probability mass on agreement and
[00:23:07] lot of probability mass on agreement and then no probability mass anywhere else.
[00:23:09] then no probability mass anywhere else. So then that gives us some sense that
[00:23:11] So then that gives us some sense that the network actually kind of did figure
[00:23:13] the network actually kind of did figure out the alignment between the input
[00:23:15] out the alignment between the input words and the output words when doing
[00:23:16] words and the output words when doing this translation task. And there's some
[00:23:19] this translation task. And there's some interesting patterns here that here that
[00:23:20] interesting patterns here that here that kind of pop up when we look when we see
[00:23:23] kind of pop up when we look when we see diagonal structures in this attention
[00:23:24] diagonal structures in this attention matrix. That means that there was a
[00:23:26] matrix. That means that there was a onetoone correspondence between words in
[00:23:29] onetoone correspondence between words in order between the input sequence and the
[00:23:30] order between the input sequence and the output sequence. So in particular we see
[00:23:32] output sequence. So in particular we see that the agreement on the the first four
[00:23:35] that the agreement on the the first four words of the input sequence correspond
[00:23:37] words of the input sequence correspond to this diagonal structure um in the
[00:23:39] to this diagonal structure um in the attention matrix. So that means that the
[00:23:41] attention matrix. So that means that the network has decided for itself that
[00:23:43] network has decided for itself that these first four words of the input
[00:23:45] these first four words of the input sequence sort of align or match up or
[00:23:47] sequence sort of align or match up or correspond to the first four words of
[00:23:49] correspond to the first four words of the input sequence. And the same thing
[00:23:51] the input sequence. And the same thing for the last um several words. So again,
[00:23:53] for the last um several words. So again, we see this diagonal structure at the
[00:23:55] we see this diagonal structure at the end of the sequence, which means that
[00:23:56] end of the sequence, which means that August 1992 or in August 1992 um
[00:24:00] August 1992 or in August 1992 um corresponds to these uh these last
[00:24:02] corresponds to these uh these last couple words in the French sequence. And
[00:24:03] couple words in the French sequence. And again, there's this one:1 correspondence
[00:24:05] again, there's this one:1 correspondence between words in the output and words in
[00:24:07] between words in the output and words in the input. But we see some other
[00:24:09] the input. But we see some other interesting stuff in the middle here. So
[00:24:11] interesting stuff in the middle here. So in the middle, um we see European
[00:24:13] in the middle, um we see European economic area, but in the French, we see
[00:24:16] economic area, but in the French, we see words that look kind of like those in a
[00:24:18] words that look kind of like those in a slightly different order. Good question.
[00:24:20] slightly different order. Good question. How does it figure out the grammar?
[00:24:22] How does it figure out the grammar? That's the mystery of deep learning. But
[00:24:24] That's the mystery of deep learning. But like basically we told the we didn't
[00:24:26] like basically we told the we didn't tell the network anything about grammar.
[00:24:27] tell the network anything about grammar. We told the network we supervised it
[00:24:29] We told the network we supervised it with a lot of input output pairs. We
[00:24:31] with a lot of input output pairs. We told it here's an input sequence in
[00:24:32] told it here's an input sequence in English. Here's an output sequence in
[00:24:34] English. Here's an output sequence in French. Here's a mechanism for
[00:24:36] French. Here's a mechanism for processing this and learn via gradient
[00:24:38] processing this and learn via gradient descent to to like set the weights of
[00:24:41] descent to to like set the weights of this architecture in order to in order
[00:24:43] this architecture in order to in order to produce this output from this input.
[00:24:45] to produce this output from this input. Um we'd never told it anything about
[00:24:46] Um we'd never told it anything about grammar. Um but it kind of because we as
[00:24:49] grammar. Um but it kind of because we as human designers have this intuition that
[00:24:51] human designers have this intuition that maybe it makes sense that there ought to
[00:24:53] maybe it makes sense that there ought to be some correspondence between some of
[00:24:54] be some correspondence between some of the words. So we bake in a mechanism
[00:24:56] the words. So we bake in a mechanism that we think as human designers might
[00:24:58] that we think as human designers might be helpful for solving this problem and
[00:25:00] be helpful for solving this problem and the network figures out for itself in
[00:25:02] the network figures out for itself in the process of doing the endto-end task
[00:25:04] the process of doing the endto-end task how to make use of that mechanism um to
[00:25:06] how to make use of that mechanism um to solve the problem we set for it. And
[00:25:08] solve the problem we set for it. And it's pretty pretty amazing that it
[00:25:10] it's pretty pretty amazing that it works.
[00:25:12] works. Um right but in this case you know it
[00:25:14] Um right but in this case you know it kind of figured out some of the grammar
[00:25:15] kind of figured out some of the grammar for itself. So it sees that you know we
[00:25:17] for itself. So it sees that you know we see this nondagonal sort of backward
[00:25:19] see this nondagonal sort of backward diagonal in the attention matrix here
[00:25:21] diagonal in the attention matrix here and that means that the network figured
[00:25:22] and that means that the network figured out for itself this um other this like
[00:25:25] out for itself this um other this like different word order between words in
[00:25:26] different word order between words in English and words in French um or in the
[00:25:29] English and words in French um or in the middle you see there's a little there's
[00:25:30] middle you see there's a little there's a little like 2x2 grid um kind of here
[00:25:33] a little like 2x2 grid um kind of here and that kind of corresponds to a
[00:25:35] and that kind of corresponds to a situation where there might not have
[00:25:36] situation where there might not have been a one to one correspondence between
[00:25:37] been a one to one correspondence between the English words and the French words.
[00:25:39] the English words and the French words. There might have been two French words
[00:25:40] There might have been two French words that corresponded to two English words
[00:25:42] that corresponded to two English words and they didn't perfectly disentangle
[00:25:44] and they didn't perfectly disentangle perfectly. I mean the network just all
[00:25:45] perfectly. I mean the network just all figures out this for itself over the
[00:25:47] figures out this for itself over the process of training um on a lot of data
[00:25:49] process of training um on a lot of data and putting a lot of compute through
[00:25:50] and putting a lot of compute through this and that's pretty cool.
[00:25:53] this and that's pretty cool. Okay, so there's actually uh so that
[00:25:55] Okay, so there's actually uh so that that's basic so and this actually was
[00:25:57] that's basic so and this actually was the initial usage of attention in
[00:25:59] the initial usage of attention in machine learning. Um it actually came
[00:26:00] machine learning. Um it actually came from this from these machine translation
[00:26:02] from this from these machine translation problems. Um so this was from a paper
[00:26:04] problems. Um so this was from a paper back in uh back in 2015 uh neural
[00:26:07] back in uh back in 2015 uh neural machine translation by joint by jointly
[00:26:09] machine translation by joint by jointly alerting to align and translate. Um, and
[00:26:12] alerting to align and translate. Um, and this paper actually just won the
[00:26:13] this paper actually just won the runner-up test of time award at iclair
[00:26:15] runner-up test of time award at iclair 2025. Uh, so that's pretty cool. A nice
[00:26:18] 2025. Uh, so that's pretty cool. A nice nice uh, you know, this has been a
[00:26:20] nice uh, you know, this has been a really impactful paper over time. Um,
[00:26:22] really impactful paper over time. Um, but it turns out that there's actually a
[00:26:24] but it turns out that there's actually a more general idea here and a more
[00:26:26] more general idea here and a more general operator hiding here. You know,
[00:26:28] general operator hiding here. You know, we approach this problem from the
[00:26:29] we approach this problem from the perspective of trying to fix our
[00:26:31] perspective of trying to fix our recurrent neural networks. But it turns
[00:26:33] recurrent neural networks. But it turns out the mechanism that we used to fix
[00:26:35] out the mechanism that we used to fix the recurrent neural networks actually
[00:26:37] the recurrent neural networks actually is something general and interesting and
[00:26:39] is something general and interesting and really powerful in its own right. So now
[00:26:41] really powerful in its own right. So now we want to try to pull that out, pull
[00:26:43] we want to try to pull that out, pull out this idea of attention and divorce
[00:26:46] out this idea of attention and divorce the idea of attention from the recurrent
[00:26:48] the idea of attention from the recurrent neural networks. And it turns out that
[00:26:50] neural networks. And it turns out that attention will be a very useful and
[00:26:52] attention will be a very useful and powerful computational primitive for
[00:26:54] powerful computational primitive for neural networks in its own right. even
[00:26:57] neural networks in its own right. even even if then we can cut away the
[00:26:58] even if then we can cut away the recurrent neural network part and just
[00:27:00] recurrent neural network part and just be left with attention as the core
[00:27:02] be left with attention as the core primitive in our architecture and that's
[00:27:04] primitive in our architecture and that's kind of where we're what we're going
[00:27:05] kind of where we're what we're going towards. So now what we want to do is
[00:27:07] towards. So now what we want to do is take this this idea of attention as we
[00:27:09] take this this idea of attention as we saw it in recurrent neural networks and
[00:27:11] saw it in recurrent neural networks and try to generalize it and try to carve
[00:27:13] try to generalize it and try to carve out this independent operator that can
[00:27:15] out this independent operator that can be used on its own. So let's think about
[00:27:17] be used on its own. So let's think about what this attention mechanism was doing.
[00:27:20] what this attention mechanism was doing. Basically what this attention mechanism
[00:27:22] Basically what this attention mechanism did is there were a bunch of query
[00:27:24] did is there were a bunch of query vectors. These are Well, maybe maybe it
[00:27:27] vectors. These are Well, maybe maybe it makes sense to talk about these in the
[00:27:28] makes sense to talk about these in the other order. So, there's data vectors
[00:27:29] other order. So, there's data vectors which are like data that we want to
[00:27:31] which are like data that we want to summarize. These are the the the the
[00:27:33] summarize. These are the the the the encoder states of the encoder RNN. So,
[00:27:35] encoder states of the encoder RNN. So, we have this input sequence and we've
[00:27:37] we have this input sequence and we've summarized that into a sequence of
[00:27:38] summarized that into a sequence of vectors. Um, and the sequence of vectors
[00:27:40] vectors. Um, and the sequence of vectors is sort of data that we think is
[00:27:42] is sort of data that we think is relevant for the problem that we're
[00:27:43] relevant for the problem that we're trying to solve. Um, and now in the
[00:27:45] trying to solve. Um, and now in the process of trying to make use of that
[00:27:47] process of trying to make use of that data, we want to produce a bunch of
[00:27:49] data, we want to produce a bunch of outputs. And for each output, we have a
[00:27:51] outputs. And for each output, we have a query vector. A query vector is a vector
[00:27:54] query vector. A query vector is a vector that we're trying to use to solve an out
[00:27:55] that we're trying to use to solve an out to to produce some piece of output. Um,
[00:27:57] to to produce some piece of output. Um, and in this case, the query vectors are
[00:28:00] and in this case, the query vectors are the hidden states of the decoder RNN.
[00:28:03] the hidden states of the decoder RNN. Um, and we have this this property that
[00:28:05] Um, and we have this this property that for each query vector, we want to go
[00:28:07] for each query vector, we want to go back look at the data vectors and
[00:28:09] back look at the data vectors and summarize the information in the data
[00:28:11] summarize the information in the data vectors into a context vector. Um, for
[00:28:16] vectors into a context vector. Um, for each well, okay, from the purpose of
[00:28:18] each well, okay, from the purpose of from the purpose of attention, this gets
[00:28:19] from the purpose of attention, this gets a little bit weird. So the output of the
[00:28:22] a little bit weird. So the output of the attention operator are the context
[00:28:23] attention operator are the context vectors that we just talked about for
[00:28:25] vectors that we just talked about for the RNN. So if we're talk if we're
[00:28:26] the RNN. So if we're talk if we're thinking about just what does that
[00:28:28] thinking about just what does that attention operator do? The output of the
[00:28:30] attention operator do? The output of the attention operator were the context
[00:28:32] attention operator were the context vectors that we feed into the RNN. So
[00:28:34] vectors that we feed into the RNN. So then what is the attention operator
[00:28:36] then what is the attention operator doing? The attention operator is taking
[00:28:38] doing? The attention operator is taking a query vector going back to the input
[00:28:40] a query vector going back to the input data vectors summarizing the data
[00:28:42] data vectors summarizing the data vectors in some new way to produce an
[00:28:44] vectors in some new way to produce an output vector. Um and that's what the
[00:28:46] output vector. Um and that's what the attention operator is doing. Is that
[00:28:48] attention operator is doing. Is that does that is that does that kind of make
[00:28:49] does that is that does that kind of make sense as a generalization of this
[00:28:51] sense as a generalization of this attention mechanism that we just saw?
[00:28:55] attention mechanism that we just saw?  Yeah. Yeah. I I'll repeat it again
[00:28:56] Yeah. Yeah. I I'll repeat it again because it's it's tricky. There's a lot
[00:28:57] because it's it's tricky. There's a lot of stuff flying around here. A lot of
[00:28:59] of stuff flying around here. A lot of boxes and we're changing the words that
[00:29:00] boxes and we're changing the words that we're using to define the the define the
[00:29:02] we're using to define the the define the boxes. So I get it. There's a lot
[00:29:03] boxes. So I get it. There's a lot happening. Um so what the attention
[00:29:05] happening. Um so what the attention operator is doing is there's a bunch of
[00:29:06] operator is doing is there's a bunch of data vectors which are the encoder
[00:29:08] data vectors which are the encoder hidden states. Um then for then we have
[00:29:10] hidden states. Um then for then we have a bunch of query vectors which are the p
[00:29:13] a bunch of query vectors which are the p the things we're trying to produce
[00:29:14] the things we're trying to produce output for. Now, in the process of
[00:29:16] output for. Now, in the process of processing a query vector, we're going
[00:29:18] processing a query vector, we're going to go back to the data vectors,
[00:29:20] to go back to the data vectors, summarize the data vectors in a new
[00:29:22] summarize the data vectors in a new custom way for each query vector, and
[00:29:25] custom way for each query vector, and that will produce um an output vector,
[00:29:27] that will produce um an output vector, which is the context to be fed into the
[00:29:30] which is the context to be fed into the next tick of the RNN. Right? So, our
[00:29:32] next tick of the RNN. Right? So, our query vectors are these guys in green.
[00:29:34] query vectors are these guys in green. For each query vector, we go back to the
[00:29:35] For each query vector, we go back to the data vectors, summarize the data
[00:29:37] data vectors, summarize the data vectors, and then produce a new output
[00:29:39] vectors, and then produce a new output vector, which is one of the contexts
[00:29:40] vector, which is one of the contexts that we then feed into the the rest of
[00:29:42] that we then feed into the the rest of the network. So um you know this is kind
[00:29:45] the network. So um you know this is kind of tricky because we're trying to like
[00:29:46] of tricky because we're trying to like go into this architecture and like cut
[00:29:48] go into this architecture and like cut carefully cut out the attention part and
[00:29:50] carefully cut out the attention part and cut it out from the RNN. Um so then
[00:29:53] cut it out from the RNN. Um so then we're going to try to like walk through
[00:29:54] we're going to try to like walk through this again from the perspective of just
[00:29:56] this again from the perspective of just the attention operator. So from the
[00:29:58] the attention operator. So from the perspective of just the attention
[00:30:00] perspective of just the attention operator we're going to start with just
[00:30:01] operator we're going to start with just one query vector at first um which is
[00:30:04] one query vector at first um which is you know one of the one of the states in
[00:30:06] you know one of the one of the states in our RNN. We also have a bunch of data
[00:30:08] our RNN. We also have a bunch of data vectors which are the encoder hidden
[00:30:10] vectors which are the encoder hidden states in the RNN. Now the computation
[00:30:13] states in the RNN. Now the computation that we want to perform is first compute
[00:30:15] that we want to perform is first compute similarities between that query vector
[00:30:17] similarities between that query vector and all of the data vectors. This is the
[00:30:19] and all of the data vectors. This is the exact same thing that we just saw just
[00:30:20] exact same thing that we just saw just sort of written in a different way. So
[00:30:22] sort of written in a different way. So we use this fat function to compute
[00:30:24] we use this fat function to compute these similarity scores um from our to
[00:30:27] these similarity scores um from our to to to compute similarities between each
[00:30:29] to to compute similarities between each data vector and our one query vector.
[00:30:32] data vector and our one query vector. Then once we have those similarities,
[00:30:33] Then once we have those similarities, we're going to squash them through
[00:30:34] we're going to squash them through through a softmax to get attention
[00:30:36] through a softmax to get attention weights. And this will be a distribution
[00:30:38] weights. And this will be a distribution over the data vectors that has been
[00:30:39] over the data vectors that has been computed on the fly for this one query
[00:30:41] computed on the fly for this one query vector.
[00:30:43] vector. Then we want to do is produce an output
[00:30:45] Then we want to do is produce an output vector. And out this output vector is a
[00:30:47] vector. And out this output vector is a linear combination of our data vectors
[00:30:50] linear combination of our data vectors where those linear combination weights
[00:30:52] where those linear combination weights are the attention scores that we just
[00:30:53] are the attention scores that we just computed. So this is the output of the
[00:30:55] computed. So this is the output of the attention layer. And then in the context
[00:30:57] attention layer. And then in the context of the larger RNN that we saw, the
[00:30:59] of the larger RNN that we saw, the output of the attention layer or the
[00:31:01] output of the attention layer or the attention operator will become an input
[00:31:03] attention operator will become an input to the next tick of the decoder RNN. But
[00:31:06] to the next tick of the decoder RNN. But we're trying to deprecate the RNN. So we
[00:31:08] we're trying to deprecate the RNN. So we don't want to talk about that. We just
[00:31:09] don't want to talk about that. We just want to talk about the attention and
[00:31:10] want to talk about the attention and focus on the computation happening
[00:31:11] focus on the computation happening inside the attention layer. So like so
[00:31:15] inside the attention layer. So like so this is basically the operator that we
[00:31:16] this is basically the operator that we saw in the RNN, right? We had this one
[00:31:19] saw in the RNN, right? We had this one like we we did this process over and
[00:31:21] like we we did this process over and over again of taking a query vector
[00:31:23] over again of taking a query vector using it to compute similarity scores
[00:31:24] using it to compute similarity scores getting attention weights getting an
[00:31:26] getting attention weights getting an output vector. Then we got a new query
[00:31:28] output vector. Then we got a new query vector. Where did that query vector come
[00:31:29] vector. Where did that query vector come from? Attention operator doesn't care.
[00:31:31] from? Attention operator doesn't care. Get a new query vector. Go back
[00:31:33] Get a new query vector. Go back summarize the data vectors get a new
[00:31:34] summarize the data vectors get a new output vector. Um and that's that's the
[00:31:36] output vector. Um and that's that's the core of the attention operator. So now
[00:31:38] core of the attention operator. So now let's try to generalize this and make it
[00:31:40] let's try to generalize this and make it a even more powerful computational
[00:31:42] a even more powerful computational primitive. Yeah. So in principle this um
[00:31:44] primitive. Yeah. So in principle this um this fat doesn't have to be it could be
[00:31:47] this fat doesn't have to be it could be any function. It could be any function
[00:31:48] any function. It could be any function of two vectors that outputs a scaler um
[00:31:50] of two vectors that outputs a scaler um in principle but in practice we're
[00:31:52] in principle but in practice we're actually going to make it simpler in in
[00:31:53] actually going to make it simpler in in a couple slides. But in principle yeah
[00:31:56] a couple slides. But in principle yeah you could just slot in any any function
[00:31:57] you could just slot in any any function that you wanted there.
[00:32:00] that you wanted there. Okay. So the first generalization that
[00:32:02] Okay. So the first generalization that we're going to do is actually um the
[00:32:04] we're going to do is actually um the opposite of what you just suggested and
[00:32:05] opposite of what you just suggested and make that similarity function simpler.
[00:32:07] make that similarity function simpler. So we said in principle it can be any
[00:32:09] So we said in principle it can be any function that takes two vectors and
[00:32:11] function that takes two vectors and gives a similarity score. What's the
[00:32:13] gives a similarity score. What's the simplest possible function that inputs
[00:32:14] simplest possible function that inputs two vectors and gives us a scalar
[00:32:16] two vectors and gives us a scalar similarity score? It's a dot product. So
[00:32:19] similarity score? It's a dot product. So we want to try to make things uh simpler
[00:32:21] we want to try to make things uh simpler and also generalize at the same time.
[00:32:23] and also generalize at the same time. And it turns out that a dot productduct
[00:32:24] And it turns out that a dot productduct is is good enough of a similarity score
[00:32:26] is is good enough of a similarity score to be used for this purpose. So the
[00:32:28] to be used for this purpose. So the first thing we're going to do is um
[00:32:30] first thing we're going to do is um actually just only use dot products to
[00:32:31] actually just only use dot products to compute similarity.
[00:32:33] compute similarity. Um but it turns out there's a slight
[00:32:35] Um but it turns out there's a slight problem with dotproducts. So and this
[00:32:38] problem with dotproducts. So and this one's kind of subtle because there's a
[00:32:39] one's kind of subtle because there's a weird interaction between the dot
[00:32:41] weird interaction between the dot product and the softmax. Um and that has
[00:32:43] product and the softmax. Um and that has to do with what happens when the when
[00:32:46] to do with what happens when the when the dimension of those vectors scales up
[00:32:47] the dimension of those vectors scales up or down. Right? So if you have a like
[00:32:51] or down. Right? So if you have a like the motivating example is that if you
[00:32:52] the motivating example is that if you scale the dimension of that vector, say
[00:32:55] scale the dimension of that vector, say we had a constant vector of all ones of
[00:32:57] we had a constant vector of all ones of like dimension 10 versus a constant
[00:32:59] like dimension 10 versus a constant vector of all ones of dimension 100,
[00:33:02] vector of all ones of dimension 100, then as we go to the to the higher
[00:33:03] then as we go to the to the higher dimensional vector, then when we compute
[00:33:06] dimensional vector, then when we compute the sum inside that softmax, then we're
[00:33:08] the sum inside that softmax, then we're going to be dividing by a larger number.
[00:33:10] going to be dividing by a larger number. So we'll end up with more squashed
[00:33:12] So we'll end up with more squashed probability scores as we go to higher
[00:33:14] probability scores as we go to higher dimensional vectors. Um that can lead to
[00:33:16] dimensional vectors. Um that can lead to vanishing gradients as we just saw in
[00:33:17] vanishing gradients as we just saw in the previous lecture. and prevent
[00:33:19] the previous lecture. and prevent learning of this whole thing. So as kind
[00:33:21] learning of this whole thing. So as kind of a slight hack to prevent that um and
[00:33:24] of a slight hack to prevent that um and make this architecture more generally
[00:33:26] make this architecture more generally more generalizably scalable up and down
[00:33:28] more generalizably scalable up and down to different dimension vectors um what
[00:33:30] to different dimension vectors um what we're going to do is actually not use
[00:33:32] we're going to do is actually not use the pure dot productduct but scale the
[00:33:34] the pure dot productduct but scale the dotproduct down by the square root of
[00:33:36] dotproduct down by the square root of the dimension of those vectors that
[00:33:37] the dimension of those vectors that we're looking at. Um and this is just a
[00:33:39] we're looking at. Um and this is just a way to prevent vanishing gradients and
[00:33:41] way to prevent vanishing gradients and give nicer gradient flow through the
[00:33:42] give nicer gradient flow through the softmax for a wider range of dimensions
[00:33:45] softmax for a wider range of dimensions of vectors. Um, and this turns out to be
[00:33:47] of vectors. Um, and this turns out to be very important because as we make these
[00:33:49] very important because as we make these networks bigger and bigger and bigger
[00:33:50] networks bigger and bigger and bigger over time, we want to get higher
[00:33:52] over time, we want to get higher dimensional vectors because that gives
[00:33:53] dimensional vectors because that gives us more compute, more capacity. So we
[00:33:55] us more compute, more capacity. So we always want to think about how our
[00:33:56] always want to think about how our architectures will scale as we make the
[00:33:58] architectures will scale as we make the parts of those architectures get bigger
[00:34:00] parts of those architectures get bigger and bigger and bigger. So this um this
[00:34:03] and bigger and bigger. So this um this this scale dotproduct is actually really
[00:34:04] this scale dotproduct is actually really important for preventing vanishing
[00:34:06] important for preventing vanishing gradients here. Yeah. Question was we're
[00:34:08] gradients here. Yeah. Question was we're limited to data and query vectors of the
[00:34:09] limited to data and query vectors of the same size, but we'll actually fix that.
[00:34:12] same size, but we'll actually fix that. So uh our first generalization was to
[00:34:14] So uh our first generalization was to use actually scaled dotproduct
[00:34:16] use actually scaled dotproduct similarity as our as our similarity
[00:34:17] similarity as our as our similarity measure. Um so now you know if we go
[00:34:20] measure. Um so now you know if we go back and look at the shapes of these
[00:34:21] back and look at the shapes of these things we have one query vector of
[00:34:23] things we have one query vector of dimension dq. We have data vectors of
[00:34:25] dimension dq. We have data vectors of dimension nx by dq as well because it's
[00:34:27] dimension nx by dq as well because it's a dot productduct they need to match. Um
[00:34:30] a dot productduct they need to match. Um but there's actually a next generaliz
[00:34:32] but there's actually a next generaliz generalization that we're going to do is
[00:34:34] generalization that we're going to do is have multiple query vectors right like
[00:34:36] have multiple query vectors right like maybe we don't want to process just one
[00:34:38] maybe we don't want to process just one query vector at a time. We want to have
[00:34:40] query vector at a time. We want to have the ability to process a whole set of
[00:34:42] the ability to process a whole set of query vectors all at once. Um, and this
[00:34:44] query vectors all at once. Um, and this kind of happens in the RNN. You know, we
[00:34:45] kind of happens in the RNN. You know, we did end up with a bunch of query
[00:34:46] did end up with a bunch of query vectors. Um, and it's useful for the
[00:34:48] vectors. Um, and it's useful for the attention operator to be able to process
[00:34:50] attention operator to be able to process not one query vector at a time, but
[00:34:52] not one query vector at a time, but basically process a set of query vectors
[00:34:54] basically process a set of query vectors all in parallel and perform the exact
[00:34:56] all in parallel and perform the exact same computation in parallel for a whole
[00:34:59] same computation in parallel for a whole set of query vectors. So in this case,
[00:35:01] set of query vectors. So in this case, we've now generalized it to have n. So q
[00:35:03] we've now generalized it to have n. So q is now a matrix of shape nq by dq. So we
[00:35:06] is now a matrix of shape nq by dq. So we have n q query vectors. Each of those
[00:35:08] have n q query vectors. Each of those query vectors has dimension dq. We have
[00:35:10] query vectors has dimension dq. We have our data vectors is a matrix of size nx
[00:35:13] our data vectors is a matrix of size nx by dq. And now uh this now the the
[00:35:16] by dq. And now uh this now the the computation changes a little bit because
[00:35:18] computation changes a little bit because now want when we compute these alignment
[00:35:20] now want when we compute these alignment scores when we when we compute these
[00:35:22] scores when we when we compute these similarities basically we want to
[00:35:24] similarities basically we want to compute all pairs of similarities
[00:35:26] compute all pairs of similarities between all of the input data vectors
[00:35:29] between all of the input data vectors and all of the input query vectors. And
[00:35:31] and all of the input query vectors. And that sim and each one of those
[00:35:33] that sim and each one of those similarities is a dotproduct. So well
[00:35:35] similarities is a dotproduct. So well scaled dotproduct. So what's a very
[00:35:37] scaled dotproduct. So what's a very efficient and easy and natural way for
[00:35:39] efficient and easy and natural way for us to compute dotproducts between two
[00:35:41] us to compute dotproducts between two sets of input vectors? That turns out
[00:35:43] sets of input vectors? That turns out exactly to be a matrix multiply, right?
[00:35:45] exactly to be a matrix multiply, right? Cuz remember when you do a matrix
[00:35:46] Cuz remember when you do a matrix multiply, each entry in the output
[00:35:48] multiply, each entry in the output matrix is the inner product of one of
[00:35:50] matrix is the inner product of one of the columns of one of your matrices and
[00:35:52] the columns of one of your matrices and the rows of your other matrix. And
[00:35:53] the rows of your other matrix. And that's what uh so then each entry in the
[00:35:55] that's what uh so then each entry in the output of a matrix multiply is exactly
[00:35:57] output of a matrix multiply is exactly the dotproduct between the rows and the
[00:35:59] the dotproduct between the rows and the columns in the output. So this by
[00:36:01] columns in the output. So this by computing a matrix multiply between our
[00:36:04] computing a matrix multiply between our um query vectors Q and our data vectors
[00:36:07] um query vectors Q and our data vectors X and you need to get a transpose in
[00:36:08] X and you need to get a transpose in there to make the rows and columns match
[00:36:10] there to make the rows and columns match up in the right way. Um this basically
[00:36:12] up in the right way. Um this basically gives us you know lets us compute all
[00:36:13] gives us you know lets us compute all the scale all the similarities between
[00:36:15] the scale all the similarities between all the data vectors and all the query
[00:36:17] all the data vectors and all the query vectors um all in one simple matrix
[00:36:20] vectors um all in one simple matrix multiply.
[00:36:21] multiply. Um now we still need to compute these
[00:36:23] Um now we still need to compute these attention weights. Remember the
[00:36:24] attention weights. Remember the attention weights we want to compute for
[00:36:26] attention weights we want to compute for each query vector we want to compute a
[00:36:28] each query vector we want to compute a distribution over the data vectors.
[00:36:30] distribution over the data vectors. Well, we already have these. Now, our
[00:36:32] Well, we already have these. Now, our similarity scores are not just a single
[00:36:33] similarity scores are not just a single vector of scores. They're now a matrix
[00:36:35] vector of scores. They're now a matrix of scores giving all the similarities.
[00:36:37] of scores giving all the similarities. But we still want to compute a
[00:36:38] But we still want to compute a distribution over the data vectors for
[00:36:41] distribution over the data vectors for each query vector independently. So, now
[00:36:43] each query vector independently. So, now we need to comput the softmax over just
[00:36:45] we need to comput the softmax over just one of the axes of that um of that
[00:36:47] one of the axes of that um of that matrix of similarity scores. So, this is
[00:36:49] matrix of similarity scores. So, this is basically the exact same computation
[00:36:50] basically the exact same computation that we just saw. We're just doing it in
[00:36:52] that we just saw. We're just doing it in parallel for a set of query vectors all
[00:36:54] parallel for a set of query vectors all at once. Um now, we need to compute the
[00:36:56] at once. Um now, we need to compute the output vectors. And remember the output
[00:36:58] output vectors. And remember the output vectors were going to be a um weighted
[00:37:02] vectors were going to be a um weighted combination of the in of the data
[00:37:04] combination of the in of the data vectors where those weights are the
[00:37:07] vectors where those weights are the values in the softmax. And it turns out
[00:37:09] values in the softmax. And it turns out that this is also something that matrix
[00:37:11] that this is also something that matrix multiply does. Um another way to think
[00:37:13] multiply does. Um another way to think about matrix multiply is that when you
[00:37:15] about matrix multiply is that when you take a matrix multiply of two matrices,
[00:37:17] take a matrix multiply of two matrices, a different way to view a matrix
[00:37:19] a different way to view a matrix multiply is that it takes a linear
[00:37:21] multiply is that it takes a linear combination of oh man, am I going to get
[00:37:23] combination of oh man, am I going to get the rows and the columns in the right
[00:37:24] the rows and the columns in the right way? But I think you get the the linear
[00:37:26] way? But I think you get the the linear combination of the columns of one of
[00:37:28] combination of the columns of one of your input matrices um weighted by the
[00:37:30] your input matrices um weighted by the values in the other input matrix. So
[00:37:33] values in the other input matrix. So this is another interpretation of matrix
[00:37:35] this is another interpretation of matrix multiplication. So then if you kind of
[00:37:37] multiplication. So then if you kind of work through the indices and draw some
[00:37:38] work through the indices and draw some little pictures for yourself to to prove
[00:37:40] little pictures for yourself to to prove to yourself what's going on. It also
[00:37:42] to yourself what's going on. It also turns out that you know in order to
[00:37:44] turns out that you know in order to compute me now what we want to do is
[00:37:46] compute me now what we want to do is compute many linear combinations of the
[00:37:48] compute many linear combinations of the data vectors where each linear
[00:37:50] data vectors where each linear combination will be given by the
[00:37:52] combination will be given by the probabilities in one of the rows of the
[00:37:54] probabilities in one of the rows of the attention matrix. Um so we can compute
[00:37:56] attention matrix. Um so we can compute all of these all at once with another
[00:37:58] all of these all at once with another matrix multiply between um the attention
[00:38:01] matrix multiply between um the attention matrix A and the data vectors X. And
[00:38:03] matrix A and the data vectors X. And again you need to get the transposes in
[00:38:05] again you need to get the transposes in the right order to make this work out.
[00:38:07] the right order to make this work out. But basically, this is the exact same
[00:38:08] But basically, this is the exact same operation that we just saw, but we're
[00:38:10] operation that we just saw, but we're now doing it for a set of query vectors
[00:38:11] now doing it for a set of query vectors all at once. And it turns out that we
[00:38:13] all at once. And it turns out that we can do it all at once with just a couple
[00:38:15] can do it all at once with just a couple matrix multiplies. There's a next way
[00:38:18] matrix multiplies. There's a next way that we'll generalize this is notice
[00:38:20] that we'll generalize this is notice that in this equation, the the data
[00:38:23] that in this equation, the the data vectors X are actually entering in two
[00:38:26] vectors X are actually entering in two different places in this computation. Um
[00:38:28] different places in this computation. Um the first place that we're using the
[00:38:29] the first place that we're using the data vectors X is to compute
[00:38:31] data vectors X is to compute similarities with the query vectors in
[00:38:33] similarities with the query vectors in this uh similarities computation. So in
[00:38:36] this uh similarities computation. So in that in that notion what we're trying to
[00:38:38] that in that notion what we're trying to do is say oh for hey data vector how
[00:38:40] do is say oh for hey data vector how much do you line up with each query
[00:38:41] much do you line up with each query vector as measured by an inner product
[00:38:43] vector as measured by an inner product but then we're also using the data
[00:38:45] but then we're also using the data vectors again to compute the output
[00:38:46] vectors again to compute the output vectors. So we're we're re we're the
[00:38:49] vectors. So we're we're re we're the output vectors are now a linear
[00:38:50] output vectors are now a linear combination of the data vectors weighted
[00:38:52] combination of the data vectors weighted by our attention weights. Um and it
[00:38:54] by our attention weights. Um and it maybe seems a little bit weird to reuse
[00:38:57] maybe seems a little bit weird to reuse the data vectors in those two different
[00:38:58] the data vectors in those two different contexts. So now what we want to do is
[00:39:01] contexts. So now what we want to do is um separate those two usages of the data
[00:39:03] um separate those two usages of the data vectors um and let the network sort of
[00:39:06] vectors um and let the network sort of figure out for itself two different ways
[00:39:09] figure out for itself two different ways to use the data vectors in those two
[00:39:11] to use the data vectors in those two contexts.
[00:39:12] contexts. So to do that we'll introduce this idea
[00:39:14] So to do that we'll introduce this idea of keys and queries. So now what we're
[00:39:17] of keys and queries. So now what we're going to do is you know we had a set of
[00:39:19] going to do is you know we had a set of data vectors but what we're going to do
[00:39:21] data vectors but what we're going to do is for each data vector we're going to
[00:39:23] is for each data vector we're going to project each data vector into two
[00:39:25] project each data vector into two vectors. One is a key vector one is a
[00:39:28] vectors. One is a key vector one is a value vector. Um, and the idea of the
[00:39:30] value vector. Um, and the idea of the key vectors are the key vectors are
[00:39:32] key vectors are the key vectors are going to be compared with the query
[00:39:34] going to be compared with the query vectors to compute the alignment scores
[00:39:36] vectors to compute the alignment scores and the value vectors are what we're
[00:39:38] and the value vectors are what we're going to compute linear combinations of
[00:39:40] going to compute linear combinations of in order to compute the output from the
[00:39:42] in order to compute the output from the layer. Um, and this also by so then the
[00:39:45] layer. Um, and this also by so then the way that we implement this is we add two
[00:39:47] way that we implement this is we add two learnable weight matrices, the key
[00:39:49] learnable weight matrices, the key matrix and the value matrix which are
[00:39:51] matrix and the value matrix which are going to be um linear projections that
[00:39:53] going to be um linear projections that project the input that project the data
[00:39:55] project the input that project the data vectors into key vectors and value
[00:39:57] vectors into key vectors and value vectors. So now the data vectors are we
[00:40:00] vectors. So now the data vectors are we have remember we have n data vectors
[00:40:01] have remember we have n data vectors each of dimension dx. So now the key
[00:40:03] each of dimension dx. So now the key matrix projects is a is a linear
[00:40:06] matrix projects is a is a linear transformation that projects from dx
[00:40:08] transformation that projects from dx into dq right because we know that we're
[00:40:10] into dq right because we know that we're going to compare the key vectors with
[00:40:12] going to compare the key vectors with the query vectors. So they need to have
[00:40:13] the query vectors. So they need to have the same dimension as the query vectors.
[00:40:15] the same dimension as the query vectors. Um so that will project each so then
[00:40:17] Um so that will project each so then applying matrix multiply of k= x wk will
[00:40:21] applying matrix multiply of k= x wk will project each data vector into a key
[00:40:24] project each data vector into a key vector of dimension dq. Then we'll
[00:40:26] vector of dimension dq. Then we'll separately have another weight matrix
[00:40:28] separately have another weight matrix that projects from dx to dv which is the
[00:40:31] that projects from dx to dv which is the dimension of the value vectors which in
[00:40:33] dimension of the value vectors which in principle could be different than the
[00:40:35] principle could be different than the query vector dimension. Um and then
[00:40:37] query vector dimension. Um and then we'll separately project each each data
[00:40:39] we'll separately project each each data vector into a value vector again with a
[00:40:42] vector into a value vector again with a matrix multiply operator here. Um and
[00:40:44] matrix multiply operator here. Um and the intuition here is that it's kind of
[00:40:46] the intuition here is that it's kind of like in a search engine like you want to
[00:40:48] like in a search engine like you want to separate what you're looking for from
[00:40:50] separate what you're looking for from the answer you want in response to that
[00:40:52] the answer you want in response to that query, right? So like you go to Google
[00:40:54] query, right? So like you go to Google or these days chatgpt and you type in
[00:40:56] or these days chatgpt and you type in like what is the best school in the
[00:40:58] like what is the best school in the world that's your query and then the
[00:41:00] world that's your query and then the value you get that's the that's the
[00:41:02] value you get that's the that's the query that needs to be combined with the
[00:41:03] query that needs to be combined with the keys in the back end but then the value
[00:41:05] keys in the back end but then the value the data you want to get back from that
[00:41:06] the data you want to get back from that query is actually different from the
[00:41:08] query is actually different from the query you typed in right so we want to
[00:41:10] query you typed in right so we want to separate this idea of like you put your
[00:41:12] separate this idea of like you put your query in what is the best school in the
[00:41:14] query in what is the best school in the world that query needs to go match on
[00:41:16] world that query needs to go match on all the different strings in the on the
[00:41:18] all the different strings in the on the internet and then the value you want to
[00:41:19] internet and then the value you want to get back from that query is Stanford
[00:41:21] get back from that query is Stanford which is a different value come which is
[00:41:23] which is a different value come which is a different value which is different
[00:41:24] a different value which is different from the query that you put in. So
[00:41:26] from the query that you put in. So that's kind of the intuition another
[00:41:27] that's kind of the intuition another intuition between separating the keys
[00:41:29] intuition between separating the keys and the queries and the values in this
[00:41:31] and the queries and the values in this way. The query is what I'm looking for.
[00:41:34] way. The query is what I'm looking for. The key is you know in the back end we
[00:41:36] The key is you know in the back end we have some record of all the data back
[00:41:38] have some record of all the data back there in the data vectors but um when we
[00:41:41] there in the data vectors but um when we query we want to match up against part
[00:41:43] query we want to match up against part of the potentially just part of the data
[00:41:44] of the potentially just part of the data vector and then the thing we want to get
[00:41:46] vector and then the thing we want to get back from the data vector is the value.
[00:41:48] back from the data vector is the value. So we're separating the usage of the
[00:41:50] So we're separating the usage of the data vectors into those two different
[00:41:52] data vectors into those two different notions of keys and values. Then we can
[00:41:55] notions of keys and values. Then we can visualize this in a different way. So
[00:41:57] visualize this in a different way. So now we're we're we're finally throwing
[00:41:59] now we're we're we're finally throwing away the RNN and we're looking at
[00:42:00] away the RNN and we're looking at attention just as an operator on its
[00:42:02] attention just as an operator on its own. So we can step through this
[00:42:04] own. So we can step through this operation again. We've got our query
[00:42:05] operation again. We've got our query vectors coming in. We've got our data
[00:42:07] vectors coming in. We've got our data vectors coming in. Now what we're going
[00:42:09] vectors coming in. Now what we're going to do is from the data vectors, we're
[00:42:11] to do is from the data vectors, we're going to project each data vector into a
[00:42:13] going to project each data vector into a key and a value. Um then we're going to
[00:42:16] key and a value. Um then we're going to compare each key with each query to get
[00:42:19] compare each key with each query to get our um similarity scores. Right? So this
[00:42:21] our um similarity scores. Right? So this is a similarity. This is a matrix of
[00:42:23] is a similarity. This is a matrix of scalers giving the similarities between
[00:42:25] scalers giving the similarities between each key and each query. Then once we
[00:42:28] each key and each query. Then once we have this matrix of similarity scores,
[00:42:30] have this matrix of similarity scores, we want to compute um a distribution
[00:42:32] we want to compute um a distribution over each qu a distribution over the
[00:42:34] over each qu a distribution over the data vectors for each query. So that
[00:42:36] data vectors for each query. So that means we need to run softmax over this
[00:42:39] means we need to run softmax over this uh matrix of alignment scores um or we
[00:42:42] uh matrix of alignment scores um or we compute the softmax over each row. Then
[00:42:44] compute the softmax over each row. Then what we do is we want to take rewe the
[00:42:48] what we do is we want to take rewe the value vectors by the attention scores in
[00:42:51] value vectors by the attention scores in the softmax. Oh actually no sorry we
[00:42:54] the softmax. Oh actually no sorry we want each we want a each column to be we
[00:42:56] want each we want a each column to be we want each column to be a distribution.
[00:42:58] want each column to be a distribution. uh right because each we want for each
[00:43:00] uh right because each we want for each query a distribution over the keys which
[00:43:02] query a distribution over the keys which means we want softmax over the columns
[00:43:04] means we want softmax over the columns right because we want it to be aligned
[00:43:05] right because we want it to be aligned to the columns. So then what we do is
[00:43:08] to the columns. So then what we do is you know we've got this query one we've
[00:43:10] you know we've got this query one we've prod we've predicted this distribution
[00:43:12] prod we've predicted this distribution over all of the keys um from this
[00:43:14] over all of the keys um from this computation. Then we're going to take a
[00:43:16] computation. Then we're going to take a linear combination of the values
[00:43:18] linear combination of the values weighted by these attention weights and
[00:43:20] weighted by these attention weights and comput a linear combination of the value
[00:43:21] comput a linear combination of the value vectors to produce our first output
[00:43:23] vectors to produce our first output vector y1. And then the same thing
[00:43:25] vector y1. And then the same thing happens over here. Our second query got
[00:43:27] happens over here. Our second query got compared with all the keys. We computed
[00:43:29] compared with all the keys. We computed a distribution over those alignment
[00:43:30] a distribution over those alignment scores to get a distribution over the
[00:43:32] scores to get a distribution over the keys for the second query which then get
[00:43:35] keys for the second query which then get linearly combine. Then we use those to
[00:43:37] linearly combine. Then we use those to linear linearly combine the values to
[00:43:39] linear linearly combine the values to produce our output vector. So now this
[00:43:41] produce our output vector. So now this is now the attention operator sort of
[00:43:43] is now the attention operator sort of standing on its own um divorced from the
[00:43:45] standing on its own um divorced from the recurrent neural network. The question
[00:43:47] recurrent neural network. The question is how do you divide the data vector
[00:43:48] is how do you divide the data vector into keys and values? The beautiful part
[00:43:50] into keys and values? The beautiful part is we don't have to say we don't have to
[00:43:52] is we don't have to say we don't have to say how just as we just give the neural
[00:43:54] say how just as we just give the neural network the capacity to split it by by
[00:43:56] network the capacity to split it by by to split it by itself by giving it this
[00:43:59] to split it by itself by giving it this mechanism to project separately into
[00:44:01] mechanism to project separately into keys and values but just but we're not
[00:44:04] keys and values but just but we're not going to tell it how to do it. Um these
[00:44:05] going to tell it how to do it. Um these are just going to be the the key matrix
[00:44:07] are just going to be the the key matrix and the value matrix are just going to
[00:44:08] and the value matrix are just going to be learnable parameters of the model
[00:44:09] be learnable parameters of the model that will be learned via gradient
[00:44:11] that will be learned via gradient descent along with everything else. So
[00:44:13] descent along with everything else. So just as we did not tell it how to align
[00:44:15] just as we did not tell it how to align the English and the French sentences all
[00:44:17] the English and the French sentences all of that was sort of learned via gradient
[00:44:18] of that was sort of learned via gradient descent. the model will learn for itself
[00:44:20] descent. the model will learn for itself how to separately project into keys and
[00:44:22] how to separately project into keys and values in a way that's sensible for the
[00:44:25] values in a way that's sensible for the problem for that's helpful for the
[00:44:26] problem for that's helpful for the problem it's trying to solve. So that
[00:44:28] problem it's trying to solve. So that the keys and values you might think of
[00:44:29] the keys and values you might think of it as some kind of filter right so the
[00:44:31] it as some kind of filter right so the data vector might have a lot of stuff in
[00:44:32] data vector might have a lot of stuff in there but for the task at hand we might
[00:44:34] there but for the task at hand we might want to filter the data vector in
[00:44:35] want to filter the data vector in various ways and only try to match our
[00:44:37] various ways and only try to match our queries against part of it and we only
[00:44:39] queries against part of it and we only care about retrieving information of a
[00:44:40] care about retrieving information of a different part of it. So you could think
[00:44:41] different part of it. So you could think of those as yeah filtering you know the
[00:44:43] of those as yeah filtering you know the information in the data vector in two
[00:44:45] information in the data vector in two different ways.
[00:44:47] different ways. Okay, so this is this is basically our
[00:44:49] Okay, so this is this is basically our attention operator. And now like there's
[00:44:51] attention operator. And now like there's no RNN here. This is just a neural
[00:44:53] no RNN here. This is just a neural network layer that you could have
[00:44:55] network layer that you could have standing on its own, right? It receives
[00:44:57] standing on its own, right? It receives two inputs, the query vectors and the
[00:44:59] two inputs, the query vectors and the data vectors. It has two weights of
[00:45:00] data vectors. It has two weights of learnable parameters which are the key
[00:45:02] learnable parameters which are the key matrix and the value matrix. Um it
[00:45:04] matrix and the value matrix. Um it inputs two two sets two sequences of
[00:45:06] inputs two two sets two sequences of vectors, outputs a sequence of vectors.
[00:45:08] vectors, outputs a sequence of vectors. So this is a neural network layer in its
[00:45:10] So this is a neural network layer in its own right that you could start to plug
[00:45:11] own right that you could start to plug into your neural network architectures
[00:45:13] into your neural network architectures in various places. This is sometimes
[00:45:15] in various places. This is sometimes called a cross attention layer because
[00:45:17] called a cross attention layer because it has two sets of inputs coming in,
[00:45:20] it has two sets of inputs coming in, right? The idea is we have both data
[00:45:21] right? The idea is we have both data vectors and query vectors. They're
[00:45:23] vectors and query vectors. They're potentially coming from two different
[00:45:25] potentially coming from two different sources. Um, and this is sometimes
[00:45:27] sources. Um, and this is sometimes useful, right? So that I have a set of
[00:45:29] useful, right? So that I have a set of queries. For each query, I want to go
[00:45:30] queries. For each query, I want to go and summarize information from my data
[00:45:32] and summarize information from my data which is potentially different or a
[00:45:34] which is potentially different or a different number or totally different
[00:45:35] different number or totally different from my query vectors. Um, so this is
[00:45:37] from my query vectors. Um, so this is this is sometimes called a cross
[00:45:39] this is sometimes called a cross attention layer because we're
[00:45:40] attention layer because we're crossending between two different sets
[00:45:42] crossending between two different sets of things.
[00:45:44] of things. Um, but there's another version of this
[00:45:46] Um, but there's another version of this that happens maybe even more commonly is
[00:45:48] that happens maybe even more commonly is a self attention layer. So here what
[00:45:51] a self attention layer. So here what we're going to do is we only have one
[00:45:53] we're going to do is we only have one set of things. We only have one sequence
[00:45:55] set of things. We only have one sequence of inputs. We we have one set of
[00:45:57] of inputs. We we have one set of vectors, one sequence of vectors that
[00:45:58] vectors, one sequence of vectors that we're processing. Um, and then so now we
[00:46:00] we're processing. Um, and then so now we we no longer have this separation
[00:46:02] we no longer have this separation between data vectors and query vectors.
[00:46:04] between data vectors and query vectors. We just have one set of input vectors
[00:46:06] We just have one set of input vectors that we would like to process. So in a
[00:46:08] that we would like to process. So in a self attention layer um we're going to
[00:46:10] self attention layer um we're going to have one we're going to have a set of
[00:46:12] have one we're going to have a set of input vectors and we're going to produce
[00:46:14] input vectors and we're going to produce a set of output vectors. So we want to
[00:46:16] a set of output vectors. So we want to input a set of vectors X output a set of
[00:46:18] input a set of vectors X output a set of vectors Y that are the same number as
[00:46:20] vectors Y that are the same number as the input vectors. But now the mechanism
[00:46:22] the input vectors. But now the mechanism of this is basically the same attention
[00:46:24] of this is basically the same attention mechanism that we just saw. Um but now
[00:46:26] mechanism that we just saw. Um but now rather than projecting but we then we're
[00:46:28] rather than projecting but we then we're still going to use this notion of
[00:46:29] still going to use this notion of filtering but now rather than projecting
[00:46:31] filtering but now rather than projecting our data vectors into keys and queries
[00:46:33] our data vectors into keys and queries as we previously did. Now, what we're
[00:46:35] as we previously did. Now, what we're going to do is take each one of our
[00:46:36] going to do is take each one of our input vectors and project it to three
[00:46:39] input vectors and project it to three different things. Um, from each of our
[00:46:41] different things. Um, from each of our input vectors, we're going to project it
[00:46:43] input vectors, we're going to project it to a query, to a key, and to a value.
[00:46:46] to a query, to a key, and to a value. Um, and so the the the equations change
[00:46:48] Um, and so the the the equations change just a little bit. Um, and but the
[00:46:50] just a little bit. Um, and but the picture over here doesn't actually
[00:46:51] picture over here doesn't actually change very much for each of our input
[00:46:53] change very much for each of our input vectors. We separately project it to a
[00:46:55] vectors. We separately project it to a query, to a key, and to a value. Um, and
[00:46:58] query, to a key, and to a value. Um, and now we have, you know, the exact same
[00:47:00] now we have, you know, the exact same computation. Now we've got queries,
[00:47:02] computation. Now we've got queries, we've got keys, we've got values. From
[00:47:04] we've got keys, we've got values. From the perspective of everything happening
[00:47:05] the perspective of everything happening up here, it's all the same. It just h it
[00:47:07] up here, it's all the same. It just h it just so happened that we computed the
[00:47:10] just so happened that we computed the keys and the queries and the values all
[00:47:12] keys and the queries and the values all from different linear projections of
[00:47:14] from different linear projections of those same input vectors, but all the
[00:47:16] those same input vectors, but all the computation otherwise shared. Yeah.
[00:47:18] computation otherwise shared. Yeah. Question is how do you where what are D
[00:47:20] Question is how do you where what are D in and D out? How are they sized? Um so
[00:47:22] in and D out? How are they sized? Um so these are going to be architectural
[00:47:23] these are going to be architectural hyperparameters of the layer, right?
[00:47:25] hyperparameters of the layer, right? Like just when we have a learnable
[00:47:27] Like just when we have a learnable linear layer in a model, a linear layer
[00:47:29] linear layer in a model, a linear layer basically projects from a Din to a D
[00:47:30] basically projects from a Din to a D out. Those are going to be architectural
[00:47:32] out. Those are going to be architectural hyperparameters of the layer. Um same
[00:47:34] hyperparameters of the layer. Um same thing with a self attention layer. The D
[00:47:35] thing with a self attention layer. The D in and the D out are going to be
[00:47:37] in and the D out are going to be architectural hyperparameters of the
[00:47:38] architectural hyperparameters of the layer. Um and in principle they could be
[00:47:40] layer. Um and in principle they could be different, right? There's enough there's
[00:47:43] different, right? There's enough there's enough flexibility in this architecture
[00:47:44] enough flexibility in this architecture so that in principle D in D in and D out
[00:47:47] so that in principle D in D in and D out could be different. Although I don't
[00:47:49] could be different. Although I don't think I've almost ever seen that. In
[00:47:50] think I've almost ever seen that. In practice they're like almost always the
[00:47:52] practice they're like almost always the same. So I've been like a little bit
[00:47:54] same. So I've been like a little bit extra general in the notation here.
[00:47:58] extra general in the notation here. Okay. So I I don't know that we
[00:47:59] Okay. So I I don't know that we necessarily need to walk through this.
[00:48:00] necessarily need to walk through this. Oh actually there is one important
[00:48:01] Oh actually there is one important thing. Right. So I said that um we are
[00:48:04] thing. Right. So I said that um we are separately projecting the inputs into
[00:48:05] separately projecting the inputs into queries, keys and values. Um so that
[00:48:08] queries, keys and values. Um so that happens via three matrix multiplies with
[00:48:10] happens via three matrix multiplies with our three learnable weight matrices. Now
[00:48:12] our three learnable weight matrices. Now we have three learnable weight matrices.
[00:48:13] we have three learnable weight matrices. One for keys, one for values, one for
[00:48:15] One for keys, one for values, one for queries. Um and we separately project
[00:48:17] queries. Um and we separately project the the input vectors X into keys,
[00:48:20] the the input vectors X into keys, queries and values. Um but in practice
[00:48:22] queries and values. Um but in practice um we can actually typically compute
[00:48:24] um we can actually typically compute just one matrix multiply all at once for
[00:48:26] just one matrix multiply all at once for those because it's typically more
[00:48:27] those because it's typically more efficient on hardware to do fewer large
[00:48:30] efficient on hardware to do fewer large matrix multiplies than it is to do more
[00:48:32] matrix multiplies than it is to do more smaller matrix multiplies. So a pretty
[00:48:34] smaller matrix multiplies. So a pretty common trick in practice is to fuse is
[00:48:36] common trick in practice is to fuse is to sort of concatenate these three matri
[00:48:38] to sort of concatenate these three matri matrices along the dimensions and
[00:48:40] matrices along the dimensions and compute all of these keys queries and
[00:48:42] compute all of these keys queries and values for all the input vectors all at
[00:48:43] values for all the input vectors all at once with one big matrix multiply. If
[00:48:46] once with one big matrix multiply. If you've read transformers before, they
[00:48:48] you've read transformers before, they sometimes separate between encoder and
[00:48:50] sometimes separate between encoder and decoder transformers or encoder decoder
[00:48:52] decoder transformers or encoder decoder attention. So in that case like this
[00:48:54] attention. So in that case like this does this is this would be the decoder
[00:48:56] does this is this would be the decoder only attention. Um if you've read
[00:48:57] only attention. Um if you've read transformer papers before um and which
[00:48:59] transformer papers before um and which corresponds to the decoder
[00:49:02] corresponds to the decoder of the RNN initial example at beginning
[00:49:04] of the RNN initial example at beginning of class um but like this mechanism is
[00:49:07] of class um but like this mechanism is actually just the most gen is like the
[00:49:08] actually just the most gen is like the most commonly used flavor of attention
[00:49:10] most commonly used flavor of attention nowadays is this sort of so-called
[00:49:11] nowadays is this sort of so-called decoder only attention. So we are we are
[00:49:14] decoder only attention. So we are we are quite divorcing ourselves away from the
[00:49:16] quite divorcing ourselves away from the RNN now. Right? So this flavor of it
[00:49:18] RNN now. Right? So this flavor of it doesn't really make sense to be used in
[00:49:20] doesn't really make sense to be used in the RNN that we saw at the beginning of
[00:49:21] the RNN that we saw at the beginning of class. Right? So we basically been like
[00:49:23] class. Right? So we basically been like doing a little bit of a slight of hand
[00:49:24] doing a little bit of a slight of hand here where we introduced this
[00:49:26] here where we introduced this architecture for the purpose of RNN in
[00:49:28] architecture for the purpose of RNN in this very concrete case of machine
[00:49:30] this very concrete case of machine translation sequence to sequence. But
[00:49:32] translation sequence to sequence. But we've now generalized it to become a
[00:49:33] we've now generalized it to become a totally different operator that can be
[00:49:35] totally different operator that can be used all on its own. And in this
[00:49:37] used all on its own. And in this particular generalization into self
[00:49:38] particular generalization into self attention, it actually no longer can be
[00:49:40] attention, it actually no longer can be used in that decoder in the RNN. Um but
[00:49:42] used in that decoder in the RNN. Um but it's a very useful primitive that gets
[00:49:44] it's a very useful primitive that gets used in a lot of other places. It turns
[00:49:45] used in a lot of other places. It turns out the question is what's the benefit
[00:49:48] out the question is what's the benefit or difference between the self attention
[00:49:49] or difference between the self attention versus the cross attention. Um they
[00:49:51] versus the cross attention. Um they would get used in different contexts. So
[00:49:53] would get used in different contexts. So in a in some situations you naturally
[00:49:55] in a in some situations you naturally have two dis different kinds of data
[00:49:57] have two dis different kinds of data that you want to compare which we saw
[00:49:59] that you want to compare which we saw for example in the machine translation
[00:50:00] for example in the machine translation setting we have an input sentence we
[00:50:02] setting we have an input sentence we have an output sentence. We believe that
[00:50:04] have an output sentence. We believe that there there's some natural structure in
[00:50:05] there there's some natural structure in the problem that there's two different
[00:50:07] the problem that there's two different sets of things that we want to compare.
[00:50:08] sets of things that we want to compare. Um that also might happen in say image
[00:50:10] Um that also might happen in say image captioning right say we have an input
[00:50:12] captioning right say we have an input image we want to produce an output
[00:50:13] image we want to produce an output sentence there's two different kinds of
[00:50:15] sentence there's two different kinds of things we want to compare pieces of the
[00:50:17] things we want to compare pieces of the image and tokens in the words that we're
[00:50:19] image and tokens in the words that we're generating so for some problems there's
[00:50:21] generating so for some problems there's just this natural structure where you
[00:50:22] just this natural structure where you have two different kinds of things
[00:50:24] have two different kinds of things floating around but for other problems
[00:50:26] floating around but for other problems there aren't two kinds of things there's
[00:50:28] there aren't two kinds of things there's just one thing um so say you're doing
[00:50:29] just one thing um so say you're doing image classification then there's only
[00:50:31] image classification then there's only an image we just want to process the
[00:50:33] an image we just want to process the image so in that case we just want to
[00:50:34] image so in that case we just want to compare parts of the image with itself
[00:50:36] compare parts of the image with itself and that's where you use a self
[00:50:37] and that's where you use a self attention layer So they just get used in
[00:50:39] attention layer So they just get used in different for different kinds of
[00:50:40] different for different kinds of problems.
[00:50:41] problems. Um but we want to but crucially we want
[00:50:43] Um but we want to but crucially we want to reuse basically the same machinery
[00:50:45] to reuse basically the same machinery and the same uh computational primitives
[00:50:47] and the same uh computational primitives to you to be used in those different
[00:50:49] to you to be used in those different kinds of problems and that's very that's
[00:50:50] kinds of problems and that's very that's really beneficial. There's a couple
[00:50:53] really beneficial. There's a couple interesting things about attention that
[00:50:55] interesting things about attention that I want to get through. So one is like
[00:50:57] I want to get through. So one is like let's consider what happens if you
[00:50:58] let's consider what happens if you permit permute the inputs right we had a
[00:51:00] permit permute the inputs right we had a set of input vectors. What happens if
[00:51:02] set of input vectors. What happens if you shuffle them and process them in a
[00:51:03] you shuffle them and process them in a different order? Now actually a lot of
[00:51:05] different order? Now actually a lot of interesting stuff happens. So the keys,
[00:51:07] interesting stuff happens. So the keys, the queries, and the values will all end
[00:51:09] the queries, and the values will all end up the same, right? Because they are
[00:51:10] up the same, right? Because they are computed as linear projections of the
[00:51:12] computed as linear projections of the input. So we'll end up getting the same
[00:51:14] input. So we'll end up getting the same keys, queries, and values. They'll just
[00:51:16] keys, queries, and values. They'll just be in a different order, shuffled in the
[00:51:17] be in a different order, shuffled in the same way that the inputs were. Um, and
[00:51:19] same way that the inputs were. Um, and now because our similarity scores were
[00:51:21] now because our similarity scores were just dotproducts, we'll also end up with
[00:51:22] just dotproducts, we'll also end up with the same similarity scores, just again
[00:51:24] the same similarity scores, just again kind of shuffled in accordance with the
[00:51:25] kind of shuffled in accordance with the way we shuffled input. Um, same thing
[00:51:27] way we shuffled input. Um, same thing with the softmax. Softmax doesn't
[00:51:29] with the softmax. Softmax doesn't actually care about the order of its
[00:51:30] actually care about the order of its inputs. So it's the softmax is now
[00:51:33] inputs. So it's the softmax is now operating on the same vector, but
[00:51:34] operating on the same vector, but shuffled. So um each column of our
[00:51:36] shuffled. So um each column of our attention weights will end up the same
[00:51:37] attention weights will end up the same as they did before just shuffled. And
[00:51:39] as they did before just shuffled. And then same thing with linear
[00:51:40] then same thing with linear combinations. So our output val our
[00:51:43] combinations. So our output val our outputs y will actually still be the
[00:51:45] outputs y will actually still be the same outputs as we said before. They'll
[00:51:46] same outputs as we said before. They'll just all be shuffled. So that that means
[00:51:48] just all be shuffled. So that that means that there's a really interesting
[00:51:49] that there's a really interesting structure here um called permutation
[00:51:52] structure here um called permutation equariance. Remember we saw we we saw
[00:51:54] equariance. Remember we saw we we saw this a couple lectures ago with with um
[00:51:56] this a couple lectures ago with with um with with convolution. Now we see a
[00:51:58] with with convolution. Now we see a different equariance property of these
[00:52:00] different equariance property of these uh self attention layers which is that
[00:52:02] uh self attention layers which is that if we shuffle the inputs then the
[00:52:04] if we shuffle the inputs then the outputs we get the same outputs just
[00:52:06] outputs we get the same outputs just shuffled in the same way that the inputs
[00:52:08] shuffled in the same way that the inputs were shuffled. And this kind of means in
[00:52:10] were shuffled. And this kind of means in this case that self attention doesn't
[00:52:13] this case that self attention doesn't actually care about the order of the
[00:52:15] actually care about the order of the inputs. If we change the order of the
[00:52:16] inputs. If we change the order of the inputs we'll get the same outputs just
[00:52:18] inputs we'll get the same outputs just shuffled in the same way. That the
[00:52:20] shuffled in the same way. That the computation of the layer does not depend
[00:52:21] computation of the layer does not depend on the order in which we present the
[00:52:23] on the order in which we present the inputs. So that means that we can think
[00:52:25] inputs. So that means that we can think of self attention actually not as
[00:52:27] of self attention actually not as operating on sequences of vectors. They
[00:52:29] operating on sequences of vectors. They happen to be packed into into an ordered
[00:52:31] happen to be packed into into an ordered sequence of a matrix. But we really
[00:52:33] sequence of a matrix. But we really think of it instead as operating on an
[00:52:35] think of it instead as operating on an unordered set of vectors because the the
[00:52:37] unordered set of vectors because the the the the the outputs that we get don't
[00:52:39] the the the outputs that we get don't actually depend on what order we've
[00:52:41] actually depend on what order we've packed those vectors into our input
[00:52:43] packed those vectors into our input matrix. So we really think about this as
[00:52:45] matrix. So we really think about this as a kind of different neural network
[00:52:46] a kind of different neural network primitive that fundamentally operates on
[00:52:48] primitive that fundamentally operates on sets of vectors rather than sequences of
[00:52:50] sets of vectors rather than sequences of vectors. But this is sometimes a
[00:52:52] vectors. But this is sometimes a problem. Sometimes it is useful to tell
[00:52:54] problem. Sometimes it is useful to tell the neural network what the order of the
[00:52:55] the neural network what the order of the se what the order of the entries is. So
[00:52:57] se what the order of the entries is. So as a quick fix to that, we'll sometimes
[00:52:59] as a quick fix to that, we'll sometimes concatenate an additional piece of data
[00:53:01] concatenate an additional piece of data onto each of the input vectors called a
[00:53:03] onto each of the input vectors called a positional embedding. That is basically
[00:53:05] positional embedding. That is basically some some piece of data that tells the
[00:53:07] some some piece of data that tells the neural network this one's at index one,
[00:53:09] neural network this one's at index one, this one's at index two, this one's at
[00:53:10] this one's at index two, this one's at index 3, blah blah blah blah blah. And
[00:53:11] index 3, blah blah blah blah blah. And there's a bunch of different mechanisms
[00:53:12] there's a bunch of different mechanisms for that. The question is, is it going
[00:53:16] for that. The question is, is it going to train to the same result? Um the
[00:53:18] to train to the same result? Um the training, I'm not really talking about
[00:53:20] training, I'm not really talking about the training here. I'm talking about if
[00:53:21] the training here. I'm talking about if you fix the weight matrices and just
[00:53:23] you fix the weight matrices and just consider the computation of the layer
[00:53:25] consider the computation of the layer then if I were to shuffle the inputs
[00:53:27] then if I were to shuffle the inputs then I receive the same outputs but
[00:53:29] then I receive the same outputs but they'll be shuffled in the same way that
[00:53:30] they'll be shuffled in the same way that the inputs were shuffled. So like the
[00:53:32] the inputs were shuffled. So like the the the the question of what vectors do
[00:53:34] the the the question of what vectors do I compute at the output does not depend
[00:53:36] I compute at the output does not depend on the on the on the vectors on the
[00:53:39] on the on the on the vectors on the order of the vectors in the input but
[00:53:41] order of the vectors in the input but the order of the vectors I get from the
[00:53:42] the order of the vectors I get from the output does depend on the order that
[00:53:44] output does depend on the order that they were presented in the input. So
[00:53:46] they were presented in the input. So there's another couple tricks we can do
[00:53:47] there's another couple tricks we can do with self attention, but I'll go through
[00:53:48] with self attention, but I'll go through these a little bit faster. Um, so
[00:53:51] these a little bit faster. Um, so sometimes, you know, in in a full self
[00:53:53] sometimes, you know, in in a full self attention layer, we allowed every piece
[00:53:55] attention layer, we allowed every piece of the input to look at every other
[00:53:56] of the input to look at every other piece of the input. But for some
[00:53:58] piece of the input. But for some problems, we might want to impose some
[00:54:00] problems, we might want to impose some structure on this computation and say
[00:54:02] structure on this computation and say that certain pieces of the input are
[00:54:04] that certain pieces of the input are only allowed to look at certain other
[00:54:05] only allowed to look at certain other pieces of the input rather than looking
[00:54:07] pieces of the input rather than looking at rather than everything being allowed
[00:54:08] at rather than everything being allowed to look at everything. And we can
[00:54:10] to look at everything. And we can implement this via a notion called
[00:54:11] implement this via a notion called masked self attention. So what we're
[00:54:13] masked self attention. So what we're going to do is after we compute these um
[00:54:15] going to do is after we compute these um these alignment scores E, we're going to
[00:54:17] these alignment scores E, we're going to go in and override the alignment scores
[00:54:19] go in and override the alignment scores with negative infinities in places where
[00:54:21] with negative infinities in places where we want to block the attention. Um and
[00:54:23] we want to block the attention. Um and now if you have a negative infinity in
[00:54:24] now if you have a negative infinity in your alignment scores, then after you do
[00:54:26] your alignment scores, then after you do a softmax, it's going to end up as a
[00:54:28] a softmax, it's going to end up as a zero if you walk through the softmax
[00:54:29] zero if you walk through the softmax computation. Um so that means that if
[00:54:31] computation. Um so that means that if there's a zero if there's whenever
[00:54:32] there's a zero if there's whenever there's a negative infinity in the
[00:54:33] there's a negative infinity in the alignment scores we end up with a zero
[00:54:35] alignment scores we end up with a zero in the softmax in in the scores after
[00:54:37] in the softmax in in the scores after the softmax which means that that output
[00:54:39] the softmax which means that that output y will not depend on the value vector
[00:54:41] y will not depend on the value vector computed at that index. So this is a
[00:54:43] computed at that index. So this is a mechanism to let us control which inputs
[00:54:46] mechanism to let us control which inputs are allowed to interact with each other
[00:54:47] are allowed to interact with each other in the process of the computation. Um
[00:54:50] in the process of the computation. Um and we might want to do this now for
[00:54:51] and we might want to do this now for language modeling right because now
[00:54:53] language modeling right because now we've generalized this operator to the
[00:54:55] we've generalized this operator to the point where we don't need an RNN at all.
[00:54:57] point where we don't need an RNN at all. We can just use this in the for the same
[00:54:59] We can just use this in the for the same problem that we used to use an RNN for.
[00:55:01] problem that we used to use an RNN for. So now we can use it to process sequence
[00:55:03] So now we can use it to process sequence of words like attention is very and then
[00:55:06] of words like attention is very and then output is very cool. So then in this
[00:55:08] output is very cool. So then in this case we're doing the same language
[00:55:10] case we're doing the same language modeling task that we saw last lecture
[00:55:12] modeling task that we saw last lecture with RNN's but we can now do just do it
[00:55:14] with RNN's but we can now do just do it natively with this self attention block.
[00:55:16] natively with this self attention block. But in this case we want to make the
[00:55:18] But in this case we want to make the first output is only depend on the first
[00:55:20] first output is only depend on the first word. The second output vary only
[00:55:22] word. The second output vary only allowed to look depend on the first two
[00:55:23] allowed to look depend on the first two words. We don't want to let that let the
[00:55:25] words. We don't want to let that let the network look ahead in the sequence and
[00:55:26] network look ahead in the sequence and cheat. So here is where we would use
[00:55:28] cheat. So here is where we would use masking.
[00:55:30] masking. Um another thing that we'll sometimes do
[00:55:32] Um another thing that we'll sometimes do with self attention is called
[00:55:33] with self attention is called multi-headed self attention where you
[00:55:35] multi-headed self attention where you run n copies like h separate independent
[00:55:38] run n copies like h separate independent copies of self attention in parallel.
[00:55:40] copies of self attention in parallel. Why do you want to do this? Because it's
[00:55:41] Why do you want to do this? Because it's more computation, it's more flops, it's
[00:55:43] more computation, it's more flops, it's more parameters. Deep learning we always
[00:55:45] more parameters. Deep learning we always want more and bigger. Um and this is
[00:55:46] want more and bigger. Um and this is another way that you can make this
[00:55:47] another way that you can make this network that make this this layer more
[00:55:50] network that make this this layer more and bigger and more powerful. So what
[00:55:51] and bigger and more powerful. So what we're going to do is take our inputs X,
[00:55:54] we're going to do is take our inputs X, route them to H independent copies of
[00:55:56] route them to H independent copies of separate self attention layers. Those
[00:55:58] separate self attention layers. Those will each produce their own outputs Y
[00:56:00] will each produce their own outputs Y which will then stack up um along the
[00:56:02] which will then stack up um along the output and then fuse the and then have
[00:56:05] output and then fuse the and then have another linear projection at the output
[00:56:07] another linear projection at the output to kind of fuse the output data from
[00:56:08] to kind of fuse the output data from each of the independent self attention
[00:56:10] each of the independent self attention layers. Um and now in this case uh this
[00:56:13] layers. Um and now in this case uh this is called multi-headed self attention.
[00:56:16] is called multi-headed self attention. Um and this is basically the format that
[00:56:18] Um and this is basically the format that we always see in practice. So this is
[00:56:20] we always see in practice. So this is like whenever you see self- attention
[00:56:21] like whenever you see self- attention used these days, it's almost always this
[00:56:24] used these days, it's almost always this multi-headed self- attention version.
[00:56:27] multi-headed self- attention version. Um, and in practice, um, it turns out
[00:56:29] Um, and in practice, um, it turns out that you can compute this all with
[00:56:31] that you can compute this all with matrix multiplies as well. So you don't
[00:56:33] matrix multiplies as well. So you don't have to like run a for loop. Um, you can
[00:56:35] have to like run a for loop. Um, you can compute each of these H copies of self
[00:56:37] compute each of these H copies of self attention all in parallel if you're
[00:56:39] attention all in parallel if you're clever and use batched matrix multiplies
[00:56:41] clever and use batched matrix multiplies all in the right places. Um, so in in in
[00:56:43] all in the right places. Um, so in in in fact this whole self attention operator
[00:56:46] fact this whole self attention operator seems like a lot of stuff going on, but
[00:56:47] seems like a lot of stuff going on, but it's really basically just four matrix
[00:56:49] it's really basically just four matrix multiplies. We have one matrix multiply
[00:56:51] multiplies. We have one matrix multiply where we take our inputs and project
[00:56:53] where we take our inputs and project them to queries, keys, and values. Um,
[00:56:56] them to queries, keys, and values. Um, we have another matrix multiply where we
[00:56:58] we have another matrix multiply where we compute Qase similarity. For each Q, we
[00:57:00] compute Qase similarity. For each Q, we compute the similarity against all the
[00:57:02] compute the similarity against all the all the K's. And that's one big batched
[00:57:04] all the K's. And that's one big batched matrix multiply. Now in the multi-headed
[00:57:06] matrix multiply. Now in the multi-headed case um we have another one called
[00:57:08] case um we have another one called V-weing where we want to take linear
[00:57:10] V-weing where we want to take linear combinations of all the values weighted
[00:57:11] combinations of all the values weighted by the softmax entries and that can be
[00:57:14] by the softmax entries and that can be done in another big batched matrix
[00:57:16] done in another big batched matrix multiply and then finally we have an
[00:57:17] multiply and then finally we have an output projection to mix information
[00:57:19] output projection to mix information across our different self our different
[00:57:21] across our different self our different heads of our self attention. So even
[00:57:23] heads of our self attention. So even though there's a lot of equations
[00:57:24] though there's a lot of equations there's a lot of vectors flying around
[00:57:26] there's a lot of vectors flying around this whole self attention operator is
[00:57:28] this whole self attention operator is basically just four big batched matrix
[00:57:30] basically just four big batched matrix multiplies. Um, and that's great because
[00:57:32] multiplies. Um, and that's great because matrix multipliers are a really
[00:57:34] matrix multipliers are a really scalable, powerful primitive that we can
[00:57:36] scalable, powerful primitive that we can distribute, we can optimize um, and we
[00:57:38] distribute, we can optimize um, and we can make this thing highly parallel,
[00:57:40] can make this thing highly parallel, highly parallel, highly scalable, highly
[00:57:42] highly parallel, highly scalable, highly uh, highly efficient. Yeah. Question is
[00:57:44] uh, highly efficient. Yeah. Question is that the x1, x2, x3, they're exactly the
[00:57:46] that the x1, x2, x3, they're exactly the same. Um, but just but um, yeah, but
[00:57:49] same. Um, but just but um, yeah, but we're just going to like have separate
[00:57:51] we're just going to like have separate cop like basically separate copies of
[00:57:53] cop like basically separate copies of the self attention layer. They're all
[00:57:55] the self attention layer. They're all they will all be random. They all have
[00:57:56] they will all be random. They all have different weights critically and those
[00:57:58] different weights critically and those weights will be initialized randomly
[00:58:00] weights will be initialized randomly different at initialization. So they
[00:58:01] different at initialization. So they will end up learning to process them in
[00:58:03] will end up learning to process them in slightly different ways. So this is just
[00:58:04] slightly different ways. So this is just a way to give extra capacity to the
[00:58:06] a way to give extra capacity to the layer. Oh yeah, the only thing different
[00:58:09] layer. Oh yeah, the only thing different between the different heads is the
[00:58:10] between the different heads is the weights. So we'll the architecture is
[00:58:11] weights. So we'll the architecture is exactly the same. The computation is
[00:58:12] exactly the same. The computation is exactly the same, but they'll have
[00:58:14] exactly the same, but they'll have different weights and those weights will
[00:58:15] different weights and those weights will be diff will be initialized to different
[00:58:17] be diff will be initialized to different things at initialization. Um but other
[00:58:18] things at initialization. Um but other than that it's all exactly the same.
[00:58:21] than that it's all exactly the same. Okay, there's some stuff there that we
[00:58:23] Okay, there's some stuff there that we can skip. But now basically we've gotten
[00:58:24] can skip. But now basically we've gotten to one really interesting place where we
[00:58:26] to one really interesting place where we have basically three different ways to
[00:58:28] have basically three different ways to process sequences that we've seen in
[00:58:30] process sequences that we've seen in this class. The first is recurrent
[00:58:31] this class. The first is recurrent neural networks. We saw that recurrent
[00:58:33] neural networks. We saw that recurrent neural networks basically operate on 1D
[00:58:35] neural networks basically operate on 1D ordered sequences. Um and they're
[00:58:37] ordered sequences. Um and they're they're really cool. They're really
[00:58:38] they're really cool. They're really powerful. People like them for a long
[00:58:40] powerful. People like them for a long time, but they're fundamentally not very
[00:58:42] time, but they're fundamentally not very parallelizable because of this
[00:58:43] parallelizable because of this concurrent structure where each hidden
[00:58:45] concurrent structure where each hidden state depends on the previous hidden
[00:58:47] state depends on the previous hidden state. Then they're just a fundamentally
[00:58:49] state. Then they're just a fundamentally sequential algorithm. there's no way to
[00:58:51] sequential algorithm. there's no way to parallelize this across the sequence.
[00:58:53] parallelize this across the sequence. Um, and that makes them very difficult
[00:58:54] Um, and that makes them very difficult to scale, very difficult to make very
[00:58:56] to scale, very difficult to make very big. Um, another primitive that we've
[00:58:58] big. Um, another primitive that we've seen is convolution. And convolution
[00:59:00] seen is convolution. And convolution basically operates on multi-dimensional
[00:59:02] basically operates on multi-dimensional grids. Um, we've seen it in
[00:59:04] grids. Um, we've seen it in two-dimensional grids in the case of
[00:59:05] two-dimensional grids in the case of images. You can also run them on 1D
[00:59:07] images. You can also run them on 1D grids, 3D grids, 4D grids. And
[00:59:09] grids, 3D grids, 4D grids. And convolution basically is something that
[00:59:11] convolution basically is something that mixes information locally in
[00:59:13] mixes information locally in n-dimensional grids. Um, this is great.
[00:59:15] n-dimensional grids. Um, this is great. is very parallelizable because by this
[00:59:18] is very parallelizable because by this notion of sliding a kernel around a
[00:59:19] notion of sliding a kernel around a grid, each position that we might want
[00:59:21] grid, each position that we might want to place the kernel can in principle be
[00:59:23] to place the kernel can in principle be computed in parallel. So this is a very
[00:59:25] computed in parallel. So this is a very parallelizable primitive. Um but um it
[00:59:28] parallelizable primitive. Um but um it has a hard time building up large
[00:59:29] has a hard time building up large receptive fields. If we want to if we
[00:59:31] receptive fields. If we want to if we want to summarize an entire very long
[00:59:33] want to summarize an entire very long input sequence or an entire very large
[00:59:35] input sequence or an entire very large image with convolution, we either need
[00:59:37] image with convolution, we either need to have very large convolutional kernels
[00:59:39] to have very large convolutional kernels or stack up many many many convolutional
[00:59:41] or stack up many many many convolutional layers. So that still introduces some
[00:59:43] layers. So that still introduces some fundamental sequentiality in the way
[00:59:45] fundamental sequentiality in the way that we need to process large pieces of
[00:59:47] that we need to process large pieces of data. And now self attention basically
[00:59:50] data. And now self attention basically is a separate kind of primitive that
[00:59:52] is a separate kind of primitive that operates on sets of vectors. Um it sort
[00:59:54] operates on sets of vectors. Um it sort of naturally generalizes to long
[00:59:56] of naturally generalizes to long sequences. There are no bottlenecks in
[00:59:58] sequences. There are no bottlenecks in the way that there are in in in
[00:59:59] the way that there are in in in recurrent neural networks. There's also
[01:00:01] recurrent neural networks. There's also no necessity of stacking up many many
[01:00:03] no necessity of stacking up many many layers of them to pro to to let all the
[01:00:05] layers of them to pro to to let all the vectors look at each other. In one layer
[01:00:07] vectors look at each other. In one layer of self attention, every vector looks at
[01:00:09] of self attention, every vector looks at every other vector. So with just one
[01:00:11] every other vector. So with just one layer you can summarize you can do a lot
[01:00:13] layer you can summarize you can do a lot of computation um and it's also highly
[01:00:15] of computation um and it's also highly paralyzable as we saw the whole
[01:00:16] paralyzable as we saw the whole operation is just four big matrix
[01:00:18] operation is just four big matrix multiplies and matrix multiplies are a
[01:00:20] multiplies and matrix multiplies are a great primitive that we can distribute
[01:00:22] great primitive that we can distribute we can run on GPUs we can run in very
[01:00:24] we can run on GPUs we can run in very scalable distributed ways um the only
[01:00:26] scalable distributed ways um the only downside of attention is that it's
[01:00:28] downside of attention is that it's expensive it ends up having n squ
[01:00:30] expensive it ends up having n squ compute for a sequence of length n um
[01:00:32] compute for a sequence of length n um and n squ or later n o n memory for a
[01:00:35] and n squ or later n o n memory for a sequence of of of length n and if your n
[01:00:38] sequence of of of length n and if your n ends up being like 100,000 million 10
[01:00:40] ends up being like 100,000 million 10 million n squared becomes very expensive
[01:00:42] million n squared becomes very expensive but you can solve that by buying more
[01:00:44] but you can solve that by buying more GPUs. Um so that's that's basically the
[01:00:46] GPUs. Um so that's that's basically the solution that people have have come up
[01:00:48] solution that people have have come up with here. So basically attention has
[01:00:50] with here. So basically attention has become this super awesome primitive that
[01:00:53] become this super awesome primitive that is super powerful for processing very
[01:00:55] is super powerful for processing very arbitrary pieces of data and you might
[01:00:58] arbitrary pieces of data and you might be wondering which of these you should
[01:00:59] be wondering which of these you should use. Attention attention is all you
[01:01:01] use. Attention attention is all you need. It turns out that of the three you
[01:01:03] need. It turns out that of the three you can get a long way using only attention.
[01:01:06] can get a long way using only attention. Yeah the question is is paralyzable.
[01:01:08] Yeah the question is is paralyzable. What's the advantage of that? Um the
[01:01:09] What's the advantage of that? Um the advantage of that is that in the history
[01:01:11] advantage of that is that in the history of computing um it it get it gets hard
[01:01:14] of computing um it it get it gets hard to make processors faster, right? We've
[01:01:16] to make processors faster, right? We've sort of run up against this limit as a
[01:01:18] sort of run up against this limit as a fundamental limit in hardware that it's
[01:01:19] fundamental limit in hardware that it's become very difficult to make individual
[01:01:21] become very difficult to make individual processes faster. But what we can do
[01:01:23] processes faster. But what we can do very easily is get a lot of processors,
[01:01:26] very easily is get a lot of processors, right? So we so the way that we've able
[01:01:29] right? So we so the way that we've able to marshall more computation over the
[01:01:31] to marshall more computation over the last two decades is finding algorithms
[01:01:33] last two decades is finding algorithms that do not require running on one
[01:01:35] that do not require running on one really fast processor. But instead if we
[01:01:38] really fast processor. But instead if we can have an algorithm that can make use
[01:01:39] can have an algorithm that can make use of 10 processors or a 100 processors or
[01:01:42] of 10 processors or a 100 processors or a thousand processors or a million
[01:01:43] a thousand processors or a million processors. I want to blanket the entire
[01:01:46] processors. I want to blanket the entire Stanford campus with processors and have
[01:01:47] Stanford campus with processors and have all of them working together in concert
[01:01:49] all of them working together in concert to process this big thing. If we can
[01:01:51] to process this big thing. If we can find algorithms that do that, that's how
[01:01:52] find algorithms that do that, that's how we can scale up and get really big
[01:01:54] we can scale up and get really big powerful computations. Um, so the
[01:01:56] powerful computations. Um, so the benefit of parallelizability is that if
[01:01:58] benefit of parallelizability is that if you have algorithms that can trivially
[01:02:00] you have algorithms that can trivially make use of more and more and more
[01:02:02] make use of more and more and more processors in parallel, then we can
[01:02:04] processors in parallel, then we can scale up those algorithms without having
[01:02:05] scale up those algorithms without having to wait for individual processors to
[01:02:07] to wait for individual processors to become faster, which they may never
[01:02:08] become faster, which they may never will. Yeah. Is there a trade-off with
[01:02:10] will. Yeah. Is there a trade-off with the n squ? I I think the n squed is
[01:02:12] the n squ? I I think the n squed is actually a good thing. Um, so it it see
[01:02:14] actually a good thing. Um, so it it see it seems bad. You're taught in computer
[01:02:16] it seems bad. You're taught in computer science that higher parameters inside
[01:02:18] science that higher parameters inside that n that that those is bad. But in
[01:02:20] that n that that those is bad. But in the case of neural networks for compute
[01:02:22] the case of neural networks for compute it could actually be a good thing
[01:02:24] it could actually be a good thing because more compute means the network
[01:02:25] because more compute means the network is doing more computation. It has more
[01:02:27] is doing more computation. It has more ability to think more ability to
[01:02:29] ability to think more ability to process. So actually the more compute
[01:02:31] process. So actually the more compute the network does on the input sequence
[01:02:33] the network does on the input sequence actually maybe the better answer it
[01:02:34] actually maybe the better answer it could get it could arrive to. So it
[01:02:36] could get it could arrive to. So it means that you know it's more expensive
[01:02:38] means that you know it's more expensive but that's not necessarily a bad thing.
[01:02:40] but that's not necessarily a bad thing. So basically the transformer is now a
[01:02:42] So basically the transformer is now a neural network architecture that puts
[01:02:44] neural network architecture that puts self attention at the core of
[01:02:45] self attention at the core of everything. So our input is going to be
[01:02:47] everything. So our input is going to be a set of vectors X. Um then we're going
[01:02:49] a set of vectors X. Um then we're going to run all those vectors through self
[01:02:51] to run all those vectors through self attention. Um which is as we just said
[01:02:53] attention. Um which is as we just said this amazing primitive that lets all the
[01:02:54] this amazing primitive that lets all the vectors talk to each other. Um after
[01:02:56] vectors talk to each other. Um after that we'll wrap that self attention in a
[01:02:58] that we'll wrap that self attention in a residual connection for all the same
[01:03:00] residual connection for all the same reasons that we wanted to use residual
[01:03:02] reasons that we wanted to use residual connections in ResNets just a couple
[01:03:03] connections in ResNets just a couple lectures ago. Um then we will take the
[01:03:06] lectures ago. Um then we will take the output of that residual connection pass
[01:03:08] output of that residual connection pass it through a layer normalization because
[01:03:10] it through a layer normalization because as we saw in ResNets and in CNN's adding
[01:03:12] as we saw in ResNets and in CNN's adding normalization inside your architectures
[01:03:14] normalization inside your architectures makes them train more stably. Um then um
[01:03:17] makes them train more stably. Um then um but then now there's something
[01:03:18] but then now there's something interesting because the self attention
[01:03:20] interesting because the self attention basically what it does is compares all
[01:03:22] basically what it does is compares all the vectors with each other. Um and
[01:03:23] the vectors with each other. Um and that's a very useful primitive that's a
[01:03:25] that's a very useful primitive that's a very powerful thing to do. But we also
[01:03:27] very powerful thing to do. But we also want to give this network the ability to
[01:03:29] want to give this network the ability to perform processing on vectors
[01:03:31] perform processing on vectors independently one one by one. So then
[01:03:33] independently one one by one. So then there's a second primitive inside the
[01:03:34] there's a second primitive inside the transformer which is the multi-layer
[01:03:36] transformer which is the multi-layer perceptron MLP or also called FFN. But
[01:03:39] perceptron MLP or also called FFN. But basically this is a little two-layer
[01:03:40] basically this is a little two-layer neural network that operates independent
[01:03:42] neural network that operates independent that is run independently on each one of
[01:03:44] that is run independently on each one of our vectors inside. So then this kind of
[01:03:47] our vectors inside. So then this kind of works in concert with the self attention
[01:03:48] works in concert with the self attention where self attention lets all the
[01:03:50] where self attention lets all the vectors talk to each other and compare
[01:03:51] vectors talk to each other and compare with each other and the FFN or MLP um
[01:03:54] with each other and the FFN or MLP um lets us perform computation on each
[01:03:56] lets us perform computation on each vector independently.
[01:03:58] vector independently. Um we'll also wrap the MLP in a residual
[01:04:00] Um we'll also wrap the MLP in a residual connection put a layer normalization and
[01:04:02] connection put a layer normalization and put a box around the whole thing and
[01:04:04] put a box around the whole thing and call it a neural network block. So this
[01:04:06] call it a neural network block. So this is our um transformer block and a
[01:04:08] is our um transformer block and a transformer is just a sequence of
[01:04:10] transformer is just a sequence of transformer blocks. Um and these things
[01:04:12] transformer blocks. Um and these things have gotten much much bigger over time.
[01:04:14] have gotten much much bigger over time. Um the architectures haven't changed too
[01:04:16] Um the architectures haven't changed too much since 2017 when this is introduced.
[01:04:18] much since 2017 when this is introduced. Um the original transformer was
[01:04:20] Um the original transformer was something like 12 blocks, 200 million
[01:04:22] something like 12 blocks, 200 million parameters. And now we're people are
[01:04:24] parameters. And now we're people are training transformers with up with
[01:04:26] training transformers with up with hundreds of blocks and trillions of
[01:04:28] hundreds of blocks and trillions of parameters. So this same architecture
[01:04:30] parameters. So this same architecture has scaled across many orders of
[01:04:31] has scaled across many orders of magnitude in compute and size and
[01:04:33] magnitude in compute and size and parameters over the past eight years. Um
[01:04:36] parameters over the past eight years. Um they can be used both for language
[01:04:37] they can be used both for language modeling as we sort of already seen. Um
[01:04:39] modeling as we sort of already seen. Um they also can be used for for for for
[01:04:42] they also can be used for for for for images. And here the application is
[01:04:44] images. And here the application is fairly straightforward. Given an image
[01:04:46] fairly straightforward. Given an image we basically divide the image up into
[01:04:47] we basically divide the image up into patches project each of those patches
[01:04:49] patches project each of those patches separately into a vector. Those vectors
[01:04:52] separately into a vector. Those vectors then get passed as inputs to our
[01:04:54] then get passed as inputs to our transformer. um and then the output
[01:04:56] transformer. um and then the output gives us one output from the transformer
[01:04:59] gives us one output from the transformer for every patch in the input. Now if you
[01:05:01] for every patch in the input. Now if you want to do something like a
[01:05:02] want to do something like a classification score uh do a
[01:05:04] classification score uh do a classification problem then you do a
[01:05:05] classification problem then you do a pooling operation on all the vectors
[01:05:07] pooling operation on all the vectors coming out of the transformer and have a
[01:05:08] coming out of the transformer and have a linear layer that predicts your class
[01:05:10] linear layer that predicts your class scores. So that's then this same
[01:05:13] scores. So that's then this same architecture of a transformer can be
[01:05:15] architecture of a transformer can be applied both to a language and to images
[01:05:18] applied both to a language and to images and to a lot of other things as well.
[01:05:20] and to a lot of other things as well. Um, I mentioned there have been a couple
[01:05:22] Um, I mentioned there have been a couple minor tweaks to transformers since they
[01:05:24] minor tweaks to transformers since they were first introduced, but we're running
[01:05:25] were first introduced, but we're running out of time, so I'll just leave those as
[01:05:27] out of time, so I'll just leave those as extra reading. So, kind of the summary
[01:05:29] extra reading. So, kind of the summary of where we get to at the end of this
[01:05:30] of where we get to at the end of this lecture is basically two things that I
[01:05:33] lecture is basically two things that I promised at the beginning. One is that
[01:05:35] promised at the beginning. One is that we introduced attention, which is this
[01:05:37] we introduced attention, which is this new primitive that lets us operate on
[01:05:39] new primitive that lets us operate on sets of vectors. It's highly
[01:05:41] sets of vectors. It's highly paralyzable. It's basically just a
[01:05:42] paralyzable. It's basically just a couple matrix multiplies. So, it's
[01:05:44] couple matrix multiplies. So, it's highly scalable, highly paralyzable,
[01:05:46] highly scalable, highly paralyzable, highly flexible. It can be applied in a
[01:05:48] highly flexible. It can be applied in a lot of different situations. Um and the
[01:05:50] lot of different situations. Um and the transformer which is now a neural
[01:05:51] transformer which is now a neural network architecture that uses self
[01:05:53] network architecture that uses self attention as its main computational
[01:05:55] attention as its main computational primitive. Um and the transformer is
[01:05:57] primitive. Um and the transformer is basically the neural network
[01:05:58] basically the neural network architecture that every every
[01:06:00] architecture that every every application in deep learning is using
[01:06:02] application in deep learning is using these days. So that's super powerful,
[01:06:04] these days. So that's super powerful, super interesting, super exciting. Um
[01:06:06] super interesting, super exciting. Um they've been transformers have been with
[01:06:08] they've been transformers have been with us for like 8 years now and I don't see
[01:06:10] us for like 8 years now and I don't see them really dying anytime soon. So
[01:06:12] them really dying anytime soon. So that's that's pretty pretty exciting. So
[01:06:14] that's that's pretty pretty exciting. So that's that's basically it for today's
[01:06:16] that's that's basically it for today's lecture. Um and then next time we'll
[01:06:18] lecture. Um and then next time we'll come back and talk about some new tasks
[01:06:20] come back and talk about some new tasks uh detection, segmentation,
[01:06:21] uh detection, segmentation, visualization and see how we can use
[01:06:23] visualization and see how we can use these architectures to do new cool
[01:06:25] these architectures to do new cool things.


================================================================================
LECTURE 009
================================================================================

Stanford CS231N | Spring 2025 | Lecture 9: Object Detection, Image Segmentation, Visualizing

Source: https://www.youtube.com/watch?v=PTypu6GqEd4

---

Transcript

[00:00:05] Okay, today we'll be talking about
[00:00:09] Okay, today we'll be talking about different tasks of um core computer
[00:00:13] different tasks of um core computer vision
[00:00:15] vision algorithms and tasks detection and
[00:00:18] algorithms and tasks detection and segmentation. We will also be covering
[00:00:20] segmentation. We will also be covering topics around visualization and
[00:00:22] topics around visualization and understanding. I will cover the the most
[00:00:25] understanding. I will cover the the most important ones. All right. So last le
[00:00:29] important ones. All right. So last le the the previous lecture last time what
[00:00:32] the the previous lecture last time what we discussed was around the topic of um
[00:00:36] we discussed was around the topic of um transitioning from sequence to sequence
[00:00:39] transitioning from sequence to sequence models RNN's through uh two transformers
[00:00:44] models RNN's through uh two transformers and we saw that transformers were
[00:00:47] and we saw that transformers were defined by um having some sort of
[00:00:50] defined by um having some sort of encoder a number of layers which had had
[00:00:54] encoder a number of layers which had had multi-headed self attention and layer
[00:00:57] multi-headed self attention and layer norm norm as well as some MLP norm uh
[00:01:00] norm norm as well as some MLP norm uh layers and this was
[00:01:04] layers and this was ultimately called something that we now
[00:01:07] ultimately called something that we now refer to as encoder encoding the
[00:01:10] refer to as encoder encoding the sequence and then if we need to decode
[00:01:12] sequence and then if we need to decode an image or a language a sequence as the
[00:01:16] an image or a language a sequence as the output then similar type of architecture
[00:01:20] output then similar type of architecture is used for decoder getting the
[00:01:25] is used for decoder getting the uh encoded tokens as input taking those
[00:01:27] uh encoded tokens as input taking those as input and then generating what is um
[00:01:32] as input and then generating what is um I'm hoping that you can see my cursor
[00:01:34] I'm hoping that you can see my cursor too what is um
[00:01:37] too what is um the
[00:01:39] the uh the the output the desired output we
[00:01:42] uh the the output the desired output we did talk Justin talked extensively quite
[00:01:46] did talk Justin talked extensively quite extensively about the differences of
[00:01:49] extensively about the differences of modeling sequences
[00:01:53] modeling sequences recurrence neural networks RNN's and and
[00:01:56] recurrence neural networks RNN's and and their variation ations that we've talked
[00:01:58] their variation ations that we've talked um last week I think on Tuesday uh about
[00:02:01] um last week I think on Tuesday uh about and then using convolution also as as
[00:02:05] and then using convolution also as as another approach but we talked about
[00:02:08] another approach but we talked about ultimately that um self attention is
[00:02:11] ultimately that um self attention is what we
[00:02:15] work with in many of the applications
[00:02:17] work with in many of the applications these days they work uh much better than
[00:02:20] these days they work uh much better than the other other two they do add they are
[00:02:24] the other other two they do add they are more expensive they do add computation
[00:02:27] more expensive they do add computation and and memory requirements for um and
[00:02:31] and and memory requirements for um and um but but that comes with much better
[00:02:35] um but but that comes with much better modeling of the sequence and uh better
[00:02:38] modeling of the sequence and uh better results in terms of any of the tasks. So
[00:02:43] results in terms of any of the tasks. So until here
[00:02:45] until here uh it was it was mostly talking about
[00:02:48] uh it was it was mostly talking about self attention. We also talked a little
[00:02:49] self attention. We also talked a little bit about cross attention and
[00:02:53] bit about cross attention and related topics. And then we got to the
[00:02:57] related topics. And then we got to the topic of vision transformers which is
[00:03:00] topic of vision transformers which is one of the core models that is being
[00:03:03] one of the core models that is being used in modern applications computer
[00:03:06] used in modern applications computer vision applications. We did go through
[00:03:09] vision applications. We did go through this um
[00:03:11] this um in the
[00:03:13] in the last minutes of last lecture and I want
[00:03:16] last minutes of last lecture and I want to the previous lecture and I want to
[00:03:19] to the previous lecture and I want to revisit the topic and after that I'll
[00:03:22] revisit the topic and after that I'll stop and and and hear any questions or
[00:03:25] stop and and and hear any questions or comments you may have regarding the
[00:03:27] comments you may have regarding the assignments and everything that I talked
[00:03:29] assignments and everything that I talked about so far. We talked about the fact
[00:03:32] about so far. We talked about the fact that what we do with um transformers
[00:03:36] that what we do with um transformers when we want to process images, we
[00:03:40] when we want to process images, we split the image into patches
[00:03:44] split the image into patches basically creating a kind of sequence,
[00:03:47] basically creating a kind of sequence, right? So the image was split to S bys
[00:03:52] right? So the image was split to S bys or um in in this case maybe uh 3x3
[00:03:58] or um in in this case maybe uh 3x3 patches and each of those patches are
[00:04:02] patches and each of those patches are then represented by what we call tokens.
[00:04:05] then represented by what we call tokens. And tokens are often a linear
[00:04:10] And tokens are often a linear projection of the the the vector the
[00:04:16] projection of the the the vector the reshaped version of the image into a
[00:04:19] reshaped version of the image into a vector and um it's it's basically a
[00:04:23] vector and um it's it's basically a D-dimensional vector as you can see in
[00:04:26] D-dimensional vector as you can see in this slide. But because we have turned
[00:04:28] this slide. But because we have turned the image into patches what becomes
[00:04:31] the image into patches what becomes important? What are we losing here? It's
[00:04:33] important? What are we losing here? It's it's basically we're losing the
[00:04:35] it's basically we're losing the location, the position, the 2D position
[00:04:37] location, the position, the 2D position of of the image, right? So that's why we
[00:04:40] of of the image, right? So that's why we often create or add something that we
[00:04:44] often create or add something that we call positional embedding. And there are
[00:04:47] call positional embedding. And there are many different ways of doing this. You
[00:04:49] many different ways of doing this. You can create a sequence just put numbers
[00:04:52] can create a sequence just put numbers of sequence as 1 2 3 and and so on. or
[00:04:55] of sequence as 1 2 3 and and so on. or you can do
[00:04:57] you can do uh a 2D version of X and Y u coordinates
[00:05:02] uh a 2D version of X and Y u coordinates and adding these two together creates
[00:05:06] and adding these two together creates the the new token that goes to the
[00:05:09] the the new token that goes to the transformer layer the same way all of
[00:05:12] transformer layer the same way all of the self attention layer norm and um
[00:05:17] the self attention layer norm and um everything that we've talked about NMLP
[00:05:19] everything that we've talked about NMLP everything we talked about last last
[00:05:22] everything we talked about last last week. So and then the output layer will
[00:05:27] week. So and then the output layer will will generate the output vectors for us
[00:05:30] will generate the output vectors for us could be used for any application. One
[00:05:32] could be used for any application. One of the major applications in computer
[00:05:34] of the major applications in computer vision has been classification. We
[00:05:37] vision has been classification. We started with image classification.
[00:05:38] started with image classification. Right? So with image classification what
[00:05:41] Right? So with image classification what what uh becomes important is to somehow
[00:05:45] what uh becomes important is to somehow be able to encode
[00:05:48] be able to encode or um generate something as the output
[00:05:52] or um generate something as the output that is is uh representative of the
[00:05:54] that is is uh representative of the class. So what we do is often we add one
[00:05:58] class. So what we do is often we add one token a special extra input to the
[00:06:01] token a special extra input to the transformer which is of the same
[00:06:04] transformer which is of the same dimensionality but it's a learnable
[00:06:06] dimensionality but it's a learnable parameter that
[00:06:09] parameter that in the output space whatever that
[00:06:13] in the output space whatever that represents is going to be turned into
[00:06:16] represents is going to be turned into the class um probability vector. So a cd
[00:06:20] the class um probability vector. So a cd dimensional vector which is the class
[00:06:21] dimensional vector which is the class probabilities and that's what we often
[00:06:24] probabilities and that's what we often call the the class token. So this is one
[00:06:27] call the the class token. So this is one of the base
[00:06:30] of the base u and and most standard way of doing uh
[00:06:32] u and and most standard way of doing uh using vits vision transformers for image
[00:06:38] using vits vision transformers for image classification
[00:06:40] classification but transformers are not only used for
[00:06:42] but transformers are not only used for classification they could be used for
[00:06:43] classification they could be used for for many other tasks that we'll be
[00:06:46] for many other tasks that we'll be covering some of those uh today as well
[00:06:49] covering some of those uh today as well but last week we also talked about this
[00:06:52] but last week we also talked about this other variant of the transformers again
[00:06:55] other variant of the transformers again tokens And from the tokens we we go with
[00:07:00] tokens And from the tokens we we go with uh the transformer layers. If you
[00:07:03] uh the transformer layers. If you remember last time we talked about these
[00:07:06] remember last time we talked about these multiple um layers of transformers as I
[00:07:10] multiple um layers of transformers as I said positional embeddings are added and
[00:07:14] said positional embeddings are added and here because images are they we we see
[00:07:17] here because images are they we we see the entire image all together. We don't
[00:07:20] the entire image all together. We don't have to do masking like we did for
[00:07:22] have to do masking like we did for language because language is really a
[00:07:24] language because language is really a sequence uh that we shouldn't be using
[00:07:26] sequence uh that we shouldn't be using the future information for. And then
[00:07:30] the future information for. And then ultimately
[00:07:31] ultimately transformers um give an output of a
[00:07:36] transformers um give an output of a uh a vector uh patch for for each of the
[00:07:40] uh a vector uh patch for for each of the inputs. And the other option for
[00:07:43] inputs. And the other option for training a transformer is actually to
[00:07:45] training a transformer is actually to instead of having a separate class token
[00:07:48] instead of having a separate class token just take the outputs run it pulling
[00:07:51] just take the outputs run it pulling layer and then turn that into a
[00:07:54] layer and then turn that into a probability vector for for C different
[00:07:56] probability vector for for C different classes. So I talked about two versions
[00:07:59] classes. So I talked about two versions of transformers, right? One of them was
[00:08:01] of transformers, right? One of them was we're using a class token and the other
[00:08:04] we're using a class token and the other one was for we take all of the output
[00:08:07] one was for we take all of the output tokens. We apply pulling and projection
[00:08:09] tokens. We apply pulling and projection into a vector that represents the class
[00:08:13] into a vector that represents the class probabilities. How we supervise this?
[00:08:16] probabilities. How we supervise this? It's the exact same thing that we talked
[00:08:18] It's the exact same thing that we talked about earlier and that was uh back
[00:08:22] about earlier and that was uh back propagation defining a loss function
[00:08:25] propagation defining a loss function binary cross entropy the soft max loss
[00:08:27] binary cross entropy the soft max loss and and so on.
[00:08:29] and and so on. So this was VITS uh this is VIT in a
[00:08:34] So this was VITS uh this is VIT in a nutshell but and and and over the years
[00:08:39] nutshell but and and and over the years this type of architecture um for many
[00:08:42] this type of architecture um for many different applications have have
[00:08:44] different applications have have remained the same. Um many modern
[00:08:48] remained the same. Um many modern architectures right now use many of
[00:08:51] architectures right now use many of these components very similar to what we
[00:08:55] these components very similar to what we presented here. But there are some
[00:08:58] presented here. But there are some optimizations that we had in the slides
[00:09:01] optimizations that we had in the slides last uh last week. But I think I'll I'll
[00:09:04] last uh last week. But I think I'll I'll just spend quick um quickly a couple of
[00:09:07] just spend quick um quickly a couple of minutes um on on them. But I want you to
[00:09:11] minutes um on on them. But I want you to understand that there are many different
[00:09:14] understand that there are many different tweaks and optimizations for better
[00:09:18] tweaks and optimizations for better performance and also making the
[00:09:20] performance and also making the transformers the training a little bit
[00:09:22] transformers the training a little bit more stable.
[00:09:24] more stable. One of them was is actually the residual
[00:09:28] One of them was is actually the residual connections. This layer norm is is
[00:09:30] connections. This layer norm is is basically at outside the residual
[00:09:33] basically at outside the residual connection. So this means that whatever
[00:09:36] connection. So this means that whatever we get here, we normalize it, right? So
[00:09:39] we get here, we normalize it, right? So this doesn't really mean that we we
[00:09:42] this doesn't really mean that we we can't replicate any form of identity
[00:09:44] can't replicate any form of identity function anymore that ResNets really
[00:09:46] function anymore that ResNets really wanted to to do, right? So the solution
[00:09:48] wanted to to do, right? So the solution for that is to bring in the layer
[00:09:52] for that is to bring in the layer normalization. we often put it uh before
[00:09:55] normalization. we often put it uh before self attention and the second one right
[00:09:57] self attention and the second one right before the MLP layer. So normalization
[00:10:00] before the MLP layer. So normalization is there but we also preserve our
[00:10:03] is there but we also preserve our identity function. There are al also
[00:10:06] identity function. There are al also other ways of normalizing. There is this
[00:10:09] other ways of normalizing. There is this RMS norm root mean square normalization
[00:10:13] RMS norm root mean square normalization which is actually a very um basic type
[00:10:17] which is actually a very um basic type of normalization. It doesn't it doesn't
[00:10:19] of normalization. It doesn't it doesn't use the for each of the features it
[00:10:21] use the for each of the features it doesn't use the the mean value of the
[00:10:24] doesn't use the the mean value of the feature for normalization but this makes
[00:10:26] feature for normalization but this makes the training a little bit more uh
[00:10:28] the training a little bit more uh stable. Again there are the these are
[00:10:31] stable. Again there are the these are all empirically shown to be
[00:10:35] all empirically shown to be uh better options.
[00:10:38] uh better options. Uh although there are some
[00:10:39] Uh although there are some justifications why they work well but
[00:10:42] justifications why they work well but most of mostly the reason uh for
[00:10:45] most of mostly the reason uh for adopting these is is just um the fact
[00:10:48] adopting these is is just um the fact that they are uh they perform they make
[00:10:51] that they are uh they perform they make the trainings more stable. The other
[00:10:54] the trainings more stable. The other option is is to instead of using a
[00:10:56] option is is to instead of using a simple MLP, we use a uh sugloo MLP where
[00:11:03] simple MLP, we use a uh sugloo MLP where we actually do some sort of this is what
[00:11:05] we actually do some sort of this is what we call gated non nonlinearity. Instead
[00:11:08] we call gated non nonlinearity. Instead of having two vectors of U weight
[00:11:11] of having two vectors of U weight matrices of W1 and W2, we add a third
[00:11:14] matrices of W1 and W2, we add a third one 1 2 and three. But here we create
[00:11:18] one 1 2 and three. But here we create some sort of gated non nonlinearity
[00:11:21] some sort of gated non nonlinearity which basically what what it uh does is
[00:11:27] which basically what what it uh does is um
[00:11:30] is is um
[00:11:32] is is um getting more um trainable uh parameters
[00:11:37] getting more um trainable uh parameters and not just necessarily trainable
[00:11:39] and not just necessarily trainable parameters but creating a better
[00:11:41] parameters but creating a better nonlinearity for a small architecture.
[00:11:44] nonlinearity for a small architecture. Even if we select the hidden layer um
[00:11:47] Even if we select the hidden layer um value equal to 8 di divided by 3, it
[00:11:51] value equal to 8 di divided by 3, it keeps the the same size of the network
[00:11:54] keeps the the same size of the network in terms of the number of parameters but
[00:11:56] in terms of the number of parameters but it does um learn higher dimensional
[00:12:00] it does um learn higher dimensional nonlinearities um in
[00:12:04] nonlinearities um in uh in in that layer. The last piece is
[00:12:08] uh in in that layer. The last piece is mixture of extra experts that is often
[00:12:10] mixture of extra experts that is often used in even the very modern
[00:12:12] used in even the very modern architectures these days. Instead of
[00:12:15] architectures these days. Instead of having one set of MLP layers, you can
[00:12:18] having one set of MLP layers, you can have multiple sets of MLP layers. Each
[00:12:20] have multiple sets of MLP layers. Each of those will be an expert and what we
[00:12:24] of those will be an expert and what we do is you through a a router the tokens
[00:12:29] do is you through a a router the tokens will be routed to a of those e experts
[00:12:35] will be routed to a of those e experts and in in this uh way we actually have a
[00:12:38] and in in this uh way we actually have a active experts but then again this what
[00:12:42] active experts but then again this what it does is it increases
[00:12:45] it does is it increases the
[00:12:46] the number of parameters and It helps
[00:12:50] number of parameters and It helps learning more robust models without
[00:12:54] learning more robust models without increasing too much in the uh on the
[00:12:57] increasing too much in the uh on the compute and these again are all parallel
[00:13:00] compute and these again are all parallel MLPS. So we can have multiple experts in
[00:13:04] MLPS. So we can have multiple experts in parallel. As I said, they are used in
[00:13:07] parallel. As I said, they are used in all LLMs these days, large language
[00:13:09] all LLMs these days, large language models and um all of the modern LLMs up
[00:13:14] models and um all of the modern LLMs up to the level that we know of know about
[00:13:18] to the level that we know of know about are using these types of tweaks and this
[00:13:22] are using these types of tweaks and this is the summary of all of the tweaks that
[00:13:24] is the summary of all of the tweaks that I just mentioned. This is similar to
[00:13:27] I just mentioned. This is similar to bias. No, this is this is completely a
[00:13:30] bias. No, this is this is completely a trainable parameter by itself that you
[00:13:33] trainable parameter by itself that you train either a feed forward network or
[00:13:37] train either a feed forward network or or just a linear projection to turn that
[00:13:39] or just a linear projection to turn that into the probability vector. No. So it's
[00:13:42] into the probability vector. No. So it's it's it's not uh it's not just then
[00:13:46] it's it's not uh it's not just then again remember that you have so many
[00:13:49] again remember that you have so many uh self attention networks here, right?
[00:13:53] uh self attention networks here, right? layers here and those self attention
[00:13:55] layers here and those self attention layers are basically fusing the
[00:13:58] layers are basically fusing the information creating attention between
[00:14:00] information creating attention between all of the tokens and this class token.
[00:14:02] all of the tokens and this class token. So when you supervise it from here the
[00:14:05] So when you supervise it from here the loss function comes in this will
[00:14:07] loss function comes in this will represent as the will represent the
[00:14:09] represent as the will represent the class uh class token uh the class
[00:14:12] class uh class token uh the class probabilities vector. So the question is
[00:14:14] probabilities vector. So the question is if there are nice intuitions uh what
[00:14:17] if there are nice intuitions uh what different experts are doing. Uh that's a
[00:14:20] different experts are doing. Uh that's a great question
[00:14:22] great question because they are trained in parallel and
[00:14:25] because they are trained in parallel and they are initialized differently. They
[00:14:28] they are initialized differently. They often try to learn one aspect or uh or a
[00:14:32] often try to learn one aspect or uh or a related maybe also sometimes very much
[00:14:34] related maybe also sometimes very much related aspect but um it's just adding
[00:14:39] related aspect but um it's just adding more more u compute and more parameters
[00:14:43] more more u compute and more parameters giving the network to learn different
[00:14:46] giving the network to learn different things um if if it does have to uh learn
[00:14:50] things um if if it does have to uh learn multiple concepts for example if you
[00:14:52] multiple concepts for example if you have to cover multiple probability
[00:14:54] have to cover multiple probability distributions then then with these ops
[00:14:57] distributions then then with these ops you often have the power to like
[00:14:59] you often have the power to like separate those modes of um data. So the
[00:15:02] separate those modes of um data. So the question is if the the number of experts
[00:15:06] question is if the the number of experts is a hyperparameter or or not. Yes,
[00:15:09] is a hyperparameter or or not. Yes, definitely it's a hyperparameter. From
[00:15:12] definitely it's a hyperparameter. From what I know, it's often predefined. Uh
[00:15:15] what I know, it's often predefined. Uh don't necessarily like uh over fine-tune
[00:15:18] don't necessarily like uh over fine-tune them, but yes, they are all
[00:15:21] them, but yes, they are all hyperparameters.
[00:15:22] hyperparameters.  Yes. And they are. So why moving the the
[00:15:27] Yes. And they are. So why moving the the layer norm helps us learn identifi
[00:15:31] layer norm helps us learn identifi identity transformation. So look at this
[00:15:34] identity transformation. So look at this architecture will you be able to create
[00:15:37] architecture will you be able to create any form of identity because right after
[00:15:39] any form of identity because right after that uh residual connection the feature
[00:15:42] that uh residual connection the feature values are changed because you have a
[00:15:44] values are changed because you have a normalization. you will never have the
[00:15:47] normalization. you will never have the identity in the features, right? Because
[00:15:49] identity in the features, right? Because right after that you see the layer norm
[00:15:52] right after that you see the layer norm normalization,
[00:15:53] normalization, right? And that's why what we do is we
[00:15:55] right? And that's why what we do is we bring it in.
[00:15:57] bring it in. We have a few quite a few different
[00:16:02] We have a few quite a few different tasks in computer vision and these these
[00:16:06] tasks in computer vision and these these were the core and more important the
[00:16:08] were the core and more important the most important tasks over the years for
[00:16:11] most important tasks over the years for computer vision applications. Although
[00:16:13] computer vision applications. Although these days we we're solving much much
[00:16:16] these days we we're solving much much harder tasks and nobody cares about
[00:16:17] harder tasks and nobody cares about object detection anymore because now we
[00:16:20] object detection anymore because now we can just do it with one line of code but
[00:16:23] can just do it with one line of code but over the past 10 15 years there has been
[00:16:26] over the past 10 15 years there has been a lot of advances and we want to cover I
[00:16:29] a lot of advances and we want to cover I want to really cover some of those today
[00:16:32] want to really cover some of those today just so if you have to design something
[00:16:34] just so if you have to design something new yourself you know uh where to look
[00:16:37] new yourself you know uh where to look and how to design your models and then
[00:16:41] and how to design your models and then ultimately there is the topic Think of
[00:16:43] ultimately there is the topic Think of visualization and understanding which is
[00:16:47] visualization and understanding which is very important in in many applications.
[00:16:49] very important in in many applications. For example, if you're working with
[00:16:51] For example, if you're working with medical data, often the visualization
[00:16:53] medical data, often the visualization understanding is more important than the
[00:16:55] understanding is more important than the classification itself or detection of
[00:16:58] classification itself or detection of tumor for example. You want to know
[00:17:00] tumor for example. You want to know where, why and so on.
[00:17:05] And um the way we started the class and
[00:17:09] And um the way we started the class and this this slide is probably uh very much
[00:17:12] this this slide is probably uh very much familiar to everybody.
[00:17:15] familiar to everybody. We talked about different tasks and for
[00:17:18] We talked about different tasks and for object classification for for the task
[00:17:21] object classification for for the task of classification. We talked about this.
[00:17:24] of classification. We talked about this. We spent quite a lot of time over the
[00:17:27] We spent quite a lot of time over the first few lectures saying how we can
[00:17:31] first few lectures saying how we can create a classifier that classifies
[00:17:33] create a classifier that classifies images from pixels into labels. But then
[00:17:39] images from pixels into labels. But then one of the other tasks important
[00:17:43] one of the other tasks important um similarly is is semantic
[00:17:47] um similarly is is semantic segmentation. And within semantic
[00:17:49] segmentation. And within semantic segmentation what we care about is to
[00:17:53] segmentation what we care about is to assign a label for to every single pixel
[00:17:56] assign a label for to every single pixel inside the image.
[00:17:59] inside the image. turn each each of the pixels into a
[00:18:02] turn each each of the pixels into a label that
[00:18:05] label that is is the label for that object or or um
[00:18:09] is is the label for that object or or um anything in the scene.
[00:18:12] anything in the scene. So basically at the when we train a
[00:18:15] So basically at the when we train a model that does this at the test time we
[00:18:17] model that does this at the test time we want to take an image and generate the
[00:18:20] want to take an image and generate the same map as the output. How do we do
[00:18:24] same map as the output. How do we do that? There are many different options.
[00:18:26] that? There are many different options. So let's say um I can what I can do is
[00:18:31] So let's say um I can what I can do is just look at each pixel every single
[00:18:32] just look at each pixel every single pixel and say what the value or what the
[00:18:37] pixel and say what the value or what the label for that pixel should be. Some it
[00:18:40] label for that pixel should be. Some it in in the very basic form as you can see
[00:18:42] in in the very basic form as you can see here it's it's actually very much
[00:18:45] here it's it's actually very much impossible. It's hard to say what pixel
[00:18:48] impossible. It's hard to say what pixel that
[00:18:50] that um represent that that specific what
[00:18:52] um represent that that specific what what object that specific pixel
[00:18:55] what object that specific pixel represents.
[00:18:56] represents. because there's no context if you only
[00:18:58] because there's no context if you only look at the the pixel itself. So
[00:19:02] look at the the pixel itself. So that's why context is important. We look
[00:19:04] that's why context is important. We look at the surrounding areas and um and then
[00:19:10] at the surrounding areas and um and then if I take these patches the pixel in the
[00:19:14] if I take these patches the pixel in the center and the surrounding areas. Now I
[00:19:17] center and the surrounding areas. Now I can train a convolutional neural network
[00:19:20] can train a convolutional neural network or any network that generates the output
[00:19:22] or any network that generates the output label for us. Right? It's the same
[00:19:24] label for us. Right? It's the same architecture that we've talked about
[00:19:26] architecture that we've talked about over the quarter and you can select any
[00:19:29] over the quarter and you can select any of those that we used for for image
[00:19:31] of those that we used for for image classification because now you're
[00:19:33] classification because now you're classifying the image the entire image.
[00:19:35] classifying the image the entire image. It could be a CNN, could be a ResNet,
[00:19:36] It could be a CNN, could be a ResNet, could be a VIT or whatever.
[00:19:40] could be a VIT or whatever. This is really time consuming because if
[00:19:44] This is really time consuming because if you want to run one full network for
[00:19:48] you want to run one full network for every single p pixel in an image, it
[00:19:50] every single p pixel in an image, it will take forever to to turn this into a
[00:19:53] will take forever to to turn this into a segmentation map. The other option that
[00:19:56] segmentation map. The other option that we often uh we can use is to
[00:20:00] we often uh we can use is to instead of running one network for every
[00:20:04] instead of running one network for every single pixel, what if we train a neural
[00:20:06] single pixel, what if we train a neural network that takes the image as the
[00:20:09] network that takes the image as the input and outputs the entire pixel map,
[00:20:13] input and outputs the entire pixel map, the segmentation map, not just one
[00:20:16] the segmentation map, not just one single label, a matrix of labels, right?
[00:20:20] single label, a matrix of labels, right? And in that case
[00:20:23] And in that case um we will have our segmentation task
[00:20:27] um we will have our segmentation task solved. And in order to do that we need
[00:20:30] solved. And in order to do that we need to have
[00:20:32] to have a layer in the in the input that is the
[00:20:35] a layer in the in the input that is the same size of as the image. And also in
[00:20:37] same size of as the image. And also in the in the output you also need some
[00:20:40] the in the output you also need some sort of an inflated layer. You can't go
[00:20:43] sort of an inflated layer. You can't go to fully connected layers and so on
[00:20:46] to fully connected layers and so on because now we are generating an image
[00:20:50] because now we are generating an image and because of that we need to
[00:20:54] and because of that we need to keep the the network inflated and and
[00:20:58] keep the the network inflated and and then um that's what we call often fully
[00:21:02] then um that's what we call often fully convolutional neural networks or or FCN.
[00:21:07] convolutional neural networks or or FCN. So with fully connection uh
[00:21:09] So with fully connection uh convolutional neural networks this is
[00:21:10] convolutional neural networks this is this is definitely a great idea but
[00:21:14] this is definitely a great idea but there is a caveat there is a problem
[00:21:16] there is a caveat there is a problem these images are are large and and these
[00:21:18] these images are are large and and these networks these uh layers will become
[00:21:23] networks these uh layers will become very large and there will be so many
[00:21:26] very large and there will be so many parameters to optimize especially in the
[00:21:28] parameters to optimize especially in the early years that we didn't have powerful
[00:21:30] early years that we didn't have powerful GPUs this was a bottleneck a problem a
[00:21:34] GPUs this was a bottleneck a problem a challenge for training algorithms
[00:21:37] challenge for training algorithms And that's why the algorithms evolved
[00:21:41] And that's why the algorithms evolved into starting from full size images
[00:21:45] into starting from full size images going down in terms of the resolution
[00:21:49] going down in terms of the resolution making the convolutions the the special
[00:21:52] making the convolutions the the special resolutions smaller and smaller through
[00:21:54] resolutions smaller and smaller through down sampling operations. And then
[00:21:58] down sampling operations. And then somewhere in the in the middle we'll
[00:21:59] somewhere in the in the middle we'll have a low resolution but but somehow
[00:22:02] have a low resolution but but somehow thick in terms of the number of
[00:22:04] thick in terms of the number of channels. And and then from there what
[00:22:07] channels. And and then from there what we do is we go back up to the same size
[00:22:10] we do is we go back up to the same size of the image to create the output pixel.
[00:22:14] of the image to create the output pixel. And in order to do that we know how to
[00:22:18] And in order to do that we know how to do the down sampling. Right?
[00:22:19] do the down sampling. Right? Downsampling was was easy. We've talked
[00:22:21] Downsampling was was easy. We've talked about it. We talked about um pulling
[00:22:24] about it. We talked about um pulling operation, strided convolution and um
[00:22:27] operation, strided convolution and um several other
[00:22:31] um
[00:22:33] um steps or operations that could be used
[00:22:36] steps or operations that could be used here.
[00:22:38] here. But on the upsampling side, we don't we
[00:22:41] But on the upsampling side, we don't we don't really how know how to to do the
[00:22:43] don't really how know how to to do the upsampling, right? because we don't have
[00:22:46] upsampling, right? because we don't have pulling or reverse of uppooling or
[00:22:50] pulling or reverse of uppooling or reverse uh reverse right convolutions
[00:22:54] reverse uh reverse right convolutions right and and because of that we had to
[00:22:57] right and and because of that we had to invent some new operations that reverses
[00:23:01] invent some new operations that reverses down sampling
[00:23:03] down sampling uh by itself but before I go to the
[00:23:07] uh by itself but before I go to the upsampling
[00:23:09] upsampling uh defining what upsampling is I just
[00:23:12] uh defining what upsampling is I just briefly want to tell you that um
[00:23:18] how this uh maybe I can ask you a
[00:23:20] how this uh maybe I can ask you a question. How do you think this network
[00:23:22] question. How do you think this network is trained? Because now we have a
[00:23:25] is trained? Because now we have a network that starts from an image and
[00:23:28] network that starts from an image and ends with an image and then the tools
[00:23:31] ends with an image and then the tools that we have for training this network
[00:23:33] that we have for training this network was a loss function, right?
[00:23:37] was a loss function, right? How do you think is the best to train or
[00:23:40] How do you think is the best to train or or define a loss function for this
[00:23:42] or define a loss function for this network?
[00:23:44] network? We talked about softmax loss, right? We
[00:23:47] We talked about softmax loss, right? We talked also about a little bit about
[00:23:49] talked also about a little bit about some regression losses and SVM loss. But
[00:23:54] some regression losses and SVM loss. But assuming that we want to use softmax
[00:23:57] assuming that we want to use softmax loss function,
[00:23:59] loss function, how could we define this uh or train
[00:24:01] how could we define this uh or train this network?
[00:24:02] this network? What would the objective be? So mean you
[00:24:06] What would the objective be? So mean you said mean classification loss for each
[00:24:08] said mean classification loss for each of the pixels and that's uh that's
[00:24:11] of the pixels and that's uh that's correct. You can add the loss function
[00:24:15] correct. You can add the loss function for every single pixel because every
[00:24:17] for every single pixel because every single pixel is like is doing a
[00:24:19] single pixel is like is doing a classification right. So you will have a
[00:24:21] classification right. So you will have a sigma over all pixels of the image and
[00:24:25] sigma over all pixels of the image and the loss function is just a simple soft
[00:24:27] the loss function is just a simple soft max and then you can back prop that's
[00:24:30] max and then you can back prop that's that's that's the entire loss function
[00:24:32] that's that's the entire loss function that you need. The question is um do we
[00:24:36] that you need. The question is um do we need what we call ground truth for
[00:24:38] need what we call ground truth for training? So that's that's actually the
[00:24:40] training? So that's that's actually the ground truth of segmentation and that
[00:24:42] ground truth of segmentation and that yes for these types of algorithms
[00:24:44] yes for these types of algorithms because they are fully supervised. We do
[00:24:46] because they are fully supervised. We do need the ground truth label maps and
[00:24:50] need the ground truth label maps and early years there has been a lot of work
[00:24:53] early years there has been a lot of work doing and sitting down and and and u
[00:24:56] doing and sitting down and and and u manually labeling the the pixels to be
[00:24:58] manually labeling the the pixels to be able to train these algorithms. Yes,
[00:25:01] able to train these algorithms. Yes, these days we don't need that because we
[00:25:03] these days we don't need that because we have tools. But early on in order to
[00:25:06] have tools. But early on in order to train these algorithms, we needed the
[00:25:08] train these algorithms, we needed the ground truth.
[00:25:12] Okay, very briefly let me tell you what
[00:25:15] Okay, very briefly let me tell you what we do with upsampling. Upsampling is
[00:25:18] we do with upsampling. Upsampling is actually not that hard. We can use an
[00:25:21] actually not that hard. We can use an unpulling operation. There are different
[00:25:23] unpulling operation. There are different ways of doing it. One is nearest
[00:25:25] ways of doing it. One is nearest neighbor. If I want to go from a two 2x
[00:25:28] neighbor. If I want to go from a two 2x two
[00:25:30] two As an example here, a matrix 2 to 4x4. I
[00:25:34] As an example here, a matrix 2 to 4x4. I just need to copy
[00:25:36] just need to copy the data kind of for each of these. Take
[00:25:39] the data kind of for each of these. Take the nearest neighbor in the in the lower
[00:25:41] the nearest neighbor in the in the lower resolution one or bed of nails is you
[00:25:44] resolution one or bed of nails is you just select one of those in in the
[00:25:47] just select one of those in in the upsampled version. You only select one
[00:25:50] upsampled version. You only select one of those or the one on the corner
[00:25:54] of those or the one on the corner to copy the data. replace everything
[00:25:56] to copy the data. replace everything else with zero and through multiple
[00:25:59] else with zero and through multiple layers of convolution these values will
[00:26:01] layers of convolution these values will will start um appearing.
[00:26:05] will start um appearing. If we used max pooling in our
[00:26:10] If we used max pooling in our um in our network in the in the encoding
[00:26:13] um in our network in the in the encoding side, what we can do is we can save the
[00:26:16] side, what we can do is we can save the locations of the max um the the ones
[00:26:20] locations of the max um the the ones that were selected and then copy the
[00:26:22] that were selected and then copy the data in the unpulling max unpulling
[00:26:26] data in the unpulling max unpulling stage right over there that the
[00:26:31] stage right over there that the the uh max was defined. So basically we
[00:26:35] the uh max was defined. So basically we we save the locations in the encoding
[00:26:39] we save the locations in the encoding part and in decoding part in the
[00:26:40] part and in decoding part in the upsampling step we reuse those saved
[00:26:45] upsampling step we reuse those saved coordinates.
[00:26:47] coordinates. The other option is to to do a learned
[00:26:50] The other option is to to do a learned upsampling.
[00:26:52] upsampling. All of these that I I showed there is no
[00:26:54] All of these that I I showed there is no parameter to be learned. It's just an
[00:26:56] parameter to be learned. It's just an operation. But learned upsamplings are
[00:26:59] operation. But learned upsamplings are also possible. Very simply, let's let's
[00:27:03] also possible. Very simply, let's let's revisit the convolution. In the
[00:27:05] revisit the convolution. In the convolution layer, what we we did was
[00:27:08] convolution layer, what we we did was applying a convolution filter for a
[00:27:10] applying a convolution filter for a pixel and generating the output and do
[00:27:12] pixel and generating the output and do this repeatedly for all of the pixels,
[00:27:15] this repeatedly for all of the pixels, right?
[00:27:17] right? And when we wanted to do um to to down
[00:27:21] And when we wanted to do um to to down sample what we did was strided
[00:27:23] sample what we did was strided convolution where instead of taking
[00:27:26] convolution where instead of taking steps of one we take steps of two and
[00:27:29] steps of one we take steps of two and generate the outputs step by step. If
[00:27:32] generate the outputs step by step. If you don't remember this part go back to
[00:27:33] you don't remember this part go back to the lecture we talked about it third
[00:27:36] the lecture we talked about it third lecture I think and um and then we can
[00:27:40] lecture I think and um and then we can replicate the same for the offsampling
[00:27:44] replicate the same for the offsampling process. So this one will represent this
[00:27:49] process. So this one will represent this area in the upsampled image. And then we
[00:27:52] area in the upsampled image. And then we define some weights here to map that to
[00:27:55] define some weights here to map that to the output um map. And then for the next
[00:27:58] the output um map. And then for the next one, same story, but there will be
[00:28:00] one, same story, but there will be overlaps. And for the overlaps, you
[00:28:02] overlaps. And for the overlaps, you often sum over the the values. Let me
[00:28:05] often sum over the the values. Let me give you an example. Sum over the
[00:28:07] give you an example. Sum over the outputs. Let me give you an example. And
[00:28:10] outputs. Let me give you an example. And that's with a simple 1D
[00:28:13] that's with a simple 1D uh function. If the input is um just two
[00:28:17] uh function. If the input is um just two values of A and B, we learn a filter
[00:28:21] values of A and B, we learn a filter that that filter maps it to the higher
[00:28:24] that that filter maps it to the higher resolution output, right? And for doing
[00:28:27] resolution output, right? And for doing that, we just apply the filter to each
[00:28:30] that, we just apply the filter to each of the values and write the outputs
[00:28:34] of the values and write the outputs here. for the parts that there's an
[00:28:36] here. for the parts that there's an overlap, it's a summation addition of
[00:28:39] overlap, it's a summation addition of what is um
[00:28:43] what is um coming from each of the two uh
[00:28:45] coming from each of the two uh locations.
[00:28:49] So
[00:28:52] So we did talk about this um fully
[00:28:56] we did talk about this um fully convolutional neural networks and and
[00:28:58] convolutional neural networks and and how they are being used. This these are
[00:29:02] how they are being used. This these are actually some of the most basic and
[00:29:06] actually some of the most basic and mostly widely used algor algorithms for
[00:29:09] mostly widely used algor algorithms for segmentation. Um and I want to also very
[00:29:14] segmentation. Um and I want to also very quickly highlight one of the widely used
[00:29:19] quickly highlight one of the widely used networks unit as you can see the the the
[00:29:22] networks unit as you can see the the the shape U. It's actually the same
[00:29:24] shape U. It's actually the same architecture as I showed here just let's
[00:29:27] architecture as I showed here just let's let's uh draw it as a similar to a U
[00:29:31] let's uh draw it as a similar to a U shape. And
[00:29:34] shape. And uh the reason that I'm highlighting this
[00:29:35] uh the reason that I'm highlighting this is that still today some of the medical
[00:29:39] is that still today some of the medical applications uh that that work on
[00:29:41] applications uh that that work on segmentation use segmentation algorithms
[00:29:44] segmentation use segmentation algorithms still this is uh the this this unit or
[00:29:49] still this is uh the this this unit or its variance generate the
[00:29:51] its variance generate the state-of-the-art um results if you don't
[00:29:54] state-of-the-art um results if you don't want to use any foundation model. So
[00:29:58] want to use any foundation model. So what it does is exactly what we
[00:30:00] what it does is exactly what we explained a down sampling phase that
[00:30:03] explained a down sampling phase that increases the field of view and loses
[00:30:06] increases the field of view and loses some special uh information and then
[00:30:08] some special uh information and then upsampling
[00:30:10] upsampling phase that um goes back to the image
[00:30:13] phase that um goes back to the image resolution. The only difference that
[00:30:15] resolution. The only difference that unit has is because it's the it's used
[00:30:18] unit has is because it's the it's used for segmentation.
[00:30:20] for segmentation. There is this
[00:30:22] There is this um understanding that we need to keep
[00:30:25] um understanding that we need to keep the the the
[00:30:28] the the the special uh information in the decoder
[00:30:32] special uh information in the decoder side because when we down sample we
[00:30:33] side because when we down sample we somehow lose resolution. And then
[00:30:36] somehow lose resolution. And then upsampling if you don't have the
[00:30:37] upsampling if you don't have the information it's going to be a little
[00:30:39] information it's going to be a little bit um hard and we often get into
[00:30:42] bit um hard and we often get into sometimes boundaries are are faded and
[00:30:46] sometimes boundaries are are faded and in order not to um get there the feature
[00:30:51] in order not to um get there the feature maps in the encoder side are actually
[00:30:54] maps in the encoder side are actually copied
[00:30:55] copied as inputs to the decoder layers. In that
[00:30:59] as inputs to the decoder layers. In that way you are act you're you're keeping
[00:31:01] way you are act you're you're keeping this structural
[00:31:03] this structural information within the image and
[00:31:05] information within the image and generate the outputs that are much
[00:31:08] generate the outputs that are much sharper. So this was the idea about um
[00:31:11] sharper. So this was the idea about um behind unit and as I said it's actually
[00:31:14] behind unit and as I said it's actually being used quite often. Summary of
[00:31:18] being used quite often. Summary of semantic segmentation. Um what we talked
[00:31:22] semantic segmentation. Um what we talked about today the fully convolutional
[00:31:24] about today the fully convolutional neural networks you have same filter as
[00:31:29] neural networks you have same filter as um as before
[00:31:31] um as before that we had for down sampling here. So
[00:31:34] that we had for down sampling here. So you have a um a
[00:31:39] you have a um a actually so
[00:31:42] actually so for to save time I actually removed some
[00:31:45] for to save time I actually removed some of the slides from this part and I have
[00:31:47] of the slides from this part and I have it in the backup slides you should check
[00:31:48] it in the backup slides you should check it out. This is a reverse um this is a
[00:31:51] it out. This is a reverse um this is a transform transpose convolution. So we
[00:31:54] transform transpose convolution. So we do have a 3x3 matrix here and then
[00:31:57] do have a 3x3 matrix here and then instead of convolving the input image
[00:32:01] instead of convolving the input image input data we convolve we we do the
[00:32:03] input data we convolve we we do the convolution on the applied convolution
[00:32:06] convolution on the applied convolution on the transposed version of the input
[00:32:09] on the transposed version of the input and it actually generates an a larger
[00:32:11] and it actually generates an a larger output. So it's the transposed
[00:32:14] output. So it's the transposed convolution it's it's uh it's the
[00:32:17] convolution it's it's uh it's the reverse of the regular convolution. But
[00:32:20] reverse of the regular convolution. But why transposed? I would refer you to
[00:32:23] why transposed? I would refer you to take a look at the the additional
[00:32:25] take a look at the the additional slides. So the question is is that
[00:32:27] slides. So the question is is that filter trained? Yes, it's very much
[00:32:29] filter trained? Yes, it's very much similar to other convolution layers. All
[00:32:31] similar to other convolution layers. All of the filters are trained. Yes. Yeah.
[00:32:36] Okay. Um
[00:32:39] Okay. Um great. This was uh the topic of semantic
[00:32:45] great. This was uh the topic of semantic segmentation. And as we talked about
[00:32:47] segmentation. And as we talked about this, we only get labels for the pixels.
[00:32:52] this, we only get labels for the pixels. But if there are two instances of the
[00:32:54] But if there are two instances of the same object, we have no idea which one
[00:32:58] same object, we have no idea which one is which, right? Because this is just
[00:33:03] generating or outputting the the labels,
[00:33:06] generating or outputting the the labels, pixel labels. And this brings us to that
[00:33:09] pixel labels. And this brings us to that brings us to the topic of instance
[00:33:11] brings us to the topic of instance segmentation where now we not only care
[00:33:15] segmentation where now we not only care about the pixel uh classes but also I
[00:33:19] about the pixel uh classes but also I want to know that um these pixels belong
[00:33:23] want to know that um these pixels belong to one instance of the dog and this uh
[00:33:26] to one instance of the dog and this uh next one is actually a different dog
[00:33:28] next one is actually a different dog right and and for doing that what we
[00:33:32] right and and for doing that what we need is
[00:33:35] need is understanding objects multiple objects
[00:33:38] understanding objects multiple objects in the image and brings us to the topic
[00:33:41] in the image and brings us to the topic of object detection. Object detection
[00:33:44] of object detection. Object detection has been one of the uh besides image
[00:33:49] has been one of the uh besides image classification or after image
[00:33:50] classification or after image classification has been one of the core
[00:33:53] classification has been one of the core computer vision
[00:33:56] computer vision problems and and tasks. And for many
[00:34:00] problems and and tasks. And for many years many many different algorithms
[00:34:03] years many many different algorithms were proposed for just doing the task of
[00:34:08] were proposed for just doing the task of object detection. We are going to fly
[00:34:10] object detection. We are going to fly over some of them and highlight a couple
[00:34:13] over some of them and highlight a couple of important ones but again there are so
[00:34:18] of important ones but again there are so many works in the literature that even
[00:34:20] many works in the literature that even in literature of deep learning that I'm
[00:34:23] in literature of deep learning that I'm not covering uh here. So in the past 101
[00:34:28] not covering uh here. So in the past 101 15 years
[00:34:30] 15 years how we can do that um and and solve the
[00:34:33] how we can do that um and and solve the problem of object detection if it's just
[00:34:35] problem of object detection if it's just a single object it means that we need to
[00:34:39] a single object it means that we need to generate we need to do the
[00:34:41] generate we need to do the classification
[00:34:43] classification generate a label class scores as well as
[00:34:47] generate a label class scores as well as getting a bounding box coordinates of a
[00:34:50] getting a bounding box coordinates of a box right so you need the coordinates of
[00:34:52] box right so you need the coordinates of the box XY and H and and W as the output
[00:34:56] the box XY and H and and W as the output as well as what class it is. Right? So
[00:34:58] as well as what class it is. Right? So this is exactly the task of object
[00:35:00] this is exactly the task of object detection. How can we solve this? It's
[00:35:03] detection. How can we solve this? It's very simple. Right? We can define a
[00:35:05] very simple. Right? We can define a softmax loss soft max loss function for
[00:35:10] softmax loss soft max loss function for the class scores. And we can define an
[00:35:13] the class scores. And we can define an L2 loss function which is a simple
[00:35:17] L2 loss function which is a simple distance metric a regression loss for
[00:35:20] distance metric a regression loss for the box coordinates. And having these
[00:35:24] the box coordinates. And having these two defined we have a multitask loss. We
[00:35:27] two defined we have a multitask loss. We are solving two tasks at the same time.
[00:35:30] are solving two tasks at the same time. And for doing that we again uh add the
[00:35:34] And for doing that we again uh add the loss values and generate a
[00:35:38] loss values and generate a compound loss function as you can see
[00:35:42] compound loss function as you can see here.
[00:35:44] here. So this is simple. It's doable. If we
[00:35:48] So this is simple. It's doable. If we have one single object, we can for sure
[00:35:53] have one single object, we can for sure um
[00:35:55] um solve this problem using this
[00:35:56] solve this problem using this architecture that I talked about. But
[00:35:58] architecture that I talked about. But this is not that easy if we have
[00:36:00] this is not that easy if we have multiple objects in the in the scene. So
[00:36:03] multiple objects in the in the scene. So for three objects, we have to generate
[00:36:05] for three objects, we have to generate 12 output numbers and and if there are
[00:36:08] 12 output numbers and and if there are more then it it's going to be too many
[00:36:11] more then it it's going to be too many uh numbers to generate. So this is this
[00:36:13] uh numbers to generate. So this is this algorithm is not really scalable. it's
[00:36:15] algorithm is not really scalable. it's it's it's just extending the
[00:36:17] it's it's just extending the classification into some sort of object
[00:36:19] classification into some sort of object detection which is fine but it's not
[00:36:21] detection which is fine but it's not really scalable.
[00:36:23] really scalable. So um
[00:36:25] So um if the when there are multiple objects,
[00:36:28] if the when there are multiple objects, one solution is
[00:36:31] one solution is um instead of going or or getting the
[00:36:34] um instead of going or or getting the entire image as the input, why not to
[00:36:37] entire image as the input, why not to look at bounding boxes for each bounding
[00:36:39] look at bounding boxes for each bounding box, we can say we only have one label
[00:36:45] box, we can say we only have one label and um whether it's a cat or a dog or or
[00:36:48] and um whether it's a cat or a dog or or the background, right? And if I
[00:36:53] the background, right? And if I have this this um way of classifying
[00:36:56] have this this um way of classifying each of the bounding boxes, I can do a
[00:36:58] each of the bounding boxes, I can do a sliding window. I can create just
[00:37:00] sliding window. I can create just bounding boxes slided over the image
[00:37:04] bounding boxes slided over the image from coordinate 0 0 to all combination
[00:37:07] from coordinate 0 0 to all combination of XY and H and W and see if we can
[00:37:11] of XY and H and W and see if we can detect the object. So step by step I can
[00:37:14] detect the object. So step by step I can I can create uh I can find the bounding
[00:37:17] I can create uh I can find the bounding boxes that have the the maximum
[00:37:20] boxes that have the the maximum probability of each of the objects.
[00:37:23] probability of each of the objects. But there is a huge problem here right
[00:37:26] But there is a huge problem here right again there are so many different
[00:37:28] again there are so many different combinations of bounding boxes that we
[00:37:30] combinations of bounding boxes that we can use and again this algorithm is not
[00:37:34] can use and again this algorithm is not scalable.
[00:37:36] scalable. what we've been doing in the literature
[00:37:39] what we've been doing in the literature again early early years if you look at
[00:37:41] again early early years if you look at the
[00:37:42] the uh the years these articles were
[00:37:44] uh the years these articles were published 2014 and before
[00:37:48] published 2014 and before there has been a lot of research around
[00:37:51] there has been a lot of research around finding regions
[00:37:54] finding regions that are um they have high probability
[00:37:58] that are um they have high probability of having the object in them. So region
[00:38:01] of having the object in them. So region proposals and if I have a way to find
[00:38:05] proposals and if I have a way to find region proposals that's actually going
[00:38:07] region proposals that's actually going to be a easyish
[00:38:10] to be a easyish problem. I can do the same thing as as I
[00:38:12] problem. I can do the same thing as as I explained earlier, right? For a for an
[00:38:15] explained earlier, right? For a for an image, what we can do is if if I have
[00:38:18] image, what we can do is if if I have region proposals, I can just take that
[00:38:21] region proposals, I can just take that part that patch out
[00:38:24] part that patch out and run a CNN on that on that patch,
[00:38:28] and run a CNN on that on that patch, right? And convolutional neuronet
[00:38:29] right? And convolutional neuronet network on the patch and then classify
[00:38:31] network on the patch and then classify it. And in order to even I can I can
[00:38:34] it. And in order to even I can I can refine the bounding boxes.
[00:38:38] refine the bounding boxes. So classify oops classify and then
[00:38:43] So classify oops classify and then refine the bounding boxes to have um the
[00:38:47] refine the bounding boxes to have um the object detected. So we can we can
[00:38:49] object detected. So we can we can classify the boxes and also refine the
[00:38:51] classify the boxes and also refine the bounding boxes if if I have to change
[00:38:53] bounding boxes if if I have to change the coordinates a little bit.
[00:38:56] the coordinates a little bit. And this is what is called um RCNN um
[00:39:00] And this is what is called um RCNN um algorithm.
[00:39:02] algorithm. And uh although it's it's it's it works
[00:39:07] And uh although it's it's it's it works and again this is one of the early
[00:39:09] and again this is one of the early algorithms CPR 2014
[00:39:12] algorithms CPR 2014 there are these are very slow because
[00:39:14] there are these are very slow because for each of these box boxes again we are
[00:39:17] for each of these box boxes again we are running a full convolutional neural
[00:39:20] running a full convolutional neural network. But there is one catch that um
[00:39:24] network. But there is one catch that um um what what we can do is
[00:39:28] um what what we can do is um is instead of
[00:39:32] um is instead of running convolution neural network on
[00:39:35] running convolution neural network on each of these boxes because convo
[00:39:38] each of these boxes because convo operations convolution operations they
[00:39:39] operations convolution operations they preserve the special information right
[00:39:41] preserve the special information right they either down sample or or upsample
[00:39:45] they either down sample or or upsample we always have a a way to to track them
[00:39:48] we always have a a way to to track them where in the pixel space they are Right.
[00:39:51] where in the pixel space they are Right. So in that uh case what we do is instead
[00:39:55] So in that uh case what we do is instead of running the
[00:39:58] of running the convolutional neural network on the
[00:40:01] convolutional neural network on the patches let's say we run one big
[00:40:04] patches let's say we run one big convolution on the entire image and then
[00:40:07] convolution on the entire image and then now we have the regions in that feature
[00:40:09] now we have the regions in that feature map which is corresponding to the entire
[00:40:11] map which is corresponding to the entire image. Let's look at those regions and
[00:40:14] image. Let's look at those regions and now run a smaller CNN for on top of
[00:40:17] now run a smaller CNN for on top of those and generate the outputs.
[00:40:21] those and generate the outputs. for each of these two um outputs that I
[00:40:25] for each of these two um outputs that I want. First the box offset like should I
[00:40:28] want. First the box offset like should I move the bounding box a little bit or
[00:40:31] move the bounding box a little bit or what the object category is. So this is
[00:40:34] what the object category is. So this is the the fast version of RCNN. These are
[00:40:38] the the fast version of RCNN. These are some uh some basic algorithms that you
[00:40:40] some uh some basic algorithms that you can use convolutional neural networks
[00:40:42] can use convolutional neural networks for detecting objects their bounding
[00:40:45] for detecting objects their bounding boxes and so on. The question is if the
[00:40:47] boxes and so on. The question is if the number of pre proposed uh regions are
[00:40:49] number of pre proposed uh regions are predefined. Uh that is the the short
[00:40:52] predefined. Uh that is the the short answer to that is yes. I will talk very
[00:40:54] answer to that is yes. I will talk very briefly about the region proposal
[00:40:56] briefly about the region proposal networks too.
[00:41:00] So easy algorithms right? One puts the
[00:41:04] So easy algorithms right? One puts the bounding boxes of the regions the
[00:41:07] bounding boxes of the regions the proposed regions um on the images one
[00:41:10] proposed regions um on the images one puts it on the feature maps of the
[00:41:11] puts it on the feature maps of the connet and both of those generate the
[00:41:14] connet and both of those generate the output class label as well as offset of
[00:41:18] output class label as well as offset of improving the the location of the
[00:41:21] improving the the location of the detected object. But this will mean this
[00:41:26] detected object. But this will mean this requires us to do um that bounding box
[00:41:31] requires us to do um that bounding box uh region proposal um first and and have
[00:41:35] uh region proposal um first and and have a region proposal network that tells us
[00:41:37] a region proposal network that tells us where to in image we should look at uh
[00:41:40] where to in image we should look at uh look for and uh there has been research
[00:41:44] look for and uh there has been research on on building region proposal networks
[00:41:47] on on building region proposal networks RPNs and here what we do is we just
[00:41:50] RPNs and here what we do is we just randomly start with with a CNN we try to
[00:41:56] randomly start with with a CNN we try to randomly start in in different locations
[00:41:59] randomly start in in different locations in the image and through layers of
[00:42:01] in the image and through layers of convolution
[00:42:03] convolution we refine those regions where they have
[00:42:08] we refine those regions where they have the higher probability of having an
[00:42:10] the higher probability of having an object in them because we have the
[00:42:12] object in them because we have the object labels and and and locations. So
[00:42:14] object labels and and and locations. So we can optim we can uh supervise this
[00:42:17] we can optim we can uh supervise this and then each of those also refine the
[00:42:20] and then each of those also refine the box coordinates. So basically a neural u
[00:42:24] box coordinates. So basically a neural u a region proposal network what it does
[00:42:27] a region proposal network what it does is it
[00:42:29] is it refineses the boxes each of those boxes
[00:42:32] refineses the boxes each of those boxes that have a probability a high
[00:42:34] that have a probability a high probability of an object in them and the
[00:42:38] probability of an object in them and the box
[00:42:40] box uh the output boxes the box corrections
[00:42:43] uh the output boxes the box corrections again I'm I'm leaving all of the details
[00:42:45] again I'm I'm leaving all of the details about the coordinate uh coordinates and
[00:42:47] about the coordinate uh coordinates and all of these u dimensionalities
[00:42:50] all of these u dimensionalities for you to pick uh afterwards because
[00:42:52] for you to pick uh afterwards because you it will take too much time and we
[00:42:54] you it will take too much time and we don't want to spend too much time on
[00:42:55] don't want to spend too much time on this uh algorithm. But what's important
[00:42:58] this uh algorithm. But what's important here is back to your question, we often
[00:43:02] here is back to your question, we often take the top k the ones that have the
[00:43:07] take the top k the ones that have the highest probability of having an object
[00:43:09] highest probability of having an object in them as the proposals for this image.
[00:43:12] in them as the proposals for this image. This is an simple image. So then then
[00:43:16] This is an simple image. So then then most of the and only has one object. So
[00:43:18] most of the and only has one object. So most of the regions are centered around
[00:43:22] most of the regions are centered around that single object but in general that's
[00:43:24] that single object but in general that's not the case. So in in many setups we
[00:43:27] not the case. So in in many setups we can have region proposals used in in
[00:43:31] can have region proposals used in in different setups. um we can get
[00:43:34] different setups. um we can get different objects and with higher
[00:43:36] different objects and with higher probabilities with
[00:43:40] probabilities with that and after talking a little bit
[00:43:42] that and after talking a little bit about RCNN and mask RCNN which again for
[00:43:46] about RCNN and mask RCNN which again for you it's it's important to to go through
[00:43:49] you it's it's important to to go through the details and if you can spend some
[00:43:51] the details and if you can spend some time um doing the calculations yourself
[00:43:55] time um doing the calculations yourself that would be very good but those types
[00:43:58] that would be very good but those types of algorithms RCNN mask RCNN they're not
[00:44:01] of algorithms RCNN mask RCNN they're not being used anymore more these days
[00:44:03] being used anymore more these days because because they are very heavy
[00:44:06] because because they are very heavy computationally.
[00:44:07] computationally. Although it's important to understand
[00:44:09] Although it's important to understand how we got to this point but but those
[00:44:13] how we got to this point but but those are for uh for many reasons. One of
[00:44:16] are for uh for many reasons. One of those reasons is that we we need two
[00:44:18] those reasons is that we we need two separate networks. One region proposal
[00:44:20] separate networks. One region proposal network and then one um classification
[00:44:24] network and then one um classification and box refundment network. So it's like
[00:44:28] and box refundment network. So it's like at least two passes for detecting
[00:44:30] at least two passes for detecting objects on the on the uh for each image
[00:44:33] objects on the on the uh for each image right and that's why there has been
[00:44:38] right and that's why there has been advances after these um using single
[00:44:41] advances after these um using single single stage object detectors SSDs and
[00:44:46] single stage object detectors SSDs and one of the most popular ones is called
[00:44:49] one of the most popular ones is called YOLO. YOLO is um probably if if you work
[00:44:54] YOLO. YOLO is um probably if if you work with any computer vision problem you've
[00:44:55] with any computer vision problem you've you've heard about YOLO even to date
[00:44:58] you've heard about YOLO even to date although it's a it's a convolution um
[00:45:01] although it's a it's a convolution um heavy network today uh at least its
[00:45:04] heavy network today uh at least its earlier versions
[00:45:06] earlier versions uh in many even industrial applications
[00:45:10] uh in many even industrial applications YOLO is being used as as the base for
[00:45:15] YOLO is being used as as the base for object detection because it's a fast uh
[00:45:18] object detection because it's a fast uh object detector and It's um very good in
[00:45:22] object detector and It's um very good in terms of detecting uh the objects. What
[00:45:25] terms of detecting uh the objects. What YOLO does I want to very briefly uh tell
[00:45:30] YOLO does I want to very briefly uh tell you a little bit about. It's basically
[00:45:31] you a little bit about. It's basically you look only once with one single pass
[00:45:34] you look only once with one single pass on the image. You generate all of the
[00:45:37] on the image. You generate all of the bounding boxes. How it does it is it
[00:45:41] bounding boxes. How it does it is it maps it. It divides the image into S bys
[00:45:46] maps it. It divides the image into S bys S
[00:45:48] S grid of S bys and in this example it's 7
[00:45:51] grid of S bys and in this example it's 7 by 7. What happens is that for each
[00:45:54] by 7. What happens is that for each single box in that grid it tries to
[00:45:58] single box in that grid it tries to output it. It it creates a fully con
[00:46:00] output it. It it creates a fully con convolutional network that outputs the
[00:46:04] convolutional network that outputs the probability of an object being in that
[00:46:09] probability of an object being in that location.
[00:46:10] location. refinements of the bounding boxes. So it
[00:46:13] refinements of the bounding boxes. So it generates Bbounding boxes and new
[00:46:15] generates Bbounding boxes and new hyperparameters B bounding boxes that is
[00:46:18] hyperparameters B bounding boxes that is the refinement of the object that is
[00:46:21] the refinement of the object that is present in that box and also it
[00:46:23] present in that box and also it generates a probability the class uh
[00:46:26] generates a probability the class uh class probability object class
[00:46:27] class probability object class probabilities and um in this case for
[00:46:31] probabilities and um in this case for example if it's B equal to two it
[00:46:33] example if it's B equal to two it generates just two B bounding boxes with
[00:46:36] generates just two B bounding boxes with different probabilities.
[00:46:38] different probabilities. It it does this for all of the
[00:46:41] It it does this for all of the boxes at the same time. So basically
[00:46:44] boxes at the same time. So basically it's the it's the same network that is
[00:46:46] it's the it's the same network that is being uh generating something as the
[00:46:49] being uh generating something as the output for each of these
[00:46:51] output for each of these bounding boxes and it it it does
[00:46:54] bounding boxes and it it it does generate a number of different options
[00:46:57] generate a number of different options for the for the for the object and as I
[00:47:01] for the for the for the object and as I said each of those boxes are associated
[00:47:04] said each of those boxes are associated with a probability and pro in this
[00:47:06] with a probability and pro in this example the probability is shown with
[00:47:08] example the probability is shown with the weight of the edges
[00:47:11] the weight of the edges uh in each of those boxes.
[00:47:14] uh in each of those boxes. And um for these many different bounding
[00:47:16] And um for these many different bounding boxes and object probabilities now we
[00:47:19] boxes and object probabilities now we can do thresholding
[00:47:21] can do thresholding and also
[00:47:25] um
[00:47:27] um there is an algorithm that they used in
[00:47:28] there is an algorithm that they used in the paper again I don't want to go into
[00:47:30] the paper again I don't want to go into the details uh non uh maximal
[00:47:33] the details uh non uh maximal suppression and um some some algorithms
[00:47:37] suppression and um some some algorithms uh with thresholding involved that that
[00:47:40] uh with thresholding involved that that identifies the ones that have the
[00:47:42] identifies the ones that have the highest
[00:47:43] highest probabilities. So this is this is a
[00:47:45] probabilities. So this is this is a simple implementation or or use of the
[00:47:50] simple implementation or or use of the uh
[00:47:54] object detection. Again this is this is
[00:47:57] object detection. Again this is this is something very useful if you have time
[00:47:59] something very useful if you have time spend time with the repositories of
[00:48:01] spend time with the repositories of YOLO. There are so many different uh
[00:48:04] YOLO. There are so many different uh newer versions of YOLO that is being
[00:48:06] newer versions of YOLO that is being used for many applications in in
[00:48:10] used for many applications in in medicine robotics and also in many
[00:48:12] medicine robotics and also in many industrial applications. So the question
[00:48:14] industrial applications. So the question is how do we get this uh second image
[00:48:16] is how do we get this uh second image and what's the intuition behind it right
[00:48:19] and what's the intuition behind it right as I said for each of the grids we
[00:48:21] as I said for each of the grids we generate B bounding boxes like for for
[00:48:24] generate B bounding boxes like for for this one we generated two for all of all
[00:48:26] this one we generated two for all of all others we also generate two this B is
[00:48:30] others we also generate two this B is again a a probability vector and each of
[00:48:33] again a a probability vector and each of these boxes are associated with
[00:48:35] these boxes are associated with probability of uh existing an object in
[00:48:37] probability of uh existing an object in them and then if I put all of them
[00:48:39] them and then if I put all of them together for all of the patches I have
[00:48:42] together for all of the patches I have so many boxes
[00:48:43] so many boxes And now each of those are associated
[00:48:46] And now each of those are associated with a probability
[00:48:47] with a probability right.
[00:48:55] let's move on. Um
[00:49:00] and one of the more recent
[00:49:04] and one of the more recent approaches for object detection is deter
[00:49:07] approaches for object detection is deter a detection transformer. This is purely
[00:49:11] a detection transformer. This is purely based on transformers
[00:49:14] based on transformers and the topic that we did discuss last
[00:49:16] and the topic that we did discuss last uh week and also I started today same
[00:49:20] uh week and also I started today same type of self attention and cross
[00:49:23] type of self attention and cross attention modules could also generate
[00:49:26] attention modules could also generate some
[00:49:28] some uh some object detections and bounding
[00:49:30] uh some object detections and bounding boxes for us and how this works. This is
[00:49:34] boxes for us and how this works. This is actually not a very old paper 2020 uh
[00:49:38] actually not a very old paper 2020 uh almost 5 years ago although it's it's
[00:49:40] almost 5 years ago although it's it's now kind of deprecated. Nobody uses this
[00:49:42] now kind of deprecated. Nobody uses this for real applications but but it's a
[00:49:46] for real applications but but it's a it's a very good example of how to use
[00:49:48] it's a very good example of how to use transformers for object detection. And
[00:49:52] transformers for object detection. And what we do here is basically similar to
[00:49:55] what we do here is basically similar to what we've explained earlier. We make
[00:49:58] what we've explained earlier. We make turn the image into patches and then
[00:50:02] turn the image into patches and then those patches are passed through CNN's
[00:50:04] those patches are passed through CNN's creating a token. Then we add positional
[00:50:08] creating a token. Then we add positional encoding the same way that I explained
[00:50:10] encoding the same way that I explained to the patches and those define our
[00:50:15] to the patches and those define our tokens for the input uh which are inputs
[00:50:18] tokens for the input uh which are inputs to the transformer encoder. a
[00:50:21] to the transformer encoder. a transformer encoder again a bunch of
[00:50:23] transformer encoder again a bunch of self attention
[00:50:25] self attention layer normalization or any normalization
[00:50:28] layer normalization or any normalization as well as MLP layers that generates the
[00:50:32] as well as MLP layers that generates the output tokens after multiple layers of
[00:50:36] output tokens after multiple layers of uh transformer encoder.
[00:50:39] uh transformer encoder. Then in order to generate the bounding
[00:50:41] Then in order to generate the bounding boxes, this is this is the smart part
[00:50:44] boxes, this is this is the smart part for this algorithm that it does take the
[00:50:48] for this algorithm that it does take the encoder output tokens as input in the
[00:50:51] encoder output tokens as input in the transformer decoder. But we also define
[00:50:54] transformer decoder. But we also define some queries
[00:50:57] some queries which are trainable parameters
[00:50:58] which are trainable parameters themselves that
[00:51:01] themselves that each of those for example if I add five
[00:51:04] each of those for example if I add five five queries as input four queries as
[00:51:06] five queries as input four queries as input or 10 or 20 inquiries as input I'm
[00:51:09] input or 10 or 20 inquiries as input I'm seeking
[00:51:11] seeking up to 20 objects to be detected in that
[00:51:14] up to 20 objects to be detected in that image and then again through a
[00:51:17] image and then again through a combination of self attention
[00:51:20] combination of self attention layers in the beginning of this
[00:51:23] layers in the beginning of this transformer decoder as well as cross
[00:51:27] transformer decoder as well as cross attention with the encoder
[00:51:31] attention with the encoder output.
[00:51:33] output. So cross attention and self attention
[00:51:35] So cross attention and self attention networks layers. It generates it it it
[00:51:39] networks layers. It generates it it it generates the output values for each of
[00:51:42] generates the output values for each of these queries which are passed through
[00:51:46] these queries which are passed through an FFN uh freeforward network to
[00:51:49] an FFN uh freeforward network to generate either class
[00:51:52] generate either class um
[00:51:54] um labels and and the bounding boxes very
[00:51:57] labels and and the bounding boxes very similar to what we discussed earlier
[00:52:00] similar to what we discussed earlier or even uh in some cases it it just says
[00:52:03] or even uh in some cases it it just says there's no object to be detected and and
[00:52:07] there's no object to be detected and and at the end we have the bounding boxes
[00:52:10] at the end we have the bounding boxes and the classes associated with bounding
[00:52:12] and the classes associated with bounding boxes as the output.
[00:52:16] boxes as the output. So the question is are we inputting
[00:52:18] So the question is are we inputting every possible box to the transformer?
[00:52:20] every possible box to the transformer? No, the input here are some general
[00:52:23] No, the input here are some general parameters that are representing they
[00:52:25] parameters that are representing they are queries
[00:52:26] are queries uh representing
[00:52:30] uh representing asking the question that I actually want
[00:52:31] asking the question that I actually want an object to be outputed in in in place
[00:52:35] an object to be outputed in in in place of this input query. Right? So there is
[00:52:38] of this input query. Right? So there is no box or anything as the input. It's
[00:52:41] no box or anything as the input. It's part of the output that it generates the
[00:52:43] part of the output that it generates the class label and the box coordinates.
[00:52:46] class label and the box coordinates. So the question is um if the queries are
[00:52:50] So the question is um if the queries are formed in a way that's it it it actually
[00:52:54] formed in a way that's it it it actually represents what we want to look for and
[00:52:56] represents what we want to look for and where in the image right so in this case
[00:53:00] where in the image right so in this case um what we are looking for is defined by
[00:53:03] um what we are looking for is defined by class labels which are defined uh
[00:53:06] class labels which are defined uh predefined and they are as part of the
[00:53:08] predefined and they are as part of the output. So our supervision is based on
[00:53:10] output. So our supervision is based on the class labels. We have a class
[00:53:11] the class labels. We have a class probability vector same same way that we
[00:53:14] probability vector same same way that we defined it for the other algorithms.
[00:53:16] defined it for the other algorithms. Right? So that's how um the algorithm
[00:53:19] Right? So that's how um the algorithm knows what type of classes to look for.
[00:53:21] knows what type of classes to look for. And then in terms of the outputs
[00:53:25] And then in terms of the outputs uh again these outputs are supervised if
[00:53:28] uh again these outputs are supervised if you remember based on the L2 norm L2
[00:53:31] you remember based on the L2 norm L2 loss of the ground truth boxes right. So
[00:53:35] loss of the ground truth boxes right. So we're not telling anything in the query
[00:53:37] we're not telling anything in the query part
[00:53:39] part um what to or where to look uh for any
[00:53:44] um what to or where to look uh for any of the objects. The training the the
[00:53:47] of the objects. The training the the process itself is back propagating um if
[00:53:49] process itself is back propagating um if there are any losses any any errors it
[00:53:52] there are any losses any any errors it back propagates the the outputs. So
[00:53:56] back propagates the the outputs. So basically it's
[00:53:58] basically it's um we are not determining anything in
[00:54:00] um we are not determining anything in the beginning or in in this part. So the
[00:54:03] the beginning or in in this part. So the question was
[00:54:05] question was the query is give me up to nine objects
[00:54:08] the query is give me up to nine objects and uh and yes the that's the that's
[00:54:11] and uh and yes the that's the that's that that's basically what this means
[00:54:14] that that's basically what this means and through the self attention and cost
[00:54:18] and through the self attention and cost attention it will turn try to generate
[00:54:21] attention it will turn try to generate output tokens that are turned into class
[00:54:24] output tokens that are turned into class and box coordinates through that FFN
[00:54:28] and box coordinates through that FFN operation. Your question is if the
[00:54:31] operation. Your question is if the queries over there if they are image
[00:54:34] queries over there if they are image patches or not.
[00:54:36] patches or not.  No, they are they are not image patches.
[00:54:38] No, they are they are not image patches. They are just uh queries for uh
[00:54:41] They are just uh queries for uh trainable parameters to to you you put
[00:54:44] trainable parameters to to you you put them in to generate the outputs for each
[00:54:47] them in to generate the outputs for each of the inputs. You you get the value as
[00:54:50] of the inputs. You you get the value as the output and that value is turned into
[00:54:53] the output and that value is turned into class and box coordinates.
[00:54:56] class and box coordinates. Again the question is what is object
[00:54:57] Again the question is what is object queries? They are trainable learnable
[00:55:00] queries? They are trainable learnable parameters. So you initialize them, the
[00:55:03] parameters. So you initialize them, the network finds the best values for them
[00:55:06] network finds the best values for them and that's what you get as the output.
[00:55:09] and that's what you get as the output. The question is if uh if there's any
[00:55:11] The question is if uh if there's any intuition uh which FFN gets which box,
[00:55:13] intuition uh which FFN gets which box, right? The short answer to that is no.
[00:55:18] right? The short answer to that is no. uh there's there's nothing that stops
[00:55:20] uh there's there's nothing that stops the network from I mean we are not
[00:55:24] the network from I mean we are not including anything that stops the
[00:55:25] including anything that stops the network from uh generating multiple but
[00:55:29] network from uh generating multiple but remember that are so many self attention
[00:55:32] remember that are so many self attention and cross attention layers over there
[00:55:35] and cross attention layers over there that they are actually interacting with
[00:55:37] that they are actually interacting with each other and and makes each of those
[00:55:40] each other and and makes each of those queries match with one of the uh output
[00:55:43] queries match with one of the uh output layer. So it's it's not generating the
[00:55:45] layer. So it's it's not generating the exact same thing at the as the output.
[00:55:47] exact same thing at the as the output. And we also have control over how to
[00:55:49] And we also have control over how to supervise supervising those uh FFNS as
[00:55:52] supervise supervising those uh FFNS as well. So your question is if there are
[00:55:55] well. So your question is if there are image segmentations pixel level
[00:55:57] image segmentations pixel level segmentations as part of the training.
[00:56:00] segmentations as part of the training. This algorithm does not require the
[00:56:03] This algorithm does not require the segment level um pixel level
[00:56:06] segment level um pixel level segmentations. It it's only supervised
[00:56:09] segmentations. It it's only supervised based on class labels and bounding
[00:56:11] based on class labels and bounding boxes. But if you have the b the pixel
[00:56:14] boxes. But if you have the b the pixel level segmentations, you can always turn
[00:56:16] level segmentations, you can always turn the pixel level segmentations into a
[00:56:17] the pixel level segmentations into a bounding box to train this algorithm,
[00:56:20] bounding box to train this algorithm, right? But it doesn't necessarily
[00:56:22] right? But it doesn't necessarily require that. So the question is if it's
[00:56:25] require that. So the question is if it's possible to generalize unseen objects.
[00:56:28] possible to generalize unseen objects. Um
[00:56:30] Um and by unseen you mean a new class
[00:56:32] and by unseen you mean a new class label.
[00:56:34] label. uh for these types of algorithms that
[00:56:37] uh for these types of algorithms that they are fully supervised often there is
[00:56:39] they are fully supervised often there is no way because you are creating class
[00:56:41] no way because you are creating class probability vector there's no way of
[00:56:44] probability vector there's no way of like adding something at the end for a
[00:56:46] like adding something at the end for a new class um
[00:56:49] new class um without previously knowing there's
[00:56:52] without previously knowing there's there's some some other classes right so
[00:56:54] there's some some other classes right so fully supervised networks they are often
[00:56:56] fully supervised networks they are often there's no new object we can have a
[00:56:58] there's no new object we can have a background object or no object as you
[00:57:00] background object or no object as you can see we have the the label of no
[00:57:02] can see we have the the label of no object
[00:57:03] object But there are many algorithms and
[00:57:06] But there are many algorithms and extensions of these types of algorithms
[00:57:08] extensions of these types of algorithms that are used for zeroot learning. Zero
[00:57:11] that are used for zeroot learning. Zero shot means understanding something new
[00:57:14] shot means understanding something new without having an example of those in
[00:57:15] without having an example of those in the training data. But uh it's beyond
[00:57:18] the training data. But uh it's beyond this this topic. What happens if you
[00:57:20] this this topic. What happens if you have more objects in the scene than um
[00:57:24] have more objects in the scene than um what you uh put in as the query? Right.
[00:57:28] what you uh put in as the query? Right. So that's a that's a great question. it
[00:57:30] So that's a that's a great question. it often generates the ones that have the
[00:57:32] often generates the ones that have the has has the highest confidence on the
[00:57:35] has has the highest confidence on the objects. So bounding boxes with the
[00:57:36] objects. So bounding boxes with the highest confidence and in in those cases
[00:57:39] highest confidence and in in those cases you often want to add more queries just
[00:57:42] you often want to add more queries just so you can get more objects, right?
[00:57:46] so you can get more objects, right? Okay. Um I'll be here to answer
[00:57:48] Okay. Um I'll be here to answer questions if you have any uh after after
[00:57:50] questions if you have any uh after after the class, but we have a bunch of other
[00:57:53] the class, but we have a bunch of other topics to cover and I want to make sure
[00:57:56] topics to cover and I want to make sure we we go over them. at least you get uh
[00:57:59] we we go over them. at least you get uh familiar with the topics. So with the
[00:58:02] familiar with the topics. So with the object detections now back to the
[00:58:05] object detections now back to the question that was asked earlier, how can
[00:58:08] question that was asked earlier, how can we use those types of algorithms for
[00:58:11] we use those types of algorithms for instance segmentation and that's
[00:58:13] instance segmentation and that's actually not too hard. We talked about
[00:58:15] actually not too hard. We talked about this um when when we were talking about
[00:58:19] this um when when we were talking about our CNN algorithms where we run a CNN on
[00:58:23] our CNN algorithms where we run a CNN on the image then we have a region proposal
[00:58:26] the image then we have a region proposal network that gives us the bounding boxes
[00:58:29] network that gives us the bounding boxes and those bounding boxes are
[00:58:33] and those bounding boxes are turned into either class labels and
[00:58:36] turned into either class labels and bounding box refinements. Right? So this
[00:58:39] bounding box refinements. Right? So this is what we've talked so far
[00:58:42] is what we've talked so far with RCNN and so on. Now we can turn
[00:58:44] with RCNN and so on. Now we can turn this into a mask RCNN that also
[00:58:47] this into a mask RCNN that also generates the mask. So, so basically
[00:58:51] generates the mask. So, so basically same architecture that we talked about
[00:58:53] same architecture that we talked about earlier. Now we can we can uh take one
[00:58:57] earlier. Now we can we can uh take one more output make it more multitask
[00:59:01] more output make it more multitask and generate the mask predictions. So
[00:59:05] and generate the mask predictions. So what we used to be doing before was
[00:59:07] what we used to be doing before was again image region proposals CNN gives
[00:59:12] again image region proposals CNN gives us class label and the box coordinates.
[00:59:15] us class label and the box coordinates. Now we add another layer of convolution
[00:59:18] Now we add another layer of convolution that generates the mask the the mask for
[00:59:23] that generates the mask the the mask for for that object on in the pixel level.
[00:59:26] for that object on in the pixel level. And that mask again could be the same
[00:59:28] And that mask again could be the same size as the input and the and the image
[00:59:32] size as the input and the and the image and basically on the on the layer
[00:59:34] and basically on the on the layer itself. If we use fully convolutional
[00:59:36] itself. If we use fully convolutional neural network that's that's what we
[00:59:38] neural network that's that's what we often get as the output for each of the
[00:59:41] often get as the output for each of the objects. When we have that box tiny box
[00:59:44] objects. When we have that box tiny box we can always get the mask for that. the
[00:59:48] we can always get the mask for that. the chair in different settings of the box
[00:59:51] chair in different settings of the box itself if you have different boxes, the
[00:59:54] itself if you have different boxes, the bed uh and the human the baby in the
[01:00:00] bed uh and the human the baby in the image and this is what we an extension
[01:00:03] image and this is what we an extension of the RCNN algorithm which we call
[01:00:06] of the RCNN algorithm which we call mascarn
[01:00:07] mascarn and um
[01:00:10] and um with mascarn it's actually the the
[01:00:12] with mascarn it's actually the the results have been very very good in
[01:00:14] results have been very very good in detecting different objects different
[01:00:18] detecting different objects different um
[01:00:20] um known objects that we could train the
[01:00:22] known objects that we could train the algorithms for. And then um there are so
[01:00:26] algorithms for. And then um there are so many APIs and and open-source versions
[01:00:29] many APIs and and open-source versions of object detectors that you can um
[01:00:33] of object detectors that you can um explore. There are some some links and
[01:00:36] explore. There are some some links and resources here. But this all basically
[01:00:40] resources here. But this all basically rounds up and summarizes some of the
[01:00:42] rounds up and summarizes some of the tasks that we wanted to cover and and
[01:00:46] tasks that we wanted to cover and and they are actually very important for you
[01:00:47] they are actually very important for you to understand these tasks. They have
[01:00:49] to understand these tasks. They have been core computer revision tasks.
[01:00:52] been core computer revision tasks. Although these days computer vision is
[01:00:54] Although these days computer vision is is way more advanced, they're not bound
[01:00:57] is way more advanced, they're not bound to these tasks. But if if you have
[01:01:00] to these tasks. But if if you have industrial applications of for example
[01:01:03] industrial applications of for example uh quality control of
[01:01:07] uh quality control of separating rotten tomatoes and and good
[01:01:10] separating rotten tomatoes and and good tomatoes in an industrial pipeline then
[01:01:12] tomatoes in an industrial pipeline then with computer vision you need to be able
[01:01:14] with computer vision you need to be able to detect objects and then classify them
[01:01:16] to detect objects and then classify them into good or bad. Right? So so that
[01:01:20] into good or bad. Right? So so that that's why it's important to still
[01:01:21] that's why it's important to still understand and know these um steps and
[01:01:24] understand and know these um steps and pipelines and how to do them in real
[01:01:26] pipelines and how to do them in real time. But now there are larger scale
[01:01:28] time. But now there are larger scale models that you're all familiar with.
[01:01:32] models that you're all familiar with. This um summarizes the first part the
[01:01:35] This um summarizes the first part the computer vision tasks that I wanted to
[01:01:37] computer vision tasks that I wanted to talk about. And the last piece that I
[01:01:38] talk about. And the last piece that I want to spend 10 minutes on is around
[01:01:42] want to spend 10 minutes on is around visualization and understanding.
[01:01:46] visualization and understanding. Again this has been a big lecture by
[01:01:49] Again this has been a big lecture by itself and and in 2050s60s until 2020s
[01:01:55] itself and and in 2050s60s until 2020s that the topic of computer u and and and
[01:01:57] that the topic of computer u and and and even before that 2014 13 the topic of
[01:02:00] even before that 2014 13 the topic of visualizing neural networks has been
[01:02:03] visualizing neural networks has been very hot and very uh much
[01:02:09] very hot and very uh much it gained it it helped us gain
[01:02:11] it gained it it helped us gain understanding into what the networks are
[01:02:13] understanding into what the networks are learning and I'm going to summarize some
[01:02:15] learning and I'm going to summarize some of those the most important ones here
[01:02:18] of those the most important ones here that you may need to un to use in your
[01:02:21] that you may need to un to use in your applications. But before that, let me go
[01:02:25] applications. But before that, let me go back to the linear classifier that we
[01:02:27] back to the linear classifier that we talked about. Um, we spent quite a lot
[01:02:30] talked about. Um, we spent quite a lot of time on linear classifiers. And with
[01:02:33] of time on linear classifiers. And with the linear classifiers what we we did
[01:02:36] the linear classifiers what we we did was at the end we said if if I look at
[01:02:39] was at the end we said if if I look at the linear function the the what what
[01:02:42] the linear function the the what what the network is learning I can have a
[01:02:44] the network is learning I can have a template for each of those classes like
[01:02:46] template for each of those classes like for example for this car you can always
[01:02:48] for example for this car you can always see a front-facing car um as a template
[01:02:51] see a front-facing car um as a template right we can do the same with neural
[01:02:54] right we can do the same with neural networks if I visualize one of the
[01:02:56] networks if I visualize one of the filters so here we visualize the weights
[01:03:00] filters so here we visualize the weights of the linear function it was the visual
[01:03:02] of the linear function it was the visual viewpoint point I can do the same with
[01:03:04] viewpoint point I can do the same with linear uh sorry with visualizing the
[01:03:08] linear uh sorry with visualizing the filters in the neural networks. So for
[01:03:12] filters in the neural networks. So for each of the filters the the network is
[01:03:15] each of the filters the the network is for example is learning something that
[01:03:19] for example is learning something that is basically some of the basic shapes um
[01:03:23] is basically some of the basic shapes um orientations or or simple shapes as you
[01:03:26] orientations or or simple shapes as you can see here.
[01:03:28] can see here. Although this visualization is it's we
[01:03:30] Although this visualization is it's we can only do it for the layers that we
[01:03:33] can only do it for the layers that we have few channels like for example if we
[01:03:34] have few channels like for example if we have three three channels I can put them
[01:03:36] have three three channels I can put them in an RGB image and just just visualize
[01:03:38] in an RGB image and just just visualize it but as you remember in in CNN's that
[01:03:41] it but as you remember in in CNN's that was not the case in CNN's we had
[01:03:45] was not the case in CNN's we had different um sometimes
[01:03:47] different um sometimes quite a few uh channels in the middle
[01:03:51] quite a few uh channels in the middle layer so it's not easy to visualize
[01:03:53] layer so it's not easy to visualize those in something that we can see but
[01:03:55] those in something that we can see but but that's that's what you basically in
[01:03:58] but that's that's what you basically in early layers that we have fewer channels
[01:04:00] early layers that we have fewer channels we can visualize them and see the
[01:04:02] we can visualize them and see the network is actually learning some
[01:04:04] network is actually learning some patterns. So we start it starts learning
[01:04:06] patterns. So we start it starts learning patterns and then later stages it um
[01:04:10] patterns and then later stages it um gets
[01:04:12] gets uh more holistic and and bigger patterns
[01:04:15] uh more holistic and and bigger patterns as um
[01:04:19] as um if if we train um sorry if we run some
[01:04:23] if if we train um sorry if we run some something that we call guided back
[01:04:24] something that we call guided back propagation we can also visualize those
[01:04:28] propagation we can also visualize those but not as simple as this. I want to
[01:04:32] but not as simple as this. I want to highlight an a couple of ways of
[01:04:34] highlight an a couple of ways of evaluating uh understanding and
[01:04:36] evaluating uh understanding and visualizing the neural networks which
[01:04:39] visualizing the neural networks which are actually kind of important. One is
[01:04:44] are actually kind of important. One is uh the concept of salencies. So in many
[01:04:48] uh the concept of salencies. So in many applications it's very important for you
[01:04:50] applications it's very important for you to know which pixel matters. For
[01:04:53] to know which pixel matters. For example, in a medical application when
[01:04:55] example, in a medical application when you do a classification of tumor versus
[01:04:58] you do a classification of tumor versus none, you want to see which parts of
[01:05:00] none, you want to see which parts of that image is actually the the tumor
[01:05:03] that image is actually the the tumor because if you want to automate this,
[01:05:06] because if you want to automate this, nobody cares about knowing if there is
[01:05:08] nobody cares about knowing if there is tumor or not. Everybody cares about
[01:05:10] tumor or not. Everybody cares about where in the image the tumor is, right?
[01:05:12] where in the image the tumor is, right? So
[01:05:14] So in order to do that simplest application
[01:05:17] in order to do that simplest application is we train a network a feed forward um
[01:05:20] is we train a network a feed forward um neural network that generates the value
[01:05:24] neural network that generates the value or the class label doc. But what we can
[01:05:28] or the class label doc. But what we can do is we can um
[01:05:32] do is we can um actually before I showed you that in in
[01:05:35] actually before I showed you that in in this case in order to train this network
[01:05:37] this case in order to train this network what we've done was we always took the
[01:05:40] what we've done was we always took the derivative derivative of the neural
[01:05:43] derivative derivative of the neural network the weights sorry of the loss or
[01:05:47] network the weights sorry of the loss or of the class score with respect to the
[01:05:50] of the class score with respect to the weights in order to update the weights.
[01:05:52] weights in order to update the weights. Right now what I need is for each pixel
[01:05:56] Right now what I need is for each pixel I want to see how much changing the
[01:05:59] I want to see how much changing the pixel value how much changing the pixel
[01:06:02] pixel value how much changing the pixel value
[01:06:03] value would affect the dog score right what
[01:06:07] would affect the dog score right what does this mean what I explained is is
[01:06:09] does this mean what I explained is is the the meaning of derivation right so
[01:06:14] the the meaning of derivation right so if I u this means that the meaning of
[01:06:17] if I u this means that the meaning of basically gradient so if I take the
[01:06:21] basically gradient so if I take the gradient of the core with respect to now
[01:06:23] gradient of the core with respect to now the pixel values not the network weights
[01:06:26] the pixel values not the network weights anymore with pixel values I can
[01:06:28] anymore with pixel values I can visualize those gradients
[01:06:30] visualize those gradients and visualizing those means that these
[01:06:33] and visualizing those means that these are the pixels that are that that matter
[01:06:36] are the pixels that are that that matter in order to classify dog
[01:06:38] in order to classify dog on this image those are the pixels that
[01:06:41] on this image those are the pixels that matter. So if I change the values of
[01:06:43] matter. So if I change the values of those pixels the score the dog score
[01:06:45] those pixels the score the dog score will change will be changed right again
[01:06:49] will change will be changed right again this is the basic meaning and definition
[01:06:51] this is the basic meaning and definition of gradients that uh we've talked about.
[01:06:54] of gradients that uh we've talked about. So this is one way if you run this on on
[01:06:57] So this is one way if you run this on on different objects
[01:06:59] different objects uh that you've trained in in the network
[01:07:02] uh that you've trained in in the network then this is what you get.
[01:07:07] then this is what you get. So that's one way of understanding
[01:07:10] So that's one way of understanding salency. Uh and it's very effective in
[01:07:13] salency. Uh and it's very effective in many cases. But sometimes it's not just
[01:07:17] many cases. But sometimes it's not just about the the pixel values all the way
[01:07:20] about the the pixel values all the way to the to the back. You want to see
[01:07:24] to the to the back. You want to see um for each of the classes how the
[01:07:26] um for each of the classes how the activations
[01:07:27] activations uh work. And this brings us to class
[01:07:30] uh work. And this brings us to class activation maps or CAM algorithm. uh
[01:07:34] activation maps or CAM algorithm. uh class activation mapping cam or grat cam
[01:07:38] class activation mapping cam or grat cam that I will talk about in a in two
[01:07:39] that I will talk about in a in two minutes are uh one of the most and
[01:07:44] minutes are uh one of the most and widely used algorithms for understanding
[01:07:46] widely used algorithms for understanding CNN's and also could be used for other
[01:07:49] CNN's and also could be used for other architectures too but for transformers
[01:07:51] architectures too but for transformers we we have a much better way of um
[01:07:54] we we have a much better way of um making sense of those which actually we
[01:07:57] making sense of those which actually we talked last uh in the last lecture. So
[01:08:00] talked last uh in the last lecture. So what happens is that for each of the
[01:08:02] what happens is that for each of the convolution layers um we often do
[01:08:05] convolution layers um we often do pooling and the pooling generates
[01:08:07] pooling and the pooling generates feature maps. The feature maps are then
[01:08:09] feature maps. The feature maps are then turned into scores and those scores with
[01:08:13] turned into scores and those scores with with those uh values of the
[01:08:17] with those uh values of the uh weights.
[01:08:20] uh weights. If I if we extend the math, basically we
[01:08:24] If I if we extend the math, basically we simply can highlight the class scores in
[01:08:27] simply can highlight the class scores in a weighted sum form. And this means that
[01:08:30] a weighted sum form. And this means that um you can trace back class predictions
[01:08:35] um you can trace back class predictions all the way back to the feature maps
[01:08:39] all the way back to the feature maps and a specific locations of the space
[01:08:42] and a specific locations of the space because convolution layers are always
[01:08:44] because convolution layers are always mapped to a space in the image space
[01:08:46] mapped to a space in the image space too, right? we do convolution that's the
[01:08:49] too, right? we do convolution that's the the special uh consistency across all of
[01:08:53] the special uh consistency across all of the operations can can help us trace
[01:08:55] the operations can can help us trace back all the way to the image space. So
[01:08:58] back all the way to the image space. So anyways we can we can look at the
[01:09:01] anyways we can we can look at the feature maps and see how the class
[01:09:04] feature maps and see how the class activations each of these classes are
[01:09:06] activations each of these classes are actually impacting those locations in
[01:09:09] actually impacting those locations in the in the image. And with that now if I
[01:09:13] the in the image. And with that now if I if I do this um multiplication of
[01:09:16] if I do this um multiplication of weights versus uh the the the weights
[01:09:19] weights versus uh the the the weights that we've learned on top of the feature
[01:09:22] that we've learned on top of the feature values we create some class activations.
[01:09:26] values we create some class activations. And this means that I have a way now to
[01:09:30] And this means that I have a way now to go back to the image space because as
[01:09:32] go back to the image space because as long as I'm in the convolutional space,
[01:09:33] long as I'm in the convolutional space, I can go all the way back to the image
[01:09:36] I can go all the way back to the image and create these maps of like for
[01:09:38] and create these maps of like for example for each of the classes, palace,
[01:09:41] example for each of the classes, palace, dome, church, u altar and and uh
[01:09:45] dome, church, u altar and and uh monastery. We can have different class
[01:09:50] monastery. We can have different class activation maps. These are the weights.
[01:09:54] activation maps. These are the weights. These are the pixels or areas of the
[01:09:56] These are the pixels or areas of the convolution layer that have been driving
[01:10:00] convolution layer that have been driving the scores for these specific
[01:10:04] the scores for these specific uh classes. It's the same for for others
[01:10:08] uh classes. It's the same for for others um like class activation maps for one
[01:10:10] um like class activation maps for one single object in different images.
[01:10:14] single object in different images. But there's a problem with this and that
[01:10:16] But there's a problem with this and that problem is that we can only apply this
[01:10:18] problem is that we can only apply this to the last convolution layer because
[01:10:21] to the last convolution layer because this is the only way we can do it. Like
[01:10:24] this is the only way we can do it. Like we can only go to the last convolution
[01:10:26] we can only go to the last convolution layer the way that we did the
[01:10:28] layer the way that we did the calculations here. And in order to solve
[01:10:30] calculations here. And in order to solve that problem there is one variant of the
[01:10:34] that problem there is one variant of the algorithm called grat cam. So gradient
[01:10:38] algorithm called grat cam. So gradient weighted class activation maps. It's
[01:10:40] weighted class activation maps. It's basically the same algorithm just we
[01:10:43] basically the same algorithm just we need to weight calculate the weights
[01:10:45] need to weight calculate the weights with respect to the we basically take
[01:10:47] with respect to the we basically take one of the
[01:10:49] one of the uh layers
[01:10:52] uh layers that created some sort of activation in
[01:10:54] that created some sort of activation in the class class uh level. We compute the
[01:10:58] the class class uh level. We compute the gradients instead of just calculating
[01:11:01] gradients instead of just calculating the multiplication between W and
[01:11:03] the multiplication between W and feature. we all we go all the way back
[01:11:06] feature. we all we go all the way back with the gradients and create a weight
[01:11:09] with the gradients and create a weight based on the gradients um and then that
[01:11:12] based on the gradients um and then that is used instead of the weights it's
[01:11:16] is used instead of the weights it's aggregate of all of the weights and
[01:11:18] aggregate of all of the weights and gradients up to that specific layer and
[01:11:21] gradients up to that specific layer and then we weigh that with with that and
[01:11:25] then we weigh that with with that and then we also use ReLU to only map uh to
[01:11:28] then we also use ReLU to only map uh to pass the uh be the positive ones and And
[01:11:34] pass the uh be the positive ones and And that could also be all the way shown in
[01:11:36] that could also be all the way shown in the class in the image space. So I
[01:11:41] the class in the image space. So I talked about cam which was only applied
[01:11:43] talked about cam which was only applied to the last convolution layer. If you
[01:11:46] to the last convolution layer. If you wants to but but this is not possible
[01:11:49] wants to but but this is not possible because in most of the CNN algorithms we
[01:11:52] because in most of the CNN algorithms we don't have just u like one convolution
[01:11:55] don't have just u like one convolution layer at the end right we always have
[01:11:56] layer at the end right we always have some operations fully connected and so
[01:11:59] some operations fully connected and so on. So in order to be able to carry this
[01:12:02] on. So in order to be able to carry this class activation to the convolution
[01:12:05] class activation to the convolution layer if there is something else in the
[01:12:07] layer if there is something else in the middle we often use the gradients and
[01:12:10] middle we often use the gradients and weight basically weigh the maps with
[01:12:14] weight basically weigh the maps with with the gradients uh aggregates and
[01:12:18] with the gradients uh aggregates and then we can actually do the
[01:12:19] then we can actually do the visualization they create these heat
[01:12:21] visualization they create these heat maps for each of the objects. So this
[01:12:25] maps for each of the objects. So this was about CNN's but we talked about u
[01:12:29] was about CNN's but we talked about u transformers last week right u last in
[01:12:31] transformers last week right u last in the last lecture that they actually they
[01:12:34] the last lecture that they actually they inherently come with the activation
[01:12:36] inherently come with the activation maps. Did you do you remember that
[01:12:38] maps. Did you do you remember that language um matrix that that Justin
[01:12:42] language um matrix that that Justin showed that for each of the output words
[01:12:45] showed that for each of the output words there is a tension weight for the input.
[01:12:47] there is a tension weight for the input. We can do that the the same thing for
[01:12:49] We can do that the the same thing for the pixels for each of the outputs we
[01:12:52] the pixels for each of the outputs we can create these maps in the pixel space
[01:12:56] can create these maps in the pixel space and and visualize the features of the
[01:12:59] and and visualize the features of the VITs in the pixel space. So basically
[01:13:01] VITs in the pixel space. So basically with vit and and transformers this is
[01:13:03] with vit and and transformers this is much easier. you already have a way to
[01:13:06] much easier. you already have a way to to visualize the ch the attentions the
[01:13:11] to visualize the ch the attentions the weights but with
[01:13:14] weights but with CNN's we often use grat cam or these
[01:13:17] CNN's we often use grat cam or these types of algorithms that said
[01:13:22] types of algorithms that said I did this u task that I thought I
[01:13:25] I did this u task that I thought I wouldn't be able to completing the the
[01:13:28] wouldn't be able to completing the the the topics I wanted to talk about today
[01:13:31] the topics I wanted to talk about today and next session we'll have the lecture
[01:13:35] and next session we'll have the lecture around video understanding. Thank you.


================================================================================
LECTURE 010
================================================================================

Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 10: Video Understanding

Source: https://www.youtube.com/watch?v=wElqklprhPE

---

Transcript

[00:00:05] I think at the beginning of the course
[00:00:07] I think at the beginning of the course we announced that we would have um a few
[00:00:10] we announced that we would have um a few guest lecturers, people who previously
[00:00:12] guest lecturers, people who previously taught the course to come and give uh
[00:00:14] taught the course to come and give uh sort of a single guest lecture about a
[00:00:16] sort of a single guest lecture about a topic that they're very familiar with.
[00:00:17] topic that they're very familiar with. And I'm very happy to announce we have
[00:00:19] And I'm very happy to announce we have the first one of those lectures today.
[00:00:21] the first one of those lectures today. So uh I'll introduce Dr. Rohan Gao. He
[00:00:25] So uh I'll introduce Dr. Rohan Gao. He is an assistant professor in the
[00:00:26] is an assistant professor in the department of computer science at the
[00:00:28] department of computer science at the University of Maryland. uh College Park
[00:00:30] University of Maryland. uh College Park and he leads the multiensory machine
[00:00:32] and he leads the multiensory machine intelligence lab there. He was
[00:00:34] intelligence lab there. He was previously an instructor for CS23 uh 1N
[00:00:38] previously an instructor for CS23 uh 1N from 2022 to 2023 and he this is while
[00:00:42] from 2022 to 2023 and he this is while he completed his posttock with uh uh Fei
[00:00:46] he completed his posttock with uh uh Fei Jajin Wu and Sylvio Saves. So without
[00:00:48] Jajin Wu and Sylvio Saves. So without further ado, I'll leave it to Rohan to
[00:00:51] further ado, I'll leave it to Rohan to give the presentation today.
[00:00:53] give the presentation today.  Okay. Thanks. Uh hello uh hello
[00:00:56] Okay. Thanks. Uh hello uh hello everyone. So it's really uh exciting to
[00:00:59] everyone. So it's really uh exciting to be back to the class of two thes 231N
[00:01:02] be back to the class of two thes 231N and I'm Rohan uh just like Zen
[00:01:04] and I'm Rohan uh just like Zen introduced. So as you can tell I'm very
[00:01:06] introduced. So as you can tell I'm very interested in multi model stuff. So a
[00:01:08] interested in multi model stuff. So a lot only vision but also how we can make
[00:01:10] lot only vision but also how we can make use of other sensory modalities like
[00:01:12] use of other sensory modalities like audio tactile or other modalities just
[00:01:15] audio tactile or other modalities just like with humans to uh perceive
[00:01:18] like with humans to uh perceive understand and interact with this
[00:01:20] understand and interact with this multiensory world. But of course uh
[00:01:22] multiensory world. But of course uh vision is the most important modalities
[00:01:23] vision is the most important modalities right that's why we have this course uh
[00:01:25] right that's why we have this course uh deep learning for computer vision and
[00:01:27] deep learning for computer vision and I'm sure up to this point that you guys
[00:01:29] I'm sure up to this point that you guys are very familiar with image
[00:01:30] are very familiar with image classification right given a 2D image
[00:01:32] classification right given a 2D image like this how to you know uh give a
[00:01:35] like this how to you know uh give a class uh label to see whether it's a dog
[00:01:38] class uh label to see whether it's a dog it's a cat or it's a truck a plane
[00:01:41] it's a cat or it's a truck a plane that's a 2D based image classification
[00:01:43] that's a 2D based image classification and from the last lecture I'm sure you
[00:01:45] and from the last lecture I'm sure you have also learned some other tasks that
[00:01:47] have also learned some other tasks that you can do on images not only you can
[00:01:49] you can do on images not only you can just assign a single label to see it's a
[00:01:51] just assign a single label to see it's a cat a lot. And also you can do semantic
[00:01:54] cat a lot. And also you can do semantic segmentation to segment you know the
[00:01:56] segmentation to segment you know the picture into different portions
[00:01:58] picture into different portions components and also have some semantic
[00:02:00] components and also have some semantic meaning like where is grass, where is
[00:02:01] meaning like where is grass, where is cat, where is tree and also you can also
[00:02:03] cat, where is tree and also you can also put a bounding box on top of the objects
[00:02:06] put a bounding box on top of the objects uh you detect in the image uh to see
[00:02:09] uh you detect in the image uh to see where the dog is, where the cat is and
[00:02:11] where the dog is, where the cat is and also do instance segmentation that you
[00:02:12] also do instance segmentation that you not only you want to know the categories
[00:02:15] not only you want to know the categories but also you know uh for each category
[00:02:17] but also you know uh for each category if there are two dogs I want to have a
[00:02:19] if there are two dogs I want to have a se segmentation pass for each category
[00:02:21] se segmentation pass for each category that's instant segmentation. There are a
[00:02:22] that's instant segmentation. There are a lot of like tasks classification
[00:02:24] lot of like tasks classification recognition task you can do uh based on
[00:02:27] recognition task you can do uh based on 2D images but that's not u the only
[00:02:31] 2D images but that's not u the only thing that we can use for computer
[00:02:33] thing that we can use for computer vision system to do right and also our
[00:02:34] vision system to do right and also our world is not just static like this so if
[00:02:37] world is not just static like this so if we look at this image hopefully at up to
[00:02:40] we look at this image hopefully at up to this point you have uh you know learned
[00:02:43] this point you have uh you know learned a lot of tools that you can uh train
[00:02:46] a lot of tools that you can uh train some models to know detect this classify
[00:02:49] some models to know detect this classify this is a kneeling room right you have
[00:02:51] this is a kneeling room right you have also have uh tools you have learned that
[00:02:54] also have uh tools you have learned that to put a bounding box to see that this
[00:02:56] to put a bounding box to see that this is a dog and this is a baby and also
[00:02:58] is a dog and this is a baby and also even you can even have a you know
[00:03:00] even you can even have a you know segmentation mask to segment them out to
[00:03:02] segmentation mask to segment them out to see where uh those objects you detect
[00:03:04] see where uh those objects you detect are uh in the image. So today we're
[00:03:06] are uh in the image. So today we're going to f focus on video understanding.
[00:03:08] going to f focus on video understanding. So more formally what is video?
[00:03:10] So more formally what is video? Basically video is just like this 2D
[00:03:12] Basically video is just like this 2D image plus time. There's an extra time
[00:03:14] image plus time. There's an extra time dimension. So we now we are uh tackling
[00:03:17] dimension. So we now we are uh tackling things not only in this 3D image but
[00:03:21] things not only in this 3D image but also but now in 4D we have this uh uh uh
[00:03:25] also but now in 4D we have this uh uh uh three * t t is temporal dimension and h
[00:03:28] three * t t is temporal dimension and h and w are the spatial dimension. Now we
[00:03:30] and w are the spatial dimension. Now we are uh considering this kind of image
[00:03:32] are uh considering this kind of image and videos as like a volume of images of
[00:03:36] and videos as like a volume of images of of video frames. So an example task is
[00:03:39] of video frames. So an example task is video classification just like image
[00:03:41] video classification just like image classification right. So we are given a
[00:03:43] classification right. So we are given a video like this uh some some person is
[00:03:46] video like this uh some some person is like running right we want to take this
[00:03:49] like running right we want to take this videos as input and also we want to
[00:03:51] videos as input and also we want to train some model right a deep learning
[00:03:53] train some model right a deep learning model we want to classify whether this
[00:03:55] model we want to classify whether this person is doing swimming or running or
[00:03:57] person is doing swimming or running or jumping or what actions that he's doing
[00:03:59] jumping or what actions that he's doing right just based on this uh temporal
[00:04:02] right just based on this uh temporal streams of video frames so we also we
[00:04:06] streams of video frames so we also we have from the previous lectures I'm sure
[00:04:07] have from the previous lectures I'm sure you have already learned some you know
[00:04:09] you have already learned some you know loss functions like cross entropy loss
[00:04:11] loss functions like cross entropy loss and you and train a image classifier.
[00:04:13] and you and train a image classifier. Similarly, you can do you can use the
[00:04:15] Similarly, you can do you can use the similar tools, you know, just train a
[00:04:16] similar tools, you know, just train a video classifier. You just get some
[00:04:18] video classifier. You just get some features and use the same loss functions
[00:04:20] features and use the same loss functions and train a video classifier. So now the
[00:04:23] and train a video classifier. So now the problem on video understanding is that
[00:04:25] problem on video understanding is that how can we uh get features of videos
[00:04:28] how can we uh get features of videos that you can apply the loss functions
[00:04:30] that you can apply the loss functions you have learned from the previous
[00:04:31] you have learned from the previous lectures. Right? So uh and also another
[00:04:35] lectures. Right? So uh and also another kind of difference between image
[00:04:36] kind of difference between image classification and the video
[00:04:38] classification and the video classification video understanding is
[00:04:39] classification video understanding is that now the things the task you want to
[00:04:41] that now the things the task you want to do might be a little bit different just
[00:04:42] do might be a little bit different just from the previous example in images for
[00:04:45] from the previous example in images for example for image classification usually
[00:04:47] example for image classification usually you care more about the scenes the
[00:04:48] you care more about the scenes the objects right you want to just uh uh
[00:04:50] objects right you want to just uh uh doing a classification what is the
[00:04:52] doing a classification what is the object category for videos usually just
[00:04:55] object category for videos usually just like this example I'm showing here
[00:04:56] like this example I'm showing here usually you want to classify actions
[00:04:58] usually you want to classify actions it's often actions like where the person
[00:05:00] it's often actions like where the person what activities the person or some some
[00:05:03] what activities the person or some some animals are doing in the in the videos
[00:05:04] animals are doing in the in the videos that's what we care about usually in
[00:05:06] that's what we care about usually in video understanding. So nature of things
[00:05:08] video understanding. So nature of things to recognize can be a little bit
[00:05:10] to recognize can be a little bit different.
[00:05:12] different. And another problem that uh we want to
[00:05:15] And another problem that uh we want to be careful about for video understanding
[00:05:17] be careful about for video understanding is that videos are usually very big.
[00:05:20] is that videos are usually very big. Right? Well you talk when we talk about
[00:05:23] Right? Well you talk when we talk about images it's just like three times H* W.
[00:05:26] images it's just like three times H* W. It's a single you know matrix of some
[00:05:28] It's a single you know matrix of some you know RGB RGB numbers. But now we
[00:05:31] you know RGB RGB numbers. But now we consider videos it's a sequence of
[00:05:33] consider videos it's a sequence of frames. It can be like 30 frames per
[00:05:35] frames. It can be like 30 frames per second. So in movies it can be sometimes
[00:05:38] second. So in movies it can be sometimes we can have even higher like uh
[00:05:41] we can have even higher like uh resolution uh and also temporal uh
[00:05:44] resolution uh and also temporal uh resolution uh video frames and so if you
[00:05:46] resolution uh video frames and so if you consider uh space to store videos you
[00:05:50] consider uh space to store videos you for example if we consider standard
[00:05:51] for example if we consider standard definition videos it can take about like
[00:05:54] definition videos it can take about like 1.5
[00:05:56] 1.5 uh gigabyte per minute if we store this
[00:05:58] uh gigabyte per minute if we store this video. we consider even high resolution
[00:06:01] video. we consider even high resolution like 19 uh 1,920
[00:06:04] like 19 uh 1,920 times uh 10880
[00:06:06] times uh 10880 and now it takes like 10 gigabyte per
[00:06:08] and now it takes like 10 gigabyte per minute. So it takes a gigantic space in
[00:06:11] minute. So it takes a gigantic space in order to store this kind of video data
[00:06:14] order to store this kind of video data and also there's no way for us to just
[00:06:16] and also there's no way for us to just fit this kind of data directly to GPUs
[00:06:19] fit this kind of data directly to GPUs right if we we just have the input then
[00:06:22] right if we we just have the input then we can have have a lot of storage to
[00:06:23] we can have have a lot of storage to store them to source this kind of data
[00:06:25] store them to source this kind of data and also there are other things you can
[00:06:27] and also there are other things you can have to store like the weights the
[00:06:28] have to store like the weights the activations in your convolution neural
[00:06:31] activations in your convolution neural networks so then uh your model uh will
[00:06:33] networks so then uh your model uh will be uh very huge
[00:06:36] be uh very huge and uh the solution what solutions we
[00:06:40] and uh the solution what solutions we can have to you know to make videos
[00:06:42] can have to you know to make videos smaller to make them you know
[00:06:43] smaller to make them you know processable. So one simple solution is
[00:06:46] processable. So one simple solution is that we just make videos smaller right u
[00:06:48] that we just make videos smaller right u so although the high definition videos
[00:06:51] so although the high definition videos and also the original videos are long
[00:06:54] and also the original videos are long when we can shrink things both
[00:06:56] when we can shrink things both temporarily and uh spatially right we
[00:06:59] temporarily and uh spatially right we can just u for example for 3.2 two
[00:07:02] can just u for example for 3.2 two second videos like this. We can for
[00:07:04] second videos like this. We can for example we can maybe for each for each
[00:07:06] example we can maybe for each for each second we don't maybe we don't need all
[00:07:08] second we don't maybe we don't need all the frames. Let's just take five frame
[00:07:10] the frames. Let's just take five frame because there are a lot of redundancies
[00:07:12] because there are a lot of redundancies uh in the video frames, right? If we
[00:07:13] uh in the video frames, right? If we take five five frames per second uh and
[00:07:16] take five five frames per second uh and also we just have uh smaller spatial
[00:07:19] also we just have uh smaller spatial resolution like 112 * 112 and now we can
[00:07:22] resolution like 112 * 112 and now we can make the videos uh slightly smaller. For
[00:07:24] make the videos uh slightly smaller. For example, it's 5 588
[00:07:28] example, it's 5 588 uh KB uh for this uh simple video. But
[00:07:31] uh KB uh for this uh simple video. But definitely we can also do uh larger
[00:07:33] definitely we can also do uh larger resolution. We have the compute right
[00:07:35] resolution. We have the compute right just like images and also how to uh
[00:07:39] just like images and also how to uh train a model on this long videos we
[00:07:42] train a model on this long videos we cannot usually like in the previous
[00:07:44] cannot usually like in the previous slide I showed that we are training this
[00:07:46] slide I showed that we are training this video classifier on 3.2 in two second,
[00:07:48] video classifier on 3.2 in two second, right? But videos can be very long, can
[00:07:49] right? But videos can be very long, can be minutes, can be hours, right? So, one
[00:07:52] be minutes, can be hours, right? So, one one way that people do is that we train
[00:07:53] one way that people do is that we train on clips just like uh we train on kind
[00:07:56] on clips just like uh we train on kind of chunks of this video frames uh using
[00:07:58] of chunks of this video frames uh using a video classifier. And what we do is
[00:08:01] a video classifier. And what we do is that we train models to classify short
[00:08:03] that we train models to classify short clips with some low of uh FPS, frame per
[00:08:07] clips with some low of uh FPS, frame per second. And we just use a sliding
[00:08:09] second. And we just use a sliding window. We, you know, sample a lot of
[00:08:11] window. We, you know, sample a lot of different clips and use them as training
[00:08:12] different clips and use them as training data and we train a classifier. And then
[00:08:15] data and we train a classifier. And then during testing, during inference time,
[00:08:16] during testing, during inference time, we just run the model on different
[00:08:19] we just run the model on different clips. We sample a few clips, right? We
[00:08:21] clips. We sample a few clips, right? We made 10 clips and then we average the
[00:08:23] made 10 clips and then we average the prediction results. And that is our
[00:08:25] prediction results. And that is our prediction for this long video.
[00:08:29] prediction for this long video. And uh then what is the same post like
[00:08:32] And uh then what is the same post like video classification model uh we can
[00:08:35] video classification model uh we can use. So I have mentioned basically
[00:08:39] use. So I have mentioned basically video is just like a sequence of images
[00:08:41] video is just like a sequence of images right a sequence of video uh image
[00:08:43] right a sequence of video uh image frames. So one simple thing is that we
[00:08:46] frames. So one simple thing is that we just treat them as images right that's
[00:08:48] just treat them as images right that's the simplest kind of uh tool uh we
[00:08:50] the simplest kind of uh tool uh we already have right we just run single
[00:08:52] already have right we just run single frame convolution neuronet networks
[00:08:54] frame convolution neuronet networks because we already have all the tools
[00:08:56] because we already have all the tools right we have learned that we can train
[00:08:57] right we have learned that we can train an image classifier if we just take our
[00:09:00] an image classifier if we just take our image classifier to just run on top of
[00:09:02] image classifier to just run on top of those kind of video frames to treat them
[00:09:04] those kind of video frames to treat them as images we can indeed get uh decent
[00:09:06] as images we can indeed get uh decent predictions right especially like a
[00:09:08] predictions right especially like a video like this you can see that there
[00:09:10] video like this you can see that there are not many changes across videos right
[00:09:11] are not many changes across videos right the person is running maybe there are
[00:09:13] the person is running maybe there are some different movements uh on body. But
[00:09:15] some different movements uh on body. But generally it looks pretty similar,
[00:09:17] generally it looks pretty similar, right? Maybe we just you run a image
[00:09:20] right? Maybe we just you run a image action classifier and uh on the every
[00:09:22] action classifier and uh on the every frame maybe all of the frames will tell
[00:09:25] frame maybe all of the frames will tell you it's running and you if you average
[00:09:27] you it's running and you if you average the prediction results from each image
[00:09:29] the prediction results from each image each video frame then you'll predict uh
[00:09:31] each video frame then you'll predict uh running for this uh particular video. So
[00:09:34] running for this uh particular video. So actually also it's a ve it's usually a
[00:09:36] actually also it's a ve it's usually a very very strong baseline right for this
[00:09:39] very very strong baseline right for this simple image classifier u for especially
[00:09:42] simple image classifier u for especially for video like this because there are
[00:09:43] for video like this because there are not too many changes across uh videos.
[00:09:46] not too many changes across uh videos. So if you are trying to design some uh
[00:09:49] So if you are trying to design some uh video classifier you should always run
[00:09:50] video classifier you should always run this first uh because that's kind of
[00:09:52] this first uh because that's kind of simple things to try and maybe you can
[00:09:54] simple things to try and maybe you can already get pretty decent results. So
[00:09:56] already get pretty decent results. So the question is uh whether we just run
[00:09:59] the question is uh whether we just run on single frame or we run on chunk of
[00:10:01] on single frame or we run on chunk of frames. So for this simple uh single
[00:10:03] frames. So for this simple uh single frame state and basically we just uh you
[00:10:05] frame state and basically we just uh you have a video of like 30 frame maybe you
[00:10:07] have a video of like 30 frame maybe you just maybe sample a few frames and just
[00:10:09] just maybe sample a few frames and just use a image classifier to run on those
[00:10:11] use a image classifier to run on those like sampled 10 frames and just treat
[00:10:13] like sampled 10 frames and just treat them as images and you just directly
[00:10:14] them as images and you just directly average the results. That's basically uh
[00:10:17] average the results. That's basically uh the per frames they in. So I think you
[00:10:19] the per frames they in. So I think you ask a very important question is how to
[00:10:21] ask a very important question is how to sample the frame that's a very key
[00:10:22] sample the frame that's a very key question because we're given a giant I'm
[00:10:25] question because we're given a giant I'm I'm I'm talking about we want to sample
[00:10:26] I'm I'm talking about we want to sample some frames and want to run a CN on the
[00:10:28] some frames and want to run a CN on the frame. So how to get those frames? So
[00:10:30] frame. So how to get those frames? So that is actually also an active area of
[00:10:32] that is actually also an active area of research. Uh one simple way is that you
[00:10:34] research. Uh one simple way is that you do random random sampling. If you have a
[00:10:36] do random random sampling. If you have a one hour video, I don't know where where
[00:10:38] one hour video, I don't know where where the interesting part where the important
[00:10:39] the interesting part where the important parts are, right? We just sample maybe
[00:10:41] parts are, right? We just sample maybe every one minute I sample one frame and
[00:10:43] every one minute I sample one frame and then I run image classifier average
[00:10:45] then I run image classifier average results. But obviously maybe this is
[00:10:47] results. But obviously maybe this is it's a gives some good results but maybe
[00:10:49] it's a gives some good results but maybe this is not the smartest way to do the
[00:10:50] this is not the smartest way to do the sampling. There are other methods trying
[00:10:53] sampling. There are other methods trying to propose smarter sampling strategy.
[00:10:55] to propose smarter sampling strategy. Maybe you can sample one frame. you can
[00:10:56] Maybe you can sample one frame. you can then use that decision to decide where
[00:10:59] then use that decision to decide where where else to sample. I actually also
[00:11:00] where else to sample. I actually also have some examples later uh in later
[00:11:03] have some examples later uh in later lecture slides.
[00:11:05] lecture slides. Okay. So this is uh a very very simple
[00:11:07] Okay. So this is uh a very very simple kind of uh video classifier just like we
[00:11:09] kind of uh video classifier just like we just adopt image classifier single frame
[00:11:11] just adopt image classifier single frame CN and uh similarly just uh maybe we
[00:11:16] CN and uh similarly just uh maybe we take one step further instead of
[00:11:18] take one step further instead of directly just uh run single frame CN and
[00:11:22] directly just uh run single frame CN and average the prediction results maybe we
[00:11:24] average the prediction results maybe we can doing some fusion right across the
[00:11:27] can doing some fusion right across the you know features uh on the uh using the
[00:11:30] you know features uh on the uh using the uh single frame CN so this is often
[00:11:33] uh single frame CN so this is often called late fusion Basically the idea is
[00:11:36] called late fusion Basically the idea is that we still take some 2D ends and we
[00:11:39] that we still take some 2D ends and we have some input uh maybe t frames and
[00:11:42] have some input uh maybe t frames and for for each frame we uh use a 2D and
[00:11:47] for for each frame we uh use a 2D and then we extract some feature vector u
[00:11:49] then we extract some feature vector u and then we get u maybe a feature
[00:11:51] and then we get u maybe a feature feature map of b * h prime time w prime
[00:11:56] feature map of b * h prime time w prime and then we get because we have t frames
[00:11:58] and then we get because we have t frames right then basically we have t feature
[00:12:01] right then basically we have t feature feature uh feature maps and then the
[00:12:03] feature uh feature maps and then the simple thing that we just uh uh flatten
[00:12:06] simple thing that we just uh uh flatten all the feature maps to vectors and then
[00:12:08] all the feature maps to vectors and then concatenate them. Then we have a giant
[00:12:10] concatenate them. Then we have a giant like uh feature vector that basically
[00:12:13] like uh feature vector that basically contains all the information all the
[00:12:14] contains all the information all the features across all the frames right and
[00:12:16] features across all the frames right and then what we can do we can use tools
[00:12:18] then what we can do we can use tools that we have learned like fully
[00:12:19] that we have learned like fully connecting networks right we train a MLP
[00:12:21] connecting networks right we train a MLP that maps this uh vector to some ner
[00:12:25] that maps this uh vector to some ner dimension and then we train a classifier
[00:12:27] dimension and then we train a classifier on top of it made to map it to class
[00:12:29] on top of it made to map it to class score C right so this is uh uh called
[00:12:32] score C right so this is uh uh called late fusion because basically you can
[00:12:34] late fusion because basically you can see that we extract the feature maps and
[00:12:37] see that we extract the feature maps and we process them very independently And
[00:12:39] we process them very independently And then at the very late stage we
[00:12:40] then at the very late stage we concatenate the feature vectors and run
[00:12:42] concatenate the feature vectors and run some fully connecters to doing the
[00:12:44] some fully connecters to doing the classification. So uh this is uh uh uh
[00:12:49] classification. So uh this is uh uh uh this is useful but uh one drawback that
[00:12:52] this is useful but uh one drawback that you can you can already probably tell
[00:12:53] you can you can already probably tell from this uh example from my description
[00:12:55] from this uh example from my description is that this fully connect right it's
[00:12:58] is that this fully connect right it's it's going to introduce a lot of
[00:13:00] it's going to introduce a lot of parameters because if you if we
[00:13:02] parameters because if you if we concatenate a lot of we flatten them and
[00:13:05] concatenate a lot of we flatten them and across time and this feature vectors
[00:13:07] across time and this feature vectors depending on how long how large t is
[00:13:10] depending on how long how large t is then you can have a giant feature vector
[00:13:12] then you can have a giant feature vector and you use this giant feature vector
[00:13:14] and you use this giant feature vector you want to map them into some lower
[00:13:15] you want to map them into some lower dimension and you have a very large
[00:13:17] dimension and you have a very large fully contracting
[00:13:19] fully contracting layer and that will introduce a lot of
[00:13:21] layer and that will introduce a lot of parameters. So it's not very efficient.
[00:13:23] parameters. So it's not very efficient. So another way to do this is that
[00:13:26] So another way to do this is that instead of concatenating them right we
[00:13:28] instead of concatenating them right we we we don't do concatenation we don't
[00:13:30] we we don't do concatenation we don't use uh just use the giant feature vector
[00:13:34] use uh just use the giant feature vector and then doing fully then have a fully
[00:13:37] and then doing fully then have a fully concatenator to map them to scores. We
[00:13:38] concatenator to map them to scores. We can actually just do a simple pooling
[00:13:40] can actually just do a simple pooling right doing pooling we don't you don't
[00:13:43] right doing pooling we don't you don't increase the the you know the length of
[00:13:44] increase the the you know the length of feature vector basically if you have
[00:13:46] feature vector basically if you have feature dimension some feature dimension
[00:13:48] feature dimension some feature dimension for a single frame and you pull across
[00:13:50] for a single frame and you pull across time right for this t frames you just
[00:13:52] time right for this t frames you just doing a pooling to do a tempo
[00:13:54] doing a pooling to do a tempo aggregation and then based on this uh
[00:13:56] aggregation and then based on this uh clip feature d and then you uh instead
[00:13:59] clip feature d and then you uh instead of now instead of d time t you just you
[00:14:02] of now instead of d time t you just you still have a feature vector of time d if
[00:14:03] still have a feature vector of time d if you do pulling right and then you have a
[00:14:05] you do pulling right and then you have a linear layer to map d to some uh uh
[00:14:08] linear layer to map d to some uh uh dimens C that match the cast score and
[00:14:10] dimens C that match the cast score and then you train the uh cross entropy loss
[00:14:13] then you train the uh cross entropy loss on top of it and that's also late fusion
[00:14:16] on top of it and that's also late fusion but now we are using pooling and the and
[00:14:18] but now we are using pooling and the and the the the good side here is that now
[00:14:21] the the the good side here is that now you have you don't have to have a very
[00:14:23] you have you don't have to have a very large fully connect but the pooling can
[00:14:26] large fully connect but the pooling can also you know uh get rid of information
[00:14:29] also you know uh get rid of information that may be important so that's kind of
[00:14:31] that may be important so that's kind of the downside of this operation
[00:14:34] the downside of this operation so the the reason I'm calling late
[00:14:37] so the the reason I'm calling late fusion right The important part is late,
[00:14:39] fusion right The important part is late, right? Uh and when it's late, maybe
[00:14:41] right? Uh and when it's late, maybe there's some information that has
[00:14:43] there's some information that has already been lost when you have using
[00:14:45] already been lost when you have using this 2D convolution networks to process
[00:14:47] this 2D convolution networks to process images, right? For example, for example,
[00:14:49] images, right? For example, for example, as shown in this red circles here. So,
[00:14:51] as shown in this red circles here. So, what uh is very important to recognize
[00:14:54] what uh is very important to recognize this video is actually the the motion of
[00:14:57] this video is actually the the motion of this uh this man's feet, right? It's
[00:14:59] this uh this man's feet, right? It's moving up and down, up and down, and you
[00:15:02] moving up and down, up and down, and you can maybe tell that he's running, right?
[00:15:04] can maybe tell that he's running, right? So if you if we just use a single 2D CN
[00:15:07] So if you if we just use a single 2D CN to process them indiv independently as a
[00:15:10] to process them indiv independently as a uh 2D image and extract some feature map
[00:15:13] uh 2D image and extract some feature map and maybe up to you know very late stage
[00:15:15] and maybe up to you know very late stage of the feature maps you it doesn't
[00:15:17] of the feature maps you it doesn't actually already contain uh it doesn't
[00:15:20] actually already contain uh it doesn't contain the information of this move
[00:15:22] contain the information of this move movement of this of the feet of this man
[00:15:24] movement of this of the feet of this man anymore right at this very late stage.
[00:15:26] anymore right at this very late stage. So, so some information this fit up and
[00:15:29] So, so some information this fit up and down is like showing these red circles
[00:15:31] down is like showing these red circles which should be useful cues right but
[00:15:33] which should be useful cues right but now it's not there in the feature maps
[00:15:35] now it's not there in the feature maps the intuition is that if you you think
[00:15:38] the intuition is that if you you think of if you if you extract features from
[00:15:39] of if you if you extract features from the early layers right it's very close
[00:15:42] the early layers right it's very close to the original video frames so it's so
[00:15:44] to the original video frames so it's so it will it will be there's larger chance
[00:15:48] it will it will be there's larger chance it will contain this no- level kind of
[00:15:50] it will contain this no- level kind of information there's no movement like
[00:15:51] information there's no movement like from the video frames right and also you
[00:15:54] from the video frames right and also you can if you concaten them or pull them It
[00:15:56] can if you concaten them or pull them It will come across time. It will analyze
[00:15:59] will come across time. It will analyze the motion across time. But because we
[00:16:01] the motion across time. But because we are processing a lot of convolution
[00:16:03] are processing a lot of convolution pooling, convolution pooling up to a
[00:16:04] pooling, convolution pooling up to a relate stage you at a very late stage it
[00:16:07] relate stage you at a very late stage it contains more high level information
[00:16:09] contains more high level information like demand information instead of this
[00:16:11] like demand information instead of this low-level motion information. So that's
[00:16:13] low-level motion information. So that's why it's most likely it's lost there. So
[00:16:16] why it's most likely it's lost there. So uh that's the downside of late fusion.
[00:16:19] uh that's the downside of late fusion. So instead of doing late fusion then we
[00:16:21] So instead of doing late fusion then we actually we can do early fusion. Right?
[00:16:22] actually we can do early fusion. Right? So to do early fusion if we want to make
[00:16:24] So to do early fusion if we want to make use of the feature vectors more closer
[00:16:26] use of the feature vectors more closer to the actual video frames we can just
[00:16:28] to the actual video frames we can just you know take this uh input uh and then
[00:16:31] you know take this uh input uh and then we we get we we directly you know
[00:16:35] we we get we we directly you know reshape them to 3t  h  w right we just
[00:16:38] reshape them to 3t  h  w right we just directly aggregate the information
[00:16:40] directly aggregate the information temporally uh from the very beginning
[00:16:42] temporally uh from the very beginning and then we use some uh 2D convolution
[00:16:44] and then we use some uh 2D convolution the first 2D convolution just directly
[00:16:46] the first 2D convolution just directly map them uh to from from channel
[00:16:49] map them uh to from from channel dimensions 3T to D basically we use the
[00:16:51] dimensions 3T to D basically we use the 2D convol solution to process this
[00:16:54] 2D convol solution to process this temporal information uh in the first
[00:16:57] temporal information uh in the first layer to map uh the the channel
[00:16:59] layer to map uh the the channel dimension from 3D to D to process you
[00:17:01] dimension from 3D to D to process you know the the video frames of all the
[00:17:03] know the the video frames of all the information from the frames uh in the in
[00:17:05] information from the frames uh in the in the very beginning of the convolution
[00:17:07] the very beginning of the convolution neural networks and the rest of network
[00:17:09] neural networks and the rest of network is then in standard 2DC end and uh uh
[00:17:13] is then in standard 2DC end and uh uh the only difference that now we destroy
[00:17:14] the only difference that now we destroy and collapse all the temporary
[00:17:16] and collapse all the temporary information into a single uh single
[00:17:18] information into a single uh single using a single layer and then the rest
[00:17:19] using a single layer and then the rest is just like image classification and
[00:17:21] is just like image classification and then you're doing uh this classification
[00:17:23] then you're doing uh this classification using standard cross entropy loss for
[00:17:26] using standard cross entropy loss for each frame we get a features like D
[00:17:28] each frame we get a features like D right and each frame each each each each
[00:17:29] right and each frame each each each each each single frame will give you a
[00:17:31] each single frame will give you a feature D so you have T this feature
[00:17:33] feature D so you have T this feature vectors D so for pooling we are pulling
[00:17:35] vectors D so for pooling we are pulling over the features basically we can do
[00:17:36] over the features basically we can do mean pooling to average the features or
[00:17:39] mean pooling to average the features or we max pooling we max over the features
[00:17:41] we max pooling we max over the features then after that we still get a feature
[00:17:42] then after that we still get a feature that's D so it's pulling over the
[00:17:44] that's D so it's pulling over the features not the frames
[00:17:48] okay so that's early fusion Um so then
[00:17:53] okay so that's early fusion Um so then the downside of the early field is that
[00:17:55] the downside of the early field is that u uh although we explicitly trying to
[00:17:58] u uh although we explicitly trying to handle you know the motion from the
[00:17:59] handle you know the motion from the early layer but then the but then we the
[00:18:02] early layer but then the but then we the we are we are trying we
[00:18:05] we are we are trying we we are too ambitious we're trying to you
[00:18:06] we are too ambitious we're trying to you know capture everything in a single
[00:18:08] know capture everything in a single layer right we just concatenate all the
[00:18:10] layer right we just concatenate all the frames and then collapse all the temp
[00:18:12] frames and then collapse all the temp information a single convolution network
[00:18:14] information a single convolution network maybe it's not uh going to achieve what
[00:18:16] maybe it's not uh going to achieve what we wanted to achieve right so then
[00:18:18] we wanted to achieve right so then another solution is that we uh instead
[00:18:21] another solution is that we uh instead Instead of doing late fusion or early
[00:18:22] Instead of doing late fusion or early fusion, maybe we should do something in
[00:18:24] fusion, maybe we should do something in between, right? That's kind of like slow
[00:18:26] between, right? That's kind of like slow fusion. That's exactly what uh this 3D
[00:18:28] fusion. That's exactly what uh this 3D convolution 3D convolution network is
[00:18:30] convolution 3D convolution network is doing. So in the intuition is that we
[00:18:32] doing. So in the intuition is that we want to use this 3D version of
[00:18:34] want to use this 3D version of convolution and and pooling. We want to
[00:18:35] convolution and and pooling. We want to slowly fuse information over the course
[00:18:38] slowly fuse information over the course of the network. Instead of doing it at a
[00:18:40] of the network. Instead of doing it at a very late stage or at a very early
[00:18:42] very late stage or at a very early stage, we gradually shrink over, you
[00:18:44] stage, we gradually shrink over, you know, temporal dimension and spatial
[00:18:46] know, temporal dimension and spatial dimension to get this 3D feature maps.
[00:18:48] dimension to get this 3D feature maps. So that's the idea of uh 3D convolution
[00:18:51] So that's the idea of uh 3D convolution neuronet network. We just use 3D
[00:18:53] neuronet network. We just use 3D convolution and a 3D pooling operation.
[00:18:55] convolution and a 3D pooling operation. So what is 3D convolution? 3D pooling.
[00:18:57] So what is 3D convolution? 3D pooling. So you have learned the 2D convolution
[00:18:59] So you have learned the 2D convolution right for 2D convolution there uh
[00:19:01] right for 2D convolution there uh basically if you you take an image like
[00:19:03] basically if you you take an image like this 32 32  32 * 3 uh uh image and if
[00:19:08] this 32 32  32 * 3 uh uh image and if you if you use 2D convolution basically
[00:19:10] you if you use 2D convolution basically uh you have learned that you can for
[00:19:12] uh you have learned that you can for each kernel you have this uh uh filter
[00:19:15] each kernel you have this uh uh filter right you can you can have like this uh
[00:19:18] right you can you can have like this uh convolution kernel that is maybe 5 5 3
[00:19:21] convolution kernel that is maybe 5 5 3 that that runs like in a sliding window
[00:19:24] that that runs like in a sliding window approach just you know slides across
[00:19:27] approach just you know slides across across space and uh goes all the way ac
[00:19:32] across space and uh goes all the way ac for for each uh uh for each computation
[00:19:36] for for each uh uh for each computation you you know you map that uh uh to a
[00:19:39] you you know you map that uh uh to a single value in that final activation
[00:19:41] single value in that final activation maps and then finally you obtain this
[00:19:44] maps and then finally you obtain this activation map uh of 28  28  1 in this
[00:19:48] activation map uh of 28  28  1 in this case you come over all spatial locations
[00:19:51] case you come over all spatial locations right and map this uh channel dimension
[00:19:53] right and map this uh channel dimension the depth step go all the all go all the
[00:19:55] the depth step go all the all go all the way over the channel dimension and
[00:19:57] way over the channel dimension and uh map three map from three to to one in
[00:20:00] uh map three map from three to to one in this case. So that's 2D convolution. So
[00:20:03] this case. So that's 2D convolution. So the difference uh is that for 3D
[00:20:05] the difference uh is that for 3D convolution now uh we just have one
[00:20:09] convolution now uh we just have one extra dimension. So here uh you can
[00:20:11] extra dimension. So here uh you can think that uh uh here the input is c * t
[00:20:16] think that uh uh here the input is c  t  h * w right the the extra thing is
[00:20:18]  h  w right the the extra thing is this t dimension that is a temporal
[00:20:20] this t dimension that is a temporal dimension. So, but what I'm showing here
[00:20:22] dimension. So, but what I'm showing here because we can only show things in 3D,
[00:20:24] because we can only show things in 3D, right? We cannot show things in in 4D.
[00:20:26] right? We cannot show things in in 4D. So, there's actually one dimension that
[00:20:28] So, there's actually one dimension that is not shown here that is the C
[00:20:29] is not shown here that is the C dimension. The channel dimension is not
[00:20:31] dimension. The channel dimension is not shown here. So you can think that uh for
[00:20:33] shown here. So you can think that uh for each grid point in this feature map
[00:20:35] each grid point in this feature map there are um many features there are uh
[00:20:39] there are um many features there are uh C features uh in that spatial in that
[00:20:41] C features uh in that spatial in that grid point and then for this uh 3D
[00:20:44] grid point and then for this uh 3D convolution basically if we if we are
[00:20:47] convolution basically if we if we are talking about like a 60 6  6  6
[00:20:49] talking about like a 60 6  6  6 convolution because it have one extra
[00:20:51] convolution because it have one extra dimension now instead of slide over you
[00:20:53] dimension now instead of slide over you know the spatial dimension just just in
[00:20:55] know the spatial dimension just just in the H and W dimension over the images
[00:20:58] the H and W dimension over the images now we are sliding over this you know
[00:21:00] now we are sliding over this you know cube right we are sliding over uh this
[00:21:02] cube right we are sliding over uh this uh cube of dimension t  h  w. So
[00:21:06] uh cube of dimension t  h  w. So includes both the spatial dimension and
[00:21:08] includes both the spatial dimension and the temporal dimension and also it goes
[00:21:11] the temporal dimension and also it goes all the way uh along the channel
[00:21:14] all the way uh along the channel dimension. So then it will uh gradually
[00:21:17] dimension. So then it will uh gradually you can you know uh just like 2D con the
[00:21:19] you can you know uh just like 2D con the other part is just like the uh 2D
[00:21:22] other part is just like the uh 2D convolution it just to have this extra
[00:21:24] convolution it just to have this extra dimension and then you get the 3D 6 * 6
[00:21:27] dimension and then you get the 3D 6  6  6 3D convolution and maybe another
[00:21:29]  6 3D convolution and maybe another layer of five 5 and finally after you
[00:21:32] layer of five* 5 and finally after you process doing this 3D convolution
[00:21:34] process doing this 3D convolution operations and you flatten the feature
[00:21:36] operations and you flatten the feature vectors and then you use a 4D connecters
[00:21:39] vectors and then you use a 4D connecters to map them to the class scores. So that
[00:21:42] to map them to the class scores. So that is uh basically the idea of 3D
[00:21:44] is uh basically the idea of 3D convolution.
[00:21:46] convolution. So let's walk maybe through some toy
[00:21:49] So let's walk maybe through some toy examples to you know to better
[00:21:50] examples to you know to better understand it to compare uh the early
[00:21:53] understand it to compare uh the early late and uh uh the uh 3D convolution
[00:21:58] late and uh uh the uh 3D convolution neuronet networks right just to just to
[00:22:00] neuronet networks right just to just to give you a flavor how it works uh uh in
[00:22:03] give you a flavor how it works uh uh in practice actually definitely it's the
[00:22:04] practice actually definitely it's the better can be much larger and more
[00:22:06] better can be much larger and more complicated but here I'm just trying to
[00:22:08] complicated but here I'm just trying to show that uh a toy example to to walk
[00:22:11] show that uh a toy example to to walk through you know the the size of the
[00:22:13] through you know the the size of the feature maps and also the receptive
[00:22:14] feature maps and also the receptive field to give you a sense about what's
[00:22:16] field to give you a sense about what's the difference between early fusion late
[00:22:17] the difference between early fusion late fusion 3D convolution neural networks.
[00:22:20] fusion 3D convolution neural networks. So for
[00:22:21] So for uh uh late fusion right you can think
[00:22:24] uh uh late fusion right you can think that for example in this case maybe
[00:22:26] that for example in this case maybe originally it's of uh the input is like
[00:22:29] originally it's of uh the input is like 3 * 20 20 is a temporal dimension and
[00:22:31] 3 * 20 20 is a temporal dimension and 6464 is a spatial dimension and you use
[00:22:34] 6464 is a spatial dimension and you use a two you use a uh 2D convolution
[00:22:37] a two you use a uh 2D convolution basically we we we we just because we're
[00:22:40] basically we we we we just because we're doing late fusion we we don't do
[00:22:41] doing late fusion we we don't do anything over the temporal dimension
[00:22:43] anything over the temporal dimension initially right we just keep the 20 the
[00:22:45] initially right we just keep the 20 the the temporal dimension we just build up
[00:22:47] the temporal dimension we just build up the receptive field spatially right now
[00:22:49] the receptive field spatially right now we have the uh conute com 2D layer to
[00:22:51] we have the uh conute com 2D layer to map the channel dimension from 3 to 12
[00:22:53] map the channel dimension from 3 to 12 but just keep the temporal dimension 20
[00:22:56] but just keep the temporal dimension 20 and then gradually maybe we use some uh
[00:22:58] and then gradually maybe we use some uh pooling layers still we you know we
[00:23:00] pooling layers still we you know we didn't do anything with the tempo
[00:23:01] didn't do anything with the tempo dimension so it's still 20 right but we
[00:23:03] dimension so it's still 20 right but we because of because of the pooling
[00:23:05] because of because of the pooling operation we build up the receptive
[00:23:06] operation we build up the receptive field in the spatial uh dimension and
[00:23:09] field in the spatial uh dimension and then gradually we maybe use another 2D
[00:23:12] then gradually we maybe use another 2D layer and we now the facial map is 24 *
[00:23:14] layer and we now the facial map is 24  20  16 * 16 and we also we gradually
[00:23:17] 20  16  16 and we also we gradually increase the spatial receptive field But
[00:23:19] increase the spatial receptive field But we still keep the temporal dimension 20.
[00:23:21] we still keep the temporal dimension 20. So we did we didn't do anything over the
[00:23:22] So we did we didn't do anything over the temporal dimension. And finally just
[00:23:24] temporal dimension. And finally just using a single like a global average
[00:23:26] using a single like a global average pooling we pull across the feature maps
[00:23:28] pooling we pull across the feature maps 20  16  16. So we pull over both time
[00:23:31] 20  16  16. So we pull over both time and and the spatial dimension. And now
[00:23:34] and and the spatial dimension. And now we get from this 20  16  16 we get a 1
[00:23:37] we get from this 20  16  16 we get a 1  1  1 feature point. Right? So
[00:23:39]  1  1 feature point. Right? So basically we collapse everything in the
[00:23:41] basically we collapse everything in the final single layer and we build up the
[00:23:43] final single layer and we build up the temporal receptive field in the single
[00:23:45] temporal receptive field in the single layer. So that's late fusion. So then
[00:23:48] layer. So that's late fusion. So then for early fusion what's the difference?
[00:23:50] for early fusion what's the difference? So now instead of building slowly in
[00:23:52] So now instead of building slowly in space or at once in time at end now we
[00:23:54] space or at once in time at end now we are building slowly in space and all at
[00:23:57] are building slowly in space and all at once in time at the very beginning.
[00:23:59] once in time at the very beginning. Right? So so so the input is still that
[00:24:02] Right? So so so the input is still that 3  20  64 * 64 but now we're just
[00:24:05] 3  20  64 * 64 but now we're just using a single com 2D layer. Now we just
[00:24:07] using a single com 2D layer. Now we just treat this 3 * 20 as a single you know
[00:24:10] treat this 3 * 20 as a single you know the the channel dimension. we just map
[00:24:12] the the channel dimension. we just map everything this three * 30 uh just treat
[00:24:15] everything this three * 30 uh just treat all of them as channel dimension and
[00:24:16] all of them as channel dimension and then map them to 12. So basically we use
[00:24:18] then map them to 12. So basically we use a single con convolution uh layer 2D 2D
[00:24:22] a single con convolution uh layer 2D 2D convolution air to map uh to collapse
[00:24:24] convolution air to map uh to collapse all the temporal information from the
[00:24:26] all the temporal information from the very beginning. So we build the uh
[00:24:28] very beginning. So we build the uh temporal receptive field in the first
[00:24:30] temporal receptive field in the first layer. So now the temporal receptive
[00:24:31] layer. So now the temporal receptive field becomes uh uh from 1 to 20 and
[00:24:34] field becomes uh uh from 1 to 20 and then the spatial receptive field
[00:24:35] then the spatial receptive field gradually builds up and we use pooling
[00:24:38] gradually builds up and we use pooling and count 2D to build up you know the
[00:24:40] and count 2D to build up you know the spatial dimension just as late fusion
[00:24:42] spatial dimension just as late fusion and finally we use a global average
[00:24:44] and finally we use a global average pooling now with glob average pooling is
[00:24:46] pooling now with glob average pooling is only just trying to you know uh doing
[00:24:48] only just trying to you know uh doing the averaging doing the pooling across
[00:24:51] the averaging doing the pooling across space. So we build a slowly space but
[00:24:54] space. So we build a slowly space but all at once uh at the very beginning. So
[00:24:56] all at once uh at the very beginning. So that's early fusion. So then what is uh
[00:25:00] that's early fusion. So then what is uh 3D convolutional networks? So for 3D
[00:25:02] 3D convolutional networks? So for 3D convolutionary basically we build slowly
[00:25:04] convolutionary basically we build slowly both in space and time. So that's why we
[00:25:06] both in space and time. So that's why we call it uh slow fusion. So the input can
[00:25:09] call it uh slow fusion. So the input can still be like the same 3  20  64 * 64
[00:25:12] still be like the same 3  20  64 * 64 but now we are using 3D convolutions. So
[00:25:15] but now we are using 3D convolutions. So we uh uh in the first layer uh we we
[00:25:19] we uh uh in the first layer uh we we just uh uh uh to map things from uh
[00:25:24] just uh uh uh to map things from uh three time to to 12. We also just keep
[00:25:27] three time to to 12. We also just keep the temporal dimension in this case. And
[00:25:29] the temporal dimension in this case. And then we build up a little bit temporal
[00:25:31] then we build up a little bit temporal receptive field and spatial receptive
[00:25:33] receptive field and spatial receptive field. And then we use a pooling layer.
[00:25:35] field. And then we use a pooling layer. And then uh like four times four times
[00:25:37] And then uh like four times four times four pooling layer. And then we you know
[00:25:39] four pooling layer. And then we you know we pull a little bit of this temporal
[00:25:41] we pull a little bit of this temporal feature and also spatial features. And
[00:25:42] feature and also spatial features. And then we further build up both this
[00:25:44] then we further build up both this spatial and temporal receptive field.
[00:25:46] spatial and temporal receptive field. And we have another con 3D layer and and
[00:25:49] And we have another con 3D layer and and then to further build up the spatial and
[00:25:51] then to further build up the spatial and temporal receptive field. And finally
[00:25:53] temporal receptive field. And finally we're using a global average pooling but
[00:25:55] we're using a global average pooling but now we're pulling over this four  16 
[00:25:57] now we're pulling over this four  16  16 feature map and then to further
[00:26:00] 16 feature map and then to further increase the uh temporal and spatial
[00:26:03] increase the uh temporal and spatial receptive field. So we are building up
[00:26:04] receptive field. So we are building up gradually in both space and time. So
[00:26:06] gradually in both space and time. So that's kind of the difference between uh
[00:26:08] that's kind of the difference between uh early fusion late fusion and 3D
[00:26:10] early fusion late fusion and 3D convolution neural networks. So you can
[00:26:12] convolution neural networks. So you can see that uh for the early fusion and 3D
[00:26:16] see that uh for the early fusion and 3D convolution neural networks. So both of
[00:26:18] convolution neural networks. So both of them builds receptive field over time,
[00:26:21] them builds receptive field over time, right? But what's the actual difference?
[00:26:23] right? But what's the actual difference? So let's look at it uh more more closely
[00:26:25] So let's look at it uh more more closely here. So if we think of it as like a a
[00:26:31] here. So if we think of it as like a a feature vector uh for each spatial grid
[00:26:34] feature vector uh for each spatial grid point. So the uh the count filter
[00:26:39] point. So the uh the count filter uh if it's a 2D convolution, right? for
[00:26:43] uh if it's a 2D convolution, right? for this grid point it it will consider
[00:26:48] this grid point it it will consider all the temporal uh along the temporal
[00:26:52] all the temporal uh along the temporal dimensions t t is equal to 16 right so
[00:26:55] dimensions t t is equal to 16 right so so it is local in space but extends
[00:26:57] so it is local in space but extends fully in time right that's like the the
[00:27:00] fully in time right that's like the the the filter in 2D convolution uh neuronet
[00:27:04] the filter in 2D convolution uh neuronet network but but what what is the problem
[00:27:07] network but but what what is the problem think about it so if you if we directly
[00:27:09] think about it so if you if we directly just go all the way through time
[00:27:11] just go all the way through time dimension in this 2D dev convolution
[00:27:12] dimension in this 2D dev convolution what problem uh it's going to happen so
[00:27:15] what problem uh it's going to happen so the shortcoming of that is that there
[00:27:18] the shortcoming of that is that there will be no temporal shift in variance
[00:27:21] will be no temporal shift in variance because uh the 2D the future now extends
[00:27:25] because uh the 2D the future now extends fully in time right so if we want to ner
[00:27:28] fully in time right so if we want to ner some like global transition in color
[00:27:30] some like global transition in color have different time for example if we
[00:27:31] have different time for example if we want to uh uh because it's it's a temp
[00:27:35] want to uh uh because it's it's a temp it's a video right it's a some when we
[00:27:37] it's a video right it's a some when we recognize temporal information if there
[00:27:39] recognize temporal information if there if there are some changes from like blue
[00:27:41] if there are some changes from like blue to orange at different time step. Maybe
[00:27:43] to orange at different time step. Maybe there's some change uh that is happening
[00:27:46] there's some change uh that is happening at time step four. There's another same
[00:27:48] at time step four. There's another same change that is happening at time step
[00:27:51] change that is happening at time step 15. But it's the same change like from
[00:27:53] 15. But it's the same change like from blue to to to to orange. Right? If you
[00:27:56] blue to to to to orange. Right? If you if we go all the way through through uh
[00:27:59] if we go all the way through through uh through time, the future extends fully
[00:28:01] through time, the future extends fully in time. Then if we want to learn the
[00:28:04] in time. Then if we want to learn the global transition at different time,
[00:28:06] global transition at different time, then we have to have a whole separate
[00:28:09] then we have to have a whole separate future in order to learn this, right? So
[00:28:10] future in order to learn this, right? So we have to you know learn a different uh
[00:28:12] we have to you know learn a different uh this kernel in order to learn this
[00:28:14] this kernel in order to learn this different transitions across uh at
[00:28:17] different transitions across uh at different time stamp. So there's no
[00:28:19] different time stamp. So there's no temporal shift in variance. So uh so how
[00:28:23] temporal shift in variance. So uh so how to recognize uh this kind of blue to
[00:28:25] to recognize uh this kind of blue to orange transition just anywhere in space
[00:28:28] orange transition just anywhere in space and and time right just like when we are
[00:28:30] and and time right just like when we are doing image classification we want to
[00:28:33] doing image classification we want to you know have some spatial invariance.
[00:28:34] you know have some spatial invariance. We want to okay be able to recognize the
[00:28:37] We want to okay be able to recognize the cat the image contains a cat no matter
[00:28:39] cat the image contains a cat no matter where the cat is on the right corner
[00:28:40] where the cat is on the right corner left corner right we want to to share
[00:28:43] left corner right we want to to share the you know the the the kernels to be
[00:28:45] the you know the the the kernels to be able to you know to recognize things at
[00:28:47] able to you know to recognize things at different at a different spatial
[00:28:49] different at a different spatial location here we want to be able to
[00:28:51] location here we want to be able to learn this different types of motion
[00:28:53] learn this different types of motion different time this temporal patterns at
[00:28:55] different time this temporal patterns at different you know temporal time steps
[00:28:57] different you know temporal time steps so that's kind of similar idea so then
[00:28:59] so that's kind of similar idea so then the that's exactly the benefit of 3D
[00:29:01] the that's exactly the benefit of 3D convolution neuronet networks right Now
[00:29:04] convolution neuronet networks right Now instead of extends fully in time right
[00:29:06] instead of extends fully in time right in this that t dimension originally for
[00:29:09] in this that t dimension originally for this uh early fusion t extends all the
[00:29:11] this uh early fusion t extends all the way in the temporal dimension t is
[00:29:13] way in the temporal dimension t is equals to 16 but now t t is equal to
[00:29:16] equals to 16 but now t t is equal to three and we can slide over the temporal
[00:29:18] three and we can slide over the temporal dimension right just like uh we learn
[00:29:20] dimension right just like uh we learn this uh spatial invariance using filter
[00:29:22] this uh spatial invariance using filter on local regions now this count filter
[00:29:24] on local regions now this count filter only span a local window in time and
[00:29:27] only span a local window in time and slide over in the time dimension. So the
[00:29:30] slide over in the time dimension. So the then the the benefit is that now we can
[00:29:32] then the the benefit is that now we can have some temporal shift invariance
[00:29:35] have some temporal shift invariance because each filter slides over time. So
[00:29:37] because each filter slides over time. So we can reuse this filter to recognize
[00:29:39] we can reuse this filter to recognize different motion patterns uh across uh
[00:29:42] different motion patterns uh across uh these dimensions. So the transition from
[00:29:44] these dimensions. So the transition from blue to orange can now be recognized at
[00:29:46] blue to orange can now be recognized at every moment in time. Right? Uh and then
[00:29:50] every moment in time. Right? Uh and then the benefit of of this is that we don't
[00:29:52] the benefit of of this is that we don't have to have separate filters, right?
[00:29:53] have to have separate filters, right? Then now we are more efficient, more
[00:29:55] Then now we are more efficient, more representation efficient. we don't need
[00:29:57] representation efficient. we don't need to know separate futures anymore. So
[00:30:00] to know separate futures anymore. So that's basically the main difference
[00:30:02] that's basically the main difference between 2D con early fusion and the 3D
[00:30:05] between 2D con early fusion and the 3D convolutional network.
[00:30:08] convolutional network. And also uh in the last lecture I think
[00:30:11] And also uh in the last lecture I think uh you have already also uh seen some
[00:30:14] uh you have already also uh seen some examples of some tools that we can use
[00:30:16] examples of some tools that we can use to you know visualize what we have
[00:30:18] to you know visualize what we have learned right in a in a 2D convolutional
[00:30:21] learned right in a in a 2D convolutional networks. Similarly, we can also
[00:30:22] networks. Similarly, we can also visualize uh uh filters just uh for this
[00:30:26] visualize uh uh filters just uh for this 3D convolution networks as this kind of
[00:30:29] 3D convolution networks as this kind of video clips. You can see that uh uh
[00:30:32] video clips. You can see that uh uh there the I'm not you can see it. Okay.
[00:30:35] there the I'm not you can see it. Okay. The learner filters from the 3D
[00:30:36] The learner filters from the 3D convolution neural networks uh because
[00:30:39] convolution neural networks uh because now the filter extends both in space and
[00:30:43] now the filter extends both in space and time. So uh we can see that in in in in
[00:30:48] time. So uh we can see that in in in in uh as a video clip and you can see that
[00:30:50] uh as a video clip and you can see that for this filters uh some of them are
[00:30:53] for this filters uh some of them are just like those like um filters you have
[00:30:56] just like those like um filters you have simple im image classifier right you can
[00:30:58] simple im image classifier right you can have this kind of color patterns and
[00:31:01] have this kind of color patterns and also the different edges but you can
[00:31:03] also the different edges but you can also see that there are some other
[00:31:05] also see that there are some other filters there's some temporal transition
[00:31:07] filters there's some temporal transition right from from one color to another or
[00:31:10] right from from one color to another or from some one edge pattern to another
[00:31:12] from some one edge pattern to another So, so that's uh uh and some some
[00:31:16] So, so that's uh uh and some some doesn't learn motion and some maybe
[00:31:17] doesn't learn motion and some maybe focus on just like the image p uh the
[00:31:19] focus on just like the image p uh the color patterns but some learns motion in
[00:31:21] color patterns but some learns motion in like different uh directions. So, we can
[00:31:24] like different uh directions. So, we can just visualize these kernels uh like
[00:31:25] just visualize these kernels uh like this to to interpret them. Basically,
[00:31:29] this to to interpret them. Basically, the main there are two two difference.
[00:31:30] the main there are two two difference. One is like this slow fusion right
[00:31:32] One is like this slow fusion right basically in terms of convolution
[00:31:35] basically in terms of convolution operation. Yeah, indeed. Basically 3D
[00:31:37] operation. Yeah, indeed. Basically 3D convolution but 3D convolution
[00:31:39] convolution but 3D convolution definitely and 2D convolution are
[00:31:40] definitely and 2D convolution are totally different, right? It's a you you
[00:31:42] totally different, right? It's a you you you have another dimension of
[00:31:44] you have another dimension of convolution. You have this temporal
[00:31:45] convolution. You have this temporal dimension. So the main difference you
[00:31:46] dimension. So the main difference you have this temporal dimension in the
[00:31:47] have this temporal dimension in the convolution operation but uh practically
[00:31:50] convolution operation but uh practically uh you if you use 3D convolution your
[00:31:52] uh you if you use 3D convolution your networks it kind of gradually builds
[00:31:54] networks it kind of gradually builds this receptive field uh over space and
[00:31:56] this receptive field uh over space and time. Yeah.
[00:32:01] So uh we have talked about this uh
[00:32:05] So uh we have talked about this uh tools 3D convolution networks or
[00:32:07] tools 3D convolution networks or architectures right but what data we can
[00:32:10] architectures right but what data we can uh use just like image network what data
[00:32:12] uh use just like image network what data we can use to do video to train a video
[00:32:14] we can use to do video to train a video classifier. So one kind of example data
[00:32:17] classifier. So one kind of example data set uh challenge data set that uh people
[00:32:19] set uh challenge data set that uh people have been tackling is this uh data set
[00:32:22] have been tackling is this uh data set called sports 1 million which was
[00:32:24] called sports 1 million which was introduced uh in uh 2014. So for this
[00:32:28] introduced uh in uh 2014. So for this data set you can see that what what kind
[00:32:30] data set you can see that what what kind of task we can do we can do very very
[00:32:32] of task we can do we can do very very fine grain uh sport cateory
[00:32:34] fine grain uh sport cateory classification right you can see that
[00:32:35] classification right you can see that the here the blue shows the ground
[00:32:37] the here the blue shows the ground truths and uh the
[00:32:40] truths and uh the uh below uh the it shows the top five
[00:32:44] uh below uh the it shows the top five predictions and the green shows the
[00:32:46] predictions and the green shows the correct prediction and the red shows the
[00:32:48] correct prediction and the red shows the incorrect prediction. You can see that
[00:32:50] incorrect prediction. You can see that uh uh the action categories is that it's
[00:32:52] uh uh the action categories is that it's very very fine grain right there are 487
[00:32:55] very very fine grain right there are 487 different types of sports like there can
[00:32:58] different types of sports like there can be like marathon ultra marathon actually
[00:33:01] be like marathon ultra marathon actually don't know the the difference between
[00:33:02] don't know the the difference between them but it's kind of very kind of fine
[00:33:05] them but it's kind of very kind of fine different types of sports categories in
[00:33:07] different types of sports categories in this data set
[00:33:09] this data set and uh here are some results uh if we
[00:33:12] and uh here are some results uh if we you know train this different types of
[00:33:13] you know train this different types of classifiers we have talked about on this
[00:33:16] classifiers we have talked about on this uh sports 1 million data set so One very
[00:33:19] uh sports 1 million data set so One very shocking results you can probably see
[00:33:20] shocking results you can probably see here is that uh for this single frame
[00:33:23] here is that uh for this single frame model that I ask you to try uh for if
[00:33:26] model that I ask you to try uh for if you want to develop some video
[00:33:28] you want to develop some video classification model uh uh is that it
[00:33:31] classification model uh uh is that it actually has a very good performance
[00:33:33] actually has a very good performance right you can can see that the single
[00:33:34] right you can can see that the single frame model if you just train trade at
[00:33:36] frame model if you just train trade at the image classifier uh it actually
[00:33:38] the image classifier uh it actually gives you like 77.7 uh classification
[00:33:42] gives you like 77.7 uh classification top five uh accuracy and for the early
[00:33:44] top five uh accuracy and for the early fusion we talked about it actually has a
[00:33:46] fusion we talked about it actually has a slightly worse performance and for late
[00:33:48] slightly worse performance and for late fusion is slightly better and if we use
[00:33:51] fusion is slightly better and if we use 3D convolution new networks in this case
[00:33:53] 3D convolution new networks in this case on this data set it gets like a 2% uh
[00:33:56] on this data set it gets like a 2% uh two to 3% kind of boost
[00:33:59] two to 3% kind of boost uh so the takeaway message here is that
[00:34:02] uh so the takeaway message here is that uh definitely you should try the single
[00:34:04] uh definitely you should try the single frame model it can usually actually
[00:34:06] frame model it can usually actually works pretty well and uh uh and the 3D
[00:34:10] works pretty well and uh uh and the 3D convolution by the 3D convolution new
[00:34:11] convolution by the 3D convolution new networks uh I showed here is the 3D
[00:34:14] networks uh I showed here is the 3D convolution neural network used in 2014
[00:34:17] convolution neural network used in 2014 right but over the past 10 years we have
[00:34:19] right but over the past 10 years we have seen a lot of advancements so the
[00:34:21] seen a lot of advancements so the members are also uh getting much much
[00:34:23] members are also uh getting much much better as I'm going to talk about in the
[00:34:25] better as I'm going to talk about in the later slide yeah for both training and
[00:34:27] later slide yeah for both training and testing it's just treating videos as
[00:34:29] testing it's just treating videos as images and train image classifier that's
[00:34:32] images and train image classifier that's exactly what uh single frame is doing
[00:34:34] exactly what uh single frame is doing basically if I understand question
[00:34:35] basically if I understand question correctly it's it's it's using image
[00:34:37] correctly it's it's it's using image classifier but it's training a lot of
[00:34:38] classifier but it's training a lot of frames on videos it's not a single frame
[00:34:40] frames on videos it's not a single frame each video yeah
[00:34:43] each video yeah uh and also uh because this data set is
[00:34:45] uh and also uh because this data set is a huge data set because like I mentioned
[00:34:47] a huge data set because like I mentioned Videos are very huge, right? When people
[00:34:50] Videos are very huge, right? When people sharing data sets uh video data sets, we
[00:34:53] sharing data sets uh video data sets, we cannot just share them uh just like as
[00:34:56] cannot just share them uh just like as like imageet we have people can can
[00:34:59] like imageet we have people can can download from some database because
[00:35:01] download from some database because videos are really huge like this data
[00:35:03] videos are really huge like this data set has like maybe 1 million videos,
[00:35:05] set has like maybe 1 million videos, right? It's not very uh doable to
[00:35:07] right? It's not very uh doable to download all of them to going to share
[00:35:08] download all of them to going to share it. Actually, this video actually
[00:35:09] it. Actually, this video actually originally when it was released is
[00:35:11] originally when it was released is shared as a list of URLs, YouTube URLs.
[00:35:14] shared as a list of URLs, YouTube URLs. But one thing can you can expect from
[00:35:16] But one thing can you can expect from this YouTube vial URLs that people you
[00:35:19] this YouTube vial URLs that people you know modify their videos and delete
[00:35:21] know modify their videos and delete their videos, right? And so so that that
[00:35:24] their videos, right? And so so that that original list maybe have one million
[00:35:25] original list maybe have one million videos but now maybe the I I guess maybe
[00:35:27] videos but now maybe the I I guess maybe the list maybe half of the videos are
[00:35:30] the list maybe half of the videos are already gone or maybe not there. So this
[00:35:31] already gone or maybe not there. So this data set is not very kind of stable
[00:35:33] data set is not very kind of stable because of this reason.
[00:35:36] because of this reason. Okay. So
[00:35:39] Okay. So um
[00:35:42] um another kind of uh like I mentioned 3D
[00:35:45] another kind of uh like I mentioned 3D convolutional networks have been like
[00:35:48] convolutional networks have been like improving yeah gradually right since
[00:35:49] improving yeah gradually right since maybe 2014. So one early kind of popular
[00:35:53] maybe 2014. So one early kind of popular version of this 3D convolution network
[00:35:54] version of this 3D convolution network is this model called 33N networks. So
[00:35:58] is this model called 33N networks. So basically uh it's actually very very
[00:36:00] basically uh it's actually very very very simple. Basically it's uh uh it's
[00:36:03] very simple. Basically it's uh uh it's very similar to the VG architecture we
[00:36:06] very similar to the VG architecture we use for 2D image classification. uh but
[00:36:09] use for 2D image classification. uh but it's now we just con we just convert
[00:36:11] it's now we just con we just convert things uh to three dimension uh
[00:36:14] things uh to three dimension uh convolution network right uh and uh for
[00:36:17] convolution network right uh and uh for example for the 3D CNN uh it you use
[00:36:22] example for the 3D CNN uh it you use three  3  3 and 2  2  2 pooling and
[00:36:26] three  3  3 and 2  2  2 pooling and except maybe for the first layer has
[00:36:27] except maybe for the first layer has some changes uh it's so over
[00:36:30] some changes uh it's so over architecture it's very similar to like
[00:36:32] architecture it's very similar to like VG uh architecture it's now just we have
[00:36:35] VG uh architecture it's now just we have this extra dimension and uh for this so
[00:36:38] this extra dimension and uh for this so it's so that's why it's called the VG of
[00:36:40] it's so that's why it's called the VG of 3D ends and it's uh the model uh it was
[00:36:44] 3D ends and it's uh the model uh it was trained on this sports one mini data set
[00:36:46] trained on this sports one mini data set that uh uh I just mentioned and because
[00:36:50] that uh uh I just mentioned and because it uh it was introduced like in 2014
[00:36:53] it uh it was introduced like in 2014 right and that that time uh imagine that
[00:36:56] right and that that time uh imagine that you want to train such uh model it needs
[00:36:58] you want to train such uh model it needs a lot of compute right because this uh
[00:37:01] a lot of compute right because this uh uh not so many people have access to you
[00:37:05] uh not so many people have access to you know a lot of GPUs at that time so
[00:37:07] know a lot of GPUs at that time so Actually this model was trained at
[00:37:09] Actually this model was trained at Facebook and they uh they and they
[00:37:11] Facebook and they uh they and they literally released this model this the
[00:37:12] literally released this model this the pre-trained weights they they train this
[00:37:15] pre-trained weights they they train this 3D model on sport and release the
[00:37:18] 3D model on sport and release the feature um pre-trained model as a
[00:37:20] feature um pre-trained model as a feature extractor. So many people
[00:37:22] feature extractor. So many people actually who cannot afford to you know
[00:37:24] actually who cannot afford to you know train a video model themselves actually
[00:37:26] train a video model themselves actually started to use this model as a feature
[00:37:27] started to use this model as a feature extractor. So you can just take a video
[00:37:29] extractor. So you can just take a video and extract features from from this uh
[00:37:31] and extract features from from this uh from this uh uh using this uh
[00:37:34] from this uh uh using this uh pre-trained model say 3D model and then
[00:37:36] pre-trained model say 3D model and then maybe train some other classifier. So uh
[00:37:39] maybe train some other classifier. So uh people start to to to you know use it uh
[00:37:41] people start to to to you know use it uh that's why how it get got popular. So so
[00:37:45] that's why how it get got popular. So so the question basically is about uh when
[00:37:46] the question basically is about uh when we're talking about media classification
[00:37:48] we're talking about media classification about how many frames we should take as
[00:37:50] about how many frames we should take as input in terms to extract the features.
[00:37:52] input in terms to extract the features. So basically for all this uh models we
[00:37:54] So basically for all this uh models we are talking about we assume that we are
[00:37:55] are talking about we assume that we are just passing a clip a predefined length
[00:37:58] just passing a clip a predefined length like 16 frames right and or 32 frames
[00:38:00] like 16 frames right and or 32 frames you train a single model that always
[00:38:02] you train a single model that always takes 16 frames or 32 frames as input
[00:38:05] takes 16 frames or 32 frames as input and there are other techniques we're
[00:38:08] and there are other techniques we're going to talk about how we're going to
[00:38:09] going to talk about how we're going to aggregate this kind of clip level kind
[00:38:10] aggregate this kind of clip level kind of prediction but for for now we're just
[00:38:13] of prediction but for for now we're just doing clip level kind of feature
[00:38:15] doing clip level kind of feature extraction. So the the downside of this
[00:38:17] extraction. So the the downside of this uh 3D CAN is that uh it's very
[00:38:19] uh 3D CAN is that uh it's very computationally expensive right
[00:38:21] computationally expensive right basically we just directly in a bootstro
[00:38:23] basically we just directly in a bootstro way we just make to make this VG kind of
[00:38:26] way we just make to make this VG kind of style from 2D to 3D and uh you can see
[00:38:29] style from 2D to 3D and uh you can see that uh for Alex that uh for this for
[00:38:32] that uh for Alex that uh for this for this G-flop basically what what it means
[00:38:34] this G-flop basically what what it means is that it's it's it's the ga flops it's
[00:38:37] is that it's it's it's the ga flops it's trying to measure how many floatingoint
[00:38:39] trying to measure how many floatingoint operations you need for a single forward
[00:38:41] operations you need for a single forward pass basically just trying to measure
[00:38:43] pass basically just trying to measure you know whether the network is
[00:38:44] you know whether the network is efficient or So for Alex net it takes
[00:38:47] efficient or So for Alex net it takes 0.7 g flops for VG16 it takes like 13.6
[00:38:52] 0.7 g flops for VG16 it takes like 13.6 six g flop but for C3D right you
[00:38:54] six g flop but for C3D right you directly doing this kind of mapping from
[00:38:56] directly doing this kind of mapping from 2D 3D to 3D now it takes like 13 uh 39.5
[00:39:00] 2D 3D to 3D now it takes like 13 uh 39.5 g flops so it's 2.9 times VG so it's not
[00:39:03] g flops so it's 2.9 times VG so it's not very efficient right so uh that's the
[00:39:06] very efficient right so uh that's the kind of the the downside of this kind of
[00:39:08] kind of the the downside of this kind of network
[00:39:10] network uh and if we look at the performance on
[00:39:12] uh and if we look at the performance on sports 1 million uh uh this just 360 now
[00:39:16] sports 1 million uh uh this just 360 now uh gets uh about like 4% uh gain in
[00:39:20] uh gets uh about like 4% uh gain in terms of it top wide accuracy. uh but
[00:39:25] terms of it top wide accuracy. uh but uh so this is just like one example of
[00:39:27] uh so this is just like one example of the 3D convolution network we can do
[00:39:29] the 3D convolution network we can do right but there are definitely can be
[00:39:31] right but there are definitely can be other things right we have talking about
[00:39:32] other things right we have talking about a lot of tricks that we we we can do for
[00:39:34] a lot of tricks that we we we can do for 2D image classification right we can
[00:39:36] 2D image classification right we can have this residue connections like you
[00:39:38] have this residue connections like you you have seen in in ResNet right but
[00:39:40] you have seen in in ResNet right but definitely we can also do that just
[00:39:42] definitely we can also do that just improve say 3D to adding some residue
[00:39:44] improve say 3D to adding some residue connections or other techniques we
[00:39:46] connections or other techniques we talked about in 2D convolutions and
[00:39:48] talked about in 2D convolutions and indeed there are also a lot of work on
[00:39:50] indeed there are also a lot of work on trying to improve this different 3D
[00:39:52] trying to improve this different 3D different types of 3D video convol uh
[00:39:54] different types of 3D video convol uh video uh architectures and also papers
[00:39:58] video uh architectures and also papers on that but apart from that let's think
[00:40:00] on that but apart from that let's think maybe a little bit more on whether we
[00:40:02] maybe a little bit more on whether we should treat space and time in a
[00:40:06] should treat space and time in a separate way right because that indeed
[00:40:07] separate way right because that indeed very very very different things right
[00:40:09] very very very different things right spatial kind of information temporal
[00:40:11] spatial kind of information temporal information so maybe we should actually
[00:40:13] information so maybe we should actually explicitly try to model things that is
[00:40:17] explicitly try to model things that is uh exists there temporally that is
[00:40:19] uh exists there temporally that is motion right so we humans actually can
[00:40:22] motion right so we humans actually can incredible job uh processing motion. So
[00:40:25] incredible job uh processing motion. So maybe uh take a guess what actions the
[00:40:28] maybe uh take a guess what actions the the humans are doing here in this uh
[00:40:30] the humans are doing here in this uh simple video.
[00:40:33] simple video. You can yeah say it out if you want.
[00:40:38] What's this
[00:40:40] What's this seating?
[00:40:45] Yeah, just from this very a few points
[00:40:48] Yeah, just from this very a few points you can actually do pretty good good
[00:40:50] you can actually do pretty good good job, right? This is to recognize what
[00:40:51] job, right? This is to recognize what actions uh that this person is doing,
[00:40:54] actions uh that this person is doing, right? Or maybe two person, right?
[00:41:02] Yeah. Now there are not any appearance
[00:41:05] Yeah. Now there are not any appearance information, right? Just a few points,
[00:41:07] information, right? Just a few points, right? Just motion. We can actually have
[00:41:09] right? Just motion. We can actually have a very good understanding about some
[00:41:11] a very good understanding about some activities that is going on in this
[00:41:12] activities that is going on in this videos, right? So how to pro so that's
[00:41:14] videos, right? So how to pro so that's why how we process appearance and motion
[00:41:17] why how we process appearance and motion might be very different. maybe we should
[00:41:19] might be very different. maybe we should have separate kind of uh networks to
[00:41:21] have separate kind of uh networks to process them. Right? So indeed that and
[00:41:23] process them. Right? So indeed that and that's kind of motivation uh for this
[00:41:26] that's kind of motivation uh for this work that uh was introduced in 20 uh uh
[00:41:29] work that uh was introduced in 20 uh uh 14 and uh uh they are trying to propose
[00:41:33] 14 and uh uh they are trying to propose uh a two stream network to process
[00:41:35] uh a two stream network to process appearance information and and the
[00:41:37] appearance information and and the motion information separately. So
[00:41:38] motion information separately. So basically uh one way to measure
[00:41:41] basically uh one way to measure explicitly measure of motion is to use
[00:41:43] explicitly measure of motion is to use this uh concept called optical flow. So
[00:41:46] this uh concept called optical flow. So for optic flow basically the idea is
[00:41:47] for optic flow basically the idea is that uh we want to measure the the
[00:41:50] that uh we want to measure the the motion the changes uh of the the motion
[00:41:53] motion the changes uh of the the motion of the pixels uh in adjacent frames.
[00:41:55] of the pixels uh in adjacent frames. Basically the for the first frame for
[00:41:57] Basically the for the first frame for every pixel how it's going to move in
[00:41:58] every pixel how it's going to move in the second frame. So it calculates kind
[00:42:01] the second frame. So it calculates kind the velocity for points within the
[00:42:03] the velocity for points within the frames and kind of provides an
[00:42:05] frames and kind of provides an estimation of where the points could be
[00:42:06] estimation of where the points could be in the in the next frame sequence. For
[00:42:09] in the in the next frame sequence. For example, in this in this case like for
[00:42:10] example, in this in this case like for frame t and t plus one basically this
[00:42:13] frame t and t plus one basically this flow field right here are two two
[00:42:15] flow field right here are two two dimensions and tells uh whether where
[00:42:18] dimensions and tells uh whether where each pixel will move in the next frame.
[00:42:21] each pixel will move in the next frame. So fxy is equals to dxdy and then the uh
[00:42:24] So fxy is equals to dxdy and then the uh i t + 1 x + uh dx that's where uh you
[00:42:29] i t + 1 x + uh dx that's where uh you know the the pixel in the last frame is
[00:42:32] know the the pixel in the last frame is equal to you know ix ity uh in the
[00:42:34] equal to you know ix ity uh in the current frame. So it's trying to measure
[00:42:38] current frame. So it's trying to measure uh a way to measure explicitly measure
[00:42:39] uh a way to measure explicitly measure motion of the pixels. Right? So there
[00:42:41] motion of the pixels. Right? So there are many papers actually on doing
[00:42:42] are many papers actually on doing research and also how to actually
[00:42:44] research and also how to actually compute object flow uh given a pair of
[00:42:46] compute object flow uh given a pair of frames. There there are ways to make
[00:42:48] frames. There there are ways to make different types of assumptions like some
[00:42:50] different types of assumptions like some some work assume the object flow uh just
[00:42:53] some work assume the object flow uh just assumes uh brightly stays constant as
[00:42:56] assumes uh brightly stays constant as things move and then trying to propose
[00:42:58] things move and then trying to propose some techniques to compute this object
[00:43:00] some techniques to compute this object flow. But once you get it, it basically
[00:43:02] flow. But once you get it, it basically captures the motion information for two
[00:43:04] captures the motion information for two adjacent frames. And also you can
[00:43:06] adjacent frames. And also you can because there are two dimensions, right?
[00:43:08] because there are two dimensions, right? Because it's trying to uh capture how
[00:43:10] Because it's trying to uh capture how pixels move horizontally and vertically.
[00:43:13] pixels move horizontally and vertically. So you can actually also visualize it
[00:43:15] So you can actually also visualize it separately. You can visualize it the
[00:43:16] separately. You can visualize it the horizontal motion horizontal flow dx and
[00:43:19] horizontal motion horizontal flow dx and also you can visualize the vertical flow
[00:43:21] also you can visualize the vertical flow dy. You can see that there capture some
[00:43:23] dy. You can see that there capture some you know horizontal motion and the
[00:43:24] you know horizontal motion and the vertical motion. Uh so we capture this
[00:43:26] vertical motion. Uh so we capture this kind of low-level motion cues. So once
[00:43:29] kind of low-level motion cues. So once you have a way to capture this kind of
[00:43:30] you have a way to capture this kind of motion cues as optical flow uh then
[00:43:32] motion cues as optical flow uh then people trying to you know propose a
[00:43:34] people trying to you know propose a two-stream networks to se uh to train a
[00:43:37] two-stream networks to se uh to train a motion classifier and appearance
[00:43:38] motion classifier and appearance classifier. So this is a famous
[00:43:40] classifier. So this is a famous twostream network for action
[00:43:41] twostream network for action recognition. So basically it has a one
[00:43:43] recognition. So basically it has a one single frame uh model that that's doing
[00:43:46] single frame uh model that that's doing appearance classification to tell what
[00:43:48] appearance classification to tell what action it is and then you have a
[00:43:49] action it is and then you have a separate uh stream that's the temporal
[00:43:51] separate uh stream that's the temporal stream come that takes this uh uh
[00:43:55] stream come that takes this uh uh multi-frame optic flow for two for every
[00:43:57] multi-frame optic flow for two for every two adjacent frames it computes the
[00:43:59] two adjacent frames it computes the optical flow map and also uh it
[00:44:01] optical flow map and also uh it separately traits the horizontal uh
[00:44:03] separately traits the horizontal uh motion optical flow and the vertical
[00:44:05] motion optical flow and the vertical vertical flow and stack them together
[00:44:07] vertical flow and stack them together and then they process them using a
[00:44:09] and then they process them using a temporal stream convolution neuronet
[00:44:10] temporal stream convolution neuronet network and doing make a prediction and
[00:44:13] network and doing make a prediction and then they aggregate the prediction
[00:44:14] then they aggregate the prediction results uh for both the motion stream
[00:44:16] results uh for both the motion stream and the appearance stream to get a final
[00:44:18] and the appearance stream to get a final prediction. So that's the idea of this
[00:44:21] prediction. So that's the idea of this two-stream network and it actually works
[00:44:23] two-stream network and it actually works pretty pretty good. Uh it's it's on
[00:44:25] pretty pretty good. Uh it's it's on another data set called UCF 101. It's
[00:44:27] another data set called UCF 101. It's there are 121 100 101 action categories
[00:44:30] there are 121 100 101 action categories in this data set. So you can see that uh
[00:44:33] in this data set. So you can see that uh one surprising thing you can see that is
[00:44:35] one surprising thing you can see that is that using only motion actually works
[00:44:37] that using only motion actually works very well surprisingly well right uh you
[00:44:40] very well surprisingly well right uh you can you see the performance of 3D cycle
[00:44:42] can you see the performance of 3D cycle network and the spatial only that's only
[00:44:44] network and the spatial only that's only the appearance uh stream and the
[00:44:46] the appearance uh stream and the temporal only that's a motion stream you
[00:44:47] temporal only that's a motion stream you can see that motion stream actually is
[00:44:49] can see that motion stream actually is uh works uh much better compared to this
[00:44:52] uh works uh much better compared to this the spatial only stream right the the
[00:44:54] the spatial only stream right the the appearance string so my hypothesis is
[00:44:57] appearance string so my hypothesis is that uh uh it's less easier to overfit
[00:45:00] that uh uh it's less easier to overfit because you Uh for the motion uh there
[00:45:05] because you Uh for the motion uh there are a lot of like background information
[00:45:06] are a lot of like background information which may be not important for for the
[00:45:08] which may be not important for for the background uh for for the action
[00:45:10] background uh for for the action classification but uh but but for the
[00:45:12] classification but uh but but for the motion stream it actually contains the
[00:45:14] motion stream it actually contains the queue the very the the key information
[00:45:16] queue the very the the key information right the movements which are less
[00:45:18] right the movements which are less easier to overfeit actually you can get
[00:45:20] easier to overfeit actually you can get better results on this data set. So so
[00:45:22] better results on this data set. So so far uh uh we have been talking about uh
[00:45:26] far uh uh we have been talking about uh short-term structures uh in in videos.
[00:45:28] short-term structures uh in in videos. So um and also we have uh earlier I
[00:45:32] So um and also we have uh earlier I think uh the uh folks asking about uh
[00:45:35] think uh the uh folks asking about uh you know how where how many fra how many
[00:45:37] you know how where how many fra how many frame we should use actually to doing
[00:45:39] frame we should use actually to doing the classification right so definitely
[00:45:40] the classification right so definitely it's very important uh to modeling the
[00:45:42] it's very important uh to modeling the long-term uh temporal uh structure uh to
[00:45:47] long-term uh temporal uh structure uh to to recognize more more distant in in
[00:45:49] to recognize more more distant in in time right so uh we already know
[00:45:53] time right so uh we already know actually we already have the tools we
[00:45:55] actually we already have the tools we have to handle sequences uh to you know
[00:45:58] have to handle sequences uh to you know to use rec uh recurrent networks right
[00:46:01] to use rec uh recurrent networks right to process a sequence of words to doing
[00:46:03] to process a sequence of words to doing some you know uh uh it's like caption
[00:46:07] some you know uh uh it's like caption tasks and some some some prediction
[00:46:09] tasks and some some some prediction tasks right so we can also use similar
[00:46:11] tasks right so we can also use similar like tools right just recurrent new
[00:46:13] like tools right just recurrent new networks uh we just have a convolutional
[00:46:16] networks uh we just have a convolutional networks right we can uh not no matter
[00:46:18] networks right we can uh not no matter whether it's a single frame convolution
[00:46:20] whether it's a single frame convolution networks to get a 2D feature vector or
[00:46:23] networks to get a 2D feature vector or uh use a 3D convolution network to get a
[00:46:25] uh use a 3D convolution network to get a feature vector from a clip but if you
[00:46:27] feature vector from a clip but if you have a much longer video. We can get a
[00:46:30] have a much longer video. We can get a feature vector and then we just use like
[00:46:34] feature vector and then we just use like uh the RNs or LSTMs we have talked about
[00:46:37] uh the RNs or LSTMs we have talked about to model the long-term temporal
[00:46:39] to model the long-term temporal structure, right? We just process uh the
[00:46:41] structure, right? We just process uh the local features using this recurrent
[00:46:44] local features using this recurrent networks uh and uh uh make maybe make a
[00:46:47] networks uh and uh uh make maybe make a final prediction uh at the last time
[00:46:49] final prediction uh at the last time step, right? We want to do a single
[00:46:51] step, right? We want to do a single video level classification. Uh we just
[00:46:54] video level classification. Uh we just doing a many to one mapping, right? one
[00:46:56] doing a many to one mapping, right? one uh output at the end of the video or we
[00:46:58] uh output at the end of the video or we can also do like uh one to one mapping
[00:47:02] can also do like uh one to one mapping uh like we talked about right uh so for
[00:47:04] uh like we talked about right uh so for each uh for each frame we can make a
[00:47:06] each uh for each frame we can make a prediction maybe there are some
[00:47:07] prediction maybe there are some predictions we can want to make for each
[00:47:10] predictions we can want to make for each video frame and we can also get this
[00:47:11] video frame and we can also get this output for uh from like LSTM or recurren
[00:47:15] output for uh from like LSTM or recurren new network
[00:47:16] new network and uh uh
[00:47:20] and uh uh actually this kind of idea is is already
[00:47:22] actually this kind of idea is is already has already been explored in 20 2011 11
[00:47:25] has already been explored in 20 2011 11 actually that's kind of way ahead of it
[00:47:27] actually that's kind of way ahead of it time right because Alex that was
[00:47:28] time right because Alex that was introduced in uh 2012 right so but it's
[00:47:32] introduced in uh 2012 right so but it's more popularized by this 2015 paper so
[00:47:36] more popularized by this 2015 paper so uh you can also if you want to train
[00:47:39] uh you can also if you want to train this kind of recurring architectures for
[00:47:40] this kind of recurring architectures for for modeling long-term temporal
[00:47:42] for modeling long-term temporal structure uh you can often only back
[00:47:46] structure uh you can often only back propagate through this RN layer right
[00:47:48] propagate through this RN layer right you can fuse the CNS you can pretend
[00:47:49] you can fuse the CNS you can pretend them on maybe on some clips on some
[00:47:52] them on maybe on some clips on some image classification and you just uh
[00:47:54] image classification and you just uh otherwise you have a huge network
[00:47:56] otherwise you have a huge network recurrent part is a convolution part
[00:47:58] recurrent part is a convolution part then it's it's very hard to kind of
[00:47:59] then it's it's very hard to kind of train them uh in end to end so you can
[00:48:02] train them uh in end to end so you can just use this in C3D as like feature
[00:48:04] just use this in C3D as like feature extractor and train this recurrent new
[00:48:06] extractor and train this recurrent new networks so and we have already seen two
[00:48:09] networks so and we have already seen two approaches right uh to model the
[00:48:10] approaches right uh to model the temporal structure right how about we
[00:48:12] temporal structure right how about we can combine uh this kind of two two
[00:48:15] can combine uh this kind of two two approaches right this convolution
[00:48:16] approaches right this convolution networks and this recurrent new network
[00:48:19] networks and this recurrent new network both of them has some uh advantages we
[00:48:21] both of them has some uh advantages we can maybe just combine them in a single
[00:48:25] can maybe just combine them in a single kind of architecture to process the kind
[00:48:27] kind of architecture to process the kind of video data, right? So indeed we can
[00:48:29] of video data, right? So indeed we can take some inspiration from this
[00:48:30] take some inspiration from this multi-layer recurrent new networks we
[00:48:32] multi-layer recurrent new networks we have talked about right so each time
[00:48:34] have talked about right so each time stamp can takes this previous hidden
[00:48:36] stamp can takes this previous hidden time stamp from the same layer and also
[00:48:38] time stamp from the same layer and also the output from uh the same time stamp
[00:48:41] the output from uh the same time stamp from the previous layer right that's
[00:48:42] from the previous layer right that's basically the idea of this multi-layer
[00:48:44] basically the idea of this multi-layer RN but similarly we can just do do it
[00:48:46] RN but similarly we can just do do it for videos now we introduce you can
[00:48:48] for videos now we introduce you can introduce use this recurrent convolution
[00:48:50] introduce use this recurrent convolution neuronet networks right uh it's it's
[00:48:52] neuronet networks right uh it's it's very similar um it's just like now we
[00:48:55] very similar um it's just like now we build this grid of features right where
[00:48:58] build this grid of features right where Each one is kind of a three dimension
[00:49:00] Each one is kind of a three dimension vector like two are spatial dimension
[00:49:02] vector like two are spatial dimension and one is a channel dimension. Uh so
[00:49:05] and one is a channel dimension. Uh so each uh d so each feature vector uh it's
[00:49:08] each uh d so each feature vector uh it's uh uh like uh of dimension c  h  w. So
[00:49:12] uh uh like uh of dimension c  h  w. So each depends on two inputs for each
[00:49:14] each depends on two inputs for each vector for each feature map it depends
[00:49:16] vector for each feature map it depends on the feature map from the same layer
[00:49:18] on the feature map from the same layer by the previous time stamp but it also
[00:49:21] by the previous time stamp but it also depends on the feature map from the
[00:49:22] depends on the feature map from the previous layer by the same time stamp.
[00:49:24] previous layer by the same time stamp. Right? Uh
[00:49:26] Right? Uh so if we record in 2D convolution
[00:49:29] so if we record in 2D convolution network right where we uh we just map
[00:49:32] network right where we uh we just map this feature map from some input feature
[00:49:34] this feature map from some input feature to a output feature right but here for
[00:49:36] to a output feature right but here for this recurrent convolution network uh we
[00:49:39] this recurrent convolution network uh we can just use as input this two 3D
[00:49:43] can just use as input this two 3D tensors right one from the previous
[00:49:45] tensors right one from the previous layer and uh previous uh same layer and
[00:49:47] layer and uh previous uh same layer and previous time stamp and one from the
[00:49:48] previous time stamp and one from the previous layer and the same time stamp.
[00:49:50] previous layer and the same time stamp. So you recall a recurrent uh uh network
[00:49:53] So you recall a recurrent uh uh network it has this uh form uh form of like it
[00:49:56] it has this uh form uh form of like it has some hidden uh hidden layer uh
[00:50:00] has some hidden uh hidden layer uh feature map HT minus one. It takes input
[00:50:02] feature map HT minus one. It takes input of this current time stamp, right? It
[00:50:04] of this current time stamp, right? It have some uh some function with some
[00:50:06] have some uh some function with some parameter W and then process new state
[00:50:09] parameter W and then process new state uh feature vector HT, right? That's
[00:50:11] uh feature vector HT, right? That's basically R the the key of RN. So now
[00:50:15] basically R the the key of RN. So now instead we just change this vectors
[00:50:17] instead we just change this vectors forms of RN. We just replace all this
[00:50:19] forms of RN. We just replace all this matrix multiplication in this uh in
[00:50:21] matrix multiplication in this uh in recurrent new networks with 2D
[00:50:23] recurrent new networks with 2D convolutions. Right? Now we get this
[00:50:26] convolutions. Right? Now we get this recurrent convolution networks. So you
[00:50:28] recurrent convolution networks. So you have the feature map you do 2D
[00:50:29] have the feature map you do 2D convolution instead of have this matrix
[00:50:31] convolution instead of have this matrix multiplication we get another feature
[00:50:33] multiplication we get another feature map right and uh for also for features
[00:50:35] map right and uh for also for features from uh the previous layer the same time
[00:50:37] from uh the previous layer the same time stamp we also do this and then we use uh
[00:50:39] stamp we also do this and then we use uh uh after doing this 2D convolution we
[00:50:42] uh after doing this 2D convolution we add them together use another 10h layer
[00:50:44] add them together use another 10h layer and we get the picture map for for for
[00:50:47] and we get the picture map for for for the current uh uh hidden uh hidden layer
[00:50:50] the current uh uh hidden uh hidden layer so that's basically the the idea of
[00:50:52] so that's basically the the idea of recurrent convolution network we com
[00:50:54] recurrent convolution network we com combine convolution operations and
[00:50:56] combine convolution operations and recurrent operations and we can also
[00:50:59] recurrent operations and we can also actually do this for any kind kind of
[00:51:01] actually do this for any kind kind of recurren network of coal variants like
[00:51:02] recurren network of coal variants like GRUs and LSTMs. Maybe you have already
[00:51:05] GRUs and LSTMs. Maybe you have already learned from previous class and uh so
[00:51:09] learned from previous class and uh so now we can successfully combine the
[00:51:11] now we can successfully combine the benefits of the two right we have this
[00:51:12] benefits of the two right we have this uh both spatial and temporal fusion uh
[00:51:16] uh both spatial and temporal fusion uh inside this recurrent convolution new
[00:51:18] inside this recurrent convolution new network. So but this model is not uh was
[00:51:22] network. So but this model is not uh was not used too much and because uh there's
[00:51:25] not used too much and because uh there's one uh large downside of recurrent
[00:51:27] one uh large downside of recurrent neural networks which you have already
[00:51:29] neural networks which you have already learned that iron unit are very slow
[00:51:32] learned that iron unit are very slow right for processing non-sequence and
[00:51:34] right for processing non-sequence and videos are usually very very very long
[00:51:35] videos are usually very very very long and you have to pro you have to be
[00:51:37] and you have to pro you have to be processing in parallel but irons are
[00:51:39] processing in parallel but irons are very hard to to to be paralyzed
[00:51:42] very hard to to to be paralyzed but there's uh another thing another uh
[00:51:45] but there's uh another thing another uh important model you have learned like uh
[00:51:47] important model you have learned like uh I think in the previous lectures right
[00:51:48] I think in the previous lectures right what we can We can also use uh uh
[00:51:51] what we can We can also use uh uh operations like the self attention right
[00:51:53] operations like the self attention right to uh process videos right for self
[00:51:56] to uh process videos right for self attention you have this kind of key
[00:51:58] attention you have this kind of key queries keys and values and you uh and
[00:52:02] queries keys and values and you uh and you you can you can use self attention
[00:52:03] you you can you can use self attention here as a standalone kind of operation
[00:52:05] here as a standalone kind of operation to process images here we can also do it
[00:52:07] to process images here we can also do it for videos right and and one one very
[00:52:10] for videos right and and one one very large advantage of self attention is
[00:52:12] large advantage of self attention is highly paralyzerable right and all the
[00:52:14] highly paralyzerable right and all the alignment and this attention scores for
[00:52:17] alignment and this attention scores for all the inputs can be done
[00:52:19] all the inputs can be done completely in parallel. So indeed people
[00:52:21] completely in parallel. So indeed people are trying to you know use self
[00:52:23] are trying to you know use self attention also in videos right so they
[00:52:26] attention also in videos right so they just pause self attention directly to 3D
[00:52:28] just pause self attention directly to 3D right maybe you have some 3D
[00:52:29] right maybe you have some 3D convolutional network you get some
[00:52:31] convolutional network you get some feature map like c  t  h * w and then
[00:52:34] feature map like c  t  h * w and then you can similarly you can you want to
[00:52:35] you can similarly you can you want to get some query query feature maps right
[00:52:37] get some query query feature maps right you can use some 1 1  1 3D
[00:52:40] you can use some 1 1  1 3D convolutions to uh change the channel
[00:52:42] convolutions to uh change the channel dimension to map them to query feature
[00:52:44] dimension to map them to query feature map that is c prime t  h * w similarly
[00:52:47] map that is c prime t  h * w similarly for keys you get this feature map for
[00:52:48] for keys you get this feature map for values you get this feature maps and
[00:52:50] values you get this feature maps and Then you want to get some tension
[00:52:52] Then you want to get some tension weights, right? Basically, you're doing
[00:52:53] weights, right? Basically, you're doing some transpose of this feature map from
[00:52:56] some transpose of this feature map from queries and you're doing this uh uh uh
[00:53:00] queries and you're doing this uh uh uh vector wise uh multiplication. You get a
[00:53:02] vector wise uh multiplication. You get a attention score for each query and key
[00:53:07] attention score for each query and key uh feature uh kind of pair and then you
[00:53:10] uh feature uh kind of pair and then you can get this attention map and then use
[00:53:11] can get this attention map and then use it to you know condition the values,
[00:53:13] it to you know condition the values, right? And you can uh them to you can
[00:53:15] right? And you can uh them to you can get another value kind of feature
[00:53:17] get another value kind of feature feature map and then you can map them
[00:53:20] feature map and then you can map them you do another one times one time one
[00:53:21] you do another one times one time one convolution to map them back to you know
[00:53:23] convolution to map them back to you know the same dimension C so that you can be
[00:53:25] the same dimension C so that you can be concatenated with the original feature
[00:53:27] concatenated with the original feature input. So that is a resid connection. So
[00:53:30] input. So that is a resid connection. So in total you can see that it's very
[00:53:32] in total you can see that it's very similar to the self attention uh uh
[00:53:35] similar to the self attention uh uh operations but now we move things to 3D
[00:53:37] operations but now we move things to 3D and this is some one block that is very
[00:53:40] and this is some one block that is very you know independent it can stand on its
[00:53:42] you know independent it can stand on its own right you can so that's so in this
[00:53:44] own right you can so that's so in this uh paper it's called looo neuronet
[00:53:46] uh paper it's called looo neuronet network it introduces kind of block and
[00:53:48] network it introduces kind of block and call local block you can use it as uh a
[00:53:51] call local block you can use it as uh a kind of building block for uh processing
[00:53:53] kind of building block for uh processing videos to do video understanding for
[00:53:55] videos to do video understanding for example you can just add this unknown
[00:53:57] example you can just add this unknown local blocks uh into existing 3D
[00:54:00] local blocks uh into existing 3D convolutional network architectures and
[00:54:02] convolutional network architectures and uh to you know to have some 3DC and have
[00:54:05] uh to you know to have some 3DC and have non-local block and another and another
[00:54:07] non-local block and another and another block of 3DC and add non-local block and
[00:54:10] block of 3DC and add non-local block and each non-local block basically has is
[00:54:12] each non-local block basically has is very powerful to fuse across both space
[00:54:14] very powerful to fuse across both space and time and finally you do into this
[00:54:16] and time and finally you do into this classification.
[00:54:18] classification. So but one thing we haven't talked about
[00:54:20] So but one thing we haven't talked about is what is this 3D convolutional
[00:54:22] is what is this 3D convolutional networks right what what what what we
[00:54:24] networks right what what what what we should use here. So uh another very kind
[00:54:28] should use here. So uh another very kind of u interesting idea that people have
[00:54:30] of u interesting idea that people have explored in the past that is can we
[00:54:33] explored in the past that is can we reuse the 2D convolution neural network
[00:54:36] reuse the 2D convolution neural network many successful architecture we have we
[00:54:38] many successful architecture we have we have been we have talked about or have
[00:54:39] have been we have talked about or have learned right directly to 3D right we
[00:54:41] learned right directly to 3D right we can we just doing some inflation of this
[00:54:44] can we just doing some inflation of this 2D networks so we can you know then we
[00:54:46] 2D networks so we can you know then we can get a 3D convolution new networks so
[00:54:48] can get a 3D convolution new networks so for this work uh it's called I 3D
[00:54:50] for this work uh it's called I 3D architecture the idea is that uh they
[00:54:53] architecture the idea is that uh they just take a two 2D say and architecture
[00:54:55] just take a two 2D say and architecture they replace each 2D uh convo pool uh
[00:55:00] they replace each 2D uh convo pool uh layer uh the layer that originates of
[00:55:03] layer uh the layer that originates of dimension kh * kw but now we would we we
[00:55:07] dimension kh  kw but now we would we we replace with a 3D version that is a kt
[00:55:10] replace with a 3D version that is a kt kh kw right they just inflate it
[00:55:13] kh* kw right they just inflate it basically and they use it uh on top of
[00:55:17] basically and they use it uh on top of the inception block uh uh and then the
[00:55:24] after they doing this inflation uh bas
[00:55:27] after they doing this inflation uh bas uh then you you have architecture uh for
[00:55:30] uh then you you have architecture uh for processing videos right directly just
[00:55:33] processing videos right directly just reuse the existing architectures and uh
[00:55:36] reuse the existing architectures and uh al we now we can transfer the
[00:55:38] al we now we can transfer the architecture that works pretty well in
[00:55:40] architecture that works pretty well in 2D to work uh also in 3D right but what
[00:55:44] 2D to work uh also in 3D right but what one taking one step further people also
[00:55:45] one taking one step further people also have been uh uh trying trying things
[00:55:50] have been uh uh trying trying things that not only we can transfer the
[00:55:52] that not only we can transfer the architectures but actually also we can
[00:55:54] architectures but actually also we can transfer the weights right because we
[00:55:55] transfer the weights right because we have already trained pre-trained a lot
[00:55:57] have already trained pre-trained a lot of architectures models on image data
[00:56:00] of architectures models on image data sets right maybe we can actually use the
[00:56:02] sets right maybe we can actually use the weights we have learned there there are
[00:56:03] weights we have learned there there are some maybe some good prior information
[00:56:05] some maybe some good prior information so they so one thing you can do is that
[00:56:07] so they so one thing you can do is that you can just initialize uh uh the
[00:56:09] you can just initialize uh uh the inflated CN with weights train on images
[00:56:12] inflated CN with weights train on images for example you have maybe from from
[00:56:13] for example you have maybe from from from for for from for from for for from
[00:56:14] for from for from for from for from for from for for one originally uh uh you
[00:56:17] from for for one originally uh uh you have you have this 2D common kernel
[00:56:18] have you have this 2D common kernel right uh you just uh copy the kernel by
[00:56:22] right uh you just uh copy the kernel by KT times and you divide it by KT and you
[00:56:26] KT times and you divide it by KT and you just uh now you originally takes one
[00:56:28] just uh now you originally takes one single image as input now you take this
[00:56:30] single image as input now you take this video of three times KT H W as input
[00:56:33] video of three times KT H W as input because we have divided them by KT if
[00:56:35] because we have divided them by KT if you just use this inflated version and
[00:56:38] you just use this inflated version and use a copy the weights by copy the
[00:56:40] use a copy the weights by copy the weights by uh KT times and then you
[00:56:42] weights by uh KT times and then you you'll get the same output if you just
[00:56:44] you'll get the same output if you just uh input a single frame or like a video
[00:56:48] uh input a single frame or like a video of constant uh frames. So now we have a
[00:56:52] of constant uh frames. So now we have a way to recycle this kind of existing 2D
[00:56:55] way to recycle this kind of existing 2D image based on this architecture uh and
[00:56:58] image based on this architecture uh and weights from uh 2D uh image
[00:57:00] weights from uh 2D uh image understanding. So and actually it works
[00:57:03] understanding. So and actually it works uh pretty well. So if you look at the
[00:57:05] uh pretty well. So if you look at the performance if you inflate them uh
[00:57:07] performance if you inflate them uh compared to this two stream convolution
[00:57:09] compared to this two stream convolution network actually has better performance
[00:57:10] network actually has better performance and you can also inflate actually not
[00:57:12] and you can also inflate actually not only in the appearance stream frame you
[00:57:14] only in the appearance stream frame you can also inflate the you know motion
[00:57:16] can also inflate the you know motion motion stream. So you actually get gets
[00:57:18] motion stream. So you actually get gets uh some further improvements. Basically
[00:57:21] uh some further improvements. Basically this is just like a technique you can do
[00:57:24] this is just like a technique you can do to reuse this kind of independent from
[00:57:25] to reuse this kind of independent from the 3D convolutional networks. Those are
[00:57:27] the 3D convolutional networks. Those are you can you can you can build this kind
[00:57:29] you can you can you can build this kind of long local blocks and but this part
[00:57:32] of long local blocks and but this part I'm what I'm trying to say is that uh we
[00:57:35] I'm what I'm trying to say is that uh we can we have a lot of 2D convolutional
[00:57:38] can we have a lot of 2D convolutional networks uh the weights successful
[00:57:41] networks uh the weights successful people have have you know shown that
[00:57:42] people have have you know shown that they are very successful and if we want
[00:57:44] they are very successful and if we want to reuse them people have shown that
[00:57:47] to reuse them people have shown that that they can actually copy the weights
[00:57:49] that they can actually copy the weights and reuse them reuse their weights so
[00:57:53] and reuse them reuse their weights so directly oper uh use them to operate on
[00:57:55] directly oper uh use them to operate on on videos so So that basically that's
[00:57:57] on videos so So that basically that's the kind of highlight idea and you just
[00:57:59] the kind of highlight idea and you just you can you can after you're doing this
[00:58:01] you can you can after you're doing this initialization you can still fine-tune
[00:58:02] initialization you can still fine-tune right on the video data but you have uh
[00:58:05] right on the video data but you have uh the pre-trained weights from images. So
[00:58:07] the pre-trained weights from images. So then we can give you some good
[00:58:09] then we can give you some good initialization for training the video
[00:58:11] initialization for training the video models. So this is this idea of this I3D
[00:58:14] models. So this is this idea of this I3D network basically is trying to copy the
[00:58:16] network basically is trying to copy the weights and doing the inflation.
[00:58:18] weights and doing the inflation. Uh okay. Uh so there this is also just a
[00:58:23] Uh okay. Uh so there this is also just a one example of this video understanding
[00:58:24] one example of this video understanding uh net uh model and there are also a
[00:58:28] uh net uh model and there are also a many other video transfer model proposed
[00:58:30] many other video transfer model proposed for for video understanding that is uh
[00:58:34] for for video understanding that is uh uh for example there are some uh this
[00:58:36] uh for example there are some uh this work uh space-time attention is trying
[00:58:38] work uh space-time attention is trying to doing more factorized attention to
[00:58:40] to doing more factorized attention to transpose uh space and time and also
[00:58:43] transpose uh space and time and also there are some other method trying to be
[00:58:45] there are some other method trying to be more efficient in terms of this
[00:58:46] more efficient in terms of this transformer architecture or have some
[00:58:48] transformer architecture or have some mask autoenccoder you heard about to
[00:58:50] mask autoenccoder you heard about to doing more efficient scalable video
[00:58:54] doing more efficient scalable video uh level kind of pre-training uh to
[00:58:55] uh level kind of pre-training uh to doing video understanding. So I'm not
[00:58:57] doing video understanding. So I'm not going to talk uh them uh here in the
[00:58:59] going to talk uh them uh here in the class but if you are interested you can
[00:59:01] class but if you are interested you can check out our papers because there are
[00:59:02] check out our papers because there are also a lot many progress has been made
[00:59:04] also a lot many progress has been made to have better you know video
[00:59:06] to have better you know video understanding models and if you look at
[00:59:08] understanding models and if you look at the performance of progress that we I
[00:59:10] the performance of progress that we I think we start here from like single
[00:59:11] think we start here from like single frame model 62.2 two for on this uh on
[00:59:14] frame model 62.2 two for on this uh on this this is another data set kinetics
[00:59:16] this this is another data set kinetics 400 it's a large video data set and then
[00:59:19] 400 it's a large video data set and then you can see that for this video muscle
[00:59:21] you can see that for this video muscle encoder now it already gets to 90% uh
[00:59:24] encoder now it already gets to 90% uh accuracy so there are some other uh new
[00:59:28] accuracy so there are some other uh new transformer model has been proposed
[00:59:30] transformer model has been proposed so so we are doing uh very well on
[00:59:33] so so we are doing uh very well on classifying the the videos uh and
[00:59:36] classifying the the videos uh and similar to the image classification uh
[00:59:39] similar to the image classification uh in the last class we can also use
[00:59:41] in the last class we can also use similar tricks for visualizing uh video
[00:59:44] similar tricks for visualizing uh video models. So we can taking this uh two
[00:59:46] models. So we can taking this uh two stream network uh as an examples we can
[00:59:49] stream network uh as an examples we can randomly initialize the appearance image
[00:59:51] randomly initialize the appearance image and the flow image uh doing a we doing a
[00:59:55] and the flow image uh doing a we doing a forward pass and then compute the the
[00:59:57] forward pass and then compute the the score and then we can back propagate
[00:59:59] score and then we can back propagate with with respect to the score of a
[01:00:01] with with respect to the score of a particular class and you gradient ascent
[01:00:04] particular class and you gradient ascent to you know maximize the classification
[01:00:06] to you know maximize the classification score just just like we were doing the
[01:00:09] score just just like we were doing the visualization for for the image based
[01:00:11] visualization for for the image based model right so if you then it's through
[01:00:13] model right so if you then it's through this way that if you can we can
[01:00:15] this way that if you can we can visualize you know doing some
[01:00:17] visualize you know doing some visualization interpretation of what has
[01:00:19] visualization interpretation of what has been earned right so this is uh the left
[01:00:21] been earned right so this is uh the left is the optimized image for appearance
[01:00:23] is the optimized image for appearance stream maybe it's hard to you know to
[01:00:25] stream maybe it's hard to you know to guess what is maybe happening uh in the
[01:00:27] guess what is maybe happening uh in the in the video stream on the right uh it's
[01:00:30] in the video stream on the right uh it's uh it's optimized
[01:00:32] uh it's optimized image for the flow stream one has like
[01:00:34] image for the flow stream one has like some temporal constraints to you know to
[01:00:37] some temporal constraints to you know to pre prevent the temporal stream to
[01:00:38] pre prevent the temporal stream to change too too fast so there's so you
[01:00:41] change too too fast so there's so you can capture slow motion and the other is
[01:00:43] can capture slow motion and the other is capture 's right motion. So you can
[01:00:45] capture 's right motion. So you can guess what the action it is. Uh maybe
[01:00:47] guess what the action it is. Uh maybe this in this case is pretty clear. So
[01:00:49] this in this case is pretty clear. So what action is this?
[01:00:52] what action is this? So this is a uh weak lifting. You can
[01:00:55] So this is a uh weak lifting. You can see that the maybe the single the middle
[01:00:57] see that the maybe the single the middle one is doing some bar shaking, right?
[01:00:59] one is doing some bar shaking, right? And the the right one is doing some uh
[01:01:02] And the the right one is doing some uh pushing uh overhead the the motion,
[01:01:05] pushing uh overhead the the motion, right? So it's indeed actually you can
[01:01:07] right? So it's indeed actually you can see that this uh video models action
[01:01:09] see that this uh video models action models uh are learning something about
[01:01:12] models uh are learning something about this motion.
[01:01:13] this motion. Okay. So uh so so far uh I have been
[01:01:18] Okay. So uh so so far uh I have been talking about uh how we can classify the
[01:01:20] talking about uh how we can classify the short clips right uh the swimming
[01:01:23] short clips right uh the swimming running uh uh but another very important
[01:01:27] running uh uh but another very important thing is that how we can other task is
[01:01:30] thing is that how we can other task is that uh this is called temporal action
[01:01:32] that uh this is called temporal action localization is that uh not only we want
[01:01:35] localization is that uh not only we want to you know just doing clip level or
[01:01:38] to you know just doing clip level or classification sometimes we want to
[01:01:39] classification sometimes we want to localize just we want to doing object
[01:01:41] localize just we want to doing object detection right now we want to localize
[01:01:43] detection right now we want to localize where in the video the action is
[01:01:45] where in the video the action is happening right maybe sometimes the
[01:01:46] happening right maybe sometimes the person is running sometimes is jumping
[01:01:48] person is running sometimes is jumping so this is another there's another task
[01:01:50] so this is another there's another task this is another class called temporal
[01:01:52] this is another class called temporal action localization it's a uh you can
[01:01:54] action localization it's a uh you can also use similar maybe ideas from like
[01:01:57] also use similar maybe ideas from like fast RN right you can just generate some
[01:01:59] fast RN right you can just generate some temporal proposals and then doing the
[01:02:01] temporal proposals and then doing the classification
[01:02:03] classification and uh also there you can also do both
[01:02:06] and uh also there you can also do both right this is a spatial temporal
[01:02:07] right this is a spatial temporal detection basically you can uh local you
[01:02:09] detection basically you can uh local you want to localize not only in space but
[01:02:11] want to localize not only in space but also in time where the action is
[01:02:13] also in time where the action is happening in space where the hashing is
[01:02:14] happening in space where the hashing is happening temporally. So this is another
[01:02:16] happening temporally. So this is another task called spatial temporal uh
[01:02:18] task called spatial temporal uh detection.
[01:02:20] detection. Okay. Uh so so far uh I have been uh
[01:02:25] Okay. Uh so so far uh I have been uh talking about you know the temporal
[01:02:26] talking about you know the temporal stream and uh uh how architectures we
[01:02:30] stream and uh uh how architectures we can use to you know doing 3DC and
[01:02:32] can use to you know doing 3DC and twostream neuronet network spatial
[01:02:33] twostream neuronet network spatial temporal self attention and uh we have
[01:02:36] temporal self attention and uh we have already talked about some tools uh to do
[01:02:38] already talked about some tools uh to do that but yeah uh maybe in the final 10
[01:02:42] that but yeah uh maybe in the final 10 minutes let's just uh revisit I hope to
[01:02:44] minutes let's just uh revisit I hope to yeah finish uh in time uh let's revisit
[01:02:48] yeah finish uh in time uh let's revisit example that we started today right we
[01:02:50] example that we started today right we we We I showed you a video, right? But
[01:02:54] we We I showed you a video, right? But that's still uh maybe not the full
[01:02:56] that's still uh maybe not the full picture, right?
[01:03:10] So looking at video, we are doing video
[01:03:12] So looking at video, we are doing video understanding there's another very
[01:03:14] understanding there's another very important dimension that we have never
[01:03:15] important dimension that we have never covered till now, right? That is this.
[01:03:18] covered till now, right? That is this. There's sound, there's audio, there's
[01:03:19] There's sound, there's audio, there's another modalities in videos, right? If
[01:03:21] another modalities in videos, right? If we we miss that ingredient, you not lose
[01:03:23] we we miss that ingredient, you not lose a lot of fun, right? There's emotions
[01:03:25] a lot of fun, right? There's emotions you can perceive. There's another you
[01:03:27] you can perceive. There's another you know interactions you can do if you
[01:03:28] know interactions you can do if you combine this visual and motion. So if we
[01:03:32] combine this visual and motion. So if we have this audio in mind and we have this
[01:03:34] have this audio in mind and we have this vision stream then people also have
[01:03:35] vision stream then people also have proposed many other interesting tasks
[01:03:38] proposed many other interesting tasks and also we have explored other tasks to
[01:03:40] and also we have explored other tasks to doing video understanding. Here's
[01:03:42] doing video understanding. Here's another example that maybe uh you in
[01:03:45] another example that maybe uh you in videos that maybe there are some multi
[01:03:46] videos that maybe there are some multi multiple objects multiple speakers and
[01:03:49] multiple objects multiple speakers and you can actually one task example task
[01:03:51] you can actually one task example task uh that I I also personally have
[01:03:53] uh that I I also personally have explored in the past that we only guided
[01:03:54] explored in the past that we only guided audio source separation and you can
[01:03:56] audio source separation and you can actually understand trying to process
[01:03:58] actually understand trying to process things visually and acoustically you can
[01:04:00] things visually and acoustically you can use a visual information to guide the
[01:04:02] use a visual information to guide the source separation you want to separate
[01:04:03] source separation you want to separate the sound components right because
[01:04:05] the sound components right because originally maybe there's a mixture you
[01:04:07] originally maybe there's a mixture you want to use a visual information to
[01:04:08] want to use a visual information to separate into some sound components this
[01:04:10] separate into some sound components this is called visually guided also
[01:04:11] is called visually guided also separation ation to just to give you an
[01:04:13] separation ation to just to give you an example for this task. For example, here
[01:04:15] example for this task. For example, here is a speech mixture. Maybe sometimes we
[01:04:18] is a speech mixture. Maybe sometimes we want to hear the sounds for each person
[01:04:20] want to hear the sounds for each person individually, right? Then we can use
[01:04:21] individually, right? Then we can use their visual information, audio
[01:04:23] their visual information, audio information to process them together to
[01:04:24] information to process them together to separate their sounds. So here is a what
[01:04:26] separate their sounds. So here is a what we can do. We can separate the voice for
[01:04:27] we can do. We can separate the voice for the live speakers. So only we can do
[01:04:30] the live speakers. So only we can do this for for people for speech right
[01:04:31] this for for people for speech right when we have to process audio and speech
[01:04:34] when we have to process audio and speech and visual strength. But we also have do
[01:04:36] and visual strength. But we also have do this for other types of you know sound
[01:04:38] this for other types of you know sound like muted instruments. Here's another
[01:04:39] like muted instruments. Here's another example. We can even do music
[01:04:41] example. We can even do music instruments separation by analyze the
[01:04:43] instruments separation by analyze the motion the object central information
[01:04:45] motion the object central information with the audio stream and doing the
[01:04:47] with the audio stream and doing the separation. Yeah. So this is another
[01:04:50] separation. Yeah. So this is another example for this task and also there
[01:04:52] example for this task and also there have been since since once we introduce
[01:04:55] have been since since once we introduce this new modality of audio we just want
[01:04:56] this new modality of audio we just want to do video understanding classification
[01:04:58] to do video understanding classification we can audio can also be useful cues
[01:05:00] we can audio can also be useful cues right so there are indeed there are
[01:05:01] right so there are indeed there are other work audio visual video
[01:05:03] other work audio visual video understanding work proposed from
[01:05:04] understanding work proposed from transformer attention based models
[01:05:07] transformer attention based models trying to not only we want to map images
[01:05:09] trying to not only we want to map images to patches but also we map those audio
[01:05:11] to patches but also we map those audio spectrum to patches and use some
[01:05:13] spectrum to patches and use some transformer architectures to doing the
[01:05:15] transformer architectures to doing the classification or even we want to we can
[01:05:17] classification or even we want to we can do some uh um mask autoenccoder style.
[01:05:20] do some uh um mask autoenccoder style. We want to predict the patches for the
[01:05:22] We want to predict the patches for the images and also spectrograms doing video
[01:05:25] images and also spectrograms doing video understanding. Uh so and and also
[01:05:29] understanding. Uh so and and also another aspect people have been uh
[01:05:32] another aspect people have been uh exploring is how to do efficient video
[01:05:33] exploring is how to do efficient video understanding. So I will just quickly
[01:05:35] understanding. So I will just quickly give some examples. So here uh uh
[01:05:39] give some examples. So here uh uh throughout this class I think many focus
[01:05:41] throughout this class I think many focus on clip clip level classification right
[01:05:43] on clip clip level classification right just giving a clip how to doing this uh
[01:05:46] just giving a clip how to doing this uh classification and after we classify a
[01:05:48] classification and after we classify a lot of clips and we want to aggregate
[01:05:50] lot of clips and we want to aggregate information to get a video level
[01:05:51] information to get a video level predictions right that's action
[01:05:52] predictions right that's action recognition in non videos so so for for
[01:05:56] recognition in non videos so so for for efficient video understanding why we
[01:05:57] efficient video understanding why we want to do efficient video understanding
[01:05:58] want to do efficient video understanding because you know videos are very long we
[01:06:00] because you know videos are very long we don't we cannot afford to process every
[01:06:02] don't we cannot afford to process every clip one by one right so there we're
[01:06:04] clip one by one right so there we're trying to increase the the efficiency
[01:06:07] trying to increase the the efficiency for a single uh clip just building like
[01:06:10] for a single uh clip just building like this X3D is trying to build better 3D
[01:06:13] this X3D is trying to build better 3D convolution your network but also they
[01:06:15] convolution your network but also they are trying to you know this like this uh
[01:06:18] are trying to you know this like this uh SD sampler trying to you know predict
[01:06:20] SD sampler trying to you know predict which which which clips are the most
[01:06:23] which which which clips are the most senior most useful and then you can com
[01:06:26] senior most useful and then you can com uh combine the predictions only run your
[01:06:28] uh combine the predictions only run your clip classifier on those important clips
[01:06:30] clip classifier on those important clips and also they were trying to doing
[01:06:31] and also they were trying to doing policy learning trying to you know
[01:06:33] policy learning trying to you know predict which which modality we to use
[01:06:36] predict which which modality we to use in order to doing this uh action
[01:06:37] in order to doing this uh action classification. We can select oh whether
[01:06:39] classification. We can select oh whether we want to use video, how many videos,
[01:06:41] we want to use video, how many videos, how many video clips or whether want to
[01:06:43] how many video clips or whether want to use audio or other sensory data. Um so
[01:06:46] use audio or other sensory data. Um so here's one example that yeah we we can
[01:06:48] here's one example that yeah we we can also use audio trying to as a preview
[01:06:50] also use audio trying to as a preview mechanism we can uh to to predict or
[01:06:54] mechanism we can uh to to predict or which which uh where are the important
[01:06:57] which which uh where are the important uh moments and then we use that uh as a
[01:07:00] uh moments and then we use that uh as a guiding crew to process the clips and to
[01:07:03] guiding crew to process the clips and to average the results. Um so that's about
[01:07:05] average the results. Um so that's about the efficient video understanding. So
[01:07:07] the efficient video understanding. So that's also uh one area of research. So
[01:07:10] that's also uh one area of research. So and also nowadays there people are
[01:07:12] and also nowadays there people are moving to VR and AR right the smart
[01:07:14] moving to VR and AR right the smart glasses right and uh for this and also
[01:07:16] glasses right and uh for this and also now in the future I'm I'm guessing there
[01:07:18] now in the future I'm I'm guessing there are a lot of like egocentric video
[01:07:19] are a lot of like egocentric video streams that's another aspect of video
[01:07:21] streams that's another aspect of video understanding so not only you have this
[01:07:23] understanding so not only you have this egocentric videos but you have also have
[01:07:25] egocentric videos but you have also have this multi multi microphone microphone
[01:07:28] this multi multi microphone microphone array multi- channelannel audios so then
[01:07:30] array multi- channelannel audios so then you can so how to doing better on video
[01:07:32] you can so how to doing better on video understanding from this egocentric
[01:07:34] understanding from this egocentric multimodal egocentric stream video
[01:07:35] multimodal egocentric stream video streams is also a hot topic maybe uh we
[01:07:39] streams is also a hot topic maybe uh we I've explored that uh we can do like use
[01:07:42] I've explored that uh we can do like use process this video streams the audio
[01:07:44] process this video streams the audio multi channel audio and the vision
[01:07:45] multi channel audio and the vision information to doing you know to predict
[01:07:47] information to doing you know to predict who is speaking to whom and who is
[01:07:49] who is speaking to whom and who is listening to whom imagine whe in the
[01:07:50] listening to whom imagine whe in the future you wear this smart glasses you
[01:07:52] future you wear this smart glasses you want to to use it to help you to
[01:07:55] want to to use it to help you to understand this different type of social
[01:07:56] understand this different type of social interactions right so that's a ego
[01:07:58] interactions right so that's a ego egocentric video understanding so my
[01:08:01] egocentric video understanding so my final slide yeah definitely for LMS
[01:08:04] final slide yeah definitely for LMS right now uh there also a lot of ongoing
[01:08:07] right now uh there also a lot of ongoing work trying to build video level
[01:08:09] work trying to build video level foundation models, right? How to connect
[01:08:10] foundation models, right? How to connect the video understanding to Loom. So
[01:08:12] the video understanding to Loom. So there indeed there are works trying to
[01:08:13] there indeed there are works trying to you know just map the videos to some
[01:08:16] you know just map the videos to some tokenize them and map them to the LM
[01:08:19] tokenize them and map them to the LM embedding space and use maybe you can
[01:08:20] embedding space and use maybe you can ask prompt the video uh foundation model
[01:08:24] ask prompt the video uh foundation model you know where the person is uh what the
[01:08:26] you know where the person is uh what the person is doing in the in the video and
[01:08:27] person is doing in the in the video and then you output some you know uh uh text
[01:08:31] then you output some you know uh uh text to you know uh to describe the videos.
[01:08:34] to you know uh to describe the videos. So there are many work trying to connect
[01:08:36] So there are many work trying to connect video understanding LMS. So that's also
[01:08:37] video understanding LMS. So that's also a hot topic right now.


================================================================================
LECTURE 011
================================================================================

Stanford CS231N | Spring 2025 | Lecture 11: Large Scale Distributed Training

Source: https://www.youtube.com/watch?v=9MvD-XsowsE

---

Transcript

[00:00:05] Welcome back to CS231 lecture 11. Today
[00:00:08] Welcome back to CS231 lecture 11. Today we're going to talk about large scale
[00:00:09] we're going to talk about large scale distributed training. Um and this is a
[00:00:11] distributed training. Um and this is a pretty exciting topic because this is
[00:00:12] pretty exciting topic because this is basically how all neural networks get
[00:00:14] basically how all neural networks get trained in practice today. When you look
[00:00:15] trained in practice today. When you look at large models from from startups, from
[00:00:17] at large models from from startups, from industries, even academia um really
[00:00:20] industries, even academia um really large scale is kind of the new norm in
[00:00:22] large scale is kind of the new norm in deep learning nowadays. Um and that's
[00:00:23] deep learning nowadays. Um and that's actually something that's changed quite
[00:00:24] actually something that's changed quite a lot in the last 10 years since we
[00:00:26] a lot in the last 10 years since we started this class. Um 10 years ago it
[00:00:28] started this class. Um 10 years ago it was really well it was actually pretty
[00:00:30] was really well it was actually pretty common to train all models basically on
[00:00:32] common to train all models basically on one GPU one device um and it was fairly
[00:00:34] one GPU one device um and it was fairly uncommon to train on multiple devices
[00:00:37] uncommon to train on multiple devices but as we'll see nowadays the new norm
[00:00:38] but as we'll see nowadays the new norm is to train models on tens hundreds
[00:00:41] is to train models on tens hundreds thousands even tens of thousands of
[00:00:42] thousands even tens of thousands of devices concurrently so we need to
[00:00:44] devices concurrently so we need to develop new algorithms and new ways of
[00:00:46] develop new algorithms and new ways of thinking in order to do that.
[00:00:48] thinking in order to do that. So, as a and and then as a as a bit of
[00:00:50] So, as a and and then as a as a bit of running example through today's lecture,
[00:00:52] running example through today's lecture, we're going to be talking a lot about
[00:00:53] we're going to be talking a lot about Llama 3 405B. Um, not because this is
[00:00:56] Llama 3 405B. Um, not because this is the best model or the most interesting
[00:00:57] the best model or the most interesting model, but because this is a model that
[00:00:59] model, but because this is a model that this is a fairly close to
[00:01:00] this is a fairly close to state-of-the-art model that actually
[00:01:01] state-of-the-art model that actually shares a lot of the implementation
[00:01:03] shares a lot of the implementation details of how it was trained, the model
[00:01:05] details of how it was trained, the model architecture, everything like that. Um,
[00:01:06] architecture, everything like that. Um, there's a lot of really amazing powerful
[00:01:08] there's a lot of really amazing powerful models that have been trained in the
[00:01:09] models that have been trained in the last couple of years from Google, from
[00:01:11] last couple of years from Google, from OpenAI, from Anthropic, from others, but
[00:01:13] OpenAI, from Anthropic, from others, but basically they don't share any details
[00:01:14] basically they don't share any details whatsoever about their models anymore.
[00:01:16] whatsoever about their models anymore. Um there's a very famous quote that just
[00:01:18] Um there's a very famous quote that just sort of marked a sea change in the
[00:01:20] sort of marked a sea change in the industry to me that was in the GPT4
[00:01:22] industry to me that was in the GPT4 paper um back in 2023. So when they
[00:01:24] paper um back in 2023. So when they released the GPT4 model they said um
[00:01:26] released the GPT4 model they said um given both the competitive landscape and
[00:01:28] given both the competitive landscape and the safety implications of large scale
[00:01:30] the safety implications of large scale models like GPT4 this report meaning the
[00:01:32] models like GPT4 this report meaning the paper they wrote about GPT4 contains no
[00:01:35] paper they wrote about GPT4 contains no further details about the architecture
[00:01:37] further details about the architecture including model size hardware training
[00:01:39] including model size hardware training compute data set construction training
[00:01:41] compute data set construction training method or similar and that's basically
[00:01:42] method or similar and that's basically the new that's basically been the
[00:01:44] the new that's basically been the state-of-the-art for um large scale
[00:01:46] state-of-the-art for um large scale models the last 3 years since this since
[00:01:48] models the last 3 years since this since GPT4 they're not they don't tell you
[00:01:50] GPT4 they're not they don't tell you anything about anything they'll tell you
[00:01:51] anything about anything they'll tell you nothing about the model you'll be lucky
[00:01:53] nothing about the model you'll be lucky if they'll tell it's a transformer. They
[00:01:54] if they'll tell it's a transformer. They might tell you that much. Um, so Llama 3
[00:01:57] might tell you that much. Um, so Llama 3 is sort of notable not because it's the
[00:01:59] is sort of notable not because it's the best model out there, but because it's
[00:02:00] best model out there, but because it's one of the most open models out there.
[00:02:02] one of the most open models out there. So this is a large language model that
[00:02:03] So this is a large language model that was trained by Meta um and released open
[00:02:05] was trained by Meta um and released open source about a year ago in April 2024.
[00:02:08] source about a year ago in April 2024. And unlike OpenAI, the paper actually
[00:02:10] And unlike OpenAI, the paper actually does share a lot of the details about
[00:02:12] does share a lot of the details about the the model training, not too much
[00:02:14] the the model training, not too much about the data set, a lot about the
[00:02:15] about the data set, a lot about the system infrastructure that was used to
[00:02:17] system infrastructure that was used to train it. Um, so this is something that
[00:02:19] train it. Um, so this is something that we and this gives us a sort of a peak
[00:02:21] we and this gives us a sort of a peak into how large scale LLMs are actually
[00:02:23] into how large scale LLMs are actually trained these days. Um, and by the way,
[00:02:25] trained these days. Um, and by the way, there is a new Llama 4 model that just
[00:02:27] there is a new Llama 4 model that just came out from Meta uh, last month, April
[00:02:29] came out from Meta uh, last month, April 2025. Um, so there are slightly better
[00:02:32] 2025. Um, so there are slightly better models out there in open source already.
[00:02:34] models out there in open source already. Um, but there's no paper on Llama 4 yet.
[00:02:36] Um, but there's no paper on Llama 4 yet. So I'm excited to read that one
[00:02:37] So I'm excited to read that one hopefully when it comes out in a couple
[00:02:38] hopefully when it comes out in a couple months and see um, what can we learn
[00:02:40] months and see um, what can we learn from the new generation of llama
[00:02:41] from the new generation of llama training. But as just as a as a running
[00:02:43] training. But as just as a as a running example through today's lecture, we'll
[00:02:44] example through today's lecture, we'll be pointing at a lot of examples from
[00:02:46] be pointing at a lot of examples from the llama 3 405b model um for this
[00:02:48] the llama 3 405b model um for this reason.
[00:02:50] reason. Okay. So there's basically two things
[00:02:51] Okay. So there's basically two things that I want to talk about today. Um one
[00:02:53] that I want to talk about today. Um one is a bit about GPU hardware and the
[00:02:56] is a bit about GPU hardware and the other is how to train on lots of GPUs.
[00:02:58] other is how to train on lots of GPUs. So I want to give you a sense both of
[00:02:59] So I want to give you a sense both of what actually is the hardware that these
[00:03:01] what actually is the hardware that these things execute on as well as the
[00:03:03] things execute on as well as the algorithms that we need to use in order
[00:03:04] algorithms that we need to use in order to train on a lot of them. So first
[00:03:06] to train on a lot of them. So first we're going to talk a little bit about
[00:03:08] we're going to talk a little bit about GPU hardware. So GPU um for those of you
[00:03:10] GPU hardware. So GPU um for those of you that don't know is graphics processing
[00:03:12] that don't know is graphics processing unit. These were specialized
[00:03:14] unit. These were specialized co-processors that were originally
[00:03:15] co-processors that were originally developed for computer graphics. Um and
[00:03:17] developed for computer graphics. Um and they turned out to become to be very
[00:03:18] they turned out to become to be very useful generalizable parallel
[00:03:20] useful generalizable parallel processors. Um it's actually very
[00:03:22] processors. Um it's actually very fitting to be giving this lecture in
[00:03:23] fitting to be giving this lecture in this room cuz this is the the Hang
[00:03:26] this room cuz this is the the Hang auditorium. Jensen Huang is the CEO and
[00:03:28] auditorium. Jensen Huang is the CEO and founder of Nvidia um which is sort of
[00:03:30] founder of Nvidia um which is sort of the biggest company right now and has
[00:03:32] the biggest company right now and has been for the last couple decades in
[00:03:34] been for the last couple decades in producing GPUs both for gaming and for
[00:03:36] producing GPUs both for gaming and for uh and for ML. So these things started
[00:03:38] uh and for ML. So these things started off basically for graphics because if
[00:03:40] off basically for graphics because if you think about it when you're doing
[00:03:41] you think about it when you're doing computer graphics you need to generate a
[00:03:43] computer graphics you need to generate a lot of pixels on the screen. You need to
[00:03:44] lot of pixels on the screen. You need to process a little lots of little pieces
[00:03:46] process a little lots of little pieces of primitive geometry to produce those
[00:03:48] of primitive geometry to produce those pixels. So it's kind of very natural to
[00:03:50] pixels. So it's kind of very natural to do a lot of computation all in parallel
[00:03:51] do a lot of computation all in parallel when you're doing computer graphics. Um
[00:03:53] when you're doing computer graphics. Um so they people quickly figured out that
[00:03:56] so they people quickly figured out that this hardware that had been built
[00:03:58] this hardware that had been built intended to use in computer graphics
[00:04:00] intended to use in computer graphics could actually be used for much more
[00:04:01] could actually be used for much more general pieces of parallel computation
[00:04:03] general pieces of parallel computation um as well. So researchers kind so in in
[00:04:05] um as well. So researchers kind so in in the early days in the sort of early
[00:04:07] the early days in the sort of early 2000s researchers figured out how they
[00:04:09] 2000s researchers figured out how they could contort these graphics cards into
[00:04:11] could contort these graphics cards into doing general generalizable parallel
[00:04:13] doing general generalizable parallel programming. And then moving on towards
[00:04:15] programming. And then moving on towards the end of the 2000s and into the 2010s
[00:04:17] the end of the 2000s and into the 2010s um Nvidia really picked up this and sort
[00:04:19] um Nvidia really picked up this and sort of developed these things, marketed
[00:04:20] of developed these things, marketed them, built them with the intention of
[00:04:22] them, built them with the intention of being generalized parallel processors.
[00:04:24] being generalized parallel processors. Um they didn't quite know at the time
[00:04:26] Um they didn't quite know at the time what exactly they were going to be used
[00:04:27] what exactly they were going to be used for. think I think they had this general
[00:04:29] for. think I think they had this general idea that parallel processing was going
[00:04:30] idea that parallel processing was going to be important and they really
[00:04:32] to be important and they really capitalized on deep learning when it
[00:04:33] capitalized on deep learning when it came when it started to take off in the
[00:04:34] came when it started to take off in the early 2010s. Um much to Nvidia's credit,
[00:04:37] early 2010s. Um much to Nvidia's credit, I think they really realized the
[00:04:38] I think they really realized the potential of this research area very
[00:04:40] potential of this research area very early um even in the early 2010s and
[00:04:43] early um even in the early 2010s and started putting a ton of resources into
[00:04:45] started putting a ton of resources into making sure that their hardware was
[00:04:46] making sure that their hardware was really useful um for deep learning
[00:04:48] really useful um for deep learning training and it's basically been the
[00:04:50] training and it's basically been the like the main way that people train deep
[00:04:53] like the main way that people train deep large scale deep learning models for
[00:04:55] large scale deep learning models for more than a decade now. um that's
[00:04:57] more than a decade now. um that's starting to change as we'll see a little
[00:04:58] starting to change as we'll see a little bit, but um it's their their their chips
[00:05:01] bit, but um it's their their their chips are kind of the the main one that people
[00:05:02] are kind of the the main one that people use. So I think it's I always like
[00:05:05] use. So I think it's I always like looking inside these things and seeing
[00:05:06] looking inside these things and seeing like what's in them. So this is a
[00:05:07] like what's in them. So this is a picture of the Nvidia H100 which is sort
[00:05:10] picture of the Nvidia H100 which is sort of the the the sort of the mainstay of
[00:05:12] of the the the sort of the mainstay of deep learning training right now today.
[00:05:14] deep learning training right now today. Um there's a next generation that just
[00:05:15] Um there's a next generation that just came out but it's not really accessible
[00:05:17] came out but it's not really accessible yet. I haven't trained anything on it
[00:05:18] yet. I haven't trained anything on it yet. Um so this is this is kind of the
[00:05:20] yet. Um so this is this is kind of the state-of-the-art right now. Inside this
[00:05:22] state-of-the-art right now. Inside this Nvidia GP, inside this H100 GPU, in the
[00:05:25] Nvidia GP, inside this H100 GPU, in the middle here are these compute cores and
[00:05:27] middle here are these compute cores and surrounding that are 80 GB of HBM
[00:05:29] surrounding that are 80 GB of HBM memory, high bandwidth memory. Um, so
[00:05:31] memory, high bandwidth memory. Um, so you can see the memory is separated from
[00:05:32] you can see the memory is separated from the compute cores. They need to move
[00:05:34] the compute cores. They need to move they need to talk to each other over
[00:05:35] they need to talk to each other over this over this bus to move data back and
[00:05:37] this over this bus to move data back and forth from the GPU memory into the
[00:05:39] forth from the GPU memory into the cores. Um, and it can do that at at a
[00:05:41] cores. Um, and it can do that at at a speed of about 3 terabytes per second,
[00:05:43] speed of about 3 terabytes per second, which is a lot of bits moving around.
[00:05:45] which is a lot of bits moving around. Now, if we dive deeper inside the the
[00:05:47] Now, if we dive deeper inside the the GPU cores, um we see that in the middle
[00:05:49] GPU cores, um we see that in the middle in that compute core part, we've got a
[00:05:52] in that compute core part, we've got a smaller piece of memory about 50
[00:05:53] smaller piece of memory about 50 megabytes of L2 cache um that is much
[00:05:55] megabytes of L2 cache um that is much much smaller than that 80 GB of um of
[00:05:58] much smaller than that 80 GB of um of HBM memory, but it's very very close to
[00:05:59] HBM memory, but it's very very close to this to the GP to the actual computing
[00:06:01] this to the GP to the actual computing elements. So, they can be accessed much
[00:06:03] elements. So, they can be accessed much more quickly from the compute cores. Um
[00:06:05] more quickly from the compute cores. Um and then the real heart of the thing are
[00:06:07] and then the real heart of the thing are these 132 streaming multipprocessors or
[00:06:10] these 132 streaming multipprocessors or SM. Um these are kind of like
[00:06:12] SM. Um these are kind of like independent parallel cores. Um, they're
[00:06:14] independent parallel cores. Um, they're a little bit more powerful in some ways
[00:06:16] a little bit more powerful in some ways than a typical CPU core because they can
[00:06:18] than a typical CPU core because they can do a lot more parallelism, but they're a
[00:06:19] do a lot more parallelism, but they're a lot weaker than a typical CP CPU core
[00:06:21] lot weaker than a typical CP CPU core also in a lot of ways because they tend
[00:06:23] also in a lot of ways because they tend to have slower clock speeds. They can't
[00:06:24] to have slower clock speeds. They can't do as much instruction prediction, as
[00:06:26] do as much instruction prediction, as much branch prediction. So, it's it's
[00:06:28] much branch prediction. So, it's it's really hard to make exact apples to
[00:06:29] really hard to make exact apples to apples comparisons between these GPU
[00:06:31] apples comparisons between these GPU cores and the CPU cores. But, I usually
[00:06:33] cores and the CPU cores. But, I usually think of these um streaming
[00:06:34] think of these um streaming multipprocessors as roughly akin to a
[00:06:37] multipprocessors as roughly akin to a CPU core. Um, also for I I know
[00:06:39] CPU core. Um, also for I I know someone's going to go back home and
[00:06:40] someone's going to go back home and actually count all the little boxes on
[00:06:42] actually count all the little boxes on this screen and you'll see that there
[00:06:43] this screen and you'll see that there are actually 144 of them when I've said
[00:06:46] are actually 144 of them when I've said there's only 132. Why is that? It's
[00:06:48] there's only 132. Why is that? It's because all GPU hardware uses a process
[00:06:50] because all GPU hardware uses a process called binning where making these
[00:06:52] called binning where making these things, they have so many transistors,
[00:06:53] things, they have so many transistors, so many little computing elements, no
[00:06:55] so many little computing elements, no matter how much money they pour into the
[00:06:56] matter how much money they pour into the process, they just don't come out
[00:06:57] process, they just don't come out perfectly. Um, some of them always end
[00:06:59] perfectly. Um, some of them always end up a little bit messed up. So, they kind
[00:07:01] up a little bit messed up. So, they kind of plan for that in the development of
[00:07:02] of plan for that in the development of their products and they say, "We're
[00:07:04] their products and they say, "We're going to try to make a chip. The full
[00:07:05] going to try to make a chip. The full chip in theory has 144, but none of the
[00:07:08] chip in theory has 144, but none of the chips are perfect. But we know we'll get
[00:07:09] chips are perfect. But we know we'll get a reasonable number of those that have
[00:07:11] a reasonable number of those that have at least 132 that are functioning. So
[00:07:13] at least 132 that are functioning. So they tend to use this process of
[00:07:15] they tend to use this process of binning. So then they actually only um
[00:07:17] binning. So then they actually only um you know can sell a much more larger
[00:07:18] you know can sell a much more larger proportion of the chips they try to
[00:07:20] proportion of the chips they try to produce by um you know only only
[00:07:23] produce by um you know only only promising that 132 of them will be will
[00:07:25] promising that 132 of them will be will be um turned on.
[00:07:27] be um turned on. um then we can dive even deeper inside
[00:07:29] um then we can dive even deeper inside one of those streaming multipprocessors
[00:07:31] one of those streaming multipprocessors and then we can see uh even more what's
[00:07:33] and then we can see uh even more what's going on inside of these GPUs. So this
[00:07:35] going on inside of these GPUs. So this is just one of the 132 active streaming
[00:07:38] is just one of the 132 active streaming multipprocessors inside an H100. And
[00:07:40] multipprocessors inside an H100. And there's a couple interesting elements in
[00:07:41] there's a couple interesting elements in here to look at. Um first we see we have
[00:07:44] here to look at. Um first we see we have um 256 kilobytes of L1 cache and
[00:07:46] um 256 kilobytes of L1 cache and register files. So this this continues
[00:07:48] register files. So this this continues the trend of the memory hierarchy in the
[00:07:50] the trend of the memory hierarchy in the GPU. We saw that you know in general you
[00:07:53] GPU. We saw that you know in general you know you you thought you were learning
[00:07:54] know you you thought you were learning deep learning, you're actually learning
[00:07:55] deep learning, you're actually learning computer architecture. Sorry, it's a
[00:07:57] computer architecture. Sorry, it's a surprise. Um, and it turns out that that
[00:07:58] surprise. Um, and it turns out that that memory hierarchy is really important for
[00:08:00] memory hierarchy is really important for deep learning and for all all kind of
[00:08:02] deep learning and for all all kind of high performance computing. And the
[00:08:04] high performance computing. And the general trend is that you have larger
[00:08:05] general trend is that you have larger bits of memory that are farther away
[00:08:07] bits of memory that are farther away from the compute cores. And the closer
[00:08:08] from the compute cores. And the closer you get to the compute cores, you have
[00:08:10] you get to the compute cores, you have smaller bits of memory but are much much
[00:08:12] smaller bits of memory but are much much faster. And it's really important to
[00:08:14] faster. And it's really important to write if you're writing the low
[00:08:15] write if you're writing the low low-level algorithms that run on these
[00:08:17] low-level algorithms that run on these things, it's very important to be aware
[00:08:19] things, it's very important to be aware of this memory hierarchy and to be very
[00:08:21] of this memory hierarchy and to be very diligent in passing data between
[00:08:22] diligent in passing data between different phases of this memor memory
[00:08:24] different phases of this memor memory hierarchy. And if you're writing
[00:08:25] hierarchy. And if you're writing performant GPU kernels, you spend a lot
[00:08:27] performant GPU kernels, you spend a lot of time try trying to optimize that. So
[00:08:29] of time try trying to optimize that. So just to give you a flavor of that, you
[00:08:31] just to give you a flavor of that, you know, you see that we see the three
[00:08:32] know, you see that we see the three levels of memory memory hierarchy in the
[00:08:34] levels of memory memory hierarchy in the H100. You've got 256 kilobytes of L1
[00:08:37] H100. You've got 256 kilobytes of L1 cache, 50 megabytes of um of L2 cache,
[00:08:40] cache, 50 megabytes of um of L2 cache, and then 80 GB of um HBM memory. So
[00:08:43] and then 80 GB of um HBM memory. So those are the three primary levels of
[00:08:45] those are the three primary levels of memory hierarchy in the H100.
[00:08:47] memory hierarchy in the H100. Then we've also got 128 of these FP32
[00:08:50] Then we've also got 128 of these FP32 cores. Um these are little arithmetic
[00:08:53] cores. Um these are little arithmetic units that can do sort of generalized
[00:08:54] units that can do sort of generalized floatingoint operations. Um and in
[00:08:56] floatingoint operations. Um and in particular each one of these 128 FP FP32
[00:08:59] particular each one of these 128 FP FP32 cores can compute ax plus b where ax and
[00:09:02] cores can compute ax plus b where ax and b are all scalers. It can perform that
[00:09:04] b are all scalers. It can perform that that bit of computation in one clock
[00:09:06] that bit of computation in one clock cycle. So then if you um add this all up
[00:09:09] cycle. So then if you um add this all up that ax plus b is basically one multiply
[00:09:11] that ax plus b is basically one multiply um one addition and you've got 128 of
[00:09:13] um one addition and you've got 128 of these cores. Um so you can do this this
[00:09:15] these cores. Um so you can do this this whole SM can do 256 floatingoint
[00:09:18] whole SM can do 256 floatingoint operations per SM uh per clock cycle of
[00:09:21] operations per SM uh per clock cycle of the device.
[00:09:23] the device. Then we'll also see that in red we've
[00:09:25] Then we'll also see that in red we've got this is where the real magic
[00:09:26] got this is where the real magic happens. Um in addition to these uh FP32
[00:09:29] happens. Um in addition to these uh FP32 cores, there are also these four tensor
[00:09:31] cores, there are also these four tensor cores. Um I think the name is a little
[00:09:33] cores. Um I think the name is a little bit miss a little bit of a misnomer.
[00:09:34] bit miss a little bit of a misnomer. These are actually matrix cores. Um what
[00:09:37] These are actually matrix cores. Um what these what each of these little tensor
[00:09:38] these what each of these little tensor cores does is they are special circuits
[00:09:41] cores does is they are special circuits that are designed they only do one
[00:09:42] that are designed they only do one thing. they do matrix multiply. So each
[00:09:44] thing. they do matrix multiply. So each one of these little tensor cores can do
[00:09:46] one of these little tensor cores can do a single chunk of matrix multiply. Um in
[00:09:49] a single chunk of matrix multiply. Um in particular I believe the H100 ones can
[00:09:51] particular I believe the H100 ones can do a 16 like input matrix A is 16x4. Um
[00:09:55] do a 16 like input matrix A is 16x4. Um input matrix B is 4x8 and then plus a
[00:09:58] input matrix B is 4x8 and then plus a bias matrix of for of a size 16 by 8. So
[00:10:01] bias matrix of for of a size 16 by 8. So it basically does ax plus b where a x
[00:10:04] it basically does ax plus b where a x and b are little matrix chunks of this
[00:10:06] and b are little matrix chunks of this fixed size. And it can do that that one
[00:10:08] fixed size. And it can do that that one little ma chunk of matrix multiply once
[00:10:10] little ma chunk of matrix multiply once per tensor core per clock cycle. Um so
[00:10:12] per tensor core per clock cycle. Um so then if you kind of multiply all these
[00:10:14] then if you kind of multiply all these numbers out you see that um you you know
[00:10:17] numbers out you see that um you you know that little matrix multiply of ax plus b
[00:10:19] that little matrix multiply of ax plus b of that particular size is 1,024
[00:10:22] of that particular size is 1,024 floatingoint operations where we that's
[00:10:24] floatingoint operations where we that's counting each multiply each ad as a
[00:10:26] counting each multiply each ad as a single floatingoint operation. We
[00:10:27] single floatingoint operation. We multiply that by the four uh tensor
[00:10:29] multiply that by the four uh tensor cores in the SM and we see that the
[00:10:31] cores in the SM and we see that the entire S the entire SM if it's going
[00:10:33] entire S the entire SM if it's going through the through the tensor cores can
[00:10:35] through the through the tensor cores can do just over can do 4096 floatingoint
[00:10:37] do just over can do 4096 floatingoint operations per SM per clock cycle. Um,
[00:10:40] operations per SM per clock cycle. Um, and this we need to compare with a 256
[00:10:42] and this we need to compare with a 256 that we can get from the FP32 cores. And
[00:10:44] that we can get from the FP32 cores. And here we see that just like the tensor
[00:10:47] here we see that just like the tensor cores are where all the magic happens.
[00:10:48] cores are where all the magic happens. This is where the main throughput of the
[00:10:50] This is where the main throughput of the device comes from. And if you're writing
[00:10:52] device comes from. And if you're writing code that wants to run on these GPUs and
[00:10:53] code that wants to run on these GPUs and make maximum avail make maximum usage of
[00:10:56] make maximum avail make maximum usage of them, you need to make maximum usage of
[00:10:57] them, you need to make maximum usage of these tensor cores. Another interesting
[00:11:00] these tensor cores. Another interesting thing about these tensor cores is that
[00:11:01] thing about these tensor cores is that they actually operate in mixed
[00:11:02] they actually operate in mixed precision. um rather than traditional
[00:11:04] precision. um rather than traditional floatingoint numbers which are normally
[00:11:06] floatingoint numbers which are normally 32-bit, uh the tensor cores tend to use
[00:11:08] 32-bit, uh the tensor cores tend to use a mix precision procedure where the
[00:11:10] a mix precision procedure where the inputs are usually 16 bit. Um and
[00:11:12] inputs are usually 16 bit. Um and there's a couple different interesting
[00:11:13] there's a couple different interesting 16- bit formats that they can use that
[00:11:15] 16- bit formats that they can use that we can't get into today. Um and they'll
[00:11:17] we can't get into today. Um and they'll do the they'll do the multiplications in
[00:11:19] do the they'll do the multiplications in this lower precision 16 bit and then do
[00:11:21] this lower precision 16 bit and then do the additions, the accumulations in a
[00:11:23] the additions, the accumulations in a higher precision 32-bit. So these tensor
[00:11:25] higher precision 32-bit. So these tensor cores take a low precision 16- bit input
[00:11:28] cores take a low precision 16- bit input and then produce uh and then do some of
[00:11:29] and then produce uh and then do some of the intermediate computation and produce
[00:11:31] the intermediate computation and produce the outputs in a higher precision
[00:11:33] the outputs in a higher precision 32-bit. Um and this is important because
[00:11:36] 32-bit. Um and this is important because um at the PyTorch layer if you forget to
[00:11:38] um at the PyTorch layer if you forget to cast your model into into 16 bit it will
[00:11:41] cast your model into into 16 bit it will run on the floatingoint coes instead it
[00:11:43] run on the floatingoint coes instead it will be 20 times slower than you expect.
[00:11:45] will be 20 times slower than you expect. So this is uh you know seems like a
[00:11:46] So this is uh you know seems like a little bit of minutiae but it becomes
[00:11:48] little bit of minutiae but it becomes very tangible when you mess up that data
[00:11:50] very tangible when you mess up that data those data types in your pietorch code.
[00:11:53] those data types in your pietorch code. Um, and then not so GPUs are really fast
[00:11:56] Um, and then not so GPUs are really fast and it's really crazy just how fast how
[00:11:58] and it's really crazy just how fast how much faster they've gotten over the past
[00:12:00] much faster they've gotten over the past decade or 15 years or so. Um, so when I
[00:12:02] decade or 15 years or so. Um, so when I first started my PhD um, and was working
[00:12:05] first started my PhD um, and was working on deep learning, the state-of-the-art
[00:12:06] on deep learning, the state-of-the-art GPU that we were all using was this K40
[00:12:08] GPU that we were all using was this K40 GPU um, back which was released back in
[00:12:11] GPU um, back which was released back in 2013. And this thing could do just uh, 5
[00:12:14] 2013. And this thing could do just uh, 5 teraflops of FP32 compute um, for the
[00:12:16] teraflops of FP32 compute um, for the whole device. Uh, all right. So I should
[00:12:18] whole device. Uh, all right. So I should explain the graph. So the x- axis is uh
[00:12:20] explain the graph. So the x- axis is uh time ranging from about uh 2013 up to
[00:12:23] time ranging from about uh 2013 up to present day and then the y-axis is the
[00:12:25] present day and then the y-axis is the peak throughput of each of these devices
[00:12:28] peak throughput of each of these devices measured in terms of uh teraflops per
[00:12:30] measured in terms of uh teraflops per teraflops per second per device. Um and
[00:12:33] teraflops per second per device. Um and you can see like the graph goes up a
[00:12:35] you can see like the graph goes up a lot. Um but there's something salient to
[00:12:37] lot. Um but there's something salient to notice here is that um from the K40 to
[00:12:39] notice here is that um from the K40 to the P100 something really amazing
[00:12:41] the P100 something really amazing happened in the V100 which came out uh
[00:12:43] happened in the V100 which came out uh towards the end of my PhD in around 20
[00:12:45] towards the end of my PhD in around 20 2016 2017 and that the V100 was the
[00:12:48] 2016 2017 and that the V100 was the first device that introduced these
[00:12:50] first device that introduced these tensor cores. Um and since then um the
[00:12:52] tensor cores. Um and since then um the the the since then more recent devices
[00:12:55] the the since then more recent devices have gotten more tensor cores, bigger
[00:12:57] have gotten more tensor cores, bigger tensor cores, more of the device area
[00:12:59] tensor cores, more of the device area allocated to tensor cores and this has
[00:13:00] allocated to tensor cores and this has resulted in a gigantic increase in the
[00:13:03] resulted in a gigantic increase in the throughput of these devices over the
[00:13:04] throughput of these devices over the past 10 or 15 years. Um and the most
[00:13:07] past 10 or 15 years. Um and the most recent device is this B200 um that was
[00:13:09] recent device is this B200 um that was uh you know formally announced. It's
[00:13:10] uh you know formally announced. It's slowly rolling out now. Um this one in
[00:13:13] slowly rolling out now. Um this one in theory has about you know five uh has
[00:13:17] theory has about you know five uh has about um 83 83.3 uh teraflops per second
[00:13:21] about um 83 83.3 uh teraflops per second of FP32 compute and 5,000 teraflops per
[00:13:24] of FP32 compute and 5,000 teraflops per second in theory of um mixed precision
[00:13:26] second in theory of um mixed precision compute on the tensor cores. So if you
[00:13:28] compute on the tensor cores. So if you step back like this is this is like
[00:13:30] step back like this is this is like literally we've been living through a
[00:13:31] literally we've been living through a 1,000fold increase in computation um
[00:13:34] 1,000fold increase in computation um over the past you know 12 years. Um and
[00:13:36] over the past you know 12 years. Um and that's just at the per device level. So
[00:13:38] that's just at the per device level. So there one explanation of why AI has
[00:13:41] there one explanation of why AI has gotten so good in the last 10 years.
[00:13:42] gotten so good in the last 10 years. What has happened like this is the
[00:13:44] What has happened like this is the answer. Um there's now a source of
[00:13:46] answer. Um there's now a source of computation that we're taking advantage
[00:13:47] computation that we're taking advantage of and it's gone up by 1,000x in the
[00:13:49] of and it's gone up by 1,000x in the last decade. Anytime anything in the
[00:13:51] last decade. Anytime anything in the world changes by 10,000x you should step
[00:13:54] world changes by 10,000x you should step up and pay attention cuz that's going to
[00:13:55] up and pay attention cuz that's going to that's going to cause major changes in
[00:13:57] that's going to cause major changes in our technological capabilities. And this
[00:13:59] our technological capabilities. And this 1000x improvement I think is the major
[00:14:01] 1000x improvement I think is the major driver of improvement in deep learning
[00:14:03] driver of improvement in deep learning over the past decade. So the it's it
[00:14:06] over the past decade. So the it's it does not have 5,000 does not have 5,000
[00:14:08] does not have 5,000 does not have 5,000 tensor cores. That's 5,000 teraflops of
[00:14:10] tensor cores. That's 5,000 teraflops of compute on the tensor cores. Okay.
[00:14:12] compute on the tensor cores. Okay.  Yeah. So we always try to distinguish
[00:14:14] Yeah. So we always try to distinguish between the compute on the tensor cores
[00:14:16] between the compute on the tensor cores versus the compute on FP32 cores.
[00:14:19] versus the compute on FP32 cores. Right. So like this is already crazy,
[00:14:20] Right. So like this is already crazy, right? It's already crazy that there's
[00:14:22] right? It's already crazy that there's been a 1,000 increase in like a device
[00:14:24] been a 1,000 increase in like a device that you can hold in your hands. Like
[00:14:26] that you can hold in your hands. Like I've held a K40 in my hands and I've
[00:14:28] I've held a K40 in my hands and I've I've not held I've not had the
[00:14:29] I've not held I've not had the opportunity to hold a B100, but like
[00:14:31] opportunity to hold a B100, but like they feel like the same physical object.
[00:14:33] they feel like the same physical object. It's like about the same size, about the
[00:14:34] It's like about the same size, about the same weight, like kind of looks the
[00:14:36] same weight, like kind of looks the same, but the one from today is 1,000
[00:14:37] same, but the one from today is 1,000 times faster than the one from 12 years
[00:14:39] times faster than the one from 12 years ago. That's insane. Um, but it gets even
[00:14:41] ago. That's insane. Um, but it gets even crazier because we don't train on one
[00:14:44] crazier because we don't train on one GPU, right? I said that when when the
[00:14:46] GPU, right? I said that when when the K40 first came out in 2013, it actually
[00:14:48] K40 first came out in 2013, it actually was common to train a lot of models on
[00:14:50] was common to train a lot of models on just one GPU. But today, we're training
[00:14:52] just one GPU. But today, we're training not just on one GPU. We're training on
[00:14:54] not just on one GPU. We're training on thousands, tens of thousands, sometimes
[00:14:56] thousands, tens of thousands, sometimes hundreds of thousands of GPUs all
[00:14:58] hundreds of thousands of GPUs all working together to train one model. So,
[00:15:00] working together to train one model. So, so, so, so stack that on top of this
[00:15:02] so, so, so stack that on top of this 1,000fold increase in per device
[00:15:04] 1,000fold increase in per device throughput and something truly insane
[00:15:06] throughput and something truly insane has happened in the past decade.
[00:15:08] has happened in the past decade. So, then um you know then we can know
[00:15:11] So, then um you know then we can know we've we've looked inside the GPU. Now,
[00:15:13] we've we've looked inside the GPU. Now, from here I want to zoom out and and put
[00:15:15] from here I want to zoom out and and put that GPU in context not looking at
[00:15:17] that GPU in context not looking at individual devices but thinking about
[00:15:18] individual devices but thinking about the modern GPU clusters that we build
[00:15:20] the modern GPU clusters that we build that stitch a lot of these things
[00:15:21] that stitch a lot of these things together. So we've already seen a single
[00:15:23] together. So we've already seen a single H100 GPU. Um we saw right and here the
[00:15:26] H100 GPU. Um we saw right and here the and here we can think of it as another
[00:15:28] and here we can think of it as another level of memory hier hierarchy. We
[00:15:30] level of memory hier hierarchy. We already saw inside the H100 there were
[00:15:32] already saw inside the H100 there were sort of three layers of memory hierarchy
[00:15:33] sort of three layers of memory hierarchy as we got closer to the compute elements
[00:15:35] as we got closer to the compute elements and as you got farther away from the
[00:15:36] and as you got farther away from the compute elements the bandwidth the
[00:15:38] compute elements the bandwidth the memory bandwidth the ability of the
[00:15:39] memory bandwidth the ability of the device to move bits around between
[00:15:41] device to move bits around between different parts of the system um gets
[00:15:43] different parts of the system um gets slower and this trend actually continues
[00:15:45] slower and this trend actually continues if you once you escape the bounds of a
[00:15:47] if you once you escape the bounds of a single device and imagine these in the
[00:15:49] single device and imagine these in the context of a full data center. So here
[00:15:51] context of a full data center. So here we saw a single H100 GPU gets about 3
[00:15:53] we saw a single H100 GPU gets about 3 terabytes of um memory bandwidth. That's
[00:15:55] terabytes of um memory bandwidth. That's sort of the GPU memory talking to its
[00:15:57] sort of the GPU memory talking to its from its own HBM memory to its own
[00:15:59] from its own HBM memory to its own compute elements. 3 terabytes per second
[00:16:01] compute elements. 3 terabytes per second it can move bits around. Um but these
[00:16:03] it can move bits around. Um but these things typically live inside a GPU
[00:16:05] things typically live inside a GPU server. Um almost all GPU servers have
[00:16:07] server. Um almost all GPU servers have um eight devices in one big box. Um and
[00:16:10] um eight devices in one big box. Um and those GPUs can talk to each other. Um
[00:16:12] those GPUs can talk to each other. Um and they typically talk to each other at
[00:16:14] and they typically talk to each other at a rate of about 900 GB per second from
[00:16:16] a rate of about 900 GB per second from any one GPU in the server to any other
[00:16:19] any one GPU in the server to any other GPU in the server. So you can see that's
[00:16:21] GPU in the server. So you can see that's like a 3x less memory communication
[00:16:23] like a 3x less memory communication bandwidth compared to the GPU talking
[00:16:25] bandwidth compared to the GPU talking from in inside one device. Um and here
[00:16:29] from in inside one device. Um and here things here we again turn to llama 3. Um
[00:16:32] things here we again turn to llama 3. Um a lot of major players don't publish a
[00:16:34] a lot of major players don't publish a lot of details on their training
[00:16:36] lot of details on their training clusters but the llama 3 technical
[00:16:37] clusters but the llama 3 technical report did actually give a lot of
[00:16:38] report did actually give a lot of details around their training clusters.
[00:16:40] details around their training clusters. So from here some of the specifics
[00:16:42] So from here some of the specifics probably vary a little bit from cluster
[00:16:43] probably vary a little bit from cluster to cluster. Um but these are now numbers
[00:16:45] to cluster. Um but these are now numbers from the llama 3 cluster that was used
[00:16:47] from the llama 3 cluster that was used to train their their their models. Um,
[00:16:49] to train their their their models. Um, so they given one GPU box, they stack
[00:16:53] so they given one GPU box, they stack two of those box into one server rack.
[00:16:55] two of those box into one server rack. Um, and a server rack, if you haven't
[00:16:56] Um, and a server rack, if you haven't seen it, they're, you know, about 6 ft
[00:16:58] seen it, they're, you know, about 6 ft tall, like about the size of a person to
[00:17:00] tall, like about the size of a person to just kind of get a mental picture of one
[00:17:02] just kind of get a mental picture of one of those things. So one server rack has
[00:17:04] of those things. So one server rack has two servers inside of it. Total of 16
[00:17:06] two servers inside of it. Total of 16 GPUs.
[00:17:07] GPUs. Then we connect a lot of server racks
[00:17:09] Then we connect a lot of server racks together into a GPU pod. Um, the Llama 3
[00:17:12] together into a GPU pod. Um, the Llama 3 cluster has GPU pods that are composed
[00:17:14] cluster has GPU pods that are composed of 192 racks. um and which is a total of
[00:17:17] of 192 racks. um and which is a total of 3,72 GPUs. And these things have really
[00:17:20] 3,72 GPUs. And these things have really high bandwidth connectors between all
[00:17:22] high bandwidth connectors between all the different racks. Um and as a result
[00:17:24] the different racks. Um and as a result um any pair of GPUs inside that pod can
[00:17:28] um any pair of GPUs inside that pod can talk to each other at a rate of about 50
[00:17:29] talk to each other at a rate of about 50 GB per second. And now you see this is
[00:17:32] GB per second. And now you see this is another sort of 20x decrease in memory
[00:17:34] another sort of 20x decrease in memory traffic between what an individual um
[00:17:36] traffic between what an individual um server can talk and then what any GPU
[00:17:38] server can talk and then what any GPU across an entire rack can talk to each
[00:17:40] across an entire rack can talk to each other. So 3072 GPUs seems like a lot of
[00:17:43] other. So 3072 GPUs seems like a lot of compute, but it's now it's nowhere near
[00:17:45] compute, but it's now it's nowhere near enough. Um, so we're going to stack
[00:17:47] enough. Um, so we're going to stack those GPU pods together into a full GPU
[00:17:49] those GPU pods together into a full GPU cluster. Um, so this is actually the
[00:17:51] cluster. Um, so this is actually the full GPU cluster that um, Meta built to
[00:17:54] full GPU cluster that um, Meta built to train their Llama 3 models. This thing
[00:17:56] train their Llama 3 models. This thing combines eight GPU pods together for a
[00:17:58] combines eight GPU pods together for a total of 24,576
[00:18:00] total of 24,576 GPUs. Um, I could not find exact numbers
[00:18:03] GPUs. Um, I could not find exact numbers on the memory traffic between these
[00:18:05] on the memory traffic between these things. Um, but it's definitely less
[00:18:06] things. Um, but it's definitely less than 50 GB per second. And by the way,
[00:18:09] than 50 GB per second. And by the way, this is not the largest GPU cluster in
[00:18:11] this is not the largest GPU cluster in the world by a long shot. Um, it's the
[00:18:13] the world by a long shot. Um, it's the long it's the biggest one that I could
[00:18:15] long it's the biggest one that I could quickly find precise numbers on, but
[00:18:16] quickly find precise numbers on, but there are definitely GPU clusters out
[00:18:18] there are definitely GPU clusters out there in the world that are, you know,
[00:18:19] there in the world that are, you know, 50,000 GPUs, 100,000 GPUs. They exist
[00:18:22] 50,000 GPUs, 100,000 GPUs. They exist and people train models on them. Um, and
[00:18:24] and people train models on them. Um, and like the way that this works is it sort
[00:18:26] like the way that this works is it sort of scales out naturally. So, you would
[00:18:27] of scales out naturally. So, you would just sort of cluster together more pods
[00:18:29] just sort of cluster together more pods together to create a bigger a bigger
[00:18:31] together to create a bigger a bigger cluster. Or you might have another level
[00:18:32] cluster. Or you might have another level of hierarchy where you might have sort
[00:18:34] of hierarchy where you might have sort of a super pod that connects to other
[00:18:35] of a super pod that connects to other super pods to get you another level up.
[00:18:38] super pods to get you another level up. uh how long do they train with that GPU
[00:18:39] uh how long do they train with that GPU cluster? Um I I don't remember offhand
[00:18:42] cluster? Um I I don't remember offhand for the llama 3 models, but there's been
[00:18:44] for the llama 3 models, but there's been kind of a rule of thumb um for the past
[00:18:45] kind of a rule of thumb um for the past decade is that the longest models that
[00:18:47] decade is that the longest models that people train are usually on the order of
[00:18:49] people train are usually on the order of months. Um and that I think has less to
[00:18:51] months. Um and that I think has less to do with technology and more to do with
[00:18:52] do with technology and more to do with people. When it comes to like making
[00:18:54] people. When it comes to like making like having progress, like making plans,
[00:18:56] like having progress, like making plans, like having people work on things, it's
[00:18:58] like having people work on things, it's very difficult to have training runs
[00:19:00] very difficult to have training runs that are that are very very long. Um so
[00:19:02] that are that are very very long. Um so the longest training runs, the biggest
[00:19:03] the longest training runs, the biggest state-of-the-art models I think are
[00:19:04] state-of-the-art models I think are typically measured in months. Um, I
[00:19:06] typically measured in months. Um, I would not be surprised if the very very
[00:19:08] would not be surprised if the very very largest models, the GPT 4.5s, the GPT5s,
[00:19:11] largest models, the GPT 4.5s, the GPT5s, if those are like pushing closer to a
[00:19:13] if those are like pushing closer to a year at this point. Um, but it's pretty
[00:19:15] year at this point. Um, but it's pretty common to see training runs that are on
[00:19:16] common to see training runs that are on the order of a couple of months on these
[00:19:18] the order of a couple of months on these really long, really big training
[00:19:19] really long, really big training clusters. The question is, um, why do
[00:19:22] clusters. The question is, um, why do you organize servers into a rack rather
[00:19:23] you organize servers into a rack rather than in a pod? You got to put them
[00:19:25] than in a pod? You got to put them somewhere, right? There's physical
[00:19:26] somewhere, right? There's physical constraints on these things. Um, so
[00:19:28] constraints on these things. Um, so server racks are kind of a stand have
[00:19:29] server racks are kind of a stand have been a standard unit, um, in just like
[00:19:31] been a standard unit, um, in just like data centers for for decades at this
[00:19:33] data centers for for decades at this point. So when they when new these new
[00:19:34] point. So when they when new these new devices came onto the scene of GPUs that
[00:19:36] devices came onto the scene of GPUs that that gives you a different kind of
[00:19:38] that gives you a different kind of server, they're a lot physically bigger.
[00:19:40] server, they're a lot physically bigger. They have a lot more power. Um but you
[00:19:42] They have a lot more power. Um but you can't redesign the whole data center
[00:19:44] can't redesign the whole data center from scratch overnight. Um so as a
[00:19:46] from scratch overnight. Um so as a result, the server rack has been kind of
[00:19:47] result, the server rack has been kind of a standard unit um with standard
[00:19:49] a standard unit um with standard hardware sizes and everything that the
[00:19:51] hardware sizes and everything that the that data centers are typically built
[00:19:52] that data centers are typically built around.
[00:19:53] around.  How much physical space does like a
[00:19:55] How much physical space does like a cluster typically?
[00:19:57] cluster typically?  Oh, that's a great question. Um,
[00:20:00] Oh, that's a great question. Um, so a a single you should think of a
[00:20:01] so a a single you should think of a single server rack as being like around
[00:20:04] single server rack as being like around 6 8 ft tall, something like that, like
[00:20:06] 6 8 ft tall, something like that, like about this big. So maybe like a server
[00:20:08] about this big. So maybe like a server rack would be like around the size of
[00:20:10] rack would be like around the size of this podium and like about as tall as
[00:20:12] this podium and like about as tall as me. Um, then you then you've got 192
[00:20:14] me. Um, then you then you've got 192 racks in a pod. So imagine like 200 of
[00:20:17] racks in a pod. So imagine like 200 of these podiums. How big would that be?
[00:20:18] these podiums. How big would that be? And then multiply that by eight. Um, but
[00:20:20] And then multiply that by eight. Um, but it's actually that's actually a little
[00:20:21] it's actually that's actually a little bit of an underestimate because you
[00:20:23] bit of an underestimate because you typically organize these things in rows
[00:20:24] typically organize these things in rows so people can actually walk between
[00:20:26] so people can actually walk between them. Um, and there's more hardware that
[00:20:27] them. Um, and there's more hardware that you need to pack into the cluster, not
[00:20:29] you need to pack into the cluster, not just the train, not just the the compute
[00:20:30] just the train, not just the the compute racks. Um, so in addition to the compute
[00:20:32] racks. Um, so in addition to the compute racks that have the physical GPU
[00:20:34] racks that have the physical GPU servers, there'll be other racks that
[00:20:36] servers, there'll be other racks that contain networking hardware, we've got a
[00:20:37] contain networking hardware, we've got a lot of a lot of a lot of bits that need
[00:20:39] lot of a lot of a lot of bits that need to fly around between all these devices.
[00:20:41] to fly around between all these devices. So there'll be dedicated racks that only
[00:20:43] So there'll be dedicated racks that only hold networking hardware. There will
[00:20:44] hold networking hardware. There will also be dedicated racks that only hold
[00:20:46] also be dedicated racks that only hold storage hardware um because you need to
[00:20:47] storage hardware um because you need to store the training data somewhere and
[00:20:48] store the training data somewhere and get that into your devices. So these
[00:20:50] get that into your devices. So these things uh can take up quite a lot of
[00:20:52] things uh can take up quite a lot of space. Oh yeah question is um the when
[00:20:54] space. Oh yeah question is um the when you go to these big clusters do the
[00:20:56] you go to these big clusters do the smaller units of compute maintain the
[00:20:58] smaller units of compute maintain the higher throughput and yes they do and
[00:20:59] higher throughput and yes they do and that's part of this that's part of the
[00:21:01] that's part of this that's part of the secret and the challenge of designing
[00:21:02] secret and the challenge of designing for these systems because you ideally
[00:21:04] for these systems because you ideally want to take advantage of the fast
[00:21:05] want to take advantage of the fast communication when you can get it but
[00:21:07] communication when you can get it but also fall back gracefully to the slower
[00:21:09] also fall back gracefully to the slower communication on the larger units um as
[00:21:11] communication on the larger units um as you as you scale up. Oh, how hot does it
[00:21:13] you as you scale up. Oh, how hot does it get? Um pretty hot. Like if if anyway if
[00:21:16] get? Um pretty hot. Like if if anyway if any of you is a gamer and has a 4090 GPU
[00:21:18] any of you is a gamer and has a 4090 GPU or 5090 GPU in your in your desktop at
[00:21:21] or 5090 GPU in your in your desktop at home, like a single 4090 GPU if you're
[00:21:24] home, like a single 4090 GPU if you're playing games will heat up your room,
[00:21:25] playing games will heat up your room, like make you want to open the window.
[00:21:27] like make you want to open the window. Like it will make a make the room
[00:21:28] Like it will make a make the room physically warmer. Um so imagine like if
[00:21:30] physically warmer. Um so imagine like if that's what a single gaming GPU can do
[00:21:32] that's what a single gaming GPU can do to an averageized room. Yeah, there's
[00:21:34] to an averageized room. Yeah, there's some serious cooling requirements for
[00:21:35] some serious cooling requirements for these things once you stack tens of
[00:21:36] these things once you stack tens of thousands of them in a big in a big data
[00:21:38] thousands of them in a big in a big data center.
[00:21:41] center. Um although another interesting thing is
[00:21:43] Um although another interesting thing is about I mean the cooling gets crazy,
[00:21:44] about I mean the cooling gets crazy, right? So a gaming desktop will
[00:21:46] right? So a gaming desktop will typically be air cooled, sometimes water
[00:21:47] typically be air cooled, sometimes water cooled, and then like you can design
[00:21:49] cooled, and then like you can design different different cooling systems and
[00:21:50] different different cooling systems and you can go nuts on the hardware here to
[00:21:52] you can go nuts on the hardware here to try to optimize all this stuff. All
[00:21:54] try to optimize all this stuff. All right, so I I think this stuff is super
[00:21:56] right, so I I think this stuff is super cool. Just like imagine imagining like
[00:21:58] cool. Just like imagine imagining like these GPUs are not just mythical
[00:22:00] these GPUs are not just mythical creatures that are floating around in
[00:22:01] creatures that are floating around in the cloud. These are like actual
[00:22:02] the cloud. These are like actual physical atoms that someone built and
[00:22:04] physical atoms that someone built and stacked up in a room somewhere. And it's
[00:22:05] stacked up in a room somewhere. And it's really interesting to imagine what they
[00:22:07] really interesting to imagine what they look like. Um and so basically one one
[00:22:10] look like. Um and so basically one one one kind of mindset shift when we move
[00:22:12] one kind of mindset shift when we move to these big GPU clusters is actually
[00:22:14] to these big GPU clusters is actually thinking not so much about the
[00:22:15] thinking not so much about the individual devices about the individual
[00:22:17] individual devices about the individual servers. I basically try to think of the
[00:22:19] servers. I basically try to think of the entire data center as one big computer.
[00:22:22] entire data center as one big computer. Um and this big computer has in this
[00:22:24] Um and this big computer has in this case has 24,000 GPUs 1.8 pabytes of HBM
[00:22:27] case has 24,000 GPUs 1.8 pabytes of HBM memory on the GPUs 415 million FP32
[00:22:31] memory on the GPUs 415 million FP32 cores 13 million tensor cores and this
[00:22:34] cores 13 million tensor cores and this whole thing can do 24 exoflops of
[00:22:36] whole thing can do 24 exoflops of compute per second. That's 10. That's 24
[00:22:38] compute per second. That's 10. That's 24 * 10 18. That's a lot of flops. It's a
[00:22:41] * 10 18. That's a lot of flops. It's a lot of flops, but I guarantee you 5
[00:22:43] lot of flops, but I guarantee you 5 years from today, it will not feel like
[00:22:44] years from today, it will not feel like a lot of flops, which is the even
[00:22:45] a lot of flops, which is the even crazier part. Um, and our goal here is
[00:22:48] crazier part. Um, and our goal here is actually to think of this entire block
[00:22:50] actually to think of this entire block of 24,000 GPUs as one giant supercomput.
[00:22:54] of 24,000 GPUs as one giant supercomput. And then the question is, how can we
[00:22:56] And then the question is, how can we train one neural network for months at a
[00:22:58] train one neural network for months at a time on this one giant supercomputer and
[00:23:01] time on this one giant supercomputer and train a really gigantic neural network
[00:23:02] train a really gigantic neural network that's really powerful that can soak up
[00:23:04] that's really powerful that can soak up tons and tons of data? Um, and that's
[00:23:05] tons and tons of data? Um, and that's basically the the question and the
[00:23:07] basically the the question and the paradigm that we've moved to in deep
[00:23:08] paradigm that we've moved to in deep learning.
[00:23:10] learning. Um, and by the way, um, I I I keep
[00:23:12] Um, and by the way, um, I I I keep saying GPU, I keep saying Nvidia because
[00:23:14] saying GPU, I keep saying Nvidia because they are sort of the most dominant
[00:23:15] they are sort of the most dominant training architecture and and hardware
[00:23:17] training architecture and and hardware today, but there are some others that
[00:23:19] today, but there are some others that have sprung up. Um, the biggest
[00:23:20] have sprung up. Um, the biggest competitor, I think, right now to to
[00:23:23] competitor, I think, right now to to Nvidia training hardware is Google.
[00:23:25] Nvidia training hardware is Google. Google has their own hardware called
[00:23:26] Google has their own hardware called tensor proing units, TPUs. Um, and these
[00:23:29] tensor proing units, TPUs. Um, and these are really good. Um, these are uh they
[00:23:31] are really good. Um, these are uh they they've gone through six generations of
[00:23:32] they've gone through six generations of these already. Um, these are the stats
[00:23:35] these already. Um, these are the stats of the V5P TPU which you can rent in
[00:23:37] of the V5P TPU which you can rent in Google Cloud today. Um, and it's sort of
[00:23:39] Google Cloud today. Um, and it's sort of roughly, you know, same order of
[00:23:41] roughly, you know, same order of magnitude, kind of similar specs as the
[00:23:42] magnitude, kind of similar specs as the H100 that we just talked about. There
[00:23:44] H100 that we just talked about. There are some interesting design decisions in
[00:23:46] are some interesting design decisions in the TPU that are quite different from
[00:23:47] the TPU that are quite different from the GPUs, which I find fascinating, but
[00:23:49] the GPUs, which I find fascinating, but we just don't have time to get into
[00:23:50] we just don't have time to get into today. Um, and someone was asking how
[00:23:52] today. Um, and someone was asking how big are these things. This is an actual
[00:23:54] big are these things. This is an actual picture. Just like GPUs, uh, these TPUs
[00:23:56] picture. Just like GPUs, uh, these TPUs are arranged into pods. Um, and the V5P
[00:23:59] are arranged into pods. Um, and the V5P TPU TPUs can be arranged in pods of up
[00:24:02] TPU TPUs can be arranged in pods of up to 8,960
[00:24:04] to 8,960 chips. Um, and this is a picture
[00:24:06] chips. Um, and this is a picture actually of a V2 TPU pod which has only
[00:24:09] actually of a V2 TPU pod which has only 256 chips. So then that kind of gives
[00:24:11] 256 chips. So then that kind of gives you a sense of how big these things are.
[00:24:13] you a sense of how big these things are. Each one of those um, then you see
[00:24:14] Each one of those um, then you see there's four racks here. Those racks are
[00:24:16] there's four racks here. Those racks are kind of like I said maybe about as a
[00:24:18] kind of like I said maybe about as a little bit taller than me and there's
[00:24:20] little bit taller than me and there's four of them side by side for 256 TPU
[00:24:22] four of them side by side for 256 TPU chips. And now imagine this thing is
[00:24:24] chips. And now imagine this thing is going to get a lot bigger in the more
[00:24:25] going to get a lot bigger in the more recent pods that have up to almost 9,000
[00:24:27] recent pods that have up to almost 9,000 chips. Uh yes. So Google's Gemini models
[00:24:30] chips. Uh yes. So Google's Gemini models are almost certainly trained on TPUs. I
[00:24:32] are almost certainly trained on TPUs. I would be I mean of course they don't
[00:24:33] would be I mean of course they don't tell you, but I would be astounded,
[00:24:35] tell you, but I would be astounded, absolutely astounded if they um were
[00:24:37] absolutely astounded if they um were not. And like I said um the TPUs are
[00:24:40] not. And like I said um the TPUs are actually very good. Um most I I I assume
[00:24:42] actually very good. Um most I I I assume that most large scale Google models are
[00:24:43] that most large scale Google models are trained on these things and those are
[00:24:45] trained on these things and those are very competitive models. Um so this is
[00:24:47] very competitive models. Um so this is really good training hardware. Um the
[00:24:49] really good training hardware. Um the difference with Nvidia is you can't buy
[00:24:50] difference with Nvidia is you can't buy it. the only way you can access TPUs are
[00:24:52] it. the only way you can access TPUs are either by working at Google or by
[00:24:54] either by working at Google or by renting them on Google Cloud. Um, but it
[00:24:56] renting them on Google Cloud. Um, but it is very good hardware and a lot of
[00:24:57] is very good hardware and a lot of people are making use of it, but I think
[00:24:59] people are making use of it, but I think it's still a little bit less popular
[00:25:00] it's still a little bit less popular today um than H100 than uh than Nvidia
[00:25:02] today um than H100 than uh than Nvidia GPUs. Um, and of course other companies
[00:25:04] GPUs. Um, and of course other companies obviously know that this is a very
[00:25:06] obviously know that this is a very important thing. So there's a lot of
[00:25:07] important thing. So there's a lot of other companies that are trying to build
[00:25:09] other companies that are trying to build competitive training hardware, but I
[00:25:10] competitive training hardware, but I think like my honest assessment right
[00:25:12] think like my honest assessment right now is that probably Nvidia and TPUs are
[00:25:15] now is that probably Nvidia and TPUs are the like the two big ones. they're
[00:25:16] the like the two big ones. they're they're way ahead of everyone else right
[00:25:18] they're way ahead of everyone else right now today in terms of usability,
[00:25:20] now today in terms of usability, performance, just like market share, but
[00:25:22] performance, just like market share, but there are a lot of others that are
[00:25:23] there are a lot of others that are trying to to catch up here. Um, two
[00:25:25] trying to to catch up here. Um, two notable ones are AMD. Um, AMD has been
[00:25:28] notable ones are AMD. Um, AMD has been sort of the second major GPU
[00:25:29] sort of the second major GPU manufacturer for many decades. They also
[00:25:32] manufacturer for many decades. They also have um a training accelerator called
[00:25:33] have um a training accelerator called the MI3 325X. On paper, it actually has
[00:25:36] the MI3 325X. On paper, it actually has really good stats that are pretty
[00:25:37] really good stats that are pretty comparable to an H100, but it just has
[00:25:40] comparable to an H100, but it just has not had the same impact as the H100
[00:25:42] not had the same impact as the H100 right now. Um AWS also has their own
[00:25:45] right now. Um AWS also has their own training chip that they've developed
[00:25:46] training chip that they've developed called trrenium. Um I don't know too
[00:25:49] called trrenium. Um I don't know too much about this one. I've never tried to
[00:25:50] much about this one. I've never tried to use it myself, but I know that Enthropic
[00:25:52] use it myself, but I know that Enthropic uses it for some of their training. I
[00:25:53] uses it for some of their training. I don't know how to what extent their
[00:25:55] don't know how to what extent their training is entirely tranium versus um
[00:25:56] training is entirely tranium versus um GPUs.
[00:25:58] GPUs. Um so we should expect to see more. But
[00:26:00] Um so we should expect to see more. But like today I think G Nvidia GPUs are
[00:26:02] like today I think G Nvidia GPUs are probably the most dominant and Google
[00:26:04] probably the most dominant and Google TPUs are like right there. They're
[00:26:06] TPUs are like right there. They're really good as well, but probably not
[00:26:07] really good as well, but probably not quite as widely used as GPUs from
[00:26:08] quite as widely used as GPUs from Nvidia.
[00:26:10] Nvidia. Okay, so that's basically part one. um
[00:26:12] Okay, so that's basically part one. um what is what are GPUs? How do we arrange
[00:26:14] what is what are GPUs? How do we arrange them into clusters? Just give you a
[00:26:15] them into clusters? Just give you a sense of the physicality of these
[00:26:16] sense of the physicality of these machines that we're building and
[00:26:17] machines that we're building and training on. Then the second question is
[00:26:20] training on. Then the second question is how do we actually write algorithms that
[00:26:22] how do we actually write algorithms that can make use of this giant GPU cluster
[00:26:25] can make use of this giant GPU cluster with tens of thousands of GPUs? It's
[00:26:27] with tens of thousands of GPUs? It's going to require us to develop new
[00:26:28] going to require us to develop new algorithms, new ways of thinking about
[00:26:30] algorithms, new ways of thinking about our compute, um and new ways of
[00:26:32] our compute, um and new ways of parallelizing and splitting up our
[00:26:34] parallelizing and splitting up our neural networks. So the basic strategy
[00:26:36] neural networks. So the basic strategy here is going to be split up your
[00:26:37] here is going to be split up your computation. These things are giant
[00:26:39] computation. These things are giant parallel devices. They have a lot of we
[00:26:41] parallel devices. They have a lot of we saw they have a lot of GPUs, a lot of
[00:26:43] saw they have a lot of GPUs, a lot of CPU cores, a lot of GPU cores that can
[00:26:45] CPU cores, a lot of GPU cores that can all operate independently and they can't
[00:26:47] all operate independently and they can't talk to each other too much. If you
[00:26:49] talk to each other too much. If you think about what a computer really does
[00:26:50] think about what a computer really does like at a high level, a computer
[00:26:52] like at a high level, a computer basically does two things. It does
[00:26:53] basically does two things. It does computation which is taking input bits
[00:26:55] computation which is taking input bits and computing new output bits from those
[00:26:57] and computing new output bits from those and it does communication which is
[00:26:59] and it does communication which is taking bits and moving them from one bit
[00:27:00] taking bits and moving them from one bit of memory in one place to some bit of
[00:27:02] of memory in one place to some bit of memory in some other place. And the
[00:27:04] memory in some other place. And the whole trick is how do we make use of all
[00:27:06] whole trick is how do we make use of all of these multi- scale multi multiple
[00:27:08] of these multi- scale multi multiple scales of memory hierarchy across the
[00:27:10] scales of memory hierarchy across the entire cluster to overlap the
[00:27:13] entire cluster to overlap the communication with the computation and
[00:27:15] communication with the computation and also to split up the computation and
[00:27:16] also to split up the computation and parallelize paralyze it so that in the
[00:27:19] parallelize paralyze it so that in the process of training a giant neural
[00:27:20] process of training a giant neural network we have useful work for all of
[00:27:23] network we have useful work for all of those tens of thousands of individ
[00:27:24] those tens of thousands of individ individual GPUs all of those millions of
[00:27:26] individual GPUs all of those millions of individual compute elements we have
[00:27:28] individual compute elements we have useful work for all of them to be doing
[00:27:30] useful work for all of them to be doing in parallel and then get them to
[00:27:31] in parallel and then get them to communicate their work to each other in
[00:27:33] communicate their work to each other in a way that achieves training a giant
[00:27:35] a way that achieves training a giant neural network all on this giant
[00:27:36] neural network all on this giant cluster.
[00:27:38] cluster. So to that end um one way I like to
[00:27:39] So to that end um one way I like to think about it is there's basically five
[00:27:42] think about it is there's basically five degrees of parallelism that people
[00:27:43] degrees of parallelism that people exploit when training neural networks
[00:27:45] exploit when training neural networks large scale neural neural networks
[00:27:46] large scale neural neural networks today. Um a lot of this is specific to
[00:27:48] today. Um a lot of this is specific to transformers because those are the
[00:27:50] transformers because those are the dominant architecture that people are
[00:27:51] dominant architecture that people are using for large scale training. So if
[00:27:53] using for large scale training. So if you think about a transformer, a
[00:27:54] you think about a transformer, a transformer is basically a stack of L
[00:27:57] transformer is basically a stack of L layers. And each one of those L layers
[00:28:00] layers. And each one of those L layers is operating on a threedimensional
[00:28:01] is operating on a threedimensional tensor of size mini where we're where
[00:28:03] tensor of size mini where we're where one dimension is the mini batch
[00:28:05] one dimension is the mini batch dimension. We've got a bunch of
[00:28:06] dimension. We've got a bunch of sequences all operating in a mini batch.
[00:28:08] sequences all operating in a mini batch. Um a sequence dimension, you know, we're
[00:28:09] Um a sequence dimension, you know, we're operating on sequences or sets of tokens
[00:28:12] operating on sequences or sets of tokens and a dim dimension. So each of those
[00:28:14] and a dim dimension. So each of those tokens itself is a vector with some
[00:28:16] tokens itself is a vector with some dimension. So our tensor our our our t
[00:28:19] dimension. So our tensor our our our t our um our transformers are operating on
[00:28:21] our um our transformers are operating on these threedimensional tensors and they
[00:28:23] these threedimensional tensors and they operate through a stack of layers. So
[00:28:25] operate through a stack of layers. So that gives us four axes to parallelize
[00:28:27] that gives us four axes to parallelize on. Um we can access we can parallelize
[00:28:29] on. Um we can access we can parallelize on the layers axis which is pipeline
[00:28:31] on the layers axis which is pipeline parallelism. We can we can parallelize
[00:28:33] parallelism. We can we can parallelize on the um batch dimension which is data
[00:28:35] on the um batch dimension which is data parallelism. We can split on the
[00:28:37] parallelism. We can split on the sequence dimension which is called
[00:28:38] sequence dimension which is called context parallelism. And we can split on
[00:28:40] context parallelism. And we can split on that dim dimension which is called
[00:28:41] that dim dimension which is called tensor parallelism. So all of these have
[00:28:43] tensor parallelism. So all of these have kind of funny names, but if you think
[00:28:45] kind of funny names, but if you think about it in this way, they're basically
[00:28:46] about it in this way, they're basically all different ways of splitting up your
[00:28:48] all different ways of splitting up your computation across these four axes of
[00:28:50] computation across these four axes of compute inside your transformer. And
[00:28:52] compute inside your transformer. And then we're going to step through each of
[00:28:53] then we're going to step through each of each one of these in more detail because
[00:28:55] each one of these in more detail because there's a lot of interesting nuances
[00:28:56] there's a lot of interesting nuances with all of these different meth
[00:28:58] with all of these different meth mechanisms of distributed training.
[00:29:00] mechanisms of distributed training. So the first one is data parallelism um
[00:29:02] So the first one is data parallelism um or DP. And the the basic idea here is
[00:29:06] or DP. And the the basic idea here is kind of simple. So remember when we're
[00:29:08] kind of simple. So remember when we're training neural networks, we're always
[00:29:09] training neural networks, we're always operating on mini batches of samples,
[00:29:11] operating on mini batches of samples, right? like we're always, you know,
[00:29:13] right? like we're always, you know, taking a mini batch of of elements.
[00:29:14] taking a mini batch of of elements. We're computing a loss for every entry
[00:29:16] We're computing a loss for every entry in our mini batch depending on what our
[00:29:17] in our mini batch depending on what our whatever our training task is. Then we
[00:29:19] whatever our training task is. Then we compute a gradient where the gradient is
[00:29:21] compute a gradient where the gradient is actually typically an average of the
[00:29:23] actually typically an average of the gradients of the losses for the
[00:29:24] gradients of the losses for the individual elements in the mini batch.
[00:29:26] individual elements in the mini batch. So um in most neural network
[00:29:28] So um in most neural network architectures, the computation of
[00:29:29] architectures, the computation of computing the loss and then computing
[00:29:31] computing the loss and then computing the gradient is independent for each of
[00:29:33] the gradient is independent for each of the elements in the mini batch. So this
[00:29:34] the elements in the mini batch. So this is something that seems trivially
[00:29:36] is something that seems trivially parallelizable. So the basic idea is
[00:29:38] parallelizable. So the basic idea is that if you have if you can fit a mini
[00:29:40] that if you have if you can fit a mini batch of n examples on a single GPU and
[00:29:43] batch of n examples on a single GPU and you have access to m GPUs, then we're
[00:29:45] you have access to m GPUs, then we're going to train our model with a giant
[00:29:47] going to train our model with a giant mini batch of m* n examples where we
[00:29:50] mini batch of m* n examples where we split up that giant mini batch into a
[00:29:52] split up that giant mini batch into a little tiny smaller mini batch of n
[00:29:54] little tiny smaller mini batch of n samples that goes on each GPU. Um, and
[00:29:57] samples that goes on each GPU. Um, and if you think about mathematically why
[00:29:59] if you think about mathematically why this makes sense, it's because gradients
[00:30:00] this makes sense, it's because gradients are linear. So in practice, you know, if
[00:30:03] are linear. So in practice, you know, if you're kind of computing a single scalar
[00:30:05] you're kind of computing a single scalar loss L, which is going to be the average
[00:30:07] loss L, which is going to be the average of some individual losses computed on
[00:30:09] of some individual losses computed on each of our so these Xig are all the
[00:30:12] each of our so these Xig are all the entries across your entire macro batch,
[00:30:14] entries across your entire macro batch, I guess we'll call it. And then the W
[00:30:16] I guess we'll call it. And then the W are the weight matric weight matrices of
[00:30:18] are the weight matric weight matrices of the entire network. Then typically the
[00:30:20] the entire network. Then typically the loss that you're computing um at the end
[00:30:21] loss that you're computing um at the end of the forward pass is an average of the
[00:30:24] of the forward pass is an average of the losses on each of the individual mini
[00:30:25] losses on each of the individual mini batch elements. And then if you take the
[00:30:27] batch elements. And then if you take the gradient of the loss with respect to the
[00:30:29] gradient of the loss with respect to the weights of the network, that's the
[00:30:30] weights of the network, that's the that's the that's the thing we need to
[00:30:31] that's the that's the thing we need to compute in order to make a weight
[00:30:33] compute in order to make a weight update, then that is actually going to
[00:30:34] update, then that is actually going to split. Um and you can because gradients
[00:30:36] split. Um and you can because gradients are linear, um you get you get to sort
[00:30:38] are linear, um you get you get to sort of choose in what order do we want to do
[00:30:40] of choose in what order do we want to do the sum, do we want to do the gradient,
[00:30:41] the sum, do we want to do the gradient, do we want to do the averaging. So in
[00:30:42] do we want to do the averaging. So in particular, it becomes convenient to
[00:30:44] particular, it becomes convenient to arrange the gradient in this particular
[00:30:45] arrange the gradient in this particular formulation. um where there's this inner
[00:30:48] formulation. um where there's this inner term that we've highlighted in blue
[00:30:50] term that we've highlighted in blue which is um which is basically a normal
[00:30:52] which is um which is basically a normal for backward pass on n elements and
[00:30:54] for backward pass on n elements and these can be be computed in parallel on
[00:30:57] these can be be computed in parallel on different on different GPUs and then
[00:30:59] different on different GPUs and then there's an outer sum where we need to
[00:31:00] there's an outer sum where we need to take an average of the gradients across
[00:31:02] take an average of the gradients across our um m different devices that we are
[00:31:05] our um m different devices that we are that that we're operating on. Um so
[00:31:08] that that we're operating on. Um so that's kind of why what what what's
[00:31:09] that's kind of why what what what's happening from a mathematical
[00:31:10] happening from a mathematical perspective and we see that this is
[00:31:12] perspective and we see that this is perfectly mathematically sound. This is
[00:31:13] perfectly mathematically sound. This is basically exactly the same
[00:31:14] basically exactly the same mathematically as training on um a
[00:31:17] mathematically as training on um a single device. Um we've just been clever
[00:31:19] single device. Um we've just been clever with our algebra and changed the order
[00:31:20] with our algebra and changed the order of doing our averages and our
[00:31:22] of doing our averages and our summations. But this is not an
[00:31:23] summations. But this is not an approximation. This is exactly the same
[00:31:25] approximation. This is exactly the same computation as we would have done on a
[00:31:26] computation as we would have done on a single larger GPU.
[00:31:28] single larger GPU. So kind of what this looks like at the
[00:31:30] So kind of what this looks like at the GPU perspective is that we have um n GPU
[00:31:32] GPU perspective is that we have um n GPU m GPUs. Here I'm showing m equals 3
[00:31:35] m GPUs. Here I'm showing m equals 3 because that's all that can sensibly fit
[00:31:37] because that's all that can sensibly fit on the slide but you know think that
[00:31:38] on the slide but you know think that this is much larger than three in
[00:31:40] this is much larger than three in practice. Then each one of those GPUs
[00:31:42] practice. Then each one of those GPUs actually maintains its own separate copy
[00:31:44] actually maintains its own separate copy of the neural network weights um of the
[00:31:46] of the neural network weights um of the optimizer state um and of the gradients.
[00:31:49] optimizer state um and of the gradients. So then what we're going to do each GPU
[00:31:50] So then what we're going to do each GPU will load in parallel a different mini
[00:31:52] will load in parallel a different mini batch of data. Um here we're showing
[00:31:54] batch of data. Um here we're showing each GPU loading three a mini batch of
[00:31:57] each GPU loading three a mini batch of three elements. And crucially the the
[00:31:59] three elements. And crucially the the different GPUs need to load different
[00:32:01] different GPUs need to load different mini batches of data. Um I've had bugs
[00:32:03] mini batches of data. Um I've had bugs in in my code in students code where
[00:32:05] in in my code in students code where they accidentally load the same mini
[00:32:06] they accidentally load the same mini batch in all the GPUs. That's not going
[00:32:08] batch in all the GPUs. That's not going to help you. That's not going to be
[00:32:09] to help you. That's not going to be good. Don't don't make that mistake. Um
[00:32:11] good. Don't don't make that mistake. Um so you it's crucially important that
[00:32:12] so you it's crucially important that your and that your different GPUs
[00:32:14] your and that your different GPUs actually load different mini batches of
[00:32:15] actually load different mini batches of data. Um then each GPU will
[00:32:18] data. Um then each GPU will independently do its own forward pass on
[00:32:20] independently do its own forward pass on its own mini batch of data to compute
[00:32:21] its own mini batch of data to compute its own local loss on its own local mini
[00:32:24] its own local loss on its own local mini batch of data. And these these these can
[00:32:26] batch of data. And these these these can all operate totally independently. It
[00:32:28] all operate totally independently. It does not require any communication
[00:32:29] does not require any communication between GPUs. Then the the each network
[00:32:32] between GPUs. Then the the each network will do its own backward pass to compute
[00:32:34] will do its own backward pass to compute the gradient of its own local loss with
[00:32:36] the gradient of its own local loss with respect to all the weights of the model.
[00:32:38] respect to all the weights of the model. And again this can happen totally
[00:32:40] And again this can happen totally independently because each model
[00:32:41] independently because each model remember has its or each GPU has its own
[00:32:43] remember has its or each GPU has its own independent copy of the model weights.
[00:32:45] independent copy of the model weights. It can do its own forward backward pass
[00:32:47] It can do its own forward backward pass um completely independently.
[00:32:49] um completely independently. But now after the backward pass is done
[00:32:51] But now after the backward pass is done this is where things get tricky.
[00:32:53] this is where things get tricky. Remember we said we needed to compute an
[00:32:55] Remember we said we needed to compute an average of those gradients across all
[00:32:56] average of those gradients across all the devices that are participating in
[00:32:58] the devices that are participating in our training. So then we need
[00:33:00] our training. So then we need communication. So this is where we do um
[00:33:02] communication. So this is where we do um an an all reduce operation and every GPU
[00:33:05] an an all reduce operation and every GPU needs to needs to send its gradients to
[00:33:07] needs to needs to send its gradients to all the other GPUs and each so like
[00:33:10] all the other GPUs and each so like there's sort of two things happening
[00:33:11] there's sort of two things happening simultaneously. One each GPU needs to
[00:33:13] simultaneously. One each GPU needs to broadcast its gradients to all the GPUs
[00:33:16] broadcast its gradients to all the GPUs and then two each GPU needs to collect
[00:33:18] and then two each GPU needs to collect the gradients from all the GPUs in the
[00:33:20] the gradients from all the GPUs in the that are participating in the training.
[00:33:21] that are participating in the training. Um so this is an all reduce operation.
[00:33:24] Um so this is an all reduce operation. Um and there's uh that this kind of
[00:33:25] Um and there's uh that this kind of happens in sort of logarithmic time
[00:33:27] happens in sort of logarithmic time typically depending on in the number of
[00:33:29] typically depending on in the number of GPUs. Um but at the end of this all
[00:33:31] GPUs. Um but at the end of this all reduce operation then each GPU now has
[00:33:35] reduce operation then each GPU now has um an an average of all the gradients
[00:33:37] um an an average of all the gradients across all the devices. So at this point
[00:33:39] across all the devices. So at this point the communication has happened. Um each
[00:33:41] the communication has happened. Um each GPU now has an identical copy of the
[00:33:44] GPU now has an identical copy of the gradients that have been all reduced
[00:33:45] gradients that have been all reduced across all the devices. So now um at the
[00:33:47] across all the devices. So now um at the beginning of the training iteration we
[00:33:49] beginning of the training iteration we assumed that each GPU had its own
[00:33:51] assumed that each GPU had its own independent copy of the model weights.
[00:33:52] independent copy of the model weights. Now at this point each GPU has its own
[00:33:54] Now at this point each GPU has its own independent but identical copy of the
[00:33:56] independent but identical copy of the gradients across the entire macro batch
[00:33:59] gradients across the entire macro batch of data. So now at this point um each
[00:34:01] of data. So now at this point um each GPU can make a weight update on its own
[00:34:04] GPU can make a weight update on its own local copy of the weights. And because
[00:34:05] local copy of the weights. And because they started with the same weights and
[00:34:07] they started with the same weights and they had this and they applied the same
[00:34:08] they had this and they applied the same gradient, they're going to have the same
[00:34:10] gradient, they're going to have the same weights after the local weight update
[00:34:12] weights after the local weight update assuming the arithmetic was
[00:34:13] assuming the arithmetic was deterministic.
[00:34:15] deterministic. Um and also by the way um and this is
[00:34:18] Um and also by the way um and this is really important, steps four and five
[00:34:20] really important, steps four and five can actually happen in parallel. We said
[00:34:22] can actually happen in parallel. We said that there's two things that there's
[00:34:24] that there's two things that there's actually two things here that can happen
[00:34:25] actually two things here that can happen in parallel. One is the backward pass
[00:34:27] in parallel. One is the backward pass where each GPU computes its own backward
[00:34:29] where each GPU computes its own backward pass to compute gradients and the other
[00:34:31] pass to compute gradients and the other is the communication of the gradients
[00:34:33] is the communication of the gradients across the GPUs and these things in
[00:34:35] across the GPUs and these things in practice will typically happen
[00:34:36] practice will typically happen simultaneously. So that means that each
[00:34:38] simultaneously. So that means that each model will start off doing doing
[00:34:40] model will start off doing doing backward pass over the last layer in the
[00:34:41] backward pass over the last layer in the network and then compute its own local
[00:34:43] network and then compute its own local gradient. Now the model will move its
[00:34:45] gradient. Now the model will move its compute on to computing backward pass
[00:34:46] compute on to computing backward pass for the second to last layer of the of
[00:34:48] for the second to last layer of the of the of the model. And while the compute
[00:34:50] the of the model. And while the compute elements are busy computing the backward
[00:34:52] elements are busy computing the backward pass on the second to last layer, then
[00:34:54] pass on the second to last layer, then the the GPUs will simultaneously be
[00:34:56] the the GPUs will simultaneously be doing an all reduce of the gradients of
[00:34:58] doing an all reduce of the gradients of the last layer. So this means that these
[00:35:00] the last layer. So this means that these things kind of chunk along um
[00:35:02] things kind of chunk along um communication for layer L+1 and backward
[00:35:04] communication for layer L+1 and backward pass for layer L and they can just chunk
[00:35:06] pass for layer L and they can just chunk along in parallel so that hopefully by
[00:35:08] along in parallel so that hopefully by the time we've gotten to the end of the
[00:35:09] the time we've gotten to the end of the network and by the time the backward
[00:35:11] network and by the time the backward pass is done, the the gradients have
[00:35:13] pass is done, the the gradients have already been all reduced across all the
[00:35:14] already been all reduced across all the devices. Um and we can make our weight
[00:35:16] devices. Um and we can make our weight update all at once without waiting. Um
[00:35:18] update all at once without waiting. Um this is really important because like we
[00:35:20] this is really important because like we said with the communication is
[00:35:22] said with the communication is relatively slow. So the whole trick in
[00:35:23] relatively slow. So the whole trick in these things is figuring out ways to
[00:35:25] these things is figuring out ways to hide the communication costs and do them
[00:35:27] hide the communication costs and do them at the same time as the compute. The
[00:35:29] at the same time as the compute. The question is is four or five going to be
[00:35:31] question is is four or five going to be the bottleneck? And the answer is yes.
[00:35:32] the bottleneck? And the answer is yes. Um it depends entirely on how fast your
[00:35:34] Um it depends entirely on how fast your device is, how big is your model, how
[00:35:36] device is, how big is your model, how big is your mini batch, how fast is the
[00:35:38] big is your mini batch, how fast is the inter interconnect between the devices.
[00:35:40] inter interconnect between the devices. Um when you get to this lower scale
[00:35:41] Um when you get to this lower scale distributed trading the answer is always
[00:35:43] distributed trading the answer is always it depends on your situation and you
[00:35:45] it depends on your situation and you need to benchmark for your situation.
[00:35:47] need to benchmark for your situation. Ah, why not take m different gradient
[00:35:48] Ah, why not take m different gradient steps on each of them? That's actually a
[00:35:50] steps on each of them? That's actually a really cool idea. Um, there actually was
[00:35:52] really cool idea. Um, there actually was a popular uh set of algorithms that
[00:35:54] a popular uh set of algorithms that people used a while back called
[00:35:56] people used a while back called asynchronous SGD where they would
[00:35:58] asynchronous SGD where they would basically do that then basically have a
[00:36:00] basically do that then basically have a bunch of different model replicas all
[00:36:01] bunch of different model replicas all take a bunch of independent model steps
[00:36:03] take a bunch of independent model steps and then try to average them every once
[00:36:04] and then try to average them every once in a while. Um, and those were popular
[00:36:06] in a while. Um, and those were popular like there were some like actually
[00:36:08] like there were some like actually Google used to do this before they
[00:36:09] Google used to do this before they developed the TPU pods. Um, and some of
[00:36:11] developed the TPU pods. Um, and some of their earlier networks in the early
[00:36:13] their earlier networks in the early 2010s were trained in this way. Um, but
[00:36:15] 2010s were trained in this way. Um, but one, it tends to just be a lot more
[00:36:17] one, it tends to just be a lot more unstable. Um, and two, it's very hard to
[00:36:19] unstable. Um, and two, it's very hard to debug and reproduce. Um, so it and it it
[00:36:22] debug and reproduce. Um, so it and it it just tends to work a little bit worse.
[00:36:23] just tends to work a little bit worse. So it it does feel like a more scalable
[00:36:25] So it it does feel like a more scalable approach, but in practice, um, if you
[00:36:27] approach, but in practice, um, if you can do everything synchronously, then
[00:36:29] can do everything synchronously, then your algorithms are easier to debug,
[00:36:30] your algorithms are easier to debug, easier to understand, easier to reason
[00:36:32] easier to understand, easier to reason about. Um, and if you can get basically
[00:36:34] about. Um, and if you can get basically if you can get away with with
[00:36:35] if you can get away with with synchronous gradient updates, it's
[00:36:36] synchronous gradient updates, it's probably going to work better. Um, but
[00:36:38] probably going to work better. Um, but actually I would personally not be too
[00:36:40] actually I would personally not be too surprised if we see a resurgence of
[00:36:41] surprised if we see a resurgence of async SGD methods in the next couple of
[00:36:43] async SGD methods in the next couple of years at some point because I think they
[00:36:45] years at some point because I think they are a lot more friendly to distributed
[00:36:47] are a lot more friendly to distributed training. There's no one computer that
[00:36:49] training. There's no one computer that can orchestrate all this stuff. All
[00:36:50] can orchestrate all this stuff. All these things are independent devices
[00:36:51] these things are independent devices with their own independent stuff.
[00:36:52] with their own independent stuff. There's no there's no driver that can
[00:36:55] There's no there's no driver that can take a god's eye view and and take those
[00:36:57] take a god's eye view and and take those steps. All that computation has to
[00:36:59] steps. All that computation has to happen somewhere. Ah, great question. I
[00:37:01] happen somewhere. Ah, great question. I I said like as you're overlapping
[00:37:03] I said like as you're overlapping communication and compute, do you need
[00:37:04] communication and compute, do you need to write code for this or does the
[00:37:05] to write code for this or does the hardware do this automatically? You
[00:37:07] hardware do this automatically? You definitely got to write code for this.
[00:37:08] definitely got to write code for this. The hardware is not smart enough to
[00:37:09] The hardware is not smart enough to understand what you want to do. Like the
[00:37:11] understand what you want to do. Like the hardware, like we said, it's sort of
[00:37:12] hardware, like we said, it's sort of understands these little matrix multiply
[00:37:13] understands these little matrix multiply chunks. It understands pretty low-level
[00:37:15] chunks. It understands pretty low-level stuff. Um anything that you want to do
[00:37:17] stuff. Um anything that you want to do to schedule that communication um the
[00:37:19] to schedule that communication um the you need to take care of in software. Um
[00:37:21] you need to take care of in software. Um but thankfully um for a lot of these
[00:37:23] but thankfully um for a lot of these common use cases, um PyTorch ships with
[00:37:25] common use cases, um PyTorch ships with it for you. So, for example, in this
[00:37:26] it for you. So, for example, in this case, there's a there's a PyTorch class
[00:37:28] case, there's a there's a PyTorch class called distributed data parallel um that
[00:37:30] called distributed data parallel um that will do this for you and make this uh
[00:37:31] will do this for you and make this uh this sort of happen relatively
[00:37:33] this sort of happen relatively transparently on top of um otherwise
[00:37:35] transparently on top of um otherwise straightforward PyTorch code that you've
[00:37:36] straightforward PyTorch code that you've written.
[00:37:37] written.  Although that actually that is really
[00:37:39] Although that actually that is really interesting to contrast with the
[00:37:40] interesting to contrast with the individual devices because um if you're
[00:37:42] individual devices because um if you're programming an individual GPU in CUDA,
[00:37:44] programming an individual GPU in CUDA, which is Nvidia's language for
[00:37:46] which is Nvidia's language for programming GPUs, then actually the
[00:37:48] programming GPUs, then actually the hardware does take care of a lot of this
[00:37:49] hardware does take care of a lot of this async transfer for you automatically. Um
[00:37:51] async transfer for you automatically. Um but at the cluster level, um it
[00:37:53] but at the cluster level, um it typically doesn't. then you typically
[00:37:54] typically doesn't. then you typically need to do it in software. So there
[00:37:56] need to do it in software. So there there actually is a little bit of
[00:37:57] there actually is a little bit of interesting asymmetry here between
[00:37:58] interesting asymmetry here between parallelism on the individual device
[00:38:00] parallelism on the individual device level where a lot of that is does happen
[00:38:02] level where a lot of that is does happen automatically in hardware versus at the
[00:38:03] automatically in hardware versus at the cluster level where it needs to be
[00:38:04] cluster level where it needs to be orchestrated in software. Yeah. So so
[00:38:07] orchestrated in software. Yeah. So so typically these are these are
[00:38:08] typically these are these are heterogeneous systems where different
[00:38:09] heterogeneous systems where different parts of the system are written in
[00:38:10] parts of the system are written in different programming languages. So
[00:38:12] different programming languages. So there's going to be low-level device
[00:38:13] there's going to be low-level device kernels that actually are the code that
[00:38:15] kernels that actually are the code that executes inside the GPU and those are
[00:38:17] executes inside the GPU and those are typically written in CUDA which is um
[00:38:19] typically written in CUDA which is um you know it's a C like language that is
[00:38:21] you know it's a C like language that is Nvidia's language for programming their
[00:38:22] Nvidia's language for programming their own GPUs. Um, and but then those
[00:38:24] own GPUs. Um, and but then those individual GPU kernels will get wrapped
[00:38:26] individual GPU kernels will get wrapped up and you can call those GPU kernels
[00:38:28] up and you can call those GPU kernels from Python. And this is basically how
[00:38:30] from Python. And this is basically how PyTorch works. PyTorch is sort of like a
[00:38:32] PyTorch works. PyTorch is sort of like a collection of a lot of GPU kernels that
[00:38:34] collection of a lot of GPU kernels that can do lots of interesting stuff on the
[00:38:36] can do lots of interesting stuff on the GPU and then a lot of C++ and Python
[00:38:38] GPU and then a lot of C++ and Python code that wraps around those GPU kernels
[00:38:40] code that wraps around those GPU kernels and makes it more user friendly to
[00:38:41] and makes it more user friendly to program. So in this picture, each GPU is
[00:38:44] program. So in this picture, each GPU is computing its own gradients in black um
[00:38:46] computing its own gradients in black um on it by itself and then the gradients
[00:38:48] on it by itself and then the gradients in red get computed via an all reduce
[00:38:51] in red get computed via an all reduce across all the GPUs in parallel. Oh, the
[00:38:54] across all the GPUs in parallel. Oh, the backward pass at the lower layer is
[00:38:55] backward pass at the lower layer is dependent on the gradients from the
[00:38:56] dependent on the gradients from the previous layer. Um, but crucially, um,
[00:38:59] previous layer. Um, but crucially, um, each network is each GPU is only doing
[00:39:01] each network is each GPU is only doing backward pass locally on its own mini
[00:39:02] backward pass locally on its own mini batch. So then there's basically two
[00:39:04] batch. So then there's basically two different variants of the gradient at
[00:39:06] different variants of the gradient at each layer that you need to think about.
[00:39:07] each layer that you need to think about. Now there's the local gradients about
[00:39:09] Now there's the local gradients about like the gradient of the local loss of
[00:39:11] like the gradient of the local loss of my mini batch with respect to my network
[00:39:12] my mini batch with respect to my network weights and there's the global gradient
[00:39:14] weights and there's the global gradient which is the derivative of the total
[00:39:16] which is the derivative of the total loss of the of the macro batch with
[00:39:18] loss of the of the macro batch with respect to the network weights. So each
[00:39:19] respect to the network weights. So each GPU can only in order to compute a
[00:39:21] GPU can only in order to compute a backward pass each GPU only needs the
[00:39:23] backward pass each GPU only needs the local version of its upstream gradient
[00:39:25] local version of its upstream gradient but then the computing the global
[00:39:26] but then the computing the global version of the upstream gradient
[00:39:28] version of the upstream gradient requires communication.
[00:39:31] requires communication. Um so this is data parallelism and
[00:39:34] Um so this is data parallelism and there's actually a bit of problem here
[00:39:35] there's actually a bit of problem here which is this is a great way to paralyze
[00:39:37] which is this is a great way to paralyze G GPU computation and this was the first
[00:39:39] G GPU computation and this was the first way that people started paralyzing GPU
[00:39:40] way that people started paralyzing GPU computation in neural network training
[00:39:42] computation in neural network training but we quickly hit a bottleneck on the G
[00:39:44] but we quickly hit a bottleneck on the G on the model size. So here remember that
[00:39:47] on the model size. So here remember that each GPU is keeping its own independent
[00:39:49] each GPU is keeping its own independent copy of the model parameters and this
[00:39:51] copy of the model parameters and this becomes a bottleneck when you want to
[00:39:52] becomes a bottleneck when you want to have really big models. So in particular
[00:39:54] have really big models. So in particular um now each weight in your neural
[00:39:56] um now each weight in your neural network you basically need to keep track
[00:39:57] network you basically need to keep track of four numbers the weight itself um the
[00:40:00] of four numbers the weight itself um the gradient of that the the gradient of
[00:40:01] gradient of that the the gradient of that weight uh and the optimizer state.
[00:40:03] that weight uh and the optimizer state. So if you're using atom that's typically
[00:40:05] So if you're using atom that's typically a beta that's a beta 1 and a beta 2 per
[00:40:07] a beta that's a beta 1 and a beta 2 per parameter in the network. Um and
[00:40:08] parameter in the network. Um and sometimes you'll also have an
[00:40:10] sometimes you'll also have an exponential moving average of the model
[00:40:11] exponential moving average of the model parameters as well. So typically you'll
[00:40:13] parameters as well. So typically you'll have you know four to five scalers that
[00:40:15] have you know four to five scalers that you need to keep track of for every
[00:40:16] you need to keep track of for every weight in your network. Um and if you're
[00:40:19] weight in your network. Um and if you're training with 16 bit precision which is
[00:40:21] training with 16 bit precision which is pretty common these days what some some
[00:40:23] pretty common these days what some some of these you'll sometimes keep in in
[00:40:24] of these you'll sometimes keep in in higher precision but let's talk about 16
[00:40:26] higher precision but let's talk about 16 bit as a lower bound then you need two
[00:40:28] bit as a lower bound then you need two bytes for each number. So that means
[00:40:29] bytes for each number. So that means that we need you know four numbers two
[00:40:31] that we need you know four numbers two bytes we need six we need uh eight bytes
[00:40:34] bytes we need six we need uh eight bytes per scaler in the in the network um to
[00:40:36] per scaler in the in the network um to to keep track of which means that 1
[00:40:38] to keep track of which means that 1 billion parameters um 1 billion model
[00:40:40] billion parameters um 1 billion model parameters is going to take about 8 GB
[00:40:42] parameters is going to take about 8 GB of GPU memory to store all that stuff
[00:40:44] of GPU memory to store all that stuff and we said the whole GPU only has 80 GB
[00:40:46] and we said the whole GPU only has 80 GB of memory for an H100. So that means
[00:40:48] of memory for an H100. So that means that the biggest model you could ever
[00:40:49] that the biggest model you could ever hope to train in this scenario is
[00:40:51] hope to train in this scenario is something like 10 billion parameters and
[00:40:53] something like 10 billion parameters and that's not big enough. We want really
[00:40:55] that's not big enough. We want really big models. We don't want to be
[00:40:56] big models. We don't want to be constrained by the tyranny of our GPU
[00:40:58] constrained by the tyranny of our GPU memory size in telling us how big of
[00:41:00] memory size in telling us how big of models we're allowed to train. So we
[00:41:01] models we're allowed to train. So we need to fix this somehow. And the fix
[00:41:04] need to fix this somehow. And the fix for this is actually relatively easy. We
[00:41:06] for this is actually relatively easy. We want to split we need to split the model
[00:41:07] want to split we need to split the model weights across the different GPUs. So in
[00:41:09] weights across the different GPUs. So in addition to splitting the batch of data
[00:41:11] addition to splitting the batch of data across GPUs, we're also going to split
[00:41:13] across GPUs, we're also going to split our model weights across the GPUs. And
[00:41:15] our model weights across the GPUs. And this leads to a variant of data
[00:41:17] this leads to a variant of data parallelism called fully sharded data
[00:41:18] parallelism called fully sharded data parallelism or FSTP. Um, and this is
[00:41:22] parallelism or FSTP. Um, and this is relatively simp conceptually what we're
[00:41:24] relatively simp conceptually what we're going to do is each model weight in the
[00:41:26] going to do is each model weight in the network, each weight wii, we are going
[00:41:28] network, each weight wii, we are going to assign it to a owner GPU. So each
[00:41:32] to assign it to a owner GPU. So each weight will be owned by a unique GPU
[00:41:34] weight will be owned by a unique GPU among the MGPUs that we're training on.
[00:41:36] among the MGPUs that we're training on. Um, and that the GPU that owns each
[00:41:38] Um, and that the GPU that owns each weight will also be responsible for
[00:41:40] weight will also be responsible for managing the global gradients for that
[00:41:42] managing the global gradients for that weight and the optimizer state for that
[00:41:43] weight and the optimizer state for that weight. Um, and typically you would
[00:41:45] weight. Um, and typically you would split this up by layer like you're not
[00:41:47] split this up by layer like you're not managing individual scalers. This W you
[00:41:49] managing individual scalers. This W you should think of as like the lay like the
[00:41:50] should think of as like the lay like the weight matrix for an entire layer of the
[00:41:52] weight matrix for an entire layer of the neural network.
[00:41:54] neural network. So now what but now so now what's so now
[00:41:56] So now what but now so now what's so now the the picture on the right changes a
[00:41:57] the the picture on the right changes a little bit here we're only showing two
[00:41:59] little bit here we're only showing two GPUs because spoiler there's going to be
[00:42:01] GPUs because spoiler there's going to be a lot more arrows flying around here in
[00:42:02] a lot more arrows flying around here in just a moment. So here we're showing a
[00:42:04] just a moment. So here we're showing a four-layer network that we're
[00:42:05] four-layer network that we're distributing across two different GPUs.
[00:42:07] distributing across two different GPUs. We've we've assigned the first two
[00:42:09] We've we've assigned the first two network the weights for the first two
[00:42:11] network the weights for the first two network layers W1 and W2. Those are
[00:42:13] network layers W1 and W2. Those are owned by GPU1. um the weights W3 and W4
[00:42:16] owned by GPU1. um the weights W3 and W4 are owned by GPU 3, GPU 2. So that means
[00:42:19] are owned by GPU 3, GPU 2. So that means that you know at the start of each of
[00:42:20] that you know at the start of each of each batch the network weights are split
[00:42:23] each batch the network weights are split up across the GPUs in this way. But it's
[00:42:25] up across the GPUs in this way. But it's it's still data parallelism. It's still
[00:42:27] it's still data parallelism. It's still the same basic idea that each GPU is
[00:42:29] the same basic idea that each GPU is going to load its own independent batch
[00:42:30] going to load its own independent batch of elements, do a full forward backward
[00:42:32] of elements, do a full forward backward pass on that batch to compute its own
[00:42:34] pass on that batch to compute its own local gradients. Then I'll reduce the
[00:42:36] local gradients. Then I'll reduce the gradients and take a gradient step. Same
[00:42:37] gradients and take a gradient step. Same basic algorithm, but it gets tricky now
[00:42:39] basic algorithm, but it gets tricky now because the model weights are split up.
[00:42:41] because the model weights are split up. So here we need to introduce extra
[00:42:42] So here we need to introduce extra communication. So when you're doing
[00:42:44] communication. So when you're doing fully sharded data parallelism now at
[00:42:46] fully sharded data parallelism now at the beginning of forward before you
[00:42:48] the beginning of forward before you start doing the forward pass of the
[00:42:50] start doing the forward pass of the first layer um whoever owns that weight
[00:42:53] first layer um whoever owns that weight for the first layer needs to broadcast
[00:42:55] for the first layer needs to broadcast that weight matrix to all the other GPUs
[00:42:56] that weight matrix to all the other GPUs that you're training on. So in this case
[00:42:58] that you're training on. So in this case um GPU1 owns W1. So it broadcasts that
[00:43:01] um GPU1 owns W1. So it broadcasts that to GPU 2. So GPU 2 now has a copy of W1.
[00:43:05] to GPU 2. So GPU 2 now has a copy of W1. Now that all the GPUs have a copy of W1,
[00:43:07] Now that all the GPUs have a copy of W1, they can run a forward pass through the
[00:43:09] they can run a forward pass through the first layer of the network and compute
[00:43:10] first layer of the network and compute the activations at the first layer of
[00:43:12] the activations at the first layer of the network. And now um after you run
[00:43:14] the network. And now um after you run the forward pass um each GPU that does
[00:43:17] the forward pass um each GPU that does not own that W1 is going to delete its
[00:43:19] not own that W1 is going to delete its local copy of the W1 weight matrix to
[00:43:21] local copy of the W1 weight matrix to save memory. So then after we've run the
[00:43:23] save memory. So then after we've run the first after we run the forward pass for
[00:43:25] first after we run the forward pass for the first layer we're back in the state
[00:43:26] the first layer we're back in the state where the model weights are split up
[00:43:28] where the model weights are split up across the GPUs. But now all the GPUs
[00:43:30] across the GPUs. But now all the GPUs have also have an activations in in GPU
[00:43:32] have also have an activations in in GPU memory that is the result of running the
[00:43:34] memory that is the result of running the first layer of the network. And now now
[00:43:36] first layer of the network. And now now it's time to do the second layer and we
[00:43:38] it's time to do the second layer and we do the exact same thing. So then the GPU
[00:43:40] do the exact same thing. So then the GPU that owns the weight matrix for layer 2
[00:43:42] that owns the weight matrix for layer 2 is going to broadcast that to all the
[00:43:44] is going to broadcast that to all the GPUs that we're training on. Now they
[00:43:46] GPUs that we're training on. Now they all have their own local copy of W2.
[00:43:48] all have their own local copy of W2. They can uh and they can go go forward.
[00:43:50] They can uh and they can go go forward. Um and by the way, we also have an
[00:43:52] Um and by the way, we also have an opportunity to interle computation and
[00:43:54] opportunity to interle computation and communication here as well. So that
[00:43:56] communication here as well. So that while we are computing the forward pass
[00:43:57] while we are computing the forward pass for layer I, we can be prefetching the
[00:44:00] for layer I, we can be prefetching the weights for the next layer. So in
[00:44:01] weights for the next layer. So in practice, this will happen in parallel
[00:44:03] practice, this will happen in parallel um during the forward pass of an FSDP
[00:44:05] um during the forward pass of an FSDP run. So then we'll be computing for
[00:44:07] run. So then we'll be computing for we'll be computing uh layer 2 at the
[00:44:09] we'll be computing uh layer 2 at the same time we are fetching the weights
[00:44:10] same time we are fetching the weights for layer 3. Um and once we get to layer
[00:44:12] for layer 3. Um and once we get to layer 3, note that now GPU 1 owns layer 3. So
[00:44:16] 3, note that now GPU 1 owns layer 3. So then GPU1 will be broadcasting the
[00:44:17] then GPU1 will be broadcasting the weights to all the GPUs that we're
[00:44:19] weights to all the GPUs that we're training on. And this will repeat until
[00:44:21] training on. And this will repeat until we've gotten to the end of the network.
[00:44:22] we've gotten to the end of the network. And now at the end of the network, then
[00:44:24] And now at the end of the network, then all models now we now each model has on
[00:44:26] all models now we now each model has on a full forward pass computed its local
[00:44:28] a full forward pass computed its local loss on its own local mini batch and has
[00:44:30] loss on its own local mini batch and has all the activations for all the layers
[00:44:31] all the activations for all the layers in memory all ready for backward. Um and
[00:44:34] in memory all ready for backward. Um and now we need to do the same thing in
[00:44:35] now we need to do the same thing in reverse in order to compute the backward
[00:44:36] reverse in order to compute the backward pass. So um at the beginning of the
[00:44:38] pass. So um at the beginning of the backward pass for the for the last layer
[00:44:40] backward pass for the for the last layer whoever owns that last layer weight will
[00:44:43] whoever owns that last layer weight will broadcast it to all the devices. Um once
[00:44:45] broadcast it to all the devices. Um once the devices have that weight then they
[00:44:47] the devices have that weight then they can perform the backward pass and this
[00:44:49] can perform the backward pass and this whole we'll do a similar kind of
[00:44:50] whole we'll do a similar kind of procedure in the backward pass. Now
[00:44:52] procedure in the backward pass. Now there is a little bit of optimization we
[00:44:54] there is a little bit of optimization we can do on the la very last layer in the
[00:44:56] can do on the la very last layer in the network which is don't delete the me
[00:44:58] network which is don't delete the me like keep the have the all the GPUs keep
[00:45:00] like keep the have the all the GPUs keep the the weights for the last layer in
[00:45:01] the the weights for the last layer in memory. So, this is something that
[00:45:02] memory. So, this is something that you'll usually do in practice, right?
[00:45:04] you'll usually do in practice, right? Because at the end, um, all the all the
[00:45:05] Because at the end, um, all the all the GPUs already have a copy of the last
[00:45:07] GPUs already have a copy of the last their weights from the forward pass.
[00:45:08] their weights from the forward pass. They'll just keep it in memory and just
[00:45:10] They'll just keep it in memory and just because they they're they know they're
[00:45:11] because they they're they know they're about to reuse it for the backward pass
[00:45:12] about to reuse it for the backward pass anyway. So, we just won't delete the the
[00:45:14] anyway. So, we just won't delete the the the weights from the very last layer.
[00:45:17] the weights from the very last layer. Um, so now there's basically but now
[00:45:18] Um, so now there's basically but now there's basically three things that need
[00:45:20] there's basically three things that need to happen during the backward pass. Um,
[00:45:22] to happen during the backward pass. Um, one is that you know once the GPUs have
[00:45:24] one is that you know once the GPUs have computed the backward pass for their
[00:45:26] computed the backward pass for their last layer of the network now that they
[00:45:28] last layer of the network now that they have a copy of the weights now at that
[00:45:29] have a copy of the weights now at that point each GPU has computed its own
[00:45:31] point each GPU has computed its own local backward gra it own local
[00:45:34] local backward gra it own local gradients for its local loss with
[00:45:35] gradients for its local loss with respect to that last layer weights. Um
[00:45:38] respect to that last layer weights. Um then we need to we need to communicate
[00:45:40] then we need to we need to communicate those gradients back and we said that
[00:45:42] those gradients back and we said that the GPU that owns the weight matrix is
[00:45:45] the GPU that owns the weight matrix is also going to be responsible for
[00:45:46] also going to be responsible for managing the gradients for that weight
[00:45:47] managing the gradients for that weight matrix. So now rather than all reducing
[00:45:50] matrix. So now rather than all reducing the gradients as we did in the data
[00:45:51] the gradients as we did in the data parallelism case, instead we're going to
[00:45:54] parallelism case, instead we're going to um just just the one weight matrix that
[00:45:56] um just just the one weight matrix that owns just the one GPU that owns that one
[00:45:59] owns just the one GPU that owns that one last layer weights is going to gather
[00:46:00] last layer weights is going to gather and take take a sum across all the local
[00:46:03] and take take a sum across all the local gradients across all our devices. So in
[00:46:05] gradients across all our devices. So in this case, GPU1 is going to send its
[00:46:07] this case, GPU1 is going to send its last layer local gradient to GPU uh to
[00:46:10] last layer local gradient to GPU uh to GPU 2, which will then have the full
[00:46:12] GPU 2, which will then have the full gradient um DL DW4 of the entire macro
[00:46:16] gradient um DL DW4 of the entire macro batch with respect to the last their
[00:46:18] batch with respect to the last their weights. What happens during the
[00:46:19] weights. What happens during the downtime? You got to get all this stuff
[00:46:21] downtime? You got to get all this stuff happening in parallel. Um so in this so
[00:46:24] happening in parallel. Um so in this so then you know there's basically three
[00:46:25] then you know there's basically three things that need to happen during
[00:46:26] things that need to happen during backward. During backward we need to
[00:46:28] backward. During backward we need to communicate the weights. So whoever
[00:46:30] communicate the weights. So whoever whatever whatever GPU owns the layer
[00:46:32] whatever whatever GPU owns the layer owns the weights for that layer has to
[00:46:34] owns the weights for that layer has to broadcast them. Two we need to all the
[00:46:36] broadcast them. Two we need to all the GPUs once they get that weight need to
[00:46:38] GPUs once they get that weight need to compute a backward pass for that layer.
[00:46:40] compute a backward pass for that layer. And then three um after each GPU
[00:46:43] And then three um after each GPU computes its backward pass then it needs
[00:46:45] computes its backward pass then it needs to send the result of that the gradients
[00:46:47] to send the result of that the gradients with respect to the weights of that
[00:46:48] with respect to the weights of that backward pass back to the GPU that owns
[00:46:50] backward pass back to the GPU that owns it. Um and we can and then after that
[00:46:52] it. Um and we can and then after that then the owner then once the owner of
[00:46:55] then the owner then once the owner of the weights has that full gradient then
[00:46:57] the weights has that full gradient then only the owner of the weight matrix can
[00:46:59] only the owner of the weight matrix can now make a gradient update on that one
[00:47:01] now make a gradient update on that one weight matrix. Um but I I think at this
[00:47:04] weight matrix. Um but I I think at this point we actually do not need to
[00:47:05] point we actually do not need to communicate the the updated weight
[00:47:07] communicate the the updated weight matrix because it will get
[00:47:08] matrix because it will get recommunicated to all the GPUs on the
[00:47:10] recommunicated to all the GPUs on the next forward pass. Um so that that's a
[00:47:12] next forward pass. Um so that that's a little bit different from the from the
[00:47:14] little bit different from the from the from the DP case maybe. Um and then
[00:47:16] from the DP case maybe. Um and then basically all of these things can
[00:47:18] basically all of these things can actually happen in parallel as well. So
[00:47:20] actually happen in parallel as well. So they will repeat this for every layer of
[00:47:21] they will repeat this for every layer of the network and then basically we in the
[00:47:23] the network and then basically we in the steady state of a very deep network all
[00:47:25] steady state of a very deep network all three of these things will be happening
[00:47:26] three of these things will be happening simultaneously. Um so so while we are
[00:47:29] simultaneously. Um so so while we are computing the backward pass for layer L
[00:47:32] computing the backward pass for layer L we will be um aggregating the gradients
[00:47:34] we will be um aggregating the gradients and performing a weight update on the on
[00:47:37] and performing a weight update on the on layer L+1 and we will be pre-fetching
[00:47:40] layer L+1 and we will be pre-fetching the weights for layer L minus one. So I
[00:47:42] the weights for layer L minus one. So I said there's three things that need to
[00:47:43] said there's three things that need to happen. we need to get the weight um run
[00:47:46] happen. we need to get the weight um run the backward pass and then uh update the
[00:47:48] the backward pass and then uh update the weight and then aggregate the gradient
[00:47:50] weight and then aggregate the gradient and update the weight and these things
[00:47:51] and update the weight and these things can all happen in parallel. So we'll
[00:47:53] can all happen in parallel. So we'll basically in general be operating on
[00:47:54] basically in general be operating on three consecutive layers and doing all
[00:47:56] three consecutive layers and doing all three of these things in parallel over
[00:47:58] three of these things in parallel over the over the course of the backward
[00:47:59] the over the course of the backward pass.
[00:48:02] Right? Right. So then like during the
[00:48:04] Right? Right. So then like during the and then as we chunk backwards over the
[00:48:05] and then as we chunk backwards over the network um then by the time we then
[00:48:07] network um then by the time we then hopefully if you were properly able to
[00:48:09] hopefully if you were properly able to overlap all that communication and
[00:48:11] overlap all that communication and computation then by the time you get you
[00:48:13] computation then by the time you get you finish your backward pass then all the
[00:48:15] finish your backward pass then all the gradients have already been communicated
[00:48:16] gradients have already been communicated all the GPUs have already finished doing
[00:48:18] all the GPUs have already finished doing their update on all the weights and
[00:48:20] their update on all the weights and we're ready and also hopefully your data
[00:48:22] we're ready and also hopefully your data loader that's loading data is al also
[00:48:24] loader that's loading data is al also happening asynchronously usually on the
[00:48:25] happening asynchronously usually on the CPU cores of our servers. So then the
[00:48:28] CPU cores of our servers. So then the CPU is like ready with a fresh batch of
[00:48:29] CPU is like ready with a fresh batch of data to go forward again. So these
[00:48:31] data to go forward again. So these things are basically parallelization
[00:48:33] things are basically parallelization machines. We have a lot of stuff that
[00:48:34] machines. We have a lot of stuff that needs to happen um both within a GPU and
[00:48:37] needs to happen um both within a GPU and across GPUs and we need to overlap all
[00:48:39] across GPUs and we need to overlap all of that as much as possible. So we can
[00:48:40] of that as much as possible. So we can always feed the GPUs and have them
[00:48:42] always feed the GPUs and have them running on those tensor cores as as
[00:48:43] running on those tensor cores as as densely as possible.
[00:48:46] densely as possible. Um right so then we're basically ready
[00:48:48] Um right so then we're basically ready to do our next batch after after that.
[00:48:51] to do our next batch after after that. So this is great. This is fully sharded
[00:48:53] So this is great. This is fully sharded data parallelism. Um and this is this
[00:48:55] data parallelism. Um and this is this can get you a long way. Um but there's
[00:48:57] can get you a long way. Um but there's actually a slightly uh slightly fancier
[00:48:59] actually a slightly uh slightly fancier variant of data parallelism that people
[00:49:00] variant of data parallelism that people sometimes use called hybrid sharded data
[00:49:03] sometimes use called hybrid sharded data parallelism or HSDP. Um and in this case
[00:49:06] parallelism or HSDP. Um and in this case we're actually going to imagine
[00:49:08] we're actually going to imagine conceptually dividing our GPUs into a
[00:49:10] conceptually dividing our GPUs into a two-dimensional grid. So in the previous
[00:49:12] two-dimensional grid. So in the previous examples we said we had sort of n GPUs
[00:49:14] examples we said we had sort of n GPUs and the way that we parallelized our
[00:49:16] and the way that we parallelized our computation was kind of the same. We had
[00:49:18] computation was kind of the same. We had sort of one axis of parallelization um
[00:49:21] sort of one axis of parallelization um in the previous variants of data
[00:49:22] in the previous variants of data parallelism. Once we get to hybrid
[00:49:24] parallelism. Once we get to hybrid sharded data parallelism, we now are
[00:49:25] sharded data parallelism, we now are going to have two separate axes of
[00:49:27] going to have two separate axes of parallelism that we will do at the same
[00:49:29] parallelism that we will do at the same time. So the first axis is we will do
[00:49:32] time. So the first axis is we will do typical FSTP um fully sharded data
[00:49:34] typical FSTP um fully sharded data parallelism along one axis that we just
[00:49:35] parallelism along one axis that we just talked about. Um so we'll have sort of
[00:49:38] talked about. Um so we'll have sort of groups of KG GPUs. Um and each group of
[00:49:40] groups of KG GPUs. Um and each group of KGPUs will be doing um fully sharded
[00:49:43] KGPUs will be doing um fully sharded data parallelism that we just talked
[00:49:44] data parallelism that we just talked about. So within each group of KG GPUs,
[00:49:47] about. So within each group of KG GPUs, the model weights will be split across
[00:49:49] the model weights will be split across those KGPUs and they will be interle um
[00:49:51] those KGPUs and they will be interle um sending weights and gradients back and
[00:49:52] sending weights and gradients back and forth to each other during the forward
[00:49:54] forth to each other during the forward and backward passes. But we will have
[00:49:56] and backward passes. But we will have now um m M copies of those K groups
[00:50:00] now um m M copies of those K groups operating in parallel. So in this case
[00:50:02] operating in parallel. So in this case we have um two groups of four GPUs. So
[00:50:05] we have um two groups of four GPUs. So each group of four GPUs you see has the
[00:50:08] each group of four GPUs you see has the weights split across the four GPUs but
[00:50:11] weights split across the four GPUs but we have the entire the entire the entire
[00:50:13] we have the entire the entire the entire uh setup duplicated a second time on a
[00:50:16] uh setup duplicated a second time on a second group of two GPUs. Um and the
[00:50:18] second group of two GPUs. Um and the reason and and then when you do this now
[00:50:20] reason and and then when you do this now you know then they do typical data
[00:50:22] you know then they do typical data parallelism across the groups. So within
[00:50:24] parallelism across the groups. So within a group we're going to do forward
[00:50:25] a group we're going to do forward backward um and at the end of the
[00:50:27] backward um and at the end of the backward each group will have computed
[00:50:29] backward each group will have computed its own local gradients and then after
[00:50:32] its own local gradients and then after the backward then each group needs to
[00:50:34] the backward then each group needs to all reduce the gradients across the
[00:50:36] all reduce the gradients across the groups so that we now have like the full
[00:50:38] groups so that we now have like the full like macro macro batch gradients across
[00:50:40] like macro macro batch gradients across the two groups and then each group can
[00:50:42] the two groups and then each group can make an gradient update independently
[00:50:44] make an gradient update independently once they've received the full gradients
[00:50:46] once they've received the full gradients for the macro macro batch.
[00:50:49] for the macro macro batch. Um so this is called multi-dimensional
[00:50:51] Um so this is called multi-dimensional parallelism because now there's
[00:50:52] parallelism because now there's basically two different axes, two
[00:50:54] basically two different axes, two different strategies that we're using to
[00:50:55] different strategies that we're using to paralyze our computation simultaneously.
[00:50:58] paralyze our computation simultaneously. Um and why this might be useful is
[00:51:00] Um and why this might be useful is because there's different amounts of
[00:51:02] because there's different amounts of communication required for these two
[00:51:03] communication required for these two different kinds of parallelism. So if we
[00:51:05] different kinds of parallelism. So if we think about fully sharded parallelism,
[00:51:07] think about fully sharded parallelism, you know, we actually need to what do we
[00:51:08] you know, we actually need to what do we need to communicate for during fully
[00:51:10] need to communicate for during fully sharded data parallelism? During FSTP,
[00:51:12] sharded data parallelism? During FSTP, during the forward pass, remember we
[00:51:13] during the forward pass, remember we were copying the weights all over. So we
[00:51:15] were copying the weights all over. So we sort of copy like during the forward
[00:51:17] sort of copy like during the forward pass we end up doing a communication of
[00:51:19] pass we end up doing a communication of one full copy of the network weights.
[00:51:20] one full copy of the network weights. Then during the backward pass we need to
[00:51:22] Then during the backward pass we need to recommunicate the network weights um and
[00:51:24] recommunicate the network weights um and we also need to communicate the
[00:51:25] we also need to communicate the gradients. So basically when it comes to
[00:51:27] gradients. So basically when it comes to when you use fully shredded data
[00:51:28] when you use fully shredded data parallelism you b like during a single
[00:51:30] parallelism you b like during a single forward backward pass you need to
[00:51:32] forward backward pass you need to communicate three times the network
[00:51:33] communicate three times the network weights across everything participating
[00:51:35] weights across everything participating in an FSDP group. But when you do normal
[00:51:37] in an FSDP group. But when you do normal data parallelism um where each where
[00:51:39] data parallelism um where each where where you each group keeps its own
[00:51:41] where you each group keeps its own independent copy of the weights you only
[00:51:43] independent copy of the weights you only need to all reduce the gradients. So
[00:51:44] need to all reduce the gradients. So that means that across multiple data
[00:51:46] that means that across multiple data parallelism groups, you only need to
[00:51:48] parallelism groups, you only need to communicate the network weights once
[00:51:50] communicate the network weights once over a forward backward pass. And this
[00:51:52] over a forward backward pass. And this plays into this idea of multiple levels
[00:51:53] plays into this idea of multiple levels of hierarchy inside of our GPU clusters.
[00:51:56] of hierarchy inside of our GPU clusters. So what you'll what you might do, for
[00:51:57] So what you'll what you might do, for example, is have um a GPU server where
[00:52:00] example, is have um a GPU server where it has eight GPUs with higher
[00:52:01] it has eight GPUs with higher interconnect inside a single machine.
[00:52:03] interconnect inside a single machine. Those might be an FSTP group because it
[00:52:05] Those might be an FSTP group because it requires more communication inside an
[00:52:07] requires more communication inside an FSTP group. But then you could have
[00:52:08] FSTP group. But then you could have multiple servers that are, you know, on
[00:52:11] multiple servers that are, you know, on this other axis. So you have sort of one
[00:52:12] this other axis. So you have sort of one server with a full copy of the model
[00:52:14] server with a full copy of the model weights then another server with another
[00:52:16] weights then another server with another full copy of the model weights and
[00:52:17] full copy of the model weights and remember communication across servers is
[00:52:19] remember communication across servers is going to be slower than communication
[00:52:20] going to be slower than communication inside a server. So then this is our
[00:52:22] inside a server. So then this is our first example of you know take designing
[00:52:24] first example of you know take designing algorithms to take to make to take
[00:52:26] algorithms to take to make to take advantage of the network topology that
[00:52:28] advantage of the network topology that we know our devices are connected into.
[00:52:30] we know our devices are connected into. Question is um would you rather have g
[00:52:33] Question is um would you rather have g like the these things are impossible to
[00:52:34] like the these things are impossible to tune. It's very very hard to say. Um
[00:52:38] tune. It's very very hard to say. Um right so you know and then basically but
[00:52:41] right so you know and then basically but once you have data parallelism once you
[00:52:43] once you have data parallelism once you have this this DP FSTP and HSTP this is
[00:52:46] have this this DP FSTP and HSTP this is actually a recipe that can take you a
[00:52:47] actually a recipe that can take you a long ways. So for example um you know we
[00:52:50] long ways. So for example um you know we a model with 100 billion parameters
[00:52:52] a model with 100 billion parameters would take 800 GB of memory to store um
[00:52:54] would take 800 GB of memory to store um and and if you split that over 80 GPUs
[00:52:57] and and if you split that over 80 GPUs it only takes 10 GB of memory per GPU.
[00:52:59] it only takes 10 GB of memory per GPU. So you can have you know a pretty big
[00:53:00] So you can have you know a pretty big model once you have FSDP.
[00:53:03] model once you have FSDP. Um, but there's another problem that the
[00:53:05] Um, but there's another problem that the model activations themselves now start
[00:53:06] model activations themselves now start to fill up memory. So if we go back to
[00:53:08] to fill up memory. So if we go back to llama 35B, um, it's a transformer with
[00:53:11] llama 35B, um, it's a transformer with 126 layers, model dimension of 16,000,
[00:53:14] 126 layers, model dimension of 16,000, sequence length 4096. Um, so if you kind
[00:53:16] sequence length 4096. Um, so if you kind of imagine how much GPU memory it takes
[00:53:18] of imagine how much GPU memory it takes to just store the hidden states during a
[00:53:20] to just store the hidden states during a forward pass. Um, that's going to be a
[00:53:22] forward pass. Um, that's going to be a lot. So that's going to quickly cause
[00:53:24] lot. So that's going to quickly cause your GPU to run out of memory once your
[00:53:26] your GPU to run out of memory once your models and sequences get really big. So
[00:53:28] models and sequences get really big. So that leads to another trick is called uh
[00:53:30] that leads to another trick is called uh activation checkpointing which means
[00:53:32] activation checkpointing which means that we're actually not going to store
[00:53:33] that we're actually not going to store all the activations in memory. We're
[00:53:35] all the activations in memory. We're going to recmp compute them during the
[00:53:36] going to recmp compute them during the backward pass. So to to to see how this
[00:53:39] backward pass. So to to to see how this works like we it's useful to think of
[00:53:40] works like we it's useful to think of your neural network in a different way
[00:53:42] your neural network in a different way where there's actually two different
[00:53:43] where there's actually two different layers in the neural each each layer in
[00:53:45] layers in the neural each each layer in the in the neural network does two
[00:53:46] the in the neural network does two things. It does a forward pass that
[00:53:48] things. It does a forward pass that computes activations for the next layer.
[00:53:50] computes activations for the next layer. Then it has a backward pass that
[00:53:51] Then it has a backward pass that computes gradients that take both the up
[00:53:54] computes gradients that take both the up upstream gradients and the activations.
[00:53:56] upstream gradients and the activations. So normally you know how much compute
[00:53:58] So normally you know how much compute and memory does this all take. Um if we
[00:54:00] and memory does this all take. Um if we assume that all of these are constant
[00:54:02] assume that all of these are constant then typically a forward backward pass
[00:54:03] then typically a forward backward pass will take sort of 1 2 3 4 four step
[00:54:07] will take sort of 1 2 3 4 four step string forward you'll remember those
[00:54:08] string forward you'll remember those activations during the forward pass then
[00:54:11] activations during the forward pass then 1 2 3 four step string backward. So in a
[00:54:14] 1 2 3 four step string backward. So in a normal forward backward pass it sort of
[00:54:16] normal forward backward pass it sort of takes um compute and memory for an end
[00:54:18] takes um compute and memory for an end layer network. Um but as we just said
[00:54:21] layer network. Um but as we just said this is going to run out of memory. So
[00:54:23] this is going to run out of memory. So instead what we can do is imagine
[00:54:24] instead what we can do is imagine recomputing the activations during the
[00:54:26] recomputing the activations during the backward pass. So what that looks like
[00:54:28] backward pass. So what that looks like is something like this. So we'll start
[00:54:29] is something like this. So we'll start with the activations. We'll run the
[00:54:31] with the activations. We'll run the first layer and then immediately throw
[00:54:32] first layer and then immediately throw away the activations for the first well
[00:54:34] away the activations for the first well run the forward pass for the first layer
[00:54:36] run the forward pass for the first layer and then immediately throw away those
[00:54:37] and then immediately throw away those activations and sort of do this four
[00:54:39] activations and sort of do this four times. So now we've sort of gone through
[00:54:41] times. So now we've sort of gone through the network once got the activations at
[00:54:42] the network once got the activations at the last layer. At this point we can
[00:54:44] the last layer. At this point we can compute our backward for the last layer.
[00:54:46] compute our backward for the last layer. But um now we're kind of out of luck. We
[00:54:48] But um now we're kind of out of luck. We we don't have the activations from from
[00:54:50] we don't have the activations from from A3 to compute the next backward pass.
[00:54:52] A3 to compute the next backward pass. But we can recmp compute them. So we
[00:54:54] But we can recmp compute them. So we recomp compute them. Now we can do the
[00:54:55] recomp compute them. Now we can do the backward pass. Now recomputee some more.
[00:54:57] backward pass. Now recomputee some more. Now do the backward pass. Now recomp
[00:54:59] Now do the backward pass. Now recomp compute. Now do another backward pass.
[00:55:01] compute. Now do another backward pass. So if you kind of add this all up, this
[00:55:03] So if you kind of add this all up, this ends up being n^2 compute and constant
[00:55:05] ends up being n^2 compute and constant memory for a layer for a network with n
[00:55:07] memory for a layer for a network with n layers cuz it's sort of sum n minus 1 n
[00:55:10] layers cuz it's sort of sum n minus 1 n -2 n -3 blah blah blah down to one.
[00:55:12] -2 n -3 blah blah blah down to one. That's quadratic time. Um and you can
[00:55:15] That's quadratic time. Um and you can split this up. You know n squed compute
[00:55:16] split this up. You know n squed compute is pretty bad for deep networks. So
[00:55:18] is pretty bad for deep networks. So instead let's not recmp compute
[00:55:20] instead let's not recmp compute everything. Let's instead imagine taking
[00:55:22] everything. Let's instead imagine taking a checkpoint of activations every C
[00:55:24] a checkpoint of activations every C layers. So we'll only sort of compute
[00:55:26] layers. So we'll only sort of compute like recomp compute within blocks of uh
[00:55:28] like recomp compute within blocks of uh within tinier blocks of the network.
[00:55:30] within tinier blocks of the network. Then in that case um if you take C
[00:55:32] Then in that case um if you take C checkpoints where you remember your
[00:55:34] checkpoints where you remember your activations c times over the course of
[00:55:36] activations c times over the course of your network then it's going to take n^
[00:55:37] your network then it's going to take n^ squ over c compute and o of c memory. Um
[00:55:40] squ over c compute and o of c memory. Um and a pretty common thing to do is to
[00:55:41] and a pretty common thing to do is to set c equal to root n in which case this
[00:55:44] set c equal to root n in which case this becomes n root n compute and o of root n
[00:55:47] becomes n root n compute and o of root n memory. Um so this is a pretty common
[00:55:49] memory. Um so this is a pretty common way that you can trade off computation
[00:55:50] way that you can trade off computation and memory um to train even bigger
[00:55:52] and memory um to train even bigger models.
[00:55:54] models. Okay so now at this point once we have
[00:55:56] Okay so now at this point once we have FSTP activation checkpointing HSTP we
[00:55:59] FSTP activation checkpointing HSTP we actually this can we can do a lot of
[00:56:00] actually this can we can do a lot of damage here. We can start to train some
[00:56:02] damage here. We can start to train some really big models. Um and the recipe for
[00:56:04] really big models. Um and the recipe for that is basically as following. Um so
[00:56:06] that is basically as following. Um so your scaling recipe that will take you
[00:56:08] your scaling recipe that will take you quite a long way from here is first use
[00:56:10] quite a long way from here is first use data parallelism just raw data
[00:56:12] data parallelism just raw data parallelism roughly up to 128 GPUs um
[00:56:15] parallelism roughly up to 128 GPUs um and roughly up to models of around a
[00:56:17] and roughly up to models of around a billion parameters. You can just do
[00:56:18] billion parameters. You can just do normal data parallelism for models of
[00:56:20] normal data parallelism for models of this size. It tends to work pretty well.
[00:56:22] this size. It tends to work pretty well. Um and another thing that you almost
[00:56:24] Um and another thing that you almost always want to set the local batch size
[00:56:26] always want to set the local batch size per GPU to max out the GPU memory.
[00:56:28] per GPU to max out the GPU memory. That's almost always the right thing to
[00:56:30] That's almost always the right thing to do. Um, and then once your model starts
[00:56:32] do. Um, and then once your model starts to get big, then the model itself will
[00:56:34] to get big, then the model itself will take up a lot of memory inside your GPU.
[00:56:36] take up a lot of memory inside your GPU. So that'll that will start to give you
[00:56:38] So that'll that will start to give you problems. So you know, it kind of
[00:56:40] problems. So you know, it kind of depends on how much memory your GPU has,
[00:56:42] depends on how much memory your GPU has, how fast your interconnects are. But in
[00:56:44] how fast your interconnects are. But in general, once your model starts to be
[00:56:46] general, once your model starts to be more than a billion parameters, that's
[00:56:48] more than a billion parameters, that's when you want to start thinking about
[00:56:49] when you want to start thinking about switching from data parallelism to fully
[00:56:51] switching from data parallelism to fully sharded data parallelism. Um, and then
[00:56:53] sharded data parallelism. Um, and then at this point you can scale up quite a
[00:56:55] at this point you can scale up quite a bit, but then you'll run into the memory
[00:56:57] bit, but then you'll run into the memory bottleneck for your activations and
[00:56:58] bottleneck for your activations and that's when you turn on activation
[00:56:59] that's when you turn on activation checkpointing. Activation checkpointing
[00:57:01] checkpointing. Activation checkpointing kind of sucks because it makes
[00:57:02] kind of sucks because it makes everything a lot slower, but it does let
[00:57:04] everything a lot slower, but it does let you train much bigger models. Um, and
[00:57:07] you train much bigger models. Um, and once and this this will scale like up to
[00:57:09] once and this this will scale like up to several hundred GPUs and then there's
[00:57:11] several hundred GPUs and then there's some point usually depending on your
[00:57:12] some point usually depending on your cluster topology maybe around 256 GPUs,
[00:57:15] cluster topology maybe around 256 GPUs, maybe around 512 GPUs. Once you get to
[00:57:17] maybe around 512 GPUs. Once you get to like on the order of multiple hundreds
[00:57:19] like on the order of multiple hundreds of devices, then FSTP becomes too
[00:57:22] of devices, then FSTP becomes too expensive and you need to and you need
[00:57:23] expensive and you need to and you need to start switching to HSTP. Um, and then
[00:57:26] to start switching to HSTP. Um, and then this is basically going to let you get
[00:57:27] this is basically going to let you get up to models that are roughly tens of
[00:57:30] up to models that are roughly tens of billions of parameters training on maybe
[00:57:32] billions of parameters training on maybe a thousand GPUs. Um, that's on like
[00:57:34] a thousand GPUs. Um, that's on like pretty long sequence lengths. So that's
[00:57:36] pretty long sequence lengths. So that's pretty good. Um but if you have maybe
[00:57:38] pretty good. Um but if you have maybe more than a thousand GPUs, uh more than
[00:57:41] more than a thousand GPUs, uh more than 50 billion parameter models, sequence
[00:57:42] 50 billion parameter models, sequence lengths more than more than 10,000 or
[00:57:45] lengths more than more than 10,000 or so, um this is when you need to turn to
[00:57:46] so, um this is when you need to turn to these more advanced strategies, context
[00:57:48] these more advanced strategies, context parallelism, pipeline parallelism, or
[00:57:50] parallelism, pipeline parallelism, or tensor parallelism.
[00:57:52] tensor parallelism. And then there's a big question. It's
[00:57:53] And then there's a big question. It's like, oh my god, there's a lot of nubs
[00:57:54] like, oh my god, there's a lot of nubs to tune here. Like how am I supposed to
[00:57:55] to tune here. Like how am I supposed to optimize this? I need to set the glo the
[00:57:57] optimize this? I need to set the glo the global batch size, the local batch size,
[00:57:59] global batch size, the local batch size, the HSTP dimension, the FSDB dimension.
[00:58:01] the HSTP dimension, the FSDB dimension. Like how much to recomputee? Like I'm
[00:58:03] Like how much to recomputee? Like I'm lost here. What do I do? There's so many
[00:58:05] lost here. What do I do? There's so many knobs. I don't know what what what what
[00:58:06] knobs. I don't know what what what what do I do? Um the answer is to optimize a
[00:58:09] do I do? Um the answer is to optimize a very important metric called model flops
[00:58:11] very important metric called model flops utilization MFU. Whenever you get lost
[00:58:13] utilization MFU. Whenever you get lost in the sea of GPU parallelism like model
[00:58:16] in the sea of GPU parallelism like model flops utilization is your guiding light.
[00:58:17] flops utilization is your guiding light. Follow this. It will tell you what to do
[00:58:19] Follow this. It will tell you what to do um to optimize your training stack. But
[00:58:22] um to optimize your training stack. But before we get to model flops
[00:58:23] before we get to model flops utilization, we need to talk about
[00:58:24] utilization, we need to talk about hardware flops utilization. So remember
[00:58:26] hardware flops utilization. So remember we said in theory an H100 can do 988
[00:58:30] we said in theory an H100 can do 988 989.4 t flops per second of compute on
[00:58:33] 989.4 t flops per second of compute on the tensor cores. But that's
[00:58:34] the tensor cores. But that's theoretical. How much can you actually
[00:58:35] theoretical. How much can you actually get? Um the question is how much can you
[00:58:38] get? Um the question is how much can you actually achieve in practice? Um and
[00:58:40] actually achieve in practice? Um and that's the metric of hardware flops
[00:58:41] that's the metric of hardware flops utilization. You know, you're running
[00:58:43] utilization. You know, you're running some compute on the device. How much
[00:58:44] some compute on the device. How much compute do you actually realize of that
[00:58:46] compute do you actually realize of that theoretical maximum? Um and this is not
[00:58:48] theoretical maximum? Um and this is not hard to do. Like you can write a couple
[00:58:49] hard to do. Like you can write a couple lines of PyTorch code and just like
[00:58:51] lines of PyTorch code and just like benchmark this. So this is a benchmark
[00:58:53] benchmark this. So this is a benchmark that I wrote that I ran on an H100
[00:58:54] that I wrote that I ran on an H100 yesterday. Um and you can see what it
[00:58:57] yesterday. Um and you can see what it does is basically X-axis. It just does
[00:58:59] does is basically X-axis. It just does dense matrix multiply in in a loop and
[00:59:01] dense matrix multiply in in a loop and then times how long did the matrix
[00:59:02] then times how long did the matrix multiply happen. how long did the matrix
[00:59:04] multiply happen. how long did the matrix multiply take? Um, we can compute how
[00:59:06] multiply take? Um, we can compute how many flops the matrix multiply takes.
[00:59:08] many flops the matrix multiply takes. Then on the x axis, we're plotting the
[00:59:09] Then on the x axis, we're plotting the size of our matrix um going from uh 512
[00:59:12] size of our matrix um going from uh 512 up to 32,000. And the y-axis is this
[00:59:15] up to 32,000. And the y-axis is this hardware flops utilization, which is
[00:59:17] hardware flops utilization, which is basically the fraction of the
[00:59:18] basically the fraction of the theoretical maximum throughput of the
[00:59:19] theoretical maximum throughput of the device that we actually realize from
[00:59:21] device that we actually realize from these matrix multiplies. And you can see
[00:59:23] these matrix multiplies. And you can see that on this, you know, pretty
[00:59:24] that on this, you know, pretty straightforward PyTorch loop, we're
[00:59:26] straightforward PyTorch loop, we're getting about 80% HFU on an H100 once we
[00:59:29] getting about 80% HFU on an H100 once we get to large matrix multiplies of around
[00:59:31] get to large matrix multiplies of around um 8,000 by 8,000. So that's pretty
[00:59:33] um 8,000 by 8,000. So that's pretty good. But the problem is that HFU does
[00:59:36] good. But the problem is that HFU does not account for all the other stuff that
[00:59:37] not account for all the other stuff that your model needs to do, right? We're
[00:59:39] your model needs to do, right? We're doing we're maybe doing activation
[00:59:40] doing we're maybe doing activation recmputation. We're maybe running some
[00:59:42] recmputation. We're maybe running some other models on the side. We're maybe
[00:59:44] other models on the side. We're maybe doing data loading, data augmentation.
[00:59:45] doing data loading, data augmentation. There's a lot of other stuff your GPU is
[00:59:47] There's a lot of other stuff your GPU is doing other than just forward backward
[00:59:48] doing other than just forward backward on your raw model. And that's where we
[00:59:50] on your raw model. And that's where we move from hardware flops utilization to
[00:59:52] move from hardware flops utilization to model flops utilization. So model flops
[00:59:55] model flops utilization. So model flops utilization is basically saying um what
[00:59:57] utilization is basically saying um what fraction of the GPU's theoretical peak
[00:59:59] fraction of the GPU's theoretical peak flops are being used for forward
[01:00:01] flops are being used for forward backward in my model. Um and this is the
[01:00:04] backward in my model. Um and this is the thing you always want to optimize for.
[01:00:06] thing you always want to optimize for. So then to make this more concrete, you
[01:00:08] So then to make this more concrete, you basically compute um you know based on
[01:00:09] basically compute um you know based on your model architecture, the size of the
[01:00:11] your model architecture, the size of the number of the layers, the size of the
[01:00:13] number of the layers, the size of the layers, you compute how many flops does
[01:00:15] layers, you compute how many flops does it take to do a full forward backward
[01:00:16] it take to do a full forward backward pass of your architecture on your mini
[01:00:18] pass of your architecture on your mini batch of data. Then you look up
[01:00:20] batch of data. Then you look up somewhere the peak theoretical
[01:00:22] somewhere the peak theoretical throughput of the device you're running
[01:00:23] throughput of the device you're running on. And then you divide those two and
[01:00:25] on. And then you divide those two and that tells you how long should a full
[01:00:27] that tells you how long should a full forward backward pass take if you were
[01:00:29] forward backward pass take if you were achieving the theoretical maximum
[01:00:31] achieving the theoretical maximum throughput of the device. That's like
[01:00:32] throughput of the device. That's like the theoretical fastest you could ever
[01:00:34] the theoretical fastest you could ever do a forward backward pass on your
[01:00:35] do a forward backward pass on your model. Then um you actually time a
[01:00:38] model. Then um you actually time a forward backward pass of your model. You
[01:00:40] forward backward pass of your model. You know your your training loop is doing
[01:00:42] know your your training loop is doing all this other stuff. It's doing data
[01:00:43] all this other stuff. It's doing data loading. It's doing it's doing
[01:00:44] loading. It's doing it's doing augmentation. It's doing communication.
[01:00:46] augmentation. It's doing communication. It's doing um maybe activation
[01:00:48] It's doing um maybe activation checkpointing. So it's doing recmp
[01:00:49] checkpointing. So it's doing recmp computation during backward. Your
[01:00:51] computation during backward. Your training loop is doing a lot of stuff.
[01:00:52] training loop is doing a lot of stuff. Just time see how long it actually takes
[01:00:54] Just time see how long it actually takes and then divide those two numbers. that
[01:00:55] and then divide those two numbers. that gives you a number between zero and one
[01:00:57] gives you a number between zero and one which is like what fraction of that
[01:00:59] which is like what fraction of that theoretical maximum are you actually
[01:01:01] theoretical maximum are you actually achieving in your training loop and that
[01:01:04] achieving in your training loop and that that's your MFU your model FOPS
[01:01:05] that's your MFU your model FOPS utilization um and again we can kind of
[01:01:07] utilization um and again we can kind of benchmark this with some relatively
[01:01:08] benchmark this with some relatively simple PyTorch code here's an example
[01:01:10] simple PyTorch code here's an example running forward backward on just a like
[01:01:12] running forward backward on just a like a like a short multi-layer perception
[01:01:14] a like a short multi-layer perception with uh with a relu nonlinearity and
[01:01:17] with uh with a relu nonlinearity and with really big with really wide MLP
[01:01:19] with really big with really wide MLP layers and a gigantic batch size on a
[01:01:21] layers and a gigantic batch size on a single uh H100 this is getting around
[01:01:23] single uh H100 this is getting around 50% MFU
[01:01:25] 50% MFU Um, but and then in in general whenever
[01:01:28] Um, but and then in in general whenever you're trying to tune knobs for
[01:01:30] you're trying to tune knobs for distributed training, you always want to
[01:01:31] distributed training, you always want to try to tune whatever knobs you can to
[01:01:33] try to tune whatever knobs you can to maximize MFU because that's the one
[01:01:35] maximize MFU because that's the one metric that we typically care about when
[01:01:37] metric that we typically care about when trying to optimize training throughput.
[01:01:39] trying to optimize training throughput. Um, and in general, you know, an MFU
[01:01:42] Um, and in general, you know, an MFU these days, generally above 30% is
[01:01:44] these days, generally above 30% is pretty good. If you're way under 30%,
[01:01:46] pretty good. If you're way under 30%, you've probably got some gigantic
[01:01:48] you've probably got some gigantic bottleneck somewhere and something is
[01:01:49] bottleneck somewhere and something is going wrong. Um, and above 40% is pretty
[01:01:51] going wrong. Um, and above 40% is pretty pretty excellent. And that's basically
[01:01:53] pretty excellent. And that's basically state-of-the-art. Um, and here's some
[01:01:55] state-of-the-art. Um, and here's some numbers that we can pull from a couple
[01:01:56] numbers that we can pull from a couple papers. In particular, this is that
[01:01:58] papers. In particular, this is that llama llama 3 405b paper that we talked
[01:02:01] llama llama 3 405b paper that we talked about. Um, in their in their final
[01:02:03] about. Um, in their in their final training step, they have a couple
[01:02:04] training step, they have a couple different variants of their of their
[01:02:06] different variants of their of their training phases where they train on
[01:02:07] training phases where they train on between 8,000 and 16,000 GPUs
[01:02:10] between 8,000 and 16,000 GPUs simultaneously. Um, and across that,
[01:02:12] simultaneously. Um, and across that, they're getting MFUs roughly in the low
[01:02:14] they're getting MFUs roughly in the low like high30s, low 40s. And that's pretty
[01:02:17] like high30s, low 40s. And that's pretty that's pretty good. Um, you're never
[01:02:19] that's pretty good. Um, you're never going to get really much higher than
[01:02:20] going to get really much higher than that on an H100. Um, and actually
[01:02:22] that on an H100. Um, and actually paradoxically, more recent devices
[01:02:24] paradoxically, more recent devices sometimes get worse MFUs. So on on the
[01:02:27] sometimes get worse MFUs. So on on the previous generation devices, the A100's,
[01:02:29] previous generation devices, the A100's, you could sometimes get MFUs above 50%.
[01:02:31] you could sometimes get MFUs above 50%. And the reason for that is because GPUs
[01:02:34] And the reason for that is because GPUs are getting faster faster than they are
[01:02:36] are getting faster faster than they are getting faster at communicating. So when
[01:02:38] getting faster at communicating. So when we moved from the A100 to the H100, we
[01:02:40] we moved from the A100 to the H100, we got roughly a 3x improvement in the
[01:02:42] got roughly a 3x improvement in the theoretical throughput of the compute,
[01:02:44] theoretical throughput of the compute, but we only got a 2x improvement in the
[01:02:45] but we only got a 2x improvement in the theoretical memory bandwidth. Um, so
[01:02:47] theoretical memory bandwidth. Um, so that that there's there there's this
[01:02:49] that that there's there there's this growing gap between making GPUs are
[01:02:51] growing gap between making GPUs are getting faster really fast, but it's
[01:02:53] getting faster really fast, but it's harder to scale the communication
[01:02:54] harder to scale the communication between the GPUs and as a result, we
[01:02:56] between the GPUs and as a result, we tend to sometimes get worse MFUs
[01:02:58] tend to sometimes get worse MFUs actually on more recent generations of
[01:03:00] actually on more recent generations of devices.
[01:03:02] devices. So that's um that and I I intentionally
[01:03:04] So that's um that and I I intentionally spent wanted to spend most of the time
[01:03:05] spent wanted to spend most of the time on those points because those are the
[01:03:07] on those points because those are the ones that you guys are probably going to
[01:03:08] ones that you guys are probably going to use in practice. Um I don't think anyone
[01:03:11] use in practice. Um I don't think anyone in this room likely has access to a
[01:03:12] in this room likely has access to a 10,000 GPU cluster. If you do, like come
[01:03:15] 10,000 GPU cluster. If you do, like come talk to me after class. I would love to
[01:03:16] talk to me after class. I would love to be your friend. Um but uh so like those
[01:03:19] be your friend. Um but uh so like those are the ones that you're likely to
[01:03:20] are the ones that you're likely to encounter in practice like up to many
[01:03:22] encounter in practice like up to many hundreds of GPUs. Um but there are these
[01:03:24] hundreds of GPUs. Um but there are these other ones that I I just like there are
[01:03:26] other ones that I I just like there are slides here that are pretty that I think
[01:03:27] slides here that are pretty that I think are pretty nice, but it's okay if we
[01:03:28] are pretty nice, but it's okay if we don't go through the full details of
[01:03:29] don't go through the full details of these. You can ch you can check it
[01:03:31] these. You can ch you can check it offline. Um so we said context
[01:03:33] offline. Um so we said context parallelism is basically splitting on
[01:03:35] parallelism is basically splitting on the sequence dimension. So we said
[01:03:36] the sequence dimension. So we said transformers are operating on sequences.
[01:03:38] transformers are operating on sequences. Um and basically the idea is you've got
[01:03:41] Um and basically the idea is you've got a long sequence, you know, make
[01:03:43] a long sequence, you know, make different GPUs handle different parts of
[01:03:44] different GPUs handle different parts of the sequence. Um, and for if you recall
[01:03:47] the sequence. Um, and for if you recall your transformer block, this is actually
[01:03:49] your transformer block, this is actually easy for large parts of the transformer
[01:03:51] easy for large parts of the transformer because the layer norm, the multi-layer
[01:03:52] because the layer norm, the multi-layer like the the FFN, MLP, um, and the
[01:03:55] like the the FFN, MLP, um, and the residual connections, those all operate
[01:03:56] residual connections, those all operate independently across the sequence
[01:03:58] independently across the sequence anyway. So, it's relatively
[01:04:00] anyway. So, it's relatively straightforward to ask to chunk up that
[01:04:02] straightforward to ask to chunk up that computation across the sequence
[01:04:03] computation across the sequence dimension. Um, things get I mean it does
[01:04:05] dimension. Um, things get I mean it does get a little bit hairy inside the MLP
[01:04:07] get a little bit hairy inside the MLP because there are weights in there. So,
[01:04:08] because there are weights in there. So, you have to have some some reduce of the
[01:04:10] you have to have some some reduce of the gradients like we did in the data
[01:04:12] gradients like we did in the data parallelism cases. The attention is
[01:04:14] parallelism cases. The attention is where things get hairy for sequence
[01:04:16] where things get hairy for sequence parallelism because if we remember um
[01:04:17] parallelism because if we remember um attention we need to compute the sort of
[01:04:19] attention we need to compute the sort of all pairs interaction between every pair
[01:04:21] all pairs interaction between every pair of elements inside the sequence. Um the
[01:04:23] of elements inside the sequence. Um the QKB projection is easy um because that's
[01:04:25] QKB projection is easy um because that's sort of trivially paralyzable over the
[01:04:27] sort of trivially paralyzable over the sequence. But that core attention matrix
[01:04:29] sequence. But that core attention matrix that actually gets pretty tricky to
[01:04:30] that actually gets pretty tricky to paralyze. Um the the most like the the
[01:04:33] paralyze. Um the the most like the the first version of this that people
[01:04:34] first version of this that people developed was called ring attention
[01:04:36] developed was called ring attention where you basically take that full
[01:04:37] where you basically take that full attention matrix um chunk it up into
[01:04:39] attention matrix um chunk it up into blocks and then a have your your your
[01:04:41] blocks and then a have your your your GPUs sort of work on those blocks
[01:04:43] GPUs sort of work on those blocks independently in parallel in the right
[01:04:45] independently in parallel in the right order to make sure everything works out.
[01:04:47] order to make sure everything works out. Um there's a lot of details in there.
[01:04:48] Um there's a lot of details in there. You can check out the paper for more
[01:04:49] You can check out the paper for more details. Um the second which is a little
[01:04:52] details. Um the second which is a little bit conceptually easier is called um
[01:04:54] bit conceptually easier is called um ulysus attention. Um where you do
[01:04:56] ulysus attention. Um where you do parallelism over the heads. So remember
[01:04:58] parallelism over the heads. So remember in a transformer you're almost always
[01:04:59] in a transformer you're almost always doing multi-head attention where you're
[01:05:01] doing multi-head attention where you're sort of computing attention over like
[01:05:03] sort of computing attention over like multiple attention matrices all in
[01:05:05] multiple attention matrices all in parallel. Um so in ulysus attention
[01:05:07] parallel. Um so in ulysus attention we're going to parallelize that
[01:05:08] we're going to parallelize that computation of that core attention
[01:05:10] computation of that core attention operator paralyze that over heads and
[01:05:12] operator paralyze that over heads and then everything else all other parts of
[01:05:14] then everything else all other parts of the transformer are paralyzed over the
[01:05:15] the transformer are paralyzed over the sequence dimension.
[01:05:18] sequence dimension. Um and as a as an example like um cont
[01:05:21] Um and as a as an example like um cont this this context parallelism becomes
[01:05:22] this this context parallelism becomes important once you scale up your
[01:05:24] important once you scale up your sequence length to be quite large. So if
[01:05:26] sequence length to be quite large. So if we go back to this example of llama 3
[01:05:27] we go back to this example of llama 3 pre-training, they actually train the
[01:05:29] pre-training, they actually train the model in two stages. The first stage
[01:05:31] model in two stages. The first stage they go um sequence length of 8,000 with
[01:05:33] they go um sequence length of 8,000 with no context parallelism whatsoever. Um
[01:05:35] no context parallelism whatsoever. Um and then they have a second stage of
[01:05:37] and then they have a second stage of training where they crank the sequence
[01:05:38] training where they crank the sequence length up to 130,000. And then that
[01:05:41] length up to 130,000. And then that point at at that point they do 16-way
[01:05:43] point at at that point they do 16-way context parallelism. So that means that
[01:05:44] context parallelism. So that means that each of those um 131,000 long sequences
[01:05:48] each of those um 131,000 long sequences has 16 GPUs operating on one sequence in
[01:05:50] has 16 GPUs operating on one sequence in parallel. And that's kind of like saying
[01:05:52] parallel. And that's kind of like saying the batch size is like 1 over 16 because
[01:05:54] the batch size is like 1 over 16 because now like each batch each GPU is h is
[01:05:57] now like each batch each GPU is h is working on like less than one element.
[01:06:00] working on like less than one element. Um so that's context parallelism.
[01:06:02] Um so that's context parallelism. Pipeline parallelism we're going to
[01:06:03] Pipeline parallelism we're going to split across the layers dimension. Um
[01:06:05] split across the layers dimension. Um intuitively what you want to do you have
[01:06:07] intuitively what you want to do you have you a network with a bunch of layers and
[01:06:09] you a network with a bunch of layers and we're going to just divide the layers
[01:06:11] we're going to just divide the layers across the GPUs. That's actually a very
[01:06:13] across the GPUs. That's actually a very intuitive thing to do. The problem is
[01:06:15] intuitive thing to do. The problem is that there's sequential dependencies,
[01:06:17] that there's sequential dependencies, right? Because each GPU like it needs
[01:06:19] right? Because each GPU like it needs the activations from the previous GPU to
[01:06:22] the activations from the previous GPU to continue running the forward pass and
[01:06:23] continue running the forward pass and during the backward pass I need the
[01:06:25] during the backward pass I need the gradients from the upstream layers in
[01:06:26] gradients from the upstream layers in order to compute the backward pass. So
[01:06:28] order to compute the backward pass. So we can draw a diagram like this where
[01:06:30] we can draw a diagram like this where the the vertical axis are GPUs 1 to 4
[01:06:33] the the vertical axis are GPUs 1 to 4 the horizontal axis is what happens over
[01:06:35] the horizontal axis is what happens over the course of time. So then you can see
[01:06:36] the course of time. So then you can see that GPU run GPU 1 runs forward then
[01:06:39] that GPU run GPU 1 runs forward then passes the activations to GPU 2 which
[01:06:41] passes the activations to GPU 2 which passes activations to GPU 3 which passes
[01:06:43] passes activations to GPU 3 which passes activations to GPU 4. GPU 4 is lucky it
[01:06:46] activations to GPU 4. GPU 4 is lucky it can do forward backward all at once then
[01:06:48] can do forward backward all at once then pass gradients back to GPU 3 back to GPU
[01:06:50] pass gradients back to GPU 3 back to GPU 2 back to GPU 1. So from this graph like
[01:06:53] 2 back to GPU 1. So from this graph like there that's obviously really really bad
[01:06:55] there that's obviously really really bad right because the GPUs are mostly
[01:06:56] right because the GPUs are mostly sitting idle um and in fact if you have
[01:06:59] sitting idle um and in fact if you have n GPUs like only you're only getting
[01:07:01] n GPUs like only you're only getting useful work out of them one overn of the
[01:07:04] useful work out of them one overn of the time. So that means that if we had like
[01:07:05] time. So that means that if we had like 8-way pipeline parallelism, your maximum
[01:07:08] 8-way pipeline parallelism, your maximum possible MFU at that point is like 12%.
[01:07:10] possible MFU at that point is like 12%. Which is terrible. So like that's really
[01:07:12] Which is terrible. So like that's really bad. Um and by the way, there's a cute
[01:07:14] bad. Um and by the way, there's a cute name. These are sometimes called the
[01:07:15] name. These are sometimes called the bubble. Um is like that chunk of where
[01:07:18] bubble. Um is like that chunk of where GPUs are waiting for work and they
[01:07:19] GPUs are waiting for work and they they're like waiting around for the
[01:07:21] they're like waiting around for the communication. So the trick in pipeline
[01:07:23] communication. So the trick in pipeline parallelism is to shrink the bubble. You
[01:07:25] parallelism is to shrink the bubble. You want to have less bubble and the way
[01:07:26] want to have less bubble and the way that we do that is running multiple
[01:07:28] that we do that is running multiple micro batches simultaneously. So now
[01:07:31] micro batches simultaneously. So now like rather than running running one
[01:07:32] like rather than running running one batch of data through all the GPUs
[01:07:34] batch of data through all the GPUs forward and backward, we're going to
[01:07:35] forward and backward, we're going to have multiple batches of data in play
[01:07:37] have multiple batches of data in play simultaneously and shuttle these things
[01:07:39] simultaneously and shuttle these things across the GPUs um in parallel.
[01:07:42] across the GPUs um in parallel. So there's a lot of different
[01:07:43] So there's a lot of different interesting patterns you can try to
[01:07:45] interesting patterns you can try to design for this, but here's a relatively
[01:07:47] design for this, but here's a relatively simple one where we have um four-way
[01:07:49] simple one where we have um four-way pipeline parallelism. So there's four
[01:07:50] pipeline parallelism. So there's four GPUs that are all working in parallel
[01:07:52] GPUs that are all working in parallel and then we we have four batches of data
[01:07:54] and then we we have four batches of data that are all active at the same time. So
[01:07:56] that are all active at the same time. So then and these batches are colorcoded.
[01:07:58] then and these batches are colorcoded. So then we see that GPU1 runs forward on
[01:08:01] So then we see that GPU1 runs forward on the on the blue batch, then forward on
[01:08:03] the on the blue batch, then forward on the yellow batch, then forward on the
[01:08:04] the yellow batch, then forward on the green batch, then forward on the red
[01:08:05] green batch, then forward on the red batch. And while GPU1 is going forward
[01:08:08] batch. And while GPU1 is going forward on the yellow batch, um we've passed the
[01:08:10] on the yellow batch, um we've passed the activations of the blue batch to GPU 1
[01:08:12] activations of the blue batch to GPU 1 or to GPU 2 and GPU 2 can now do forward
[01:08:14] or to GPU 2 and GPU 2 can now do forward on the blue batch. So these things can
[01:08:16] on the blue batch. So these things can all sort of cascade down and happen in
[01:08:18] all sort of cascade down and happen in parallel. And then the same kind of
[01:08:20] parallel. And then the same kind of pattern repeats during the backward
[01:08:21] pattern repeats during the backward pass. We can kind of interle these
[01:08:23] pass. We can kind of interle these different mini batch, these different
[01:08:24] different mini batch, these different microbatches as we pipeline them through
[01:08:26] microbatches as we pipeline them through the different GPUs. Um and in this case
[01:08:29] the different GPUs. Um and in this case with um you know four-way pipeline
[01:08:31] with um you know four-way pipeline parallelism with four microbatches our
[01:08:33] parallelism with four microbatches our max the like the max theoretical MFU is
[01:08:36] max the like the max theoretical MFU is just like the fraction of this graph
[01:08:37] just like the fraction of this graph which is not white and that increases
[01:08:39] which is not white and that increases now to 57% which is pretty good. So with
[01:08:42] now to 57% which is pretty good. So with pipeline parallelism in theory if you go
[01:08:44] pipeline parallelism in theory if you go to like lots and lots of microbatches
[01:08:46] to like lots and lots of microbatches then your MFU is going to be good
[01:08:48] then your MFU is going to be good because you can do a lot of work in
[01:08:49] because you can do a lot of work in parallel but the more microbatches you
[01:08:51] parallel but the more microbatches you have they need to store all the
[01:08:53] have they need to store all the activations in memory. So now we need to
[01:08:55] activations in memory. So now we need to do activation checkpointing. And then
[01:08:56] do activation checkpointing. And then you think like, oh crap, like how do I
[01:08:58] you think like, oh crap, like how do I tune these things? Should I have more
[01:09:00] tune these things? Should I have more pipeline parallelism? Should I have
[01:09:01] pipeline parallelism? Should I have fewer microbatches? Should I have more
[01:09:03] fewer microbatches? Should I have more aggressive activation checkpointing? And
[01:09:04] aggressive activation checkpointing? And then should I also layer data
[01:09:06] then should I also layer data parallelism on top of that? I don't
[01:09:07] parallelism on top of that? I don't know. What are you going to do? Maximize
[01:09:09] know. What are you going to do? Maximize MFU. You're going to try to turn tune
[01:09:11] MFU. You're going to try to turn tune all of those knobs to maximize the MFU
[01:09:13] all of those knobs to maximize the MFU of your of your of your training
[01:09:14] of your of your of your training pipeline.
[01:09:16] pipeline. Then the last one is tensor parallelism.
[01:09:18] Then the last one is tensor parallelism. So this one um you're going to split on
[01:09:20] So this one um you're going to split on that dimension on that on that um model
[01:09:22] that dimension on that on that um model dimension. So basically what we're going
[01:09:24] dimension. So basically what we're going to do is we have a lot of weight
[01:09:25] to do is we have a lot of weight matrices in our model. All those weight
[01:09:27] matrices in our model. All those weight matrices um are like computing XW equals
[01:09:30] matrices um are like computing XW equals Y. That's basically what we're doing
[01:09:31] Y. That's basically what we're doing over and over again inside of our trans
[01:09:33] over and over again inside of our trans inside of our transformer. Now the idea
[01:09:35] inside of our transformer. Now the idea is we'll split each weight matrix across
[01:09:38] is we'll split each weight matrix across GPUs kind of and now this is different
[01:09:40] GPUs kind of and now this is different from FSTP because we're actually
[01:09:41] from FSTP because we're actually splitting a single weight matrix across
[01:09:43] splitting a single weight matrix across GPUs and now there's no communication.
[01:09:45] GPUs and now there's no communication. Each GPU now we we do a block matrix
[01:09:47] Each GPU now we we do a block matrix multiply. So we um each GPU is now
[01:09:50] multiply. So we um each GPU is now computing a slice of that matrix
[01:09:51] computing a slice of that matrix multiply on the full input. So in this
[01:09:53] multiply on the full input. So in this case we split our weight matrix into W1,
[01:09:55] case we split our weight matrix into W1, W2, W3, W4. Um and then each GPU just
[01:09:58] W2, W3, W4. Um and then each GPU just computes a slice of that matrix multiply
[01:10:00] computes a slice of that matrix multiply to compute a slice of the output. And
[01:10:03] to compute a slice of the output. And then a problem is that now you know
[01:10:04] then a problem is that now you know after you do that forward pass then you
[01:10:06] after you do that forward pass then you need to gather the activations across
[01:10:08] need to gather the activations across all the GPUs to do the next to do the
[01:10:10] all the GPUs to do the next to do the next forward pass. Um there's a slight
[01:10:12] next forward pass. Um there's a slight trick which is if you have two of these
[01:10:14] trick which is if you have two of these layers in sequence, you can actually get
[01:10:16] layers in sequence, you can actually get away with not computing with not uh not
[01:10:19] away with not computing with not uh not gathering in between two layers. So if
[01:10:21] gathering in between two layers. So if you have two layers, you can sit down in
[01:10:23] you have two layers, you can sit down in a quiet place and work through this. You
[01:10:25] a quiet place and work through this. You split the first matrix into
[01:10:26] split the first matrix into column-shaped chunks, then you split the
[01:10:28] column-shaped chunks, then you split the second matrix into row-shaped chunks.
[01:10:30] second matrix into row-shaped chunks. And if you do all this, then it all kind
[01:10:32] And if you do all this, then it all kind of works out magically because of the
[01:10:34] of works out magically because of the magic and mystery of block matrix
[01:10:35] magic and mystery of block matrix multiplication. And you see that the
[01:10:37] multiplication. And you see that the final output you can kind of compute as
[01:10:38] final output you can kind of compute as an inner product like structure of these
[01:10:40] an inner product like structure of these um of this block of these block matrix
[01:10:42] um of this block of these block matrix multipliers of Y and U. So then you
[01:10:45] multipliers of Y and U. So then you basically can have two layers of matrix
[01:10:48] basically can have two layers of matrix multiply um and the that are split
[01:10:50] multiply um and the that are split across multiple GPUs and then they only
[01:10:52] across multiple GPUs and then they only need to communicate at the end of every
[01:10:54] need to communicate at the end of every two layers. Um and this actually works
[01:10:55] two layers. Um and this actually works out nicely because remember transformers
[01:10:57] out nicely because remember transformers have a two-layer MLP in the FFN. So this
[01:11:00] have a two-layer MLP in the FFN. So this is a really nice trick that plays really
[01:11:02] is a really nice trick that plays really nicely into the two-layer MLPS that
[01:11:04] nicely into the two-layer MLPS that transformers always have. So it's pretty
[01:11:06] transformers always have. So it's pretty common in big transformers to use tensor
[01:11:08] common in big transformers to use tensor parallelism. This this two this like two
[01:11:10] parallelism. This this two this like two layer tensor parallelism trick um on the
[01:11:13] layer tensor parallelism trick um on the MLP in a transformer.
[01:11:15] MLP in a transformer. Um so that's basically all of our
[01:11:17] Um so that's basically all of our mechanisms for splitting up computation
[01:11:19] mechanisms for splitting up computation across GPUs. Which one is the best? The
[01:11:22] across GPUs. Which one is the best? The actual answer is all of them. So in
[01:11:24] actual answer is all of them. So in practice we're going to use ND
[01:11:25] practice we're going to use ND parallelism. We saw already an example
[01:11:27] parallelism. We saw already an example of two dimensional parallelism with
[01:11:29] of two dimensional parallelism with HSTP. In practice, you know, the current
[01:11:31] HSTP. In practice, you know, the current state-of-the-art is like
[01:11:32] state-of-the-art is like four-dimensional parallelism. If we go
[01:11:34] four-dimensional parallelism. If we go back to Llama, we see that they are
[01:11:36] back to Llama, we see that they are training with their on their biggest
[01:11:38] training with their on their biggest training run with 16,000 GPUs. They're
[01:11:40] training run with 16,000 GPUs. They're using 8-way tensor parallelism, 16-way
[01:11:42] using 8-way tensor parallelism, 16-way context parallelism, 16-way pipeline
[01:11:44] context parallelism, 16-way pipeline parallelism, and 8way data parallelism
[01:11:47] parallelism, and 8way data parallelism all at the same time. Um, and if you're
[01:11:49] all at the same time. Um, and if you're careful in like these different meth
[01:11:51] careful in like these different meth mechanisms of parallelism have different
[01:11:53] mechanisms of parallelism have different communication requirements. So if you're
[01:11:55] communication requirements. So if you're careful in how you arrange these GP how
[01:11:57] careful in how you arrange these GP how you arrange these different axes of
[01:11:59] you arrange these different axes of parallelism along your cluster, you can
[01:12:01] parallelism along your cluster, you can try to take advantage of that varying
[01:12:02] try to take advantage of that varying speeds of communication across your your
[01:12:05] speeds of communication across your your whole cluster.
[01:12:06] whole cluster. And that's basically and that's
[01:12:08] And that's basically and that's basically a whirlwind tour of large
[01:12:09] basically a whirlwind tour of large scale distributed training. So the
[01:12:11] scale distributed training. So the takeaway for today is that an individual
[01:12:13] takeaway for today is that an individual GPU is basically a generalizable
[01:12:15] GPU is basically a generalizable parallel computing machine. A GPU
[01:12:17] parallel computing machine. A GPU cluster is a giant massively parallel
[01:12:20] cluster is a giant massively parallel machine with tens of thousands, maybe
[01:12:21] machine with tens of thousands, maybe hundreds of thousands of individual GPUs
[01:12:23] hundreds of thousands of individual GPUs that we want to program as one big unit.
[01:12:26] that we want to program as one big unit. And then we talked about multiple
[01:12:27] And then we talked about multiple different mechanisms of paralyzing our
[01:12:29] different mechanisms of paralyzing our computation across big clusters as well
[01:12:31] computation across big clusters as well as one trick, activation checkpointing
[01:12:33] as one trick, activation checkpointing for saving memory. And then the one
[01:12:35] for saving memory. And then the one guiding light metric that you're always
[01:12:36] guiding light metric that you're always trying to optimize when you design these
[01:12:38] trying to optimize when you design these pipelines, which is model flops
[01:12:39] pipelines, which is model flops utilization. So, the next time you're
[01:12:41] utilization. So, the next time you're going out and training on tens of
[01:12:42] going out and training on tens of thousands of GPUs, hope you keep this in
[01:12:44] thousands of GPUs, hope you keep this in mind. Um, and let me know so I can
[01:12:45] mind. Um, and let me know so I can borrow your tens of thousands of GPUs.


================================================================================
LECTURE 012
================================================================================

Stanford CS231N | Spring 2025 | Lecture 12: Self-Supervised Learning

Source: https://www.youtube.com/watch?v=4howBU7THbM

---

Transcript

[00:00:05] Last time on on Tuesday this week, we
[00:00:10] Last time on on Tuesday this week, we had a lecture on GPUs and how to train
[00:00:13] had a lecture on GPUs and how to train and how to use them and how to use
[00:00:15] and how to use them and how to use multiple GPUs for training larger scale
[00:00:18] multiple GPUs for training larger scale scaling your trainings and so on. an
[00:00:21] scaling your trainings and so on. an exciting um a new topic that we've added
[00:00:23] exciting um a new topic that we've added to this uh class this this year which is
[00:00:27] to this uh class this this year which is I think timely and very important with
[00:00:29] I think timely and very important with the increase of model sizes and and
[00:00:33] the increase of model sizes and and applications
[00:00:34] applications uh that you see the AI models have these
[00:00:38] uh that you see the AI models have these days.
[00:00:40] days. And
[00:00:41] And before that we talked about we covered
[00:00:44] before that we talked about we covered all of the key tasks in computer vision
[00:00:49] all of the key tasks in computer vision from classification, semantic
[00:00:51] from classification, semantic segmentation, object detection, instance
[00:00:53] segmentation, object detection, instance segmentation and so on. And uh we're
[00:00:58] segmentation and so on. And uh we're going to revisit some of these topics,
[00:00:59] going to revisit some of these topics, some of the results of the models we
[00:01:01] some of the results of the models we talked about today. But so that's that's
[00:01:05] talked about today. But so that's that's those tasks are uh still quite
[00:01:07] those tasks are uh still quite important.
[00:01:09] important. And then we talked about visualizing and
[00:01:11] And then we talked about visualizing and understanding the
[00:01:15] understanding the models and and seeing what the models
[00:01:18] models and and seeing what the models are learning. One of the things that
[00:01:20] are learning. One of the things that we've discussed was
[00:01:22] we've discussed was for example using just
[00:01:26] for example using just in the in the early sessions we talked
[00:01:28] in the in the early sessions we talked about nearest neighbor and and in the
[00:01:30] about nearest neighbor and and in the pixel space and how we can actually do
[00:01:34] pixel space and how we can actually do find the class of um we can find the
[00:01:36] find the class of um we can find the class of images based on only pixel
[00:01:39] class of images based on only pixel space distances and we discussed how
[00:01:42] space distances and we discussed how it's actually not efficient and one of
[00:01:45] it's actually not efficient and one of the things that we talked about was
[00:01:49] the things that we talked about was if we use the embedding layers or the
[00:01:52] if we use the embedding layers or the feature space feature layers, what one
[00:01:55] feature space feature layers, what one of those fully connected layers in
[00:01:58] of those fully connected layers in the the feature maps there from a
[00:02:01] the the feature maps there from a convolutional neural network or any
[00:02:03] convolutional neural network or any other network architecture that we use
[00:02:06] other network architecture that we use there that could actually be a good
[00:02:08] there that could actually be a good representation of images. And we talked
[00:02:12] representation of images. And we talked about the L2 distance if if we use that
[00:02:16] about the L2 distance if if we use that as the metric for nearest neighbor in
[00:02:19] as the metric for nearest neighbor in the feature space feature of these
[00:02:21] the feature space feature of these models. Right? So these uh this means
[00:02:24] models. Right? So these uh this means that these features are quite meaningful
[00:02:27] that these features are quite meaningful for the specific task that we had at
[00:02:29] for the specific task that we had at hand. And this means that specifically
[00:02:33] hand. And this means that specifically okay if we run a neural network a CNN a
[00:02:38] okay if we run a neural network a CNN a resnet or even uh the transformer models
[00:02:41] resnet or even uh the transformer models and look at the
[00:02:45] and look at the representations the learned
[00:02:47] representations the learned representations in different context you
[00:02:49] representations in different context you may uh see these as uh referred to with
[00:02:52] may uh see these as uh referred to with with different names large
[00:02:55] with different names large representations features embeddings uh
[00:02:58] representations features embeddings uh latent space and so on but These
[00:03:01] latent space and so on but These learned representations or features are
[00:03:04] learned representations or features are very good representatives of the images.
[00:03:07] very good representatives of the images. And if we can if we have a way to to
[00:03:12] And if we can if we have a way to to extract those, we can always get the
[00:03:14] extract those, we can always get the class labels out of those features as
[00:03:17] class labels out of those features as well by a simple linear model as you can
[00:03:20] well by a simple linear model as you can see at the end here. But the major
[00:03:23] see at the end here. But the major challenge that exists is
[00:03:28] training or building these networks at
[00:03:30] training or building these networks at large scale is always um challenging and
[00:03:33] large scale is always um challenging and can you tell me why there's a challenge
[00:03:35] can you tell me why there's a challenge here?
[00:03:37] here? So the thing is that at large scale we
[00:03:40] So the thing is that at large scale we need a lot of labeled data because this
[00:03:43] need a lot of labeled data because this network is trained starting from an
[00:03:46] network is trained starting from an image and at the end we have class
[00:03:48] image and at the end we have class labels. If we train this network, yes,
[00:03:50] labels. If we train this network, yes, these features are going to be very
[00:03:52] these features are going to be very useful for getting those uh class labels
[00:03:55] useful for getting those uh class labels out, right? But at scale, we need a lot
[00:04:00] out, right? But at scale, we need a lot of manual labeling efforts uh to to sit
[00:04:03] of manual labeling efforts uh to to sit down and label the images one by one. If
[00:04:05] down and label the images one by one. If the task is segmentation, you have to
[00:04:07] the task is segmentation, you have to label the pixels one by one in every
[00:04:10] label the pixels one by one in every image. And that is going to be very
[00:04:13] image. And that is going to be very challenging.
[00:04:14] challenging. So the question is is there a way we can
[00:04:18] So the question is is there a way we can train neural networks without the need
[00:04:21] train neural networks without the need for huge manually labeled data sets. So
[00:04:25] for huge manually labeled data sets. So these manual labels are um the the
[00:04:29] these manual labels are um the the challenge and we want to see if we can
[00:04:32] challenge and we want to see if we can bypass them in a way to train any neuron
[00:04:35] bypass them in a way to train any neuron network that gets us very good features.
[00:04:39] network that gets us very good features. And with that the topic of
[00:04:41] And with that the topic of self-supervised learning uh comes in
[00:04:44] self-supervised learning uh comes in light and that's what we are going to
[00:04:46] light and that's what we are going to cover today. So
[00:04:51] cover today. So having a a large data set of say images
[00:04:55] having a a large data set of say images without any labels
[00:04:58] without any labels our hypothesis is that we can train a
[00:05:01] our hypothesis is that we can train a neural network using an objective
[00:05:03] neural network using an objective function a pretext task that gets us
[00:05:08] function a pretext task that gets us good features
[00:05:11] good features from images. And then when it comes to
[00:05:16] from images. And then when it comes to learning on a specific data set with a
[00:05:18] learning on a specific data set with a smaller uh set of data points which have
[00:05:22] smaller uh set of data points which have labels, we we can basically train uh we
[00:05:26] labels, we we can basically train uh we can transfer this trained encoder and
[00:05:29] can transfer this trained encoder and use that to extract features for a
[00:05:31] use that to extract features for a downstream task or downstream objective.
[00:05:34] downstream task or downstream objective. So here we want to define a pretext
[00:05:37] So here we want to define a pretext task, a task that is general enough to
[00:05:40] task, a task that is general enough to be able to uh learn some good features
[00:05:45] be able to uh learn some good features from the images and then use that
[00:05:48] from the images and then use that encoder. Let's call it encoder
[00:05:51] encoder. Let's call it encoder to
[00:05:53] to uh to to solve another problem what we
[00:05:56] uh to to solve another problem what we call downstream task downstream
[00:05:58] call downstream task downstream objective which is the application that
[00:06:01] objective which is the application that you care about. Like for example, we
[00:06:03] you care about. Like for example, we have a lot of natural images
[00:06:06] have a lot of natural images downloaded from in the internet. We can
[00:06:08] downloaded from in the internet. We can we can train something out of it and
[00:06:10] we can train something out of it and then we have a small data set of for
[00:06:12] then we have a small data set of for example one of these industrial
[00:06:14] example one of these industrial applications or medical applications
[00:06:16] applications or medical applications that we have few labeled images on and
[00:06:19] that we have few labeled images on and we can now use that transfer of
[00:06:21] we can now use that transfer of knowledge to extract features and then
[00:06:24] knowledge to extract features and then classify or or perform the task that we
[00:06:26] classify or or perform the task that we are interested in.
[00:06:29] are interested in. So we want to delve into this topic and
[00:06:32] So we want to delve into this topic and go a little bit deeper in uh
[00:06:34] go a little bit deeper in uh understanding the different components.
[00:06:37] understanding the different components. In a nutshell, what self revised
[00:06:39] In a nutshell, what self revised learning is, as I said, defining this
[00:06:42] learning is, as I said, defining this preex task on the data set with no
[00:06:46] preex task on the data set with no labels. encoder often
[00:06:51] labels. encoder often uh gets us some learned representations
[00:06:55] uh gets us some learned representations and then another module of the the same
[00:06:58] and then another module of the the same neural network generates or or does um
[00:07:02] neural network generates or or does um transfer the learned representations
[00:07:04] transfer the learned representations into the output space which could be
[00:07:06] into the output space which could be labels or outputs that automatically are
[00:07:10] labels or outputs that automatically are generated from the data. They are not
[00:07:12] generated from the data. They are not manual annotations. So if we can do this
[00:07:17] manual annotations. So if we can do this then um we have an an objective function
[00:07:22] then um we have an an objective function a loss function and a neural network uh
[00:07:25] a loss function and a neural network uh to be trained with that loss function.
[00:07:28] to be trained with that loss function. I'm as as you can see here we call the
[00:07:31] I'm as as you can see here we call the second part sometimes decoder a
[00:07:33] second part sometimes decoder a classifier a regressor depending on how
[00:07:36] classifier a regressor depending on how we define our pre-text task. I will give
[00:07:38] we define our pre-text task. I will give you some examples but there are this
[00:07:41] you some examples but there are this could be any form of framework but when
[00:07:44] could be any form of framework but when when it's encoder and then a decoder
[00:07:47] when it's encoder and then a decoder this is more of an autoenccoding
[00:07:49] this is more of an autoenccoding framework that I'll I'll talk briefly
[00:07:51] framework that I'll I'll talk briefly about okay
[00:07:54] about okay so after we we do this uh training with
[00:07:57] so after we we do this uh training with the pretext task now we can um use the
[00:08:01] the pretext task now we can um use the encoder and the learn representations
[00:08:03] encoder and the learn representations for a downstream task which we just need
[00:08:07] for a downstream task which we just need to add uh one layer or even a fully
[00:08:11] to add uh one layer or even a fully connected neural network, a linear
[00:08:13] connected neural network, a linear function or a fully connected neural
[00:08:15] function or a fully connected neural network that you that predicts the
[00:08:16] network that you that predicts the labels and these labels are coming now
[00:08:18] labels and these labels are coming now from the data set. So that's that's the
[00:08:22] from the data set. So that's that's the major that's the main concept of
[00:08:24] major that's the main concept of self-supervised
[00:08:26] self-supervised learning that the pretext version of it
[00:08:31] learning that the pretext version of it portion of it doesn't require any
[00:08:34] portion of it doesn't require any labeled data to do the training. But how
[00:08:37] labeled data to do the training. But how to define the pretext task itself is not
[00:08:40] to define the pretext task itself is not that uh straightforward.
[00:08:43] that uh straightforward. There are many different ways of
[00:08:45] There are many different ways of defining those. For example,
[00:08:48] defining those. For example, just keep in mind that we want to define
[00:08:52] just keep in mind that we want to define um
[00:08:54] um the pretext task in a way that it's
[00:08:58] the pretext task in a way that it's first general enough that can get us
[00:09:00] first general enough that can get us good features and doesn't require manual
[00:09:03] good features and doesn't require manual labeling. So the labels should come from
[00:09:05] labeling. So the labels should come from the data itself, right? So one one
[00:09:08] the data itself, right? So one one example would be image completion where
[00:09:13] example would be image completion where we mask half of the image or parts of
[00:09:15] we mask half of the image or parts of the image and we define a task to given
[00:09:19] the image and we define a task to given the parts that are unmasked predict the
[00:09:22] the parts that are unmasked predict the parts that are masked. Or for example,
[00:09:24] parts that are masked. Or for example, we rotate the image with a specific u
[00:09:27] we rotate the image with a specific u angle and the task is to take the image
[00:09:32] angle and the task is to take the image as input and predict what's the rotation
[00:09:35] as input and predict what's the rotation angle that it's gone through. And the
[00:09:38] angle that it's gone through. And the other one could be a jigsaw puzzle where
[00:09:42] other one could be a jigsaw puzzle where we have patches of the image that are
[00:09:44] we have patches of the image that are not ordered but the the the the task is
[00:09:46] not ordered but the the the the task is to output the correct order of these
[00:09:49] to output the correct order of these patches. and colorization is one of the
[00:09:52] patches. and colorization is one of the popular ones that these are the four
[00:09:54] popular ones that these are the four that we'll be covering uh very quickly
[00:09:57] that we'll be covering uh very quickly today. But uh given the um blackout and
[00:10:03] today. But uh given the um blackout and white version of the image predict the
[00:10:05] white version of the image predict the colors for each of the uh pixels. So
[00:10:08] colors for each of the uh pixels. So solving the pretext task allows the
[00:10:12] solving the pretext task allows the model to learn good features. That's
[00:10:14] model to learn good features. That's what we wanted. And we can automatically
[00:10:17] what we wanted. And we can automatically generate labels for the pretext task. So
[00:10:20] generate labels for the pretext task. So the two points that I mentioned we we
[00:10:23] the two points that I mentioned we we need for a task that could be qualified
[00:10:26] need for a task that could be qualified as a good pretext task for
[00:10:28] as a good pretext task for self-supervised learning. Some uh quick
[00:10:32] self-supervised learning. Some uh quick considerations to to always keep in mind
[00:10:36] considerations to to always keep in mind how to evaluate a self-supervised
[00:10:38] how to evaluate a self-supervised learning framework. There are many
[00:10:40] learning framework. There are many different pieces and and areas that you
[00:10:43] different pieces and and areas that you can actually look into the pretext task
[00:10:46] can actually look into the pretext task itself because we are generating the
[00:10:48] itself because we are generating the labels and so on. It gives us the power
[00:10:51] labels and so on. It gives us the power to do some evaluation of how good the
[00:10:55] to do some evaluation of how good the model is able to solve that pretext
[00:10:58] model is able to solve that pretext task. So that's one of the
[00:11:02] task. So that's one of the uh factors. Then representation quality
[00:11:05] uh factors. Then representation quality itself is sometimes very important
[00:11:08] itself is sometimes very important looking at um for example
[00:11:12] looking at um for example only representations without any
[00:11:14] only representations without any fine-tuning or anything that I'll I'll
[00:11:16] fine-tuning or anything that I'll I'll be talking about or even clustering the
[00:11:18] be talking about or even clustering the the representations to see do we see a
[00:11:21] the representations to see do we see a pattern in the representations and
[00:11:23] pattern in the representations and sometimes there are some good
[00:11:25] sometimes there are some good dimensionality reduction algorithms uh
[00:11:28] dimensionality reduction algorithms uh I'm I'm referring to TSN testing here
[00:11:31] I'm I'm referring to TSN testing here which we didn't very much talk about but
[00:11:33] which we didn't very much talk about but this is a dimensionality reduction
[00:11:35] this is a dimensionality reduction framework that you can reduce the
[00:11:37] framework that you can reduce the dimensionality of the lens
[00:11:38] dimensionality of the lens representations and then visualize it in
[00:11:41] representations and then visualize it in 2D or 3D and see if there is a pattern
[00:11:44] 2D or 3D and see if there is a pattern that you can find in your
[00:11:46] that you can find in your representations. So robustness,
[00:11:49] representations. So robustness, generalization and computational
[00:11:51] generalization and computational efficiency I these are all quite
[00:11:53] efficiency I these are all quite important but the most important thing
[00:11:57] important but the most important thing the most important um aspect that we are
[00:12:00] the most important um aspect that we are after is
[00:12:02] after is the performance on the downstream task
[00:12:06] the performance on the downstream task because we are doing the entire
[00:12:10] because we are doing the entire self-supervised learning and and define
[00:12:12] self-supervised learning and and define the tasks pre text tasks and so on to be
[00:12:15] the tasks pre text tasks and so on to be able to improve results for a task of
[00:12:18] able to improve results for a task of interest or something that we are we
[00:12:20] interest or something that we are we care about. Let's see some quick
[00:12:22] care about. Let's see some quick examples of how this could be done. Um
[00:12:25] examples of how this could be done. Um this is an example of let's rotate the
[00:12:30] this is an example of let's rotate the images and then predict the degree of
[00:12:33] images and then predict the degree of rotation in as the output. So we can
[00:12:36] rotation in as the output. So we can train this in a self-supervised manner
[00:12:39] train this in a self-supervised manner without the need for labels of the
[00:12:41] without the need for labels of the objects in the in the image. And we have
[00:12:44] objects in the in the image. And we have a bunch of convolution layers in this
[00:12:45] a bunch of convolution layers in this example and then a fully connected uh
[00:12:47] example and then a fully connected uh neural network at the end to do this
[00:12:50] neural network at the end to do this either regression or classification
[00:12:52] either regression or classification task. And this means that this is giving
[00:12:56] task. And this means that this is giving us some sort of uh a set of good
[00:12:58] us some sort of uh a set of good features uh good feature extractor which
[00:13:02] features uh good feature extractor which we can at the end remove this these
[00:13:06] we can at the end remove this these task pretext task specific parts the FC
[00:13:10] task pretext task specific parts the FC layers and puts
[00:13:13] layers and puts one layer or in some cases multiple
[00:13:16] one layer or in some cases multiple layers uh to to classify the features
[00:13:19] layers uh to to classify the features into the object label. So this time we
[00:13:22] into the object label. So this time we use the object labels to do the
[00:13:24] use the object labels to do the prediction and train these uh this
[00:13:27] prediction and train these uh this linear uh function itself. So we often
[00:13:32] linear uh function itself. So we often look for a shallow network because if
[00:13:34] look for a shallow network because if the features are good enough then we
[00:13:36] the features are good enough then we don't need to um do a lot of training um
[00:13:42] don't need to um do a lot of training um on um on on getting the class labels
[00:13:46] on um on on getting the class labels out. So
[00:13:49] out. So this is self-supervised learning in in
[00:13:51] this is self-supervised learning in in general and um although we are talking
[00:13:54] general and um although we are talking about um the computer vision
[00:13:56] about um the computer vision applications but this this paradigm of
[00:14:00] applications but this this paradigm of self-supervised learning was actually
[00:14:03] self-supervised learning was actually what enabled all of these large language
[00:14:06] what enabled all of these large language models uh GPT4 and all of these
[00:14:09] models uh GPT4 and all of these frameworks are trained with mostly raw
[00:14:13] frameworks are trained with mostly raw data without any manual ual labeling and
[00:14:19] data without any manual ual labeling and um not just in language models in speech
[00:14:21] um not just in language models in speech and these days quite a lot in robot and
[00:14:26] and these days quite a lot in robot and um robotics and reinforcement learning
[00:14:29] um robotics and reinforcement learning because when we don't need any labeling
[00:14:33] because when we don't need any labeling data, we can start capturing data
[00:14:39] data, we can start capturing data raw data without any manual labeling and
[00:14:42] raw data without any manual labeling and use those for training. And that's why
[00:14:43] use those for training. And that's why you see so many self-driving uh cars in
[00:14:47] you see so many self-driving uh cars in the Bay Area collecting data because
[00:14:49] the Bay Area collecting data because that's that's getting them the data and
[00:14:51] that's that's getting them the data and they don't have to really annotate the
[00:14:53] they don't have to really annotate the data but they can still train models
[00:14:55] data but they can still train models based on.
[00:14:57] based on. So with that um today's agenda we'll
[00:15:01] So with that um today's agenda we'll we'll cover some of these pre-text tasks
[00:15:04] we'll cover some of these pre-text tasks from image transformations and then I
[00:15:07] from image transformations and then I will talk a little bit about
[00:15:10] will talk a little bit about a set of algorithms that are around
[00:15:14] a set of algorithms that are around contrastive representation learning that
[00:15:17] contrastive representation learning that um
[00:15:19] um are slightly different from these image
[00:15:21] are slightly different from these image transformation based pre-text tasks but
[00:15:24] transformation based pre-text tasks but have shown promise. So let's start with
[00:15:28] have shown promise. So let's start with the with the first part and there we
[00:15:32] the with the first part and there we will cover the tasks one by one.
[00:15:36] will cover the tasks one by one. So I talked quite uh a lot about
[00:15:39] So I talked quite uh a lot about rotation predicting rotations. Let's see
[00:15:42] rotation predicting rotations. Let's see if we can actually u rotate the images
[00:15:45] if we can actually u rotate the images with random or or arbitrary
[00:15:50] with random or or arbitrary degrees and predict the rotation uh
[00:15:54] degrees and predict the rotation uh angle with with a model. And our
[00:15:57] angle with with a model. And our hypothesis here is that the model a
[00:16:00] hypothesis here is that the model a model could recognize the correct
[00:16:03] model could recognize the correct rotation of an object only if it has the
[00:16:08] rotation of an object only if it has the common sense visual common sense of what
[00:16:10] common sense visual common sense of what the object should look like u
[00:16:13] the object should look like u unperturbed.
[00:16:14] unperturbed. So these models mostly uh are are
[00:16:18] So these models mostly uh are are designed around this concept of uh
[00:16:21] designed around this concept of uh visual common sense and um and then if
[00:16:26] visual common sense and um and then if if the model is able to to capture that
[00:16:29] if the model is able to to capture that it means that it is also able to
[00:16:32] it means that it is also able to summarize the image the entire image
[00:16:34] summarize the image the entire image into a a useful
[00:16:38] into a a useful um set of features good features.
[00:16:42] um set of features good features. This paper published in 2018
[00:16:45] This paper published in 2018 um
[00:16:47] um implemented this with with just
[00:16:50] implemented this with with just exploring four different angles of 0,
[00:16:53] exploring four different angles of 0, 19, 90, 180 and uh 270. rotating th
[00:16:59] 19, 90, 180 and uh 270. rotating th those into with with one of these
[00:17:02] those into with with one of these rotating images with one of these angles
[00:17:04] rotating images with one of these angles and then using the convolutional neural
[00:17:08] and then using the convolutional neural network to basically predict what which
[00:17:10] network to basically predict what which which of these rotations the output is.
[00:17:14] which of these rotations the output is. And because they only created four
[00:17:16] And because they only created four different outputs, this is a
[00:17:17] different outputs, this is a classification task because it only has
[00:17:19] classification task because it only has four different cases, right? It doesn't
[00:17:21] four different cases, right? It doesn't have to predict the exact value of the
[00:17:24] have to predict the exact value of the angle the degrees but it's um
[00:17:28] angle the degrees but it's um it's actually um just predicting one of
[00:17:31] it's actually um just predicting one of these four
[00:17:34] these four uh classes of 0 1 2 or three
[00:17:38] uh classes of 0 1 2 or three why 0 1 2 or three. So with that um the
[00:17:45] why 0 1 2 or three. So with that um the authors were able to
[00:17:48] authors were able to learn good representations
[00:17:51] learn good representations and um with those representations they
[00:17:55] and um with those representations they they started training the neural network
[00:17:57] they started training the neural network at a downstream application basically
[00:18:00] at a downstream application basically fine-tuning the neural network
[00:18:03] fine-tuning the neural network based on um fine-tuning the encoders and
[00:18:07] based on um fine-tuning the encoders and also the
[00:18:09] also the um the and the classifier actually we in
[00:18:14] um the and the classifier actually we in this case they they froze
[00:18:17] this case they they froze first and second layers and then they
[00:18:20] first and second layers and then they fine-tuned uh the last convolution layer
[00:18:23] fine-tuned uh the last convolution layer and the linear layer. So it's not
[00:18:26] and the linear layer. So it's not entirely
[00:18:28] entirely um fine-tuning the entire network but
[00:18:32] um fine-tuning the entire network but they were able to get very good results.
[00:18:35] they were able to get very good results. This is on C410 data set, one of the
[00:18:39] This is on C410 data set, one of the data sets that we've talked about
[00:18:41] data sets that we've talked about earlier. You see that when when the
[00:18:44] earlier. You see that when when the model is pre-trained, it starts with a
[00:18:46] model is pre-trained, it starts with a good accuracy to start with. So it means
[00:18:49] good accuracy to start with. So it means that it it it already is in a good shape
[00:18:52] that it it it already is in a good shape and it's it's having a good
[00:18:55] and it's it's having a good understanding of the the objects even in
[00:18:58] understanding of the the objects even in the very uh first iterations. But if the
[00:19:02] the very uh first iterations. But if the task is simple enough, CR10 is actually
[00:19:05] task is simple enough, CR10 is actually not too hard to um to train a model for.
[00:19:09] not too hard to um to train a model for. If the task is simple enough, the
[00:19:11] If the task is simple enough, the supervised version, the fully supervised
[00:19:13] supervised version, the fully supervised version and the one that starts with
[00:19:15] version and the one that starts with pre-training often converge to the same
[00:19:17] pre-training often converge to the same number, same same accuracy. But again,
[00:19:20] number, same same accuracy. But again, if the task is simple enough in a very
[00:19:22] if the task is simple enough in a very hard application, much harder
[00:19:24] hard application, much harder applications, often the supervised
[00:19:26] applications, often the supervised learning frameworks, if we don't do any
[00:19:27] learning frameworks, if we don't do any pre-training, larger scale pre-training,
[00:19:29] pre-training, larger scale pre-training, we don't get as good results. Okay.
[00:19:33] we don't get as good results. Okay. Um so
[00:19:37] Um so um they've they've done also some uh
[00:19:41] um they've they've done also some uh applica some some um experiments on this
[00:19:44] applica some some um experiments on this POS Pascal VOCC 2017
[00:19:50] data set which involves a number of
[00:19:53] data set which involves a number of tasks including classification,
[00:19:55] tasks including classification, detection and uh segmentation.
[00:19:59] detection and uh segmentation. And these all these three uh sets of
[00:20:03] And these all these three uh sets of tasks they've used different setups of
[00:20:06] tasks they've used different setups of just pre-train just training a few fully
[00:20:09] just pre-train just training a few fully connected layers or all of the layers
[00:20:11] connected layers or all of the layers for for the classification detection and
[00:20:14] for for the classification detection and segmentation tasks. If you look at the
[00:20:18] segmentation tasks. If you look at the imageet labels like if we have a huge
[00:20:21] imageet labels like if we have a huge labeled data set and we pre-train on
[00:20:25] labeled data set and we pre-train on that uh data set we already get a very
[00:20:28] that uh data set we already get a very high accuracy but but again have this in
[00:20:30] high accuracy but but again have this in mind that this is imageet with all of
[00:20:32] mind that this is imageet with all of the labels involved for for the
[00:20:35] the labels involved for for the pre-training but if we don't do any
[00:20:39] pre-training but if we don't do any supervised pre-training and the
[00:20:41] supervised pre-training and the pre-training is all based on
[00:20:43] pre-training is all based on self-supervised it's it's showing that
[00:20:45] self-supervised it's it's showing that this rot ation uh framework is doing a
[00:20:50] this rot ation uh framework is doing a much better job than many of the other
[00:20:54] much better job than many of the other counterparts any of the other examples
[00:20:57] counterparts any of the other examples that are um other other methods that's
[00:21:00] that are um other other methods that's we don't actually go into many of the
[00:21:03] we don't actually go into many of the detail of many of those but the it's
[00:21:06] detail of many of those but the it's showing efficacy for this rotation
[00:21:08] showing efficacy for this rotation pretext task and see how it's different
[00:21:11] pretext task and see how it's different how much better it is if you start with
[00:21:14] how much better it is if you start with a random
[00:21:16] a random um initialization of the weights. So the
[00:21:18] um initialization of the weights. So the random initialization versus um
[00:21:20] random initialization versus um pre-training with rotation um pretext
[00:21:23] pre-training with rotation um pretext task the the difference is huge and the
[00:21:26] task the the difference is huge and the the this rotation pretext task is not
[00:21:30] the this rotation pretext task is not equal but it's close to pre-training on
[00:21:33] equal but it's close to pre-training on the entire imageet. So
[00:21:37] the entire imageet. So one of the thing that
[00:21:39] one of the thing that they've they've looked into in this
[00:21:41] they've they've looked into in this paper was looking at the features and
[00:21:43] paper was looking at the features and how um how how the learned features are
[00:21:47] how um how how the learned features are meaningful. I mentioned earlier that one
[00:21:52] meaningful. I mentioned earlier that one of the ways of evaluating the pretext
[00:21:55] of the ways of evaluating the pretext tasks generally self-supervised learning
[00:21:57] tasks generally self-supervised learning frameworks is to look at the features
[00:21:59] frameworks is to look at the features right and and you can always go from the
[00:22:01] right and and you can always go from the features from the fully connected
[00:22:02] features from the fully connected layers. We talked about grat cam and all
[00:22:05] layers. We talked about grat cam and all of those other attention based
[00:22:07] of those other attention based frameworks how we can go from the
[00:22:10] frameworks how we can go from the features back to the image space. So
[00:22:12] features back to the image space. So this evaluation involves projecting
[00:22:15] this evaluation involves projecting those into the image space and seeing
[00:22:17] those into the image space and seeing what the model is looking at. If you
[00:22:19] what the model is looking at. If you look at the attention maps for the
[00:22:21] look at the attention maps for the supervised model often the supervised
[00:22:23] supervised model often the supervised model has more focused uh maps because
[00:22:28] model has more focused uh maps because it's it's only trying to solve one
[00:22:31] it's it's only trying to solve one single task of classification. So if I
[00:22:33] single task of classification. So if I if it captures the eye and and the shape
[00:22:36] if it captures the eye and and the shape around it doesn't care about the other
[00:22:38] around it doesn't care about the other other parts very much. But in cases of
[00:22:41] other parts very much. But in cases of self-supervised learning often more
[00:22:44] self-supervised learning often more features more areas are covered because
[00:22:47] features more areas are covered because it has to have a more holistic
[00:22:50] it has to have a more holistic understanding of the image because we
[00:22:52] understanding of the image because we don't know what the downstream task is.
[00:22:54] don't know what the downstream task is. But it's that the the goal is to perform
[00:22:57] But it's that the the goal is to perform equally well in many of those. So that's
[00:23:00] equally well in many of those. So that's one of the tasks. Um, if you do have any
[00:23:04] one of the tasks. Um, if you do have any questions, skip it. I'll I'll stop after
[00:23:07] questions, skip it. I'll I'll stop after uh going over some of the tasks and if
[00:23:08] uh going over some of the tasks and if your answer your question was not
[00:23:10] your answer your question was not answered, then I would be happy to to
[00:23:12] answered, then I would be happy to to answer that. Okay. Um
[00:23:16] answer that. Okay. Um another one another popular um pretext
[00:23:21] another one another popular um pretext task was to to basically create this 3x3
[00:23:26] task was to to basically create this 3x3 grid and then use networks to predict
[00:23:30] grid and then use networks to predict the location of each of the given
[00:23:33] the location of each of the given patches with respect to the sensor
[00:23:35] patches with respect to the sensor patch. So for for this patch which is
[00:23:39] patch. So for for this patch which is around here the output should be three
[00:23:42] around here the output should be three because we only have um uh eight
[00:23:47] because we only have um uh eight this is 3x3 and the center patch is the
[00:23:49] this is 3x3 and the center patch is the the reference. So this also turns out to
[00:23:52] the reference. So this also turns out to be an eightway classification task. It's
[00:23:57] be an eightway classification task. It's getting any of these patches and yeah,
[00:24:00] getting any of these patches and yeah, it's trying to output what the location
[00:24:02] it's trying to output what the location of that given patch it is with respect
[00:24:05] of that given patch it is with respect to the input uh with respect to the
[00:24:08] to the input uh with respect to the sensor
[00:24:11] sensor patch. Sorry.
[00:24:13] patch. Sorry. So
[00:24:15] So uh so this is this was another another
[00:24:18] uh so this is this was another another example but the um this other follow-up
[00:24:22] example but the um this other follow-up paper paper publication which uh turned
[00:24:26] paper paper publication which uh turned this into a jigsaw puzzle framework was
[00:24:30] this into a jigsaw puzzle framework was instead of saying to to uh asking the
[00:24:33] instead of saying to to uh asking the model to just predict which of these
[00:24:35] model to just predict which of these eight patches it is it tried to predict
[00:24:39] eight patches it is it tried to predict the exact permutation the right
[00:24:41] the exact permutation the right permutation. So what they've done was
[00:24:43] permutation. So what they've done was they used the same 3x3 um grid took all
[00:24:47] they used the same 3x3 um grid took all of the patches shuffled them randomly
[00:24:50] of the patches shuffled them randomly and then ask the neural network to say
[00:24:54] and then ask the neural network to say which of the
[00:24:57] which of the correct permuta which one should be the
[00:24:59] correct permuta which one should be the correct permutation.
[00:25:01] correct permutation. So they they basically predict the the
[00:25:03] So they they basically predict the the correct permutations. Can you tell me
[00:25:05] correct permutations. Can you tell me what is the number of permutations you
[00:25:06] what is the number of permutations you can you can have for um
[00:25:10] can you can have for um this setup? Say again.
[00:25:13] this setup? Say again.  Nine factorial. Yes, exactly. So, it's
[00:25:14] Nine factorial. Yes, exactly. So, it's it's a huge number, right? 300,000
[00:25:18] it's a huge number, right? 300,000 something, I think. Uh but what they've
[00:25:21] something, I think. Uh but what they've done was they've they've created this
[00:25:23] done was they've they've created this lookup table with only four 64 um
[00:25:28] lookup table with only four 64 um plausible possible uh rotation uh sorry
[00:25:32] plausible possible uh rotation uh sorry uh permutations and then only only they
[00:25:36] uh permutations and then only only they consider 64 permutations and when when
[00:25:38] consider 64 permutations and when when they're shuffling shuffling that they do
[00:25:41] they're shuffling shuffling that they do the shuffling based on one of these 64
[00:25:43] the shuffling based on one of these 64 and then the output will also be just a
[00:25:45] and then the output will also be just a 6064 uh sized vector. So again, this
[00:25:49] 6064 uh sized vector. So again, this turns out to be a just a simple
[00:25:51] turns out to be a just a simple classification task with 64 output
[00:25:55] classification task with 64 output classes and they've shown this is also a
[00:25:58] classes and they've shown this is also a great idea for
[00:26:00] great idea for um solving to to define as a pretext
[00:26:03] um solving to to define as a pretext task and on the same data set with
[00:26:06] task and on the same data set with similar type of
[00:26:08] similar type of tasks that I talked about and and how
[00:26:11] tasks that I talked about and and how the supervision is is done. they've
[00:26:14] the supervision is is done. they've shown their method was um outperforming
[00:26:17] shown their method was um outperforming some of the
[00:26:19] some of the more
[00:26:21] more uh previous models, previous frameworks
[00:26:25] uh previous models, previous frameworks and uh again remember that this is this
[00:26:28] and uh again remember that this is this was published in 2016. So
[00:26:33] was published in 2016. So next pretext task is just in painting
[00:26:36] next pretext task is just in painting predicting what is missing. So what they
[00:26:39] predicting what is missing. So what they uh have done here was a simple masking
[00:26:43] uh have done here was a simple masking strategy. You mask parts of the image
[00:26:46] strategy. You mask parts of the image and then you ask the model to impose
[00:26:49] and then you ask the model to impose those parts that are masked. So how it
[00:26:52] those parts that are masked. So how it was done simple masking um on the input
[00:26:56] was done simple masking um on the input image but because we have the all of the
[00:26:59] image but because we have the all of the images we actually have the desired
[00:27:01] images we actually have the desired output. So an encoder turns this into a
[00:27:05] output. So an encoder turns this into a feature space and then that feature
[00:27:07] feature space and then that feature space is also there are some uh fully
[00:27:11] space is also there are some uh fully connected layers in the middle and then
[00:27:12] connected layers in the middle and then there is a decoder that decodes the
[00:27:15] there is a decoder that decodes the parts that is missing and the loss
[00:27:19] parts that is missing and the loss function is comparing the output
[00:27:22] function is comparing the output with what um the ground truth was.
[00:27:28] with what um the ground truth was. And this is basically learning to
[00:27:30] And this is basically learning to reconstruct the missing pixels.
[00:27:34] reconstruct the missing pixels. Again, um we've talked about
[00:27:37] Again, um we've talked about autoenccoders a couple of times before.
[00:27:39] autoenccoders a couple of times before. And um and this is also some form of an
[00:27:42] And um and this is also some form of an autoenccoder that it encodes the input
[00:27:44] autoenccoder that it encodes the input image um into a representation that you
[00:27:48] image um into a representation that you want to decode the output. But this
[00:27:52] want to decode the output. But this autoenccoder is trained with a masking
[00:27:55] autoenccoder is trained with a masking strategy. masking uh objective. So
[00:28:01] strategy. masking uh objective. So just to show you some examples, the
[00:28:03] just to show you some examples, the impainting
[00:28:05] impainting evaluations
[00:28:06] evaluations um are a little bit interesting and
[00:28:08] um are a little bit interesting and tricky because when you want to impsing
[00:28:12] tricky because when you want to impsing this this image, right? And it's um we
[00:28:16] this this image, right? And it's um we can't say it's it's it's it's
[00:28:19] can't say it's it's it's it's there are there is just one um output to
[00:28:23] there are there is just one um output to to do um in this case here
[00:28:26] to do um in this case here reconstructionbased frameworks earlier
[00:28:28] reconstructionbased frameworks earlier reconstruction based frameworks we're
[00:28:31] reconstruction based frameworks we're actually creating a lot of uh fuzzy uh
[00:28:35] actually creating a lot of uh fuzzy uh and and very
[00:28:37] and and very um
[00:28:40] a smooth outputs And
[00:28:44] a smooth outputs And that's why
[00:28:46] that's why this this paper that I'm uh referring to
[00:28:49] this this paper that I'm uh referring to here was actually using
[00:28:52] here was actually using an an additional adversarial objective
[00:28:54] an an additional adversarial objective function which I'm not going to go into
[00:28:56] function which I'm not going to go into details because this is a topic of
[00:28:57] details because this is a topic of discussion for the next lecture
[00:28:59] discussion for the next lecture generative models. But generally how um
[00:29:03] generative models. But generally how um these uh frameworks work, we have a
[00:29:06] these uh frameworks work, we have a reconstruction loss and the
[00:29:07] reconstruction loss and the reconstruction loss is basically
[00:29:09] reconstruction loss is basically calculating the difference between the
[00:29:12] calculating the difference between the patch uh between the image X and the
[00:29:17] patch uh between the image X and the image after it's passed through the
[00:29:22] image after it's passed through the uh the encoder.
[00:29:24] uh the encoder. So um
[00:29:28] So um and then uh these this is element wise
[00:29:32] and then uh these this is element wise multiplication and we also have this
[00:29:34] multiplication and we also have this mask here because we want to only
[00:29:36] mask here because we want to only calculate the loss function the
[00:29:38] calculate the loss function the objective function only on the masked
[00:29:41] objective function only on the masked area. So we do an element wise mask um
[00:29:44] area. So we do an element wise mask um multiplication with the mask as well.
[00:29:46] multiplication with the mask as well. And this basically gives us the
[00:29:50] And this basically gives us the reconstruction loss for that part of the
[00:29:52] reconstruction loss for that part of the mask that um we had. So as I said the
[00:29:58] mask that um we had. So as I said the it's also supplemented with an
[00:29:59] it's also supplemented with an adversarial objective adversarial
[00:30:01] adversarial objective adversarial learning uh loss function which
[00:30:05] learning uh loss function which ensures the images that are generated
[00:30:09] ensures the images that are generated are real looking right. So with with
[00:30:12] are real looking right. So with with that um they have been able to improve
[00:30:17] that um they have been able to improve the the parts that are
[00:30:20] the the parts that are uh reconstructed to look a little bit uh
[00:30:24] uh reconstructed to look a little bit uh better but again details will be um
[00:30:28] better but again details will be um discussed in the next lecture.
[00:30:31] discussed in the next lecture. So this reconstruction uh framework was
[00:30:35] So this reconstruction uh framework was actually able to again provide and this
[00:30:39] actually able to again provide and this ours is the same provide additional
[00:30:41] ours is the same provide additional benefits when it's run on the same
[00:30:45] benefits when it's run on the same classification detection and
[00:30:47] classification detection and segmentation task on uh same set of data
[00:30:51] segmentation task on uh same set of data sets.
[00:30:52] sets. I will come back to this reconstruction
[00:30:55] I will come back to this reconstruction based frameworks and and masking in a
[00:30:58] based frameworks and and masking in a bit because it's one of the most uh used
[00:31:02] bit because it's one of the most uh used models or pretext tasks that these days
[00:31:05] models or pretext tasks that these days are used for for pre-training. Uh so
[00:31:08] are used for for pre-training. Uh so I'll I'll I'll come back to this but
[00:31:10] I'll I'll I'll come back to this but before that let me introduce this this
[00:31:13] before that let me introduce this this um other
[00:31:15] um other pretext task of image coloring.
[00:31:19] pretext task of image coloring. And this is another very simple
[00:31:22] And this is another very simple framework setup that we turn a colored
[00:31:27] framework setup that we turn a colored image because our data set is mostly
[00:31:29] image because our data set is mostly colored colored images, right? We turn
[00:31:31] colored colored images, right? We turn that colored image into its components
[00:31:36] that colored image into its components or channels that that separate the
[00:31:38] or channels that that separate the lightness, the illumination from the
[00:31:41] lightness, the illumination from the color itself. There are several color
[00:31:43] color itself. There are several color spaces. If you've taken uh courses like
[00:31:46] spaces. If you've taken uh courses like computer graphics or CS131 other uh
[00:31:49] computer graphics or CS131 other uh computer vision class, you know that
[00:31:51] computer vision class, you know that there are so many different color
[00:31:54] there are so many different color spaces. Mostly in computer vision we use
[00:31:56] spaces. Mostly in computer vision we use RGB.
[00:31:58] RGB. But if you want to separate lightness
[00:32:01] But if you want to separate lightness illumination from color, there are some
[00:32:03] illumination from color, there are some other color spaces. For example, lab
[00:32:05] other color spaces. For example, lab color space L A B is one of those color
[00:32:10] color space L A B is one of those color spaces that separates lightness from
[00:32:13] spaces that separates lightness from color. So we have one channel for
[00:32:16] color. So we have one channel for lightness and two channels for defining
[00:32:19] lightness and two channels for defining the the actual color. And if we add
[00:32:24] the the actual color. And if we add these two all together, L A and B all
[00:32:28] these two all together, L A and B all three channels together we can actually
[00:32:29] three channels together we can actually get the colored image. So the pretext
[00:32:32] get the colored image. So the pretext task here is simple. Given the L
[00:32:36] task here is simple. Given the L channel, predict the A and B channels.
[00:32:39] channel, predict the A and B channels. Right? So again, we don't need to do any
[00:32:42] Right? So again, we don't need to do any manual annotation. It's already in the
[00:32:45] manual annotation. It's already in the data.
[00:32:47] data. And this was extended into uh other
[00:32:51] And this was extended into uh other frameworks. Why why should we only look
[00:32:53] frameworks. Why why should we only look at like given L predict A and B? We can
[00:32:57] at like given L predict A and B? We can also do the reverse, right? And that um
[00:33:01] also do the reverse, right? And that um led us to split something that we call a
[00:33:03] led us to split something that we call a split brain autoenccoder where the input
[00:33:07] split brain autoenccoder where the input image is split um is basically turned
[00:33:10] image is split um is basically turned into one the the L channel lightness
[00:33:14] into one the the L channel lightness channel and the color channels uh these
[00:33:16] channel and the color channels uh these two images. This is one channel. This is
[00:33:18] two images. This is one channel. This is two channels of color. And we train two
[00:33:22] two channels of color. And we train two functions, two neural networks, sets of
[00:33:24] functions, two neural networks, sets of layers to predict the other one. And
[00:33:27] layers to predict the other one. And then at the end in order to calculate
[00:33:29] then at the end in order to calculate the loss function uh and um back prop we
[00:33:33] the loss function uh and um back prop we just merge these two to generate the
[00:33:35] just merge these two to generate the actual image and a loss uh an L2 loss
[00:33:39] actual image and a loss uh an L2 loss any distance function can help with
[00:33:42] any distance function can help with training this neural network in a more
[00:33:45] training this neural network in a more generic uh framework um or or
[00:33:48] generic uh framework um or or formulation. The idea is
[00:33:51] formulation. The idea is given one channel or a set of channels,
[00:33:55] given one channel or a set of channels, predict the others and do the same uh
[00:33:58] predict the others and do the same uh for X2. So sets of channels X1, sets of
[00:34:03] for X2. So sets of channels X1, sets of channels X2. So given one, we can
[00:34:06] channels X2. So given one, we can predict the other one. And these are the
[00:34:07] predict the other one. And these are the neural networks for those. Merging them,
[00:34:10] neural networks for those. Merging them, we'll get u the value
[00:34:14] we'll get u the value the image and then loss function would
[00:34:17] the image and then loss function would be simple. So if we have such a
[00:34:20] be simple. So if we have such a framework, we can run it on everything,
[00:34:22] framework, we can run it on everything, not just u color and illumination,
[00:34:25] not just u color and illumination, right? We can have um data from some of
[00:34:28] right? We can have um data from some of these
[00:34:29] these RGBD sensors, those that have um RGB
[00:34:34] RGBD sensors, those that have um RGB channels and and depth channels like for
[00:34:36] channels and and depth channels like for example connect and and other sensors
[00:34:38] example connect and and other sensors that they use in robotics. And given the
[00:34:42] that they use in robotics. And given the RGB channel, predict the depth and vice
[00:34:46] RGB channel, predict the depth and vice versa. And this was a very successful
[00:34:49] versa. And this was a very successful downstream task that was used for
[00:34:53] downstream task that was used for different applications. And as you can
[00:34:55] different applications. And as you can see this this uh this model and uh this
[00:34:59] see this this uh this model and uh this paper that I just um the split brain and
[00:35:02] paper that I just um the split brain and uh the papers that um the the model that
[00:35:06] uh the papers that um the the model that predicts colorizes the images is those
[00:35:10] predicts colorizes the images is those those features themselves. they do have
[00:35:12] those features themselves. they do have actually a very good level of accuracy
[00:35:15] actually a very good level of accuracy for
[00:35:16] for um predicting the class labels and you
[00:35:19] um predicting the class labels and you can see there are many different other
[00:35:22] can see there are many different other frameworks that are used also
[00:35:26] frameworks that are used also um in terms of uh comparisons again this
[00:35:30] um in terms of uh comparisons again this is not as good as supervised learning
[00:35:32] is not as good as supervised learning because there is no label involved uh
[00:35:35] because there is no label involved uh here and and it's just based on the the
[00:35:38] here and and it's just based on the the learned features with concatenated
[00:35:41] learned features with concatenated features out of f_sub_1 and f_sub_2.
[00:35:43] features out of f_sub_1 and f_sub_2. Okay, so uh the image coloring pre-text
[00:35:48] Okay, so uh the image coloring pre-text tax was actually very interesting
[00:35:51] tax was actually very interesting because now we could um not only use it
[00:35:55] because now we could um not only use it for pre-training neural networks, but it
[00:35:57] for pre-training neural networks, but it was also itself useful somehow because
[00:36:00] was also itself useful somehow because now we could colorize images that we
[00:36:03] now we could colorize images that we don't have a colored version. So we
[00:36:06] don't have a colored version. So we could colorize images that and then
[00:36:09] could colorize images that and then videos that we don't have a colored
[00:36:12] videos that we don't have a colored version of those.
[00:36:15] version of those. And not only that, uh one of the other
[00:36:17] And not only that, uh one of the other interesting results that they they
[00:36:19] interesting results that they they they've shown in the paper was uh this
[00:36:22] they've shown in the paper was uh this image of yuseimmed
[00:36:24] image of yuseimmed and the halfdme that um they colorized.
[00:36:27] and the halfdme that um they colorized. The interesting thing that is
[00:36:30] The interesting thing that is is seen in this image is the consistency
[00:36:34] is seen in this image is the consistency between the actual object the halfd dome
[00:36:37] between the actual object the halfd dome or trees or the bridge and its
[00:36:39] or trees or the bridge and its reflection in the in the water. So the
[00:36:41] reflection in the in the water. So the model was was also able to understand
[00:36:44] model was was also able to understand that this reflection should also somehow
[00:36:47] that this reflection should also somehow preserve uh the color based on how it
[00:36:50] preserve uh the color based on how it was trained on vast amounts of data.
[00:36:53] was trained on vast amounts of data. Again,
[00:36:56] Again, keep in mind that these models are all
[00:36:58] keep in mind that these models are all pre-large language large vision models
[00:37:01] pre-large language large vision models and they have been trained on specific
[00:37:03] and they have been trained on specific tasks. So they're not trained for
[00:37:06] tasks. So they're not trained for solving everything.
[00:37:09] solving everything. So this could be actually extended into
[00:37:13] So this could be actually extended into video settings because now if we have a
[00:37:15] video settings because now if we have a video, we can have a reference frame
[00:37:17] video, we can have a reference frame that has the color and do the coloring
[00:37:20] that has the color and do the coloring for the follow-up frames. And how this
[00:37:22] for the follow-up frames. And how this is done. This is very simple because uh
[00:37:24] is done. This is very simple because uh with with uh I mean this is also very
[00:37:27] with with uh I mean this is also very useful because with uh colorizing future
[00:37:32] useful because with uh colorizing future frames in the video what we are doing is
[00:37:34] frames in the video what we are doing is we basically try to uh track pixels and
[00:37:39] we basically try to uh track pixels and objects in the video and the model
[00:37:42] objects in the video and the model implicitly learns how these uh tracks
[00:37:46] implicitly learns how these uh tracks should be formed. So the hypothesis is
[00:37:48] should be formed. So the hypothesis is learning to color video frames should
[00:37:51] learning to color video frames should allow model to learn to track regions or
[00:37:54] allow model to learn to track regions or objects without labels.
[00:37:58] objects without labels. Um
[00:37:59] Um and learning to color videos because
[00:38:02] and learning to color videos because there are a lot of correspondences is an
[00:38:05] there are a lot of correspondences is an interesting task by itself. I would um
[00:38:08] interesting task by itself. I would um suggest taking a look at the the
[00:38:10] suggest taking a look at the the details. I'll um very briefly talk to
[00:38:12] details. I'll um very briefly talk to them talk about them. So if we have a
[00:38:15] them talk about them. So if we have a reference frame, what we need to do is
[00:38:17] reference frame, what we need to do is for coloring the input frame, we need to
[00:38:20] for coloring the input frame, we need to find the pointers of where that uh
[00:38:23] find the pointers of where that uh specific object or pixel is and then
[00:38:27] specific object or pixel is and then based on that see what the color is and
[00:38:29] based on that see what the color is and copy what the color is u as as the color
[00:38:31] copy what the color is u as as the color for that pixel as the output at the as
[00:38:34] for that pixel as the output at the as the target color. And how this is done
[00:38:37] the target color. And how this is done is very much similar to the same topic
[00:38:40] is very much similar to the same topic of attention uh that we talked about. So
[00:38:44] of attention uh that we talked about. So it's it's about forming attention uh for
[00:38:47] it's it's about forming attention uh for each of the pixels for each input frame
[00:38:51] each of the pixels for each input frame sorry reference frame and target frame.
[00:38:53] sorry reference frame and target frame. We often run a CNN to see what features
[00:38:56] We often run a CNN to see what features around those pixels should be used. And
[00:38:59] around those pixels should be used. And using those fe features now we can
[00:39:02] using those fe features now we can calculate for each of the target pixels
[00:39:05] calculate for each of the target pixels we can calculate the attention or the
[00:39:08] we can calculate the attention or the distance to all of the frames all of the
[00:39:11] distance to all of the frames all of the pixels in the reference frame. And then
[00:39:15] pixels in the reference frame. And then after defining this attention with
[00:39:18] after defining this attention with respect to the
[00:39:23] uh the the pixel of interest in the in
[00:39:25] uh the the pixel of interest in the in the target frame with all of the frame
[00:39:27] the target frame with all of the frame all of the pixels in the
[00:39:30] all of the pixels in the reference frame. Now we can do an
[00:39:32] reference frame. Now we can do an average color um of all of based on
[00:39:37] average color um of all of based on those attention modules. So attention is
[00:39:39] those attention modules. So attention is basically just similarity between the
[00:39:41] basically just similarity between the the two. So anyways with that um what we
[00:39:44] the two. So anyways with that um what we can do is we can just get the output
[00:39:47] can do is we can just get the output color as an average with with that
[00:39:50] color as an average with with that tension and then ultimately
[00:39:53] tension and then ultimately um calculate the loss function because
[00:39:56] um calculate the loss function because we have the the the values of the colors
[00:40:00] we have the the the values of the colors the right colors of those pixels in our
[00:40:03] the right colors of those pixels in our data and this was able to with this
[00:40:07] data and this was able to with this reference frame colorize the images. You
[00:40:10] reference frame colorize the images. You see how consistent this this becomes in
[00:40:14] see how consistent this this becomes in terms of uh coloring if we color them
[00:40:16] terms of uh coloring if we color them separately without this consistency over
[00:40:19] separately without this consistency over time. You often uh see like for example
[00:40:22] time. You often uh see like for example the person uh person's shirts or or
[00:40:25] the person uh person's shirts or or clothing changes color because uh
[00:40:29] clothing changes color because uh because there's no
[00:40:31] because there's no uh constraint to keep it consistent. And
[00:40:34] uh constraint to keep it consistent. And then there has been also very
[00:40:37] then there has been also very interesting um applications because now
[00:40:40] interesting um applications because now that you're calculating attention to a
[00:40:42] that you're calculating attention to a reference frame, you're actually able to
[00:40:45] reference frame, you're actually able to track objects, track segments in in
[00:40:47] track objects, track segments in in videos and even identify key points in
[00:40:52] videos and even identify key points in the videos. That's a good question.
[00:40:56] the videos. That's a good question. You're you're asking the qu your your
[00:40:59] You're you're asking the qu your your question is about this slide basically
[00:41:03] question is about this slide basically and and how the encoder knows about the
[00:41:06] and and how the encoder knows about the data to begin with and um gets us good
[00:41:12] data to begin with and um gets us good learned representations. So all of these
[00:41:14] learned representations. So all of these tasks that I presented and and defined
[00:41:19] tasks that I presented and and defined are trying to do something here either
[00:41:22] are trying to do something here either decoding
[00:41:24] decoding uh classifying or using regression to
[00:41:28] uh classifying or using regression to generate some outputs to be able to
[00:41:31] generate some outputs to be able to train this encoder. So if your original
[00:41:34] train this encoder. So if your original images, if these are all natural images
[00:41:36] images, if these are all natural images taken off of internet or imageet or
[00:41:38] taken off of internet or imageet or whatever, then you are learning an
[00:41:41] whatever, then you are learning an encoder that can extract features from
[00:41:43] encoder that can extract features from those types of images, right? With the
[00:41:46] those types of images, right? With the pretext task and then when you remove
[00:41:49] pretext task and then when you remove the enc decoder and add this this
[00:41:51] the enc decoder and add this this classifier to the end, you only need to
[00:41:54] classifier to the end, you only need to uh train this this part because this
[00:41:56] uh train this this part because this encoder was already trained with all of
[00:41:59] encoder was already trained with all of these pre-training tasks that I just uh
[00:42:01] these pre-training tasks that I just uh talked about.
[00:42:02] talked about. You're asking if the labels are coming
[00:42:04] You're asking if the labels are coming from the decoder for pre-training the
[00:42:06] from the decoder for pre-training the encoder. And the the answer to that is
[00:42:09] encoder. And the the answer to that is yes. That's that's why we define the
[00:42:11] yes. That's that's why we define the pre-text tasks because we want to have
[00:42:13] pre-text tasks because we want to have some sort of labels, some outputs,
[00:42:16] some sort of labels, some outputs, right? And then based on those outputs,
[00:42:19] right? And then based on those outputs, we try to um train this entire network
[00:42:23] we try to um train this entire network and the
[00:42:25] and the on on the way of predicting the right
[00:42:27] on on the way of predicting the right labels. This this encoder is also
[00:42:30] labels. This this encoder is also trained. Good question. You're asking if
[00:42:32] trained. Good question. You're asking if encoder and decoder of one big neural
[00:42:34] encoder and decoder of one big neural network or there's differences in
[00:42:38] network or there's differences in different um papers in different works.
[00:42:41] different um papers in different works. It has uh been completely different in
[00:42:44] It has uh been completely different in some cases your encoder and not just
[00:42:47] some cases your encoder and not just decoder that's why I'm calling it
[00:42:48] decoder that's why I'm calling it classifier in the example that I showed
[00:42:51] classifier in the example that I showed you about um predicting the degree. This
[00:42:55] you about um predicting the degree. This is just a simple neural network right
[00:42:57] is just a simple neural network right the decoder is these FC layers. So this
[00:43:00] the decoder is these FC layers. So this could be one entire network and then you
[00:43:03] could be one entire network and then you you're replacing it with something for
[00:43:05] you're replacing it with something for your downstream task. But in some cases,
[00:43:07] your downstream task. But in some cases, for example, when I talked about
[00:43:09] for example, when I talked about autoenccoding,
[00:43:10] autoenccoding, encoding an image and then decoding
[00:43:12] encoding an image and then decoding another image, you have you often have
[00:43:15] another image, you have you often have two neural networks that are trained end
[00:43:17] two neural networks that are trained end to end because you have you want to make
[00:43:20] to end because you have you want to make use of that representation space in the
[00:43:21] use of that representation space in the middle. And in the next um thing that I
[00:43:25] middle. And in the next um thing that I want to talk about mascato encoders even
[00:43:27] want to talk about mascato encoders even there is no symmetry between encoders
[00:43:31] there is no symmetry between encoders and decoders. They can be just two
[00:43:33] and decoders. They can be just two different um frameworks two different
[00:43:36] different um frameworks two different neural networks even without any
[00:43:38] neural networks even without any symmetry to
[00:43:41] symmetry to um to to train the task. So this this is
[00:43:44] um to to train the task. So this this is very much task dependent pretext task
[00:43:46] very much task dependent pretext task dependent but they could be belonging to
[00:43:50] dependent but they could be belonging to the same architecture that we we know
[00:43:52] the same architecture that we we know about say CNN or ResNet or they could be
[00:43:56] about say CNN or ResNet or they could be two different architectures even without
[00:43:58] two different architectures even without any symmetry.
[00:44:00] any symmetry. Remember that these are the very first
[00:44:04] Remember that these are the very first um methods for self-supervised learning.
[00:44:08] um methods for self-supervised learning. So they're not supposed to be solving
[00:44:10] So they're not supposed to be solving everything. Uh that's just a a quick
[00:44:12] everything. Uh that's just a a quick disclaimer but the idea is the
[00:44:15] disclaimer but the idea is the hypothesis here is if the model is able
[00:44:17] hypothesis here is if the model is able to say this is 90° rotated it means that
[00:44:21] to say this is 90° rotated it means that implicitly it's understanding the right
[00:44:23] implicitly it's understanding the right rotation right sorry right uh right uh
[00:44:27] rotation right sorry right uh right uh orientation and direction right and then
[00:44:30] orientation and direction right and then u it will be able to if if given a right
[00:44:34] u it will be able to if if given a right um an an unrotated image it's able to
[00:44:38] um an an unrotated image it's able to recognize what it is in it Right. So,
[00:44:42] recognize what it is in it Right. So, but this is a limited task by itself. I
[00:44:45] but this is a limited task by itself. I agree with that. The question that you
[00:44:47] agree with that. The question that you have is why why here they use 64. Right.
[00:44:51] have is why why here they use 64. Right. Um that's a good question but that's
[00:44:55] Um that's a good question but that's also an arbitrary choice almost
[00:44:57] also an arbitrary choice almost arbitrary choice. As I said there are
[00:45:00] arbitrary choice. As I said there are many different types of different number
[00:45:02] many different types of different number of uh permutations here 9 factorial. So
[00:45:06] of uh permutations here 9 factorial. So it's a very big number. it doesn't make
[00:45:08] it's a very big number. it doesn't make sense for us to be predicting all of
[00:45:10] sense for us to be predicting all of those. What the authors did here, they
[00:45:12] those. What the authors did here, they decided to select a few of those
[00:45:15] decided to select a few of those perurbations that there is enough
[00:45:17] perurbations that there is enough variation because many of those uh
[00:45:19] variation because many of those uh perturbations is just like one uh patch
[00:45:22] perturbations is just like one uh patch like switched only, right? So they
[00:45:25] like switched only, right? So they selected 64 of those that they have the
[00:45:29] selected 64 of those that they have the largest difference between them and they
[00:45:32] largest difference between them and they just selected 64 because they wanted to
[00:45:34] just selected 64 because they wanted to solve a classification problem instead
[00:45:36] solve a classification problem instead of um other types of tasks.
[00:45:41] of um other types of tasks. Okay. So I've been talking about these
[00:45:44] Okay. So I've been talking about these um
[00:45:46] um frameworks that often do some transport
[00:45:50] frameworks that often do some transport transformation on uh the image or the
[00:45:53] transformation on uh the image or the videos and so on. And this brings us to
[00:45:56] videos and so on. And this brings us to this newer framework which is um
[00:45:59] this newer framework which is um published in 2021 and then there has
[00:46:01] published in 2021 and then there has been so many uh follow-ups on this and
[00:46:06] been so many uh follow-ups on this and um it's been a great framework for
[00:46:09] um it's been a great framework for pre-training for many tasks even if uh
[00:46:12] pre-training for many tasks even if uh these days when we want to pre-train on
[00:46:14] these days when we want to pre-train on a um data set uh on raw data sets we
[00:46:18] a um data set uh on raw data sets we often use this uh MAE framework it's
[00:46:22] often use this uh MAE framework it's called masked out encoders it is also a
[00:46:25] called masked out encoders it is also a reconstruction-based framework similar
[00:46:28] reconstruction-based framework similar to that masking strategy in painting
[00:46:29] to that masking strategy in painting that I mentioned but this is far more um
[00:46:33] that I mentioned but this is far more um uh detailed and as you can see this uh
[00:46:38] uh detailed and as you can see this uh framework is is not just selecting one
[00:46:41] framework is is not just selecting one mask. There are so many different uh
[00:46:43] mask. There are so many different uh patches and and places that they do
[00:46:45] patches and and places that they do masking with even more aggressive um
[00:46:48] masking with even more aggressive um sampling rates. 50% masking or 75 uh
[00:46:52] sampling rates. 50% masking or 75 uh masking rates and ratios. And through
[00:46:56] masking rates and ratios. And through training in large scale, they have shown
[00:46:59] training in large scale, they have shown that not only they are able to
[00:47:02] that not only they are able to reconstruct all of those masked areas,
[00:47:05] reconstruct all of those masked areas, they are also
[00:47:07] they are also getting very good encoders that can
[00:47:10] getting very good encoders that can summarize the images into good features.
[00:47:14] summarize the images into good features. And how this was done was through
[00:47:19] And how this was done was through defining a dec encoder and a decoder.
[00:47:22] defining a dec encoder and a decoder. And that's one of the examples that I
[00:47:24] And that's one of the examples that I said this is not these are not symmetric
[00:47:26] said this is not these are not symmetric and encoders and decoder. So a large
[00:47:30] and encoders and decoder. So a large portion of the input masks uh input uh
[00:47:33] portion of the input masks uh input uh patches are masked and the ones that are
[00:47:36] patches are masked and the ones that are not masked they are
[00:47:39] not masked they are um given to the encoder to encode in to
[00:47:42] um given to the encoder to encode in to features that are then passed through
[00:47:44] features that are then passed through the coder to generate the the complete
[00:47:48] the coder to generate the the complete image. But let's let's go a little bit
[00:47:51] image. But let's let's go a little bit into details of uh what this means. I do
[00:47:54] into details of uh what this means. I do have some of the details and and and how
[00:47:56] have some of the details and and and how these models are trained here. But
[00:48:00] these models are trained here. But um I um very briefly just explain to you
[00:48:04] um I um very briefly just explain to you how these models often work. The encoder
[00:48:09] how these models often work. The encoder um here is very much similar to vit all
[00:48:13] um here is very much similar to vit all of the them are are based on
[00:48:15] of the them are are based on transformers. The vit that we've talked
[00:48:17] transformers. The vit that we've talked about uh similar to the vits the images
[00:48:20] about uh similar to the vits the images are uh split into patches. The patches
[00:48:23] are uh split into patches. The patches are then uh sampled.
[00:48:28] are then uh sampled. So uniform sampling is what they've done
[00:48:30] So uniform sampling is what they've done in and they've shown 75% sampling is um
[00:48:35] in and they've shown 75% sampling is um was was quite efficient in in uh the
[00:48:39] was was quite efficient in in uh the experiments and um they use a mask a
[00:48:43] experiments and um they use a mask a high masking ratio and then this makes
[00:48:47] high masking ratio and then this makes the prediction task very challenging and
[00:48:50] the prediction task very challenging and challenging in the set of pre uh pretext
[00:48:54] challenging in the set of pre uh pretext for pretext tasks in self-supervised
[00:48:56] for pretext tasks in self-supervised learning means the task is meaningful,
[00:48:58] learning means the task is meaningful, the task is is very good because the the
[00:49:00] the task is is very good because the the model has to learn uh good features to
[00:49:03] model has to learn uh good features to be able to reconstruct it. So uh with
[00:49:06] be able to reconstruct it. So uh with that high sampling um high ratio of
[00:49:10] that high sampling um high ratio of sampling what what it uh it can do is
[00:49:13] sampling what what it uh it can do is they can actually augment the data by a
[00:49:15] they can actually augment the data by a lot too because each time you mask 75%
[00:49:18] lot too because each time you mask 75% of the data. So you can reuse the same
[00:49:20] of the data. So you can reuse the same image multiple and multiple times during
[00:49:22] image multiple and multiple times during training. So you you will have so much
[00:49:25] training. So you you will have so much of data to train this uh encoder for
[00:49:28] of data to train this uh encoder for with and then that's why they use a huge
[00:49:32] with and then that's why they use a huge encoder a large um vit in as as their
[00:49:37] encoder a large um vit in as as their encoder. So this encoder uh itself only
[00:49:41] encoder. So this encoder uh itself only sees 25% of the samples the the patches
[00:49:45] sees 25% of the samples the the patches embeds those two with a first linear
[00:49:48] embeds those two with a first linear projection to some embedding spaces then
[00:49:52] projection to some embedding spaces then positional embeddings are um added
[00:49:54] positional embeddings are um added exactly same as what uh what we
[00:49:56] exactly same as what uh what we mentioned for vits and all of these are
[00:50:00] mentioned for vits and all of these are transformer blocks and um
[00:50:05] transformer blocks and um the encoder is is very large that's what
[00:50:07] the encoder is is very large that's what uh I just mentioned and then when it
[00:50:10] uh I just mentioned and then when it comes to the decoding uh part. So we
[00:50:14] comes to the decoding uh part. So we have the embeddings of all of those
[00:50:16] have the embeddings of all of those patches that were present for the
[00:50:20] patches that were present for the patches that were masked or or were
[00:50:22] patches that were masked or or were missing for those. There is a trainable
[00:50:25] missing for those. There is a trainable uh parameter very much similar to that
[00:50:27] uh parameter very much similar to that class token that we had a shared masked
[00:50:31] class token that we had a shared masked mask token that is um basically in some
[00:50:34] mask token that is um basically in some sort of average uh we can consider that
[00:50:37] sort of average uh we can consider that as as an average patch an average
[00:50:40] as as an average patch an average representation that is put for the ones
[00:50:43] representation that is put for the ones that that are missing are masked and
[00:50:45] that that are missing are masked and then the decoder has to transfer those
[00:50:48] then the decoder has to transfer those them to the uh image patches of the
[00:50:53] them to the uh image patches of the entire image. And then the entire last
[00:50:56] entire image. And then the entire last one the entire image is the the target
[00:50:58] one the entire image is the the target the is the output target is the output.
[00:51:02] the is the output target is the output. So how we train this this is a simple
[00:51:05] So how we train this this is a simple MSE based uh mean squared uh error loss
[00:51:09] MSE based uh mean squared uh error loss function for um the
[00:51:15] function for um the between the image and the reconstructed
[00:51:17] between the image and the reconstructed image. The loss function is only
[00:51:19] image. The loss function is only computed for the masked patches similar
[00:51:22] computed for the masked patches similar to the the previous one that I I just
[00:51:24] to the the previous one that I I just talked about. And then what it has is um
[00:51:30] talked about. And then what it has is um when we do the training, they've shown
[00:51:32] when we do the training, they've shown in the paper that you can do either
[00:51:35] in the paper that you can do either linear probing or full fine-tuning
[00:51:39] linear probing or full fine-tuning to train your uh to to to to use it for
[00:51:43] to train your uh to to to to use it for your downstream tasks for any of the
[00:51:45] your downstream tasks for any of the applications that you have in mind. And
[00:51:48] applications that you have in mind. And um in linear probing what happens is you
[00:51:51] um in linear probing what happens is you often have your encoder frozen and then
[00:51:54] often have your encoder frozen and then use the learn representations and only
[00:51:57] use the learn representations and only learn a linear function for the for the
[00:52:00] learn a linear function for the for the final task. And this this mark is it's
[00:52:02] final task. And this this mark is it's it means that it's being trained. But in
[00:52:04] it means that it's being trained. But in full fine tuning in fine-tuning the
[00:52:06] full fine tuning in fine-tuning the pre-trained encoders are also fine-tuned
[00:52:09] pre-trained encoders are also fine-tuned either either all of them or just few um
[00:52:13] either either all of them or just few um transformer blocks. And that's uh that's
[00:52:16] transformer blocks. And that's uh that's the fine-tuning uh framework. The linear
[00:52:20] the fine-tuning uh framework. The linear probing provides a measure of
[00:52:22] probing provides a measure of representation quality. How the those
[00:52:25] representation quality. How the those representations features are um and
[00:52:28] representations features are um and fine-tuning always exploits models near
[00:52:33] fine-tuning always exploits models near potential to adopt for new tasks. Okay.
[00:52:38] potential to adopt for new tasks. Okay. So uh I if you're interested in this
[00:52:42] So uh I if you're interested in this topic and if you're planning to use this
[00:52:44] topic and if you're planning to use this this paper I highly advise looking at
[00:52:47] this paper I highly advise looking at the paper in his follow-ups. There are
[00:52:49] the paper in his follow-ups. There are so many uh discussions around different
[00:52:52] so many uh discussions around different aspects model choices hyperparameters
[00:52:55] aspects model choices hyperparameters masking ratio is one thing they've
[00:52:57] masking ratio is one thing they've they've shown with the masking ratio
[00:52:59] they've shown with the masking ratio that the 75% is actually giving is
[00:53:02] that the 75% is actually giving is giving them a very high accuracy. So 75%
[00:53:05] giving them a very high accuracy. So 75% that's that's the reason that it was
[00:53:06] that's that's the reason that it was chosen uh decoder depth decoder width um
[00:53:10] chosen uh decoder depth decoder width um mask tokens uh reconstruction targets
[00:53:14] mask tokens uh reconstruction targets data augmentation how it's it's u
[00:53:17] data augmentation how it's it's u helpful and mask sampling method I'm I'm
[00:53:20] helpful and mask sampling method I'm I'm showing the results here again mask
[00:53:22] showing the results here again mask sampling method is uh mostly around
[00:53:25] sampling method is uh mostly around should should they use um some uh random
[00:53:28] should should they use um some uh random masking blocks or grid type of masking
[00:53:31] masking blocks or grid type of masking you see the examples here and they've
[00:53:33] you see the examples here and they've they've um came to the conclusion that
[00:53:35] they've um came to the conclusion that this random masking was was the best uh
[00:53:38] this random masking was was the best uh choice
[00:53:39] choice and um finally they've been able to to
[00:53:43] and um finally they've been able to to show that MAE is um doing a
[00:53:48] show that MAE is um doing a much better job compared to many of the
[00:53:52] much better job compared to many of the other methods that were used. So some of
[00:53:54] other methods that were used. So some of the other state-of-the-art methods were
[00:53:56] the other state-of-the-art methods were actually Dynino and Moco V3. If you have
[00:53:59] actually Dynino and Moco V3. If you have time, I will briefly go over them. But
[00:54:01] time, I will briefly go over them. But um this framework was actually
[00:54:05] um this framework was actually outperforming those that are more um um
[00:54:11] outperforming those that are more um um at the time advanced frameworks of
[00:54:14] at the time advanced frameworks of contrastive learning. So uh I'll stop
[00:54:20] contrastive learning. So uh I'll stop for a few questions if you have any
[00:54:22] for a few questions if you have any after. Let me just summarize the the
[00:54:26] after. Let me just summarize the the what we've talked about. pretext tasks
[00:54:29] what we've talked about. pretext tasks uh were actually very important and as
[00:54:31] uh were actually very important and as as I said their focus is on
[00:54:33] as I said their focus is on understanding the visual common sense
[00:54:36] understanding the visual common sense and um one of the things that also
[00:54:40] and um one of the things that also related to some of the questions that
[00:54:42] related to some of the questions that were asked we can see is coming up with
[00:54:45] were asked we can see is coming up with an individual pretext task is often
[00:54:49] an individual pretext task is often challenging because the learn
[00:54:51] challenging because the learn representations may not be general
[00:54:53] representations may not be general enough because of the type of task that
[00:54:55] enough because of the type of task that you you define Right? Uh for example,
[00:55:01] um if you're using completion rotation
[00:55:04] um if you're using completion rotation prediction or or jigsaw puzzle or
[00:55:06] prediction or or jigsaw puzzle or colorization,
[00:55:08] colorization, your learn representations are good for
[00:55:11] your learn representations are good for solving those specific tasks and they
[00:55:13] solving those specific tasks and they may not be very good for uh general
[00:55:16] may not be very good for uh general pretext um tasks. So the question is in
[00:55:20] pretext um tasks. So the question is in split brain autoenccoder how does the
[00:55:23] split brain autoenccoder how does the model knows know um how to predict the
[00:55:27] model knows know um how to predict the other channel given the for example L
[00:55:31] other channel given the for example L channel the illumination channel
[00:55:33] channel the illumination channel lightness channel so your question let
[00:55:36] lightness channel so your question let me ask answer your question with a
[00:55:38] me ask answer your question with a question when you're training a model to
[00:55:41] question when you're training a model to predict classes of objects in the image
[00:55:44] predict classes of objects in the image how does the encoder know what features
[00:55:46] how does the encoder know what features to extract to predict them the
[00:55:49] to extract to predict them the um the class of models labeled data
[00:55:52] um the class of models labeled data because you have what you're doing is
[00:55:55] because you have what you're doing is you're back propagating a loss value
[00:55:58] you're back propagating a loss value that is calculated around those labels.
[00:56:01] that is calculated around those labels. Right? It's the same story here. We
[00:56:03] Right? It's the same story here. We define a network that takes the one of
[00:56:06] define a network that takes the one of the channels and outputs the other
[00:56:09] the channels and outputs the other channel. How this was trained was by
[00:56:12] channel. How this was trained was by back propagating what the output should
[00:56:14] back propagating what the output should be. The output was the other other
[00:56:16] be. The output was the other other channel. we do have the other channel in
[00:56:18] channel. we do have the other channel in the data. So instead of defining the
[00:56:22] the data. So instead of defining the task being a classification of
[00:56:24] task being a classification of predicting the class of the objects here
[00:56:26] predicting the class of the objects here we define the task to be predicting the
[00:56:29] we define the task to be predicting the color of the pixels
[00:56:32] color of the pixels and the color of the pixels we already
[00:56:33] and the color of the pixels we already have them in the data set. So the loss
[00:56:36] have them in the data set. So the loss function still can be calculated and
[00:56:38] function still can be calculated and back propagated.
[00:56:40] back propagated. So the question is how they these
[00:56:42] So the question is how they these outputs are used as input to the
[00:56:43] outputs are used as input to the decoder. So this is again a vit
[00:56:48] decoder. So this is again a vit transformer uh style uh framework the
[00:56:51] transformer uh style uh framework the encoder that turns every uh every input
[00:56:56] encoder that turns every uh every input to a token at the as the output that is
[00:56:59] to a token at the as the output that is representation of that specific uh input
[00:57:02] representation of that specific uh input patch. Right? So we've talked about this
[00:57:04] patch. Right? So we've talked about this but then we know that this is not this
[00:57:07] but then we know that this is not this is not the list of all patches. There
[00:57:10] is not the list of all patches. There are some of the patches that are masked.
[00:57:12] are some of the patches that are masked. for those that are masked. We also train
[00:57:16] for those that are masked. We also train this encoder outputs um shared mask
[00:57:20] this encoder outputs um shared mask token. A token that is basically an an
[00:57:24] token. A token that is basically an an average token for example. It's it's a
[00:57:25] average token for example. It's it's a learnable parameter. We can't
[00:57:28] learnable parameter. We can't necessarily interpret it, but we can say
[00:57:30] necessarily interpret it, but we can say just it's probably something that is
[00:57:32] just it's probably something that is similar to mask like an average token.
[00:57:35] similar to mask like an average token. So that that shared mask token is put in
[00:57:40] So that that shared mask token is put in the place of those that are missing and
[00:57:42] the place of those that are missing and then this long vector long long sequence
[00:57:44] then this long vector long long sequence is created. Decoder another transformer
[00:57:48] is created. Decoder another transformer framework takes these long this long uh
[00:57:52] framework takes these long this long uh set of tokens and outputs those that are
[00:57:56] set of tokens and outputs those that are projected as the output pixel value.
[00:58:01] Perfect.
[00:58:03] Perfect. So, we only have 15 minutes
[00:58:07] So, we only have 15 minutes and um
[00:58:10] and um and a lot of things to cover.
[00:58:15] and a lot of things to cover. But, uh I think what I wanted to to to
[00:58:18] But, uh I think what I wanted to to to get out of this uh this um session was
[00:58:21] get out of this uh this um session was for you to understand what pretext tasks
[00:58:24] for you to understand what pretext tasks are and how we define them. and and one
[00:58:26] are and how we define them. and and one of the most um used frameworks right now
[00:58:30] of the most um used frameworks right now is the masked autoenccoder which we
[00:58:32] is the masked autoenccoder which we actually covered um uh to some good
[00:58:35] actually covered um uh to some good extent. But anyways um I was here that
[00:58:40] extent. But anyways um I was here that um we we did look at these
[00:58:42] um we we did look at these transformations and then we know that
[00:58:45] transformations and then we know that all of these transformations are
[00:58:48] all of these transformations are representing are actually the same
[00:58:50] representing are actually the same object as the original image just in a
[00:58:53] object as the original image just in a different uh form. Right? But then we
[00:58:56] different uh form. Right? But then we also have the knowledge that in the data
[00:58:58] also have the knowledge that in the data set we have other objects that look
[00:59:01] set we have other objects that look completely different right. So if I
[00:59:04] completely different right. So if I define a task that can say these that
[00:59:09] define a task that can say these that belong to the same object, same pixel,
[00:59:12] belong to the same object, same pixel, try to bring them close in the
[00:59:14] try to bring them close in the representation space. Basically attract
[00:59:16] representation space. Basically attract them to each other in the representation
[00:59:18] them to each other in the representation space and those that are not um the that
[00:59:22] space and those that are not um the that they do not belong to the same object
[00:59:26] they do not belong to the same object uh kind of put them try to maximize the
[00:59:29] uh kind of put them try to maximize the distance between them in the latent
[00:59:32] distance between them in the latent space. basically repel the
[00:59:34] space. basically repel the representation of of these two images.
[00:59:37] representation of of these two images. Then this is another task that is often
[00:59:40] Then this is another task that is often referred to as contrastive learning,
[00:59:43] referred to as contrastive learning, contrastive representation learning. And
[00:59:46] contrastive representation learning. And there are quite a number of very uh
[00:59:50] there are quite a number of very uh interesting methods to look at. We have
[00:59:53] interesting methods to look at. We have sampled there are so many papers um uh
[00:59:55] sampled there are so many papers um uh in this space especially around uh 2018s
[00:59:59] in this space especially around uh 2018s to 2020s u and uh and so on in around
[01:00:05] to 2020s u and uh and so on in around those those years Sinclair Moco CPC and
[01:00:08] those those years Sinclair Moco CPC and then ultimately Dino which is actually
[01:00:12] then ultimately Dino which is actually borrows concepts from contrastive
[01:00:13] borrows concepts from contrastive learning but it's not necessarily
[01:00:14] learning but it's not necessarily contrastive learning framework
[01:00:17] contrastive learning framework um
[01:00:19] um and uh what we do there is in order to
[01:00:23] and uh what we do there is in order to define, attract and repel functions u
[01:00:30] define, attract and repel functions u characteristics or or uh regularize the
[01:00:32] characteristics or or uh regularize the model based on those. We define the
[01:00:36] model based on those. We define the reference image as X and then all of
[01:00:39] reference image as X and then all of those that are transformations of the
[01:00:41] those that are transformations of the same as positive samples and all of the
[01:00:44] same as positive samples and all of the other objects in the data set or in the
[01:00:47] other objects in the data set or in the batch as negative samples. And these
[01:00:50] batch as negative samples. And these positive and negative samples will
[01:00:52] positive and negative samples will basically define a way for us to
[01:00:56] basically define a way for us to calculate the loss function. How can we
[01:00:59] calculate the loss function. How can we do that? Assume we have a scoring
[01:01:02] do that? Assume we have a scoring function. We want to get a scoring
[01:01:03] function. We want to get a scoring function that says the score for the
[01:01:07] function that says the score for the reference image encoded version features
[01:01:09] reference image encoded version features of the reference image and the features
[01:01:11] of the reference image and the features of a positive sample should be larger
[01:01:14] of a positive sample should be larger than the score that is comparing
[01:01:17] than the score that is comparing the reference image and the
[01:01:21] the reference image and the uh negative samples.
[01:01:24] uh negative samples. So with this type of um
[01:01:29] So with this type of um scoring function if we uh so basically
[01:01:31] scoring function if we uh so basically to define uh this type of scoring
[01:01:34] to define uh this type of scoring function we define a loss function based
[01:01:37] function we define a loss function based on that after training the scoring
[01:01:40] on that after training the scoring function uh s you can see this s is the
[01:01:45] function uh s you can see this s is the same as a score in the previous slide.
[01:01:47] same as a score in the previous slide. Um so if we have that scoring function
[01:01:50] Um so if we have that scoring function now in order to attract and repel we can
[01:01:55] now in order to attract and repel we can use this uh framework of uh turn them
[01:01:59] use this uh framework of uh turn them the softmax uh setup with the exp that
[01:02:03] the softmax uh setup with the exp that turn those scores into probability
[01:02:06] turn those scores into probability values and then in the denominator you
[01:02:09] values and then in the denominator you see all of the other negative samples
[01:02:11] see all of the other negative samples that are um used um are are considered
[01:02:16] that are um used um are are considered are are In order to implement actually
[01:02:18] are are In order to implement actually we use this we use the batch learning
[01:02:21] we use this we use the batch learning framework. All of the other negative
[01:02:22] framework. All of the other negative samples that belong to other objects in
[01:02:26] samples that belong to other objects in um in the batch are are taken as
[01:02:28] um in the batch are are taken as negative samples and one of the
[01:02:30] negative samples and one of the transformers as as the positive sample.
[01:02:33] transformers as as the positive sample. So we define the loss function like this
[01:02:36] So we define the loss function like this and um a score for the positive pair
[01:02:39] and um a score for the positive pair score for all of the other negative
[01:02:41] score for all of the other negative pairs.
[01:02:44] pairs. And this function is very similar to
[01:02:47] And this function is very similar to what we've actually discussed before.
[01:02:50] what we've actually discussed before. Any ideas? This is the cross entropy for
[01:02:53] Any ideas? This is the cross entropy for multiclasses.
[01:02:54] multiclasses. So if we have n samples, sorry, n um uh
[01:02:59] So if we have n samples, sorry, n um uh samples. Yes.
[01:03:01] samples. Yes. Then um in this case we have n samples.
[01:03:04] Then um in this case we have n samples. So the softmax is
[01:03:07] So the softmax is if you have multiple classes, if you
[01:03:09] if you have multiple classes, if you have 10 classes as the output, it wants
[01:03:12] have 10 classes as the output, it wants to maximize the score for one of those
[01:03:14] to maximize the score for one of those 10 and minimize for the rest of it.
[01:03:17] 10 and minimize for the rest of it. Right? It's the same story here. We want
[01:03:19] Right? It's the same story here. We want to maximize this score but kind of
[01:03:22] to maximize this score but kind of minimize the score between negative and
[01:03:24] minimize the score between negative and reference. Right? So it's it's the same
[01:03:27] reference. Right? So it's it's the same concept that we discussed about
[01:03:29] concept that we discussed about multiclass uh classification but put in
[01:03:32] multiclass uh classification but put in the form of um formulation of
[01:03:34] the form of um formulation of contrastive uh for this loss function as
[01:03:37] contrastive uh for this loss function as contra contrastive learning. This
[01:03:40] contra contrastive learning. This function is called info NCE or the
[01:03:44] function is called info NCE or the information noise contrastive estimation
[01:03:46] information noise contrastive estimation loss which was proposed in uh this paper
[01:03:51] loss which was proposed in uh this paper and there are a lot of theoretical uh
[01:03:54] and there are a lot of theoretical uh discussions in in in the paper that this
[01:03:57] discussions in in in the paper that this uh objective this loss function measures
[01:03:59] uh objective this loss function measures the dependencies
[01:04:01] the dependencies uh sorry that uh it's it's a lower bound
[01:04:04] uh sorry that uh it's it's a lower bound on mutual uh information and what in
[01:04:08] on mutual uh information and what in mutual Mutual information is uh when you
[01:04:11] mutual Mutual information is uh when you have two images and you calculate the
[01:04:13] have two images and you calculate the mutual information between them, it's
[01:04:15] mutual information between them, it's basically measuring the dependencies or
[01:04:18] basically measuring the dependencies or shared information between these two
[01:04:20] shared information between these two images. Right? So what we want to do is
[01:04:23] images. Right? So what we want to do is we want to maximize shared information
[01:04:25] we want to maximize shared information between X and X plus but minimize the
[01:04:28] between X and X plus but minimize the shared information for X and X minuses.
[01:04:32] shared information for X and X minuses. Right? So the paper says um and again
[01:04:37] Right? So the paper says um and again this will itself take half an hour to to
[01:04:39] this will itself take half an hour to to go into. So you should definitely take a
[01:04:41] go into. So you should definitely take a look at uh the paper if you're
[01:04:43] look at uh the paper if you're interested that
[01:04:45] interested that this the negative of this loss function
[01:04:47] this the negative of this loss function influenc is a lower bound on mutual
[01:04:51] influenc is a lower bound on mutual information between X and X plus. So a
[01:04:55] information between X and X plus. So a lower bound on mutual information if um
[01:04:59] lower bound on mutual information if um the negative of it is a lower bound on
[01:05:02] the negative of it is a lower bound on mutual information if I minimize the
[01:05:05] mutual information if I minimize the influenc I'm basically maximizing the
[01:05:09] influenc I'm basically maximizing the mutual information between X and X plus.
[01:05:11] mutual information between X and X plus. So this is what I really want. So that's
[01:05:15] So this is what I really want. So that's why we take this as a as the loss
[01:05:17] why we take this as a as the loss function and start minimizing
[01:05:20] function and start minimizing the uh that value. There is also another
[01:05:24] the uh that value. There is also another theoretical aspect in the info inc paper
[01:05:27] theoretical aspect in the info inc paper that says the larger this this the
[01:05:30] that says the larger this this the number of negative samples the tighter
[01:05:32] number of negative samples the tighter the bound. So they it tightens the bound
[01:05:35] the bound. So they it tightens the bound based on the number of negative samples.
[01:05:38] based on the number of negative samples. So that's why for training a loss
[01:05:41] So that's why for training a loss function neural network with this type
[01:05:44] function neural network with this type of loss function we need a huge batch
[01:05:46] of loss function we need a huge batch size.
[01:05:48] size. ute a larger number of negative samples
[01:05:51] ute a larger number of negative samples we'll get better um
[01:05:54] we'll get better um and more much faster training
[01:05:57] and more much faster training convergence
[01:05:59] convergence and then this loss function was used in
[01:06:02] and then this loss function was used in a number of different frameworks and in
[01:06:06] a number of different frameworks and in the next few minutes I'm just going to
[01:06:09] the next few minutes I'm just going to tell you what those u frameworks are for
[01:06:12] tell you what those u frameworks are for example Sinclair is a simple uh
[01:06:16] example Sinclair is a simple uh framework for contrastive learning
[01:06:17] framework for contrastive learning learning is basically taking each image
[01:06:22] learning is basically taking each image do the two transformation of the same
[01:06:24] do the two transformation of the same image transfer it into the
[01:06:25] image transfer it into the representation space and what it does is
[01:06:28] representation space and what it does is it calculates the the cosine similarity
[01:06:32] it calculates the the cosine similarity between um embeddings representations
[01:06:36] between um embeddings representations but before doing that it does a linear
[01:06:39] but before doing that it does a linear or nonlinear projection into a set of
[01:06:42] or nonlinear projection into a set of features Z that calculates the the uh
[01:06:47] features Z that calculates the the uh the distance between those in this
[01:06:49] the distance between those in this space. And um this is this is the way
[01:06:53] space. And um this is this is the way that they uh they do this generating
[01:06:55] that they uh they do this generating this positive samples and uh for
[01:06:59] this positive samples and uh for generating positive samples all sorts of
[01:07:02] generating positive samples all sorts of transformations would uh make sense. The
[01:07:05] transformations would uh make sense. The details are are basically uh covered
[01:07:08] details are are basically uh covered here. Uh generate a a positive pair by
[01:07:11] here. Uh generate a a positive pair by sampling data uh augmentation functions.
[01:07:15] sampling data uh augmentation functions. So we sample a few of those then we
[01:07:18] So we sample a few of those then we calculate the info NC loss on uh the
[01:07:22] calculate the info NC loss on uh the pairs and this is what we iterate
[01:07:25] pairs and this is what we iterate because each each sample has two n uh to
[01:07:29] because each each sample has two n uh to multiply by n uh samples that we have
[01:07:32] multiply by n uh samples that we have created. So what happens is we take a
[01:07:36] created. So what happens is we take a list of um
[01:07:39] list of um images in the in the mini batch in the
[01:07:41] images in the in the mini batch in the batch that we have and um pass them
[01:07:45] batch that we have and um pass them through encoder for both variations of
[01:07:48] through encoder for both variations of the same image. So each of the images
[01:07:51] the same image. So each of the images basically will have a version of it
[01:07:54] basically will have a version of it transformed version of it next. So we
[01:07:57] transformed version of it next. So we have two n um subject two n samples in
[01:08:01] have two n um subject two n samples in the batch now.
[01:08:03] the batch now. And then this means that for each of the
[01:08:06] And then this means that for each of the samples the next one these two are
[01:08:09] samples the next one these two are positive samples and everything else is
[01:08:11] positive samples and everything else is negative. So for the first one the
[01:08:13] negative. So for the first one the second uh image is positive everything
[01:08:16] second uh image is positive everything is else is negative. For the second one
[01:08:18] is else is negative. For the second one the first is positive and everything
[01:08:20] the first is positive and everything else is negative. And this repeats for
[01:08:22] else is negative. And this repeats for all of the samples there. So this is a
[01:08:26] all of the samples there. So this is a highle implement uh high level
[01:08:29] highle implement uh high level definition of Sinclair.
[01:08:31] definition of Sinclair. Please note that we have in assignment
[01:08:35] Please note that we have in assignment three a question related to Sinclair
[01:08:38] three a question related to Sinclair that you will be uh exploring this
[01:08:41] that you will be uh exploring this framework a little bit more but be
[01:08:45] framework a little bit more but be careful that the definition is slightly
[01:08:47] careful that the definition is slightly different from what the standard
[01:08:49] different from what the standard definition that I presented here and and
[01:08:51] definition that I presented here and and make sure you follow the instructions in
[01:08:54] make sure you follow the instructions in the assignment. So Sinclair was actually
[01:08:58] the assignment. So Sinclair was actually very successful. it was able to without
[01:09:01] very successful. it was able to without the use of labels or or um and then
[01:09:05] the use of labels or or um and then training a linear classifier on top of
[01:09:08] training a linear classifier on top of the features. It was able to to surpass
[01:09:13] the features. It was able to to surpass all of the previous works and even be as
[01:09:17] all of the previous works and even be as uh basically generate results comparable
[01:09:20] uh basically generate results comparable to the supervised learning fully
[01:09:23] to the supervised learning fully supervised uh learning frameworks.
[01:09:25] supervised uh learning frameworks. Although we need a larger neural network
[01:09:27] Although we need a larger neural network because now we are learning more generic
[01:09:29] because now we are learning more generic features but in terms of accuracies it
[01:09:32] features but in terms of accuracies it was not as uh it was comparable to what
[01:09:35] was not as uh it was comparable to what we had for supervised learning.
[01:09:38] we had for supervised learning. So um the interesting thing um with
[01:09:43] So um the interesting thing um with Sinclair and and u some of the results
[01:09:46] Sinclair and and u some of the results around it was
[01:09:50] around it was um that there are a few choices actually
[01:09:54] um that there are a few choices actually let me uh spend time on on the the main
[01:09:56] let me uh spend time on on the the main main choices. You may have this question
[01:10:00] main choices. You may have this question of why did we project the features into
[01:10:03] of why did we project the features into a new variable instead of using the same
[01:10:05] a new variable instead of using the same representations.
[01:10:07] representations. So this this was a design choice that
[01:10:09] So this this was a design choice that they made in Sinclair because they
[01:10:13] they made in Sinclair because they rightly so assumed that when we have an
[01:10:16] rightly so assumed that when we have an objective function that does this
[01:10:18] objective function that does this contrast between samples, you often lose
[01:10:22] contrast between samples, you often lose some more information, some extra
[01:10:25] some more information, some extra information that do not help with the
[01:10:26] information that do not help with the contrastive learning framework. Right?
[01:10:28] contrastive learning framework. Right? So in order to preserve all of those uh
[01:10:31] So in order to preserve all of those uh extra
[01:10:33] extra features representations are defined as
[01:10:36] features representations are defined as edge but then there is this linear or
[01:10:38] edge but then there is this linear or nonlinear projection in their paper they
[01:10:41] nonlinear projection in their paper they use nonlinear projection to to get the
[01:10:45] use nonlinear projection to to get the uh z values that they can calculate
[01:10:47] uh z values that they can calculate influency on. So that's one important uh
[01:10:51] influency on. So that's one important uh design choice and the other one is I
[01:10:54] design choice and the other one is I mentioned earlier large batch sizes. You
[01:10:56] mentioned earlier large batch sizes. You need huge batch sizes, larger batch
[01:10:59] need huge batch sizes, larger batch sizes to be able to get better Sinclair
[01:11:03] sizes to be able to get better Sinclair performance. And we talked about it how
[01:11:05] performance. And we talked about it how and and why this is the case. But we
[01:11:08] and and why this is the case. But we can't always do large batch sizes for
[01:11:11] can't always do large batch sizes for many of the tasks that we have at hand
[01:11:14] many of the tasks that we have at hand because um of
[01:11:18] because um of constraints in memory and so on. And
[01:11:20] constraints in memory and so on. And that was why a number of follow-ups for
[01:11:23] that was why a number of follow-ups for example Moco was um proposed momentum
[01:11:28] example Moco was um proposed momentum contrastive learning instead of using
[01:11:32] contrastive learning instead of using the samples all of the negative samples
[01:11:35] the samples all of the negative samples in the batch. What it does it it creates
[01:11:38] in the batch. What it does it it creates a queue and and keeps a history of the
[01:11:44] a queue and and keeps a history of the negative samples across patches over
[01:11:47] negative samples across patches over time in the model. So it doesn't only
[01:11:50] time in the model. So it doesn't only depend on the negative samples in the
[01:11:52] depend on the negative samples in the batch. It has a separate queue that is
[01:11:56] batch. It has a separate queue that is defined and and um keeps a number of
[01:12:00] defined and and um keeps a number of negative samples and changes it updates
[01:12:02] negative samples and changes it updates it over time to train the u to to do the
[01:12:09] it over time to train the u to to do the info
[01:12:10] info uh loss with the contrastive loss here.
[01:12:13] uh loss with the contrastive loss here. But because we have this Q, we cannot
[01:12:16] But because we have this Q, we cannot back propagate because those samples are
[01:12:18] back propagate because those samples are not in the batch anymore. Right? So we
[01:12:21] not in the batch anymore. Right? So we cannot back propagate for the negative
[01:12:24] cannot back propagate for the negative uh samples. And that's why it had to
[01:12:26] uh samples. And that's why it had to separate the encoder for positive
[01:12:28] separate the encoder for positive samples which are now called query and
[01:12:31] samples which are now called query and the negative samples which are now
[01:12:32] the negative samples which are now called key in this um architecture. So
[01:12:38] called key in this um architecture. So the training only impacts encoder and
[01:12:41] the training only impacts encoder and over time the Q encoder
[01:12:46] over time the Q encoder using a momentum
[01:12:48] using a momentum m it updates the key encoder the
[01:12:51] m it updates the key encoder the momentum encoder right so this is a
[01:12:56] momentum encoder right so this is a framework that is actually been very
[01:12:58] framework that is actually been very successful in terms of implementation
[01:13:01] successful in terms of implementation and and followup versions of it there is
[01:13:04] and and followup versions of it there is a lot of uh interest interesting results
[01:13:08] a lot of uh interest interesting results But then what they've done was basically
[01:13:12] But then what they've done was basically uh tried hybrid versions of using some
[01:13:17] uh tried hybrid versions of using some nonlinear projection heads and data
[01:13:20] nonlinear projection heads and data augmentation from Sinclair and using
[01:13:22] augmentation from Sinclair and using this mini batch style the the decoupling
[01:13:25] this mini batch style the the decoupling of the mini batch and negative samples
[01:13:27] of the mini batch and negative samples from Moco. And they've shown that
[01:13:29] from Moco. And they've shown that actually if you if you do this together
[01:13:32] actually if you if you do this together version moco version two it um improves
[01:13:37] version moco version two it um improves the performance by a lot. So I will uh
[01:13:42] the performance by a lot. So I will uh stop here but there was some notions of
[01:13:45] stop here but there was some notions of CPC the contrastive predictive coding uh
[01:13:50] CPC the contrastive predictive coding uh as another example that you can look at
[01:13:52] as another example that you can look at in the slides and then a better version
[01:13:58] in the slides and then a better version of Moco Moo version 3 and Dino is also
[01:14:01] of Moco Moo version 3 and Dino is also one of the widely used frameworks which
[01:14:04] one of the widely used frameworks which actually has a similar type of
[01:14:06] actually has a similar type of architecture as
[01:14:09] architecture as uh as Moco but it's not necessarily uh
[01:14:13] uh as Moco but it's not necessarily uh contrastive learning because now we have
[01:14:15] contrastive learning because now we have student and teacher networks. So I'll
[01:14:18] student and teacher networks. So I'll leave that for a separate discussion and
[01:14:20] leave that for a separate discussion and if if uh if uh you're interested we can
[01:14:23] if if uh if uh you're interested we can discuss maybe in the future uh slides in
[01:14:26] discuss maybe in the future uh slides in future lectures. But anyways uh this is
[01:14:29] future lectures. But anyways uh this is also one of the widely used frameworks
[01:14:32] also one of the widely used frameworks for extracting features from images and
[01:14:34] for extracting features from images and also videos sometimes.


================================================================================
LECTURE 013
================================================================================

Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 13: Generative Models 1

Source: https://www.youtube.com/watch?v=zbHXQRUNlH0

---

Transcript

[00:00:05] Welcome back to CS231N lecture 13. Today
[00:00:09] Welcome back to CS231N lecture 13. Today we're going to talk about generative
[00:00:10] we're going to talk about generative models. Last time we were talking about
[00:00:12] models. Last time we were talking about self-supervised learning um which is
[00:00:14] self-supervised learning um which is this really interesting paradigm where
[00:00:16] this really interesting paradigm where we want to somehow learn structure
[00:00:18] we want to somehow learn structure directly from data with no with no
[00:00:20] directly from data with no with no supervision with no labels. the and and
[00:00:22] supervision with no labels. the and and the typical present like the typical
[00:00:23] the typical present like the typical formulation of self-supervised learning
[00:00:25] formulation of self-supervised learning that we talked about a bunch of examples
[00:00:26] that we talked about a bunch of examples last time is you have your big data set
[00:00:28] last time is you have your big data set with no labels. Ideally, it's just
[00:00:30] with no labels. Ideally, it's just images. This is great. You can get a lot
[00:00:31] images. This is great. You can get a lot of images. Um you're going to feed these
[00:00:33] of images. Um you're going to feed these through some kind of encoder that's
[00:00:34] through some kind of encoder that's going to extract a feature
[00:00:35] going to extract a feature representation from your images. Then go
[00:00:37] representation from your images. Then go through some decoder that will predict
[00:00:39] through some decoder that will predict something from that feature
[00:00:40] something from that feature representation. And in the whole trick
[00:00:42] representation. And in the whole trick in self-supervised learning is coming up
[00:00:44] in self-supervised learning is coming up with some kind of pretext task that you
[00:00:46] with some kind of pretext task that you can train this whole system on without
[00:00:48] can train this whole system on without requiring any kind of human annotation
[00:00:49] requiring any kind of human annotation or human labels. Um and then and then so
[00:00:52] or human labels. Um and then and then so we talked about things like rotation uh
[00:00:54] we talked about things like rotation uh different kinds of tasks that we can use
[00:00:55] different kinds of tasks that we can use as pretexts for these self-supervised to
[00:00:58] as pretexts for these self-supervised to to formulate these self-supervised
[00:00:59] to formulate these self-supervised learning objectives. And then typically
[00:01:01] learning objectives. And then typically um this is usually a two-stage procedure
[00:01:03] um this is usually a two-stage procedure where first you're going to go and learn
[00:01:05] where first you're going to go and learn this self-supervised encoder decoder on
[00:01:07] this self-supervised encoder decoder on your self-supervised task on all the
[00:01:09] your self-supervised task on all the data that you can find. And then after
[00:01:10] data that you can find. And then after that you're going to throw away the
[00:01:11] that you're going to throw away the decoder and then slot in some new maybe
[00:01:14] decoder and then slot in some new maybe tiny possibly tiny fully connected
[00:01:16] tiny possibly tiny fully connected network um and actually train this thing
[00:01:18] network um and actually train this thing maybe end to end or maybe just learn the
[00:01:19] maybe end to end or maybe just learn the fully connected network at the end um on
[00:01:21] fully connected network at the end um on some small labelled task. And the idea
[00:01:23] some small labelled task. And the idea here is that via self-supervised
[00:01:25] here is that via self-supervised learning this pretext task you can train
[00:01:27] learning this pretext task you can train on lots and lots of data millions
[00:01:29] on lots and lots of data millions hundreds of millions billions of samples
[00:01:31] hundreds of millions billions of samples where we don't have access to to high
[00:01:32] where we don't have access to to high quality human labels. And in the process
[00:01:34] quality human labels. And in the process of self-supervised learning it's going
[00:01:36] of self-supervised learning it's going to learn something about the general
[00:01:37] to learn something about the general structure of images or of data. Um, and
[00:01:39] structure of images or of data. Um, and then you can transfer that knowledge to
[00:01:41] then you can transfer that knowledge to downstream tasks where you have small
[00:01:43] downstream tasks where you have small amounts of human labels. So then the
[00:01:45] amounts of human labels. So then the typical setup you should keep in your
[00:01:46] typical setup you should keep in your mind that we want to work towards in
[00:01:47] mind that we want to work towards in self-supervised learning is that you're
[00:01:49] self-supervised learning is that you're going to train on like a billion
[00:01:51] going to train on like a billion unlabeled images that we're getting from
[00:01:52] unlabeled images that we're getting from the internet somewhere. Um, and then
[00:01:54] the internet somewhere. Um, and then we're going to transfer those features
[00:01:55] we're going to transfer those features to tasks where we're going to we're
[00:01:57] to tasks where we're going to we're willing to sit down and label maybe
[00:01:58] willing to sit down and label maybe tens, hundreds, maybe thousands of
[00:02:00] tens, hundreds, maybe thousands of examples for particular tasks that we
[00:02:02] examples for particular tasks that we really care about. But we want those
[00:02:04] really care about. But we want those tasks to be um improved by this generic
[00:02:06] tasks to be um improved by this generic knowledge that from that we've learned
[00:02:08] knowledge that from that we've learned through this self-supervised pretext
[00:02:10] through this self-supervised pretext task. And we talked about a couple
[00:02:12] task. And we talked about a couple different kinds of pretext tasks last
[00:02:13] different kinds of pretext tasks last time including rotation, rearrangement,
[00:02:16] time including rotation, rearrangement, reconstruction. Um all of these are
[00:02:17] reconstruction. Um all of these are basically having this this sense of that
[00:02:20] basically having this this sense of that you're making some geometric
[00:02:21] you're making some geometric perturbation, geometric disturbance to
[00:02:23] perturbation, geometric disturbance to the input pixels and then you're asking
[00:02:24] the input pixels and then you're asking the model to somehow recover from that
[00:02:26] the model to somehow recover from that dis from that perturbation. So in the
[00:02:28] dis from that perturbation. So in the case of rotation, maybe you rotate the
[00:02:30] case of rotation, maybe you rotate the image and you ask the model to predict
[00:02:31] image and you ask the model to predict how much it was rotated. In the sense of
[00:02:33] how much it was rotated. In the sense of rearrangement or solving jig jigsaw
[00:02:35] rearrangement or solving jig jigsaw puzzles, you're going to cut the image
[00:02:36] puzzles, you're going to cut the image up into patches and ask the model to try
[00:02:38] up into patches and ask the model to try to predict what was the relative
[00:02:39] to predict what was the relative arrangement of those patches in the
[00:02:41] arrangement of those patches in the original image. Um or in reconstruction,
[00:02:43] original image. Um or in reconstruction, maybe you're going to delete some parts
[00:02:44] maybe you're going to delete some parts of the input image and then ask the
[00:02:46] of the input image and then ask the model to fill them in as some kind of
[00:02:47] model to fill them in as some kind of inpainting or reconstruction task. Um
[00:02:50] inpainting or reconstruction task. Um and these are fairly successful. Um, we
[00:02:52] and these are fairly successful. Um, we also talked last time about a different
[00:02:54] also talked last time about a different formulation of self-supervised learning
[00:02:55] formulation of self-supervised learning called contrastive learning, which has
[00:02:57] called contrastive learning, which has been very successful. Um, and here I I
[00:02:59] been very successful. Um, and here I I was told that you ran out of time a
[00:03:01] was told that you ran out of time a little bit to cover a couple of these
[00:03:02] little bit to cover a couple of these later methods. So, I wanted to just go
[00:03:03] later methods. So, I wanted to just go over go over those really quickly at the
[00:03:05] over go over those really quickly at the beginning of today's lecture instead.
[00:03:07] beginning of today's lecture instead. Um, so the really the idea of
[00:03:09] Um, so the really the idea of contrastive learning is you're going to
[00:03:11] contrastive learning is you're going to get pairs that are similar and pairs
[00:03:13] get pairs that are similar and pairs that are dissimilar and you want to pull
[00:03:15] that are dissimilar and you want to pull the similar pairs together and push the
[00:03:16] the similar pairs together and push the dissimilar pairs apart. And the way that
[00:03:18] dissimilar pairs apart. And the way that you usually do this in the context of
[00:03:21] you usually do this in the context of self-supervised learning is you're going
[00:03:22] self-supervised learning is you're going to start with your input images. And
[00:03:24] to start with your input images. And again, these are unlabeled images. You
[00:03:26] again, these are unlabeled images. You don't have labels. You don't have labels
[00:03:27] don't have labels. You don't have labels for them. And now for each input image,
[00:03:29] for them. And now for each input image, you're going to apply two random
[00:03:31] you're going to apply two random transformations. So in the case of the
[00:03:33] transformations. So in the case of the cat, we sort of took one crop around the
[00:03:35] cat, we sort of took one crop around the cat's face, another crop around the
[00:03:37] cat's face, another crop around the backside of the cat, um, and and around
[00:03:39] backside of the cat, um, and and around the monkey, we sort of took one around
[00:03:40] the monkey, we sort of took one around the monkeykey's face and also dropped it
[00:03:42] the monkeykey's face and also dropped it to black and white and, etc., etc. So
[00:03:44] to black and white and, etc., etc. So basically for each one of your input
[00:03:46] basically for each one of your input images you're going to apply two or
[00:03:48] images you're going to apply two or possibly more possibly more than two but
[00:03:50] possibly more possibly more than two but two is a is a nice minimal subset two
[00:03:52] two is a is a nice minimal subset two random perturbations to your input
[00:03:54] random perturbations to your input image. Now you're going to feed all of
[00:03:56] image. Now you're going to feed all of those randomly perturbed um versions of
[00:03:58] those randomly perturbed um versions of your input data to some kind of feature
[00:04:00] your input data to some kind of feature extractor um which is which could be a
[00:04:02] extractor um which is which could be a vit could be a CNN um any kind of neural
[00:04:04] vit could be a CNN um any kind of neural network that can input an image and
[00:04:06] network that can input an image and output a feature representation. Then
[00:04:08] output a feature representation. Then you want to apply this notion of
[00:04:10] you want to apply this notion of contrastive. So um for each of the two
[00:04:13] contrastive. So um for each of the two augmentations that came from the cat, we
[00:04:16] augmentations that came from the cat, we want those two feature vectors to be the
[00:04:17] want those two feature vectors to be the same. So we color them green. So
[00:04:19] same. So we color them green. So basically you you basically comput this
[00:04:21] basically you you basically comput this big n square similarity matrix where if
[00:04:23] big n square similarity matrix where if well I guess it's 2 n open pern 2 n
[00:04:25] well I guess it's 2 n open pern 2 n closed pern squ. So it's 4n^ squ if you
[00:04:28] closed pern squ. So it's 4n^ squ if you have n if you have n images you put two
[00:04:30] have n if you have n images you put two perturbations on each. So we have a
[00:04:32] perturbations on each. So we have a giant um 2nx by 2n matrix for all the
[00:04:35] giant um 2nx by 2n matrix for all the all these perturbed images all these
[00:04:37] all these perturbed images all these perturbed uh augmented samples that we
[00:04:39] perturbed uh augmented samples that we that we got. Um, and now basically we
[00:04:42] that we got. Um, and now basically we want to pull together the two augmented
[00:04:44] want to pull together the two augmented samples that the two augmentations that
[00:04:46] samples that the two augmentations that came from the original image. Um, and
[00:04:48] came from the original image. Um, and for every pair of augmented for of every
[00:04:50] for every pair of augmented for of every pair of augmentations that came from
[00:04:52] pair of augmentations that came from different original images, we want to
[00:04:53] different original images, we want to push them apart. So you basically run um
[00:04:55] push them apart. So you basically run um run through this feature run all of
[00:04:57] run through this feature run all of these things through your feature
[00:04:58] these things through your feature extractor, compute this giant um 4n
[00:05:01] extractor, compute this giant um 4n squared matrix of all of your scalar
[00:05:03] squared matrix of all of your scalar similarities between those feature
[00:05:04] similarities between those feature vectors and then pull together the ones
[00:05:06] vectors and then pull together the ones that are similar, push apart the ones
[00:05:07] that are similar, push apart the ones that ought to be different. Um, and
[00:05:09] that ought to be different. Um, and that's the basic idea of contrastive
[00:05:10] that's the basic idea of contrastive learning. Um, and one paper that really
[00:05:13] learning. Um, and one paper that really pulled all this together, um, a couple
[00:05:15] pulled all this together, um, a couple years ago was called Simincar that, um,
[00:05:17] years ago was called Simincar that, um, applied this very successfully to
[00:05:19] applied this very successfully to self-supervised representation learning
[00:05:20] self-supervised representation learning on images. Um, and that's the one I
[00:05:22] on images. Um, and that's the one I think he walked through last time. Um,
[00:05:24] think he walked through last time. Um, but one kind of problem with the SIM
[00:05:26] but one kind of problem with the SIM clear setup is that it requires a fairly
[00:05:28] clear setup is that it requires a fairly large batch size um, to actually good to
[00:05:30] large batch size um, to actually good to get good convergence. Um, because it
[00:05:32] get good convergence. Um, because it gets too it's sort of too easy of a
[00:05:34] gets too it's sort of too easy of a problem for the network. If there aren't
[00:05:35] problem for the network. If there aren't that many samples, it's sort of too easy
[00:05:37] that many samples, it's sort of too easy to pick out the two cat ones that looked
[00:05:38] to pick out the two cat ones that looked similar. So to make the problem hard
[00:05:40] similar. So to make the problem hard enough for the network to give it good
[00:05:41] enough for the network to give it good enough learning signal, you tend to need
[00:05:43] enough learning signal, you tend to need quite a large batch size in order to get
[00:05:45] quite a large batch size in order to get this model to converge to good features.
[00:05:47] this model to converge to good features. And then once you do that, you need to
[00:05:48] And then once you do that, you need to rope in all the ideas around large scale
[00:05:50] rope in all the ideas around large scale distributed training that we talked
[00:05:51] distributed training that we talked about a couple of lectures ago, which is
[00:05:53] about a couple of lectures ago, which is totally feasible. It totally works. Um,
[00:05:55] totally feasible. It totally works. Um, but you might ask, is there some way you
[00:05:56] but you might ask, is there some way you can get away with with without that? Um,
[00:05:59] can get away with with without that? Um, and that that leads to a couple of
[00:06:01] and that that leads to a couple of approaches that I don't want to go into
[00:06:02] approaches that I don't want to go into too much detail. I actually don't want
[00:06:03] too much detail. I actually don't want to walk through through these and tell
[00:06:05] to walk through through these and tell you exactly how they how they work. I
[00:06:06] you exactly how they how they work. I just want to make you aware of their
[00:06:08] just want to make you aware of their existence and give you the the general
[00:06:09] existence and give you the the general flavor of what they're trying to
[00:06:10] flavor of what they're trying to achieve. Um so in this MOCO or momentum
[00:06:13] achieve. Um so in this MOCO or momentum contrast approach to self-supervised
[00:06:15] contrast approach to self-supervised learning, the setup is very similar to
[00:06:17] learning, the setup is very similar to what we just saw in Simcle. You're
[00:06:18] what we just saw in Simcle. You're taking data, you're getting augmented
[00:06:20] taking data, you're getting augmented pairs. You want you run them through
[00:06:21] pairs. You want you run them through with feature encoder. You want to pull
[00:06:23] with feature encoder. You want to pull together the ones that are similar, push
[00:06:24] together the ones that are similar, push apart the ones that are dissimilar. But
[00:06:26] apart the ones that are dissimilar. But the thing that with the thing that
[00:06:27] the thing that with the thing that differs is that we want to get away with
[00:06:29] differs is that we want to get away with not having to have a gigantic batch size
[00:06:31] not having to have a gigantic batch size at every iteration. So to do that they
[00:06:33] at every iteration. So to do that they keep a queue of um negatives. They keep
[00:06:36] keep a queue of um negatives. They keep a queue of samples from previous
[00:06:39] a queue of samples from previous iterations of training. Um and then at
[00:06:41] iterations of training. Um and then at every training iteration I've got my my
[00:06:43] every training iteration I've got my my X query is my current new batch of data.
[00:06:46] X query is my current new batch of data. And I have this this Q um X0 X1 X2 key
[00:06:49] And I have this this Q um X0 X1 X2 key which are previous batches of data that
[00:06:51] which are previous batches of data that I've seen on previous iterations of
[00:06:53] I've seen on previous iterations of training. Now, my current batch of data
[00:06:56] training. Now, my current batch of data I'm going to run through my encoder
[00:06:57] I'm going to run through my encoder network the same as I always did um and
[00:06:59] network the same as I always did um and compute these sort of compute the
[00:07:01] compute these sort of compute the contrast of loss the same way that we
[00:07:02] contrast of loss the same way that we did with SIM clear the and then these uh
[00:07:04] did with SIM clear the and then these uh these this larger Q these like previous
[00:07:07] these this larger Q these like previous history of batches we're going to run
[00:07:08] history of batches we're going to run through something different the momentum
[00:07:10] through something different the momentum encoder um and then still get feature
[00:07:12] encoder um and then still get feature representations and compute the same
[00:07:14] representations and compute the same kind of similarity that we did through
[00:07:16] kind of similarity that we did through the through through the SIM clear uh
[00:07:17] the through through the SIM clear uh thing but the problem is that we don't
[00:07:19] thing but the problem is that we don't want to back propagate into the momentum
[00:07:21] want to back propagate into the momentum encoder because it has too much data it
[00:07:23] encoder because it has too much data it too big of a batch. We can't afford to
[00:07:25] too big of a batch. We can't afford to fit that in GPU memory. Um, so we want
[00:07:27] fit that in GPU memory. Um, so we want to not have to back propagate through
[00:07:28] to not have to back propagate through that part. So instead, so that means
[00:07:30] that part. So instead, so that means that we're not we cannot upgrade update
[00:07:32] that we're not we cannot upgrade update this momentum encoding encoder, this
[00:07:34] this momentum encoding encoder, this second encoder via gradient descent
[00:07:36] second encoder via gradient descent descent. Instead, we're going to do
[00:07:37] descent. Instead, we're going to do something kind of wacky. What we're
[00:07:39] something kind of wacky. What we're going to do is have this momentum
[00:07:40] going to do is have this momentum encoder have its own set of weights.
[00:07:42] encoder have its own set of weights. We're going to learn them not via
[00:07:43] We're going to learn them not via gradient descent. Instead, what we're
[00:07:45] gradient descent. Instead, what we're going to do is have the momentum encoder
[00:07:46] going to do is have the momentum encoder be a exponential moving average of the
[00:07:48] be a exponential moving average of the weights of the normal encoder. So the
[00:07:50] weights of the normal encoder. So the normal encoder, we're going to learn via
[00:07:52] normal encoder, we're going to learn via gradient descent. everything is normal.
[00:07:53] gradient descent. everything is normal. Um, we'll forward prop, we'll back prop,
[00:07:55] Um, we'll forward prop, we'll back prop, we'll get gradients, we'll make a
[00:07:56] we'll get gradients, we'll make a gradient update step on the in on the on
[00:07:58] gradient update step on the in on the on the typical encoder. That's the normal
[00:08:00] the typical encoder. That's the normal thing. But then after we do that, the
[00:08:02] thing. But then after we do that, the momentum encoder, we're going to take
[00:08:04] momentum encoder, we're going to take we're going to decay the encoder
[00:08:05] we're going to decay the encoder weights. We're going to decay the
[00:08:07] weights. We're going to decay the current momentum encoder weights by
[00:08:09] current momentum encoder weights by like.99 um, and then add in 1% of the
[00:08:12] like.99 um, and then add in 1% of the encoder weights. So then the momentum
[00:08:14] encoder weights. So then the momentum encoder, we have this other update rule
[00:08:15] encoder, we have this other update rule where it's this lagging trailing
[00:08:17] where it's this lagging trailing exponential moving average of the
[00:08:19] exponential moving average of the encoder weights. Um, and I don't have a
[00:08:22] encoder weights. Um, and I don't have a great intuition or explanation for why
[00:08:23] great intuition or explanation for why this exactly makes sense, but there's
[00:08:25] this exactly makes sense, but there's very strong empirical evidence that this
[00:08:26] very strong empirical evidence that this works. Um, so I unfort so that's uh
[00:08:28] works. Um, so I unfort so that's uh that's kind of the state of things, but
[00:08:30] that's kind of the state of things, but it's nice because it means that you can
[00:08:32] it's nice because it means that you can now get away with learning these
[00:08:33] now get away with learning these self-supervised representations without
[00:08:35] self-supervised representations without having to have this gigantic massive
[00:08:37] having to have this gigantic massive batch of negatives at every iteration.
[00:08:39] batch of negatives at every iteration. Um, and this was fairly successful.
[00:08:40] Um, and this was fairly successful. There were a bunch of follow-up papers
[00:08:41] There were a bunch of follow-up papers that uh that push this direction. Um,
[00:08:44] that uh that push this direction. Um, another one that you should be aware of
[00:08:45] another one that you should be aware of is called Dino. Um, again the idea is
[00:08:47] is called Dino. Um, again the idea is very similar. It uses this similar sort
[00:08:49] very similar. It uses this similar sort of momentum encoder, this like dual um
[00:08:51] of momentum encoder, this like dual um normal encoder that's learned via
[00:08:52] normal encoder that's learned via gradient descent and a momentum encoder
[00:08:54] gradient descent and a momentum encoder just as in Moco. Um but the loss is a
[00:08:56] just as in Moco. Um but the loss is a little bit different. Instead of using
[00:08:58] little bit different. Instead of using softmax, they use some kind of kale
[00:08:59] softmax, they use some kind of kale divergence loss. Um and the reason I'm
[00:09:01] divergence loss. Um and the reason I'm mentioning this one is because you
[00:09:03] mentioning this one is because you should be aware of the existence of
[00:09:04] should be aware of the existence of Dinov2 even if we don't talk about
[00:09:07] Dinov2 even if we don't talk about exactly what it does because Dinov2 is a
[00:09:10] exactly what it does because Dinov2 is a really strong um feature a really strong
[00:09:12] really strong um feature a really strong model for self-supervised features
[00:09:14] model for self-supervised features that's used quite a lot in practice
[00:09:15] that's used quite a lot in practice these days. So what they basically did
[00:09:17] these days. So what they basically did is took this recipe from Dinov1 which
[00:09:20] is took this recipe from Dinov1 which was kind of similar to MoCo has a lot of
[00:09:22] was kind of similar to MoCo has a lot of ideas from Simcle clear as well but a
[00:09:24] ideas from Simcle clear as well but a lot of unique details in their approach
[00:09:25] lot of unique details in their approach as well. Um but the big the big
[00:09:28] as well. Um but the big the big difference in Dinov2 is that they scaled
[00:09:29] difference in Dinov2 is that they scaled up the training data quite a lot. So a
[00:09:31] up the training data quite a lot. So a lot of these previous self-supervised
[00:09:32] lot of these previous self-supervised approaches had been trained on the
[00:09:34] approaches had been trained on the imageet data set which was 1 million
[00:09:36] imageet data set which was 1 million images. Um and Dinov2 was able to
[00:09:38] images. Um and Dinov2 was able to successfully scale up this approach to a
[00:09:40] successfully scale up this approach to a much larger training set of about 142
[00:09:42] much larger training set of about 142 million images. So, you know, in deep
[00:09:45] million images. So, you know, in deep learning, we like bigger networks, we
[00:09:46] learning, we like bigger networks, we like bigger data, we like more GPUs, we
[00:09:48] like bigger data, we like more GPUs, we like more flops, we like all of those
[00:09:49] like more flops, we like all of those things. Um, and Dinov2 was able to find
[00:09:52] things. Um, and Dinov2 was able to find a recipe for self-supervised learning
[00:09:53] a recipe for self-supervised learning that successfully scaled up to this much
[00:09:55] that successfully scaled up to this much larger data set gives very strong
[00:09:57] larger data set gives very strong self-supervised features. And this tends
[00:09:58] self-supervised features. And this tends to be used quite a lot in practice
[00:10:00] to be used quite a lot in practice today. Um, if you want to pick up
[00:10:01] today. Um, if you want to pick up features and then super fine-tune them
[00:10:03] features and then super fine-tune them or supervise them for some for some of
[00:10:05] or supervise them for some for some of your own downstream tasks. Um, so again,
[00:10:07] your own downstream tasks. Um, so again, I don't expect you I don't want to walk
[00:10:09] I don't expect you I don't want to walk through all the details of how this
[00:10:10] through all the details of how this works. I don't expect you to know, but I
[00:10:11] works. I don't expect you to know, but I want you to know that it exists in case
[00:10:13] want you to know that it exists in case you want to pick it up and use it for
[00:10:14] you want to pick it up and use it for some of your own projects um in the
[00:10:16] some of your own projects um in the future.
[00:10:18] future. Um so that that's basically all I had to
[00:10:20] Um so that that's basically all I had to say about self-supervised learning. Um
[00:10:21] say about self-supervised learning. Um any any questions about that before we
[00:10:23] any any questions about that before we move on to the the meat of today's
[00:10:25] move on to the the meat of today's lecture?
[00:10:33] Okay, guess not. So today the main topic
[00:10:36] Okay, guess not. So today the main topic is generative models. Um this is really
[00:10:39] is generative models. Um this is really cool. Uh this is an area of deep
[00:10:40] cool. Uh this is an area of deep learning that basically went from not
[00:10:42] learning that basically went from not working at all 10 years ago to like
[00:10:45] working at all 10 years ago to like really really working in the last couple
[00:10:46] really really working in the last couple of years. Um and this has given rise to
[00:10:49] of years. Um and this has given rise to things like language models. These could
[00:10:50] things like language models. These could be these can be viewed as generative
[00:10:52] be these can be viewed as generative models as we'll see. Um all kind of
[00:10:54] models as we'll see. Um all kind of image generation models, all kinds of
[00:10:55] image generation models, all kinds of video generation models. These really
[00:10:57] video generation models. These really went from just absolutely not working at
[00:10:59] went from just absolutely not working at all when I was in grad school. Like that
[00:11:01] all when I was in grad school. Like that you would look at these samples and peer
[00:11:02] you would look at these samples and peer into them and they just look like low
[00:11:04] into them and they just look like low resolution, complete blurry garbage, but
[00:11:06] resolution, complete blurry garbage, but somehow you could view some promise in
[00:11:08] somehow you could view some promise in them. And I'm glad that people kept
[00:11:09] them. And I'm glad that people kept pushing on that and pushed through the
[00:11:11] pushing on that and pushed through the blurry garbage and scaled it up over the
[00:11:12] blurry garbage and scaled it up over the past decade because now a lot of these
[00:11:14] past decade because now a lot of these techniques really do work and that's
[00:11:16] techniques really do work and that's very exciting. So this is a this is a
[00:11:18] very exciting. So this is a this is a this is an area of deep learning that
[00:11:20] this is an area of deep learning that basically didn't work at all with the
[00:11:21] basically didn't work at all with the first time we taught this class and
[00:11:23] first time we taught this class and that's really cool that it now does. Um
[00:11:26] that's really cool that it now does. Um but that said like a lot of the
[00:11:27] but that said like a lot of the fundamental ideas around generative
[00:11:29] fundamental ideas around generative modeling actually remain the same. um
[00:11:30] modeling actually remain the same. um the ideas about how you think about data
[00:11:32] the ideas about how you think about data um what are what are approaches for
[00:11:34] um what are what are approaches for modeling them a lot of those
[00:11:35] modeling them a lot of those mathematical fundamentals actually have
[00:11:37] mathematical fundamentals actually have not changed that much um in the past
[00:11:39] not changed that much um in the past decade um and so but what changed is
[00:11:43] decade um and so but what changed is better is more compute more stable
[00:11:45] better is more compute more stable training recipes uh bigger data sets uh
[00:11:47] training recipes uh bigger data sets uh distributed training and the ability to
[00:11:49] distributed training and the ability to scale all this up into more useful tasks
[00:11:51] scale all this up into more useful tasks I think was really what drove the
[00:11:52] I think was really what drove the progress over the past decade um there
[00:11:54] progress over the past decade um there were some algorithmic tweaks and
[00:11:55] were some algorithmic tweaks and especially we'll see that next lecture
[00:11:57] especially we'll see that next lecture as we talk when we talk about diffusion
[00:11:58] as we talk when we talk about diffusion models
[00:12:00] models But um first before we talk about
[00:12:02] But um first before we talk about generative modeling, I wanted to step
[00:12:03] generative modeling, I wanted to step back a little bit and talk about
[00:12:04] back a little bit and talk about supervised versus unsupervised learning,
[00:12:06] supervised versus unsupervised learning, right? Because there's a couple of
[00:12:07] right? Because there's a couple of different there's a there's a couple of
[00:12:09] different there's a there's a couple of different tasks that we try to approach
[00:12:11] different tasks that we try to approach in deep learning. Um and they can
[00:12:12] in deep learning. Um and they can sometimes be sliced along a couple of
[00:12:14] sometimes be sliced along a couple of different orthogonal axes. So I wanted
[00:12:16] different orthogonal axes. So I wanted to talk about those a little bit just so
[00:12:17] to talk about those a little bit just so we get our terminology and our
[00:12:18] we get our terminology and our nomenclature clear. Um so supervised
[00:12:21] nomenclature clear. Um so supervised learning is what we've mostly been doing
[00:12:22] learning is what we've mostly been doing all semester um except for last lecture.
[00:12:25] all semester um except for last lecture. Um, in supervised learning, we have a
[00:12:27] Um, in supervised learning, we have a data set of pairs X and Y. Um, and the
[00:12:29] data set of pairs X and Y. Um, and the goal is to learn some function that maps
[00:12:31] goal is to learn some function that maps from the input data X to the target or
[00:12:33] from the input data X to the target or label Y. Um, and we've seen a lot of
[00:12:35] label Y. Um, and we've seen a lot of examples of this kind of approach so
[00:12:37] examples of this kind of approach so far. Um, something like image
[00:12:39] far. Um, something like image classification. Our input X is an image.
[00:12:41] classification. Our input X is an image. The output Y is going to be a label or
[00:12:44] The output Y is going to be a label or image captioning. The input X is going
[00:12:46] image captioning. The input X is going to be an image. The output Y is going to
[00:12:48] to be an image. The output Y is going to be some piece of text describing what we
[00:12:50] be some piece of text describing what we see in that image. Object detection.
[00:12:51] see in that image. Object detection. Input is an image. output is a set of
[00:12:53] Input is an image. output is a set of boxes and category labels describing the
[00:12:55] boxes and category labels describing the objects that appear in the image or
[00:12:57] objects that appear in the image or segmentation. Maybe you assign a pixel
[00:12:59] segmentation. Maybe you assign a pixel label, assign a label to every pixel in
[00:13:00] label, assign a label to every pixel in the input image. And these are
[00:13:03] the input image. And these are supervised learning problems because the
[00:13:05] supervised learning problems because the task you're trying to solve, the thing
[00:13:06] task you're trying to solve, the thing you want to predict is exactly what you
[00:13:08] you want to predict is exactly what you have in your data set. And you sort of
[00:13:10] have in your data set. And you sort of all you need to do in some sense is
[00:13:12] all you need to do in some sense is learn a function that mimics that XY
[00:13:14] learn a function that mimics that XY mapping on your training data set and
[00:13:15] mapping on your training data set and then generalizes that mapping to new
[00:13:17] then generalizes that mapping to new samples um beyond your training data
[00:13:18] samples um beyond your training data set. Now, unsupervised learning is
[00:13:21] set. Now, unsupervised learning is something a bit more fishy and
[00:13:23] something a bit more fishy and mysterious and hard to describe. Um, but
[00:13:27] mysterious and hard to describe. Um, but the idea of unsupervised learning or
[00:13:28] the idea of unsupervised learning or sometimes self-supervised learning is
[00:13:30] sometimes self-supervised learning is that you don't have any labels. You just
[00:13:32] that you don't have any labels. You just have data. You just have samples X. Um,
[00:13:34] have data. You just have samples X. Um, you just have images and you want to
[00:13:36] you just have images and you want to learn some kind of structure from that
[00:13:38] learn some kind of structure from that data. Um, there's no particular task
[00:13:40] data. Um, there's no particular task you're necessarily targeting. You're
[00:13:41] you're necessarily targeting. You're just trying to uncover good
[00:13:42] just trying to uncover good representations, good structure in all
[00:13:44] representations, good structure in all of that data. Um, why? so that you can
[00:13:47] of that data. Um, why? so that you can you know as we talked about in
[00:13:48] you know as we talked about in self-supervised learning often so you
[00:13:50] self-supervised learning often so you can apply it to downstream tasks later
[00:13:51] can apply it to downstream tasks later on um but there's but the the task
[00:13:54] on um but there's but the the task itself in unsupervised learning is often
[00:13:56] itself in unsupervised learning is often somewhat unspecified
[00:13:58] somewhat unspecified some examples of this are K means
[00:14:00] some examples of this are K means clustering where maybe we're trying to
[00:14:01] clustering where maybe we're trying to identify um clusters in the data which
[00:14:04] identify um clusters in the data which is which is some kind of structure that
[00:14:05] is which is some kind of structure that we can examine from the raw pixels that
[00:14:07] we can examine from the raw pixels that you know was even though we didn't have
[00:14:09] you know was even though we didn't have labels for um or dimensionality
[00:14:11] labels for um or dimensionality reduction PCA where we're trying to
[00:14:13] reduction PCA where we're trying to uncover some lower dimensional subspace
[00:14:14] uncover some lower dimensional subspace or lower dimensional manifold that
[00:14:16] or lower dimensional manifold that explains the structure of our data. Um,
[00:14:18] explains the structure of our data. Um, and again, this is something we're
[00:14:19] and again, this is something we're trying to discover from the data itself.
[00:14:21] trying to discover from the data itself. We don't have annotations of what this
[00:14:22] We don't have annotations of what this what this ought to be. Um, or density
[00:14:24] what this ought to be. Um, or density estimation. Maybe we're trying to fit a
[00:14:26] estimation. Maybe we're trying to fit a probability distribution to the data.
[00:14:27] probability distribution to the data. We're trying to understand what is the
[00:14:29] We're trying to understand what is the probabilistic function that gave rise to
[00:14:30] probabilistic function that gave rise to the data samples that we're seeing. And
[00:14:32] the data samples that we're seeing. And again, we don't have explicit labels for
[00:14:34] again, we don't have explicit labels for this or an explicit training set for
[00:14:35] this or an explicit training set for this. So, this is some kind of hidden or
[00:14:37] this. So, this is some kind of hidden or latent structure that we're trying to
[00:14:38] latent structure that we're trying to uncover through the process of training.
[00:14:40] uncover through the process of training. So, this um supervised unsupervised
[00:14:42] So, this um supervised unsupervised dichotomy is something that you always
[00:14:44] dichotomy is something that you always should keep in mind. Um and you can do
[00:14:46] should keep in mind. Um and you can do unsupervised learning which is not
[00:14:48] unsupervised learning which is not probabilistic or not generative
[00:14:49] probabilistic or not generative necessarily. Something like clustering,
[00:14:51] necessarily. Something like clustering, something like PCA, you know, often they
[00:14:53] something like PCA, you know, often they have probabilistic interpretations. Um
[00:14:54] have probabilistic interpretations. Um but these are examples of unsupervised
[00:14:56] but these are examples of unsupervised learning that don't necessarily have a
[00:14:57] learning that don't necessarily have a generative or probabilistic
[00:14:59] generative or probabilistic interpretation or or don't have to be
[00:15:00] interpretation or or don't have to be thought of as such. Um so I often like
[00:15:02] thought of as such. Um so I often like to think about the supervised
[00:15:03] to think about the supervised unsupervised dichotomy as kind of one
[00:15:06] unsupervised dichotomy as kind of one spectrum along which methods or or or um
[00:15:08] spectrum along which methods or or or um systems can lie. A separate spectrum
[00:15:11] systems can lie. A separate spectrum along which we can classify systems or
[00:15:13] along which we can classify systems or tasks is that of generative versus
[00:15:15] tasks is that of generative versus discriminative models. Um and these are
[00:15:16] discriminative models. Um and these are inherently probabilistic. And when when
[00:15:18] inherently probabilistic. And when when we talk about generative or
[00:15:19] we talk about generative or discriminative models, we're always
[00:15:21] discriminative models, we're always imagining some kind of probabilistic
[00:15:22] imagining some kind of probabilistic structure in our data that we're trying
[00:15:24] structure in our data that we're trying to uncover or learn from. Um and the
[00:15:26] to uncover or learn from. Um and the difference is exactly what is the
[00:15:27] difference is exactly what is the probabilistic relationship between the
[00:15:29] probabilistic relationship between the variables that we're trying to model. Um
[00:15:31] variables that we're trying to model. Um so in discriminative model, so so
[00:15:33] so in discriminative model, so so typically we have some some y and some
[00:15:34] typically we have some some y and some x. Um, and usually we think of the X as
[00:15:37] x. Um, and usually we think of the X as something large, highdimensional,
[00:15:38] something large, highdimensional, usually an image in our case, and the Y
[00:15:40] usually an image in our case, and the Y is some kind of label or description or
[00:15:42] is some kind of label or description or auxiliary information. Um, and so that
[00:15:45] auxiliary information. Um, and so that would be like your text, like your
[00:15:46] would be like your text, like your caption, like a category label,
[00:15:48] caption, like a category label, something like that. Um, and when you do
[00:15:50] something like that. Um, and when you do when you talk about a discriminative
[00:15:52] when you talk about a discriminative model, we're trying to learn a
[00:15:53] model, we're trying to learn a probability distribution of Y given X.
[00:15:56] probability distribution of Y given X. So, we're trying to learn a distribution
[00:15:58] So, we're trying to learn a distribution over labels um conditioned on our input
[00:16:00] over labels um conditioned on our input image X. Um and to understand to really
[00:16:04] image X. Um and to understand to really appreciate you know what's going on
[00:16:06] appreciate you know what's going on probabilistically you need to remember
[00:16:07] probabilistically you need to remember one very important feature of
[00:16:09] one very important feature of probability distributions and that's
[00:16:10] probability distributions and that's that they are normalized right when you
[00:16:12] that they are normalized right when you talk about a probability distribution or
[00:16:14] talk about a probability distribution or more generally a density function p of x
[00:16:16] more generally a density function p of x um p of x is basically a function that
[00:16:18] um p of x is basically a function that applies that that um that that uh that
[00:16:21] applies that that um that that uh that assigns a nonzero number um to every
[00:16:24] assigns a nonzero number um to every every possible input x with the very
[00:16:26] every possible input x with the very important normalization constraint um
[00:16:28] important normalization constraint um that if you integrate over the entire
[00:16:30] that if you integrate over the entire space of all possible x's It integrates
[00:16:32] space of all possible x's It integrates for it integrates to one, right? And
[00:16:34] for it integrates to one, right? And this normalization constraint really
[00:16:35] this normalization constraint really gives rise to the the power of
[00:16:37] gives rise to the the power of probabistic models in some sense because
[00:16:39] probabistic models in some sense because the normalization constraint means that
[00:16:41] the normalization constraint means that all of your x's need to compete for
[00:16:43] all of your x's need to compete for probability mass. There's a fixed unit
[00:16:45] probability mass. There's a fixed unit amount of probability mass. Um, and the
[00:16:48] amount of probability mass. Um, and the and a prob and and choosing a
[00:16:49] and a prob and and choosing a probability distribution or density
[00:16:51] probability distribution or density function basically amounts to aortioning
[00:16:53] function basically amounts to aortioning out that fixed amount of probability
[00:16:55] out that fixed amount of probability mass and smearing it across all possible
[00:16:57] mass and smearing it across all possible values of x that could exist. Um, and
[00:16:59] values of x that could exist. Um, and all of those X's are in competition
[00:17:01] all of those X's are in competition because there's only a fixed unit amount
[00:17:02] because there's only a fixed unit amount of mass to go around. So if you want to
[00:17:04] of mass to go around. So if you want to push up the probability of one X
[00:17:06] push up the probability of one X necessarily, the probabilities of or or
[00:17:08] necessarily, the probabilities of or or densities of other X's have to go down.
[00:17:11] densities of other X's have to go down. And this this and so then in these
[00:17:13] And this this and so then in these different formulations of probabilistic
[00:17:15] different formulations of probabilistic models, basically what's changing is
[00:17:16] models, basically what's changing is what are the variables that are
[00:17:17] what are the variables that are competing for probability mass. Um and
[00:17:20] competing for probability mass. Um and that means that even though the symbols
[00:17:21] that means that even though the symbols that we write on the page look very
[00:17:23] that we write on the page look very similar um they they the different the
[00:17:26] similar um they they the different the different competitions among what is
[00:17:27] different competitions among what is competing for probability mass um
[00:17:29] competing for probability mass um induces very different structure that
[00:17:30] induces very different structure that the model is trying to learn or uncover.
[00:17:32] the model is trying to learn or uncover. So in the case of a discriminative model
[00:17:34] So in the case of a discriminative model we're learning a probabilistic model of
[00:17:36] we're learning a probabilistic model of y conditioned on x which means that for
[00:17:39] y conditioned on x which means that for every x we our model is predicting a
[00:17:41] every x we our model is predicting a probability distribution over all
[00:17:43] probability distribution over all possible labels. So if our labels are
[00:17:45] possible labels. So if our labels are discrete and categorical um like cat dog
[00:17:48] discrete and categorical um like cat dog then that means we have a fixed amount
[00:17:50] then that means we have a fixed amount of probability 0 to one um and cat and
[00:17:52] of probability 0 to one um and cat and dog must must sum to one um and we have
[00:17:55] dog must must sum to one um and we have a separate probability distribution over
[00:17:58] a separate probability distribution over the labels for every input x um and
[00:18:00] the labels for every input x um and crucially notice here that there is no
[00:18:02] crucially notice here that there is no competition among images for probability
[00:18:04] competition among images for probability mass because every image is inducing its
[00:18:07] mass because every image is inducing its own distribution over the label space
[00:18:09] own distribution over the label space there's no kind of competition for mass
[00:18:11] there's no kind of competition for mass across the different images the the only
[00:18:13] across the different images the the only things that are competing for mass are
[00:18:15] things that are competing for mass are the different labels for each image and
[00:18:17] the different labels for each image and that's very important when you think
[00:18:18] that's very important when you think about discriminative modeling. Um and
[00:18:21] about discriminative modeling. Um and one interesting other thing so you know
[00:18:23] one interesting other thing so you know and one interesting other facet of
[00:18:24] and one interesting other facet of discriminative modeling is that they
[00:18:26] discriminative modeling is that they have no real way to reject unreasonable
[00:18:28] have no real way to reject unreasonable inputs. So once we've fixed our label
[00:18:30] inputs. So once we've fixed our label space of say cat and dog in this example
[00:18:32] space of say cat and dog in this example if we feed in something that's not a cat
[00:18:34] if we feed in something that's not a cat or a dog at all like a monkey or a piece
[00:18:36] or a dog at all like a monkey or a piece of abstract art the system has no
[00:18:39] of abstract art the system has no flexibility. It has no freedom to say
[00:18:40] flexibility. It has no freedom to say this is unreasonable. it's forced to
[00:18:42] this is unreasonable. it's forced to output a a distribution over the fixed
[00:18:44] output a a distribution over the fixed vocabulary that we assigned at the
[00:18:45] vocabulary that we assigned at the beginning. Um so that's that's maybe
[00:18:47] beginning. Um so that's that's maybe could be seen as a shortcoming but it's
[00:18:49] could be seen as a shortcoming but it's just important to understand what
[00:18:50] just important to understand what exactly is happening under the hood when
[00:18:52] exactly is happening under the hood when you think about modeling different kinds
[00:18:53] you think about modeling different kinds of data probabilistically.
[00:18:56] of data probabilistically. Now a generative model is something very
[00:18:57] Now a generative model is something very different. Now instead what we're doing
[00:18:59] different. Now instead what we're doing in a generative model is learning a
[00:19:00] in a generative model is learning a distribution P of X. Um we want to learn
[00:19:02] distribution P of X. Um we want to learn a distribution over all possible images
[00:19:05] a distribution over all possible images X. Um and now this is very interesting.
[00:19:07] X. Um and now this is very interesting. This means that all possible images that
[00:19:09] This means that all possible images that could ever exist in the universe are all
[00:19:11] could ever exist in the universe are all now competing with each other for
[00:19:13] now competing with each other for probability mass. Um, and this is now
[00:19:16] probability mass. Um, and this is now really really a hard question that
[00:19:18] really really a hard question that requires, you know, it sounds kind of
[00:19:20] requires, you know, it sounds kind of simple in in on its face, but this
[00:19:22] simple in in on its face, but this requires you to confront some very deep
[00:19:23] requires you to confront some very deep and philosophical problems about the
[00:19:25] and philosophical problems about the world, right? Because now all images are
[00:19:27] world, right? Because now all images are competing for probability mass. And it
[00:19:29] competing for probability mass. And it you're forced in order to model that,
[00:19:31] you're forced in order to model that, you're forced to answer questions.
[00:19:32] you're forced to answer questions. You're like, you know, should an image
[00:19:35] You're like, you know, should an image of a three-legged dog, how should that
[00:19:36] of a three-legged dog, how should that get probability mass in relationship to
[00:19:38] get probability mass in relationship to an image of a three-ear armed monkey?
[00:19:40] an image of a three-ear armed monkey? Probably the three-legged dog should get
[00:19:41] Probably the three-legged dog should get pro more probability mass because
[00:19:43] pro more probability mass because that's, you know, you can have that
[00:19:44] that's, you know, you can have that happen by a dog losing a leg. But how
[00:19:46] happen by a dog losing a leg. But how are you going to get a three- armed
[00:19:47] are you going to get a three- armed monkey? I don't know. That seems much
[00:19:48] monkey? I don't know. That seems much more rare. Um, unless you're modeling
[00:19:50] more rare. Um, unless you're modeling sci-fi images or something like this. So
[00:19:52] sci-fi images or something like this. So once you're in this regime of im all
[00:19:53] once you're in this regime of im all possible images competing for
[00:19:55] possible images competing for probability mass now your model really
[00:19:57] probability mass now your model really needs to think very carefully about the
[00:19:58] needs to think very carefully about the kinds of structure that can can exist in
[00:20:00] kinds of structure that can can exist in the data and it becomes a much much
[00:20:01] the data and it becomes a much much harder problem to solve.
[00:20:04] harder problem to solve. And another interesting thing here is
[00:20:06] And another interesting thing here is that now with a generative model the
[00:20:08] that now with a generative model the model now does have the capacity to
[00:20:09] model now does have the capacity to basically say no this is not a
[00:20:11] basically say no this is not a reasonable image this is not a
[00:20:12] reasonable image this is not a reasonable input and the way that it can
[00:20:14] reasonable input and the way that it can do that is by assigning you know low or
[00:20:16] do that is by assigning you know low or even zero probability mass to any one
[00:20:19] even zero probability mass to any one image that it gets. So maybe in our
[00:20:21] image that it gets. So maybe in our generative model um maybe we only want
[00:20:22] generative model um maybe we only want it to recognize you know be a generative
[00:20:24] it to recognize you know be a generative model of zoo animals and if we want to
[00:20:25] model of zoo animals and if we want to have a generative model of zoo animals
[00:20:27] have a generative model of zoo animals then you know if we feed in an image of
[00:20:29] then you know if we feed in an image of abstract art it should have zero
[00:20:30] abstract art it should have zero probability mass. So now we have a
[00:20:32] probability mass. So now we have a mechanism for rejecting or or saying
[00:20:34] mechanism for rejecting or or saying that this type of image is not within
[00:20:36] that this type of image is not within the scope of what we care about.
[00:20:39] the scope of what we care about. And now a conditional generative model
[00:20:40] And now a conditional generative model is um even more interesting. So this is
[00:20:43] is um even more interesting. So this is where we're learning a conditional
[00:20:44] where we're learning a conditional distribution over images x conditioned
[00:20:46] distribution over images x conditioned on some um some some label sigma y. And
[00:20:49] on some um some some label sigma y. And now this means that for every possible
[00:20:51] now this means that for every possible label um maybe we're now inducing a
[00:20:53] label um maybe we're now inducing a distri a competition among all possible
[00:20:55] distri a competition among all possible images. So in this case if y is say a
[00:20:58] images. So in this case if y is say a categorical label of cat and dog now um
[00:21:01] categorical label of cat and dog now um for every for each possible categorical
[00:21:03] for every for each possible categorical label cat and dog the model is
[00:21:04] label cat and dog the model is separately inducing a competition among
[00:21:06] separately inducing a competition among all possible images. So now you know in
[00:21:09] all possible images. So now you know in in the top distribution maybe this is
[00:21:10] in the top distribution maybe this is the probability of all these images um
[00:21:12] the probability of all these images um conditioned on the cat label. So then
[00:21:14] conditioned on the cat label. So then obviously the cat image should be high.
[00:21:15] obviously the cat image should be high. Maybe the monkey and the dog images
[00:21:17] Maybe the monkey and the dog images image should be somewhat higher because
[00:21:19] image should be somewhat higher because they're still mammals at least, but the
[00:21:20] they're still mammals at least, but the abstract art should be very low, maybe
[00:21:22] abstract art should be very low, maybe even zero. Um, and then you know a
[00:21:23] even zero. Um, and then you know a different distribution among images if
[00:21:25] different distribution among images if we're conditioning on the dog label. Um,
[00:21:27] we're conditioning on the dog label. Um, and this gets even more interesting if
[00:21:29] and this gets even more interesting if you imagine that your conditioning
[00:21:30] you imagine that your conditioning signal Y is something much richer than a
[00:21:32] signal Y is something much richer than a single categorical label. That that
[00:21:34] single categorical label. That that conditioning single signal Y might have
[00:21:36] conditioning single signal Y might have been a text description. It might have
[00:21:38] been a text description. It might have been a whole paragraph of written text.
[00:21:39] been a whole paragraph of written text. It might have been another image plus a
[00:21:41] It might have been another image plus a piece of text. So now once you talk
[00:21:43] piece of text. So now once you talk about modeling these very rich output
[00:21:45] about modeling these very rich output spaces X um conditioned on very rich
[00:21:47] spaces X um conditioned on very rich input spaces Y now you're actually
[00:21:49] input spaces Y now you're actually asking the model to solve a very
[00:21:50] asking the model to solve a very complicated and and quite illdefined
[00:21:53] complicated and and quite illdefined problem that requires very deep
[00:21:54] problem that requires very deep reasoning about the about the objects
[00:21:56] reasoning about the about the objects involved. Um so that's why I think that
[00:21:58] involved. Um so that's why I think that generative modeling is such an
[00:22:00] generative modeling is such an interesting topic because it looks kind
[00:22:01] interesting topic because it looks kind of simil it looks kind of simple. All we
[00:22:04] of simil it looks kind of simple. All we did was flop the X and the Y. How hard
[00:22:05] did was flop the X and the Y. How hard could it be? Um but all of a sudden it
[00:22:07] could it be? Um but all of a sudden it required us to think really hard about
[00:22:09] required us to think really hard about what's going on in the visual world. Um,
[00:22:11] what's going on in the visual world. Um, and what's also interesting is that we
[00:22:13] and what's also interesting is that we wrote down discriminative generative
[00:22:14] wrote down discriminative generative models and conditional generative models
[00:22:17] models and conditional generative models as three separate categories of things.
[00:22:19] as three separate categories of things. Um, but actually they're all related.
[00:22:21] Um, but actually they're all related. Um, and they're related through Baze
[00:22:23] Um, and they're related through Baze rule, which is um, you know, your kind
[00:22:25] rule, which is um, you know, your kind of your one of the most one of the most
[00:22:27] of your one of the most one of the most amazing relationships in probability.
[00:22:28] amazing relationships in probability. And in particular it says that you know
[00:22:31] And in particular it says that you know if we have we can write we can if we
[00:22:33] if we have we can write we can if we have access to a discriminative model P
[00:22:35] have access to a discriminative model P Y of X um and an unconditional
[00:22:37] Y of X um and an unconditional generative model P of X um as well as
[00:22:39] generative model P of X um as well as some prior distribution over our labels
[00:22:41] some prior distribution over our labels Y we can compose those to build a
[00:22:43] Y we can compose those to build a conditional generative model P of Y
[00:22:44] conditional generative model P of Y given X um or in general you can always
[00:22:47] given X um or in general you can always rearrange B rule in some way so that if
[00:22:49] rearrange B rule in some way so that if you have any two of these you can always
[00:22:50] you have any two of these you can always get a third one which is pretty cool. Um
[00:22:53] get a third one which is pretty cool. Um so I mean so in practice like in in
[00:22:56] so I mean so in practice like in in theory you can in principle build a
[00:22:58] theory you can in principle build a conditional generative model out of the
[00:22:59] conditional generative model out of the other two components. Although in
[00:23:01] other two components. Although in practice this is not really how you do
[00:23:03] practice this is not really how you do it. You tend to sort of learn
[00:23:04] it. You tend to sort of learn conditional generative models um from
[00:23:06] conditional generative models um from all from scratch on their own. Although
[00:23:09] all from scratch on their own. Although as we'll talk about in diffusion you do
[00:23:10] as we'll talk about in diffusion you do end up sometimes learning conditional
[00:23:12] end up sometimes learning conditional and unconditional models jointly um for
[00:23:14] and unconditional models jointly um for some reasons. Um but this is nice to
[00:23:16] some reasons. Um but this is nice to keep in mind that there's a very deep
[00:23:17] keep in mind that there's a very deep relationship across these different
[00:23:18] relationship across these different flavors of probabilistic models. So then
[00:23:21] flavors of probabilistic models. So then you might be wondering, okay, what can
[00:23:22] you might be wondering, okay, what can we do with these different flavors of
[00:23:24] we do with these different flavors of probabilistic models? Um, with with
[00:23:25] probabilistic models? Um, with with discriminative models, this one
[00:23:27] discriminative models, this one shouldn't require a lot of creativity.
[00:23:28] shouldn't require a lot of creativity. We've seen a lot of these examples so
[00:23:29] We've seen a lot of these examples so far this this this uh this quarter. So
[00:23:31] far this this this uh this quarter. So with discriminative models, after you
[00:23:33] with discriminative models, after you train them, you can assign labels to
[00:23:34] train them, you can assign labels to data. You can also do feature learning,
[00:23:36] data. You can also do feature learning, right? In the case of say um supervised
[00:23:38] right? In the case of say um supervised learning on imageet, we've seen that by
[00:23:40] learning on imageet, we've seen that by in the process of trying to predict um
[00:23:42] in the process of trying to predict um categorical labels of images, those
[00:23:44] categorical labels of images, those models tend to learn useful feature
[00:23:45] models tend to learn useful feature representations in the middle that can
[00:23:47] representations in the middle that can be transferred to downstream things. So
[00:23:49] be transferred to downstream things. So this is an So you tend to use
[00:23:50] this is an So you tend to use descriptive models for just directly
[00:23:52] descriptive models for just directly predicting the y's that you care about
[00:23:53] predicting the y's that you care about or for learning feature representations
[00:23:55] or for learning feature representations that are induced in the process of
[00:23:56] that are induced in the process of trying to predict those y's.
[00:23:59] trying to predict those y's. Generative models um actually these
[00:24:00] Generative models um actually these unconditional generative models I I
[00:24:02] unconditional generative models I I think are actually kind of useless um in
[00:24:04] think are actually kind of useless um in general but what they kind of let you do
[00:24:06] general but what they kind of let you do is maybe detect outliers. Um they look
[00:24:08] is maybe detect outliers. Um they look at images and say are they are they
[00:24:10] at images and say are they are they really do they really have low
[00:24:11] really do they really have low probability mass? Are they unreasonable
[00:24:13] probability mass? Are they unreasonable images? um you can view that you can
[00:24:15] images? um you can view that you can sort of use them for feature learning
[00:24:17] sort of use them for feature learning without data without labels. Um so hope
[00:24:19] without data without labels. Um so hope that in the process of trying to fit an
[00:24:21] that in the process of trying to fit an unconditional distribution P of X the
[00:24:23] unconditional distribution P of X the model maybe learned some useful feature
[00:24:24] model maybe learned some useful feature representations. Um although in general
[00:24:26] representations. Um although in general these have not been super successful for
[00:24:28] these have not been super successful for self-supervised learning. Um and
[00:24:29] self-supervised learning. Um and typically the contrasted methods that
[00:24:30] typically the contrasted methods that we've talked about in the previous
[00:24:31] we've talked about in the previous lecture have in practice been much more
[00:24:33] lecture have in practice been much more successful for self-supervised learning
[00:24:35] successful for self-supervised learning as compared to an unconditional density
[00:24:38] as compared to an unconditional density estimation.
[00:24:39] estimation. Or um in principle you could use this
[00:24:41] Or um in principle you could use this unconditional generative model to sample
[00:24:43] unconditional generative model to sample and produce new samples X. Um but I
[00:24:45] and produce new samples X. Um but I think this is actually kind of useless
[00:24:46] think this is actually kind of useless um because it gives you no control over
[00:24:48] um because it gives you no control over what what is being sampled right if you
[00:24:50] what what is being sampled right if you have an unconditional generative model
[00:24:51] have an unconditional generative model of images then you can sample from it to
[00:24:53] of images then you can sample from it to get a new image but you have no control
[00:24:55] get a new image but you have no control over what what's in that image. Um so I
[00:24:57] over what what's in that image. Um so I think that's that's actually you know
[00:24:58] think that's that's actually you know it's mathematically interesting to think
[00:25:00] it's mathematically interesting to think about how to build such models but it's
[00:25:02] about how to build such models but it's I don't think they have as much
[00:25:03] I don't think they have as much practical significance.
[00:25:05] practical significance. Um and conditional generative models are
[00:25:06] Um and conditional generative models are where are where I think actually things
[00:25:08] where are where I think actually things are the most useful and the most
[00:25:09] are the most useful and the most interesting. Um and these are the kinds
[00:25:11] interesting. Um and these are the kinds of generative models that get trained
[00:25:12] of generative models that get trained and used in practice by far the most. Um
[00:25:14] and used in practice by far the most. Um so you can in principle use them to
[00:25:16] so you can in principle use them to assign labels while rejecting outliers.
[00:25:18] assign labels while rejecting outliers. Right? You could say you know if I have
[00:25:20] Right? You could say you know if I have a piece of data X then look at the P of
[00:25:22] a piece of data X then look at the P of X given Y over all over all of my
[00:25:24] X given Y over all over all of my possible Y's and then you know I could
[00:25:26] possible Y's and then you know I could reject if if the if that's too low among
[00:25:28] reject if if the if that's too low among all the possible Y's. So in principle,
[00:25:30] all the possible Y's. So in principle, you could use conditional generative
[00:25:31] you could use conditional generative models to do some kind of classification
[00:25:33] models to do some kind of classification while also maintaining the ability to
[00:25:35] while also maintaining the ability to reject outliers. Um although I don't
[00:25:37] reject outliers. Um although I don't think that's really used too much in
[00:25:38] think that's really used too much in practice. Um what's really useful for
[00:25:40] practice. Um what's really useful for conditional generative models and what
[00:25:43] conditional generative models and what is used in practice all the time
[00:25:44] is used in practice all the time everywhere is sampling to generate new
[00:25:46] everywhere is sampling to generate new data from labels where you actually get
[00:25:48] data from labels where you actually get to control what is generated, right?
[00:25:50] to control what is generated, right? Because if your Y is now maybe a piece
[00:25:52] Because if your Y is now maybe a piece of text, you can write down I want to
[00:25:54] of text, you can write down I want to see a cat wearing a hot dog flavored
[00:25:56] see a cat wearing a hot dog flavored t-shirt on the moon or whatever and then
[00:25:59] t-shirt on the moon or whatever and then your favorite generative model of images
[00:26:00] your favorite generative model of images will generate you a brand new image X
[00:26:02] will generate you a brand new image X conditioned on that label Y. So this is
[00:26:05] conditioned on that label Y. So this is where I think all the juice is, where
[00:26:06] where I think all the juice is, where all the magic is, where all the
[00:26:07] all the magic is, where all the excitement is. Um although somewhat
[00:26:10] excitement is. Um although somewhat confusingly in the literature whenever
[00:26:12] confusingly in the literature whenever you see the term generative model you'll
[00:26:13] you see the term generative model you'll they kind of mush up between um
[00:26:15] they kind of mush up between um unconditional and conditional generative
[00:26:17] unconditional and conditional generative modeling. Um, and a lot of the papers
[00:26:19] modeling. Um, and a lot of the papers that you read will even sometimes drop
[00:26:20] that you read will even sometimes drop the conditioning signal. Why? Because it
[00:26:22] the conditioning signal. Why? Because it makes the math look cleaner. It makes
[00:26:23] makes the math look cleaner. It makes the equations look cleaner. Um, but I
[00:26:26] the equations look cleaner. Um, but I don't think unconditional generative
[00:26:27] don't think unconditional generative modeling is super useful. Um, it's
[00:26:28] modeling is super useful. Um, it's almost always conditional generative
[00:26:30] almost always conditional generative modeling that you really want to do in
[00:26:32] modeling that you really want to do in most cases. Um, so just keep aware just
[00:26:34] most cases. Um, so just keep aware just be aware that if you read papers, see
[00:26:36] be aware that if you read papers, see equations, hear people talk, when they
[00:26:38] equations, hear people talk, when they talk about generative modeling, they're
[00:26:40] talk about generative modeling, they're probably the one they care about
[00:26:41] probably the one they care about training more are these conditional
[00:26:43] training more are these conditional generative models. Even if the math
[00:26:44] generative models. Even if the math doesn't even if the equations or
[00:26:45] doesn't even if the equations or notation doesn't reflect that.
[00:26:48] notation doesn't reflect that.  So uh for unconditional generative model
[00:26:50] So uh for unconditional generative model what does this
[00:26:53] what does this  ah so I I didn't really tell you that um
[00:26:55] ah so I I didn't really tell you that um and and I I was sneaky there because how
[00:26:58] and and I I was sneaky there because how you parameterize that actually depends a
[00:27:00] you parameterize that actually depends a lot. There's a lot of different
[00:27:00] lot. There's a lot of different formulations for all of these things and
[00:27:02] formulations for all of these things and what exactly are the going to be the
[00:27:03] what exactly are the going to be the inputs and the outputs of the network
[00:27:05] inputs and the outputs of the network are going to vary quite a lot depending
[00:27:06] are going to vary quite a lot depending on the formulation. Um so we're going to
[00:27:08] on the formulation. Um so we're going to talk about a whole tonomy of those in a
[00:27:09] talk about a whole tonomy of those in a couple slides.
[00:27:12] couple slides. Okay. So you know why generative models?
[00:27:14] Okay. So you know why generative models? Um the main reason you want to build
[00:27:16] Um the main reason you want to build generative models is whenever there's
[00:27:18] generative models is whenever there's some ambiguity in the task you're trying
[00:27:20] some ambiguity in the task you're trying to model, right? So the the the beauty
[00:27:22] to model, right? So the the the beauty of a probabilistic model P of X given Y
[00:27:25] of a probabilistic model P of X given Y is that it's probabilistic. Um there
[00:27:27] is that it's probabilistic. Um there might be a whole space of possible
[00:27:29] might be a whole space of possible outputs X conditioned on that input
[00:27:31] outputs X conditioned on that input label Y. So whenever there's not a like
[00:27:34] label Y. So whenever there's not a like in some case sometimes there's just a
[00:27:35] in some case sometimes there's just a deterministic mapping, right? I look at
[00:27:36] deterministic mapping, right? I look at an image, I want to ask how many cats
[00:27:38] an image, I want to ask how many cats are in the image. There's just three
[00:27:39] are in the image. There's just three cats. There's just one answer. Um, but
[00:27:41] cats. There's just one answer. Um, but in a lot of cases it's more subtle. If I
[00:27:43] in a lot of cases it's more subtle. If I ask for a picture of a dog wearing a hot
[00:27:46] ask for a picture of a dog wearing a hot dog hat, like there's a lot of different
[00:27:47] dog hat, like there's a lot of different images that could exist based on that
[00:27:49] images that could exist based on that query. Um, there's uncertainty in the
[00:27:51] query. Um, there's uncertainty in the output. Um, and that's exactly what
[00:27:52] output. Um, and that's exactly what generative models are trying to model.
[00:27:54] generative models are trying to model. They model a whole distribution of
[00:27:55] They model a whole distribution of outputs conditioned on their input
[00:27:56] outputs conditioned on their input signal. So, anytime that there's
[00:27:58] signal. So, anytime that there's ambiguity in the kind of output that you
[00:28:00] ambiguity in the kind of output that you want the model to produce conditioned on
[00:28:02] want the model to produce conditioned on the input, that's where you want to turn
[00:28:03] the input, that's where you want to turn to a generative model. Um, and this is
[00:28:05] to a generative model. Um, and this is why, and this is we've seen, we'll see a
[00:28:07] why, and this is we've seen, we'll see a couple examples of why where this has
[00:28:08] couple examples of why where this has gotten used a lot in the last couple of
[00:28:10] gotten used a lot in the last couple of years. So, one example is language
[00:28:12] years. So, one example is language modeling. Um, someone asked about chatbt
[00:28:13] modeling. Um, someone asked about chatbt a moment ago. So, in language modeling,
[00:28:15] a moment ago. So, in language modeling, what you're often trying to do is
[00:28:17] what you're often trying to do is predict output text X from input text Y.
[00:28:20] predict output text X from input text Y. Um, sorry, the X's and Y's ended up
[00:28:22] Um, sorry, the X's and Y's ended up flipped in an awkward way on this
[00:28:24] flipped in an awkward way on this example. Um, but you know, so so here's
[00:28:27] example. Um, but you know, so so here's a here's an example from Chad GBT, but
[00:28:29] a here's an example from Chad GBT, but the input is write me a short rhyming
[00:28:30] the input is write me a short rhyming poem about generative models. And wow,
[00:28:33] poem about generative models. And wow, it actually works. This is crazy. This
[00:28:34] it actually works. This is crazy. This didn't work at all when we first taught
[00:28:36] didn't work at all when we first taught this class. Um, I'm not going to read
[00:28:38] this class. Um, I'm not going to read it. That would be embarrassing. You can
[00:28:39] it. That would be embarrassing. You can read it yourself. Um, but you know, now
[00:28:41] read it yourself. Um, but you know, now this is a conditional generative model.
[00:28:42] this is a conditional generative model. You could imagine there's a lot of
[00:28:44] You could imagine there's a lot of different possible rhyming poems about
[00:28:45] different possible rhyming poems about generative models that one might write.
[00:28:48] generative models that one might write. Um, and we had to pick one of them. And
[00:28:50] Um, and we had to pick one of them. And the beauty of a generative model is that
[00:28:51] the beauty of a generative model is that it in principle models that whole
[00:28:53] it in principle models that whole distribution over over possible outputs
[00:28:55] distribution over over possible outputs conditioned on that input. Um, or text
[00:28:57] conditioned on that input. Um, or text to image. You know, make me an image
[00:28:59] to image. You know, make me an image showing a person teaching a class on
[00:29:00] showing a person teaching a class on generative models in front of a
[00:29:01] generative models in front of a whiteboard. um you're kind of looking at
[00:29:03] whiteboard. um you're kind of looking at one example through your eyes. Chat GPT
[00:29:05] one example through your eyes. Chat GPT gave you a different example, right?
[00:29:06] gave you a different example, right? There's a whole different space of
[00:29:08] There's a whole different space of possible images that might match this uh
[00:29:10] possible images that might match this uh this input text and a generative model
[00:29:12] this input text and a generative model allows you to model that whole space and
[00:29:14] allows you to model that whole space and sample from that space depending on what
[00:29:15] sample from that space depending on what you want. Um or image to video, you
[00:29:18] you want. Um or image to video, you know, input an image. What happens next?
[00:29:20] know, input an image. What happens next? Um you know, this was me holding my
[00:29:22] Um you know, this was me holding my AirPods over a cardboard box. Maybe I'm
[00:29:24] AirPods over a cardboard box. Maybe I'm going to drop it. Maybe I'm going to
[00:29:25] going to drop it. Maybe I'm going to move my hand. Maybe, you know, maybe I'm
[00:29:28] move my hand. Maybe, you know, maybe I'm going to move my hand and the AirPods
[00:29:29] going to move my hand and the AirPods will morph into a different kinds of
[00:29:30] will morph into a different kinds of AirPods. Um there's all kinds of things
[00:29:32] AirPods. Um there's all kinds of things that could happen. A generative model in
[00:29:33] that could happen. A generative model in principle lets you kind of model and
[00:29:35] principle lets you kind of model and sample from these possible futures.
[00:29:38] sample from these possible futures. Um so those are so I I so this is sort
[00:29:40] Um so those are so I I so this is sort of why we want to care about generative
[00:29:42] of why we want to care about generative modeling. Anytime there's ambiguity in
[00:29:43] modeling. Anytime there's ambiguity in the output that's when you want to try
[00:29:45] the output that's when you want to try to turn to a generative model to solve
[00:29:46] to turn to a generative model to solve it.
[00:29:48] it. Um and someone asked about what are the
[00:29:49] Um and someone asked about what are the inputs and outputs. It turns out this is
[00:29:51] inputs and outputs. It turns out this is a huge field. Um and this is
[00:29:53] a huge field. Um and this is surprisingly one field of deep learning
[00:29:55] surprisingly one field of deep learning that is quite mathematical. um because
[00:29:57] that is quite mathematical. um because it requires you know thinking about what
[00:29:58] it requires you know thinking about what are different ways to model probability
[00:30:00] are different ways to model probability dist distributions, how can we write
[00:30:01] dist distributions, how can we write down loss functions that cause the right
[00:30:04] down loss functions that cause the right things to happen. So this is one area
[00:30:06] things to happen. So this is one area where when you read papers there may be
[00:30:07] where when you read papers there may be a lot of math there may be a lot of
[00:30:09] a lot of math there may be a lot of equations um and you actually might need
[00:30:11] equations um and you actually might need to think through those equations pretty
[00:30:13] to think through those equations pretty carefully to understand what's going on.
[00:30:14] carefully to understand what's going on. So this is one sub field that tends to
[00:30:16] So this is one sub field that tends to have more more math and more equations
[00:30:18] have more more math and more equations um which I think is kind of fun kind of
[00:30:20] um which I think is kind of fun kind of interesting. Um so there's this whole
[00:30:21] interesting. Um so there's this whole taxonomy of different kinds of
[00:30:22] taxonomy of different kinds of generative models that people build. So
[00:30:25] generative models that people build. So on the one hand you can imagine one part
[00:30:27] on the one hand you can imagine one part of the family tree are what we call
[00:30:28] of the family tree are what we call explicit density um methods. These are
[00:30:31] explicit density um methods. These are ones where the model actually does you
[00:30:33] ones where the model actually does you know the whole thing is we're trying to
[00:30:34] know the whole thing is we're trying to model P of X or P of X given Y. Um and
[00:30:36] model P of X or P of X given Y. Um and with these explicit density methods you
[00:30:38] with these explicit density methods you can actually compute you can get that
[00:30:40] can actually compute you can get that value out P of X for any sample X. Um
[00:30:43] value out P of X for any sample X. Um and the counter the counterpoint are
[00:30:44] and the counter the counterpoint are implicit density methods. These are ones
[00:30:47] implicit density methods. These are ones where you can't actually get that
[00:30:48] where you can't actually get that probability mass value. you can't
[00:30:50] probability mass value. you can't actually get that density value P of X
[00:30:52] actually get that density value P of X out from the model, but you can somehow
[00:30:54] out from the model, but you can somehow sample from that probability probability
[00:30:55] sample from that probability probability distribution. So the difference here um
[00:30:58] distribution. So the difference here um is that in an implicit model, right, you
[00:31:00] is that in an implicit model, right, you can't actually access the the the value
[00:31:02] can't actually access the the the value of the density function, but you can
[00:31:04] of the density function, but you can sample from the underlying density
[00:31:05] sample from the underlying density function somehow. So the model has
[00:31:07] function somehow. So the model has implicitly learned to model the density
[00:31:08] implicitly learned to model the density even if you can't get the value out. Um
[00:31:10] even if you can't get the value out. Um and on the explicit density side, um
[00:31:12] and on the explicit density side, um it's almost the almost the opposite
[00:31:14] it's almost the almost the opposite where in many cases even when you like
[00:31:16] where in many cases even when you like you can get that explic that explicit
[00:31:18] you can get that explic that explicit density value out but then sampling
[00:31:19] density value out but then sampling tends to be more complicated sometimes
[00:31:21] tends to be more complicated sometimes with these explicit density methods. Not
[00:31:23] with these explicit density methods. Not always but sometimes, right? And the
[00:31:26] always but sometimes, right? And the reason why you might turn to implicit
[00:31:27] reason why you might turn to implicit models is because in many cases you may
[00:31:29] models is because in many cases you may not actually care about knowing what
[00:31:31] not actually care about knowing what exactly was the density value for any
[00:31:33] exactly was the density value for any input. you maybe all you care about is
[00:31:35] input. you maybe all you care about is generating samples and get generating
[00:31:36] generating samples and get generating good samples get generate generating a
[00:31:38] good samples get generate generating a good diversity of samples. So if the
[00:31:40] good diversity of samples. So if the thing you really care about is sampling
[00:31:42] thing you really care about is sampling then maybe you don't actually need to
[00:31:43] then maybe you don't actually need to explicitly be able to see the value of
[00:31:45] explicitly be able to see the value of the density for any for any for any
[00:31:47] the density for any for any for any input. Um and then things break down and
[00:31:49] input. Um and then things break down and cascade and get more fractal-like from
[00:31:51] cascade and get more fractal-like from here. So inside explicit density methods
[00:31:54] here. So inside explicit density methods um there's ones where like actually yeah
[00:31:55] um there's ones where like actually yeah you can really compute the real P of X
[00:31:57] you can really compute the real P of X that's being modeled. Um and auto
[00:31:59] that's being modeled. Um and auto reggressive models are one example of
[00:32:00] reggressive models are one example of that. um or another version of explicit
[00:32:03] that. um or another version of explicit density methods are ones where you can
[00:32:05] density methods are ones where you can get a density value out but it's not the
[00:32:07] get a density value out but it's not the real one. It's some kind of
[00:32:08] real one. It's some kind of approximation to the true density um of
[00:32:10] approximation to the true density um of the data. Um and variational
[00:32:12] the data. Um and variational autoenccoders are one example of an
[00:32:14] autoenccoders are one example of an explicit but approximate um generative
[00:32:16] explicit but approximate um generative method that we'll see. Now on the other
[00:32:18] method that we'll see. Now on the other branch of the family tree um we can
[00:32:21] branch of the family tree um we can think about direct methods for implicit
[00:32:22] think about direct methods for implicit density. These are ones where maybe it
[00:32:24] density. These are ones where maybe it requires a single network evaluation to
[00:32:26] requires a single network evaluation to just draw a sample from the underlying
[00:32:29] just draw a sample from the underlying from the underlying distribution that's
[00:32:30] from the underlying distribution that's being modeled. Um, and a generative
[00:32:31] being modeled. Um, and a generative adversarial network is an example of a
[00:32:33] adversarial network is an example of a generative model in this part of the
[00:32:34] generative model in this part of the family tree. Um, and the other part is
[00:32:36] family tree. Um, and the other part is um, I don't know if it has a good name.
[00:32:38] um, I don't know if it has a good name. Um, I I called it indirect, but this is
[00:32:40] Um, I I called it indirect, but this is a name I made up yesterday. So, please
[00:32:42] a name I made up yesterday. So, please feel free to correct me if there's a
[00:32:43] feel free to correct me if there's a better term for this. Um but these in
[00:32:45] better term for this. Um but these in indirect ones are ones where you can
[00:32:47] indirect ones are ones where you can sample from the underlying um from the
[00:32:49] sample from the underlying um from the underlying density P of X that's being
[00:32:51] underlying density P of X that's being modeled. But it requires some kind of
[00:32:53] modeled. But it requires some kind of iterative procedure. There's no feed
[00:32:55] iterative procedure. There's no feed forward function that where you can
[00:32:56] forward function that where you can input and get the sample directly out.
[00:32:58] input and get the sample directly out. There's some kind of iterative method
[00:32:59] There's some kind of iterative method that you need to run in order to draw a
[00:33:01] that you need to run in order to draw a sample from the underlying density
[00:33:02] sample from the underlying density that's being modeled. Um and diffusion
[00:33:04] that's being modeled. Um and diffusion models are an example of this that we'll
[00:33:05] models are an example of this that we'll see next time. I told you a couple
[00:33:07] see next time. I told you a couple slides ago that people are sloppy with
[00:33:09] slides ago that people are sloppy with notation and drop the Y. And I did that
[00:33:10] notation and drop the Y. And I did that explicitly on purpose on this slide so
[00:33:12] explicitly on purpose on this slide so that someone would ask me that question
[00:33:13] that someone would ask me that question and you would always be attentive to
[00:33:14] and you would always be attentive to that fact. Um so yes exactly every time
[00:33:17] that fact. Um so yes exactly every time I've written P of X on this slide and
[00:33:18] I've written P of X on this slide and actually all the rest of the slides this
[00:33:20] actually all the rest of the slides this lecture I also have been lazy and
[00:33:22] lecture I also have been lazy and dropped the Y but you should always
[00:33:23] dropped the Y but you should always imagine an additional condition on Y um
[00:33:25] imagine an additional condition on Y um in all of these P of X's that you see
[00:33:27] in all of these P of X's that you see for the rest of the lecture. So thank
[00:33:28] for the rest of the lecture. So thank you for asking that. So the question was
[00:33:31] you for asking that. So the question was like is the is the indirect method is it
[00:33:33] like is the is the indirect method is it still you know can you just treat that
[00:33:34] still you know can you just treat that indirect iterative procedure as a
[00:33:36] indirect iterative procedure as a blackbox and then treat that as as a
[00:33:38] blackbox and then treat that as as a direct sampling method. Um in principle
[00:33:40] direct sampling method. Um in principle yes but in practice no because your
[00:33:43] yes but in practice no because your samples kind of end up end up
[00:33:44] samples kind of end up end up approximate um you know depending on
[00:33:46] approximate um you know depending on exactly the method but like with with
[00:33:48] exactly the method but like with with diffusion models you kind of would need
[00:33:50] diffusion models you kind of would need to take an infinite number of steps in
[00:33:51] to take an infinite number of steps in order to draw a true sample so instead
[00:33:53] order to draw a true sample so instead we approximate that with a finite number
[00:33:54] we approximate that with a finite number of steps and that's you know true of
[00:33:56] of steps and that's you know true of other methods as well diffusion models
[00:33:58] other methods as well diffusion models are the most common for this today um
[00:34:00] are the most common for this today um but you know some kind of marov chain
[00:34:01] but you know some kind of marov chain method or MCMC method in years past
[00:34:03] method or MCMC method in years past might have also had this property where
[00:34:05] might have also had this property where there is an iterative procedure but if
[00:34:06] there is an iterative procedure but if you want to draw an exact sample from
[00:34:08] you want to draw an exact sample from the distribution that's being modeled,
[00:34:09] the distribution that's being modeled, you need an infinite number of steps to
[00:34:10] you need an infinite number of steps to converge. Um, so we always approximate
[00:34:12] converge. Um, so we always approximate that and by taking a finite number of
[00:34:14] that and by taking a finite number of steps.
[00:34:16] steps. Okay. And I was I was pretty I was
[00:34:19] Okay. And I was I was pretty I was pretty proud of this tonomy because it's
[00:34:21] pretty proud of this tonomy because it's very symmetric. There's, you know, it's
[00:34:23] very symmetric. There's, you know, it's a it's a there's there's four leaves,
[00:34:25] a it's a there's there's four leaves, there's two branches, and we're going to
[00:34:26] there's two branches, and we're going to cover half the tree today and half the
[00:34:28] cover half the tree today and half the tree next time. Um, so I thought that
[00:34:30] tree next time. Um, so I thought that was a pretty nice uh pretty nice
[00:34:31] was a pretty nice uh pretty nice breakdown. question is what's the
[00:34:33] breakdown. question is what's the difference between the approximate
[00:34:35] difference between the approximate density and directly sampling from an
[00:34:37] density and directly sampling from an implicit p of x? The difference is that
[00:34:39] implicit p of x? The difference is that even in an in an indirect but implicit
[00:34:41] even in an in an indirect but implicit method there's no density value anywhere
[00:34:43] method there's no density value anywhere to be found. You can't compute one at
[00:34:45] to be found. You can't compute one at all. Um but you can still iteratively
[00:34:46] all. Um but you can still iteratively sample in some way. Um with an
[00:34:48] sample in some way. Um with an approximate density method um you can
[00:34:50] approximate density method um you can still get a value out like you can
[00:34:51] still get a value out like you can actually get a density value out that's
[00:34:53] actually get a density value out that's going to be some approximate or bound to
[00:34:54] going to be some approximate or bound to the true p of x.
[00:34:58] Okay. So then the first such auto the
[00:35:00] Okay. So then the first such auto the first such generative model that we'll
[00:35:02] first such generative model that we'll actually talk about in a little bit more
[00:35:03] actually talk about in a little bit more concrete specificity are auto
[00:35:05] concrete specificity are auto reggressive models. Um so autogressive
[00:35:08] reggressive models. Um so autogressive models we're actually going to take a
[00:35:09] models we're actually going to take a slight detour and talk about an a really
[00:35:12] slight detour and talk about an a really general idea behind all of generative
[00:35:13] general idea behind all of generative modeling and that's the idea of maximum
[00:35:15] modeling and that's the idea of maximum likelihood estimation. Um and maximum
[00:35:17] likelihood estimation. Um and maximum likelihood estimation is actually a
[00:35:19] likelihood estimation is actually a quite general procedure that we can use
[00:35:20] quite general procedure that we can use to fit probabilistic models given a
[00:35:22] to fit probabilistic models given a finite set of samples. Um so the idea is
[00:35:25] finite set of samples. Um so the idea is we're going to write down some explicit
[00:35:26] we're going to write down some explicit function for the density. um we said
[00:35:28] function for the density. um we said that in that that um some methods are
[00:35:31] that in that that um some methods are going to explicitly model the density.
[00:35:32] going to explicitly model the density. Well, let's do it with a neural network.
[00:35:34] Well, let's do it with a neural network. Let's write a neural network that's
[00:35:36] Let's write a neural network that's going to input the data x um input the
[00:35:39] going to input the data x um input the weights w of the neural network and it's
[00:35:41] weights w of the neural network and it's going to spit out a number that's going
[00:35:42] going to spit out a number that's going to tell us what is the density. Um so
[00:35:45] to tell us what is the density. Um so then you know we're going to train the
[00:35:47] then you know we're going to train the data we're going to given a data set of
[00:35:48] data we're going to given a data set of samples x1 x2 xn we're going to train
[00:35:51] samples x1 x2 xn we're going to train the model via this this objective
[00:35:52] the model via this this objective function. We want to find the weights
[00:35:54] function. We want to find the weights that give rise that make the data set
[00:35:57] that give rise that make the data set most likely right where we want to set
[00:35:59] most likely right where we want to set because as we vary the weights it's
[00:36:01] because as we vary the weights it's going to vary the kind of densities that
[00:36:02] going to vary the kind of densities that are being modeled by the network. So we
[00:36:04] are being modeled by the network. So we want the network to select the density
[00:36:06] want the network to select the density that maximizes the likelihood of the
[00:36:08] that maximizes the likelihood of the data. Um note that we said likelihood
[00:36:10] data. Um note that we said likelihood rather than probability. Um that's a
[00:36:12] rather than probability. Um that's a deep philosophical rabbit hole you can
[00:36:14] deep philosophical rabbit hole you can fall into. Um the difference is what
[00:36:16] fall into. Um the difference is what we're varying. Right? If you think about
[00:36:18] we're varying. Right? If you think about probability, you kind of imagine you
[00:36:19] probability, you kind of imagine you that the density is fixed and we're sort
[00:36:21] that the density is fixed and we're sort of sliding X around and changing what is
[00:36:23] of sliding X around and changing what is the probability of X under a fixed
[00:36:25] the probability of X under a fixed distribution. When you talk about
[00:36:26] distribution. When you talk about likelihood, instead you're often fixing
[00:36:28] likelihood, instead you're often fixing the samples X and you're varying the
[00:36:30] the samples X and you're varying the distribution itself. Um and saying, you
[00:36:32] distribution itself. Um and saying, you know, what is the how does the prob how
[00:36:35] know, what is the how does the prob how does the probability density of those
[00:36:36] does the probability density of those samples change as we vary different
[00:36:38] samples change as we vary different distributions. So you have to think
[00:36:40] distributions. So you have to think about very carefully in these equations
[00:36:42] about very carefully in these equations what's being fixed and what's varying.
[00:36:43] what's being fixed and what's varying. So in this process of maximum likelihood
[00:36:45] So in this process of maximum likelihood estimation, what we're doing is varying
[00:36:48] estimation, what we're doing is varying the distribution that the model that the
[00:36:50] the distribution that the model that the that the neural network is modeling to
[00:36:51] that the neural network is modeling to try to maximize the probability of the
[00:36:53] try to maximize the probability of the fixed set of samples we have of from
[00:36:55] fixed set of samples we have of from that distribution in our training set.
[00:36:57] that distribution in our training set. Right? And I guess the the unsaid thing
[00:36:58] Right? And I guess the the unsaid thing behind all of this is that we assume
[00:37:00] behind all of this is that we assume that there is some underlying true
[00:37:02] that there is some underlying true probability distribution P data which
[00:37:04] probability distribution P data which was used by the universe to generate the
[00:37:06] was used by the universe to generate the data that we are seeing. Um and in some
[00:37:08] data that we are seeing. Um and in some sense always what we want to do is try
[00:37:10] sense always what we want to do is try to model that true underlying unknown
[00:37:13] to model that true underlying unknown distribution P data and we can never
[00:37:14] distribution P data and we can never access P data because we don't know we
[00:37:16] access P data because we don't know we don't have this omnisient view of like
[00:37:18] don't have this omnisient view of like exactly how the universe works. um but
[00:37:20] exactly how the universe works. um but instead we get some samples from P data
[00:37:23] instead we get some samples from P data that the universe has given to us and
[00:37:25] that the universe has given to us and what we're trying to do through our
[00:37:26] what we're trying to do through our learning procedure is uncover that
[00:37:28] learning procedure is uncover that unknown distribution P data um given a
[00:37:30] unknown distribution P data um given a finite number of samples from that
[00:37:32] finite number of samples from that unknown distribution right so one
[00:37:34] unknown distribution right so one procedure that you can go about this is
[00:37:36] procedure that you can go about this is like well let's select the distribution
[00:37:38] like well let's select the distribution that makes the data that I saw actually
[00:37:40] that makes the data that I saw actually most likely um and that's the objective
[00:37:42] most likely um and that's the objective that's that's the maximum likelihood
[00:37:43] that's that's the maximum likelihood objective function
[00:37:45] objective function right and then a standard trick that we
[00:37:47] right and then a standard trick that we do here right we assume that the data
[00:37:48] do here right we assume that the data was iid um independent and identically
[00:37:50] was iid um independent and identically distributed. So we assume that each one
[00:37:52] distributed. So we assume that each one of those X's was drawn from that true P
[00:37:54] of those X's was drawn from that true P data distribution. And now we want to
[00:37:56] data distribution. And now we want to maximize the joint distribution of all
[00:37:57] maximize the joint distribution of all the data that we saw. But because it's
[00:37:59] the data that we saw. But because it's independent, we can factor it down um
[00:38:01] independent, we can factor it down um into independent probab like independent
[00:38:03] into independent probab like independent likelihood of each of the independent
[00:38:04] likelihood of each of the independent samples. Um and then the common trick
[00:38:06] samples. Um and then the common trick that we always use is the log trick. So
[00:38:07] that we always use is the log trick. So we know that log is a monotonic
[00:38:09] we know that log is a monotonic function. So if you maximize something,
[00:38:12] function. So if you maximize something, it's equivalent to maximizing the log of
[00:38:13] it's equivalent to maximizing the log of that something because log is a
[00:38:15] that something because log is a monotonic function. Um, and then log is
[00:38:17] monotonic function. Um, and then log is also very convenient because it swaps
[00:38:19] also very convenient because it swaps sums and products. So it's common to
[00:38:21] sums and products. So it's common to just instead of maximizing the the the
[00:38:23] just instead of maximizing the the the pro the the the likelihood of the data,
[00:38:25] pro the the the likelihood of the data, instead we're going to maximize the log
[00:38:27] instead we're going to maximize the log likelihood of the data. And that's the
[00:38:28] likelihood of the data. And that's the same as maximizing the likelihood. And
[00:38:30] same as maximizing the likelihood. And once we apply the log, then that that um
[00:38:32] once we apply the log, then that that um that product splits into a sum and sums
[00:38:34] that product splits into a sum and sums are easier to handle. Um and now you
[00:38:37] are easier to handle. Um and now you know we slot in our neural network um
[00:38:38] know we slot in our neural network um because our neural network is now maybe
[00:38:40] because our neural network is now maybe directly outputting the density. So this
[00:38:42] directly outputting the density. So this gives a direct objective function that
[00:38:43] gives a direct objective function that we could use to train a neural network.
[00:38:45] we could use to train a neural network. Um give it to you know this gives it a
[00:38:47] Um give it to you know this gives it a very concrete loss function that we can
[00:38:48] very concrete loss function that we can use to train a neural network to solve
[00:38:49] use to train a neural network to solve this kind of generative modeling
[00:38:50] this kind of generative modeling problem. Um but we need a little bit
[00:38:53] problem. Um but we need a little bit more structure here to actually make
[00:38:54] more structure here to actually make progress. Um so this this is where so
[00:38:57] progress. Um so this this is where so this this idea of maximum likelihood
[00:38:58] this this idea of maximum likelihood estimation is very general. It doesn't
[00:39:00] estimation is very general. It doesn't really assume anything about the kind of
[00:39:01] really assume anything about the kind of data. Um it doesn't really assume any
[00:39:03] data. Um it doesn't really assume any structure in the data. Um and we in in
[00:39:05] structure in the data. Um and we in in general need to put a little bit more
[00:39:06] general need to put a little bit more structure on this to make some progress.
[00:39:08] structure on this to make some progress. Um, so auto reggressive models basically
[00:39:10] Um, so auto reggressive models basically make the assumption that there is some
[00:39:12] make the assumption that there is some canonical way that we can take our data
[00:39:14] canonical way that we can take our data X and split each ind each data sample X
[00:39:17] X and split each ind each data sample X into some sequence of subp parts. Um,
[00:39:19] into some sequence of subp parts. Um, X1, X2, XT. Um, you got to be careful
[00:39:21] X1, X2, XT. Um, you got to be careful with indices here. Here I said the subp
[00:39:23] with indices here. Here I said the subp parts these are subp parts of a single
[00:39:25] parts these are subp parts of a single sample. Um, so I use a subscript and the
[00:39:28] sample. Um, so I use a subscript and the previous slide we had superscript to
[00:39:29] previous slide we had superscript to indicate different samples um, x1 to xn.
[00:39:33] indicate different samples um, x1 to xn. So be careful with that. you know, sub
[00:39:34] So be careful with that. you know, sub superscript on this slide is different
[00:39:36] superscript on this slide is different samples X. Subscript on this slide means
[00:39:38] samples X. Subscript on this slide means different parts of the same sample. Um,
[00:39:40] different parts of the same sample. Um, so we assume that there's some canonical
[00:39:42] so we assume that there's some canonical way to break up our our data sample X
[00:39:44] way to break up our our data sample X into some sequence of subp parts. Um,
[00:39:46] into some sequence of subp parts. Um, and now we can apply the chain rule of
[00:39:48] and now we can apply the chain rule of probability. So um, probability of X is
[00:39:50] probability. So um, probability of X is just the joint probability probability
[00:39:52] just the joint probability probability of all of those subp parts X1 to XT. Um,
[00:39:55] of all of those subp parts X1 to XT. Um, and then given any probability
[00:39:57] and then given any probability distribution, you can always break it
[00:39:58] distribution, you can always break it apart into this chain rule. Um that
[00:40:00] apart into this chain rule. Um that probability of the joint distribution of
[00:40:02] probability of the joint distribution of all these variables is equal to
[00:40:04] all these variables is equal to probability of the first one um times
[00:40:06] probability of the first one um times probability of the first given
[00:40:08] probability of the first given conditioned on the first probability of
[00:40:10] conditioned on the first probability of the second condition on the first times
[00:40:12] the second condition on the first times probability of the third conditioned on
[00:40:13] probability of the third conditioned on the first and the second etc etc etc
[00:40:16] the first and the second etc etc etc right and this is this is the chain rule
[00:40:18] right and this is this is the chain rule of probability this requires no
[00:40:19] of probability this requires no assumptions this is always true of any
[00:40:21] assumptions this is always true of any kind of um joint distribution of of
[00:40:22] kind of um joint distribution of of random variables
[00:40:24] random variables um and then this this sort of gives us
[00:40:26] um and then this this sort of gives us our our objective function or then then
[00:40:28] our our objective function or then then like then you could basically train a
[00:40:30] like then you could basically train a neural network that's going to input,
[00:40:32] neural network that's going to input, you know, the po the the the previous
[00:40:34] you know, the po the the the previous part of the sequence and then try to
[00:40:36] part of the sequence and then try to give us a probability distribution over
[00:40:37] give us a probability distribution over the next part of the sequence. Does that
[00:40:39] the next part of the sequence. Does that sound familiar? Does that sound like
[00:40:41] sound familiar? Does that sound like something we've done before? RNN's. Yes.
[00:40:44] something we've done before? RNN's. Yes. So that's exactly what an RNN is doing,
[00:40:46] So that's exactly what an RNN is doing, right? And RNN has this very natural
[00:40:47] right? And RNN has this very natural structure that you know by passing
[00:40:49] structure that you know by passing hidden states along forward through
[00:40:51] hidden states along forward through time, um the hidden state always depends
[00:40:53] time, um the hidden state always depends on the beginning of the sequence up to
[00:40:55] on the beginning of the sequence up to the current point. Um so then with
[00:40:57] the current point. Um so then with there's a very natural way to use RNN's
[00:40:59] there's a very natural way to use RNN's to for auto reggressive modeling. Um so
[00:41:01] to for auto reggressive modeling. Um so you're you have your sequence of hidden
[00:41:02] you're you have your sequence of hidden states that are basically summarizing
[00:41:04] states that are basically summarizing your your sequence and then from each
[00:41:06] your your sequence and then from each hidden state you predict probability of
[00:41:08] hidden state you predict probability of the next piece of the sequence. Um
[00:41:10] the next piece of the sequence. Um condition on the rest condition on all
[00:41:11] condition on the rest condition on all earlier parts of the sequence um and
[00:41:13] earlier parts of the sequence um and that basically is an RNN language model
[00:41:14] that basically is an RNN language model that we saw some lectures ago. Have we
[00:41:17] that we saw some lectures ago. Have we seen anything else that can do this?
[00:41:19] seen anything else that can do this? Yes, transformers. Um and particularly
[00:41:21] Yes, transformers. Um and particularly mask transformers, right? So we talked
[00:41:23] mask transformers, right? So we talked in the transformers lecture um
[00:41:25] in the transformers lecture um transformers can also be used to have
[00:41:27] transformers can also be used to have this this structure where by masking out
[00:41:29] this this structure where by masking out the attention matrix in the right way we
[00:41:31] the attention matrix in the right way we can make each output of the transformer
[00:41:32] can make each output of the transformer depend on only the prefix of the
[00:41:34] depend on only the prefix of the sequence. So we can also use
[00:41:36] sequence. So we can also use transformers for autogressive um
[00:41:37] transformers for autogressive um autogressive modeling and this and
[00:41:39] autogressive modeling and this and they're very commonly used for this.
[00:41:40] they're very commonly used for this. Okay, so this is um but the problem with
[00:41:43] Okay, so this is um but the problem with autogressive modeling is that you need
[00:41:45] autogressive modeling is that you need to break your data up into a sequence
[00:41:46] to break your data up into a sequence and this is very very natural with um
[00:41:48] and this is very very natural with um with text data, right? Because text data
[00:41:50] with text data, right? Because text data is naturally a 1D sequence. Um and it's
[00:41:53] is naturally a 1D sequence. Um and it's even a 1D sequence of discrete things
[00:41:54] even a 1D sequence of discrete things which is great because it's very easy to
[00:41:56] which is great because it's very easy to model probabilities of discrete things.
[00:41:58] model probabilities of discrete things. Um we've been doing that all semester um
[00:42:00] Um we've been doing that all semester um with our favorite cross entropy softmax
[00:42:02] with our favorite cross entropy softmax loss, right? The cross entropy softmax
[00:42:04] loss, right? The cross entropy softmax loss is always, you know, distribution
[00:42:06] loss is always, you know, distribution over discrete like fixed discrete number
[00:42:08] over discrete like fixed discrete number of categories. um the network predicts a
[00:42:10] of categories. um the network predicts a score for each one of those. Normalize
[00:42:12] score for each one of those. Normalize it with a softmax, train with a cross
[00:42:13] it with a softmax, train with a cross entropy loss. We know how to do that. Um
[00:42:15] entropy loss. We know how to do that. Um so that's that's why these things fit
[00:42:17] so that's that's why these things fit very naturally for language models. Um
[00:42:19] very naturally for language models. Um because language is already discrete. Um
[00:42:21] because language is already discrete. Um language is already a 1D sequence. Um so
[00:42:23] language is already a 1D sequence. Um so there's a little bit of fuzziness and
[00:42:25] there's a little bit of fuzziness and how you know there's a tokenizer in
[00:42:26] how you know there's a tokenizer in there. We're not going to get into that.
[00:42:28] there. We're not going to get into that. Um but these are very naturally well
[00:42:29] Um but these are very naturally well suited to language problems because
[00:42:31] suited to language problems because language is already 1D. It's already
[00:42:33] language is already 1D. It's already discreet. Um images are more tricky
[00:42:35] discreet. Um images are more tricky because images are not naturally 1D. Um,
[00:42:38] because images are not naturally 1D. Um, images are also not naturally discreet.
[00:42:40] images are also not naturally discreet. We often think of images as continuous
[00:42:42] We often think of images as continuous real valued things. Um, so this is they
[00:42:44] real valued things. Um, so this is they don't these don't naturally fit quite as
[00:42:46] don't these don't naturally fit quite as nicely onto images. Um, but you know,
[00:42:49] nicely onto images. Um, but you know, you got a hammer, you're going to whack
[00:42:50] you got a hammer, you're going to whack some nails. So, um, people definitely
[00:42:52] some nails. So, um, people definitely apply autogressive models to images in
[00:42:54] apply autogressive models to images in kind of a naive way. Um, at least at
[00:42:56] kind of a naive way. Um, at least at least some years ago. So one thing you
[00:42:59] least some years ago. So one thing you can do one one one thing you can do to
[00:43:01] can do one one one thing you can do to um model images with autogressive models
[00:43:04] um model images with autogressive models is to treat the images it treat an image
[00:43:06] is to treat the images it treat an image as a sequence of pixels right and in
[00:43:08] as a sequence of pixels right and in particular each pixel is actually just
[00:43:10] particular each pixel is actually just three numbers um and you know in most
[00:43:12] three numbers um and you know in most displays and in most representations
[00:43:15] displays and in most representations most representations of images those
[00:43:16] most representations of images those numbers are actually discreet right so
[00:43:18] numbers are actually discreet right so most JPEGs or PGs most of the file
[00:43:21] most JPEGs or PGs most of the file formats we use to store images um are
[00:43:23] formats we use to store images um are typically 8 bit per channel so there's
[00:43:24] typically 8 bit per channel so there's actually you know only a fixed number of
[00:43:27] actually you know only a fixed number of values that each pixel can take. So a
[00:43:29] values that each pixel can take. So a pixel is just three um three single bite
[00:43:31] pixel is just three um three single bite values. A single bite is just like an
[00:43:34] values. A single bite is just like an integer from 0 to 255. So a pixel is
[00:43:36] integer from 0 to 255. So a pixel is just three integers each of each integer
[00:43:38] just three integers each of each integer can be 0 to 255. So what we can do is um
[00:43:42] can be 0 to 255. So what we can do is um take our image and then rasterize it out
[00:43:45] take our image and then rasterize it out into a long um into a long sequence
[00:43:47] into a long um into a long sequence where each element of the sequence is
[00:43:49] where each element of the sequence is one of the sub pixels values of our
[00:43:51] one of the sub pixels values of our image. Um, and now we've turned our
[00:43:53] image. Um, and now we've turned our image into a discrete into a into a
[00:43:55] image into a discrete into a into a one-dimensional sequence where each
[00:43:56] one-dimensional sequence where each entry in that sequence is a discrete
[00:43:58] entry in that sequence is a discrete value. So you can apply autogressive
[00:44:00] value. So you can apply autogressive modeling directly to that sequence um in
[00:44:02] modeling directly to that sequence um in exactly the way that you might have for
[00:44:03] exactly the way that you might have for a language model um using an RNN or a
[00:44:05] a language model um using an RNN or a transformer.
[00:44:06] transformer. Um can anyone spot a problem with this
[00:44:08] Um can anyone spot a problem with this approach? Too long. Very expensive. Very
[00:44:11] approach? Too long. Very expensive. Very very expensive. Um so you know a kind of
[00:44:14] very expensive. Um so you know a kind of reasonable image that you might want to
[00:44:15] reasonable image that you might want to model is maybe 1024 by 1020 1024. Um,
[00:44:18] model is maybe 1024 by 1020 1024. Um, that's not even that high resolution
[00:44:20] that's not even that high resolution really, but that's a pretty good
[00:44:21] really, but that's a pretty good resolution. But if you have 1020 1024 x
[00:44:24] resolution. But if you have 1020 1024 x 1024 image, that's going to be a
[00:44:26] 1024 image, that's going to be a sequence of 3 million sub pixels. Um,
[00:44:28] sequence of 3 million sub pixels. Um, you know, people actually can model
[00:44:30] you know, people actually can model these days sequences in the millions,
[00:44:32] these days sequences in the millions, but it gets very very expensive. Um,
[00:44:33] but it gets very very expensive. Um, there's got to be a more efficient way
[00:44:34] there's got to be a more efficient way to do this. Um, so there were some
[00:44:36] to do this. Um, so there were some papers a couple years ago where people
[00:44:38] papers a couple years ago where people applied these sort of autogressive
[00:44:39] applied these sort of autogressive models directly to pixels of images. Um,
[00:44:41] models directly to pixels of images. Um, but they were not super successful, I
[00:44:43] but they were not super successful, I think, because they they're very
[00:44:44] think, because they they're very difficult to scale to high resolution.
[00:44:46] difficult to scale to high resolution. So a spoiler alert that we'll talk a
[00:44:49] So a spoiler alert that we'll talk a little bit more in next lecture is that
[00:44:50] little bit more in next lecture is that this actually has made a resurgence in
[00:44:52] this actually has made a resurgence in the last couple of years. Um but the
[00:44:54] the last couple of years. Um but the trick is to not model the individual
[00:44:55] trick is to not model the individual pixels in the sequence um as individual
[00:44:57] pixels in the sequence um as individual pixel values but instead to use some
[00:44:59] pixel values but instead to use some other kind of process or procedure or
[00:45:01] other kind of process or procedure or model neural network maybe to break that
[00:45:03] model neural network maybe to break that image into a sequence of one-dimensional
[00:45:05] image into a sequence of one-dimensional tokens. Um so that's something we'll
[00:45:06] tokens. Um so that's something we'll talk about a bit more next lecture. Um
[00:45:08] talk about a bit more next lecture. Um but this at least gives you the sense of
[00:45:10] but this at least gives you the sense of what is an auto reagive model. Um what's
[00:45:12] what is an auto reagive model. Um what's the probabilistic formulation of them?
[00:45:13] the probabilistic formulation of them? How do you apply them to language? how
[00:45:15] How do you apply them to language? how do you apply them to images? So then
[00:45:17] do you apply them to images? So then from auto reggressive models we next
[00:45:19] from auto reggressive models we next turn to variational autoenccoders um and
[00:45:22] turn to variational autoenccoders um and variational autoenccoders are pretty
[00:45:24] variational autoenccoders are pretty fun.
[00:45:26] fun. So we talked about you know we in in um
[00:45:29] So we talked about you know we in in um in these autogressive models we talked
[00:45:31] in these autogressive models we talked about you know we're trying to do
[00:45:32] about you know we're trying to do maximum likelihood. We broke up our data
[00:45:34] maximum likelihood. We broke up our data into a sequence of parts. We're trying
[00:45:36] into a sequence of parts. We're trying to maximize the likelihood of the data.
[00:45:38] to maximize the likelihood of the data. Um and variational autoenccoders are
[00:45:40] Um and variational autoenccoders are going to do something a little bit
[00:45:41] going to do something a little bit different. um instead we're there's
[00:45:43] different. um instead we're there's still going to be an explicit method.
[00:45:44] still going to be an explicit method. There's still going to be some kind of
[00:45:45] There's still going to be some kind of density that we can compute. Um but
[00:45:48] density that we can compute. Um but there it's going to be intractable.
[00:45:49] there it's going to be intractable. We're going to be able to approximate
[00:45:50] We're going to be able to approximate it. Um why are we going to do that? We
[00:45:52] it. Um why are we going to do that? We had a perfectly good method that
[00:45:54] had a perfectly good method that computed densities exactly. Um and what
[00:45:56] computed densities exactly. Um and what we're going to give up for that is we're
[00:45:58] we're going to give up for that is we're going to gain something. Um we're going
[00:45:59] going to gain something. Um we're going to gain the ability to compute
[00:46:00] to gain the ability to compute reasonable latent vectors over our data.
[00:46:02] reasonable latent vectors over our data. We're going to have you know vectors
[00:46:04] We're going to have you know vectors that represent our data that come out
[00:46:06] that represent our data that come out that pop out naturally from the learning
[00:46:07] that pop out naturally from the learning process. And those vectors are going to
[00:46:09] process. And those vectors are going to be useful in their own right. And the
[00:46:11] be useful in their own right. And the ability to get access to those latent
[00:46:12] ability to get access to those latent vectors is going to be useful enough to
[00:46:14] vectors is going to be useful enough to us that we're willing to give up
[00:46:16] us that we're willing to give up computing exact densities and instead
[00:46:18] computing exact densities and instead settle for these approximate densities
[00:46:19] settle for these approximate densities that are actual actually lower bounds on
[00:46:21] that are actual actually lower bounds on the true density. Oh, the the motivation
[00:46:23] the true density. Oh, the the motivation for breaking stuff up in a sequence in
[00:46:25] for breaking stuff up in a sequence in autogressive models um because it
[00:46:26] autogressive models um because it factors the problem. It makes each part
[00:46:28] factors the problem. It makes each part each part easier to model, right? So
[00:46:30] each part easier to model, right? So imagine, you know, imagine you're doing
[00:46:32] imagine, you know, imagine you're doing language modeling, right? Um and you
[00:46:33] language modeling, right? Um and you have a vocabulary of Vword um and I want
[00:46:36] have a vocabulary of Vword um and I want to model the probability of two words
[00:46:39] to model the probability of two words jointly, right? How many possible
[00:46:40] jointly, right? How many possible two-word sequences are there? There's V^
[00:46:42] two-word sequences are there? There's V^ squ. Um how many possible three-word
[00:46:44] squ. Um how many possible three-word sequences are there? There's V cubed. Um
[00:46:46] sequences are there? There's V cubed. Um then in general, if you have like how
[00:46:47] then in general, if you have like how many T word sequences with a vocabulary
[00:46:49] many T word sequences with a vocabulary of V are there? It's um V to the T,
[00:46:52] of V are there? It's um V to the T, right? So that's bad. It grows
[00:46:54] right? So that's bad. It grows exponentially. If you wanted to directly
[00:46:55] exponentially. If you wanted to directly model the joint distribution of a
[00:46:57] model the joint distribution of a sequence of T things, the number of
[00:46:59] sequence of T things, the number of entries in that discrete probability
[00:47:02] entries in that discrete probability distribution you need to model is going
[00:47:03] distribution you need to model is going to grow exponentially with the sequence
[00:47:04] to grow exponentially with the sequence length. um and that's quickly going to
[00:47:06] length. um and that's quickly going to become completely intractable if we if
[00:47:08] become completely intractable if we if we want to go to long sequences. So then
[00:47:10] we want to go to long sequences. So then the reason we break that up is so that
[00:47:11] the reason we break that up is so that we don't have to model it all at once.
[00:47:12] we don't have to model it all at once. We factor it in this way and predict
[00:47:14] We factor it in this way and predict only one part condition on the previous
[00:47:16] only one part condition on the previous parts. Good question. Can we apply the
[00:47:18] parts. Good question. Can we apply the log trick to mitigate that? Yeah,
[00:47:20] log trick to mitigate that? Yeah, exactly. So in practice like you'll
[00:47:21] exactly. So in practice like you'll never actually see these these real
[00:47:22] never actually see these these real these like probability density values
[00:47:24] these like probability density values modeled. Um almost always you're going
[00:47:26] modeled. Um almost always you're going to work in log probabilities instead. Um
[00:47:28] to work in log probabilities instead. Um so the model that outputs um the model
[00:47:29] so the model that outputs um the model is going to output log probabilities.
[00:47:31] is going to output log probabilities. You're going to compute your loss in log
[00:47:32] You're going to compute your loss in log space. like for numeric stability you're
[00:47:34] space. like for numeric stability you're almost going to compute everything in
[00:47:35] almost going to compute everything in log space in practice. So then the p of
[00:47:37] log space in practice. So then the p of x is being generated because at the
[00:47:39] x is being generated because at the output at the at the top of the
[00:47:41] output at the at the top of the transformer it's outputting a
[00:47:42] transformer it's outputting a probability distribution over the next
[00:47:44] probability distribution over the next token condition on all the previous
[00:47:45] token condition on all the previous tokens and it does that for every point
[00:47:47] tokens and it does that for every point in the sequence. So you could actually
[00:47:48] in the sequence. So you could actually recover this exact probability density
[00:47:50] recover this exact probability density value um by multiplying out the the
[00:47:53] value um by multiplying out the the values at all points in the sequence. So
[00:47:54] values at all points in the sequence. So if I have an input sequence, I pass it
[00:47:56] if I have an input sequence, I pass it to the transformer. The transformer will
[00:47:58] to the transformer. The transformer will have predicted at every point in the
[00:47:59] have predicted at every point in the sequence what is the probability, what
[00:48:01] sequence what is the probability, what is the distribution over all the tokens
[00:48:03] is the distribution over all the tokens conditioned on the on the earlier part
[00:48:04] conditioned on the on the earlier part of the sequence. Um and I can compute
[00:48:06] of the sequence. Um and I can compute what was the actual next token, what was
[00:48:08] what was the actual next token, what was the predicted probability of the next
[00:48:09] the predicted probability of the next token, and then multiply all of those
[00:48:11] token, and then multiply all of those across the entire sequence. So that's
[00:48:12] across the entire sequence. So that's how we can recover the exact density
[00:48:14] how we can recover the exact density value out of one of these autogressive
[00:48:15] value out of one of these autogressive models. And that actually would apply
[00:48:17] models. And that actually would apply either to an RNN or a transformer.
[00:48:21] either to an RNN or a transformer. Okay. Um good questions. So then in a
[00:48:25] Okay. Um good questions. So then in a variational autoenccoder um things get
[00:48:27] variational autoenccoder um things get hairy. So we're actually going to drop
[00:48:29] hairy. So we're actually going to drop the V and talk about autoenccoders for
[00:48:30] the V and talk about autoenccoders for just a couple slides because I don't
[00:48:31] just a couple slides because I don't think we've done that yet this this uh
[00:48:33] think we've done that yet this this uh this course. So in a non-variational
[00:48:35] this course. So in a non-variational autoenccoder this is basically going to
[00:48:36] autoenccoder this is basically going to be an unsupervised method for learning
[00:48:38] be an unsupervised method for learning to extract features Z from inputs X
[00:48:41] to extract features Z from inputs X without labels. Um and this actually you
[00:48:44] without labels. Um and this actually you know is kind of a ver kind of you know
[00:48:45] know is kind of a ver kind of you know in this vein of self-supervised learning
[00:48:47] in this vein of self-supervised learning that we just talked about. Um, and our
[00:48:50] that we just talked about. Um, and our notion is that the features ought to
[00:48:51] notion is that the features ought to extract useful information about the
[00:48:53] extract useful information about the data, right? Maybe they somehow
[00:48:54] data, right? Maybe they somehow implicitly encode what is the what what
[00:48:57] implicitly encode what is the what what is the identity of objects in the image,
[00:48:58] is the identity of objects in the image, how many of them there are, what are the
[00:49:00] how many of them there are, what are the colors of them. We want this feature
[00:49:01] colors of them. We want this feature vector Z to contain useful information
[00:49:03] vector Z to contain useful information about about the input X. Um, and this
[00:49:06] about about the input X. Um, and this encoder itself could be a neural network
[00:49:08] encoder itself could be a neural network of any architecture. It could be an MLP,
[00:49:09] of any architecture. It could be an MLP, transformer, CNN, whatever you want. Um,
[00:49:12] transformer, CNN, whatever you want. Um, but it inputs our data X and then it's
[00:49:14] but it inputs our data X and then it's going to output some vector Z. And then
[00:49:16] going to output some vector Z. And then the question is how do we learn this
[00:49:17] the question is how do we learn this without without labels. Um we actually
[00:49:19] without without labels. Um we actually saw a lot of examples of this in the
[00:49:20] saw a lot of examples of this in the previous lecture. Um but there's a very
[00:49:22] previous lecture. Um but there's a very simple one which is just try to
[00:49:24] simple one which is just try to reconstruct the input. Um so we're going
[00:49:25] reconstruct the input. Um so we're going to now have a second part of the model
[00:49:28] to now have a second part of the model called the decoder which is going to
[00:49:29] called the decoder which is going to input the Z and then output back an X.
[00:49:32] input the Z and then output back an X. Um and we want Oh, and I dropped the X.
[00:49:34] Um and we want Oh, and I dropped the X. Um and and we're going to train this
[00:49:36] Um and and we're going to train this thing to so that the output from the
[00:49:37] thing to so that the output from the model should actually match the input.
[00:49:39] model should actually match the input. Um this is the in some sense the
[00:49:41] Um this is the in some sense the stupidest loss function ever. We're just
[00:49:42] stupidest loss function ever. We're just training the model to mimic the identity
[00:49:44] training the model to mimic the identity function. Um why do we do that? we
[00:49:45] function. Um why do we do that? we already know the identity function. Why
[00:49:47] already know the identity function. Why are we expending a lot of flops and
[00:49:48] are we expending a lot of flops and training a neural network on a big data
[00:49:49] training a neural network on a big data set to just learn the identity function
[00:49:51] set to just learn the identity function that we already know? Um it's because
[00:49:53] that we already know? Um it's because we're going to bottleneck it in some
[00:49:54] we're going to bottleneck it in some way. Um if this model had infinite
[00:49:56] way. Um if this model had infinite capacity, for example, if that Z vector
[00:49:58] capacity, for example, if that Z vector was very wide, if there were no
[00:49:59] was very wide, if there were no constraints on the learning, um I would
[00:50:01] constraints on the learning, um I would expect a neural network to just nail
[00:50:02] expect a neural network to just nail this problem. Um but we don't want to do
[00:50:04] this problem. Um but we don't want to do that because we explicitly don't care
[00:50:06] that because we explicitly don't care about learning this objective. We
[00:50:08] about learning this objective. We already know the identity function. We
[00:50:09] already know the identity function. We don't need an expensive neural network
[00:50:10] don't need an expensive neural network to compute it. What we want to do is
[00:50:12] to compute it. What we want to do is force the network to try to learn the
[00:50:14] force the network to try to learn the identity function under some constraint.
[00:50:16] identity function under some constraint. And the constraint that you often use in
[00:50:18] And the constraint that you often use in a in a traditional autoenccoder is by
[00:50:20] a in a traditional autoenccoder is by bottlenecking that representation Z. In
[00:50:22] bottlenecking that representation Z. In particular, that means that that vector
[00:50:24] particular, that means that that vector Z in the middle is going to be much much
[00:50:26] Z in the middle is going to be much much smaller than the input X. So your input
[00:50:28] smaller than the input X. So your input X might be a high resolution image,
[00:50:30] X might be a high resolution image, maybe like a 1024x 1024 image that we
[00:50:33] maybe like a 1024x 1024 image that we said is composed of 3 million floats,
[00:50:35] said is composed of 3 million floats, but then that Z might be like 128
[00:50:38] but then that Z might be like 128 dimensional latent code. So the model is
[00:50:40] dimensional latent code. So the model is now asked to solve this problem where I
[00:50:41] now asked to solve this problem where I want to reconstruct the output,
[00:50:44] want to reconstruct the output, reconstruct the data X, but squash it
[00:50:46] reconstruct the data X, but squash it through this layer, this like
[00:50:47] through this layer, this like bottlenecking representation in the
[00:50:49] bottlenecking representation in the middle. And we hope that this is going
[00:50:50] middle. And we hope that this is going to force the model to learn some
[00:50:52] to force the model to learn some non-trivial structure about the about
[00:50:54] non-trivial structure about the about the data by squashing it through this
[00:50:56] the data by squashing it through this this representation in the middle of the
[00:50:57] this representation in the middle of the network. Um, and then after we do this,
[00:51:00] network. Um, and then after we do this, we can apply our normal self-supervised
[00:51:01] we can apply our normal self-supervised learning trick where, you know, you
[00:51:03] learning trick where, you know, you could throw away the decoder um, and
[00:51:04] could throw away the decoder um, and then use this Z to initialize some
[00:51:06] then use this Z to initialize some supervised model for some downstream
[00:51:08] supervised model for some downstream task. the same story as in the supervis
[00:51:10] task. the same story as in the supervis the the self-supervised story that we
[00:51:11] the the self-supervised story that we just saw. Um, but what about what if we
[00:51:15] just saw. Um, but what about what if we want actually want to use this to
[00:51:16] want actually want to use this to generate data? Um, then what we'd really
[00:51:18] generate data? Um, then what we'd really like to do is somehow the opposite of
[00:51:19] like to do is somehow the opposite of the self-supervised story. What we'd
[00:51:21] the self-supervised story. What we'd really like to do is throw away the
[00:51:23] really like to do is throw away the encoder and instead be able to somehow
[00:51:25] encoder and instead be able to somehow sample Z's that match the kinds of Z's
[00:51:27] sample Z's that match the kinds of Z's that the model learned to represent data
[00:51:30] that the model learned to represent data as. And if we had some procedure for
[00:51:32] as. And if we had some procedure for sampling Z's that matched the data
[00:51:34] sampling Z's that matched the data distribution in some way, then we could
[00:51:37] distribution in some way, then we could sample a Z, pass it through our learned
[00:51:39] sample a Z, pass it through our learned decoder, and now generate a new sample,
[00:51:41] decoder, and now generate a new sample, right? And now this is an implicit
[00:51:43] right? And now this is an implicit method, right? You said that there's no
[00:51:44] method, right? You said that there's no there's no densities floating around
[00:51:46] there's no densities floating around anywhere. But if we had a way to do
[00:51:48] anywhere. But if we had a way to do this, it would be a way to draw samples
[00:51:50] this, it would be a way to draw samples from the model um without explicitly
[00:51:52] from the model um without explicitly modeling the density in any way.
[00:51:55] modeling the density in any way. But the problem is that, you know, we've
[00:51:56] But the problem is that, you know, we've just kind of kicked the can down the
[00:51:58] just kind of kicked the can down the road here a little bit because we said
[00:52:00] road here a little bit because we said we wanted we want if we want to generate
[00:52:01] we wanted we want if we want to generate images, we want to generate X's, we have
[00:52:03] images, we want to generate X's, we have a data set of X's. How do we do that? We
[00:52:06] a data set of X's. How do we do that? We said we're going to solve that by
[00:52:07] said we're going to solve that by training this autoenccoder and now we
[00:52:09] training this autoenccoder and now we have a data set of Z's and we need to
[00:52:10] have a data set of Z's and we need to sample in Zpace. It's it's not any
[00:52:12] sample in Zpace. It's it's not any easier. So it kind of kind of we're kind
[00:52:13] easier. So it kind of kind of we're kind of stuck. Um and the idea of variational
[00:52:16] of stuck. Um and the idea of variational autoenccoders is you know what if we
[00:52:19] autoenccoders is you know what if we could force some structure on the Z's.
[00:52:21] could force some structure on the Z's. Um if you have this traditional
[00:52:23] Um if you have this traditional auto-enccoder structure all you're
[00:52:24] auto-enccoder structure all you're you're not forcing the model to impose
[00:52:26] you're not forcing the model to impose any known structure on the Z's. You're
[00:52:28] any known structure on the Z's. You're just asking it to reconstruct the data
[00:52:30] just asking it to reconstruct the data um given its latent representation. But
[00:52:32] um given its latent representation. But what if we had some mechanism to force
[00:52:34] what if we had some mechanism to force the Z's to come from a gausian
[00:52:35] the Z's to come from a gausian distribution or some other known
[00:52:37] distribution or some other known distribution. If that were the case then
[00:52:39] distribution. If that were the case then we could just draw a sample at inference
[00:52:42] we could just draw a sample at inference time after this model is trained. Draw a
[00:52:44] time after this model is trained. Draw a sample from that known distribution pass
[00:52:46] sample from that known distribution pass it through the pass it through the
[00:52:47] it through the pass it through the decoder and now we would have our have
[00:52:49] decoder and now we would have our have our sample. So forcing these
[00:52:52] our sample. So forcing these auto-enccoders to be probabilistic and
[00:52:54] auto-enccoders to be probabilistic and to enforce a probabilistic structure on
[00:52:56] to enforce a probabilistic structure on that latent space exactly is what a
[00:52:58] that latent space exactly is what a variational autoenccoder tries to do.
[00:53:00] variational autoenccoder tries to do. Why variational? Um it's a long story.
[00:53:02] Why variational? Um it's a long story. Um it says has a long history in a long
[00:53:04] Um it says has a long history in a long term lot of history around that
[00:53:06] term lot of history around that terminology in the literature. Um but
[00:53:08] terminology in the literature. Um but basically variational autoenccoders are
[00:53:10] basically variational autoenccoders are a probabilistic spin on our traditional
[00:53:12] a probabilistic spin on our traditional autoenccoder. Um, so it's going to learn
[00:53:13] autoenccoder. Um, so it's going to learn latent features Z from raw data and then
[00:53:16] latent features Z from raw data and then we'll be able to enforce a structure on
[00:53:19] we'll be able to enforce a structure on that learned latent space Z such that we
[00:53:21] that learned latent space Z such that we can sample from it at inference time
[00:53:23] can sample from it at inference time after the model is trained and generate
[00:53:25] after the model is trained and generate new samples.
[00:53:26] new samples. So in more more more concretely we'll
[00:53:28] So in more more more concretely we'll assume that our training data X I again
[00:53:31] assume that our training data X I again note here the superscript I means these
[00:53:32] note here the superscript I means these are different independent samples of of
[00:53:34] are different independent samples of of X. um we assume that um each x i was
[00:53:38] X. um we assume that um each x i was generated from some underlying latent
[00:53:40] generated from some underlying latent val latent vector uh z that there's some
[00:53:43] val latent vector uh z that there's some zi that's lurking under the surface
[00:53:45] zi that's lurking under the surface associated with every x i and in the
[00:53:48] associated with every x i and in the universe's procedure for generating data
[00:53:50] universe's procedure for generating data first it generated the zi then it
[00:53:52] first it generated the zi then it generated the xi from the zi um and all
[00:53:55] generated the xi from the zi um and all everything that the universe needed to
[00:53:56] everything that the universe needed to know in order to generate the image that
[00:53:58] know in order to generate the image that we seed that we saw was contained in
[00:54:00] we seed that we saw was contained in that latent vector z but we can't see
[00:54:03] that latent vector z but we can't see those latent vectors z we can never
[00:54:04] those latent vectors z we can never observe them. We don't have a data set
[00:54:06] observe them. We don't have a data set of them, right? So the intuition is that
[00:54:08] of them, right? So the intuition is that X is an image. Z is some kind of latent
[00:54:10] X is an image. Z is some kind of latent feature representation that tells you
[00:54:12] feature representation that tells you everything you would ever need to know
[00:54:13] everything you would ever need to know about that image, but you can never
[00:54:14] about that image, but you can never observe that latent vector. Um, and then
[00:54:17] observe that latent vector. Um, and then after training, we could generate a
[00:54:18] after training, we could generate a sample by uh, oh, and the other
[00:54:20] sample by uh, oh, and the other constraint is that we're going to force
[00:54:22] constraint is that we're going to force those Z's to come from a known
[00:54:23] those Z's to come from a known distribution. So then after the model is
[00:54:25] distribution. So then after the model is trained, then we can do exactly what we
[00:54:26] trained, then we can do exactly what we just said. Draw a Z from that known
[00:54:28] just said. Draw a Z from that known distribution, pass through the decoder.
[00:54:30] distribution, pass through the decoder. That's going to give us a sample. Um,
[00:54:32] That's going to give us a sample. Um, and then we'll typically assume a simple
[00:54:34] and then we'll typically assume a simple prior. Um, almost always a unit gausian
[00:54:37] prior. Um, almost always a unit gausian distribution is the most is by far the
[00:54:38] distribution is the most is by far the most common. So then how do we possibly
[00:54:41] most common. So then how do we possibly train this? Like this feels like an
[00:54:42] train this? Like this feels like an impossible problem. Um, we want to
[00:54:44] impossible problem. Um, we want to basically train this network that's
[00:54:45] basically train this network that's going to, you know, get these Z's, find
[00:54:48] going to, you know, get these Z's, find a Z for every X. We can never observe
[00:54:50] a Z for every X. We can never observe the Z's. This seems impossible. What are
[00:54:51] the Z's. This seems impossible. What are we going to do? Um, we're going to go
[00:54:53] we going to do? Um, we're going to go back to maximum likelihood, right? If we
[00:54:55] back to maximum likelihood, right? If we indeed had a data set of X's and Z's,
[00:54:58] indeed had a data set of X's and Z's, then we could use maximum likelihood to
[00:55:00] then we could use maximum likelihood to directly use the same kind of log trick
[00:55:02] directly use the same kind of log trick um maximize the log probability. We
[00:55:04] um maximize the log probability. We could use the exact same thing that we
[00:55:05] could use the exact same thing that we previously saw. Um and then train a
[00:55:07] previously saw. Um and then train a conditional generative model P of X
[00:55:09] conditional generative model P of X conditioned on Z, but we don't know Z.
[00:55:11] conditioned on Z, but we don't know Z. But let's pretend we do for a moment. Um
[00:55:14] But let's pretend we do for a moment. Um so but if but because we don't know Z,
[00:55:16] so but if but because we don't know Z, we could try to marginalize, right? we
[00:55:18] we could try to marginalize, right? we know that p of x is equal to like maybe
[00:55:21] know that p of x is equal to like maybe there's some joint distribution of x and
[00:55:22] there's some joint distribution of x and z um that must exist even though we
[00:55:25] z um that must exist even though we can't observe it. Um and then in
[00:55:26] can't observe it. Um and then in principle you could integrate out the z
[00:55:28] principle you could integrate out the z to marginalize over it to get a p of x.
[00:55:31] to marginalize over it to get a p of x. Um and then maybe we could do like
[00:55:32] Um and then maybe we could do like pretend there's a joint distribution x
[00:55:34] pretend there's a joint distribution x and z marginalize out the z somehow and
[00:55:36] and z marginalize out the z somehow and still do maximum likelihood. Um you know
[00:55:39] still do maximum likelihood. Um you know maybe let's see how this works. So this
[00:55:41] maybe let's see how this works. So this term in and and then here we've also
[00:55:43] term in and and then here we've also used the chain rule to break up that P
[00:55:44] used the chain rule to break up that P of X given Z into that that joint
[00:55:46] of X given Z into that that joint probability P of X and Z into P of X
[00:55:49] probability P of X and Z into P of X given Z and just P of Z. So um this P of
[00:55:52] given Z and just P of Z. So um this P of X given Z that's okay. Um we could
[00:55:54] X given Z that's okay. Um we could compute that with our with our decoder
[00:55:56] compute that with our with our decoder here on the left. That's a neural
[00:55:57] here on the left. That's a neural network that we're hoping to train. Um
[00:55:59] network that we're hoping to train. Um this P of Z term is okay. We're going to
[00:56:00] this P of Z term is okay. We're going to assume that that's a unit gausian or
[00:56:02] assume that that's a unit gausian or some other simple distribution that we
[00:56:03] some other simple distribution that we can compute or reason about. Um but this
[00:56:05] can compute or reason about. Um but this integral kills us, right? In in general,
[00:56:07] integral kills us, right? In in general, we have no feasible way to integrate
[00:56:09] we have no feasible way to integrate over the full space of a neural
[00:56:11] over the full space of a neural network's input. Right? This um this p
[00:56:13] network's input. Right? This um this p of x given z is going to be some very
[00:56:14] of x given z is going to be some very complicated function that's modeled by a
[00:56:16] complicated function that's modeled by a neural network. There's no going to be
[00:56:17] neural network. There's no going to be no way that we can analytically or or
[00:56:19] no way that we can analytically or or exactly integrate this. You can train
[00:56:21] exactly integrate this. You can train neural networks for individual parts
[00:56:22] neural networks for individual parts here. Right? So the whole underlying
[00:56:24] here. Right? So the whole underlying notion here whenever you're doing this
[00:56:25] notion here whenever you're doing this probabistic modeling is like we're going
[00:56:27] probabistic modeling is like we're going to write down some probabilistic terms.
[00:56:29] to write down some probabilistic terms. Um hopefully some of them are going to
[00:56:30] Um hopefully some of them are going to be simple distributions that we can
[00:56:32] be simple distributions that we can write down analytically and reason
[00:56:33] write down analytically and reason about. um some of them are going to be
[00:56:35] about. um some of them are going to be learned neural network components. So
[00:56:36] learned neural network components. So we're kind of assuming that probability
[00:56:38] we're kind of assuming that probability of X given Z is going to be some neural
[00:56:40] of X given Z is going to be some neural network that we could that we could in
[00:56:41] network that we could that we could in principle learn via maximum likelihood.
[00:56:43] principle learn via maximum likelihood. Um but we don't we but we're trying to
[00:56:46] Um but we don't we but we're trying to write down what objective could we use
[00:56:47] write down what objective could we use to learn that neural network via maximum
[00:56:49] to learn that neural network via maximum likelihood. Um and we're out of luck
[00:56:51] likelihood. Um and we're out of luck here because you can't you have no way
[00:56:52] here because you can't you have no way to integrate over Z. Um you could try to
[00:56:54] to integrate over Z. Um you could try to approximate that integral via some like
[00:56:56] approximate that integral via some like finite sampling. Um, but in general
[00:56:58] finite sampling. Um, but in general that's probably not going to work very
[00:56:59] that's probably not going to work very well because this Z is a super high
[00:57:00] well because this Z is a super high dimensional space and trying to like do
[00:57:02] dimensional space and trying to like do a do an approx do doing a an approximate
[00:57:05] a do an approx do doing a an approximate numerical integral in the inner loop of
[00:57:07] numerical integral in the inner loop of your training is not going to be very a
[00:57:08] your training is not going to be very a very good idea. Um, so we could try
[00:57:10] very good idea. Um, so we could try something else. Uh, baze rule. That's
[00:57:12] something else. Uh, baze rule. That's the other thing we always do in
[00:57:14] the other thing we always do in probability. So let's try basu. Um, if
[00:57:16] probability. So let's try basu. Um, if we have basu we have another formula
[00:57:18] we have basu we have another formula that we can use to write down p of x,
[00:57:20] that we can use to write down p of x, right? So p of x we can write down in
[00:57:22] right? So p of x we can write down in baze rule. Um, using using baze rule in
[00:57:24] baze rule. Um, using using baze rule in this equation on the screen. Um let's
[00:57:26] this equation on the screen. Um let's see what we can do with these terms. So
[00:57:29] see what we can do with these terms. So this one okay P of X given Z again we
[00:57:31] this one okay P of X given Z again we can compute that with our decoder. P of
[00:57:33] can compute that with our decoder. P of Z again okay this one's um we assume
[00:57:35] Z again okay this one's um we assume this is Gausian so we can compute
[00:57:36] this is Gausian so we can compute something with it. Um there's no
[00:57:37] something with it. Um there's no integrals here that's good. So we're
[00:57:39] integrals here that's good. So we're we're in good shape. Uh but now we're
[00:57:41] we're in good shape. Uh but now we're out of luck. This P of Z given X term um
[00:57:44] out of luck. This P of Z given X term um this posterior of Z given X. Uh we have
[00:57:46] this posterior of Z given X. Uh we have no good way to compute this. Um in order
[00:57:48] no good way to compute this. Um in order to compute this term you would also need
[00:57:50] to compute this term you would also need some kind of integral. Out of luck we
[00:57:52] some kind of integral. Out of luck we can't compute it. What are we going to
[00:57:53] can't compute it. What are we going to do? Okay, let's use another neural
[00:57:56] do? Okay, let's use another neural network. So the variational autoenccoder
[00:57:58] network. So the variational autoenccoder trick is like let's there's that
[00:58:00] trick is like let's there's that trolistic term on the bottom here, a b
[00:58:02] trolistic term on the bottom here, a b rule that we can't compute. Um let's
[00:58:04] rule that we can't compute. Um let's just slot in another neural network to
[00:58:06] just slot in another neural network to try to comput it for us. Um so we're
[00:58:07] try to comput it for us. Um so we're going to have another neural network Q
[00:58:09] going to have another neural network Q with different weights um phi that's
[00:58:12] with different weights um phi that's going to learn a different conditional
[00:58:14] going to learn a different conditional distribution prob probability of z given
[00:58:17] distribution prob probability of z given x. And the whole thing is we want this
[00:58:19] x. And the whole thing is we want this other neural network to try to
[00:58:20] other neural network to try to approximate the the true P of X given Z
[00:58:23] approximate the the true P of X given Z of the first neural network. And you
[00:58:25] of the first neural network. And you can't really enforce this in general,
[00:58:27] can't really enforce this in general, but you know, let's put a neural network
[00:58:29] but you know, let's put a neural network there and see what we can do.
[00:58:31] there and see what we can do. So then if we could somehow have this
[00:58:33] So then if we could somehow have this other neural network that was
[00:58:34] other neural network that was approximating this term on the bottom
[00:58:36] approximating this term on the bottom that we can't compute. Um then we could
[00:58:37] that we can't compute. Um then we could go and compute our our likelihood and
[00:58:40] go and compute our our likelihood and max and do maximum likelihood and we
[00:58:41] max and do maximum likelihood and we would all be all be set. Um so that's
[00:58:44] would all be all be set. Um so that's kind of what we do when training a
[00:58:45] kind of what we do when training a variational autoenccoder. We're
[00:58:47] variational autoenccoder. We're basically going to jointly learn two
[00:58:48] basically going to jointly learn two different neural networks. One is the
[00:58:50] different neural networks. One is the decoder which inputs the latent code Z
[00:58:52] decoder which inputs the latent code Z and outputs a distribution over the data
[00:58:54] and outputs a distribution over the data X. Um the other is an encoder which is
[00:58:57] X. Um the other is an encoder which is going to input the data X and output a
[00:59:00] going to input the data X and output a distribution over the latent codes Z.
[00:59:02] distribution over the latent codes Z. And each of these are going to be
[00:59:03] And each of these are going to be separate neural networks that are
[00:59:04] separate neural networks that are separately trained with their own
[00:59:05] separately trained with their own independent weights. There's a question
[00:59:08] independent weights. There's a question you might have which is um how can you
[00:59:10] you might have which is um how can you possibly output a probability
[00:59:11] possibly output a probability distribution from a neural network? That
[00:59:13] distribution from a neural network? That seems confusing and hard and unclear.
[00:59:15] seems confusing and hard and unclear. Um, so the trick here is we're going to
[00:59:17] Um, so the trick here is we're going to actually force everything to be a normal
[00:59:19] actually force everything to be a normal distribution. Um, and we're going to
[00:59:21] distribution. Um, and we're going to have the neural network output the
[00:59:22] have the neural network output the parameters of the normal distribution.
[00:59:24] parameters of the normal distribution. So typically for the decoder network,
[00:59:26] So typically for the decoder network, we're going to assume that we're going
[00:59:27] we're going to assume that we're going to the output distribution from the
[00:59:29] to the output distribution from the decoder is going to be um diagonal
[00:59:31] decoder is going to be um diagonal gausian where the where the entries in
[00:59:33] gausian where the where the entries in the diagonal are the pixels of the
[00:59:35] the diagonal are the pixels of the neural network. Um, and the model is
[00:59:38] neural network. Um, and the model is going to output the mean of that
[00:59:39] going to output the mean of that diagonal gausian distribution. And
[00:59:41] diagonal gausian distribution. And typically for the decoder we'd assume a
[00:59:42] typically for the decoder we'd assume a fixed um a fixed uh variance or or
[00:59:44] fixed um a fixed uh variance or or standard deviation sigma squared. Um now
[00:59:46] standard deviation sigma squared. Um now for the encoder network um similar same
[00:59:49] for the encoder network um similar same idea the model's going to input the data
[00:59:51] idea the model's going to input the data sample x and then it's going to output
[00:59:53] sample x and then it's going to output the parameters of a gausian distribution
[00:59:54] the parameters of a gausian distribution that model the distribution uh q of z
[00:59:57] that model the distribution uh q of z given x. Um so in this case the de the
[01:00:01] given x. Um so in this case the de the encoder network will output one vector
[01:00:03] encoder network will output one vector which is the the mean of that gausian
[01:00:04] which is the the mean of that gausian distribution and another vector which is
[01:00:06] distribution and another vector which is the diagonal of the coariance of that ve
[01:00:08] the diagonal of the coariance of that ve of that of that gausian distribution. Um
[01:00:10] of that of that gausian distribution. Um and here it's very important that we
[01:00:12] and here it's very important that we assume the diagonal structure because
[01:00:14] assume the diagonal structure because otherwise we would have to model you
[01:00:16] otherwise we would have to model you know h squ kind of entries in that full
[01:00:18] know h squ kind of entries in that full coariance matrix right. So here right
[01:00:21] coariance matrix right. So here right you have a you imagine an image that's h
[01:00:23] you have a you imagine an image that's h by w p p p p p p p p p p p p p p p p p p
[01:00:23] by w p p p p p p p p p p p p p p p p p p p pixels. Um so that means that the
[01:00:25] p pixels. Um so that means that the entries in your diagonal matrix are like
[01:00:27] entries in your diagonal matrix are like the right the full you could in
[01:00:29] the right the full you could in principle model the co the full
[01:00:31] principle model the co the full coariance across every pair of pixels in
[01:00:33] coariance across every pair of pixels in the image but that would require h squ
[01:00:34] the image but that would require h squ w^ squ entries that would be too big um
[01:00:36] w^ squ entries that would be too big um so instead we'll just ignore any kind of
[01:00:38] so instead we'll just ignore any kind of correlation structure among the the the
[01:00:40] correlation structure among the the the different values um and now that means
[01:00:42] different values um and now that means that the diagonal coariance is now a
[01:00:44] that the diagonal coariance is now a vector that's the same size as the data
[01:00:45] vector that's the same size as the data itself right so that means this mu of z
[01:00:47] itself right so that means this mu of z given x and this sigma of z given x are
[01:00:50] given x and this sigma of z given x are both vectors of the same shape as z um
[01:00:53] both vectors of the same shape as z um so we basically treat the neural have
[01:00:55] so we basically treat the neural have the neural network output two vectors of
[01:00:57] the neural network output two vectors of the same shape and then treat them as
[01:00:58] the same shape and then treat them as the parameters of this gausian
[01:01:00] the parameters of this gausian distribution. So that's how we can
[01:01:02] distribution. So that's how we can output a distribution from a neural
[01:01:04] output a distribution from a neural network. If you do maximum likelihood on
[01:01:07] network. If you do maximum likelihood on this uh thing with a fixed standard
[01:01:09] this uh thing with a fixed standard deviation, it actually becomes
[01:01:11] deviation, it actually becomes equivalent to L2. Um and that's a nice
[01:01:13] equivalent to L2. Um and that's a nice trick. Um, and the reason you want to do
[01:01:15] trick. Um, and the reason you want to do that is because trying to model the
[01:01:18] that is because trying to model the diagonal, like if you want, like you
[01:01:19] diagonal, like if you want, like you could in principle try to model the same
[01:01:21] could in principle try to model the same thing on the decoder and try to model
[01:01:23] thing on the decoder and try to model the individual like a separate variance
[01:01:25] the individual like a separate variance of every pixel, but that would be kind
[01:01:26] of every pixel, but that would be kind of useless. Um, because that would be
[01:01:28] of useless. Um, because that would be like if you're not modeling any coarian
[01:01:30] like if you're not modeling any coarian structure among the pixels, that would
[01:01:31] structure among the pixels, that would basically be saying that each pixel is
[01:01:33] basically be saying that each pixel is allowed to like vary a little bit. Um,
[01:01:35] allowed to like vary a little bit. Um, and the amount that each pixel is
[01:01:37] and the amount that each pixel is allowed to vary kind of depends on the
[01:01:38] allowed to vary kind of depends on the pixel. and then sampling from that
[01:01:40] pixel. and then sampling from that distribution would basically amount to
[01:01:42] distribution would basically amount to um fixing the mean and then adding per
[01:01:44] um fixing the mean and then adding per pixel independent noise that's scaled by
[01:01:46] pixel independent noise that's scaled by the per pixel variances and that would
[01:01:48] the per pixel variances and that would not be a sensible thing to do. Um so in
[01:01:51] not be a sensible thing to do. Um so in general like for the decoder you kind of
[01:01:53] general like for the decoder you kind of you kind of cheat a little bit um and in
[01:01:55] you kind of cheat a little bit um and in you kind of pretend it's outputting a
[01:01:57] you kind of pretend it's outputting a probability distribution but in general
[01:01:58] probability distribution but in general we're never going to sample from that
[01:01:59] we're never going to sample from that distribution. Instead we're always going
[01:02:00] distribution. Instead we're always going to output the mean. Does that make
[01:02:03] to output the mean. Does that make sense? Yeah. And then it turns out like
[01:02:05] sense? Yeah. And then it turns out like if you write this down like the that
[01:02:07] if you write this down like the that that that constant sigma square just
[01:02:08] that that constant sigma square just comes off as a constant in the front. Um
[01:02:10] comes off as a constant in the front. Um and in practice all like maximizing the
[01:02:12] and in practice all like maximizing the log likelihood of a gausian distribution
[01:02:15] log likelihood of a gausian distribution with a fixed with a fixed variance along
[01:02:17] with a fixed with a fixed variance along the diagonal um is equivalent to
[01:02:19] the diagonal um is equivalent to minimizing L2 distance between the mean
[01:02:21] minimizing L2 distance between the mean and the X which is kind of nice. Yeah,
[01:02:23] and the X which is kind of nice. Yeah, good question. Is there some kind of
[01:02:24] good question. Is there some kind of like weird invariance or non-invarian
[01:02:27] like weird invariance or non-invarian structure here with the pixel shifting?
[01:02:29] structure here with the pixel shifting? Um that would be that would be more a
[01:02:30] Um that would be that would be more a property of the architecture that you
[01:02:32] property of the architecture that you would choose to build the neural
[01:02:33] would choose to build the neural network. Um, so you could try to build
[01:02:36] network. Um, so you could try to build into your network architecture that's
[01:02:37] into your network architecture that's predicting these, you could try to build
[01:02:38] predicting these, you could try to build some invariance or um or equariance
[01:02:41] some invariance or um or equariance properties into the architecture. Um,
[01:02:43] properties into the architecture. Um, but yeah, you're right that in general
[01:02:44] but yeah, you're right that in general that's not accounted for at the at the
[01:02:46] that's not accounted for at the at the loss level here.
[01:02:50] Okay, so now we've got this idea. We've
[01:02:52] Okay, so now we've got this idea. We've got an encoder, a decoder, they're both
[01:02:53] got an encoder, a decoder, they're both one is inputting X, outputting a
[01:02:55] one is inputting X, outputting a distribution over Z. Other is inputting
[01:02:57] distribution over Z. Other is inputting Z, inputting a distribution over X.
[01:02:59] Z, inputting a distribution over X. What's our training objective? Um, and
[01:03:00] What's our training objective? Um, and here's the one slide where we're going
[01:03:02] here's the one slide where we're going to do some math. Um but we'll see. So
[01:03:05] to do some math. Um but we'll see. So here we're gonna we basically the idea
[01:03:06] here we're gonna we basically the idea is we want to do um maximum likelihood.
[01:03:08] is we want to do um maximum likelihood. That's usually the the single thing that
[01:03:10] That's usually the the single thing that we want. That's like the the guiding
[01:03:11] we want. That's like the the guiding principle behind a lot of objectives in
[01:03:13] principle behind a lot of objectives in generative modeling. Um so we want to
[01:03:15] generative modeling. Um so we want to maximize log P of log P of X and then we
[01:03:17] maximize log P of log P of X and then we can use B rule to write that as log P of
[01:03:20] can use B rule to write that as log P of this B rule expression. All right, this
[01:03:22] this B rule expression. All right, this is this is an exact equivalence. Um now
[01:03:25] is this is an exact equivalence. Um now we're going to do something silly. We're
[01:03:26] we're going to do something silly. We're going to multiply the top and bottom of
[01:03:28] going to multiply the top and bottom of this by our Q of Z given X. Remember, we
[01:03:30] this by our Q of Z given X. Remember, we just introduced another neural network Q
[01:03:32] just introduced another neural network Q out of nowhere that was in that was
[01:03:34] out of nowhere that was in that was modeling this other distribution Q of Z
[01:03:36] modeling this other distribution Q of Z given X. And now we're going to multiply
[01:03:38] given X. And now we're going to multiply that density term on the top and bottom
[01:03:39] that density term on the top and bottom of this uh of this base rule expression.
[01:03:42] of this uh of this base rule expression. Now we're going to do some logarithm. Um
[01:03:44] Now we're going to do some logarithm. Um and if you're, you know, have some
[01:03:46] and if you're, you know, have some foresight actually, you know, you'll
[01:03:48] foresight actually, you know, you'll you'll you'll for some reason decide to
[01:03:50] you'll you'll for some reason decide to rearrange these terms in this particular
[01:03:51] rearrange these terms in this particular order. Um and I've colorcoded them so
[01:03:53] order. Um and I've colorcoded them so you can later go and track which term
[01:03:54] you can later go and track which term went where. Um but we do some logarithms
[01:03:57] went where. Um but we do some logarithms and ar and like break this up into three
[01:03:59] and ar and like break this up into three separate terms. Um now you need to make
[01:04:02] separate terms. Um now you need to make another magical observation which is
[01:04:04] another magical observation which is that this p of x um actually does not
[01:04:07] that this p of x um actually does not depend on z right like so far this this
[01:04:10] depend on z right like so far this this sequence of three terms this is all an
[01:04:12] sequence of three terms this is all an exact equivalence. Um these are all
[01:04:13] exact equivalence. Um these are all exact equalities. So even though there's
[01:04:15] exact equalities. So even though there's a z in this expression um it actually
[01:04:17] a z in this expression um it actually doesn't depend on z because all the z's
[01:04:18] doesn't depend on z because all the z's would cancel out. Um and if you have
[01:04:20] would cancel out. Um and if you have something that doesn't depend on Z, you
[01:04:22] something that doesn't depend on Z, you can always wrap an expectation um over Z
[01:04:25] can always wrap an expectation um over Z of that thing. So in this case, we know
[01:04:27] of that thing. So in this case, we know that this is a P of X, we can we can
[01:04:30] that this is a P of X, we can we can always feel free to wrap an expectation
[01:04:32] always feel free to wrap an expectation of Z sampled according to any
[01:04:33] of Z sampled according to any distribution that we want um of P of X.
[01:04:37] distribution that we want um of P of X. Um and because that internal thing does
[01:04:38] Um and because that internal thing does not depend on Z, this is always true for
[01:04:41] not depend on Z, this is always true for any for any uh for any distribution that
[01:04:43] any for any uh for any distribution that we might choose to to take this
[01:04:44] we might choose to to take this expectation over. Okay. So then because
[01:04:48] expectation over. Okay. So then because because expectation is a linear thing um
[01:04:50] because expectation is a linear thing um we can apply that expectation to each of
[01:04:52] we can apply that expectation to each of these three terms upstairs. Um and now
[01:04:54] these three terms upstairs. Um and now we have these three terms um each of
[01:04:56] we have these three terms um each of which looks very mysterious. Um but if
[01:04:59] which looks very mysterious. Um but if you if you kind of you had a lot of
[01:05:01] you if you kind of you had a lot of intuition about probability you memorize
[01:05:03] intuition about probability you memorize all these formulas that you may have
[01:05:04] all these formulas that you may have seen in an earlier statistics or probab
[01:05:06] seen in an earlier statistics or probab probability course um maybe you could
[01:05:08] probability course um maybe you could learn to recognize some of these. So
[01:05:10] learn to recognize some of these. So this first one um we're going to carry
[01:05:11] this first one um we're going to carry down as it was before and these second
[01:05:14] down as it was before and these second two are actually kale divergence um are
[01:05:16] two are actually kale divergence um are actually kale divergence terms. So the
[01:05:18] actually kale divergence terms. So the kale divergence is a kind of measure of
[01:05:20] kale divergence is a kind of measure of dissimilarity between probability
[01:05:22] dissimilarity between probability distributions and it just so happens to
[01:05:24] distributions and it just so happens to have this exact definition of these
[01:05:26] have this exact definition of these these latter two terms. So we can
[01:05:28] these latter two terms. So we can rewrite this exactly as this first term
[01:05:30] rewrite this exactly as this first term which is this expectation blah blah blah
[01:05:32] which is this expectation blah blah blah we'll talk about it. Um and then plus
[01:05:34] we'll talk about it. Um and then plus then plus these two other KL terms. Um,
[01:05:36] then plus these two other KL terms. Um, so these two KL terms are basically
[01:05:38] so these two KL terms are basically measuring dissimilarity but or or
[01:05:40] measuring dissimilarity but or or measuring discrepancy or dissimilarity
[01:05:42] measuring discrepancy or dissimilarity between these different kinds of
[01:05:44] between these different kinds of probability distributions that we have
[01:05:45] probability distributions that we have floating around on this slide. Um, and
[01:05:47] floating around on this slide. Um, and now these all look kind of crazy. Um,
[01:05:49] now these all look kind of crazy. Um, but if we stare at each of these terms,
[01:05:51] but if we stare at each of these terms, we can actually recover a like
[01:05:53] we can actually recover a like interpretable structure like an
[01:05:55] interpretable structure like an interpretable meaning for each of these
[01:05:56] interpretable meaning for each of these three terms. Um, this first one is
[01:05:58] three terms. Um, this first one is actually a data reconstruction term. If
[01:06:00] actually a data reconstruction term. If we walk through what this is saying,
[01:06:02] we walk through what this is saying, this is saying that we're going to
[01:06:03] this is saying that we're going to sample a Z. And the way we're going to
[01:06:04] sample a Z. And the way we're going to sample the Z is by Q of Z given Q of Z
[01:06:09] sample the Z is by Q of Z given Q of Z given X which is our encoder. So we're
[01:06:11] given X which is our encoder. So we're going to take our X pass it to the
[01:06:13] going to take our X pass it to the encoder. The encoder is going to predict
[01:06:15] encoder. The encoder is going to predict a distribution Q of Z given X. Then from
[01:06:17] a distribution Q of Z given X. Then from that predicted distribution we're going
[01:06:19] that predicted distribution we're going to sample a Z. Then we're going to take
[01:06:21] to sample a Z. Then we're going to take an expectation over all such Z and
[01:06:23] an expectation over all such Z and maximize the log probability of X given
[01:06:25] maximize the log probability of X given Z. So this is basically a data
[01:06:27] Z. So this is basically a data reconstruction term. It's saying that if
[01:06:29] reconstruction term. It's saying that if we take an X, a data point X, run it
[01:06:31] we take an X, a data point X, run it through the encoder to get a
[01:06:32] through the encoder to get a distribution over Z, and then pass any
[01:06:35] distribution over Z, and then pass any sample of that distrib of that predicted
[01:06:37] sample of that distrib of that predicted distribution over Z into the decoder,
[01:06:39] distribution over Z into the decoder, we're going to recover X. So this is a
[01:06:41] we're going to recover X. So this is a kind of data reconstruction term. The
[01:06:43] kind of data reconstruction term. The middle one is a prior term. This is
[01:06:45] middle one is a prior term. This is saying we want to um this is the
[01:06:47] saying we want to um this is the measuring the KL divergence between Q of
[01:06:49] measuring the KL divergence between Q of Z given X and P of Z. So remember Q of Z
[01:06:52] Z given X and P of Z. So remember Q of Z given X this is the encoder is inputting
[01:06:55] given X this is the encoder is inputting the data X and outputting a distribution
[01:06:58] the data X and outputting a distribution over the latent space Z. Um so this is
[01:07:01] over the latent space Z. Um so this is the predicted distribution over the
[01:07:02] the predicted distribution over the latent space of the encoder and this
[01:07:04] latent space of the encoder and this other term P of Z this is the prior this
[01:07:06] other term P of Z this is the prior this is the prior that we assumed for the
[01:07:08] is the prior that we assumed for the latent space usually diagonal gausian.
[01:07:10] latent space usually diagonal gausian. So this term is basically saying um the
[01:07:12] So this term is basically saying um the model is separately predicting is sort
[01:07:14] model is separately predicting is sort of predicting distributions of Z given X
[01:07:17] of predicting distributions of Z given X and we want those predicted
[01:07:18] and we want those predicted distributions to match the simple
[01:07:20] distributions to match the simple Gausian prior that we had previously set
[01:07:22] Gausian prior that we had previously set um that we' previously chosen. So this
[01:07:24] um that we' previously chosen. So this is just measuring how much does that
[01:07:25] is just measuring how much does that latent space that's learned by our model
[01:07:27] latent space that's learned by our model match the prior and this third term gets
[01:07:30] match the prior and this third term gets us in trouble. So this third term is um
[01:07:33] us in trouble. So this third term is um Q of Z given X. So that's the predicted
[01:07:36] Q of Z given X. So that's the predicted distribution over Z given the input data
[01:07:39] distribution over Z given the input data X to the encoder. Um, and how much does
[01:07:42] X to the encoder. Um, and how much does that match P of Z given X. So that's
[01:07:44] that match P of Z given X. So that's this uh this flipped around distribution
[01:07:47] this uh this flipped around distribution of what the decoder is modeling. And
[01:07:49] of what the decoder is modeling. And this one we're out of luck. We cannot
[01:07:51] this one we're out of luck. We cannot compute this term cuz remember what got
[01:07:53] compute this term cuz remember what got us into trouble in the first place was
[01:07:55] us into trouble in the first place was this P of Z given X. The whole reason we
[01:07:57] this P of Z given X. The whole reason we introduced Q was to was because we
[01:07:59] introduced Q was to was because we couldn't intro we could not compute this
[01:08:00] couldn't intro we could not compute this P this P of Z given X. Um, so now what
[01:08:05] P this P of Z given X. Um, so now what do we do? We're going to throw it away
[01:08:07] do we do? We're going to throw it away because we know that kale divergences
[01:08:09] because we know that kale divergences are always greater than equal to zero.
[01:08:11] are always greater than equal to zero. So we know that this last term because
[01:08:13] So we know that this last term because it's a kale divergence of two
[01:08:14] it's a kale divergence of two distributions even though we cannot
[01:08:16] distributions even though we cannot compute those distributions in general,
[01:08:17] compute those distributions in general, we know that it must be greater than
[01:08:19] we know that it must be greater than zero because that's a that's a
[01:08:20] zero because that's a that's a well-known property of kale divergences.
[01:08:22] well-known property of kale divergences. So we can throw it away um and get a
[01:08:25] So we can throw it away um and get a lower bound to the true probability. So
[01:08:27] lower bound to the true probability. So if we throw away that last term then we
[01:08:29] if we throw away that last term then we know that log p of x um is greater than
[01:08:32] know that log p of x um is greater than or equal to those two terms our
[01:08:34] or equal to those two terms our reconstruction term and our prior term.
[01:08:36] reconstruction term and our prior term. So this will be the loss that we use to
[01:08:37] So this will be the loss that we use to train our our variational autoenccoder.
[01:08:40] train our our variational autoenccoder. Um and the idea is that this is an
[01:08:41] Um and the idea is that this is an approximation to the true log
[01:08:43] approximation to the true log likelihood. This is this is a lower
[01:08:44] likelihood. This is this is a lower bound to the log likelihood. So if we
[01:08:46] bound to the log likelihood. So if we maximize the lower bound hopefully that
[01:08:48] maximize the lower bound hopefully that will also maximize the true log
[01:08:50] will also maximize the true log likelihood even though we're not doing
[01:08:52] likelihood even though we're not doing it exactly. So that's our training
[01:08:54] it exactly. So that's our training objective for variational autoenccoders.
[01:08:55] objective for variational autoenccoders. So that's kind of the summary. Um, you
[01:08:57] So that's kind of the summary. Um, you know, you're going to jointly and trade,
[01:08:58] know, you're going to jointly and trade, you're going to jointly train an encoder
[01:09:00] you're going to jointly train an encoder Q and a decoder P to maximize this
[01:09:03] Q and a decoder P to maximize this variational lower what's what's called a
[01:09:04] variational lower what's what's called a variational lower bound on the true data
[01:09:06] variational lower bound on the true data log likelihood. Um, and this is also
[01:09:08] log likelihood. Um, and this is also sometimes called the evidence lower
[01:09:09] sometimes called the evidence lower bound or elbow. So it's just the elbow
[01:09:11] bound or elbow. So it's just the elbow when we're going to maximize the elbow.
[01:09:13] when we're going to maximize the elbow. Um, and it has this particular term. We
[01:09:15] Um, and it has this particular term. We have these encoder network, this decoder
[01:09:16] have these encoder network, this decoder network. Um, that's what we do. So then,
[01:09:19] network. Um, that's what we do. So then, you know, to kind of walk through what
[01:09:20] you know, to kind of walk through what the training procedure looks like more
[01:09:22] the training procedure looks like more explicitly, we're going to have this
[01:09:23] explicitly, we're going to have this neural network in we're going to have
[01:09:24] neural network in we're going to have this neural network encoder inputs the X
[01:09:27] this neural network encoder inputs the X outputs the distribution over Z. Um then
[01:09:29] outputs the distribution over Z. Um then we're then we're going to apply this KL
[01:09:31] we're then we're going to apply this KL term to the predicted distribution. Um
[01:09:34] term to the predicted distribution. Um and in particular because this this is
[01:09:36] and in particular because this this is going to force the predicted
[01:09:37] going to force the predicted distribution to be unit gausian. So it's
[01:09:39] distribution to be unit gausian. So it's basically going to force it's going to
[01:09:40] basically going to force it's going to encourage the predicted mean to be zero
[01:09:42] encourage the predicted mean to be zero and the predicted sigma to be diagonal
[01:09:45] and the predicted sigma to be diagonal ones to be all ones. Then once we get to
[01:09:48] ones to be all ones. Then once we get to those predicted distribution from the
[01:09:49] those predicted distribution from the encoder, we're going to sample from that
[01:09:52] encoder, we're going to sample from that predicted distribution using this
[01:09:53] predicted distribution using this so-called reparameterization trick that
[01:09:55] so-called reparameterization trick that allows allows you to back prop through
[01:09:57] allows allows you to back prop through this. Then we draw a sample Z from the
[01:09:59] this. Then we draw a sample Z from the predicted distribution. Once you get the
[01:10:01] predicted distribution. Once you get the predicted distrib this sample Z, you run
[01:10:03] predicted distrib this sample Z, you run it through your decoder to get your um
[01:10:05] it through your decoder to get your um to get your normal distribution
[01:10:07] to get your normal distribution predicted by the decoder and then you
[01:10:09] predicted by the decoder and then you apply your reconstruction your
[01:10:11] apply your reconstruction your reconstruction term of the loss to the
[01:10:12] reconstruction term of the loss to the output of the decoder. So even though
[01:10:15] output of the decoder. So even though this looked like a large scary slides of
[01:10:17] this looked like a large scary slides of math, it actually led to like not too
[01:10:19] math, it actually led to like not too crazy of a training objective for this
[01:10:21] crazy of a training objective for this uh for this thing. Um and I think this
[01:10:23] uh for this thing. Um and I think this variational autoenccoder is actually
[01:10:25] variational autoenccoder is actually very interesting because these two
[01:10:26] very interesting because these two losses fight against each other in a
[01:10:27] losses fight against each other in a very interesting way. So the reconstruct
[01:10:30] very interesting way. So the reconstruct because we're basically forcing the
[01:10:31] because we're basically forcing the model to bottleneck through this um
[01:10:33] model to bottleneck through this um through this latent space Z and these
[01:10:34] through this latent space Z and these two terms kind of want different things
[01:10:36] two terms kind of want different things from the latent space. So the
[01:10:38] from the latent space. So the reconstruction loss um kind of wants the
[01:10:40] reconstruction loss um kind of wants the sigma to be zero and the mux to be a
[01:10:42] sigma to be zero and the mux to be a different and unique vector for each uh
[01:10:44] different and unique vector for each uh for each data x. Um because if that were
[01:10:46] for each data x. Um because if that were the case then we could perfectly satisfy
[01:10:49] the case then we could perfectly satisfy the reconstruction objective. We would
[01:10:50] the reconstruction objective. We would have a separate a separate separate
[01:10:52] have a separate a separate separate unique vector for every data point. Um
[01:10:54] unique vector for every data point. Um and there would be no probability in
[01:10:56] and there would be no probability in there. We could perfectly reconstruct
[01:10:57] there. We could perfectly reconstruct everything. So that's kind of what the
[01:10:58] everything. So that's kind of what the reconstruction loss wants. But the prior
[01:11:01] reconstruction loss wants. But the prior loss actually wants the sigas to be all
[01:11:03] loss actually wants the sigas to be all one because it wants it wants it to be
[01:11:05] one because it wants it wants it to be unit gausian and it wants all the mues
[01:11:06] unit gausian and it wants all the mues to be zero. Um which is very different
[01:11:09] to be zero. Um which is very different what the two losses want. So in the
[01:11:10] what the two losses want. So in the process of training a VAE, you're asking
[01:11:12] process of training a VAE, you're asking these two losses to fight against each
[01:11:14] these two losses to fight against each other to try to find some equilibrium
[01:11:16] other to try to find some equilibrium between um reconstructing your data well
[01:11:18] between um reconstructing your data well and forcing your latent space to be
[01:11:20] and forcing your latent space to be close to your prior. And then once
[01:11:22] close to your prior. And then once you've trained it, then you can sample Z
[01:11:24] you've trained it, then you can sample Z from your prior, run through the decoder
[01:11:26] from your prior, run through the decoder and get a sample. Um, another nice thing
[01:11:28] and get a sample. Um, another nice thing is that because your latent space was
[01:11:30] is that because your latent space was diagonal gausian, there's also a notion
[01:11:32] diagonal gausian, there's also a notion of of um uh uh statistical independence
[01:11:35] of of um uh uh statistical independence across the different the different
[01:11:37] across the different the different entries in your latent space Z. So you
[01:11:39] entries in your latent space Z. So you can vary them separately. Um and maybe
[01:11:41] can vary them separately. Um and maybe those separate dimensions often encode
[01:11:43] those separate dimensions often encode something useful or interpretable or
[01:11:45] something useful or interpretable or orthogonal about your data. So in this
[01:11:47] orthogonal about your data. So in this case we took a VAE trained it on a data
[01:11:49] case we took a VAE trained it on a data set of handwritten digits and you kind
[01:11:51] set of handwritten digits and you kind of see that as we vary two dimensions of
[01:11:53] of see that as we vary two dimensions of the latent space the the digits kind of
[01:11:56] the latent space the the digits kind of smoothly morph from one kind of category
[01:11:58] smoothly morph from one kind of category into another and this is a pretty common
[01:11:59] into another and this is a pretty common property of VAEs. So that's basically it
[01:12:02] property of VAEs. So that's basically it for today. Um to kind of recap what we
[01:12:04] for today. Um to kind of recap what we talked about we talked about supervised
[01:12:06] talked about we talked about supervised versus unsupervised learning. We t
[01:12:08] versus unsupervised learning. We t talked about these three different
[01:12:09] talked about these three different flavors of generative modeling. Um and
[01:12:11] flavors of generative modeling. Um and then we talked about one branch of this
[01:12:13] then we talked about one branch of this family tree of generative models. Um so
[01:12:15] family tree of generative models. Um so then next time we're going to uh come
[01:12:17] then next time we're going to uh come back and talk about the other half of
[01:12:19] back and talk about the other half of the family tree of generative models in
[01:12:21] the family tree of generative models in particular talking about generative
[01:12:22] particular talking about generative adversarial networks and diffusion
[01:12:24] adversarial networks and diffusion models.


================================================================================
LECTURE 014
================================================================================

Stanford CS231N Deep Learning for Computer Vision| Spring 2025 | Lecture 14: Generative Models 2

Source: https://www.youtube.com/watch?v=Edr4uZFh4EE

---

Transcript

[00:00:05] So last time we were talking about
[00:00:07] So last time we were talking about generative models and we started off
[00:00:09] generative models and we started off with some discussion of generative
[00:00:10] with some discussion of generative versus discriminative models and recall
[00:00:12] versus discriminative models and recall that these are basically both different
[00:00:13] that these are basically both different flavors of probabilistic models. Um but
[00:00:16] flavors of probabilistic models. Um but it depends on what we're trying to
[00:00:17] it depends on what we're trying to predict, what we're conditioning on and
[00:00:19] predict, what we're conditioning on and really critically what we're normalizing
[00:00:20] really critically what we're normalizing over. So we talked about discriminative
[00:00:22] over. So we talked about discriminative models where you're trying to predict
[00:00:23] models where you're trying to predict the label Y conditioned on your data X,
[00:00:26] the label Y conditioned on your data X, generative models where you're trying to
[00:00:27] generative models where you're trying to just learn a probability distribution
[00:00:29] just learn a probability distribution over your data X and conditional
[00:00:30] over your data X and conditional generative models where you want to
[00:00:32] generative models where you want to model the data X conditioned on some
[00:00:34] model the data X conditioned on some user input Y um or label Y. Um and
[00:00:37] user input Y um or label Y. Um and recall that um these differ in that what
[00:00:39] recall that um these differ in that what you're trying to normalize over uh
[00:00:41] you're trying to normalize over uh because probabilities distributions
[00:00:43] because probabilities distributions introduce this normalizing effect where
[00:00:44] introduce this normalizing effect where all things that you're that we're all
[00:00:47] all things that you're that we're all where different kinds of things need to
[00:00:48] where different kinds of things need to compete for probability mass due to this
[00:00:50] compete for probability mass due to this normalization constraint of probability
[00:00:52] normalization constraint of probability distributions.
[00:00:53] distributions. Um and last time we also went through
[00:00:55] Um and last time we also went through this taxonomy of different categories of
[00:00:57] this taxonomy of different categories of generative models because it turns out
[00:00:59] generative models because it turns out this area of generative modeling has
[00:01:01] this area of generative modeling has been something people have studied for a
[00:01:02] been something people have studied for a very long time come with a lot of
[00:01:04] very long time come with a lot of different categories of methods to try
[00:01:05] different categories of methods to try to solve variance of these problems. So
[00:01:07] to solve variance of these problems. So we went through this family tree of
[00:01:08] we went through this family tree of generative models um where we have exp
[00:01:11] generative models um where we have exp where last time we talked about these
[00:01:12] where last time we talked about these explicit density models where the model
[00:01:15] explicit density models where the model outputs some measure some quantity that
[00:01:17] outputs some measure some quantity that you can that some quantity P of X um
[00:01:19] you can that some quantity P of X um either either the exact predicted P of X
[00:01:22] either either the exact predicted P of X in the case of tractable density models
[00:01:24] in the case of tractable density models or some approximate version of P of X in
[00:01:27] or some approximate version of P of X in the case of these approximate density
[00:01:28] the case of these approximate density models and then in the case of tract
[00:01:30] models and then in the case of tract tractable density we saw auto
[00:01:32] tractable density we saw auto reggressive as a category of model and
[00:01:34] reggressive as a category of model and we saw variational auto-enccoders as an
[00:01:36] we saw variational auto-enccoders as an example of something that gives you some
[00:01:38] example of something that gives you some approximate density.
[00:01:40] approximate density. So recall that auto reggressive models,
[00:01:42] So recall that auto reggressive models, what we did is we took our our our image
[00:01:44] what we did is we took our our our image or more generally whatever kind of data
[00:01:45] or more generally whatever kind of data we're working with and we break it up
[00:01:47] we're working with and we break it up into a sequence. Um and then for the
[00:01:49] into a sequence. Um and then for the case of image data, we typically treat
[00:01:50] case of image data, we typically treat this as a sequence of pixel values or
[00:01:52] this as a sequence of pixel values or even sub pixel values. Um and we usually
[00:01:55] even sub pixel values. Um and we usually want these to be discreet. So you treat
[00:01:57] want these to be discreet. So you treat those subpixel values as um 8 bit
[00:01:59] those subpixel values as um 8 bit integers that can each take a value 0 to
[00:02:01] integers that can each take a value 0 to 255. you string this out into a long uh
[00:02:03] 255. you string this out into a long uh into a long sequence of integers and
[00:02:05] into a long sequence of integers and then model this in some using some
[00:02:07] then model this in some using some discrete autogressive sequence model
[00:02:09] discrete autogressive sequence model typically an RNN or a transformer.
[00:02:12] typically an RNN or a transformer. Then we also saw the variation
[00:02:13] Then we also saw the variation variational autoenccoders which were
[00:02:15] variational autoenccoders which were another uh um explicit density model but
[00:02:17] another uh um explicit density model but now they compute not the exact density
[00:02:19] now they compute not the exact density but some approximation to the density in
[00:02:21] but some approximation to the density in in in particular a lower bound to the
[00:02:23] in in particular a lower bound to the density. Um so by to do this we train we
[00:02:26] density. Um so by to do this we train we jointly trained some uh encoder network
[00:02:29] jointly trained some uh encoder network which is going to input the data X and
[00:02:31] which is going to input the data X and output a distribution over latent codes
[00:02:33] output a distribution over latent codes Z um and a decoder network which is
[00:02:35] Z um and a decoder network which is going to input a latent code Z and
[00:02:36] going to input a latent code Z and output a predicted uh piece of data X
[00:02:39] output a predicted uh piece of data X and we were able to jointly train these
[00:02:41] and we were able to jointly train these two networks the encoder and the decoder
[00:02:43] two networks the encoder and the decoder to maximize this variational uh to to
[00:02:45] to maximize this variational uh to to max to to maximize this variational
[00:02:47] max to to maximize this variational lower bound to our likelihood function.
[00:02:49] lower bound to our likelihood function. Um recall that likelihood, maximum
[00:02:51] Um recall that likelihood, maximum likelihood is one of the key insights
[00:02:53] likelihood is one of the key insights behind all generative modeling that we
[00:02:54] behind all generative modeling that we of that often our our objective function
[00:02:56] of that often our our objective function for training generative models is
[00:02:58] for training generative models is somehow to maximize the likelihood of
[00:03:00] somehow to maximize the likelihood of the data that we observe that comes from
[00:03:02] the data that we observe that comes from um from our true data distribution.
[00:03:05] um from our true data distribution. So today we're going to continue our
[00:03:06] So today we're going to continue our discussion of generative models and
[00:03:08] discussion of generative models and explore this other half of the family
[00:03:09] explore this other half of the family tree um these implicit density models.
[00:03:12] tree um these implicit density models. So in in implicit density models, we are
[00:03:14] So in in implicit density models, we are no longer going to get access to some
[00:03:16] no longer going to get access to some actual density value P of X, but these
[00:03:18] actual density value P of X, but these models will sort of implicitly model the
[00:03:21] models will sort of implicitly model the probability distribution and even though
[00:03:22] probability distribution and even though we can't compute a density value P of X
[00:03:24] we can't compute a density value P of X for any piece of data X, we will be able
[00:03:26] for any piece of data X, we will be able to sample from the underlying
[00:03:28] to sample from the underlying distribution that these models learn. Um
[00:03:29] distribution that these models learn. Um so we'll be able to draw samples from
[00:03:31] so we'll be able to draw samples from the learned distribution even if we
[00:03:33] the learned distribution even if we can't output an actual density value. Um
[00:03:36] can't output an actual density value. Um so the first such model that we'll
[00:03:37] so the first such model that we'll explore are generative adversarial
[00:03:39] explore are generative adversarial networks or um usually called GANs.
[00:03:42] networks or um usually called GANs. And it's useful to contrast GANs with
[00:03:44] And it's useful to contrast GANs with the variational autoenccoders and auto
[00:03:46] the variational autoenccoders and auto reggressive models that we've seen so
[00:03:47] reggressive models that we've seen so far. So like we just said um auto
[00:03:49] far. So like we just said um auto reggressive models are are likelihood
[00:03:51] reggressive models are are likelihood based method. Um their training
[00:03:53] based method. Um their training objective is maximum likelihood. So you
[00:03:55] objective is maximum likelihood. So you write down this parameterized function
[00:03:56] write down this parameterized function that is your P of X where X is a piece
[00:03:59] that is your P of X where X is a piece of data and then you maximize this uh
[00:04:01] of data and then you maximize this uh over the data that you observe trying to
[00:04:03] over the data that you observe trying to do maximum likelihood and variational
[00:04:05] do maximum likelihood and variational autoenccoders sort of follow this
[00:04:07] autoenccoders sort of follow this similar idea where we can where we write
[00:04:09] similar idea where we can where we write down this um approximation to P of X and
[00:04:11] down this um approximation to P of X and then maximize that approximation to P of
[00:04:13] then maximize that approximation to P of X. Um and now very generative
[00:04:15] X. Um and now very generative adversarial networks will do something a
[00:04:17] adversarial networks will do something a little bit different. They will give up
[00:04:18] little bit different. They will give up on directly modeling that P of X as we
[00:04:20] on directly modeling that P of X as we just said. Um but even though they don't
[00:04:22] just said. Um but even though they don't explicitly model the p of x or let us
[00:04:24] explicitly model the p of x or let us get out those density values, they will
[00:04:26] get out those density values, they will give us some way to sample from the
[00:04:28] give us some way to sample from the underlying distribution that the that
[00:04:29] underlying distribution that the that the model is fitting.
[00:04:32] the model is fitting. Um so the setup here is um that we'll
[00:04:34] Um so the setup here is um that we'll start by having some finite samples of
[00:04:37] start by having some finite samples of data x i which are assumes to be drawn
[00:04:39] data x i which are assumes to be drawn from some true data distribution p data.
[00:04:42] from some true data distribution p data. Um and our goal is we want to be able to
[00:04:44] Um and our goal is we want to be able to draw samples from p data. And recall p
[00:04:46] draw samples from p data. And recall p data is something like the true
[00:04:48] data is something like the true distribution of the universe. This is
[00:04:49] distribution of the universe. This is the distribution that the universe uses
[00:04:51] the distribution that the universe uses to give you samples of your data. And
[00:04:53] to give you samples of your data. And this is likely a very complicated
[00:04:55] this is likely a very complicated distribution. It involves physics. It
[00:04:57] distribution. It involves physics. It involves history. It involves social
[00:04:59] involves history. It involves social political constraints maybe, right?
[00:05:01] political constraints maybe, right? There's a lot of complication that goes
[00:05:02] There's a lot of complication that goes into all the stuff happening in the
[00:05:03] into all the stuff happening in the universe that gives rise to the data
[00:05:05] universe that gives rise to the data that you see. um and somehow we want to
[00:05:08] that you see. um and somehow we want to model that fit some approximate model
[00:05:10] model that fit some approximate model that tries to match that true data
[00:05:12] that tries to match that true data distribution as well as possible and
[00:05:13] distribution as well as possible and then allow us to draw new samples from
[00:05:15] then allow us to draw new samples from our fitted distribution that look like
[00:05:17] our fitted distribution that look like the original data samples that we
[00:05:18] the original data samples that we observed.
[00:05:20] observed. So the way that we're going to do this
[00:05:21] So the way that we're going to do this is um by introducing a latent variable
[00:05:23] is um by introducing a latent variable Z. Um this looks kind of like the latent
[00:05:25] Z. Um this looks kind of like the latent variable Z that we saw in um variational
[00:05:27] variable Z that we saw in um variational autoenccoders where it's going to give
[00:05:30] autoenccoders where it's going to give where the this latent variable Z is
[00:05:31] where the this latent variable Z is going to be distributed according to
[00:05:33] going to be distributed according to some known prior distribution P of Z
[00:05:35] some known prior distribution P of Z that we will write down and control
[00:05:37] that we will write down and control ourselves and usually this is going to
[00:05:38] ourselves and usually this is going to be a unit gausian or or a uniform
[00:05:41] be a unit gausian or or a uniform distribution but typically a unit
[00:05:42] distribution but typically a unit gausian something very simple that we
[00:05:44] gausian something very simple that we know how to sample from we know the
[00:05:45] know how to sample from we know the analytical properties of um and now the
[00:05:48] analytical properties of um and now the setup is that we're going to um imagine
[00:05:50] setup is that we're going to um imagine some data generating process that our
[00:05:52] some data generating process that our network is going to model. So here um
[00:05:55] network is going to model. So here um we're going to imagine that we sample a
[00:05:57] we're going to imagine that we sample a Z according to our known distribution P
[00:05:59] Z according to our known distribution P of Z um to get a sampled Z, pass that
[00:06:02] of Z um to get a sampled Z, pass that sampled Z through a generator network
[00:06:04] sampled Z through a generator network that G of Z um and then that X is going
[00:06:07] that G of Z um and then that X is going to be a sample from some generator
[00:06:09] to be a sample from some generator distribution PG. Um and as we vary the
[00:06:12] distribution PG. Um and as we vary the parameters or the architecture or the
[00:06:14] parameters or the architecture or the training of our generator network that
[00:06:16] training of our generator network that is going to induce different kinds of
[00:06:18] is going to induce different kinds of distributions that we sample from in
[00:06:20] distributions that we sample from in this PG distribution. So the whole goal
[00:06:22] this PG distribution. So the whole goal in GAN training is to try to force this
[00:06:24] in GAN training is to try to force this PG distribution which is induced by our
[00:06:26] PG distribution which is induced by our generator network. We want that PG
[00:06:28] generator network. We want that PG distribution to match the true P data
[00:06:30] distribution to match the true P data distribution as close as possible. And
[00:06:32] distribution as close as possible. And because if they match then we could
[00:06:34] because if they match then we could sample a Z pass it through our generator
[00:06:37] sample a Z pass it through our generator and now we have a Z now we have a sample
[00:06:39] and now we have a Z now we have a sample of sampled piece of data that looks a
[00:06:40] of sampled piece of data that looks a lot like our P data.
[00:06:43] lot like our P data. So the the picture for this is something
[00:06:44] So the the picture for this is something like the following. We'll imagine
[00:06:46] like the following. We'll imagine sampling Z from our PZ to get a a
[00:06:48] sampling Z from our PZ to get a a concrete latencer
[00:06:50] concrete latencer G and that will give us a generated
[00:06:52] G and that will give us a generated image. So the generator network
[00:06:54] image. So the generator network basically is trained to convert a sample
[00:06:57] basically is trained to convert a sample from a known distribution Z into a
[00:06:59] from a known distribution Z into a sample of our data distribution. But now
[00:07:02] sample of our data distribution. But now the question is how can we force these
[00:07:04] the question is how can we force these outputs? How can we force the induced
[00:07:06] outputs? How can we force the induced generator distribution PG? How can we
[00:07:08] generator distribution PG? How can we force this to match the data
[00:07:09] force this to match the data distribution P data? Um and the trick in
[00:07:12] distribution P data? Um and the trick in generative adversarial networks is that
[00:07:14] generative adversarial networks is that we're going to introduce another neural
[00:07:16] we're going to introduce another neural network to do that task for us. Right?
[00:07:18] network to do that task for us. Right? In the previous versions of of gener of
[00:07:20] In the previous versions of of gener of generative modeling um VAEs and
[00:07:22] generative modeling um VAEs and autogressive models, we tried to write
[00:07:23] autogressive models, we tried to write down some objective function that we
[00:07:25] down some objective function that we could minimize that would force our fit
[00:07:27] could minimize that would force our fit distribution to match the data
[00:07:28] distribution to match the data distribution. Here we're going to
[00:07:29] distribution. Here we're going to relinquish that control and basically
[00:07:31] relinquish that control and basically have ask another neural network to solve
[00:07:33] have ask another neural network to solve that task for us. Um so in particular
[00:07:36] that task for us. Um so in particular we're going to train another neural
[00:07:37] we're going to train another neural network called the discriminator D. And
[00:07:40] network called the discriminator D. And this discriminator is going to be tasked
[00:07:42] this discriminator is going to be tasked with inputting a ner inputting an image,
[00:07:44] with inputting a ner inputting an image, sometimes a real image, sometimes a fake
[00:07:46] sometimes a real image, sometimes a fake image. And it's going to classify
[00:07:48] image. And it's going to classify whether the real whether that image was
[00:07:49] whether the real whether that image was fake or real. Um, and then the idea is
[00:07:52] fake or real. Um, and then the idea is that these two these two networks are
[00:07:54] that these two these two networks are going to fight. We're going to train the
[00:07:56] going to fight. We're going to train the generator to try to fool the
[00:07:57] generator to try to fool the discriminator and we're going to train
[00:07:59] discriminator and we're going to train the discriminator as a classification
[00:08:00] the discriminator as a classification model to try to correctly discriminate
[00:08:02] model to try to correctly discriminate or classify between real data and fake
[00:08:04] or classify between real data and fake data. And the intuition is that as these
[00:08:07] data. And the intuition is that as these two networks fight then ideally the
[00:08:09] two networks fight then ideally the discriminator will get better. It will
[00:08:11] discriminator will get better. It will get really good. The discriminator will
[00:08:12] get really good. The discriminator will get really good at at determining
[00:08:14] get really good at at determining features of real data from fake data.
[00:08:16] features of real data from fake data. And once the discriminator gets really
[00:08:17] And once the discriminator gets really good then in order to fool the
[00:08:19] good then in order to fool the discriminator into thinking that the
[00:08:20] discriminator into thinking that the generated samples are classified as
[00:08:22] generated samples are classified as real. The generator data will need to
[00:08:24] real. The generator data will need to get closer and closer to producing
[00:08:26] get closer and closer to producing samples that look like true data. Um so
[00:08:28] samples that look like true data. Um so that's the kind of intuition between
[00:08:30] that's the kind of intuition between generative adversarial networks. So
[00:08:32] generative adversarial networks. So question is does the generator network
[00:08:34] question is does the generator network get feedback from the discriminator on
[00:08:35] get feedback from the discriminator on whether it's classifying correctly? U
[00:08:37] whether it's classifying correctly? U yes and that's crucial for this whole
[00:08:39] yes and that's crucial for this whole process working and the type of feedback
[00:08:40] process working and the type of feedback it gets is gradients right this this
[00:08:43] it gets is gradients right this this whole thing this composite system of the
[00:08:45] whole thing this composite system of the generator and the discriminator are just
[00:08:47] generator and the discriminator are just neural networks we know how to compute
[00:08:48] neural networks we know how to compute gradients through those and those
[00:08:50] gradients through those and those communicate through that generated
[00:08:51] communicate through that generated image. So we're going to back propagate
[00:08:53] image. So we're going to back propagate from the discriminator all the way
[00:08:54] from the discriminator all the way through the generated image into the
[00:08:56] through the generated image into the generator. So that's how the generator
[00:08:58] generator. So that's how the generator is going to learn from the
[00:08:59] is going to learn from the discriminator.
[00:09:01] discriminator. And then more concretely, we need to
[00:09:03] And then more concretely, we need to write down some actual equations, some
[00:09:05] write down some actual equations, some actual math that we're going to use to
[00:09:06] actual math that we're going to use to concretize this intuition. So in
[00:09:09] concretize this intuition. So in particular, um we're going to jointly
[00:09:10] particular, um we're going to jointly train the generator G and the and the
[00:09:12] train the generator G and the and the discriminator D um with this miniax
[00:09:14] discriminator D um with this miniax game. Um this equation looks maybe a
[00:09:17] game. Um this equation looks maybe a little bit daunting, so we'll walk
[00:09:18] little bit daunting, so we'll walk through each of the terms one by one.
[00:09:20] through each of the terms one by one. So here we're going to color code this
[00:09:22] So here we're going to color code this and say that the generator is going to
[00:09:24] and say that the generator is going to be um in blue, the discriminator is
[00:09:26] be um in blue, the discriminator is going to be in red. Um and the
[00:09:27] going to be in red. Um and the discriminator is going to be a function
[00:09:29] discriminator is going to be a function um that inputs a piece of data x and
[00:09:32] um that inputs a piece of data x and outputs the probability that that data
[00:09:33] outputs the probability that that data is real. So in particular d ofx equals 0
[00:09:36] is real. So in particular d ofx equals 0 means that the discriminator has
[00:09:37] means that the discriminator has classified that piece of data x as fake.
[00:09:40] classified that piece of data x as fake. Um d of x= 1 means that the
[00:09:42] Um d of x= 1 means that the discriminator has classified that piece
[00:09:43] discriminator has classified that piece of data as real. Um of course those are
[00:09:45] of data as real. Um of course those are the extreme cases. is the discriminator
[00:09:47] the extreme cases. is the discriminator in practice will output some probability
[00:09:49] in practice will output some probability that gives you a soft a soft version in
[00:09:50] that gives you a soft a soft version in between those two decisions.
[00:09:53] between those two decisions. Now um and now imagine what happens if
[00:09:55] Now um and now imagine what happens if we fix the generator G and just imagine
[00:09:57] we fix the generator G and just imagine this problem from the perspective of the
[00:09:58] this problem from the perspective of the discriminator. So then from the
[00:10:00] discriminator. So then from the perspective of the discriminator there's
[00:10:02] perspective of the discriminator there's two terms here. One this says this first
[00:10:04] two terms here. One this says this first term says the discriminator wants d ofx=
[00:10:07] term says the discriminator wants d ofx= 1 for real data. Remember d ofx equals 1
[00:10:10] 1 for real data. Remember d ofx equals 1 means that the discriminator says that
[00:10:12] means that the discriminator says that it's real. Um, and this this uh this
[00:10:14] it's real. Um, and this this uh this expectation basically says we're going
[00:10:16] expectation basically says we're going to draw data data samples X from the
[00:10:18] to draw data data samples X from the true P data distribution. We're going to
[00:10:20] true P data distribution. We're going to pass those through the discriminator and
[00:10:22] pass those through the discriminator and then take a log because you know we
[00:10:24] then take a log because you know we almost always work in log space when
[00:10:25] almost always work in log space when working with probabilities. And remember
[00:10:27] working with probabilities. And remember log is a is a monotonic function. So um
[00:10:29] log is a is a monotonic function. So um maximizing maximizing log of x is the
[00:10:32] maximizing maximizing log of x is the same as maximizing x. So in this case um
[00:10:35] same as maximizing x. So in this case um this is saying this this is saying that
[00:10:37] this is saying this this is saying that the we want to maximize log of d ofx for
[00:10:40] the we want to maximize log of d ofx for real data which is equivalent to saying
[00:10:41] real data which is equivalent to saying d ofx equals 1 for real data. Um now on
[00:10:44] d ofx equals 1 for real data. Um now on the other side um this is saying that
[00:10:46] the other side um this is saying that we're going to draw take an expectation
[00:10:48] we're going to draw take an expectation by sampling prior's z uh sorry sampling
[00:10:51] by sampling prior's z uh sorry sampling latence z according to our known prior p
[00:10:53] latence z according to our known prior p of z. We're going to take those z's pass
[00:10:55] of z. We're going to take those z's pass them through the generator um which
[00:10:57] them through the generator um which should give us a generated data sample
[00:10:59] should give us a generated data sample and then pass that generated data sample
[00:11:01] and then pass that generated data sample through the discriminator. Um and now
[00:11:03] through the discriminator. Um and now the discriminator wants to classify
[00:11:04] the discriminator wants to classify these as these are fake samples. So the
[00:11:06] these as these are fake samples. So the discriminator wants to classify these as
[00:11:08] discriminator wants to classify these as fake. So we need to somehow invert that
[00:11:10] fake. So we need to somehow invert that that that that expression on the left.
[00:11:12] that that that expression on the left. So here um we want d of xals 0 to be
[00:11:15] So here um we want d of xals 0 to be fake. So one way to say that is uh is to
[00:11:18] fake. So one way to say that is uh is to maximize log of 1 minus d of g of z. So
[00:11:21] maximize log of 1 minus d of g of z. So this term on the right says the
[00:11:22] this term on the right says the discriminator wants d ofx equals 0 for
[00:11:24] discriminator wants d ofx equals 0 for fake data. Um and the term on the left
[00:11:26] fake data. Um and the term on the left says discriminator wants d ofx= 1 for
[00:11:29] says discriminator wants d ofx= 1 for real data. Okay. So that's what the
[00:11:31] real data. Okay. So that's what the discriminator is trying to do. The
[00:11:32] discriminator is trying to do. The discriminator is trying to correctly
[00:11:34] discriminator is trying to correctly just do this classification task between
[00:11:36] just do this classification task between generated samples from between generated
[00:11:38] generated samples from between generated samples and real samples from our data
[00:11:40] samples and real samples from our data set and try to classify them correctly
[00:11:41] set and try to classify them correctly as real or fake. Now look at this from
[00:11:44] as real or fake. Now look at this from the from the perspective of the
[00:11:46] the from the perspective of the generator. So imagine fixing the
[00:11:48] generator. So imagine fixing the discriminator and looking at this setup
[00:11:50] discriminator and looking at this setup only from the perspective of a generator
[00:11:52] only from the perspective of a generator uh with a fixed discriminator. Um now in
[00:11:54] uh with a fixed discriminator. Um now in this case this first term doesn't depend
[00:11:56] this case this first term doesn't depend on the generator at all because this
[00:11:57] on the generator at all because this first term was just about the getting
[00:11:59] first term was just about the getting the discriminator to correctly classify
[00:12:01] the discriminator to correctly classify the real data samples. Um so this the
[00:12:03] the real data samples. Um so this the generator only cares about this term on
[00:12:05] generator only cares about this term on the right. Um and intuitively we want
[00:12:07] the right. Um and intuitively we want the generator to fool the discriminator
[00:12:09] the generator to fool the discriminator into thinking that its samples are real.
[00:12:11] into thinking that its samples are real. So that means that the the generator
[00:12:13] So that means that the the generator wants d ofx equals 1 for fake data. So
[00:12:16] wants d ofx equals 1 for fake data. So the term is the same. Um we draw a
[00:12:18] the term is the same. Um we draw a sample z according to p of z. um pass it
[00:12:21] sample z according to p of z. um pass it through the generator to get a generated
[00:12:22] through the generator to get a generated sample. Pass it through the
[00:12:23] sample. Pass it through the discriminator to get the discriminator's
[00:12:25] discriminator to get the discriminator's predicted probability on that sample. Um
[00:12:28] predicted probability on that sample. Um and now the
[00:12:30] and now the now recall the generator wants d ofx
[00:12:32] now recall the generator wants d ofx equals 1. So rather than maximizing this
[00:12:34] equals 1. So rather than maximizing this like we did from like the discriminator
[00:12:36] like we did from like the discriminator wanted to instead we're going to try to
[00:12:38] wanted to instead we're going to try to minimize this from for the for the for
[00:12:40] minimize this from for the for the for the from the perspective of the
[00:12:41] the from the perspective of the generator.
[00:12:43] generator. And that gives us this uh this miniax
[00:12:45] And that gives us this uh this miniax game. So in particular we can abstract
[00:12:47] game. So in particular we can abstract away all this math by writing it as uh
[00:12:49] away all this math by writing it as uh some some scalar function V as a
[00:12:51] some some scalar function V as a function of G and D. Um and then we say
[00:12:53] function of G and D. Um and then we say that the the discriminator wants to
[00:12:55] that the the discriminator wants to maximize V. The generator wants to
[00:12:57] maximize V. The generator wants to minimize V. Um and there's this they're
[00:12:59] minimize V. Um and there's this they're going to fight against each other in
[00:13:00] going to fight against each other in that way. And then to optimize this
[00:13:03] that way. And then to optimize this we're going to basically run some
[00:13:05] we're going to basically run some gradient descent loop taking alternating
[00:13:07] gradient descent loop taking alternating steps trying to minimize and maximize
[00:13:08] steps trying to minimize and maximize this with respect to the parameters of
[00:13:10] this with respect to the parameters of the generator and discriminator. So um
[00:13:12] the generator and discriminator. So um loop while true forever then update d
[00:13:14] loop while true forever then update d acc according to take a gradient descent
[00:13:16] acc according to take a gradient descent step um derivative of v with respect to
[00:13:19] step um derivative of v with respect to d and then step plus that gradient
[00:13:21] d and then step plus that gradient because remember discriminator is trying
[00:13:23] because remember discriminator is trying to maximize this. So we want to do
[00:13:25] to maximize this. So we want to do gradient ascent to try to maximize this
[00:13:27] gradient ascent to try to maximize this this uh this term. Um and then after we
[00:13:30] this uh this term. Um and then after we make an update on the discriminator
[00:13:31] make an update on the discriminator weights then we'll make an update on the
[00:13:33] weights then we'll make an update on the generator weights by taking derivative
[00:13:35] generator weights by taking derivative of that um of that uh of that v with
[00:13:37] of that um of that uh of that v with respect to the generator weights g and
[00:13:39] respect to the generator weights g and now take a gradient descent step on v
[00:13:42] now take a gradient descent step on v because the generator wants to minimize
[00:13:44] because the generator wants to minimize that objective. Um so that's basically
[00:13:47] that objective. Um so that's basically our our training that's basically the
[00:13:48] our our training that's basically the way that we train generative adversarial
[00:13:50] way that we train generative adversarial networks. Um, we've got this this thing
[00:13:52] networks. Um, we've got this this thing V, which is the value of our miniax
[00:13:54] V, which is the value of our miniax game. And we're going to take
[00:13:55] game. And we're going to take alternating gradient ascent and gradient
[00:13:57] alternating gradient ascent and gradient descent weights on that objective V. Um,
[00:14:00] descent weights on that objective V. Um, in order it like alternatively up
[00:14:02] in order it like alternatively up alternatingly update the generator and
[00:14:04] alternatingly update the generator and discriminator.
[00:14:05] discriminator. And one thing that's really important to
[00:14:07] And one thing that's really important to realize when we're training generative
[00:14:08] realize when we're training generative adversarial networks is that this V is
[00:14:11] adversarial networks is that this V is not a loss function. Um that means there
[00:14:13] not a loss function. Um that means there is no like the the absolute value of V
[00:14:16] is no like the the absolute value of V basically does not tell us anything
[00:14:17] basically does not tell us anything about how well the generator and
[00:14:19] about how well the generator and discriminator are solving this problem
[00:14:20] discriminator are solving this problem or how well or or really the thing we
[00:14:23] or how well or or really the thing we care about is how well is that induced
[00:14:25] care about is how well is that induced PG distribution matching the data
[00:14:27] PG distribution matching the data distribution and just looking at the
[00:14:29] distribution and just looking at the value of V doesn't really tell us
[00:14:30] value of V doesn't really tell us anything about that right because the
[00:14:32] anything about that right because the value of V sort of depends on how good
[00:14:35] value of V sort of depends on how good the discriminator is right if the
[00:14:36] the discriminator is right if the discriminator is really bad then it's
[00:14:38] discriminator is really bad then it's really easy for the generator to fool
[00:14:40] really easy for the generator to fool this and get good get good numbers or if
[00:14:42] this and get good get good numbers or if the discriminator is really good then
[00:14:44] the discriminator is really good then the generator has to be really good. So
[00:14:45] the generator has to be really good. So there can be sort of different settings
[00:14:47] there can be sort of different settings of D and V D and G that will lead to the
[00:14:50] of D and V D and G that will lead to the exact same value for V. Um and that
[00:14:52] exact same value for V. Um and that means that generative adversarial
[00:14:53] means that generative adversarial networks are often really hard to train.
[00:14:55] networks are often really hard to train. Um and even hard to tell when they are
[00:14:57] Um and even hard to tell when they are doing a good job at training. Right?
[00:14:59] doing a good job at training. Right? Normally when you train neural networks
[00:15:00] Normally when you train neural networks you have a loss. You try to minimize the
[00:15:02] you have a loss. You try to minimize the loss with respect to the parameters of
[00:15:03] loss with respect to the parameters of your network and you want to see that
[00:15:05] your network and you want to see that loss go down over the course of
[00:15:06] loss go down over the course of training. But you don't have that with
[00:15:08] training. But you don't have that with generative adversarial networks. You
[00:15:09] generative adversarial networks. You have the generator loss. you have the
[00:15:11] have the generator loss. you have the discriminator loss. You can try to plot
[00:15:12] discriminator loss. You can try to plot them, but in general they're pretty
[00:15:14] them, but in general they're pretty they're pretty meaningless. So with
[00:15:16] they're pretty meaningless. So with generative adversarial networks, they're
[00:15:17] generative adversarial networks, they're really hard to train. Um one I think one
[00:15:21] really hard to train. Um one I think one this objective is sort of fundamentally
[00:15:22] this objective is sort of fundamentally unstable. You're sort of trying to do
[00:15:24] unstable. You're sort of trying to do jointly maximize and minimize the same
[00:15:26] jointly maximize and minimize the same quantity with respect to different parts
[00:15:28] quantity with respect to different parts different sets of parameters of the
[00:15:30] different sets of parameters of the network. So that's kind of inherently a
[00:15:31] network. So that's kind of inherently a difficult optimization problem. And then
[00:15:33] difficult optimization problem. And then even worse, you don't have any value to
[00:15:35] even worse, you don't have any value to look at to tell whether or not you're
[00:15:37] look at to tell whether or not you're making progress towards a good solution.
[00:15:39] making progress towards a good solution. So generative adversarial networks um
[00:15:41] So generative adversarial networks um they're they're pretty effective but
[00:15:43] they're they're pretty effective but they're really hard to train and really
[00:15:44] they're really hard to train and really hard to tune and really hard to make
[00:15:46] hard to tune and really hard to make progress on.
[00:15:48] progress on. So um that's that's sort of the main
[00:15:50] So um that's that's sort of the main takeaway for generative adversarial
[00:15:51] takeaway for generative adversarial networks. You know there is um one
[00:15:53] networks. You know there is um one little trick that is kind of useful to
[00:15:55] little trick that is kind of useful to think about for when training GANs is
[00:15:57] think about for when training GANs is that um just to kind of imagine the
[00:15:59] that um just to kind of imagine the training dynamics of these things. So at
[00:16:01] training dynamics of these things. So at the very start of imagine at the very
[00:16:02] the very start of imagine at the very beginning of training your generator is
[00:16:04] beginning of training your generator is randomly initialized your discriminator
[00:16:06] randomly initialized your discriminator is randomly initialized. What's going to
[00:16:07] is randomly initialized. What's going to happen? Well, at the very start of
[00:16:09] happen? Well, at the very start of training, then your generator is
[00:16:11] training, then your generator is producing completely random noise. That
[00:16:13] producing completely random noise. That completely random noise is going to look
[00:16:14] completely random noise is going to look very different from real images. So, at
[00:16:16] very different from real images. So, at the very beginning of training, when the
[00:16:18] the very beginning of training, when the generator is terrible, then the
[00:16:19] generator is terrible, then the discriminator has a very easy problem.
[00:16:21] discriminator has a very easy problem. So, typically within a couple
[00:16:23] So, typically within a couple iterations, um the discriminator can
[00:16:24] iterations, um the discriminator can like really immediately pick up between
[00:16:26] like really immediately pick up between real images and these totally garbage
[00:16:28] real images and these totally garbage random fake images that the random that
[00:16:30] random fake images that the random that the random generator is giving you. So
[00:16:32] the random generator is giving you. So that means that at the very start of
[00:16:34] that means that at the very start of training um the generator is the
[00:16:35] training um the generator is the discriminator is quickly going to learn
[00:16:37] discriminator is quickly going to learn to classify these real these um real and
[00:16:40] to classify these real these um real and fake pretty quickly with pretty high
[00:16:42] fake pretty quickly with pretty high probability. Um so then it's interesting
[00:16:44] probability. Um so then it's interesting to plot what is d of g of z uh sorry
[00:16:47] to plot what is d of g of z uh sorry what is the what is the what is what is
[00:16:50] what is the what is the what is what is this term um as a function of d of g of
[00:16:52] this term um as a function of d of g of z right because that's basically the
[00:16:53] z right because that's basically the loss fun that's basically the loss
[00:16:55] loss fun that's basically the loss function from the perspective of the
[00:16:56] function from the perspective of the generator. So then from the perspective
[00:16:58] generator. So then from the perspective of the generator, um we're we're we kind
[00:17:01] of the generator, um we're we're we kind of at the beginning of training are
[00:17:02] of at the beginning of training are somewhere all the way over here. So the
[00:17:05] somewhere all the way over here. So the discriminator is doing a really good job
[00:17:06] discriminator is doing a really good job at treating generated samples as fake,
[00:17:09] at treating generated samples as fake, classifying them as fake at the
[00:17:10] classifying them as fake at the beginning of training, which means that
[00:17:11] beginning of training, which means that from the discriminator's perspective,
[00:17:13] from the discriminator's perspective, it's trying to optimize a loss function
[00:17:15] it's trying to optimize a loss function that looks something like this. Um and
[00:17:17] that looks something like this. Um and if you'll notice something, this loss
[00:17:18] if you'll notice something, this loss function is flat um or very close to
[00:17:20] function is flat um or very close to flat at the place where the generator is
[00:17:22] flat at the place where the generator is trying to optim optimize its parameters.
[00:17:24] trying to optim optimize its parameters. So this means that in like in practice
[00:17:26] So this means that in like in practice when you use this naive objective for um
[00:17:29] when you use this naive objective for um training GANs then the generator has a
[00:17:31] training GANs then the generator has a really hard time learning at the
[00:17:32] really hard time learning at the beginning of training. Oh good question
[00:17:34] beginning of training. Oh good question is um how do you assemble the data set
[00:17:36] is um how do you assemble the data set like how do we generate a photo of a
[00:17:37] like how do we generate a photo of a unicorn if if nothing no unicorn exists.
[00:17:40] unicorn if if nothing no unicorn exists. So so this um this P data that that's
[00:17:42] So so this um this P data that that's kind of your choice of P data right so
[00:17:44] kind of your choice of P data right so whatever data set you happen to assemble
[00:17:45] whatever data set you happen to assemble from your training set that's you
[00:17:47] from your training set that's you choosing what is the P data that you're
[00:17:48] choosing what is the P data that you're trying to model. So in general, if you
[00:17:51] trying to model. So in general, if you want to generate a sample of a thing
[00:17:53] want to generate a sample of a thing that looks nothing like anything that
[00:17:54] that looks nothing like anything that you've ever seen before, you're out of
[00:17:56] you've ever seen before, you're out of luck. It's not going to happen. Um, so
[00:17:57] luck. It's not going to happen. Um, so the only way that you're generally going
[00:17:59] the only way that you're generally going to draw samples is if you have something
[00:18:01] to draw samples is if you have something in your training data set that looks
[00:18:02] in your training data set that looks kind of like that. So these networks do
[00:18:05] kind of like that. So these networks do like all generative models and all
[00:18:06] like all generative models and all neural networks really do generalize a
[00:18:08] neural networks really do generalize a little bit, right? So then the hope is
[00:18:10] little bit, right? So then the hope is that you know maybe you've never seen a
[00:18:12] that you know maybe you've never seen a photorealistic image of a unicorn
[00:18:14] photorealistic image of a unicorn wearing a Santa Claus hat, but you've
[00:18:16] wearing a Santa Claus hat, but you've seen photorealistic images of horses.
[00:18:18] seen photorealistic images of horses. You've seen photorealistic images of
[00:18:20] You've seen photorealistic images of Santa Claus hats. You've seen drawings
[00:18:21] Santa Claus hats. You've seen drawings of unicorns. You've seen drawings of
[00:18:23] of unicorns. You've seen drawings of horses. And somehow you've got enough
[00:18:25] horses. And somehow you've got enough stuff, even if you've never seen the
[00:18:26] stuff, even if you've never seen the exact composition of attributes that you
[00:18:28] exact composition of attributes that you want to generate. Um you've seen enough
[00:18:30] want to generate. Um you've seen enough stuff that's close enough to it that the
[00:18:32] stuff that's close enough to it that the model can generalize and give you
[00:18:34] model can generalize and give you something new. And that's always the
[00:18:35] something new. And that's always the hope here. What does this look like from
[00:18:37] hope here. What does this look like from the discriminator's perspective? So then
[00:18:39] the discriminator's perspective? So then like well then if I generated this image
[00:18:41] like well then if I generated this image of a photorealistic unicorn wearing a
[00:18:43] of a photorealistic unicorn wearing a Santa Claus hat then maybe all the
[00:18:45] Santa Claus hat then maybe all the textures are really perfect like the
[00:18:46] textures are really perfect like the lighting is perfect the shadows are
[00:18:47] lighting is perfect the shadows are perfect the leaves are perfect. So
[00:18:49] perfect the leaves are perfect. So there's no real evidence in the data in
[00:18:52] there's no real evidence in the data in that sample itself to say that this is
[00:18:53] that sample itself to say that this is obviously wrong. Um maybe if the
[00:18:55] obviously wrong. Um maybe if the discriminator was really really smart
[00:18:56] discriminator was really really smart then the discriminator could somehow
[00:18:58] then the discriminator could somehow know that unicorns don't really exist
[00:19:00] know that unicorns don't really exist and having a perfectly photorealistic
[00:19:01] and having a perfectly photorealistic image of them isn't likely to happen.
[00:19:03] image of them isn't likely to happen. But that's a like that's a pretty hard
[00:19:05] But that's a like that's a pretty hard semantic problem to solve. So in
[00:19:07] semantic problem to solve. So in practice generator discriminators tend
[00:19:08] practice generator discriminators tend not to really be that smart. Yeah, good
[00:19:11] not to really be that smart. Yeah, good good question. The cost the idea is why
[00:19:13] good question. The cost the idea is why don't we look at two curves? Why don't
[00:19:14] don't we look at two curves? Why don't we look at one curve saying how good the
[00:19:15] we look at one curve saying how good the discriminator is? Why don't we look at
[00:19:17] discriminator is? Why don't we look at one curve saying how low good the
[00:19:18] one curve saying how low good the generator is? Feel free to plot them.
[00:19:20] generator is? Feel free to plot them. They tend to look really useless. Um and
[00:19:22] They tend to look really useless. Um and there's probably literally hundreds of
[00:19:24] there's probably literally hundreds of research papers of people trying to
[00:19:25] research papers of people trying to solve this problem and figure out how do
[00:19:27] solve this problem and figure out how do we tweak the GAN objective? How do we
[00:19:29] we tweak the GAN objective? How do we not use a log? How do we use a a wazer
[00:19:31] not use a log? How do we use a a wazer sign something a something something?
[00:19:33] sign something a something something? like put all kinds of crazy stuff into
[00:19:34] like put all kinds of crazy stuff into this to make those curves more
[00:19:36] this to make those curves more interpretable. Hundreds of papers
[00:19:37] interpretable. Hundreds of papers written about it like 5 years of of
[00:19:40] written about it like 5 years of of thousands of people's time I don't think
[00:19:41] thousands of people's time I don't think anybody came up with a with a good
[00:19:43] anybody came up with a with a good solution. Um so still even after like
[00:19:45] solution. Um so still even after like hundreds thousands of papers training
[00:19:47] hundreds thousands of papers training GANs that kind of they ended up a lot of
[00:19:49] GANs that kind of they ended up a lot of people still end up using this vindilla
[00:19:50] people still end up using this vindilla formulation which tends not to give you
[00:19:52] formulation which tends not to give you very interpretable curves even when you
[00:19:53] very interpretable curves even when you split it up that way. The question is,
[00:19:55] split it up that way. The question is, is it really is is what happens to the
[00:19:57] is it really is is what happens to the discriminator early in training really
[00:19:58] discriminator early in training really important? And the answer is no because
[00:20:00] important? And the answer is no because this is unlike any other classification
[00:20:02] this is unlike any other classification problem we've ever seen before because
[00:20:04] problem we've ever seen before because this is a non-stationary distribution,
[00:20:06] this is a non-stationary distribution, right? When you train an image
[00:20:07] right? When you train an image classifier on imageet or cir or
[00:20:09] classifier on imageet or cir or something like that, the data set is
[00:20:10] something like that, the data set is fixed and the and the model is just
[00:20:12] fixed and the and the model is just trying to classify that static data set.
[00:20:13] trying to classify that static data set. Well, but in the case of GAN training,
[00:20:15] Well, but in the case of GAN training, the data set that it's trying to fit is
[00:20:17] the data set that it's trying to fit is changing under it during the course of
[00:20:19] changing under it during the course of training because you know even at the
[00:20:21] training because you know even at the beginning of training, maybe it's the
[00:20:22] beginning of training, maybe it's the the generated images look really bad.
[00:20:24] the generated images look really bad. it's easy to solve the problem, but then
[00:20:25] it's easy to solve the problem, but then the generator gets better and now the
[00:20:27] the generator gets better and now the data set that it's that the
[00:20:28] data set that it's that the discriminator is trying to discriminate
[00:20:30] discriminator is trying to discriminate changes under it during the course of
[00:20:32] changes under it during the course of training. So that means that it's a
[00:20:33] training. So that means that it's a non-stationary problem and you know with
[00:20:35] non-stationary problem and you know with very complicated learning dynamics.
[00:20:38] very complicated learning dynamics.  Yeah, good question. Um can you does do
[00:20:40] Yeah, good question. Um can you does do these get caught in local minima? Are
[00:20:41] these get caught in local minima? Are there way to kick them out of local
[00:20:42] there way to kick them out of local lumina train for a while kick them out?
[00:20:45] lumina train for a while kick them out? I think again hundreds of papers,
[00:20:46] I think again hundreds of papers, thousands of papers, lots of heristics,
[00:20:48] thousands of papers, lots of heristics, nothing really stuck
[00:20:51] nothing really stuck correct. So you have to train the you
[00:20:52] correct. So you have to train the you have to train this end to end. So
[00:20:54] have to train this end to end. So gradients from the discriminator always
[00:20:55] gradients from the discriminator always propagate into the generator. Um if you
[00:20:57] propagate into the generator. Um if you did like so in particular through this
[00:20:59] did like so in particular through this term on the right. So the the only way
[00:21:01] term on the right. So the the only way that you're ever getting gradients onto
[00:21:02] that you're ever getting gradients onto the generator's parameters is actually
[00:21:04] the generator's parameters is actually through the discriminator. Right?
[00:21:05] through the discriminator. Right? There's I mean unless you have some
[00:21:06] There's I mean unless you have some regularizer in here, but there's no
[00:21:08] regularizer in here, but there's no auxiliary term that's telling the
[00:21:10] auxiliary term that's telling the generator what to do other than the
[00:21:11] generator what to do other than the gradients that gets through the
[00:21:12] gradients that gets through the discriminator. And that lead that's
[00:21:13] discriminator. And that lead that's again leading to part of the the
[00:21:14] again leading to part of the the unstable learning problem. Correct? So
[00:21:17] unstable learning problem. Correct? So the that that P data distribution that's
[00:21:19] the that that P data distribution that's going to stay fixed over the course of
[00:21:20] going to stay fixed over the course of training. All right. So we said um you
[00:21:23] training. All right. So we said um you know there's this problem that the
[00:21:24] know there's this problem that the generator gets low gradients at the
[00:21:25] generator gets low gradients at the course of training. There's a little
[00:21:27] course of training. There's a little hack here where rather than um minimize
[00:21:30] hack here where rather than um minimize rather than trying to maximize log of 1
[00:21:32] rather than trying to maximize log of 1 minus d of g of z, you can instead
[00:21:34] minus d of g of z, you can instead minimize minus log of d of g of z
[00:21:36] minimize minus log of d of g of z instead you can convince yourself
[00:21:38] instead you can convince yourself offline that those are roughly
[00:21:39] offline that those are roughly equivalent. Um but the tlddr is then it
[00:21:41] equivalent. Um but the tlddr is then it gives you a better curve for the
[00:21:42] gives you a better curve for the generator to get better gradients at the
[00:21:44] generator to get better gradients at the start of training. Um so that's really
[00:21:45] start of training. Um so that's really important. And whenever you're training
[00:21:47] important. And whenever you're training GANs from scratch using this sort of uh
[00:21:49] GANs from scratch using this sort of uh log objective, then this uh this trick
[00:21:51] log objective, then this uh this trick to use the modified loss for the
[00:21:53] to use the modified loss for the generator is really important in
[00:21:54] generator is really important in practice. So that means that there
[00:21:56] practice. So that means that there actually is a a one V that you're
[00:21:57] actually is a a one V that you're computing for the generator, one V that
[00:21:59] computing for the generator, one V that a different V that you're computing for
[00:22:01] a different V that you're computing for the discriminator. Um and they aren't
[00:22:02] the discriminator. Um and they aren't quite the same.
[00:22:04] quite the same. Okay, there's another question of why
[00:22:06] Okay, there's another question of why might this be a good objective. Um and I
[00:22:09] might this be a good objective. Um and I used to have slides that walked through
[00:22:10] used to have slides that walked through this proof um step by step, but I don't
[00:22:13] this proof um step by step, but I don't think we have time for that today. So
[00:22:14] think we have time for that today. So I'll just give you the TLDDR and uh we
[00:22:16] I'll just give you the TLDDR and uh we can refer you to something offline to
[00:22:17] can refer you to something offline to check. Um but the the TLDDR is that this
[00:22:21] check. Um but the the TLDDR is that this objective is good because you can write
[00:22:24] objective is good because you can write down this optimal discriminator, right?
[00:22:26] down this optimal discriminator, right? So you know this is a this is a nested
[00:22:28] So you know this is a this is a nested optimization problem where there's an
[00:22:29] optimization problem where there's an inner maximization over D and outer
[00:22:31] inner maximization over D and outer minimization over G. So if you do a
[00:22:33] minimization over G. So if you do a little bit of math, you can actually
[00:22:35] little bit of math, you can actually solve this um in this inner maximization
[00:22:38] solve this um in this inner maximization problem and write down what is the
[00:22:39] problem and write down what is the optimal gener what is the optimal
[00:22:42] optimal gener what is the optimal discriminator. This should actually be
[00:22:44] discriminator. This should actually be the op this is the optimal discriminator
[00:22:45] the op this is the optimal discriminator with respect to a particular generator
[00:22:47] with respect to a particular generator G. Um and you can just write this down
[00:22:49] G. Um and you can just write this down of course even though you can write it
[00:22:51] of course even though you can write it down. You can never compute it because
[00:22:52] down. You can never compute it because it depends on P data. Um and you can
[00:22:55] it depends on P data. Um and you can never compute P data because if you had
[00:22:56] never compute P data because if you had access to the P data density you'd be
[00:22:58] access to the P data density you'd be done. So you can write this down as an
[00:23:00] done. So you can write this down as an equation on a slide or on a piece of
[00:23:01] equation on a slide or on a piece of paper but you can never compute it. Um
[00:23:04] paper but you can never compute it. Um and then uh once you have this inner
[00:23:06] and then uh once you have this inner objective once you have um sort of
[00:23:08] objective once you have um sort of maximized this inner objective by
[00:23:10] maximized this inner objective by writing down the optimal discriminator
[00:23:12] writing down the optimal discriminator then you can show that the outer
[00:23:13] then you can show that the outer objective is minimized if and only if PG
[00:23:15] objective is minimized if and only if PG PG of X is equal to P data. So at least
[00:23:18] PG of X is equal to P data. So at least theoretically um kind of the the optimum
[00:23:21] theoretically um kind of the the optimum state of both the discriminator and the
[00:23:22] state of both the discriminator and the generator actually occurs uniquely when
[00:23:25] generator actually occurs uniquely when PG is equal to P data. Um so that that
[00:23:28] PG is equal to P data. Um so that that kind of makes us feel good but there's a
[00:23:30] kind of makes us feel good but there's a lot of caveats to that theoretical
[00:23:31] lot of caveats to that theoretical result. Um the one is that they both
[00:23:33] result. Um the one is that they both assume of infinite potential capacity
[00:23:35] assume of infinite potential capacity for G and D. Um that those those both
[00:23:38] for G and D. Um that those those both assume that your generator and
[00:23:39] assume that your generator and discriminator can in principle represent
[00:23:41] discriminator can in principle represent any function which of course they can't
[00:23:42] any function which of course they can't because they are neural networks of a
[00:23:44] because they are neural networks of a fixed size and capacity. Um this also
[00:23:46] fixed size and capacity. Um this also tells us absolutely nothing about
[00:23:48] tells us absolutely nothing about whether or not we will converge to this
[00:23:50] whether or not we will converge to this solution. So even though there is this
[00:23:52] solution. So even though there is this optimum point in this objective
[00:23:53] optimum point in this objective landscape um you know there this tells
[00:23:56] landscape um you know there this tells us absolutely nothing about whether we
[00:23:57] us absolutely nothing about whether we can ever reach there via gra v via this
[00:24:00] can ever reach there via gra v via this gradient descent gradient ascent um
[00:24:02] gradient descent gradient ascent um especially with a finite number of data
[00:24:03] especially with a finite number of data samples. So there's this weak there's
[00:24:06] samples. So there's this weak there's this um sort of comforting theoretical
[00:24:07] this um sort of comforting theoretical results that there is some theoretical
[00:24:09] results that there is some theoretical justification to GANs but in practice
[00:24:11] justification to GANs but in practice you know this doesn't really hold or
[00:24:13] you know this doesn't really hold or give us very strong guarantees.
[00:24:15] give us very strong guarantees. Um, so these GANs, you know, in
[00:24:17] Um, so these GANs, you know, in practice, the your your generator G and
[00:24:19] practice, the your your generator G and your discriminator D both are going to
[00:24:21] your discriminator D both are going to be parameterized as neural networks. Um,
[00:24:24] be parameterized as neural networks. Um, and they used to be CNN's. Uh, the GANs
[00:24:26] and they used to be CNN's. Uh, the GANs kind of fell out of favor before before
[00:24:28] kind of fell out of favor before before VITs became popular, but I'm sure they
[00:24:30] VITs became popular, but I'm sure they would work with VITs as well. Um, and
[00:24:32] would work with VITs as well. Um, and the first GAN that really gave
[00:24:33] the first GAN that really gave non-trivial results was called DC GAN
[00:24:35] non-trivial results was called DC GAN that had this five layer comnet
[00:24:37] that had this five layer comnet architecture which gave these pretty
[00:24:38] architecture which gave these pretty what were at the time pretty exciting
[00:24:40] what were at the time pretty exciting samples. Um, and I mentioned DC GAN
[00:24:42] samples. Um, and I mentioned DC GAN because the first author, Alec Radford
[00:24:44] because the first author, Alec Radford at all, you know, for most people doing
[00:24:46] at all, you know, for most people doing DC GAN would have been a highlight of
[00:24:47] DC GAN would have been a highlight of their career. But for Alec Radford, it
[00:24:49] their career. But for Alec Radford, it wasn't nearly enough for him because the
[00:24:50] wasn't nearly enough for him because the next project he worked on right after DC
[00:24:52] next project he worked on right after DC GAN, does anybody know
[00:24:55] GAN, does anybody know  GPT?
[00:24:56] GPT?  GPT. Um, so Alec Radford, you know, DC
[00:24:58] GPT. Um, so Alec Radford, you know, DC GAN was kind of a lowlight in his
[00:24:59] GAN was kind of a lowlight in his career. He went on to do uh GPT1 and
[00:25:02] career. He went on to do uh GPT1 and GPT2 um as well as some other amazing
[00:25:05] GPT2 um as well as some other amazing work at OpenAI. So I think there's this
[00:25:06] work at OpenAI. So I think there's this really cool connection between um you
[00:25:08] really cool connection between um you know people that were working on
[00:25:09] know people that were working on generative modeling of images actually
[00:25:11] generative modeling of images actually jumped over to do generative modeling of
[00:25:12] jumped over to do generative modeling of discrete text data and did some of the
[00:25:14] discrete text data and did some of the really important work there.
[00:25:16] really important work there. Um and one of the you know the only
[00:25:18] Um and one of the you know the only other GAN um paper that I'm going to
[00:25:20] other GAN um paper that I'm going to highlight is called style GAN. Um I'm
[00:25:21] highlight is called style GAN. Um I'm not really going to walk you through the
[00:25:23] not really going to walk you through the details of this one other than to kind
[00:25:24] details of this one other than to kind of point you at it as like a good one to
[00:25:26] of point you at it as like a good one to read if you want to know kind of the the
[00:25:27] read if you want to know kind of the the best practices of GANs. They use a much
[00:25:29] best practices of GANs. They use a much more complicated architecture um but
[00:25:31] more complicated architecture um but they get pretty good results in
[00:25:33] they get pretty good results in practice. And one really nice thing
[00:25:35] practice. And one really nice thing about GANs is that they actually tend to
[00:25:37] about GANs is that they actually tend to learn something smooth in the latent
[00:25:39] learn something smooth in the latent space. So what I mean that is that if we
[00:25:41] space. So what I mean that is that if we have two latent vectors Z0 and Z1 and we
[00:25:44] have two latent vectors Z0 and Z1 and we interpolate between them that is you
[00:25:46] interpolate between them that is you like draw a sample Z0 from your from
[00:25:48] like draw a sample Z0 from your from your from your gausian you draw a sample
[00:25:50] your from your gausian you draw a sample Z1 from your Gausian then you
[00:25:51] Z1 from your Gausian then you interpolate some kind of curve between
[00:25:53] interpolate some kind of curve between Z0 and Z1 then for every point along the
[00:25:55] Z0 and Z1 then for every point along the curve we're going to generate a sample
[00:25:57] curve we're going to generate a sample um using through our uh using our
[00:25:59] um using through our uh using our generator. Then if we do that, we tend
[00:26:01] generator. Then if we do that, we tend to get smooth interpolations to this
[00:26:02] to get smooth interpolations to this latent space, which is something really
[00:26:04] latent space, which is something really cool with GANs. And here's an example of
[00:26:07] cool with GANs. And here's an example of this latent space interpolation from the
[00:26:08] this latent space interpolation from the style GAN 3 paper. Um, so you can see
[00:26:11] style GAN 3 paper. Um, so you can see that like these are all generated
[00:26:13] that like these are all generated samples um by smoothly varying that
[00:26:16] samples um by smoothly varying that latent Z and then passing it through the
[00:26:18] latent Z and then passing it through the generator to generate these different
[00:26:19] generator to generate these different samples. And you can see that these uh
[00:26:21] samples. And you can see that these uh these animals sort of smoothly morph
[00:26:23] these animals sort of smoothly morph into each other. Um, so that means that
[00:26:24] into each other. Um, so that means that the model has somehow uncovered some
[00:26:26] the model has somehow uncovered some useful structure and stuffed it into the
[00:26:28] useful structure and stuffed it into the latent space.
[00:26:29] latent space. So that's uh that's pretty cool. So I I
[00:26:33] So that's uh that's pretty cool. So I I used to talk a lot more about generative
[00:26:35] used to talk a lot more about generative adversarial networks. Um the the pros is
[00:26:37] adversarial networks. Um the the pros is basically they have a really fairly
[00:26:39] basically they have a really fairly simple formulation and if you tune them
[00:26:41] simple formulation and if you tune them right like we saw with style GAN 3 then
[00:26:43] right like we saw with style GAN 3 then they can actually give you very nice
[00:26:45] they can actually give you very nice results. Um very beautiful images, very
[00:26:47] results. Um very beautiful images, very high resolution. Uh very good stuff. Um
[00:26:49] high resolution. Uh very good stuff. Um but the cons like one like we talked
[00:26:51] but the cons like one like we talked about they're they're fairly unstable to
[00:26:53] about they're they're fairly unstable to train. You have no loss curve to look
[00:26:54] train. You have no loss curve to look at. You have very unstable training.
[00:26:56] at. You have very unstable training. They tend to blow up at the drop of a
[00:26:58] They tend to blow up at the drop of a hat. Um so you end up with like um mode
[00:27:00] hat. Um so you end up with like um mode coll what's called mode collapse. All of
[00:27:02] coll what's called mode collapse. All of a sudden you might get nans you might
[00:27:03] a sudden you might get nans you might get ins your generator your your
[00:27:05] get ins your generator your your discriminator starts going crazy your
[00:27:07] discriminator starts going crazy your generator starts producing complete
[00:27:08] generator starts producing complete random garbage all the time you have no
[00:27:10] random garbage all the time you have no loss curves to look at to tell to
[00:27:12] loss curves to look at to tell to diagnose this they're kind of a mess. So
[00:27:14] diagnose this they're kind of a mess. So even though that GANs can give you
[00:27:16] even though that GANs can give you really nice results if you very very
[00:27:18] really nice results if you very very carefully tune them um and very very
[00:27:20] carefully tune them um and very very carefully control the normalization the
[00:27:22] carefully control the normalization the sampling everything about them um they
[00:27:24] sampling everything about them um they have been in practice they've been
[00:27:25] have been in practice they've been fairly hard to scale up to really big
[00:27:27] fairly hard to scale up to really big models to really big data um so GANs
[00:27:29] models to really big data um so GANs were basically the go-to category of
[00:27:31] were basically the go-to category of generative models from around 2016 to
[00:27:34] generative models from around 2016 to maybe around 2020 2021 something around
[00:27:37] maybe around 2020 2021 something around there um and in that 5 years there were
[00:27:39] there um and in that 5 years there were like literally thousands and thousands
[00:27:41] like literally thousands and thousands of papers um people both trying to use
[00:27:43] of papers um people both trying to use different GAN formulations, different
[00:27:45] different GAN formulations, different loss functions, different mathematical
[00:27:46] loss functions, different mathematical formalisms as well as applying GANs to
[00:27:49] formalisms as well as applying GANs to all kinds of different generative
[00:27:50] all kinds of different generative modeling tasks that you can imagine. So
[00:27:52] modeling tasks that you can imagine. So like this was basically the go-to
[00:27:53] like this was basically the go-to generative modeling framework for about
[00:27:55] generative modeling framework for about five five or six years. Question is um
[00:27:58] five five or six years. Question is um would we just expect this? Would we just
[00:27:59] would we just expect this? Would we just shouldn't we just expect these smooth
[00:28:00] shouldn't we just expect these smooth latenc? Um I think not necessarily
[00:28:03] latenc? Um I think not necessarily because one thing that can happen with
[00:28:05] because one thing that can happen with GANs is the generator might just might
[00:28:07] GANs is the generator might just might just memorize a fixed number of data
[00:28:09] just memorize a fixed number of data samples, right? So what if your
[00:28:11] samples, right? So what if your generator ignores the latent Z and just
[00:28:13] generator ignores the latent Z and just memorizes 10 samples from the training
[00:28:15] memorizes 10 samples from the training data set somehow and then no matter what
[00:28:16] data set somehow and then no matter what Z you give it, it always gives you one
[00:28:18] Z you give it, it always gives you one of those 10 samples from the training
[00:28:20] of those 10 samples from the training data set and it never gives you anything
[00:28:21] data set and it never gives you anything else. Then in that case, the disc you're
[00:28:23] else. Then in that case, the disc you're going to fool the discriminator because
[00:28:25] going to fool the discriminator because the generator is always giving you
[00:28:26] the generator is always giving you something which is maybe bitwise
[00:28:28] something which is maybe bitwise identical to one of your real samples.
[00:28:29] identical to one of your real samples. Um and then in that case the generator
[00:28:31] Um and then in that case the generator would have basically piled on sort of
[00:28:33] would have basically piled on sort of durac delta density like in the
[00:28:35] durac delta density like in the immediate vicinity of a couple finite
[00:28:37] immediate vicinity of a couple finite samples. um but put no probability mass
[00:28:39] samples. um but put no probability mass anywhere else. So that actually is kind
[00:28:41] anywhere else. So that actually is kind of a legitimate solution for the
[00:28:42] of a legitimate solution for the generator. Um and that would definitely
[00:28:44] generator. Um and that would definitely not give you smooth latence at all. So
[00:28:46] not give you smooth latence at all. So that's just one example of how these
[00:28:48] that's just one example of how these things can collapse into unintuitive
[00:28:49] things can collapse into unintuitive solutions that are not what you want.
[00:28:51] solutions that are not what you want. Oh, good question. Um what is the
[00:28:53] Oh, good question. Um what is the relationship between your training data
[00:28:55] relationship between your training data set and your latence? Um so this is
[00:28:57] set and your latence? Um so this is actually something very fundamental
[00:28:58] actually something very fundamental about GANs is a great question. So you
[00:29:00] about GANs is a great question. So you you can map one way. So the generator
[00:29:02] you can map one way. So the generator gives you a mapping from latent space
[00:29:04] gives you a mapping from latent space into data space maps from Z to an X. But
[00:29:07] into data space maps from Z to an X. But with GANs, you in general have no way to
[00:29:09] with GANs, you in general have no way to map back from an X to a Z. Um, and
[00:29:11] map back from an X to a Z. Um, and that's that's something very uh
[00:29:13] that's that's something very uh different between GANs and VAEs. So VAEs
[00:29:15] different between GANs and VAEs. So VAEs will learn an explicit mapping from X
[00:29:17] will learn an explicit mapping from X back to Z, but with GANs you have no
[00:29:18] back to Z, but with GANs you have no such thing. Um, you can try to invert
[00:29:20] such thing. Um, you can try to invert you can try to um compute an inverse um
[00:29:23] you can try to um compute an inverse um analytically numerically via gradient
[00:29:25] analytically numerically via gradient descent. There's papers that do that.
[00:29:27] descent. There's papers that do that. Um, but there's actually like no
[00:29:28] Um, but there's actually like no explicitly enforced relationship between
[00:29:30] explicitly enforced relationship between X and Z. Um instead the you can think of
[00:29:32] X and Z. Um instead the you can think of the discriminator as trying to just
[00:29:34] the discriminator as trying to just enforce a distributional alignment
[00:29:36] enforce a distributional alignment between the distribution of all the
[00:29:37] between the distribution of all the outputs coming from the generator and
[00:29:39] outputs coming from the generator and the distribution of all the data samples
[00:29:40] the distribution of all the data samples without any kind of explicit supervision
[00:29:42] without any kind of explicit supervision between them. Um of course like when it
[00:29:44] between them. Um of course like when it comes to GANs anything you probably
[00:29:45] comes to GANs anything you probably think about there's probably at least
[00:29:46] think about there's probably at least like a dozen papers about so there's
[00:29:48] like a dozen papers about so there's also a lot of papers about GAN variants
[00:29:50] also a lot of papers about GAN variants that try to learn birectional mappings
[00:29:52] that try to learn birectional mappings both ways but those never really took
[00:29:53] both ways but those never really took off. Oh good good question. What have we
[00:29:56] off. Oh good good question. What have we gained? So when we when we went to VAE,
[00:29:58] gained? So when we when we went to VAE, we gained latent vectors, but we gave up
[00:29:59] we gained latent vectors, but we gave up density. And now with GANs, it seems
[00:30:01] density. And now with GANs, it seems like we gave up latent vectors that we
[00:30:03] like we gave up latent vectors that we couldn't control. What you gained was
[00:30:04] couldn't control. What you gained was much better samples. Um so when it comes
[00:30:07] much better samples. Um so when it comes to VAE, like VAEs tend not to give you
[00:30:09] to VAE, like VAEs tend not to give you very good samples. Um VAEs are sort of
[00:30:12] very good samples. Um VAEs are sort of characteristically always kind of
[00:30:13] characteristically always kind of blurry. They never really look good. Um
[00:30:16] blurry. They never really look good. Um VAEs on their own just never tend to
[00:30:18] VAEs on their own just never tend to give you very clean, crisp samples. But
[00:30:20] give you very clean, crisp samples. But with GANs, um some of the as you saw
[00:30:21] with GANs, um some of the as you saw with some of the examples, you can get
[00:30:23] with some of the examples, you can get very crisp, very clean, very good
[00:30:25] very crisp, very clean, very good samples.
[00:30:28] Um but what you lost was the ability it
[00:30:29] Um but what you lost was the ability it was like your sanity in trying to tune
[00:30:30] was like your sanity in trying to tune these systems. Yeah. At inference time
[00:30:33] these systems. Yeah. At inference time you throw away the discriminator and
[00:30:34] you throw away the discriminator and just do the generator. So at inference
[00:30:35] just do the generator. So at inference you'll just draw a sample Z um draw a
[00:30:37] you'll just draw a sample Z um draw a sample Z from the prior pass through the
[00:30:39] sample Z from the prior pass through the generator get your sample from your
[00:30:41] generator get your sample from your data. So it's very very efficient at
[00:30:42] data. So it's very very efficient at inference time. All right. So I
[00:30:44] inference time. All right. So I mentioned that GANs used to be the go-to
[00:30:47] mentioned that GANs used to be the go-to um category of of generative modeling
[00:30:49] um category of of generative modeling for about five or six years. Um, so what
[00:30:51] for about five or six years. Um, so what displaced them? And what displaced them
[00:30:53] displaced them? And what displaced them was a very different category of models
[00:30:56] was a very different category of models called diffusion models. Now, um, I need
[00:30:58] called diffusion models. Now, um, I need to put some caveats here. Diffusion
[00:31:00] to put some caveats here. Diffusion model literature is crazy, right? Like
[00:31:02] model literature is crazy, right? Like you read these papers, they go through
[00:31:04] you read these papers, they go through like five pages of math before they tell
[00:31:06] like five pages of math before they tell you at all what's going on. And there's
[00:31:08] you at all what's going on. And there's like three different mathematical
[00:31:10] like three different mathematical formalisms that lead to diffusion models
[00:31:12] formalisms that lead to diffusion models that are all very different
[00:31:13] that are all very different mathematically. And there's very
[00:31:15] mathematically. And there's very different notation, very different
[00:31:16] different notation, very different terminology, very different mathematical
[00:31:18] terminology, very different mathematical formalisms between papers. So like this
[00:31:20] formalisms between papers. So like this this is a sub area that's crazy. Um so I
[00:31:23] this is a sub area that's crazy. Um so I need to put a big caveat here that I'm
[00:31:25] need to put a big caveat here that I'm not going to cover fully all the
[00:31:27] not going to cover fully all the different variants of diffusion models
[00:31:29] different variants of diffusion models with all of their proper mathematical
[00:31:30] with all of their proper mathematical formalism. Instead, what I'm going to
[00:31:32] formalism. Instead, what I'm going to try to do is give you an intuitive
[00:31:34] try to do is give you an intuitive overview of diffusion models as well as
[00:31:36] overview of diffusion models as well as an intuitive geometric understanding of
[00:31:38] an intuitive geometric understanding of the most common form of diffusion models
[00:31:40] the most common form of diffusion models today which are called rectified flow
[00:31:41] today which are called rectified flow models. Um, and you could really you
[00:31:43] models. Um, and you could really you could teach many many lectures about
[00:31:44] could teach many many lectures about diffusion models and get into all the
[00:31:46] diffusion models and get into all the interesting mathematical nuance of all
[00:31:47] interesting mathematical nuance of all these flavors, but we just won't have
[00:31:49] these flavors, but we just won't have time for that in in three in two-thirds
[00:31:51] time for that in in three in two-thirds of one lecture unfortunately.
[00:31:53] of one lecture unfortunately. So, um, with that caveat aside, the intu
[00:31:55] So, um, with that caveat aside, the intu the intuition behind diffusion models is
[00:31:57] the intuition behind diffusion models is actually kind of easy. So um you know
[00:32:00] actually kind of easy. So um you know like with all generative models we want
[00:32:02] like with all generative models we want to draw samples um and we we like kind
[00:32:05] to draw samples um and we we like kind of like GANs we want to convert samples
[00:32:07] of like GANs we want to convert samples from a noise distribution um Z um into a
[00:32:10] from a noise distribution um Z um into a data distribution PX. But the way that
[00:32:12] data distribution PX. But the way that we're going to do that in diffusion
[00:32:14] we're going to do that in diffusion models is totally different. Um GANs
[00:32:16] models is totally different. Um GANs learn this deterministic mapping through
[00:32:17] learn this deterministic mapping through the generator to map a Z directly to an
[00:32:20] the generator to map a Z directly to an X. With a diffusion model we're going to
[00:32:21] X. With a diffusion model we're going to do something more implicit, more
[00:32:23] do something more implicit, more indirect. So what we're going to do um
[00:32:25] indirect. So what we're going to do um first off the first constraint in
[00:32:26] first off the first constraint in diffusion models is that the Z the the
[00:32:29] diffusion models is that the Z the the noise distribution the noise always has
[00:32:31] noise distribution the noise always has to have to have the same shape as our
[00:32:33] to have to have the same shape as our data right so if you have an image
[00:32:34] data right so if you have an image that's like H byW3 then your noise
[00:32:37] that's like H byW3 then your noise distribution always has to be H byw3 as
[00:32:39] distribution always has to be H byw3 as well they have to be exactly the same
[00:32:41] well they have to be exactly the same shape now what we're going to do is
[00:32:43] shape now what we're going to do is consider um v different versions of our
[00:32:46] consider um v different versions of our data that are corrupted by increasing
[00:32:48] data that are corrupted by increasing levels of noise so here if we have a
[00:32:50] levels of noise so here if we have a data sample which is this picture of a
[00:32:52] data sample which is this picture of a cat then under the t equals Z then and
[00:32:54] cat then under the t equals Z then and then t is going to be our noise level um
[00:32:56] then t is going to be our noise level um which is ranges from 0 to 1. So at t
[00:32:59] which is ranges from 0 to 1. So at t equals 0 that means no noise that means
[00:33:01] equals 0 that means no noise that means a totally clean data sample. Um t
[00:33:03] a totally clean data sample. Um t equals.3 is a little bit of noise. We
[00:33:06] equals.3 is a little bit of noise. We add some of our noise z into we mix some
[00:33:09] add some of our noise z into we mix some of our noise z into our data x. And if
[00:33:11] of our noise z into our data x. And if we go all the way to t= 1 we get full
[00:33:14] we go all the way to t= 1 we get full noise and those are going to be samples
[00:33:15] noise and those are going to be samples directly from our noise distribution. So
[00:33:17] directly from our noise distribution. So somehow this ted parameter is going to
[00:33:19] somehow this ted parameter is going to interpolate smoothly between our data
[00:33:22] interpolate smoothly between our data distribution and our noise distribution.
[00:33:24] distribution and our noise distribution. Um and this is something and we can we
[00:33:26] Um and this is something and we can we and the and the the noise distribution
[00:33:27] and the and the the noise distribution again is going to be gausian um almost
[00:33:29] again is going to be gausian um almost always gausian something simple that we
[00:33:31] always gausian something simple that we understand and can sample from. Um and
[00:33:34] understand and can sample from. Um and now what we're going to do is train a
[00:33:35] now what we're going to do is train a neural network to do a little bit of
[00:33:36] neural network to do a little bit of incremental denoising. So the neural
[00:33:39] incremental denoising. So the neural network is going to receive um some
[00:33:41] network is going to receive um some inter some some uh sample which is a
[00:33:43] inter some some uh sample which is a piece of data which has been corrupted
[00:33:45] piece of data which has been corrupted with some intermediate amount of noise.
[00:33:46] with some intermediate amount of noise. And now the neural network is going to
[00:33:48] And now the neural network is going to be trained to try to clean it up a
[00:33:50] be trained to try to clean it up a little bit, remove just a little bit of
[00:33:51] little bit, remove just a little bit of the noise. Um, so the training objective
[00:33:53] the noise. Um, so the training objective here is going to be neural network
[00:33:55] here is going to be neural network inputs
[00:33:57] inputs some an image of some amount with some
[00:33:59] some an image of some amount with some amount of noise and it tries to remove
[00:34:00] amount of noise and it tries to remove some of the noise. Then at inference
[00:34:03] some of the noise. Then at inference time, what we're going to do is do an
[00:34:04] time, what we're going to do is do an iterative procedure where we first draw
[00:34:06] iterative procedure where we first draw a noise sample directly from our noise
[00:34:09] a noise sample directly from our noise distribution PZ and then iteratively
[00:34:11] distribution PZ and then iteratively apply the neural network to remove noise
[00:34:13] apply the neural network to remove noise from that sample one at a time. So um
[00:34:16] from that sample one at a time. So um you know the very first time we do this
[00:34:17] you know the very first time we do this we're going to draw a complete sample
[00:34:19] we're going to draw a complete sample that's complete noise and then the very
[00:34:21] that's complete noise and then the very first application of the neural network
[00:34:22] first application of the neural network the network will be trying to remove
[00:34:24] the network will be trying to remove noise from full noise. So it'll
[00:34:26] noise from full noise. So it'll basically be forced to hallucinate just
[00:34:28] basically be forced to hallucinate just a tiny whiff of data structure in that
[00:34:30] a tiny whiff of data structure in that noise. And then once we get to some like
[00:34:33] noise. And then once we get to some like slightly less noisy example, we're going
[00:34:35] slightly less noisy example, we're going to pass it back to the neural network
[00:34:36] to pass it back to the neural network and again ask it to remove just a little
[00:34:38] and again ask it to remove just a little bit of noise from this now slightly um
[00:34:41] bit of noise from this now slightly um slightly dinoised slightly generated
[00:34:43] slightly dinoised slightly generated image and then it'll get a little bit
[00:34:44] image and then it'll get a little bit less noisy and it'll get a little bit
[00:34:46] less noisy and it'll get a little bit less less noisy and it'll get a bit
[00:34:48] less less noisy and it'll get a bit little bit less noisy. And eventually if
[00:34:50] little bit less noisy. And eventually if we set up all of this stuff correctly,
[00:34:52] we set up all of this stuff correctly, then we want to get to a situation where
[00:34:54] then we want to get to a situation where we can draw a complete noise sample and
[00:34:56] we can draw a complete noise sample and then ask the network to remove noise
[00:34:58] then ask the network to remove noise from that complete sample Z of complete
[00:35:00] from that complete sample Z of complete random noise until eventually we we've
[00:35:02] random noise until eventually we we've removed all the noise and come up with a
[00:35:04] removed all the noise and come up with a generated sample from the system. So
[00:35:07] generated sample from the system. So that's kind of a weird setting. It's
[00:35:09] that's kind of a weird setting. It's kind of a weird thing. Um but that's
[00:35:11] kind of a weird thing. Um but that's kind of the intuition behind diffusion
[00:35:12] kind of the intuition behind diffusion models is the number of steps of fixed
[00:35:15] models is the number of steps of fixed hyperparameter. It depends. So on this
[00:35:17] hyperparameter. It depends. So on this slide, I've intentionally been I I was
[00:35:19] slide, I've intentionally been I I was sort of forced to be very vague about
[00:35:21] sort of forced to be very vague about all these things. Um what is the noise?
[00:35:23] all these things. Um what is the noise? What does it mean to corrupt the data
[00:35:25] What does it mean to corrupt the data with respect to the noise? What does it
[00:35:27] with respect to the noise? What does it mean to remove a little bit of the
[00:35:28] mean to remove a little bit of the noise? What does it mean to apply
[00:35:30] noise? What does it mean to apply iteratively at inference? Because like I
[00:35:33] iteratively at inference? Because like I said, there's so many different
[00:35:34] said, there's so many different formalisms of diffusion. There's a lot
[00:35:36] formalisms of diffusion. There's a lot of different variants about exactly what
[00:35:37] of different variants about exactly what these mean in different situations. So
[00:35:39] these mean in different situations. So this slide is intended to be a fairly
[00:35:41] this slide is intended to be a fairly highle overview of diffusion and then
[00:35:43] highle overview of diffusion and then different specific implementations of
[00:35:44] different specific implementations of diffusion models will have different
[00:35:46] diffusion models will have different concrete choices for what all these
[00:35:47] concrete choices for what all these terms specifically mean. So um does this
[00:35:51] terms specifically mean. So um does this uh does this highle picture of diffusion
[00:35:53] uh does this highle picture of diffusion kind of make sense? Okay. So then let's
[00:35:55] kind of make sense? Okay. So then let's make this more concrete. So now we're
[00:35:56] make this more concrete. So now we're going to now we're going to jump from
[00:35:57] going to now we're going to jump from you know general diffusion models to a
[00:35:59] you know general diffusion models to a particular category of diffusion models
[00:36:02] particular category of diffusion models called rectified flow models. Um some
[00:36:04] called rectified flow models. Um some people may argue with me and say that
[00:36:06] people may argue with me and say that rectified flow is not diffusion. They
[00:36:07] rectified flow is not diffusion. They might be some people might say that
[00:36:09] might be some people might say that they're different things. I don't really
[00:36:10] they're different things. I don't really care. Um to me, rectified flow is a kind
[00:36:12] care. Um to me, rectified flow is a kind of diffusion model. Fight me. Um so with
[00:36:15] of diffusion model. Fight me. Um so with rectified flow, um the intuition is
[00:36:17] rectified flow, um the intuition is basically this. You know, we have this
[00:36:19] basically this. You know, we have this same thing. We have our P noise. We have
[00:36:20] same thing. We have our P noise. We have our P data. And we're going to draw this
[00:36:22] our P data. And we're going to draw this geometrically because I think that's a
[00:36:24] geometrically because I think that's a nice a nice way to gain intuition
[00:36:26] nice a nice way to gain intuition geometrically in two dimensions in
[00:36:27] geometrically in two dimensions in particular because that's all that fits
[00:36:28] particular because that's all that fits on the slide. But of course, these are
[00:36:30] on the slide. But of course, these are going to be super super highdimensional
[00:36:31] going to be super super highdimensional images um and gausian um which is an
[00:36:34] images um and gausian um which is an easy way to get led astray, right?
[00:36:35] easy way to get led astray, right? because you know that intuitions that
[00:36:36] because you know that intuitions that hold in two and three dimensions go like
[00:36:38] hold in two and three dimensions go like totally out the window when you go to a
[00:36:40] totally out the window when you go to a lot of dimensions. Um it's really sad.
[00:36:42] lot of dimensions. Um it's really sad. It's it's sort of sad that we live in
[00:36:43] It's it's sort of sad that we live in such a low dimensional universe because
[00:36:45] such a low dimensional universe because our intuitions that we build in this
[00:36:46] our intuitions that we build in this universe just don't really transfer to
[00:36:48] universe just don't really transfer to like 100 dimensional spaces, thousand
[00:36:50] like 100 dimensional spaces, thousand dimensional spaces. So always be aware
[00:36:52] dimensional spaces. So always be aware but you know it is what it is. We're
[00:36:54] but you know it is what it is. We're stuck with the universe we got. So you
[00:36:56] stuck with the universe we got. So you know the setup in rectified flow is that
[00:36:58] know the setup in rectified flow is that we've got you know our distribution P
[00:37:00] we've got you know our distribution P noise, our distribution P data. P
[00:37:02] noise, our distribution P data. P noiseis is something simple that we
[00:37:03] noiseis is something simple that we understand, we can sample from, we can
[00:37:05] understand, we can sample from, we can compute integrals of. It's a very
[00:37:06] compute integrals of. It's a very friendly distribution. P data again is
[00:37:08] friendly distribution. P data again is something crazy. That's what the
[00:37:09] something crazy. That's what the universe is using to give us images.
[00:37:12] universe is using to give us images. Now, at every training iteration, um
[00:37:14] Now, at every training iteration, um we're going to sample a Z from our prior
[00:37:17] we're going to sample a Z from our prior distribution and sample an X from our
[00:37:19] distribution and sample an X from our data dist data distribution. um you know
[00:37:21] data dist data distribution. um you know we can draw a sample e analytically
[00:37:23] we can draw a sample e analytically because um pz is something simple that
[00:37:25] because um pz is something simple that we control and drawing a sample from the
[00:37:27] we control and drawing a sample from the beta distribution just means picking an
[00:37:29] beta distribution just means picking an example from your finite training set.
[00:37:31] example from your finite training set. Now what we're and you're also going to
[00:37:33] Now what we're and you're also going to choose a a t to be uniform on 0 to 1.
[00:37:36] choose a a t to be uniform on 0 to 1. Remember t is our noise level where t
[00:37:38] Remember t is our noise level where t equals 0 means no noise t= 1 means all
[00:37:41] equals 0 means no noise t= 1 means all the noise. So now we're going to draw
[00:37:43] the noise. So now we're going to draw now we're going to draw a line that
[00:37:45] now we're going to draw a line that points from our data sample x directly
[00:37:47] points from our data sample x directly to our noise sample z. Um, and this
[00:37:50] to our noise sample z. Um, and this line, this vector pointing from X to Z
[00:37:52] line, this vector pointing from X to Z is we're going to call it V. Um, this is
[00:37:54] is we're going to call it V. Um, this is this is going to be the velocity of a
[00:37:56] this is going to be the velocity of a flow field. Um, and then in and then we
[00:37:58] flow field. Um, and then in and then we set XT to be a point along this line,
[00:38:01] set XT to be a point along this line, which is a linear interpolation between
[00:38:03] which is a linear interpolation between X and Z. So now we've got um we've got
[00:38:05] X and Z. So now we've got um we've got our our noise sample Z, our data sample
[00:38:08] our our noise sample Z, our data sample X, we've got the velocity vector between
[00:38:10] X, we've got the velocity vector between them V, and we've picked a noisy uh a
[00:38:13] them V, and we've picked a noisy uh a noised version of our data XT. Um and
[00:38:15] noised version of our data XT. Um and this is what this is you know in the
[00:38:16] this is what this is you know in the previous slide when we said um get noisy
[00:38:19] previous slide when we said um get noisy data. This is what that means in the
[00:38:20] data. This is what that means in the case of rectified flow models. It's a
[00:38:22] case of rectified flow models. It's a linear interpolation between a data
[00:38:24] linear interpolation between a data sample and a noise sample. And now the
[00:38:26] sample and a noise sample. And now the training objective is very very simple.
[00:38:28] training objective is very very simple. So now we're going to train a neural
[00:38:30] So now we're going to train a neural network f theta. So that's f with
[00:38:32] network f theta. So that's f with learnable parameters theta. Um that
[00:38:33] learnable parameters theta. Um that neural network is going to input the
[00:38:35] neural network is going to input the noisiest sample xt as well as the noise
[00:38:37] noisiest sample xt as well as the noise level t and it's going to try to predict
[00:38:40] level t and it's going to try to predict the green vector v. So that's that's it.
[00:38:42] the green vector v. So that's that's it. That's all we need to do in rectified
[00:38:44] That's all we need to do in rectified flow. Um the code for this is very
[00:38:46] flow. Um the code for this is very simple. Um you would be shocked like how
[00:38:49] simple. Um you would be shocked like how how much obscurity there is when you
[00:38:50] how much obscurity there is when you read papers here and it boils down to
[00:38:52] read papers here and it boils down to this very simple code. Drives me crazy
[00:38:54] this very simple code. Drives me crazy that this is not made more clear in in a
[00:38:56] that this is not made more clear in in a lot of presentations of this. So then
[00:38:58] lot of presentations of this. So then the the training loop for rectified flow
[00:38:59] the the training loop for rectified flow is extremely simple. Um you loop over
[00:39:01] is extremely simple. Um you loop over your data set at every iteration. You
[00:39:03] your data set at every iteration. You get um Z which is g unit which is unit
[00:39:06] get um Z which is g unit which is unit gausian of the same shape as X. You
[00:39:09] gausian of the same shape as X. You choose a noise level T which is uniform
[00:39:11] choose a noise level T which is uniform 0 to1. You compute XT which is a linear
[00:39:13] 0 to1. You compute XT which is a linear interpolation between X and Z. Um you
[00:39:16] interpolation between X and Z. Um you give XT and T to your model and then
[00:39:18] give XT and T to your model and then your loss is just the mean squared error
[00:39:20] your loss is just the mean squared error between um the this ground truth V and
[00:39:22] between um the this ground truth V and the model prediction. Um and that's it.
[00:39:24] the model prediction. Um and that's it. That's your training objective for
[00:39:26] That's your training objective for rectified flow models. Um contrast this
[00:39:28] rectified flow models. Um contrast this with GANs. Um when you train rectified
[00:39:30] with GANs. Um when you train rectified flow models or really any kind of
[00:39:31] flow models or really any kind of diffusion model, you have a loss that
[00:39:33] diffusion model, you have a loss that you can look at during training. When
[00:39:35] you can look at during training. When the loss goes down, the model is
[00:39:36] the loss goes down, the model is generally better. So like for those of
[00:39:38] generally better. So like for those of us that went through half a decade of
[00:39:40] us that went through half a decade of GAN madness, the first time you train a
[00:39:42] GAN madness, the first time you train a diffusion model and like there's a loss
[00:39:44] diffusion model and like there's a loss to look at, it's like, oh my god, this
[00:39:45] to look at, it's like, oh my god, this is an amazing thing. Um like how many
[00:39:47] is an amazing thing. Um like how many hours have we spent looking at GAN plots
[00:39:49] hours have we spent looking at GAN plots and they look like this and you have no
[00:39:50] and they look like this and you have no idea. It's like reading tea leaves to
[00:39:52] idea. It's like reading tea leaves to tell whether or not the model is working
[00:39:53] tell whether or not the model is working well. You train a diffusion model, you
[00:39:55] well. You train a diffusion model, you get this beautiful smooth exponential
[00:39:57] get this beautiful smooth exponential loss curve and it just makes you so
[00:39:58] loss curve and it just makes you so happy. Um so that's great. So then
[00:40:02] happy. Um so that's great. So then that's training for that's training for
[00:40:04] that's training for that's training for diffusion models. Now what do we do at
[00:40:06] diffusion models. Now what do we do at inference? Right? Because you the you
[00:40:08] inference? Right? Because you the you know gans are kind of easy at inference.
[00:40:10] know gans are kind of easy at inference. GANs you just have to take a Z pass it
[00:40:12] GANs you just have to take a Z pass it through your generator you get a data
[00:40:13] through your generator you get a data sample very straightforward. But now
[00:40:15] sample very straightforward. But now with a diffusion model or rectified flow
[00:40:17] with a diffusion model or rectified flow model in this case the model output
[00:40:19] model in this case the model output itself is like kind of useless. Like we
[00:40:21] itself is like kind of useless. Like we get this XT we get a V. Like what are we
[00:40:23] get this XT we get a V. Like what are we going to do with this? Not super clear.
[00:40:25] going to do with this? Not super clear. So then at inference time um is where
[00:40:27] So then at inference time um is where diffusion models get a little bit more
[00:40:28] diffusion models get a little bit more complicated compared to uh compared to
[00:40:30] complicated compared to uh compared to GANs.
[00:40:31] GANs. So um you know at inference we first
[00:40:34] So um you know at inference we first will upfront choose a number of steps t
[00:40:36] will upfront choose a number of steps t um which is usually a fixed constant um
[00:40:38] um which is usually a fixed constant um and in the case of rectified flow models
[00:40:40] and in the case of rectified flow models t= 50 um is usually a good number to
[00:40:42] t= 50 um is usually a good number to start with. Sometimes you can get down
[00:40:43] start with. Sometimes you can get down to t equals 30 and that works okay. Then
[00:40:46] to t equals 30 and that works okay. Then what you're going to do is um sample an
[00:40:48] what you're going to do is um sample an x directly from your noise distribution.
[00:40:50] x directly from your noise distribution. This is going to be pure noise that's
[00:40:51] This is going to be pure noise that's sampled from your known noise
[00:40:53] sampled from your known noise distribution.
[00:40:54] distribution. Then you're going to evaluate the then
[00:40:56] Then you're going to evaluate the then you're going to loop from t equals 1.
[00:40:59] you're going to loop from t equals 1. you're going to march backwards to t
[00:41:01] you're going to march backwards to t equals z. This is your noise level. Um,
[00:41:02] equals z. This is your noise level. Um, and in this case, in this simple
[00:41:04] and in this case, in this simple version, we're just sort of marching
[00:41:05] version, we're just sort of marching linearly from full noise one back to
[00:41:08] linearly from full noise one back to noise zero, perfectly clean. Then at
[00:41:10] noise zero, perfectly clean. Then at every iteration,
[00:41:12] every iteration, we're going to take our our um our XT
[00:41:14] we're going to take our our um our XT that was at first full noise, pass it to
[00:41:17] that was at first full noise, pass it to the network along with the current noise
[00:41:18] the network along with the current noise level, and get the network's predicted
[00:41:20] level, and get the network's predicted VT. And remember what this VT is
[00:41:22] VT. And remember what this VT is supposed to be in the case of rectified
[00:41:24] supposed to be in the case of rectified flow. This V remember was supposed to
[00:41:26] flow. This V remember was supposed to point from a data sample all the way to
[00:41:29] point from a data sample all the way to a noise sample. So then it's kind of
[00:41:31] a noise sample. So then it's kind of geometrically obvious what you should do
[00:41:33] geometrically obvious what you should do in the case of rectified flow. You
[00:41:35] in the case of rectified flow. You should take a little step along that
[00:41:36] should take a little step along that predicted V vector. So you but right
[00:41:39] predicted V vector. So you but right because the problem is this rectified
[00:41:40] because the problem is this rectified flow model it's not going to that that V
[00:41:41] flow model it's not going to that that V is not going to point you all the way to
[00:41:43] is not going to point you all the way to a clean sample. Um it's just going to
[00:41:44] a clean sample. Um it's just going to get you started. It's going to set you
[00:41:46] get you started. It's going to set you on a trajectory towards a clean sample.
[00:41:48] on a trajectory towards a clean sample. So we take a little step along that
[00:41:49] So we take a little step along that predicted V from the flow model to get a
[00:41:52] predicted V from the flow model to get a new X2 which is now a version of the
[00:41:54] new X2 which is now a version of the data that has had a little bit of the
[00:41:55] data that has had a little bit of the noise removed from it by the model. And
[00:41:58] noise removed from it by the model. And now we iterate this. So once we have
[00:42:00] now we iterate this. So once we have this X this X2/3 then we pass it back to
[00:42:02] this X this X2/3 then we pass it back to the model and get another predicted V
[00:42:04] the model and get another predicted V vector from the model. Um and remember
[00:42:06] vector from the model. Um and remember the V is supposed to point from clean
[00:42:08] the V is supposed to point from clean data all from a from a clean sample all
[00:42:10] data all from a from a clean sample all the way to a noise sample. So then again
[00:42:12] the way to a noise sample. So then again we can take a gradient step take a
[00:42:14] we can take a gradient step take a little step along this predicted V to
[00:42:16] little step along this predicted V to get another um X1/3. Repeat this thing
[00:42:18] get another um X1/3. Repeat this thing again. Um re evaluate the model again to
[00:42:21] again. Um re evaluate the model again to get another predicted V and then take a
[00:42:23] get another predicted V and then take a step in this case all the way to no to
[00:42:25] step in this case all the way to no to to no noise all the way to the end of
[00:42:26] to no noise all the way to the end of that vector uh to get our predicted X0
[00:42:29] that vector uh to get our predicted X0 and then that is our sample from our
[00:42:31] and then that is our sample from our diffusion model. So um you know the then
[00:42:34] diffusion model. So um you know the then the inference procedure you see here got
[00:42:35] the inference procedure you see here got a little bit more complicated compared
[00:42:37] a little bit more complicated compared to GANs. Um but what we gained here was
[00:42:40] to GANs. Um but what we gained here was um s sanity when you're training. You've
[00:42:42] um s sanity when you're training. You've regained that. Um and they tend to give
[00:42:45] regained that. Um and they tend to give you much better samples and they tend to
[00:42:46] you much better samples and they tend to scale really well to to large data sets
[00:42:48] scale really well to to large data sets and large models. And the code here is
[00:42:50] and large models. And the code here is really simple, right? So we um we start
[00:42:52] really simple, right? So we um we start off by taking a random sample, make it
[00:42:54] off by taking a random sample, make it be perfectly random, then march
[00:42:56] be perfectly random, then march backwards for t from one back to zero.
[00:42:59] backwards for t from one back to zero. At every noise level, you get a
[00:43:01] At every noise level, you get a predicted V from the model given your
[00:43:02] predicted V from the model given your current sample as well as your T. Then
[00:43:04] current sample as well as your T. Then you take what looks kind of like a
[00:43:06] you take what looks kind of like a gradient descent step on the model's
[00:43:07] gradient descent step on the model's predicted V and update the sample and
[00:43:09] predicted V and update the sample and just repeat this whole thing in a loop.
[00:43:11] just repeat this whole thing in a loop. So then you can see like these diffusion
[00:43:14] So then you can see like these diffusion models aren't so scary after all. Um you
[00:43:15] models aren't so scary after all. Um you can actually fit a complete
[00:43:17] can actually fit a complete implementation of training and sampling
[00:43:19] implementation of training and sampling from a rectified flow model in just a
[00:43:21] from a rectified flow model in just a couple lines um on on one slide which I
[00:43:23] couple lines um on on one slide which I think is is very nice.
[00:43:26] think is is very nice. Okay. So you might be ask so so this is
[00:43:28] Okay. So you might be ask so so this is this is pretty nice. I'm I'm pretty
[00:43:29] this is pretty nice. I'm I'm pretty happy that we're able to get to a full
[00:43:30] happy that we're able to get to a full right this and this will actually work
[00:43:32] right this and this will actually work right like if you if you take this code
[00:43:33] right like if you if you take this code you plug it in like you plug it in a
[00:43:35] you plug it in like you plug it in a reasonable model architecture like this
[00:43:37] reasonable model architecture like this will actually this will actually work
[00:43:38] will actually this will actually work like this will actually convert to
[00:43:39] like this will actually convert to something kind of reasonable in a lot of
[00:43:40] something kind of reasonable in a lot of cases you're kind of hitting on the core
[00:43:43] cases you're kind of hitting on the core problem in generative modeling that I've
[00:43:44] problem in generative modeling that I've been thinking about a lot the last
[00:43:45] been thinking about a lot the last couple days while reviewing these slides
[00:43:47] couple days while reviewing these slides right the core problem in generative
[00:43:49] right the core problem in generative modeling is somehow you have a prior
[00:43:50] modeling is somehow you have a prior distribution which is Z's that you know
[00:43:52] distribution which is Z's that you know how to sample from you have a data
[00:43:54] how to sample from you have a data distribution which is X's that you want
[00:43:55] distribution which is X's that you want to generate and kind of the core problem
[00:43:57] to generate and kind of the core problem in generative modeling is figuring out
[00:43:59] in generative modeling is figuring out how to associate Z's and X's. And all
[00:44:02] how to associate Z's and X's. And all your different categories to generative
[00:44:03] your different categories to generative modeling kind of do it in different
[00:44:04] modeling kind of do it in different ways. Right? In a VAE, you say, I'm
[00:44:07] ways. Right? In a VAE, you say, I'm going to have the model predict a Z and
[00:44:09] going to have the model predict a Z and then predict an X and then try to force
[00:44:11] then predict an X and then try to force that Z to be something I know how to
[00:44:12] that Z to be something I know how to sample from. Um, which doesn't super
[00:44:14] sample from. Um, which doesn't super work that well. In a GAN, um, you're not
[00:44:17] work that well. In a GAN, um, you're not supervising that relationship. Like the
[00:44:19] supervising that relationship. Like the generator is kind of figuring out its
[00:44:20] generator is kind of figuring out its own mapping from Z to X in a feed
[00:44:22] own mapping from Z to X in a feed forward way through this distributional
[00:44:24] forward way through this distributional matching objective that the
[00:44:25] matching objective that the discriminator is giving it. um in
[00:44:26] discriminator is giving it. um in diffusion it's kind of figuring out by
[00:44:29] diffusion it's kind of figuring out by what what ends up and it ends up having
[00:44:32] what what ends up and it ends up having to integrate these curves. Um and
[00:44:34] to integrate these curves. Um and there's some there's a lot of different
[00:44:35] there's some there's a lot of different actually like several different
[00:44:37] actually like several different mathematical formalisms as to why
[00:44:38] mathematical formalisms as to why objectives that look like this end up
[00:44:40] objectives that look like this end up matching probability distributions in a
[00:44:42] matching probability distributions in a reasonable way. Um but again like the
[00:44:44] reasonable way. Um but again like the whole core problem is that we have no
[00:44:46] whole core problem is that we have no way ahead of time to get to pair up
[00:44:49] way ahead of time to get to pair up samples Z from our prior with samples X
[00:44:51] samples Z from our prior with samples X from our data. If we knew how to make
[00:44:53] from our data. If we knew how to make that pairing and also knew how to sample
[00:44:55] that pairing and also knew how to sample from the prior like that would be that
[00:44:57] from the prior like that would be that you'd be done. And in some sense, all
[00:44:58] you'd be done. And in some sense, all these different forms of generative
[00:45:00] these different forms of generative modeling are different ways to square
[00:45:01] modeling are different ways to square that circle and come up with a way to
[00:45:03] that circle and come up with a way to learn an association from Z to X even
[00:45:06] learn an association from Z to X even though we don't and and be able to
[00:45:07] though we don't and and be able to sample from Z even though we don't have
[00:45:09] sample from Z even though we don't have that association at training time.
[00:45:11] that association at training time. There's a lot of different
[00:45:12] There's a lot of different interpretations of this that that can
[00:45:14] interpretations of this that that can get very very heavy very quick. So I've
[00:45:15] get very very heavy very quick. So I've tried to avoid them.
[00:45:18] tried to avoid them. Right? So um but we we said a couple
[00:45:20] Right? So um but we we said a couple lecture we said last lecture that um
[00:45:22] lecture we said last lecture that um unconditional generative modeling is
[00:45:23] unconditional generative modeling is kind of kind of pointless. So what we
[00:45:25] kind of kind of pointless. So what we almost always care about is conditional
[00:45:27] almost always care about is conditional generative modeling and that's easy to
[00:45:28] generative modeling and that's easy to accommodate in in rectified flow. So we
[00:45:31] accommodate in in rectified flow. So we can do to do conditional rectified flow
[00:45:32] can do to do conditional rectified flow we kind of imagine that there's
[00:45:34] we kind of imagine that there's different subp parts of our data
[00:45:35] different subp parts of our data distribution. Here I'm saying it's
[00:45:37] distribution. Here I'm saying it's categorical. Maybe our data is actually
[00:45:38] categorical. Maybe our data is actually squares and triangles. Um, and then we
[00:45:40] squares and triangles. Um, and then we have our our whole data distribution P
[00:45:42] have our our whole data distribution P data as well as our two sort of
[00:45:44] data as well as our two sort of subdistributions P data X given that Y
[00:45:46] subdistributions P data X given that Y is a square and P data P data X given
[00:45:49] is a square and P data P data X given that Y is the label Y is a triangle. So
[00:45:51] that Y is the label Y is a triangle. So this is kind of the the picture you
[00:45:53] this is kind of the the picture you should have in mind when you think about
[00:45:54] should have in mind when you think about conditional generative modeling. Then in
[00:45:56] conditional generative modeling. Then in the case of of rectified flow this is
[00:45:58] the case of of rectified flow this is very easy to accommodate. So you just um
[00:46:00] very easy to accommodate. So you just um sort of your data set now has pairs X
[00:46:02] sort of your data set now has pairs X and Y and your model now takes Y as an
[00:46:04] and Y and your model now takes Y as an additional auxiliary input somehow. Um
[00:46:06] additional auxiliary input somehow. Um and then during sampling same thing you
[00:46:07] and then during sampling same thing you get your predicted V's uh according and
[00:46:09] get your predicted V's uh according and you you get your predicted V's you know
[00:46:11] you you get your predicted V's you know the model takes as input this extra Y uh
[00:46:14] the model takes as input this extra Y uh and you use that. So this all kind of
[00:46:15] and you use that. So this all kind of goes through. Um the difference is that
[00:46:17] goes through. Um the difference is that now Y is actually hopefully some
[00:46:19] now Y is actually hopefully some conditional signal that the user can
[00:46:21] conditional signal that the user can control. Maybe this is a text prompt.
[00:46:22] control. Maybe this is a text prompt. Maybe this is an input image. Maybe this
[00:46:24] Maybe this is an input image. Maybe this is this is some kind of user input that
[00:46:26] is this is some kind of user input that you're expecting at inference time. Um
[00:46:27] you're expecting at inference time. Um which actually make these models
[00:46:28] which actually make these models controllable and useful in practice.
[00:46:31] controllable and useful in practice. Um, but then there's another really
[00:46:32] Um, but then there's another really interesting question is um, is there any
[00:46:35] interesting question is um, is there any knob you can tune to control how much
[00:46:37] knob you can tune to control how much the model pays attention to the
[00:46:39] the model pays attention to the conditioning signal, right? It turns out
[00:46:40] conditioning signal, right? It turns out if you train these things naively, a lot
[00:46:42] if you train these things naively, a lot of times they don't often follow the
[00:46:44] of times they don't often follow the conditioning signal quite as much as you
[00:46:45] conditioning signal quite as much as you would like. Um, so there's a trick
[00:46:47] would like. Um, so there's a trick called classifier free guidance or CFG
[00:46:50] called classifier free guidance or CFG um, that changes this a little that
[00:46:51] um, that changes this a little that changes our diffusion uh, training loop
[00:46:54] changes our diffusion uh, training loop just a little bit. So, what we're going
[00:46:55] just a little bit. So, what we're going to do is we're still going to train this
[00:46:57] to do is we're still going to train this conditional diffusion model that inputs
[00:46:58] conditional diffusion model that inputs your XT um inputs your Y, but we're
[00:47:01] your XT um inputs your Y, but we're going to every on every training on
[00:47:03] going to every on every training on every training iteration, we're going to
[00:47:04] every training iteration, we're going to flip a coin. Um, and if that coin is
[00:47:06] flip a coin. Um, and if that coin is heads, we're going to delete the
[00:47:07] heads, we're going to delete the conditioning information. So, we're
[00:47:09] conditioning information. So, we're going to set it equal to some kind of
[00:47:10] going to set it equal to some kind of zero value or null value. Um, basically
[00:47:12] zero value or null value. Um, basically destroy the conditioning information 50%
[00:47:15] destroy the conditioning information 50% of the time. That could be a
[00:47:16] of the time. That could be a hyperparameter, but 50 is a pretty good
[00:47:17] hyperparameter, but 50 is a pretty good one that most people use in practice.
[00:47:19] one that most people use in practice. So, we're going to flip a coin. If the
[00:47:20] So, we're going to flip a coin. If the coin is heads, delete the conditioning
[00:47:22] coin is heads, delete the conditioning information. So that means that the
[00:47:23] information. So that means that the model is conceptually now forced to
[00:47:25] model is conceptually now forced to learn two different kinds of velocity
[00:47:27] learn two different kinds of velocity vectors. Right? Um
[00:47:32] vectors. Right? Um right. So then the same the the model is
[00:47:34] right. So then the same the the model is sort of forced to learn two different
[00:47:35] sort of forced to learn two different kinds of velocity vectors. Right? So in
[00:47:38] kinds of velocity vectors. Right? So in the case where we pass it this uh this
[00:47:40] the case where we pass it this uh this null value for y that is that that has
[00:47:42] null value for y that is that that has destroyed the conditioning information
[00:47:44] destroyed the conditioning information then this is basically an unconditional
[00:47:46] then this is basically an unconditional generative model. Now um now that that
[00:47:48] generative model. Now um now that that predicted velocity vector V is sort of
[00:47:50] predicted velocity vector V is sort of has to point back towards the the meat
[00:47:53] has to point back towards the the meat of the whole data distribution P data.
[00:47:55] of the whole data distribution P data. Um but when we pass a real conditioning
[00:47:57] Um but when we pass a real conditioning uh input Y that's non-destroyed,
[00:48:00] uh input Y that's non-destroyed, non-null, non-zero um then we're getting
[00:48:02] non-null, non-zero um then we're getting sort of a conditional velocity vector
[00:48:04] sort of a conditional velocity vector that is pointing us back towards not the
[00:48:06] that is pointing us back towards not the full data distribution but towards the
[00:48:08] full data distribution but towards the conditional data distribution which is
[00:48:10] conditional data distribution which is conditional on that conditioning signal
[00:48:11] conditional on that conditioning signal that we cared about. And then the dumb
[00:48:14] that we cared about. And then the dumb trick is we're going to take a linear
[00:48:16] trick is we're going to take a linear combination of these two vectors to kind
[00:48:18] combination of these two vectors to kind of push it more towards the uh more
[00:48:22] of push it more towards the uh more towards the conditional velocity vector.
[00:48:24] towards the conditional velocity vector. So in particular, we'll have a scalar
[00:48:25] So in particular, we'll have a scalar hyperparameter w and take a linear
[00:48:27] hyperparameter w and take a linear combination 1 + w  v y minus w  v uh v
[00:48:32] combination 1 + w  v y minus w  v uh v null. So that's going to be a vector
[00:48:33] null. So that's going to be a vector that now kind of points kind of points
[00:48:35] that now kind of points kind of points even more towards the towards the
[00:48:37] even more towards the towards the conditional distribution than it does
[00:48:39] conditional distribution than it does the data distribution. Um and then the
[00:48:40] the data distribution. Um and then the idea is that during sampling we're now
[00:48:42] idea is that during sampling we're now going to um right then during sampling
[00:48:45] going to um right then during sampling we're now going to step according to
[00:48:47] we're now going to step according to this uh this this uh vcfg vector rather
[00:48:50] this uh this this uh vcfg vector rather than the rather than the raw vectors
[00:48:52] than the rather than the raw vectors predicted by the model and then setting
[00:48:54] predicted by the model and then setting w equals 1 um setting w equals z here
[00:48:58] w equals 1 um setting w equals z here will recover exactly the the the
[00:49:00] will recover exactly the the the conditional one and you know the higher
[00:49:02] conditional one and you know the higher your w is then the more you're
[00:49:03] your w is then the more you're overemphasizing the conditioning signal.
[00:49:06] overemphasizing the conditioning signal. Um so this is this is and then this is
[00:49:08] Um so this is this is and then this is pretty easy to implement right so uh
[00:49:10] pretty easy to implement right so uh then your inference code doesn't really
[00:49:12] then your inference code doesn't really change too much now you but now you
[00:49:14] change too much now you but now you evaluate the model twice at every
[00:49:15] evaluate the model twice at every iteration to get your vy and your v 0 um
[00:49:18] iteration to get your vy and your v 0 um and then you take this this linear
[00:49:19] and then you take this this linear combination and then step according to
[00:49:21] combination and then step according to that and this is called classifier free
[00:49:24] that and this is called classifier free because of a stupid reason there was an
[00:49:26] because of a stupid reason there was an earlier paper called classifier guidance
[00:49:28] earlier paper called classifier guidance that I don't want to get into and then
[00:49:29] that I don't want to get into and then they remove the classifier and even
[00:49:31] they remove the classifier and even though there was only like 9 months
[00:49:32] though there was only like 9 months between those two papers and it's now
[00:49:34] between those two papers and it's now been 4 years since the second one we're
[00:49:35] been 4 years since the second one we're still stuck with the name classifier
[00:49:37] still stuck with the name classifier free guidance. So it is what it is.
[00:49:39] free guidance. So it is what it is. Okay. So that's actually really
[00:49:41] Okay. So that's actually really important in practice for getting high
[00:49:42] important in practice for getting high quality outputs. Um and that's uh that's
[00:49:44] quality outputs. Um and that's uh that's CFG. That's really important. That's
[00:49:45] CFG. That's really important. That's used everywhere in diffusion models. Um
[00:49:47] used everywhere in diffusion models. Um it does double the cost of sampling
[00:49:48] it does double the cost of sampling though because now you need to hit the
[00:49:50] though because now you need to hit the model twice on every iteration which is
[00:49:51] model twice on every iteration which is kind of a problem. Okay. Uh this there's
[00:49:54] kind of a problem. Okay. Uh this there's this thing on optimal prediction. I
[00:49:55] this thing on optimal prediction. I think I'll skip that. That's not so
[00:49:56] think I'll skip that. That's not so interesting. Um I mean it is interesting
[00:49:58] interesting. Um I mean it is interesting but we're running out of but but I'm
[00:50:00] but we're running out of but but I'm worried about time. So um but one thing
[00:50:02] worried about time. So um but one thing that we sometimes need to do is tweak
[00:50:04] that we sometimes need to do is tweak this uh this uh this this this t
[00:50:07] this uh this uh this this this t distribution. So we saw in particular
[00:50:09] distribution. So we saw in particular that we were sampling t according to to
[00:50:11] that we were sampling t according to to a uniform distribution in a raw in like
[00:50:13] a uniform distribution in a raw in like a raw rectified flow model. Um and the
[00:50:16] a raw rectified flow model. Um and the thing about that is it kind of is going
[00:50:18] thing about that is it kind of is going to put uniform emphasis on all noise
[00:50:20] to put uniform emphasis on all noise levels. And intuitively when you're at
[00:50:23] levels. And intuitively when you're at full noise the problem is very easy
[00:50:25] full noise the problem is very easy right? When you're at when you're at
[00:50:27] right? When you're at when you're at full noise the problem is very easy.
[00:50:29] full noise the problem is very easy. then the model like the optimal
[00:50:30] then the model like the optimal prediction from the model is basically
[00:50:32] prediction from the model is basically to point towards the mean of the data
[00:50:34] to point towards the mean of the data distribution. Um and similarly when
[00:50:35] distribution. Um and similarly when you're at um zero noise then the optimal
[00:50:38] you're at um zero noise then the optimal prediction is actually to point towards
[00:50:40] prediction is actually to point towards the mean of the noise distribution. So
[00:50:42] the mean of the noise distribution. So actually the model like the the the
[00:50:44] actually the model like the the the optimal prediction from the model at
[00:50:45] optimal prediction from the model at full noise and full data and no noise
[00:50:48] full noise and full data and no noise are actually very relatively easy
[00:50:49] are actually very relatively easy problems. It just needs to learn the
[00:50:51] problems. It just needs to learn the mean of those two distributions. Um but
[00:50:53] mean of those two distributions. Um but when you when you're somewhere in the
[00:50:55] when you when you're somewhere in the middle it's actually really really hard,
[00:50:56] middle it's actually really really hard, right? because when you when you when
[00:50:58] right? because when you when you when you're somewhere in the middle and you
[00:50:59] you're somewhere in the middle and you sample that XT, there might have been
[00:51:00] sample that XT, there might have been multiple pairs X and Z that could have
[00:51:03] multiple pairs X and Z that could have given rise to that same XT in the
[00:51:06] given rise to that same XT in the middle. Um, and then the network
[00:51:07] middle. Um, and then the network basically needs to solve this
[00:51:08] basically needs to solve this expectation problem and figure out what
[00:51:10] expectation problem and figure out what is that optimal direction to predict
[00:51:12] is that optimal direction to predict that kind of integrates over all
[00:51:14] that kind of integrates over all possible X's and Z's that might
[00:51:16] possible X's and Z's that might intersect at this point XT. So somehow
[00:51:18] intersect at this point XT. So somehow those points in the middle are
[00:51:19] those points in the middle are intuitively much harder for the network
[00:51:21] intuitively much harder for the network to solve. So um but when we sample
[00:51:23] to solve. So um but when we sample uniformly from 0 to 1t then we're kind
[00:51:26] uniformly from 0 to 1t then we're kind of putting equal importance on all
[00:51:28] of putting equal importance on all levels of noise which doesn't really
[00:51:29] levels of noise which doesn't really match this intuition. So in practice
[00:51:31] match this intuition. So in practice you'll often sample from different noise
[00:51:33] you'll often sample from different noise schedules. Um and one very popular one
[00:51:35] schedules. Um and one very popular one is this one called logic normal sampling
[00:51:37] is this one called logic normal sampling which basically um looks kind of like a
[00:51:39] which basically um looks kind of like a gausian puts relatively little weight on
[00:51:40] gausian puts relatively little weight on the zero and the one with a lot more
[00:51:42] the zero and the one with a lot more weight in the middle. Um another thing
[00:51:44] weight in the middle. Um another thing you'll see sometimes are these so-called
[00:51:46] you'll see sometimes are these so-called shifted noise schedules that are
[00:51:47] shifted noise schedules that are asymmetric that shift more towards one
[00:51:49] asymmetric that shift more towards one direction or the other. um and those are
[00:51:51] direction or the other. um and those are important as we scale to high resolution
[00:51:52] important as we scale to high resolution data. The intuition being that when you
[00:51:54] data. The intuition being that when you have a very high resolution image then
[00:51:56] have a very high resolution image then there can be very strong correlations
[00:51:57] there can be very strong correlations across neighboring pixels. When you have
[00:51:59] across neighboring pixels. When you have a low resolution image then there then
[00:52:01] a low resolution image then there then the correlations across neighboring
[00:52:02] the correlations across neighboring pixels tend to be smaller. So depending
[00:52:04] pixels tend to be smaller. So depending on how strong of correlations you have
[00:52:06] on how strong of correlations you have in your data, you actually may need
[00:52:08] in your data, you actually may need different amounts of noise level to
[00:52:09] different amounts of noise level to properly destroy information in a nice
[00:52:11] properly destroy information in a nice way. So um these things don't naively
[00:52:14] way. So um these things don't naively scale to different resolutions,
[00:52:17] scale to different resolutions, right? All right. And that's actually a
[00:52:17] right? All right. And that's actually a big problem with these diffusion models
[00:52:19] big problem with these diffusion models is that they are, you know, they're a
[00:52:20] is that they are, you know, they're a beautiful formulation, but they don't
[00:52:22] beautiful formulation, but they don't but they're hard to get them to work
[00:52:23] but they're hard to get them to work naively on high resolution data. So, um,
[00:52:27] naively on high resolution data. So, um, that leads to actually, you know, I said
[00:52:29] that leads to actually, you know, I said diffusion models are the most popular
[00:52:30] diffusion models are the most popular form of generative modeling. That was a
[00:52:32] form of generative modeling. That was a little bit of a lie because what's
[00:52:33] little bit of a lie because what's actually most popular are these
[00:52:34] actually most popular are these so-called latent diffusion models, which
[00:52:37] so-called latent diffusion models, which is a variant that actually gets used
[00:52:38] is a variant that actually gets used everywhere. Um, so here it's going to be
[00:52:40] everywhere. Um, so here it's going to be a multi-stage procedure. So, what we're
[00:52:42] a multi-stage procedure. So, what we're going to do is first train an encoder
[00:52:44] going to do is first train an encoder network and a decoder network. The
[00:52:46] network and a decoder network. The encoder network is going to map from our
[00:52:48] encoder network is going to map from our image into some latent space um which
[00:52:51] image into some latent space um which I've colored in purple and ideally that
[00:52:53] I've colored in purple and ideally that that latent is going to spatially down
[00:52:54] that latent is going to spatially down sample the image by a factor of D um as
[00:52:57] sample the image by a factor of D um as well as uh convert from three channels
[00:52:58] well as uh convert from three channels up into C channels and a pretty common
[00:53:01] up into C channels and a pretty common setting is to get 8 by8 spatial
[00:53:02] setting is to get 8 by8 spatial downsampling and to increase to 16
[00:53:05] downsampling and to increase to 16 channels. That's kind of the one some of
[00:53:06] channels. That's kind of the one some of these most common encoder decoders. Um
[00:53:08] these most common encoder decoders. Um these encoder decoders tend to be CNN's
[00:53:10] these encoder decoders tend to be CNN's with attention but um more some more
[00:53:12] with attention but um more some more recent papers have explored bits for
[00:53:14] recent papers have explored bits for these. Then what we do is we're going to
[00:53:16] these. Then what we do is we're going to train a diffusion model not on the raw
[00:53:18] train a diffusion model not on the raw pixel space of our images but instead on
[00:53:21] pixel space of our images but instead on the latent space which is discovered by
[00:53:22] the latent space which is discovered by this encoder decoder model. So then the
[00:53:24] this encoder decoder model. So then the training procedure kind of looks like a
[00:53:26] training procedure kind of looks like a training for training the diffusion
[00:53:27] training for training the diffusion model. We're going to sample an image
[00:53:30] model. We're going to sample an image pass it through the encoder that we
[00:53:31] pass it through the encoder that we learned in the first stage to get a
[00:53:32] learned in the first stage to get a latent um and then add noise to the
[00:53:35] latent um and then add noise to the latent and train the diffusion model to
[00:53:37] latent and train the diffusion model to dn noiseise the the noised latent. And
[00:53:40] dn noiseise the the noised latent. And really importantly, you freeze the
[00:53:42] really importantly, you freeze the encoder. So you do not prop you do not
[00:53:44] encoder. So you do not prop you do not propagate bit gradients back into the
[00:53:45] propagate bit gradients back into the encoder. We're only using it to extract
[00:53:48] encoder. We're only using it to extract these latents and then training a
[00:53:49] these latents and then training a diffusion model directly on the latent
[00:53:50] diffusion model directly on the latent space which is learned by the encoder.
[00:53:52] space which is learned by the encoder. Then at inference time once you're all
[00:53:54] Then at inference time once you're all done training then we'll sample a random
[00:53:56] done training then we'll sample a random latent pass it through the diffusion
[00:53:57] latent pass it through the diffusion model many many times to remove all the
[00:53:59] model many many times to remove all the noise to get a clean sample. But that
[00:54:01] noise to get a clean sample. But that clean sample is now a clean sample in
[00:54:03] clean sample is now a clean sample in latent space. So then we need to run the
[00:54:05] latent space. So then we need to run the decoder to convert that clean latent
[00:54:07] decoder to convert that clean latent into a clean image. Um, and this is
[00:54:10] into a clean image. Um, and this is actually the most common form of most
[00:54:11] actually the most common form of most diffusion models these days. So, you
[00:54:14] diffusion models these days. So, you might be asking, okay, we've knocked
[00:54:15] might be asking, okay, we've knocked this diffusion model, um, this encoder
[00:54:18] this diffusion model, um, this encoder decoder. How do we train an encoder
[00:54:19] decoder. How do we train an encoder decoder? Any ideas?
[00:54:22] decoder? Any ideas? Have we seen encoder decoders?
[00:54:25] Have we seen encoder decoders? How about a variational autoenccoder?
[00:54:26] How about a variational autoenccoder? Um, so in practice, whenever you're
[00:54:28] Um, so in practice, whenever you're training these latent diffusion models,
[00:54:30] training these latent diffusion models, this encoder decoder tends to be a
[00:54:32] this encoder decoder tends to be a variational auto-enccoder. But we just
[00:54:34] variational auto-enccoder. But we just said there was a big problem with
[00:54:35] said there was a big problem with variational auto-enccoders is that they
[00:54:37] variational auto-enccoders is that they give you blurry outputs, right? And if
[00:54:39] give you blurry outputs, right? And if this encoder decoder is going to give
[00:54:41] this encoder decoder is going to give you blurry outputs, the quality of the
[00:54:43] you blurry outputs, the quality of the reconstructions you get out of the
[00:54:44] reconstructions you get out of the decoder is going to bottleneck the
[00:54:46] decoder is going to bottleneck the quality of the generations you get out
[00:54:47] quality of the generations you get out of the downstream diffusion model. So if
[00:54:49] of the downstream diffusion model. So if your encoder decoder is giving you
[00:54:51] your encoder decoder is giving you blurry, ugly reconstructions, that's not
[00:54:53] blurry, ugly reconstructions, that's not going to fly. That's not going to get us
[00:54:55] going to fly. That's not going to get us good clean samples. So anyone have an
[00:54:57] good clean samples. So anyone have an idea for cleaning up the sample quality
[00:54:59] idea for cleaning up the sample quality of a of a of a VAE? put something after
[00:55:02] of a of a of a VAE? put something after the decoder in particular um we can make
[00:55:05] the decoder in particular um we can make again make it again. So what we tend to
[00:55:09] again make it again. So what we tend to do is actually um train this encoder
[00:55:11] do is actually um train this encoder that encodes from an image into latent
[00:55:13] that encodes from an image into latent space a decoder that goes from latent
[00:55:15] space a decoder that goes from latent space back to image a discriminator that
[00:55:17] space back to image a discriminator that tries to tell the fake images from the
[00:55:19] tries to tell the fake images from the re from the real images and then a
[00:55:21] re from the real images and then a diffusion model that uh generates these
[00:55:23] diffusion model that uh generates these things in latent space. So this is
[00:55:25] things in latent space. So this is basically why we have to walk through
[00:55:27] basically why we have to walk through all of these different formulations of
[00:55:28] all of these different formulations of diffusion models of generative models in
[00:55:30] diffusion models of generative models in order for you to understand the modern
[00:55:32] order for you to understand the modern pipeline. Like basically the
[00:55:33] pipeline. Like basically the state-of-the-art in generative modeling
[00:55:35] state-of-the-art in generative modeling is you know is it a VAE is it a GAN is
[00:55:37] is you know is it a VAE is it a GAN is it a diffusion it's all of them right?
[00:55:39] it a diffusion it's all of them right? It's all of them. The modern the modern
[00:55:40] It's all of them. The modern the modern generative modeling pipeline involves
[00:55:42] generative modeling pipeline involves training a VAE and a GAN and a diffusion
[00:55:45] training a VAE and a GAN and a diffusion model. I'm sorry it's a mess. Okay so
[00:55:48] model. I'm sorry it's a mess. Okay so then you might ask what what what do the
[00:55:50] then you might ask what what what do the neural networks actually look like under
[00:55:51] neural networks actually look like under the hood here? So um thankfully there is
[00:55:54] the hood here? So um thankfully there is some sanity here the last couple of
[00:55:55] some sanity here the last couple of years. So relatively straight it turns
[00:55:57] years. So relatively straight it turns out that relatively straightforward
[00:55:59] out that relatively straightforward transformers actually can be applied to
[00:56:01] transformers actually can be applied to these diffusion models and they work
[00:56:02] these diffusion models and they work really well. Um these are these are
[00:56:04] really well. Um these are these are typically called diffusion transformers
[00:56:05] typically called diffusion transformers or DITS but basically these are just
[00:56:08] or DITS but basically these are just standard diffusion models uh standard
[00:56:10] standard diffusion models uh standard transformer blocks that really don't
[00:56:11] transformer blocks that really don't really have much special sauce in them.
[00:56:13] really have much special sauce in them. Um there's a couple questions uh the the
[00:56:15] Um there's a couple questions uh the the kind of main question you needed to
[00:56:16] kind of main question you needed to solve in the architectural side is how
[00:56:18] solve in the architectural side is how do you inject the conditioning
[00:56:19] do you inject the conditioning information? Um, and in particular, the
[00:56:21] information? Um, and in particular, the diffusion model now needs to take like
[00:56:23] diffusion model now needs to take like three things as input. It needs to take
[00:56:24] three things as input. It needs to take your noisy image. It needs to take your
[00:56:26] your noisy image. It needs to take your time stamp t. It also needs to take your
[00:56:28] time stamp t. It also needs to take your conditioning signal, which might be
[00:56:29] conditioning signal, which might be your, um, like your text or something
[00:56:31] your, um, like your text or something like that. So, um, and then you have a
[00:56:33] like that. So, um, and then you have a couple different mechanisms for
[00:56:34] couple different mechanisms for injecting that conditioning signal into
[00:56:35] injecting that conditioning signal into your into your transformer blocks. Um,
[00:56:37] your into your transformer blocks. Um, so the first is, um, to predict a scale
[00:56:39] so the first is, um, to predict a scale and shift, um, that are going to be used
[00:56:41] and shift, um, that are going to be used to like modulate some of the
[00:56:42] to like modulate some of the intermediate activations of your of your
[00:56:44] intermediate activations of your of your diffusion block. Um, and that's
[00:56:46] diffusion block. Um, and that's typically the way that we inject the
[00:56:47] typically the way that we inject the time stamp information into diffusion
[00:56:48] time stamp information into diffusion models. Um, another thing you can do is
[00:56:51] models. Um, another thing you can do is transformers are just models of
[00:56:52] transformers are just models of sequences. So you can jam everything
[00:56:54] sequences. So you can jam everything into the sequence. Um, you can jam the
[00:56:56] into the sequence. Um, you can jam the time stamp into the sequence, you can
[00:56:57] time stamp into the sequence, you can jam your text into the sequence, you can
[00:56:58] jam your text into the sequence, you can jam whatever you want into the sequence
[00:57:00] jam whatever you want into the sequence and have the the have the transformer
[00:57:02] and have the the have the transformer just model that sequence of data
[00:57:03] just model that sequence of data altogether. Um, and you can do that
[00:57:05] altogether. Um, and you can do that either via cross attention or joint
[00:57:06] either via cross attention or joint attention. And different models um sort
[00:57:08] attention. And different models um sort of do both. And typically we're going to
[00:57:10] of do both. And typically we're going to typically in modern diffusion DITS we
[00:57:12] typically in modern diffusion DITS we inject the time stamp through this scale
[00:57:14] inject the time stamp through this scale shift mechanism and you inject the the
[00:57:16] shift mechanism and you inject the the text or other conditioning signal
[00:57:17] text or other conditioning signal through um sequence concatenation
[00:57:19] through um sequence concatenation usually cross attention but sometimes
[00:57:21] usually cross attention but sometimes joint attention as well. Um so then how
[00:57:24] joint attention as well. Um so then how can you actually apply this to different
[00:57:25] can you actually apply this to different problems? So one task that people care
[00:57:27] problems? So one task that people care about a lot is the is the task of text
[00:57:29] about a lot is the is the task of text to image generation. So here we're going
[00:57:31] to image generation. So here we're going to input a text prompt. This is one that
[00:57:33] to input a text prompt. This is one that I wrote yesterday. A professional
[00:57:34] I wrote yesterday. A professional documentary photograph of a monkey
[00:57:36] documentary photograph of a monkey shaking hands with a tiger in front of
[00:57:37] shaking hands with a tiger in front of the Eiffel Tower. monkey is wearing a
[00:57:39] the Eiffel Tower. monkey is wearing a hat made out of bananas. Tiger is
[00:57:40] hat made out of bananas. Tiger is standing on two legs and wearing a suit.
[00:57:42] standing on two legs and wearing a suit. Um, and this is a really this is a real
[00:57:43] Um, and this is a really this is a real sample. Like it's crazy that the stuff
[00:57:44] sample. Like it's crazy that the stuff works now. Um, but I'm sure you've all
[00:57:46] works now. Um, but I'm sure you've all seen these kind of things before. Um,
[00:57:48] seen these kind of things before. Um, and the way that this works is you'll
[00:57:49] and the way that this works is you'll take your text prompt. Um, pass it
[00:57:51] take your text prompt. Um, pass it through usually a pre-trained text
[00:57:52] through usually a pre-trained text encoder, right? So, actually I lied.
[00:57:54] encoder, right? So, actually I lied. Like there's actually more models you
[00:57:55] Like there's actually more models you have to train in addition to an encoder
[00:57:57] have to train in addition to an encoder and a decoder and a VAE and a
[00:57:59] and a decoder and a VAE and a discriminator. You also need to train a
[00:58:00] discriminator. You also need to train a language model secretly to get these
[00:58:02] language model secretly to get these things to work. Um, so you'll actually
[00:58:03] things to work. Um, so you'll actually typically pick up a pre-trained text
[00:58:05] typically pick up a pre-trained text encoder, usually T5, clip, something
[00:58:07] encoder, usually T5, clip, something like that to give um, text embeddings
[00:58:08] like that to give um, text embeddings and usually the text encoder will be
[00:58:10] and usually the text encoder will be frozen. Then those text embeddings um,
[00:58:12] frozen. Then those text embeddings um, you pass your text embeddings together
[00:58:14] you pass your text embeddings together with your noisy latence into your
[00:58:15] with your noisy latence into your diffusion transformer that also gets
[00:58:17] diffusion transformer that also gets your diffusion time step. That's going
[00:58:18] your diffusion time step. That's going to output clean latence and this thing
[00:58:20] to output clean latence and this thing will kind of go iteratively and that'll
[00:58:22] will kind of go iteratively and that'll go through your your VAE decoder to give
[00:58:24] go through your your VAE decoder to give you your final image. And just to put
[00:58:26] you your final image. And just to put some numbers on this to make it
[00:58:27] some numbers on this to make it concrete, um, one pretty a pretty
[00:58:30] concrete, um, one pretty a pretty powerful open source model right now is
[00:58:31] powerful open source model right now is called flux one dev. They use the T5 and
[00:58:33] called flux one dev. They use the T5 and clip encoders. Their encoder uses 8x
[00:58:35] clip encoders. Their encoder uses 8x down sampling. They train a 12 billion
[00:58:37] down sampling. They train a 12 billion parameter um, transformer model on this.
[00:58:39] parameter um, transformer model on this. And that transformer has an additional
[00:58:42] And that transformer has an additional layer of downsampling on top of the VAE,
[00:58:44] layer of downsampling on top of the VAE, which is kind of messy. So it ends up
[00:58:46] which is kind of messy. So it ends up having a sequence length of 1024 image
[00:58:48] having a sequence length of 1024 image tokens.
[00:58:50] tokens. Another task that people care about a
[00:58:51] Another task that people care about a lot is text to video. So we can input a
[00:58:54] lot is text to video. So we can input a text prompt and then output the pixels
[00:58:55] text prompt and then output the pixels of a video that follow that text prompt.
[00:58:57] of a video that follow that text prompt. And the pipeline basically looks kind of
[00:58:59] And the pipeline basically looks kind of the same. So you're going to input a
[00:59:01] the same. So you're going to input a text um through your pre-trained text
[00:59:03] text um through your pre-trained text encoder, get noisy latence. Importantly,
[00:59:05] encoder, get noisy latence. Importantly, the only difference is that your latents
[00:59:06] the only difference is that your latents are now have an extra dimension to
[00:59:08] are now have an extra dimension to accommodate the time. So in addition to
[00:59:10] accommodate the time. So in addition to two spatial dimensions, HW, you'll also
[00:59:12] two spatial dimensions, HW, you'll also have a time dimension in your latent and
[00:59:14] have a time dimension in your latent and that will give you clean latence. And
[00:59:15] that will give you clean latence. And then your decoder um this is typically
[00:59:17] then your decoder um this is typically going to be a spatial temporal
[00:59:18] going to be a spatial temporal auto-enccoder now. So it will downsample
[00:59:20] auto-enccoder now. So it will downsample both spatially and temporally. Um, so
[00:59:22] both spatially and temporally. Um, so then it will take your latence and then
[00:59:24] then it will take your latence and then upsample them into pixels which will
[00:59:26] upsample them into pixels which will give you a video. Um, and this is
[00:59:28] give you a video. Um, and this is actually a this is actually a generated
[00:59:30] actually a this is actually a generated video from Meta's movie gen paper that
[00:59:32] video from Meta's movie gen paper that came out uh in last year.
[00:59:36] Um, okay. And that's putting some that's
[00:59:38] Um, okay. And that's putting some that's putting some particular numbers on this
[00:59:39] putting some particular numbers on this thing. Um, and the key takeaway of these
[00:59:42] thing. Um, and the key takeaway of these video generation models is that they get
[00:59:43] video generation models is that they get very expensive to train due to the
[00:59:45] very expensive to train due to the sequence length, right? because if you
[00:59:46] sequence length, right? because if you want to generate high FPS, high
[00:59:49] want to generate high FPS, high resolution, high frame rate video, it
[00:59:51] resolution, high frame rate video, it just ends up with a lot of tokens. So,
[00:59:53] just ends up with a lot of tokens. So, we said that with um kind of a a fairly
[00:59:55] we said that with um kind of a a fairly state-of-the-art um to text to image
[00:59:57] state-of-the-art um to text to image diffusion model, that transformer ended
[00:59:58] diffusion model, that transformer ended up working on a sequence of 1024 image
[01:00:00] up working on a sequence of 1024 image tokens for this um for this texttovideo
[01:00:03] tokens for this um for this texttovideo diffusion model. Even though the overall
[01:00:05] diffusion model. Even though the overall architecture looks pretty similar, the
[01:00:06] architecture looks pretty similar, the biggest difference is in the sequence
[01:00:07] biggest difference is in the sequence length. So now rather now they actually
[01:00:09] length. So now rather now they actually need to process um 76,000 um video
[01:00:12] need to process um 76,000 um video tokens to process this to create this
[01:00:14] tokens to process this to create this high resolution video with a lot of
[01:00:16] high resolution video with a lot of frames. So that's where the expense
[01:00:17] frames. So that's where the expense happens in these video diffusion models
[01:00:19] happens in these video diffusion models is actually processing these really long
[01:00:21] is actually processing these really long sequences.
[01:00:23] sequences. So um I I think basically the last year
[01:00:25] So um I I think basically the last year has pretty much been the era of video
[01:00:27] has pretty much been the era of video diffusion models. So um there was there
[01:00:30] diffusion models. So um there was there it basically seems like every week
[01:00:32] it basically seems like every week almost for the past year there's been a
[01:00:34] almost for the past year there's been a new interesting video diffusion model
[01:00:36] new interesting video diffusion model coming out. Um, and these have been a
[01:00:38] coming out. Um, and these have been a mix of open source models, uh, models
[01:00:40] mix of open source models, uh, models that have technical reports, um, so they
[01:00:42] that have technical reports, um, so they they give you some details about the
[01:00:44] they give you some details about the model architecture and the training. Um,
[01:00:46] model architecture and the training. Um, and you know, purely industrial models
[01:00:47] and you know, purely industrial models where they don't tell you anything, but
[01:00:48] where they don't tell you anything, but they'll take your credit card number,
[01:00:50] they'll take your credit card number, let you generate samples from it. So,
[01:00:52] let you generate samples from it. So, um, I I'm not going to go through all
[01:00:53] um, I I'm not going to go through all these in all of these all of these uh,
[01:00:55] these in all of these all of these uh, one by one, but I just wanted to give
[01:00:56] one by one, but I just wanted to give the sense of like this has been a really
[01:00:58] the sense of like this has been a really hot topic the last like really the past
[01:01:00] hot topic the last like really the past 18 months. Um and in particular this uh
[01:01:03] 18 months. Um and in particular this uh there was this really influential blog
[01:01:04] there was this really influential blog post from OpenAI called Sora that came
[01:01:06] post from OpenAI called Sora that came out in March 2024 which was not the
[01:01:09] out in March 2024 which was not the first um diffusion model on videos but
[01:01:11] first um diffusion model on videos but it was the first one that gave really
[01:01:12] it was the first one that gave really really really good results. Um and they
[01:01:14] really really good results. Um and they kind of adopted this modern sort of
[01:01:16] kind of adopted this modern sort of diffusion transformer plus rectified
[01:01:18] diffusion transformer plus rectified flow. Um actually I don't know if they
[01:01:20] flow. Um actually I don't know if they were using rectified flow in Sora. I
[01:01:21] were using rectified flow in Sora. I don't know if they said um but they were
[01:01:23] don't know if they said um but they were they were one of the first to really
[01:01:24] they were one of the first to really scale up these diffusion transformers um
[01:01:26] scale up these diffusion transformers um and get this thing to work really well
[01:01:28] and get this thing to work really well and then kind of that was the 4-minute
[01:01:29] and then kind of that was the 4-minute mile sort of moment in video diffusion
[01:01:31] mile sort of moment in video diffusion models and then all the other big
[01:01:33] models and then all the other big companies took notice of that and
[01:01:34] companies took notice of that and quickly tried to replicate Sora. So, I
[01:01:36] quickly tried to replicate Sora. So, I said it's it's felt like for the past
[01:01:38] said it's it's felt like for the past year and a half that like almost every
[01:01:40] year and a half that like almost every week there's been a brand new
[01:01:42] week there's been a brand new state-of-the-art video diffusion model.
[01:01:43] state-of-the-art video diffusion model. And today is no exception because an
[01:01:46] And today is no exception because an hour and a half uh at 11:00 a.m. this
[01:01:48] hour and a half uh at 11:00 a.m. this morning, Google announced Veo 3, which
[01:01:50] morning, Google announced Veo 3, which is almost certainly the best video diff
[01:01:52] is almost certainly the best video diff which is almost certainly the best
[01:01:53] which is almost certainly the best generative model of video out there
[01:01:55] generative model of video out there right now. Um, I literally like read the
[01:01:58] right now. Um, I literally like read the blog post while I was in the car on the
[01:02:00] blog post while I was in the car on the way here, but it seems cool. Here are
[01:02:02] way here, but it seems cool. Here are some samples from Veo3.
[01:02:04] some samples from Veo3. So like these are actually generated
[01:02:06] So like these are actually generated videos from a text from a text prompt um
[01:02:08] videos from a text from a text prompt um in Google's new model. Kind of crazy.
[01:02:10] in Google's new model. Kind of crazy. Also, this model joint also models sound
[01:02:13] Also, this model joint also models sound jointly. So they can output like audio
[01:02:15] jointly. So they can output like audio along with the video frames. This is
[01:02:17] along with the video frames. This is another generated one. So you can kind
[01:02:19] another generated one. So you can kind of tell what you want to happen in text.
[01:02:21] of tell what you want to happen in text. It'll fly over here and like looks
[01:02:23] It'll fly over here and like looks crazy.
[01:02:25] crazy. Okay. So I thought that's just fun to to
[01:02:27] Okay. So I thought that's just fun to to incorporate new stuff.
[01:02:30] incorporate new stuff. Okay. So one big problem with um with
[01:02:32] Okay. So one big problem with um with diffusion is that during sampling it's
[01:02:34] diffusion is that during sampling it's really slow right we said that sampling
[01:02:36] really slow right we said that sampling was this iterative procedure and these
[01:02:38] was this iterative procedure and these models are can be really big these can
[01:02:39] models are can be really big these can be models with tens of billions of
[01:02:40] be models with tens of billions of parameters potentially operating on
[01:02:42] parameters potentially operating on sequence lengths of tens of thousands or
[01:02:44] sequence lengths of tens of thousands or or more so these things get really slow
[01:02:46] or more so these things get really slow at inference time because even with
[01:02:48] at inference time because even with rectified flow you need like you know
[01:02:50] rectified flow you need like you know tens tens of iterations of the model at
[01:02:52] tens tens of iterations of the model at inference time so the solution is a
[01:02:55] inference time so the solution is a category of algorithms called
[01:02:56] category of algorithms called distillation which we don't have time to
[01:02:58] distillation which we don't have time to get into I just wanted to put a couple
[01:03:00] get into I just wanted to put a couple referenc here make you aware that this
[01:03:01] referenc here make you aware that this exists as a set of techniques. So
[01:03:03] exists as a set of techniques. So distillation algorithms are basically
[01:03:05] distillation algorithms are basically ways that you can take a diffusion model
[01:03:07] ways that you can take a diffusion model that normally would take you know 30 50
[01:03:09] that normally would take you know 30 50 100 iterations at inference time to get
[01:03:11] 100 iterations at inference time to get good samples and then modify the model
[01:03:13] good samples and then modify the model in some way such that you can take many
[01:03:15] in some way such that you can take many many fewer steps on inference and still
[01:03:17] many fewer steps on inference and still get good samples. Um they tend to
[01:03:19] get good samples. Um they tend to sacrifice sample quality. So the whole
[01:03:20] sacrifice sample quality. So the whole trick in distillation methods is trying
[01:03:22] trick in distillation methods is trying to maintain the sample quality as good
[01:03:24] to maintain the sample quality as good as you can while still letting you take
[01:03:26] as you can while still letting you take fewer samples at inference time. Um, and
[01:03:28] fewer samples at inference time. Um, and some distillation methods let you get
[01:03:29] some distillation methods let you get all the way down to singlestep sampling,
[01:03:31] all the way down to singlestep sampling, which is really cool, although they tend
[01:03:33] which is really cool, although they tend to take quite a hit on the on the on the
[01:03:35] to take quite a hit on the on the on the um generation quality when you do that.
[01:03:37] um generation quality when you do that. So, that's kind of uh that's kind of,
[01:03:39] So, that's kind of uh that's kind of, you know, I'm not going to go through
[01:03:40] you know, I'm not going to go through these, but I just put some references
[01:03:42] these, but I just put some references here to different papers on distillation
[01:03:44] here to different papers on distillation if you want to take a look. Um, and this
[01:03:45] if you want to take a look. Um, and this is a really active and evolving area of
[01:03:47] is a really active and evolving area of research. So, if you look at these
[01:03:48] research. So, if you look at these references, these are um, you know, from
[01:03:50] references, these are um, you know, from 2024, from 2025. So, like these are
[01:03:53] 2024, from 2025. So, like these are stuff that people are working on right
[01:03:54] stuff that people are working on right now is how do we get better
[01:03:56] now is how do we get better distillation? And how do we get uh how
[01:03:58] distillation? And how do we get uh how do we get diffusion models to be more
[01:03:59] do we get diffusion models to be more efficient at inverse time? So another
[01:04:02] efficient at inverse time? So another thing so I I kind of mentioned that
[01:04:04] thing so I I kind of mentioned that diffusion has this like black hole of
[01:04:06] diffusion has this like black hole of math that you can get sucked into. Um
[01:04:08] math that you can get sucked into. Um and we intentionally sidestepped that by
[01:04:10] and we intentionally sidestepped that by just walking very intuitively through
[01:04:11] just walking very intuitively through rectified flow models kind of giving you
[01:04:13] rectified flow models kind of giving you a geometric intuition for the problem
[01:04:15] a geometric intuition for the problem without really diving through any math
[01:04:16] without really diving through any math that proves anything. Um so I wanted to
[01:04:18] that proves anything. Um so I wanted to give you just a brief sense of like what
[01:04:20] give you just a brief sense of like what some of these formalisms are. Um but
[01:04:22] some of these formalisms are. Um but we're not going to be able to go through
[01:04:23] we're not going to be able to go through them in detail. Um so here's kind of
[01:04:25] them in detail. Um so here's kind of restating the rectified flow objective.
[01:04:27] restating the rectified flow objective. We said that during training we're going
[01:04:29] We said that during training we're going to sample um our X's and our Z's
[01:04:31] to sample um our X's and our Z's according to our data distribution and
[01:04:32] according to our data distribution and our noise distribution. We're going to
[01:04:34] our noise distribution. We're going to sample T according to some distribution
[01:04:36] sample T according to some distribution PT that we choose either um uniform
[01:04:38] PT that we choose either um uniform logic normal or shifted something like
[01:04:40] logic normal or shifted something like that. And then we'll set XT equal to the
[01:04:42] that. And then we'll set XT equal to the linear interpolation between X and Z.
[01:04:43] linear interpolation between X and Z. We'll set a now now we've changed we've
[01:04:46] We'll set a now now we've changed we've written this a little bit differently in
[01:04:47] written this a little bit differently in in this slide. Now we're writing down a
[01:04:49] in this slide. Now we're writing down a ground truth velocity VGT that we want
[01:04:51] ground truth velocity VGT that we want the network to predict which is Z minus
[01:04:53] the network to predict which is Z minus X. Um we compute a predicted V from the
[01:04:56] X. Um we compute a predicted V from the network by passing it our noisy XT and
[01:04:58] network by passing it our noisy XT and RT then compute an L2 loss minimizing
[01:05:01] RT then compute an L2 loss minimizing the V the VGT and the predicted V from
[01:05:04] the V the VGT and the predicted V from the network. So most um when I said that
[01:05:07] the network. So most um when I said that there's a lot of different formalisms a
[01:05:09] there's a lot of different formalisms a lot of different flavors of diffusion um
[01:05:11] lot of different flavors of diffusion um what a lot of these look to is sort of
[01:05:13] what a lot of these look to is sort of different functional hyperparameters in
[01:05:14] different functional hyperparameters in this general setup in this general
[01:05:16] this general setup in this general setup. So in the in more generalized
[01:05:18] setup. So in the in more generalized flavors of diffusion, usually you'll
[01:05:21] flavors of diffusion, usually you'll you'll you can you might vary what is
[01:05:22] you'll you can you might vary what is this PT distribution. Usually you don't
[01:05:24] this PT distribution. Usually you don't vary the noise distribution. This is
[01:05:26] vary the noise distribution. This is almost always gausian um at least for
[01:05:27] almost always gausian um at least for for continuous models. Um but what you
[01:05:30] for continuous models. Um but what you will vary is like how do you compute
[01:05:31] will vary is like how do you compute that noisy XT and in general that will
[01:05:33] that noisy XT and in general that will be some functional combination that will
[01:05:35] be some functional combination that will be some linear combination of X and Z.
[01:05:38] be some linear combination of X and Z. Um and the that the linear combination
[01:05:40] Um and the that the linear combination weights will in general be some function
[01:05:42] weights will in general be some function of T. But what exactly that function is
[01:05:44] of T. But what exactly that function is kind of depends on the diffusion
[01:05:45] kind of depends on the diffusion formulation. Then what also varies is
[01:05:47] formulation. Then what also varies is what is that ground truth target that we
[01:05:49] what is that ground truth target that we ask the model to predict. It's always
[01:05:51] ask the model to predict. It's always going to be some linear combination of
[01:05:53] going to be some linear combination of our data sample X and our and our um and
[01:05:55] our data sample X and our and our um and our latent Z. Um and again the what are
[01:05:58] our latent Z. Um and again the what are the linear combination weights might be
[01:06:00] the linear combination weights might be functions of T in some formulations. So
[01:06:02] functions of T in some formulations. So basically and then uh you know we're
[01:06:03] basically and then uh you know we're going to ask the model to we're going to
[01:06:05] going to ask the model to we're going to give it that noisy XT and the T um get a
[01:06:07] give it that noisy XT and the T um get a predicted Y and then always compute an
[01:06:09] predicted Y and then always compute an L2 loss between the two. I mean not
[01:06:11] L2 loss between the two. I mean not always but usually. And then what varies
[01:06:13] always but usually. And then what varies is basically what are these different
[01:06:15] is basically what are these different functional forms? What are these
[01:06:16] functional forms? What are these different functions that we plot that we
[01:06:18] different functions that we plot that we slot into these different these four
[01:06:19] slot into these different these four different spots in this thing. So in the
[01:06:21] different spots in this thing. So in the case of rectified flow it's fairly
[01:06:23] case of rectified flow it's fairly simple. Um a these all take these really
[01:06:25] simple. Um a these all take these really simple forms and um ct and bdt are
[01:06:28] simple forms and um ct and bdt are actually just constants. Um there's a
[01:06:30] actually just constants. Um there's a kind of another another flavor of this
[01:06:32] kind of another another flavor of this called variance preserving um where a is
[01:06:35] called variance preserving um where a is where you kind of collapse these two
[01:06:36] where you kind of collapse these two into one scalar hyperparameter um called
[01:06:38] into one scalar hyperparameter um called sigma of t. Um and now you have these
[01:06:41] sigma of t. Um and now you have these linear linear combinations in this
[01:06:42] linear linear combinations in this particular way and you choose this
[01:06:44] particular way and you choose this because if X and Z are independent and
[01:06:46] because if X and Z are independent and have unit variance then your output also
[01:06:48] have unit variance then your output also is guaranteed to have unit variance. So
[01:06:50] is guaranteed to have unit variance. So that kind of collapses these two
[01:06:51] that kind of collapses these two functional hyperparameters into just one
[01:06:53] functional hyperparameters into just one noise schedule and then you still need
[01:06:54] noise schedule and then you still need to choose that somehow. Um in
[01:06:57] to choose that somehow. Um in combination with variance preserving
[01:06:58] combination with variance preserving there's also variance exploding is
[01:07:00] there's also variance exploding is another one where you'll set a t equals
[01:07:01] another one where you'll set a t equals 1 bt equal to again some sigma of t and
[01:07:04] 1 bt equal to again some sigma of t and you need to choose that somehow. Um
[01:07:06] you need to choose that somehow. Um there's a lot of different way a lot of
[01:07:08] there's a lot of different way a lot of different targets that people will
[01:07:09] different targets that people will choose. Um sometimes they'll predict
[01:07:11] choose. Um sometimes they'll predict sometimes they'll ask the network to
[01:07:12] sometimes they'll ask the network to predict the clean data. Sometimes
[01:07:14] predict the clean data. Sometimes they'll ask the model to predict the
[01:07:15] they'll ask the model to predict the noise that was added. Sometimes they'll
[01:07:17] noise that was added. Sometimes they'll ask the network to predict some linear
[01:07:18] ask the network to predict some linear combination of the two. Um and in the
[01:07:20] combination of the two. Um and in the case of of rectified flow, you are just
[01:07:22] case of of rectified flow, you are just predicting the that velocity vector that
[01:07:24] predicting the that velocity vector that points from a data directly to a noise.
[01:07:26] points from a data directly to a noise. But in more generaliz in more in
[01:07:28] But in more generaliz in more in different flavors of diffusion, all of
[01:07:30] different flavors of diffusion, all of these can kind of change. Um then you
[01:07:32] these can kind of change. Um then you might be wondering you know how how you
[01:07:34] might be wondering you know how how you know choosing hyperparameters is bad
[01:07:35] know choosing hyperparameters is bad enough. Now we need to choose
[01:07:36] enough. Now we need to choose hyperparameters which are themselves
[01:07:38] hyperparameters which are themselves functions of t. Like this is crazy. Um
[01:07:40] functions of t. Like this is crazy. Um you're never going to set these
[01:07:41] you're never going to set these intuitively. So you'd have to be guided
[01:07:42] intuitively. So you'd have to be guided by some some kind of math. Um and
[01:07:44] by some some kind of math. Um and there's basically three different
[01:07:45] there's basically three different mathematical formalisms that people um
[01:07:47] mathematical formalisms that people um you that people think about when
[01:07:49] you that people think about when training diffusion models that again we
[01:07:50] training diffusion models that again we will not walk through in practice. I
[01:07:51] will not walk through in practice. I just want to get you the to know the
[01:07:53] just want to get you the to know the existence of. Um the first is that
[01:07:55] existence of. Um the first is that diffusion is a latent variable model,
[01:07:56] diffusion is a latent variable model, right? that we're going to we have our
[01:07:58] right? that we're going to we have our clean data samples X0 but then
[01:08:00] clean data samples X0 but then associated to every clean data sample.
[01:08:02] associated to every clean data sample. There exists some sequence of corrupted
[01:08:03] There exists some sequence of corrupted or noisy samples like that that
[01:08:06] or noisy samples like that that correspond to that clean sample and we
[01:08:08] correspond to that clean sample and we can't observe them. We don't know what
[01:08:09] can't observe them. We don't know what they are but we need to figure them out
[01:08:10] they are but we need to figure them out somehow. So that's a latent variable
[01:08:12] somehow. So that's a latent variable model that ends up looking a lot like a
[01:08:14] model that ends up looking a lot like a variational autoenccoder. Remember in a
[01:08:15] variational autoenccoder. Remember in a variational autoenccoder we had a Z and
[01:08:17] variational autoenccoder we had a Z and an X. We didn't observe the Z. We wanted
[01:08:19] an X. We didn't observe the Z. We wanted to train this thing somehow. Um then it
[01:08:21] to train this thing somehow. Um then it turns out you can turn a very similar ma
[01:08:23] turns out you can turn a very similar ma use a very similar mathematical trick um
[01:08:25] use a very similar mathematical trick um as we did in variational autoenccoders
[01:08:27] as we did in variational autoenccoders and maximize some variational lower
[01:08:28] and maximize some variational lower bound of the likelihood of the data and
[01:08:30] bound of the likelihood of the data and that gives rise to this um latent
[01:08:32] that gives rise to this um latent variable model interpretation of
[01:08:33] variable model interpretation of diffusion.
[01:08:35] diffusion. Another another a totally different
[01:08:36] Another another a totally different interpretation of diffusion is that it
[01:08:37] interpretation of diffusion is that it models something called the score
[01:08:38] models something called the score function. Um so given a data given a
[01:08:41] function. Um so given a data given a data data given a distribution p data of
[01:08:44] data data given a distribution p data of x um there's this nice thing called the
[01:08:47] x um there's this nice thing called the score function of the distribution which
[01:08:48] score function of the distribution which is the derivative with respect to x of
[01:08:50] is the derivative with respect to x of the log of p data of x and intuitively
[01:08:52] the log of p data of x and intuitively this given a distribution the score
[01:08:54] this given a distribution the score function is a vector field that points
[01:08:56] function is a vector field that points towards areas of high of high
[01:08:59] towards areas of high of high probability density. So um you know for
[01:09:01] probability density. So um you know for any point in the data space the score
[01:09:03] any point in the data space the score function is going to be a vector that
[01:09:05] function is going to be a vector that points you towards areas of high data
[01:09:07] points you towards areas of high data density. And now another interpretation
[01:09:09] density. And now another interpretation of diffusion is that diffusion is
[01:09:11] of diffusion is that diffusion is learning the score function of the data
[01:09:13] learning the score function of the data distribution and in fact learning a set
[01:09:15] distribution and in fact learning a set of score functions corresponding to
[01:09:17] of score functions corresponding to different levels of noise on the data
[01:09:19] different levels of noise on the data distribution. So there's another
[01:09:20] distribution. So there's another interpretation of diffusion which is
[01:09:22] interpretation of diffusion which is that it's trying to learn a family of
[01:09:23] that it's trying to learn a family of score functions corresponding to a
[01:09:25] score functions corresponding to a family of noised distributions that
[01:09:27] family of noised distributions that corrupt the true data distribution with
[01:09:29] corrupt the true data distribution with increasing amounts of known noise. And
[01:09:31] increasing amounts of known noise. And that's a totally different mathematical
[01:09:32] that's a totally different mathematical formalism that gives rise to very
[01:09:34] formalism that gives rise to very similar looking algorithm at the end.
[01:09:37] similar looking algorithm at the end. And then the third one that's come onto
[01:09:38] And then the third one that's come onto the scene a little bit more recently is
[01:09:40] the scene a little bit more recently is this notion of diffusion as solving
[01:09:42] this notion of diffusion as solving stochastic differential equations. And I
[01:09:44] stochastic differential equations. And I got to admit like I don't fully
[01:09:45] got to admit like I don't fully understand this one myself. So don't ask
[01:09:46] understand this one myself. So don't ask me too many questions. Um but the idea
[01:09:49] me too many questions. Um but the idea is that you want to learn some you want
[01:09:51] is that you want to learn some you want to write down some differential equation
[01:09:52] to write down some differential equation that's going to write down you know some
[01:09:54] that's going to write down you know some infantessimal way to transport samples
[01:09:56] infantessimal way to transport samples from a noise distribution into samples
[01:09:58] from a noise distribution into samples from a data distribution. And then it
[01:10:00] from a data distribution. And then it and then it inference. Then the neural
[01:10:02] and then it inference. Then the neural network is basically learning some kind
[01:10:03] network is basically learning some kind of numeric integrator, some numeric
[01:10:06] of numeric integrator, some numeric integrator to this stochastic
[01:10:08] integrator to this stochastic differential equation that we can write
[01:10:09] differential equation that we can write down. Um, and this opens up a whole new
[01:10:11] down. Um, and this opens up a whole new way of thinking about it, right? Because
[01:10:12] way of thinking about it, right? Because from under the lens of stoastic
[01:10:14] from under the lens of stoastic differential equations, then we get
[01:10:15] differential equations, then we get access to whole different categories of
[01:10:17] access to whole different categories of functions to to to whole different
[01:10:19] functions to to to whole different categories of methods to sample from
[01:10:21] categories of methods to sample from these things at inference time, right?
[01:10:22] these things at inference time, right? And from this perspective, the kind of
[01:10:24] And from this perspective, the kind of like naive gradient descent type um type
[01:10:26] like naive gradient descent type um type approach that we saw in rectified flow
[01:10:28] approach that we saw in rectified flow basically corresponds to a forward oiler
[01:10:30] basically corresponds to a forward oiler type of integrator on top of a sta
[01:10:32] type of integrator on top of a sta stocastic differential equation. And
[01:10:33] stocastic differential equation. And then under this interpretation, you can
[01:10:35] then under this interpretation, you can imagine doing all kinds of more
[01:10:36] imagine doing all kinds of more complicated integrators to maybe do a
[01:10:38] complicated integrators to maybe do a better job at mar at marching along this
[01:10:40] better job at mar at marching along this score function. So again, like these are
[01:10:42] score function. So again, like these are these are these are deep waters. Like
[01:10:44] these are these are deep waters. Like there's there's there's papers that go
[01:10:45] there's there's there's papers that go into great detail on all these things.
[01:10:47] into great detail on all these things. Um, and a blog post that I really liked
[01:10:49] Um, and a blog post that I really liked is this one by Sander Deman on
[01:10:50] is this one by Sander Deman on perspectives on diffusion, who actually
[01:10:52] perspectives on diffusion, who actually gave um eight different perspectives on
[01:10:54] gave um eight different perspectives on different ways to think about or view
[01:10:56] different ways to think about or view diffusion models. So, this is an
[01:10:57] diffusion models. So, this is an excellent post. I would highly recommend
[01:10:59] excellent post. I would highly recommend I would actually highly recommend
[01:11:00] I would actually highly recommend everything he's written about diffusion
[01:11:01] everything he's written about diffusion models. All his blog posts are amazing.
[01:11:04] models. All his blog posts are amazing. Um, autogressive models actually come
[01:11:07] Um, autogressive models actually come back. We can do the same thing in coder
[01:11:09] back. We can do the same thing in coder decoder um and put an autogressive model
[01:11:12] decoder um and put an autogressive model on there too. So the other the other you
[01:11:14] on there too. So the other the other you know just sneaking this in there at the
[01:11:16] know just sneaking this in there at the end in addition to diffusion models the
[01:11:18] end in addition to diffusion models the other you know modern recipe for
[01:11:19] other you know modern recipe for generative modeling is to train an auto
[01:11:21] generative modeling is to train an auto reggressive model on discrete latents
[01:11:23] reggressive model on discrete latents that are computed by a discrete
[01:11:24] that are computed by a discrete variational autoenccoder. So you know
[01:11:27] variational autoenccoder. So you know you know that's why we did the four
[01:11:28] you know that's why we did the four generative models that we did right um
[01:11:30] generative models that we did right um GANs ves autogressive models diffusion
[01:11:33] GANs ves autogressive models diffusion because it turns out they all get used
[01:11:34] because it turns out they all get used in in in modern machine learning uh
[01:11:36] in in in modern machine learning uh pipelines.
[01:11:38] pipelines. So that's basically the summary of
[01:11:39] So that's basically the summary of today. Uh today we did a whirlwind tour
[01:11:41] today. Uh today we did a whirlwind tour of two different categories of
[01:11:42] of two different categories of generative models. We talked about
[01:11:44] generative models. We talked about generative adversarial networks as well
[01:11:45] generative adversarial networks as well as diffusion models. And we saw kind of
[01:11:47] as diffusion models. And we saw kind of their their modern full pipeline
[01:11:49] their their modern full pipeline instantiated in latent diffusion models
[01:11:51] instantiated in latent diffusion models which is kind of a nice way to wrap up
[01:11:52] which is kind of a nice way to wrap up this generative modeling section because
[01:11:54] this generative modeling section because all the generative models that we saw
[01:11:55] all the generative models that we saw basically come back and come together to
[01:11:57] basically come back and come together to form these big modern pipelines. So
[01:12:00] form these big modern pipelines. So thanks and next time we'll talk about
[01:12:01] thanks and next time we'll talk about vision and language.


================================================================================
LECTURE 015
================================================================================

Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 15: 3D Vision

Source: https://www.youtube.com/watch?v=7lxrKDKtykM

---

Transcript

[00:00:05] I'm really happy to announce uh our next
[00:00:07] I'm really happy to announce uh our next guest speaker for the course uh
[00:00:09] guest speaker for the course uh professor Jajun Wu. So Jajun is an
[00:00:12] professor Jajun Wu. So Jajun is an assistant professor here at Stanford uh
[00:00:14] assistant professor here at Stanford uh in the department of computer science
[00:00:16] in the department of computer science and he's a faculty member of the
[00:00:18] and he's a faculty member of the Stanford vision and uh sorry vision and
[00:00:22] Stanford vision and uh sorry vision and learning lab. Um his research focuses on
[00:00:25] learning lab. Um his research focuses on scene understanding uh with an emphasis
[00:00:27] scene understanding uh with an emphasis on multimodal perception uh robotics and
[00:00:30] on multimodal perception uh robotics and embodied AI, visual generation and
[00:00:32] embodied AI, visual generation and reasoning and uh 3D understanding which
[00:00:36] reasoning and uh 3D understanding which is the topic of today's lecture. Um and
[00:00:38] is the topic of today's lecture. Um and so I'll now turn it over to Jajun to
[00:00:40] so I'll now turn it over to Jajun to begin today's lecture.
[00:00:42] begin today's lecture.  Okay. Yeah. So I'm Jun. I'm an assistant
[00:00:44] Okay. Yeah. So I'm Jun. I'm an assistant professor here and I think a few years
[00:00:45] professor here and I think a few years ago I used to teach this class code
[00:00:47] ago I used to teach this class code teach. So I heard this year it's the
[00:00:50] teach. So I heard this year it's the 10th year anniversary right. So we have
[00:00:52] 10th year anniversary right. So we have guest speakers from different places.
[00:00:55] guest speakers from different places. Okay. Um so today we're going to talk
[00:00:56] Okay. Um so today we're going to talk about 3D vision. Uh so it might be kind
[00:00:59] about 3D vision. Uh so it might be kind of different from a lot of things you um
[00:01:02] of different from a lot of things you um you you learned before because I think
[00:01:04] you you learned before because I think in the past few weeks we talked about
[00:01:06] in the past few weeks we talked about like conversion new networks and
[00:01:08] like conversion new networks and transformers and maybe vision language
[00:01:09] transformers and maybe vision language models and generator models as well for
[00:01:11] models and generator models as well for justing right. Okay. Yeah. So here for
[00:01:14] justing right. Okay. Yeah. So here for 3D I think I'm going to first introduce
[00:01:16] 3D I think I'm going to first introduce a little bit on what are the 3D
[00:01:18] a little bit on what are the 3D representations you know. So it's more
[00:01:20] representations you know. So it's more like it's pretty distant from all the
[00:01:22] like it's pretty distant from all the deep learning stuff. But then we're
[00:01:23] deep learning stuff. But then we're going to talk about how deep learning or
[00:01:25] going to talk about how deep learning or AI has changed 3D vision and how they
[00:01:27] AI has changed 3D vision and how they can be integrated in different ways, you
[00:01:30] can be integrated in different ways, you know, and we look into a few different
[00:01:31] know, and we look into a few different applications about 3D generation,
[00:01:33] applications about 3D generation, reconstruction and stuff like that.
[00:01:35] reconstruction and stuff like that. Okay. So let's begin by looking into
[00:01:38] Okay. So let's begin by looking into what are the possible ways to represent
[00:01:39] what are the possible ways to represent objects in in 3D because in 2D is so so
[00:01:42] objects in in 3D because in 2D is so so straightforward. It looks like I just
[00:01:43] straightforward. It looks like I just have pixels, right? I have a I loading a
[00:01:45] have pixels, right? I have a I loading a file of a PNG file or JPEG file. It's
[00:01:47] file of a PNG file or JPEG file. It's like 200 by 200 pixels. But how can we
[00:01:50] like 200 by 200 pixels. But how can we represent 3D objects? I think that's the
[00:01:52] represent 3D objects? I think that's the first thing you want to look into.
[00:01:55] first thing you want to look into. And uh 3D objects, you know, they can be
[00:01:57] And uh 3D objects, you know, they can be diverse. Uh they can be at different
[00:02:00] diverse. Uh they can be at different scales. It can be like huge large
[00:02:02] scales. It can be like huge large buildings and trees, complex structures.
[00:02:05] buildings and trees, complex structures. And if you zoom in, you know, you can
[00:02:07] And if you zoom in, you know, you can also see all the fine details. So what
[00:02:09] also see all the fine details. So what are the best 3D representations to
[00:02:11] are the best 3D representations to represent all these different types of
[00:02:12] represent all these different types of 3D objects at different scales with
[00:02:14] 3D objects at different scales with different features
[00:02:16] different features and unlike images where everyone just
[00:02:19] and unlike images where everyone just use pixels you know so we have 200 by
[00:02:21] use pixels you know so we have 200 by 200 500 by 500 um the way to represent
[00:02:25] 200 500 by 500 um the way to represent 3D objects in ter you know objects are
[00:02:27] 3D objects in ter you know objects are also like you have geometry you have
[00:02:28] also like you have geometry you have textures you have materials but let's
[00:02:30] textures you have materials but let's just maybe start by looking at geometry
[00:02:32] just maybe start by looking at geometry and even just for 3D object geometry
[00:02:34] and even just for 3D object geometry there are so many different ways to
[00:02:35] there are so many different ways to represent them you know we can basically
[00:02:37] represent them you know we can basically categorize them into two different
[00:02:39] categorize them into two different categories. One is called explicit
[00:02:40] categories. One is called explicit representations. So where you can you
[00:02:43] representations. So where you can you can in some sense they're directly I
[00:02:45] can in some sense they're directly I would say explicitly representing part
[00:02:47] would say explicitly representing part of the objects. This includes uh things
[00:02:49] of the objects. This includes uh things like point clouds. If you have a cloud
[00:02:50] like point clouds. If you have a cloud of 3D points or a polygon mesh or
[00:02:55] of 3D points or a polygon mesh or subdivisions which we're going to talk
[00:02:56] subdivisions which we're going to talk about it and others and there's a
[00:02:59] about it and others and there's a different categories of object shape
[00:03:01] different categories of object shape representations which are often called
[00:03:03] representations which are often called implicit. So we're going to talk about
[00:03:04] implicit. So we're going to talk about them as well. So I'm going to explain
[00:03:06] them as well. So I'm going to explain them in a little bit detail later
[00:03:07] them in a little bit detail later including level sets, algebraic
[00:03:09] including level sets, algebraic surfaces, distance functions. So they're
[00:03:11] surfaces, distance functions. So they're basically representing 3D objects or
[00:03:13] basically representing 3D objects or their geometries as functions which it
[00:03:15] their geometries as functions which it is not directly know it's not as in some
[00:03:17] is not directly know it's not as in some sense intuitive as oh it's just a
[00:03:19] sense intuitive as oh it's just a collection of points but as we'll see
[00:03:20] collection of points but as we'll see later they also have their own
[00:03:22] later they also have their own advantages and weaknesses uh using
[00:03:24] advantages and weaknesses uh using implicit representations. So every
[00:03:26] implicit representations. So every choice they have their you know suitable
[00:03:28] choice they have their you know suitable task and type of geometry in particular
[00:03:31] task and type of geometry in particular in the context deep learning you know
[00:03:32] in the context deep learning you know they may also have their own strengths
[00:03:33] they may also have their own strengths and weakness when you want to apply a
[00:03:35] and weakness when you want to apply a deep learning method on top of it.
[00:03:38] deep learning method on top of it. So when do we choose representation? You
[00:03:40] So when do we choose representation? You know we want we have to store them. So
[00:03:42] know we want we have to store them. So pixels are easy to store because it's
[00:03:44] pixels are easy to store because it's just a matrix. But then 3D point clouds
[00:03:47] just a matrix. But then 3D point clouds are kind of more irregular and also you
[00:03:49] are kind of more irregular and also you know if especially if you use some
[00:03:50] know if especially if you use some implicit representations like
[00:03:51] implicit representations like representing object as a function how
[00:03:53] representing object as a function how would you how would you store that in a
[00:03:54] would you how would you store that in a in a computer and how does it support
[00:03:57] in a computer and how does it support you know creating a new shapes um and
[00:03:59] you know creating a new shapes um and especially now let's say maybe the input
[00:04:01] especially now let's say maybe the input is a picture or the input is a language
[00:04:03] is a picture or the input is a language descriptions and different type of
[00:04:05] descriptions and different type of operations you have a 3D objects then
[00:04:07] operations you have a 3D objects then how can you edit it simplify it smooth
[00:04:09] how can you edit it simplify it smooth it filter it repair it right so you can
[00:04:11] it filter it repair it right so you can have to do a lot more you know for for
[00:04:13] have to do a lot more you know for for images sometimes you want to do that too
[00:04:14] images sometimes you want to do that too you want to edit it. You want to add it
[00:04:16] you want to edit it. You want to add it using language. You want to edit it
[00:04:17] using language. You want to edit it using stroke. And how can you edit or
[00:04:20] using stroke. And how can you edit or perform any type of operations on 3D
[00:04:22] perform any type of operations on 3D objects and rendering? How can you turn
[00:04:24] objects and rendering? How can you turn that 3D objects render into computer v
[00:04:26] that 3D objects render into computer v uh into 2D pixels? In some sense, you
[00:04:28] uh into 2D pixels? In some sense, you can 3D vision is is to invert a process,
[00:04:31] can 3D vision is is to invert a process, right? How can you go from 2D images to
[00:04:33] right? How can you go from 2D images to reconstruct the 3D objects? So, how does
[00:04:36] reconstruct the 3D objects? So, how does support all these different things
[00:04:37] support all these different things including NN animations especially if
[00:04:39] including NN animations especially if you are modeling let's say 3D humans or
[00:04:42] you are modeling let's say 3D humans or animals and you want to animate them. So
[00:04:44] animals and you want to animate them. So all these factors need to be considered
[00:04:46] all these factors need to be considered and something else of course it is sort
[00:04:48] and something else of course it is sort of connect through all these is also
[00:04:49] of connect through all these is also their integration with different deep
[00:04:50] their integration with different deep learning methods for let's say shape
[00:04:53] learning methods for let's say shape editing rendering inverse rendering
[00:04:55] editing rendering inverse rendering animation as well. So very quickly we
[00:04:58] animation as well. So very quickly we can go through some of these
[00:04:59] can go through some of these representations like point clouds. Um so
[00:05:02] representations like point clouds. Um so point cloud is probably the simplest
[00:05:04] point cloud is probably the simplest representations uh only has 3D points.
[00:05:07] representations uh only has 3D points. It doesn't have connectivity. So it
[00:05:08] It doesn't have connectivity. So it doesn't capture how these points are
[00:05:10] doesn't capture how these points are connected. So you basically just have a
[00:05:12] connected. So you basically just have a instead of having a n byn matrix uh
[00:05:14] instead of having a n byn matrix uh which is about the pixel values of all
[00:05:16] which is about the pixel values of all the pixels in a picture now you have a
[00:05:18] the pixels in a picture now you have a 3xn matrix where three is xyz
[00:05:22] 3xn matrix where three is xyz coordinates of these individual points
[00:05:23] coordinates of these individual points and you have a number of points.
[00:05:26] and you have a number of points. So sometimes you can represent the
[00:05:28] So sometimes you can represent the surface normals of the point as well so
[00:05:30] surface normals of the point as well so that you have not only where the point
[00:05:31] that you have not only where the point is in the 3D space but also uh which uh
[00:05:34] is in the 3D space but also uh which uh to which direction is facing. So you
[00:05:36] to which direction is facing. So you have the surface normals which give you
[00:05:38] have the surface normals which give you a bit more information and sometimes
[00:05:40] a bit more information and sometimes people call them surf which is points uh
[00:05:43] people call them surf which is points uh with orientations
[00:05:46] with orientations and yeah so why do you need surface
[00:05:48] and yeah so why do you need surface normals? Because if you want to render
[00:05:49] normals? Because if you want to render them, you want to see like how the
[00:05:50] them, you want to see like how the object look like you know if you then
[00:05:52] object look like you know if you then that means you have to often specify a
[00:05:54] that means you have to often specify a lighting source right where's the
[00:05:55] lighting source right where's the lighting coming from and but you know to
[00:05:58] lighting coming from and but you know to make the rendering look realistic you
[00:05:59] make the rendering look realistic you have to consider how the lighting you
[00:06:01] have to consider how the lighting you know coming from a certain direction is
[00:06:03] know coming from a certain direction is going to interact with the point and
[00:06:05] going to interact with the point and this is where the surface norms normals
[00:06:07] this is where the surface norms normals is used to help you to make the
[00:06:09] is used to help you to make the rendering look realistic like you can
[00:06:10] rendering look realistic like you can see here.
[00:06:12] see here. So how can you get points? Um you know a
[00:06:14] So how can you get points? Um you know a benefit of the point cloud it is it is
[00:06:17] benefit of the point cloud it is it is often a raw format that you will get
[00:06:19] often a raw format that you will get from a lot of the 3D sensors. Um you
[00:06:21] from a lot of the 3D sensors. Um you know including these kind of depth
[00:06:22] know including these kind of depth sensors including some 3D scanners and
[00:06:25] sensors including some 3D scanners and you know nowadays I think if you can
[00:06:27] you know nowadays I think if you can even use your iPhone and I think they
[00:06:28] even use your iPhone and I think they have a AR kit or these kind of software
[00:06:30] have a AR kit or these kind of software allow to scan 3D objects but but the raw
[00:06:33] allow to scan 3D objects but but the raw output of those sensors they're still
[00:06:35] output of those sensors they're still kind of 3D point clouds. Now of course
[00:06:36] kind of 3D point clouds. Now of course after that you have to process them and
[00:06:38] after that you have to process them and fuse them to make it like maybe objects
[00:06:40] fuse them to make it like maybe objects with textures.
[00:06:43] with textures. Um so yeah they often results of from
[00:06:45] Um so yeah they often results of from scanners they can potentially be very
[00:06:46] scanners they can potentially be very noisy and there are things like this and
[00:06:48] noisy and there are things like this and you want to fuse them merge them repair
[00:06:50] you want to fuse them merge them repair them um and this in this part you know
[00:06:53] them um and this in this part you know you have to consider how these different
[00:06:55] you have to consider how these different pictures can be registered uh to give
[00:06:57] pictures can be registered uh to give you the shared point cloud
[00:07:00] you the shared point cloud and um they can they're very flexible so
[00:07:03] and um they can they're very flexible so you can because you can move points here
[00:07:05] you can because you can move points here and there so you can use them to
[00:07:07] and there so you can use them to represent basically any type of object
[00:07:08] represent basically any type of object geometry you're not constrained by the
[00:07:10] geometry you're not constrained by the topology or stuff like
[00:07:11] topology or stuff like uh it's kind of useful for large data
[00:07:13] uh it's kind of useful for large data sets because sometimes you have to
[00:07:14] sets because sometimes you have to consider a very diverse set of objects.
[00:07:17] consider a very diverse set of objects. Um but you know because points are
[00:07:19] Um but you know because points are already in some you consider to have pre
[00:07:21] already in some you consider to have pre being pres right so you know if you have
[00:07:24] being pres right so you know if you have a lot of points if if you're
[00:07:25] a lot of points if if you're representing objects and but your points
[00:07:27] representing objects and but your points are sampled in uneven way in the sense
[00:07:29] are sampled in uneven way in the sense that you have a lot of points let's say
[00:07:31] that you have a lot of points let's say on the on the head of the rabbit but you
[00:07:33] on the on the head of the rabbit but you have very very few points on the tail of
[00:07:34] have very very few points on the tail of the rabbit then it will be actually hard
[00:07:36] the rabbit then it will be actually hard to draw samples from these inner sample
[00:07:38] to draw samples from these inner sample regions. So sometimes when people
[00:07:40] regions. So sometimes when people consider you know sampling points you
[00:07:42] consider you know sampling points you have to design algorithm to make sure
[00:07:43] have to design algorithm to make sure you sample them roughly evenly you know
[00:07:45] you sample them roughly evenly you know across different parts of the objects
[00:07:48] across different parts of the objects and other limitations or you know it's
[00:07:49] and other limitations or you know it's not obvious how we can directly perform
[00:07:52] not obvious how we can directly perform sometimes the very useful operations
[00:07:53] sometimes the very useful operations like simplification or subdivisions on
[00:07:55] like simplification or subdivisions on these objects. Uh it doesn't directly
[00:07:58] these objects. Uh it doesn't directly allow you to do smooth rendering.
[00:08:00] allow you to do smooth rendering. There's no topological information.
[00:08:02] There's no topological information. Right? So for example here, you know, if
[00:08:05] Right? So for example here, you know, if I give you a collection of points, then
[00:08:07] I give you a collection of points, then you can't even tell, you know, if this
[00:08:08] you can't even tell, you know, if this is like a Taurus or this is like these
[00:08:12] is like a Taurus or this is like these kind of ring-l like shapes, right?
[00:08:13] kind of ring-l like shapes, right? Because it doesn't tell you how these
[00:08:15] Because it doesn't tell you how these points are connected. So it's kind of a
[00:08:16] points are connected. So it's kind of a a partial information about what object
[00:08:19] a partial information about what object is if you just have the point clouds. So
[00:08:22] is if you just have the point clouds. So naturally people will say, okay, how can
[00:08:23] naturally people will say, okay, how can I actually capture, you know, more
[00:08:25] I actually capture, you know, more information so that I can distinguish
[00:08:26] information so that I can distinguish between these two different objects?
[00:08:28] between these two different objects? Then naturally that goes to the
[00:08:30] Then naturally that goes to the polygonal meshes right. So it represent
[00:08:34] polygonal meshes right. So it represent the object that still has a collection
[00:08:35] the object that still has a collection of points but then also how these points
[00:08:37] of points but then also how these points are connected. So now you have not only
[00:08:39] are connected. So now you have not only the points but also the faces the
[00:08:41] the points but also the faces the surfaces
[00:08:43] surfaces and this is arguably like uh I would say
[00:08:45] and this is arguably like uh I would say the most widely used representations for
[00:08:47] the most widely used representations for 3D objects in all these graphics engines
[00:08:50] 3D objects in all these graphics engines and in computer games and stuff like
[00:08:51] and in computer games and stuff like that. You know basically it is all
[00:08:52] that. You know basically it is all represented as polygon meshes. Uh but
[00:08:54] represented as polygon meshes. Uh but you can see that you know to represent
[00:08:56] you can see that you know to represent faces it is more complex because often
[00:08:58] faces it is more complex because often you have to consider um you know
[00:09:01] you have to consider um you know especially if you're looking at raw
[00:09:02] especially if you're looking at raw meshes then every face may have a
[00:09:04] meshes then every face may have a different number of points some have
[00:09:05] different number of points some have three points some have four points have
[00:09:07] three points some have four points have five points and how you can represent
[00:09:08] five points and how you can represent them especially given their irregularity
[00:09:11] them especially given their irregularity how you can integrate them with neuronet
[00:09:13] how you can integrate them with neuronet networks you know especially in early
[00:09:14] networks you know especially in early stage when people start with like
[00:09:15] stage when people start with like commercial neuron networks okay they
[00:09:17] commercial neuron networks okay they always assume a fixed resolution but
[00:09:19] always assume a fixed resolution but here you have you know not I would say a
[00:09:21] here you have you know not I would say a variable uh dimension of these raw
[00:09:24] variable uh dimension of these raw information how does that integrate with
[00:09:25] information how does that integrate with deep learning that's been some big
[00:09:27] deep learning that's been some big challenge that's why you know deep
[00:09:28] challenge that's why you know deep learning with 3D vision kind of started
[00:09:30] learning with 3D vision kind of started kind of late because people are thinking
[00:09:31] kind of late because people are thinking about how we can adapt all these deep
[00:09:33] about how we can adapt all these deep learning method to deal with all these
[00:09:35] learning method to deal with all these um uh complex representations for
[00:09:37] um uh complex representations for objects which are not as unified as uh
[00:09:40] objects which are not as unified as uh images
[00:09:42] images but meshes are really widely used and
[00:09:44] but meshes are really widely used and they can be you know very complex meshes
[00:09:45] they can be you know very complex meshes that capture all the details you know
[00:09:47] that capture all the details you know for example you have scanners you get
[00:09:49] for example you have scanners you get points and then you fuse them and you
[00:09:51] points and then you fuse them and you apply some algorithm you can get you
[00:09:53] apply some algorithm you can get you know very arch mesh. This one has like
[00:09:55] know very arch mesh. This one has like 56 uh million triangles and 28 million
[00:10:00] 56 uh million triangles and 28 million uh vertices uh to represent a sculpture.
[00:10:04] uh vertices uh to represent a sculpture. And you can have even larger ones. You
[00:10:06] And you can have even larger ones. You know, let's say from Google Earth, they
[00:10:07] know, let's say from Google Earth, they have trillions of triangles try to
[00:10:09] have trillions of triangles try to represent basically all all the
[00:10:11] represent basically all all the buildings on Earth. The nice thing about
[00:10:14] buildings on Earth. The nice thing about meshes, it supports a lot of operations
[00:10:15] meshes, it supports a lot of operations like subdivisions. Oh, I want to have
[00:10:17] like subdivisions. Oh, I want to have more details and how can I have more use
[00:10:19] more details and how can I have more use more meshes to capture more details of
[00:10:20] more meshes to capture more details of the shape.
[00:10:22] the shape. Um and you can do simplification as
[00:10:24] Um and you can do simplification as well. Sometimes you want to process
[00:10:25] well. Sometimes you want to process things very fast. So I don't need that
[00:10:27] things very fast. So I don't need that many meshes. I just want to simplify it
[00:10:29] many meshes. I just want to simplify it and so I can there are existing
[00:10:31] and so I can there are existing algorithms allows you to do that as
[00:10:32] algorithms allows you to do that as well.
[00:10:34] well. And regularization you know if you get
[00:10:36] And regularization you know if you get irregular mesh and sometimes you want to
[00:10:38] irregular mesh and sometimes you want to regularize them so that every phase is a
[00:10:41] regularize them so that every phase is a triangle they always connect three
[00:10:43] triangle they always connect three vertices. they have roughly uh the same
[00:10:45] vertices. they have roughly uh the same uh size and so that it's easier for
[00:10:47] uh size and so that it's easier for processing and they have uh more say
[00:10:50] processing and they have uh more say good properties that supports future
[00:10:52] good properties that supports future processing of different graphics
[00:10:53] processing of different graphics algorithms and matches you know there
[00:10:55] algorithms and matches you know there have been people who develop these
[00:10:57] have been people who develop these algorithms as well so that you have you
[00:10:59] algorithms as well so that you have you can ensure that you know basically
[00:11:01] can ensure that you know basically points at different regions they're
[00:11:03] points at different regions they're roughly evenly sampled so that it won't
[00:11:04] roughly evenly sampled so that it won't be the case that okay let's say the head
[00:11:06] be the case that okay let's say the head of the tail or the head of the rabbit is
[00:11:08] of the tail or the head of the rabbit is much more densely sampled than the tail
[00:11:09] much more densely sampled than the tail and these kind of things
[00:11:12] and these kind of things okay so this is kind one type of shape
[00:11:14] okay so this is kind one type of shape representations and there are other type
[00:11:15] representations and there are other type of shape representations. For example,
[00:11:18] of shape representations. For example, uh parametric representations because
[00:11:20] uh parametric representations because objects are not just totally irregular.
[00:11:22] objects are not just totally irregular. It's not just often a collection of
[00:11:23] It's not just often a collection of points and meshes. They're very general.
[00:11:26] points and meshes. They're very general. But sometimes you lose a lot of
[00:11:27] But sometimes you lose a lot of information if you look at let's say
[00:11:29] information if you look at let's say your chairs or the tables, you know, you
[00:11:30] your chairs or the tables, you know, you have all these kind of straight lines,
[00:11:32] have all these kind of straight lines, right? So how can you represent these
[00:11:33] right? So how can you represent these kind of straight lines? And when people
[00:11:34] kind of straight lines? And when people design them, you know, they often use
[00:11:36] design them, you know, they often use some of these parametric
[00:11:37] some of these parametric representations, you know, so you can
[00:11:39] representations, you know, so you can represent shapes as a function, right?
[00:11:41] represent shapes as a function, right? As I think about it, you know, so when I
[00:11:43] As I think about it, you know, so when I design them, I can, you know, it's
[00:11:45] design them, I can, you know, it's really if I want to represent a surface
[00:11:47] really if I want to represent a surface or represent a curve, right? The
[00:11:48] or represent a curve, right? The underlying dimen degree of freedom is
[00:11:51] underlying dimen degree of freedom is actually lower. You know, often if I
[00:11:52] actually lower. You know, often if I have a curve, you know, there's only
[00:11:54] have a curve, you know, there's only only one underlying degree of freedom.
[00:11:55] only one underlying degree of freedom. That's why I can represent a curve using
[00:11:57] That's why I can represent a curve using a function f ofx, right? Just varied x
[00:11:59] a function f ofx, right? Just varied x and get a value of y. So you can use
[00:12:02] and get a value of y. So you can use basically all these different types of
[00:12:04] basically all these different types of functions in 2D, but also more often in
[00:12:06] functions in 2D, but also more often in 3D, right? to map I know a certain
[00:12:09] 3D, right? to map I know a certain number of variables the underlying
[00:12:10] number of variables the underlying intrinsic dimensionality of the objects
[00:12:12] intrinsic dimensionality of the objects which are often let's say two or even
[00:12:14] which are often let's say two or even one uh and then map it into the 3D space
[00:12:17] one uh and then map it into the 3D space and then this allows you to represent a
[00:12:19] and then this allows you to represent a 3D objects in a parametric
[00:12:20] 3D objects in a parametric representations using basically a set of
[00:12:23] representations using basically a set of functions right you can do that for
[00:12:25] functions right you can do that for curves no let's say in circles right
[00:12:27] curves no let's say in circles right basically if you want to represent a
[00:12:28] basically if you want to represent a circle you know you don't really need to
[00:12:30] circle you know you don't really need to you know one way is you simple number of
[00:12:32] you know one way is you simple number of points right or you can even connect
[00:12:34] points right or you can even connect them uh like a meshes you using the
[00:12:36] them uh like a meshes you using the lines
[00:12:37] lines Another way is you just represent the
[00:12:39] Another way is you just represent the curve uh the the circle as this function
[00:12:42] curve uh the the circle as this function right basically there's a sign function
[00:12:43] right basically there's a sign function and cosine function uh and you just vary
[00:12:47] and cosine function uh and you just vary one variable which is t you can think
[00:12:48] one variable which is t you can think about it as the degrees or the angles
[00:12:50] about it as the degrees or the angles and it will map it to all the points on
[00:12:52] and it will map it to all the points on the circles right so here now you can
[00:12:54] the circles right so here now you can use a function to represent a parametric
[00:12:56] use a function to represent a parametric representations for a curves in 2D and
[00:12:59] representations for a curves in 2D and of course you can do that uh in 3D as
[00:13:01] of course you can do that uh in 3D as well if you want to represent a sphere
[00:13:03] well if you want to represent a sphere all you need is just like two degrees of
[00:13:05] all you need is just like two degrees of freedoms um UMV and then you can go
[00:13:08] freedoms um UMV and then you can go through these functions so that you can
[00:13:09] through these functions so that you can map them to every point in the 3D space
[00:13:12] map them to every point in the 3D space for this sphere right so people have
[00:13:15] for this sphere right so people have designed I'm not going to detail here
[00:13:16] designed I'm not going to detail here but more complex parametric
[00:13:18] but more complex parametric representations like basic curves and
[00:13:20] representations like basic curves and basier surfaces which allows you to
[00:13:22] basier surfaces which allows you to represent like these kind of pretty
[00:13:24] represent like these kind of pretty flexible and smooth surfaces in 3D using
[00:13:27] flexible and smooth surfaces in 3D using basically a few control points um so you
[00:13:31] basically a few control points um so you basically use these basia functions to
[00:13:33] basically use these basia functions to capture the underlying lower
[00:13:34] capture the underlying lower dimensionalities of these surfaces and
[00:13:36] dimensionalities of these surfaces and then you can map these underlying low
[00:13:38] then you can map these underlying low dimensionality into these kind of
[00:13:40] dimensionality into these kind of flexible shapes
[00:13:43] flexible shapes and they also allow you to do things
[00:13:45] and they also allow you to do things like subdivisions. So you can you know
[00:13:47] like subdivisions. So you can you know trying to get more details uh into the
[00:13:49] trying to get more details uh into the surfaces and to make it more fine grain
[00:13:50] surfaces and to make it more fine grain and stuff like that. Okay, so that would
[00:13:52] and stuff like that. Okay, so that would be the second category of shape
[00:13:54] be the second category of shape representations. You know, you can
[00:13:55] representations. You know, you can represent 3D objects in a nonparametric
[00:13:57] represent 3D objects in a nonparametric way like a collection of unordered
[00:13:59] way like a collection of unordered points or their connections as meshes or
[00:14:01] points or their connections as meshes or you can represent them in a parametric
[00:14:03] you can represent them in a parametric way where you have a function basically
[00:14:05] way where you have a function basically and you by varying a few parameters that
[00:14:08] and you by varying a few parameters that are underlying the true degree of
[00:14:10] are underlying the true degree of freedoms of the object geometry you can
[00:14:12] freedoms of the object geometry you can map them into more complex shapes.
[00:14:16] map them into more complex shapes. So basically everything here as I said
[00:14:18] So basically everything here as I said has been you know if you remember at the
[00:14:20] has been you know if you remember at the very beginning we said there are two
[00:14:21] very beginning we said there are two types of ways to represent object
[00:14:22] types of ways to represent object geometry. one is explicit there's more
[00:14:24] geometry. one is explicit there's more implicit and all of them they fall into
[00:14:27] implicit and all of them they fall into this category of being quite explicit
[00:14:29] this category of being quite explicit right so it's like I have a points and
[00:14:30] right so it's like I have a points and points are just directly points you on
[00:14:32] points are just directly points you on objects and for the surfaces or for
[00:14:35] objects and for the surfaces or for parametric curves as well you know they
[00:14:36] parametric curves as well you know they directly map it right to the points on
[00:14:39] directly map it right to the points on the on on the on objects um so their
[00:14:42] the on on the on objects um so their explicit representations they have a lot
[00:14:44] explicit representations they have a lot of benefits you know for first they have
[00:14:47] of benefits you know for first they have you map all the points uh directly um so
[00:14:50] you map all the points uh directly um so you can get all these points in general
[00:14:52] you can get all these points in general eneral you know you can you know every
[00:14:54] eneral you know you can you know every point I have let's say I sample I have a
[00:14:56] point I have let's say I sample I have a basia surface representations I can
[00:14:59] basia surface representations I can basically sample two points and UV in
[00:15:01] basically sample two points and UV in this underlying low dimensional space
[00:15:03] this underlying low dimensional space and then going through that function I
[00:15:04] and then going through that function I map it to a point in the 3D space right
[00:15:07] map it to a point in the 3D space right so I directly get a point on the 3D surf
[00:15:10] so I directly get a point on the 3D surf in a 3D space so all points are given in
[00:15:12] in a 3D space so all points are given in some sense you can say directly I can
[00:15:13] some sense you can say directly I can directly get all the points so it's very
[00:15:16] directly get all the points so it's very easy for us to sample points right so
[00:15:18] easy for us to sample points right so let's say I have this Taurus and I have
[00:15:20] let's say I have this Taurus and I have represented using this f function So now
[00:15:22] represented using this f function So now my question is can you just sample some
[00:15:24] my question is can you just sample some points on the surface of the object for
[00:15:26] points on the surface of the object for me? This is so easy because I will just
[00:15:28] me? This is so easy because I will just you know randomly put in some UMV
[00:15:30] you know randomly put in some UMV values. I just randomly sample those UMV
[00:15:32] values. I just randomly sample those UMV values and then let them go through this
[00:15:33] values and then let them go through this function and it will just you know
[00:15:35] function and it will just you know compute and give me some of the 3D
[00:15:38] compute and give me some of the 3D points which are guaranteed to be on
[00:15:40] points which are guaranteed to be on this surface of the object. Right? So
[00:15:42] this surface of the object. Right? So sampling is much easier. The what is
[00:15:45] sampling is much easier. The what is hard about these explicit
[00:15:46] hard about these explicit representations?
[00:15:48] representations? The hard thing is it's very hard in some
[00:15:51] The hard thing is it's very hard in some sense you know to test whether a point
[00:15:53] sense you know to test whether a point is inside or outside the objects.
[00:15:56] is inside or outside the objects. Similarly, you know, if I represent a
[00:15:58] Similarly, you know, if I represent a sphere as this function and it's easy
[00:16:01] sphere as this function and it's easy for me to sample points on a sphere, but
[00:16:03] for me to sample points on a sphere, but it is hard for me to say, you know, now
[00:16:05] it is hard for me to say, you know, now I have a different, you know, I have a
[00:16:06] I have a different, you know, I have a query, right? I say this point 3 over 4,
[00:16:09] query, right? I say this point 3 over 4, one over two, one over four, I have this
[00:16:10] one over two, one over four, I have this point in 3D space, is it inside object
[00:16:13] point in 3D space, is it inside object or is it outside object,
[00:16:15] or is it outside object, right, right? You know, I think we can
[00:16:17] right, right? You know, I think we can maybe actually I'm not even sure about
[00:16:19] maybe actually I'm not even sure about that. Um so it is actually kind of hard
[00:16:22] that. Um so it is actually kind of hard to test you know whether this certain
[00:16:24] to test you know whether this certain point is inside or outside object. So
[00:16:28] point is inside or outside object. So you can see that explicit
[00:16:29] you can see that explicit representations um is you know it's all
[00:16:32] representations um is you know it's all these representations they have their
[00:16:33] these representations they have their own strengths and weaknesses and for
[00:16:35] own strengths and weaknesses and for explicit representations it's actually
[00:16:36] explicit representations it's actually pretty easy to sample points which are
[00:16:38] pretty easy to sample points which are very useful because sometimes you want
[00:16:39] very useful because sometimes you want to convert them into let's say a
[00:16:40] to convert them into let's say a collection of points you know and then
[00:16:42] collection of points you know and then you want to apply whatever your point
[00:16:43] you want to apply whatever your point neuron networks on it but it's hard to
[00:16:46] neuron networks on it but it's hard to test if a certain point is inside
[00:16:48] test if a certain point is inside outside object which may have some
[00:16:49] outside object which may have some issues you know for example let's say if
[00:16:50] issues you know for example let's say if you want to use a uh newer rendering
[00:16:52] you want to use a uh newer rendering methods and nowadays a lot of these
[00:16:54] methods and nowadays a lot of these newer rendering methods requires a lot
[00:16:55] newer rendering methods requires a lot of these kind of queries about whether a
[00:16:57] of these kind of queries about whether a point is inside object, outside object,
[00:16:59] point is inside object, outside object, what will be the geometry or density of
[00:17:00] what will be the geometry or density of object at a particular point, what will
[00:17:02] object at a particular point, what will be the material or what will be the
[00:17:04] be the material or what will be the radiance or color of the object at a
[00:17:06] radiance or color of the object at a particular point. Right? So explicit
[00:17:07] particular point. Right? So explicit representations are not very um
[00:17:09] representations are not very um supportive of it's not easy to run these
[00:17:12] supportive of it's not easy to run these operations on these explicit
[00:17:14] operations on these explicit representations.
[00:17:16] representations. So naturally people thought okay maybe
[00:17:17] So naturally people thought okay maybe we can come up with a different type of
[00:17:18] we can come up with a different type of ways to represent geometry and here I
[00:17:21] ways to represent geometry and here I say implicit representational geometry
[00:17:23] say implicit representational geometry but as you can see later you know a lot
[00:17:25] but as you can see later you know a lot of these newer rendering methods or deep
[00:17:26] of these newer rendering methods or deep learning methods they just extend these
[00:17:28] learning methods they just extend these implicit representations for not only
[00:17:29] implicit representations for not only geometry but also for colors and
[00:17:30] geometry but also for colors and appearance of objects in 3D. So idea of
[00:17:34] appearance of objects in 3D. So idea of these implicit representations is now
[00:17:36] these implicit representations is now that I want to uh classify these points.
[00:17:39] that I want to uh classify these points. So I assume you know uh if the points
[00:17:41] So I assume you know uh if the points are on the objects they're on the
[00:17:43] are on the objects they're on the surface of objects then they satisfy
[00:17:45] surface of objects then they satisfy some certain relationship. So for
[00:17:47] some certain relationship. So for example you know for a sphere what will
[00:17:49] example you know for a sphere what will be the points on a sphere on a unit
[00:17:50] be the points on a sphere on a unit sphere you know the the the constraint
[00:17:52] sphere you know the the the constraint they satisfy is you know square of x and
[00:17:55] they satisfy is you know square of x and square of y and square z you know when
[00:17:56] square of y and square z you know when they sum them up when you sum them up
[00:17:58] they sum them up when you sum them up they equal to one right so this is
[00:17:59] they equal to one right so this is constraint they satisfy for all the
[00:18:01] constraint they satisfy for all the points um uh on the sphere. All right.
[00:18:05] points um uh on the sphere. All right. So more generally you know you can write
[00:18:07] So more generally you know you can write it down as you know the constraint will
[00:18:08] it down as you know the constraint will be some function of x and y and z equals
[00:18:11] be some function of x and y and z equals zero. Right. So in this case fx and y
[00:18:13] zero. Right. So in this case fx and y the function will be x2 + y square 2 + z
[00:18:16] the function will be x2 + y square 2 + z 2 - 1 right? So that would be the
[00:18:18] 2 - 1 right? So that would be the function here. But more generally you
[00:18:20] function here. But more generally you can think about it is you know for even
[00:18:21] can think about it is you know for even for complex shapes you can represent
[00:18:23] for complex shapes you can represent sometimes you know these functions can
[00:18:24] sometimes you know these functions can be so complex that you don't even have a
[00:18:26] be so complex that you don't even have a closed form you know. So how can I
[00:18:27] closed form you know. So how can I represent f I don't know I I just write
[00:18:29] represent f I don't know I I just write it as a neuronet network. My hope is a
[00:18:31] it as a neuronet network. My hope is a neuron network will be able to represent
[00:18:33] neuron network will be able to represent it. But in general the idea is you know
[00:18:35] it. But in general the idea is you know you have some function um or some
[00:18:37] you have some function um or some constraints that the points on a certain
[00:18:40] constraints that the points on a certain object will satisfy and this is the way
[00:18:42] object will satisfy and this is the way you will represent an object. This is
[00:18:43] you will represent an object. This is called implicit representations which
[00:18:45] called implicit representations which started with geometry but as I said you
[00:18:48] started with geometry but as I said you now use it in all these different ways
[00:18:49] now use it in all these different ways representing textures materials
[00:18:51] representing textures materials appearance and all these things. So the
[00:18:53] appearance and all these things. So the good thing about implicit representation
[00:18:54] good thing about implicit representation oh sorry let's start with the bad thing
[00:18:56] oh sorry let's start with the bad thing the bad thing about implicit
[00:18:57] the bad thing about implicit representation is now it's actually much
[00:18:58] representation is now it's actually much harder to sample points right I tell you
[00:19:01] harder to sample points right I tell you okay this is a constraint let's say this
[00:19:03] okay this is a constraint let's say this tora satisfy right okay um know every x
[00:19:06] tora satisfy right okay um know every x and y and z you know if I put into this
[00:19:08] and y and z you know if I put into this function and the output is zero then
[00:19:11] function and the output is zero then yeah they must be on the surface of this
[00:19:12] yeah they must be on the surface of this object but then how can I get a couple
[00:19:15] object but then how can I get a couple of these xyz you know tupless that would
[00:19:18] of these xyz you know tupless that would be very hard because they're required to
[00:19:19] be very hard because they're required to solve this function and this function is
[00:19:21] solve this function and this function is maybe not too hard to solve you Maybe
[00:19:23] maybe not too hard to solve you Maybe you can still solve that using some high
[00:19:24] you can still solve that using some high school math. Uh but uh you know when the
[00:19:27] school math. Uh but uh you know when the function gets really complex you know
[00:19:28] function gets really complex you know for arbitrary shapes it becomes much
[00:19:30] for arbitrary shapes it becomes much harder to solve these functions. So it's
[00:19:32] harder to solve these functions. So it's not easy to actually now sample points
[00:19:34] not easy to actually now sample points on the surface of the objects if you are
[00:19:37] on the surface of the objects if you are representing objects implicitly. But
[00:19:39] representing objects implicitly. But benefit the strength of that is now it's
[00:19:41] benefit the strength of that is now it's actually pretty easy to test whether a
[00:19:43] actually pretty easy to test whether a point is inside object outside object
[00:19:45] point is inside object outside object because you know if I want to do testing
[00:19:47] because you know if I want to do testing I just have a query then this is so easy
[00:19:49] I just have a query then this is so easy because is it inside outside I just send
[00:19:51] because is it inside outside I just send it into that function I'll get a value
[00:19:53] it into that function I'll get a value or the value is okay minus one / 8 okay
[00:19:55] or the value is okay minus one / 8 okay this more than zero so you know because
[00:19:58] this more than zero so you know because I assume you know the object is
[00:20:01] I assume you know the object is represented by this function and all the
[00:20:02] represented by this function and all the surface point on object they satisfy
[00:20:04] surface point on object they satisfy that function equals zero anything
[00:20:06] that function equals zero anything that's no I would say lower than zero
[00:20:08] that's no I would say lower than zero that's negative The output value is
[00:20:10] that's negative The output value is negative then the point must be inside
[00:20:12] negative then the point must be inside object and if the output value is
[00:20:14] object and if the output value is positive then the point must be outside
[00:20:16] positive then the point must be outside object right so now it becomes much
[00:20:18] object right so now it becomes much easier to test whether a certain point
[00:20:20] easier to test whether a certain point is inside outside object although it
[00:20:22] is inside outside object although it becomes much harder to sample a number
[00:20:24] becomes much harder to sample a number of points on the surface of object so
[00:20:26] of points on the surface of object so you can see now there's a clear
[00:20:28] you can see now there's a clear trade-off between this implicit and
[00:20:29] trade-off between this implicit and explicit representations here we again
[00:20:32] explicit representations here we again we talk about geometry and but this
[00:20:34] we talk about geometry and but this distinction and the contrast between
[00:20:36] distinction and the contrast between explicit implicit representations is I I
[00:20:38] explicit implicit representations is I I think is very important and fundamental
[00:20:40] think is very important and fundamental is behind development of deep neuron
[00:20:42] is behind development of deep neuron deep neuronet networks when they apply
[00:20:44] deep neuronet networks when they apply to 3D uh data in general as we'll see
[00:20:47] to 3D uh data in general as we'll see later. Okay. So before we mean 25
[00:20:50] later. Okay. So before we mean 25 minutes so I promise I spend another
[00:20:52] minutes so I promise I spend another more no more than five minutes and then
[00:20:53] more no more than five minutes and then we're going to talk about deep learning.
[00:20:54] we're going to talk about deep learning. So before we talk about how deep
[00:20:55] So before we talk about how deep learning can be applied to 3D
[00:20:57] learning can be applied to 3D representations in general, you know, a
[00:20:59] representations in general, you know, a little bit more on implicit
[00:21:00] little bit more on implicit representations is uh some other
[00:21:02] representations is uh some other features of implicit representations.
[00:21:04] features of implicit representations. The good things about them is it's easy
[00:21:06] The good things about them is it's easy uh to compose them, right? So sometimes
[00:21:08] uh to compose them, right? So sometimes you feel like oh if I have to represent
[00:21:10] you feel like oh if I have to represent everything with a function uh that seems
[00:21:13] everything with a function uh that seems uh great you know if I have a closed
[00:21:14] uh great you know if I have a closed form but also seem very constrained
[00:21:16] form but also seem very constrained because every closed form I can write it
[00:21:18] because every closed form I can write it out you know the the geometries look
[00:21:20] out you know the the geometries look very very regular. So if I want to
[00:21:22] very very regular. So if I want to represent the shape of a cow, you know,
[00:21:24] represent the shape of a cow, you know, how would I represent that? What would
[00:21:25] how would I represent that? What would be the function I can write for the
[00:21:26] be the function I can write for the shape of a cow? It's just not obvious.
[00:21:28] shape of a cow? It's just not obvious. But nice thing about implicit
[00:21:30] But nice thing about implicit representation is you don't have to
[00:21:31] representation is you don't have to write everything in one shot because
[00:21:32] write everything in one shot because it's so easy to compose them, right? You
[00:21:34] it's so easy to compose them, right? You can actually perform logical operations
[00:21:36] can actually perform logical operations on these implicit functions. Let's say
[00:21:37] on these implicit functions. Let's say you have two objects and you find the
[00:21:39] you have two objects and you find the unions or intersections or differences.
[00:21:41] unions or intersections or differences. You know, you can again they're just
[00:21:43] You know, you can again they're just values, right? So you put XYZ onto this
[00:21:45] values, right? So you put XYZ onto this function, you get a value. You put XYZ
[00:21:46] function, you get a value. You put XYZ onto that function, you get a value. You
[00:21:48] onto that function, you get a value. You can just do you know arithmetic
[00:21:50] can just do you know arithmetic operations on top of these values and
[00:21:52] operations on top of these values and that allows you to compute the unions or
[00:21:53] that allows you to compute the unions or intersections or differences between
[00:21:54] intersections or differences between these objects and eventually you can
[00:21:57] these objects and eventually you can compose them to uh develop pretty
[00:21:59] compose them to uh develop pretty complex shapes and this is actually you
[00:22:02] complex shapes and this is actually you know behind uh support a lot of these
[00:22:04] know behind uh support a lot of these industrial designs when people are
[00:22:06] industrial designs when people are designing you know complex parts uh for
[00:22:09] designing you know complex parts uh for you know I would say uh I don't know I
[00:22:11] you know I would say uh I don't know I mean when you're doing manufacturing on
[00:22:12] mean when you're doing manufacturing on a on you have a you have to fabricate
[00:22:16] a on you have a you have to fabricate some complex shapes a lot of these
[00:22:17] some complex shapes a lot of these designs are done by these kind of cat
[00:22:19] designs are done by these kind of cat models, computer AD designs and they're
[00:22:20] models, computer AD designs and they're composing these implicit functions using
[00:22:23] composing these implicit functions using simple logical operations
[00:22:25] simple logical operations and you know you can also do things that
[00:22:29] and you know you can also do things that beyond just logical you can even add
[00:22:30] beyond just logical you can even add things up especially if you have a
[00:22:32] things up especially if you have a distance function where every point is
[00:22:34] distance function where every point is sort of like oh um the positive value
[00:22:36] sort of like oh um the positive value and the negative value the values
[00:22:37] and the negative value the values actually have meanings because they
[00:22:38] actually have meanings because they indicate how far you are to the surface
[00:22:40] indicate how far you are to the surface of object. So you can even add them up
[00:22:42] of object. So you can even add them up and this allows you to just smoothly
[00:22:44] and this allows you to just smoothly blend the shapes. Right? So you can see
[00:22:47] blend the shapes. Right? So you can see that here you know if I have a distance
[00:22:49] that here you know if I have a distance function and here I just want to
[00:22:50] function and here I just want to represent a vertical line. Okay this is
[00:22:52] represent a vertical line. Okay this is here and then anything that's minus zero
[00:22:54] here and then anything that's minus zero is to the left of the line anything that
[00:22:56] is to the left of the line anything that is positive is to the right of the line
[00:22:58] is positive is to the right of the line and then you have another line
[00:22:59] and then you have another line represented using different function you
[00:23:01] represented using different function you know. So what happens if you add them
[00:23:02] know. So what happens if you add them up? You know, if you add them up, then
[00:23:05] up? You know, if you add them up, then it naturally becomes an interpolation
[00:23:06] it naturally becomes an interpolation between these two shapes, right? This is
[00:23:08] between these two shapes, right? This is example of doing things in 1D. But you
[00:23:10] example of doing things in 1D. But you can imagine, you know, you can even
[00:23:11] can imagine, you know, you can even similar doing things in 3D in a sense.
[00:23:13] similar doing things in 3D in a sense. Okay, now you can actually even blend
[00:23:15] Okay, now you can actually even blend the these different shapes and these
[00:23:17] the these different shapes and these distance functions can be arbitrary
[00:23:18] distance functions can be arbitrary composed and allow you to create
[00:23:20] composed and allow you to create actually pretty complex worlds like
[00:23:22] actually pretty complex worlds like this. And this is not easy but you can
[00:23:24] this. And this is not easy but you can even you know think about you know con
[00:23:25] even you know think about you know con you can construct really complex with
[00:23:27] you can construct really complex with all the details worlds uh by just simply
[00:23:31] all the details worlds uh by just simply not simply by difficulty but compose
[00:23:34] not simply by difficulty but compose these different functions but they are
[00:23:35] these different functions but they are actually very expressive if you're very
[00:23:37] actually very expressive if you're very good at it.
[00:23:40] good at it. Okay. Um so we said okay we have
[00:23:43] Okay. Um so we said okay we have parametric representation that can be
[00:23:44] parametric representation that can be explicit that directly give you points
[00:23:46] explicit that directly give you points on 3D surface or we can have parametric
[00:23:48] on 3D surface or we can have parametric representations like these functions but
[00:23:49] representations like these functions but they're implicit right? So they're just
[00:23:51] they're implicit right? So they're just like okay now you can only try to verify
[00:23:53] like okay now you can only try to verify if the point is inside and outside an
[00:23:55] if the point is inside and outside an object but then you can also compose
[00:23:56] object but then you can also compose them to build more complex shapes and is
[00:23:59] them to build more complex shapes and is that possible for us to also have
[00:24:00] that possible for us to also have implicit repetition and nonparametric
[00:24:02] implicit repetition and nonparametric like like point style but then you also
[00:24:04] like like point style but then you also querying functions or sometimes they
[00:24:06] querying functions or sometimes they actually do have things like that and
[00:24:07] actually do have things like that and this eventually goes to methods like
[00:24:09] this eventually goes to methods like level set methods right so implicit
[00:24:11] level set methods right so implicit surfaces are very nice because as we
[00:24:12] surfaces are very nice because as we said it's easy to merge them it's easy
[00:24:15] said it's easy to merge them it's easy to split them but sometimes you know
[00:24:17] to split them but sometimes you know it's hard to describe as we said a
[00:24:18] it's hard to describe as we said a complex shapes in closed forms Right?
[00:24:20] complex shapes in closed forms Right? You have a cow. How would you represent
[00:24:22] You have a cow. How would you represent it? Okay, you can compose them. But you
[00:24:24] it? Okay, you can compose them. But you know if every time I have to query
[00:24:26] know if every time I have to query whether whether a certain point is
[00:24:27] whether whether a certain point is inside a cow, you have to have hundreds
[00:24:29] inside a cow, you have to have hundreds of functions and you perform all these
[00:24:31] of functions and you perform all these add and or plus minus operations, then
[00:24:34] add and or plus minus operations, then it takes a long time. So what if I just
[00:24:36] it takes a long time. So what if I just pre-query, right? So I have I have a I
[00:24:38] pre-query, right? So I have I have a I have a 3D space and I just sample let's
[00:24:40] have a 3D space and I just sample let's say a 100 by 100 by 100 grid, right? So
[00:24:43] say a 100 by 100 by 100 grid, right? So any I have now a million points
[00:24:45] any I have now a million points pre-sampled and for these 1 million
[00:24:46] pre-sampled and for these 1 million points I just premputee whether they're
[00:24:49] points I just premputee whether they're inside the objects or outside the
[00:24:50] inside the objects or outside the objects. What is the distance of these 1
[00:24:52] objects. What is the distance of these 1 million points to the surfaces for
[00:24:54] million points to the surfaces for complex shapes? So you can premputee
[00:24:57] complex shapes? So you can premputee them and then you can store all the
[00:24:58] them and then you can store all the values in a matrix. This is in 2D but
[00:25:01] values in a matrix. This is in 2D but you can this is for visualization but in
[00:25:04] you can this is for visualization but in practice is in 3D right? So you have a
[00:25:06] practice is in 3D right? So you have a you have a 3D matrix that store all
[00:25:08] you have a 3D matrix that store all these precomputed values of the distance
[00:25:10] these precomputed values of the distance functions. So now you're sort of in some
[00:25:12] functions. So now you're sort of in some you still have implicit representations
[00:25:13] you still have implicit representations but be because you have pre-qui them you
[00:25:16] but be because you have pre-qui them you you turn it into a nonparametric
[00:25:19] you turn it into a nonparametric representations and even if you just
[00:25:21] representations and even if you just look at this matrix right in 2D you can
[00:25:24] look at this matrix right in 2D you can now still find okay where the boundaries
[00:25:26] now still find okay where the boundaries are so where are the boundaries they're
[00:25:27] are so where are the boundaries they're just basically where you know you have
[00:25:29] just basically where you know you have two adjacent values one is positive and
[00:25:31] two adjacent values one is positive and one is negative right so that means
[00:25:33] one is negative right so that means there must be somewhere in between that
[00:25:36] there must be somewhere in between that uh the point here they satisfy the
[00:25:38] uh the point here they satisfy the function fx equals z which means means
[00:25:40] function fx equals z which means means the point must be on the surface. Right?
[00:25:43] the point must be on the surface. Right? So in that sense you're sort of turning
[00:25:46] So in that sense you're sort of turning an parametric representation which are
[00:25:48] an parametric representation which are implicit into a nonparametric
[00:25:49] implicit into a nonparametric representations by pre-quaring a lot of
[00:25:52] representations by pre-quaring a lot of these points using the functions and
[00:25:55] these points using the functions and this allows you to have actually more
[00:25:56] this allows you to have actually more explicit controls because you know you
[00:25:59] explicit controls because you know you can now visualize them. You can say I
[00:26:01] can now visualize them. You can say I have this matrix and I can visualize
[00:26:02] have this matrix and I can visualize them based on their values. And this is
[00:26:04] them based on their values. And this is used a lot in things like uh uh CTS and
[00:26:07] used a lot in things like uh uh CTS and MRIs and all these like medical data.
[00:26:10] MRIs and all these like medical data. And a related thing is people may say
[00:26:12] And a related thing is people may say okay what if I don't care about all
[00:26:13] okay what if I don't care about all these distance values you know I can
[00:26:14] these distance values you know I can pre-query what's going on at all these
[00:26:16] pre-query what's going on at all these points but then I compute all the
[00:26:18] points but then I compute all the values. Let's say plus five minus five
[00:26:20] values. Let's say plus five minus five but all I care about is whether this is
[00:26:22] but all I care about is whether this is inside object or outside object. Right?
[00:26:24] inside object or outside object. Right? So if it's positive I'll just treat them
[00:26:26] So if it's positive I'll just treat them as one. If it's negative which means
[00:26:28] as one. If it's negative which means they're inside object treat them as
[00:26:29] they're inside object treat them as zero. Let's say so if you binarize them
[00:26:32] zero. Let's say so if you binarize them then this give you a final
[00:26:33] then this give you a final representations which is arguably the
[00:26:35] representations which is arguably the easiest to to understand this is called
[00:26:37] easiest to to understand this is called called voxels right so you know you can
[00:26:40] called voxels right so you know you can pre-query where the implicit functions
[00:26:41] pre-query where the implicit functions are uh and then you have all these kind
[00:26:43] are uh and then you have all these kind of you know density sample grid and but
[00:26:45] of you know density sample grid and but now instead of storing you know their
[00:26:47] now instead of storing you know their distance functions how far they are from
[00:26:49] distance functions how far they are from the surface by going through the
[00:26:50] the surface by going through the functions and give you plus five minus 5
[00:26:52] functions and give you plus five minus 5 you just binarize it you only care about
[00:26:55] you just binarize it you only care about whether a certain point is inside
[00:26:57] whether a certain point is inside objects or outside objects then you have
[00:26:59] objects or outside objects then you have a voxel representation which is again
[00:27:01] a voxel representation which is again like a 3D matrix can be 100 by 100 by
[00:27:04] like a 3D matrix can be 100 by 100 by 100 but for every point you have go
[00:27:06] 100 but for every point you have go through this function and query whether
[00:27:08] through this function and query whether it's inside and outside objects you have
[00:27:10] it's inside and outside objects you have one or zero and you can represent
[00:27:12] one or zero and you can represent objects in a binarized way right so this
[00:27:14] objects in a binarized way right so this gives you the final representations I'm
[00:27:16] gives you the final representations I'm going to talk about for objects in 3D
[00:27:19] going to talk about for objects in 3D so I have introduced voxels in a kind of
[00:27:22] so I have introduced voxels in a kind of complex way but from a different
[00:27:24] complex way but from a different perspective people may say this is
[00:27:25] perspective people may say this is actually very easy to understand because
[00:27:26] actually very easy to understand because in some sense voxels actually have a lot
[00:27:28] in some sense voxels actually have a lot of analogy as pixel because pixels are
[00:27:30] of analogy as pixel because pixels are like 2D matrices and now you have a 3D
[00:27:32] like 2D matrices and now you have a 3D matrices and voxels is basically just a
[00:27:34] matrices and voxels is basically just a 3D matrix right so um although you can
[00:27:39] 3D matrix right so um although you can see that they have connections with all
[00:27:40] see that they have connections with all the other ways that we can represent
[00:27:42] the other ways that we can represent shapes and the way I'm introducing it
[00:27:44] shapes and the way I'm introducing it this way is you know actually when when
[00:27:47] this way is you know actually when when deep learning come in right so first
[00:27:48] deep learning come in right so first deep learning when when did it started
[00:27:49] deep learning when when did it started right 2010 you know deep learning has
[00:27:51] right 2010 you know deep learning has been there for a long time but the more
[00:27:53] been there for a long time but the more than deep learning thing is 2010 people
[00:27:56] than deep learning thing is 2010 people started Jeff Hinton started doing that
[00:27:57] started Jeff Hinton started doing that on speech recognition And then 2012 they
[00:28:00] on speech recognition And then 2012 they have Alex net which is run on image net.
[00:28:02] have Alex net which is run on image net. So you've learned all these and they're
[00:28:03] So you've learned all these and they're all in 2D. Okay. Now people say okay
[00:28:05] all in 2D. Okay. Now people say okay what if I want to do in 3D right? This
[00:28:07] what if I want to do in 3D right? This is a very natural thought. So I want to
[00:28:09] is a very natural thought. So I want to go from 2D commercial networks. 2012
[00:28:12] go from 2D commercial networks. 2012 there's no transformer right? There's a
[00:28:14] there's no transformer right? There's a so how can I apply a 2D commercial
[00:28:15] so how can I apply a 2D commercial network on 3D data and everyone knows we
[00:28:18] network on 3D data and everyone knows we have all these different 3D
[00:28:19] have all these different 3D representations.
[00:28:21] representations. But which one to begin with? Right? And
[00:28:24] But which one to begin with? Right? And it turns out that people say, "Okay,
[00:28:25] it turns out that people say, "Okay, yeah, you know, the people who started
[00:28:27] yeah, you know, the people who started doing uh deep learning on your data,
[00:28:29] doing uh deep learning on your data, they're the computer vision people.
[00:28:30] they're the computer vision people. They're not like the graphics people.
[00:28:31] They're not like the graphics people. They're like, I've been working with
[00:28:32] They're like, I've been working with pixels and maybe the easiest thing I can
[00:28:34] pixels and maybe the easiest thing I can do is just to, you know, scale up and
[00:28:36] do is just to, you know, scale up and instead of working on 2D matrices, I
[00:28:38] instead of working on 2D matrices, I just make it work on 3D matrices." So
[00:28:40] just make it work on 3D matrices." So that would be the simplest thing I can
[00:28:41] that would be the simplest thing I can do. Instead of having a 2D convolution
[00:28:42] do. Instead of having a 2D convolution in your network, I have a volumetric
[00:28:43] in your network, I have a volumetric convolution in your network. Then which
[00:28:46] convolution in your network. Then which of these representations allow you or
[00:28:47] of these representations allow you or support a volumetric convolution? Right?
[00:28:50] support a volumetric convolution? Right? It turned out to be this box of
[00:28:51] It turned out to be this box of representation. This is this is
[00:28:52] representation. This is this is basically the the easiest you can
[00:28:54] basically the the easiest you can imagine, right? So, but the graphics
[00:28:56] imagine, right? So, but the graphics people do not agree with that because
[00:28:57] people do not agree with that because the graphics people are like, "Oh, this
[00:28:59] the graphics people are like, "Oh, this box representation is really bad because
[00:29:00] box representation is really bad because it's uh it's very slow to compute as you
[00:29:03] it's uh it's very slow to compute as you as we talked about. We have to
[00:29:04] as we talked about. We have to pre-sample all these values and there
[00:29:06] pre-sample all these values and there you know you can look at the quality
[00:29:07] you know you can look at the quality right is so bad compared with meshes or
[00:29:10] right is so bad compared with meshes or point clouds." So, people like you know
[00:29:11] point clouds." So, people like you know why do you want to start with that? But
[00:29:12] why do you want to start with that? But the reason that people started doing
[00:29:14] the reason that people started doing deep learning on 3D data with voxels
[00:29:16] deep learning on 3D data with voxels because I think it's like you just it's
[00:29:17] because I think it's like you just it's so easy to draw an analogy between
[00:29:19] so easy to draw an analogy between pixels and voxels and and you just have
[00:29:21] pixels and voxels and and you just have to change one type of code that is
[00:29:22] to change one type of code that is instead of doing 2D convolution now do
[00:29:24] instead of doing 2D convolution now do 3D convolution right so that's sort of
[00:29:26] 3D convolution right so that's sort of in some sense how things get started
[00:29:28] in some sense how things get started okay but before I talk about the
[00:29:29] okay but before I talk about the different methods for 3D uh data another
[00:29:32] different methods for 3D uh data another aspect that's very important is the data
[00:29:34] aspect that's very important is the data uh sorry for 3D yeah so beyond methods
[00:29:36] uh sorry for 3D yeah so beyond methods data sets are also very important you
[00:29:38] data sets are also very important you know image net really prompted annex and
[00:29:40] know image net really prompted annex and stuff like that so for 3D similarly you
[00:29:42] stuff like that so for 3D similarly you know, we have to collect a lot of data
[00:29:43] know, we have to collect a lot of data as well. So, preDep learning, um, the
[00:29:47] as well. So, preDep learning, um, the common data set, the popular data set
[00:29:49] common data set, the popular data set people often use is this thing called
[00:29:50] people often use is this thing called Princeton data shape benchmark, which
[00:29:52] Princeton data shape benchmark, which has 1,800 models, 180 categories. So,
[00:29:55] has 1,800 models, 180 categories. So, you can see they actually have quite a
[00:29:57] you can see they actually have quite a lot of categories, 180 categories, but
[00:29:59] lot of categories, 180 categories, but there are only 1,800 models, which means
[00:30:01] there are only 1,800 models, which means there are like basically 10 models per
[00:30:02] there are like basically 10 models per hydrator, which is so small, but back
[00:30:05] hydrator, which is so small, but back then it was considered pretty large, and
[00:30:06] then it was considered pretty large, and people feel like, oh, this is already
[00:30:08] people feel like, oh, this is already enough to do because we can't really
[00:30:09] enough to do because we can't really make any of things really work well on
[00:30:11] make any of things really work well on them. Um and there was very little
[00:30:13] them. Um and there was very little machine learning there. Um so prior to
[00:30:16] machine learning there. Um so prior to 2014 you know all these data sets are
[00:30:18] 2014 you know all these data sets are kind of more or less small. You know
[00:30:20] kind of more or less small. You know they may have a certain number of models
[00:30:22] they may have a certain number of models even up to 10,000 9,000 10,000 but you
[00:30:25] even up to 10,000 9,000 10,000 but you know they also divided into so many
[00:30:26] know they also divided into so many different classes and so each class you
[00:30:28] different classes and so each class you only have like 10 models each or less
[00:30:30] only have like 10 models each or less than 100 I would say. Um so after that
[00:30:33] than 100 I would say. Um so after that people started by saying okay if we have
[00:30:35] people started by saying okay if we have image net we also have the 3D data sets
[00:30:38] image net we also have the 3D data sets for shapes. So this is uh behind efforts
[00:30:40] for shapes. So this is uh behind efforts of a few concurrent work but really I
[00:30:42] of a few concurrent work but really I think eventually they sort of
[00:30:43] think eventually they sort of consolidated into this thing called
[00:30:44] consolidated into this thing called shapenet which is a lot of them are
[00:30:46] shapenet which is a lot of them are actually led by Stanford you know
[00:30:48] actually led by Stanford you know there's Leo Gibbus and Sylvio Sarasi um
[00:30:52] there's Leo Gibbus and Sylvio Sarasi um so they led this kind of large data sets
[00:30:54] so they led this kind of large data sets called shapenet which has three million
[00:30:56] called shapenet which has three million models and so but in practice just like
[00:30:59] models and so but in practice just like image net you have this large image and
[00:31:00] image net you have this large image and there's a smaller data set that people
[00:31:02] there's a smaller data set that people often use. So shapeet similarly you have
[00:31:04] often use. So shapeet similarly you have a shapeet core data set which is what
[00:31:05] a shapeet core data set which is what people typically use as 50 basically
[00:31:08] people typically use as 50 basically 50,000 models in 55 categories now you
[00:31:11] 50,000 models in 55 categories now you can see for every category you have
[00:31:12] can see for every category you have 1,000 models on average but in practice
[00:31:14] 1,000 models on average but in practice it's not that balanced so for chairs you
[00:31:16] it's not that balanced so for chairs you have actually a lot more so that's why
[00:31:18] have actually a lot more so that's why you know people say oh now I have
[00:31:19] you know people say oh now I have finally I have thousands of models on
[00:31:20] finally I have thousands of models on chairs I can train some deep networks on
[00:31:22] chairs I can train some deep networks on it right before it just you have 10
[00:31:24] it right before it just you have 10 models you can't do anything so that is
[00:31:26] models you can't do anything so that is how things started and uh so there has
[00:31:28] how things started and uh so there has been a few years where a lot of these
[00:31:30] been a few years where a lot of these advances and all the results are just
[00:31:32] advances and all the results are just present on chairs and cars because these
[00:31:34] present on chairs and cars because these are like the largest categories in
[00:31:35] are like the largest categories in shapeet and people feel like okay that's
[00:31:37] shapeet and people feel like okay that's great but then that's not enough so we
[00:31:39] great but then that's not enough so we should uh move you know even bigger so
[00:31:42] should uh move you know even bigger so in the past few years uh this is work at
[00:31:44] in the past few years uh this is work at AI2 the island institute from Seattle
[00:31:48] AI2 the island institute from Seattle where what they did is they collected
[00:31:49] where what they did is they collected much larger data sets called obverse and
[00:31:50] much larger data sets called obverse and objiverse extra large that you have
[00:31:53] objiverse extra large that you have roughly 1 million or 10 million uh
[00:31:55] roughly 1 million or 10 million uh models uh for different 3D assets you
[00:31:57] models uh for different 3D assets you can see they have much more categories
[00:31:59] can see they have much more categories uh and they also have these models on
[00:32:01] uh and they also have these models on average also are have higher quality
[00:32:03] average also are have higher quality also with textures.
[00:32:05] also with textures. So these are entire data sets but also
[00:32:07] So these are entire data sets but also there are real data sets that are being
[00:32:08] there are real data sets that are being produced including you know some some of
[00:32:10] produced including you know some some of them are from like 3D scans you know you
[00:32:12] them are from like 3D scans you know you just take uh 3D scanners back in 2016
[00:32:16] just take uh 3D scanners back in 2016 people have been working on it this is a
[00:32:18] people have been working on it this is a data set uh called I think the redwood
[00:32:20] data set uh called I think the redwood data set or something so you have like
[00:32:21] data set or something so you have like 10,000 scans of real world objects
[00:32:24] 10,000 scans of real world objects and more recently you know people have
[00:32:27] and more recently you know people have been building larger data sets where
[00:32:29] been building larger data sets where they also encourage people I think this
[00:32:31] they also encourage people I think this is effort collab by Meta and Oxford
[00:32:34] is effort collab by Meta and Oxford heard. Um, so they encourage people to
[00:32:36] heard. Um, so they encourage people to take data for them. Uh, they also pay
[00:32:38] take data for them. Uh, they also pay people to take data for them. So people
[00:32:39] people to take data for them. So people just use it on iPhone. You you have an
[00:32:41] just use it on iPhone. You you have an object, you put it on a table, you use
[00:32:42] object, you put it on a table, you use an iPhone, you take a 360 video around
[00:32:44] an iPhone, you take a 360 video around objects and then you get $1 or something
[00:32:46] objects and then you get $1 or something like that. Um, so they encourage people
[00:32:48] like that. Um, so they encourage people to take data for them. This is the first
[00:32:50] to take data for them. This is the first version. They have 19,000 videos of
[00:32:52] version. They have 19,000 videos of objects. Now these are like real
[00:32:53] objects. Now these are like real objects, right? Because capturing real
[00:32:55] objects, right? Because capturing real objects is much harder and the object
[00:32:56] objects is much harder and the object version and all the things I talked
[00:32:57] version and all the things I talked about before it was like synced objects,
[00:32:59] about before it was like synced objects, but these are like real objects. And
[00:33:01] but these are like real objects. And then uh also because of a lot of the
[00:33:03] then uh also because of a lot of the development in 3D vision algorithms, you
[00:33:05] development in 3D vision algorithms, you can actually take these 360 videos and
[00:33:07] can actually take these 360 videos and trying to reconstruct uh the 3D objects.
[00:33:09] trying to reconstruct uh the 3D objects. So now you have paired data of the
[00:33:11] So now you have paired data of the videos or images of objects as well as
[00:33:13] videos or images of objects as well as their 3D geometries and textures.
[00:33:27] This is their first version. I think
[00:33:28] This is their first version. I think they have a more recent version V2 or
[00:33:30] they have a more recent version V2 or maybe even V3 right now which is
[00:33:32] maybe even V3 right now which is supposed to be a little larger but still
[00:33:34] supposed to be a little larger but still it's kind of hard to scale up. Think
[00:33:35] it's kind of hard to scale up. Think about it right now you have like 90,000
[00:33:37] about it right now you have like 90,000 videos or basically 19,000 objects and I
[00:33:40] videos or basically 19,000 objects and I think they're scaling it up but I don't
[00:33:42] think they're scaling it up but I don't think it's over 100,000. So basically
[00:33:45] think it's over 100,000. So basically you can think about it as for real
[00:33:46] you can think about it as for real objects you have like 100,000 models and
[00:33:49] objects you have like 100,000 models and but if you look at what's the data set
[00:33:51] but if you look at what's the data set size of the images right so it's like I
[00:33:53] size of the images right so it's like I know line on 5B or whatever that's like
[00:33:55] know line on 5B or whatever that's like 5 billion images and Google and open
[00:33:57] 5 billion images and Google and open must have much larger data sets so
[00:33:59] must have much larger data sets so there's still kind of a huge gap between
[00:34:01] there's still kind of a huge gap between number of data points that you can have
[00:34:03] number of data points that you can have for 2D images or videos versus you can
[00:34:05] for 2D images or videos versus you can have for 3D objects so I think that's a
[00:34:07] have for 3D objects so I think that's a kind of a big challenge you know how we
[00:34:08] kind of a big challenge you know how we can move forward with 3D vision and
[00:34:10] can move forward with 3D vision and people have different ideas uh but still
[00:34:13] people have different ideas uh but still you know this is much larger than what
[00:34:14] you know this is much larger than what we had before at least you can you know
[00:34:16] we had before at least you can you know it's possible that you can still more or
[00:34:17] it's possible that you can still more or less train some deep learning models on
[00:34:19] less train some deep learning models on these data sets now
[00:34:22] these data sets now and quickly there are also other data
[00:34:23] and quickly there are also other data sets of people being built on parts and
[00:34:25] sets of people being built on parts and this is also from Stanford where they
[00:34:27] this is also from Stanford where they try to annotate a little bit of object
[00:34:29] try to annotate a little bit of object parts and their correspondence and
[00:34:30] parts and their correspondence and hierarchies
[00:34:32] hierarchies um and you know their uh and also uh
[00:34:36] um and you know their uh and also uh there's this this data set called
[00:34:38] there's this this data set called partnet where they want to annotate not
[00:34:40] partnet where they want to annotate not only the parts and their semantics but
[00:34:42] only the parts and their semantics but also how they may move also a little bit
[00:34:43] also how they may move also a little bit of mo mobility information of different
[00:34:45] of mo mobility information of different parts like laptop you can open and close
[00:34:47] parts like laptop you can open and close it and there also data sets for 3D
[00:34:49] it and there also data sets for 3D scenes so not only just objects and
[00:34:51] scenes so not only just objects and parts but they're also the the rooms um
[00:34:53] parts but they're also the the rooms um so they're having things like the scan
[00:34:56] so they're having things like the scan data sets like you know um this are
[00:34:57] data sets like you know um this are people actually just go inside your home
[00:34:59] people actually just go inside your home or go inside to actually to our office
[00:35:01] or go inside to actually to our office as well this come and then they just
[00:35:03] as well this come and then they just have a 3D scanner they scan home and
[00:35:05] have a 3D scanner they scan home and then they have some annotations so you
[00:35:07] then they have some annotations so you know here
[00:35:09] know here know and more recently again you can use
[00:35:11] know and more recently again you can use you can do that even with your iPhone
[00:35:13] you can do that even with your iPhone right now but still um these kind of
[00:35:15] right now but still um these kind of data sets are much smaller right so here
[00:35:17] data sets are much smaller right so here this one the first version of scanner
[00:35:19] this one the first version of scanner you have 1,500 I think they have a plus+
[00:35:22] you have 1,500 I think they have a plus+ the second version which is roughly the
[00:35:23] the second version which is roughly the same size maybe 2,000 or 3,000 rooms so
[00:35:26] same size maybe 2,000 or 3,000 rooms so the amount of data you have for 3D data
[00:35:29] the amount of data you have for 3D data uh for 3D scenes in particular is also
[00:35:31] uh for 3D scenes in particular is also you know even much smaller than the
[00:35:32] you know even much smaller than the amount of data you have for 3D objects
[00:35:34] amount of data you have for 3D objects um so I I think it's not obvious that
[00:35:36] um so I I think it's not obvious that how can we go beyond that constraints
[00:35:37] how can we go beyond that constraints because if you have to scan it yourself
[00:35:39] because if you have to scan it yourself you're always bounded by how much time
[00:35:41] you're always bounded by how much time you have and how much people you
[00:35:43] you have and how much people you Um
[00:35:45] Um anyway, but you know there there are
[00:35:47] anyway, but you know there there are attempts being made in trying to collect
[00:35:49] attempts being made in trying to collect data
[00:35:52] data okay and finally uh if want to apply
[00:35:54] okay and finally uh if want to apply deep learning on to 3D vision so what
[00:35:57] deep learning on to 3D vision so what are the tasks we care about right so
[00:35:59] are the tasks we care about right so there are generative modeling just like
[00:36:00] there are generative modeling just like in generative just like what Justin said
[00:36:03] in generative just like what Justin said you can generate 2D images or videos um
[00:36:05] you can generate 2D images or videos um you can also generate 3D shapes you can
[00:36:07] you can also generate 3D shapes you can generate 3D scenes you can make them
[00:36:09] generate 3D scenes you can make them condition right the condition can be
[00:36:11] condition right the condition can be condition on language conditional image.
[00:36:13] condition on language conditional image. You have an input image and how can you
[00:36:15] You have an input image and how can you reconstruct the 3D objects and you have
[00:36:17] reconstruct the 3D objects and you have to learn the shape priors. You have to
[00:36:19] to learn the shape priors. You have to do shape generation completion.
[00:36:21] do shape generation completion. Sometimes you have a partial objects and
[00:36:23] Sometimes you have a partial objects and you want to you know repair it. You want
[00:36:25] you want to you know repair it. You want to fix it. So there's geometry data
[00:36:27] to fix it. So there's geometry data processing as well. Other tasks
[00:36:29] processing as well. Other tasks including discriminative uh models for
[00:36:31] including discriminative uh models for example you have a 3D shape. How can you
[00:36:33] example you have a 3D shape. How can you classify what's the category of objects
[00:36:35] classify what's the category of objects it belongs to? Is it a chair or a table?
[00:36:38] it belongs to? Is it a chair or a table? and a lot of them and now actually done
[00:36:40] and a lot of them and now actually done them by rendering them into pixels right
[00:36:42] them by rendering them into pixels right because you have very good you know
[00:36:43] because you have very good you know image recognition models like GPT or
[00:36:45] image recognition models like GPT or something right so you just take a 3D
[00:36:47] something right so you just take a 3D object you can render them into a
[00:36:48] object you can render them into a picture you can upload a picture to GPT
[00:36:50] picture you can upload a picture to GPT and they can do it for you right so
[00:36:51] and they can do it for you right so that's in some sense one way of solving
[00:36:53] that's in some sense one way of solving these discriminative problems uh but
[00:36:56] these discriminative problems uh but there are also more specific things that
[00:36:57] there are also more specific things that is not very easy to solve you know for
[00:36:59] is not very easy to solve you know for example you have a different type of
[00:37:01] example you have a different type of cell and you have the 3D scans and how
[00:37:03] cell and you have the 3D scans and how can you classify the cell and all these
[00:37:05] can you classify the cell and all these kind of more specialized domains where
[00:37:06] kind of more specialized domains where you don't have that much data So how can
[00:37:08] you don't have that much data So how can you solve these discriminative problems
[00:37:12] you solve these discriminative problems and uh join modeling of 2D and 3D data
[00:37:14] and uh join modeling of 2D and 3D data which is becoming more and more
[00:37:15] which is becoming more and more important because the 2D data we have so
[00:37:17] important because the 2D data we have so much more we have so many images and
[00:37:18] much more we have so many images and videos we have very good foundation
[00:37:20] videos we have very good foundation models that got trained on them. So how
[00:37:22] models that got trained on them. So how can we leverage the priors in our 2D
[00:37:24] can we leverage the priors in our 2D foundation models like what an image
[00:37:25] foundation models like what an image look like? How to make an image look
[00:37:27] look like? How to make an image look like look realistic? How to make a video
[00:37:29] like look realistic? How to make a video look realistic? How can we use that
[00:37:30] look realistic? How can we use that information to help our 3D
[00:37:31] information to help our 3D reconstruction to be more realistic?
[00:37:34] reconstruction to be more realistic? Right? So joint modeling and 2D and 3D
[00:37:36] Right? So joint modeling and 2D and 3D data because there are so many large
[00:37:38] data because there are so many large scale 2D data sets and very good
[00:37:39] scale 2D data sets and very good training models and also there has been
[00:37:41] training models and also there has been a lot of advances in uh neural rendering
[00:37:43] a lot of advances in uh neural rendering or differential rendering methods that
[00:37:45] or differential rendering methods that basically connect 3D world and 2D world
[00:37:47] basically connect 3D world and 2D world because you have 3D world you have 3D
[00:37:48] because you have 3D world you have 3D model you can render them into 2D the
[00:37:50] model you can render them into 2D the rendering process you know can be made
[00:37:51] rendering process you know can be made differentiable or can be even
[00:37:52] differentiable or can be even approximated with neuronet networks then
[00:37:54] approximated with neuronet networks then now you can connect all these data in
[00:37:56] now you can connect all these data in different modalities through
[00:37:57] different modalities through differentable neural networks allows you
[00:37:59] differentable neural networks allows you to bridge the prior you have in 2D data
[00:38:02] to bridge the prior you have in 2D data or 2D foundation models into the 3D
[00:38:04] or 2D foundation models into the 3D world.
[00:38:06] world. Yeah. And then you know sometimes you
[00:38:07] Yeah. And then you know sometimes you want to even do some joint multim model
[00:38:09] want to even do some joint multim model beyond visual data including textile
[00:38:10] beyond visual data including textile data but sometimes you have other data.
[00:38:11] data but sometimes you have other data. Let's say in robotics you often have
[00:38:13] Let's say in robotics you often have tactile data. So how to fuse them as
[00:38:15] tactile data. So how to fuse them as well? And sometimes for autonomous
[00:38:18] well? And sometimes for autonomous driving you know maybe you have lighter
[00:38:19] driving you know maybe you have lighter data or depth data. How can you fuse
[00:38:21] data or depth data. How can you fuse them as well? So we want to use deep
[00:38:23] them as well? So we want to use deep learning on 3D data to solve all these
[00:38:25] learning on 3D data to solve all these different problems. So we spend all the
[00:38:28] different problems. So we spend all the time talking about representation. So
[00:38:29] time talking about representation. So how do we begin with? So as I suggested
[00:38:31] how do we begin with? So as I suggested you know people who are initially doing
[00:38:34] you know people who are initially doing that is the computer vision people do
[00:38:35] that is the computer vision people do they work on pixels they work on images.
[00:38:36] they work on pixels they work on images. So naturally they say why don't we start
[00:38:38] So naturally they say why don't we start with voxels but even before that they
[00:38:40] with voxels but even before that they say okay this is the old idea and this
[00:38:43] say okay this is the old idea and this is the very first idea that people tried
[00:38:45] is the very first idea that people tried in applying deep learning to 3D vision
[00:38:46] in applying deep learning to 3D vision and now in some sense it's coming back
[00:38:48] and now in some sense it's coming back but the very first idea they tried is
[00:38:50] but the very first idea they tried is you know let's don't even worry about
[00:38:51] you know let's don't even worry about voxels let's just say you have a 3D
[00:38:53] voxels let's just say you have a 3D shape you know it's a mesh it's a voxel
[00:38:55] shape you know it's a mesh it's a voxel whatever you know I want to uh learn to
[00:38:57] whatever you know I want to uh learn to recognize what an object is right what
[00:38:59] recognize what an object is right what is the object here is a chair but what
[00:39:02] is the object here is a chair but what if the input is 3D data how can we
[00:39:03] if the input is 3D data how can we process that before we have a 3D deep
[00:39:06] process that before we have a 3D deep learning methods. What if I just render
[00:39:08] learning methods. What if I just render into images because I have very good
[00:39:10] into images because I have very good image models. I'll just you know render
[00:39:13] image models. I'll just you know render I just take the 3D objects. I would just
[00:39:15] I just take the 3D objects. I would just put camera at different places. I can
[00:39:16] put camera at different places. I can render all these images the objects from
[00:39:18] render all these images the objects from different views and then now this
[00:39:20] different views and then now this becomes a 2D problem. I would just apply
[00:39:22] becomes a 2D problem. I would just apply a commercial neuron network you know on
[00:39:24] a commercial neuron network you know on each of these views and I have some ways
[00:39:26] each of these views and I have some ways of fuse them right. So I have some
[00:39:28] of fuse them right. So I have some pooling whatever uh and then I just do
[00:39:30] pooling whatever uh and then I just do an image classification you know. So
[00:39:32] an image classification you know. So this becomes an image classification
[00:39:34] this becomes an image classification problem while the only difference being
[00:39:35] problem while the only difference being now you have multiple views right so
[00:39:38] now you have multiple views right so this is like in some one of the very
[00:39:39] this is like in some one of the very first idea people apply to 3D vision
[00:39:41] first idea people apply to 3D vision they just use 2D networks and why do you
[00:39:44] they just use 2D networks and why do you want to use 2D networks because back
[00:39:45] want to use 2D networks because back then they're approaching on image net
[00:39:46] then they're approaching on image net and they're very good um so they have
[00:39:49] and they're very good um so they have image net is much larger than 3D data
[00:39:51] image net is much larger than 3D data sets so any model that approaching on
[00:39:52] sets so any model that approaching on image net they have very good
[00:39:53] image net they have very good performance so the easiest way to solve
[00:39:55] performance so the easiest way to solve your 3D recognition problem is to first
[00:39:57] your 3D recognition problem is to first render into 2D
[00:39:59] render into 2D later people sort of move away from it
[00:40:01] later people sort of move away from it because people like oh you know we have
[00:40:03] because people like oh you know we have more 3D data we should try to do 3D
[00:40:05] more 3D data we should try to do 3D native come up with 3D native methods
[00:40:07] native come up with 3D native methods and people also you know come up with
[00:40:08] and people also you know come up with ideas about connecting 3D and 2D through
[00:40:10] ideas about connecting 3D and 2D through like newer rendering but now I feel like
[00:40:12] like newer rendering but now I feel like this trend is coming back because you
[00:40:14] this trend is coming back because you know all these image and video models
[00:40:15] know all these image and video models are getting so great I don't know many
[00:40:17] are getting so great I don't know many of you may have seen like the V3
[00:40:19] of you may have seen like the V3 whatever was released yesterday right so
[00:40:21] whatever was released yesterday right so if they're so great you know maybe we
[00:40:23] if they're so great you know maybe we should just rely a bit more on the image
[00:40:25] should just rely a bit more on the image and video foundation models again
[00:40:26] and video foundation models again because they they just train on you know
[00:40:28] because they they just train on you know a thousand times or tens of thousands
[00:40:30] a thousand times or tens of thousands times or even more than and maybe a
[00:40:32] times or even more than and maybe a million times more data than 3D data. So
[00:40:34] million times more data than 3D data. So how can we incorporate that? But I
[00:40:36] how can we incorporate that? But I anyway coming back now this is the very
[00:40:38] anyway coming back now this is the very first methods in some sense people try
[00:40:41] first methods in some sense people try to apply deep learning on 3D data just
[00:40:43] to apply deep learning on 3D data just by converting them into 2D and they do
[00:40:46] by converting them into 2D and they do very well in image shape classification
[00:40:48] very well in image shape classification you have shapes and you want to classify
[00:40:49] you have shapes and you want to classify them into different categories and they
[00:40:51] them into different categories and they have very good performance. Um and yeah
[00:40:55] have very good performance. Um and yeah so you can leverage you know a lot of
[00:40:56] so you can leverage you know a lot of literatures on 2D image models. Um, but
[00:41:00] literatures on 2D image models. Um, but the issue is you need some projections.
[00:41:02] the issue is you need some projections. Um, but sometimes, you know, the input
[00:41:04] Um, but sometimes, you know, the input can be very noisy. People like, what if
[00:41:06] can be very noisy. People like, what if my input is too noisy? The point clouds
[00:41:08] my input is too noisy? The point clouds or whatever, they're just not very good.
[00:41:09] or whatever, they're just not very good. If I render them, they look kind of bad.
[00:41:11] If I render them, they look kind of bad. Um, so is that possible for us to come
[00:41:13] Um, so is that possible for us to come up with more 3D native methods? So later
[00:41:16] up with more 3D native methods? So later people tried a number of 3D native
[00:41:18] people tried a number of 3D native methods uh that just apply deep learning
[00:41:20] methods uh that just apply deep learning directly on 3D data. As I said, the
[00:41:22] directly on 3D data. As I said, the easiest way to do this is just to apply
[00:41:25] easiest way to do this is just to apply your uh pixel connection
[00:41:29] your uh pixel connection network. So this is actually a deep
[00:41:33] network. So this is actually a deep belief network which is generated
[00:41:34] belief network which is generated network. Uh but still you know you have
[00:41:35] network. Uh but still you know you have some 3D convolutional uh features and
[00:41:38] some 3D convolutional uh features and this is in 2015 uh by Princeton and you
[00:41:41] this is in 2015 uh by Princeton and you can see that they learn gen model that
[00:41:43] can see that they learn gen model that can actually synthesize 3D shapes in
[00:41:46] can actually synthesize 3D shapes in form of in the form of 3D box at
[00:41:48] form of in the form of 3D box at relatively lower resolution. Uh but you
[00:41:50] relatively lower resolution. Uh but you know this is 10 years ago now. So back
[00:41:52] know this is 10 years ago now. So back then this is considered kind of pretty
[00:41:54] then this is considered kind of pretty impressive
[00:41:56] impressive and you can do all these condition
[00:41:57] and you can do all these condition generations condition on uh semantic
[00:41:59] generations condition on uh semantic labels at bats and a desk and tables.
[00:42:02] labels at bats and a desk and tables. You can singleize these different shapes
[00:42:04] You can singleize these different shapes and because this is a generative network
[00:42:06] and because this is a generative network you can also use it for classification.
[00:42:08] you can also use it for classification. Um so you can do image shape
[00:42:10] Um so you can do image shape classification as well.
[00:42:12] classification as well. And later um something that actually we
[00:42:16] And later um something that actually we did is you know what if we just applied
[00:42:18] did is you know what if we just applied against this generative server. Now you
[00:42:20] against this generative server. Now you can use scans to generate 2D pixels.
[00:42:22] can use scans to generate 2D pixels. There's no reason you cannot use GANs to
[00:42:23] There's no reason you cannot use GANs to generate 3D box. So we just this this
[00:42:26] generate 3D box. So we just this this very simple thing and that is apply
[00:42:28] very simple thing and that is apply again to 3D box and actually give you a
[00:42:30] again to 3D box and actually give you a pretty good generation of uh 3D objects.
[00:42:33] pretty good generation of uh 3D objects. This is eight nine years ago. Yeah.
[00:42:36] This is eight nine years ago. Yeah. Yeah. Okay.
[00:42:39] Yeah. Okay. And um later uh with Chyen from CMU we
[00:42:43] And um later uh with Chyen from CMU we also did an extension that is you can
[00:42:45] also did an extension that is you can use GANs to not only generate 3D shapes
[00:42:47] use GANs to not only generate 3D shapes but also you can render them into 2D you
[00:42:50] but also you can render them into 2D you can project them into 2D surfaces u so
[00:42:53] can project them into 2D surfaces u so that you can get the depth map of the 2D
[00:42:55] that you can get the depth map of the 2D objects you 3D objects you generated and
[00:42:57] objects you 3D objects you generated and then you can use a cycle GAN uh to
[00:42:59] then you can use a cycle GAN uh to convert this depth map into a color um
[00:43:01] convert this depth map into a color um color image. Now you can have adversary
[00:43:03] color image. Now you can have adversary losses not only on 3D shapes but also on
[00:43:05] losses not only on 3D shapes but also on 2D pictures, right? You want the 3D
[00:43:06] 2D pictures, right? You want the 3D shapes to look realistic so that they
[00:43:08] shapes to look realistic so that they can be indisting indistinguishable from
[00:43:10] can be indisting indistinguishable from the 3D object data you have. You also
[00:43:12] the 3D object data you have. You also want the 2D images to look realistic so
[00:43:14] want the 2D images to look realistic so that they're indistinguishable from
[00:43:16] that they're indistinguishable from images of real cars. Um so then you can
[00:43:19] images of real cars. Um so then you can do 3D generation as well as 2D
[00:43:21] do 3D generation as well as 2D generation at the same time. And because
[00:43:23] generation at the same time. And because you have you know different latent
[00:43:25] you have you know different latent vectors for the shapes for the
[00:43:27] vectors for the shapes for the viewpoints and for the textures you also
[00:43:29] viewpoints and for the textures you also have some level of controllability like
[00:43:31] have some level of controllability like you can change the viewpoint you can
[00:43:33] you can change the viewpoint you can change the textures you can do
[00:43:35] change the textures you can do interpolation and you can you know uh
[00:43:38] interpolation and you can you know uh transfer the texture of one car onto the
[00:43:40] transfer the texture of one car onto the shape of another car you know this is
[00:43:42] shape of another car you know this is 2018
[00:43:45] 2018 so people tried applying deep networks
[00:43:47] so people tried applying deep networks like neural networks gener networks on
[00:43:50] like neural networks gener networks on 3D voxels instead of 2D pixels
[00:43:52] 3D voxels instead of 2D pixels And can we do a little bit better with
[00:43:54] And can we do a little bit better with box holes? Because one thing that people
[00:43:56] box holes? Because one thing that people have complained about vox holes is
[00:43:57] have complained about vox holes is they're just like really slow, right?
[00:43:59] they're just like really slow, right? You have to presample them and there are
[00:44:00] You have to presample them and there are a lot of wasted effort because a lot of
[00:44:02] a lot of wasted effort because a lot of sample points just like empty space or
[00:44:05] sample points just like empty space or they're inside object and g give you no
[00:44:07] they're inside object and g give you no information. So naturally people thought
[00:44:09] information. So naturally people thought okay can we actually make it better? So
[00:44:10] okay can we actually make it better? So there are improvements to voxels like
[00:44:12] there are improvements to voxels like octtop trees. The idea of octtop trees
[00:44:15] octtop trees. The idea of octtop trees is you still have explicit
[00:44:16] is you still have explicit representations. Um sorry in some sense
[00:44:19] representations. Um sorry in some sense you can argue it's implicit
[00:44:20] you can argue it's implicit representation but it's like
[00:44:21] representation but it's like nonparametric implicit representations.
[00:44:24] nonparametric implicit representations. Uh but then instead of representing
[00:44:25] Uh but then instead of representing every spa uh every point in the space at
[00:44:28] every spa uh every point in the space at a at a at a uniform scale right you
[00:44:30] a at a at a uniform scale right you actually you know um assuming uh the
[00:44:33] actually you know um assuming uh the voxels or yeah basically the voxels can
[00:44:36] voxels or yeah basically the voxels can be of different sizes you know I just
[00:44:38] be of different sizes you know I just divide the space into like different
[00:44:39] divide the space into like different regions and I spin a lot more when I
[00:44:42] regions and I spin a lot more when I feel like I'm really close to the
[00:44:43] feel like I'm really close to the surface objects I just represent objects
[00:44:45] surface objects I just represent objects in a much finer scale and when I'm like
[00:44:47] in a much finer scale and when I'm like in this empty space or inside objects
[00:44:49] in this empty space or inside objects where I really don't care too much about
[00:44:51] where I really don't care too much about what's going on I just I can have like
[00:44:53] what's going on I just I can have like huge voxels in some sense, right? So you
[00:44:55] huge voxels in some sense, right? So you can recursively partition the space and
[00:44:57] can recursively partition the space and you can have, you know, different sizes
[00:45:00] you can have, you know, different sizes of voxels at different space and this
[00:45:02] of voxels at different space and this allows you to really scale up. Um, so
[00:45:05] allows you to really scale up. Um, so you can see that compared with just
[00:45:06] you can see that compared with just directly using voxels, you know, this is
[00:45:08] directly using voxels, you know, this is 2019ish. You know, people say, okay,
[00:45:10] 2019ish. You know, people say, okay, octave trees are great because allows me
[00:45:13] octave trees are great because allows me to go from lower resolution. That's how
[00:45:15] to go from lower resolution. That's how much you can fit into GPU memory, right?
[00:45:17] much you can fit into GPU memory, right? You can do 64 x 64 with voxels, but with
[00:45:20] You can do 64 x 64 with voxels, but with oct trees, you can do 256, right? So,
[00:45:24] oct trees, you can do 256, right? So, and you can even use that for generation
[00:45:25] and you can even use that for generation as well. You can generate objects also,
[00:45:28] as well. You can generate objects also, you know, they look like voxels, but
[00:45:30] you know, they look like voxels, but they're kind of higher resolution
[00:45:31] they're kind of higher resolution because you're more efficient in
[00:45:32] because you're more efficient in representing the space. So, these are
[00:45:35] representing the space. So, these are like the very early attempts in applying
[00:45:37] like the very early attempts in applying deep learning to each space and they're
[00:45:38] deep learning to each space and they're like, okay, why don't we just try
[00:45:40] like, okay, why don't we just try voxels?
[00:45:41] voxels? Then this is the moment where people
[00:45:44] Then this is the moment where people have get a bit more interest into like,
[00:45:46] have get a bit more interest into like, oh, now what if you know uh um the
[00:45:49] oh, now what if you know uh um the graphics people feel like, you know,
[00:45:50] graphics people feel like, you know, you're just doing all this wrong, right?
[00:45:51] you're just doing all this wrong, right? Because why do why do you want to use
[00:45:53] Because why do why do you want to use these kind of pretty inefficient ugly
[00:45:55] these kind of pretty inefficient ugly looking representations like voxels or
[00:45:57] looking representations like voxels or octtop trees? Now we have all these good
[00:45:59] octtop trees? Now we have all these good representations point clouds meshes
[00:46:01] representations point clouds meshes splines you know why are we not using
[00:46:02] splines you know why are we not using these representations but as we said the
[00:46:04] these representations but as we said the challenge is you know point here and
[00:46:06] challenge is you know point here and there right how can you even apply
[00:46:07] there right how can you even apply convolution on points and stuff like
[00:46:09] convolution on points and stuff like that it's not just not very obvious but
[00:46:10] that it's not just not very obvious but people start to look into it so
[00:46:13] people start to look into it so naturally people move into you know
[00:46:14] naturally people move into you know applying or developing new deep learning
[00:46:16] applying or developing new deep learning methods that directly work on uh not
[00:46:18] methods that directly work on uh not only just 3D data but also different
[00:46:20] only just 3D data but also different type of 3D representations like point
[00:46:21] type of 3D representations like point clouds so in point I think this is an
[00:46:25] clouds so in point I think this is an important work also from Stanford uh
[00:46:27] important work also from Stanford uh from Leo's team. Um what's going on here
[00:46:30] from Leo's team. Um what's going on here is they develop a new type of deep
[00:46:32] is they develop a new type of deep network that direct work with 3D point
[00:46:34] network that direct work with 3D point clouds. So it's called pointet. Um so
[00:46:37] clouds. So it's called pointet. Um so the idea is you know for points you have
[00:46:40] the idea is you know for points you have to be permutationally invariant right
[00:46:42] to be permutationally invariant right because you know if I have point one and
[00:46:44] because you know if I have point one and point 2 okay like 0.1 is here and point
[00:46:46] point 2 okay like 0.1 is here and point 2 is here okay now I have a you know
[00:46:48] 2 is here okay now I have a you know different input I would say 0.1 here and
[00:46:50] different input I would say 0.1 here and point two there right so then you know
[00:46:53] point two there right so then you know whatever your network should be
[00:46:55] whatever your network should be environment to these different two types
[00:46:57] environment to these different two types of input which means you know no matter
[00:46:58] of input which means you know no matter I ne I name this one is 0.1 that one's
[00:47:01] I ne I name this one is 0.1 that one's point 2 or I name this one's point 2 one
[00:47:03] point 2 or I name this one's point 2 one this that one's point 2 your output
[00:47:05] this that one's point 2 your output should be the Right? Because there's no
[00:47:06] should be the Right? Because there's no the points they're like kind of
[00:47:08] the points they're like kind of unordered. There's no guaranteed
[00:47:09] unordered. There's no guaranteed ordering in the sense our top left is
[00:47:11] ordering in the sense our top left is one one bottom right is 100 100. Right?
[00:47:14] one one bottom right is 100 100. Right? So if there's unorder if the points are
[00:47:16] So if there's unorder if the points are unordered you have to be permutitional
[00:47:19] unordered you have to be permutitional environment. So how can we do that
[00:47:22] environment. So how can we do that and second is you have to be also
[00:47:24] and second is you have to be also sampling environment. Um so you know
[00:47:25] sampling environment. Um so you know sometimes you sample like say uh 10
[00:47:27] sometimes you sample like say uh 10 points on the on the head of the b the
[00:47:30] points on the on the head of the b the bunny or a rabbit and five points on the
[00:47:32] bunny or a rabbit and five points on the tail of the rabbit and sometimes you
[00:47:34] tail of the rabbit and sometimes you sample 10 points on the tail of the
[00:47:35] sample 10 points on the tail of the rabbit only five points on the head of
[00:47:37] rabbit only five points on the head of of the rabbit right so how can also be
[00:47:39] of the rabbit right so how can also be invarant to that right because there's
[00:47:40] invarant to that right because there's no guarantee on how you sample points so
[00:47:42] no guarantee on how you sample points so they're kind of a bit of a issue here
[00:47:44] they're kind of a bit of a issue here and there um but the one idea they they
[00:47:48] and there um but the one idea they they imply they they used uh and I think it's
[00:47:50] imply they they used uh and I think it's basically probably the most important
[00:47:52] basically probably the most important point at is they just you know I just
[00:47:54] point at is they just you know I just apply It's also so simple. I just apply
[00:47:56] apply It's also so simple. I just apply a symmetric function on the embeddings
[00:47:58] a symmetric function on the embeddings of the points. So basically, you know,
[00:48:00] of the points. So basically, you know, for all the points, you know, I first
[00:48:02] for all the points, you know, I first compute some embeddings for them just
[00:48:04] compute some embeddings for them just like you will compute uh embings for
[00:48:06] like you will compute uh embings for different regions or different windows
[00:48:07] different regions or different windows of image and I compute the features for
[00:48:10] of image and I compute the features for each point and then I just have to fuse
[00:48:12] each point and then I just have to fuse them. But because I want them to be
[00:48:14] them. But because I want them to be permutational environment, right? So I I
[00:48:16] permutational environment, right? So I I just use a symmetric function in a sense
[00:48:18] just use a symmetric function in a sense that for example, it can be just a max
[00:48:20] that for example, it can be just a max function and then I take the maximum
[00:48:22] function and then I take the maximum soft max. It can also be a sum function.
[00:48:24] soft max. It can also be a sum function. I just add them up, you know. So that's
[00:48:27] I just add them up, you know. So that's what's going on. This is so simple. You
[00:48:28] what's going on. This is so simple. You have a you have a number of points 1 2 3
[00:48:30] have a you have a number of points 1 2 3 1 and then you have computer embeddings
[00:48:33] 1 and then you have computer embeddings for them. And they just aggregate them.
[00:48:34] for them. And they just aggregate them. You can say you compute the max for each
[00:48:36] You can say you compute the max for each dimension. You can sum them up or stuff
[00:48:37] dimension. You can sum them up or stuff like that. And yeah, and then you have
[00:48:40] like that. And yeah, and then you have this aggregated embeddings for all the
[00:48:42] this aggregated embeddings for all the points and then you go through maybe a
[00:48:44] points and then you go through maybe a few layers of fully connected networks
[00:48:45] few layers of fully connected networks or stuff like that. and then you use it
[00:48:47] or stuff like that. and then you use it to uh classify you know oh are these
[00:48:50] to uh classify you know oh are these points really representing a chair or a
[00:48:53] points really representing a chair or a table
[00:48:55] table so that's basically what's going on and
[00:48:58] so that's basically what's going on and uh it turned out to be quite powerful
[00:49:00] uh it turned out to be quite powerful and of course there have been a lot of
[00:49:01] and of course there have been a lot of the uh I would say improvements on top
[00:49:04] the uh I would say improvements on top of that people have been coming up with
[00:49:05] of that people have been coming up with new methods that improve on point they
[00:49:07] new methods that improve on point they have they have point plus and there are
[00:49:10] have they have point plus and there are things like people have been trying to
[00:49:11] things like people have been trying to do is like graph neuron networks because
[00:49:13] do is like graph neuron networks because you can easily translate points into
[00:49:15] you can easily translate points into nodes of a graph and then the the
[00:49:17] nodes of a graph and then the the neighborhood you know the proximity
[00:49:19] neighborhood you know the proximity whether two point points are close to
[00:49:20] whether two point points are close to each other as the edges connecting these
[00:49:22] each other as the edges connecting these points. So there have been like graph
[00:49:24] points. So there have been like graph neural networks and all these other
[00:49:25] neural networks and all these other methods that have been developed for
[00:49:26] methods that have been developed for these uh point point cloud processing
[00:49:30] these uh point point cloud processing but you know the original idea in the uh
[00:49:33] but you know the original idea in the uh point paper is kind of so simple and
[00:49:34] point paper is kind of so simple and turned out to be also very powerful.
[00:49:37] turned out to be also very powerful. Something else you want to consider is
[00:49:38] Something else you want to consider is you also want to measure you know for
[00:49:40] you also want to measure you know for pixels it's easy I have an output image
[00:49:42] pixels it's easy I have an output image I have the ground truth image I just
[00:49:44] I have the ground truth image I just compute differences between the two I
[00:49:45] compute differences between the two I have out two loss or whatever for points
[00:49:49] have out two loss or whatever for points how would you compare the output point
[00:49:50] how would you compare the output point cloud and input point cloud right
[00:49:52] cloud and input point cloud right especially if you care about generation
[00:49:53] especially if you care about generation task if you do a classification that's
[00:49:55] task if you do a classification that's fine you have input point cloud and
[00:49:57] fine you have input point cloud and output is uh you know chair table
[00:49:59] output is uh you know chair table whatever you have a cross entropy loss
[00:50:00] whatever you have a cross entropy loss that's that's all you need but if you
[00:50:02] that's that's all you need but if you output if if you're doing a generation
[00:50:04] output if if you're doing a generation task you have a box it's also easy right
[00:50:06] task you have a box it's also easy right just do a cross entropy loss of the you
[00:50:08] just do a cross entropy loss of the you know 100 by 100 by 100 voxal grid but if
[00:50:11] know 100 by 100 by 100 voxal grid but if your output is a point 100 points and
[00:50:13] your output is a point 100 points and how would you compare uh the output
[00:50:14] how would you compare uh the output point cloud versus the ground truth
[00:50:16] point cloud versus the ground truth point cloud you have to also design
[00:50:17] point cloud you have to also design distance metrics so the common the two
[00:50:20] distance metrics so the common the two common distance metrics that people did
[00:50:22] common distance metrics that people did use one is called the chamfer distance a
[00:50:24] use one is called the chamfer distance a chamfer distance is easy to understand
[00:50:26] chamfer distance is easy to understand that is you have two set of points and
[00:50:28] that is you have two set of points and for each point on each side or each set
[00:50:31] for each point on each side or each set you just basically find the nearest
[00:50:32] you just basically find the nearest neighbor right so you have a collection
[00:50:34] neighbor right so you have a collection of red points you have a collection of
[00:50:35] of red points you have a collection of blue points And for the red point, for
[00:50:37] blue points And for the red point, for each of the red point, you just find its
[00:50:38] each of the red point, you just find its nearest neighbor in the blue set. And
[00:50:40] nearest neighbor in the blue set. And for each of the blue points, you just
[00:50:42] for each of the blue points, you just find the nearest neighbor in the red
[00:50:43] find the nearest neighbor in the red set. And you want to minimize the
[00:50:44] set. And you want to minimize the distance, you know, minimize the
[00:50:45] distance, you know, minimize the distance of each point to its nearest
[00:50:47] distance of each point to its nearest neighbor in the other set. And the
[00:50:49] neighbor in the other set. And the second idea loss function that people
[00:50:51] second idea loss function that people may use is called earth move distance.
[00:50:53] may use is called earth move distance. And here you do a bipartite matching
[00:50:55] And here you do a bipartite matching between the two set of points. And you
[00:50:57] between the two set of points. And you have a onetoone paired matching between
[00:50:59] have a onetoone paired matching between these points. And you want to minimize
[00:51:00] these points. And you want to minimize distance uh among all these pairs. So
[00:51:03] distance uh among all these pairs. So these are the two common metrics that
[00:51:04] these are the two common metrics that people use when they're comparing the
[00:51:05] people use when they're comparing the distance between the point clouds and
[00:51:07] distance between the point clouds and they can be made differentiable which
[00:51:08] they can be made differentiable which means you can now computer gradients and
[00:51:10] means you can now computer gradients and use it to op optimize your neuron
[00:51:12] use it to op optimize your neuron network so that hopefully they you know
[00:51:13] network so that hopefully they you know output better point clouds if you're
[00:51:15] output better point clouds if you're caring about a point cloud generation
[00:51:17] caring about a point cloud generation problem.
[00:51:18] problem. Um so we have moved from uh vauels to
[00:51:22] Um so we have moved from uh vauels to point clouds and then people are like
[00:51:24] point clouds and then people are like okay this is great and now I can process
[00:51:26] okay this is great and now I can process the points I can output points but we
[00:51:28] the points I can output points but we also have other you know beautiful
[00:51:29] also have other you know beautiful repartitions like uh splines you know
[00:51:32] repartitions like uh splines you know they're very good at capturing the
[00:51:33] they're very good at capturing the surfaces of objects if if you use any
[00:51:35] surfaces of objects if if you use any kind of neuron network to generate
[00:51:36] kind of neuron network to generate voxels or generate point clouds they
[00:51:38] voxels or generate point clouds they always look very ugly right so they
[00:51:40] always look very ugly right so they don't have smooth surfaces and stuff
[00:51:41] don't have smooth surfaces and stuff like that so how can we have a neuronet
[00:51:43] like that so how can we have a neuronet network that can output or understand
[00:51:45] network that can output or understand objects but also represent the beautiful
[00:51:47] objects but also represent the beautiful surfaces
[00:51:48] surfaces So people go a bit forward and think
[00:51:52] So people go a bit forward and think about how I can integrate neural
[00:51:53] about how I can integrate neural networks with things like splines or
[00:51:55] networks with things like splines or functions like that. And a notable
[00:51:57] functions like that. And a notable example here is the thing called atlet.
[00:52:00] example here is the thing called atlet. So what's going on here is they try to
[00:52:02] So what's going on here is they try to use deep learning uh but then instead of
[00:52:04] use deep learning uh but then instead of you know directly outputting a set of 3D
[00:52:06] you know directly outputting a set of 3D point clouds right. So I learn a
[00:52:09] point clouds right. So I learn a transformation function. You know I have
[00:52:11] transformation function. You know I have a latent shape representations and then
[00:52:14] a latent shape representations and then you know if you remember right when we
[00:52:15] you know if you remember right when we say you have these parametric
[00:52:17] say you have these parametric representation of object shapes you're
[00:52:18] representation of object shapes you're basically transforming let's say a 2D
[00:52:20] basically transforming let's say a 2D space of U and V into a 3D space like a
[00:52:22] space of U and V into a 3D space like a sphere. So and you know for simple
[00:52:26] sphere. So and you know for simple things like sphere it's easy you can
[00:52:27] things like sphere it's easy you can write it down right what is that
[00:52:28] write it down right what is that function uh and s cosine whatever uh but
[00:52:32] function uh and s cosine whatever uh but for complex objects it is very hard to
[00:52:34] for complex objects it is very hard to write a function and often there's no
[00:52:35] write a function and often there's no closed form. So the idea here is okay if
[00:52:38] closed form. So the idea here is okay if there's no closed form then why don't we
[00:52:40] there's no closed form then why don't we just use a neuron network to represent
[00:52:41] just use a neuron network to represent that right so here you can see this
[00:52:43] that right so here you can see this neuron network which is repres
[00:52:45] neuron network which is repres implemented as MLP just learns that
[00:52:48] implemented as MLP just learns that function f right you can take the two
[00:52:50] function f right you can take the two values u and v as the input to the
[00:52:52] values u and v as the input to the function f and the neuronet networks is
[00:52:54] function f and the neuronet networks is performing the computation of that
[00:52:55] performing the computation of that function f and u and v and output a
[00:52:57] function f and u and v and output a point in the 3D space so it's basically
[00:53:00] point in the 3D space so it's basically learning how we are able to transform
[00:53:03] learning how we are able to transform this comput space into the 3D case and
[00:53:06] this comput space into the 3D case and it might be too hard uh to represent the
[00:53:09] it might be too hard uh to represent the entire object using a single
[00:53:10] entire object using a single transformation. So people thought okay
[00:53:12] transformation. So people thought okay we can use a couple of small neuron
[00:53:14] we can use a couple of small neuron networks. So think about it as now you
[00:53:16] networks. So think about it as now you have a piece of paper you can fold it in
[00:53:17] have a piece of paper you can fold it in different ways and you can fold it
[00:53:19] different ways and you can fold it multiple times and all these things get
[00:53:20] multiple times and all these things get put together uh to form the final shapes
[00:53:23] put together uh to form the final shapes you care about right so this is you can
[00:53:27] you care about right so this is you can see the differences between these two
[00:53:29] see the differences between these two three different representations you have
[00:53:30] three different representations you have input image and if you want to represent
[00:53:32] input image and if you want to represent a reconstruct using voxels you can see
[00:53:35] a reconstruct using voxels you can see it's it's doing something but you're
[00:53:37] it's it's doing something but you're really bounded by limited resolution
[00:53:39] really bounded by limited resolution voxels and for point clouds you know
[00:53:41] voxels and for point clouds you know yeah you're no longer bounded by the
[00:53:43] yeah you're no longer bounded by the resolutions and it give maybe a bit more
[00:53:45] resolutions and it give maybe a bit more details but points are really unordered.
[00:53:48] details but points are really unordered. you cannot really get any smooth
[00:53:49] you cannot really get any smooth surfaces out of the point clouds and for
[00:53:52] surfaces out of the point clouds and for this thing called island net which is
[00:53:53] this thing called island net which is basically learning transform pieces you
[00:53:55] basically learning transform pieces you can see that they have actually smoother
[00:53:56] can see that they have actually smoother surfaces right so using new and to
[00:53:59] surfaces right so using new and to represent how you can map a parametric
[00:54:01] represent how you can map a parametric representations from lower dimensional
[00:54:03] representations from lower dimensional space to a higher dimension higher
[00:54:04] space to a higher dimension higher dimensional space and you learn multiple
[00:54:06] dimensional space and you learn multiple of these mappings and when they're
[00:54:07] of these mappings and when they're combined that give you the final output
[00:54:10] combined that give you the final output geometries conditional 2D images
[00:54:16] okay Uh so finally you know
[00:54:22] okay Uh so finally you know uh
[00:54:26] no in some sense
[00:54:28] no in some sense we can put it this way right so what is
[00:54:31] we can put it this way right so what is um what is deep network doing when
[00:54:32] um what is deep network doing when they're doing image classification right
[00:54:34] they're doing image classification right they're basically learning a very
[00:54:35] they're basically learning a very complex functions that map uh input
[00:54:38] complex functions that map uh input images in the form of pixels into a
[00:54:40] images in the form of pixels into a final category label is it a cat or a
[00:54:43] final category label is it a cat or a dog or person or whatever that function
[00:54:45] dog or person or whatever that function is really complex and output Output
[00:54:47] is really complex and output Output space is really small. Output space is
[00:54:49] space is really small. Output space is like, you know, 1,000 dimensions, right?
[00:54:50] like, you know, 1,000 dimensions, right? So it's like, okay, it's a cat or dog.
[00:54:52] So it's like, okay, it's a cat or dog. You have 1,000 with classification.
[00:54:53] You have 1,000 with classification. Output space is so small. Input space is
[00:54:55] Output space is so small. Input space is much larger because you have, you know,
[00:54:56] much larger because you have, you know, 500 by 500 pixels. So it's that's
[00:54:59] 500 by 500 pixels. So it's that's 250,000 or something, right? The input
[00:55:01] 250,000 or something, right? The input space is much larger. Output space is
[00:55:02] space is much larger. Output space is really small. The function is really
[00:55:04] really small. The function is really hard to to to write. You know, I how can
[00:55:06] hard to to to write. You know, I how can you can you is it possible for me to
[00:55:08] you can you is it possible for me to write down some formulas so that I can
[00:55:10] write down some formulas so that I can classify the input image uh by computing
[00:55:13] classify the input image uh by computing whatever some specific values and output
[00:55:15] whatever some specific values and output if this is a cat or a dog? I cannot do
[00:55:17] if this is a cat or a dog? I cannot do that right a function is so hard to
[00:55:18] that right a function is so hard to write there's no closed form that's why
[00:55:20] write there's no closed form that's why you need a deep network input space is
[00:55:21] you need a deep network input space is large output space is small so if you
[00:55:24] large output space is small so if you really think about deep networks that
[00:55:25] really think about deep networks that way and think about what they are doing
[00:55:27] way and think about what they are doing then you realize you know a lot of the
[00:55:29] then you realize you know a lot of the things that we have doing with deep
[00:55:30] things that we have doing with deep networks on 3D shapes it doesn't seem to
[00:55:32] networks on 3D shapes it doesn't seem to map that you know map into that you know
[00:55:36] map that you know map into that you know I would say equation right so it doesn't
[00:55:38] I would say equation right so it doesn't map that well and what are the
[00:55:40] map that well and what are the representations that really map them map
[00:55:41] representations that really map them map it the best right what's the optimal
[00:55:43] it the best right what's the optimal representations that seems to really fit
[00:55:45] representations that seems to really fit into the paradigm
[00:55:47] into the paradigm And if we think more carefully I think
[00:55:49] And if we think more carefully I think around 2019 and people realize oh yeah
[00:55:52] around 2019 and people realize oh yeah you know in some sense deep network is
[00:55:53] you know in some sense deep network is implicit function and why don't we just
[00:55:55] implicit function and why don't we just use it to represent an implicit function
[00:55:57] use it to represent an implicit function for object 3D geometry right so instead
[00:56:00] for object 3D geometry right so instead of representing the kind of voxels
[00:56:02] of representing the kind of voxels because now you're just doing you know
[00:56:04] because now you're just doing you know you you turn it into pixels and you but
[00:56:06] you you turn it into pixels and you but then you just scale up into 3D and you
[00:56:08] then you just scale up into 3D and you apply 3D convolution okay but you know
[00:56:10] apply 3D convolution okay but you know fundamentally voxil is really about you
[00:56:12] fundamentally voxil is really about you know whether this thing is inside and
[00:56:13] know whether this thing is inside and outside object so instead of just
[00:56:15] outside object so instead of just directly
[00:56:16] directly pre-quing the space and get the voxels
[00:56:18] pre-quing the space and get the voxels and apply convolution on top of it. what
[00:56:20] and apply convolution on top of it. what if I just directly using deep network to
[00:56:22] if I just directly using deep network to perform that query for me so that I
[00:56:24] perform that query for me so that I don't have to run 3D conversion or
[00:56:26] don't have to run 3D conversion or anything you know I just query a space
[00:56:27] anything you know I just query a space in 3D and deep network should tell me
[00:56:29] in 3D and deep network should tell me you know if in the output can just be
[00:56:32] you know if in the output can just be one dimensional you know inside and
[00:56:33] one dimensional you know inside and outside right whether that point is
[00:56:35] outside right whether that point is inside a sweet shape or outside the 3D
[00:56:37] inside a sweet shape or outside the 3D shape
[00:56:39] shape so finally I think people move that leap
[00:56:41] so finally I think people move that leap you know take the leap to go from uh
[00:56:43] you know take the leap to go from uh explicit representations on point cows
[00:56:45] explicit representations on point cows or splines into implicit repetitions but
[00:56:47] or splines into implicit repetitions but not directly working on voxels but
[00:56:49] not directly working on voxels but instead that you know think about it as
[00:56:50] instead that you know think about it as a level set or or or some implicit
[00:56:52] a level set or or or some implicit functions that use deep number to
[00:56:54] functions that use deep number to represent you know that's the final step
[00:56:56] represent you know that's the final step right going from this uh atlas now
[00:56:58] right going from this uh atlas now whatever you're learning that trans
[00:57:00] whatever you're learning that trans transformation from a from a 2D space to
[00:57:02] transformation from a from a 2D space to the 3D space but now I can directly do
[00:57:05] the 3D space but now I can directly do implicit query using the deep networks
[00:57:07] implicit query using the deep networks so that goes through deep implicit
[00:57:09] so that goes through deep implicit functions where which is kind of
[00:57:11] functions where which is kind of interesting because around that time in
[00:57:12] interesting because around that time in 2019 there are like four papers that are
[00:57:15] 2019 there are like four papers that are doing almost exactly the same thing you
[00:57:17] doing almost exactly the same thing you know they all argue that before we have
[00:57:19] know they all argue that before we have been using voxels and point calls and
[00:57:20] been using voxels and point calls and matches or whatever they have their own
[00:57:22] matches or whatever they have their own strengths or weakness but really the
[00:57:23] strengths or weakness but really the right thing to do is I should just send
[00:57:27] right thing to do is I should just send you know the query into deep network so
[00:57:28] you know the query into deep network so deep what it should do is it should take
[00:57:31] deep what it should do is it should take the input let's say xyz coordinate and
[00:57:33] the input let's say xyz coordinate and output whether that point is inside and
[00:57:35] output whether that point is inside and outside object you know and that will be
[00:57:39] outside object you know and that will be in some sense the the final I would say
[00:57:41] in some sense the the final I would say one of the final and that's the kind of
[00:57:42] one of the final and that's the kind of idea that has been that was proposed in
[00:57:44] idea that has been that was proposed in 2019 and even right now 2025 right a lot
[00:57:46] 2019 and even right now 2025 right a lot of people has been still using this kind
[00:57:48] of people has been still using this kind of same idea
[00:57:49] of same idea that is I'll just use deep network to
[00:57:51] that is I'll just use deep network to tell me whether a point is inside
[00:57:53] tell me whether a point is inside outside object and not only you can go a
[00:57:56] outside object and not only you can go a little bit beyond than just like a
[00:57:57] little bit beyond than just like a binary classification of inside outside
[00:57:59] binary classification of inside outside because you can also say oh maybe I care
[00:58:01] because you can also say oh maybe I care a bit more I say what will be the sign
[00:58:02] a bit more I say what will be the sign distance function how far the point is
[00:58:04] distance function how far the point is from the from the surface of the object
[00:58:07] from the from the surface of the object or what will be the density values of
[00:58:08] or what will be the density values of the of the point or later what will be
[00:58:11] the of the point or later what will be the color what will be the radiance
[00:58:12] the color what will be the radiance values of of the point right but here
[00:58:14] values of of the point right but here starting from 2019 people really start
[00:58:16] starting from 2019 people really start to you know apply deep network in a way
[00:58:18] to you know apply deep network in a way that is in something similar to
[00:58:20] that is in something similar to classification that is I take points in
[00:58:23] classification that is I take points in 3D space and use it as implicit function
[00:58:24] 3D space and use it as implicit function to query you know some properties of the
[00:58:26] to query you know some properties of the points in the 3D space
[00:58:29] points in the 3D space and you know people have tried to use it
[00:58:31] and you know people have tried to use it as you know represent a collection of
[00:58:33] as you know represent a collection of implicit functions not just like how
[00:58:35] implicit functions not just like how pieces are deforming into the 3D space
[00:58:37] pieces are deforming into the 3D space to get different pieces of papers in 3D
[00:58:39] to get different pieces of papers in 3D but really representing implicit parts
[00:58:41] but really representing implicit parts of the objects using small neural
[00:58:42] of the objects using small neural networks and they can form a complex
[00:58:44] networks and they can form a complex shapes uh and you know If you can rent
[00:58:48] shapes uh and you know If you can rent represent objects in 3D uh using
[00:58:51] represent objects in 3D uh using implicit functions uh you can do that
[00:58:52] implicit functions uh you can do that not only for geometry you can query not
[00:58:55] not only for geometry you can query not only the inside whether a point is
[00:58:56] only the inside whether a point is inside outside objects whether how far
[00:58:59] inside outside objects whether how far the point is from the surface of the
[00:59:00] the point is from the surface of the object you can also query what will be
[00:59:02] object you can also query what will be the radiance what will be the color of
[00:59:04] the radiance what will be the color of object and then I go there actually one
[00:59:06] object and then I go there actually one year later this is maybe one or two year
[00:59:07] year later this is maybe one or two year later no people come up with this thing
[00:59:09] later no people come up with this thing called nerf right so where they used
[00:59:11] called nerf right so where they used differences here is now say oh I I I
[00:59:13] differences here is now say oh I I I should use deep network knowledge to
[00:59:15] should use deep network knowledge to query what will be uh the sign distance
[00:59:17] query what will be uh the sign distance function or density of the objects but
[00:59:19] function or density of the objects but also quering the radiance. Um so here
[00:59:22] also quering the radiance. Um so here you can see what's going on is you query
[00:59:25] you can see what's going on is you query nerf about xyz coordinate in the 3D
[00:59:27] nerf about xyz coordinate in the 3D space. In addition to that, because
[00:59:29] space. In addition to that, because you're trying to model the appearance as
[00:59:30] you're trying to model the appearance as well, you also query the viewing
[00:59:32] well, you also query the viewing directions, right? The camera viewing
[00:59:34] directions, right? The camera viewing directions and the output of the neuron
[00:59:36] directions and the output of the neuron network is not just like one or zero
[00:59:39] network is not just like one or zero inside outside. It is uh the density
[00:59:41] inside outside. It is uh the density values in addition to the color values,
[00:59:44] values in addition to the color values, right? The radiance and you know if you
[00:59:47] right? The radiance and you know if you directly train uh implicit functions on
[00:59:49] directly train uh implicit functions on 3D shapes and then you require 3D
[00:59:51] 3D shapes and then you require 3D supervision, right? So you know if you
[00:59:53] supervision, right? So you know if you have a collection of 3D objects you can
[00:59:55] have a collection of 3D objects you can use them as a supervision that give you
[00:59:56] use them as a supervision that give you okay you know ground you know you have
[00:59:58] okay you know ground you know you have the ground truth about whether a point
[00:59:59] the ground truth about whether a point is inside outside a 3D object but here
[01:00:02] is inside outside a 3D object but here you know uh you want to train on 2D
[01:00:04] you know uh you want to train on 2D images that's what's going on with nerf
[01:00:05] images that's what's going on with nerf so they also put that together with a
[01:00:07] so they also put that together with a newer rendering uh volume rendering
[01:00:09] newer rendering uh volume rendering function and they made this volume
[01:00:10] function and they made this volume rendering function differentiable in the
[01:00:12] rendering function differentiable in the sense that you can have a rendering
[01:00:14] sense that you can have a rendering model uh you can query all these
[01:00:16] model uh you can query all these different points in a 3D space you can
[01:00:18] different points in a 3D space you can get their colors and also their
[01:00:20] get their colors and also their densities and appearances
[01:00:23] densities and appearances And then you can compute in how much
[01:00:25] And then you can compute in how much light is blocked along the way. Right?
[01:00:27] light is blocked along the way. Right? So this is basically volume rendering as
[01:00:29] So this is basically volume rendering as in computer graphics. Uh there's
[01:00:32] in computer graphics. Uh there's there's very minimally changed made
[01:00:34] there's very minimally changed made because you can see even directly from
[01:00:36] because you can see even directly from the volume rendering equations that
[01:00:38] the volume rendering equations that everything here is you know this is
[01:00:40] everything here is you know this is approximation but with the approximation
[01:00:42] approximation but with the approximation everything here is sort of
[01:00:42] everything here is sort of differentiable. So you can compute you
[01:00:44] differentiable. So you can compute you know how much light you know if you have
[01:00:46] know how much light you know if you have if the neural network gives you the
[01:00:48] if the neural network gives you the density which is basically you can think
[01:00:49] density which is basically you can think about as opacity of the point in 3D
[01:00:51] about as opacity of the point in 3D space and also give you the color then
[01:00:53] space and also give you the color then you can compute you know how much light
[01:00:54] you can compute you know how much light has been blocked by the points I sampled
[01:00:56] has been blocked by the points I sampled ahead of that ahead of that point and
[01:01:00] ahead of that ahead of that point and along the ray and you can compute also
[01:01:02] along the ray and you can compute also you know how much light is there
[01:01:04] you know how much light is there contributing to what I'm going to see in
[01:01:06] contributing to what I'm going to see in this ray from any particular point right
[01:01:09] this ray from any particular point right so now you have a few things you have
[01:01:11] so now you have a few things you have new number to represent implicit
[01:01:12] new number to represent implicit functions for colors or radians and and
[01:01:15] functions for colors or radians and and densities and then you have this volume
[01:01:17] densities and then you have this volume rendering equations which is made
[01:01:18] rendering equations which is made differentiable so that you can learn
[01:01:20] differentiable so that you can learn directly from 2D images. So these are
[01:01:23] directly from 2D images. So these are the two things I have changed and one is
[01:01:26] the two things I have changed and one is I no longer have to train on 3D shapes I
[01:01:27] I no longer have to train on 3D shapes I can train on 2D images with this volume
[01:01:29] can train on 2D images with this volume render equations and the second is
[01:01:31] render equations and the second is instead of just looking into geometry or
[01:01:34] instead of just looking into geometry or density of objects in the 3D I also look
[01:01:37] density of objects in the 3D I also look into their radiance or appearance in 3D
[01:01:39] into their radiance or appearance in 3D right so these two changes lead to this
[01:01:41] right so these two changes lead to this kind of big jump uh behind you know from
[01:01:44] kind of big jump uh behind you know from nerf or implic from implicit functions
[01:01:46] nerf or implic from implicit functions deep SDF and all these earlier methods
[01:01:48] deep SDF and all these earlier methods to nerf so a lot of people feel like
[01:01:50] to nerf so a lot of people feel like okay yeah nerf has been great and seems
[01:01:53] okay yeah nerf has been great and seems out of nowhere. That's really not the
[01:01:54] out of nowhere. That's really not the case because, you know, they're very
[01:01:56] case because, you know, they're very much inspired if you look at the
[01:01:57] much inspired if you look at the articles they later wrote themselves.
[01:01:59] articles they later wrote themselves. Um, you know, they're very much inspired
[01:02:02] Um, you know, they're very much inspired by all these advances in deep in
[01:02:04] by all these advances in deep in functions. Although they focus only on
[01:02:05] functions. Although they focus only on geometry, but now I do both geometry and
[01:02:09] geometry, but now I do both geometry and appearance and I do learning from 2D
[01:02:11] appearance and I do learning from 2D images instead of 3D shapes. Um, so
[01:02:14] images instead of 3D shapes. Um, so yeah, here are some results of Nerf. Uh,
[01:02:16] yeah, here are some results of Nerf. Uh, this you may have seen many times.
[01:02:22] Um
[01:02:24] Um okay so if you remember we said you know
[01:02:27] okay so if you remember we said you know in the past we have been working on
[01:02:28] in the past we have been working on something like generating a 3D shapes
[01:02:30] something like generating a 3D shapes and then uh also generating their 2D
[01:02:33] and then uh also generating their 2D appearances. Uh here at the very
[01:02:35] appearances. Uh here at the very beginning we use the representations
[01:02:36] beginning we use the representations that is voxels. Um but now as we said
[01:02:40] that is voxels. Um but now as we said yeah nerf is great and if we have
[01:02:42] yeah nerf is great and if we have implicit representations there's no need
[01:02:43] implicit representations there's no need to really represent it as a voxels. What
[01:02:46] to really represent it as a voxels. What if we just replace that with a with a
[01:02:47] if we just replace that with a with a with a radiance fields right? So we also
[01:02:50] with a radiance fields right? So we also did that as well. So we have a neuron
[01:02:52] did that as well. So we have a neuron network that capture the implicit
[01:02:53] network that capture the implicit radiance fields and densities but it is
[01:02:55] radiance fields and densities but it is generative neuron network and then you
[01:02:57] generative neuron network and then you can even still apply the same GAN
[01:02:59] can even still apply the same GAN rendering framework so that you can
[01:03:00] rendering framework so that you can render objects in 3D as well as their 2D
[01:03:03] render objects in 3D as well as their 2D uh pictures and then you can also do the
[01:03:05] uh pictures and then you can also do the same as controllability and you know you
[01:03:08] same as controllability and you know you can you can change the you can change
[01:03:09] can you can change the you can change the camera viewpoint you can change
[01:03:11] the camera viewpoint you can change object identity but then you can keep
[01:03:13] object identity but then you can keep the viewpoint you can do all these
[01:03:14] the viewpoint you can do all these things as we can do before but now with
[01:03:16] things as we can do before but now with nerf you can learn directly from images.
[01:03:18] nerf you can learn directly from images. uh so you don't have to restrict
[01:03:19] uh so you don't have to restrict yourself to the categories of cars or
[01:03:21] yourself to the categories of cars or chairs where you have a lot of 3D 3D
[01:03:23] chairs where you have a lot of 3D 3D data because you can learn directly from
[01:03:25] data because you can learn directly from images and yeah so you can see that now
[01:03:29] images and yeah so you can see that now the output becomes much more realistic
[01:03:30] the output becomes much more realistic so this is the thing we did called pyan
[01:03:33] so this is the thing we did called pyan with uh Eric Chen as a first author and
[01:03:36] with uh Eric Chen as a first author and also with uh mostly people from Gordon's
[01:03:38] also with uh mostly people from Gordon's group
[01:03:42] okay and finally you know nerf is great
[01:03:46] okay and finally you know nerf is great but then Nerf has this issue do that is
[01:03:48] but then Nerf has this issue do that is you know you have to sample a lot of
[01:03:49] you know you have to sample a lot of points in 3D you know you you're no
[01:03:51] points in 3D you know you you're no longer pre-sampling them and then
[01:03:53] longer pre-sampling them and then applying a volumetric convolution but
[01:03:55] applying a volumetric convolution but still just like a level set right you
[01:03:56] still just like a level set right you have to sample all the points and
[01:03:58] have to sample all the points and current neon all the time and now you
[01:04:01] current neon all the time and now you can do it learning from 2D uh you can do
[01:04:03] can do it learning from 2D uh you can do all these great things but because you
[01:04:04] all these great things but because you still have to do all these sampling but
[01:04:06] still have to do all these sampling but it's very slow so people thought it a
[01:04:09] it's very slow so people thought it a bit more again from the graphics people
[01:04:10] bit more again from the graphics people they're like okay I have this good idea
[01:04:12] they're like okay I have this good idea about points and meshes and the nice
[01:04:14] about points and meshes and the nice thing about them is they're free in
[01:04:15] thing about them is they're free in space they're very efficient uh so it's
[01:04:17] space they're very efficient uh so it's possible for us to integrate it to can I
[01:04:19] possible for us to integrate it to can I have implicit representations but maybe
[01:04:21] have implicit representations but maybe I don't have to have a fixed sampling
[01:04:23] I don't have to have a fixed sampling grid I don't have to sample all the at
[01:04:25] grid I don't have to sample all the at times because they take so much time so
[01:04:27] times because they take so much time so maybe I really should put them together
[01:04:29] maybe I really should put them together right so you can argue that nip nerf try
[01:04:32] right so you can argue that nip nerf try to parameterize densities uh sorry
[01:04:34] to parameterize densities uh sorry parameterize the scenes very very
[01:04:35] parameterize the scenes very very densely you have to sample all the
[01:04:37] densely you have to sample all the points densely in 3D
[01:04:39] points densely in 3D um a lot of points are wasted just like
[01:04:41] um a lot of points are wasted just like in voxels you know you have all the
[01:04:42] in voxels you know you have all the points that are representing empty space
[01:04:44] points that are representing empty space you don't want that in nerf a lot of
[01:04:46] you don't want that in nerf a lot of samplings a All the queries are also
[01:04:48] samplings a All the queries are also querying empty space and network may
[01:04:51] querying empty space and network may give you like density of zero or
[01:04:52] give you like density of zero or something like that but it's taking a
[01:04:54] something like that but it's taking a lot of time so how can we address that
[01:04:57] lot of time so how can we address that um you know what if I just try to sample
[01:04:59] um you know what if I just try to sample things more sparsely right I still have
[01:05:01] things more sparsely right I still have this implicit representations but
[01:05:03] this implicit representations but instead of you know uh you know sampling
[01:05:05] instead of you know uh you know sampling empty spaces all the time I only sample
[01:05:07] empty spaces all the time I only sample at places where I know there are stuff
[01:05:09] at places where I know there are stuff but how can I know that you know if I
[01:05:11] but how can I know that you know if I what if I have a point representations
[01:05:13] what if I have a point representations so this is the idea behind this thing
[01:05:16] so this is the idea behind this thing called Gaussian spats which you may have
[01:05:17] called Gaussian spats which you may have heard of. So it still has the same
[01:05:19] heard of. So it still has the same implicitive functions, you're acquiring
[01:05:21] implicitive functions, you're acquiring new network for densities and for
[01:05:23] new network for densities and for appearance and stuff like that, but
[01:05:26] appearance and stuff like that, but instead of quiring new network all the
[01:05:27] instead of quiring new network all the time, I have a point representation. I
[01:05:29] time, I have a point representation. I have these 3D Gaussian blobs in the 3D
[01:05:31] have these 3D Gaussian blobs in the 3D space, which I think sometimes you can
[01:05:32] space, which I think sometimes you can think about them as point clouds, but
[01:05:34] think about them as point clouds, but the points are not like a single point.
[01:05:36] the points are not like a single point. They're they're like a blob. They're
[01:05:37] They're they're like a blob. They're like some regions. uh and uh because you
[01:05:41] like some regions. uh and uh because you know where these blobs are, you know,
[01:05:43] know where these blobs are, you know, when you're sending out array from your
[01:05:45] when you're sending out array from your camera to the 3D space and sample
[01:05:46] camera to the 3D space and sample points, you don't have to sample all the
[01:05:48] points, you don't have to sample all the time. You just look at all where these
[01:05:49] time. You just look at all where these blobs are and then you can know based on
[01:05:51] blobs are and then you can know based on their um the the radius of these
[01:05:54] their um the the radius of these different galaxies. You will only sample
[01:05:56] different galaxies. You will only sample at regions where you know there's some
[01:05:57] at regions where you know there's some stuff. Uh so this makes rendering much
[01:06:00] stuff. Uh so this makes rendering much more efficient.
[01:06:02] more efficient. And uh so here are some of the
[01:06:04] And uh so here are some of the reconstruction results using 3D gossian
[01:06:05] reconstruction results using 3D gossian spats.
[01:06:14] And you can see that you know in terms
[01:06:16] And you can see that you know in terms of quality right they're actually you
[01:06:18] of quality right they're actually you know not that you know they're
[01:06:20] know not that you know they're comparable I would say they're
[01:06:21] comparable I would say they're comparable to nerves uh this is like
[01:06:23] comparable to nerves uh this is like different matrix PS and SSM they're like
[01:06:25] different matrix PS and SSM they're like rendering qualities and I think the
[01:06:27] rendering qualities and I think the x-axis the sorry the y- axis is it
[01:06:29] x-axis the sorry the y- axis is it doesn't start from zero so this is a
[01:06:30] doesn't start from zero so this is a little misleading but basically you can
[01:06:31] little misleading but basically you can see these numbers are really close so in
[01:06:33] see these numbers are really close so in terms of quality rendering quality
[01:06:34] terms of quality rendering quality gaussian spats and nerves are similar at
[01:06:36] gaussian spats and nerves are similar at least when They're first proposed, but
[01:06:38] least when They're first proposed, but Gaussian splats are just much more
[01:06:40] Gaussian splats are just much more efficient, right? So this is FPS, frame
[01:06:42] efficient, right? So this is FPS, frame per second, right? You can render 150
[01:06:44] per second, right? You can render 150 pictures per second. Well, for Nerf, you
[01:06:46] pictures per second. Well, for Nerf, you know, it takes you like maybe um 20
[01:06:49] know, it takes you like maybe um 20 seconds to render just a single picture,
[01:06:50] seconds to render just a single picture, right? So now you this thing is now made
[01:06:53] right? So now you this thing is now made 1,000 times faster. At least that's what
[01:06:55] 1,000 times faster. At least that's what they argued. So because you don't you
[01:06:57] they argued. So because you don't you you no longer waste all your computing
[01:06:59] you no longer waste all your computing power on simply simply empty space and
[01:07:01] power on simply simply empty space and querying nerve quering your networks all
[01:07:02] querying nerve quering your networks all the time on these uh about these points
[01:07:05] the time on these uh about these points that are in the empty space.
[01:07:11] Okay. Yeah. So that is basically how
[01:07:13] Okay. Yeah. So that is basically how deep learning has been integrated on 3D
[01:07:15] deep learning has been integrated on 3D data in all these different
[01:07:16] data in all these different representations how they got started how
[01:07:18] representations how they got started how they have involved and it connection
[01:07:19] they have involved and it connection with all these uh different shape
[01:07:21] with all these uh different shape representations. And one thing we didn't
[01:07:23] representations. And one thing we didn't talk about I just use two minutes to
[01:07:24] talk about I just use two minutes to quickly cover it is uh you know there
[01:07:27] quickly cover it is uh you know there has been also interesting things about
[01:07:29] has been also interesting things about object geometry that is not only about
[01:07:31] object geometry that is not only about the element geometry the specific
[01:07:33] the element geometry the specific details about the parts but also the
[01:07:35] details about the parts but also the structures you know because often there
[01:07:37] structures you know because often there could be you know the shares are
[01:07:39] could be you know the shares are symmetric right so we talked a little
[01:07:40] symmetric right so we talked a little bit about it where like there's a
[01:07:41] bit about it where like there's a parametric surface and you can
[01:07:43] parametric surface and you can parameterize part of a surface using
[01:07:45] parameterize part of a surface using like a sphere or stuff like that using
[01:07:47] like a sphere or stuff like that using these kind of closed form equations and
[01:07:49] these kind of closed form equations and that give you a little bit of symmetry
[01:07:50] that give you a little bit of symmetry but there has been also more systematic
[01:07:52] but there has been also more systematic studies about uh the regularities or
[01:07:55] studies about uh the regularities or structures within object geometry
[01:07:56] structures within object geometry including their repetitions including
[01:07:58] including their repetitions including their symmetries and people also come up
[01:08:00] their symmetries and people also come up with different representations for it as
[01:08:01] with different representations for it as well you know so you know so how can we
[01:08:05] well you know so you know so how can we really represent you know in some sense
[01:08:06] really represent you know in some sense you can argue that point clouds meshes
[01:08:08] you can argue that point clouds meshes inclive functions they're really
[01:08:09] inclive functions they're really representing geometric details maybe for
[01:08:11] representing geometric details maybe for the individual parts none of none of
[01:08:13] the individual parts none of none of them is directly capturing things like
[01:08:15] them is directly capturing things like regularities like symmetry or repetition
[01:08:18] regularities like symmetry or repetition so how can we capture that a few other
[01:08:20] so how can we capture that a few other attempts that people have been exploring
[01:08:23] attempts that people have been exploring mostly from the graphics community is
[01:08:25] mostly from the graphics community is you know I can represent objects just
[01:08:27] you know I can represent objects just basically as a collection of these
[01:08:29] basically as a collection of these sample geometric parts uh like a part
[01:08:31] sample geometric parts uh like a part set but I can and there has been methods
[01:08:34] set but I can and there has been methods that apply deep learnings on it you know
[01:08:36] that apply deep learnings on it you know representing uh using deep to represent
[01:08:38] representing uh using deep to represent different parts of object using simple
[01:08:40] different parts of object using simple geometry primitives and then compose
[01:08:42] geometry primitives and then compose them or you know using implicit
[01:08:44] them or you know using implicit functions and compose them as we talked
[01:08:46] functions and compose them as we talked about before but you know there has also
[01:08:49] about before but you know there has also been attempts to do a bit more right So
[01:08:51] been attempts to do a bit more right So not just like representing objects as a
[01:08:52] not just like representing objects as a collection of parts without considering
[01:08:54] collection of parts without considering their relationships but also modeling
[01:08:56] their relationships but also modeling the relationships between these parts. U
[01:08:58] the relationships between these parts. U this is you know even more so the case
[01:09:00] this is you know even more so the case for for scenes let's say a bed or bed is
[01:09:04] for for scenes let's say a bed or bed is usually next to the wall chairs is
[01:09:05] usually next to the wall chairs is usually next to a tables and stuff like
[01:09:07] usually next to a tables and stuff like that. So you not only want to represent
[01:09:08] that. So you not only want to represent them as a you know unrelated collection
[01:09:11] them as a you know unrelated collection of parts or objects you want to capture
[01:09:13] of parts or objects you want to capture their relationships as well in the
[01:09:15] their relationships as well in the hierarchies. you know when you are
[01:09:17] hierarchies. you know when you are constructing when you're building uh
[01:09:19] constructing when you're building uh you're doing some constructions you're
[01:09:20] you're doing some constructions you're architecture uh you're architecting
[01:09:22] architecture uh you're architecting design your building um then you of
[01:09:25] design your building um then you of course you know you're not like just
[01:09:27] course you know you're not like just like representing objects or their
[01:09:28] like representing objects or their relationships you have to consider
[01:09:30] relationships you have to consider hierarchies what you build first uh
[01:09:32] hierarchies what you build first uh there's a classroom and the classroom
[01:09:33] there's a classroom and the classroom has you know there's some tables and
[01:09:35] has you know there's some tables and chairs in it and chairs has parts
[01:09:36] chairs in it and chairs has parts there's basically like a kind of level
[01:09:38] there's basically like a kind of level hierarchy and how this can be used and
[01:09:40] hierarchy and how this can be used and integrated with neuronet networks as
[01:09:42] integrated with neuronet networks as well as you know you have not only
[01:09:43] well as you know you have not only hierarchy but also you can you can
[01:09:44] hierarchy but also you can you can compose hierarchies and relationships
[01:09:46] compose hierarchies and relationships Right? So you have a hierarchal graph
[01:09:48] Right? So you have a hierarchal graph where let's say for chairs you know you
[01:09:50] where let's say for chairs you know you have different level hierarchy for bases
[01:09:52] have different level hierarchy for bases for seats for backs and the bases may
[01:09:54] for seats for backs and the bases may have you know different legs but then
[01:09:55] have you know different legs but then also the legs themselves are related
[01:09:57] also the legs themselves are related right the left leg of the chair and the
[01:09:59] right the left leg of the chair and the right leg of the chair they're supposed
[01:10:00] right leg of the chair they're supposed to be symmetric and they have they
[01:10:01] to be symmetric and they have they should have the identical shape you
[01:10:03] should have the identical shape you there are constraints on where these
[01:10:04] there are constraints on where these legs are they have to be you know really
[01:10:07] legs are they have to be you know really aligned otherwise the chair is going to
[01:10:09] aligned otherwise the chair is going to fall. So there are all these constraints
[01:10:10] fall. So there are all these constraints that are sort of you know pretty useful
[01:10:12] that are sort of you know pretty useful and how could we represent them and
[01:10:14] and how could we represent them and people come up with all these different
[01:10:15] people come up with all these different representations and for each of them
[01:10:17] representations and for each of them there are also you know a lot of
[01:10:19] there are also you know a lot of neuronet networks you know deep learning
[01:10:20] neuronet networks you know deep learning method designed to learn and to capture
[01:10:23] method designed to learn and to capture and to generate objects that satisfy all
[01:10:24] and to generate objects that satisfy all these constraints. Uh for example like
[01:10:28] these constraints. Uh for example like you can see this is a kind of
[01:10:29] you can see this is a kind of hierarchical graph encoders and decoders
[01:10:31] hierarchical graph encoders and decoders that try to uh represent and generate 3D
[01:10:33] that try to uh represent and generate 3D chairs that satisfy all these
[01:10:35] chairs that satisfy all these constraints while maintaining their
[01:10:37] constraints while maintaining their hierarchies. Right? So I think this is
[01:10:40] hierarchies. Right? So I think this is also from the ODUS group from 2019
[01:10:43] also from the ODUS group from 2019 and you know sometimes we can even
[01:10:45] and you know sometimes we can even represent shapes using some form of
[01:10:46] represent shapes using some form of programs right because there's
[01:10:48] programs right because there's repetitions and for loops and how this
[01:10:49] repetitions and for loops and how this can be incorporated into using neuronet
[01:10:52] can be incorporated into using neuronet networks to generate programs that
[01:10:54] networks to generate programs that synthesize object shapes and synize
[01:10:56] synthesize object shapes and synize their uh relations between these object
[01:10:58] their uh relations between these object parts. uh and that's also an important
[01:11:00] parts. uh and that's also an important topic and most recently I'll say let me
[01:11:03] topic and most recently I'll say let me end it this by saying
[01:11:05] end it this by saying I think there has been a new trend just
[01:11:07] I think there has been a new trend just in the past one year or two that the
[01:11:10] in the past one year or two that the deep networks or large language models
[01:11:12] deep networks or large language models are doing so well and they understand
[01:11:14] are doing so well and they understand things so well that people are exploring
[01:11:16] things so well that people are exploring is it possible for us to just use large
[01:11:19] is it possible for us to just use large language models like GPT to output these
[01:11:21] language models like GPT to output these programs right because they understand
[01:11:23] programs right because they understand semantics and what the tier should be
[01:11:24] semantics and what the tier should be like what are the constraints the chair
[01:11:26] like what are the constraints the chair should satisfy so is that possible for
[01:11:27] should satisfy so is that possible for me to use a large language model the
[01:11:29] me to use a large language model the output programs but then maybe I can use
[01:11:31] output programs but then maybe I can use some implicitive functions or whatever
[01:11:32] some implicitive functions or whatever right to capture the specific geometric
[01:11:34] right to capture the specific geometric details of the parts of the objects like
[01:11:36] details of the parts of the objects like the chairs so there's some kind of new
[01:11:38] the chairs so there's some kind of new emerging trend of research that is
[01:11:40] emerging trend of research that is happening right now in these days okay I
[01:11:42] happening right now in these days okay I think that's all I have thank


================================================================================
LECTURE 016
================================================================================

Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 16: Vision and Language

Source: https://www.youtube.com/watch?v=mQOK0Mfyrkk

---

Transcript

[00:00:05] Thank you everyone for coming. Um we
[00:00:08] Thank you everyone for coming. Um we have another guest lecture. Uh and today
[00:00:10] have another guest lecture. Uh and today we have Ranjay Krishna. Uh Ranjay
[00:00:12] we have Ranjay Krishna. Uh Ranjay Krishna is a assistant professor at the
[00:00:14] Krishna is a assistant professor at the school of computer science and
[00:00:16] school of computer science and engineering at the University of
[00:00:17] engineering at the University of Washington and he co-directs the Raven
[00:00:20] Washington and he co-directs the Raven lab. Um he has taught previous
[00:00:22] lab. Um he has taught previous iterations of CS231N in 2020 and 2021
[00:00:26] iterations of CS231N in 2020 and 2021 and his research lies at the
[00:00:27] and his research lies at the intersection of computer v uh computer
[00:00:29] intersection of computer v uh computer vision natural language processing
[00:00:32] vision natural language processing robotics and human computer interaction.
[00:00:34] robotics and human computer interaction. In today's lecture he will discuss
[00:00:36] In today's lecture he will discuss multimodal foundation models and RJ the
[00:00:38] multimodal foundation models and RJ the floor is yours.
[00:00:39] floor is yours.  Thank you. Uh it's great to be back. The
[00:00:41] Thank you. Uh it's great to be back. The first time I ever taught this course uh
[00:00:42] first time I ever taught this course uh here at Stanford, it was 2020 and we had
[00:00:46] here at Stanford, it was 2020 and we had about 3 weeks where we had to take all
[00:00:48] about 3 weeks where we had to take all the material and move it online. Uh I'm
[00:00:51] the material and move it online. Uh I'm yeah I'm every year after that has been
[00:00:53] yeah I'm every year after that has been much easier to teach. U it's great to be
[00:00:56] much easier to teach. U it's great to be back. So today we're going to talk about
[00:00:57] back. So today we're going to talk about uh multimodal foundation models. So a
[00:01:00] uh multimodal foundation models. So a lot of the lectures in this class so far
[00:01:02] lot of the lectures in this class so far has really been focused on building
[00:01:04] has really been focused on building individual models for individual tasks.
[00:01:07] individual models for individual tasks. So these uh usually follow a few steps
[00:01:10] So these uh usually follow a few steps that you've seen over and over again in
[00:01:11] that you've seen over and over again in lectures. You collect a data set,
[00:01:13] lectures. You collect a data set, usually a training set as well as a test
[00:01:14] usually a training set as well as a test set. Then you train a very specialized
[00:01:16] set. Then you train a very specialized model for that purpose. So that could be
[00:01:18] model for that purpose. So that could be an image classification model or image
[00:01:20] an image classification model or image captioning model like the ones you've
[00:01:22] captioning model like the ones you've seen in your assignments as well. And
[00:01:23] seen in your assignments as well. And then you finally sort of evaluate th
[00:01:25] then you finally sort of evaluate th those models on your test set. Now
[00:01:27] those models on your test set. Now what's sort of been different uh in the
[00:01:29] what's sort of been different uh in the field so far uh in the last couple of
[00:01:31] field so far uh in the last couple of years is this sort of shift away from
[00:01:33] years is this sort of shift away from these individual models into building
[00:01:35] these individual models into building these more foundation models. Um and the
[00:01:38] these more foundation models. Um and the way to sort of think about foundation
[00:01:40] way to sort of think about foundation models is that it really is trying to
[00:01:42] models is that it really is trying to pre-train models on a wide variety of
[00:01:44] pre-train models on a wide variety of skills, a wide variety of different
[00:01:46] skills, a wide variety of different tasks and then later on adapt those
[00:01:48] tasks and then later on adapt those things for individual tasks uh depending
[00:01:50] things for individual tasks uh depending on your needs. So for example, one very
[00:01:53] on your needs. So for example, one very common uh foundation model that you all
[00:01:55] common uh foundation model that you all probably use in some form or the other
[00:01:57] probably use in some form or the other is GPT. And GPT was trained on a lot of
[00:02:00] is GPT. And GPT was trained on a lot of common crawl data from the internet. And
[00:02:02] common crawl data from the internet. And then you take that model that you get
[00:02:04] then you take that model that you get and then you fine-tune it for different
[00:02:06] and then you fine-tune it for different purposes. So you fine-tune that model uh
[00:02:08] purposes. So you fine-tune that model uh for math problems or symbolic reasoning
[00:02:10] for math problems or symbolic reasoning or trivia questions. And all of these
[00:02:12] or trivia questions. And all of these are individual tasks that this model can
[00:02:13] are individual tasks that this model can quickly adapt to. Now, what's nice about
[00:02:15] quickly adapt to. Now, what's nice about foundation models is that it allows you
[00:02:17] foundation models is that it allows you to sort of do that update step, that
[00:02:19] to sort of do that update step, that adaptation to new tasks with very
[00:02:21] adaptation to new tasks with very minimal data. Meaning, you don't need to
[00:02:23] minimal data. Meaning, you don't need to collect a large amount of training data.
[00:02:24] collect a large amount of training data. You can usually get away with very, very
[00:02:26] You can usually get away with very, very little. Often times, you can even get
[00:02:28] little. Often times, you can even get away with collecting no training data at
[00:02:30] away with collecting no training data at all. Um, and so when you think about
[00:02:33] all. Um, and so when you think about foundation models, there's many
[00:02:34] foundation models, there's many different classes of foundation models
[00:02:36] different classes of foundation models that you might care about. Uh, in
[00:02:38] that you might care about. Uh, in language, uh, you've got Elmo and BERT
[00:02:40] language, uh, you've got Elmo and BERT that really started this entire
[00:02:41] that really started this entire revolution. uh and then we now have GPT
[00:02:44] revolution. uh and then we now have GPT and T5 and variants of these models. Uh
[00:02:47] and T5 and variants of these models. Uh these are things we're not going to talk
[00:02:48] these are things we're not going to talk about in this class since we're mostly
[00:02:49] about in this class since we're mostly going to be talking about multimodal
[00:02:51] going to be talking about multimodal models. What we will talk about is how
[00:02:54] models. What we will talk about is how do you build these same kind of
[00:02:55] do you build these same kind of foundation models for image
[00:02:56] foundation models for image classification and we'll go into uh
[00:02:59] classification and we'll go into uh examples like clip and cocoa today. Uh
[00:03:02] examples like clip and cocoa today. Uh we'll also talk about how do you combine
[00:03:04] we'll also talk about how do you combine language models that that you might have
[00:03:05] language models that that you might have seen already in class with these sort of
[00:03:07] seen already in class with these sort of vision foundation models to enable all
[00:03:10] vision foundation models to enable all kinds of new models, multimodal
[00:03:12] kinds of new models, multimodal foundation models that can solve a wide
[00:03:14] foundation models that can solve a wide variety of tasks. And then of course we
[00:03:16] variety of tasks. And then of course we can do a lot more than just sort of
[00:03:18] can do a lot more than just sort of solve tasks in language. Uh we'll talk
[00:03:20] solve tasks in language. Uh we'll talk about how you can build models that can
[00:03:22] about how you can build models that can output not just text but also uh masks
[00:03:25] output not just text but also uh masks uh or images uh that it might you might
[00:03:28] uh or images uh that it might you might want to generate. And then finally,
[00:03:29] want to generate. And then finally, we'll talk about this idea of chaining
[00:03:31] we'll talk about this idea of chaining where you take a bunch of foundation
[00:03:32] where you take a bunch of foundation models and then combine them to do all
[00:03:35] models and then combine them to do all kinds of new things together. Now, when
[00:03:37] kinds of new things together. Now, when we talk about foundation models, there's
[00:03:39] we talk about foundation models, there's many different ways to classify them.
[00:03:41] many different ways to classify them. Uh, and it's hard because the definition
[00:03:43] Uh, and it's hard because the definition is often sort of um disagreed upon, but
[00:03:46] is often sort of um disagreed upon, but what you typically will see in a
[00:03:48] what you typically will see in a foundation model is that it's robust and
[00:03:50] foundation model is that it's robust and general uh to many different tasks. So,
[00:03:52] general uh to many different tasks. So, you can apply that same model for all
[00:03:54] you can apply that same model for all different use cases, and I'll show you a
[00:03:55] different use cases, and I'll show you a ton of use cases today. Um, also
[00:03:58] ton of use cases today. Um, also something else that's common in a lot of
[00:03:59] something else that's common in a lot of these foundation models is that they
[00:04:00] these foundation models is that they have a lot of parameters. They have
[00:04:03] have a lot of parameters. They have large numbers amounts of parameters,
[00:04:05] large numbers amounts of parameters, large amounts of training data, and
[00:04:07] large amounts of training data, and usually they're trained with some sort
[00:04:08] usually they're trained with some sort of self-supervised objective. Okay. Um,
[00:04:11] of self-supervised objective. Okay. Um, so of course we're not going to talk
[00:04:12] so of course we're not going to talk about the language stuff. What we will
[00:04:14] about the language stuff. What we will talk about are the ones in green today.
[00:04:16] talk about are the ones in green today. And so let's get started with image
[00:04:17] And so let's get started with image classification. So how do we actually go
[00:04:20] classification. So how do we actually go about building an foundation model that
[00:04:22] about building an foundation model that can solve image classification for any
[00:04:24] can solve image classification for any sort of data set you might care about?
[00:04:27] sort of data set you might care about? Now, if you remember from a few lectures
[00:04:28] Now, if you remember from a few lectures ago, we were talking about
[00:04:30] ago, we were talking about self-supervised learning. And in
[00:04:32] self-supervised learning. And in self-supervised learning, one of those
[00:04:34] self-supervised learning, one of those uh methods that you saw was simple,
[00:04:36] uh methods that you saw was simple, where you have this contrastive
[00:04:37] where you have this contrastive objective that contrasts against um
[00:04:41] objective that contrasts against um dissimilar images and pulls closer
[00:04:43] dissimilar images and pulls closer representations of the same image that
[00:04:45] representations of the same image that has been transformed in some way or the
[00:04:47] has been transformed in some way or the other. Now, this idea you can think of
[00:04:49] other. Now, this idea you can think of as pulling together similar concepts.
[00:04:52] as pulling together similar concepts. different augmentations of a cat should
[00:04:54] different augmentations of a cat should result in representations that are
[00:04:56] result in representations that are similar to one another, but it should
[00:04:58] similar to one another, but it should push away representations for other
[00:05:00] push away representations for other kinds of categories like dogs for
[00:05:02] kinds of categories like dogs for example. Now, the the hope with training
[00:05:04] example. Now, the the hope with training with these self-supervised learning
[00:05:06] with these self-supervised learning objectives is that these representations
[00:05:08] objectives is that these representations become general enough, right? So that
[00:05:10] become general enough, right? So that when you see something new, maybe a
[00:05:11] when you see something new, maybe a sketch of a cat or a sketch of a dog, it
[00:05:14] sketch of a cat or a sketch of a dog, it still embeds those in the space so that
[00:05:16] still embeds those in the space so that it's easy to classify exactly what those
[00:05:18] it's easy to classify exactly what those concepts are. Now moving on to
[00:05:21] concepts are. Now moving on to multimodal, we can take these same
[00:05:22] multimodal, we can take these same ideas, the same objective and then start
[00:05:25] ideas, the same objective and then start thinking about what would happen if we
[00:05:26] thinking about what would happen if we added text to that representation space.
[00:05:29] added text to that representation space. So for example, if we could also embed a
[00:05:32] So for example, if we could also embed a representation of the text that says a
[00:05:35] representation of the text that says a cute fluffy cat and have that be close
[00:05:37] cute fluffy cat and have that be close to the cat representations, that would
[00:05:39] to the cat representations, that would be great because now we can sort of
[00:05:40] be great because now we can sort of query things in both images as well as
[00:05:43] query things in both images as well as text. uh and similarly we if we can also
[00:05:46] text. uh and similarly we if we can also embed uh the phrase my favorite dog is a
[00:05:48] embed uh the phrase my favorite dog is a golden retriever and ideally that
[00:05:50] golden retriever and ideally that representation would lie closer to
[00:05:52] representation would lie closer to golden retrievers than other kinds of
[00:05:54] golden retrievers than other kinds of dogs. So that's the general idea behind
[00:05:57] dogs. So that's the general idea behind sort of adapting the self-supervised
[00:05:59] sort of adapting the self-supervised learning objectives we've been talking
[00:06:00] learning objectives we've been talking about in class so far to incorporate
[00:06:03] about in class so far to incorporate text and other sort of multimodal uh
[00:06:05] text and other sort of multimodal uh inputs. So in simincere if you remember
[00:06:07] inputs. So in simincere if you remember the main objective was that you want to
[00:06:10] the main objective was that you want to pull together again uh transformations
[00:06:13] pull together again uh transformations of the same image. So the cat should be
[00:06:15] of the same image. So the cat should be closest to its other cat augmentation.
[00:06:18] closest to its other cat augmentation. So that green arrow right there sort of
[00:06:19] So that green arrow right there sort of indicates two things that should be
[00:06:20] indicates two things that should be pulled together and it should be further
[00:06:22] pulled together and it should be further away from all the other augmentations.
[00:06:24] away from all the other augmentations. So any other image of a dog or a monkey
[00:06:27] So any other image of a dog or a monkey you want those representations to be far
[00:06:29] you want those representations to be far away. Now we can use that same idea and
[00:06:32] away. Now we can use that same idea and now think about training a clip model.
[00:06:34] now think about training a clip model. In clip what they do is they still have
[00:06:37] In clip what they do is they still have that same image encoder that you have on
[00:06:39] that same image encoder that you have on the left hand side but on the right hand
[00:06:40] the left hand side but on the right hand side you now have a text encoder and
[00:06:43] side you now have a text encoder and this text encoder is embedding
[00:06:45] this text encoder is embedding descriptions of those individual images.
[00:06:48] descriptions of those individual images. Okay. So your dog image will now
[00:06:50] Okay. So your dog image will now hopefully learn that it should be closer
[00:06:52] hopefully learn that it should be closer to a representation of a text that says
[00:06:54] to a representation of a text that says my favorite dog is a golden retriever
[00:06:56] my favorite dog is a golden retriever and far away from all the other
[00:06:58] and far away from all the other representations. Okay. And because this
[00:07:00] representations. Okay. And because this is the same formulation that you've seen
[00:07:02] is the same formulation that you've seen with Sinclair, the objective that you
[00:07:03] with Sinclair, the objective that you use to train a model like this is just
[00:07:06] use to train a model like this is just by collecting a lot of image text pairs.
[00:07:08] by collecting a lot of image text pairs. Uh and then once you have those pairs,
[00:07:10] Uh and then once you have those pairs, feed them into a model in a mini batch
[00:07:12] feed them into a model in a mini batch and then make sure that you have this
[00:07:13] and then make sure that you have this contrastive objective uh that we used
[00:07:16] contrastive objective uh that we used for SIM clear, but now we're applying
[00:07:17] for SIM clear, but now we're applying them across images and text. So we're
[00:07:20] them across images and text. So we're pulling together here in the numerator
[00:07:22] pulling together here in the numerator uh the representations uh of similar
[00:07:24] uh the representations uh of similar things and pulling apart the
[00:07:26] things and pulling apart the representations in the denominator for
[00:07:27] representations in the denominator for everything else. Now, of course, we want
[00:07:30] everything else. Now, of course, we want that image to be closest to its
[00:07:32] that image to be closest to its corresponding text and far away from all
[00:07:34] corresponding text and far away from all the other text. But we also want the
[00:07:36] the other text. But we also want the inverse to be true as well. So, we have
[00:07:38] inverse to be true as well. So, we have a second objective as well that says
[00:07:40] a second objective as well that says every text should be closest to its
[00:07:42] every text should be closest to its image and further away from all the
[00:07:44] image and further away from all the other image descriptions. Right? So,
[00:07:46] other image descriptions. Right? So, it's it's a complimentary symmetric loss
[00:07:48] it's it's a complimentary symmetric loss that you have between the two different
[00:07:50] that you have between the two different types of modalities that you're feeding
[00:07:52] types of modalities that you're feeding into this uh learning objective.
[00:07:55] into this uh learning objective. Okay? So, of course, what's really nice
[00:07:57] Okay? So, of course, what's really nice about a clip-like model is that it can
[00:07:59] about a clip-like model is that it can be trained with just associations of
[00:08:02] be trained with just associations of images and text. And there's a ton of
[00:08:04] images and text. And there's a ton of this data on the internet. So, you have
[00:08:06] this data on the internet. So, you have a lot of data of corresponding images
[00:08:08] a lot of data of corresponding images and text that you can pull up from the
[00:08:10] and text that you can pull up from the internet. You can download and now you
[00:08:12] internet. You can download and now you can train this model at a very very
[00:08:14] can train this model at a very very large scale. And this is exactly what
[00:08:15] large scale. And this is exactly what OpenAI did a couple of years ago in 2021
[00:08:18] OpenAI did a couple of years ago in 2021 when they released their clip model. So
[00:08:21] when they released their clip model. So they collected a lot of that data and
[00:08:22] they collected a lot of that data and then they trained it using this
[00:08:24] then they trained it using this contrastive objective using all of the
[00:08:26] contrastive objective using all of the images and text pairs that they found
[00:08:27] images and text pairs that they found from the internet and then once they
[00:08:29] from the internet and then once they were done training that you follow the
[00:08:31] were done training that you follow the same sort of two-step pipeline that you
[00:08:33] same sort of two-step pipeline that you saw in the self-supervised learning
[00:08:34] saw in the self-supervised learning object uh class where in step one you do
[00:08:37] object uh class where in step one you do that pre-training and then in step two
[00:08:39] that pre-training and then in step two you can take that image encoder and now
[00:08:41] you can take that image encoder and now you can adapt it to a new task. So once
[00:08:44] you can adapt it to a new task. So once you have this pre-trained image encoder,
[00:08:46] you have this pre-trained image encoder, you take it, you take its weights and
[00:08:47] you take it, you take its weights and then you tag on an additional linear
[00:08:49] then you tag on an additional linear layer on top uh to adapt it to an image
[00:08:52] layer on top uh to adapt it to an image classification task or a detection task
[00:08:54] classification task or a detection task or you can put in something like a
[00:08:55] or you can put in something like a decoder and even decode out uh semantic
[00:08:58] decoder and even decode out uh semantic segmentation maps. Right? So a ton of
[00:09:00] segmentation maps. Right? So a ton of different tasks become possible just by
[00:09:02] different tasks become possible just by initializing your model from this sort
[00:09:05] initializing your model from this sort of pre-trained objective. What was
[00:09:07] of pre-trained objective. What was really exciting when this paper came out
[00:09:09] really exciting when this paper came out is that linear uh addition of this one
[00:09:12] is that linear uh addition of this one linear classifier on top of this clip
[00:09:14] linear classifier on top of this clip encoder led to really large improvements
[00:09:17] encoder led to really large improvements in performance. So here in this graph
[00:09:19] in performance. So here in this graph I'm showing you uh average performance
[00:09:21] I'm showing you uh average performance across many different image
[00:09:22] across many different image classification data sets and the clip
[00:09:25] classification data sets and the clip models the ones in red they're all the
[00:09:27] models the ones in red they're all the way at the top and you can see that as
[00:09:29] way at the top and you can see that as you sort of train on more and more
[00:09:31] you sort of train on more and more images uh you end up getting better and
[00:09:33] images uh you end up getting better and better performance. Um, so it was very
[00:09:35] better performance. Um, so it was very exciting because it seemed to indicate
[00:09:37] exciting because it seemed to indicate that there's this really nice
[00:09:38] that there's this really nice pre-training objective that we've been
[00:09:40] pre-training objective that we've been able to unlock and there's an abundance
[00:09:42] able to unlock and there's an abundance of image text data on the internet which
[00:09:44] of image text data on the internet which means that we can train these to be very
[00:09:46] means that we can train these to be very very large and very very performant. Um,
[00:09:50] very large and very very performant. Um, of course that's not the end of the
[00:09:51] of course that's not the end of the story. What we want to do ideally is not
[00:09:54] story. What we want to do ideally is not to have to adapt these features for
[00:09:56] to have to adapt these features for something new. We would ideally want to
[00:09:57] something new. We would ideally want to be able to use a clip model out of the
[00:09:59] be able to use a clip model out of the box. So in language models for example,
[00:10:02] box. So in language models for example, you train a model to autocomplete
[00:10:04] you train a model to autocomplete usually. And this autocomp completion
[00:10:06] usually. And this autocomp completion kind of works like this. You have a
[00:10:07] kind of works like this. You have a phrase that says I love and then your
[00:10:10] phrase that says I love and then your model sort of fills in the next word.
[00:10:12] model sort of fills in the next word. For example, cake. And then you train
[00:10:14] For example, cake. And then you train with this pre-training objective. And
[00:10:16] with this pre-training objective. And what you want to do during the second
[00:10:17] what you want to do during the second stage is to basically take that same
[00:10:20] stage is to basically take that same model and adapt it to a new task. For
[00:10:22] model and adapt it to a new task. For language models, you never have to
[00:10:24] language models, you never have to retrain that model. You never have to
[00:10:26] retrain that model. You never have to retrain it on a new downstream task.
[00:10:28] retrain it on a new downstream task. Every task is a language task. And so
[00:10:30] Every task is a language task. And so every task can be treated as this sort
[00:10:32] every task can be treated as this sort of autocomplete uh process. But with
[00:10:35] of autocomplete uh process. But with clip the problem is there is no
[00:10:37] clip the problem is there is no autocomplete process. Right? So we've
[00:10:39] autocomplete process. Right? So we've trained this model on this contrastive
[00:10:41] trained this model on this contrastive objective. But to adapt it to a new
[00:10:43] objective. But to adapt it to a new task, we still need training data and we
[00:10:46] task, we still need training data and we still need uh a linear layer on top that
[00:10:49] still need uh a linear layer on top that we need to train to adapt it to new
[00:10:50] we need to train to adapt it to new tasks. So a lot of people started
[00:10:52] tasks. So a lot of people started thinking about what what we can do to
[00:10:54] thinking about what what we can do to sort of adapt this uh model for uh you
[00:10:57] sort of adapt this uh model for uh you know to use it directly out of the box
[00:11:00] know to use it directly out of the box and there's this clever trick that
[00:11:01] and there's this clever trick that people came up with and this clever
[00:11:03] people came up with and this clever trick is basically using the text
[00:11:06] trick is basically using the text encoder as a way of guiding the model uh
[00:11:10] encoder as a way of guiding the model uh to generalize to any downstream
[00:11:11] to generalize to any downstream classifiification task. And it works
[00:11:14] classifiification task. And it works like this. Uh so let's say you want to
[00:11:15] like this. Uh so let's say you want to classify what this image is using a clip
[00:11:17] classify what this image is using a clip model but you don't want to sort of
[00:11:19] model but you don't want to sort of retrain this model or adapt it for any
[00:11:21] retrain this model or adapt it for any downstream task. What you can do is you
[00:11:24] downstream task. What you can do is you can take the text encoder and pass in a
[00:11:27] can take the text encoder and pass in a word through that text encoder to create
[00:11:29] word through that text encoder to create a text vector and use nearest neighbors
[00:11:32] a text vector and use nearest neighbors to figure out what is the right
[00:11:34] to figure out what is the right classification. So the way this works is
[00:11:37] classification. So the way this works is you take all the categories in your new
[00:11:38] you take all the categories in your new data set. So for example, let's say your
[00:11:40] data set. So for example, let's say your new data set contains the categories
[00:11:42] new data set contains the categories plane, dog, and bird. You're going to
[00:11:45] plane, dog, and bird. You're going to embed all of them in the text space to
[00:11:47] embed all of them in the text space to get a vector for plane, a vector for
[00:11:49] get a vector for plane, a vector for dog, and a vector for bird. And now when
[00:11:52] dog, and a vector for bird. And now when a new image comes in, all you have to do
[00:11:54] a new image comes in, all you have to do is embed that image using the image
[00:11:57] is embed that image using the image encoder and then find the closest
[00:11:59] encoder and then find the closest neighbor. So in this case, you should
[00:12:01] neighbor. So in this case, you should find that this image has the highest
[00:12:03] find that this image has the highest sort of similarity with uh the correct
[00:12:06] sort of similarity with uh the correct class. uh in this case uh it should be
[00:12:08] class. uh in this case uh it should be the dog vector and you can see the dog
[00:12:10] the dog vector and you can see the dog vector does have the highest sort of
[00:12:11] vector does have the highest sort of similarity score and so because of that
[00:12:13] similarity score and so because of that you can now classify that image as a
[00:12:15] you can now classify that image as a dog. Okay.
[00:12:18] dog. Okay. Now you can think of this entire process
[00:12:19] Now you can think of this entire process as essentially building a one nearest
[00:12:22] as essentially building a one nearest neighbor algorithm. Right? So you're you
[00:12:25] neighbor algorithm. Right? So you're you have a bunch of centers that you've or
[00:12:27] have a bunch of centers that you've or embeddings that you've generated in the
[00:12:29] embeddings that you've generated in the text space and now you can use them as
[00:12:31] text space and now you can use them as your class category labels and you're
[00:12:33] your class category labels and you're doing one nearest neighbor to find uh
[00:12:36] doing one nearest neighbor to find uh the optimal classification for any new
[00:12:38] the optimal classification for any new image that comes in.
[00:12:40] image that comes in. Now of course a uh single word might not
[00:12:44] Now of course a uh single word might not be sufficient to get a really good word
[00:12:45] be sufficient to get a really good word vector. Um instead what you might want
[00:12:48] vector. Um instead what you might want to do is a use a phrase. And the reason
[00:12:50] to do is a use a phrase. And the reason you might want to do this is because a
[00:12:52] you might want to do this is because a lot of the internet data it it usually
[00:12:54] lot of the internet data it it usually doesn't have words that occur by
[00:12:56] doesn't have words that occur by themselves. Clip was trained from just
[00:12:59] themselves. Clip was trained from just phrases that that were downloaded from
[00:13:00] phrases that that were downloaded from the internet. And so ideally you want to
[00:13:03] the internet. And so ideally you want to pick the right phrase that gives you the
[00:13:04] pick the right phrase that gives you the best uh representation. So instead of
[00:13:07] best uh representation. So instead of just having the categories plane, dog,
[00:13:09] just having the categories plane, dog, and bird, you might instead want to
[00:13:11] and bird, you might instead want to embed a vector that represents a photo
[00:13:13] embed a vector that represents a photo of a plane, a photo of a dog. And turns
[00:13:16] of a plane, a photo of a dog. And turns out if you do this one small change, uh,
[00:13:18] out if you do this one small change, uh, you suddenly get a large boost on
[00:13:20] you suddenly get a large boost on imageet where you see an improvement of
[00:13:22] imageet where you see an improvement of about 1.3%.
[00:13:24] about 1.3%. Of course, picking that right phrase is
[00:13:27] Of course, picking that right phrase is also something that's very difficult to
[00:13:28] also something that's very difficult to do. And so what people typically do is
[00:13:30] do. And so what people typically do is they don't just pick a single phrase.
[00:13:31] they don't just pick a single phrase. They pick many different phrases. Uh, so
[00:13:33] They pick many different phrases. Uh, so a photo of a dog, a drawing of a dog, or
[00:13:36] a photo of a dog, a drawing of a dog, or a bunch of different ideas for different
[00:13:38] a bunch of different ideas for different phrases. And you want to create a many
[00:13:40] phrases. And you want to create a many many different vectors for all of those
[00:13:42] many different vectors for all of those different phrases you might think of.
[00:13:44] different phrases you might think of. And at the end, what you do is you just
[00:13:46] And at the end, what you do is you just take the mean vector representation
[00:13:49] take the mean vector representation across all of your phrases for each
[00:13:51] across all of your phrases for each category and use that as your mean dog
[00:13:54] category and use that as your mean dog vector, your mean uh plane vector, and
[00:13:56] vector, your mean uh plane vector, and your mean bird vector. Right? And then
[00:13:58] your mean bird vector. Right? And then now you're back to where you started and
[00:14:00] now you're back to where you started and you can do your same sort of one nearest
[00:14:02] you can do your same sort of one nearest neighbor uh algorithm on this. It
[00:14:05] neighbor uh algorithm on this. It probably has been trained on imageet.
[00:14:08] probably has been trained on imageet. This is I think uh a point to show that
[00:14:10] This is I think uh a point to show that you can adapt it to a new task. But I
[00:14:13] you can adapt it to a new task. But I will show you other examples of data
[00:14:14] will show you other examples of data sets where it's definitely not been
[00:14:15] sets where it's definitely not been trained on and it does sort of adapt to
[00:14:17] trained on and it does sort of adapt to that as well. So you get a single vector
[00:14:20] that as well. So you get a single vector out. Uh it depends on the architecture
[00:14:22] out. Uh it depends on the architecture you're using. If you're using ResNet,
[00:14:24] you're using. If you're using ResNet, you take the final vector
[00:14:25] you take the final vector representation. Uh if your text encoder
[00:14:27] representation. Uh if your text encoder is let's say a VIT or a transformer,
[00:14:30] is let's say a VIT or a transformer, then you usually take the CLS token uh
[00:14:32] then you usually take the CLS token uh of your transformer.
[00:14:36] of your transformer. Okay. Um so that's sort of it for clip.
[00:14:38] Okay. Um so that's sort of it for clip. you could basically adapt this for a
[00:14:39] you could basically adapt this for a wide variety of new image uh
[00:14:41] wide variety of new image uh classification tasks and to your
[00:14:43] classification tasks and to your question right now uh of course it's not
[00:14:45] question right now uh of course it's not that big a deal uh that it performs just
[00:14:48] that big a deal uh that it performs just as well on imageet uh but it is still
[00:14:50] as well on imageet uh but it is still exciting that it does do well on imageet
[00:14:52] exciting that it does do well on imageet at all uh what's I think more
[00:14:54] at all uh what's I think more interesting is that when you look at
[00:14:56] interesting is that when you look at other data sets data sets that were
[00:14:58] other data sets data sets that were collected after clip came out so a data
[00:15:00] collected after clip came out so a data set like object net which contains
[00:15:02] set like object net which contains objects that people took photos of in
[00:15:04] objects that people took photos of in very weird sort of places so they put a
[00:15:06] very weird sort of places so they put a banana on the ground and took a photo of
[00:15:08] banana on the ground and took a photo of it or they took a banana that was like
[00:15:10] it or they took a banana that was like really rotten and took a photo of it. Uh
[00:15:12] really rotten and took a photo of it. Uh so things that are just not common. Uh
[00:15:14] so things that are just not common. Uh and so in this data set uh if you train
[00:15:17] and so in this data set uh if you train on imageet you don't do very well. Uh
[00:15:20] on imageet you don't do very well. Uh because imageet again contains most of
[00:15:21] because imageet again contains most of these categories in its most typical
[00:15:23] these categories in its most typical form. Uh but if you take the clip model
[00:15:26] form. Uh but if you take the clip model it performs just as well. uh and that
[00:15:29] it performs just as well. uh and that was really really exciting for many
[00:15:31] was really really exciting for many people because this ability to
[00:15:32] people because this ability to generalize to a completely new data set
[00:15:34] generalize to a completely new data set that it hasn't seen before that's even
[00:15:36] that it hasn't seen before that's even out of domain to some degree uh was
[00:15:38] out of domain to some degree uh was really great. So why do you think this
[00:15:39] really great. So why do you think this is why do you think clip generaliz is a
[00:15:41] is why do you think clip generaliz is a lot better than training on imageet to
[00:15:44] lot better than training on imageet to paraphrase your response because I think
[00:15:45] paraphrase your response because I think it's the right response um is you know
[00:15:49] it's the right response um is you know the text that you download from the
[00:15:50] the text that you download from the internet it contains a lot more than the
[00:15:52] internet it contains a lot more than the category labels it contains a lot more
[00:15:54] category labels it contains a lot more structural information contains
[00:15:55] structural information contains information about shape about uh the
[00:15:57] information about shape about uh the colors of things and all of that adds uh
[00:16:01] colors of things and all of that adds uh to the representations and so these
[00:16:03] to the representations and so these models are able to adapt a lot better to
[00:16:05] models are able to adapt a lot better to something that maybe is slightly out of
[00:16:07] something that maybe is slightly out of distribution or an object that looks
[00:16:09] distribution or an object that looks slightly different because it does have
[00:16:10] slightly different because it does have all of these other things it's looking
[00:16:12] all of these other things it's looking for as well. Uh and so those that
[00:16:14] for as well. Uh and so those that additional supervision really helps
[00:16:16] additional supervision really helps quite a lot. Uh the other reason it
[00:16:18] quite a lot. Uh the other reason it helps quite a lot is the scale of data.
[00:16:20] helps quite a lot is the scale of data. Imageet is only about 1.3 million images
[00:16:23] Imageet is only about 1.3 million images or so whereas the internet contains
[00:16:25] or so whereas the internet contains millions and billions at at this point
[00:16:27] millions and billions at at this point billions of image text pairs that we can
[00:16:29] billions of image text pairs that we can download very easily. And so these
[00:16:30] download very easily. And so these models have just seen so much more data
[00:16:33] models have just seen so much more data uh that this adaptation becomes a lot
[00:16:35] uh that this adaptation becomes a lot easier. Um and so people started doing
[00:16:37] easier. Um and so people started doing these experiments on a wide variety of
[00:16:39] these experiments on a wide variety of uh generalization tasks. So they showed
[00:16:41] uh generalization tasks. So they showed that you can generalize these models for
[00:16:44] that you can generalize these models for uh not just natural images but also
[00:16:46] uh not just natural images but also sketches that you can also do this on
[00:16:48] sketches that you can also do this on adversarial data sets as well. And
[00:16:50] adversarial data sets as well. And performance across the board seem to
[00:16:52] performance across the board seem to indicate that these models are just
[00:16:53] indicate that these models are just really really good and robust to many
[00:16:55] really really good and robust to many different applications. Um and then here
[00:16:58] different applications. Um and then here I'm showing you the difference between
[00:16:59] I'm showing you the difference between zeroot and linear probe. And you can see
[00:17:02] zeroot and linear probe. And you can see that of course linear probe when you add
[00:17:03] that of course linear probe when you add that additional linear classifier and
[00:17:06] that additional linear classifier and train it and adapt it a little bit it
[00:17:08] train it and adapt it a little bit it does improve performance in majority of
[00:17:10] does improve performance in majority of uh the data sets the ones in green and
[00:17:13] uh the data sets the ones in green and but it's not always the case. In some
[00:17:14] but it's not always the case. In some cases the clip zero shot just performs
[00:17:17] cases the clip zero shot just performs really well out of the box. Um and so it
[00:17:20] really well out of the box. Um and so it just seemed to indicate that we finally
[00:17:22] just seemed to indicate that we finally unlocked this capability of being able
[00:17:24] unlocked this capability of being able to adapt um image encoders for a wide
[00:17:27] to adapt um image encoders for a wide variety of different downstream tasks.
[00:17:29] variety of different downstream tasks. And this is why I think a lot of people
[00:17:30] And this is why I think a lot of people talk about clip as the first sort of
[00:17:32] talk about clip as the first sort of foundation model uh for images. So let's
[00:17:35] foundation model uh for images. So let's talk about what makes clip work so well.
[00:17:37] talk about what makes clip work so well. Uh of course there's no real labels as
[00:17:39] Uh of course there's no real labels as such with clip. We're just downloading
[00:17:41] such with clip. We're just downloading any sort of text associated with images.
[00:17:43] any sort of text associated with images. What makes clip work so well is what I
[00:17:45] What makes clip work so well is what I was saying it contained when it was
[00:17:47] was saying it contained when it was first trained it was trained on about uh
[00:17:49] first trained it was trained on about uh somewhere between um well the parameters
[00:17:52] somewhere between um well the parameters were just gigantic. they sort of scaled
[00:17:54] were just gigantic. they sort of scaled up the model and they changed the
[00:17:55] up the model and they changed the architecture from ResNet to a VIT and so
[00:17:58] architecture from ResNet to a VIT and so you had this transformer architecture
[00:18:00] you had this transformer architecture with 307 million parameters uh that was
[00:18:04] with 307 million parameters uh that was used to train this model and the second
[00:18:06] used to train this model and the second thing that helped was the amount of data
[00:18:08] thing that helped was the amount of data right so instead of just 1.2 2 million
[00:18:10] right so instead of just 1.2 2 million images from imageet you suddenly had
[00:18:12] images from imageet you suddenly had about 400 million image text pairs from
[00:18:14] about 400 million image text pairs from the internet that they downloaded and
[00:18:16] the internet that they downloaded and used that to train. So that addition
[00:18:17] used that to train. So that addition that scale both in terms of model size
[00:18:19] that scale both in terms of model size as well as the amount of data helped
[00:18:22] as well as the amount of data helped improve performance quite a lot. So
[00:18:24] improve performance quite a lot. So immediately after clip came out uh
[00:18:26] immediately after clip came out uh people started experimenting with this
[00:18:27] people started experimenting with this objective and there's many different var
[00:18:30] objective and there's many different var variants of uh clip that have come out
[00:18:32] variants of uh clip that have come out over the years but one in particular
[00:18:34] over the years but one in particular that's really stood out came out in
[00:18:35] that's really stood out came out in 2022. It's called Koka. And Koka took
[00:18:39] 2022. It's called Koka. And Koka took the clip model. Here you can see it's
[00:18:40] the clip model. Here you can see it's the same sort of objective here. You've
[00:18:42] the same sort of objective here. You've got the image being encoded on one side.
[00:18:44] got the image being encoded on one side. You've got the text being encoded on one
[00:18:45] You've got the text being encoded on one side. And then you have that contrastive
[00:18:47] side. And then you have that contrastive loss between the two. But they added one
[00:18:49] loss between the two. But they added one additional thing. They added a decoder
[00:18:52] additional thing. They added a decoder as well that took the image features
[00:18:55] as well that took the image features from the image encoder and then they fed
[00:18:57] from the image encoder and then they fed it in as through cost attention and
[00:18:59] it in as through cost attention and caption that image. And turns out this
[00:19:02] caption that image. And turns out this captioning process also helps the model
[00:19:05] captioning process also helps the model learn quite a lot of rich information.
[00:19:07] learn quite a lot of rich information. So the general motivation here is that
[00:19:09] So the general motivation here is that it's it's not sufficient to just be able
[00:19:11] it's it's not sufficient to just be able to say that this is an image of a cat
[00:19:13] to say that this is an image of a cat versus a dog, but to describe that image
[00:19:15] versus a dog, but to describe that image in text requires a lot more information
[00:19:18] in text requires a lot more information to be learned by the model. And so it's
[00:19:20] to be learned by the model. And so it's a the hypothesis is that it's a stronger
[00:19:22] a the hypothesis is that it's a stronger learning objective. And so because of
[00:19:24] learning objective. And so because of that, it learns better features. And we
[00:19:27] that, it learns better features. And we found that to be true overall. Koka when
[00:19:29] found that to be true overall. Koka when you compare Koka to clip its performance
[00:19:32] you compare Koka to clip its performance improves quite a lot across all the
[00:19:34] improves quite a lot across all the different image net variants uh and
[00:19:36] different image net variants uh and overall there's like a 10% boost in
[00:19:38] overall there's like a 10% boost in performance across all of the data sets
[00:19:41] performance across all of the data sets and I think this was the first time
[00:19:42] and I think this was the first time where these sort of foundation models
[00:19:46] where these sort of foundation models actually beat all the models that we had
[00:19:48] actually beat all the models that we had trained from supervised learning. So at
[00:19:50] trained from supervised learning. So at this point we had many different models
[00:19:52] this point we had many different models that people are putting out onto online
[00:19:53] that people are putting out onto online leaderboards uh and in those
[00:19:56] leaderboards uh and in those leaderboards across the years um you can
[00:19:59] leaderboards across the years um you can see the trend sort of going upwards as
[00:20:01] see the trend sort of going upwards as models are performing better and better
[00:20:02] models are performing better and better and this is I think the turning point
[00:20:04] and this is I think the turning point where people abandoned um supervised
[00:20:06] where people abandoned um supervised learning objectives for image encoders
[00:20:08] learning objectives for image encoders and instead focus solely on just
[00:20:10] and instead focus solely on just pre-training objectives using these sort
[00:20:12] pre-training objectives using these sort of self-supervised learning methods from
[00:20:14] of self-supervised learning methods from the internet data. Okay, so let's talk
[00:20:16] the internet data. Okay, so let's talk about some advantages of clip. Eclipse's
[00:20:18] about some advantages of clip. Eclipse's got a lot of really fun things that you
[00:20:19] got a lot of really fun things that you can do with it. Uh it's super easy to
[00:20:21] can do with it. Uh it's super easy to train, right? Cuz it's just a simple uh
[00:20:23] train, right? Cuz it's just a simple uh contrastive learning objective. It's
[00:20:25] contrastive learning objective. It's also really fast in terms of inference.
[00:20:27] also really fast in terms of inference. Um you can embed your entire data set
[00:20:29] Um you can embed your entire data set into uh some representation and then all
[00:20:32] into uh some representation and then all you have to do to classify is just do
[00:20:35] you have to do to classify is just do retrieval on that uh sort of embedded
[00:20:37] retrieval on that uh sort of embedded data set. So you can retrieve things
[00:20:39] data set. So you can retrieve things very easily with clips representations
[00:20:41] very easily with clips representations which makes it really useful for not
[00:20:43] which makes it really useful for not just classification tasks but also
[00:20:45] just classification tasks but also search and retrieval tasks as well.
[00:20:48] search and retrieval tasks as well. Another thing that people really liked
[00:20:49] Another thing that people really liked about clip is that it's open vocabulary.
[00:20:51] about clip is that it's open vocabulary. You can feed in any text description and
[00:20:53] You can feed in any text description and it should be able to retrieve the right
[00:20:54] it should be able to retrieve the right images for you. And so that also allows
[00:20:57] images for you. And so that also allows for its applicability across many
[00:20:59] for its applicability across many different domains. Um and of course
[00:21:02] different domains. Um and of course we're going to talk about this later.
[00:21:03] we're going to talk about this later. clip is really amendable to being um
[00:21:05] clip is really amendable to being um sort of chained with other models and
[00:21:07] sort of chained with other models and this idea of chaining started becoming
[00:21:09] this idea of chaining started becoming really popular but hold off on that.
[00:21:10] really popular but hold off on that. We'll talk about that in uh a few
[00:21:12] We'll talk about that in uh a few minutes. Um of course I'm telling you
[00:21:15] minutes. Um of course I'm telling you all the good things. Turns out there's a
[00:21:16] all the good things. Turns out there's a lot of bad as well. Uh clip
[00:21:19] lot of bad as well. Uh clip unfortunately can distinguish between
[00:21:20] unfortunately can distinguish between these two images. So you have an image
[00:21:23] these two images. So you have an image of a mug in grass and you have some
[00:21:26] of a mug in grass and you have some grass in some mug and clip just does not
[00:21:29] grass in some mug and clip just does not know the difference between these two
[00:21:31] know the difference between these two things. Okay. Um, the reason it doesn't
[00:21:34] things. Okay. Um, the reason it doesn't know is because the clip's learning
[00:21:37] know is because the clip's learning objective really depends on its batch
[00:21:40] objective really depends on its batch size. If your batch size is not large
[00:21:42] size. If your batch size is not large enough, then all of the other batch
[00:21:44] enough, then all of the other batch elements are unlikely to provide any
[00:21:47] elements are unlikely to provide any useful supervisation uh for the model.
[00:21:49] useful supervisation uh for the model. If you're always comparing a cat versus
[00:21:51] If you're always comparing a cat versus a truck, you're not really going to
[00:21:53] a truck, you're not really going to learn a representation for a cat. Um,
[00:21:56] learn a representation for a cat. Um, instead, what you get is some sort of
[00:21:58] instead, what you get is some sort of representation that's kind of okay at
[00:22:00] representation that's kind of okay at some high level. Uh but if you increase
[00:22:01] some high level. Uh but if you increase the batch size, you're more likely to
[00:22:03] the batch size, you're more likely to encounter other animals that are similar
[00:22:04] encounter other animals that are similar to the cat. And then you learn a much
[00:22:06] to the cat. And then you learn a much better representation. And then of
[00:22:08] better representation. And then of course if you increase your batch size
[00:22:09] course if you increase your batch size to let's say 32,000 uh and you train it
[00:22:12] to let's say 32,000 uh and you train it across many many GPUs, then suddenly you
[00:22:15] across many many GPUs, then suddenly you start learning really good
[00:22:16] start learning really good representations. You can actually start
[00:22:18] representations. You can actually start identifying a Welsh corgi versus another
[00:22:20] identifying a Welsh corgi versus another corgi. And this only is possible when
[00:22:22] corgi. And this only is possible when you have gigantic batch sizes because it
[00:22:25] you have gigantic batch sizes because it requires you to have other negative
[00:22:27] requires you to have other negative examples in your batch that are close
[00:22:29] examples in your batch that are close enough that are sort of hard negatives
[00:22:30] enough that are sort of hard negatives that forces the model to have to learn.
[00:22:33] that forces the model to have to learn. Right? So that's very important for
[00:22:34] Right? So that's very important for getting these models to work well. Um
[00:22:37] getting these models to work well. Um but unfortunately regardless of how much
[00:22:39] but unfortunately regardless of how much people have tried, increasing this batch
[00:22:41] people have tried, increasing this batch size doesn't guarantee that the model
[00:22:43] size doesn't guarantee that the model will learn a good representation for
[00:22:45] will learn a good representation for things. And so you're sort of at the
[00:22:47] things. And so you're sort of at the mercy of the randomness of your training
[00:22:49] mercy of the randomness of your training data. Um, so increasing the batch size
[00:22:52] data. Um, so increasing the batch size does help with some amount of fine grain
[00:22:54] does help with some amount of fine grain concepts, but of course it's still
[00:22:55] concepts, but of course it's still limited and training on 32,000 amounts
[00:22:59] limited and training on 32,000 amounts of batch size is just too large for most
[00:23:02] of batch size is just too large for most labs to even consider doing. Um, people
[00:23:06] labs to even consider doing. Um, people have identified this error across many
[00:23:09] have identified this error across many different benchmarks uh and sort of
[00:23:11] different benchmarks uh and sort of identified that clip just doesn't have
[00:23:12] identified that clip just doesn't have this notion of comp compositionality. So
[00:23:15] this notion of comp compositionality. So this idea that uh the mug and the grass
[00:23:17] this idea that uh the mug and the grass versus grass in the mug, it's really
[00:23:19] versus grass in the mug, it's really about composing different concepts like
[00:23:21] about composing different concepts like the mug and the grass and the the sort
[00:23:23] the mug and the grass and the the sort of relationship in all of those
[00:23:25] of relationship in all of those individual components are not composed
[00:23:28] individual components are not composed well in your clip uh representations.
[00:23:31] well in your clip uh representations. And there's been a ton of different
[00:23:32] And there's been a ton of different benchmarks like winterground or crepe or
[00:23:35] benchmarks like winterground or crepe or arro and a lot of these benchmarks have
[00:23:37] arro and a lot of these benchmarks have actually come from my lab. um they just
[00:23:40] actually come from my lab. um they just keep finding over and over again that
[00:23:42] keep finding over and over again that clip has a ton of limitations and
[00:23:44] clip has a ton of limitations and there's a ton of things that they are
[00:23:45] there's a ton of things that they are just unable to do. Now of course in
[00:23:48] just unable to do. Now of course in reaction the community immediately
[00:23:49] reaction the community immediately started thinking about how do I
[00:23:51] started thinking about how do I handcraft my batch so that it contains
[00:23:54] handcraft my batch so that it contains the hard negatives you know so if I have
[00:23:56] the hard negatives you know so if I have one type of corgi that I should
[00:23:58] one type of corgi that I should hopefully have another type of corgi in
[00:24:00] hopefully have another type of corgi in there so your model really is forced to
[00:24:01] there so your model really is forced to learn good representations uh and so
[00:24:04] learn good representations uh and so this idea of training with hard
[00:24:06] this idea of training with hard negatives became really popular in the
[00:24:08] negatives became really popular in the community for a whole one year until we
[00:24:11] community for a whole one year until we released a follow-up paper that said
[00:24:13] released a follow-up paper that said that if you train with hard negatives
[00:24:14] that if you train with hard negatives you end up actually unlearning a a lot
[00:24:16] you end up actually unlearning a a lot of things about semantics. Uh for
[00:24:19] of things about semantics. Uh for whatever reason, and this is something
[00:24:20] whatever reason, and this is something that we still don't theoretically
[00:24:21] that we still don't theoretically understand, we end up actually with much
[00:24:24] understand, we end up actually with much worse performance in generalization
[00:24:26] worse performance in generalization across different sort of uh environments
[00:24:28] across different sort of uh environments and different kind of data sets. So
[00:24:30] and different kind of data sets. So there's a lot of work to be done still
[00:24:32] there's a lot of work to be done still in ter trying terms of figuring out the
[00:24:34] in ter trying terms of figuring out the right way of constructing your data set,
[00:24:36] right way of constructing your data set, the right way of constructing your
[00:24:38] the right way of constructing your batches and training signal. Uh so we're
[00:24:40] batches and training signal. Uh so we're still really far away from that but
[00:24:41] still really far away from that but regardless people are still very excited
[00:24:43] regardless people are still very excited about clip in general uh because it does
[00:24:46] about clip in general uh because it does give you some amount of uh supervision
[00:24:48] give you some amount of uh supervision regardless.
[00:24:49] regardless. Of course again image level uh captions
[00:24:52] Of course again image level uh captions are again not enough. Um ideally what we
[00:24:55] are again not enough. Um ideally what we want is more than just that right we
[00:24:57] want is more than just that right we want to be able to identify not just
[00:24:58] want to be able to identify not just that there's a person crossing the
[00:25:00] that there's a person crossing the street but that the person is in this
[00:25:02] street but that the person is in this location the car is here the street is
[00:25:04] location the car is here the street is here. All of that information, that
[00:25:06] here. All of that information, that grounding information is completely
[00:25:07] grounding information is completely missing in clip. And so ideally, you'd
[00:25:10] missing in clip. And so ideally, you'd want also your data set to contain this
[00:25:13] want also your data set to contain this kind of information and your model to
[00:25:15] kind of information and your model to also be able to reason about that kind
[00:25:16] also be able to reason about that kind of information. Also, uh the final thing
[00:25:19] of information. Also, uh the final thing that's a big disadvantage for clip is
[00:25:22] that's a big disadvantage for clip is that regardless of how big your data set
[00:25:24] that regardless of how big your data set is, even if you collect upwards of let's
[00:25:26] is, even if you collect upwards of let's say 5 billion images, it's still not
[00:25:28] say 5 billion images, it's still not going to be enough to capture all the
[00:25:30] going to be enough to capture all the important things that you might care
[00:25:32] important things that you might care about. uh and so there's been a lot of
[00:25:34] about. uh and so there's been a lot of efforts that we've been doing in data
[00:25:36] efforts that we've been doing in data filtering. So how do you filter the
[00:25:38] filtering. So how do you filter the internet to find the best training data
[00:25:40] internet to find the best training data for training these clip models? I won't
[00:25:42] for training these clip models? I won't go into that today. Uh but there are all
[00:25:44] go into that today. Uh but there are all of these sort of mechanisms that people
[00:25:45] of these sort of mechanisms that people are exploring. Uh that's now become the
[00:25:48] are exploring. Uh that's now become the frontier of what today's research looks
[00:25:49] frontier of what today's research looks like in this field. Okay. So that's the
[00:25:52] like in this field. Okay. So that's the first branch of foundation models we
[00:25:54] first branch of foundation models we talked about. It's about sort of
[00:25:56] talked about. It's about sort of generalizing classification to a whole
[00:25:58] generalizing classification to a whole host of tasks. Now let's talk about
[00:26:00] host of tasks. Now let's talk about vision and language models. So there's a
[00:26:03] vision and language models. So there's a new class of foundation models uh which
[00:26:06] new class of foundation models uh which has become popular in the last 2 two and
[00:26:07] has become popular in the last 2 two and a half years and we often refer to them
[00:26:10] a half years and we often refer to them as multimodal language models. Uh and
[00:26:13] as multimodal language models. Uh and I'll start off with this discussion by
[00:26:15] I'll start off with this discussion by focusing on lava which is arguably one
[00:26:18] focusing on lava which is arguably one of the first sort of multimodal language
[00:26:19] of the first sort of multimodal language models that became very very popular.
[00:26:22] models that became very very popular. The motivation here is that language
[00:26:24] The motivation here is that language models which do this next token
[00:26:26] models which do this next token prediction uh this autocomplete process
[00:26:28] prediction uh this autocomplete process that process is really useful for
[00:26:31] that process is really useful for adapting to a lot of new tasks and so
[00:26:33] adapting to a lot of new tasks and so can we start thinking about even image
[00:26:36] can we start thinking about even image models as well doing the same thing. Can
[00:26:39] models as well doing the same thing. Can we given an image also start doing
[00:26:41] we given an image also start doing different kinds of reasoning uh that's
[00:26:43] different kinds of reasoning uh that's similar to this auto uh regressive
[00:26:45] similar to this auto uh regressive process and that gave rise to this class
[00:26:47] process and that gave rise to this class of models called visual language models
[00:26:49] of models called visual language models or multimodal models. Uh but of course
[00:26:52] or multimodal models. Uh but of course this just to be historically correct
[00:26:54] this just to be historically correct this idea wasn't completely new uh in
[00:26:57] this idea wasn't completely new uh in 2022. Uh in 2019 Vilbert actually
[00:27:01] 2022. Uh in 2019 Vilbert actually introduced this idea. Uh there's a paper
[00:27:03] introduced this idea. Uh there's a paper called Vilbert in 2019 that took these
[00:27:05] called Vilbert in 2019 that took these image models and language models and put
[00:27:08] image models and language models and put them all together uh to accomplish a
[00:27:11] them all together uh to accomplish a generaliz generalization across
[00:27:13] generaliz generalization across different tasks. Uh but they were all
[00:27:15] different tasks. Uh but they were all trained uh pre-transformers and also
[00:27:18] trained uh pre-transformers and also mostly use LSTMs instead. And so the
[00:27:22] mostly use LSTMs instead. And so the rebirth of all of this is what's
[00:27:23] rebirth of all of this is what's happening right now with uh Lava where a
[00:27:27] happening right now with uh Lava where a lot of these models switched over to a
[00:27:29] lot of these models switched over to a better architecture, switched over to a
[00:27:31] better architecture, switched over to a better set of objectives, and now aren't
[00:27:34] better set of objectives, and now aren't just training on individual tasks, but
[00:27:35] just training on individual tasks, but are training on a foundation of a
[00:27:38] are training on a foundation of a variety of different tasks using some
[00:27:40] variety of different tasks using some sort of pre-training objective from the
[00:27:41] sort of pre-training objective from the internet. So how does this work? How do
[00:27:43] internet. So how does this work? How do how do you sort of think about Lava? So
[00:27:46] how do you sort of think about Lava? So to sort of talk about lava, let's take a
[00:27:48] to sort of talk about lava, let's take a step back and think about uh the
[00:27:51] step back and think about uh the transformer model or self attention in
[00:27:53] transformer model or self attention in particular. So when we think about
[00:27:56] particular. So when we think about language models, what they're doing is
[00:27:58] language models, what they're doing is they're attending over the past. So you
[00:28:00] they're attending over the past. So you have a sequence of words that are coming
[00:28:02] have a sequence of words that are coming in. So for example, cats are so and then
[00:28:05] in. So for example, cats are so and then your model to generate the next word
[00:28:07] your model to generate the next word will attend over that historical context
[00:28:09] will attend over that historical context and then generate what it thinks the
[00:28:11] and then generate what it thinks the next word should be. So it might think
[00:28:13] next word should be. So it might think that the the phrase should be cats are
[00:28:15] that the the phrase should be cats are so cute. Um, and here's another way of
[00:28:17] so cute. Um, and here's another way of sort of representing that same sort of
[00:28:19] sort of representing that same sort of objective. You've got the input text
[00:28:20] objective. You've got the input text coming in at the bottom and your model
[00:28:22] coming in at the bottom and your model will generate the next word which is
[00:28:24] will generate the next word which is cute. So when we think about vision
[00:28:27] cute. So when we think about vision language models, what people usually are
[00:28:29] language models, what people usually are referring to is adding in additional
[00:28:32] referring to is adding in additional context by grounding that conversation
[00:28:34] context by grounding that conversation that we're having with some image that
[00:28:36] that we're having with some image that we care about. So we might care about
[00:28:39] we care about. So we might care about tokenizing our image somehow and feeding
[00:28:42] tokenizing our image somehow and feeding those tokens into our language model
[00:28:44] those tokens into our language model along with the historical context of
[00:28:46] along with the historical context of cats are so and then using that to
[00:28:48] cats are so and then using that to autocomplete uh the rest of the
[00:28:50] autocomplete uh the rest of the description. So that's the basic idea
[00:28:52] description. So that's the basic idea behind llama uh is sort of feed in these
[00:28:55] behind llama uh is sort of feed in these image tokens along with the words that
[00:28:57] image tokens along with the words that are being generated to continuously
[00:28:59] are being generated to continuously generate more words about that image. So
[00:29:02] generate more words about that image. So of course a question comes in which is
[00:29:04] of course a question comes in which is how do you define these tokens? what
[00:29:06] how do you define these tokens? what should these tokens be in the first
[00:29:07] should these tokens be in the first place? And Lava their solution was to
[00:29:10] place? And Lava their solution was to use the clip image encoder. So they took
[00:29:13] use the clip image encoder. So they took the clip model, they took the image
[00:29:15] the clip model, they took the image encoder and then they basically
[00:29:18] encoder and then they basically extracted tokens from that encoder. So
[00:29:21] extracted tokens from that encoder. So the first thing you might think about
[00:29:22] the first thing you might think about doing is just using the CLS token. So
[00:29:25] doing is just using the CLS token. So here you've got uh let me see if my
[00:29:27] here you've got uh let me see if my mouse works here. Oh, it does. Okay. So
[00:29:29] mouse works here. Oh, it does. Okay. So you've got the image coming in over here
[00:29:31] you've got the image coming in over here and then they're getting patched. Each
[00:29:34] and then they're getting patched. Each patch turns into a representation that's
[00:29:36] patch turns into a representation that's fed into your transformer architecture
[00:29:38] fed into your transformer architecture in clip. It goes through a bunch of
[00:29:40] in clip. It goes through a bunch of layers of processing. And then at the
[00:29:42] layers of processing. And then at the end, you get a bunch of different tokens
[00:29:44] end, you get a bunch of different tokens for each of the patches along with a
[00:29:46] for each of the patches along with a representation for the CLS token. And so
[00:29:49] representation for the CLS token. And so far, we've only been considering the CLS
[00:29:51] far, we've only been considering the CLS token. We've only been doing things with
[00:29:52] token. We've only been doing things with the CLS token for doing any sort of
[00:29:54] the CLS token for doing any sort of classification task, but there are all
[00:29:56] classification task, but there are all of these other tokens in there as well.
[00:29:59] of these other tokens in there as well. Now, the problem with these other tokens
[00:30:01] Now, the problem with these other tokens is that they're never supervised, right?
[00:30:03] is that they're never supervised, right? Right? So the CLS token is supervised
[00:30:04] Right? So the CLS token is supervised with this contrastive objective with
[00:30:06] with this contrastive objective with text. But the other tokens are never
[00:30:08] text. But the other tokens are never used for any purpose. So they might not
[00:30:11] used for any purpose. So they might not actually contain any useful information.
[00:30:13] actually contain any useful information. And empirically people have shown that
[00:30:15] And empirically people have shown that these features are not very useful. But
[00:30:17] these features are not very useful. But what they have shown is that if you go
[00:30:19] what they have shown is that if you go one more layer back, the pen ultimate
[00:30:22] one more layer back, the pen ultimate layer in your clip encoder, these
[00:30:24] layer in your clip encoder, these features are actually very useful. Uh so
[00:30:28] features are actually very useful. Uh so these features are used to generate the
[00:30:29] these features are used to generate the final clip embedding uh in the final
[00:30:32] final clip embedding uh in the final layer and they contain a lot of spatial
[00:30:35] layer and they contain a lot of spatial information about where objects are uh
[00:30:37] information about where objects are uh in your entire image. And so this is
[00:30:40] in your entire image. And so this is what people typically use when combining
[00:30:43] what people typically use when combining uh clip encoder with uh with a sort of
[00:30:46] uh clip encoder with uh with a sort of uh transformer sort of LLM based model.
[00:30:49] uh transformer sort of LLM based model. Okay. So this is what the entire lava
[00:30:51] Okay. So this is what the entire lava architecture looks like. You feed an
[00:30:53] architecture looks like. You feed an image through uh your clip pre-trained
[00:30:56] image through uh your clip pre-trained clip encoder. You extract a bunch of
[00:30:59] clip encoder. You extract a bunch of features from it. You take those
[00:31:01] features from it. You take those features and you pass it through a
[00:31:03] features and you pass it through a linear layer that you need to train. And
[00:31:05] linear layer that you need to train. And what this linear layer will train to do
[00:31:07] what this linear layer will train to do is convert your clip representations
[00:31:10] is convert your clip representations into something that the LLM can
[00:31:12] into something that the LLM can understand and make sense of. Okay. And
[00:31:14] understand and make sense of. Okay. And once you have these tokens, you now
[00:31:16] once you have these tokens, you now basically pass in all of your tokens to
[00:31:18] basically pass in all of your tokens to your uh language model and it can now
[00:31:21] your uh language model and it can now generate some uh conversations about
[00:31:23] generate some uh conversations about that image itself.
[00:31:26] that image itself. So Lava was one of the very first sort
[00:31:28] So Lava was one of the very first sort of popular models that were out there.
[00:31:29] of popular models that were out there. And following up, Google quickly
[00:31:32] And following up, Google quickly released Flamingo. Uh and in Flamingo,
[00:31:34] released Flamingo. Uh and in Flamingo, it followed very much the entire Lava
[00:31:36] it followed very much the entire Lava setup of being able to combine uh a
[00:31:39] setup of being able to combine uh a vision encoder features with a large
[00:31:41] vision encoder features with a large language model. But the the place where
[00:31:43] language model. But the the place where they innovated is how do you do that
[00:31:46] they innovated is how do you do that fusing of these different features. So
[00:31:48] fusing of these different features. So in lava you had the features coming in
[00:31:51] in lava you had the features coming in um through a linear layer and fed in as
[00:31:53] um through a linear layer and fed in as part of the input. In Flamingo what they
[00:31:55] part of the input. In Flamingo what they did instead was they basically took all
[00:31:58] did instead was they basically took all the features coming out of your vision
[00:31:59] the features coming out of your vision encoder and fed them into every layer of
[00:32:02] encoder and fed them into every layer of your LLM. Okay? So they had to make some
[00:32:05] your LLM. Okay? So they had to make some they had to make some changes to the LLM
[00:32:07] they had to make some changes to the LLM architecture itself. And this is how
[00:32:09] architecture itself. And this is how they made those changes. So, here's an
[00:32:11] they made those changes. So, here's an example of what Flamingo's training data
[00:32:13] example of what Flamingo's training data looks like. You've got images that are
[00:32:15] looks like. You've got images that are encoded. So, you you've got this dog and
[00:32:17] encoded. So, you you've got this dog and you've got this cat. They both get
[00:32:18] you've got this cat. They both get embedded and they're both going to be
[00:32:20] embedded and they're both going to be fed into every single layer of your LLM.
[00:32:24] fed into every single layer of your LLM. And down here, you've got a data that's
[00:32:26] And down here, you've got a data that's sort of describing every single starts
[00:32:29] sort of describing every single starts uh with the image and describes that
[00:32:31] uh with the image and describes that image, then the next image, describes
[00:32:33] image, then the next image, describes the next image, and so on and so forth.
[00:32:35] the next image, and so on and so forth. And they're sort of fed in as input to
[00:32:37] And they're sort of fed in as input to your LLM. And your output is going to be
[00:32:39] your LLM. And your output is going to be to autocomplete that last uh image.
[00:32:43] to autocomplete that last uh image. Okay, so you got one image followed by a
[00:32:45] Okay, so you got one image followed by a description of the dog second image and
[00:32:47] description of the dog second image and you start the description and your model
[00:32:49] you start the description and your model will be trained on autocompleting that
[00:32:51] will be trained on autocompleting that description for that second image. So
[00:32:53] description for that second image. So what did they do? What did they change
[00:32:55] what did they do? What did they change to the model itself? They added this
[00:32:57] to the model itself? They added this sort of gated X cross attention module
[00:33:01] sort of gated X cross attention module to every single layer of your LLM. And
[00:33:04] to every single layer of your LLM. And they made one other change. They also
[00:33:06] they made one other change. They also added this perceiver sampler right here
[00:33:09] added this perceiver sampler right here uh that basically samples and
[00:33:12] uh that basically samples and downsamples your image representations.
[00:33:14] downsamples your image representations. So there's smaller dimensions and a
[00:33:15] So there's smaller dimensions and a fixed number of tokens uh for every
[00:33:17] fixed number of tokens uh for every single layer. So let me go into some
[00:33:19] single layer. So let me go into some details with what they look like. Um so
[00:33:22] details with what they look like. Um so this is the full architecture overall.
[00:33:25] this is the full architecture overall. Most of the components are frozen. All
[00:33:26] Most of the components are frozen. All the language model weights are frozen.
[00:33:28] the language model weights are frozen. All the vision model parts are frozen.
[00:33:30] All the vision model parts are frozen. The only parts that are trained are
[00:33:32] The only parts that are trained are these perceiver sampler components and
[00:33:34] these perceiver sampler components and this cross attention layer that's sort
[00:33:36] this cross attention layer that's sort of added into every single uh layer of
[00:33:39] of added into every single uh layer of your LLM. So let's talk about what this
[00:33:42] your LLM. So let's talk about what this cross attention module looks like. Uh
[00:33:44] cross attention module looks like. Uh this is me zooming into that cross
[00:33:46] this is me zooming into that cross attention module. So every single LLM
[00:33:49] attention module. So every single LLM layer right before the LLM layer you
[00:33:51] layer right before the LLM layer you have this cross attention component and
[00:33:53] have this cross attention component and what its purpose is is to look at the
[00:33:56] what its purpose is is to look at the image features and then decide what
[00:33:58] image features and then decide what parts of the image features it wants to
[00:34:00] parts of the image features it wants to keep around and what it thinks to be
[00:34:02] keep around and what it thinks to be useful for the language model to know
[00:34:03] useful for the language model to know about and they designed it as a set of
[00:34:07] about and they designed it as a set of uh components that you've already seen
[00:34:09] uh components that you've already seen so far. Uh so you attend over the image
[00:34:12] so far. Uh so you attend over the image features using a cross attention layer
[00:34:14] features using a cross attention layer and then following that cross attention
[00:34:16] and then following that cross attention they added this 10H nonlinear activation
[00:34:19] they added this 10H nonlinear activation and this is basically deciding what
[00:34:21] and this is basically deciding what parts of these components do I want to
[00:34:22] parts of these components do I want to keep around which ones uh which parts of
[00:34:24] keep around which ones uh which parts of the image do I want to forget uh and
[00:34:27] the image do I want to forget uh and then it goes through this fully
[00:34:28] then it goes through this fully connected layer where it adapts those
[00:34:30] connected layer where it adapts those representations a little bit and then
[00:34:32] representations a little bit and then again a tanh nonlinearity to decide
[00:34:35] again a tanh nonlinearity to decide again which parts it should keep and
[00:34:36] again which parts it should keep and which parts it shouldn't. Once it goes
[00:34:38] which parts it shouldn't. Once it goes through those two components and each
[00:34:40] through those two components and each one has a residual connection across
[00:34:42] one has a residual connection across from it, uh it then goes to your normal
[00:34:44] from it, uh it then goes to your normal language model processing and then
[00:34:47] language model processing and then continues to generate the word it needs
[00:34:48] continues to generate the word it needs to. Okay, so this additional layers are
[00:34:51] to. Okay, so this additional layers are being added just as a way for the
[00:34:53] being added just as a way for the language to sort of incorporate and
[00:34:56] language to sort of incorporate and attend over the vision features at every
[00:34:58] attend over the vision features at every single layer. Okay, so the actual
[00:35:00] single layer. Okay, so the actual modification itself, if you're
[00:35:01] modification itself, if you're interested in what this looks like in
[00:35:03] interested in what this looks like in code, is just uh about two or three
[00:35:05] code, is just uh about two or three lines of code where uh they added this
[00:35:08] lines of code where uh they added this cross attention layer and then this tanh
[00:35:10] cross attention layer and then this tanh nonlinearity in between. And that's
[00:35:12] nonlinearity in between. And that's really about it. So in terms of code,
[00:35:13] really about it. So in terms of code, it's a very minimal change. Although for
[00:35:15] it's a very minimal change. Although for the model, it's a very gigantic change
[00:35:17] the model, it's a very gigantic change because now it can sort of choose what
[00:35:19] because now it can sort of choose what parts of the image to attend to at every
[00:35:21] parts of the image to attend to at every single layer of its processing. So you
[00:35:23] single layer of its processing. So you give the model a lot of uh ability to
[00:35:25] give the model a lot of uh ability to decide when and how to attend over the
[00:35:28] decide when and how to attend over the vision features.
[00:35:30] vision features. Okay. So flamingo was very very
[00:35:32] Okay. So flamingo was very very exciting. Uh but training it was very
[00:35:34] exciting. Uh but training it was very difficult. Um and they had this really
[00:35:37] difficult. Um and they had this really ingenious way of training it that
[00:35:38] ingenious way of training it that allowed our models to adapt to many
[00:35:40] allowed our models to adapt to many different tasks. The way they trained it
[00:35:42] different tasks. The way they trained it was through this concatenation of a
[00:35:46] was through this concatenation of a bunch of different images together. So
[00:35:48] bunch of different images together. So you didn't just have one image and one
[00:35:50] you didn't just have one image and one description. You had a description at
[00:35:53] description. You had a description at the beginning that says here some cute
[00:35:55] the beginning that says here some cute pictures of my pets, end of sentence,
[00:35:58] pictures of my pets, end of sentence, beginning of image, and then a
[00:36:00] beginning of image, and then a description of that first image, end of
[00:36:02] description of that first image, end of uh that first component, and then the
[00:36:05] uh that first component, and then the second image, and a description of the
[00:36:07] second image, and a description of the second image. Okay, so you had the
[00:36:09] second image. Okay, so you had the training set up so that it looks like a
[00:36:11] training set up so that it looks like a long sequence of image text, image text,
[00:36:14] long sequence of image text, image text, interled data. And of course when
[00:36:17] interled data. And of course when describing any single image, you don't
[00:36:19] describing any single image, you don't want the model to look at the entire
[00:36:21] want the model to look at the entire context. You want it to only look at
[00:36:23] context. You want it to only look at that one particular image. And so they
[00:36:25] that one particular image. And so they created a masking scheme where every
[00:36:28] created a masking scheme where every single image when when generating only
[00:36:31] single image when when generating only looks at that particular image features
[00:36:32] looks at that particular image features and not the other ones. Uh meaning that
[00:36:34] and not the other ones. Uh meaning that when you're generating the description
[00:36:36] when you're generating the description for my puppy is sitting in the grass,
[00:36:38] for my puppy is sitting in the grass, you're only looking at the features uh
[00:36:40] you're only looking at the features uh that correspond to the puppy only when
[00:36:42] that correspond to the puppy only when generating those words. Similarly, when
[00:36:44] generating those words. Similarly, when generating the the description for the
[00:36:46] generating the the description for the cat, you're only looking at the cat
[00:36:49] cat, you're only looking at the cat image and not the other image. So, there
[00:36:51] image and not the other image. So, there is this sort of distinction where they
[00:36:53] is this sort of distinction where they created this handcraft and masking
[00:36:55] created this handcraft and masking scheme uh to make sure your descriptions
[00:36:58] scheme uh to make sure your descriptions are always following and looking at only
[00:37:00] are always following and looking at only that particular image. Uh but when
[00:37:03] that particular image. Uh but when trained, the model does get to see the
[00:37:04] trained, the model does get to see the entire context of everything that it's
[00:37:07] entire context of everything that it's generating.
[00:37:08] generating. So, why is that helpful? Why is this
[00:37:10] So, why is that helpful? Why is this entire process helpful of being able to
[00:37:12] entire process helpful of being able to see all of this stuff together? Well,
[00:37:14] see all of this stuff together? Well, it's helpful because it allows you to do
[00:37:15] it's helpful because it allows you to do these kinds of applications. So, here is
[00:37:18] these kinds of applications. So, here is three different applications that
[00:37:19] three different applications that Flamingo was able to showcase. The and
[00:37:22] Flamingo was able to showcase. The and they all center around having multiple
[00:37:24] they all center around having multiple conversations or dealing with multiple
[00:37:26] conversations or dealing with multiple images. So, in the first case, you've
[00:37:28] images. So, in the first case, you've got an image that's fed in uh and the
[00:37:30] got an image that's fed in uh and the flamingo model describes the image by
[00:37:33] flamingo model describes the image by saying that this is a picture of two
[00:37:35] saying that this is a picture of two teddy bears on the moon. And then what
[00:37:37] teddy bears on the moon. And then what it allows people to do is then ask
[00:37:39] it allows people to do is then ask another question. So people can ask what
[00:37:41] another question. So people can ask what are they doing? And because it's already
[00:37:44] are they doing? And because it's already being uh it's training using an existing
[00:37:47] being uh it's training using an existing large language model, that large
[00:37:49] large language model, that large language model's reasoning capabilities
[00:37:51] language model's reasoning capabilities are inherited. And now it can reason and
[00:37:53] are inherited. And now it can reason and answer this particular question. It it
[00:37:55] answer this particular question. It it can now answer and say there the teddy
[00:37:56] can now answer and say there the teddy bears are having a conversation. And
[00:37:58] bears are having a conversation. And then a user might ask what objects are
[00:38:00] then a user might ask what objects are they using? And again, Flamingo can say
[00:38:02] they using? And again, Flamingo can say that it looks like it's a computer and
[00:38:04] that it looks like it's a computer and so on and so forth. So you can enable
[00:38:05] so on and so forth. So you can enable this multi-turn dialogue about an image
[00:38:08] this multi-turn dialogue about an image simply by doing two things. You train by
[00:38:12] simply by doing two things. You train by first pre-training the language model
[00:38:14] first pre-training the language model and then incorporating that language
[00:38:15] and then incorporating that language model into Flamingo and secondly
[00:38:17] model into Flamingo and secondly allowing your model to see many
[00:38:18] allowing your model to see many different images and many different
[00:38:20] different images and many different turns uh throughout its training data so
[00:38:23] turns uh throughout its training data so it can adapt to longer sequences of
[00:38:25] it can adapt to longer sequences of text. You can also give it multiple
[00:38:28] text. You can also give it multiple images and ask what is a common thing
[00:38:30] images and ask what is a common thing about these images. And now the flamingo
[00:38:32] about these images. And now the flamingo model will look at each of those
[00:38:33] model will look at each of those different components and sort of reason
[00:38:35] different components and sort of reason and say that they're all flamingos. Uh
[00:38:37] and say that they're all flamingos. Uh so you can start doing a lot of these
[00:38:38] so you can start doing a lot of these kinds of really cool applications.
[00:38:40] kinds of really cool applications. People also showed that you can start
[00:38:42] People also showed that you can start doing in context learning. I don't know
[00:38:44] doing in context learning. I don't know if this is something you've seen already
[00:38:45] if this is something you've seen already with uh language models, but I'm sure
[00:38:47] with uh language models, but I'm sure you've used in context learning with GPT
[00:38:50] you've used in context learning with GPT where you tell GPT here's an example of
[00:38:52] where you tell GPT here's an example of what I want. Give me more things like
[00:38:54] what I want. Give me more things like this. Uh, and you can do the same thing
[00:38:56] this. Uh, and you can do the same thing with Flamingo where you can pass in an
[00:38:58] with Flamingo where you can pass in an image and a description, an image and a
[00:39:00] image and a description, an image and a description, and now when you pass in a
[00:39:02] description, and now when you pass in a new image, it'll give you a description.
[00:39:04] new image, it'll give you a description. Or you can say, uh, uh, here's some
[00:39:07] Or you can say, uh, uh, here's some image, here's a question and answer.
[00:39:09] image, here's a question and answer. Here's an image, question and answer.
[00:39:11] Here's an image, question and answer. And then when you pass in a new image
[00:39:12] And then when you pass in a new image and just ask the question, it'll give
[00:39:14] and just ask the question, it'll give you the answer. Right? So you're not
[00:39:16] you the answer. Right? So you're not training it to do these different kinds
[00:39:17] training it to do these different kinds of tasks, but you're providing it with
[00:39:19] of tasks, but you're providing it with examples of behaviors that it should
[00:39:21] examples of behaviors that it should have and it should just generalize to
[00:39:23] have and it should just generalize to new kinds of behaviors uh that you might
[00:39:26] new kinds of behaviors uh that you might care about. Similarly, you might care
[00:39:28] care about. Similarly, you might care about just classification and you can
[00:39:30] about just classification and you can use flamingo to do classification as
[00:39:31] use flamingo to do classification as well. So you can give it an image and
[00:39:33] well. So you can give it an image and say this is underground, this is
[00:39:36] say this is underground, this is congress and then you can ask what is
[00:39:38] congress and then you can ask what is this right? Um, and you can also even
[00:39:41] this right? Um, and you can also even teach it to do OCR and math where you
[00:39:44] teach it to do OCR and math where you give it an image and say, "Oh, this
[00:39:45] give it an image and say, "Oh, this should correspond to 2 + 1= 3." And so
[00:39:47] should correspond to 2 + 1= 3." And so eventually when you give it a new image,
[00:39:49] eventually when you give it a new image, it should be able to autocomplete and
[00:39:51] it should be able to autocomplete and extract out 3 * 6 and then also give you
[00:39:54] extract out 3 * 6 and then also give you the output by reasoning through this
[00:39:56] the output by reasoning through this entire process. Yeah. So this would be
[00:39:58] entire process. Yeah. So this would be an example of fshot learning where you
[00:39:59] an example of fshot learning where you give it some examples, a few examples of
[00:40:01] give it some examples, a few examples of things and then you ask it uh what the
[00:40:04] things and then you ask it uh what the new thing should be. uh if I were to
[00:40:07] new thing should be. uh if I were to throw away all the incontext examples,
[00:40:09] throw away all the incontext examples, that would be zero shot learning. So,
[00:40:10] that would be zero shot learning. So, we're not concatenating them. Um we're
[00:40:13] we're not concatenating them. Um we're technically passing the image tokens
[00:40:15] technically passing the image tokens through this perceiver sampler into
[00:40:17] through this perceiver sampler into every single layer of your LLM instead.
[00:40:19] every single layer of your LLM instead. And so, only the text is ever
[00:40:21] And so, only the text is ever concatenated and fed as input to your
[00:40:23] concatenated and fed as input to your flamingo model. And it chooses when to
[00:40:25] flamingo model. And it chooses when to attend to which parts of the image. You
[00:40:28] attend to which parts of the image. You give it to it once. H but behind the
[00:40:30] give it to it once. H but behind the scenes, uh of course, this is the
[00:40:31] scenes, uh of course, this is the interface, right? the web interface but
[00:40:33] interface, right? the web interface but behind the scenes what they actually do
[00:40:35] behind the scenes what they actually do is they cache the model assuming that
[00:40:37] is they cache the model assuming that the user will want to continue talking
[00:40:39] the user will want to continue talking and so the model is cached and ready to
[00:40:41] and so the model is cached and ready to accept more tokens. Yeah. But if they
[00:40:43] accept more tokens. Yeah. But if they did not cache it then yes it would pass
[00:40:45] did not cache it then yes it would pass in this entire conversation as uh as
[00:40:47] in this entire conversation as uh as input. Yeah.
[00:40:49] input. Yeah. Okay. So Flamingo was super cool because
[00:40:52] Okay. So Flamingo was super cool because uh they have these really big tables in
[00:40:54] uh they have these really big tables in their paper that you can go check out.
[00:40:56] their paper that you can go check out. Um, but what was really cool about it is
[00:40:58] Um, but what was really cool about it is there were all of these tasks that were
[00:41:00] there were all of these tasks that were very difficult and you had to adapt clip
[00:41:02] very difficult and you had to adapt clip to do them. Uh, but Flamingo was just
[00:41:04] to do them. Uh, but Flamingo was just able to do it with zero shot or few
[00:41:06] able to do it with zero shot or few shot. Uh, and you started seeing these
[00:41:08] shot. Uh, and you started seeing these gigantic improvements across many
[00:41:10] gigantic improvements across many different benchmarks. Uh, and this is
[00:41:12] different benchmarks. Uh, and this is when I think the field shifted from
[00:41:14] when I think the field shifted from reporting on a few classification
[00:41:15] reporting on a few classification benchmarks to reporting on any sort of
[00:41:17] benchmarks to reporting on any sort of understanding task at all. As long as
[00:41:19] understanding task at all. As long as you can frame it as a question answering
[00:41:20] you can frame it as a question answering process, you can build uh benchmarks for
[00:41:24] process, you can build uh benchmarks for a wide variety of skills and we started
[00:41:26] a wide variety of skills and we started seeing that become the norm in the last
[00:41:28] seeing that become the norm in the last 2 years in the computer vision field.
[00:41:31] 2 years in the computer vision field. Okay. So this is where uh we were I
[00:41:34] Okay. So this is where uh we were I think sometime last year and seeing the
[00:41:36] think sometime last year and seeing the success of Lava a lot of companies
[00:41:38] success of Lava a lot of companies started investing quite heavily on these
[00:41:41] started investing quite heavily on these models and so you started seeing a lot
[00:41:43] models and so you started seeing a lot of API models like GPT uh 40 GPT 4V uh
[00:41:49] of API models like GPT uh 40 GPT 4V uh Gemini 1.5 Pro Gemini 1.5 Flash a lot of
[00:41:52] Gemini 1.5 Pro Gemini 1.5 Flash a lot of these models started become being
[00:41:54] these models started become being released and even Anthropic came into
[00:41:56] released and even Anthropic came into the picture with Claude 3 Opus and now
[00:41:59] the picture with Claude 3 Opus and now of course Claude 4 Opus is out Um, and
[00:42:02] of course Claude 4 Opus is out Um, and so you had a lot of these models come
[00:42:03] so you had a lot of these models come out and they were performing a lot
[00:42:05] out and they were performing a lot better on a bunch of these benchmarks.
[00:42:07] better on a bunch of these benchmarks. So here I'm showing you the average
[00:42:08] So here I'm showing you the average performance across 11 of the more
[00:42:10] performance across 11 of the more popular visual understanding benchmarks
[00:42:13] popular visual understanding benchmarks in the field and there's this gigantic
[00:42:15] in the field and there's this gigantic difference, right? So the difference
[00:42:17] difference, right? So the difference between Lava, the model we talked about
[00:42:19] between Lava, the model we talked about that's open source, that's down here at
[00:42:21] that's open source, that's down here at about 43% accuracy on average.
[00:42:24] about 43% accuracy on average. Meanwhile, GPT and all of these other
[00:42:26] Meanwhile, GPT and all of these other models are performing much much better
[00:42:28] models are performing much much better at about somewhere like um 80s or
[00:42:31] at about somewhere like um 80s or high7s, right? So, big difference in
[00:42:34] high7s, right? So, big difference in performance between these two different
[00:42:35] performance between these two different kinds of models. Um, of course,
[00:42:37] kinds of models. Um, of course, immediately seeing this sort of
[00:42:39] immediately seeing this sort of difference, people started distilling
[00:42:41] difference, people started distilling GPT and Gemini into uh distilled
[00:42:45] GPT and Gemini into uh distilled variants and trying to release those
[00:42:47] variants and trying to release those models. Uh so, Alibaba, which is a
[00:42:49] models. Uh so, Alibaba, which is a company in China, they released this
[00:42:50] company in China, they released this model called Quinn. uh and then there's
[00:42:53] model called Quinn. uh and then there's intern there's fi there's all of these
[00:42:56] intern there's fi there's all of these different models that started coming out
[00:42:57] different models that started coming out and all of them were distilled from GPT
[00:43:00] and all of them were distilled from GPT uh and if not GPT then Gemini now there
[00:43:03] uh and if not GPT then Gemini now there that led to a big problem in the field
[00:43:05] that led to a big problem in the field uh a problem that's become a big part of
[00:43:08] uh a problem that's become a big part of what my own research agenda has been
[00:43:10] what my own research agenda has been trying to sort of focus on which is we
[00:43:12] trying to sort of focus on which is we don't actually know as a research
[00:43:13] don't actually know as a research community how to build really performant
[00:43:16] community how to build really performant vision language models the the tricks
[00:43:18] vision language models the the tricks behind how to build them they only the
[00:43:21] behind how to build them they only the people in open AI I and Gemini in
[00:43:23] people in open AI I and Gemini in Google, those teams know how to build
[00:43:25] Google, those teams know how to build these kinds of models. But the open
[00:43:26] these kinds of models. But the open source community, they're down here.
[00:43:29] source community, they're down here. They're down here. This is where the
[00:43:30] They're down here. This is where the research community was as of last year.
[00:43:32] research community was as of last year. U of course you can argue these are
[00:43:34] U of course you can argue these are really nice open models, but they're not
[00:43:36] really nice open models, but they're not really open because they're distilled.
[00:43:39] really open because they're distilled. We don't actually know how to reproduce
[00:43:41] We don't actually know how to reproduce these models, right? We we can only
[00:43:43] these models, right? We we can only produce them but if GPT exists, but if
[00:43:46] produce them but if GPT exists, but if GPT doesn't exist, we don't know how to
[00:43:48] GPT doesn't exist, we don't know how to create these other models. And so what
[00:43:50] create these other models. And so what my own research agenda has been focused
[00:43:52] my own research agenda has been focused on over the last couple of years is
[00:43:54] on over the last couple of years is figuring out how do you close this gap?
[00:43:55] figuring out how do you close this gap? How do you build really good uh
[00:43:57] How do you build really good uh multimodal language models and
[00:43:59] multimodal language models and disseminate that sort of uh
[00:44:00] disseminate that sort of uh understanding to the entire community?
[00:44:03] understanding to the entire community? And so um what we've done over the last
[00:44:07] And so um what we've done over the last uh 6 months or about a year now is we've
[00:44:10] uh 6 months or about a year now is we've created our own sort of uh class of
[00:44:12] created our own sort of uh class of models that we called Momo. And I'm
[00:44:15] models that we called Momo. And I'm showing you Momo's performance up at the
[00:44:16] showing you Momo's performance up at the top. Uh, and what sort of sets Momo
[00:44:18] top. Uh, and what sort of sets Momo apart from all the other models out
[00:44:20] apart from all the other models out there is that it's completely open
[00:44:22] there is that it's completely open source, meaning it's open weights, so
[00:44:25] source, meaning it's open weights, so you can download the model. It's open
[00:44:27] you can download the model. It's open data, meaning you can download the
[00:44:28] data, meaning you can download the training set as well as the evaluation
[00:44:29] training set as well as the evaluation set. It's also open code, meaning you
[00:44:31] set. It's also open code, meaning you can basically train your own Momo in
[00:44:33] can basically train your own Momo in your own home, assuming you have enough
[00:44:35] your own home, assuming you have enough GPUs. And uh you can also evaluate add
[00:44:38] GPUs. And uh you can also evaluate add on new evaluations adapt this model for
[00:44:40] on new evaluations adapt this model for all kinds of new things and of course
[00:44:42] all kinds of new things and of course start using it for a wide variety of
[00:44:44] start using it for a wide variety of different contexts. Um now of course
[00:44:47] different contexts. Um now of course academic benchmarks are not enough right
[00:44:49] academic benchmarks are not enough right because what we care about at the end of
[00:44:51] because what we care about at the end of the day is are people going to use these
[00:44:53] the day is are people going to use these models? Will people want to use these
[00:44:55] models? Will people want to use these models over GPT? And so to make sure we
[00:44:58] models over GPT? And so to make sure we had that evaluation properly done, we
[00:45:00] had that evaluation properly done, we released a playground with a Momo and we
[00:45:03] released a playground with a Momo and we did a gigantic user study where we
[00:45:05] did a gigantic user study where we actually compared head-to-head outputs
[00:45:07] actually compared head-to-head outputs from our models versus outputs from all
[00:45:09] from our models versus outputs from all the other models. And our model has the
[00:45:12] the other models. And our model has the same ELO rating as GPT. It comes in
[00:45:15] same ELO rating as GPT. It comes in second with uh a difference of one in
[00:45:17] second with uh a difference of one in ELO rating versus GPT 40. This is that
[00:45:20] ELO rating versus GPT 40. This is that same graph rotated uh so that I can show
[00:45:23] same graph rotated uh so that I can show you some examples. Our this was a
[00:45:25] you some examples. Our this was a gigantic evaluation by the way. This was
[00:45:27] gigantic evaluation by the way. This was about 870 users that we uh showed these
[00:45:30] about 870 users that we uh showed these model outputs to and we did about
[00:45:31] model outputs to and we did about 325,000 pair wise comparisons. We asked
[00:45:34] 325,000 pair wise comparisons. We asked people which models output do you
[00:45:36] people which models output do you prefer. Our model, the MLM model ranked
[00:45:38] prefer. Our model, the MLM model ranked uh again like I said second uh more or
[00:45:41] uh again like I said second uh more or less a coin flip between what people
[00:45:42] less a coin flip between what people prefer between GPT and our model. Uh but
[00:45:45] prefer between GPT and our model. Uh but it already beat out Gemini 1.5 Pro and
[00:45:48] it already beat out Gemini 1.5 Pro and Cloud 3.5. Now the big difference is we
[00:45:51] Cloud 3.5. Now the big difference is we are a small research lab and uh we're
[00:45:53] are a small research lab and uh we're beating out Google's billions of dollars
[00:45:56] beating out Google's billions of dollars of investment into Gemini as well as
[00:45:58] of investment into Gemini as well as Entropics billions of dollars of
[00:45:59] Entropics billions of dollars of investment and already matching GPT. And
[00:46:02] investment and already matching GPT. And so we were quite excited by this entire
[00:46:04] so we were quite excited by this entire process. Uh but we also developed a 7
[00:46:06] process. Uh but we also developed a 7 billion model that comes in right after
[00:46:09] billion model that comes in right after those big models. And that 7 billion
[00:46:11] those big models. And that 7 billion model is really uh exciting because you
[00:46:13] model is really uh exciting because you can put it on a single GPU. So you can
[00:46:15] can put it on a single GPU. So you can now have this model capable of doing a
[00:46:18] now have this model capable of doing a wide variety of vision tasks that works
[00:46:20] wide variety of vision tasks that works on a single GPU, meaning a lot of people
[00:46:23] on a single GPU, meaning a lot of people can now use it and fine-tune it for all
[00:46:24] can now use it and fine-tune it for all kinds of things. We released this model
[00:46:27] kinds of things. We released this model in September 25 and the community was
[00:46:29] in September 25 and the community was very excited by it. Uh there's, you
[00:46:31] very excited by it. Uh there's, you know, this is the first time a very
[00:46:32] know, this is the first time a very performant uh multimodal vision language
[00:46:34] performant uh multimodal vision language model was released and a ton of people
[00:46:36] model was released and a ton of people started um talking and writing articles
[00:46:38] started um talking and writing articles about all the ways they want to use it.
[00:46:40] about all the ways they want to use it. One of the use cases that kept popping
[00:46:42] One of the use cases that kept popping up over and over again was this idea of
[00:46:44] up over and over again was this idea of using Malmo finally for robotics
[00:46:47] using Malmo finally for robotics applications. And I I won't talk about
[00:46:49] applications. And I I won't talk about robotics today because you're going to
[00:46:51] robotics today because you're going to learn about it in the next class. Um but
[00:46:53] learn about it in the next class. Um but I do want to give you some examples of
[00:46:55] I do want to give you some examples of things that people were excited about
[00:46:56] things that people were excited about with robotics. Um but a ton of people
[00:47:00] with robotics. Um but a ton of people even like uh folks uh at NVIDIA started
[00:47:03] even like uh folks uh at NVIDIA started chatting about how this is, you know,
[00:47:05] chatting about how this is, you know, you should never bet against open source
[00:47:06] you should never bet against open source regardless of how much model development
[00:47:08] regardless of how much model development you do in private. eventually the open
[00:47:10] you do in private. eventually the open source community will catch up and we
[00:47:12] source community will catch up and we were catching up at that point. Um, and
[00:47:14] were catching up at that point. Um, and so seeing our model out, Meta quickly
[00:47:17] so seeing our model out, Meta quickly released in response their Llama uh 3.2
[00:47:20] released in response their Llama uh 3.2 model and a lot of people did
[00:47:22] model and a lot of people did evaluations comparing Momo versus uh
[00:47:25] evaluations comparing Momo versus uh Meta's Lama model and again I'm very
[00:47:27] Meta's Lama model and again I'm very happy that we came up on top of Llama as
[00:47:29] happy that we came up on top of Llama as well. So let me show you why Momo does
[00:47:32] well. So let me show you why Momo does so well. Momo was the sort of trick to
[00:47:34] so well. Momo was the sort of trick to getting these models to work very well.
[00:47:36] getting these models to work very well. The trick was to ground its decision-m
[00:47:39] The trick was to ground its decision-m in the pixels itself. So usually when
[00:47:42] in the pixels itself. So usually when you give a model a question like count
[00:47:44] you give a model a question like count how many boats there are, it'll give you
[00:47:46] how many boats there are, it'll give you some number and often times it
[00:47:47] some number and often times it hallucinates. But what sort of sets our
[00:47:49] hallucinates. But what sort of sets our model apart is that it actually points
[00:47:51] model apart is that it actually points to all the things that it's counting. So
[00:47:54] to all the things that it's counting. So it generates points to all the boats and
[00:47:56] it generates points to all the boats and then it outputs a final number. So its
[00:47:59] then it outputs a final number. So its decision-m is grounded in the pixels
[00:48:01] decision-m is grounded in the pixels itself. Uh, and this allowed us to
[00:48:03] itself. Uh, and this allowed us to essentially train a model that unlike
[00:48:05] essentially train a model that unlike Llama for Meta that was trained on about
[00:48:07] Llama for Meta that was trained on about 6 billion image text pairs, our model
[00:48:10] 6 billion image text pairs, our model was trained on only 700,000 image text
[00:48:12] was trained on only 700,000 image text pairs. The big difference was that we
[00:48:15] pairs. The big difference was that we handcurated the 700,000 image text
[00:48:17] handcurated the 700,000 image text pairs. Uh, and that was the biggest sort
[00:48:20] pairs. Uh, and that was the biggest sort of difference between what we were able
[00:48:21] of difference between what we were able to do versus what these models uh that
[00:48:23] to do versus what these models uh that these companies were building were
[00:48:25] these companies were building were doing. So a lot of folks are trying to
[00:48:28] doing. So a lot of folks are trying to currently download these uh image text
[00:48:31] currently download these uh image text pairs from the internet, right? That's
[00:48:32] pairs from the internet, right? That's been the foundation of how a lot of
[00:48:34] been the foundation of how a lot of people train these visual language
[00:48:35] people train these visual language models. You collect a lot of internet
[00:48:37] models. You collect a lot of internet data uh of images with their associated
[00:48:40] data uh of images with their associated text. But the problem with internet data
[00:48:41] text. But the problem with internet data is that it's incidental. The text that's
[00:48:43] is that it's incidental. The text that's often associated with an image describes
[00:48:46] often associated with an image describes something subjective or something that
[00:48:48] something subjective or something that the uploader felt about the image. It
[00:48:50] the uploader felt about the image. It rarely actually talks about the contents
[00:48:52] rarely actually talks about the contents of the image itself. And meanwhile, this
[00:48:55] of the image itself. And meanwhile, this is what our data looks like, right? So a
[00:48:57] is what our data looks like, right? So a for a single image, we have a dense
[00:49:00] for a single image, we have a dense description of the actual contents of
[00:49:03] description of the actual contents of that image. And we have things that
[00:49:04] that image. And we have things that people never talk about on the internet.
[00:49:06] people never talk about on the internet. There's a ton of task and knowledge
[00:49:08] There's a ton of task and knowledge about the visual world that we just
[00:49:10] about the visual world that we just never speak about. I will never tell you
[00:49:12] never speak about. I will never tell you that something is to the left of
[00:49:13] that something is to the left of something else just because it's
[00:49:15] something else just because it's unnatural for us to do that. It's just
[00:49:17] unnatural for us to do that. It's just so obvious that something is to the left
[00:49:19] so obvious that something is to the left of something. Why would you ever
[00:49:20] of something. Why would you ever communicate that information? So that's
[00:49:22] communicate that information? So that's the kind of information we started
[00:49:23] the kind of information we started eliciting from people. We started
[00:49:25] eliciting from people. We started talking about we started getting people
[00:49:26] talking about we started getting people to talk about how things have particular
[00:49:29] to talk about how things have particular size like large or its shape like
[00:49:30] size like large or its shape like rectangular. We talked about material
[00:49:33] rectangular. We talked about material like polished and rich and uh its
[00:49:35] like polished and rich and uh its positioning across the image like it
[00:49:37] positioning across the image like it spans uh the horizontal sort of plane of
[00:49:39] spans uh the horizontal sort of plane of the image. So all of this information
[00:49:42] the image. So all of this information really is what makes these models more
[00:49:44] really is what makes these models more performant. Here's another example from
[00:49:45] performant. Here's another example from the data set. Uh here's a very simple
[00:49:47] the data set. Uh here's a very simple image of a phone screen or tablet screen
[00:49:50] image of a phone screen or tablet screen and we connect we have information here
[00:49:52] and we connect we have information here that again is completely missing from
[00:49:54] that again is completely missing from the internet. Uh things that people
[00:49:55] the internet. Uh things that people would find helpful things like this is a
[00:49:58] would find helpful things like this is a tablet device. The time is this. The
[00:50:00] tablet device. The time is this. The amount of power left in your device is
[00:50:02] amount of power left in your device is this. This is the kind of information
[00:50:04] this. This is the kind of information that would help people use these models.
[00:50:06] that would help people use these models. But this is again the kind of
[00:50:07] But this is again the kind of information we never talk about on the
[00:50:09] information we never talk about on the internet. And so to get this kind of
[00:50:12] internet. And so to get this kind of information we designed a lot of
[00:50:13] information we designed a lot of different questions. We spent two years
[00:50:15] different questions. We spent two years doing different kinds of elicitation
[00:50:17] doing different kinds of elicitation studies to figure out what are the right
[00:50:19] studies to figure out what are the right things or pieces of information that are
[00:50:21] things or pieces of information that are missing from the internet and how do we
[00:50:22] missing from the internet and how do we elicit them them as effectively as
[00:50:25] elicit them them as effectively as possible. One thing that was very
[00:50:26] possible. One thing that was very important is we had all of our
[00:50:29] important is we had all of our annotators not type descriptions but
[00:50:32] annotators not type descriptions but talk about descriptions. Talking
[00:50:34] talk about descriptions. Talking automatically breaks a lot of
[00:50:35] automatically breaks a lot of stereotypes uh around grym maxims. Um,
[00:50:39] stereotypes uh around grym maxims. Um, and so we by getting people to talk, we
[00:50:42] and so we by getting people to talk, we got them to speak about things that they
[00:50:43] got them to speak about things that they would never usually type. Um, the model
[00:50:46] would never usually type. Um, the model itself didn't look any different from
[00:50:48] itself didn't look any different from Lava. We had the same sort of setup of a
[00:50:50] Lava. We had the same sort of setup of a clip encoding coming in. You had a con
[00:50:52] clip encoding coming in. You had a con connector that was just a linear layer
[00:50:54] connector that was just a linear layer and then you had a large language model
[00:50:56] and then you had a large language model that would take in all of these tokens
[00:50:58] that would take in all of these tokens and then output any sort of thing that
[00:50:59] and then output any sort of thing that you care about. So the model itself
[00:51:01] you care about. So the model itself looked very similar to existing models.
[00:51:03] looked very similar to existing models. The biggest difference was in the data
[00:51:05] The biggest difference was in the data and the quality and density of the data
[00:51:07] and the quality and density of the data itself. Uh, and because of this sort of
[00:51:09] itself. Uh, and because of this sort of grounding capability where the model
[00:51:11] grounding capability where the model grounds its decision-m in the image
[00:51:14] grounds its decision-m in the image itself, you could get Momo to do things
[00:51:16] itself, you could get Momo to do things that you can't do. You can't use any of
[00:51:17] that you can't do. You can't use any of the other models to do. Things like
[00:51:19] the other models to do. Things like point to the menu. It actually tells you
[00:51:21] point to the menu. It actually tells you where that menu item is. Or you can tell
[00:51:23] where that menu item is. Or you can tell it uh to sort of point to where I can
[00:51:25] it uh to sort of point to where I can set my search options and it'll show
[00:51:27] set my search options and it'll show you, okay, this is where you might want
[00:51:28] you, okay, this is where you might want to set those options. Or point to where
[00:51:31] to set those options. Or point to where the midsize data sets are, and it'll
[00:51:33] the midsize data sets are, and it'll tell you what option you need to move.
[00:51:35] tell you what option you need to move. Um, and I already showed you that you
[00:51:37] Um, and I already showed you that you can point to count, but you can also
[00:51:39] can point to count, but you can also point to do really fine grain things
[00:51:41] point to do really fine grain things like being able to sort of ask what is
[00:51:43] like being able to sort of ask what is the route number on this bus. MoMA
[00:51:46] the route number on this bus. MoMA doesn't just simply give you an answer.
[00:51:47] doesn't just simply give you an answer. It actually points to where in the
[00:51:49] It actually points to where in the image, in this case, there is this area
[00:51:51] image, in this case, there is this area right here that contains the bus number
[00:51:53] right here that contains the bus number and then returns the bus number to you.
[00:51:55] and then returns the bus number to you. Uh, you can ask it to reason about how
[00:51:57] Uh, you can ask it to reason about how many cars on the left versus how many
[00:51:59] many cars on the left versus how many cars are on the right. You can ask it to
[00:52:01] cars are on the right. You can ask it to reason over depth images or overhead
[00:52:03] reason over depth images or overhead images or even really crowded scenes and
[00:52:06] images or even really crowded scenes and sports uh areas. What's also really
[00:52:09] sports uh areas. What's also really exciting and again we'll talk about this
[00:52:10] exciting and again we'll talk about this in a few minutes is this idea of
[00:52:12] in a few minutes is this idea of chaining that keeps coming up all across
[00:52:15] chaining that keeps coming up all across multimodal models today. The idea of
[00:52:17] multimodal models today. The idea of chaining Momo to other models. Uh what
[00:52:20] chaining Momo to other models. Uh what you can do is chain the output of Momo
[00:52:22] you can do is chain the output of Momo to become the input of another model
[00:52:24] to become the input of another model like SAM 2. And so you can tell Momo to
[00:52:27] like SAM 2. And so you can tell Momo to point to the Cricut bat. And now you
[00:52:28] point to the Cricut bat. And now you take that point, you feed it to a model
[00:52:30] take that point, you feed it to a model like SAM 2 which does segmentation. And
[00:52:33] like SAM 2 which does segmentation. And now you can do segmentation of that
[00:52:35] now you can do segmentation of that Cricut bat across time. And so you can
[00:52:37] Cricut bat across time. And so you can start enabling all kinds of new
[00:52:38] start enabling all kinds of new applications. Here's one that we played
[00:52:40] applications. Here's one that we played around with in the office. Um, which
[00:52:43] around with in the office. Um, which again you're going to hopefully learn
[00:52:44] again you're going to hopefully learn about in the next lecture with when uh
[00:52:47] about in the next lecture with when uh you hear about robotics. Uh but we asked
[00:52:49] you hear about robotics. Uh but we asked uh Momo to point to where the water
[00:52:51] uh Momo to point to where the water bottle is and then we moved the robot
[00:52:53] bottle is and then we moved the robot using simple motion planners to that
[00:52:55] using simple motion planners to that water bottle. Next we ask it to go move
[00:52:58] water bottle. Next we ask it to go move that water bottle to where the dirty
[00:53:00] that water bottle to where the dirty dishes are. It points to the dish uh to
[00:53:02] dishes are. It points to the dish uh to the sorry sink and then moves a robot
[00:53:04] the sorry sink and then moves a robot there. And then we tell it to go point
[00:53:07] there. And then we tell it to go point to where the uh free space is in the
[00:53:08] to where the uh free space is in the sink and put that bottle in that
[00:53:10] sink and put that bottle in that location. So again you can sort of
[00:53:12] location. So again you can sort of combine all these capabilities together
[00:53:14] combine all these capabilities together now and chain them to even sort of
[00:53:16] now and chain them to even sort of automate a lot of robotics applications.
[00:53:19] automate a lot of robotics applications. Uh this has been a lot of focus in my
[00:53:21] Uh this has been a lot of focus in my group now is sort of adapting a lot of
[00:53:22] group now is sort of adapting a lot of these vision language models and
[00:53:24] these vision language models and enabling a lot of generalization in the
[00:53:26] enabling a lot of generalization in the actual physical domain. Uh so the
[00:53:29] actual physical domain. Uh so the question is around um would these models
[00:53:31] question is around um would these models be able to sort of point if you um were
[00:53:34] be able to sort of point if you um were always sort of changing the resolution
[00:53:36] always sort of changing the resolution of the image to be a fixed resolution
[00:53:37] of the image to be a fixed resolution right. Um well turns out that you can
[00:53:40] right. Um well turns out that you can actually adapt these models to be any
[00:53:42] actually adapt these models to be any resolution nowadays. There are these uh
[00:53:44] resolution nowadays. There are these uh mechanisms like flex vit uh that has
[00:53:48] mechanisms like flex vit uh that has introduced uh a way of allowing for any
[00:53:51] introduced uh a way of allowing for any variable size image input and you can
[00:53:53] variable size image input and you can adapt them to sort of point in that new
[00:53:55] adapt them to sort of point in that new space instead. Um so your model's
[00:53:58] space instead. Um so your model's position embeddings basically change
[00:53:59] position embeddings basically change depending on how big your image size is.
[00:54:03] depending on how big your image size is. Uh and you're allow you your models you
[00:54:05] Uh and you're allow you your models you typically tend to generalize well. So
[00:54:07] typically tend to generalize well. So that was sort of uh the conversation
[00:54:08] that was sort of uh the conversation around adding vision and multimodal
[00:54:11] around adding vision and multimodal models together. In the last sort of 20
[00:54:13] models together. In the last sort of 20 minutes that we have left, I want to
[00:54:15] minutes that we have left, I want to talk about generalizing these foundation
[00:54:16] talk about generalizing these foundation models to not just deal with image
[00:54:18] models to not just deal with image classification and text but be able to
[00:54:20] classification and text but be able to sort of generalize to any sort of output
[00:54:21] sort of generalize to any sort of output space you might care about. Uh and one
[00:54:24] space you might care about. Uh and one of those models that's become really
[00:54:25] of those models that's become really popular in this space is this segment
[00:54:28] popular in this space is this segment anything model. The segment anything
[00:54:30] anything model. The segment anything model or SAM for short. uh what it tries
[00:54:33] model or SAM for short. uh what it tries to do is it tries to build a
[00:54:35] to do is it tries to build a segmentation model that's a foundation
[00:54:37] segmentation model that's a foundation model for all kinds of segmentation
[00:54:39] model for all kinds of segmentation tasks. So really what they're trying to
[00:54:41] tasks. So really what they're trying to do is allow anybody to sort of point to
[00:54:45] do is allow anybody to sort of point to things that they care about in the image
[00:54:46] things that they care about in the image and then hopefully have that thing uh be
[00:54:49] and then hopefully have that thing uh be something that they the model can sort
[00:54:51] something that they the model can sort of output a mask for. Uh so for example
[00:54:54] of output a mask for. Uh so for example um you want a model that generalizes
[00:54:56] um you want a model that generalizes beyond just a fixed number of categories
[00:54:58] beyond just a fixed number of categories to any sort of category you might care
[00:55:00] to any sort of category you might care about and you would ideally want these
[00:55:02] about and you would ideally want these outputs to be masks for any sort of
[00:55:05] outputs to be masks for any sort of category that is of interest to the
[00:55:06] category that is of interest to the user. Right? So those are the two goals
[00:55:08] user. Right? So those are the two goals uh that we want to generalize to any
[00:55:10] uh that we want to generalize to any category a huge number of categories and
[00:55:12] category a huge number of categories and we ideally want to be able to very
[00:55:15] we ideally want to be able to very specifically output something that the
[00:55:16] specifically output something that the user really cares about. So they're both
[00:55:18] user really cares about. So they're both challenges. They're both challenges in
[00:55:20] challenges. They're both challenges in figuring out how do you collect a large
[00:55:21] figuring out how do you collect a large amount of data that spans a wide variety
[00:55:23] amount of data that spans a wide variety of categories as well as how do you
[00:55:25] of categories as well as how do you design an architecture that really
[00:55:28] design an architecture that really pinpoints what the user really cares
[00:55:30] pinpoints what the user really cares about. Um now the re let's start with
[00:55:32] about. Um now the re let's start with the second question first. Um it's
[00:55:34] the second question first. Um it's really ambiguous when uh when we want a
[00:55:37] really ambiguous when uh when we want a mask for something. So imagine a
[00:55:40] mask for something. So imagine a scenario where you have two cats in an
[00:55:42] scenario where you have two cats in an image and a user comes in and says hey I
[00:55:44] image and a user comes in and says hey I want a segmentation for the cat. But
[00:55:46] want a segmentation for the cat. But it's really not clear which cat they
[00:55:48] it's really not clear which cat they want a segmentation for. Ideally, again,
[00:55:50] want a segmentation for. Ideally, again, if you had Momo's pointing capability,
[00:55:52] if you had Momo's pointing capability, you could actually point to which cat
[00:55:54] you could actually point to which cat you care about. And then depending on
[00:55:56] you care about. And then depending on the point, you could create the masks uh
[00:55:58] the point, you could create the masks uh that matter. Now, of course, these are
[00:56:00] that matter. Now, of course, these are not very good masks. And ideally, you
[00:56:02] not very good masks. And ideally, you want these masks to be very very good at
[00:56:05] want these masks to be very very good at quality that can sort of support a wide
[00:56:07] quality that can sort of support a wide variety of downstream applications like
[00:56:09] variety of downstream applications like image editing or any kinds of other
[00:56:11] image editing or any kinds of other things you might uh think of. So to
[00:56:14] things you might uh think of. So to build this architecture that allows any
[00:56:16] build this architecture that allows any user to be able to specify exactly what
[00:56:18] user to be able to specify exactly what they care about, we needed to go beyond
[00:56:20] they care about, we needed to go beyond uh just simply typing in text what you
[00:56:23] uh just simply typing in text what you care about. Right? So what the SAM
[00:56:25] care about. Right? So what the SAM architecture has is two components uh or
[00:56:28] architecture has is two components uh or three specifically. It's got the image
[00:56:29] three specifically. It's got the image encoder which is again uh it could be a
[00:56:32] encoder which is again uh it could be a clip encoder uh and it's got a prompt
[00:56:34] clip encoder uh and it's got a prompt encoder which is something special. This
[00:56:36] encoder which is something special. This prompt encoder really tries to encode
[00:56:39] prompt encoder really tries to encode text or points or bounding boxes or any
[00:56:42] text or points or bounding boxes or any way that a user might want to specify
[00:56:44] way that a user might want to specify what they care about. And then given
[00:56:46] what they care about. And then given these two things, it passes it through
[00:56:47] these two things, it passes it through this really lightweight decoder that
[00:56:49] this really lightweight decoder that outputs a mask. And the decoder looks
[00:56:51] outputs a mask. And the decoder looks very similar to like the the
[00:56:53] very similar to like the the segmentation decoders that you've
[00:56:54] segmentation decoders that you've already seen in this course. So overall,
[00:56:57] already seen in this course. So overall, this is what uh the model looks like.
[00:56:59] this is what uh the model looks like. Given an image, we encode that image
[00:57:01] Given an image, we encode that image using an image encoder. uh and then you
[00:57:04] using an image encoder. uh and then you have a bunch of different prompts that
[00:57:06] have a bunch of different prompts that are going through and um interacting
[00:57:09] are going through and um interacting with these image encoding uh through a
[00:57:11] with these image encoding uh through a decoder and you output a mask right so
[00:57:14] decoder and you output a mask right so this is the overall architecture design
[00:57:16] this is the overall architecture design now there's one big thing that is a
[00:57:19] now there's one big thing that is a problem with segmentation so let's say a
[00:57:21] problem with segmentation so let's say a user does point at this particular
[00:57:22] user does point at this particular location and says hey I want a
[00:57:24] location and says hey I want a segmentation mask for this location now
[00:57:26] segmentation mask for this location now the problem with that segmentation mask
[00:57:28] the problem with that segmentation mask is that it's still ambiguous even with a
[00:57:30] is that it's still ambiguous even with a point it's still not sufficient because
[00:57:32] point it's still not sufficient because that point might be referring to the
[00:57:34] that point might be referring to the entire um uh scissor. It might only be
[00:57:37] entire um uh scissor. It might only be looking and referring to the parts that
[00:57:40] looking and referring to the parts that you can hold or it can be referring to
[00:57:42] you can hold or it can be referring to one of the parts that you can hold. So
[00:57:44] one of the parts that you can hold. So this ambiguity is really difficult to
[00:57:46] this ambiguity is really difficult to sort of resolve for and you don't want
[00:57:48] sort of resolve for and you don't want to penalize the model for picking the
[00:57:50] to penalize the model for picking the wrong one. So what the SAM architecture
[00:57:52] wrong one. So what the SAM architecture does is instead of outputting one
[00:57:54] does is instead of outputting one segmentation mask, it actually outputs
[00:57:56] segmentation mask, it actually outputs three segmentation masks at different
[00:57:57] three segmentation masks at different levels of granularity and then it picks
[00:58:00] levels of granularity and then it picks the one that is the closest matching to
[00:58:01] the one that is the closest matching to ground truth and then uses that to
[00:58:03] ground truth and then uses that to calculate the loss and therefore not
[00:58:05] calculate the loss and therefore not penalizing the other ones. So the hope
[00:58:07] penalizing the other ones. So the hope is that overall over time this model
[00:58:10] is that overall over time this model will learn to output all different kinds
[00:58:11] will learn to output all different kinds of masks and then the user gets to
[00:58:13] of masks and then the user gets to choose basically which one is the most
[00:58:14] choose basically which one is the most appropriate for their use cases. Okay.
[00:58:18] appropriate for their use cases. Okay. Um, and if you put all of this together,
[00:58:19] Um, and if you put all of this together, the only thing you need now is data. You
[00:58:23] the only thing you need now is data. You need a lot of data across many different
[00:58:24] need a lot of data across many different categories to really make this model
[00:58:27] categories to really make this model possible. Now, the problem with data is
[00:58:29] possible. Now, the problem with data is that until this model came out in 2023,
[00:58:32] that until this model came out in 2023, this was about a year and a half, maybe
[00:58:33] this was about a year and a half, maybe 2 years ago. When this model came out, u
[00:58:36] 2 years ago. When this model came out, u most of the segmentation data sets were
[00:58:38] most of the segmentation data sets were extremely small. And what this the
[00:58:41] extremely small. And what this the authors of this paper did is that they
[00:58:43] authors of this paper did is that they grew the amount of segmentation data
[00:58:44] grew the amount of segmentation data sets that were out there, the amount of
[00:58:46] sets that were out there, the amount of images by about 6x and the number of
[00:58:48] images by about 6x and the number of segmentation masks by about 400x. So
[00:58:51] segmentation masks by about 400x. So they significantly grew uh and collected
[00:58:54] they significantly grew uh and collected a lot of masks to make this model as
[00:58:56] a lot of masks to make this model as performant as possible. So again the
[00:58:58] performant as possible. So again the message is very similar to the message
[00:58:59] message is very similar to the message we had with Flamingo, the message we had
[00:59:01] we had with Flamingo, the message we had with Momo. And the message is you need
[00:59:04] with Momo. And the message is you need really good highquality data uh to
[00:59:06] really good highquality data uh to really get these models to be as
[00:59:07] really get these models to be as performant as possible. And for a lot of
[00:59:09] performant as possible. And for a lot of vision tasks, the data is completely
[00:59:11] vision tasks, the data is completely missing from the internet and you need
[00:59:13] missing from the internet and you need to go out and find and collect that data
[00:59:16] to go out and find and collect that data uh to get these models to work very
[00:59:17] uh to get these models to work very well. Okay. And so to make this data
[00:59:20] well. Okay. And so to make this data happen, they created this sort of in the
[00:59:22] happen, they created this sort of in the loop process where they initially had
[00:59:25] loop process where they initially had some amount of data annotated. Uh from
[00:59:27] some amount of data annotated. Uh from that annotation, they created a training
[00:59:28] that annotation, they created a training data set. They trained a model and they
[00:59:30] data set. They trained a model and they used that model to annotate more data
[00:59:32] used that model to annotate more data and then they iteratively refined that
[00:59:35] and then they iteratively refined that model generated uh segments using users
[00:59:37] model generated uh segments using users and continued this process. So uh they
[00:59:40] and continued this process. So uh they had this human in the loop model in the
[00:59:42] had this human in the loop model in the loop process of proposing segments and
[00:59:45] loop process of proposing segments and then fixing the segments using human
[00:59:46] then fixing the segments using human annotators. This is what an example
[00:59:48] annotators. This is what an example image looks like from their data set.
[00:59:50] image looks like from their data set. You have quite a lot of categories. Each
[00:59:52] You have quite a lot of categories. Each individual uh vegetable here is
[00:59:55] individual uh vegetable here is annotated with its own mask. So they're
[00:59:57] annotated with its own mask. So they're quite expensive to collect. Uh and they
[01:00:00] quite expensive to collect. Uh and they did this across um a lot of images,
[01:00:03] did this across um a lot of images, millions of images. Here's another
[01:00:05] millions of images. Here's another example where again all the single
[01:00:07] example where again all the single umbrellas are annotated. Uh again here's
[01:00:10] umbrellas are annotated. Uh again here's another one with underwater sea. Uh and
[01:00:12] another one with underwater sea. Uh and of course paintings as well. They have
[01:00:14] of course paintings as well. They have segmentations of paintings. Uh and so
[01:00:16] segmentations of paintings. Uh and so all of this together is really what was
[01:00:18] all of this together is really what was foundational to making this foundation
[01:00:20] foundational to making this foundation model for segmentation. Okay. Um so
[01:00:24] model for segmentation. Okay. Um so that's for segment anything. And I want
[01:00:25] that's for segment anything. And I want to use the last couple of uh minutes
[01:00:27] to use the last couple of uh minutes that I have left today to really focus
[01:00:29] that I have left today to really focus on chaining, which is the last part of
[01:00:33] on chaining, which is the last part of multimodal language models. The idea
[01:00:35] multimodal language models. The idea behind chaining is something you've
[01:00:38] behind chaining is something you've already seen. I've given you hints
[01:00:39] already seen. I've given you hints already throughout this lecture. And the
[01:00:41] already throughout this lecture. And the idea is to be able to combine different
[01:00:43] idea is to be able to combine different models together to enable things that a
[01:00:45] models together to enable things that a single model can't do alone. Here's a
[01:00:48] single model can't do alone. Here's a fun little exercise we can do as a
[01:00:50] fun little exercise we can do as a class. So, I'm giving you four images
[01:00:53] class. So, I'm giving you four images and I'm also giving you four categories,
[01:00:56] and I'm also giving you four categories, right? And these are potentially
[01:00:57] right? And these are potentially categories that some of you have never
[01:00:59] categories that some of you have never seen before. And they're also categories
[01:01:01] seen before. And they're also categories that Clip hasn't seen. And so, Clip
[01:01:04] that Clip hasn't seen. And so, Clip actually fails at these categories cuz
[01:01:05] actually fails at these categories cuz it doesn't have any idea which one's
[01:01:07] it doesn't have any idea which one's associated with what. Does anyone here
[01:01:09] associated with what. Does anyone here know which one is what? Morima. Yeah,
[01:01:12] know which one is what? Morima. Yeah, the second one is Mima. That's right.
[01:01:14] the second one is Mima. That's right. Yeah. Uh, there's one that's a little
[01:01:16] Yeah. Uh, there's one that's a little easy. The vioaduct,
[01:01:17] easy. The vioaduct, right? Yeah. I think a lot of you know
[01:01:19] right? Yeah. I think a lot of you know which one that is. Um, but yeah, which
[01:01:22] which one that is. Um, but yeah, which one's the the dog and which one's the
[01:01:24] one's the the dog and which one's the bird? Um, what if I gave you these
[01:01:26] bird? Um, what if I gave you these instead?
[01:01:28] instead? So now I'm giving you descriptions of
[01:01:30] So now I'm giving you descriptions of these things and it suddenly becomes
[01:01:32] these things and it suddenly becomes very easy for you to associate each one
[01:01:34] very easy for you to associate each one with the right category, right? Uh and
[01:01:36] with the right category, right? Uh and that's the basic idea behind chaining
[01:01:38] that's the basic idea behind chaining that even if clip has never seen these
[01:01:40] that even if clip has never seen these images chances are these these concepts
[01:01:43] images chances are these these concepts have been talked about on the internet
[01:01:45] have been talked about on the internet to some degree and it's likely that GPT
[01:01:48] to some degree and it's likely that GPT might be able to describe it and if GPT
[01:01:50] might be able to describe it and if GPT can create those descriptions now those
[01:01:52] can create those descriptions now those descriptions become really good ways of
[01:01:55] descriptions become really good ways of classifying exactly which category is
[01:01:57] classifying exactly which category is which one and that's the idea behind
[01:01:59] which one and that's the idea behind chaining is that you take the strengths
[01:02:00] chaining is that you take the strengths of one model and you combine it with the
[01:02:02] of one model and you combine it with the capabilities of another and suddenly you
[01:02:04] capabilities of another and suddenly you get all kinds of new capabilities that
[01:02:06] get all kinds of new capabilities that you didn't have before. Uh and so you
[01:02:08] you didn't have before. Uh and so you can take a ton of categories that have
[01:02:09] can take a ton of categories that have no training data uh in clip, but if you
[01:02:12] no training data uh in clip, but if you can describe all of them because clip
[01:02:14] can describe all of them because clip has seen a lot of descriptions for
[01:02:16] has seen a lot of descriptions for things, it can now start classifying all
[01:02:17] things, it can now start classifying all of them very well. And you can start
[01:02:19] of them very well. And you can start getting Clip to generate classifications
[01:02:21] getting Clip to generate classifications for individual flowers or individual
[01:02:23] for individual flowers or individual cars or individual uh spaces or even
[01:02:27] cars or individual uh spaces or even different kinds of pets. And you start
[01:02:29] different kinds of pets. And you start seeing these improvements on a bunch of
[01:02:30] seeing these improvements on a bunch of different data sets that are about more
[01:02:32] different data sets that are about more fine grained specialized categories. And
[01:02:35] fine grained specialized categories. And the only way it's able to do that is
[01:02:37] the only way it's able to do that is because GPT has ingested some ability to
[01:02:40] because GPT has ingested some ability to sort of describe those things. Um, and
[01:02:42] sort of describe those things. Um, and this idea of being able to generalize to
[01:02:44] this idea of being able to generalize to new capabilities is something that uh
[01:02:46] new capabilities is something that uh was extremely popular last year and it
[01:02:49] was extremely popular last year and it still remains very popular this year. uh
[01:02:52] still remains very popular this year. uh and it's through this idea of chaining
[01:02:56] and it's through this idea of chaining for any sort of question at all. uh so
[01:02:59] for any sort of question at all. uh so for example if I asked you how many are
[01:03:01] for example if I asked you how many are there three people in the boat um the
[01:03:04] there three people in the boat um the way you might want to do this is by
[01:03:07] way you might want to do this is by again asking a multimodal language model
[01:03:08] again asking a multimodal language model to answer this question or what you
[01:03:10] to answer this question or what you could do is use all of the hundreds of
[01:03:12] could do is use all of the hundreds of specialized vision models that we've
[01:03:14] specialized vision models that we've been developing over the last few
[01:03:16] been developing over the last few decades. So there are object detection
[01:03:18] decades. So there are object detection models that you learned about in class.
[01:03:20] models that you learned about in class. If you use an object detector, you'd be
[01:03:22] If you use an object detector, you'd be able to get a detection for each of the
[01:03:23] able to get a detection for each of the three people and then you could just
[01:03:25] three people and then you could just say, "Oh, they're three people." Because
[01:03:26] say, "Oh, they're three people." Because there are three detections, right? So
[01:03:27] there are three detections, right? So that's a general idea is you can chain
[01:03:30] that's a general idea is you can chain other models outputs together so that
[01:03:33] other models outputs together so that you can do new kinds of capabilities.
[01:03:35] you can do new kinds of capabilities. Here's another example. Uh if I ask you
[01:03:38] Here's another example. Uh if I ask you how many total people there are uh is
[01:03:40] how many total people there are uh is across these two boats, is it six? And
[01:03:43] across these two boats, is it six? And again, you can do the same thing. you
[01:03:44] again, you can do the same thing. you can write a program that does object
[01:03:46] can write a program that does object detection on image one and then object
[01:03:47] detection on image one and then object detection on image two and then adds up
[01:03:50] detection on image two and then adds up all of those components uh together.
[01:03:52] all of those components uh together. Right? So this is the basic idea behind
[01:03:55] Right? So this is the basic idea behind um what we now call uh chaining and this
[01:03:58] um what we now call uh chaining and this was popularized by a paper that won the
[01:04:00] was popularized by a paper that won the best paper award last year uh called
[01:04:02] best paper award last year uh called Visprog. And in this Visprock paper, the
[01:04:05] Visprog. And in this Visprock paper, the visual programming paper, the idea was
[01:04:07] visual programming paper, the idea was that you take any image or any sort of
[01:04:10] that you take any image or any sort of question and you generate a program. You
[01:04:13] question and you generate a program. You generate a program that says answer
[01:04:15] generate a program that says answer something about image one, answer
[01:04:16] something about image one, answer something about image two and then
[01:04:18] something about image two and then return combine those answers together to
[01:04:21] return combine those answers together to give you the final answer. Right? So you
[01:04:23] give you the final answer. Right? So you write a function in Python and then in
[01:04:25] write a function in Python and then in that Python function you have individual
[01:04:28] that Python function you have individual calls to other models uh that we've
[01:04:30] calls to other models uh that we've already seen in training. Right? So for
[01:04:32] already seen in training. Right? So for example in when asking this particular
[01:04:34] example in when asking this particular question uh deciding if this statement
[01:04:37] question uh deciding if this statement is true or not the left and right image
[01:04:39] is true or not the left and right image contains a total of six people and two
[01:04:40] contains a total of six people and two boats you can ask GPT to actually create
[01:04:44] boats you can ask GPT to actually create a program that tries to answer this
[01:04:46] a program that tries to answer this question and then you can take the
[01:04:48] question and then you can take the answer uh from its uh program. Okay. And
[01:04:52] answer uh from its uh program. Okay. And you can also get GPT to do in context
[01:04:54] you can also get GPT to do in context examples where you give it examples of
[01:04:56] examples where you give it examples of programs that it can generate using
[01:04:58] programs that it can generate using other functions. And we can see that it
[01:05:00] other functions. And we can see that it generalizes to new sort of questions and
[01:05:03] generalizes to new sort of questions and you start using all of the functionality
[01:05:05] you start using all of the functionality that it has available. Of course, the
[01:05:07] that it has available. Of course, the one thing you need to do is give it the
[01:05:09] one thing you need to do is give it the functions themselves. So you need to
[01:05:11] functions themselves. So you need to tell it that hey you have these
[01:05:12] tell it that hey you have these capabilities from other models that you
[01:05:14] capabilities from other models that you can use. You can localize things using
[01:05:16] can use. You can localize things using an object detector. You can localize
[01:05:18] an object detector. You can localize faces using a face detector. uh and you
[01:05:21] faces using a face detector. uh and you can have all of these different
[01:05:22] can have all of these different capabilities uh across many different
[01:05:25] capabilities uh across many different sort of other models that people have
[01:05:27] sort of other models that people have created and you can chain them together
[01:05:29] created and you can chain them together to do different kinds of tasks. Yeah. So
[01:05:31] to do different kinds of tasks. Yeah. So there's two different ways of doing it.
[01:05:32] there's two different ways of doing it. One is a static way of doing it where
[01:05:34] One is a static way of doing it where you want to give it as diverse examples
[01:05:36] you want to give it as diverse examples as possible and then hopes that it
[01:05:37] as possible and then hopes that it generalizes. Another one is to
[01:05:39] generalizes. Another one is to dynamically choose given this question
[01:05:42] dynamically choose given this question what are the best in context examples I
[01:05:44] what are the best in context examples I should use. And so you can treat that as
[01:05:46] should use. And so you can treat that as another retrieval process where you
[01:05:48] another retrieval process where you retrieve the best examples and then you
[01:05:50] retrieve the best examples and then you ask it to generate a program and that
[01:05:53] ask it to generate a program and that tends to perform a lot better but only
[01:05:55] tends to perform a lot better but only if you have a good retrieval system. Uh
[01:05:58] if you have a good retrieval system. Uh yes it would require a lot of compute.
[01:06:00] yes it would require a lot of compute. So you there's compute in terms of uh
[01:06:02] So you there's compute in terms of uh calling GPT which you have to do through
[01:06:04] calling GPT which you have to do through an API and then you have to load each of
[01:06:06] an API and then you have to load each of these individual models into your memory
[01:06:08] these individual models into your memory and then run each of them sequentially.
[01:06:10] and then run each of them sequentially. So it could actually be a lot more
[01:06:13] So it could actually be a lot more costly. And so what people are trying to
[01:06:14] costly. And so what people are trying to do is figure out can we distill these
[01:06:16] do is figure out can we distill these capabilities into a single model. Uh and
[01:06:18] capabilities into a single model. Uh and that's a big part of what research looks
[01:06:20] that's a big part of what research looks like today in 2025. Uh but of course uh
[01:06:23] like today in 2025. Uh but of course uh people are also still trying to figure
[01:06:25] people are also still trying to figure out how do you chain these things
[01:06:26] out how do you chain these things effectively. Very well.
[01:06:28] effectively. Very well.  Yeah. You can think of it as like an
[01:06:29] Yeah. You can think of it as like an agent. Yeah. So you have an agent that's
[01:06:31] agent. Yeah. So you have an agent that's basically deciding hey given this
[01:06:33] basically deciding hey given this question what are the other models I
[01:06:34] question what are the other models I need help from and how do I sort of
[01:06:36] need help from and how do I sort of stitch them together to do new kinds of
[01:06:38] stitch them together to do new kinds of capabilities? Uh so that's what it looks
[01:06:40] capabilities? Uh so that's what it looks like. Yeah. Uh here's another example
[01:06:42] like. Yeah. Uh here's another example where you might want to do image
[01:06:44] where you might want to do image editing. You might care about replacing
[01:06:46] editing. You might care about replacing the sand, the desert with lush green
[01:06:48] the sand, the desert with lush green grass. Uh of course, image uh editing
[01:06:50] grass. Uh of course, image uh editing models are still in its infancy. And so
[01:06:52] models are still in its infancy. And so what you might want to do instead is
[01:06:53] what you might want to do instead is call a segmentation model, identify the
[01:06:56] call a segmentation model, identify the desert, and then only replace the desert
[01:06:59] desert, and then only replace the desert parts, those pixels with grass. And then
[01:07:01] parts, those pixels with grass. And then you can sort of composite them together
[01:07:03] you can sort of composite them together to make a new image. Okay, that's sort
[01:07:06] to make a new image. Okay, that's sort of um all of the things I wanted to talk
[01:07:08] of um all of the things I wanted to talk about in terms of different
[01:07:09] about in terms of different capabilities. Uh so you've got um here
[01:07:12] capabilities. Uh so you've got um here some capabilities around how to think
[01:07:14] some capabilities around how to think about foundation models. Uh it really is
[01:07:17] about foundation models. Uh it really is at the end of the day uh an ability to
[01:07:19] at the end of the day uh an ability to sort of train a model for a single task
[01:07:22] sort of train a model for a single task and then from that single sort of uh
[01:07:24] and then from that single sort of uh task generalized to many different
[01:07:26] task generalized to many different downstream applications.
[01:07:28] downstream applications. And we talked about in classification
[01:07:30] And we talked about in classification how you can create these models by just
[01:07:32] how you can create these models by just taking a lot of image text pairs from
[01:07:34] taking a lot of image text pairs from the internet um training them together
[01:07:36] the internet um training them together to do different kinds of tasks. uh and
[01:07:38] to do different kinds of tasks. uh and that allows you to generalize to new
[01:07:39] that allows you to generalize to new kinds of data sets that might not even
[01:07:42] kinds of data sets that might not even exist in the real world or have any
[01:07:43] exist in the real world or have any labels for in the real world. Uh you can
[01:07:46] labels for in the real world. Uh you can also combine them with language models
[01:07:48] also combine them with language models and train them to do in context examples
[01:07:50] and train them to do in context examples uh like captioning or counting or OCR.
[01:07:53] uh like captioning or counting or OCR. And these are again capabilities that
[01:07:54] And these are again capabilities that enable many different applications. And
[01:07:57] enable many different applications. And then of course the outputs don't always
[01:07:58] then of course the outputs don't always have to be language or categories. There
[01:08:00] have to be language or categories. There can also be segmentation masks uh where
[01:08:02] can also be segmentation masks uh where you can take different kinds of masks uh
[01:08:04] you can take different kinds of masks uh depending on different user uh inputs
[01:08:06] depending on different user uh inputs and you can generalize this even further
[01:08:08] and you can generalize this even further by combining many of these foundation
[01:08:10] by combining many of these foundation models or even smaller models together
[01:08:12] models or even smaller models together through programs and do all kinds of new
[01:08:15] through programs and do all kinds of new things. So hallucinations still happen
[01:08:18] things. So hallucinations still happen all all across the board. Um what we're
[01:08:21] all all across the board. Um what we're showing is that it seems like pointing
[01:08:23] showing is that it seems like pointing does sort of reduce hallucinations quite
[01:08:24] does sort of reduce hallucinations quite a bit uh because it does need to find
[01:08:27] a bit uh because it does need to find some evidence for its generations. Uh
[01:08:29] some evidence for its generations. Uh but that being said, there's no
[01:08:30] but that being said, there's no guarantee that it's going to point to to
[01:08:32] guarantee that it's going to point to to the right thing at all. Um so there's
[01:08:34] the right thing at all. Um so there's many different ways of sort of fixing
[01:08:35] many different ways of sort of fixing for this. One is uh of course collecting
[01:08:37] for this. One is uh of course collecting more data uh related to the kinds of
[01:08:39] more data uh related to the kinds of reasoning that you want it to do. Uh but
[01:08:41] reasoning that you want it to do. Uh but a better one is to even have
[01:08:43] a better one is to even have verification methods that verify based
[01:08:46] verification methods that verify based on the points whether the output is
[01:08:48] on the points whether the output is something you should trust or not. So a
[01:08:50] something you should trust or not. So a lot of the bigger models and bigger
[01:08:51] lot of the bigger models and bigger companies what they typically do uh when
[01:08:53] companies what they typically do uh when you use any of their models is they
[01:08:55] you use any of their models is they don't have a single model that's
[01:08:57] don't have a single model that's generating an output. you usually take
[01:08:58] generating an output. you usually take that output and pass it through other
[01:09:00] that output and pass it through other sort of verifiers before it even gets to
[01:09:02] sort of verifiers before it even gets to the user. Um and that mitigates some of
[01:09:04] the user. Um and that mitigates some of these sort of problems. Uh but it is an
[01:09:06] these sort of problems. Uh but it is an active line of uh inquiry right now. How
[01:09:08] active line of uh inquiry right now. How do you sort of reduce hallucinations and
[01:09:10] do you sort of reduce hallucinations and also improve these models actual
[01:09:12] also improve these models actual accuracy?
[01:09:13] accuracy?  Yeah. So uh so repeating your question
[01:09:15] Yeah. So uh so repeating your question um is it possible for these models to
[01:09:18] um is it possible for these models to build new tools when uh capability
[01:09:20] build new tools when uh capability requires a tool that it doesn't have? Uh
[01:09:23] requires a tool that it doesn't have? Uh yes uh it can. uh we have a few sort of
[01:09:26] yes uh it can. uh we have a few sort of uh preliminary experiments in those
[01:09:28] uh preliminary experiments in those directions as well where you can tell a
[01:09:29] directions as well where you can tell a model here's a capability that I want
[01:09:31] model here's a capability that I want and what you can build is a system that
[01:09:33] and what you can build is a system that automatically tries to collect training
[01:09:35] automatically tries to collect training data and builds a tool for specific use
[01:09:38] data and builds a tool for specific use cases. Uh but that line of work is still
[01:09:40] cases. Uh but that line of work is still again in its infancy. Uh it's one that
[01:09:42] again in its infancy. Uh it's one that we're actively working on. Uh but you
[01:09:44] we're actively working on. Uh but you know a lot of folks are excited about
[01:09:46] know a lot of folks are excited about that direction.


================================================================================
LECTURE 017
================================================================================

Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 17: Robot Learning

Source: https://www.youtube.com/watch?v=XSfmOH_xVSU

---

Transcript

[00:00:05] We're here with our final guest lecture
[00:00:07] We're here with our final guest lecture for the course. Um, and today we have uh
[00:00:10] for the course. Um, and today we have uh Dr. Yunju Lee. He is an assistant
[00:00:13] Dr. Yunju Lee. He is an assistant professor of computer science at
[00:00:15] professor of computer science at Columbia University where he leads the
[00:00:17] Columbia University where he leads the robotic perception interaction and
[00:00:19] robotic perception interaction and learning lab. He is also a former
[00:00:21] learning lab. He is also a former instructor of CS231N like all of our
[00:00:23] instructor of CS231N like all of our guest lecturers. and he taught the
[00:00:25] guest lecturers. and he taught the course in 2023 while he completed his
[00:00:28] course in 2023 while he completed his posttock here at Stanford with
[00:00:29] posttock here at Stanford with professors fee Lee and Jaun Wu. His
[00:00:32] professors fee Lee and Jaun Wu. His research lies at the intersection of
[00:00:33] research lies at the intersection of robotics, computer vision and machine
[00:00:36] robotics, computer vision and machine learning and specifically his work
[00:00:38] learning and specifically his work focuses on robot learning uh and aims to
[00:00:40] focuses on robot learning uh and aims to significantly expand robots perception
[00:00:43] significantly expand robots perception and physical interaction capabilities.
[00:00:45] and physical interaction capabilities. In today's lecture, he'll be discussing
[00:00:47] In today's lecture, he'll be discussing exactly that topic, robot learning. And
[00:00:49] exactly that topic, robot learning. And I'll now hand it off to Yunju for
[00:00:51] I'll now hand it off to Yunju for today's lecture.
[00:00:52] today's lecture. Yeah, thank you Z for the very kind
[00:00:54] Yeah, thank you Z for the very kind introduction. I'm super excited to be
[00:00:56] introduction. I'm super excited to be here like last time I was here giving
[00:00:58] here like last time I was here giving lectures was two years ago 2023 and uh
[00:01:02] lectures was two years ago 2023 and uh similar lately like I was going through
[00:01:04] similar lately like I was going through many of the lectures and today I'm going
[00:01:06] many of the lectures and today I'm going to talking about like some of things
[00:01:09] to talking about like some of things that I have been working on and also
[00:01:11] that I have been working on and also it's also a very coherent piece of this
[00:01:14] it's also a very coherent piece of this overall pictures on deep learning for
[00:01:15] overall pictures on deep learning for computer visions and this is
[00:01:17] computer visions and this is specifically on robot learning. So I'll
[00:01:19] specifically on robot learning. So I'll be discussing like what are some of the
[00:01:21] be discussing like what are some of the interesting considerations especially in
[00:01:24] interesting considerations especially in enabling the robots to better perceive
[00:01:26] enabling the robots to better perceive and interact with the physical world and
[00:01:28] and interact with the physical world and how some of the considerations might be
[00:01:30] how some of the considerations might be different from some typical computer
[00:01:32] different from some typical computer vision task and computer vision like
[00:01:34] vision task and computer vision like methods. So first of all like you guys
[00:01:38] methods. So first of all like you guys have already learned like a lot about
[00:01:40] have already learned like a lot about supervised learning. The thing and the
[00:01:43] supervised learning. The thing and the setup for supervised learning is that
[00:01:45] setup for supervised learning is that you have data X and Y. X is the input
[00:01:48] you have data X and Y. X is the input and Y is a label and you are trying to
[00:01:50] and Y is a label and you are trying to learn a mapping that maps from the input
[00:01:53] learn a mapping that maps from the input X to the output Y. There are examples
[00:01:56] X to the output Y. There are examples you have already learned like
[00:01:57] you have already learned like classification, regression, object
[00:01:59] classification, regression, object detections etc. And you have also
[00:02:02] detections etc. And you have also learned about selfsupervised learning
[00:02:04] learned about selfsupervised learning where instead of having labels in this
[00:02:07] where instead of having labels in this case you are just having the data
[00:02:09] case you are just having the data without any labels. when you are trying
[00:02:12] without any labels. when you are trying to do is you come up with learning
[00:02:13] to do is you come up with learning algorithms that is able to extract or
[00:02:16] algorithms that is able to extract or identify the underlying hidden
[00:02:18] identify the underlying hidden structures of the data just by like uh
[00:02:21] structures of the data just by like uh working or designing some auxiliary
[00:02:23] working or designing some auxiliary loss. Some typical example including
[00:02:25] loss. Some typical example including like autoenccoders. There's many other
[00:02:27] like autoenccoders. There's many other like examples in trying to do
[00:02:30] like examples in trying to do unsupervised learning or self-supervised
[00:02:32] unsupervised learning or self-supervised learnings like on top of this mass
[00:02:34] learnings like on top of this mass amount of unlabelled data.
[00:02:37] amount of unlabelled data. And the special thing and the unique
[00:02:39] And the special thing and the unique things about robot learning is that
[00:02:41] things about robot learning is that robots has to make physical interactions
[00:02:44] robots has to make physical interactions and make interactions with the world. So
[00:02:47] and make interactions with the world. So it's not just you have the input and
[00:02:48] it's not just you have the input and outputs and mapping from the input X to
[00:02:50] outputs and mapping from the input X to Y or some kind of latent
[00:02:51] Y or some kind of latent representations. It's really about you
[00:02:54] representations. It's really about you are influenced evolutions of the
[00:02:55] are influenced evolutions of the environments. So no matter what action
[00:02:57] environments. So no matter what action you decide to take in the real world,
[00:03:00] you decide to take in the real world, the world will change as a result of
[00:03:02] the world will change as a result of that actions and the world will give you
[00:03:04] that actions and the world will give you some kind of new observations or reward
[00:03:06] some kind of new observations or reward telling you like how the thing how the
[00:03:09] telling you like how the thing how the environment has been changing and how
[00:03:11] environment has been changing and how good you are in executing certain tasks.
[00:03:13] good you are in executing certain tasks. So the goal is trying to actually come
[00:03:16] So the goal is trying to actually come up with a sequence of actions with
[00:03:19] up with a sequence of actions with feedback from the environments that is
[00:03:22] feedback from the environments that is to maximize some reward or minimize some
[00:03:24] to maximize some reward or minimize some cost and robot learning like especially
[00:03:28] cost and robot learning like especially in the recent like years has attract
[00:03:31] in the recent like years has attract significant attentions both within
[00:03:33] significant attentions both within academia and also within industries. So
[00:03:36] academia and also within industries. So we have seen like a many like startup
[00:03:39] we have seen like a many like startup companies and including for example
[00:03:41] companies and including for example physical intelligence like a Tesla bots
[00:03:44] physical intelligence like a Tesla bots or figure they are producing those like
[00:03:47] or figure they are producing those like a very seemingly very nice and fancy
[00:03:50] a very seemingly very nice and fancy videos of robots doing a wide range of
[00:03:52] videos of robots doing a wide range of very complicated tasks like folding
[00:03:54] very complicated tasks like folding shirts like manipulating coffee beans or
[00:03:57] shirts like manipulating coffee beans or trying to have this humanoid like doing
[00:03:59] trying to have this humanoid like doing interesting tasks in the real physical
[00:04:01] interesting tasks in the real physical world. So this field like I mentioned
[00:04:03] world. So this field like I mentioned has attract a lot of attentions and also
[00:04:06] has attract a lot of attentions and also a lot of investments. Here are just some
[00:04:08] a lot of investments. Here are just some examples of some recent startups in the
[00:04:10] examples of some recent startups in the field of robot learnings that is able to
[00:04:12] field of robot learnings that is able to attract a huge amount of investments
[00:04:14] attract a huge amount of investments trying to build this general purpose
[00:04:16] trying to build this general purpose robots that can make physical
[00:04:18] robots that can make physical interactions with the environments. So
[00:04:20] interactions with the environments. So obviously not only those startups many
[00:04:23] obviously not only those startups many like a big established companies are
[00:04:25] like a big established companies are also having their own uh robotics in
[00:04:28] also having their own uh robotics in investigations and initiatives trying to
[00:04:30] investigations and initiatives trying to develop their own like general purpose
[00:04:32] develop their own like general purpose robots that is able to make uh general
[00:04:34] robots that is able to make uh general purpose and high performance physical
[00:04:36] purpose and high performance physical interactions with the environment.
[00:04:39] interactions with the environment. So for today's lectures I'm going to
[00:04:42] So for today's lectures I'm going to give you some kind of overviews on some
[00:04:45] give you some kind of overviews on some of the key techniques enabling factors
[00:04:47] of the key techniques enabling factors of the current success and boom of robot
[00:04:50] of the current success and boom of robot learnings. We will start with like a
[00:04:52] learnings. We will start with like a problem formulination. So how can we
[00:04:54] problem formulination. So how can we more concretely define the problems we
[00:04:56] more concretely define the problems we have been building and how can we
[00:04:58] have been building and how can we formally thinking about the robots
[00:05:00] formally thinking about the robots interactions with the environments. So I
[00:05:02] interactions with the environments. So I will then discuss on the more perception
[00:05:04] will then discuss on the more perception sides. I will talk about the difference
[00:05:07] sides. I will talk about the difference considerations between how robots
[00:05:09] considerations between how robots perceive the environments and how people
[00:05:11] perceive the environments and how people typically consider in the computer
[00:05:13] typically consider in the computer vision community and what's special
[00:05:14] vision community and what's special about robots perception. Now talking
[00:05:17] about robots perception. Now talking about reinforcement learning, model
[00:05:19] about reinforcement learning, model learning, model based planning,
[00:05:20] learning, model based planning, imitation learning like also some of the
[00:05:23] imitation learning like also some of the recent trends on robotic foundation
[00:05:25] recent trends on robotic foundation models and also like using the remaining
[00:05:28] models and also like using the remaining time to discuss some of the challenges
[00:05:30] time to discuss some of the challenges we still see like a lies ahead of us. So
[00:05:34] we still see like a lies ahead of us. So starts with problem formulation.
[00:05:38] starts with problem formulation. So this is in general like how the
[00:05:40] So this is in general like how the problem should look like at least in a
[00:05:43] problem should look like at least in a graphical u illustration. So in the
[00:05:46] graphical u illustration. So in the middle we have this agent. The agent is
[00:05:48] middle we have this agent. The agent is given some task objective. This task
[00:05:51] given some task objective. This task objective could be for example language
[00:05:52] objective could be for example language instructions from human or some kind of
[00:05:55] instructions from human or some kind of objective functions measuring how good
[00:05:57] objective functions measuring how good this agent is in doing some specific
[00:05:59] this agent is in doing some specific task. This agent is taking states from
[00:06:03] task. This agent is taking states from the physical worlds or some kind of
[00:06:05] the physical worlds or some kind of environments and the agent decides what
[00:06:08] environments and the agent decides what action to take like this at here that
[00:06:10] action to take like this at here that needs to be executed in the physical
[00:06:12] needs to be executed in the physical worlds and this physical worlds will be
[00:06:15] worlds and this physical worlds will be updated and given this agents this
[00:06:18] updated and given this agents this states S+ one as well as the rewards
[00:06:20] states S+ one as well as the rewards telling the agents how good it is doing
[00:06:23] telling the agents how good it is doing its task. So this is how the framework
[00:06:25] its task. So this is how the framework in general look like. So you have to be
[00:06:28] in general look like. So you have to be very clear on this type of formulations
[00:06:31] very clear on this type of formulations that consists of goal, states, actions
[00:06:34] that consists of goal, states, actions and also rewards that specifically
[00:06:37] and also rewards that specifically defines the problems of the robot
[00:06:39] defines the problems of the robot learning like type of scenarios is very
[00:06:43] learning like type of scenarios is very different from the computer visions. I
[00:06:45] different from the computer visions. I would like to say like computer vision
[00:06:47] would like to say like computer vision is mostly about trying to learn some
[00:06:49] is mostly about trying to learn some kind of representations of the
[00:06:51] kind of representations of the environments based on the inputs like
[00:06:53] environments based on the inputs like highdimensional data. But for robotics,
[00:06:55] highdimensional data. But for robotics, it's basically trying to solve some kind
[00:06:57] it's basically trying to solve some kind of optimization problems where you have
[00:07:00] of optimization problems where you have the constraints which is a physical
[00:07:02] the constraints which is a physical world of the environments. You have your
[00:07:04] world of the environments. You have your objective functions defined over your
[00:07:06] objective functions defined over your goal and you are essentially trying to
[00:07:08] goal and you are essentially trying to solve this optimization problems by
[00:07:10] solve this optimization problems by coming up with a sequence of actions
[00:07:12] coming up with a sequence of actions that can maximize or minimize your
[00:07:14] that can maximize or minimize your objective functions. So that's a key
[00:07:16] objective functions. So that's a key difference between like robot learning
[00:07:18] difference between like robot learning and what people typically consider in
[00:07:20] and what people typically consider in computer vision.
[00:07:21] computer vision. So some specific instantiations of this
[00:07:24] So some specific instantiations of this problem like for example carpole the
[00:07:26] problem like for example carpole the goal is can be balance the pole on the
[00:07:29] goal is can be balance the pole on the top of a movable carts and the states of
[00:07:32] top of a movable carts and the states of these environments essentially describe
[00:07:34] these environments essentially describe the physical states status of the
[00:07:36] the physical states status of the systems which can include the angle the
[00:07:39] systems which can include the angle the angular speeds positions horizontal
[00:07:41] angular speeds positions horizontal velocities etc. And the action could be
[00:07:44] velocities etc. And the action could be the horizontal force that is applied to
[00:07:46] the horizontal force that is applied to the carts. And you could have the
[00:07:48] the carts. And you could have the rewards or one indicating at each time
[00:07:51] rewards or one indicating at each time step if the pole is being kept as an
[00:07:54] step if the pole is being kept as an upright position.
[00:07:56] upright position. Some other example could including like
[00:07:58] Some other example could including like robots locomotion where the goal is to
[00:08:01] robots locomotion where the goal is to make this robots moving forwards and the
[00:08:03] make this robots moving forwards and the states could include the angle,
[00:08:05] states could include the angle, position, velocities of all joints
[00:08:07] position, velocities of all joints within this robots and the action could
[00:08:10] within this robots and the action could be the torque applies to each one of the
[00:08:12] be the torque applies to each one of the joints and the reward can be like one at
[00:08:15] joints and the reward can be like one at each time. the robots like make a step
[00:08:18] each time. the robots like make a step forwards and also being kept in an
[00:08:20] forwards and also being kept in an upright positions
[00:08:23] upright positions and also like some like interesting
[00:08:25] and also like some like interesting example including the Atari games. The
[00:08:27] example including the Atari games. The goal can be complete the gaming with the
[00:08:30] goal can be complete the gaming with the highest score as high as possible like
[00:08:32] highest score as high as possible like you can get. And the states will be the
[00:08:34] you can get. And the states will be the raw pixel inputs of the gaming screen
[00:08:37] raw pixel inputs of the gaming screen and action could be the gaming control
[00:08:39] and action could be the gaming control like up, down, left and right. And the
[00:08:42] like up, down, left and right. And the reward could be the score increase and
[00:08:44] reward could be the score increase and decrease at each time step. And uh some
[00:08:47] decrease at each time step. And uh some of the like a more famous examples you
[00:08:50] of the like a more famous examples you probably have noticed like earlier
[00:08:52] probably have noticed like earlier especially with the developments of
[00:08:54] especially with the developments of alpha go. And the definition and the
[00:08:56] alpha go. And the definition and the problem of go can also be defined in
[00:08:59] problem of go can also be defined in similar ways where the goal is to win
[00:09:01] similar ways where the goal is to win the game. In the states will be all the
[00:09:04] the game. In the states will be all the pieces that are currently already on the
[00:09:06] pieces that are currently already on the go brought and action could be where to
[00:09:08] go brought and action could be where to put the next piece down on this board
[00:09:10] put the next piece down on this board and the reward could be uh on the last
[00:09:12] and the reward could be uh on the last turn. If you win, you get a reward of
[00:09:15] turn. If you win, you get a reward of one and if you lose, you get a reward of
[00:09:17] one and if you lose, you get a reward of zero. And this not only applies to for
[00:09:20] zero. And this not only applies to for example gaming domains like even with
[00:09:22] example gaming domains like even with the recent like developments of large
[00:09:25] the recent like developments of large language models, you can also like
[00:09:27] language models, you can also like thinking about those problems especially
[00:09:29] thinking about those problems especially for sequ sequential like generation
[00:09:32] for sequ sequential like generation problems in a similar manner. Or the
[00:09:34] problems in a similar manner. Or the goal could be to predict the next words
[00:09:37] goal could be to predict the next words and the state could be the current words
[00:09:39] and the state could be the current words in the sentence. And the action will be
[00:09:41] in the sentence. And the action will be what the specific next words you want to
[00:09:42] what the specific next words you want to put there. And if it is correct, you get
[00:09:46] put there. And if it is correct, you get a reward. If it is incorrect, you get a
[00:09:48] a reward. If it is incorrect, you get a rewards of zero. And uh similarly now
[00:09:52] rewards of zero. And uh similarly now you probably have already played with
[00:09:54] you probably have already played with many of the chat bots quite a lot. And
[00:09:56] many of the chat bots quite a lot. And we can also define a problem like
[00:09:58] we can also define a problem like similarly where the goal is to be a good
[00:10:01] similarly where the goal is to be a good companions to the human user. The states
[00:10:04] companions to the human user. The states could be the current confi conversation
[00:10:07] could be the current confi conversation and action that should be generated by
[00:10:09] and action that should be generated by the chatbots will be the next sentence
[00:10:12] the chatbots will be the next sentence you are given to the human user and
[00:10:15] you are given to the human user and according to the human evaluations we
[00:10:17] according to the human evaluations we could define the reward. If the person
[00:10:19] could define the reward. If the person is happy like if it if they are
[00:10:20] is happy like if it if they are satisfied you get a rewards of one and
[00:10:23] satisfied you get a rewards of one and if uh you are not happy or neutral you
[00:10:25] if uh you are not happy or neutral you get some other rewards and more
[00:10:28] get some other rewards and more specifically for example in the robotics
[00:10:30] specifically for example in the robotics domain and the task could be to fold the
[00:10:33] domain and the task could be to fold the clothes and one clothes be folded nicely
[00:10:36] clothes and one clothes be folded nicely in the states is the current
[00:10:38] in the states is the current observations the robot is getting from
[00:10:40] observations the robot is getting from this environment which could including
[00:10:43] this environment which could including the multiv- view RGB or RGBD
[00:10:46] the multiv- view RGB or RGBD observations of the environment And the
[00:10:48] observations of the environment And the robots needs to decide its actions like
[00:10:50] robots needs to decide its actions like how to move it in factors. Should it
[00:10:52] how to move it in factors. Should it close or open its scrapers in order to
[00:10:55] close or open its scrapers in order to manipulate this close and according to
[00:10:57] manipulate this close and according to human evaluations if the close is
[00:10:59] human evaluations if the close is properly folded? You give the robot a
[00:11:02] properly folded? You give the robot a reward of one. And if the close is not
[00:11:04] reward of one. And if the close is not folded, you give the rewards of zero. So
[00:11:07] folded, you give the rewards of zero. So here is actually how you want to like a
[00:11:09] here is actually how you want to like a more concretely thinking about the robot
[00:11:11] more concretely thinking about the robot learning problem. This really is a way
[00:11:13] learning problem. This really is a way that allowing the agents to interact
[00:11:16] that allowing the agents to interact with the world that considers the effect
[00:11:18] with the world that considers the effect of an actions and also this sequential
[00:11:20] of an actions and also this sequential decision- making problems that is
[00:11:22] decision- making problems that is different from what people typically
[00:11:23] different from what people typically consider in computer vision. We just
[00:11:25] consider in computer vision. We just needs to predict the outputs and the
[00:11:28] needs to predict the outputs and the goal states actions and the rewards and
[00:11:31] goal states actions and the rewards and objective functions are the things you
[00:11:33] objective functions are the things you need to keep in mind whenever you are
[00:11:35] need to keep in mind whenever you are thinking about problems along this
[00:11:36] thinking about problems along this direction.
[00:11:38] direction. So this is about problem formulation. So
[00:11:41] So this is about problem formulation. So question is like how specific the reward
[00:11:44] question is like how specific the reward needs to be designed like in many of the
[00:11:46] needs to be designed like in many of the task the reward can have many different
[00:11:48] task the reward can have many different type of specifications. For example in
[00:11:50] type of specifications. For example in the self-driving the reward could be
[00:11:52] the self-driving the reward could be like as fast as possible or reward could
[00:11:54] like as fast as possible or reward could be like we want the the the the
[00:11:56] be like we want the the the the passengers to feel comfortable as you
[00:11:58] passengers to feel comfortable as you are driving along the roads. So even for
[00:12:01] are driving along the roads. So even for clothes folding depending on the user's
[00:12:03] clothes folding depending on the user's preference a clothes can be folded in
[00:12:04] preference a clothes can be folded in many different ways. Some want the total
[00:12:07] many different ways. Some want the total area to be as small as possible. Some
[00:12:09] area to be as small as possible. Some want it to be as like a smooth as
[00:12:11] want it to be as like a smooth as possible. There could be a difference
[00:12:12] possible. There could be a difference like types of rewards. Here I'm just
[00:12:15] like types of rewards. Here I'm just talking a generic terms like if a person
[00:12:17] talking a generic terms like if a person look at the clothes like do they think
[00:12:19] look at the clothes like do they think this is folded or not but more
[00:12:20] this is folded or not but more specifically in terms of the reward
[00:12:22] specifically in terms of the reward design. There's actually a lot of
[00:12:23] design. There's actually a lot of nuances in like a satisfying specific
[00:12:26] nuances in like a satisfying specific needs of a specific application.
[00:12:30] needs of a specific application. Okay. So I'll continue. So this is how
[00:12:32] Okay. So I'll continue. So this is how we are thinking about those robot
[00:12:34] we are thinking about those robot learning problem that allows the agents
[00:12:36] learning problem that allows the agents to interact with the physical world. Now
[00:12:38] to interact with the physical world. Now I'm moving on to robot perception
[00:12:41] I'm moving on to robot perception especially in discussing how the
[00:12:43] especially in discussing how the perception problem within this robot
[00:12:45] perception problem within this robot learning domain is different from what
[00:12:47] learning domain is different from what typically people typically consider
[00:12:48] typically people typically consider incom.
[00:12:50] incom. So this image again you you're actually
[00:12:52] So this image again you you're actually going to see this image again and again
[00:12:53] going to see this image again and again like through today's lecture. So this is
[00:12:56] like through today's lecture. So this is essentially in the question of how we
[00:12:58] essentially in the question of how we are handling the whatever information
[00:13:01] are handling the whatever information you are getting from the physical world.
[00:13:04] you are getting from the physical world. The physical world can give you for
[00:13:06] The physical world can give you for highdimensional RGB observation or RGBD
[00:13:09] highdimensional RGB observation or RGBD observations. It could also include some
[00:13:11] observations. It could also include some other sensory data like tactile
[00:13:12] other sensory data like tactile sensings. And this robot perception
[00:13:15] sensings. And this robot perception problem is essentially trying to distill
[00:13:18] problem is essentially trying to distill or harnessing some structured knowledge
[00:13:21] or harnessing some structured knowledge from those highdimensional data that is
[00:13:23] from those highdimensional data that is useful for the robot to do the
[00:13:25] useful for the robot to do the downstream decision-m
[00:13:27] downstream decision-m and essentially the question we are
[00:13:30] and essentially the question we are trying to tackle is trying to making
[00:13:32] trying to tackle is trying to making sense of this unstructured real world
[00:13:34] sense of this unstructured real world and the real world can be very messy. So
[00:13:38] and the real world can be very messy. So essentially the observations the robots
[00:13:39] essentially the observations the robots are getting from the environments can
[00:13:42] are getting from the environments can only contains like incomplete knowledge
[00:13:44] only contains like incomplete knowledge of the objects and the environments.
[00:13:46] of the objects and the environments. There could be occlusions. There could
[00:13:48] There could be occlusions. There could also be like errors from the sensory
[00:13:51] also be like errors from the sensory like data and the imperfect action may
[00:13:54] like data and the imperfect action may also leads to failure. For example like
[00:13:57] also leads to failure. For example like the robots can trying to grasp some
[00:13:59] the robots can trying to grasp some objects but that grasping behavior may
[00:14:01] objects but that grasping behavior may not always be successful. Sometimes
[00:14:03] not always be successful. Sometimes you'll accidentally drop that objects
[00:14:05] you'll accidentally drop that objects which will also cause like evolutions
[00:14:07] which will also cause like evolutions and unexpected changes of this
[00:14:09] and unexpected changes of this environment. They will also need to have
[00:14:11] environment. They will also need to have this perception system that is able to
[00:14:13] this perception system that is able to handle those scenarios and also this
[00:14:15] handle those scenarios and also this environments can change. It is dynamic
[00:14:18] environments can change. It is dynamic consists of not just rigid object but
[00:14:20] consists of not just rigid object but deformable object clothes
[00:14:23] deformable object clothes medias. There could be other agents like
[00:14:25] medias. There could be other agents like dogs or other kids or other humans that
[00:14:28] dogs or other kids or other humans that are also in the same environments
[00:14:29] are also in the same environments messing up with the world and your
[00:14:32] messing up with the world and your perception system needs to be able to
[00:14:34] perception system needs to be able to cope with all those kind of changes.
[00:14:37] cope with all those kind of changes. So that is why like in the robotics
[00:14:40] So that is why like in the robotics domain people typically not just working
[00:14:42] domain people typically not just working with like a camera data they try to add
[00:14:45] with like a camera data they try to add as much sensor as possible to the robots
[00:14:48] as much sensor as possible to the robots as long as they can provide some useful
[00:14:50] as long as they can provide some useful informations. It really considers like
[00:14:52] informations. It really considers like for example like a tactile sensing the
[00:14:54] for example like a tactile sensing the audio information the depth informations
[00:14:56] audio information the depth informations next and typically we will have to like
[00:15:00] next and typically we will have to like a design a systems that is able to put
[00:15:02] a design a systems that is able to put all the sensors togethers that allows
[00:15:04] all the sensors togethers that allows them to uh complement each other whereas
[00:15:08] them to uh complement each other whereas audio information might tell you things
[00:15:10] audio information might tell you things about the physical contacts and the
[00:15:12] about the physical contacts and the tactile information might tells you
[00:15:13] tactile information might tells you about whether a grasp is stable or not
[00:15:16] about whether a grasp is stable or not and the camera information tells you
[00:15:17] and the camera information tells you about something that is more on the
[00:15:19] about something that is more on the higher level on the grand scheme of
[00:15:20] higher level on the grand scheme of things about the overall states of this
[00:15:22] things about the overall states of this environment. So how these sensors can be
[00:15:25] environment. So how these sensors can be composed together and work together is
[00:15:27] composed together and work together is very very important to design a capable
[00:15:29] very very important to design a capable robotic system that are working in the
[00:15:32] robotic system that are working in the real physical world. And besides the the
[00:15:35] real physical world. And besides the the numbers of sensory modalities very
[00:15:38] numbers of sensory modalities very important difference between like a
[00:15:39] important difference between like a robot vision and computer visions is
[00:15:41] robot vision and computer visions is trying to really understand the effect
[00:15:43] trying to really understand the effect of an actions and also the affordance of
[00:15:45] of an actions and also the affordance of this environments. On the left is a very
[00:15:48] this environments. On the left is a very typical examples you have already seen
[00:15:50] typical examples you have already seen in computer vision which is trying to do
[00:15:52] in computer vision which is trying to do instant segmentation. What you are given
[00:15:54] instant segmentation. What you are given is this 2D image. You are segments
[00:15:56] is this 2D image. You are segments different instances from this 2D image
[00:15:59] different instances from this 2D image like a by drawing like a contours over
[00:16:01] like a by drawing like a contours over this 2D pixels. But what's difference in
[00:16:04] this 2D pixels. But what's difference in the robotics domain? For example, on the
[00:16:06] the robotics domain? For example, on the right, the robot can for example given
[00:16:08] right, the robot can for example given one object and this object seems to be
[00:16:11] one object and this object seems to be maybe just one object or maybe some a
[00:16:14] maybe just one object or maybe some a lot of pieces that are for example
[00:16:16] lot of pieces that are for example stacked into each other. The robot has
[00:16:18] stacked into each other. The robot has to know like what type of actions will
[00:16:21] to know like what type of actions will allow us to have better understand
[00:16:23] allow us to have better understand better perceptions about this
[00:16:25] better perceptions about this environments. Is this one piece of
[00:16:26] environments. Is this one piece of object or multiple pieces composed
[00:16:28] object or multiple pieces composed togethers? The robot should come up with
[00:16:30] togethers? The robot should come up with actions like perturb and actively
[00:16:32] actions like perturb and actively interact with the environments for the
[00:16:34] interact with the environments for the robot to get a better perceptions about
[00:16:36] robot to get a better perceptions about the states of this environment. So that
[00:16:39] the states of this environment. So that is why a robot vision is embodied active
[00:16:43] is why a robot vision is embodied active and also environmentally situated. By
[00:16:46] and also environmentally situated. By embodies what we mean is robots have
[00:16:49] embodies what we mean is robots have this kind of physical body that is
[00:16:52] this kind of physical body that is directly experiencing the physical world
[00:16:54] directly experiencing the physical world directly. Their actions are part of a
[00:16:57] directly. Their actions are part of a dynamic uh with the worlds that have
[00:16:59] dynamic uh with the worlds that have immediate feedback on their own
[00:17:01] immediate feedback on their own sensations
[00:17:02] sensations and active meaning the robots are active
[00:17:05] and active meaning the robots are active perceivers. It knows why it wishes you
[00:17:07] perceivers. It knows why it wishes you to sense and chooses what to perceive
[00:17:10] to sense and chooses what to perceive and it determines how and when and where
[00:17:12] and it determines how and when and where to achieve that perception. You can move
[00:17:14] to achieve that perception. You can move your head around you to know like what's
[00:17:16] your head around you to know like what's behind this table. You can just move
[00:17:18] behind this table. You can just move around to see what's behind the tables.
[00:17:20] around to see what's behind the tables. So this is the active parts which is
[00:17:21] So this is the active parts which is different from what people typically
[00:17:23] different from what people typically consider in computer vision. that are
[00:17:25] consider in computer vision. that are mostly like work with a passively
[00:17:27] mostly like work with a passively collected data sets. The third point is
[00:17:30] collected data sets. The third point is about situated. The robots are situated
[00:17:32] about situated. The robots are situated in the world. They do not deal with
[00:17:35] in the world. They do not deal with abstract descriptions but with the here
[00:17:37] abstract descriptions but with the here and now of the world directly
[00:17:39] and now of the world directly influencing the behavior of the systems.
[00:17:42] influencing the behavior of the systems. Bots really have to understand
[00:17:44] Bots really have to understand especially in closing perception and
[00:17:47] especially in closing perception and action loop. It sees the world and
[00:17:50] action loop. It sees the world and understanding its goals and be able to
[00:17:52] understanding its goals and be able to act in the environments upon its
[00:17:54] act in the environments upon its perceptions. Sometimes the robot don't
[00:17:57] perceptions. Sometimes the robot don't have to know the full state of
[00:17:58] have to know the full state of environments. For example, if I'm
[00:18:00] environments. For example, if I'm buttoning my shirt, I only have to know
[00:18:02] buttoning my shirt, I only have to know the local regions near that button for
[00:18:04] the local regions near that button for me to button that shirts. So some of the
[00:18:06] me to button that shirts. So some of the perception has to be tightly coupled and
[00:18:09] perception has to be tightly coupled and co-designed with the task and the
[00:18:11] co-designed with the task and the downstream decision-m systems for the
[00:18:13] downstream decision-m systems for the robot to focusing on the relevant region
[00:18:14] robot to focusing on the relevant region or task relevant regions of the
[00:18:16] or task relevant regions of the environments to be properly like a close
[00:18:19] environments to be properly like a close this perception and action loop.
[00:18:22] this perception and action loop. So this is about some very specific
[00:18:25] So this is about some very specific considerations and how robots perception
[00:18:27] considerations and how robots perception might be different from what people
[00:18:28] might be different from what people typically consider in computer vision. I
[00:18:31] typically consider in computer vision. I will starting to discuss some of the
[00:18:33] will starting to discuss some of the algorithms that not only allow the robot
[00:18:36] algorithms that not only allow the robot to see but allow the robot to act in the
[00:18:38] to see but allow the robot to act in the world and we will starts with
[00:18:41] world and we will starts with reinforcement learning. Remember earlier
[00:18:43] reinforcement learning. Remember earlier we have like seen this image the robots
[00:18:47] we have like seen this image the robots has to act upon this environment and get
[00:18:49] has to act upon this environment and get rewards from this environments. So one
[00:18:52] rewards from this environments. So one very typical ways of trying to solve
[00:18:55] very typical ways of trying to solve this optimization problem is allow the
[00:18:57] this optimization problem is allow the robots to interact with the worlds as
[00:19:00] robots to interact with the worlds as extensively and as massively as
[00:19:02] extensively and as massively as possible. You just collect all the
[00:19:05] possible. You just collect all the experience data and do this type of like
[00:19:08] experience data and do this type of like a trials and errors allow the robots to
[00:19:10] a trials and errors allow the robots to understand this action leads to higher
[00:19:12] understand this action leads to higher reward and that action leads to lower
[00:19:14] reward and that action leads to lower rewards and we can pivot the agents
[00:19:18] rewards and we can pivot the agents behaviors towards the actions that give
[00:19:20] behaviors towards the actions that give the agents some higher rewards. This is
[00:19:22] the agents some higher rewards. This is like the general ideas of reinforcement
[00:19:24] like the general ideas of reinforcement learning. is really a way to allow the
[00:19:26] learning. is really a way to allow the agents to constantly interact with the
[00:19:29] agents to constantly interact with the environments and do this trials and
[00:19:31] environments and do this trials and error to maximize the reward or minimize
[00:19:33] error to maximize the reward or minimize the cost. And here I also want to be a
[00:19:37] the cost. And here I also want to be a bit more specific in discussing the
[00:19:39] bit more specific in discussing the difference between reinforcement
[00:19:40] difference between reinforcement learning and supervised learning. So
[00:19:43] learning and supervised learning. So this is a typical framework of how
[00:19:45] this is a typical framework of how reinforcement learning look like. You
[00:19:47] reinforcement learning look like. You have the environments. Environment give
[00:19:48] have the environments. Environment give the agent some states. Agents generate
[00:19:50] the agent some states. Agents generate actions. And the environments like give
[00:19:52] actions. And the environments like give the agent the feedback which is the
[00:19:54] the agent the feedback which is the rewards and the environments will change
[00:19:57] rewards and the environments will change where the environment give the agent the
[00:19:59] where the environment give the agent the new states S t plus one and it
[00:20:01] new states S t plus one and it essentially a sequence like a temporal
[00:20:03] essentially a sequence like a temporal sequence on the temporal domain where
[00:20:05] sequence on the temporal domain where the robots has to the agent has to make
[00:20:07] the robots has to the agent has to make the sequential decisions and here is a
[00:20:11] the sequential decisions and here is a typical like image u typically look like
[00:20:14] typical like image u typically look like for supervised learning. So you have the
[00:20:16] for supervised learning. So you have the data sets. The data sets will input to
[00:20:19] data sets. The data sets will input to the model this X and this model will
[00:20:22] the model this X and this model will generate the prediction Y and you will
[00:20:24] generate the prediction Y and you will be able to calculate the loss according
[00:20:26] be able to calculate the loss according to the model's predictions versus the
[00:20:28] to the model's predictions versus the ground truth from this data sets. So
[00:20:30] ground truth from this data sets. So this is a typical setup of supervised
[00:20:33] this is a typical setup of supervised learning and the key difference some of
[00:20:35] learning and the key difference some of the key difference between reinforcement
[00:20:37] the key difference between reinforcement learning and supervised learning is that
[00:20:40] learning and supervised learning is that the environments might be stochastic or
[00:20:43] the environments might be stochastic or the same actions the environment might
[00:20:46] the same actions the environment might change in a different manner. Let's say
[00:20:48] change in a different manner. Let's say if you are pushing a box forward,
[00:20:50] if you are pushing a box forward, depending on the distribution of the
[00:20:52] depending on the distribution of the supporting force, the same exact actions
[00:20:55] supporting force, the same exact actions will lead to can potentially lead to the
[00:20:57] will lead to can potentially lead to the box rotating into different angles.
[00:21:00] box rotating into different angles. Meaning there can be uncertainties and
[00:21:02] Meaning there can be uncertainties and stochasticities in the environments that
[00:21:04] stochasticities in the environments that will like lead to like a stochastic
[00:21:07] will like lead to like a stochastic behaviors of these environments which
[00:21:09] behaviors of these environments which will also give this agents stochastic
[00:21:11] will also give this agents stochastic like rewards where the same action may
[00:21:14] like rewards where the same action may not always leads to the same rewards. So
[00:21:17] not always leads to the same rewards. So this is very different from supervised
[00:21:19] this is very different from supervised learning. We are dealing with an
[00:21:20] learning. We are dealing with an uncertain dynamical system.
[00:21:23] uncertain dynamical system. The second is about the question of
[00:21:26] The second is about the question of credit assignments. So for supervised
[00:21:28] credit assignments. So for supervised learning you give the inputs, you
[00:21:30] learning you give the inputs, you predict the outputs and directly
[00:21:32] predict the outputs and directly calculate the loss like you directly
[00:21:34] calculate the loss like you directly know like what are the mistakes and what
[00:21:36] know like what are the mistakes and what are the error you are making by making a
[00:21:38] are the error you are making by making a specific predictions. But in the
[00:21:40] specific predictions. But in the reinforcement learning or sequential
[00:21:42] reinforcement learning or sequential decision- making domain, the rewards can
[00:21:45] decision- making domain, the rewards can be delayed. Meaning if you like play the
[00:21:47] be delayed. Meaning if you like play the game of gold like only until the very
[00:21:50] game of gold like only until the very ends of this episodes do you realize you
[00:21:52] ends of this episodes do you realize you are winnings or you were losing and
[00:21:56] are winnings or you were losing and that's rewards 01 mice is because of
[00:22:00] that's rewards 01 mice is because of some very very early like steps maybe
[00:22:02] some very very early like steps maybe even the first like steps or some steps
[00:22:04] even the first like steps or some steps during the middle of the games. So how
[00:22:06] during the middle of the games. So how to properly assign the credits you are
[00:22:09] to properly assign the credits you are getting along this sequential decision
[00:22:11] getting along this sequential decision makings towards all the actions is also
[00:22:13] makings towards all the actions is also another very like a tricky and important
[00:22:16] another very like a tricky and important questions people hope to answers within
[00:22:19] questions people hope to answers within reinforcement learning. The third thing
[00:22:22] reinforcement learning. The third thing is the non-defision diffusion abilities
[00:22:25] is the non-defision diffusion abilities of this dynamical systems for example
[00:22:28] of this dynamical systems for example for supervised learning like you have
[00:22:29] for supervised learning like you have the inputs you feed the inputs through
[00:22:32] the inputs you feed the inputs through the model you get outputs you calculate
[00:22:34] the model you get outputs you calculate the loss. So everything along this
[00:22:36] the loss. So everything along this process is differentiable. So you can
[00:22:38] process is differentiable. So you can directly g gradients of the loss
[00:22:41] directly g gradients of the loss functions with respected to the
[00:22:42] functions with respected to the parameters within the model. But that's
[00:22:45] parameters within the model. But that's typically not the case for reinforcement
[00:22:47] typically not the case for reinforcement learning where the environments can
[00:22:49] learning where the environments can often times not differentiable. So how
[00:22:52] often times not differentiable. So how to properly gather gradients of the
[00:22:54] to properly gather gradients of the rewards with respect to the actions can
[00:22:57] rewards with respect to the actions can be tricky. Sometime people have to
[00:22:59] be tricky. Sometime people have to relies on a massive sampling to do like
[00:23:01] relies on a massive sampling to do like those type of zeros order estimations of
[00:23:03] those type of zeros order estimations of the gradients for you to do proper
[00:23:05] the gradients for you to do proper learnings. That's also is another
[00:23:07] learnings. That's also is another difference
[00:23:09] difference like the last difference is about this
[00:23:12] like the last difference is about this non-station of this scenarios where the
[00:23:17] non-station of this scenarios where the evolutions and the states of the
[00:23:18] evolutions and the states of the environments is really a result of your
[00:23:21] environments is really a result of your actions. For supervised learning, no
[00:23:23] actions. For supervised learning, no matter whatever you predict, it doesn't
[00:23:25] matter whatever you predict, it doesn't influence other data points you are
[00:23:27] influence other data points you are getting from this data set. But your
[00:23:30] getting from this data set. But your actions will influence the next states
[00:23:32] actions will influence the next states you are getting in this sequential
[00:23:34] you are getting in this sequential decision- making problems. It is also
[00:23:36] decision- making problems. It is also what makes this kind of reinforcement
[00:23:38] what makes this kind of reinforcement learning problems a little bit more like
[00:23:40] learning problems a little bit more like a nuanced than supervised learning.
[00:23:43] a nuanced than supervised learning. So here are some more specific examples.
[00:23:46] So here are some more specific examples. Like for example playing these Atari
[00:23:48] Like for example playing these Atari games like I mentioned earlier the goal
[00:23:50] games like I mentioned earlier the goal could be to complete the games with the
[00:23:52] could be to complete the games with the highest score and the states will be the
[00:23:54] highest score and the states will be the raw pixel inputs from the gaming screen
[00:23:57] raw pixel inputs from the gaming screen and the action could be up down left and
[00:23:59] and the action could be up down left and right from the keyboards and we're
[00:24:01] right from the keyboards and we're trying to like the reward are the score
[00:24:04] trying to like the reward are the score increase and decrease at each time step
[00:24:07] increase and decrease at each time step and some typical algorithms within this
[00:24:10] and some typical algorithms within this domain either lies in the field of for
[00:24:12] domain either lies in the field of for example like a pure learnings or for
[00:24:14] example like a pure learnings or for example policy iterations Here's a one
[00:24:17] example policy iterations Here's a one examples of like trying to learn this Q
[00:24:19] examples of like trying to learn this Q function. Q function essentially
[00:24:21] function. Q function essentially measures the discounted expected future
[00:24:25] measures the discounted expected future accumulated rewards like when you apply
[00:24:28] accumulated rewards like when you apply a specific action A at a specific state
[00:24:30] a specific action A at a specific state S and you'll be able to get this Q
[00:24:33] S and you'll be able to get this Q functions through like interactions with
[00:24:36] functions through like interactions with this gaming environments and after you
[00:24:38] this gaming environments and after you have learned this Q functions you can
[00:24:41] have learned this Q functions you can evaluate for example like what are the Q
[00:24:43] evaluate for example like what are the Q values you are getting by applying
[00:24:45] values you are getting by applying different actions in this case there's
[00:24:48] different actions in this case there's left and rights ups and downs months. So
[00:24:50] left and rights ups and downs months. So there could potentially be four actions
[00:24:52] there could potentially be four actions and given this four actions you can look
[00:24:54] and given this four actions you can look at their Q values and just execute the
[00:24:57] at their Q values and just execute the action that give you the highest Q
[00:24:58] action that give you the highest Q values. So that is what allows you to do
[00:25:01] values. So that is what allows you to do this type of like decision makings in
[00:25:04] this type of like decision makings in this domain. So uh because today we're
[00:25:07] this domain. So uh because today we're going to cover a lot of materials so we
[00:25:09] going to cover a lot of materials so we won't go into the details of the
[00:25:11] won't go into the details of the reinforcement learnings but some of the
[00:25:13] reinforcement learnings but some of the current state-of-the-arts reinforcement
[00:25:14] current state-of-the-arts reinforcement learning algorithms including SACE like
[00:25:17] learning algorithms including SACE like soft actor critics and also PO like
[00:25:20] soft actor critics and also PO like proximal policy optimizations. So if you
[00:25:22] proximal policy optimizations. So if you are interested you are like very welcome
[00:25:24] are interested you are like very welcome to look at those algorithms in details.
[00:25:26] to look at those algorithms in details. There's a lot of like open source like
[00:25:29] There's a lot of like open source like implementations and tutorials online.
[00:25:31] implementations and tutorials online. But here I want to highlight to you some
[00:25:34] But here I want to highlight to you some of the results you could potentially get
[00:25:36] of the results you could potentially get by going through this reinforced
[00:25:38] by going through this reinforced learning specifically like Q-learning
[00:25:41] learning specifically like Q-learning process. This is developed by Google
[00:25:43] process. This is developed by Google deep mind that is trying to develop this
[00:25:46] deep mind that is trying to develop this agents that is trying to play the game
[00:25:48] agents that is trying to play the game breakouts in this kind of Atari world.
[00:25:52] breakouts in this kind of Atari world. So just after like a 10 minutes of
[00:25:54] So just after like a 10 minutes of training the robot the agent can already
[00:25:56] training the robot the agent can already like touch the ball but often the times
[00:26:00] like touch the ball but often the times like can still missing the ball like
[00:26:02] like can still missing the ball like quite often and after some more times of
[00:26:05] quite often and after some more times of learning for example like a two hours of
[00:26:07] learning for example like a two hours of training and the the the agents can
[00:26:11] training and the the the agents can control this kind of paths in a much
[00:26:13] control this kind of paths in a much more reliable and consistent fashions
[00:26:16] more reliable and consistent fashions that can nearly like all the time can
[00:26:18] that can nearly like all the time can catch the ball and be able to like
[00:26:20] catch the ball and be able to like constantly getting more and more rewards
[00:26:22] constantly getting more and more rewards by like a uh by by by by bouncing it
[00:26:26] by like a uh by by by by bouncing it back. And after like a four hours of
[00:26:30] back. And after like a four hours of trainings, something that's interesting
[00:26:32] trainings, something that's interesting happens where the agents come up with
[00:26:35] happens where the agents come up with actually a novel strategy which possibly
[00:26:37] actually a novel strategy which possibly is not known to many of you that is
[00:26:39] is not known to many of you that is trying to like a push bounce the ball
[00:26:42] trying to like a push bounce the ball back to create a tunnel on the left
[00:26:44] back to create a tunnel on the left sides of this wall and then it will push
[00:26:47] sides of this wall and then it will push this ball like on the upper side of the
[00:26:49] this ball like on the upper side of the wall to do this very efficient like uh
[00:26:53] wall to do this very efficient like uh reductions of those uh bricks. This is a
[00:26:55] reductions of those uh bricks. This is a type of strategy that can be discovered
[00:26:57] type of strategy that can be discovered by reinforcement learnings. So this is
[00:27:00] by reinforcement learnings. So this is like what's nice about reinforcement
[00:27:01] like what's nice about reinforcement learning meaning you allow the agents to
[00:27:04] learning meaning you allow the agents to do very extensive and comprehensive
[00:27:07] do very extensive and comprehensive exploration and interactions with the
[00:27:08] exploration and interactions with the world. And it is totally possible for
[00:27:11] world. And it is totally possible for this reinforced learning agents to
[00:27:13] this reinforced learning agents to discover some strategies that are better
[00:27:15] discover some strategies that are better than even the best human players. Like a
[00:27:18] than even the best human players. Like a very typical examples will be the game
[00:27:21] very typical examples will be the game of go. So when Alpha Go came out in the
[00:27:24] of go. So when Alpha Go came out in the January 2016, it was also like about the
[00:27:28] January 2016, it was also like about the time when I was trying to decides like
[00:27:31] time when I was trying to decides like what type of like research directions
[00:27:32] what type of like research directions I'm going to go. Before that time, I was
[00:27:35] I'm going to go. Before that time, I was just working on deep learning for
[00:27:37] just working on deep learning for computer visions. But when Alpha Go came
[00:27:39] computer visions. But when Alpha Go came out, I'm like I have to work on this
[00:27:41] out, I'm like I have to work on this kind of decision- making problems. So
[00:27:43] kind of decision- making problems. So that's why I started to touch upon
[00:27:45] that's why I started to touch upon reinforcement learning, imitation
[00:27:46] reinforcement learning, imitation learning and all the way until now to do
[00:27:48] learning and all the way until now to do robot learning that allows the robots do
[00:27:50] robot learning that allows the robots do physical interactions with the
[00:27:52] physical interactions with the environments. So um I wasn't satisfied
[00:27:55] environments. So um I wasn't satisfied with just working with a passively
[00:27:56] with just working with a passively collected data sets but we really wanted
[00:27:58] collected data sets but we really wanted an agent that can do active interactions
[00:28:00] an agent that can do active interactions with the environment. So questions was
[00:28:03] with the environment. So questions was how does specifically this Q functions
[00:28:06] how does specifically this Q functions like works. So you can see this Q take
[00:28:09] like works. So you can see this Q take as input the state S and also the action
[00:28:12] as input the state S and also the action and this setup would essentially be the
[00:28:14] and this setup would essentially be the parameters of this Q functions where the
[00:28:16] parameters of this Q functions where the Q is instantiated as a neuronet network.
[00:28:19] Q is instantiated as a neuronet network. In this specific case like I mentioned
[00:28:21] In this specific case like I mentioned earlier the states is the raw pixel
[00:28:24] earlier the states is the raw pixel inputs that directly getting from the
[00:28:26] inputs that directly getting from the gaming screen. So the inputs could be
[00:28:28] gaming screen. So the inputs could be this kind of four steps four frames that
[00:28:30] this kind of four steps four frames that directly inputed to this Q function. And
[00:28:33] directly inputed to this Q function. And if you're dealing with images, a very
[00:28:35] if you're dealing with images, a very like a straightforward way of in
[00:28:37] like a straightforward way of in instantiating this Q function is to use
[00:28:39] instantiating this Q function is to use convolutional neuronet networks. So you
[00:28:41] convolutional neuronet networks. So you have convolutional layers like show in
[00:28:43] have convolutional layers like show in this kind of orange blocks and then
[00:28:45] this kind of orange blocks and then you'll be able to go through this fully
[00:28:46] you'll be able to go through this fully connected layers to directly derive this
[00:28:49] connected layers to directly derive this like a Q value and in this case because
[00:28:52] like a Q value and in this case because there are like a four discrete actions
[00:28:54] there are like a four discrete actions like in this case probably just left and
[00:28:56] like in this case probably just left and right but let's say there's like a four
[00:28:57] right but let's say there's like a four discrete actions up and down left and
[00:28:59] discrete actions up and down left and right. So you'll be able to have like
[00:29:01] right. So you'll be able to have like different Q value estimations that is
[00:29:03] different Q value estimations that is the results of this specific action A
[00:29:05] the results of this specific action A and that's how you can use this Q values
[00:29:07] and that's how you can use this Q values to make decisions on what action to take
[00:29:10] to make decisions on what action to take that is the most effective at maximizing
[00:29:12] that is the most effective at maximizing this Q value. Does that answer your
[00:29:14] this Q value. Does that answer your question?
[00:29:17] Yes. Like uh this is when Alpha Go came
[00:29:20] Yes. Like uh this is when Alpha Go came out and obviously since then there has
[00:29:22] out and obviously since then there has been a lot of developments and
[00:29:23] been a lot of developments and evolutions in making this type of gaming
[00:29:26] evolutions in making this type of gaming agents better and better. So then later
[00:29:29] agents better and better. So then later there's like alpha go zero that is
[00:29:31] there's like alpha go zero that is essentially a simplified versions of
[00:29:33] essentially a simplified versions of alpha go they're no longer using any
[00:29:35] alpha go they're no longer using any imitation learning to do any initial in
[00:29:38] imitation learning to do any initial in initialization and is able to beat like
[00:29:40] initialization and is able to beat like at that time the number one like a
[00:29:42] at that time the number one like a player like co. So this is actually one
[00:29:46] player like co. So this is actually one thing one lessons like people learned
[00:29:48] thing one lessons like people learned inside u this AI communities which you
[00:29:52] inside u this AI communities which you can call it like the bitter lesson from
[00:29:53] can call it like the bitter lesson from the rich suten where sometimes you want
[00:29:55] the rich suten where sometimes you want to find the simplest recipes that is the
[00:29:58] to find the simplest recipes that is the most and best compatibles with the
[00:30:01] most and best compatibles with the scaling. You want to leverage the scale
[00:30:03] scaling. You want to leverage the scale like the the power of scalings and
[00:30:05] like the the power of scalings and sometimes like making this method
[00:30:07] sometimes like making this method simpler will actually give you better
[00:30:09] simpler will actually give you better performance by like making it more
[00:30:11] performance by like making it more compatible with whatever infrastructure
[00:30:13] compatible with whatever infrastructure you can use for scalings and stuff. And
[00:30:15] you can use for scalings and stuff. And then later uh um they develop alpha zero
[00:30:18] then later uh um they develop alpha zero that is able to generalize the same set
[00:30:20] that is able to generalize the same set of algorithms into not just chess uh not
[00:30:23] of algorithms into not just chess uh not not just go but other games like chess
[00:30:25] not just go but other games like chess and shoji and then they they designed
[00:30:28] and shoji and then they they designed like a mzeros that's like not just like
[00:30:31] like a mzeros that's like not just like do this kind of model free reinforcement
[00:30:32] do this kind of model free reinforcement learning but it's able to learn a latent
[00:30:34] learning but it's able to learn a latent space dynamics models to plan over that
[00:30:37] space dynamics models to plan over that give you like even better performance.
[00:30:39] give you like even better performance. So for this specific domain like I would
[00:30:42] So for this specific domain like I would say especially in the gaming
[00:30:43] say especially in the gaming developments that really empowers a lot
[00:30:46] developments that really empowers a lot of like design and developments in how
[00:30:49] of like design and developments in how people can do better and more sample
[00:30:51] people can do better and more sample efficient and more scalable design of
[00:30:53] efficient and more scalable design of those reinforcement learning like agents
[00:30:56] those reinforcement learning like agents and uh in November 2019 like Liso like
[00:30:59] and uh in November 2019 like Liso like who was beaten by alpha go announced his
[00:31:03] who was beaten by alpha go announced his retirements and he realized there just
[00:31:05] retirements and he realized there just no it's just not possible like at that
[00:31:07] no it's just not possible like at that times for any human players to beat the
[00:31:10] times for any human players to beat the best like a go like AI agents out there.
[00:31:14] best like a go like AI agents out there. And obviously since then there has
[00:31:16] And obviously since then there has something been other like more complex
[00:31:18] something been other like more complex games like Starcraft and Dota that shows
[00:31:20] games like Starcraft and Dota that shows that as long as you put enough compute
[00:31:23] that as long as you put enough compute as long as you have like a good like a
[00:31:25] as long as you have like a good like a very welld designigned like algorithms
[00:31:27] very welld designigned like algorithms and infrastructure for you to do the
[00:31:29] and infrastructure for you to do the reinforcement learning you can get very
[00:31:31] reinforcement learning you can get very very good performance in games that are
[00:31:33] very good performance in games that are actually noticeably and orders small
[00:31:35] actually noticeably and orders small magnitudes more complicated than the
[00:31:38] magnitudes more complicated than the game of go. So I would say like if you
[00:31:40] game of go. So I would say like if you have a game like a reasonably designed
[00:31:42] have a game like a reasonably designed games like uh there's like a very legit
[00:31:45] games like uh there's like a very legit chance like if you put as the the
[00:31:47] chance like if you put as the the sufficient resource you can have very
[00:31:49] sufficient resource you can have very very powerful gaming agents. So not just
[00:31:52] very powerful gaming agents. So not just in games people have also been
[00:31:54] in games people have also been developing this reinforcement learning
[00:31:55] developing this reinforcement learning algorithms and agents that can work
[00:31:58] algorithms and agents that can work directly in the real physical world. So
[00:32:01] directly in the real physical world. So this on the left is a work from ETH that
[00:32:05] this on the left is a work from ETH that was published in science robotics 2020
[00:32:08] was published in science robotics 2020 that essentially changed my minds about
[00:32:10] that essentially changed my minds about how useful reinforcement learning can be
[00:32:12] how useful reinforcement learning can be for real physical robots because before
[00:32:15] for real physical robots because before it was just mostly games like could
[00:32:16] it was just mostly games like could argue in games like there's a lot of you
[00:32:20] argue in games like there's a lot of you can just like spawn as many games as
[00:32:22] can just like spawn as many games as possible but for the real worlds there's
[00:32:24] possible but for the real worlds there's always like sim to real gap where you
[00:32:26] always like sim to real gap where you are training on the same game you are
[00:32:28] are training on the same game you are also testing on the same game but for
[00:32:30] also testing on the same game but for robots was if you train on a simulation
[00:32:32] robots was if you train on a simulation like how much does the sim to real gap
[00:32:34] like how much does the sim to real gap matters for the agents you generalize to
[00:32:36] matters for the agents you generalize to the real environments and this paper
[00:32:39] the real environments and this paper really convinced me that sometimes the
[00:32:41] really convinced me that sometimes the sim to real gap just may not matter that
[00:32:43] sim to real gap just may not matter that much so we are not simulating the bushes
[00:32:46] much so we are not simulating the bushes we are not simulating the snows but the
[00:32:48] we are not simulating the snows but the agents that's using reinforcement learn
[00:32:50] agents that's using reinforcement learn training simulation can give you some
[00:32:52] training simulation can give you some very very robust performance in the real
[00:32:55] very very robust performance in the real physical worlds like snows and slipper
[00:32:57] physical worlds like snows and slipper very very slippery like a surface
[00:33:00] very very slippery like a surface On the right is a very recent video
[00:33:02] On the right is a very recent video released by Unitry that shows just
[00:33:04] released by Unitry that shows just another levels of like dexterities for
[00:33:08] another levels of like dexterities for locomotions that can do this kind of sim
[00:33:10] locomotions that can do this kind of sim to real transfer allow this robots to do
[00:33:12] to real transfer allow this robots to do like a very very crazy and dynamic
[00:33:15] like a very very crazy and dynamic behaviors that can navigate into some
[00:33:17] behaviors that can navigate into some very rough and challenging terrains. I
[00:33:19] very rough and challenging terrains. I would say in the domain of robot
[00:33:21] would say in the domain of robot locomotion is close to be a solved
[00:33:25] locomotion is close to be a solved problems and the solution to this
[00:33:27] problems and the solution to this problems is exactly reinforcement
[00:33:30] problems is exactly reinforcement learning. So this is about local motion.
[00:33:33] learning. So this is about local motion. The other domain is about manipulations
[00:33:36] The other domain is about manipulations where the robots has to manipulate the
[00:33:38] where the robots has to manipulate the objects in the real physical world. So
[00:33:41] objects in the real physical world. So in 2019 when open AI was still like a
[00:33:44] in 2019 when open AI was still like a touch upon like a robotics they designed
[00:33:47] touch upon like a robotics they designed the systems they are trying to do
[00:33:49] the systems they are trying to do dextrous manipulations of rubric cube
[00:33:51] dextrous manipulations of rubric cube they are able to do the reinforcement
[00:33:53] they are able to do the reinforcement learning in simulation and do sim real
[00:33:56] learning in simulation and do sim real transfer they allow this kind of robots
[00:33:58] transfer they allow this kind of robots to like solve this rubric cube but one
[00:34:01] to like solve this rubric cube but one caveat is that their success rate is
[00:34:03] caveat is that their success rate is very very low like although this video
[00:34:05] very very low like although this video seem to be very beautifully done but
[00:34:06] seem to be very beautifully done but their success rate was very low and if
[00:34:08] their success rate was very low and if you really look at the papers
[00:34:10] you really look at the papers uh they only tested a very limited
[00:34:12] uh they only tested a very limited amount of trials and uh given that
[00:34:15] amount of trials and uh given that number possibly the reliability is not
[00:34:18] number possibly the reliability is not very like satisfying but still like
[00:34:21] very like satisfying but still like since then people have been able to like
[00:34:23] since then people have been able to like extends upon this texturous manipulation
[00:34:26] extends upon this texturous manipulation problems allow the robots to do enhanced
[00:34:29] problems allow the robots to do enhanced texturous manipulation and
[00:34:31] texturous manipulation and reorientations of different types of
[00:34:33] reorientations of different types of objects and into different like target
[00:34:36] objects and into different like target configurations um is all thanks to the
[00:34:39] configurations um is all thanks to the developments of reinforcement learning.
[00:34:42] developments of reinforcement learning. But we can see until now our examples in
[00:34:46] But we can see until now our examples in locomotions and in hand manipulation, it
[00:34:49] locomotions and in hand manipulation, it doesn't really solve the problem. For
[00:34:50] doesn't really solve the problem. For example, if the robot can just fold the
[00:34:52] example, if the robot can just fold the clothes for you or doing the laundry for
[00:34:53] clothes for you or doing the laundry for you in your home like for manipulation
[00:34:56] you in your home like for manipulation is still in this kind of very like
[00:34:58] is still in this kind of very like isolated domains like working with this
[00:35:00] isolated domains like working with this kind of isolated like environments.
[00:35:03] kind of isolated like environments. So this is actually some of the key
[00:35:05] So this is actually some of the key challenges and bottlenecks of existing
[00:35:07] challenges and bottlenecks of existing like a model free reinforcement
[00:35:09] like a model free reinforcement learning. This mostly learns from the
[00:35:11] learning. This mostly learns from the trials and error with the environments
[00:35:13] trials and error with the environments and it requires extensive like
[00:35:16] and it requires extensive like interactions with the worlds. For
[00:35:18] interactions with the worlds. For example, for the Alpha Go Zero, like it
[00:35:21] example, for the Alpha Go Zero, like it actually learns from 3,000 years of
[00:35:23] actually learns from 3,000 years of human knowledge in 40 days, which is
[00:35:25] human knowledge in 40 days, which is amazing, but like it still requires like
[00:35:28] amazing, but like it still requires like many many like years of computation like
[00:35:31] many many like years of computation like years of like equivalent computations
[00:35:33] years of like equivalent computations for the agents to learn. If in domains
[00:35:35] for the agents to learn. If in domains where there's like a huge simal gap and
[00:35:38] where there's like a huge simal gap and you want to do the reinforcement
[00:35:39] you want to do the reinforcement learning in the real physical worlds
[00:35:41] learning in the real physical worlds that will be a huge bottleneck for
[00:35:43] that will be a huge bottleneck for learning this reinforced learning agents
[00:35:45] learning this reinforced learning agents very effectively and also of course if
[00:35:48] very effectively and also of course if there seem to gap if you only can learn
[00:35:50] there seem to gap if you only can learn the model in the real environments
[00:35:52] the model in the real environments there's a lot of like safety concerns
[00:35:54] there's a lot of like safety concerns for example here is an example of
[00:35:56] for example here is an example of showing this kind of learning
[00:35:58] showing this kind of learning progressions of agents that is
[00:36:00] progressions of agents that is controlling this humanoid robots to move
[00:36:02] controlling this humanoid robots to move forwards you can see like during this
[00:36:04] forwards you can see like during this learning process Although at the very
[00:36:05] learning process Although at the very end the robot is able to like move
[00:36:07] end the robot is able to like move forwards but there's a lot of like a
[00:36:09] forwards but there's a lot of like a very weird behaviors that you can
[00:36:11] very weird behaviors that you can totally imagine if you apply this agents
[00:36:13] totally imagine if you apply this agents on the real physical robots it will like
[00:36:15] on the real physical robots it will like fail catastrophically
[00:36:17] fail catastrophically and uh uh it also have like a very
[00:36:20] and uh uh it also have like a very limited interpretabilities and sometimes
[00:36:23] limited interpretabilities and sometimes very hard to correct things when things
[00:36:25] very hard to correct things when things go wrong. And like one interesting thing
[00:36:28] go wrong. And like one interesting thing if you really think about how human
[00:36:30] if you really think about how human learn to interact with the environment
[00:36:31] learn to interact with the environment versus like pure reinforcement learning
[00:36:34] versus like pure reinforcement learning you realize that we humans have a very
[00:36:37] you realize that we humans have a very intuitive understanding of this
[00:36:38] intuitive understanding of this environments. We can imagine how the
[00:36:40] environments. We can imagine how the environment is going to change if we
[00:36:42] environment is going to change if we apply a specific actions. So it's
[00:36:45] apply a specific actions. So it's exactly this predictive capabilities
[00:36:47] exactly this predictive capabilities that allows we humans to plan our
[00:36:49] that allows we humans to plan our behavior in achieving some specific
[00:36:51] behavior in achieving some specific targets. And this predictive
[00:36:53] targets. And this predictive capabilities is also actually learned
[00:36:55] capabilities is also actually learned from with humans physical interaction
[00:36:57] from with humans physical interaction and everyday experiences with the real
[00:37:00] and everyday experiences with the real physical world. So going beyond the
[00:37:02] physical world. So going beyond the reinforcement learning the next like
[00:37:05] reinforcement learning the next like topics I want to discuss will be how we
[00:37:07] topics I want to discuss will be how we can indulge the robots like with similar
[00:37:10] can indulge the robots like with similar capabilities in imagining the effect of
[00:37:12] capabilities in imagining the effect of their actions and to do model based
[00:37:14] their actions and to do model based planning. For this specific examples we
[00:37:17] planning. For this specific examples we have a simulation. Typically the
[00:37:18] have a simulation. Typically the simulation people use would be for examp
[00:37:24] essentially a bunch of like rigid body
[00:37:26] essentially a bunch of like rigid body simulations where the robots like just
[00:37:28] simulations where the robots like just touching this kind of like a polygon
[00:37:30] touching this kind of like a polygon type of represented like a pl like a
[00:37:33] type of represented like a pl like a representations of the floor. It is not
[00:37:35] representations of the floor. It is not simulating for them the bushes. It is
[00:37:37] simulating for them the bushes. It is not simulating those snows. But what
[00:37:39] not simulating those snows. But what people do is to randomize the simulated
[00:37:41] people do is to randomize the simulated environments a lot. They randomize the
[00:37:44] environments a lot. They randomize the friction, the geometry and many other
[00:37:46] friction, the geometry and many other physical parameter inside this
[00:37:47] physical parameter inside this environment such that people will assume
[00:37:51] environment such that people will assume whatever you encounter in the real
[00:37:53] whatever you encounter in the real physical world is just one data points
[00:37:56] physical world is just one data points within the distributions you randomize
[00:37:58] within the distributions you randomize within your simulation. So if your
[00:38:00] within your simulation. So if your policies can be robust like robust in
[00:38:03] policies can be robust like robust in controlling this robots within that
[00:38:04] controlling this robots within that distribution and if the real just one
[00:38:06] distribution and if the real just one data point within that distribution the
[00:38:08] data point within that distribution the policy can generalize and so far at
[00:38:10] policy can generalize and so far at least from this empirical evidence this
[00:38:12] least from this empirical evidence this type of assumption actually holds and
[00:38:14] type of assumption actually holds and the policy actually works very reliably
[00:38:15] the policy actually works very reliably and robustly in the real physical world.
[00:38:18] and robustly in the real physical world. So the question is about like what is
[00:38:20] So the question is about like what is actual command. So actually in many of
[00:38:22] actual command. So actually in many of the existing demos there can be a person
[00:38:25] the existing demos there can be a person providing high level commands to the
[00:38:26] providing high level commands to the robot. For example, which direction
[00:38:28] robot. For example, which direction should the robots like walk? Should the
[00:38:29] should the robots like walk? Should the robots like rotates in place or just
[00:38:31] robots like rotates in place or just still like keep walking forwards
[00:38:33] still like keep walking forwards condition on that high level actions
[00:38:36] condition on that high level actions provided by human? The robot has to
[00:38:38] provided by human? The robot has to decide this kind of low-level actions.
[00:38:39] decide this kind of low-level actions. The low level actions are typically for
[00:38:41] The low level actions are typically for example the um the the the joint torque
[00:38:45] example the um the the the joint torque that are applies to each and every one
[00:38:47] that are applies to each and every one of joints on top of this robots. So that
[00:38:49] of joints on top of this robots. So that is how like this typically like looks
[00:38:52] is how like this typically like looks like meaning humans give high level
[00:38:53] like meaning humans give high level commands condition high level commands
[00:38:55] commands condition high level commands robot has to use this policy to decide
[00:38:57] robot has to use this policy to decide this kind of low-level like actions
[00:38:59] this kind of low-level like actions which is in instantiated using like
[00:39:01] which is in instantiated using like joint torque. So like I mentioned one
[00:39:04] joint torque. So like I mentioned one biggest lesson I learned from this lines
[00:39:07] biggest lesson I learned from this lines of work on locom motion is that the
[00:39:09] of work on locom motion is that the simulation doesn't have to be perfect as
[00:39:11] simulation doesn't have to be perfect as long as you randomize enough you can
[00:39:12] long as you randomize enough you can generalize very robustly in the real
[00:39:14] generalize very robustly in the real environments. But such lesson hasn't
[00:39:17] environments. But such lesson hasn't really been generalized like very well
[00:39:19] really been generalized like very well in the manipulation domain. So in the
[00:39:21] in the manipulation domain. So in the manipulation domain like how accurate
[00:39:23] manipulation domain like how accurate the simulation needs to be and how much
[00:39:25] the simulation needs to be and how much does the symmetry real gap matters is
[00:39:27] does the symmetry real gap matters is still a research question people hope to
[00:39:29] still a research question people hope to answer. So I can give you one specific
[00:39:31] answer. So I can give you one specific example. If in the simulation you are
[00:39:33] example. If in the simulation you are pushing a box forwards. If in simulation
[00:39:35] pushing a box forwards. If in simulation the box rotates for 10° but in reality
[00:39:38] the box rotates for 10° but in reality rotates for 12° it may not matter that
[00:39:40] rotates for 12° it may not matter that much. But if in the real world your
[00:39:42] much. But if in the real world your grasping was successful but in
[00:39:45] grasping was successful but in simulation the object shoot just flies
[00:39:47] simulation the object shoot just flies away because of some kind of numerical
[00:39:48] away because of some kind of numerical issues or if the object like sleep like
[00:39:51] issues or if the object like sleep like away between your fingers that's
[00:39:53] away between your fingers that's problematic. So there are regions where
[00:39:55] problematic. So there are regions where sim to real gap matter. There are other
[00:39:57] sim to real gap matter. There are other regions simil gap may not matter that
[00:39:59] regions simil gap may not matter that much in the manipulation domain and
[00:40:01] much in the manipulation domain and people still are trying to trying to
[00:40:02] people still are trying to trying to answer and trying to understand like how
[00:40:05] answer and trying to understand like how simil gap can happen and what are some
[00:40:07] simil gap can happen and what are some of the most important like recipes and
[00:40:09] of the most important like recipes and characteristics that this simulation
[00:40:11] characteristics that this simulation needs to have for the most reliable like
[00:40:13] needs to have for the most reliable like seem to real transfer. So if I
[00:40:15] seem to real transfer. So if I understand your questions correctly
[00:40:17] understand your questions correctly you're asking like there are still a
[00:40:19] you're asking like there are still a person like providing high level
[00:40:21] person like providing high level commands to the robots. Your question is
[00:40:23] commands to the robots. Your question is can robot come up with better plans than
[00:40:25] can robot come up with better plans than a human. So I can actually give you a
[00:40:27] a human. So I can actually give you a more nuanced perspective. Although many
[00:40:29] more nuanced perspective. Although many of these videos seems very nice. There
[00:40:32] of these videos seems very nice. There is a human operators operating the
[00:40:34] is a human operators operating the robots to choose which route to go. For
[00:40:37] robots to choose which route to go. For example, like what people typically do
[00:40:39] example, like what people typically do is for example, let's say there's kind
[00:40:41] is for example, let's say there's kind some kind of rough terrains or some pile
[00:40:43] some kind of rough terrains or some pile of rocks. And the humans can actually
[00:40:45] of rocks. And the humans can actually try to command the robot to going
[00:40:48] try to command the robot to going forward trying to climb those kind of
[00:40:49] forward trying to climb those kind of rocks. If that fails, humans can
[00:40:51] rocks. If that fails, humans can actually provide some other high level
[00:40:53] actually provide some other high level conveyance to get around this pile of
[00:40:55] conveyance to get around this pile of rocks. So there can be some kind of
[00:40:57] rocks. So there can be some kind of learning also on the human side in
[00:40:58] learning also on the human side in understanding the capabilities of those
[00:41:00] understanding the capabilities of those robots. So this is also why some of this
[00:41:04] robots. So this is also why some of this video can actually looks very nice
[00:41:06] video can actually looks very nice because human select the routes that
[00:41:08] because human select the routes that human knows can show the limits and also
[00:41:10] human knows can show the limits and also the capabilities of this kind of
[00:41:12] the capabilities of this kind of low-level controllers and how to do that
[00:41:14] low-level controllers and how to do that autonomously. That's actually an very
[00:41:16] autonomously. That's actually an very interesting like questions people are
[00:41:18] interesting like questions people are also doing research upon. Mhm.
[00:41:22] also doing research upon. Mhm. So then I'm going to continue. So I have
[00:41:26] So then I'm going to continue. So I have discussed like some of the successful
[00:41:27] discussed like some of the successful examples and power of reinforcement
[00:41:29] examples and power of reinforcement learnings and I also discussed the
[00:41:31] learnings and I also discussed the limitations of reinforcement learning
[00:41:32] limitations of reinforcement learning and we still haven't really seen very
[00:41:34] and we still haven't really seen very very successful and wide scale
[00:41:37] very successful and wide scale deployments of reinforcement learning in
[00:41:39] deployments of reinforcement learning in manipulation yet. And we human not just
[00:41:41] manipulation yet. And we human not just learn from trials and error. We actually
[00:41:43] learn from trials and error. We actually build this type of internal model. So
[00:41:45] build this type of internal model. So we're asking the question can we
[00:41:46] we're asking the question can we actually learn the models from the
[00:41:48] actually learn the models from the robot's interactions with the
[00:41:50] robot's interactions with the environments and using that model for
[00:41:52] environments and using that model for the robot to do better physical
[00:41:53] the robot to do better physical interactions. So specifically what we
[00:41:56] interactions. So specifically what we are touching upon again back to this
[00:41:58] are touching upon again back to this figure is how we can learn and
[00:42:00] figure is how we can learn and approximations of the real physical
[00:42:03] approximations of the real physical world and how can this approximated
[00:42:06] world and how can this approximated physical world that runs on the virtual
[00:42:08] physical world that runs on the virtual domain can help guides the robots
[00:42:11] domain can help guides the robots actions and decide what action to take
[00:42:13] actions and decide what action to take in the real physical world. So let's say
[00:42:16] in the real physical world. So let's say if you already have the model for
[00:42:17] if you already have the model for example let's say you have already
[00:42:18] example let's say you have already learned the models like we humans have
[00:42:20] learned the models like we humans have in our like mental like environments we
[00:42:24] in our like mental like environments we can predict given the current state s
[00:42:26] can predict given the current state s and also the action t how the state of
[00:42:29] and also the action t how the state of the environment will change in new
[00:42:31] the environment will change in new states t + one and then we can use this
[00:42:34] states t + one and then we can use this essentially is a forward model like
[00:42:36] essentially is a forward model like given the current state and action
[00:42:37] given the current state and action predicted next states. So what actually
[00:42:40] predicted next states. So what actually is the problem for us to do the planning
[00:42:42] is the problem for us to do the planning which is essentially an inverse of this
[00:42:45] which is essentially an inverse of this forward model where the plan is to give
[00:42:47] forward model where the plan is to give the current state and the target states
[00:42:49] the current state and the target states and to come up with the action they can
[00:42:50] and to come up with the action they can allow the robot to achieve the target
[00:42:53] allow the robot to achieve the target states from a given current state
[00:42:55] states from a given current state showing the blue dots. We have a targets
[00:42:57] showing the blue dots. We have a targets here in the red. We can have maybe some
[00:42:59] here in the red. We can have maybe some initial guesses of how the actions might
[00:43:01] initial guesses of how the actions might look like. And our model, our
[00:43:03] look like. And our model, our approximated learned models will be able
[00:43:05] approximated learned models will be able to predict the sequence of the
[00:43:07] to predict the sequence of the evolutions of the states which is show
[00:43:10] evolutions of the states which is show in this green like a trajectory. And
[00:43:13] in this green like a trajectory. And then we can measure the distance between
[00:43:15] then we can measure the distance between this green dots and the red dots and
[00:43:17] this green dots and the red dots and back propagates or doing optimizations
[00:43:20] back propagates or doing optimizations using the gradients of those their their
[00:43:22] using the gradients of those their their distance uh with respected to all the
[00:43:25] distance uh with respected to all the actions along that trajectories to do
[00:43:27] actions along that trajectories to do this type of optimizations in order to
[00:43:29] this type of optimizations in order to know like what actions can guide us to
[00:43:32] know like what actions can guide us to getting us closer to the targets show in
[00:43:34] getting us closer to the targets show in the red. And obviously the model may not
[00:43:37] the red. And obviously the model may not be accurate enough. So we typically only
[00:43:40] be accurate enough. So we typically only execute the first actions and we obtain
[00:43:42] execute the first actions and we obtain the new states from the environment and
[00:43:44] the new states from the environment and we can reoptimize the action sequence
[00:43:46] we can reoptimize the action sequence using gradient descent or any other
[00:43:48] using gradient descent or any other optimization technique you use to do
[00:43:50] optimization technique you use to do this trajectory optimizations. And one
[00:43:53] this trajectory optimizations. And one of the key benefits especially recently
[00:43:56] of the key benefits especially recently with the developments of GPUs and also
[00:43:59] with the developments of GPUs and also neurodynamics model is that you can use
[00:44:01] neurodynamics model is that you can use a GPU for parallel and simultaneously
[00:44:04] a GPU for parallel and simultaneously sampling to allow you to do like a large
[00:44:06] sampling to allow you to do like a large scale sampling and optimizations of
[00:44:08] scale sampling and optimizations of those action sequences which is actually
[00:44:10] those action sequences which is actually quite efficient.
[00:44:12] quite efficient. So like given this like a general
[00:44:14] So like given this like a general framework you have the model which is
[00:44:16] framework you have the model which is this kind of forward process and you can
[00:44:18] this kind of forward process and you can always use a kind of forward model to do
[00:44:20] always use a kind of forward model to do this inverse optimizations to come up
[00:44:22] this inverse optimizations to come up with the actions that can like get you
[00:44:24] with the actions that can like get you closer to your target configuration. And
[00:44:26] closer to your target configuration. And one of the key questions has always been
[00:44:29] one of the key questions has always been what should be the right representation
[00:44:30] what should be the right representation of the what should be the right and most
[00:44:32] of the what should be the right and most effective state representation is and
[00:44:35] effective state representation is and how we can learn this model based on
[00:44:36] how we can learn this model based on this state representation. And over the
[00:44:39] this state representation. And over the years there has been many different
[00:44:40] years there has been many different investigations on choosing or
[00:44:42] investigations on choosing or investigating different type of state
[00:44:44] investigating different type of state representations. Some earlier work
[00:44:46] representations. Some earlier work including how can using for them just 2D
[00:44:49] including how can using for them just 2D images as a representation of the states
[00:44:51] images as a representation of the states and trying to learn pixel dynamics
[00:44:53] and trying to learn pixel dynamics meaning how the image might change if
[00:44:56] meaning how the image might change if you apply a specific actions. This is a
[00:44:58] you apply a specific actions. This is a work called deep visual foresight which
[00:45:00] work called deep visual foresight which set up some of the initial works in the
[00:45:03] set up some of the initial works in the whole domains of like a world models.
[00:45:05] whole domains of like a world models. And by learning this kind of pixelbased
[00:45:07] And by learning this kind of pixelbased dynamics models, people can come up with
[00:45:09] dynamics models, people can come up with strategies that is able to for example
[00:45:11] strategies that is able to for example minimize the distance between the
[00:45:13] minimize the distance between the current observations and some can rotate
[00:45:15] current observations and some can rotate objects and pushing the objects around
[00:45:17] objects and pushing the objects around in order to achieve the targets show in
[00:45:20] in order to achieve the targets show in green and the current states is show in
[00:45:22] green and the current states is show in red. So this is about like a pixel
[00:45:25] red. So this is about like a pixel dynamics and what people can also do is
[00:45:27] dynamics and what people can also do is to use a key points as a representation
[00:45:30] to use a key points as a representation of the environments to learn like a key
[00:45:32] of the environments to learn like a key points dynamics models. And here like
[00:45:35] points dynamics models. And here like what we can do is to track the movement
[00:45:37] what we can do is to track the movement of the key points on top of this box
[00:45:40] of the key points on top of this box over the 3D space and also neural
[00:45:42] over the 3D space and also neural dynamics model of those key points as a
[00:45:45] dynamics model of those key points as a result of some pushing actions. And then
[00:45:48] result of some pushing actions. And then we can allow the robots to use this
[00:45:50] we can allow the robots to use this forward predictive models to plan the
[00:45:52] forward predictive models to plan the robots behaviors you to track some
[00:45:54] robots behaviors you to track some specific trajectories in order to push
[00:45:56] specific trajectories in order to push this box to achieve a target
[00:45:59] this box to achieve a target configurations.
[00:46:00] configurations. So besides using key points, so what if
[00:46:03] So besides using key points, so what if you are encountered some objects with
[00:46:05] you are encountered some objects with even higher degrees of freedoms. So if
[00:46:09] even higher degrees of freedoms. So if you go like one level finer, you can
[00:46:11] you go like one level finer, you can also represent those objects using a
[00:46:13] also represent those objects using a sets of particles essentially a set of
[00:46:16] sets of particles essentially a set of points here like which is actually a
[00:46:18] points here like which is actually a work uh uh that was done while I was
[00:46:20] work uh uh that was done while I was here like as a postto where we're
[00:46:22] here like as a postto where we're representing this pile of granular
[00:46:24] representing this pile of granular pieces using a bunch of particles and
[00:46:26] pieces using a bunch of particles and trying to predict how those particles
[00:46:29] trying to predict how those particles will move around if you apply a specific
[00:46:31] will move around if you apply a specific actions and this forward model can allow
[00:46:34] actions and this forward model can allow the robots to do this inverse decision
[00:46:36] the robots to do this inverse decision makings. They handle a wide range of
[00:46:39] makings. They handle a wide range of granular objects of different like
[00:46:41] granular objects of different like granular sizes and we come up with
[00:46:43] granular sizes and we come up with strategies that can gather those pieces
[00:46:45] strategies that can gather those pieces into the target region show in the
[00:46:48] into the target region show in the bottom right like a corner of each like
[00:46:50] bottom right like a corner of each like a segment and the same model with like a
[00:46:54] a segment and the same model with like a good feedback from the environments
[00:46:56] good feedback from the environments allow the robots to correct from the
[00:46:58] allow the robots to correct from the model's error and come up with
[00:47:00] model's error and come up with strategies that can be very reliably
[00:47:02] strategies that can be very reliably aggregates all the object pieces into
[00:47:04] aggregates all the object pieces into the target region and obviously See this
[00:47:07] the target region and obviously See this model not just can generalize to like a
[00:47:09] model not just can generalize to like a different like a granular pieces of
[00:47:11] different like a granular pieces of different granular sizes. You can also
[00:47:13] different granular sizes. You can also change to different target
[00:47:15] change to different target configurations. Here you will very
[00:47:17] configurations. Here you will very quickly realize what the target
[00:47:19] quickly realize what the target configurations are. The robot has to
[00:47:21] configurations are. The robot has to come up with a strategy that to do
[00:47:23] come up with a strategy that to do non-trivial redistributions of the
[00:47:26] non-trivial redistributions of the granular pieces. And after the
[00:47:28] granular pieces. And after the redistribution you had to align the fine
[00:47:30] redistribution you had to align the fine grain details with the targets like
[00:47:32] grain details with the targets like shape in order to accomplish this like a
[00:47:35] shape in order to accomplish this like a pile rearrangement tasks. The task here
[00:47:37] pile rearrangement tasks. The task here is actually to rearrange this granular
[00:47:39] is actually to rearrange this granular pieces into different letter shapes all
[00:47:41] pieces into different letter shapes all the way from letter A to letter Z. And
[00:47:44] the way from letter A to letter Z. And with this kind of forward models will be
[00:47:46] with this kind of forward models will be very successful in coming up with a
[00:47:48] very successful in coming up with a sequence of strategies. Of course with
[00:47:50] sequence of strategies. Of course with feedback from the environments to allow
[00:47:52] feedback from the environments to allow the robot to rearrange this kind of
[00:47:53] the robot to rearrange this kind of object pieces into the target regions.
[00:47:55] object pieces into the target regions. And this is actually a highly
[00:47:56] And this is actually a highly non-trivial task. And going beyond that,
[00:48:00] non-trivial task. And going beyond that, we are also uh have this like a
[00:48:02] we are also uh have this like a subsequent work which I was also
[00:48:04] subsequent work which I was also involved and done when I was here at
[00:48:06] involved and done when I was here at Stanford. And we designed this kind of
[00:48:08] Stanford. And we designed this kind of dumping making robots that is actually
[00:48:11] dumping making robots that is actually equipped the robots with 15 different 3D
[00:48:15] equipped the robots with 15 different 3D printed tools. We have four RGBD cameras
[00:48:18] printed tools. We have four RGBD cameras looking at the environments to do a
[00:48:20] looking at the environments to do a reconstructions of the geometry of this
[00:48:22] reconstructions of the geometry of this doll. And the robots will have to decide
[00:48:24] doll. And the robots will have to decide what tool to use and what action to take
[00:48:26] what tool to use and what action to take in order to guess this dumpling into
[00:48:30] in order to guess this dumpling into getting this dough into a dumpling. And
[00:48:32] getting this dough into a dumpling. And the key enabling factor is also this
[00:48:34] the key enabling factor is also this like a forward predicting models
[00:48:37] like a forward predicting models represented like using particles. Right
[00:48:39] represented like using particles. Right here the red dots are representing the
[00:48:42] here the red dots are representing the shape of the tool and the blue dots are
[00:48:44] shape of the tool and the blue dots are representing the shape of the object.
[00:48:46] representing the shape of the object. The first row is our model's open loop
[00:48:49] The first row is our model's open loop prediction and second row is like what's
[00:48:51] prediction and second row is like what's actually happens in the real
[00:48:52] actually happens in the real environments. So this learn model that
[00:48:55] environments. So this learn model that directly learns from this real world
[00:48:57] directly learns from this real world interactions can very actually predicts
[00:49:00] interactions can very actually predicts the change of the shape of the dough
[00:49:02] the change of the shape of the dough when using different tools applying
[00:49:03] when using different tools applying different actions and ins allows us to
[00:49:06] different actions and ins allows us to have this integrated integrated system
[00:49:08] have this integrated integrated system that can make a dumpling out of a dough.
[00:49:11] that can make a dumpling out of a dough. What's interesting about this video is
[00:49:12] What's interesting about this video is there's a person constantly perturbing
[00:49:14] there's a person constantly perturbing the robots from doing its job. The
[00:49:17] the robots from doing its job. The robots will take the real time visual
[00:49:20] robots will take the real time visual feedback from this environment to real
[00:49:22] feedback from this environment to real time understanding the shape of the
[00:49:24] time understanding the shape of the dough and then using the current
[00:49:26] dough and then using the current observations and also the learn dynamics
[00:49:29] observations and also the learn dynamics model that predicts how the environment
[00:49:30] model that predicts how the environment is going to change how the do shape will
[00:49:32] is going to change how the do shape will change. You use a tool to apply a
[00:49:35] change. You use a tool to apply a specific actions and based on this
[00:49:36] specific actions and based on this forward model is making this inverse
[00:49:39] forward model is making this inverse decision. This decision is happening at
[00:49:41] decision. This decision is happening at two levels. Both at a high level which
[00:49:44] two levels. Both at a high level which is to decide what tool to use which is a
[00:49:47] is to decide what tool to use which is a task level decision-m and given this
[00:49:49] task level decision-m and given this tools the robot is also have to make
[00:49:51] tools the robot is also have to make like a lower level like a motion level
[00:49:53] like a lower level like a motion level decisions in deciding what specific
[00:49:55] decisions in deciding what specific action to take you to progress into the
[00:49:58] action to take you to progress into the next task stage. Human are just so
[00:50:00] next task stage. Human are just so annoying adding pieces folding the
[00:50:02] annoying adding pieces folding the dough. The robot is very robust to this
[00:50:04] dough. The robot is very robust to this external disturbance in continuing its
[00:50:06] external disturbance in continuing its progress in doing the task. Here's
[00:50:08] progress in doing the task. Here's what's interesting. After robot cuts a
[00:50:10] what's interesting. After robot cuts a circle, the humans shows no mercy,
[00:50:13] circle, the humans shows no mercy, destroy everything. The robot knows you
[00:50:16] destroy everything. The robot knows you actually have to start from the very
[00:50:17] actually have to start from the very beginning like redo the task from the
[00:50:20] beginning like redo the task from the beginning in order to progress with this
[00:50:22] beginning in order to progress with this kind of task objective. So this really
[00:50:24] kind of task objective. So this really shows the patience and also the
[00:50:26] shows the patience and also the robustness of our systems with this type
[00:50:29] robustness of our systems with this type of external disturbance. And all of
[00:50:30] of external disturbance. And all of these capabilities is enabled by this
[00:50:33] these capabilities is enabled by this kind of neuron dynamics model that
[00:50:34] kind of neuron dynamics model that predicts how the shape of the dough will
[00:50:36] predicts how the shape of the dough will change if you apply a specific actions.
[00:50:39] change if you apply a specific actions. In the end, the robots will place the
[00:50:41] In the end, the robots will place the skin on top of this dumpling clip and
[00:50:43] skin on top of this dumpling clip and move this kind of feelings on top of
[00:50:45] move this kind of feelings on top of this dumpling skin and using a hook to
[00:50:48] this dumpling skin and using a hook to close the dumpling clip. You to use this
[00:50:50] close the dumpling clip. You to use this general purpose robots equipped with 15
[00:50:53] general purpose robots equipped with 15 general purpose tools to make a dumpling
[00:50:56] general purpose tools to make a dumpling out of a dough. So this is about like
[00:50:58] out of a dough. So this is about like how we can learn the model and and how
[00:51:00] how we can learn the model and and how that model can be useful for for
[00:51:02] that model can be useful for for downstream like model based planning. So
[00:51:05] downstream like model based planning. So for this specific case like u if we want
[00:51:08] for this specific case like u if we want to describe it more rigorously we are
[00:51:10] to describe it more rigorously we are not using reinforcement learning like we
[00:51:12] not using reinforcement learning like we just learn the model and using that
[00:51:14] just learn the model and using that model to do planning although the plan
[00:51:16] model to do planning although the plan can be distilled into a policy that can
[00:51:18] can be distilled into a policy that can be executed in the real environment in a
[00:51:20] be executed in the real environment in a more efficient manner but some people
[00:51:23] more efficient manner but some people also call it model based reinforcement
[00:51:25] also call it model based reinforcement learning. decide which background you
[00:51:27] learning. decide which background you are coming from. You can either call it
[00:51:28] are coming from. You can either call it like model learning and model based
[00:51:30] like model learning and model based planning. You can also call it model
[00:51:31] planning. You can also call it model based reinforcement learning. But the
[00:51:33] based reinforcement learning. But the key idea you want to learn the model
[00:51:35] key idea you want to learn the model from the robot's physical interactions
[00:51:37] from the robot's physical interactions with the real physical world and using
[00:51:39] with the real physical world and using that learn model that is very effective
[00:51:42] that learn model that is very effective in helping the robot to decide its
[00:51:43] in helping the robot to decide its behaviors to like a progress with a task
[00:51:46] behaviors to like a progress with a task objective.
[00:51:47] objective. So in this specific case it is the high
[00:51:49] So in this specific case it is the high level plane and low-level decision
[00:51:51] level plane and low-level decision making is done by two different models.
[00:51:53] making is done by two different models. So over the high level we are given the
[00:51:55] So over the high level we are given the current states like current observation
[00:51:57] current states like current observation of the environments and the targets the
[00:51:58] of the environments and the targets the robot help to achieve is essentially a
[00:52:01] robot help to achieve is essentially a classifier and classify which tool to
[00:52:03] classifier and classify which tool to use and condition on this classify like
[00:52:06] use and condition on this classify like tool label there a lowle like a policies
[00:52:08] tool label there a lowle like a policies that decides what specific action to
[00:52:10] that decides what specific action to take in order to like progress into the
[00:52:12] take in order to like progress into the next task stage. Very good question. So
[00:52:15] next task stage. Very good question. So back then this work was done in 2023. At
[00:52:17] back then this work was done in 2023. At that time vision language model wasn't
[00:52:19] that time vision language model wasn't like very powerful. So at that time like
[00:52:22] like very powerful. So at that time like what we did was to allow a human
[00:52:25] what we did was to allow a human operator to do the data collection
[00:52:28] operator to do the data collection demonstration of the task for 10 times.
[00:52:30] demonstration of the task for 10 times. We use that data to train this kind of
[00:52:32] We use that data to train this kind of classifier to classify what tool to use.
[00:52:35] classifier to classify what tool to use. That allows us to actually jump back and
[00:52:37] That allows us to actually jump back and forth over this chain. Like I mentioned
[00:52:39] forth over this chain. Like I mentioned earlier after the robots cut a circle
[00:52:41] earlier after the robots cut a circle the human destroy everything. the ROS
[00:52:43] the human destroy everything. the ROS should jump back to some earlier stages
[00:52:44] should jump back to some earlier stages like that fits to its current
[00:52:46] like that fits to its current observation in order to like do the
[00:52:48] observation in order to like do the proper like um recovery from the
[00:52:50] proper like um recovery from the external disturbances. So in this
[00:52:52] external disturbances. So in this specific case like what we have been
[00:52:54] specific case like what we have been doing is a combination of sampling based
[00:52:56] doing is a combination of sampling based trajectory optimizations versus like a
[00:52:59] trajectory optimizations versus like a policy learning. So what we have been
[00:53:01] policy learning. So what we have been doing is to give the current states of
[00:53:02] doing is to give the current states of this dough. We have our like forward
[00:53:04] this dough. We have our like forward predicting models. We'll be able to
[00:53:06] predicting models. We'll be able to sample a bunch of actions and sample a
[00:53:08] sample a bunch of actions and sample a bunch of tools to like a predict like
[00:53:10] bunch of tools to like a predict like what is the evolutions of the shape of
[00:53:12] what is the evolutions of the shape of that dough. And then we'll compare the
[00:53:14] that dough. And then we'll compare the model's prediction with the targets like
[00:53:17] model's prediction with the targets like we hope to achieve which is similar to
[00:53:19] we hope to achieve which is similar to what I showed earlier. For example, our
[00:53:20] what I showed earlier. For example, our model predicts the shape of the dough
[00:53:23] model predicts the shape of the dough will go into this green dots but the
[00:53:25] will go into this green dots but the targets is this red dots. We are
[00:53:27] targets is this red dots. We are comparing their distance and then that
[00:53:29] comparing their distance and then that allows us to select the most effective
[00:53:31] allows us to select the most effective actions that can gets us to the target
[00:53:34] actions that can gets us to the target at to be as close as possible and we can
[00:53:36] at to be as close as possible and we can do a lot of samples like this but
[00:53:38] do a lot of samples like this but sampling during the test time is very
[00:53:40] sampling during the test time is very time consuming. So we do this type of
[00:53:42] time consuming. So we do this type of sampling in an offline fashions which
[00:53:44] sampling in an offline fashions which give us a data sets and we can use that
[00:53:46] give us a data sets and we can use that data sets to train a policies that can
[00:53:48] data sets to train a policies that can be uh inferred at a very very like using
[00:53:51] be uh inferred at a very very like using a very short period of time to do the
[00:53:52] a very short period of time to do the inference during the test time. There is
[00:53:54] inference during the test time. There is still a neural network as a policies.
[00:53:56] still a neural network as a policies. Yeah, although that policy is nerds by
[00:53:59] Yeah, although that policy is nerds by distilling from our models like uh
[00:54:01] distilling from our models like uh predictions over a huge amount of
[00:54:03] predictions over a huge amount of samples. For this specific work, there's
[00:54:06] samples. For this specific work, there's no physics based simulation at all. We
[00:54:08] no physics based simulation at all. We actually have a baseline that use a
[00:54:10] actually have a baseline that use a state-of-the-art deformable object
[00:54:12] state-of-the-art deformable object simulator which is called MPM, material
[00:54:14] simulator which is called MPM, material point methods. What we realize is that
[00:54:17] point methods. What we realize is that even if we do very extensive system
[00:54:19] even if we do very extensive system identification like estimating the
[00:54:20] identification like estimating the parameters of those physics based
[00:54:22] parameters of those physics based deformable object simulator, the
[00:54:24] deformable object simulator, the identified model is noticeably less
[00:54:27] identified model is noticeably less accurate than the model that directly
[00:54:28] accurate than the model that directly learned from the real world
[00:54:29] learned from the real world interactions. Like I showed earlier, for
[00:54:31] interactions. Like I showed earlier, for example, the first row is our model's
[00:54:33] example, the first row is our model's open loop prediction. The second row is
[00:54:34] open loop prediction. The second row is a ground truth. Our model's prediction
[00:54:36] a ground truth. Our model's prediction aligns very well with the ground truth,
[00:54:38] aligns very well with the ground truth, which is just much more accurate than
[00:54:41] which is just much more accurate than whatever physics based simulators out
[00:54:43] whatever physics based simulators out there.
[00:54:46] Okay. So if there are no more questions
[00:54:48] Okay. So if there are no more questions I will continue. So what we have
[00:54:50] I will continue. So what we have discussed is this kind of model learning
[00:54:52] discussed is this kind of model learning and how this learned model can be
[00:54:54] and how this learned model can be effective for the downstream like a
[00:54:55] effective for the downstream like a model based planning and next category
[00:54:58] model based planning and next category of algorithms is imitation learning. So
[00:55:01] of algorithms is imitation learning. So like just to like recap a little bit
[00:55:04] like just to like recap a little bit like we discussed reinforcement learning
[00:55:06] like we discussed reinforcement learning that is learn direct learning the
[00:55:07] that is learn direct learning the policies by doing trials and error with
[00:55:09] policies by doing trials and error with the environments which has a lot of
[00:55:11] the environments which has a lot of troubles for example the sample
[00:55:12] troubles for example the sample efficiency the safety concerns and for
[00:55:14] efficiency the safety concerns and for the model learning what we have been
[00:55:16] the model learning what we have been doing is actually forced back to the
[00:55:19] doing is actually forced back to the category of supervised learning where we
[00:55:21] category of supervised learning where we have the evolutions of the environments.
[00:55:23] have the evolutions of the environments. You use that data to do supervised
[00:55:25] You use that data to do supervised learning to train this model and using
[00:55:27] learning to train this model and using this model to do model based planning
[00:55:29] this model to do model based planning and instead of just using supervised
[00:55:32] and instead of just using supervised learning to do to train the model people
[00:55:34] learning to do to train the model people are also asking can we do supervised
[00:55:36] are also asking can we do supervised learning also for the policies. This is
[00:55:38] learning also for the policies. This is the general idea of imitation learning
[00:55:40] the general idea of imitation learning meaning can we have a big data set that
[00:55:44] meaning can we have a big data set that shows how a task needs to be done and
[00:55:47] shows how a task needs to be done and using this data sets like to train this
[00:55:50] using this data sets like to train this kind of policies. I'm showing this
[00:55:51] kind of policies. I'm showing this figure again. This is trying to learn
[00:55:53] figure again. This is trying to learn this kind of policy taking the states as
[00:55:55] this kind of policy taking the states as inputs that predict the actions and all
[00:55:58] inputs that predict the actions and all of this kind of learning signals and
[00:56:00] of this kind of learning signals and learning procedures are done through a
[00:56:02] learning procedures are done through a large scale collected datas from human
[00:56:05] large scale collected datas from human demoing to the robots how a task needs
[00:56:07] demoing to the robots how a task needs to be done. So learning from
[00:56:09] to be done. So learning from demonstration is of course not new. that
[00:56:12] demonstration is of course not new. that has been investigated for decades is
[00:56:14] has been investigated for decades is also constantly how we human is actually
[00:56:16] also constantly how we human is actually learning to perform a lot of like a
[00:56:18] learning to perform a lot of like a physical interactions or social
[00:56:20] physical interactions or social activities in the real world like since
[00:56:22] activities in the real world like since we are very very young. So one of the
[00:56:25] we are very very young. So one of the most like earlier classic like imitation
[00:56:28] most like earlier classic like imitation learning algorithms is called behavior
[00:56:31] learning algorithms is called behavior cloning and essentially trying to learn
[00:56:33] cloning and essentially trying to learn this kind of mapping. The map currently
[00:56:35] this kind of mapping. The map currently like from this observation O into the
[00:56:38] like from this observation O into the action A and this policy is represented
[00:56:41] action A and this policy is represented using this function pi parameterized by
[00:56:44] using this function pi parameterized by the SATA. So one of the key issues for
[00:56:47] the SATA. So one of the key issues for behavior coloning is called cascading
[00:56:50] behavior coloning is called cascading error because like I mentioned the key
[00:56:52] error because like I mentioned the key difference between the robot learning or
[00:56:54] difference between the robot learning or agents interaction with environments is
[00:56:56] agents interaction with environments is it is a sequential decision making
[00:56:59] it is a sequential decision making problem. It differs from like a typical
[00:57:01] problem. It differs from like a typical like a supervised learning in the
[00:57:02] like a supervised learning in the typical computer vision domain in that
[00:57:04] typical computer vision domain in that your error can accumulate and being
[00:57:06] your error can accumulate and being amplified over time. Let's say at the
[00:57:09] amplified over time. Let's say at the very beginning you made a very small
[00:57:11] very beginning you made a very small error. That small error can list your
[00:57:13] error. That small error can list your states that is slightly deviates from
[00:57:16] states that is slightly deviates from the distribution of data you use to
[00:57:18] the distribution of data you use to train your model. That will lead a
[00:57:19] train your model. That will lead a policy to make a even larger error and
[00:57:22] policy to make a even larger error and this error will be amplified over the
[00:57:25] this error will be amplified over the temporal horizons that leads to a
[00:57:27] temporal horizons that leads to a trajectory that can decrease quite a lot
[00:57:29] trajectory that can decrease quite a lot from the demonstration trajectories. So
[00:57:31] from the demonstration trajectories. So that's a typical issues of behavior
[00:57:33] that's a typical issues of behavior chronic. So that's why uh often times
[00:57:36] chronic. So that's why uh often times when people are trying to make imitation
[00:57:38] when people are trying to make imitation learning work people follow this type of
[00:57:40] learning work people follow this type of pipeline where on the top we have the
[00:57:43] pipeline where on the top we have the demonstration collected by the experts.
[00:57:46] demonstration collected by the experts. Then we'll use that as a training data
[00:57:48] Then we'll use that as a training data to do supervised learning to train this
[00:57:50] to do supervised learning to train this policy and we'll rule out the policy in
[00:57:53] policy and we'll rule out the policy in the real environment and observe those
[00:57:55] the real environment and observe those failure cases and we either collect
[00:57:58] failure cases and we either collect additional data or provide corrective
[00:58:00] additional data or provide corrective behaviors that allows this data sets to
[00:58:03] behaviors that allows this data sets to not only contains the initial
[00:58:04] not only contains the initial demonstrations but also those corrective
[00:58:06] demonstrations but also those corrective behaviors that gets the errors from the
[00:58:10] behaviors that gets the errors from the policies back to the cononical
[00:58:12] policies back to the cononical trajectory or get back to the
[00:58:13] trajectory or get back to the trajectory. they can still successfully
[00:58:15] trajectory. they can still successfully accomplish the task. So this is actually
[00:58:17] accomplish the task. So this is actually a typical like life cycles we are trying
[00:58:19] a typical like life cycles we are trying to develop any imitation learning agents
[00:58:22] to develop any imitation learning agents or imitation learning algorithms in the
[00:58:24] or imitation learning algorithms in the real physical world
[00:58:26] real physical world and uh uh along this lines because like
[00:58:29] and uh uh along this lines because like if people do this kind of imitation
[00:58:31] if people do this kind of imitation learning there's not uh very explicit
[00:58:34] learning there's not uh very explicit definitions of what the task actually is
[00:58:37] definitions of what the task actually is the task is implicitly hidden within
[00:58:40] the task is implicitly hidden within those demonstrations. So there's a class
[00:58:42] those demonstrations. So there's a class of algorithm called inverse
[00:58:44] of algorithm called inverse reinforcement learning where on the left
[00:58:47] reinforcement learning where on the left is what people typically like thinking
[00:58:48] is what people typically like thinking about for reinforcement learning whereas
[00:58:50] about for reinforcement learning whereas on the right where people are trying to
[00:58:52] on the right where people are trying to use this inverse reinforcement learning
[00:58:54] use this inverse reinforcement learning to like summarize the rewards from your
[00:58:57] to like summarize the rewards from your demonstrations and be able to use that
[00:58:59] demonstrations and be able to use that rewards to do typical reinforcement
[00:59:01] rewards to do typical reinforcement learning to learn this kind of
[00:59:02] learning to learn this kind of algorithms. Some of the earlier like
[00:59:05] algorithms. Some of the earlier like susex examples were actually also
[00:59:07] susex examples were actually also developed at Stanford's by uh Peter Bio
[00:59:10] developed at Stanford's by uh Peter Bio and also Andrew in it that allows them
[00:59:12] and also Andrew in it that allows them to control uh helicopters to do some
[00:59:16] to control uh helicopters to do some very very crazy behaviors and this is
[00:59:18] very very crazy behaviors and this is actually a very old work and uh be able
[00:59:21] actually a very old work and uh be able to achieve this type of like agile and
[00:59:23] to achieve this type of like agile and uh and effective behaviors on this real
[00:59:26] uh and effective behaviors on this real physical helicopters is very impressive
[00:59:29] physical helicopters is very impressive uh at that times. So this is the power
[00:59:31] uh at that times. So this is the power of learning from demonstrations and
[00:59:33] of learning from demonstrations and using that demonstration to summarize
[00:59:34] using that demonstration to summarize the rewards and in connections with
[00:59:36] the rewards and in connections with reinforcement learning this is what we
[00:59:38] reinforcement learning this is what we are able to achieve. So obviously over
[00:59:41] are able to achieve. So obviously over the years like people have been making
[00:59:43] the years like people have been making the imitation learning algorithms more
[00:59:45] the imitation learning algorithms more and more effective especially in
[00:59:47] and more effective especially in connecting with for example energy based
[00:59:50] connecting with for example energy based models. So instead of learning this kind
[00:59:51] models. So instead of learning this kind of explicit policy shown on the left
[00:59:53] of explicit policy shown on the left they directly do the mapping from
[00:59:55] they directly do the mapping from observation all to the actions. If you
[00:59:57] observation all to the actions. If you are coming up with kind of implicit
[00:59:58] are coming up with kind of implicit policies that's takes idea from energy
[01:00:01] policies that's takes idea from energy based models that direct taking the
[01:00:03] based models that direct taking the observation actions to predict the score
[01:00:06] observation actions to predict the score and using this kind of like a energy
[01:00:08] and using this kind of like a energy based model to do inference to get this
[01:00:11] based model to do inference to get this kind of predicted actions ahead allows
[01:00:14] kind of predicted actions ahead allows the robots to handle demonstrations that
[01:00:17] the robots to handle demonstrations that are very multimodels or handle scenarios
[01:00:19] are very multimodels or handle scenarios where the optimization landscapes may
[01:00:21] where the optimization landscapes may not be very very smooth and the robot is
[01:00:24] not be very very smooth and the robot is able to come up with this kind of
[01:00:25] able to come up with this kind of strategies like distill these policies
[01:00:27] strategies like distill these policies from the demonstrations in doing this
[01:00:29] from the demonstrations in doing this kind of contentrich manipulation tasks.
[01:00:32] kind of contentrich manipulation tasks. Another to say like uh some of the
[01:00:35] Another to say like uh some of the recent very recent success of robot
[01:00:38] recent very recent success of robot learning as a whole is a results of this
[01:00:41] learning as a whole is a results of this work called diffusion policy which again
[01:00:44] work called diffusion policy which again is also taking some of advances in the
[01:00:48] is also taking some of advances in the community of gener models. For this one
[01:00:50] community of gener models. For this one like for the implicit behavior people
[01:00:52] like for the implicit behavior people are drawing inspiration from development
[01:00:54] are drawing inspiration from development of energy based model. Energy based
[01:00:55] of energy based model. Energy based model is a type of generated models
[01:00:57] model is a type of generated models developed in the deep learning
[01:00:59] developed in the deep learning community. And there's another class of
[01:01:02] community. And there's another class of like a more powerful models in the deep
[01:01:04] like a more powerful models in the deep learning community is called diffusion
[01:01:06] learning community is called diffusion models. And people are also trying to
[01:01:08] models. And people are also trying to use the diffusion models to use it as a
[01:01:11] use the diffusion models to use it as a policy function class to allow the
[01:01:14] policy function class to allow the agents to also like inherence the
[01:01:17] agents to also like inherence the benefits and the properties from those
[01:01:19] benefits and the properties from those diffusion models. So this work uh was
[01:01:22] diffusion models. So this work uh was originally done at Colombia. That's is
[01:01:24] originally done at Colombia. That's is where I uh am right now and the leads
[01:01:28] where I uh am right now and the leads like a PI of this work now come to
[01:01:31] like a PI of this work now come to Stanfords. You can see like many of the
[01:01:32] Stanfords. You can see like many of the work I selected are has a lot of like
[01:01:34] work I selected are has a lot of like roots like here at Stanford's like she's
[01:01:37] roots like here at Stanford's like she's currently like at the WE departments at
[01:01:40] currently like at the WE departments at Stanfords and this policy really shows
[01:01:42] Stanfords and this policy really shows some like a very diverse set of
[01:01:45] some like a very diverse set of capabilities allow the robots to do not
[01:01:48] capabilities allow the robots to do not just like a planner pushings but many
[01:01:50] just like a planner pushings but many like a fine grains like manipulation
[01:01:52] like a fine grains like manipulation task not only pick and place but for
[01:01:54] task not only pick and place but for example here spread butter on top of the
[01:01:56] example here spread butter on top of the spreads and also like like a like a
[01:01:59] spreads and also like like a like a scramble eggs like like also or peeling
[01:02:01] scramble eggs like like also or peeling the potatoes and sliding the books. It
[01:02:05] the potatoes and sliding the books. It really shows that this type of recipe
[01:02:07] really shows that this type of recipe where you collect a bunch of
[01:02:09] where you collect a bunch of demonstrations and using the best like
[01:02:11] demonstrations and using the best like policies like learning mechanisms, you
[01:02:14] policies like learning mechanisms, you can get a policy that work in the real
[01:02:16] can get a policy that work in the real physical worlds in a very very efficient
[01:02:19] physical worlds in a very very efficient manners. Meaning you collect the data in
[01:02:21] manners. Meaning you collect the data in the morning, you train a policy in the
[01:02:22] the morning, you train a policy in the moment, in the afternoon, you can have a
[01:02:25] moment, in the afternoon, you can have a working policies working in the real
[01:02:27] working policies working in the real physical world. So obviously there's a
[01:02:29] physical world. So obviously there's a lot of caveats in how reliable your
[01:02:31] lot of caveats in how reliable your policy is and how generalizable your
[01:02:33] policy is and how generalizable your policy is, how diversified the initial
[01:02:36] policy is, how diversified the initial configuration can be for the policy to
[01:02:38] configuration can be for the policy to still work very robustly. But still
[01:02:41] still work very robustly. But still imitation learning is the most efficient
[01:02:43] imitation learning is the most efficient way for you to get a policy that can do
[01:02:46] way for you to get a policy that can do something interesting in the real
[01:02:48] something interesting in the real physical world. And for the policies to
[01:02:51] physical world. And for the policies to be very effective and robust to the real
[01:02:53] be very effective and robust to the real world variations, this type of iterative
[01:02:55] world variations, this type of iterative like data collections will needs will
[01:02:58] like data collections will needs will needs to be in place for the policy to
[01:03:00] needs to be in place for the policy to cover those kind of like uh unexpected
[01:03:03] cover those kind of like uh unexpected behaviors or some kind of dev eating
[01:03:04] behaviors or some kind of dev eating behaviors.
[01:03:06] behaviors. So this is about imitation learning. So
[01:03:08] So this is about imitation learning. So any questions?
[01:03:13] Okay. So if there's no more questions, I
[01:03:15] Okay. So if there's no more questions, I will use the remaining time to discuss
[01:03:17] will use the remaining time to discuss some of the uh recent developments that
[01:03:20] some of the uh recent developments that drive all the craziness about robot
[01:03:22] drive all the craziness about robot learnings which is like a robotic
[01:03:25] learnings which is like a robotic foundation models and uh of course like
[01:03:28] foundation models and uh of course like this is a very involved domain actually
[01:03:30] this is a very involved domain actually for each one of these items you can
[01:03:32] for each one of these items you can actually have a course around them. So
[01:03:35] actually have a course around them. So like for today's lecture I'm just like
[01:03:37] like for today's lecture I'm just like skimming through them very very quickly.
[01:03:39] skimming through them very very quickly. I only tell you the gist like the high
[01:03:41] I only tell you the gist like the high level like knowledge you need to get by
[01:03:43] level like knowledge you need to get by looking at those terms. And for robotic
[01:03:46] looking at those terms. And for robotic foundation models, it is a type of
[01:03:49] foundation models, it is a type of models that is very similar to like a
[01:03:53] models that is very similar to like a reinforcement learning or imitation
[01:03:54] reinforcement learning or imitation learnings in their function class.
[01:03:58] learnings in their function class. There's no explicit representation for
[01:03:59] There's no explicit representation for example of the states or this kind of
[01:04:01] example of the states or this kind of model. For example, this robotic
[01:04:02] model. For example, this robotic foundation model doesn't learn the model
[01:04:05] foundation model doesn't learn the model of the environments. It is still a
[01:04:07] of the environments. It is still a policy that map from the observation and
[01:04:10] policy that map from the observation and goal into the actions that still like a
[01:04:12] goal into the actions that still like a representative can be very nicely
[01:04:14] representative can be very nicely represented using these figures. So you
[01:04:16] represented using these figures. So you have this agents which is a policy that
[01:04:18] have this agents which is a policy that taking the current state and also the
[01:04:20] taking the current state and also the goal as inputs that trying to generate
[01:04:22] goal as inputs that trying to generate this actions that can be executed in the
[01:04:25] this actions that can be executed in the real physical world. But you might say
[01:04:27] real physical world. But you might say like this is very similar to imitation
[01:04:29] like this is very similar to imitation learning and reinforcement learning. So
[01:04:30] learning and reinforcement learning. So what's special about this robotic
[01:04:33] what's special about this robotic foundation models? So this is actually
[01:04:35] foundation models? So this is actually all rooted back to the all the
[01:04:37] all rooted back to the all the developments within the foundation model
[01:04:39] developments within the foundation model domain especially those language related
[01:04:41] domain especially those language related foundation model and also those vision
[01:04:43] foundation model and also those vision language related foundation models
[01:04:46] language related foundation models meaning it is a policy but it needs to
[01:04:48] meaning it is a policy but it needs to generalize much better than a policy
[01:04:51] generalize much better than a policy that just work for one specific task
[01:04:54] that just work for one specific task like here is actually my definitions I'm
[01:04:56] like here is actually my definitions I'm joining analogies from the current
[01:04:58] joining analogies from the current developments of vision language models
[01:05:00] developments of vision language models meaning their outputs may not always be
[01:05:04] meaning their outputs may not always be the perfect one but to always generate
[01:05:06] the perfect one but to always generate something reasonable as you promise this
[01:05:08] something reasonable as you promise this kind of foundation model. So what we
[01:05:11] kind of foundation model. So what we hope to achieve with this robotic
[01:05:12] hope to achieve with this robotic foundation model is the synthesized
[01:05:14] foundation model is the synthesized action may not always be the optimal
[01:05:16] action may not always be the optimal actions as conditioned by the
[01:05:18] actions as conditioned by the observation and the task. But the
[01:05:20] observation and the task. But the generary trajectory or will always be
[01:05:22] generary trajectory or will always be beautiful and reasonable to execute in
[01:05:24] beautiful and reasonable to execute in the real physical world. Like beautiful
[01:05:26] the real physical world. Like beautiful meaning it shouldn't just any like
[01:05:28] meaning it shouldn't just any like jiggling motions. It should be smooth
[01:05:29] jiggling motions. It should be smooth and continuous. Reasonable meaning you
[01:05:31] and continuous. Reasonable meaning you should listen to the language
[01:05:33] should listen to the language instructions you are given to the
[01:05:34] instructions you are given to the robots. So obviously there are also many
[01:05:38] robots. So obviously there are also many different names describing exactly the
[01:05:40] different names describing exactly the same things. Some people call it vision
[01:05:42] same things. Some people call it vision language action models like VAS. Some
[01:05:44] language action models like VAS. Some people call it like a large behavior
[01:05:46] people call it like a large behavior models. But in the eence they're all
[01:05:48] models. But in the eence they're all describing the same thing meaning this
[01:05:50] describing the same thing meaning this policy taking the observation and
[01:05:52] policy taking the observation and language instructions or whatever like a
[01:05:54] language instructions or whatever like a task specification is trying to generate
[01:05:56] task specification is trying to generate the actions that generalize widely
[01:05:58] the actions that generalize widely across a wide range of scenarios.
[01:06:01] across a wide range of scenarios. So this area is actually quite noisy.
[01:06:04] So this area is actually quite noisy. Noisy meaning it's very very like hard
[01:06:07] Noisy meaning it's very very like hard to quantify the progress of different
[01:06:10] to quantify the progress of different kind of robotic foundation models
[01:06:12] kind of robotic foundation models because you're coding a foundation
[01:06:13] because you're coding a foundation model. What does that mean? That means
[01:06:15] model. What does that mean? That means you expect this model to generalize like
[01:06:17] you expect this model to generalize like very broadly over a wide range of
[01:06:19] very broadly over a wide range of scenarios. If that's your expectation,
[01:06:21] scenarios. If that's your expectation, you actually need significant evidence
[01:06:23] you actually need significant evidence to show it actually generalize broadly.
[01:06:25] to show it actually generalize broadly. So that's why evaluation and
[01:06:26] So that's why evaluation and quantitative measurements of their
[01:06:28] quantitative measurements of their progress is very challenging. But still
[01:06:30] progress is very challenging. But still by looking at the empirical videos you
[01:06:33] by looking at the empirical videos you can still see a lot of very interesting
[01:06:35] can still see a lot of very interesting and concrete progress over the past few
[01:06:38] and concrete progress over the past few years. So a lot of the earlier
[01:06:40] years. So a lot of the earlier investigation starts like with like RT1
[01:06:44] investigation starts like with like RT1 which was released in the December 2022
[01:06:47] which was released in the December 2022 and since then I would say maybe roughly
[01:06:50] and since then I would say maybe roughly every half a year there's a new model
[01:06:52] every half a year there's a new model for RT2 RTX open VA and some of the
[01:06:56] for RT2 RTX open VA and some of the recent ones like uh PI zero that's is
[01:07:00] recent ones like uh PI zero that's is making concrete progress along this
[01:07:02] making concrete progress along this lines of developing more and more
[01:07:03] lines of developing more and more generalizable robotic foundation models
[01:07:06] generalizable robotic foundation models and actually this year that's a boost
[01:07:09] and actually this year that's a boost there's a huge like first of a lot of
[01:07:12] there's a huge like first of a lot of like foundation model like a helix high
[01:07:14] like foundation model like a helix high robot gemini robotics pispin 5 etc. So
[01:07:17] robot gemini robotics pispin 5 etc. So there's a lot of investigations and also
[01:07:19] there's a lot of investigations and also investments in this domain in not not
[01:07:22] investments in this domain in not not only like a capital investment but
[01:07:24] only like a capital investment but talent investments in developing better
[01:07:27] talent investments in developing better better and better and more generalizable
[01:07:29] better and better and more generalizable robotic foundation models. So due to the
[01:07:32] robotic foundation models. So due to the time I I I clearly cannot go into
[01:07:34] time I I I clearly cannot go into details of all these models. So if you
[01:07:36] details of all these models. So if you are interested, I actually gave a
[01:07:38] are interested, I actually gave a tutorial two months ago at tripleAI
[01:07:40] tutorial two months ago at tripleAI specifically describing and discussing
[01:07:42] specifically describing and discussing some of the models along this axis. If
[01:07:44] some of the models along this axis. If you are interested, please go and uh
[01:07:46] you are interested, please go and uh watch it. And for today, I'll be mostly
[01:07:49] watch it. And for today, I'll be mostly give you a high level overview of what
[01:07:52] give you a high level overview of what actually uh is essentials for this kind
[01:07:55] actually uh is essentials for this kind of foundation models and what it
[01:07:56] of foundation models and what it actually looks like with an examples
[01:07:59] actually looks like with an examples from PI zero. So PI0 was first released
[01:08:03] from PI zero. So PI0 was first released in October 2024.
[01:08:06] in October 2024. I think this is a word that convinced me
[01:08:09] I think this is a word that convinced me that this type of robotic foundation
[01:08:10] that this type of robotic foundation model can do some very reliable like a
[01:08:13] model can do some very reliable like a texturous manipulations in the real
[01:08:15] texturous manipulations in the real world environments. It can handle like
[01:08:17] world environments. It can handle like cloth folding and box folding and many
[01:08:19] cloth folding and box folding and many other different types of manipulation
[01:08:21] other different types of manipulation tasks at a very reliable manner. And
[01:08:25] tasks at a very reliable manner. And here's how the framework actually looks
[01:08:27] here's how the framework actually looks like on the high level. On the left is
[01:08:31] like on the high level. On the left is data sets. So for any model to be called
[01:08:34] data sets. So for any model to be called foundation model it needs a fuel for
[01:08:35] foundation model it needs a fuel for that foundation model and that fuel is
[01:08:37] that foundation model and that fuel is data. So they aggregate a lot of data
[01:08:40] data. So they aggregate a lot of data both within academia and also data
[01:08:42] both within academia and also data collected by themselves across course
[01:08:44] collected by themselves across course many different embodiment where the
[01:08:46] many different embodiment where the robots are doing like some kind of
[01:08:47] robots are doing like some kind of interesting and useful task in the real
[01:08:49] interesting and useful task in the real world environments and they use this
[01:08:51] world environments and they use this data to do pre-training and one
[01:08:54] data to do pre-training and one important caveat of this pre-training
[01:08:56] important caveat of this pre-training this starts with a pre-trained like
[01:08:59] this starts with a pre-trained like vision language models the vision
[01:09:00] vision language models the vision language model that's already trained on
[01:09:02] language model that's already trained on vast amount of like a vision language
[01:09:04] vast amount of like a vision language related data that can naturally adapt
[01:09:07] related data that can naturally adapt those kind of semantically and knowledge
[01:09:09] those kind of semantically and knowledge from those models and together by doing
[01:09:12] from those models and together by doing co-fineting what they call co-fine
[01:09:14] co-fineting what they call co-fine tuning using both objective for action
[01:09:16] tuning using both objective for action predictions and also the objective
[01:09:18] predictions and also the objective adapted from those vision question
[01:09:20] adapted from those vision question answering those type of tasks you will
[01:09:22] answering those type of tasks you will be able to first preserve the semantic
[01:09:25] be able to first preserve the semantic knowledge within those model but at the
[01:09:27] knowledge within those model but at the same time you can predict the robot
[01:09:29] same time you can predict the robot actions and this is a pre-tuning stage a
[01:09:32] actions and this is a pre-tuning stage a very important like design for many of
[01:09:35] very important like design for many of the existing robotic foundation models
[01:09:37] the existing robotic foundation models is called post training which is also
[01:09:39] is called post training which is also inspired by many of the developments in
[01:09:42] inspired by many of the developments in the large language model communities
[01:09:44] the large language model communities where you have this base model. Base
[01:09:46] where you have this base model. Base model can give you some reasonable
[01:09:48] model can give you some reasonable baseline performance but you really want
[01:09:50] baseline performance but you really want the performance to be very very good on
[01:09:51] the performance to be very very good on a specific task. You actually have to
[01:09:53] a specific task. You actually have to collect task specific data to fine-tune
[01:09:56] collect task specific data to fine-tune the models do post training on the data
[01:09:58] the models do post training on the data for that specific task for the
[01:10:00] for that specific task for the performance to be satisfied.
[01:10:02] performance to be satisfied. So they are evaluating their whole
[01:10:04] So they are evaluating their whole systems over three different categories.
[01:10:06] systems over three different categories. The first one you should directly use
[01:10:08] The first one you should directly use their base model and their base model
[01:10:10] their base model and their base model can already be good enough for some very
[01:10:12] can already be good enough for some very simple indistri in distribution data.
[01:10:15] simple indistri in distribution data. The the task like is actually the task
[01:10:18] The the task like is actually the task that might have already been encountered
[01:10:20] that might have already been encountered during the pre-training stage. If for
[01:10:22] during the pre-training stage. If for indistribution task but slightly more
[01:10:24] indistribution task but slightly more complicated you can do post training to
[01:10:27] complicated you can do post training to allow the base model to further improve
[01:10:29] allow the base model to further improve on those indistribution task. And for
[01:10:32] on those indistribution task. And for unseen task typically you have to do
[01:10:34] unseen task typically you have to do post trainings by collecting those task
[01:10:36] post trainings by collecting those task specific data and fine-tune your
[01:10:39] specific data and fine-tune your pre-trained model on this tasks for it
[01:10:41] pre-trained model on this tasks for it to be performance
[01:10:43] to be performance and this pi zero model is actually open
[01:10:45] and this pi zero model is actually open sourced and you can just download the
[01:10:47] sourced and you can just download the checkpoints my lab the students in my
[01:10:49] checkpoints my lab the students in my lab have already been starting to
[01:10:51] lab have already been starting to playing with this like uh their models
[01:10:53] playing with this like uh their models and trying to do post training and we
[01:10:54] and trying to do post training and we are starting to see some very promising
[01:10:57] are starting to see some very promising like results. So if you are interested
[01:10:59] like results. So if you are interested highly encouraged to try it. That is a
[01:11:02] highly encouraged to try it. That is a very good question. So you are
[01:11:03] very good question. So you are essentially asking about the efficiency
[01:11:05] essentially asking about the efficiency of the existing robotic foundation
[01:11:06] of the existing robotic foundation models. So there's a lot of reasons why
[01:11:09] models. So there's a lot of reasons why the policy is actually slower than
[01:11:11] the policy is actually slower than humans. One of the major reason is
[01:11:13] humans. One of the major reason is actually adapted from how the data was
[01:11:16] actually adapted from how the data was collected, how the demonstration data
[01:11:17] collected, how the demonstration data was collected. Typically for many of
[01:11:19] was collected. Typically for many of these scenarios the demonstration data
[01:11:21] these scenarios the demonstration data was collected by human tele operating
[01:11:24] was collected by human tele operating the robots on that exact same robots to
[01:11:26] the robots on that exact same robots to do the data collections in for example
[01:11:28] do the data collections in for example folding this box and human tele
[01:11:31] folding this box and human tele operations is actually slower than human
[01:11:33] operations is actually slower than human just using their hands to do this tasks
[01:11:36] just using their hands to do this tasks even if you have gou for them hours of
[01:11:38] even if you have gou for them hours of training. This is because you are just
[01:11:40] training. This is because you are just using a different embodiment than a
[01:11:41] using a different embodiment than a environment that you person is the most
[01:11:44] environment that you person is the most familiar with. And also at the same time
[01:11:46] familiar with. And also at the same time because the robot arms are still like
[01:11:49] because the robot arms are still like certain distance away from you. There
[01:11:50] certain distance away from you. There will be occlusion. Sometime you have to
[01:11:52] will be occlusion. Sometime you have to look very closely carefully like
[01:11:54] look very closely carefully like changing your heads changing the the the
[01:11:56] changing your heads changing the the the viewing angles in order to really
[01:11:58] viewing angles in order to really understand is it a time to progress into
[01:12:00] understand is it a time to progress into next task stage or not. there's a lot of
[01:12:02] next task stage or not. there's a lot of like caveats and inefficiencies
[01:12:06] like caveats and inefficiencies of the current data collection regimes.
[01:12:08] of the current data collection regimes. That is why the policy directly trained
[01:12:10] That is why the policy directly trained on those data turn out to be slower than
[01:12:12] on those data turn out to be slower than like human speeds. So that's why there's
[01:12:15] like human speeds. So that's why there's a lot of like investigations in how we
[01:12:17] a lot of like investigations in how we can do those kind of data collections to
[01:12:19] can do those kind of data collections to be even more efficient to be at human
[01:12:21] be even more efficient to be at human speeds that is actually a very active
[01:12:23] speeds that is actually a very active research direction. So this is a very
[01:12:25] research direction. So this is a very good question. So for this like a box
[01:12:27] good question. So for this like a box folding house I would argue this is
[01:12:29] folding house I would argue this is already a very long hor task. So I was
[01:12:31] already a very long hor task. So I was very impressed by how good this one
[01:12:34] very impressed by how good this one single policy is able to handle this
[01:12:36] single policy is able to handle this long harden task. But you could argue if
[01:12:38] long harden task. But you could argue if you really want this policy to be useful
[01:12:40] you really want this policy to be useful in some like a more larger scale like in
[01:12:43] in some like a more larger scale like in a wide scenarios in your home you not
[01:12:45] a wide scenarios in your home you not only want the robot to fold a box if I
[01:12:47] only want the robot to fold a box if I wanted to fold the shirts and clean the
[01:12:49] wanted to fold the shirts and clean the beds and clean all the messes on the
[01:12:51] beds and clean all the messes on the floor. If in those type of scenarios
[01:12:54] floor. If in those type of scenarios like uh currently me personally I don't
[01:12:57] like uh currently me personally I don't believe one gigantic policy is able to
[01:12:58] believe one gigantic policy is able to be able to adapt to those scenarios some
[01:13:01] be able to adapt to those scenarios some higher level obstructions or some kind
[01:13:03] higher level obstructions or some kind of singraft some symbolic
[01:13:04] of singraft some symbolic representations needs to be in place as
[01:13:06] representations needs to be in place as a condition for this vision language
[01:13:09] a condition for this vision language action models for those like policy to
[01:13:11] action models for those like policy to be the most effective and useful to
[01:13:13] be the most effective and useful to steer into different type of tasks and
[01:13:15] steer into different type of tasks and scale to larger scale environments and
[01:13:17] scale to larger scale environments and more complicated tasks.
[01:13:19] more complicated tasks. So they started with a pre-trained
[01:13:21] So they started with a pre-trained vision language models. So that's why
[01:13:22] vision language models. So that's why like there's already a lot of semantic
[01:13:24] like there's already a lot of semantic knowledge that are learned through this
[01:13:26] knowledge that are learned through this large scale pre-training using this
[01:13:28] large scale pre-training using this vision language data. So that is why
[01:13:30] vision language data. So that is why like some of the generalization are
[01:13:32] like some of the generalization are coming for free. Meaning the base model
[01:13:34] coming for free. Meaning the base model can actually have surprisingly good
[01:13:36] can actually have surprisingly good levels of generalizations at a semantic
[01:13:38] levels of generalizations at a semantic level. And it's just that you have to
[01:13:40] level. And it's just that you have to fine-tune this model with those robot
[01:13:41] fine-tune this model with those robot data to making sure it can also
[01:13:42] data to making sure it can also generalize not as a semantic level but
[01:13:44] generalize not as a semantic level but also as action level.
[01:13:47] also as action level. So we can have the question maybe at
[01:13:48] So we can have the question maybe at unknown because we're already about time
[01:13:50] unknown because we're already about time I still have maybe one the last maybe
[01:13:52] I still have maybe one the last maybe two three minutes I'll discuss the some
[01:13:54] two three minutes I'll discuss the some of the remaining challenges especially
[01:13:56] of the remaining challenges especially along the developments of robot learning
[01:13:58] along the developments of robot learning models so one of the major challenges
[01:14:00] models so one of the major challenges the whole community recognize is
[01:14:03] the whole community recognize is evaluation evaluation currency is
[01:14:06] evaluation evaluation currency is primarily done in the real world for
[01:14:07] primarily done in the real world for example this is a picture of uh Google
[01:14:10] example this is a picture of uh Google robotics they have a grid of this kind
[01:14:12] robotics they have a grid of this kind of teleoperating like Aloha systems that
[01:14:15] of teleoperating like Aloha systems that they are doing data collection and also
[01:14:17] they are doing data collection and also evaluation and real world evaluation is
[01:14:19] evaluation and real world evaluation is both costly and noisy. Their exact words
[01:14:22] both costly and noisy. Their exact words to me was for evaluation they have large
[01:14:25] to me was for evaluation they have large enough budget such that they can still
[01:14:26] enough budget such that they can still make progress. This was their exact
[01:14:28] make progress. This was their exact words. Meaning if you were to do the
[01:14:30] words. Meaning if you were to do the evaluation or I were to do the
[01:14:31] evaluation or I were to do the evaluation the results can be very
[01:14:33] evaluation the results can be very different from each other depending on
[01:14:34] different from each other depending on how we specify the initial configuration
[01:14:36] how we specify the initial configuration and how the lighting condition might
[01:14:38] and how the lighting condition might change. Even the friction parameters of
[01:14:40] change. Even the friction parameters of the manufacturer can make a huge
[01:14:41] the manufacturer can make a huge difference in how robust your downstream
[01:14:43] difference in how robust your downstream policy is. So this is very costly and
[01:14:45] policy is. So this is very costly and they have to wait for two days for the
[01:14:47] they have to wait for two days for the results to come back and currently like
[01:14:51] results to come back and currently like uh there's very weak correlation between
[01:14:53] uh there's very weak correlation between the training loss and real world success
[01:14:55] the training loss and real world success rates. This is another very important
[01:14:57] rates. This is another very important caveat and difference between supervised
[01:14:59] caveat and difference between supervised learning and also this kind of
[01:15:00] learning and also this kind of sequential decision-m this kind of
[01:15:02] sequential decision-m this kind of policy learning is for supervised
[01:15:04] policy learning is for supervised learning like your training loss
[01:15:06] learning like your training loss directly measures how good your model
[01:15:08] directly measures how good your model is. But for this kind of poly learning
[01:15:10] is. But for this kind of poly learning your training loss measures how good
[01:15:12] your training loss measures how good this onestep prediction is which
[01:15:14] this onestep prediction is which sometimes may not be and actually often
[01:15:16] sometimes may not be and actually often the times is not indicative of the
[01:15:19] the times is not indicative of the performance of policy over like a long
[01:15:21] performance of policy over like a long task horizons. Even if your loss is low
[01:15:24] task horizons. Even if your loss is low but for long horizon task execution your
[01:15:26] but for long horizon task execution your policy can actually be worse. And uh the
[01:15:29] policy can actually be worse. And uh the training objective versus the task
[01:15:31] training objective versus the task specific metrics like training versus
[01:15:33] specific metrics like training versus test horizons are some of the reasons
[01:15:35] test horizons are some of the reasons explaining like why it is very hard to
[01:15:38] explaining like why it is very hard to come up with even approximate or proxy
[01:15:40] come up with even approximate or proxy metrics to measure the performance of
[01:15:42] metrics to measure the performance of the policy and people have to rely on
[01:15:44] the policy and people have to rely on real world evaluations.
[01:15:46] real world evaluations. So then the question is what about doing
[01:15:48] So then the question is what about doing the evaluation in the simulated
[01:15:49] the evaluation in the simulated environments there has been a lot of
[01:15:51] environments there has been a lot of like investigation for the behavior
[01:15:53] like investigation for the behavior which is done in fif lab here also at
[01:15:55] which is done in fif lab here also at Stanford's and also the habitat 3.0 zero
[01:15:57] Stanford's and also the habitat 3.0 zero from meta. People are trying to come up
[01:15:59] from meta. People are trying to come up with this expensive simulated
[01:16:01] with this expensive simulated environments trying to do evaluation and
[01:16:03] environments trying to do evaluation and measurements of the robot policies and
[01:16:06] measurements of the robot policies and obviously there has their own issues
[01:16:09] obviously there has their own issues especially with regard to sim to real
[01:16:11] especially with regard to sim to real gap like how can you do very accurate
[01:16:14] gap like how can you do very accurate simulation of rigid body deformable
[01:16:16] simulation of rigid body deformable object close they have good correlations
[01:16:18] object close they have good correlations with the real world performance and
[01:16:20] with the real world performance and assets is also another major issues
[01:16:22] assets is also another major issues where large scale generalizations and
[01:16:24] where large scale generalizations and generations of those assets is a huge
[01:16:26] generations of those assets is a huge pain um I can elaborate but like maybe
[01:16:30] pain um I can elaborate but like maybe after the lectures and also how to
[01:16:32] after the lectures and also how to digitize the real world is an issue and
[01:16:33] digitize the real world is an issue and how to do procedural generations of
[01:16:35] how to do procedural generations of realistic and diverse things are all
[01:16:37] realistic and diverse things are all issues of using simulation to do
[01:16:39] issues of using simulation to do evaluations for robot learning policies
[01:16:42] evaluations for robot learning policies and really we want to find a correlation
[01:16:44] and really we want to find a correlation between s and real and it's calling upon
[01:16:46] between s and real and it's calling upon the imagets in embodied AI like the
[01:16:49] the imagets in embodied AI like the reason why imaget was successful is
[01:16:51] reason why imaget was successful is because at least for few years any
[01:16:53] because at least for few years any progress in imaget meaning progress in
[01:16:55] progress in imaget meaning progress in deep learning and computeration we want
[01:16:57] deep learning and computeration we want the same thing we want have this
[01:16:59] the same thing we want have this platform meaning any progress on that
[01:17:01] platform meaning any progress on that benchmark or platform meaning progress
[01:17:03] benchmark or platform meaning progress in robot learnings so that's something
[01:17:05] in robot learnings so that's something that we really want and uh um I can
[01:17:09] that we really want and uh um I can maybe skip through like uh we talk about
[01:17:11] maybe skip through like uh we talk about how to build this foundational policies
[01:17:13] how to build this foundational policies there can also be investigations on how
[01:17:14] there can also be investigations on how to build like a foundational word models
[01:17:17] to build like a foundational word models especially now people are collecting
[01:17:18] especially now people are collecting large scale action condition robot
[01:17:20] large scale action condition robot interaction data to train this
[01:17:22] interaction data to train this foundation policies but there's a lot of
[01:17:24] foundation policies but there's a lot of dynamics knowledge embedded in those
[01:17:25] dynamics knowledge embedded in those data if we just use those data to do
[01:17:27] data if we just use those data to do policy learning that would be such a
[01:17:29] policy learning that would be such a waste. So we are also thinking about how
[01:17:31] waste. So we are also thinking about how we can use this large scale action
[01:17:33] we can use this large scale action condition robot interaction data that
[01:17:35] condition robot interaction data that was already collected to train those
[01:17:37] was already collected to train those foundational policy to train this
[01:17:39] foundational policy to train this foundational word models and how they
[01:17:41] foundational word models and how they can do interplay between each others and
[01:17:44] can do interplay between each others and there are some existing works that are
[01:17:46] there are some existing works that are thinking about along the direction of
[01:17:48] thinking about along the direction of building this kind of foundational word
[01:17:49] building this kind of foundational word models and there are some like very like
[01:17:51] models and there are some like very like interesting like characteristics you
[01:17:54] interesting like characteristics you might thinking about like do you want it
[01:17:55] might thinking about like do you want it to be 3D do you want structural prior
[01:17:58] to be 3D do you want structural prior how much learning versus how much
[01:17:59] how much learning versus how much physics and how you can correlate with
[01:18:01] physics and how you can correlate with the real physical world. And maybe
[01:18:04] the real physical world. And maybe actually I think uh we are about time so
[01:18:06] actually I think uh we are about time so I will end it here. And uh this is the
[01:18:11] I will end it here. And uh this is the future we hope to achieve to really
[01:18:12] future we hope to achieve to really build this foundational robotic model
[01:18:14] build this foundational robotic model that can work very widely and generalize
[01:18:16] that can work very widely and generalize very well in the unstructured data
[01:18:19] very well in the unstructured data environments around us. And next
[01:18:21] environments around us. And next lectures will be human- centered AI. And
[01:18:23] lectures will be human- centered AI. And that will be uh the end of today's
[01:18:25] that will be uh the end of today's lecture. Thank you so much.


================================================================================
LECTURE 018
================================================================================

Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 18: Human-Centered AI

Source: https://www.youtube.com/watch?v=g8UaBfj6Sh8

---

Transcript

[00:00:05] Welcome to the last lecture of the
[00:00:08] Welcome to the last lecture of the quarter for CS231N.
[00:00:10] quarter for CS231N. And it was great to see you guys uh at
[00:00:13] And it was great to see you guys uh at the beginning and now at the end. And
[00:00:16] the beginning and now at the end. And this lecture is a little bit of a
[00:00:17] this lecture is a little bit of a departure. We're not going to teach any
[00:00:19] departure. We're not going to teach any new materials in terms of algorithms.
[00:00:23] new materials in terms of algorithms. It's more a talk that I'd like to give
[00:00:26] It's more a talk that I'd like to give to students to put a perspective both on
[00:00:29] to students to put a perspective both on um um on a a more longer term research
[00:00:33] um um on a a more longer term research evolution but as well as a another
[00:00:38] evolution but as well as a another dimension that is important to today's
[00:00:40] dimension that is important to today's AI which we would call it the human
[00:00:42] AI which we would call it the human perspective. So um for the completeness
[00:00:46] perspective. So um for the completeness of the material there's a little bit of
[00:00:48] of the material there's a little bit of a overlap uh that you might see from
[00:00:51] a overlap uh that you might see from other parts of this course but uh
[00:00:54] other parts of this course but uh hopefully you it it makes sense in a in
[00:00:56] hopefully you it it makes sense in a in a fuller in a fuller way. So the title
[00:01:00] a fuller in a fuller way. So the title of this slide or this lecture is what we
[00:01:04] of this slide or this lecture is what we see and what we value AI with a human
[00:01:08] see and what we value AI with a human perspective. And I know that some of you
[00:01:10] perspective. And I know that some of you have already heard about this is really
[00:01:13] have already heard about this is really the beginning the origin of vision both
[00:01:15] the beginning the origin of vision both in terms of evolution as well as in
[00:01:18] in terms of evolution as well as in terms of our technology. And we did talk
[00:01:21] terms of our technology. And we did talk about um the first light that came to
[00:01:25] about um the first light that came to the animal world back in back 540
[00:01:29] the animal world back in back 540 million years ago. And uh that was when
[00:01:32] million years ago. And uh that was when uh animals or trilobytes to be specific
[00:01:35] uh animals or trilobytes to be specific developed uh a photo a photosensitive
[00:01:39] developed uh a photo a photosensitive cells to glean what the outer world is
[00:01:43] cells to glean what the outer world is about. And according to zoologologists
[00:01:47] about. And according to zoologologists like Andrew Parker, um what happened is
[00:01:50] like Andrew Parker, um what happened is that because of this um uh first the
[00:01:54] that because of this um uh first the onsite of vision, uh it set off an
[00:01:58] onsite of vision, uh it set off an evolutionary arms race where animals
[00:02:02] evolutionary arms race where animals either evolved or died. And that arms
[00:02:05] either evolved or died. And that arms race gave rise to the speciation or
[00:02:09] race gave rise to the speciation or explosive speciation of animals which
[00:02:12] explosive speciation of animals which now zoologologists call the Cambrian
[00:02:15] now zoologologists call the Cambrian explo explosion or the big ban of
[00:02:17] explo explosion or the big ban of evolution. And of course um you wouldn't
[00:02:21] evolution. And of course um you wouldn't be surprised that vision is still to
[00:02:24] be surprised that vision is still to this day a primary sensory intelligent
[00:02:28] this day a primary sensory intelligent system in many many animals. Not all
[00:02:31] system in many many animals. Not all animals use vision. uh admittedly but
[00:02:35] animals use vision. uh admittedly but many do and that's uh also one of the
[00:02:37] many do and that's uh also one of the primary sensory systems for uh humans
[00:02:41] primary sensory systems for uh humans and uh we use vision to um to you know
[00:02:46] and uh we use vision to um to you know do everything from survival to work to
[00:02:49] do everything from survival to work to entertainment to socialization to
[00:02:51] entertainment to socialization to learning development and many other
[00:02:54] learning development and many other things. So that's the the the
[00:02:58] things. So that's the the the recapture or summary of evolution. And
[00:03:01] recapture or summary of evolution. And we also uh briefly talked about um
[00:03:05] we also uh briefly talked about um computer vision being a summer vision
[00:03:08] computer vision being a summer vision project back in the 1960s as an attempt
[00:03:12] project back in the 1960s as an attempt to use a couple of undergrads to
[00:03:15] to use a couple of undergrads to construct the significant portion of the
[00:03:18] construct the significant portion of the visual system. And that was very in line
[00:03:22] visual system. And that was very in line with the history of AI where we tend to
[00:03:25] with the history of AI where we tend to um have clarity of the northstar but
[00:03:29] um have clarity of the northstar but underestimate how long it would take. We
[00:03:32] underestimate how long it would take. We are still probably experiencing that
[00:03:34] are still probably experiencing that today. But a lot has happened, right?
[00:03:37] today. But a lot has happened, right? Like um you don't need me to tell you
[00:03:40] Like um you don't need me to tell you that from empowering self-driving car to
[00:03:44] that from empowering self-driving car to understanding images to the generative
[00:03:47] understanding images to the generative AI uh revolution, we're seeing vision is
[00:03:51] AI uh revolution, we're seeing vision is uh playing a huge role and and also in
[00:03:54] uh playing a huge role and and also in many parts leading the wave.
[00:03:57] many parts leading the wave. So
[00:03:58] So maybe it's time to just take a different
[00:04:01] maybe it's time to just take a different look at this both historically and going
[00:04:04] look at this both historically and going towards the future is where have we come
[00:04:06] towards the future is where have we come from and where are we going and this is
[00:04:10] from and where are we going and this is an important uh topic to to uh discuss
[00:04:13] an important uh topic to to uh discuss because a lot of what has happened will
[00:04:16] because a lot of what has happened will inform what will happen. Uh so I'm
[00:04:20] inform what will happen. Uh so I'm organizing this talk in three chunks is
[00:04:24] organizing this talk in three chunks is um first of all building AI to see what
[00:04:28] um first of all building AI to see what humans see and that's where we came from
[00:04:30] humans see and that's where we came from that we were so inspired by human
[00:04:33] that we were so inspired by human capability that we want to m make
[00:04:35] capability that we want to m make machines that do the same and then we'll
[00:04:38] machines that do the same and then we'll talk about building AI to see what
[00:04:40] talk about building AI to see what humans don't see and then we'll finish
[00:04:42] humans don't see and then we'll finish with building AI to see what humans
[00:04:45] with building AI to see what humans would like to see. Uh let's just start
[00:04:47] would like to see. Uh let's just start with the first one. Building a to see
[00:04:49] with the first one. Building a to see what humans see. Again, just a little
[00:04:52] what humans see. Again, just a little bit of a review. Humans are so good at
[00:04:55] bit of a review. Humans are so good at seeing. We know this. This is a very
[00:04:58] seeing. We know this. This is a very half a century old um uh experiment
[00:05:02] half a century old um uh experiment showing us that even
[00:05:05] showing us that even watching a video you've never watched
[00:05:08] watching a video you've never watched played at 10 hertz, which means every
[00:05:10] played at 10 hertz, which means every frame is only about a on the screen for
[00:05:13] frame is only about a on the screen for 100 milliseconds. You've never seen
[00:05:15] 100 milliseconds. You've never seen that. It's still no problem for human
[00:05:18] that. It's still no problem for human eyes to detect a target. In this case, a
[00:05:21] eyes to detect a target. In this case, a person in a in a in this complex scene
[00:05:24] person in a in a in this complex scene where you have no idea a prior knowledge
[00:05:27] where you have no idea a prior knowledge about what this person is. It really
[00:05:29] about what this person is. It really underscores the sup s uh superb ability
[00:05:34] underscores the sup s uh superb ability of human visual understanding especially
[00:05:37] of human visual understanding especially object focused understanding. We also
[00:05:40] object focused understanding. We also have briefly mentioned that around the
[00:05:43] have briefly mentioned that around the turn of the century uh neurosych uh
[00:05:46] turn of the century uh neurosych uh neurohysiologists are measuring the
[00:05:49] neurohysiologists are measuring the speed of light uh speed of vision in
[00:05:52] speed of light uh speed of vision in terms of uh humans seeing complex
[00:05:55] terms of uh humans seeing complex objects in um in in in the form of brain
[00:06:00] objects in um in in in the form of brain signals, brain electrical signals
[00:06:03] signals, brain electrical signals measured from EG caps. And we see that
[00:06:07] measured from EG caps. And we see that differentiating or categorizing animals
[00:06:10] differentiating or categorizing animals versus animals is a very complex task.
[00:06:13] versus animals is a very complex task. Yet humans are capable of doing that at
[00:06:17] Yet humans are capable of doing that at 150 millisecond after the onset of the
[00:06:20] 150 millisecond after the onset of the stimuli. And this is remarkable speed
[00:06:23] stimuli. And this is remarkable speed given the the wetwware we have between
[00:06:26] given the the wetwware we have between our under our skulls. Also
[00:06:29] our under our skulls. Also neuroysiologists
[00:06:31] neuroysiologists have taught us that uh objects is a very
[00:06:35] have taught us that uh objects is a very important functionality in our visual
[00:06:39] important functionality in our visual intelligence in humans. So important
[00:06:42] intelligence in humans. So important that there are neuro cororlates in our
[00:06:44] that there are neuro cororlates in our brain areas that are dedicated to object
[00:06:48] brain areas that are dedicated to object understanding such as face areas or
[00:06:51] understanding such as face areas or place areas or body parts areas. This
[00:06:54] place areas or body parts areas. This shows that evolution has really spent
[00:06:57] shows that evolution has really spent time to hone in our visual intelligence
[00:07:00] time to hone in our visual intelligence skills when it comes to object
[00:07:02] skills when it comes to object recognition. So all this built up the
[00:07:07] recognition. So all this built up the history for the field of computer vision
[00:07:10] history for the field of computer vision that a few decades ago object
[00:07:13] that a few decades ago object recognition became a fundamental
[00:07:15] recognition became a fundamental building block for visual intelligence
[00:07:17] building block for visual intelligence and we want to empower machines with
[00:07:20] and we want to empower machines with that. And in order to do that, we define
[00:07:23] that. And in order to do that, we define the problem or at least the original
[00:07:26] the problem or at least the original problem as given an image, how do we
[00:07:30] problem as given an image, how do we empower enable a computer to call out
[00:07:34] empower enable a computer to call out the objects or what what the uh what the
[00:07:38] the objects or what what the uh what the object is in the image. That's such an
[00:07:41] object is in the image. That's such an effortless task for humans. But if you
[00:07:44] effortless task for humans. But if you think about it now that you've learned
[00:07:45] think about it now that you've learned enough computer vision to know that
[00:07:48] enough computer vision to know that mathematically there's infinite
[00:07:50] mathematically there's infinite possibilities to um to actually uh
[00:07:53] possibilities to um to actually uh recognize any object because of
[00:07:56] recognize any object because of different lighting, texture, background,
[00:07:59] different lighting, texture, background, occlusion, uh viewing angle, scaling and
[00:08:03] occlusion, uh viewing angle, scaling and and you know whatever you name it. So
[00:08:05] and you know whatever you name it. So this is actually fundamentally a a
[00:08:08] this is actually fundamentally a a difficult tasks difficult task.
[00:08:12] difficult tasks difficult task. Um the history pre-deep learning is also
[00:08:15] Um the history pre-deep learning is also very interesting. There were some pretty
[00:08:18] very interesting. There were some pretty heroic attempts at solving the problem
[00:08:21] heroic attempts at solving the problem of generic generalizable object
[00:08:25] of generic generalizable object recognition and the first wave of
[00:08:27] recognition and the first wave of attempt was actually very inspired by
[00:08:31] attempt was actually very inspired by psychology itself. We self-introspect
[00:08:34] psychology itself. We self-introspect sometimes uh sometimes even to the
[00:08:36] sometimes uh sometimes even to the detriment of over self-introspection.
[00:08:39] detriment of over self-introspection. We think that humans compose parts,
[00:08:42] We think that humans compose parts, right? Like we look at objects, we can
[00:08:45] right? Like we look at objects, we can see geometric parts and then we can
[00:08:48] see geometric parts and then we can compose them into different objects. And
[00:08:50] compose them into different objects. And that idea of using pre-desated
[00:08:56] that idea of using pre-desated parts or shapes and to you know compose
[00:09:00] parts or shapes and to you know compose them in specific ways was the first wave
[00:09:05] them in specific ways was the first wave of object recognition. So these are
[00:09:08] of object recognition. So these are different um work or models coming from
[00:09:11] different um work or models coming from the 70s, 80s or even going all the way
[00:09:13] the 70s, 80s or even going all the way to 90s of using different parts and
[00:09:16] to 90s of using different parts and configurations to recognize objects. Of
[00:09:19] configurations to recognize objects. Of course, it didn't really work. It's
[00:09:21] course, it didn't really work. It's mathematically beautiful and simple, but
[00:09:23] mathematically beautiful and simple, but it didn't work. So a second wave of
[00:09:27] it didn't work. So a second wave of object recognition pre-deep learning was
[00:09:30] object recognition pre-deep learning was actually a really important um uh era in
[00:09:35] actually a really important um uh era in in the field of AI is really the
[00:09:38] in the field of AI is really the beginning of machine learning
[00:09:39] beginning of machine learning statistical m machine learning. It was
[00:09:41] statistical m machine learning. It was the marriage between computer
[00:09:43] the marriage between computer programming and statistical modeling.
[00:09:46] programming and statistical modeling. And with that marriage, we start to re
[00:09:48] And with that marriage, we start to re realize the world is so complex. These
[00:09:51] realize the world is so complex. These problems, this intelligence problems,
[00:09:54] problems, this intelligence problems, whether it's visual intelligence or
[00:09:56] whether it's visual intelligence or language intelligence or other kind of
[00:09:58] language intelligence or other kind of intelligence, uh in order to generalize,
[00:10:02] intelligence, uh in order to generalize, we need to learn um learn the
[00:10:05] we need to learn um learn the parameters. It's very hard to use
[00:10:07] parameters. It's very hard to use hand-tuned models to to um to uh um you
[00:10:12] hand-tuned models to to um to uh um you know get good good uh learning. We now
[00:10:16] know get good good uh learning. We now know we need data even though we didn't
[00:10:18] know we need data even though we didn't at that time know how much data but we
[00:10:21] at that time know how much data but we also know that we need to uh design or
[00:10:24] also know that we need to uh design or architect statistical models so that
[00:10:27] architect statistical models so that they have the capability of learning
[00:10:29] they have the capability of learning through different uh through different
[00:10:31] through different uh through different learning rules. And because of that we
[00:10:34] learning rules. And because of that we saw a blossoming of models in that era
[00:10:37] saw a blossoming of models in that era where we're learning you know random
[00:10:39] where we're learning you know random fields or base nets or support vector
[00:10:42] fields or base nets or support vector machines and all that. Um and in fact a
[00:10:46] machines and all that. Um and in fact a lot of progress was made um by the time
[00:10:49] lot of progress was made um by the time we are in the first decade of 21st
[00:10:53] we are in the first decade of 21st century in object recognition that we
[00:10:55] century in object recognition that we even have international benchmarks of a
[00:10:58] even have international benchmarks of a small number of object classes to
[00:11:01] small number of object classes to encourage everybody to um to to to
[00:11:04] encourage everybody to um to to to compare their algorithms. So we're
[00:11:06] compare their algorithms. So we're inching together. The last unlock for
[00:11:10] inching together. The last unlock for object recognition as we have learned
[00:11:13] object recognition as we have learned again goes back to cognitive science. So
[00:11:16] again goes back to cognitive science. So this particular psychologist Irv Beerman
[00:11:19] this particular psychologist Irv Beerman had long conjectured that humans can
[00:11:23] had long conjectured that humans can recognize a huge number of objects and
[00:11:26] recognize a huge number of objects and this is intuitive for our common
[00:11:28] this is intuitive for our common knowledge. But he actually put a number
[00:11:31] knowledge. But he actually put a number on it. I I personally call it the
[00:11:33] on it. I I personally call it the Beerman number which is that by you know
[00:11:36] Beerman number which is that by you know age six or seven children he conjectured
[00:11:40] age six or seven children he conjectured were able to recognize about 30,000 to
[00:11:44] were able to recognize about 30,000 to 100,000 different visual categories and
[00:11:47] 100,000 different visual categories and he used this where did he come up with
[00:11:50] he used this where did he come up with this number is a combination of looking
[00:11:52] this number is a combination of looking at dictionary the number of nouns as
[00:11:55] at dictionary the number of nouns as well as visual studies of how kids uh
[00:11:58] well as visual studies of how kids uh recognize different uh uh objects
[00:12:01] recognize different uh uh objects But it's a number that's pretty daunting
[00:12:04] But it's a number that's pretty daunting and pretty sobering for the field of
[00:12:07] and pretty sobering for the field of computer vision because up till now this
[00:12:10] computer vision because up till now this is middle of the first decade of 21st
[00:12:12] is middle of the first decade of 21st century we were working with tiny number
[00:12:16] century we were working with tiny number of object categories and a tiny number
[00:12:19] of object categories and a tiny number of uh images to work with compared to
[00:12:22] of uh images to work with compared to what humans experience. And this was as
[00:12:25] what humans experience. And this was as you know the onset of the uh the the
[00:12:28] you know the onset of the uh the the motivation for image net project which
[00:12:33] motivation for image net project which um took this beerman number really
[00:12:35] um took this beerman number really seriously and um constructed uh we
[00:12:39] seriously and um constructed uh we constructed this data set that is on par
[00:12:42] constructed this data set that is on par with what the psychologist beamman
[00:12:44] with what the psychologist beamman conjectured which is around 22,000
[00:12:48] conjectured which is around 22,000 object classes over 15 million images
[00:12:51] object classes over 15 million images and of course that's the beginning that
[00:12:54] and of course that's the beginning that you start to come into this uh class is
[00:12:57] you start to come into this uh class is that because of the large data provided
[00:13:00] that because of the large data provided by imageet we start to see that powerful
[00:13:05] by imageet we start to see that powerful algorithms like neuronet network at the
[00:13:07] algorithms like neuronet network at the beginning it was convolutional neuronet
[00:13:09] beginning it was convolutional neuronet network of course now we use
[00:13:10] network of course now we use transformers and all that uh start to
[00:13:13] transformers and all that uh start to really show their power through big data
[00:13:17] really show their power through big data and uh and I'm this is a generic uh
[00:13:20] and uh and I'm this is a generic uh slide for those of those people who
[00:13:23] slide for those of those people who didn't learn about this, I'm going to
[00:13:25] didn't learn about this, I'm going to skip this because you all know this. So
[00:13:27] skip this because you all know this. So the the the the quick history is as soon
[00:13:30] the the the the quick history is as soon as we have imageet, as soon as we use
[00:13:33] as we have imageet, as soon as we use convolutional neuronet network a few
[00:13:35] convolutional neuronet network a few years after the uh beginning of imageet,
[00:13:38] years after the uh beginning of imageet, we saw this door blasted open in terms
[00:13:41] we saw this door blasted open in terms of uh uh solving the problem of object
[00:13:44] of uh uh solving the problem of object recognition. Now we have algorithms that
[00:13:47] recognition. Now we have algorithms that we can take to look at any picture in
[00:13:50] we can take to look at any picture in the world and be able to recognize
[00:13:53] the world and be able to recognize objects in the big or small and and in
[00:13:56] objects in the big or small and and in any kind of you know orientation. It's
[00:13:58] any kind of you know orientation. It's not is it 100% soft? No. There's always
[00:14:03] not is it 100% soft? No. There's always longtail problems we can solve. But as
[00:14:05] longtail problems we can solve. But as far as industrial application goes, this
[00:14:08] far as industrial application goes, this has come a long way and really has been
[00:14:11] has come a long way and really has been um a a a matured problem. And of course
[00:14:14] um a a a matured problem. And of course all of you know and all this came at the
[00:14:17] all of you know and all this came at the convergence point which is the year 2012
[00:14:21] convergence point which is the year 2012 where the image that challenge provided
[00:14:23] where the image that challenge provided the data for the convolutional neuronet
[00:14:26] the data for the convolutional neuronet network and they used two GPUs at that
[00:14:29] network and they used two GPUs at that time and the three ingredients come
[00:14:32] time and the three ingredients come together and brought uh brought the
[00:14:34] together and brought uh brought the moment of uh deep learning um uh the
[00:14:38] moment of uh deep learning um uh the birth of deep learning and uh in this
[00:14:40] birth of deep learning and uh in this class we also talked a little bit about
[00:14:43] class we also talked a little bit about different various architectures that
[00:14:46] different various architectures that image net challenge engendered u
[00:14:50] image net challenge engendered u throughout the past uh decade or so in
[00:14:53] throughout the past uh decade or so in terms of you know convolutional neuronet
[00:14:55] terms of you know convolutional neuronet network or restnet and and so on. So
[00:14:58] network or restnet and and so on. So that's um that's um you know that's the
[00:15:02] that's um that's um you know that's the beginning really about deep learning
[00:15:04] beginning really about deep learning revolution. And of course, in terms of
[00:15:07] revolution. And of course, in terms of the quest for visual intelligence, we're
[00:15:09] the quest for visual intelligence, we're not going to stop at just being able to
[00:15:12] not going to stop at just being able to label objects in a scene. For example,
[00:15:14] label objects in a scene. For example, in this two scenes, right? If you just
[00:15:17] in this two scenes, right? If you just label objects, you'll think it's just a
[00:15:19] label objects, you'll think it's just a llama and a person. But if I show you
[00:15:22] llama and a person. But if I show you the second scene with the llama and a
[00:15:24] the second scene with the llama and a person, the story is completely
[00:15:26] person, the story is completely different. Even though you have the same
[00:15:29] different. Even though you have the same object, you have very different
[00:15:31] object, you have very different relationship. So, cognitive scientist
[00:15:34] relationship. So, cognitive scientist once again was a head of computer
[00:15:36] once again was a head of computer scientist uh and and inspired us to
[00:15:40] scientist uh and and inspired us to think about um visual intelligence
[00:15:43] think about um visual intelligence beyond just naming objects or
[00:15:45] beyond just naming objects or categorizing objects. In this particular
[00:15:48] categorizing objects. In this particular paper, Jeremy Wolf, who is a pretty
[00:15:50] paper, Jeremy Wolf, who is a pretty prominent uh psychologist um wrote this
[00:15:54] prominent uh psychologist um wrote this beautiful paper that called out that
[00:15:57] beautiful paper that called out that relationships between objects must be
[00:16:00] relationships between objects must be coded as part of our understanding of
[00:16:03] coded as part of our understanding of complex natural things. And inspired by
[00:16:06] complex natural things. And inspired by that work, the field of computer vision
[00:16:09] that work, the field of computer vision start to look at how do we uh understand
[00:16:12] start to look at how do we uh understand relationships. And this is uh early
[00:16:14] relationships. And this is uh early work. You guys got a lecture from Renjay
[00:16:18] work. You guys got a lecture from Renjay and uh uh last week or or yeah last week
[00:16:22] and uh uh last week or or yeah last week this is was his PhD thesis looking at
[00:16:25] this is was his PhD thesis looking at learning object relationships using
[00:16:29] learning object relationships using scene graph as a representation. In this
[00:16:32] scene graph as a representation. In this case, scene graph is defined by these
[00:16:35] case, scene graph is defined by these entity nodes that are objects and their
[00:16:38] entity nodes that are objects and their relationships are uh defined by the
[00:16:41] relationships are uh defined by the connectivity between the nodes or
[00:16:43] connectivity between the nodes or sometimes they have attribute
[00:16:44] sometimes they have attribute relationships that defines the the
[00:16:47] relationships that defines the the particular objects. And even a scene as
[00:16:50] particular objects. And even a scene as simple as this with mostly just two
[00:16:53] simple as this with mostly just two people, you know, um, one feeding a cake
[00:16:55] people, you know, um, one feeding a cake to the other, you can form a very dense
[00:16:58] to the other, you can form a very dense scene gra scene graph because of the
[00:17:00] scene gra scene graph because of the richness of the visual scene. And this
[00:17:03] richness of the visual scene. And this was Ren's thesis after the the the image
[00:17:08] was Ren's thesis after the the the image that object recognition um era where we
[00:17:12] that object recognition um era where we try to we build a data set uh uh called
[00:17:15] try to we build a data set uh uh called visual genome where we try to uh put
[00:17:18] visual genome where we try to uh put together object relationships object uh
[00:17:22] together object relationships object uh and also story descriptions. And uh one
[00:17:25] and also story descriptions. And uh one of the work that Ranj did I thought that
[00:17:28] of the work that Ranj did I thought that was really fun was zero shot learning of
[00:17:31] was really fun was zero shot learning of unusual uh object relationships. For
[00:17:34] unusual uh object relationships. For example, it's not unusual to see person
[00:17:37] example, it's not unusual to see person riding a horse. It's not unusual to see
[00:17:40] riding a horse. It's not unusual to see person wearing a hat, but it's unusual
[00:17:43] person wearing a hat, but it's unusual in general to see horse wearing hat. And
[00:17:46] in general to see horse wearing hat. And in the era of big data training, it's
[00:17:49] in the era of big data training, it's hard to get this kind of data repeatedly
[00:17:52] hard to get this kind of data repeatedly because you just don't have too many of
[00:17:54] because you just don't have too many of that. But using this compositional scene
[00:17:57] that. But using this compositional scene graph representation, we're able to
[00:17:59] graph representation, we're able to learn the more common relationships and
[00:18:02] learn the more common relationships and then derive uncommon relationships um in
[00:18:05] then derive uncommon relationships um in that representation. And again this is
[00:18:08] that representation. And again this is another um example of zero shot learning
[00:18:10] another um example of zero shot learning where you know person sitting on chair
[00:18:13] where you know person sitting on chair and and fire hydrant on the lawn or on
[00:18:15] and and fire hydrant on the lawn or on the field are all common relationships
[00:18:18] the field are all common relationships but person sitting on fire hydrant is
[00:18:20] but person sitting on fire hydrant is the one that would not you know it's
[00:18:23] the one that would not you know it's hard to get data and we're able to do
[00:18:25] hard to get data and we're able to do that uh uh to make that happen and this
[00:18:28] that uh uh to make that happen and this is just a figure from the paper that
[00:18:30] is just a figure from the paper that shows that wrenches work at that time
[00:18:33] shows that wrenches work at that time achieved state-of-the-art recognition
[00:18:35] achieved state-of-the-art recognition rate compared to many other um uh
[00:18:38] rate compared to many other um uh methods.
[00:18:39] methods. But relationship is not enough, right?
[00:18:42] But relationship is not enough, right? The ability to actually tell a story
[00:18:46] The ability to actually tell a story that is a lot more richer or also using
[00:18:49] that is a lot more richer or also using natural language is actually um the next
[00:18:54] natural language is actually um the next big goal. So around the year 2014
[00:18:59] big goal. So around the year 2014 201 around 2014 we start working on that
[00:19:03] 201 around 2014 we start working on that problem. And think about it. That's just
[00:19:05] problem. And think about it. That's just two years after the image that uh Alex
[00:19:08] two years after the image that uh Alex that moment. But the field was starting
[00:19:11] that moment. But the field was starting to evolve so fast. We're so inspired by
[00:19:14] to evolve so fast. We're so inspired by um by what we can do using uh a
[00:19:18] um by what we can do using uh a combination of u convolutional neuronet
[00:19:21] combination of u convolutional neuronet network as well as a a uh um a language
[00:19:25] network as well as a a uh um a language model called LSTM. And this is the
[00:19:28] model called LSTM. And this is the thesis by Andre Kapathy that uh we were
[00:19:31] thesis by Andre Kapathy that uh we were one of the first uh teams that um showed
[00:19:35] one of the first uh teams that um showed how to do image captioning or
[00:19:37] how to do image captioning or storytelling as well as dense captioning
[00:19:40] storytelling as well as dense captioning which is also part of the work that
[00:19:42] which is also part of the work that Justin Johnson did and and I know he's
[00:19:45] Justin Johnson did and and I know he's he's one of the co-hurs of this course
[00:19:48] he's one of the co-hurs of this course and that was around the time between
[00:19:51] and that was around the time between 2015 to 2018 a lot of work has happened
[00:19:56] 2015 to 2018 a lot of work has happened um to to solve the problem. Of course,
[00:19:58] um to to solve the problem. Of course, today using a multimodal LLMs, we have
[00:20:02] today using a multimodal LLMs, we have taken the solution of this problem even
[00:20:04] taken the solution of this problem even to another uh another notch. But this is
[00:20:08] to another uh another notch. But this is the beginning of that line of work and
[00:20:11] the beginning of that line of work and uh uh frankly I myself as a computer
[00:20:14] uh uh frankly I myself as a computer vision scientist who entered the field
[00:20:17] vision scientist who entered the field at the beginning of the century was very
[00:20:20] at the beginning of the century was very surprised by how fast uh our field was
[00:20:23] surprised by how fast uh our field was able to solve this problem. as soon as
[00:20:25] able to solve this problem. as soon as we've got data as well as neuronet
[00:20:27] we've got data as well as neuronet network algorithms.
[00:20:30] network algorithms. But a much harder problem is actually in
[00:20:33] But a much harder problem is actually in dynamic scenes. In dynamic scenes, we uh
[00:20:37] dynamic scenes. In dynamic scenes, we uh we tend to have much more complex
[00:20:39] we tend to have much more complex relationships, much more complex uh
[00:20:42] relationships, much more complex uh movements and the camera also the camera
[00:20:45] movements and the camera also the camera uh movement or or the the the um the
[00:20:49] uh movement or or the the the um the entity the actors within the scene can
[00:20:52] entity the actors within the scene can do a lot of different things. So in this
[00:20:55] do a lot of different things. So in this work that a collaboration with Isan and
[00:20:58] work that a collaboration with Isan and um a bunch of students in our lab uh we
[00:21:01] um a bunch of students in our lab uh we call it multiobject multi- actor
[00:21:04] call it multiobject multi- actor activity understanding. This is a much
[00:21:07] activity understanding. This is a much newer work. We only published this a
[00:21:09] newer work. We only published this a couple of years ago. uh to capture the
[00:21:13] couple of years ago. uh to capture the relationship between these actors and
[00:21:16] relationship between these actors and their activities in dynamic scene is
[00:21:18] their activities in dynamic scene is still I would say an unsolved problem
[00:21:22] still I would say an unsolved problem and this will have profound implication
[00:21:24] and this will have profound implication you know that you're in Silicon Valley
[00:21:27] you know that you're in Silicon Valley so you you're hearing so much um um um
[00:21:32] so you you're hearing so much um um um excitement of robots for example if we
[00:21:36] excitement of robots for example if we ever dream to have everyday robots that
[00:21:38] ever dream to have everyday robots that work amongst us robots costly to solve
[00:21:41] work amongst us robots costly to solve solve this problem. Understand how
[00:21:43] solve this problem. Understand how complex the scene is, what people are
[00:21:45] complex the scene is, what people are doing, who is doing what, what is next
[00:21:48] doing, who is doing what, what is next and this is an unsolved problem. Um so
[00:21:54] and this is an unsolved problem. Um so also you know in addition to what I have
[00:21:56] also you know in addition to what I have shown you you you have learned a little
[00:21:59] shown you you you have learned a little bit in this class and and and related
[00:22:02] bit in this class and and and related computer vision problem uh but not we
[00:22:04] computer vision problem uh but not we didn't have time to elaborate for
[00:22:06] didn't have time to elaborate for example 3D computer vision or human pose
[00:22:10] example 3D computer vision or human pose understanding and of course generative
[00:22:13] understanding and of course generative uh AI and generative models. So this is
[00:22:16] uh AI and generative models. So this is just to show you that the field of
[00:22:18] just to show you that the field of computer vision uh since the rebirth of
[00:22:21] computer vision uh since the rebirth of modern AI has been just moving
[00:22:24] modern AI has been just moving extraordinarily fast.
[00:22:27] extraordinarily fast. But the take-home message in this
[00:22:30] But the take-home message in this section for me is that um two things.
[00:22:34] section for me is that um two things. One is that data compute and neuronet
[00:22:37] One is that data compute and neuronet network algorithm truly have converged
[00:22:39] network algorithm truly have converged about 10 years ago or 13 years ago and
[00:22:42] about 10 years ago or 13 years ago and that was the moment that modern AI or
[00:22:45] that was the moment that modern AI or deep learning revolution has happened.
[00:22:47] deep learning revolution has happened. But the history of that and so much of
[00:22:50] But the history of that and so much of the problem that we have been working on
[00:22:52] the problem that we have been working on is truly inspired by cognitive science
[00:22:55] is truly inspired by cognitive science and psychology and neuroscience. And
[00:22:59] and psychology and neuroscience. And that to me is a um is going to continue
[00:23:03] that to me is a um is going to continue to to uh happen is that we will continue
[00:23:08] to to uh happen is that we will continue to be inspired by what the brain can do
[00:23:11] to be inspired by what the brain can do or how the brain does things and also
[00:23:14] or how the brain does things and also we'll continue to use AI to help our
[00:23:16] we'll continue to use AI to help our brain brain research. So there is a very
[00:23:19] brain brain research. So there is a very intimate relationship between today's AI
[00:23:22] intimate relationship between today's AI and cognitive science, neuroscience,
[00:23:24] and cognitive science, neuroscience, brain science and all that. So that's uh
[00:23:27] brain science and all that. So that's uh that's the first section and of course a
[00:23:30] that's the first section and of course a lot of people students and and
[00:23:33] lot of people students and and collaborators have contributed to what I
[00:23:35] collaborators have contributed to what I have just presented.
[00:23:38] have just presented. Now let's talk about going beyond just
[00:23:42] Now let's talk about going beyond just building AI to see what humans don't
[00:23:44] building AI to see what humans don't see. This is where pushing AI beyond the
[00:23:47] see. This is where pushing AI beyond the capability of humans or you can call it
[00:23:50] capability of humans or you can call it superhumans. For example, most people
[00:23:54] superhumans. For example, most people don't recognize a ton of dinosaurs. You
[00:23:57] don't recognize a ton of dinosaurs. You can probably name a few. Some kids
[00:24:00] can probably name a few. Some kids really can name a lot. Well, let alone
[00:24:04] really can name a lot. Well, let alone thousands and tens of thousands of bird
[00:24:07] thousands and tens of thousands of bird species
[00:24:09] species or tens of thousands of car categories.
[00:24:14] or tens of thousands of car categories. So, this is the the the line of work
[00:24:17] So, this is the the the line of work that I call fine grained object
[00:24:19] that I call fine grained object categorization. humans are just not that
[00:24:23] categorization. humans are just not that good at it. And um this is still a
[00:24:26] good at it. And um this is still a problem that I don't think we're fully
[00:24:28] problem that I don't think we're fully solved yet to be honest. Um uh in this
[00:24:31] solved yet to be honest. Um uh in this generative AI era especially we're
[00:24:34] generative AI era especially we're talking a lot about multimodal LLMs.
[00:24:37] talking a lot about multimodal LLMs. this problem has somewhat be um
[00:24:41] this problem has somewhat be um neglected or or it just is not a
[00:24:43] neglected or or it just is not a mainstream problem but it really will
[00:24:46] mainstream problem but it really will you know still will come and and play a
[00:24:49] you know still will come and and play a important role. So in this early work of
[00:24:52] important role. So in this early work of fine grain bird species recognition, we
[00:24:55] fine grain bird species recognition, we put together you know a data set
[00:24:58] put together you know a data set actually we used a data set of 4,000
[00:25:01] actually we used a data set of 4,000 birds and as you can see as we go down
[00:25:04] birds and as you can see as we go down the tree of the species the uh the um
[00:25:10] the tree of the species the uh the um the the eras actually as we go up the
[00:25:15] the the eras actually as we go up the species as we have generalizable um more
[00:25:19] species as we have generalizable um more general names the error decreases but
[00:25:22] general names the error decreases but which means it's a convoluted way of
[00:25:24] which means it's a convoluted way of saying by the time you're in the fine
[00:25:26] saying by the time you're in the fine grained uh um um level we still make a
[00:25:31] grained uh um um level we still make a lot of errors the the algorithm are
[00:25:33] lot of errors the the algorithm are still not totally u totally u u ready um
[00:25:40] still not totally u totally u u ready um another work that I find fascinating is
[00:25:43] another work that I find fascinating is that a few years ago a group of students
[00:25:46] that a few years ago a group of students uh and and in my lab um train a fine
[00:25:52] uh and and in my lab um train a fine grained car classifier in terms of mod u
[00:25:56] grained car classifier in terms of mod u make model and year. It turns out after
[00:26:00] make model and year. It turns out after 1970s there are thousands of car make
[00:26:04] 1970s there are thousands of car make car models that are defined by different
[00:26:09] car models that are defined by different make, model and year. And then you can
[00:26:12] make, model and year. And then you can tea take we took um Google Street View
[00:26:16] tea take we took um Google Street View images from 200 or 100 I think major
[00:26:21] images from 200 or 100 I think major cities across the country and then we
[00:26:25] cities across the country and then we use the uh fine grain car detectors to
[00:26:28] use the uh fine grain car detectors to detect what are the cars on the street
[00:26:31] detect what are the cars on the street of these uh cities and we use it as a
[00:26:34] of these uh cities and we use it as a lens to study social uh patterns. For
[00:26:37] lens to study social uh patterns. For example, um what is the pattern here? I
[00:26:40] example, um what is the pattern here? I showed education patterns. You know, car
[00:26:44] showed education patterns. You know, car models and education patterns are highly
[00:26:47] models and education patterns are highly correlated or uh or uh income patterns
[00:26:52] correlated or uh or uh income patterns highly correlated. In that paper, we
[00:26:55] highly correlated. In that paper, we show voting patterns highly correlate uh
[00:26:58] show voting patterns highly correlate uh correlated or even environmental, you
[00:27:01] correlated or even environmental, you know, patterns highly correlated. So
[00:27:03] know, patterns highly correlated. So it's a really interesting way of using
[00:27:06] it's a really interesting way of using computer vision as a lens to study our
[00:27:09] computer vision as a lens to study our society and no human no individual human
[00:27:12] society and no human no individual human not even a collection of humans uh can
[00:27:15] not even a collection of humans uh can do this easily at all. So so AI is
[00:27:18] do this easily at all. So so AI is really pushing the boundary of what uh
[00:27:20] really pushing the boundary of what uh humans can see. Um
[00:27:24] humans can see. Um to drive home this idea let's do a
[00:27:27] to drive home this idea let's do a couple of tests. Humans are actually
[00:27:30] couple of tests. Humans are actually have our limitations, right? I just
[00:27:32] have our limitations, right? I just talked about celebrating humans ability
[00:27:34] talked about celebrating humans ability of seeing, but we also have our
[00:27:36] of seeing, but we also have our limitation. This is a very famous uh uh
[00:27:38] limitation. This is a very famous uh uh visual illusion test called Stroop test.
[00:27:41] visual illusion test called Stroop test. And the idea is that you all can read
[00:27:43] And the idea is that you all can read the words, but if you if I ask you to
[00:27:46] the words, but if you if I ask you to read the color of the word as fast as
[00:27:50] read the color of the word as fast as possible going from right, left to right
[00:27:52] possible going from right, left to right and top to down, you find it's it's not
[00:27:55] and top to down, you find it's it's not that easy, right? Try to read it like
[00:27:58] that easy, right? Try to read it like red, yellow,
[00:28:01] red, yellow, green,
[00:28:03] green, purple, blue, black, orange. It it it's
[00:28:07] purple, blue, black, orange. It it it's it's it it's fighting with you. This is
[00:28:10] it's it it's fighting with you. This is the fight between visual attention and
[00:28:12] the fight between visual attention and and all that. Here's another example.
[00:28:15] and all that. Here's another example. Um, there are two alternating images of
[00:28:18] Um, there are two alternating images of the of the uh picture and there's one
[00:28:22] the of the uh picture and there's one change, a pretty big change that's
[00:28:24] change, a pretty big change that's happening between the two alternating
[00:28:27] happening between the two alternating pictures. I don't know if you spot the
[00:28:29] pictures. I don't know if you spot the change. Do you spot it?
[00:28:31] change. Do you spot it?  The engine.
[00:28:32] The engine.  Yes, it's the engine, right? So, it
[00:28:34] Yes, it's the engine, right? So, it takes a while to to spot it. So, this is
[00:28:38] takes a while to to spot it. So, this is a very famous psychology uh experiment
[00:28:41] a very famous psychology uh experiment called change blindness. Now all this is
[00:28:44] called change blindness. Now all this is fun. Stroop test is fun. This is fun.
[00:28:47] fun. Stroop test is fun. This is fun. But this is not fun. That human
[00:28:50] But this is not fun. That human attention is limited. And in some
[00:28:53] attention is limited. And in some situations in our work and life, that
[00:28:57] situations in our work and life, that kind of attention limit can be dire. For
[00:29:00] kind of attention limit can be dire. For example, medical errors are the third
[00:29:05] example, medical errors are the third leading cause of death in in America's
[00:29:09] leading cause of death in in America's health care system. And of course
[00:29:11] health care system. And of course leaving this pair of scissors in the
[00:29:13] leaving this pair of scissors in the body of the patient is kind of the
[00:29:15] body of the patient is kind of the iconic image of medical errors. But
[00:29:18] iconic image of medical errors. But there are so many medical errors,
[00:29:20] there are so many medical errors, pharmaceutical errors, there's uh you
[00:29:23] pharmaceutical errors, there's uh you know um procedure errors, clerical
[00:29:26] know um procedure errors, clerical errors, u diagnostic errors. So so one
[00:29:29] errors, u diagnostic errors. So so one has to be very careful. For example, in
[00:29:32] has to be very careful. For example, in surgery rooms, um, honestly, scissors
[00:29:35] surgery rooms, um, honestly, scissors don't get left in the bodies typically,
[00:29:38] don't get left in the bodies typically, but much smaller, uh, things like suture
[00:29:42] but much smaller, uh, things like suture needles or a piece of gauze and all
[00:29:45] needles or a piece of gauze and all that. So, we uh, and today
[00:29:49] that. So, we uh, and today most of this is still just track by
[00:29:51] most of this is still just track by hands, right? Like we have these
[00:29:53] hands, right? Like we have these checklist to to uh to to track in the
[00:29:59] checklist to to uh to to track in the surgery rooms. If something is missing,
[00:30:02] surgery rooms. If something is missing, the surgery has to be paused. On
[00:30:04] the surgery has to be paused. On average, that pause is close to an hour.
[00:30:08] average, that pause is close to an hour. And think about the danger for the
[00:30:10] And think about the danger for the patient, the exposure to bacteria and
[00:30:13] patient, the exposure to bacteria and the bleeding and all that just because
[00:30:16] the bleeding and all that just because we have to search for that item. So if
[00:30:19] we have to search for that item. So if there is a way to use AI to help our PA
[00:30:24] there is a way to use AI to help our PA uh our uh doctors surgeons to track item
[00:30:28] uh our uh doctors surgeons to track item that would be so powerful and this is
[00:30:30] that would be so powerful and this is just a demo this is not a deploy system
[00:30:33] just a demo this is not a deploy system were not there in terms of fidelity but
[00:30:36] were not there in terms of fidelity but this is a demo to show that we can use
[00:30:38] this is a demo to show that we can use AI to count in this case goss you know
[00:30:41] AI to count in this case goss you know uh and and all that and um and this is
[00:30:45] uh and and all that and um and this is just an example of pushing AI to see
[00:30:49] just an example of pushing AI to see what humans don't see. Here's another
[00:30:52] what humans don't see. Here's another example um that is really fun. I don't
[00:30:56] example um that is really fun. I don't know if I showed this before, but this
[00:30:58] know if I showed this before, but this is one of my favorite visual illusions
[00:31:01] is one of my favorite visual illusions where I'm just giving you the answer. If
[00:31:04] where I'm just giving you the answer. If if you look at the two square A and B on
[00:31:07] if you look at the two square A and B on a checkerboard
[00:31:09] a checkerboard on the top, it is so hard to believe
[00:31:12] on the top, it is so hard to believe they have the same grayscale or
[00:31:14] they have the same grayscale or luminance. And then you look at the
[00:31:15] luminance. And then you look at the bottom, you're like, "Ah, of course they
[00:31:17] bottom, you're like, "Ah, of course they do." But why? Even though you have the
[00:31:21] do." But why? Even though you have the bottom picture in front of you, seeing
[00:31:25] bottom picture in front of you, seeing the top is still gives you the illusion.
[00:31:29] the top is still gives you the illusion. Why? Because evolution has pre-wired us
[00:31:32] Why? Because evolution has pre-wired us in in conjecturing or understanding our
[00:31:36] in in conjecturing or understanding our world in its common way with the common
[00:31:40] world in its common way with the common physics of the shape of objects,
[00:31:43] physics of the shape of objects, lighting source, how shadows are made
[00:31:46] lighting source, how shadows are made and all that. This is this is so deep in
[00:31:50] and all that. This is this is so deep in our evolution in our visual development
[00:31:54] our evolution in our visual development that it's hard for us to see see it
[00:31:58] that it's hard for us to see see it another way. So what I'm trying to get
[00:32:00] another way. So what I'm trying to get at is there's bias in our visual in our
[00:32:05] at is there's bias in our visual in our human visual system. The bias might come
[00:32:07] human visual system. The bias might come from evolutionary construct. The bias
[00:32:11] from evolutionary construct. The bias can come from our social um uh
[00:32:14] can come from our social um uh experience. The bias can come from the
[00:32:17] experience. The bias can come from the data we're exposed to. But some of these
[00:32:20] data we're exposed to. But some of these biases can be harmful. Right? when the
[00:32:23] biases can be harmful. Right? when the bias happens that became unfair uh to a
[00:32:28] bias happens that became unfair uh to a group of people a community and we have
[00:32:31] group of people a community and we have to be aware of this. A few years ago
[00:32:34] to be aware of this. A few years ago face recognition algorithm was not good
[00:32:37] face recognition algorithm was not good and and it tends to recognize uh certain
[00:32:40] and and it tends to recognize uh certain skin color and and even gender better
[00:32:42] skin color and and even gender better than others and it has consequences.
[00:32:45] than others and it has consequences. Think about self-driving car think about
[00:32:48] Think about self-driving car think about you know um many other uh medical use
[00:32:52] you know um many other uh medical use cases. So um so we have to be vigilant
[00:32:54] cases. So um so we have to be vigilant about this. Um I do believe that um
[00:33:00] about this. Um I do believe that um AI bias has been a problem that people
[00:33:02] AI bias has been a problem that people now are carrying. You know a few years
[00:33:05] now are carrying. You know a few years ago this problem was so new that many
[00:33:08] ago this problem was so new that many people are not even paying attention.
[00:33:10] people are not even paying attention. But fast forward to 2025. I'm not saying
[00:33:13] But fast forward to 2025. I'm not saying we have solved this problem. But I'm
[00:33:15] we have solved this problem. But I'm personally a lot happier to see that so
[00:33:18] personally a lot happier to see that so many people are paying attention to
[00:33:21] many people are paying attention to this. Not only just in academia but also
[00:33:23] this. Not only just in academia but also in industry.
[00:33:26] in industry. And then there's another kind of not
[00:33:28] And then there's another kind of not seeing and this is interesting.
[00:33:31] seeing and this is interesting. Sometimes not seeing is exactly what we
[00:33:34] Sometimes not seeing is exactly what we want
[00:33:36] want because you want to respect privacy. So
[00:33:39] because you want to respect privacy. So how do you create AI that helps people
[00:33:44] how do you create AI that helps people to see yet you still want it not to see
[00:33:47] to see yet you still want it not to see what people don't want you to see? This
[00:33:49] what people don't want you to see? This is a very deep. It's a technical problem
[00:33:52] is a very deep. It's a technical problem as well as a human problem. So from a
[00:33:55] as well as a human problem. So from a technical point of view, there are many
[00:33:57] technical point of view, there are many ways to consider ML machine learning
[00:34:00] ways to consider ML machine learning privacy. I'm just listing here from a
[00:34:04] privacy. I'm just listing here from a visual approach point of view a few
[00:34:06] visual approach point of view a few years ago. Our lab wrote this paper
[00:34:09] years ago. Our lab wrote this paper about um using smart cameras in patient
[00:34:13] about um using smart cameras in patient rooms or patient homes to help doctors
[00:34:16] rooms or patient homes to help doctors to see better. uh but even there we have
[00:34:19] to see better. uh but even there we have to recognize issues like faces or just
[00:34:23] to recognize issues like faces or just full body information and and even homes
[00:34:27] full body information and and even homes and this is a list of potential
[00:34:29] and this is a list of potential solutions. For example, you can do
[00:34:31] solutions. For example, you can do blurring or you can do masking, you can
[00:34:35] blurring or you can do masking, you can do dimensionality reduction, but you can
[00:34:37] do dimensionality reduction, but you can also, you know, try to do uh different
[00:34:41] also, you know, try to do uh different approaches for example, federated
[00:34:43] approaches for example, federated learning so that you don't send all the
[00:34:44] learning so that you don't send all the data to the server or encryption and and
[00:34:48] data to the server or encryption and and other things. So, I'm not going to
[00:34:50] other things. So, I'm not going to belabor this, but there's one work I
[00:34:52] belabor this, but there's one work I want to show you. It's not even my work,
[00:34:54] want to show you. It's not even my work, but I really like this work. And um it's
[00:34:57] but I really like this work. And um it's a work about taking videos of people and
[00:35:03] a work about taking videos of people and try to recognize the action of people
[00:35:06] try to recognize the action of people but yet respecting the privacy of
[00:35:09] but yet respecting the privacy of people. How do you do that? Right? For
[00:35:11] people. How do you do that? Right? For example, in this case um you want to
[00:35:15] example, in this case um you want to take a video of this kid uh moving in
[00:35:18] take a video of this kid uh moving in the scene. Um there are ways to do this.
[00:35:22] the scene. Um there are ways to do this. If you blur this you or defocus this or
[00:35:26] If you blur this you or defocus this or do some of these it kind of yeah you can
[00:35:30] do some of these it kind of yeah you can you can provide uh you can protect
[00:35:32] you can provide uh you can protect privacy but you also lose enough
[00:35:34] privacy but you also lose enough information that you might not even know
[00:35:36] information that you might not even know what this person is doing and for many
[00:35:38] what this person is doing and for many applications the whole goal is to know
[00:35:40] applications the whole goal is to know what this person is doing. So in this
[00:35:44] what this person is doing. So in this particular work uh led by Hong Koen's
[00:35:47] particular work uh led by Hong Koen's students they actually did a combination
[00:35:50] students they actually did a combination of hardware and software approach where
[00:35:54] of hardware and software approach where they handcrafted a lens that is actually
[00:36:00] they handcrafted a lens that is actually um that that that actually filters uh
[00:36:04] um that that that actually filters uh visual data in a particular way. so
[00:36:07] visual data in a particular way. so particular that if you look at the top
[00:36:10] particular that if you look at the top row, the lens, what the lens captures
[00:36:13] row, the lens, what the lens captures into the camera protects the privacy a
[00:36:16] into the camera protects the privacy a lot. You don't see the person's face,
[00:36:18] lot. You don't see the person's face, you don't see the body and so on. But
[00:36:22] you don't see the body and so on. But because it's a lens that's particularly
[00:36:24] because it's a lens that's particularly designed in connection with a piece of
[00:36:27] designed in connection with a piece of software, it can help to back out the
[00:36:31] software, it can help to back out the movement information or the the the
[00:36:33] movement information or the the the human activity information without
[00:36:37] human activity information without backing out face information. So that's
[00:36:40] backing out face information. So that's a really interesting approach. That's a
[00:36:42] a really interesting approach. That's a hybrid between hardware and software
[00:36:45] hybrid between hardware and software aiming towards important applications
[00:36:48] aiming towards important applications that you want to see people to protect
[00:36:51] that you want to see people to protect them but you don't want to see too much
[00:36:53] them but you don't want to see too much because you want to respect privacy. So
[00:36:55] because you want to respect privacy. So that's that's a work I really like. I
[00:36:58] that's that's a work I really like. I really like the spirit of that work. So
[00:37:01] really like the spirit of that work. So okay. So in this part of the lecture I
[00:37:05] okay. So in this part of the lecture I shared with you a number of things just
[00:37:08] shared with you a number of things just considerations of building AI to see
[00:37:11] considerations of building AI to see what humans don't see. Sometimes we're
[00:37:13] what humans don't see. Sometimes we're pushing AI like fine grain recognition
[00:37:16] pushing AI like fine grain recognition of birds to go beyond human ability.
[00:37:19] of birds to go beyond human ability. Those are superhuman ability. Sometimes
[00:37:22] Those are superhuman ability. Sometimes we know humans are not good. We have
[00:37:24] we know humans are not good. We have bias or we have attention issues and
[00:37:27] bias or we have attention issues and then we want to use AI to help us. And
[00:37:29] then we want to use AI to help us. And then sometimes we generally have
[00:37:32] then sometimes we generally have situations we don't want anyone to see.
[00:37:35] situations we don't want anyone to see. And then how do you use AI to continue
[00:37:37] And then how do you use AI to continue to help without violating those privacy
[00:37:40] to help without violating those privacy concerns. So you can see that AI is a
[00:37:43] concerns. So you can see that AI is a very interesting powerful tool. It can
[00:37:46] very interesting powerful tool. It can both help but amplify
[00:37:49] both help but amplify us. And if we have bias, if we have
[00:37:52] us. And if we have bias, if we have issues, AI can amplify us too. So when
[00:37:55] issues, AI can amplify us too. So when we build AI, it is so important not only
[00:37:59] we build AI, it is so important not only to take that technology perspective but
[00:38:02] to take that technology perspective but also to take the human perspective to
[00:38:04] also to take the human perspective to commit to study, forecast and guide AI
[00:38:08] commit to study, forecast and guide AI to understand its human impact and and
[00:38:11] to understand its human impact and and respect human values. So that's that's
[00:38:14] respect human values. So that's that's the second message take-home message and
[00:38:18] the second message take-home message and again a number of collab collaborators
[00:38:20] again a number of collab collaborators and students participated in this work.
[00:38:24] and students participated in this work. Okay, now let's talk about building AI
[00:38:27] Okay, now let's talk about building AI to see what human want to see. And in
[00:38:30] to see what human want to see. And in fact, we're going to go beyond seeing.
[00:38:32] fact, we're going to go beyond seeing. We're going to connect seeing and doing
[00:38:34] We're going to connect seeing and doing together.
[00:38:36] together. So if you think about today's societal
[00:38:40] So if you think about today's societal anxiety about AI, one of the biggest
[00:38:43] anxiety about AI, one of the biggest anxiety is labor. Uh a lot of headline
[00:38:47] anxiety is labor. Uh a lot of headline news will say labor is under threat.
[00:38:51] news will say labor is under threat. robots taking over jobs.
[00:38:54] robots taking over jobs. The truth is the the the picture is
[00:38:56] The truth is the the the picture is complex. You know, denying job change is
[00:39:01] complex. You know, denying job change is wrong. Every technological shift in
[00:39:04] wrong. Every technological shift in humans history has caused labor market
[00:39:07] humans history has caused labor market change. And some of them are very
[00:39:09] change. And some of them are very painful. Some of them can lead to even
[00:39:12] painful. Some of them can lead to even civil wars and wars. Um but so but but
[00:39:16] civil wars and wars. Um but so but but also that change sometimes is inevitable
[00:39:20] also that change sometimes is inevitable and uh
[00:39:22] and uh a tiny digression a lot of the labor
[00:39:26] a tiny digression a lot of the labor threat rhetoric that we have been
[00:39:28] threat rhetoric that we have been hearing think about physical labors but
[00:39:32] hearing think about physical labors but today in the past two years if you look
[00:39:34] today in the past two years if you look at genai's impact is white collar u jobs
[00:39:39] at genai's impact is white collar u jobs that are drastically uh being impacted
[00:39:42] that are drastically uh being impacted especially software engineer engineering
[00:39:44] especially software engineer engineering and uh and uh analytical work in in
[00:39:48] and uh and uh analytical work in in offices. So, so there's just definitely
[00:39:51] offices. So, so there's just definitely uh labor change, but in the meantime, we
[00:39:54] uh labor change, but in the meantime, we also need to recognize that AI also can
[00:39:58] also need to recognize that AI also can be helpful. We actually fundamentally
[00:40:01] be helpful. We actually fundamentally have human labor shortages in many
[00:40:04] have human labor shortages in many situations, especially in elderly care
[00:40:07] situations, especially in elderly care as well as health care. First of all, as
[00:40:10] as well as health care. First of all, as modern medicine improves,
[00:40:12] modern medicine improves, uh, human life expectancy increases and
[00:40:16] uh, human life expectancy increases and that just inevitably pushes the society
[00:40:18] that just inevitably pushes the society towards um, uh, longer living and that's
[00:40:22] towards um, uh, longer living and that's a good thing. But in the meantime, we
[00:40:25] a good thing. But in the meantime, we have shortages of labors. Young people
[00:40:28] have shortages of labors. Young people need to work and that's how to make this
[00:40:31] need to work and that's how to make this uh, society uh, vibrant, economy
[00:40:34] uh, society uh, vibrant, economy vibrant. But who is taking care of our
[00:40:36] vibrant. But who is taking care of our elderlys? who are taking care of our
[00:40:38] elderlys? who are taking care of our chronically ill. Uh even in America's
[00:40:42] chronically ill. Uh even in America's hospital, we have such a attrition of uh
[00:40:46] hospital, we have such a attrition of uh health care workers especially nurses um
[00:40:49] health care workers especially nurses um that uh we don't have enough h hands,
[00:40:52] that uh we don't have enough h hands, ears, eyes to to help our patients. So
[00:40:55] ears, eyes to to help our patients. So instead of thinking about this word
[00:40:58] instead of thinking about this word replace, we actually can think about AI
[00:41:02] replace, we actually can think about AI augmenting. And you got a glimpse of
[00:41:04] augmenting. And you got a glimpse of that in my surgery room example. Indeed,
[00:41:08] that in my surgery room example. Indeed, in health care, there's so many spaces
[00:41:11] in health care, there's so many spaces in our health care that we don't have
[00:41:13] in our health care that we don't have enough pairs of eyes. And that's what I
[00:41:16] enough pairs of eyes. And that's what I call the dark spaces of health care from
[00:41:20] call the dark spaces of health care from surgery room to patient room to uh
[00:41:22] surgery room to patient room to uh pharmaceutical to homes and and so on.
[00:41:26] pharmaceutical to homes and and so on. So, how do we make AI help? And this is
[00:41:29] So, how do we make AI help? And this is something that uh Isan has been leading
[00:41:32] something that uh Isan has been leading a ton of this work also with Zing. um is
[00:41:35] a ton of this work also with Zing. um is that we have been looking at this
[00:41:37] that we have been looking at this problem of ambient intelligence for
[00:41:40] problem of ambient intelligence for healthcare where we combine smart
[00:41:42] healthcare where we combine smart sensors with machine learning algorithms
[00:41:45] sensors with machine learning algorithms to glean health critical insights from
[00:41:50] to glean health critical insights from these situations in healthcare setup so
[00:41:55] these situations in healthcare setup so that we can alert the patients or family
[00:41:57] that we can alert the patients or family members or doctors in time to help uh
[00:42:01] members or doctors in time to help uh patients and again the fuller paper is
[00:42:04] patients and again the fuller paper is in this particular paper we published a
[00:42:06] in this particular paper we published a couple of years ago. Let me just give
[00:42:08] couple of years ago. Let me just give you a couple of examples. One example is
[00:42:11] you a couple of examples. One example is this hand hygiene project which actually
[00:42:13] this hand hygiene project which actually started way before co um and uh hand
[00:42:18] started way before co um and uh hand hygiene turns out to be really important
[00:42:21] hygiene turns out to be really important for keeping hospital um infection low.
[00:42:25] for keeping hospital um infection low. Hospital acquired infection is actually
[00:42:28] Hospital acquired infection is actually one of the leading causes of American
[00:42:31] one of the leading causes of American patients fatality in our hospitals. It
[00:42:34] patients fatality in our hospitals. It kills more than three times more people
[00:42:37] kills more than three times more people per year than car accidents nationwide.
[00:42:41] per year than car accidents nationwide. And it is really hard to control. Most
[00:42:44] And it is really hard to control. Most of these germs are passed from patient
[00:42:46] of these germs are passed from patient room to patient room and then they they
[00:42:48] room to patient room and then they they kind of just brew together. So what do
[00:42:52] kind of just brew together. So what do we do? The hospitals try to use human
[00:42:54] we do? The hospitals try to use human auditors but we just talked about we
[00:42:56] auditors but we just talked about we don't even have enough nurses let alone
[00:42:59] don't even have enough nurses let alone hiring auditors and also this is you
[00:43:02] hiring auditors and also this is you cannot hire enough of them. There's
[00:43:04] cannot hire enough of them. There's human fatigue we just talked about human
[00:43:06] human fatigue we just talked about human detention problem. So this is not a
[00:43:09] detention problem. So this is not a pretty prohibitive uh solution. There
[00:43:12] pretty prohibitive uh solution. There were some technological solution like
[00:43:14] were some technological solution like RFID, you know, put the badge if the
[00:43:17] RFID, you know, put the badge if the badge or the person wearing the badge is
[00:43:19] badge or the person wearing the badge is close to the u sink or the hand hygiene
[00:43:23] close to the u sink or the hand hygiene uh hand sanitizer dispenser, it gives
[00:43:26] uh hand sanitizer dispenser, it gives you the hint that the person or most
[00:43:28] you the hint that the person or most likely the doctor or the nurses is
[00:43:30] likely the doctor or the nurses is washing their hands. But that's very
[00:43:32] washing their hands. But that's very non-specific. You cannot guarantee and
[00:43:34] non-specific. You cannot guarantee and the hospital rooms are pretty small.
[00:43:36] the hospital rooms are pretty small. Corridors are small and just standing
[00:43:40] Corridors are small and just standing next to something doesn't mean you're
[00:43:41] next to something doesn't mean you're doing it. So a few years ago we did this
[00:43:44] doing it. So a few years ago we did this project where we put smart sensors that
[00:43:47] project where we put smart sensors that provides p protects privacy by just only
[00:43:51] provides p protects privacy by just only gleaning depth information like the blue
[00:43:54] gleaning depth information like the blue screen and the blue uh video there. And
[00:43:57] screen and the blue uh video there. And then we use vision computer vision
[00:43:59] then we use vision computer vision algorithm to classify actions. Is the
[00:44:02] algorithm to classify actions. Is the person washing hand or not washing hand?
[00:44:05] person washing hand or not washing hand? And uh the result is that if you compare
[00:44:07] And uh the result is that if you compare ground truth with the uh algorithm
[00:44:11] ground truth with the uh algorithm output versus human outputs or human um
[00:44:15] output versus human outputs or human um detective detection results, you can see
[00:44:18] detective detection results, you can see algorithm is so much better and more
[00:44:21] algorithm is so much better and more consistent than than humans. Uh you have
[00:44:24] consistent than than humans. Uh you have to almost show the same video to four
[00:44:28] to almost show the same video to four humans to get almost as good as AI. And
[00:44:33] humans to get almost as good as AI. And this is just not plausible. If it's one
[00:44:36] this is just not plausible. If it's one person, you can see how sparse the
[00:44:39] person, you can see how sparse the detection is and that's not good. So
[00:44:41] detection is and that's not good. So this is one application. Another
[00:44:44] this is one application. Another application we uh we worked on is ICUs.
[00:44:48] application we uh we worked on is ICUs. ICU is where patients fights life and
[00:44:51] ICU is where patients fights life and death. ICU is also where um 1% of US GDP
[00:44:58] death. ICU is also where um 1% of US GDP is spent. So making ICU as effective as
[00:45:03] is spent. So making ICU as effective as safely as possible is a top priority.
[00:45:06] safely as possible is a top priority. One of the goal well the goal of ICU is
[00:45:10] One of the goal well the goal of ICU is to to get our patient safely out of ICU
[00:45:14] to to get our patient safely out of ICU and go into step down units or even go
[00:45:16] and go into step down units or even go home. So uh one of the most important
[00:45:21] home. So uh one of the most important thing people have learned in ICU is to
[00:45:24] thing people have learned in ICU is to help patients to move
[00:45:27] help patients to move proper movement which we call
[00:45:29] proper movement which we call mobilization. It's actually important
[00:45:31] mobilization. It's actually important for recovery. But this is a very dicey
[00:45:34] for recovery. But this is a very dicey situation. You have to get nurses to
[00:45:36] situation. You have to get nurses to help. Doctors have to give orders. You
[00:45:39] help. Doctors have to give orders. You have to move properly and it has to be
[00:45:42] have to move properly and it has to be in different time like designated time
[00:45:45] in different time like designated time and you have to assess the movement and
[00:45:48] and you have to assess the movement and all this is not easy right so we
[00:45:50] all this is not easy right so we collaborated with Stanford as well as
[00:45:53] collaborated with Stanford as well as Utah's inter mountain hospital to put
[00:45:55] Utah's inter mountain hospital to put these smart sensors in ICU units and
[00:45:58] these smart sensors in ICU units and help doctors to monitor patient movement
[00:46:02] help doctors to monitor patient movement in this particular case four different
[00:46:05] in this particular case four different kind of movements getting out of bed
[00:46:06] kind of movements getting out of bed getting in bed getting out of chair
[00:46:08] getting in bed getting out of chair getting in here. These things are so
[00:46:10] getting in here. These things are so important for ICU patients. I know that
[00:46:12] important for ICU patients. I know that it's for us it's a no-brainer, but this
[00:46:15] it's for us it's a no-brainer, but this really is uh is is critical and you can
[00:46:19] really is uh is is critical and you can see that AI can help to do the kind of
[00:46:23] see that AI can help to do the kind of detection and prediction that uh that is
[00:46:26] detection and prediction that uh that is so helpful for doctors especially when
[00:46:28] so helpful for doctors especially when there is a labor shortage. Last but not
[00:46:31] there is a labor shortage. Last but not the least example is aging in place. And
[00:46:34] the least example is aging in place. And this is just so important for you know
[00:46:37] this is just so important for you know many many reasons. People uh seniors
[00:46:40] many many reasons. People uh seniors want to live home independently and
[00:46:42] want to live home independently and healthily. And uh uh remember during the
[00:46:46] healthily. And uh uh remember during the beginning of co when we had so much
[00:46:50] beginning of co when we had so much fatality among the aging um seniors a
[00:46:54] fatality among the aging um seniors a lot has to do with hospital overrun and
[00:46:57] lot has to do with hospital overrun and and overtaxed hospital system. So
[00:47:00] and overtaxed hospital system. So putting and keeping seniors safe and
[00:47:04] putting and keeping seniors safe and well in their homes is really critical
[00:47:06] well in their homes is really critical and using smart sensors we can help you
[00:47:10] and using smart sensors we can help you know early detection of infection
[00:47:12] know early detection of infection especially using thermal cameras or
[00:47:15] especially using thermal cameras or mobility we just talked about in ICU
[00:47:18] mobility we just talked about in ICU similar here or understanding sleep
[00:47:20] similar here or understanding sleep patterns or understanding dietary
[00:47:22] patterns or understanding dietary patterns all these are realms of
[00:47:25] patterns all these are realms of possibilities by AI and smart sensors
[00:47:30] possibilities by AI and smart sensors And then last but not least, what if
[00:47:32] And then last but not least, what if there's still short labor shortage after
[00:47:35] there's still short labor shortage after smart sensors? The thing about smart
[00:47:37] smart sensors? The thing about smart sensors is that their their information
[00:47:40] sensors is that their their information gathering system, but they cannot go
[00:47:44] gathering system, but they cannot go there and help to turn a patient or
[00:47:47] there and help to turn a patient or bring water and medicine pills to the p
[00:47:50] bring water and medicine pills to the p uh elderly. So this brings us to to the
[00:47:53] uh elderly. So this brings us to to the last technical topic which is inbody AI
[00:47:57] last technical topic which is inbody AI or we would call a a large part of
[00:48:00] or we would call a a large part of inbody AI is robotics and this is where
[00:48:05] inbody AI is robotics and this is where I find it extremely exciting because it
[00:48:08] I find it extremely exciting because it closes the loop between perception and
[00:48:11] closes the loop between perception and action and if you think about the
[00:48:14] action and if you think about the Cambrian explosion of evolution when
[00:48:17] Cambrian explosion of evolution when eyes when there's onset of eyes, animals
[00:48:21] eyes when there's onset of eyes, animals start to move. So we uh the area of
[00:48:25] start to move. So we uh the area of robotics is where we can close the loop
[00:48:27] robotics is where we can close the loop between seeing and doing. But it's not
[00:48:30] between seeing and doing. But it's not easy, right? Robots, as much as we're
[00:48:32] easy, right? Robots, as much as we're very excited by them, still are very
[00:48:35] very excited by them, still are very very slow. They are very very clumsy.
[00:48:39] very slow. They are very very clumsy. It's very hard for them to uh adapt to a
[00:48:43] It's very hard for them to uh adapt to a generalizable situation.
[00:48:46] generalizable situation. And uh in today's robotic research we we
[00:48:49] And uh in today's robotic research we we as a field have have made a ton of
[00:48:53] as a field have have made a ton of progress and and Stanford is definitely
[00:48:55] progress and and Stanford is definitely one of the centers of robotic learning
[00:48:58] one of the centers of robotic learning but still most of these work are kind of
[00:49:01] but still most of these work are kind of constrained in their setup are are short
[00:49:04] constrained in their setup are are short horizon tasks like pick and place and it
[00:49:08] horizon tasks like pick and place and it it has you know it it has anecdotal
[00:49:12] it has you know it it has anecdotal setup and lack of clinical sorry lack
[00:49:15] setup and lack of clinical sorry lack lack of standard uh benchmark. So let me
[00:49:19] lack of standard uh benchmark. So let me just share with you a couple of work in
[00:49:21] just share with you a couple of work in our lab. Uh one work is uh a few years
[00:49:25] our lab. Uh one work is uh a few years ago we're looking at how to bring robots
[00:49:28] ago we're looking at how to bring robots to the wild. If we have to pre-desate
[00:49:31] to the wild. If we have to pre-desate the set of tasks, it's kind of
[00:49:34] the set of tasks, it's kind of unsatisfying. On the other hand, if you
[00:49:36] unsatisfying. On the other hand, if you look at today's LLM, it's totally in a
[00:49:39] look at today's LLM, it's totally in a while. You can talk about anything. So
[00:49:41] while. You can talk about anything. So my student Wong and a few students want
[00:49:44] my student Wong and a few students want to close this gap. So this idea is that
[00:49:48] to close this gap. So this idea is that um how do we give an open instruction to
[00:49:52] um how do we give an open instruction to a robot any instruction without
[00:49:56] a robot any instruction without pre-training
[00:49:58] pre-training everything in a in a in a closed world
[00:50:01] everything in a in a in a closed world and the robot can do some tasks. So
[00:50:04] and the robot can do some tasks. So let's say your training set is open a
[00:50:07] let's say your training set is open a drawer like that. in the wild you have
[00:50:10] drawer like that. in the wild you have doors like that. So how do you how do
[00:50:13] doors like that. So how do you how do you make some uh progress in that
[00:50:15] you make some uh progress in that problem and uh so the the goal is is is
[00:50:20] problem and uh so the the goal is is is in the wild generalization and here's an
[00:50:24] in the wild generalization and here's an a a result um you know or or overall
[00:50:28] a a result um you know or or overall algorithm
[00:50:30] algorithm um I don't know if this is so glitchy
[00:50:33] um I don't know if this is so glitchy but whatever what you're saying is we
[00:50:36] but whatever what you're saying is we want to tell this robot arm to open a
[00:50:40] want to tell this robot arm to open a drawer
[00:50:41] drawer by planning a motion path that avoids
[00:50:45] by planning a motion path that avoids knocking down that flower and all this
[00:50:48] knocking down that flower and all this is all this instruction were not
[00:50:50] is all this instruction were not pre-trained. So what we do is actually
[00:50:54] pre-trained. So what we do is actually we borrow uh the latest advances in LLM
[00:50:59] we borrow uh the latest advances in LLM as well as in visual language model and
[00:51:02] as well as in visual language model and the idea is that we use LLM and V uh uh
[00:51:08] the idea is that we use LLM and V uh uh LLM to give us a instruction set and
[00:51:12] LLM to give us a instruction set and then we use visual language model to
[00:51:14] then we use visual language model to help us to recognize or understand the
[00:51:16] help us to recognize or understand the environment and then we turn that into a
[00:51:20] environment and then we turn that into a motion planning app so that the robotic
[00:51:23] motion planning app so that the robotic arm can execute. And because we're using
[00:51:26] arm can execute. And because we're using LLMs as well as VLMs, we don't have to
[00:51:31] LLMs as well as VLMs, we don't have to we we are we get rid of the problem of
[00:51:34] we we are we get rid of the problem of training robot in a closed world and
[00:51:36] training robot in a closed world and bring them to a more generalizable or in
[00:51:39] bring them to a more generalizable or in the wild. And the details is the
[00:51:42] the wild. And the details is the instruction of open top drawer comes in.
[00:51:46] instruction of open top drawer comes in. LLM turns this into like literally codes
[00:51:51] LLM turns this into like literally codes and then because of these instructions
[00:51:55] and then because of these instructions like drawer or handle um the uh we send
[00:52:00] like drawer or handle um the uh we send this information to a VLM model and that
[00:52:03] this information to a VLM model and that model detects drawer and handle in the
[00:52:06] model detects drawer and handle in the scene and then because of that it
[00:52:12] scene and then because of that it updates its information and then it
[00:52:14] updates its information and then it updates a motion motion map. This is
[00:52:17] updates a motion motion map. This is presented by a heat map to show you
[00:52:20] presented by a heat map to show you where the robot arm should focus, where
[00:52:22] where the robot arm should focus, where it should not focus. And with that, then
[00:52:25] it should not focus. And with that, then you get give it another instruction. But
[00:52:27] you get give it another instruction. But watch out to the uh for the vase. Again,
[00:52:30] watch out to the uh for the vase. Again, it goes through the same thing with LLM,
[00:52:33] it goes through the same thing with LLM, writes the code or or um generates the
[00:52:36] writes the code or or um generates the code, send it through a VLM model. VLRL
[00:52:40] code, send it through a VLM model. VLRL model detects the object and then
[00:52:43] model detects the object and then updates the motion planning map. In this
[00:52:46] updates the motion planning map. In this case, it's the negative, not the
[00:52:48] case, it's the negative, not the positive because you you want to avoid
[00:52:50] positive because you you want to avoid that. And then combining with the
[00:52:52] that. And then combining with the previous map, you get a heat map of
[00:52:55] previous map, you get a heat map of knowing where to avoid and where to go.
[00:52:58] knowing where to avoid and where to go. And uh eventually
[00:53:00] And uh eventually um what we do is we do this for the
[00:53:03] um what we do is we do this for the motion planning map. We do it for
[00:53:05] motion planning map. We do it for rotation to gripper velocity. And then
[00:53:08] rotation to gripper velocity. And then this is the result. Um actually let me
[00:53:12] this is the result. Um actually let me just show you this. Um this is the
[00:53:14] just show you this. Um this is the actual result of a of the of the robot.
[00:53:18] actual result of a of the of the robot. And then we do this for many different
[00:53:20] And then we do this for many different tasks, right? We we can do it for
[00:53:23] tasks, right? We we can do it for articulated object manipulation.
[00:53:26] articulated object manipulation. We can do it uh um we can you know here
[00:53:31] We can do it uh um we can you know here just many different examples you know
[00:53:34] just many different examples you know napkins or sweeping the floor. uh what
[00:53:38] napkins or sweeping the floor. uh what is this getting toast the setting up
[00:53:40] is this getting toast the setting up table um
[00:53:43] table um and also dealing with online
[00:53:46] and also dealing with online disturbances and so on. So, so this is
[00:53:49] disturbances and so on. So, so this is one work. Another work I want to just
[00:53:52] one work. Another work I want to just show you um quickly um is that um
[00:53:57] show you um quickly um is that um overall uh robotics um research is still
[00:54:02] overall uh robotics um research is still in lacking good benchmark.
[00:54:05] in lacking good benchmark. And while we're still experimenting in
[00:54:08] And while we're still experimenting in the in the labs, we know real world is
[00:54:12] the in the labs, we know real world is so much more complex, so much more
[00:54:14] so much more complex, so much more uncertain, have large variability, is so
[00:54:18] uncertain, have large variability, is so interactive and social and is has a lot
[00:54:21] interactive and social and is has a lot of multitasking task. And then we know
[00:54:24] of multitasking task. And then we know that both natural language and computer
[00:54:26] that both natural language and computer vision has benefited a lot from setting
[00:54:30] vision has benefited a lot from setting up important largecale data sets for
[00:54:34] up important largecale data sets for both training and benchmark. So in our
[00:54:37] both training and benchmark. So in our lab we have been working on this project
[00:54:40] lab we have been working on this project that is towards an ecological robotic
[00:54:43] that is towards an ecological robotic learning building an ecological robotic
[00:54:46] learning building an ecological robotic learning environment and a uh and and
[00:54:49] learning environment and a uh and and and try to encourage researchers to
[00:54:52] and try to encourage researchers to benchmark against a large and diverse
[00:54:55] benchmark against a large and diverse set of activities. And that's the
[00:54:58] set of activities. And that's the behavior uh benchmark which is benchmark
[00:55:02] behavior uh benchmark which is benchmark for everyday household activity in
[00:55:04] for everyday household activity in virtual interactive and ecological
[00:55:06] virtual interactive and ecological environments.
[00:55:08] environments. Now here's a question because this
[00:55:10] Now here's a question because this lecture has a lot to do with human
[00:55:12] lecture has a lot to do with human values is who is to say which tasks
[00:55:16] values is who is to say which tasks robots should do. I know that every
[00:55:19] robots should do. I know that every graduate students who are working on
[00:55:20] graduate students who are working on robotics just want two task. One is
[00:55:23] robotics just want two task. One is laundry the other one is dishwasher.
[00:55:25] laundry the other one is dishwasher. That's great. Let but moving beyond
[00:55:28] That's great. Let but moving beyond grasu what are this what are the tasks
[00:55:32] grasu what are this what are the tasks we should get robots to do for us. So
[00:55:35] we should get robots to do for us. So instead of us coming up with this task
[00:55:38] instead of us coming up with this task list we actually did a human- centered
[00:55:41] list we actually did a human- centered survey to ask robots sorry to ask humans
[00:55:48] survey to ask robots sorry to ask humans that what would you like robots to help
[00:55:49] that what would you like robots to help you? Let me let me test this. Um would
[00:55:52] you? Let me let me test this. Um would you like a robot to help you to clean
[00:55:54] you like a robot to help you to clean the kitchen floor?
[00:55:57] the kitchen floor? Say yes or no. Okay, good. Um, normal
[00:56:00] Say yes or no. Okay, good. Um, normal people would say yes.
[00:56:03] people would say yes. Shoveling snow.
[00:56:05] Shoveling snow.  Okay. Folding laundry.
[00:56:08] Okay. Folding laundry.  Okay, good. Cooking breakfast.
[00:56:12] Okay, good. Cooking breakfast.  See, we're getting mixture answers,
[00:56:15] See, we're getting mixture answers, right?
[00:56:16] right?  What about opening Christmas gift?
[00:56:19] What about opening Christmas gift?  Right. Exactly. People are different.
[00:56:22] Right. Exactly. People are different. Like I actually think robot can do this
[00:56:24] Like I actually think robot can do this pretty well but we don't want it and
[00:56:27] pretty well but we don't want it and there is we even ask one of the task we
[00:56:29] there is we even ask one of the task we ask is buying wedding rings. Can you
[00:56:31] ask is buying wedding rings. Can you imagine that? Um so so what we did is
[00:56:36] imagine that? Um so so what we did is actually we want to respect human
[00:56:38] actually we want to respect human preference. So we took a a bunch of
[00:56:41] preference. So we took a a bunch of government surveys from uh uh labor
[00:56:45] government surveys from uh uh labor office in US and Europe and so on and
[00:56:49] office in US and Europe and so on and clan together uh put together thousands
[00:56:52] clan together uh put together thousands of everyday activity tasks and then we
[00:56:56] of everyday activity tasks and then we went online to find people. We want to
[00:57:00] went online to find people. We want to be as diverse as possible, but I I think
[00:57:04] be as diverse as possible, but I I think we have room to improve. But we found
[00:57:06] we have room to improve. But we found 1,400 people and to answer these tasks
[00:57:11] 1,400 people and to answer these tasks and tell us which task they want robots
[00:57:14] and tell us which task they want robots to help. And then we rank that. And you
[00:57:17] to help. And then we rank that. And you can see that just like glass students,
[00:57:21] can see that just like glass students, people want robots to help with
[00:57:23] people want robots to help with cleaning, a lot of cleaning, toilet
[00:57:26] cleaning, a lot of cleaning, toilet cleaning, floor cleaning. But people
[00:57:29] cleaning, floor cleaning. But people don't want robots to play squash for you
[00:57:32] don't want robots to play squash for you or to buy a wedding ring or to even mix
[00:57:35] or to buy a wedding ring or to even mix baby cereals. There's a lot of tasks
[00:57:37] baby cereals. There's a lot of tasks that matters to us as humans emotionally
[00:57:40] that matters to us as humans emotionally or socially or whatever. So we our goal
[00:57:44] or socially or whatever. So we our goal is first we decided we doesn't uh we we
[00:57:48] is first we decided we doesn't uh we we have a principled way to uh decide which
[00:57:51] have a principled way to uh decide which are the thousand tasks that we want to
[00:57:55] are the thousand tasks that we want to train robots for and those are the tasks
[00:57:57] train robots for and those are the tasks that's humans prefer to get help and um
[00:58:01] that's humans prefer to get help and um with that in mind we have to actually
[00:58:04] with that in mind we have to actually build a ver uh virtual environments. So
[00:58:07] build a ver uh virtual environments. So we we act we scanned or or or acquired
[00:58:11] we we act we scanned or or or acquired 3D scene from 50 different uh real world
[00:58:16] 3D scene from 50 different uh real world environments from restaurants to
[00:58:18] environments from restaurants to apartments to to uh grocery stores to to
[00:58:23] apartments to to uh grocery stores to to offices and so on. And then we acquired
[00:58:27] offices and so on. And then we acquired um this number is actually outdated. We
[00:58:30] um this number is actually outdated. We require more than 10,000 object assets,
[00:58:34] require more than 10,000 object assets, 3D assets uh that has a lot of
[00:58:37] 3D assets uh that has a lot of properties whether it's articulation,
[00:58:39] properties whether it's articulation, deformability
[00:58:41] deformability and all those properties. And then we uh
[00:58:45] and all those properties. And then we uh we have to build a simulation
[00:58:46] we have to build a simulation environment. A lot of people have built
[00:58:49] environment. A lot of people have built simulation environment. Let me just fast
[00:58:51] simulation environment. Let me just fast forward. But uh our particular
[00:58:53] forward. But uh our particular simulation environment was a
[00:58:55] simulation environment was a collaboration with Nvidia's Omniverse
[00:58:57] collaboration with Nvidia's Omniverse group and we were going for building a
[00:59:02] group and we were going for building a physically, perceptually and also
[00:59:05] physically, perceptually and also interactively
[00:59:06] interactively high quality simulation environment and
[00:59:10] high quality simulation environment and this you know especially taking account
[00:59:12] this you know especially taking account for example physical effects like
[00:59:14] for example physical effects like thermal transparency, deformability and
[00:59:17] thermal transparency, deformability and so on. We also tested our uh behavior
[00:59:22] so on. We also tested our uh behavior environment against other environments
[00:59:25] environment against other environments in terms of perceptual realism for from
[00:59:28] in terms of perceptual realism for from human user study. And you know here are
[00:59:31] human user study. And you know here are some of the examples of physical
[00:59:34] some of the examples of physical interaction uh such as cloth or or
[00:59:38] interaction uh such as cloth or or liquids and so on. So there's a lot of
[00:59:40] liquids and so on. So there's a lot of nuance that has gone into this work and
[00:59:43] nuance that has gone into this work and let me just fast forward and um and
[00:59:47] let me just fast forward and um and these are some benchmark we did compared
[00:59:49] these are some benchmark we did compared to other uh other work. Okay, let me
[00:59:51] to other uh other work. Okay, let me just fast forward. Um
[00:59:54] just fast forward. Um so this is ongoing work actually in in
[00:59:57] so this is ongoing work actually in in our lab and because of this we're able
[01:00:00] our lab and because of this we're able to you know uh we are using behavior to
[01:00:05] to you know uh we are using behavior to help us to learn robotics to help us
[01:00:08] help us to learn robotics to help us actually to push us to gather more
[01:00:10] actually to push us to gather more interesting data and also to to use that
[01:00:14] interesting data and also to to use that for um for even cognitive studies. Let
[01:00:18] for um for even cognitive studies. Let me just fast forward. Um
[01:00:22] me just fast forward. Um uh one thing I want to share with you is
[01:00:27] uh one thing I want to share with you is that let me let me just share these
[01:00:30] that let me let me just share these numbers is today's algorithms still
[01:00:34] numbers is today's algorithms still cannot do behavior tasks. And the of all
[01:00:38] cannot do behavior tasks. And the of all these roles the top role is what we wish
[01:00:44] these roles the top role is what we wish robots can do. Give them no privilege
[01:00:46] robots can do. Give them no privilege information. they have to be dropped in
[01:00:48] information. they have to be dropped in the environment and do these tasks and
[01:00:51] the environment and do these tasks and we benchmarked three behavior task using
[01:00:54] we benchmarked three behavior task using today's uh robotic algorithm and the
[01:00:58] today's uh robotic algorithm and the performance is just zero. And once you
[01:01:01] performance is just zero. And once you start to give more priv uh privileged
[01:01:05] start to give more priv uh privileged information or make assumptions that
[01:01:07] information or make assumptions that make the task simpler like magic motion
[01:01:11] make the task simpler like magic motion or some uh you know uh perfect memory
[01:01:14] or some uh you know uh perfect memory and all that things start to get better.
[01:01:16] and all that things start to get better. So this is this is if you look at it um
[01:01:20] So this is this is if you look at it um you know uh only look at the top row you
[01:01:23] you know uh only look at the top row you get pretty depressed by today's uh
[01:01:25] get pretty depressed by today's uh robots but as a grad student I hope
[01:01:27] robots but as a grad student I hope you're inspired because that means we
[01:01:29] you're inspired because that means we have a lot of room to uh to grow. Um
[01:01:34] have a lot of room to uh to grow. Um okay uh these are just different papers
[01:01:36] okay uh these are just different papers from our lab. I'm going to actually fast
[01:01:38] from our lab. I'm going to actually fast forward because I think uh we've talked
[01:01:40] forward because I think uh we've talked enough about this. Um
[01:01:44] enough about this. Um uh well uh by the way we're also doing
[01:01:48] uh well uh by the way we're also doing digital twin of behavior in the digital
[01:01:51] digital twin of behavior in the digital environment as well as in the in the
[01:01:53] environment as well as in the in the real world environment and that's that's
[01:01:56] real world environment and that's that's a great way of test uh testing real to
[01:01:59] a great way of test uh testing real to sim transfer. Again, this is a unsolved
[01:02:03] sim transfer. Again, this is a unsolved problem and um and uh there's a long way
[01:02:07] problem and um and uh there's a long way to go. And this is in this particular
[01:02:09] to go. And this is in this particular case, we're showing you that this robot
[01:02:12] case, we're showing you that this robot without speeding up. You can see how
[01:02:14] without speeding up. You can see how slow it is is trying to clean up this
[01:02:17] slow it is is trying to clean up this room. And uh uh you know, okay, hooray.
[01:02:22] room. And uh uh you know, okay, hooray. Um this is actually uh some of the
[01:02:26] Um this is actually uh some of the mistakes that this robot uh makes in for
[01:02:31] mistakes that this robot uh makes in for example it cannot pick up uh the bottle
[01:02:35] example it cannot pick up uh the bottle or earlier it just went the wrong way
[01:02:39] or earlier it just went the wrong way and uh um placed it in the wrong place.
[01:02:42] and uh um placed it in the wrong place. So there's still a lot of uh mistakes.
[01:02:45] So there's still a lot of uh mistakes. Okay, let's uh let me um let me fast
[01:02:48] Okay, let's uh let me um let me fast forward. Uh we're actually we're also
[01:02:52] forward. Uh we're actually we're also using this environment to study visually
[01:02:54] using this environment to study visually impaired patients and this is a great
[01:02:57] impaired patients and this is a great way of uh putting patients in a safe
[01:03:00] way of uh putting patients in a safe environment to um to to study. Uh, one
[01:03:04] environment to um to to study. Uh, one last thing I want to show you is really
[01:03:07] last thing I want to show you is really super cool and and this is the last um
[01:03:11] super cool and and this is the last um technical work I want to show is that we
[01:03:14] technical work I want to show is that we also now are collaborating with um
[01:03:19] also now are collaborating with um psychologists and doctors to study how
[01:03:22] psychologists and doctors to study how we can use brain waves to control
[01:03:25] we can use brain waves to control robots. So what you're seeing here is a
[01:03:28] robots. So what you're seeing here is a demo where a grad student I think is
[01:03:32] demo where a grad student I think is wearing this EEG cap and that just is
[01:03:36] wearing this EEG cap and that just is sending instructions and the robotic arm
[01:03:39] sending instructions and the robotic arm is cooking a Japanese meal purely from
[01:03:42] is cooking a Japanese meal purely from thoughts. There's no invasive brain
[01:03:46] thoughts. There's no invasive brain control. This is from electrical
[01:03:49] control. This is from electrical signals. What we had to do is to
[01:03:51] signals. What we had to do is to pre-train these thoughts. You have to
[01:03:54] pre-train these thoughts. You have to pre-train the robotic arm with say lift
[01:03:57] pre-train the robotic arm with say lift or place or or drop or or whatever. And
[01:04:02] or place or or drop or or whatever. And once you do that, this is an entire meal
[01:04:04] once you do that, this is an entire meal cooked based on the the wave. This is
[01:04:07] cooked based on the the wave. This is really sci-fi. And this this has
[01:04:10] really sci-fi. And this this has happened last year. So I'm pretty
[01:04:13] happened last year. So I'm pretty excited by where all this is going.
[01:04:16] excited by where all this is going. Combining vision and perception and
[01:04:19] Combining vision and perception and robotics and also, you know, helping
[01:04:21] robotics and also, you know, helping helping people in clinical setting. This
[01:04:23] helping people in clinical setting. This is really uh the future of this is
[01:04:25] is really uh the future of this is helping severely paralyzed uh patients.
[01:04:29] helping severely paralyzed uh patients. Okay. So, so behavior project is really
[01:04:34] Okay. So, so behavior project is really aimed at augmenting people. It's a large
[01:04:38] aimed at augmenting people. It's a large scale and diverse benchmark and it has
[01:04:42] scale and diverse benchmark and it has realistic and ecological you know
[01:04:45] realistic and ecological you know physics and perception. And this is the
[01:04:48] physics and perception. And this is the last part of what the last take-home
[01:04:51] last part of what the last take-home message is that we not only want to
[01:04:54] message is that we not only want to build AI to just do things or see
[01:04:57] build AI to just do things or see things. We really want to build it to
[01:04:59] things. We really want to build it to help people. AI being an augmentation
[01:05:03] help people. AI being an augmentation tool or enhancing tool for humanity is
[01:05:06] tool or enhancing tool for humanity is very important instead of a tool that
[01:05:08] very important instead of a tool that replace us.


================================================================================
LECTURE INDEX.md
================================================================================

CS231n – Deep Learning for Computer Vision

Playlist: https://youtube.com/playlist?list=PLoROMvodv4rOmsNzYBMe0gJY2XS8AQg16

Total Videos: 18
Transcripts Downloaded: 18
Failed/No Captions: 0

---

Lectures

1. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 1: Introduction
- Video: [https://www.youtube.com/watch?v=2fq9wYslV0A](https://www.youtube.com/watch?v=2fq9wYslV0A)
- Transcript: [001_2fq9wYslV0A.md](001_2fq9wYslV0A.md)

2. Stanford CS231N | Spring 2025 | Lecture 2: Image Classification with Linear Classifiers
- Video: [https://www.youtube.com/watch?v=pdqofxJeBN8](https://www.youtube.com/watch?v=pdqofxJeBN8)
- Transcript: [002_pdqofxJeBN8.md](002_pdqofxJeBN8.md)

3. Stanford CS231N | Spring 2025 | Lecture 3: Regularization and Optimization
- Video: [https://www.youtube.com/watch?v=dyNGd06MWn4](https://www.youtube.com/watch?v=dyNGd06MWn4)
- Transcript: [003_dyNGd06MWn4.md](003_dyNGd06MWn4.md)

4. Stanford CS231N | Spring 2025 | Lecture 4: Neural Networks and Backpropagation
- Video: [https://www.youtube.com/watch?v=25zD5qJHYsk](https://www.youtube.com/watch?v=25zD5qJHYsk)
- Transcript: [004_25zD5qJHYsk.md](004_25zD5qJHYsk.md)

5. Stanford CS231N | Spring 2025 | Lecture 5: Image Classification with CNNs
- Video: [https://www.youtube.com/watch?v=f3g1zGdxptI](https://www.youtube.com/watch?v=f3g1zGdxptI)
- Transcript: [005_f3g1zGdxptI.md](005_f3g1zGdxptI.md)

6. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 6: CNN Architectures
- Video: [https://www.youtube.com/watch?v=aVJy4O5TOk8](https://www.youtube.com/watch?v=aVJy4O5TOk8)
- Transcript: [006_aVJy4O5TOk8.md](006_aVJy4O5TOk8.md)

7. Stanford CS231N | Spring 2025 | Lecture 7: Recurrent Neural Networks
- Video: [https://www.youtube.com/watch?v=kG2lAPBF7zA](https://www.youtube.com/watch?v=kG2lAPBF7zA)
- Transcript: [007_kG2lAPBF7zA.md](007_kG2lAPBF7zA.md)

8. Stanford CS231N | Spring 2025 | Lecture 8: Attention and Transformers
- Video: [https://www.youtube.com/watch?v=RQowiOF_FvQ](https://www.youtube.com/watch?v=RQowiOF_FvQ)
- Transcript: [008_RQowiOF_FvQ.md](008_RQowiOF_FvQ.md)

9. Stanford CS231N | Spring 2025 | Lecture 9: Object Detection, Image Segmentation, Visualizing
- Video: [https://www.youtube.com/watch?v=PTypu6GqEd4](https://www.youtube.com/watch?v=PTypu6GqEd4)
- Transcript: [009_PTypu6GqEd4.md](009_PTypu6GqEd4.md)

10. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 10: Video Understanding
- Video: [https://www.youtube.com/watch?v=wElqklprhPE](https://www.youtube.com/watch?v=wElqklprhPE)
- Transcript: [010_wElqklprhPE.md](010_wElqklprhPE.md)

11. Stanford CS231N | Spring 2025 | Lecture 11: Large Scale Distributed Training
- Video: [https://www.youtube.com/watch?v=9MvD-XsowsE](https://www.youtube.com/watch?v=9MvD-XsowsE)
- Transcript: [011_9MvD-XsowsE.md](011_9MvD-XsowsE.md)

12. Stanford CS231N | Spring 2025 | Lecture 12: Self-Supervised Learning
- Video: [https://www.youtube.com/watch?v=4howBU7THbM](https://www.youtube.com/watch?v=4howBU7THbM)
- Transcript: [012_4howBU7THbM.md](012_4howBU7THbM.md)

13. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 13: Generative Models 1
- Video: [https://www.youtube.com/watch?v=zbHXQRUNlH0](https://www.youtube.com/watch?v=zbHXQRUNlH0)
- Transcript: [013_zbHXQRUNlH0.md](013_zbHXQRUNlH0.md)

14. Stanford CS231N Deep Learning for Computer Vision| Spring 2025 | Lecture 14: Generative Models 2
- Video: [https://www.youtube.com/watch?v=Edr4uZFh4EE](https://www.youtube.com/watch?v=Edr4uZFh4EE)
- Transcript: [014_Edr4uZFh4EE.md](014_Edr4uZFh4EE.md)

15. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 15: 3D Vision
- Video: [https://www.youtube.com/watch?v=7lxrKDKtykM](https://www.youtube.com/watch?v=7lxrKDKtykM)
- Transcript: [015_7lxrKDKtykM.md](015_7lxrKDKtykM.md)

16. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 16: Vision and Language
- Video: [https://www.youtube.com/watch?v=mQOK0Mfyrkk](https://www.youtube.com/watch?v=mQOK0Mfyrkk)
- Transcript: [016_mQOK0Mfyrkk.md](016_mQOK0Mfyrkk.md)

17. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 17: Robot Learning
- Video: [https://www.youtube.com/watch?v=XSfmOH_xVSU](https://www.youtube.com/watch?v=XSfmOH_xVSU)
- Transcript: [017_XSfmOH_xVSU.md](017_XSfmOH_xVSU.md)

18. Stanford CS231N Deep Learning for Computer Vision | Spring 2025 | Lecture 18: Human-Centered AI
- Video: [https://www.youtube.com/watch?v=g8UaBfj6Sh8](https://www.youtube.com/watch?v=g8UaBfj6Sh8)
- Transcript: [018_g8UaBfj6Sh8.md](018_g8UaBfj6Sh8.md)