# Sergio Verdu: &quot;Teaching IT&quot; right now my focus you pretty nice parking all right so let’s open it and let’s focus on what I believe has had the most impact which is be asymptotic equipartition property Shannon did nothing in that way but he did come up with the with typical sequences those are the sequences if we if we want to deconstruct this definition we actually two properties which I will a section on they call 81 and 82 so this is like saying okay we’re only going to touch the middle class anyone tells us that the working class there are many many but you know the income is too low so it doesn’t amount to much the total income so you might as well forget about them when we tax them any Peter says well there are too few rich people so let’s let’s not ask the reason they’re like imported so now let’s see what happens if we property by matrix erasing the absolute values well you find if we really straight the the NEP here I like to do this when I teach with three three mass distribution if you do it with two masses with two masses then the students don’t worry okay the full idea because they think that this is the same as having roughly there the right empirical distribution so here you see you see representatives from the two like you sequences the typical sequences you can see that there are some things in do anything typical so these are this is actually I don’t know whether I say but this is the traffic light distribution so red and green are likely much more likely than yellow and then here’s some very unlikely sources all right so let’s see what happens when when we joined and we considered as the LED one so I don’t know my place can we still prove the source for you here ever the answer is you bet you bet resourceful inferior and here’s how we’re going to do it the optimal follows is of course very simple you just take the to take a most likely well relations problem is that that is not so easy to analyze so you have to do it through bounds and decide the key bounds in this problem let’s you ability 1 this is useful proved only using any p1 the one where people didn’t use any p2 is the lower bound but actually you don’t need you don’t need to do that for the covers you can also speak what alright so if you notice here is tau which is annoying be nice if we didn’t have the Tallis then you would have a tightness as long as K is he is moderately large then you can pick a table that is small with respect to K but large enough that this is going to be very small so notice that these bounds hope for every N and they don’t put any assumptions on this source XM okay so this is completely general nothing asymptotic going on all right but you can speak of course you’re interested immorally in the regime of large pen so if you are interested in the regime of flat other ten here there is this almost two theorem and essentially this is this is just kind of like an informal expression for those pants when n gets large but this is very intuitive because it tells us that the minimal error probability is a community for the complementary cumulative distribution function of these random variables which people like to call the ideal code lengths but that’s an a name I dislike in our tell you why later all right so if you are in through a picture you find that that distribution for the sources that you would normally encounter like memoryless sources especially recorded sources is going to converts to function but in general when it stabilizes in steady state you may have shaped like this with a lower limit in an upper limit and this is what test I caught the information spectrum so when you integrate the complementary CDF of this distribution of course you get the entropy alright so those two pin those two limits which we call the super entropy rate and the infant or iterative operational characterizations the soup entropy rate is the minimum achievable compression break for channels we have a sexy name capacity but for sources we come up with also it has the operational characterization of being the big rate required for generating X nothing information during here problem in random number generation the to a problem is where you wanna is extract fair conflicts from that random process and that turns out to be this limit and this is the infantile pillock okay so now going back to adp one that’s that’s the only thing I have used for everything I said so should we forget about a p2 then can we conclude the channel didn’t use the simplest possible approach to proving the fundamental limits of data compression well right he didn’t but but suppose you ask the questions under what condition on the source can a source be encoded at the entropy but not more efficiently in other words under what condition that does the source coding theorem hold or even more under what condition does the strong feeling fear at all in the sense then that’s going to be true regardless of what error probability you allowed well of course we know there Shannon Miller the son of McMillan theorem tells us that a sufficient condition is that the source being especially recorded but say for example non-stationary memoryless source also is going to to satisfy the coding theory so what would be a necessary and sufficient condition well here’s where where Sodom actually is validated the ATP is a necessary and sufficient condition for the validity of the coding theorem or the validity of the strong coding theorem the only condition here is that the entropy of the source is growing linearly with them and also what you can see here is that the enemy 2 is actually equivalent to the enemy so the probability of the two a dream likely sequences vanishing that implies that the probability of the two unlikely sequences also vanishes and this actually are disappearing without the parentheses that you see there around strong these Falls not just for the sources that we are used to with the serial sources but also hosts with very abstract sources where for example you just have one random variable that you want controlled but the distribution of that random variable is parameterized by by a parameter that goes to infinity okay for example a Poisson distribution or the final value of a random walk anything you can think that is parameterize by something that goes infinity you can apply this fear okay so now going back to these bounds that like the major role in the analysis I’m going to make a very trivial comment but it’s important since I didn’t use any structure about X or X M in particular I didn’t use the fact that you know it’s a random variable taking values you know in a Cartesian product we might as well think of it as you know some are they trying countably valued object anything you want so we get rid of the end we can rid of the end because it doesn’t it’s not playing anymore okay so from now on when we have bounced off of this sort and when we are in the non asymptotic regime we might as well think of all this object as abstract objects now very simple generalization once you can do this way is to say well what if I am calling a so I am calling a source which is not which is not the the one for which I have assigned this optimum code well all you need to do is substitute X by Y in in their argument now try to state this result with the conventional notation used in probability theory this I call I call the curse of probability theory some someone I guess in the 19th century decide the probability density functions and probability mass functions will be known by their arguments like their dummy hardness so that’s not a good idea addiction is very important dummy arguments be dummy okay so always always we know your distributions with Salinas’s if if there can be some confusion okay so now what I do here is I say well look this lower bound or less sorry this upper bound it’s yo bility upper bound here that you see is slightly weaker than before but it holds it holds when the compressor does not know the source you may say well okay what’s the big deal just let let the compressor learn the source but why can’t the compressor learn this or we’re not in the asymptotic regime here think of this as just seeing a one-shot of this source okay so how do you get this upper bound which is pleasingly 12 to the to the lower bound very simply you say the D compressor will first of all get rid of any realization of X which is not compliant with with the compressed version second it will it will evaluate the probability of the remaining sequences and then it will just look and all the ones that are more likely than what you see here than this threshold if there are zero or more than one then it will give up say okay if there is only one then it will be clear that what about the compressor what kind of code do we use here well we don’t know X we might as well averaged over all possible codes and then use the useful channel whistle okay so that that’s how you do it out loud and then what happened 12 or what happens is that this opens a very nice door because now if you just can’t for example here sight information with the deep compressor then of course that’s this little problem all you need to do is exactly the same thing I said before but instead of doing a computation of the unconditional probability of X you do a computation of of the conditional probability of x given Y but the proof other than that is completely the same so you see that here one threshold is enough you don’t need to get into this joint typicality idea that if you if you follow the in the usual Orthodox proof which one typicality then you get six thresholds to compare me as opposed to just one oh so so far I’ll be talking about almost lossless fixed length data compression but you cannot buy one of those codes right the real world is always lossless variable length codes so there seems to be a separation here between what we teach and what you can use in the real world of course it is trivial to turn one into another there are many ways to do that but the point I wanna make is that the analysis of fixed length we’ve been doing so far is the key to the fundamental limits of variable length data compression and that may sound like a trivial statement and you realize that when we talk about from the mental limits of variable-length people tend to always gravitate towards fundamental limits of the expected value of length any particular about the expected value of length or prefix codes okay so for example we would say the cosmic code is optimal then we say well that then we have to pay a penalty per symbol so maybe it’s better to utilize many codes well maybe we could use half man codes on the pond on blocks of course it means Lexi psy but that’s probably the best thing you can do us as far as my other length recording well that’s not the best thing you can do as far as particle anything and that’s my that’s my point so optimal zero error data compression compression this is going to be talked about in a very strong sense optimal in the sense that for every K you choose you want to maximize the probability that the variable length is very ugly in other words what you’re doing is getting the upper and middle of all the CVS that you could get before course and there is one code then it’s useless what is the code that it seems that well it’s a code that says look or they’re all the sequences in decreasing probability the most likely is assigned the empty string the second and third are signs near in one four three six and seven are assigned on the labels with two bits and so on so forth now that’s what is usually called a one to one code and that has remained a curiosity with not much practical article because it does not enforce the prefix condition but as far as proving the fundamental limits of optimal viable men’s their compression is fundamental so then the nice thing about this is that now the optimum variable length code to the analysis with a of the optimal fixed length code and the proof is here that’s all you need in order to prove the fundamental of fix them coding is the same as the one for environment quality now the expectation of this optimal viable length of this one-to-one code you can show that it’s strictly this is below the entropy that’s a very simple proof alone arrow next we have a very nice lower bound only for the widest point vertex Franco speaking as a recent to the forest to find the exact value of that expected expected length but more important then then finally what is the the little variation of the expectation from the entropy much more important is to look at the stochastic variation of the variable length and that we that you can do using a normal approximation that’s a very powerful tool and the first one to do this was transferring in 1964 in a paper that has been long forgotten actually there was another paper by Jewish carriage in the Russian literature that did something similar stress and only bidding for from memory sources but the idea here is that caps two parameters the entropy and the variance of the this is going to be give you a very good approximation to this stochastic variation so it’s actually curious than in information theory we have this law of large numbers results the ergodic theory results the ones that give you entropy as a fundamental limit and then when we want to do something finer analysis then we jump to the affiliation results there are exponents what happens you know what is the astronomically small probability that you will need twice the end of the progress but this is kind of like God for God and line of research the normal approximation around the entropy which I think we have to re-examine then you will see more on that later with the decoding part okay so some of the lessons that we have learned lessons for for the teacher are they I like to separate what is the fixed plot the fixed block length analysis from the asymptotic analysis all right the asymptotic analysis is going to involve more large numbers or the channel McMillan theorem does not say that the asymptotic analysis is going to be devoid of information theory in McMillan you do need to use some information theory because the ergodic theories don’t like to you to think of join windows of growing length okay second lesson is the image for fixed block and analysis you might as well think of arbitrary objects in other words introducing memory does not does not increase the complexity of the analysis if anything it makes it simpler because you don’t have any more structure to work with so you just have to work with the with the fundamentals of the problem okay you this brings me to a probe to a point I really feel strongly about you can teach information theory sticking to memoryless problems sources and channels that you really lose a lot by doing so even Shannon actually it’s interesting because when when he lives with the lossless data compression case it doesn’t do it for memory as he doesn’t he immediately does it for market to start so actually some some sound system theories say say you know I I don’t understand the resulting linear systems until I see embedded as a resulting nonlinear systems so you know it’s something like that happens in information is here if you take away all distraction that you can use in a problem and you can do it in general then you only work with the bare minimum the other lesson here is the fixed length coding analysis gives you the fundamental limits of optimal values and according the CDF of optimal value length coil and finally the simplest groups are not reactive ability not via the LEP that the NDP ultimately if you are in a very abstract setting of sources that is a necessary and sufficient condition for the validity of the data compression here okay I guess one of the traditions of this channel lectures I don’t know actually I know things have traditionally happened on the time but you have to tell it all right so you know I have been really struggling with this and this has been my main occupation in the last 12 months you know I I’ve done things like this you know and this wasn’t too helpful and I say you know thought Channel walks into a bar I say what would you like give me a bit of information straighter on the rows the punch lines are open brawler so you know things that I could come up where you know neither x-rated or very offensive to some people so if you are interested to see me after the talk so I thought I would just tell you a true story actually so I’ve been collecting the t-shirts from the is 90s and I hope very much to set the definition for this so one day I was sitting on a bus wearing the t-shirt of the 98 I excit the one that what color is a phony organized and they made a very nice t-shirt with the 50th anniversary boho and was designed by a very famous artist and I was sitting on this path on which this very attractive blonde woman in front of me and she kept staring you know looking at the reflection in the window but so you know when I said no no it’s information you’re a so yeah one thing that you don’t usually find in books it’s again this theme or you know slow down not so fast I smell the roses half the fun is getting there don’t go so fast the asymptotics there are some very nice asymptotic results and you can drive here’s one for example the rate of any encoder decoder that achieves a bit error probability of P P satisfies this bar under what conditions no conditions tell with memory anything you so here’s the mayor to really clear itself and it’s funny because we’ve come full circle in terms of accessibility proofs of the time coding theorem when fax technologies in 53 this was hailed as the first rigorous proof of the coding theorem it didn’t use random coding I think a greedy algorithm to find the code the code words shall endure in 57 came up with the same formula with a variation here in analyzing random coding with the maximum likelihood decoder the Gallagher came in the 60s with a different bound also analyzing the random coding technique with maximum likelihood decoder and not modernly the kind of proof that seems to have to be more prevalent in the joint fatality proof actually goes back to the origin is really a formalization of Shannon’s original idea all right so what I want to show you here is that this actually this bubble is fundamental and we can prove it we can prove it even even in an easier way that we can prove it using is joined to become created so the in order to illustrate this what I have plotted here is information density evaluating the output of the channel any protocol works so as you can see some of the information there should be some degree so if you are a maximum likelihood decoder then you will come here and say okay code word number six is the most likely whatever if your Feinstein and say well I know what I get into analyzing the maximum likelihood decoder because that seems to be quite hard so we have a threshold and then I’m going to choose the first code word that is a lot of attention and then we can recalling what what we did in in the source body problem fix a well if there are zero or more than one above the threshold we will declare an error by doing things exactly what you do is you prove that that bound which before certainly you can prove it in a in a way which is much easier than entering the typical approach where you have here comparison with six different thresholds and of course what happens here we do it in this way if you need different oops for the discrete case and office okay so now let’s proceed in parallel to data compression so let’s get into the asymptotics and what we would like to claim is there again one of these two theorems then the minimal error probability behaves like the best possible CDF of information densities now this minimization here it’s actually non-trivial if if n is large then that’s the optimum distribution here is going to be the one that achieves for possible but if it gets smaller then you really require structure guess like the structure by defining in algebraically so that’s that’s an extra chance you get large block lengths not necessarily very large that’s when the random assignment becomes very efficient okay but of course here to be able to claim that we need a conker’s just like the one that we have for the data compression problem and can I came up with these condors and that’s what enable us to come up with a general formula for example capacity and you can see now the part of the very strong parallel between these two pairs of Congress’s inability result very strongly are going to be assembled all right so you go through the same thing that we had before now we can scoop into information rate soup soup information rate the the lower limit is the capacity this is a general formula for channel capacity the upper limit there’s up to be what we call the resolve ability which is the maximum amount of randomness you need in order to generate an input to the channel such that the output will have almost the same statistics as inducing you were using the true input a master as you would expect of the noisiness of the channel nothing to do with so here if I were to summarize the operational characterizations we know of the maximum which information we have channels capacity we have the minimax compression redundancy reactor in Gallagher we have the identification capacity which may be from the data last the years channel lecture by us with a book there is some ability they I just mentioned and then this one reliability of ergodicity testing this one is something that I guess it’s new so let me explain what I mean by this now let’s suppose did I have to bias points in in my pockets or with different biases and you tell me a probability P the property P is that I choose the one in my right pocket all right so I can operate in two different modes in an ergodic mode which means that we probably P I go to my right pocket then I take that point I put it back then the next time with probability P Oracle operating normally more which means that at the beginning I reach for one my buckets and then I flipped the same point over and over so you’re going to choose B so I’m going to tell you the number to tell you the outcomes of the conflicts and then you need to decide whether I was in and you’re going to speed to make your life as easy as pressure so we’re going to decide okay what like to set threshold on the probability of designing a ergodic even the northern body is true and then we want to minimize under that condition the probability of deciding northern body even then three bodies all right so when you solve that problem when you choose the open B then the answer is that the the reliability that probability that you would not allow the even and when it is true viruses exponentially with a number of conflicts and the exponent is the child Apostle sang capacity of what well a channel whose crossover probabilities are the biases of both storms of course you can you can generalize these just a point okay now we have three things to play with yeah right a great block length and probability Pharaoh so you know the first thing we do is we fix the probability of Pharaoh and we let the block length going to infinity what is the rate and that of course is the excellent capacity or most of the time these constants to the capacity okay so every time we want to do a final analysis what doesn’t information to you do well we say okay instead of doing this let’s turn it around and let’s fix the rain and then see what happens with the probability of error a born X goes to infinity all right so this of course is the reliability function the error exponent which practical sink significance at least at this point in technology is not all bad has not acted the same way as capacity as an inside driver yes so you know the holy grail here would be to say well if I if I’m shooting for a blocking probability of epsilon then let’s do an analysis of the rate that I need when we have six block lengths well that is that would be nice if we knew that because then you know Wendy a lesson taught begin matsuda would be solving the problem in the finder in the finest possible way alright so what is the other the answer to that well this reminds me of a limerick really good place to be alright so this is the last thing that you get this was a poem on Rubik’s cubes alright so true the problem I post is not a problem that these are easy to solve but but one of the nice byproducts of our taking slope approach to proving calling theorems is that we can come up with very very nice stance now these bounds are not the ones that I could show you these very strong arms this current work by unit poliansky PhD students in university I am back safe in school so here I exploded the binary symmetric channel of capacity 1/2 local probably do 10 to the minus 3 and then after lower bound in the block is the one you get through the normal approximation okay so for that you only need two parameters the capacity and the violence of the okay so in 1945 all right so let me let me just say the most inequalities in information theory are actually just boil down the non-negativity of these the convexity of these or the data processing during both either choose I call it divergent some people call it relative entropy which is a very nice name except that Shannon in 1948 says that the relative entropy the entropy divided by the logarithm of the other accessory and also there’s something cool about proving the convergence of the divergence all right so operational characterizations now one of the things I don’t I don’t like to say when I teach is that a PQ is distance measured between P and Q because then nearly the student is thinking of something symmetric that is not symmetric the DQ is the difficulty of impersonating P by Q so you have a sequence in is generating generated under Q you look at its empirical distribution and then you have the probability that he will be in a neighborhood very tiny neighborhood all right so that’s a large deviation result which we usually know as steins lemma although slightly nothing to do with it and to compound the diversity of this most people when they read and six times lemma the reference the wrong paper by turn off so this is the right way to buy children’s s times never now so let me quickly come to what happened sent the paper out for review and then the reviewer said stylus is resolved in an unpublished man so they’re not actually went ahead and reference that and garbage naming which had without actually looking at the memory and then get rivets reduce time but sciencism we can so okay so bottom line don’t trust the reviewers reliability of rejecting human peace so if you look here this is something that is special because these are posterior probabilities of the hypothesis the PFP you are true but this is this is a patient resulted even non basin’s can live with because of Ibn the result does not depend upon the other abilities and actually for or the audience here this is very nice because this these are posterior likelihoods look like this this is for two more colors and rotations are made of okay so that’s that’s like deviation species Moorpark numbers I thought I I don’t know a reference for this but you know it’s a very simple result I’m sure that because we noticed before okay so this one is essentially we saw this before right the lossless data compression also speak when your code when you’re using the optimal code for Q a Centauri excess rate I think this is original all right and then finally we have here a capacity type of operation of characterization tpq is the maximum number of beats per euro you can send through a channel through this channel this binary input channel where if you send one you at the output like a chromatic P if you send zero probability Q 1 plus Q 1 0 0 of course you see dual capacity by the way they say divergence is the matter of all information measures because the special place of it is mutual information internally is a special case of projects but mutual information is also the mother of all information matrix because if you know mutual information you can actually write divergence as a special case the hint is somewhere in this is lat all right so let’s so a lot of people it’s a series of course are not so much into calling theorems but I’m going to maximizing nice information given a conditional distribution you wanna solve this problem given an unconditional distribution we wanna solve his boat now the key thing here to realize is they don’t be fooled by this Max and Miss mean with respect to X and inspector py given X actually what what you really want to do is this you want to do is maximize over the other students and maximize over the conditional effects even one of course do this in a legal way so that gives distributions are consistent this of course this observation is I think is well understood for the rate distortion problem maybe not as much for the for the channel provenance so essentially the simplest way to do this and I know you’re going to say you call this simple you know it’s so so nice to write in a mutual information as different well actually once you get the optimum you sticking here the optimum output distribution then this becomes independent of P X so you could even find the capacity which are actually knowing the optimum even distances the same thing for minimizing mutual information or the the most elegant proof is based on this inequality at the end this expectation depend on the conditional distribution of the reproduction earrings okay so I’ll ask for joining doctrines and I’ll just move into the last part of the talk I’ve always been a close formula free you know I you know most of my papers I really look for beautiful formulas so if if someone had told me that I would spend most of my signing channel lecture talking about bounce I can I would have felt really terrible about that but so I don’t want to leave you without some some clothes from formulas now this one is the lossy compression of Hogwarts on process so here you wanna wanna just assume that in here is very familiar one where lambda is the rate of the question process and P is the distortion now what is the distortion measure well where people get looked into this problem before but the key here is is finding the right distortion measure there are two distortion measures under which you can prove this formula one of them is the maximum for resignation of inter arrival times another one is viewing this chain as a queue and then looking at the service rate of that cube as the measure of approximation okay now you know I think this one of the neutral information let me do this this game okay this is the subtle point guess what are they to the center PA in queue well you’re gonna say haha I notice you have here variance constrained random variables maybe you forgot the one half or you were thinking of complex valued random pirates did I tell you that PN is actually independent of X bar I didn’t say that so you cannot say that this is the answer if I didn’t say that in it’s independent of X bar or at least I gave your son some conditions along those lines alright so here’s the right answer the right answer if you indeed don’t put any conditions on n being independent of X bar is in the in the domain of positive values on the bible’s where you constrain the mean of X n n in the saddle point is an exponential distribution and X is a mixed distribution okay so this is the this is the last slide this is the simplest channel less than three edges you don’t get anything interesting so what is the capacity of this channel as a function of delta okay so this is a problem alright so you know I guess leave you with one thought and the thought is there no matter how simple it looks chances are there is even simpler so the search for elegance in simplicity in formation theory 14 years and the spirit of a mathematical theory of communication continues to be our guy we already on fear but he encouraged emulation without trans-alaska my occupation so please okay yeah so yeah the question is at some point I said idea called inside online well because you kind of beat the NDL right and I can be given an average in the sense of this optimum variable length codes their expectation is not about the entropy section you know so another quick person yes well you see your joke to professor’s making the homework and one of them says you know I’m writing a book let’s stop it either okay we’re not done yet we’re not done yet you know we are already way behind schedule so let’s let’s do something you