CS50 2019 – Lecture 5 – Data Structures

[MUSIC PLAYING] DAVID MALAN: All right, this is CS50 And this is our look today at data structures You’ll recall last week that we gave ourselves a few new ingredients and some new syntax in C We introduced pointers, the ability to address chunks of memory by actual addresses– 0, 1, 2, 3 and on up And we introduced the star notation and a few other functions, malloc among them, free among them, so that you can actually manage your computer’s memory Just to hammer this home, let’s take a look at just this small example that from the outset is buggy This code is buggy And we’ll see in just a moment why But let’s just walk through it step by step This first line highlighted in yellow in English is doing what as you understand it now? If you’re a little rusty since last week, what’s that first line of code doing in English? Anything? Yeah AUDIENCE: It’s creating a point to an int named x DAVID MALAN: Perfect It’s creating a pointer to an integer and calling that pointer or that variable x And let me propose that the next line is doing the same thing giving us another pointer to an integer but this time calling it y The third line of code, someone else, what’s happening in English with this line here? Yeah AUDIENCE: It’s creates a memory of size of int and assigns it to x, but it’s more difficult to execute if it’s not a pointer, I don’t know DAVID MALAN: This is not the bug But the first part is correct malloc is the function we introduced last week that allocates memory for you It takes one argument, the number of bytes that you want to allocate Even if you don’t recall how many bytes you need for an integer, you can call this other operator, sizeof, that we saw briefly last week, that will return, in this case, 4 most likely depending on the computer that you’re on So this is saying, hey, computer give me 4 bytes of memory And it returns that chunk of memory to conventionally by way of the first address, so ox something, wherever those 4 bytes happen to star And then it stores that address in x, which is in fact OK, because as you noted initially, x is in fact a pointer That is an address So all this is doing is it’s declaring a variable called x and storing in it ultimately the address of a legitimate chunk of memory You wouldn’t typically allocate an int like this You would allocate an int with just int and semicolon just like in week 1 But now that we have the ability to allocate addresses and allocate memory, you could achieve the same idea here This line here now, the fourth line of code, says what in English? Star x equals 42 semicolon What’s going on there? Yeah AUDIENCE: It goes to the address in the x and then sets it to 42 DAVID MALAN: Good goes to the address in x and sets it to 42 So the star operator is the dereference operator, which is a fancy way of saying go to that address And what do you want to do? Well, per week 1 when we discussed the assignment operator, it just says put the number 42 there So wherever malloc found 4 bytes of available memory for me, this fourth line of code says, go there and put the number 42 by putting the appropriate zeros and ones This last line here– and here’s the bug, if we finally reveal it here– this line is buggy Does anyone see why? Yeah, over here AUDIENCE: You haven’t allocated memory for that variable yet DAVID MALAN: Exactly I haven’t allocated memory for that variable yet And because up here I’ve just said int star y semicolon You can only assume safely that has some garbage value, some unknown value, maybe remnants from some other part of the program, that might not necessarily be true here at the beginning of the program But for safety’s sake assume that if you don’t give a variable a value, who knows what it contains It’s got some bogus address, such that if you say star y go to that bogus address something bad is going to happen And maybe you experienced this already in P Set 4 or prior, some kind of memory problem with your code,

or a segmentation fault or seg fault, bad things happen when you go to addresses that don’t exist or that you don’t even know where they are So this line of code is bad But we can do a little better What if instead I do something like this? I actually assign to y x So that just says put in y the same address that’s in x And then with this last line of code, what if I now say star y equals 13? What is that– you’re nodding your head What am I doing correctly now? AUDIENCE: Now there’s memory allocated for y DAVID MALAN: Good Now, there’s memory allocated for y So you’re saying go to that address and put 13 there However, what did we just need to the 42, just to be clear? We clobbered it We overwrote it with the 13 Because if x and y are the same address, both this says go to that address and put 42 there, but then two lines later, we say, no, no, no, go there and put 13 there instead But long story short, bad things happen when you don’t necessarily anticipate what is in memory and you don’t allocate it yourself, so thanks to one of our friends at Stanford, allow us to take a moment here to hit Play on a short film, a claymation if you will, that paints the same picture in a more memorable way perhaps If we could dim the lights [VIDEO PLAYBACK] NARRATOR: Hey, Binky, wake up It’s time for pointer fun BINKY: What’s that? Learn about pointers? Oh, goody NARRATOR: Well, to get started, I guess we’re going to need a couple pointers BINKY: OK, this code allocates two pointers, which can point to integers NARRATOR: OK, well, I see the two pointers But they don’t seem to be pointing to anything BINKY: That’s right, initially pointers don’t point to anything The things they point to are call pointees And setting them up is a separate step NARRATOR: Oh, right, right I knew that The pointees are separate So how do you allocate a pointee? BINKY: OK, well, this code allocates a new integer pointee And this part sets x to point to it NARRATOR: Hey, that looks better So make it do something BINKY: OK, I’ll dereference the pointer x to store the number 42 into its pointee For this trick, I’ll need my magic wand of y dereferencing NARRATOR: Your magic wand of dereferencing That– that’s great BINKY: This is what the code looks like I’ll just set up the number And– NARRATOR: Hey, look, there it goes So doing a dereference on x, follows the arrow to access its pointee, in this case, the store 42 in there Hey, try using it to store the number 13 through the other pointer, y BINKY: OK I’ll just go over here to y and get the number 13 set up and then take the wand of dereferencing and just– [BUZZER SOUND] oh NARRATOR: Oh, hey, that didn’t work Say, Binky, I don’t think dereferencing y is a good idea, because, you know, setting up the pointee is a separate step And I don’t think we ever did it BINKY: Oh, good point NARRATOR: Yeah, we allocated the pointer y But we never set it to point to a pointee BINKY: Mm, very observant NARRATOR: Hey, you’re looking good there, Binky Can you fix it so that y points to the same pointee as x? BINKY: Sure I’ll use my magic wand of pointer assignment NARRATOR: Is that going to be a problem like before? BINKY: No, this doesn’t touch the pointees It just changes one pointer to point to the same thing as another NARRATOR: Oh, I see Now y points to the same place as x So, wait, now y is fixed It has a pointee So you can try the wand of dereferencing again to send the 13 over BINKY: Uh– OK, here goes NARRATOR: Hey, look at that Now dereferencing works on y And because the pointers are sharing that one pointee, they both see the 13 BINKY: Yeah, sharing, whatever So are we going to switch places now? NARRATOR: Oh, look, we’re out of time BINKY: But– [END PLAYBACK] DAVID MALAN: All right, so now that we do have this power of pointers and addresses where we have low level access to the computer’s memory, we can actually solve problems a lot more powerfully and in a lot more interesting ways But first, let’s motivate some of these problems So back in Week 2, we introduced arrays, which was the first of our data structures, if you will Before then in Week 1, all we had was variables for things like ints and chars and floats and so forth In Week 2, we introduced arrays, which meant you could store two ints altogether or three or 10 or 100 So you can kind of encapsulate lots of data together So unfortunately, though, arrays aren’t quite as powerful as might be ideal So, for instance, if we have an array with size 3 and we actually want to go ahead and store three values in it– one, two, three– suppose that we actually want to now store a fourth value, but we didn’t anticipate that from the get go Recall after all that with arrays you have to declare their size upfront So you’ve got to hard code the number 3 or a variable containing the number 3 But suppose that we want to store the number 4 You might think that, well, just give me another box of memory just to the right of the number 3, so that I can keep all of my numbers together But unfortunately, per last week, that’s not really a reliable assumption, because in the context of the rest of your computer’s memory, that 1, 2, 3, might be here surrounded by other bytes

And per last week those bytes might be mostly filled with other data from some other parts of your program And yet you would think that in seeing that 1, 2, 3 is kind of painted into this corner, so to speak, that there’s just no room for the number 4, and therefore you can’t add the fourth number to your array, is there a solution visibly to this problem nonetheless? Where else could we put it? Yeah AUDIENCE: Move it to off of other memory DAVID MALAN: Say that a little louder AUDIENCE: We can move it off to other memory DAVID MALAN: Yeah, so maybe we can move it off to other memory So there’s a lot of EMMAs in my memory per last week, but there is still, it would seem, based on this picture, some unused memory So maybe we could resize our array, grow it, not by just moving all of the EMMAs because frankly that would seem to take a lot of time if we had to shift all of these characters, why don’t we just relocate the 1, 2, 3 down here, and that gives us an extra space for at least a number 4 So indeed even if you’re using arrays, you can achieve this outcome by actually moving memory around But consider what’s involved in that So if you’ve got our old array at top left, and we’ve got our new array at bottom right, that is of size 4 So we have plenty of room How do we go about resizing the array? Well, it’s kind of an illusion You can’t just resize the array when we have all of these EMMAs surrounding us Instead, we actually have to move the array or copy it So the 1 gets moved to the new memory The 2 gets moved to the new memory The 3 gets moved to the new memory And then at that point, we can just throw away or free the previously used memory and now go ahead and add our 4 Unfortunately, this isn’t necessarily the best strategy, right, because if these three lockers represent our original memory and these four lockers represents our new memory and they’re deliberately far apart, that is to say that if I want to go ahead and move like these same numbers, I really have to do something like this, which involves quite a few steps Let me go ahead and put the 1 in there now Now, let me go ahead and get the 2 here And then I can go ahead and put this in here So now I’ve got the 2 And then lastly, I can go grab the 3 And so even though I did this pretty quickly on the screen, the reality is there’s a decent amount of work to do And then I still, of course, have to go ahead and add the 4 to the mix, which is to say that I’ve taken figuratively and physically quite a few steps in order to resize an array from size 3 to size 4, which is to say if we now consider the efficiency or, if you will, inefficiency of that algorithm, what kind of running time is involved when inserting additional numbers into an array as I’ve done here? Here’s our menu of options from a couple of weeks ago when we focused on algorithms What’s the running time of insertion into an array based even on that simple demonstration would you say? What’s the running time? Yeah AUDIENCE: O n squared DAVID MALAN: Say it again AUDIENCE: O n squared DAVID MALAN: O n squared So maybe O n squared in that there was a lot of back and forth and we’ve seen that before We’ve seen bubble sort and selection sort add up It’s not quite as bad as that It’s not quite as bad as that Yeah AUDIENCE: O of n DAVID MALAN: O of n And why do you say O of n? AUDIENCE: Because for as like as many lockers there are in the first one, you have to increment the same amount of processes to insert them DAVID MALAN: Exactly Whatever number of lockers you have here– so that’s three specifically– but n more generally, it’s going to take me n steps to transfer those numbers over here Or technically, it’s going to take me 3– maybe if I go back and forth, it’s like 6 steps But it’s some multiple of n So it’s not n squared That’s when we kept iterating again and again and again This time I just have to move 3 numbers to here and then add the fourth number So it’s indeed, Big O of n when you want to go ahead and insert or search equivalently an array that’s actually implemented– sorry, insert is going to take us linear time But search recall– and this was the powerful thing– what’s the running time of search so long as you keep your number sorted? Per two weeks ago, that was logarithmic So we haven’t necessarily sacrificed that And that’s the appeal of storing our data in an array that’s sorted You can use binary search However, this is expensive and moving things around isn’t necessarily the ideal approach So let’s just consider what this might look like in code Let me go over to CS50 IDE here And let me go ahead and create a new file called list.c And let’s see if we can’t represent in code exactly this idea So let me go ahead and include for myself standard stdio.h just so that we can print out some values ultimately Let me go ahead then and declare main– int main void And then down here, let’s just arbitrarily start where we did with three integers, called list and size 3 So I’m just mimicking exactly where we started pictorially by having an array that was fixed at size 3 And then if I went ahead and initialized that list, I could just hard code– that is type into the program itself– those three values into bracket 0, 1, and 2 the numbers 1, 2, 3 respectively

So I’m just manually initializing that array to three values And then just so that this program has some purpose in life, let me go ahead and do int i equals 0, i less than 3, i++ And then, let’s just print out these elements just for good measure Each of them is an int So we’ll use %i And then I’m going to go ahead and print out list bracket i So kind of a Week 2 style program, where All I’m doing is hard coding an array of size 3, initializing it with three values, 1, 2, 3; 0 indexed, and then printing them out So if I go ahead and save this and make my list and then go ahead and compile this with ./list, I should see hopefully 1, 2, 3 But there’s a problem with this implementation fundamentally because I have hardcoded– that is typed explicitly– the size of this array, how can I go about adding a fourth element? What would I have to do? Well, I could change the code up here to 4 And then I could add another line here And then I could change this But then I have to recompile it And so it’s certainly not dynamic But we did see a function last week that lets you allocate more memory dynamically And just to be sure what was that function? So malloc Right? Now that we have malloc, you don’t have to type into your program source code from the get go a fixed number You can actually allocate some amount of memory dynamically Now, here just for demonstration’s sake, we’ll do it to achieve the same goal, but in a way that’s going to scale a little more effectively Recall from last week that if you want to get a chunk of memory from malloc, it’s going to return the address of that chunk of memory So that suggests that I can declare a pointer to an integer called list And then let me go ahead and allocate, how about, three integers initially times whatever the size is of an integer So this is a little weird looking, but consider what this is doing malloc is being asked for 3 times the size of an int So give me enough memory to fit three integers By definition that returns a pointer, per last week So we have to assign it to a pointer on the left So list is a variable now, just like x and y from our previous example, that’s storing the address of that chunk of memory But what’s cool about C is that now that you know that list is a chunk of memory, we can actually borrow that same square bracket notation from Week 2 And this code here doesn’t actually need to change If you use square bracket notation next to the name of a pointer, what’s going to happen for you automatically is the computer is going to go to the first byte in that chunk of memory This index is going to go to the next chunk of memory This is going to go to the next chunk of memory, all within the scope of what malloc returned for you And just as an aside, how many bytes are in an integer? AUDIENCE: 4 DAVID MALAN: 4 And recall I briefly mentioned the expression last week pointer arithmetic What you’re also getting sort of magically with this square bracket notation is that bracket 0, it happens to be byte 0 Bracket 1 is not the second byte It’s actually 4 bytes over And bracket 2 is not the third byte It’s actually 8 bytes over, because you allocated 4 plus 4 plus 4, 12 bytes And so this square bracket notation just jumps you to the right place in that chunk of memory, so that you can fit int, int, int Yeah AUDIENCE: Why do you allocate a pointer to an int rather than an pointer to an int array? DAVID MALAN: Why do you allocate a pointer to an int and not a pointer to an int array? In this context, arrays and pointers are in some sense the same A pointer is an address of memory An array is just a chunk of memory And so even though we used chunks of memory in Week 2 by just calling them arrays, they really are just more generally chunks of memory that support square bracket notation But now that we can allocate as much memory as we want, we can kind of use these two concepts interchangeably And there are some subtleties in C. But this now has the same effect as Week 2 And this is the only new line from this week But now if you’re using malloc, even though I’m not going to do it in a more complicated program here, you can imagine now the code running in a loop and maybe allocating more memory and more memory and more memory when you need it, because malloc allows you to do just that And we do need to do a couple of safety checks here It turns out, per last week, that malloc can sometimes run out of memory If you’re Mac or PC or the cloud runs out of memory for your account, well, you might want to check the return value And so good practice would be, well, wait a minute, if list equals equals null, let me just go ahead and return 1, something went wrong, because my computer is out of memory for some reason So best practice would say anytime you allocate memory, always check if you’ve gotten back null Now, let me just do something for the sake of demonstration Let me move my window down here Let me highlight these lines of code and just make the claim that highlighted here between lines 5 and 13 are lines of code that simply allocate a list of size 3 and store three values in it That’s the story where we left off a moment ago Suppose now I change my mind and decide after line 13

or maybe elsewhere in this program if it were larger, you know what, I actually want another integer I want to resize that array Well, how can I go about doing it? Well, let me go ahead and do this Let me go ahead and allocate, for instance, another address and say store at that address a chunk of memory corresponding to four integers using the size of operator as before So temporarily, let me go ahead and give myself a new chunk of memory that is big enough to fit four integers instead of just three Let me practice best practices and say, you know what, just in case, if temp equals equals null because I’m out of memory, forget it, I’m done with the program We’re not going to proceed anyway But that’s just good practice now But now what I want to do? If I now have two chunks of memory, this one is a size 3, this one is of size 4, what did we do the last time we wanted to move something around in the computer’s memory, what did I physically do with the lockers? I think you’re nodding What did I do? AUDIENCE: You went through each and move 1 to the– DAVID MALAN: Yeah, exactly, I went through each one and copied the value from left to right, from old to new And so let me go ahead and do exactly that I think I can do this with a for loop, for int i get 0, i is less than 3, because that’s the size of the old array that size 3, i++ And then in this iteration, I can just do something like this– use this new chunk of memory just like an array, because I claimed I can use my square bracket notation and store location i whatever is in the original list at location i So this code here now, if I were to comment it, copy integers from old array into new array And that too is just using a for loop from old to new But now that’s not quite everything I want to do I also want to store at the location 3, 0 index, which means it’s the fourth location, another value, number 4 That’s why I put the additional number 4 into those lockers So now with these lines of code, I have implemented the physical notion of copying all of the values from the old array into the new array So I’m almost done, except what did we learn last week that you should do whenever you’re done with a chunk of memory? Do I still need the original chunk of memory? AUDIENCE: No DAVID MALAN: No And how do I give it back to the computer? AUDIENCE: Free DAVID MALAN: Free So quite simply, I literally just call free, passing in the address of the chunk of memory that I want to free And even though I’m passing in one address, the computer is going to do the heavy, lifting of remembering how many bytes I asked for originally You don’t have to worry about that You just say, whatever this is pointing at, go ahead and free it all So now, you know what, now that I’ve gotten rid of that list, I’m going to update list equal temp, which is just cleaning up the naming Temp is kind of a stupid name for a list Let me just reuse the original pointer and let list equal temp And now down here if I’ve done everything correctly, it should suffice to print out that whole list So let me save this Let me give myself a bigger terminal window Do make list again A lot of mistakes Let’s see, first one is up here Implicitly declaring library function malloc dot dot dot What generally is the solution when you see implicitly declaring something? AUDIENCE: Header file DAVID MALAN: Header file, which one do I want? Do you recall? This is subtle and we might not have used it last time if I used the CS50 library, but it’s stdlib.h That is where malloc is That is where free is So stdlib.h is one new header file that contains last week’s functions So let me try this again Let me make list Nice, this time it did compile /list and– hm, I seem to be missing that fourth number But I think this is just a stupid mistake on my part What did I do wrong if I look at the printing of this array? AUDIENCE: Size of list DAVID MALAN: Yeah, down here the new list is size 4 So frankly, if you recall a few weeks ago, I encouraged you, don’t just hard code numbers, magic numbers, in your programs We should really be using constants or some other variable This is exactly why, because you, just like me, might overlook a detail like that So let me recompile it And let me do list And, voila, there is my 1, 2, 3, 4 Now, to be clear, this is kind of a stupid program, because I sort of decided halfway through writing this program, wait a minute, I want four integers, not three And, of course, at that point, you should just delete the earlier code you wrote So this is just for demonstration sake If you imagine this being a bigger program, that just over time the human decides maybe because get int is called that they need more memory, this is how you would do it using malloc But it turns out there’s a function that can actually make our lives a little easier here So let me go ahead and clean this up just a little bit It turns out that I don’t have to allocate more memory myself, copy all of these things myself, and then also free it I can consolidate a bunch of these lines as follows Instead of using malloc, I can actually say realloc, which as the name suggests, reallocate a chunk of memory What do you want to reallocate? Well, I want to reallocate the list And this time I want to do 4 times size of an int

I’m going to store this in temp temporarily I’m going to make sure that nothing went wrong, as by checking for null, which just means, hey, you might be out of memory And then I’m going to return if so But down here, if all is well, I’m going to go ahead and do this And watch this I now have simplified my code by quite a few lines realloc, by definition– this is another function in stdlib.h– handles the process of taking an existing chunk of memory that you already asked for, resizes it to be this new size, whether it’s bigger or smaller It handles the copying of the data from old to new to you And you just need to check that nothing went wrong as by checking for null here and then remembering the new value So this code now, which is just six lines of code, previously was more than that And it’s just a handy function to use All right, a question from earlier AUDIENCE: Why can’t we just create this to the temp in the beginning, because if we equate this to the temp, then we equate this to the pointer perhaps, so this to the point to the 4 bytes of memory? DAVID MALAN: Really good question So let me roll this back by rewinding And all of the finished versions are on the course’s website if you want to play with them later This was the previous version using just malloc If you just do this, update a new chunk of memory, as I think you’re asking, what’s happening is you are effectively orphaning the old chunk of memory Because if you change what’s stored in list and have it store the new chunk of memory, where’d the old chunk of memory go? It’s sort of floating there in your computer’s memory But you’ve lost all pointers to it There’s no arrow anymore pointing to it conceptually So that’s why we have to jump through these hoops of having a temporary variable just so that we don’t lose track of things we’ve allocated And we’ll see this later today with another data structure as well Yeah AUDIENCE: Somebody asked this, but I don’t understand that if you initialize temp as a pointer toward integer, then does it not create problems that you use it as an array DAVID MALAN: Good question If you initialize temp as a pointer to an integer, does it not create problems that you’re using it as an array? Short answer, no, because again an array by definition from Week 2 is just a chunk of memory And in C you can use the square bracket notation to jump to random parts of that memory using simple arithmetic, bracket 0, 1, 2, and so forth Last week when we introduced malloc and free and now realloc, you now have a more low level way of allocating as much memory as you want So it’s a more powerful, general purpose mechanism But at the end of the day, you’re still getting back a chunk of memory, contiguous memory, bytes that are back to back to back So you can certainly still use the square bracket notation because essentially an array is a chunk of memory And malloc gives you a chunk of memory, ergo you can treat it as an array They really are equivalent in that sense You just don’t get as many user friendly features as with arrays, like them being freed for you, as we never until last week do we have to free chunks of memory Arrays do that for you automatically thanks to the compiler Yeah AUDIENCE: Do you not have to recreate the list for temp after line 37 DAVID MALAN: Yes Thank you So there is a bug here And if I ran Valgrind on this code, I would see exactly that, that I’m leaking some number of bytes So indeed, at the end of this program, I want to free the– let’s make sure– yep, I want to free now the new chunk of memory, which is a size 4, to avoid exactly the problem you identified Good catch All right , so just for now, even if that’s a lot of code, let’s consider now higher level takeaways from this, just so that we can then motivates an alternative approach that allows us to stitch together somewhat fancier data structures instead So in general, a data structure is just a programming construct in C, C++, Java, Python– we’ll see them in different languages over the remainder of the term– that allow you to store information differently in your computer’s memory In C, everything we’re about to do today is thanks to these three features of C So even though this may feel like a lot of syntax, everything we do today boils down to three pieces of syntax that you have seen before Struct, this recall was a keyword in C that allows you to create your own structure For instance, a couple of weeks ago, we created a person structure, who had a name and a number And that gives us our own data type that is structured to contain two values, like name and number You use structures as well for the problem set involving bitmaps, the bitmap header and so forth Those were data structures as well What do we use the dot notation for just to be clear? And you definitely use this when manipulating red and green and blue pixels recently Yeah AUDIENCE: To access a property of a structure DAVID MALAN: Exactly To access a property of a structure So if you want to get at a person’s name or get at a person’s number, you use the variable’s name and then a dot operator to get inside of that data structure So we’ve seen that before Then last week and again today, we see the star operator, which can kind of mean different things in different contexts But it’s always related here to memory as of last week

This is the dereference operator that allows you to go to a chunk of memory by way of this thing called a pointer So even if today feels like a bit of that fire hose– and again per my email, this is where things now begin to level off– realize that it all boils down to first principles, or if you will sort of three scratch-like puzzle pieces that we’re now going to assemble into more interesting solutions to problems So allow me to introduce something called a linked list A linked list, as we’ll see is going to allow you to store a list of values Now an array allows you to store a list of values But what are some of the downsides with an array? Well, an array is a fixed chunk of memory And if you want to resize that array to add more values to it, what do you have to do? Well, you minimally have to allocate more memory You need to copy all of those values from old to new And then you can go about your business Now, realloc is a function that makes that a little simpler But realloc is doing the exact same legwork that I was doing between the lockers of copying value, freeing memory, and so forth So it needs to be done And that’s why insertion into an array is going to be big O of n, because it might take you that much time to copy the whole array into a new space So that feels kind of suboptimal, right? Arrays can be slow in that sense But what was the appeal of an array? Like what’s good about arrays? Because we don’t want to abandon them entirely Yeah AUDIENCE: You can index into them really easily DAVID MALAN: You can index into them really easily, not only syntactically with the square brackets, but you have constant time access– this is known as random access And it’s not random in the sense that you just end up who knows where You can just jump to bracket 0 or 1 or 2 instantly It took me, the human, more time because I had to physically walk But a computer is going to be able to jump to 0, 1, 2, 3 instantly And so arrays are super fast And they lend themselves to things like binary search as we’ve seen now for some time But what if we use the canvas that is our computer’s memory like a little more cleverly? We don’t have to just plop things next to each other, next to each other, next to each other, and then hope for the best hope that there’s still more memory back to back to back What if we instead are a bit more clever about it? And suppose we want to store the number 1 And that happens to be an address 0x123 It’s arbitrary But recall from last week that every byte of memory in your computer is stored somewhere So let’s propose that 1 is stored at 0x123 Suppose now that this represents an array of size 1 and you want to add a second value to this array Or let’s start calling things more generally a list A list like in the real world is just a list of values This list is of size 1 Now maybe there’s a lot of EMMAs in this memory that are getting in the way But suppose that there is some free space a little lower in your computer’s memory over there So it’s not here It’s not here It’s not available here or here or here There’s other stuff there But suppose the computer does have some available memory over here in which you can store the number two, just because And that address happens to be 456 Finally, you want to store third value And it turns out that the nearest possible location is down here, number 3 That happens to be at address 0x789 So this is not an array by definition, because the 1, the 2, the 3 are not contiguous back to back to back You cannot square bracket notation here because square bracket notation requires, per Week 2, that all of your values be next to each other, just like the lockers here This picture, where 1 is over here, 2 is over here, 3 is over here is more like, oh, maybe this is 0x123 Maybe this is 0x456 Maybe this is 0x789 They’re kind of all over the place And that’s just because that’s what’s available in your computer’s memory But what if I get a little extravagant and I start to use, not just one chunk of memory to store each value, like 1, 2, 3, what if I go ahead and give myself twice as much memory just to give myself some flexibility? So I now conceptual use this chunk of memory to represent one This junk to represent 2, this chunk to represent 3 But you know what I’m going to use the latter half of each of those chunks for? Any thoughts? AUDIENCE: Address to the next DAVID MALAN: An address to the next chunk of memory So, for instance, if my goal is to keep this list sorted, so I want conceptually to have a list that stores 1, 2, 3, why don’t I use this as sort of a map or a breadcrumb, if you will, that points to the next chunk of memory? And why don’t I use this chunk of memory to point at the next one? And then this chunk of memory, you know what, I just need a special value here What would be a good arbitrary way to say, mm, mm, there’s nothing more in the list? AUDIENCE: Null DAVID MALAN: It’s something called null And this technically is different from backslash 0, which is a char This is something called– well, this is in hexadecimal 0 Now, starting today– and we saw the super briefly last week– this is n-u-l-l with two L’s– this was stupid left hand not really talking to right hand– n-u-l is backslash 0, which is a char n-u-l-l is a pointer But they both equal 0 underneath the hood So you just store special value that says that’s it for the list Now, we last week I proposed who really cares where things are in memory? So indeed, let’s do that again Let’s just use pointers drawn as arrows in this artist’s rendition to say this list of numbers, 1 2, 3, is now linked

A linked list is just a data structure containing multiple chunks of memory that are somehow linked together And if underneath the hood, so to speak, they’re just linked together by way of pointers, and the price we pay is that rather than now in a linked list storing just the numbers 1, 2, 3, which we could have in an array, now you have to store twice as much information, 1, 2, 3, as well as three pointers, two of which are in use, the other of which is ready to go if I want to add something to this list So this is to say we can now create structures that look like this in the computer’s memory just by using this new feature of pointers What might these structures look like individually? Well, any one of these numbers has two fields it seems One is an integer We’ll call it number And then there’s another field here That let’s call it next by convention, but we could call it anything we want It’s just another chunk of memory that’s pointing to the next element in the list Well a couple of weeks ago, we introduced persons And a person had a name and a number That’s not relevant today, because we’re not dealing with names and numbers We’re just dealing with integers So let me propose that we back that up and still use the same syntax as a couple of weeks ago But instead of defining a person, let’s call this rectangle a node So this is a term of art in computer science node– n-o-d-e– just represents this rectangular concept, a chunk of memory that you’re using for interesting purposes It’s sort of a node in a graph if familiar from math But what do I want this know to store? Well, let me go ahead and store a couple of things in it One, a number, and that’s just going to be an int And I’m going to go ahead and call it number And then any guesses as to what the second field should be declared as? I want to call it next just because it’s conventional What should its data type be? Any thoughts? Yeah in back AUDIENCE: A pointer DAVID MALAN: A pointer And a pointer to what would you say? AUDIENCE: The next number DAVID MALAN: A pointer to the next number, and not quite the next number per se, because now we have not numbers only, we now have nodes So those three yellow boxes, 1 2, 3, those are now nodes, I would say So you know what? Let’s go ahead and call this node star But you can’t technically quite do this It turns out that C, recall, takes you super literally And notice, if you read this code top to bottom, left to right, at which point in the story does the word node come into existence? Like not until the very last line That’s where we mentioned person This is where we mentioned node So, unfortunately, you can’t use a node inside of a node, because it literally doesn’t exist in the computer’s mind until two lines later So there’s an alternative here There is a solution It’s a little more verbose Instead of just saying typedef struct, you actually say typedef struct node Just add the word that you want to use And now down here, you can say struct node star next It’s kind of like a work around This is the way C works But this is the exact same idea This means that any node in the data structure, any yellow rectangular box has a number and a pointer And that pointer by definition is a pointer to a node structure We just have to express it more verbosely here, because the shorthand notation node doesn’t exist until the bottom It’s just sort of an annoying reality of the syntax All right, any questions on that definition of struct? Yeah AUDIENCE: Do you have to put node twice? DAVID MALAN: Do you have to put node twice? You don’t have to put node twice You can actually use any word here you want You can call this x or y or z But then you’re going to have to make this be struct x or struct y or struct z By convention, I would just reuse the same term So this is the formal name for this structure It is a struct node This is the nickname for the structure, more succinctly, node And that’s what typedef does It gives you an alias from struct node to just node, just because it’s easier to type elsewhere in the program Other questions? All right, so what can we do now with this structure? Well, let’s go ahead and build something up here All right, so this is about as scary as the code gets today We’ll focus primarily on pictures and concepts hereafter But let’s take a tour through one implementation of this same idea of a linked list How would we go about representing a linked list initially? Well, initially the list is empty And if you want to represent something that’s empty, we minimally need something So let me draw it as this empty box And this is just a pointer to a node, I claim So how do I implement the notion of a linked list that has no numbers in it yet? Well, why don’t we just use this, which I can implement as follows node star, and I’m going to call it list, but then set it equal to NULL Right, if there’s no numbers available– there’s no 1, there’s no 2, there’s no three– I should at least have a variable that connotes there is no list And the easiest way to do that is in the absence of a value, store 0, which has this new nickname as of last week and this, called null So this variable here represents this picture here And notice, there’s no numbers, because the list is empty But we do initialize it to NULL so that we don’t think that there’s an arrow pointing to some specific chunk of memory yet Because there isn’t yet Now, suppose I want to go ahead and insert a number into this list

Suppose I want to insert the number 2 I can’t just allocate space for 2 now I have to allocate space for 2 and that pointer, otherwise known together as a node, per the previous slide So how do I go about doing this? Well, in code, I can borrow the same technique that we’ve used a couple times now, even though it’s uglier than some past approaches, malloc then an integer How many bytes do you want? I don’t know how big a node is I could probably do the math and add up the integer and then the pointer But, you know what, size of node is just going to answer that question for me So this returns that chunk of memory that’s big enough to store a node And I’m going to store that just for temporarily in a variable called n, n for node, and that’s going to just be a temporary variable, if you will So, again, even though there’s some new stuff going on here, this is just like before Previously, I wanted to allocate an integer Now, I want more than an integer I want an actual node And malloc returns an address, which means I must assign it to a variable That’s an address on the left hand side All right, what should I always do? Slight spoiler because I clicked ahead a moment ago– actually, we’re going to forge ahead here This is the ugliest thing we’ll see What is this second line of code doing here? What’s going on here do you think? Yeah, what do you think? AUDIENCE: It’s setting the number of that node to 2 DAVID MALAN: It is It’s setting the number of that node to 2 But why this crazy syntax, which we’ve never used before? Well, star n, we did see last week That just means go there The parentheses are just necessary for order of operations so that the compiler knows, OK, first go there And then once you’re there, what do you want to get access to? The number field So use the same dot notation So it’s super ugly But it’s just doing two different things that we’ve seen in isolation Go to the address in n, which is that chunk of memory And then access the number field and set it equal to 2 Fortunately, C has some syntactic sugar, just an easier, prettier way of doing this And it happens to look wonderfully like the actual thing we keep drawing– this arrow notation So if you ever see and you ever write this notation in C– and I’m pretty sure this is the last new syntax we’ll see– this arrow, this very sort of hackish era where you hit a hyphen and then a greater than sign, this means the exact same thing as this This is just annoying to type It’s ugly to look at This is just slightly more pretty And frankly, it’s reminiscent of the pictures we’ve been drawing with the arrows pointing left and right What’s the next thing I want to do? After allocating this new node for the number 2, what do I want to put as well in that node? AUDIENCE: Put in the address DAVID MALAN: Sorry, a little louder AUDIENCE: The next address DAVID MALAN: The address of the next node But there is no next node yet So what value could I use as a placeholder? AUDIENCE: Null DAVID MALAN: Null And so indeed, I’m going to do this arrow notation as well You never need to do the star and then the dots and the parentheses Everyone just writes the code like this in the real world So n arrow next gets null That now gives me that picture we were drawing But, again, sanity check, if you ever use malloc, you should always check the return value So just to be super precise, let me go ahead and add a couple more lines of code that just check if n is not null, go ahead and do the following Conversely, I could check if n is null and then just exit or return depending on where I’m using this code But you don’t want to touch n and use this arrow notation unless you’re sure n is not null So what have I just done? My picture now looks like this But this, of course, is not a linked list, because there’s no linkage going on I really need to do the equivalent of pointing an arrow from this pointer to this structure I need to implement an arrow that looks like this So how can we go about implementing that in code? Well, let me propose that this is what it ultimately looks like We just need to draw that arrow How do I do that? Well, it’s as simple as this If list is a variable, and it’s previously initialized to null– it’s just a place holder– and n is my temporary variable storing the new node, it suffices to say, well, lists should not be null anymore It should literally equal the address of that chunk of memory I just allocated And that’s how we get this picture now inside of the computer Now, let me do a couple of more operations Suppose I want to add to the list the number 4 How do I add the number 4? Well, the number 4 is inside of its own node So I have to go back to code like this I need to allocate another node that installs the number 4 there But that’s not all You don’t want to just create the node, because it’s otherwise out there in no man’s land, so to speak We now need to add the arrow But now it gets a little non-obvious how you update the arrows, right, because I don’t want to update list to point at 4, because that’s going to sort of orphan, so to speak, number 2 And it just kind of float away conceptually I really want to update 2’s pointer to 4 So how can I do that? Well, you know what I can do is I can kind of follow these breadcrumbs If I declare a temporary pointer– and I’ll do it using a little extravagantly last week like this little pointer notation– if I’m a variable called temp, TMP, I can go ahead and point at the same thing that list is pointing at And I’m going to check is this next value null? If it is, I found the end of the list So really I can follow that arrow

Now, I know that I’m at a null pointer So now, I just want to draw this number up here And I accidentally advanced the screen I want to actually draw this arrow up here So how do we go about doing that? Well, the code there might look like this So if all I want to point at a node, as I just did with the big fuzzy hand, I can just initialize this pointer to equal whatever the list itself is pointing at Then, I can do like a while loop And it’s a little weird looking, because I’m using some of my new syntax But this is just asking the question, while the next field I’m pointing at is not NULL, go ahead and follow it So again, this is as complicated as the syntax today will get But this is just saying, whatever I’m pointing at point specifically at the next field While it is not NULL, go ahead and update yourself to point at whatever it is pointing at So if I advance to the next slide here, this is like I’m initially pointing at 2 I see an arrow I’m going to follow that arrow I’m going to follow that arrow So however big the list is I just keep moving my temporary pointer to follow these breadcrumbs until I hit NULL So here in the story let me propose that we add another number, 5 And 5, of course, if we keep it sorted, it’s got to go over here And again, they’re all over my computer’s memory They’re not in a perfectly straight line, because who knows where there’s available space But now that I found this, I want to go ahead and create one more pointer using code very similar to what we just saw But now lastly, let’s do one more here at the beginning and then one more in the middle and see what can go wrong What is worrisome about 1 if we actually want to store this list in sorted order? What might I be mindful of now if the goal is to insert 1 into this linked list? Any thoughts? What do I want to do first? Well, you know what, let me go ahead and just point– you know what, it’s obviously got to go to the start of the list if I want to keep it sorted, so that the arrows eventually go from left to right So let me go ahead and just use code like this to allocate the new node And let me go ahead and just move that arrow like this This is wrong even though we’ve not seen the code for it But why is this wrong? Yeah AUDIENCE: You’re orphaning 2, 4, and 5 DAVID MALAN: I’m orphaning 2, 4, 5 In what sense? I mean literally in my program, the only variables and the variables I have are those you see on the board here So if nothing is pointing at 2 anymore, it doesn’t matter that 2 is pointing at 4 and 4 is pointing at 5, we have orphaned 2 and transitively 4 and 5 So those are just lost That is a memory leak If you recall using Valgrind and getting yelled at by Valgrind because you’re leaking memory, it might be because, yes, you’ve forgotten to free memory Or worse, you might have completely forgotten where the memory is that you were using And by definition of your own code, you can never access that memory again You’ve asked the computer for it, but you’re never able to give it back because you have no variable remembering where it is So we don’t want to do that We instead want to do this probably Let’s point 1 to 2 first, which is kind of redundant, right? Now, we have sort of conflicting beginnings of the list But once 1 is pointing to 2, what can your next update? AUDIENCE: The list DAVID MALAN: List to point at 1 And you can do this in code if you’d like really with just two steps You can update the next field of your new node, which is the one representing 1 that I just allocated, and you can initialize it to point at whatever list is pointing at So if you want this thing to point at the same thing that this thing was pointing at, you literally just say in code n arrow x equals whatever list is pointing at and then you say the list should equal n itself And again, you’ll see in section this week and in the upcoming problem set actual opportunities to apply these kinds of lines of code But those are the kinds of thought processes that you should be mindful of Now, 3 is the only one that’s particularly annoying And we won’t look at the code for this If we actually want to put something in sorted order in the middle of the list, let’s just consider conceptually what’s got to happen We’ve got to allocate memory for the node We then need to update what? We probably don’t want to point 2 at 3 for the exact same reason you identified We then orphan 4 and 5 So what should we update first conceptually? AUDIENCE: 3 to 4 DAVID MALAN: Update 3 to 4, so it is going to look like this And now we can update 2 to 3 And I’m going to wave my hand at the code for this only because there’s multiple steps now You have to probably have some kind of loop that iterates over the existing list, finds the appropriate location using less than or greater than, trying to find the right spot And then you have to manipulate the pointers to do that You won’t need to do something as complicated as that for the upcoming problem set 5 But it is just boiling down to some loops, some inequality checks, and then some updates of the pointers But it’s easier generally to add stuff at the end and even easier to add things at the beginning, especially if you don’t care about maintaining any kind of sorted order Phew Any questions on that? Yeah AUDIENCE: Back to the beginning, like the code you had,

what’s the difference between node with star and like a pointer n of type node? DAVID MALAN: A pointer n of type– let me just scroll back to the code, here? AUDIENCE: Yeah DAVID MALAN: OK, so this is malloc is going to give us a chunk of memory that’s big enough to store node Node star n gives us a pointer that is the address of a node And therefore we’re going to assign the return value of malloc to that variable, so that n effectively represents a chunk of memory that’s big enough to store a node AUDIENCE: So n is node, not a pointer? DAVID MALAN: n is a pointer to a node n is a node star, or a pointer to a node And what does that mean? n is the address of a node And that should make sense, because malloc returns an address But this is why we’re now using arrow notation n is not a node You can’t do n dot number and n dot next You have to do the star thing and then the dot Or more succinctly now, you do an arrow number and arrow next Good question All right, let’s see if we can’t make this a little more real with maybe one demonstration here Let me go ahead and put on the screen the end point that we want to get to, which is that here with everything in order, I think for this we need maybe five volunteers if we could Let me go a little farther in back OK, one over there Maybe two over there Now the hands are– OK, 3 over here if we can go in back up over there OK, 4 being volunteered by your friend And 5 being volunteered by your friends Do you want to come on up? All right, come on up Come on up Brian’s going to help me run this demonstration If all five of you could come on over, come on over here where we have some space All right, let me get you some microphones and introductions OK, thank you Two of them were bravely volunteered by the people sitting next to them So props to both of you You want to say hello and a little something about yourself AUDIENCE: Hello My name [? Siobhana. ?] DAVID MALAN: [? Siobhana. ?] And year or– AUDIENCE: I’m a sophomore DAVID MALAN: Sophomore OK Nice to meet you AUDIENCE: Hi I’m a senior DAVID MALAN: It’s nice to have you too Yeah AUDIENCE: Hi, I’m Athena a sophomore in FOHO DAVID MALAN: Athena AUDIENCE: Hi I’m Anurag I’m a first year at Matthews AUDIENCE: I’m Ethan I’m a first year at Weld DAVID MALAN: OK, and– AUDIENCE: I’m Sarika I’m first year in Thayer DAVID MALAN: Wonderful All right, thank you all for volunteering Let’s go ahead and do this You for the moment represent a heap of memory, if you will So if you could maybe all back up over here just to where we have some available space We’re going to need one of you to represent the list Siobhan was it? AUDIENCE: [? Siobhana. ?] DAVID MALAN: [? Siobhana, ?] come on up. [? Siobhana, ?] do you want to go ahead and represent list And to represent our actual list, we have– or Brian– yeah, we have a name tag, hello, my name is list So you’re going to represent the rectangle on the left that represents the linked list itself And now initially we’re going to go ahead initialize you to null So you can just go ahead and put that behind your back So you’re not pointing at anything But you represent list And there’s nothing in the list, no numbers in the list What was the next step? If the goal at hand is to insert 2, 4, 5, 1, 3, we want to do what first, what lower level operation to get 2 in there? What was the first line of code? AUDIENCE: malloc DAVID MALAN: malloc So we want to malloc a node for 2 So let’s go ahead and malloc OK, come on up So malloc And what’s your name again? AUDIENCE: Ethan DAVID MALAN: Ethan OK And what do we need to give Ethan? Ethan has two values or two fields The first one is the number 2 Thank you The next one is a pointer called next Now, you’re not pointing at anyone else So you’ll put it behind your back And now what do we want to do with? [? Siobhana, ?] what do we have to do? AUDIENCE: Point to– DAVID MALAN: Point to? AUDIENCE: 2 DAVID MALAN: Him, yes, number 2 OK, so this now represents the picture where we have list here, 2 here, but the null pointer as well All right next we wanted to add 4 to the list How do we go ahead and do this? Well, with 4, we’re going to go ahead and malloc malloc, all right And now, Brian has a lovely number 4 for you and a pointer What do we want to do with your pointer? AUDIENCE: Not point it DAVID MALAN: Not point at anything Now, it’s a little more work And I need a temporary variable So I’ll play this role I’m going to go ahead and point at wherever Siobhana is pointing at in sort of unnaturally this way That’s OK We couldn’t get hands that point the other way physically So we’re going to point at the same thing here You’re both pointing at 2 And what am I looking for in order to decide where to put 4? AUDIENCE: If it’s greater than DAVID MALAN: If it’s greater than some value So I’m going to check Well, 4 is greater than 2 So I’m going to keep going And your name was Eric? AUDIENCE: Ethan DAVID MALAN: Ethan, sorry So, Ethan, what are you pointing at? Nothing So that’s an opportunity There’s nothing to his right So let me go ahead and have Ethan point at– what’s you’re name again? AUDIENCE: Athena DAVID MALAN: Athena Also, unnaturally, but that’s fine And so now does Athena need to update her pointer? No, she’s good She represents the end of the list So her pointer can stay behind her back All right, let’s go ahead and malloc 5 You want to be our 5? So now we need a 5 So we need to hand you the number 5

And what’s your name again? AUDIENCE: Sarika DAVID MALAN: Sarika All right, so Sarika’s holding the number 5 She also is going to get a pointer called next What should Sarika be pointing at? AUDIENCE: Nothing DAVID MALAN: Nothing And now how to do I insert her into the right place? Well, I have to do the same thing So I’m going to get involved again and be a temporary variable I’m going to point at the same thing [? Siobhana ?] is pointing out, which is Ethan I’m going to follow this and see, ooh, wait a minute, he’s actually pointing at someone else So I’m going to follow that It’s still number 4 So I want to keep going Oh, wait a minute Athena is not pointing at anyone This is an opportunity now to have Athena point at 5 and voila But are you going to change your pointer yet? No Now things get a little more interesting Could we go ahead and malloc 1? And what’s your name again? AUDIENCE: Emma DAVID MALAN: Emma OK, Emma, we have the number 1 for you from Brian You have a pointer, which should be initialized as well to null And now we have a couple of steps involved here What do we want to do first? What’s your proposal? AUDIENCE: Temporary pointer DAVID MALAN: Temporary pointer So I’m going to point at the same things [? Siobhana ?] is pointing at, which is Ethan here But I see that 2 is greater than 1 So what do I actually want to do? Well, let me incorrectly for a moment– [? Siobhana, ?] could you point at number 1? What have we just done wrong? We’ve orphaned everyone else And even more visibly now, no one is pointing at Ethan or beyond, which means we’ve just leaked that memory, never to be recovered or free So we don’t want to do that Undo, Control-Z What do we want to do instead? What’s your name again? AUDIENCE: Emma DAVID MALAN: Emma What do you want to point at? AUDIENCE: I want to point at [? Siobhana. ?] DAVID MALAN: At that the same thing, [? Siobhana ?] is pointing at, which is equivalent then to Ethan So go ahead and do that with your– OK, sort of like Twister now That’s OK And then [? Siobhana, ?] what I do want to point at? Perfect So again a bunch of steps involved, but it really is just two or three steps depending on which pointers we want to update And then lastly, let’s go head to malloc 3 And your name was again? AUDIENCE: Anurag DAVID MALAN: Anurag So then we have a 3 for you from Brian We have a pointer for you It’s initialized initially to null So you can put that behind your back I’m going to point at the same thing as [? Siobhana. ?] And here we go 1 is smaller 2 is smaller 4 is larger So let’s get this right And who do we want to point at whom first? AUDIENCE: 3 points at 4 DAVID MALAN: 3 should point at 4 So go ahead and do that And you can step a little forward just so it looks a little less awkward And then lastly, big finale, Ethan, who do you want to point at? Number 3 And thankfully, all these steps later, we have a linked list We have wonderful souvenirs for each of you We just need the numbers back Thank you to our volunteers here if we could OK, you can keep that You can put the numbers on the desk here So as these folks head off the stage, let me point out now one of the shortcomings of the approach– thank you all so much– of what we’ve just done here Even though you all had the luxury of seeing 1 and 2 and 3 and 4 and 5 and you could even sort of immediately figure out where things go, even these linked lists are implemented with chunks of memory And just like with arrays, or equivalently lockers, the computer can only look at one piece of memory at a time, which is to say that to a computer, that same linked list kind of looks like this It’s sort of blind to the specific numbers in that linked list until it opens each of those doors of memory So to find where 3 goes or to find where 5 goes or to find where 1 goes, all of those doors maximally might need to be opened one at a time to find that value So with linked lists we have gained a feature We have gained the ability to add dynamically to the list, right I just kept malloc-ing, malloc-ing, malloc-ing additional students and additional numbers So the list can grow as big as we wanted it to And that case, we had five We could have done it 50 times or 500 to add more and more numbers That’s an upside, because we don’t have to waste time inserting into an array by resizing it and moving all of the original contents None of our volunteers had to move, technically speaking, just to insert a number 5 or number 3 They just had to point at someone else in memory or someone else on stage So if your data structure now is a linked list that looks like this, we’ve paid a price for that dynamism We’ve paid a price for the ability to resize our list without moving everything around that’s already there What is a downside that you might perceive of a linked list? What have we lost or given up here? AUDIENCE: We lost random access DAVID MALAN: Sorry Say again AUDIENCE: We lost random access DAVID MALAN: We lost random access That’s spot on We’ve lost random access Why? Because the way you get from the beginning of the list to the end is by following these pointers, following these arrows, these breadcrumbs, you can’t just jump to the middle elements, even though obviously this one here on the screen to all of us humans, that’s the middle element You don’t know that if you’re the computer If the main variable that’s storing this data structure is the pointer, like [? Siobhana, ?] pointing to the first elements of the list, you’re going to have to follow all of these arrows, frankly counting them up and then retroactively realizing, oh, there was 5 I passed the middle one earlier You’ve glossed random access And what algorithm have we used wonderfully in the past

when we do have random access? AUDIENCE: Binary search DAVID MALAN: Binary search So we’ve lost the ability now to do binary search as efficiently as we once were able to And so if we consider now the running time of linked lists, unfortunately, we’ve paid that price Searching now has gone back up to linear We no longer have logarithmic running time because of the fact that we’re stitching together this data structure And the only way to find the end of the list, the middle of the list is to follow all of these arrows You can’t just jump to one location Meanwhile inserting into the list, in the worst case, big O of n is going to be linear as well, because you have to walk through the whole list to actually find a spot for the given number, especially if you’re trying to keep it sorted So it would seem that even though we’ve gained this feature of much more dynamic insertion and we’re building up something more interesting in memory, and you can imagine this just taking much less time overall, because you have to keep moving everything around like we did with realloc, it’s unfortunately something we’re paying a price for But that was a lot Let’s go ahead and take our 5-minute break Fruit awaits outside And we’ll be back All right, so during break, I whipped up one final example of our list program This one uses all of those building blocks And let’s see if we can’t follow along pictorially and code-wise what it is we just built with all of these humans on stage So here is list list3.c It’s available online So you can follow along at home afterward if you’d like And let’s just walk through the lines that are written for us in advance One, I’m using standard I/O for printf And I’m using stdlib for malloc and free, our new friends that give us dynamic memory Here is the definition of a node that again has a number inside of it and a pointer, specifically a pointer to another node structure So that’s what each of our humans represented, this time now in C What is my main program going to do? Just for the sake of demonstration, the goal at hand is just to write a program that initializes a linked list to nothing initially, then adds a node with 1, then adds a node with 2, then add a node with 3 We’ll keep it simple and not add 4 or 5 this time So how am I going to do this? Well, on line 17, I propose that we create a variable called list and have it be the address of a node So if I were to draw this now pictorially, it’s going to be just like our demonstration a bit ago, where I have a rectangle here called list And initially, it’s not pointing to anything So I’m just going to leave the box blank to represent NULL So that’s that line 17 right here Now, let me go ahead and do the following Add a number to the list as follows Line 20 just gives me enough memory for a node And it stores that memory’s address in a variable called n Lines 21 through 24 are just a safety check Did anything go wrong? If so, just return 1 and stop the program We ran out of memory for some reason But these two lines now should look a little more familiar This now is going to go ahead and install 1 and NULL into that structure as follows So let’s recap This line here 20 is the same thing as allocating really a node that looks like this in memory that has two halves One of those fields is called number, which I’ll write there The other field is called next And then if we go back to the code, these two lines are all about just installing values in that structure So if I go ahead to number and put the number 1, I’m not going to bother drawing anything for next, because I’m going to leave it implicitly as NULL So that’s what’s going on now What do I next want to do? Well, the last line of code here under this comment that says add number to list, I set list equal to n where n again is pointing at this new node So that’s the same thing as saying, well, list is going to go ahead and point at that new node So after those lines of code, I’ve created a picture in memory that effectively looks like this Now, let’s go ahead and add the number 2 to the list It’s almost the same So here’s the chunk of code that’s going to go and add a second node to the list, this time containing 2 Let’s do it step by step Line 30, I’m going to reuse n as my temporary variable So I don’t have to re-declare it It’s the same n as before, but it’s now going to get a different address of memory thanks to malloc So that gives me another box like this that I’m going to go ahead and draw like that with nothing in it initially I’m going to make sure per lines 31 to 34 that nothing went wrong But that’s just as before And now in lines 35 and 36, I’m going to put 2 in there and NULL So let me go over there and let me go ahead and put 2 in there And I’m going to leave NULL blank implicitly That’s the end of the list But now I, of course, conceptually have to link the node for 1 to the node for 2 And here’s where C syntax, even though it’s new, kind of finally makes sense Notice here, I’m saying list arrow next equals NULL That maps perfectly to the picture List arrow x equals what? n Well, n is this thing over here So I just draw the arrow there And so the code actually finally lines up even though it’s new for today So now I’ve drawn the picture as follows with 1 and 2 Let’s go ahead and add a third and final node This one containing the number 3, using these lines here So line 40 gives me a new node with malloc So that’s going to give me a new node I’ll draw it as a rectangle over here I’m drawing it left to right, but these things

could be all over the place in memory It doesn’t matter where they end up I’m going to go ahead and check as before that’s it’s not NULL, just to be safe Then I’m going to go ahead and install the number 3 and NULL in there just as before So that means let’s go ahead and draw 3 I’m going to leave that blank because it’s going to be NULL And then the last line, you wouldn’t typically hard code this or write this explicitly in a program This is a bit more verbose than you need to Let me propose that you would probably use some kind of loop instead and walk through the data structure step by step as I proposed earlier But if we really want to do this just for demonstration’s sake, notice, start at list, follow an arrow and go to next Follow another arrow and go to next We can literally do that with our picture So here we go Let me start at list and follow an arrow and go to next Follow an arrow, go to next And now this is NULL So what I want to update is exactly this, as with line 47, which said follow two arrows, look at two next fields interchangeably and then set it equal to n All right, so what remains here? Well, this program’s whole purpose in life was just to print a list out Here’s a way where you can actually use a for loop to iterate over a linked list It’s kind of funky because we don’t have i and ints and i++ and so forth But a for loop doesn’t need to use integers or i’s Remember that before the first semicolon, you have initialization In between the semicolons, you have a condition And then you have an update that happens over here So you’ll get more experience with this with Problem Set 5 ultimately But for today’s purposes, high level, notice that this gives me a temporary pointer, like my big red hand earlier That’s a node star pointer And that’s why I was able to point with the big fuzzy hand And I set that equal to list So whatever the list was pointing at so was my temporary fuzzy hand I’m going to follow the following loop and so long as temp does not equal NULL So earlier when I was wearing the big fuzzy hand, I kept pointing, pointing, pointing And I stopped once it equaled NULL So this is saying keep doing the following until it equals NULL What do I want to do? I want to just print out the integer that’s inside of whatever I’m pointing at inside of it’s number field So go to whatever I’m pointing at, follow the arrow, and go to the number field That’s how we get at the data inside Once I’ve printed that out, for loops say that you just update a variable So what is that variable temp equals temp arrow next So if my fuzzy hand is pointing at someone and I need to update it to point at temp arrow next, that means go to whatever I’m pointing at, follow the arrow There’s the next field and point at whatever the next field was pointing at So you just keep updating what you’re pointing at That prints out the list And then– and we’ll defer this ultimately to Problem Set 5– we will need to free this memory And actually you have to be a little clever about how you free memory, but I’m going to use a while loop there, which turns out to be a little cleaner, a little easier, ultimately to free all of this mess I made in my computer’s memory I kind of need to do the equivalent of freeing things, but I need to free what’s behind me, not what’s in front of me Once you free memory, you should not touch it, traverse it, and so forth But again more on that final note in P Set 5 All right, any questions on a high level on the code? It’s fine if it looks quite new We make it available so that you have a starting point when it comes to using this kind of code yourself Any questions? All right, so someone came up during break and noted that this actually seems to be a regression in that arrays gave us the ability to resize, even though it was a little expensive because you got to copy everything from one place to another But we had random access and therefore binary search and therefore logarithmic running time for things like searching assorted lists We seem to have given that up Linked lists give us dynamism where we and shrink things without wasting time by moving things around But we’ve lost random access But you know what? Now, that we have this ability using pointers and data structures to kind of stitch things together in memory, connect things with arrows, you know what, we can build fancier things Most of you are probably familiar with the idea of a family tree, which is this hierarchical two dimensional structure And indeed, that’s our inspiration here What if we don’t just keep making one dimensional data structures, arrays that go left and right, linked lists that kind of go left to right? What if we actually use a vertical notion too and lay things out more interestingly What can we gain from this? Well, let me propose that anytime we’ve seen an array, we can actually re-implement an array, but get the best of both worlds, the best of arrays, the best of linked lists as follows Here is an array, back from Week 1 or even Week 0 when we were searching behind doors And here, Week 2, when we were searching behind doors, let’s go ahead and note that if we were to do binary search on this looking for some value, as before, many times you look in the middle first And then you decide, do you go left or right? And if you go left or right, you’d look in the middle element over here or the middle element over here And then what do you do? You go left or right, looking at the middle element over here

or over here or over here or over here You know what? Let me just kind of explode this picture because all of this is happening in one dimension We can actually think of this is happening really in two dimensions Let me draw my same array, 1, 2, 3, 4, 5, 6, 7, but let me represent it on different levels that’s indicative of what’s happening I start in the middle And I go left or I go right I then go ahead and look at this element And then I go left or I go right So it’s the same thing, but it’s a two-dimensional rendition of what we’ve been doing for a few weeks whenever we’ve done binary search Well, you know what this kind of looks like? It kind of looks like a linked list, albeit without the arrows But you know what, I don’t think I want to stitch this together from 1 to 2 to 3 to 4 to 5 to 6 to 7, because that’s just going to be a linked list But what if I use my new-found familiarity with pointers, use a few more of them? So I spend more space and stitch this data structure together in two dimensions conceptually Every node represented here is a rectangle It doesn’t have to have just one pointer There’s nothing stopping me from creating a new struct, a new definition of node that has two pointers Maybe it’s called left Maybe it’s called right Previously, we had just one we called it next But there’s nothing stopping us from creating a fancier structure that actually has two And so we might make it look not like this as before for a linked list, but let’s get rid of the next pointer Let’s make a little more room And let’s actually give myself two pointers, left and right And I claim that this structure now in C could be used to implement the tree that I just described, the family-like tree, more properly called a binary search tree, in the following way This is a binary search tree One, because every node in the tree has at most two children, hence the bi in binary, meaning maximally two It has zero children, as like these down here Or it has maximally two children Hence, the bi in binary search tree It’s a search tree in the sense that I have taken care with this data to sort things properly Notice the following definition For any node in the tree, every element to the left is smaller than it And every element to the right is greater than it That’s a recursive definition, because watch, look at this node Everything to the left of it is smaller Everything to the right of it is larger Let’s look at 6 Everything to the left of it is smaller Everything to the right of it is larger So it’s recursive in the sense that no matter what node you look at, no matter what rectangle you look at, what I just said correctly is true of both the left child or subtree and the right child or subtree So this is to say if you have a list of numbers, for instance, or a list of anything and you actually store them using nodes that look like this, but conceptually what you’re really doing is stitching them together two dimensionally like this, guess what feature we just gain back? What have we just improved? I heard some murmuring over here AUDIENCE: Binary search DAVID MALAN: We’ve gotten back binary search So we still have dynamism, like a linked list We’re still using pointers And suppose we want to add the number 0 or the number 8, you could imagine 0 going over here, 8 going over here So we could still just plug them in without having to move everything around like we would for an array But because you’re stitching things together with additional arrows wherever they are in memory, so long as you keep track of this data structure, called a tree, with one pointer to the so-called root– the root being upside down in this world of computer science– this is the root of this binary search tree, guess what you do if you’re looking for the number 7? Well, you see 4 You know it’s greater than 4 So what do you do? You move to the right, thereby ignoring the other half of this tree, just like the other half of the phone book in Week 0 Once you get to 6, you consider, I’m looking for 7 What do I know? It’s got to be to the right And so you go The height of this tree happens to be logarithmic, for those familiar, log base 2 of n, which is to say I have 8 elements or 7 elements in this tree But it only takes me 1, 2, 3 steps to find the value It does not take big O of n, or a linear number of steps And if you want your mind really to be blown here, it turns out this is actually the best application for recursion, which might have felt a little forced previously when we built Mario’s pyramid with recursion where you did factorial or product or sum or something like that in section recursively It turns out that now that we have data structures that exist conceptually in two dimensions that are recursively defined– and by recursively defined, I mean for any given node, left is smaller, right is bigger, and you can make that statement about any node in the tree– watch what we can do in terms of implementing binary search If I have here a function called search, whose purpose in life is to return true or false if the number 50 is in the tree How do you search a tree? Well, it takes as input the tree More specifically, it takes the address of the tree More specifically, it takes the address of the root of the tree That is when you want to search a tree, you literally just hand it

the address of the very first tip top node called the root And from there, you can get everywhere else Just like with the list, we just need the beginning of the list So how do I go about searching a tree? Well, let’s consider the easy case first Suppose the address you’re handed is null, what should you do if you’re looking for 50, but you’re handed the empty address, zeros? AUDIENCE: Return false DAVID MALAN: Probably return false, right If I hand you no tree and I say it’s 50 in here, it’s an easy answer No, there’s no 50, because there’s no tree So that’s our base case, if you recall that nomenclature from our discussion of recursion You hard code You type in manually one explicit case that just gets you out of the program Next case, if 50 is less than the tree, follow the arrow to the number field, then what do know? 50 is less than the node you’re looking at What direction do you want to go conceptually? AUDIENCE: To the left DAVID MALAN: You want to go to the left So this line here searches the tree’s left child, so to speak, in the family tree sense, the left subtree So if we go back to the picture a moment ago, if I’m looking for 50 in that story– or let’s make it more real, if I’m looking for 1 in the current story, I see that 1 is less than the current node So I go ahead and just search the left subtree And notice, this is a tree But so is this if you look at it in isolation And so is this And therein lies the recursive opportunity So again here, if 50 is less than the tree’s number, then go ahead and search the left Else if 50 is greater than the tree’s current number, search the right Else logically what must be the case if the tree exists and it’s not less than and it’s not greater than the number you’re looking at? It must equal the number you’re looking for, 50, in which case we can return true But you recall perhaps from scratch, we don’t really need that explicit case We can just call it else instead Any questions then on this use of code? We won’t actually run this code But this is how you can implement recursively an algorithm that is reminiscent of Week 0 searching for Mike Smith in the phone book, this time now searching a data structure that itself is recursive All right, so what do we gain back in terms of running time, in terms of searching a binary search tree To be clear, what’s an upper bound on the running time? We’re back to log n, which was the goal And what about inserting into a binary search tree? This one we’re going to defer to a higher level CS class, because it turns out you don’t want to just go ahead and put 0 over there, and 8 over there, because if you keep doing that, putting smaller and smaller numbers or bigger and bigger numbers, you could imagine your tree getting very lanky, like very tall over here or maybe very tall over here and therefore not nearly as balanced as the tree we drew And so it turns out there are algorithms that let you keep a binary search tree balanced So even as you add elements to it, you kind of shift things around You don’t remove them in memory You just update the pointer, so that the data structure itself does not get terribly high But that too is log n, which means we had arrays, which gave us binary search capabilities in logarithmic time We then introduced the linked list, which gave us dynamism, the ability to grow and, if we want, shrink But we sacrifice binary search But if we spend a little more space and use not one pointer for every node, but two, we can actually tip the scales again, spend more space and save time by searching the data structure, this time using something logarithmic All right, so what would the ideal, though, be? Every time we talk about running time, it feels like we want to be low on this list and not high n squared was slow Big O of 1 is constant time That’s fast Wouldn’t it be nice throughout this story if we actually found our way to a data structure that gave us constant time? Like, my god, if we could just insert something into a data structure with one step and find something in a data structure with one step, that’s sort of the holy grail, so to speak, because you don’t have to worry about big O of n or big O of log n You just jump immediately to the value you want Well, it turns out, theoretically there’s something that allows you to achieve that called a hash table But how you implement that is not necessarily obvious And it takes some expertise And indeed, in Problem Set 5 among the goals at hand is to implement exactly this notion of a hash table that lets you spell check a document super fast A word processing program would be so slow if every time you wanted to check a word for whether it’s spelled correctly or incorrectly, if you had to search linearly or even longer rhythmically a big dictionary file, it might actually be really slow to spell check a file But using a hash table, we can probably do much better A hash table is a combination of an array and linked lists inside of it So I’m going to go ahead and just for convenience draw my array, this time vertically instead of horizontally But it’s the same thing And it’s just an artist’s rendition anyway And suppose the goal at hand is to keep track efficiently of like a name tags So maybe we’re holding a big event We’ve made some name tags in advance, which we indeed have And we want people to be able to pick up these name tags super efficiently It would be really annoying and pretty dumb if we just made a big stack and name tags, even if it’s alphabetical, A to Z, then had everyone in the room line up

and look through all of the darn name tags looking for their name That’s not a very inefficient system Fortunately, we’ve come prepared with some buckets, all of which are labeled, because wouldn’t it be nice if you’re looking for your name tag, you don’t look through the whole darn list of name tags or stack? You actually just go to your bucket And you jump instantly to your name, where hopefully you’re the only person with a name that starts with some letter And then you can just reach in and get it Well, how do we implement this conceptually? Well, it’s very common with a hash table if the inputs are things like words or names to look at the characters in those words to decide where to put those names or those name tags, if you will So here’s an array of size 26, from 0 to 25 But, you know what, It’s convenient to think of this array as maybe being indexed from A through Z. So still 26 buckets, but this array is really just of size 26, 0 through 25 ultimately And suppose the goal at hand now is to go ahead and store these name tags in advance So this is what the staff and I would do in advance And, Brian, if you wouldn’t mind helping out with this The goal at hand is quite simply to get the name tags ready for students to pick up And so where do I want to go ahead and put the first one? So Albus is the first one whose name tag we made I’m going to go ahead and jump immediately to bucket 0 and put Albus’s name right there in one step Meanwhile I’ve got Zacharias, and so even though it’s taking me a bunch of steps to go over here, if this is an array, I have random access, as a human, and so I can immediately, instantly put Zacharias over there It’s a little laborious for my feet, but a computer could just jump to 0 or 25 or anything in between All right, so Hermione– maybe you’re noticing the pattern– so Hermione is going to be H, or which is 7, which is going to be over here Ginny is 6, which is over here Ron is 17, which is over here So think of each of my multiple steps taking actually one step Fred is going to go over here As an aside, the staff and I discussed this morning how we probably should’ve put the buckets closer together But that’s OK Severus is going to go over here Petunia is going to go over here Draco is way over here, but doesn’t matter, constant time, bracket 3 James is bracket 9 Cedric is bracket 2 Perhaps play this part in 2x speed Luna is bucket 11 Neville bucket 13 Kingsley bucket 10 Kingsley, there we go Minerva bucket 12 Vernon– ironically, we don’t actually need this many names to make the point we’re trying to make But Vernon– we got a little carried away with the names we recognized And now, the list is pretty full All right, so that’s a whole bunch of names I filled up most of the buckets with a name tag But– why am I out of breath? But what’s really convenient now is that if Cedric or Albus or Draco or Fred or Ginny come into the room, they can index instantly, randomly, to their pocket, get their name tag, and go Nothing linear They don’t have to flip through the whole stack of name tags with which I actually began the story But there’s a problem ahead We very deliberately ordered the name tags thus far in such a way that we don’t create a problem for ourselves But among the more famous characters we’ve not heard from yet is Harry So Harry’s name tag is still here Where does this go? Well, Harry is going to go in bucket 7 But wait a minute, there’s already someone there So what do I do? If I were only using an array, Harry’s kind of out of luck Like Hermione is already in that location in the array And we would have to decide, either Hermione goes there or Harry, but we can’t just put them both But if we implement this new data structure called a hash table using an array that’s conceptually vertical, but that horizontally is a linked list, you know what, that’s fine We’re just going to go ahead and link Hermione’s and Harry’s together So, yes, it’s going to take both of them or one of them at least two steps to find their name tag But it’s not going to take big O of n steps to find their name tag, at least if there’s only two in this bucket All right, Hagrid, dammit, so he came in the door too So now that linked list is getting a little longer We now have a chain, if you will, a linked list of size 3 Sirius is going to go over here in bucket 18 But Severus is already there too Awkward Remus is 17 Remus is going to go and link together with Ron there George is going to go into bucket 6, which is over here Lily is also going to collide, so to speak with Luna And this is a collision in computer science Anytime you have a value that you’re trying to put in one place but there’s something there, you need to resolve the collision somehow So I’m proposing that we actually just link these together Or as we’re doing here, to bucketize values in computer science conceptually means to throw the value into a bucket, or physically as we’ve done here Lucius finally is going to go in bucket 11 too And lastly, Lavender goes in that same bucket Phew So thank you to Brian for helping choreograph that So this structure that you’re looking at is what is called a hash table It is an array that you index into using what’s called a hash function A hash function is like any function that we’ve seen thus far, any program we’ve seen thus far–

something that takes input and produces output So if we consider our original picture from Week 0 of what computer science in itself is when it comes to solving problems, hash function for today’s purpose it’s just this function, this process, this algorithm in between that decides, given a name tag, what bucket to put that name tag in And quite obviously in the real world, what algorithm was I using to bucketize a name tag upon reading the name? AUDIENCE: First letters DAVID MALAN: Looking at the first letter Why? It’s simple It’s pretty efficient It means I can store a relatively small array of size 26 and just immediately put the name tags there So in this case, we might have fed in Albus to that hash function And it might return 0, representing A, if we’re 0 indexing the array Or for someone like Zacharias, we might get out 25 just because the first letter of his name is z But this is kind of simplistic, right And we’ve seen a problem What is the problem with just looking, of course, at the users first letter of their name? What problem arose? Yeah AUDIENCE: There might be more than one name with the first letter DAVID MALAN: There might be more than one name with the first letter And you know in the extreme– and computer scientists and software engineers often think about the extreme What is the corner case? What could go wrong? What if by chance there’s just a lot of characters in this universe whose names start with h or l, and maybe all of their names just happened to start with h or l? It doesn’t matter how fancy your hash table is, it’s pretty stupid if all of the name tags are stacked up in a bucket So in that sense, a hash table, even though this feels like it’s pretty efficient, in the worst case, big O of n, when it comes to inserting and searching, because you could just get unlucky and get a huge stack of names that by nature of the class just all start with the same letter So how can we mitigate this? How could we mitigate this? Well, you know what, rather than naively only looking at the first name, let’s leverage some probabilities here Why don’t we look not at just the first letter, but maybe the first two letters? I bet if we look at the first two letters we’re not going to get as many collisions as many people belonging to the same bucket So Hermione, Harry, and Hagrid was a problem we identified earlier, not to mention a few other names But that was because we were looking only at h for the hash function, only at the first letter in their name What if instead we look at the first two, so we have a bucket for HA, HB, HC, HD, HE, HF? And so Hermione now goes in this bucket specifically So we’re going to need more buckets And they’re not pictured on the screen And they’re also not pictured here on stage We need more than 26 buckets Frankly, if we’re looking at two letters, we need 26 times 26, like 676 buckets now So more space, but we’re hopefully going to decrease the probability of collisions Why? Well, the next name I might put in here is Harry He’s going to end up in a different bucket this time That’s great, because it would seem that now I can get access to his name tag in constant time Unfortunately, Hagrid is still in the story And so we’re going to have a collision with HA So even looking at the first two letters is not ideal So even though we have 676 buckets in this story, 26 times 26, which is a lot of buckets, we’re still going to get collisions So what would maybe the next evolution of this idea be? Well, don’t look at the first letter, don’t look at the first two letters Why don’t we look at the first three letters Surely, that’s going to drive down the probability Unfortunately, that’s going to drive up the number of cells in the array and buckets on stage to 10,000 plus buckets this time around So that’s a lot of buckets But suppose we use not HA, but maybe HAA, HAB, HAC, HAD, HAE, HAF, HAG, dot dot dot, HAQ, HAR, HAS, dot dot dot, HEQ, HER, HES So we have a lot of buckets and even more in between not pictured Now we can go ahead and hash on Harry’s name, Hagrid’s name, Hermione’s name And this time, by design, they’re going to end up in different buckets, which seems to be an improvement And indeed, it is, because now if I go searching for Hermione or Hagrid or Harry’s name tag, or they do themselves, they’re going to be able to find it in constant time But that’s assuming there’s not a lot of other kids with the name starting with H And so a hash table still technically is big O of n because you could just get unlucky and have a big pile up of similar inputs, all of which produce the same output, even if you’re using a fancier hash function like this And there’s a trade off too My god, we’re using like almost 20,000 buckets now just to store these names to speed things up At some point, you know, it’s probably cheaper to just let Harry and Hermione and Hagrid form a line and find their name tag more slowly So there’s this trade-off of time and space But if you have what’s called an ideal hash function and you figure out some magical algorithm written in code that ensures uniqueness that no name tag will end up colliding with another, then you can achieve this holy grail of big O of one time, constant time for a hash function So it’s this sort of tension between how much space do you want to spend? How much effort do you want to spend figuring out what that ideal hash function is? So in the real world, and we’ll see this in Python, most computer systems give you a best effort, such

that a hash table is not big O of n usually It’s actually, on average much much, much faster, even though there’s a theoretical risk that it can be slow And more on that too in a higher level CS course where you explore data structures and algorithms more formally So technically speaking, it feels like search could get down to big O of 1, constant time, if every name tag ends up in a unique bucket But you could still get unlucky if there’s a lot of H names or L names or the like So technically speaking, a hash table is big O of n But, frankly, three names in a bucket, like Hermione, Hagrid, and Harry, is much better than n names in the same bucket So even in the real world if you get rid of this asymptotic hand waviness, that’s faster That’s much faster than putting everything in a linked list or an array itself All right, so from there, I bet we can try one other approach here There’s another data structure we want to present to, not in code, but in pictures This one’s called a trie Short for retrieval, even though it’s pronounced differently A trie is a data structure that actually is pretty amazing And it follows this pattern of spending one resource to save on another A trie is going to use a lot more memory, but it is going to give us actual constant time lookup for things like names or words being inserted into the structure So what does it look like? It’s a little strong, because we need to leave room for ourselves on the board with lots of memory A trie is a tree, each of whose nodes is essentially an array So notice the pattern here Computer scientists over time have been kind of clever taking this idea, this idea, mashing them together and creating some monster data structure, but that gives you some savings of time or space So this array at the very top represents the roots of this trie, which again is a tree whose nodes are arrays And notice that the array is of size 26, for the sake of discussion, A through Z, or 0 through 25 A trie does this If you want to store a name in a trie, what you do, in this case, is look at every letter in the word in question So for Harry’ it would be H-a-r-r-y We’re not just looking at the first, the second, and third We’re looking at all of them And what we do is this Suppose the first letter in the person’s name or their name tag or the word more generally is an H. You go ahead and go to that index And if there’s no child node, there’s no tree yet below it, another branch, if you will, you allocate another node And another node just means another array And so we’ve drawn two arrays on the board This now has the letter A highlighted All of the letters are technically there, because it’s of course 0 through 25 But we’re only highlighting the letters we care about for the sake of this example Here is H-a-g So it looks like the first name tag I’m trying to install into this data structure is Hagrid Notice now that g is inside of that array I want to go now to r for Hagrid That gives me another array Now i, now d. d is the end of his name So I’m going to just color in green, or I can use like a Boolean flag in C code that just says someone’s name ends here So notice, I’ve implicitly stored Hagrid name now in this data structure by storing one node, that is one array, for every letter in his name But there’s this slight efficiency here because there’s other people in this story besides Hagrid whose names are prefixes or share common prefixes So, for instance, suppose I want to install Harry into this data structure He is H-a-r-r-y And so that gives me a couple of more nodes And if I go ahead now and install Hermione in this, notice now I have even more nodes in the tree But some of them are shared If you start at the very top and look at the H, notice that both Hagrid and Harry and Hermione at least share at least one node in common Now, what’s cool about this ultimately? So what is the running time of searching for someone in this data structure if there’s n people already in it? Right now n equals 3 because there’s three people in it, even though there’s a lot of nodes But what’s the running time for searching this data structure to see has Harry picked up his name tag already? Has Hermione picked up hers? Has Hagrid picked up his? Well, how many steps does it take to find Harry or Hermione or Hagrid in this data structure? For Harry, it’s H-a-r-r-y So it’s five steps maximally For Hagrid it’s H-a-g-r-i-d It’s six steps maximally And H-e-r-m-i-o-n-e, 8 steps total And it’s probably the case that if we read through the books, there is going to be some upper bound on the length of someone’s name I don’t know what it is It’s probably 20 characters Maybe 30 if it’s crazy long But there is some fixed value Anytime you have a fixed value, that’s what you by definition in CS and in math call a constant If it’s 20, it’s 30, it doesn’t matter But it’s fixed People’s names aren’t growing every year in length There’s some hard upper bound And so technically, if it only takes you five steps or six steps or eight steps to find Harry or Hagrid or Harry or Hermione, that is technically constant time or, as we’ve said, Big O of 1

So we can actually then achieve, truly for searching this data structure, for inserting this data structure, truly what we call big O of k where k is some constant But a constant is the same thing, asymptotically, per our discussion in Week 3, of big O of 1 These are effectively constant time, because to find Harry, you look only at H-a-r-r-y It doesn’t matter if there’s 1 million other characters in that trie already It doesn’t matter if there’s Hermione and Hagrid and everyone else from the seven books in the data structure, because the only nodes you’re looking at are the ones representing H-a-r-r-y And that’s a powerful thing Every other algorithm we’ve discussed thus far, certainly for searching and sorting, has somehow been slowed down by how many other names or numbers are in the data structure That is not the case for this one here However, there is a price being paid What appears to be the price that we’re paying to gain that really low running time? AUDIENCE: Memory DAVID MALAN: Memory I mean, my god, it barely fits on the slide And this is just three names You’re spending 26 amount of the memory to store one character Now there’s some optimizations Over time, if you insert a lot of names, some of these nodes will be shared But this is a very wide, very dense data structure, so to speak, because it’s using so much memory to give you that super amazing running time of theoretically constant time So again this theme of trade-offs is going to persist in the remaining weeks of the semester where to gain one resource, we’re going to have to spend another So that there is a trie So it turns out that now that we have arrays and linked lists and trees and tries and hash tables and yet other data structures out there, we can actually implement what are called abstract data structures, using any of those as building blocks What we’ve kind of done today verbally and pictorially is invent more of those pink puzzle pieces in Scratch, those custom puzzle pieces Now we have as building blocks arrays and linked lists and trees and hash tables that we can use to solve other problems And one of the problems out there in the real world is something called a queue A queue and certainly in certain cultures immediately comes to mind, what’s a queue in the real world or an example thereof? AUDIENCE: A line DAVID MALAN: So a line, right, lining up at a store or a restaurant or a takeout place So a queue actually has a technical meaning and computer science too It’s a data structure that is FIFO, first in, first out A queue, by definition should have people hopefully pleasantly lining up one person in front of the other And it maintains this FIFO property, first in, first out, such that if I’m at the front of the line I am going to be served my food first and then the person behind me and then the person behind them It’d be really obnoxious if you walked up to Tasty Burger, placed your order, and whoever showed up most recently got their food first That would be an opposite data structure That’s LIFO Last in, first out Not fair in the real world So you might hope then that the software that companies like Tasty Burger use when they type in your order to the system actually send those orders to the team working in back cooking the food in a queue fashion, because it’d be pretty obnoxious too if people behind you were getting their food first So hopefully in software, you’re implementing that real world notion of a queue as well Printing, if you still print on campus sometimes, papers and whatnot on printers, they’re often shared printers on campus And so they have what are called printer queues You might go to Command-P or Control-P print, but then hopefully, in fairness, if there’s 10 people who are trying to print to the same Harvard printer, they are printed in the order in which they were requested It would be pretty obnoxious, again, if the order were flipped Well, it turns out with queues in computer science, there’s two fundamental operations, even though we humans don’t really think in these terms, enqueue and dequeue To enqueue means to get in line To dequeue means to get out of line, hopefully once you’ve been served your printouts or your food or whatnot Using today’s principles, arrays, linked lists, you could probably imagine conceptually using them as building blocks to implement this notion of a queue The software that Tasty Burger or any fast food place uses probably has implemented in code some lines that are using an array that’s maybe being dynamically resized or better yet a linked list that’s growing and shrinking as people are placing orders and getting orders So there’s this one-to-one mapping between some of today’s ideas and even the real world as well There’s kind of the opposite data structure that I referred to a moment ago And these are generally known as stacks Stacks in the real world might be in the dining hall right Like here is of trays And they have this fundamentally different properties, such that if the staff go ahead and clean the trays and put them right here, it would be pretty obnoxious if to get your you had to go through a FIFO fashion and get the first they put down and take that out first No one does that, realistically If you’ve got a big stack of trays in the dining hall, you probably enforce a LIFO order, last in, first out So if this was the most recently installed or clean tray, you’re probably, as the human, just going to take the top one, even though that’s not really fair to the below But it doesn’t matter in this particular case So a stack gives you the opposite property And where else might you see these? Well, your Gmail inbox If you use Gmail, your inbox, most likely by default is configured as a stack Where do your most recent emails end up?

AUDIENCE: At the top DAVID MALAN: At the top Now, this is wonderful because it’s a feature in that you always see your newest mail Kind of a downside, though, to your friends who’ve emailed you an hour ago and whose emails are now down here are on page 2 of your email So there’s trade-offs here too Stacks might have desirable properties, like just get your tray quickly, see your most recent email But if you’re like me, as soon as email falls on the page 2, you might never get back to it if the stack of trays never actually gets exhausted Frankly, there might be in some dining hall on campus some way down here that has never been used in years, because they keep refilling the stack, and we keep taking from the top So that would be a bad property for a lot of context, but not necessarily all So it turns out there’s another data structure too– oh, and the operations there are not called enqueue and dequeue By convention they’re called push and pop, where this means pushing an element onto the stack Even if it’s very gentle, that’s pushing Popping means removing the top element So it’s just nomenclature meaning adding and removing elements But there’s one other data structure we’ll give mentioned to today And that’s known as a dictionary And we’ll see this again in a couple of weeks when we look at Python A dictionary is the abstraction that you can get on top of a hash table This hash table literally involved physical buckets and in code would involve arrays and linked lists That’s like low level plumbing A dictionary more generally in computer science is a data structure that has keys and values, words and values, words and page numbers, anything that maps one thing to another Physical dictionaries in the human world, like an English Dictionary, has lots of words And if a word is correctly spelled in your document, it will be in that dictionary And if you have a typo, a misspelling, in your document, it will not be in that dictionary So wouldn’t it be nice if you could actually implement a dictionary using maybe a hash table, but a smart hash table that has plenty of buckets, so that you can answer a question, is this a word, is this a word, super fast without having a whole stack of name tags or, in this case, English words all in the same bucket And, in fact, that’s the challenge for Problem Set 5 We’re going to give you a big text file with 140,000 plus English words And the goal for you is to implement a hash table with your choice of number of buckets, your choice of hash functions, and implement this notion of an array with linked lists that stores those 140,000 plus words Dictionaries, though, do exist in the real world And taken last night at like 9:00 PM before Sweetgreen closed in Harvard Square was this photo If you’ve ever ordered a salad at Sweetgreen, they have a pretty clever optimized system so as to pick up your salad If you order on their app in advance, they go ahead and put your salad under D for David, for instance, or B for Brian and so forth So that when you go into the store, you don’t have to look through big O of n other salads You can jump immediately to the B section, the D section, or any other section and get your salad Now, in the extreme case, maybe Harry and Hermione and Hagrid all order at the same time So there’s just a big stack at the H’s So it’s technically still big O of n But if you assume a nice uniform distribution of names, this probably does work out pretty well, especially if the salads aren’t there by design very long But let’s use our final minutes together to take a look at one final visual, one made by some of our other friends online who put together an animation that tells the story of the differences between stacks and queues personified as follows [VIDEO PLAYBACK] NARRATOR: Once upon a time, there was a guy named Jack When it came to making friends, Jack did not have the knack So Jack went to talk to the most popular guy he knew He went up to Lou and asked, what do I do? Lou saw that his friend was really distressed Well, Lou began, just look how you’re dressed Don’t you have any with clothes with a different look? Yes, said Jack, I sure do Come to my house and I’ll show them to you So they went off to Jack’s And Jack showed Lou the box where he kept all his shirts and his pants and his socks Lou said, I see you have all your clothes in a pile Why don’t you wear some others once in a while? Jack said, well, when I remove and socks, I wash them and put them away in the box Then comes the next morning and up I hop I go to the box and get my off the top Lou quickly realized the problem with Jack He kept clothes, CDs, and books in a stack When he reached for something to read or to wear, he chose the top book or underwear Then when he was done, he would put it right back Back it would go, on top of the stack I know the solution, said a triumphant Lou You need to learn to start using a queue Lou took Jack’s and hung them in a closet And when he had emptied the box, he just tossed it Then he said, now Jack, at the end of day, put your clothes on the left when you put them away Then tomorrow morning, when you see the sun shine, get your clothes from the right, from the end of the line Don’t you see, said Lou, it will be so nice You’ll wear everything once before you wear something twice And with everything in queues in his closet and shelf, Jack started to feel quite sure of himself, all thanks to Lou and his wonderful queue [END PLAYBACK] DAVID MALAN: All right that’s it for CS50 We’ll see you next time