Week 7, continued

[Week 7, Continued] [David J. Malan, Harvard University] [This is CS50.] [CS50.TV] All right. Welcome Back. This is CS50, and this is the end of week 7 So one of these stupid little things that goes around the Internet and we slurped up, and it should now make a little bit of geeky sense to you Well, it was funnier to this guy than it was to you guys Speaking of, well, guys, today is Nate’s birthday To give you a sense of just how good Nate and I are at web development based on Monday’s class and based now on this, I thought I’d pull up Nate’s home page, if you haven’t seen it yet This here ia Nate’s HTML So see his sourcecode if you’d like to see how to do this, and Nate, if we could embarass you just briefly, the staff got you a little something if you’d like to share some dessert with some of the kids in the class here If you’d like to come on down You all applaud and are very nice, but no one is sitting anywhere near Nate, for some reason, in that back zone So perhaps you can find some folks to enjoy these with Happy Birthday, Nate Additional hellos: We showed a couple clips from our CS50x students If you would like to see who else it is in the world that’s following along, you can head to this URL, where Joseph, one of our TFs, has put together a montage of sorts of everyone who has been submitting these videos, among them Rick Astley And if you scroll through these, it’s really quite inspiring to see the diversity of countries and cities from which people are hailing So if you’d like to take a look at that, that will be up through the end of the semester Today we continue our look at the Web, web programming, HTML and the like, and we also have lunch coming up this Friday if you would like, and particularly, have not done so before This Friday’s theme will be Nate’s birthday, so if you would like to have birthday lunch with Nate and others, some of our friends from Industry, please head to that URL there Space, as always, is limited. Also, if you’ve forgotten, realize that next week is the deadline for problem set 4’s scavenger hunt, whereby after recovering all of those JPEGS from card.raw, you and your section mates, if you would like, can try photographing as many of the computer scientists from that memory card as possible, and you and your section will then win a fabulous prize Refer back to pset 4’s specification as to what to submit and by when Also, if you would like to have your handiwork immortalized on the course’s website and its history of apparel, know that you are welcome now to start submitting designs for this year’s T-shirts and sweatshirts and the like We’ll do our best to include as many as we can, but we’ll have some members of the staff review all of the designs to make sure they’re consistent with the specifications, and we then pick generally a handful of them to be exhibited So if you are the design type, just know that the requirements for graphics are PNG, at least 200 DPI, they shouldn’t be more than 4000 x 4000 pixels, and no more than 10 MB, but you’re welcome to use things like Photoshop or GIMP or various graphic’s programs, whatever you have at your disposal Also on the horizon is the final project. The final project really is the climax of 50, whereby of all the assignments in the course, it’s your opportunity really to do your own thing And that can be simply to do something for fun, it can be to solve some pressing problem your student group has, for some new website, some new collection mechanism for data It can be a mobile application for Android, for iOS Really, the sky is the limit, and over the next few weeks, as we transition from C to these higher-level languages like PHP and JavaScript, you’ll find yourself increasingly familiarized with some real-world techniques, some real-world tools, and to supplement that, know that the course has a history of seminars, whereby over the next several weeks, some of the teaching staff and friends of ours from on campus will offer optional seminars which go above and beyond what’s typically done in section to introduce you to things like Android programming, to introduce you to things like iOS programming or more advanced web-development techniques There’s a whole history of these already online If you go to cs50.net/seminars, we’ve been doing this for quite some years, and you’ll see that archived here with PDFs and videos and the like are several dozen videos of seminars Last year, for instance, we had a seminar on acing your technical interviews, if you’re actually looking to go off and do an internship or full-time gig Windows mobile development, Android development, Google Maps, API, CSS, developing for the BlackBerry, Emacs Really, you are welcome to take a look at any of these seminars at your convenience And we’ll be holding some new ones this semester, as well So what is ahead with the final project? Well, first, even though this date is somewhat imminent,

this is really just an opportunity to start thinking about the final project quite realistically We know only the beginnings of some of what we’ll still be covering in the course, HTML, PHP and the like, but you’re all familiar with the Web, and I bias this conversation toward the Web only because most people end up doing Web-based final projects, but that is by no means requisite Using C is fine, objective C, Java, any other language you might know or want to know is quite fine But to get the juices flowing initially, we’ll expect the submission of a preproposal which, per the PDF on the website, which is now at cs50.net, and at the top left you’ll see final project is the specification for the final project, and in there are details on the preproposal and the like It pretty much boils down to an email to your teaching fellow just to strike up a conversation with him or her about what you’re thinking On projects.cs50.net is a repository of ideas from folks on campus if you’re struggling to come up with some idea, and manual.cs50.net/APIs is a repository of links to APIs What, though, is an API? What’s an API? I’ve said it at least twice, according to the transcripts of the past several weeks What’s that? [Student, unintelligible] >>Okay, good. So something programming interface Application programming interface, and this can take several forms, but what this really boils down to is code that someone else hs written or data that someone else has collected that is made available to you in some programmatic way You can write code in C, PHP, Python, Ruby, whatever your language of choice typically is, and you can somehow build upon someone else’s functionality or someone else’s data set For instance, if I go to this link here, and you’ll see a pair of links on the subsequent page whereby we have CS50’s own APIs, which are very Harvard-centric, and then third-party APIs Among the third-party APIs are really useful things like being able to send SMS’s to people, being able to receive SMS text messages from people And things like that that you might have no idea how to implement yourself, but thanks to services, some free and some commercial, you can build atop those and do something of interest to you Among CS50’s APIs are these campus-centric things like Harvard courses, energy, events, food, maps, news, tweets and Shuttleboy’s own, and these are APIs that look a little something like this Let me pull up the HarvardFood API If you’ve ever been to HUD’s website, you’ve probably been there to just see what’s for dinner or to see what the hours are for some d-hall Well, it’s not particularly easy to navigate, and so what we did some time ago was we wrote software, it happens to be in PHP, that actually screen scrapes the entirety of HUD’s website To screen scrape something means to write a program in a language like PHP that pretends to be a browser, even though you might run it at a command prompt, that pretends to be a browser, connects to a website, downloads its HTML, the language in which it’s written, and then reads it, or more specifically, parses it top to bottom, left to right And what we did was we wrote our code in such a way that any time we saw something in that HTML that looked like something on the menu, like hamburger, we would then import that into our own database And any time we saw nutritional content, we would import that into our own database And what we did was leverage the fact that HUD’s website, even though it might be a bit of a challenge for us humans to navigate underneath the hood, all of the HTML is generated by their own computer programs So all of their HTML, even though it might look messy, like most websites underneath the hood it follows a pattern So we just spent a couple hours figuring out that pattern so that in the end, we throw away all of the messy HTML, all of the aesthetics of bold facing and italics and the like, and what we are then able to do is expose that same data For instance, in this way So we, according to the documentation here, have informed the world that if you request a URL that looks like this, food.cs50.net/ something, and you provide certain parameters, which we’ll talk about today, like end-date time, start-date time, meal and so forth, what our servers will return to you, for instance, is a CSV file, comma separted values like an Excel file, containing everything for breakfast on this particular date in March of last year when I happened to write up this documentation For those familiar, CSV is not the only file format There’s another format that’s all the more versatile called JSON, JavaScript Object Notation The data can come back in that format So the takeaway here is that whether you dive into this API or any other of CS50’s or anything out there on the Internet, or not at all, realize that the world has increasingly started to standardize how machines intercommunicate We use standard data formats like CSV or JSON And what this means for you is you can write the interesting part of a program that lets your user search a dining-hall menu, that lets them create lists of favorites that lets them get text alerts

when their favorite meal is about to be served in some d-hall by using someone else’s data sets and building on top of their APIs So more on that in the form of seminars and the documentation that you have here online So those, then, are APIs That brings us back to HTML. Quick recap What is HTML? [Student, unintelligible] >>Good. HyperText Markup Language Someone else, what is Hypertext Markup Language? HyperText Markup Language Okay. So HTML, HyperText HyperText just refers to the Web, for the most part Markup means that it’s not actually a programming language, HTML It’s not a language that you can express logic in It doesn’t have loops. It doesn’t have conditions It doesn’t have functions, per se Rather, it has these things called tags, or more properly, elements And those elements have start tags and end tags, or open tags and closed tags, and what those tags generally mean for a browser is, start doing something and then stop doing something, though there are exceptions to that Sometimes it’s just ‘put a line break here,’ for instance And we saw examples of that the other day, between bold facing, line breaks, and then a couple of other tags So HTML is the language in which web pages are written So if I go to something like Google.com and pull up just their home page, recall that if you right click or control click and look at view page source, typically it’s a complete mess these days underneath the hood, but that’s because computers don’t care about white space, so this doesn’t have to look pretty But if we zoom in on parts of it, notice that Chrome, just to be nice, has color coded things Indeed, this is the very first tag that we saw in a web page And again, HTML 5, the latest version of this language, does have this thing at the beginning,

desktop, downloads, dropbox and so forth, but now we start turning our attention to a couple On many Linux web servers there’s this folder called public html, but we’re going to skip that one for now and focus on this, vhosts Anyone know what a vhost is? Just stupid jargon for virtual host, and what this means is that on a typical server you can actually host multiple websites You can buy a domain name like foo.com, and you can host it on a server But you can also buy bar.com and host it on the same server The reason being, browsers are smart enough to inform the server when a user is requesting some webpage, what domain name the user wants the homepage for So what’s nice about this is you don’t need one physical server or one CS50 appliance for every website you might want to create You can use the same server and develop a hundred different websites And indeed, if you are a person trying to start a website, whether for fun or for business, typically you’ll go out on the Internet, and you’ll pay someone ten bucks a month, a hundred dollars a month to host your website for you And the way that works is they are charging other people ten bucks a month or a hundred bucks a month to host other people’s websites on their same server The reason they can do that is because of this feature called bhosts, but more on that when it comes time for final projects For now, let’s just dive in there. So cd vhosts, and if I type ls now, notice that there’s a folder in there called local host That’s because, by default, the appliance figures you’re ever going to run one website on an appliance This isn’t really the real world; it’s not a real-world web server So let me go into local host, and now we’ll see in there one last directory called HTML So it’s a little deep, the hierarchy, but if and when you decide to start developing multiple websites over the next n months or years, this kind of folder structure tends to be helpful Now let’s go into HTML as I just did, type ls, and nothing is there So now let’s go ahead and do this. Let me open up Chrome inside of the appliance, and let me go to http://localhost So literally the name for my appliance, enter, and I get index of / This isn’t really showing me anything of interest, but it turns out that what we’re seeing is that folder, HTML There’s nothing inside that folder right now, so instead, what I’m going to have to do is first create a file Create an HTML file like we did on Monday, but this time put it inside of the appliance For those of you who are trying to follow along with laptops now, let me do one aside that’ll be covered in the web-based pset, but in order to get this to work for the very first time, you’re going to have to run this command: sudo service httpd start And this, again, will be repeated in the last pset, but if you’re playing along at home now, the web server is turned off in the appliance, and that’s so that it doesn’t sap up RAM and memory for 7 weeks out of the semester when we don’t need it So you need to run this command once, and you’ll get an output like that Then you should be able to play along here Now let’s go back into this folder This folder is empty, so let me start creating a file, gedit hello.html All right. Gedit is open, as usual. Let me do doctype, html, html, let me get ahead of myself and start closing my tags in advance Now I have the head. Let me go ahead and close the head, let me now do the title of the page, hello world like last time, close title, now let me do a body In here I’ll say hello, world with some exclams to make clear that it’s a different string Close body, and now let me go ahead and file save Let me go back to my terminal window, and if I type ls, I should, presumably, see hello.html. And I do So now let’s go back to my browser, click reload, and you can see we are indeed inside of this HTML folder I’m not seeing a web page yet; this is Apache, the web server, just showing me the list contents of this directory Just like Mac OS or Windows would typically do on your own local hard drive So if I want to see this web page, I can click this little link here, hello.html, and indeed, that’s what I was expecting to see Now, again, this is not a URL that any of you can visit right now, because for you, local host, if you have a laptop here, it is referring to your own instance of the appliance This is on my own personal appliance, but this is kind of dumb for me to have, to have a user like myself click on hello.html to actually see the contents of this page It turns out that web servers like Apache let you have a default file for any web server Notice here we have hello.html What’s the command in Linux to rename a file? MV, for move. So let me do that, and let me rename hello.html to index.html Let me type ls to confirm it’s now been renamed Now this is going to–if I go back to local host,

notice now that I’m automatically seeing that web page This is identical to my actually doing /index.html, but the nice thing now is that the web server’s figuring, oh, if you have a file that, by human conventions, is called index.html, let me show the user that file by default rather than some stupid directory listing which is not at all user-friendly Indeed, most websites you visit on the Internet don’t have a list of files to click on, they just show you the content. So that’s how we can do that, index.html So this is all fun and good, but this is a pretty simple web page Let me go ahead and open up index.html in my vhosts, local hosts, html directory, and let’s add something of greater interest So there’s hello world; let’s instead say ‘This is CS50, Harvard College’s . . .’ So the beginning of the course catalog description of some sort there Now if I reload, I should see this in my home page Okay, and I do see that, but suppose that I want to now list some more content in this file I could go down here and say, prerequisites none, although some of you are probably like, ‘Ha ha ha, no prerequisites.’ But–officially. So reload, and now we have the same quirk that we saw last time But why is that? It was a simple fix Why is this page broken? [Student, unintelligible] >>Yeah, we’ve solved this before by explicitly telling the browser ‘put a line break here.’ And that’s because, again, a browser’s only going to do explicitly what the markup language tells it to do, so even though you might have hit enter once or twice or even ten times, it’s going to combine that all into a single space, just by convention So if you really want a line break, you have to use the br tag, and now notice, like Monday, I put the / inside of this tag, only because this just doesn’t feel right to start a line break then stop it with nothing in between So the convention in HTML is to open and close a tag simultaneously As an aside, you’ll see a lot of websites in books not doing that It is correct to do or not to do it, but we would argue that design-wise and stylistically, this is just better because then every tag is both opened and closed somehow So now let’s save and reload. Go back to the browser, okay Now we’re making some progress, but it’s not quite enough Let’s go ahead and start typing in some longer body of text So let’s say, ‘A quick brown fox jumps over a lazy dog.’ And now let me just copy and paste this a few times so that we have a paragraph of text Let me go back over here. So it’s not looking very good I do have a line break, so it’s okay, but now, once we’re getting to the point of having a web page that has lots of content and not just single lines to demonstrate HTML, we can start to think of these things as actual paragraphs And we can start to structure our web page a little more cleanly And indeed, what I can do is go up here inside of my body tag, and you know what, if ‘This is CS50. . .’ really demarks the beginning of a paragraph, well, let’s tag it as such Let me indent the text, just by convention, let me say that this paragraph ends here, and then rather than do this line break, let me just say that this belongs there and as a new paragraph, and I’ll just quickly indent by just clobbering all of this stuff So now we have an indented paragraph there, and now our markup is starting to get a little more semantically consistent with what we’re trying to do We have a paragraph, so let’s call it a paragraph with the p tag We have a second paragraph, so let’s call it a paragraph with the p tag And now, what the browser will typically do is just like in an English book or essay, where you typically see some line breaks between paragraphs Browsers will do that for you automatically So now we have two paragraphs and we can continue this But, of course, on the Web, when you have bodies of text it’s not typically just huge blobs of text There are often hyperlinks in there So if we want to, for instance, include some links there, suppose what might be of interest in whatever web page I’m creating here is– let me go to Google.com, and let me search for a quick brown fox Go to Google images, and, how about–this is cute We’ll go with this. So here we have a quick brown fox jumping over a lazy dog So what I’m going to do here, just for the sake of demonstration, is suppose that this image was on my server, and I had been creating these images What I just did was right click or control click on the image, and what you’ll see in most browsers is a little menu– stop doing that–a little menu that allows you to choose copy link location or copy URL So let me go back now to my HTML, and suppose that I want to hyperlink this to another web page What was the tag called for that? [Student, unintelligible] >>Yeah. So a href for hyper reference

Let me go ahead and paste that in It’s a pretty long URL, so let me zoom back out Close brackets, so now notice I’m way over here because that URL happened to be pretty long Let me scroll over here to the end of quick brown fox, and then let me close this tag with So everything at the top in blue is just a comment

This is my doctype declaration, which again, you can just copy and paste on faith, for now This just tells the browser, ‘Here comes some HTML 5.’ Below that, on line 14, is the first of my actual tags, and this just says, as before, here comes some HTML, here comes the head of my page, here comes the title, and then, conversely, that’s it for the title, that’s it for the head Here now comes the body of my page So a couple new tags now: h1 stands for heading 1 There’s a tradition in HTML for many years back of having different sizes of text And back in the day, each one meant, generally, just big and bold But there’s also h2, which is big but not quite as big and bold There’s h3, which is kind of big but not nearly as big and bold, and so forth, all the way down to h6 These days, though, h1, h2 and h3 are really meant to have more semantic meaning to them, whereby h1 is really a heading: the heading of a web page, the heading of a column or something like that of text So I’ve deliberately said h1 CS50 search> h1 to specifiy that this is really the heading, the title of my page Not the title in the title bar sense, but the title that you actually see in the web page itself, in the body Now this, you can probably guess what it is, even though we have a few new pieces of syntax This is a form. So the web really gets interesting when websites take input from users In this class, in the problem set on web programming, we’re not going to make a website, per se, with static content that shows photographs that you’ve taken, or this is my resume, and things about me, because those things are relatively easy to put together It’s hard to make things beautiful on the Web, but at least putting up content is pretty trivial But things get really interesting when someone can visit your website and provide input and can fill out forms, can check off checkboxes and can interact with your website And indeed, probably every website you care about these days, in any detail, is somehow interactive Facebook, Google, and the like, that take user input and produce customized output So let’s start to do that now. Let’s transition now from just using HTML for markup of static content as instead a delivery mechanism for dynamic content And toward that end, let’s implement our own search engine Let’s do it as follows. Here’s the form tag The action attribute specifies that when the user fills out this form with their keyboard, it will be submitted to this URL here So I’m kind of cheating. It’s going to take us a little longer than one class to implement the whole search engine, so we’ll just do the front end, so to speak We’ll do the part that lets the user search, and we’ll sort of punt to Google the hard part of finding search results, but, specifically, I’m going to talk to Google’s web server using one of two very popular methods One being get, another, that we’ll eventually see, being post, although there are others that are less often used So get just conjures up the idea of, I want to get some content, get some search results This, you can perhaps guess what this does This is some kind of input, it’s, in fact, going to look like a text field, and the name of that input, the name of that variable, so to speak, is going to be q for query by convention And again, the type of this input is not going to be a checkbox; it’s not going to be a menu; it’s going to be a text field as denoted by this attribute here, and this text box, like a line break, is either there or not So we have an empty element with the slash inside that tag Then I’m going to put a line break, and you can, perhaps, guess what this is going to do This is another sort of form input This one’s going to be used for submitting the form So this is going to be the big button that the user can click to submit the form, and the label on that button is going to be ‘CS50 Search.’ Close form, close body, close HTML Let’s see what we have in the form of this web page So let me go to my browser, let me go, still, to local host This is still index.html, so if I want to see this file called search0, I can simply do /search0.html, enter– and the first of my mistakes What’s going on? I clearly don’t have permission to access this file, for some reason But that’s because, unlike the work we’ve done thus far in C, where the programs you write are assumed to be runable by you, executable by you, that’s not really the case on the Web, whereby sometimes you might want to create files on a server, but you don’t want the whole world to be able to see them Rather, you want the world to see some files but not others, just for privacy’s sake So it’s more of an opt-in basis when you’re doing things on the Web And so let me actually type ls here, and you see the files I have, but recall that if I do ls -l for long, I’ll get a longer listing that gives me some more details about these files that are now, really, for the first time relevant to us Notice that on the far right are the names of my files, and then the time at which they were last modified or copied This number here is what? Do you recall? The size in bytes, how big the file is So I seem to have some kind of logo in here that’s bigger than all the other files This is who I am, this is what I am and what group I’m in

But then, over here on the left is a bit of cryptic sequence, and we talked, I think, briefly about this in the past, but this has to do with permissions And even if that’s a little hazy, RW probably means read & write So it turns out that these dashes denote different sets of permissions for different people And the pattern is, essentially, as follows When you see a sequence of dashes here, they look as follows There’s a dash, then there’s three more dashes, then there’s another three, then there’s another three The first one is either a dash or it’s a d for directory So that one’s pretty easy. If it’s a folder, it says d, otherwise it’s a hyphen There’s a couple other cases, but for now we’ll just care about files and directories These next three dashes–and I’ve artificially inserted the spaces They were, obviously, not there when we saw them a moment ago These are the file owner’s permissions, and recall from a second ago that it was read & write That was because I, as the person who created this file a moment ago, I, just by default, on a Linux computer, have the ability to continue reading and writing that file So the operating system just gives me RW automatically The middle ones relate to my group, that of students, which is sort of meaningless on the appliance because I’m the only person using the appliance So let me just wave my hands at that for now But the last ones are most important for the Web This is everyone else in the world, and the fact that that is – – – means that no one else in the world has any permissions to this file Clearly a problem, so I need to fix this by somehow giving the world what? Read & write? That’s probably dumb, right? I don’t want anyone on the Web to go to visit my page and somehow change that file, even though they really couldn’t with an HTML file, but just in principle, probably just want them to be able to read it What does it mean to read it? It doesn’t mean they’re going to care about the actual HTML, but the browser needs to be able to parse that markup language, top to bottom, left to right So someone on the Web needs to be able to read it, so I minimally need to give it r I can do this in a few different ways, but perhaps the simplest is to run this command here Chmod, change mode, then a + r so all, everyone in the world + read, and then the name of the file, search0.html Now if I do ls -l again, notice that that file has changed, and indeed, I’ve turned on r for everyone I’ve also turned it on for my group, but that’s fine, because if I turned in on for everyone, my group is a subset of that So that’s fine too. This just means the computer has now made it readable Now let me go back to my browser, click reload Ah-ha. We now have CS50 Search I’ve zoomed in a little artificially–pretty hideous search engine But let’s see if it actually works First, let me do a quick sanity check, let me control click and view page source Notice that within Chrome we’re now seeing the same HTML that I myself created Don’t get confused here, though. I can’t start changing the code here, because the browser has a read-only view of this code The browser has just asked local host for a file called search0.html It is now pure coincidence that the appliance happens to be on the same computer as my browser I could just have, equivalently, have typed in www.facebook.com/search0.html, and if Facebook had a file called that, I would then be seeing their HTML And, of course, I can’t change the file that comes back from Facebook, either So now we’re sort of blurring the lines The appliance is both a server, serving up web pages, but it’s also a client in the sense that I’m using a browser to actually talk to that server So let’s see if my Google search engine works Let me go ahead and search for quick brown fox, enter And voila, I now have my own search engine But how does this work? Bit of a stretch, but–and now you can’t see, precisely, the part that’s of interest Notice what happens Notice the URL. It turns out that that method, called get, is super-simple When you specify in a form that you want to ‘get’ results from some server, what it’s going to do is take whatever you typed into the form and put it in the URL It’s going to standardize how it gets put into the URL as follows Notice that this is the URL that was the value of my action attribute That’s where I wanted the form to end up But then notice this question mark This is a convention on the Web whereby to provide user input to a website, you append to the URL a question mark, and then you have a whole bunch of key-value pairs The name of a key, otherwise known as a parameter in the Web, then you have an equal sign, then you have the value of that parameter So it’s essentially a variable name and a variable value, but those variable’s names and values came from the HTML form Why are the pluses there, do you think? Because I did not type + in between my words [Student, unintelligible]

>>Yeah, it’s just for spacing. Odds are, whenever you’ve seen a URL, there’s never any spaces in it, if only because if there were, you couldn’t really copy and paste it into an IM or into an email because it would break You want the whole thing to be one contiguous string of characters So the browser is smart enough to realize, uh-uh Don’t just put a space there. Let me encode the space in some standard way One of the conventions for doing so is to have the browser automatically put a + where you would otherwise have a space So now, notice Google has been kind of user-friendly I certainly did not create this web page, but they have prepopulated their own text field with what, precisely, I typed in Suppose I want to search for something else, like a lazy dog I can just type this here, re-search Notice that the URL changes up here, but notice then that I can actually search for anything I want just by understanding how URLs work I could do lazy cat, enter, and notice now I’m getting a very lazy–should we? I feel like we should I get a very lazy cat All right. This is one of the stupidest things we’ve done But that is a lazy cat Anyhow, what’s the key takeaway here? Now we’re sort of playing in the world of HTTP HTML is just this markup language, open tag, close tag, that tells a browser how to render content on a web page But when you start transmitting data across the Internet between web browser and server, that’s where this protocol known as HyperText Transfer Protocol takes over This is the sort of human convention; when Sam and I shook hands on Monday, starting a connection and then closing a connection, same idea here How are Google’s results coming back to me? How is my form submission going to Google? Well, recall from the other day that what’s really going on underneath the hood when you request a web page is, your browser is sending a somewhat-cryptic message like GET / HTTP/1.1 for the default home page Or, in this case, because I specifically requested earlier search0.html, this then would be the somewhat-cryptic message that my browser sends to the appliance Or, in this case of Google, what’s actually sent is a request to /search, and then ?q=lazy cat, with a plus there So this message, that I, the human, am never typing, but is being sent by my browser, this is how HTTP happens This is the equivalent of our having shaken hands This is the request, and the server’s about to send a response So let’s take a look at this underneath the hood As before, we can open up this special field in a browser View page, inspect elements So under inspect element, notice that what’s happened in Chrome, and IE and Firefox have similar mechanisms, we have these developer tools accessible to us Normal people do not use these tabs But we, now, are interested in what’s going on underneath the hood at the network level So if I pull up the network level here, let me go ahead and expand this window, open up this entry here, and look at the headers So what happens when I request a file from a web server is my browser sends a whole bunch of things And let me view source. So under request headers, and this is just Chrome showing me some diagnostic output, sort of like a debugger of some sort, notice that what I’ve highlighted here is precisely what Chrome is sending to the server in order to request a file called search0.html It is telling the server what it thinks its name is, thanks to this host colon field, then there’s some pretty esoteric stuff in here, like something to do with dates and times, something to do with the languages that the browser understands, but the really important lines are these first two here What does the server respond with? Well, if we scroll down here and view source of this thing, notice that the server has responded with a somewhat cryptic message as well, 304 not modified That’s a little strange; let me actually try to fix this Let me hold down shift and click reload up here to force the browser to actually make this request for the first time Then let me zoom in, and we’ll see now that the server’s response, because I held shift, is 200 OK So you’ve probably never seen the number 200 in the context of the Web, but what numbers have you sometimes seen unexpectedly from a server? 404, file not found; 403, forbidden; 500, server error So there are these numeric codes that the world uses in the Web to signify errors, just like C functions can return errors and main can return exit codes 200, though, you rarely see because it means all is well And 304 you probably never see because what is it signifying? That nothing has–let’s see if we can simulate this again–

Oh, now it’s not cooperating. 304 said not modified, so why was the server even responding? Well, for efficiency, a web server automatically for you, if the file hasn’t changed, it won’t retransmit the whole HTML file It’ll just tell the browser it hasn’t changed Just use the copy you already have So there’s this notion of caching on the Web for performance, so that you don’t waste time and waste bandwidth downloading files again and again unnecessarily But this web page, now, was super-simple, and it only showed me the HTML that came back Let’s actually use the network tab now to do a Google search like quick brown fox Let me then click CS50 search, and now, notice in the bottom here a whole bunch of stuff came back because when I visit a real website like Google.com, they have images, they have text, they have a language called JavaScript there So every row in this table down here represents something that Google spit out in response to my single request The one I care about, though, is this first one And if I go to the search, request, click view source here, notice that, indeed, the cryptic message that my browser sent to Google was these two lines here, followed by some arcane information down here which we’ll ignore for now But notice, too, what Chrome is pretty handy with, it’s also showing me the query string that was sent in So rather than show me this, which was literally sent, if I view it decoded, Chrome, just for debugging purposes, for developers like us, it’s just showing me a human-friendly version of– that is not how you spell fox, apparently I’m just noticing this now–but it’s showing you what I, apparently, typed Meanwhile, the response that came back from the server is again 200 OK But included in that response, of course, if we actually view the page’s HTML– sorry, this is a little keyboard shortcut gone awry today I’ll deal with this later. So if we actually view the page’s source, which I can do down here by clicking response, this is what was actually spit back, in addition to that cryptic 200 OK message from the server A little cryptic, but where is all this coming from? Well, let’s do one other thing here. Another somewhat-cryptic command, but this one’s kind of neat in that it reveals to us exactly what’s going on underneath the hood So I’m back on my Mac here, I have connected via a program called SSH, Secure Shell, to another server because most of Harvard’s computers block the command we’re about to run because there’s this command on some servers called traceroute that allows you to trace the route between points a and b, and thus far we’ve been taking completely for granted that I can type in Google.com and somehow get data back from halfway across the country or halfway across the world With traceroute we can actually dive in a little deeper as to how the Internet works, and see what’s going on underneath the hood So let’s go ahead and arbitrarily trace a route to, say, Stanford.edu, which is across the country, and hit enter This command can be super-fast or super-slow, but what we’re seeing now, line by line, is every one of the steps or hops between us and Palo Alto, or Stanford, where they have their web server So what does each of these lines represent more concretely, though? A piece of jargon from the Internet? [Student, unintelligible] >>What’s that? [Student, unintelligible] >>Oh, so there are times, but what does each row–what do I mean by hop? Well, there are these things on the Internet called routers And routers, as the name suggests, route information from point a to point b But there are several points beyond a and b There’s c and d and e and f between row 1, which happens to be my computer’s IP address, or my numeric address, which uniquely identifies my computer, and step 15, which is actually the sixth web server, apparently, which I’m inferring from this, or version 6 of their web server at Stanford But what’s kind of neat is, we can see the path that my 0’s and 1’s are taking from my computer to Stanford So step 1 is my own computer’s address Every computer on the Internet has a unique identifier that looks like this Number.number.number.number Somewhere on this campus, probably in the science center, is a router called Core Gateway 2 -te83, whatever that means, so this is one of Harvard’s big fancy routers that routes a lot of their traffic Here’s another of Harvard’s routers, this one is Border Gateway, border meaning it’s probably on the periphery of campus somewhere Then there’s nox one, row 4, which is Northern Crossroads, which is a big ISP, Internet service provider, that places like Harvard connect up to But then things get a little interesting in line 6 Where are my bits all of a sudden? Kansas The world has a habit of using airport codes in a lot of these things, or at least abbreviations for states or cities, so it looks like, in just 60 ms, a packet of information, 0’s and 1’s from my laptop got all the way to Kansas, and again, in 60 ms

Moreover, after Kansas, they took a tour through Houston, probably, as suggested by the name of this server So just as a server on the Internet must have a numeric address, it can also, optionally, have a slightly more human-friendly address that humans came up with Now, in step 8, we don’t know what this is Sometimes routers just kind of ignore you, and they just don’t answer the questions, so that’s fine The one after step 8 is apparently where? L.A Notice in only 78 ms, what takes us humans like 6+ hours to do physically, takes packets of information on the Internet 78 ms to travel that far Step 10 is in L.A. as well, and step 11 seems to have gone north, up near Stanford This is their boundary router, or border router A couple steps at Stanford that are ignoring us, and lastly, we reach the web server in just 87 ms Now, all of these numbers, as an aside, just tell you how long it takes for data to get from me to each of these routers, and it’s not accumulative What this program does is, it first sends a message, essentially, to the first router Then one to the second router; then one to the third router, measuring each time. So in theory, these times will be growing or at least pretty close to one another, and, indeed, the ones that are right here on campus are super-small As soon as you start going across the country, it takes data a little longer to travel, closer to 100 ms, give or take But let’s go the other direction now. How about Cambridge University in the UK? Let me instead run traceroute of www.cam for Cambridge, ac for academic, .uk, and hit enter here That was pretty damn fast My data literally went to Cambridge, England, in that split second of time So let’s see the path that it took Harvard, Harvard, Harvard, Northern Crossroads, which is an ISP, and then this is Northern Crossroads, and then bam What is in between steps 6 and 7, router 6 and 7? The Atlantic Ocean. And we’re inferring this from the fact that we go from 20 ms here to 80 ms here So something took 60 ms, give or take, to get over And that was probably a big body of water What goes on after that? Well, here we are in London, just 88 ms later. More London, more London, not sure where this is, but we’ll assume it’s outside of London, Cambridge here, and finally we–literally, University of Cambridge something.net, and then, finally, in line 16, their web server is apparently called Scorpius underneath the hood, even though we know it as www Kind of mind-blowing, I think. The first time I ever did this, it totally blew my mind Unfortunately, Harvard blocks this kind of traffic, typically, on the network So you can’t do it super easily Realize, though, this here is possible All right. Let’s take our 5-minute break here. We’ll come back and dive in deeper So we are back, and we’ve kind of ambled about in a few different directions here So let’s summarize exactly what’s been going on here We started the conversation talking about this language called HTML Again, not a programming language. It’s just a markup language that is largely about aesthetics and structuring of content in the form of a webpage But HTML, therefore, needs some kind of mechanism for traveling between web browser and server HTML therefore sort of rides on top of this other language, or more properly, a protocol, known as HTTP And HTTP, as we’ve seen it thus far, is kind of analogous to this human convention of shaking hands When a browser wants to request a page from a server, it sends that “get” request from browser to server, and then the server responds with a number like 200, all is okay, as well as the HTML or some bad number like 404, file not found But meanwhile, HTTP itself isn’t the Internet, per se HTTP is just a service, a feature of the Internet much like G chat is another service, much like email is another service There’s all sorts of things we can do on the Internet HTTP is just one of those applications So on top of–HTTP is on top of something else which we didn’t mention by name, you might have heard of by name, TCP/IP So the story we just told there is all about how data travels from point a to point b And in this case, we saw at a very low level router to router to router to router, how the data is actually being transmitted But along the way, it is going to encounter various impediments Besides these routers, there are things called firewalls on the Internet, and so data, such as that we were just transmitting from me to Stanford, from me to Cambridge, is sent to, at this level, something called an IP address We saw this a moment ago, and an IP address is just a numeric address of the form w.x.y.z,

where each of these is between, give or take, 0 and 255, though you can’t quite use all of those numbers But each of these place holders is a number between 0 and 255 So an IP address these days is 32 bits Now, that gives us how many possible IP addresses in the world? Roughly 4 billion, because any time we’re counting in powers of 2 all the way up to 32 of something, that usually gives us 4 billion So that’s a lot of IP addresses, but you might have read, or you might now notice in the popular press, a push toward a new version of IP called IPV6 Right now we’re using version 4 There really hasn’t been a version 5, we’re just jumping right to 6 Version 6 is going to use 128 bits for IP addresses, which is freaking huge We should not run out for quite some time now, but we have begun to run out of version 4 IP addresses, because all of us have not only things like laptops and desktops, a lot of us have phones, a lot of us have other devices like TiVo and the like that have IP addresses themselves Harvard itself has tens of thousands of computers So the world is genuinely running out of IP addresses, at least of this form So over the next few years, you are going to see the addresses on your own computers probably slowly change as more and more companies and universities start to support the newer version But an IP address is not sufficient for computer a to request data from computer b Because computer b could be a server, and a server, as I mentioned earlier, can do bunches of things It can host web pages, it can be an email server, it can be a Skype server, it can be a G chat server All these different services that can be provided on a server could all, physically, be on the same machine So in addition to IP addresses, the world has things called ports on the Internet A port is just a number; so there is a unique number for HTTP Its number is 80 HTTP also uses number 443, but more specifically, for encrypted HTTPS Whenever you see the s, for secure, that’s using a different number There are other numbers, like 25, used for something called SMTP, otherwise known as email There’s something called 22 for SSH, and there’s a whole bunch of other ports out there Now, we humans rarely see these numbers However, when you type in an address like http://www.facebook.com, the browser is secretly inserting 80, because you’re using HTTP If you, instead, type HTTPS, it’s secretly inserting 443 And we can kind of see this manually if I pull up a brower and go to http://www.facebook.com:80 Therefore explicitly citing not just the name of the website but the port that I want to talk to, and hit enter Notice it disappears, because the browser assumes, oh, 80, I’m not even going to bother showing that to you But the reason for this is that if I actually wanted to send someone an email, I would really be sending it to them on port 25, that being SMTP A bit of an oversimplification, but some of you have friends who actually work at Facebook, and they, similarly, have servers that receive email Any time you send an email, what gmail is doing for you or Outlook or whatever program you use, it’s sort of secretly inserting that number as well, 25, in that case It’s this combination of IP address and number that uniquely identifies a computer on the Internet and a specific service on that computer Now, of course, most of us have probably never typed manually an IP address Maybe you have in the appliance, but in the real world, not so much Why do we not type IP addresses into browsers? It would work, in fact, we can see this; let me show you one other command that should work most anywhere on Harvard’s campus on a Mac or a PC There’s this command called nslookup, name server lookup If I look up www.cnn.com, it turns out that CNN has–oh, interesting CNN has started using Amazon web services You might know of cloud computing, Amazon’s one of the big players in cloud computing What I just did was, I said, ‘Give me the address of CNN’s web server,’ but it turns out that CNN’s web server is managed by Amazon, Amazon web services, this suggests And the address of that server is this here So I’m not sure if this will work, because they didn’t used to use Amazon But let’s try this; http://, IP address, enter, and– is it going to work? Yes. It is going to work. Internet is super-slow today But, in a moment, you will see some news story There we go. Bank of America’s being sued. All right This is because this IP address just happens to by synonymous with www.cnn.com Of course, it would be horrible marketing to say, visit us on the Web at You’d never remember. So even these days you might recall things

like 1-800-COLLECT or mnemonics the world came up with for phone numbers Which, before cell phones, were rather hard to remember until you could just type it in and forget about it So the Web, too, has this convention of names and IP addresses, and there are these things out there called DNS servers, domain name systems servers, that translate IP addresses into names and vice versa So that’s what’s going on underneath the hood In the end, we have TCP/IP, which is this very low-level protocol that, really, just gets 0’s and 1’s across the Internet, and it does so by putting them into a virtual envelope, if you will, and writing on the outside of the envelope the IP address of the destination, as well as the numeric port number of the service on that destination that it wants to talk to Meanwhile, on the envelope there’s also something known as a return address, which is your IP address, so that when CNN gets a packet of information from you, opens this virtual envelope, sees that you want the home page, it knows from the sender part of this virtual envelope whom to send the HTML back to So let’s take a look at this in a little more detail This is from a company called Ericson, from a few years back And they took some liberties with how the Internet actually works, but it paints a much more visual picture than mere chalk up here So I give you “A Bit of the Internet.” [Narrator] For the first time in history, people and machinery are working together, realizing a dream A uniting force that knows no geographical boundaries Without regard to race, creed, or color A new era where communication truly brings people together This is The Dawn of the Net Want to know how it works? Click here to begin your journey into the Net Now, exactly what happened when you clicked on that link? You started a flow of information This information travels down into your own personal mailroom where Mr. IP packages it, labels it, and sends it on its way Each packet is limited in its size The mail room must decide how to divide the information and how to package it Now, the package needs a label containing important information such as sender’s address, receiver’s address, and the type of packet it is Because this particular packet is going out onto the Internet, it also gets an address for the proxy server, which has a special function, as we’ll see later The packet is now launched onto your local area network, or LAN This network is used to connect all the local computers’ routers, printers, etcetera, for information exchange within the physical walls of the building The LAN is a pretty uncontrolled place, and, unfortunately, accidents can happen The highway of the LAN is packed with all types of information These are IP packets, Novell packets, AppleTalk packets They’re going against traffic, as usual The local router reads the address and, if necessary, lifts the packet on to another network Ah, the router. A symbol of control in a seemingly disorganized world [Router mumbling and talking to itself]

[Narrator] There he is, systematic, uncaring, methodical, conservative, and sometimes not quite up to speed But at least he is exact, for the most part As the packets leave the router, they make their way into the corporate Internet and head for the router switch A bit more efficient than the router, the router switch plays fast and loose with IP packets, deftly routing them along their way A digital ‘pinball wizard,’ if you will [Router switch talking to itself] [Narrator] As packets arrive at their destination, they’re picked up by the network interface, ready to be sent to the next level In this case, the proxy The proxy is used by many companies as sort of a middle man in order to lessen the load on the Internet connection and for security reasons, as well As you can see, the packets are all of various sizes depending upon their content The proxy opens the packet and looks for the web address or URL Depending upon whether the address is acceptable, the packet is sent on to the Internet There are, however, some addresses which do not meet with the approval of the proxy That is to say, corporate or management guidelines These are summarily dealt with We’ll have none of that For those who make it, it’s on the road again Next up, the firewall The corporate firewall serves two purposes It prevents some rather nasty things from the Internet from coming in to the Intranet, and it can also prevent sensitive corporate information from being sent out onto the Internet Once through the firewall, a router picks up the packet and places it onto a much narrower road, or bandwidth, as we say Obviously, the road is not broad enough to take them all Now, you might wonder what happens to all those packets which don’t make it along the way Well, when Mr. IP doesn’t receive an acknowledgement that a packet has been received in due time, he simply sends a replacement packet We are now ready to enter the world of the Internet A spiderweb of interconnected networks which span our entire globe Here, routers and switches establish links between networks Now, the Net is an entirely different environment than you’ll find within the protective walls of your LAN Out here, it’s the Wild West Plenty of space, plenty of opportunities, plenty of things to explore and places to go Thanks to very little control and regulation, new ideas find fertile soil to push the envelope of their possibilities But because of this freedom, certain dangers also lurk You’ll never know when you’ll meet the dreaded ping of death, a special version of a normal request ping, which some idiot thought up to mess up unsuspecting hosts The path our packets take may be via satellite, telephone lines, wireless, or even transoceanic cable They don’t always take the fastest or shortest routes possible, but they will get there eventually Maybe that’s why it’s sometimes called “The World Wide Wait.” But when everything is working smoothly, you can circumvent the globe five times over at the drop of a hat, literally

And all for the cost of a local call or less Near the end of our destination, we’ll find another firewall Depending upon your perspective as a data packet, the firewall could be a bastion of security or a dreaded adversary It all depends on which side you’re on and what your intentions are The firewall is designed to let in only those packets that meet its criteria This firewall is operating on ports 80 and 25 All attempts to enter through other ports are closed for business Port 25 is used for mail packets, while port 80 is the entrance for packets from the Internet to the web server Inside the firewall, packets are screened more thoroughly Some packets make it easily through customs, while others look just a bit dubious Now, the firewall officer is not easily fooled, such as when this ping of death packet tries to disguise itself as a normal ping packet [Firewall officer talking to packets] [Narrator] For those packets lucky enough to make it this far, the journey is almost over It’s just a line up on the interface to be taken up into the web server Nowadays, a web server can run on many things, from a mainframe to a web cam to the computer on your desk Why not your refrigerator? With the proper setup, you can find out if you have the makings for Chicken Cacciatore, or if you have to go shopping Remember, this is the dawn of the Net Almost anything is possible One by one, the packets are received, opened, and unpacked The information they contain, that is, your request for information, is sent on to the web server application The packet itself is recycled, ready to be used again, and filled with your requested information, addressed, and sent out on its way back to you Back past the firewall, routers, and on through to the Internet Back through your corporate firewall and onto your interface, ready to supply your web browser with the information you’ve requested That is, this film Pleased with their efforts, and trusting the better world, our trusty data packets ride off blissfully into the sunset of another day, knowing fully they have served their masters well Now, isn’t that a happy ending? [Malan] Okay, that’s enough. We’ll see you next week [CS50.TV]