Unlimited Plugins, WordPress themes, videos & courses! Unlimited asset downloads! From $16.50/m
FREELessons:48Length:7.6 hours

Next lesson playing in 5 seconds

  • Overview
  • Transcript

9.2 Getting Websites With HTTP

A common use for Python is for web crawling and scraping. Again, we could do all of this with raw sockets, but Python has existing modules that can handle the heavy lifting for us. In this lesson, we'll see how to use HTTP (Hypertext Transfer Protocol) to retrieve a webpage with only a few simple lines of code and some assistance from another useful module, httplib.

2 lessons, 11:32



2.Python Building Blocks
6 lessons, 1:08:07

Introduction to the Interpreter




Standard Input and Formatting

Building a Tip Calculator

3.Controlling the Flow
7 lessons, 1:20:10

Conditional Statements

Looping With For

The Range Function

Looping With While

Creating Functions: Part 1

Creating Functions: Part 2

Building an Average Calculator

4.Common Data Structures
4 lessons, 46:49

Lists, Stacks, and Queues, Oh My!


Iterating Data Structures

Building a Sentence Analyzer

5.Application Structure
7 lessons, 1:15:12






A Special Calculator: Part 1

A Special Calculator: Part 2

7 lessons, 46:55

What Are Comprehensions?

List Comprehensions

Dictionary Comprehensions





7.File I/O
6 lessons, 48:51

File Basics

Reading Entire Files

Navigating a File

Writing to Files

Reading and Writing to Files

Reading and Writing Complex Objects

5 lessons, 43:48

Introducing the Socket

Getting a Remote IP Address

Handling Socket Errors

Create a Socket Server

Create a Socket Client

9.Connecting to Network Services
3 lessons, 34:27

Getting the Current Time With NTP

Getting Websites With HTTP

Downloading Files With FTP

1 lesson, 02:08


9.2 Getting Websites With HTTP

In the last lesson, we started to talk about protocols. And specifically we were talking about the network time protocol, where we can send messages off to or requests off to a web server somewhere and we will get back some data that's gonna tell us exactly the local time on our machine. Now, that's all fine and dandy, but when we start talking about protocols it really makes a lot of sense to start talking about one of the more popular protocols known as the Hypertext Transfer Protocol, or HTTP, which is really just a fancy way of talking about sending requests to web servers and getting back responses. Now, here is a very simple example on my web browser,where I went to www.google.com and I got back one of the most simplistic looking webpages out on the web yet it's also probably one of the more used. Now, interestingly enough if you know anything about Web servers, Web applications, HTML and all sorts of that kinda good stuff, you will know that to make a very simplistic website like this takes a lot and lot of work. What do you mean by that? Well, if I were to just open up my web browser and go to this site it doesn't really look like there's much there but behind the scenes there's actually a lot going on. So if I right click on here and I say view page source, you're gonna see that this is a very large page, not only is this a very large page but it is shockful full of things like HTML mark up JavaScript and just a ton of information. Now, as you can see it is really no way that you will be able to understand this if you Issue to request to a web server, and it just showed you all of this business, you really would have no idea what's going on. But thankfully for us, the web browser knows exactly how to interpret all that information. Why do we care? Well, there's a lot of applications out there, Google included, that if you were to create a website and you wanted to put that website out for the end users to find, say you're trying to sell some sort of service or do something along those lines. It would be nice to have a mechanism to explain to places like Google, what type of information is on your site and where you to find it on your site. And the way that Google handles that after you register with Google, it will send a web crawler and a little bit of a scraper to head out to your website and programmatically traverse your website to find things like links, and titles, and descriptions. And all those types of things on your webpage. Well, how does it do that if it can't see those things from a page looking like this? Well, the way that it does it is it makes requests out there. And it gets the page source. Like I just showed you and that it churns through that looking for all these specific pieces of information so that it can index it and make it available for other people to find. Now, sometime ago, going on two years ago actually, I did another course that went into detail on some of this on how to do the basics of these things, crawling the web with python. And you can go back and review some of that stuff so you can see exactly how I was showing you how to do that. But in this lesson, I'm gonna take you through the process of very simply being able to issue a programmatic request using HTTP out to a web server somewhere and retrieving this source and being able to do something with it whether it's scrape it. Crawl it do whatever you really wanna do. So let's go ahead and take a look and see how that would work. So here I have a another file here so this one is gonna be 7-getwebsource.py and I've already kinda started to put some things together here. So what I wanna do is I once again want to import another library that I could use to do this. Now, once again, I can absolutely do this with sockets, but once again I would like to wrap up a lot of that functionality and make it easier on myself to issue those requests and then get responses. So in order to that I'm going to use the httpLib in Python. So once again to make sure that you have that you can simply come over to a terminal, go into interactive shell and do that import again, import httpLib If it works, great. If it doesn't work, then you're going to need to go ahead and use Pip to install it. Once again, using Pip, install the HTTP lid, just like that. But I already have it installed, so I don't need to worry about it. So let's head back over here to our source code, and let's get down to it. Now, we wanna make this one a little bit more sophisticated maybe. So let's say we wanted to make the requests go out to a server be a little bit more interactive with the end user. So let's say I could specify, at the command prompt, what website I wanna go to whether it's google.com or what-have-you. So let's go ahead and give that a try. So I'm going to come into this new function called get_page_source and I want to create a new instance of the httplib. So I'm gonna say http_client = httplib .HTTP and I need to pass the HTTP url that I wanna ultimately get to. So in order to do that I'm actually going to pass in a url so that I can kinda propagate this through. I can take some input from the end user. So I wanna go to a specific URL that's requested by my end user. Now, before connect to issue that request, I need to create it, I need to specify what type of request, I can specify headers, some HTTP headers in my request, things like what type of responses I accept, what is the user age and all of those things are configurable by you, through the httplib client. So the first thing that I wanna do is I wanna specify what type of request that I want to create here. So I'm gonna say http_client.putrequest. So put request is going to create a specific type of request that we want to be issuing. So in this case, I could be creating a get request, which is what your web browsers uses to send to these websites and get back the source to present within the browser. Or I could create posts or puts or deletes or whatever other sort of HTT[ verb that you can think of or that is documented, you can send using the HTTP lib, but in this case I'm only worried about GET. And then, I can also specify what location within that URL I want to try to get to. So I could specify www.google.com and I could go and say just get the root. And if I wanna just get the root which is what is done when you issue a request to www.google.com from your web browser, that is specified by a forward slash. Now, if there are other pages that are nested within there, I could put them here. Like if there was a page in there or a location within there called slash foo, I could do that, as well. And I will actually show that in another upcoming lesson when we start talking about FTP. But in this case I'm just gonna stick with the roots of whatever website or web page that I pass in. Now, at this point, I could also put in a bunch of different headers. Like I said, I could specify the user header, the host header. Some web servers, depending on where they're hosted and how they're configured will require certain things like that. It's really impossible to really know what web servers and locations out of the web are gonna require what headers, but in our case I'm gonna do some simple examples that really don't require anything so that I can just get back some source and go from there. So now I'm going to actually issue a request and then get a response, so HTTP is a request response based protocol so in order for me to get a response I have to issue a request. So the way that we do that, is I'm going to say http_client.getreply. Now, get reply is going to send my request, and then it's going to get a response, as well as some other information. And that other information is a series of things. I'm going to get back an error code. If something goes wrong, I'm gonna get back an error message, if something goes wrong and I'm also going to get back the headers of the response. So this HTTP protocol there's a lot of things going on so once again you could do this using sockets but there's so much data going back and forth including headers and error messages and status codes and all sorts of things like that. That it's really nice to be able to wrap all of that stuff up in a single request I just get that information back, now before we are actually able to get any of that information back we have to get a field which is really what's gonna be coming back from the web sever. So let's going to see how that work, we are gonna say field or F is gonna be equal to Httpclient.getfile, like this and then we can read the file. So really the file is a representation of the data that is found in the source code of the particular page that you requested so in I go to Google.com, that initial home page is considered a field so if you can think of it as a page really, if you wanna think of it in that way a lot of. The web servers out there or web applications are written dynamically so they're not actually pages but the way that HTTP was originally created, everything was typically served up as static files or pages on a web server. So the mechanism for that has kinda propagated down through the years, so we're generally requesting a file even though there might not be a file that Is a one to one mapping for our request, just a little bit of background there. So then once we get this file, quote unquote, we can then print whatever we get but we have to be able to read that file. So I'm gonna print f. Read, like that. Now, at this point, I am issuing a request to the URL I'm issuing a get request, I'm getting a response. I'm gonna retreat this source field and I'm gonna read that and print anything else on the screen. Now, this is not gonna be pretty but you are gonna be able to see basically what I show you by viewing page source through a web browser. So I'm gonna come down into my main code block here. And I wanna get the URL that specified at the command line. So once again, in order to do that, I'm gonna import sys like that. And then, I am going to get the URL that's being passed in. So I'm gonna make some assumptions that the user knows how to use this, and I'm going to get the URL that's being passed in, and then I'm gonna say get_page_source. And I'm gonna pass in that URL. So let's go ahead and save that. And there we go. So now we have a little application to retrieve the page source from a web location and then do whatever we want with it. Now, before we actually go back and try to use this, there is one little thing that I forgot. So what's actually happening up here as we are creating this request and putting this request, this is going to the process of creating the headers that are going to really explain everything about this request that is going out. And in order for this to actually finish and be a complete header, we need to do the put request, we can add additional headers. And then, we actually have to end the headers. So we're gonna use http_client.endheaders Like this to save at at this point we are done creating all of the headers that are gonna go with our HTTP request, so that I can now send it, and the end server is going to understand my request and send back whatever it is I'm asking for. So let's go ahead and save that now. Let's head back over to our terminal and let's go ahead and give this a shot. So we're gonna say Python 7 get web source. And I'm gonna say www.google.com just like we did before and hit enter. And holy moly. Look at all of that. That right there is the same source that we saw when we went to our web browser and navigated to google.com, and then view the page source. Now, you typically wouldn't do this type of thing in a normal application because this doesn't make any sense, but if you go back and take a look at my web crawling and scraping course, that you're gonna see at this point. Now, you can use other libraries to ingest that HTML and markup so that you can then start to pick things out of their like tags and, URLs and images and all those types of things, to start to process these websites in a much more programmatic manner.

Back to the top