2.4 Crawling to Other Pages
Scraping the data in the previous lesson is great. But how can we take this whole thing a step further? We can crawl to additional pages in order to get more data! This lesson will show you how to do just that.
2.4 Crawling to Other Pages
All right, so we're moving along pretty well. Now, the next thing we want to do is do something with those links. And the something that we wanna do is ultimately follow them and then extract more data about the apps on the upcoming pages. So first, let's go ahead and restructure this a little bit. So what I wanna do is I want to in the crawl itself here in the crawl function, I wanna get the app that's being returned here just so I can kind of keep some semblance of structure here. So I wanna return an app from this particular function that's being called by crawl. So I'm gonna get that and then from here that's where I'll go ahead and I'll add that self.apps.append and I'll append that (app). All right, so what we wanna do now, is the important thing to remember about crawling is that you want to maintain all the links that you're going to use. And there's many different ways to structure this. There's many different ways to do it. And I'm gonna show you a few examples of libraries that handle all of this stuff for you. But if you ever wanted to roll your own to kinda see how that's going to work, you can kinda follow a similar structure if you want or you can go off and do your own thing. I'm just trying to get you started. So what we wanna do is we have this concept of a depth. Now, that's the farthest we wanna go down this traversal of the different links. So we're also gonna want to then keep track of our current depth of where exactly we're at. And we're going to start off at 0, so remember what I said when we pass in that starting URL and we call crawl then we're starting at depth zero. So if I pass in 0 to my function down here, I'm only going to do this one iteration of crawl and then I'm done. Now, another thing I gonna wanna do is I'm gonna wanna keep track of all the links that I find at the different depths. So I'm gonna create another property here that's gonna be called depth_links. And this is gonna be another list. So this is where I'm gonna save in here and at each index within this list, I'm going to store the links that I find at that depth. So at depth zero, I'm going to store the links for the first starting URL that I have. At depth one, I'm gonna store all the links that I found at depth one. So when I go to those next pages I wanna save and this particular instance I believe there were three apps listed on the first starting URL page, I wanna follow all of those and I wanna get all the links on those subsequent pages and combine them into a single list at depth_link index one. So as you see how that's going to flow. So now, what we wanna do is once we come in to crawl, we want to get the app that's found at that starting URL. We wanna save that particular app. And then we wanna save the links that we find from that particular app. So we'll say self.depth_links and we want to append to this. And we want to append to it app.links. So that's going to put the links, that list of links into def links. And the append is simply going to add something to an empty list which will make it be at index 0. So at index 0 now, we're gonna have the app links found at depth 0. So what we wanna do now is we wanna do a little while loop. And we wanna check to see as long as self.current_depth is less than self.depth then we wanna go ahead and continue to go down that path. We wanna go ahead and continue to crawl the links that we find at that particular depth. So let's go ahead and create a variable here. So we're going to keep track of all the current links that we find as we go through these pages then we want to go through the links at the appropriate depth of our current depth to find the next set of apps that we want to go ahead and parse. So we'll say for link in self.depth_links. So we'll say depth_links. And I wanna find all of them that are at our current_depth just like that. So now, I'm going to iterate through all of those. And of course, we'll start by grabbing an app. So that's gonna be equal to self.get_app_from_link and we're gonna pass into it the link that we're working with. And that's gonna give us back an app. And actually, just to kinda keep things a little bit more clear, I'm gonna call this current_app so we don't get confused with any other app that we're throwing around here. And then I'm going to go into my current_links. And instead of appending at this point because what I'm getting back is actually a list and if I start to append these in I'm going to start getting lists of lists and that's going to confuse things as we go. So I'm going to make sure that at this point I am actually going to use extend. So it's just going to add the links that I found in the current app. So it's going to just extend my current_links and it's going to allow me to pass in a list and just have it instead of adding a list at an index it's going to just add those pieces of data in that list as pieces of data in my current _inks list. All right, so then once I have done that, I can come down and I can then add this new app to my list of apps. We'll say self.apps append, I want to append a new app, like that we'll go ahead and save that. And that's going to end this for loop here. So once, I finish that I'm gonna pop back out into my while. And at this point I want to increment my current depth. So I'll say I finished at that depth. So I'll say current depth += to 1 and then I'm also going to go into my depth_links. And I'm going to append now, all the information I found at current_links. We'll go ahead and save that. So now, you've gotten to the point where you have started to extract all the data from the subsequent links based on what we have done so far. So let's go ahead and save that. Let's clear what we have here. And let's go ahead and run this. Now, what we would expect to happen is only to get the first one. Because our depth that we passed in is still 0. So we're still getting just that first one. So let's see what would happen now, if I come down here and say all right, I wanna go one layer deep. So let's go ahead and save that and we'll run it again. Now, this should take a little bit longer because it has to keep going but I see now that we have a little bit of an issue. So it seems that we keep hitting either the same link or we are continuing to write the same data into our app. So let's go ahead and see what we've done wrong here. And I actually see what the problem is and I started to allude to it earlier but I seem to have typed something incorrectly where I'm continually adding the same app into our collection. So what I don't wanna do that, I actually want to use current_app like that. I wanna save that. So once I have retrieved my current_app from a link, then I am going to extend the current_links with the app _links that I find. And then I want to add the current app that I'm iterating through into my collection. So let's save that. So let's go ahead and see if that takes care of our issue. So as I had mentioned before this could take a few extra seconds. But as you can see here now, we've retrieved four. We retrieved at layer 0, then we've also retrieved information about the three apps that are found on that page as being something else we might be interested in. So I don't have to stop here, I could obviously keep going. I could add two here and save that and we'll see what happens at this point. Actually, we'll clear this out first. And we'll run this one more time just to see what happens. Now, I have to warn you as we begin to do this. Many sophisticated websites out there are going to start to notice if you begin to start to execute a bunch of get requests or sort of HTTP request against their web servers and if you start to do them too quickly. You can very quickly have your IP address blacklisted on their server. So that it actually will block your requests from coming in and sending back error responses and things like that, so you don't wanna do that. So typically what happens when people start writing apps like this is to actually kind of guard against that and put in a little bit of a weight or a sleep timer to kind of space things out a little bit just so it doesn't go to that point where you actually get blacklisted. So we could do that quite easily. There's another library we could import called time. We'll save that. And then once we do that we'll come down in here, maybe at the end of this for loop while we're doing this we'll go ahead and say time.sleep. And you pass in the number of seconds you wanna wait. So I would just put in five seconds and we'll say that. So now, you'll see that this is gonna take a little bit longer because after we get the new app from that particular link, we're gonna sleep here for five seconds and then we're gonna go back through again. So let's go ahead and clear this out, and then we'll run this one more time. And ultimately, what you're seeing now is that we are able to not only scrape data from these URLs that we've been running. So you've seen now how to use simple XPath queries to scrape pieces of information off of web sites that you go to. And then we've also learned a little bit about being able to crawl the web. So we've started at a particular URL, we then go ahead and get the data of that URL that we want. And then we take a look at the links that we find on there. And once we find those links, we determine how deep we want to go. And then we go on to the next page, the subsequent first child page. And we extract the app information as well as the links off of that one. And then we go to the next child. And the next one, until we've exhausted all the children at that layer. And then we save all of those links. And then we can say all right, do we want to go another layer deeper? So ultimately, you could continue going down this path for quite some time until you actually get everything you're looking for. And once this program execution actually finishes here. You'll see that we're once again going to get the same output. And now, our results have finally come back. And you can start to poke through here and see maybe there are some interesting apps out here that maybe I wanna go try out or maybe I wanna take a look at. And one thing to actually note here, and you can build all sorts of functionality and logic into this, but the deeper you go down this depth path, you're gonna see we came full circle. And now, we've listed Candy Crush Saga was something recommended to you if you viewed one of these other apps in here, and that's actually where we started from. So if you wanted to maybe build in some logic to say, if I get some duplicates skip those, don't worry about those, something like that but I'll definitely leave that as an option or as an upgrade for you to work on on this particular script. So as you can see here, with just a few lines of code using a few libraries, some built in and some you might have to download and include yourself. We've built a fairly nice little application that's not really a generic purpose scraper or crawler. But it's definitely something for use in what I'm interested in doing. And so hopefully you can take this information and just adapt it a little bit to work how you want it to work and go out and pull out some data that you are looking to do. Now, in the next lesson, I'm going to introduce to you a couple different libraries that you can use that can even make this much simpler for you not only in the scraping world but also in the crawling world, just to show you that you have some options.