2.3 Beginning to Extract Data
Now it's time to actually start scraping some data. We will apply the basic concepts we discussed in an earlier lesson to the lxml library for Python in order to start retrieving useful data.
2.3 Beginning to Extract Data
So the next that we want to do is to be able to start to retrieve the data that comes from the URLs that we're passing into it. And ultimately start to work through the XML or the HTML that's coming back. So the first thing let's go ahead and just run this real quick. So in order to run this in it's current form. We would just simply use python from the command shell and then pass in our itunes.py file. And as you can see here, we have a little bit of an issue. This is a nice thing about python is it's pretty good about telling you when you do things wrong and where you've done the wrong things. So as you can see here on line 34, for app in self it even shows you the line text. It doesn't know what self is and that's because, at this point, we're not in any sort of context. There's no self.apps, but what there is, and I've mistyped this before is there is a crawler.apps. Remember, we created this app as a property inside of our AppCrawler class. So, now that we've done that and we've saved it we can go ahead and run that again and it seems to run just fine. It doesn't do anything yet, but it seems to work. So now we want to go ahead and bring in our libraries that we want to use in order to get this whole thing to work. To import those we're going to go to the top of our file. We're going to say from lxml since that's the name of the library. And I want to import it's called HTML. That's what's going to be provided to us from this particular library, so we can start to extract data from the HTML that's coming back from a string. Now we also want to import requests. So the first thing I think that we'll do is we're going to focus on getting the data out of the page. So once we've determined what page we want to go to, which in our case, is that Candy Crush page, we want to extract data out of that. And how exactly do we do that? Well, interestingly enough, it's quite simple. So if I reopen my Candy Crush game here, remember, this is kind of what we were looking at here. And once you go to one of these pages, if you find a piece of data that you want, so in this case, I want to get the title. Or the name of the app. So I'll go ahead and right click on that. And once again in Chrome I can select inspect element. And most other browsers will have something similar. And it's going to take you down into this kind of dom hierarchy tree. To exactly where that piece is at. And as you can see it's right here. It's in an H1 tag. Now the nice thing here is I can see that this is an H1, but it has a very specific attribute here called itemprop*name. So, this is going to help me when I actually start to try to extract this data out of this particular page. So, what I'm going to do is I'm going to use that lxml library to use, in our case you could ultimately iterate through all of the lines in the HTML that comes back. But that's going to be a little laborious. So I want to do is I want to use lxml to use some Xpath to actually go specifically to this line and get this piece of data out. So let's see how that will work. So I'm going to come into my get_app_from_link function here. And I'm going to start by issuing a request to get the data that I want. So I'm going to call this start_page and I want to use my request, and I want to use the get function that's attached to it, and I want to pass in the link. So this is going to issue a get request to the URL that I pass into it and it's going to get back basically a string representation of what's going on on that particular page. And if you wanted to prove that you could simply do a print and you could say start_page. And you want to use the text property that's attached to it. So, if I go ahead and save that and come over here and execute my code, it of course isn't actually working yet because crawl is not doing anything with that. So, I could go ahead and just say self.get_app_from_link and we'll say we want to pass into this self.starting URL. So this is ultimately what we're going to get to eventually. So now we've got an error. It says request is not defined. And that's because it is actually requests. So we'll save that and we'll go ahead and run that. And as you can see here, we've just spit out all of that HTML that came back. Now if you wanted to parse through this line by line and look for stuff, you can. Boy, but what a nightmare and a long time that would take, so that would be rather inefficient. So what we want to do, is we want to use that lxml library to start to pick it apart. I want to ultimately take that information and kind of create a document tree out of that so that I can actually use some xpath to get where I want to go and that's what we're going to do. We're going to create tree and we're going to set that equal to. And we're going to use HTML from the LXML library. And we're going to say fromstring. And we're going to pass in to it that startpage.text. So, all of that text is what's going to get patched in to tree. Now, we can use this tree to actually execute xpath queries in to that document, and pull data out. So the first thing we want to look for is name. If you remember once we went in there we had the name kind of highlighted on our page and as you can see here it was in an H1 tag with an item prop attribute equal to name and I wanted to pull that text out of there. So let's go ahead and see how we would do that. Now I'm not going to go through the specifics of how to do a lot of XPath. I'm going to give you the absolute basics here. So what I want to do. Is I want to get everything from the root. And in order to do that. Everything from the root of the tree that I kind of parsed out here. So to do that I give two slashes. So at this point now I can say all right I want you to give me H1 tags. And I want you to give me the H1 tags that have an attribute named itemprop that is equal to a value of name. So that is going to give me all of the H1 tags that have an itemprop attribute. That has a value of name and then ultimately I want to say, you know what, give me the text from that. So once I've done that, actually this XPath function here is going to return to me a collection or a list of all the things that match this. So in order to actually get the one that I want, I'm fairly certain that there's only one in this instance. So I'm going to say, give me the value that's at index of zero right now. And let's just go ahead and print out name just to see if this works. We'll go ahead and save that. We'll go ahead and run our queries. Honestly, you can see here, we pulled out Candy Crush Saga off the top of that page. Well, let's not stop there. Head back over to our page, now we also want to grab this string here. So we're going to right-click Inspect Element. Oh, look at that, it's an H2. But now at this point. It's not as clear as or as simple as saying, give me an attribute that has a particular or give me an element that has a particular attribute like the previous one. We have to look a little bit deeper. So let's go ahead and do the same thing in XPath again but let's see if we can maybe move up a line here and say give me the DIV that has a class of left and then give me an H2 within there and let's see if that's going to get us where we want to go. So, we'll come back in here again. And this is going to be the developer. So, well say developer=tree.xpath. And we'll say all right, I want to give me a div that has an attribute named class that's equal to left. And then, from within there. I want to then give me the H2 that is within there and then from within that H2 give me the text. And once again give me the first one in the hopes that that's going to be the correct one. So go ahead and save that and now instead of printing name we'll print developer so let's save that. All right, so we got back by King.com Limited, so we can go ahead and address that by, maybe we could strip that out later or something like that, but that's probably not a big deal, we could leave it in there. And then once again we wanted the price, so let's come back over to here and we'll go ahead down to the price area which is right here. And we'll inspect that element. And we have very nice. We have a div with an itemprop attribute of price and we want what's in there. So we can handle that in a very similar way to how we've already done. So let's come back in here and this time we are going to get the price. That's going to be tree.xpath. And this is once again going to be a div. And this is going to have a class, or an attribute, which is itemprop, which is going to be equal to price. And once we've got that far we can once again grab the text. And once again, we want the first one. So we'll save that and we'll go ahead and put the price in here. So let's save that and we'll go ahead and run our code, and there we go, it's free. So now you've retrieved three pieces of information, the name, the developer, and the price, all using fairly simple XPath queries to dig through that information. Now the next thing we want to do is we want to grab those apps that we found down in there. Down at the bottom here to say, Customers Also Bought. So let's go ahead and right-click on one of these and inspect this element. We're going to see this is pretty far down here inside some lookup info, inside a presentation. And the first list item within the UL with a role of presentation, this is going to give me the link. So this is roughly where we're going to need to go to get this information. But what we're going to have to notice about this is that a lot of times, if you just start looking for list items, there's probably going to be a lot on here. And same with the ULs. There's probably going to be a bunch of them in different places around here. So let's kind of come up and take a look a little bit. And I see we're in the center stack. So this seems to be where we're looking here to find these particular links. So the center stack seems to be this whole area in the middle here. So we want to look in the center stack and we want to dig down a little bit further and grab some information that's down here and lockup-info and in these list items, in particular the first one. So let's go ahead and come back in here and we will go ahead and grab the links. And that's going to be equal to, and we're going to use this Xpath function again, and this time I'm actually going to paste some Xpath in here just so you can save the headache of actually watching me type this. So we're going to grab that div that has an attribute of class with a value of center stack. And then we're going to use the double slashes again meaning I want to look at all the descendants of this particular div. And then that's what the * is going to mean. This is the wildcard that says give me everything that's within there. And from all of that data. I want to grab a link. I want to grab an A tag that has a class of name. So, it's like come back in here. You're going to see that this particular anchor tag has a class equal to name. And if I look within here, there's a little bit more information about Cookie Jam here. But I don't necessarily want this particular title. What I want is just this link. I want the value of this href attribute. So if I come back in here you'll see I'm just saying all right get me that far to that name and then give me the value at that href and that's going to give me a list. So at this point if I were to do for link in links and then I could go ahead and just pop this in a little bit and print out the link like that. We'll go ahead and save, and we'll run this. So now you see I've got three links, I've got the link to CookieJam, this 100 pix quiz, and also Fruit Ninja free, and those were the ones that were on that page, so now you've created this piece of information, or you've extracted all the information you want out of that particular page. So, instead of doing this printing, let's go ahead and create an app instance, which is going to be an instance of an app. And then we're going to pass in the name. The developer, the price and the links. So now that we've done that, we've seen how we can now grab all that data out and create a new instance. And at this point, why don't we also say self.app. So we're going to append this particular instance of that app to our apps list. So, now that we've done that, theoretically speaking, of course, if I were to clear this out and go ahead and run. Now you see, I have a name of Candy Crush Saga. Developers is by King.com Limited, and the price is free. Now, I'm not printing out the links, because technically speaking, you're probably not going to want to visualize those, but you're going to want to follow them. And that's exactly what we're going to start to work on in the next lesson.