7 days of WordPress plugins, themes & templates - for free!* Unlimited asset downloads! Start 7-Day Free Trial
FREELessons: 7Length: 55 minutes

Next lesson playing in 5 seconds

Cancel
  • Overview
  • Transcript

2.1 Where to Find the Data

This lesson will demonstrate a strategy of using the browser developer tools to identify the DOM structures that contain the data we want to find: anchor tags for crawling and text elements for scraping.

2.1 Where to Find the Data

So as I introduced in the previous lesson what I would like to do is go to this particular site, the iTunes preview site and start off with any app. You can navigate around here, find any app you want. I'm gonna start with Candy Crush, and I wanna be able to pull some data about this app off this page. So I wanna get the title, I wanna get the author, I wanna get the price, and I also wanna get maybe the customers also bought. So that's really what I wanna do. But how are we gonna do that? Well, you have to kind of think back about how all of this web stuff works. Now this is all, if you open up your web browser and start navigating to URLs, ultimately, this is all just sugar. This is all just CSS and all this other stuff that your browser is interpreting and displaying based on what the developers, in this case of the iTunes Preview site, want you to see and how you want to see it. But this is not what the data looks like as it comes back from the servers once you request this URL. So if you think very rudimentary, if you start talking about HTTP and HTML, what you're doing is, once you type in this address in to your address bar and hit enter, or send, or whatever. This is just issuing an HTTP get request to the iTunes.apple.com domain. And this other stuff, and this other data in here as well. And then the server interprets that and sends back text. All it does is send back text. And that text looks like, if you were to right-click on any sort of page and say View Page Source, this is what that texts look like. Now, if you were just to write this out onto Say Notepad or Textpad or whatever you like, Sublime Text. It's just gonna look like a bunch of gobbledy-gook, and if you didn't write this, odds are, you may not understand it all, but that's okay, you don't really have to. But our job, when we are trying to get some data off here by scraping and crawling. We want to get access to this raw data, this raw HTML. Even though it looks like a mess, this is what we wanna get to. Because there's not a whole lot we can do about this. Sure we can do some OCR, and we can do all sort of that stuff, but that's very, very complicated. Much easier would be just give me all this HTML as a string or maybe as XML or something like that. And I can start to pick it apart in different ways. And I can find pieces of information that I want. So how would you start to do that even before we start writing some code? Well using the developer tools for a lot of the major web browsers out there today, it's quite simple actually. So let's start with this title, Candy Crush Saga. If I were to right-click on this and select in Chrome, Inspect Element. It's gonna open up a new view and you're gonna see not only the rendered page here in your browser, but you're also gonna see a nice little hierarchical structure that you can navigate through through that HTML which is pretty sweet. And once you kinda go over things here, it's going to highlight it on the page. So as you can see here this H1 tag that has an item prop attribute of name is where that text comes from. Candy Crush Saga. Well that's pretty nice. Now I know where that data is and now I just need to be able to pull that out. Well it's the same thing with the author right below it. It's in this H2 in that same div so I can get to that data as well. So that's kind of nice. And I can even come over here to the price inspect element. Here, that's in a div called with an item prop attribute again called price. And then here this says free. Now obviously if this was a paid app it would have the price in here, but at least I can see where that data is now. And then down at the very bottom here, I can see customers also bought. I'll inspect that element. And you're gonna see, okay, that's an H2, and I have a div here, but I bet if I dig down a little deeper, I'm gonna see a number of these divs that are gonna have information in it about these particular apps. And eventually I would get down into the presentation here. And you're gonna see a number of list items and I'm gonna see a link to where I can get more information about that particular app as well as its name, Cookie Jam in this case. So once we start to scrape that data out, so we saw the different pieces of information of where within the HTML we can get the title and the author of the app as well as the price. But now once we start looking down here at the bottom, we found where we can get the title of these customers also bought asked. But we also have a link here to where we can go get more information about that particular app as well. So that's where the whole concept of crawling is gonna come into play, so now I have these other URLs that I could go navigate to. I could crawl to those URLs and pull out the same pieces of data that I'm pulling from this particular page as well. So, the only question at that point comes how deep do you want to go? Because if you were to just keep following link after link after link, you'd probably reach the end of the Internet before long. But that's not really what we want to do, we want to kind of limit this a little bit. So we wanna start on this page, we wanna pull out a few pieces of data. And then we wanna maybe go another level deeper. And I wanna go and grab the same information about customers also bought from these. And then you could nest several layers deep and continue to go from page to page to page, but I think that should be enough to get you started. So now that we kind of see how we're gonna pull that data out and where that data comes from, let's introduce Python and see how we can start to issue the same requests that we 're doing through the browser here namely by issuing some HTTP GET requests.

Back to the top