Welcome to Crawling the Web With Python. In this course we will be covering the basics of building a simple web crawler and scraper using Python. Web scraping with Python is easy and fun!
Hello, my name is Derek Jensen and I would like to welcome you to something completely different. Now, in the last several courses that I've been creating, I've been focusing a lot on mobile development around iOS and Swift, but recently I kind of embarked on a new journey. Not only personally, but also professionally, in a possible career path where I've started to create a new website that's hopefully going to drive some interesting things in a particular environment. Which I will share with you when the time is right. But one of the things that I had to do in this particular application, is I had to take a lot of different data points from different resources around the web and kind of pull them all together and centralize them for easy access for end-users. Now, typically, when you're doing something like this, the way that you would really like to go would be to maybe talk to those different places, those different websites and maybe get exposure to some API's where you could pull that data from. Well, unfortunately, that's not always the case and so what I had to resort to was doing a little bit of web crawling and scraping. Now, if you're not familiar with those concepts, and I'm going to introduce you to them now, and I'm going to show you exactly what I did to do that. So, what we're going to do is we're going to focus on creating a kind of rudimentary web crawler and scraper, and we're going to do it using Python. Like I said, something completely different. And you might think to yourself, well why Python, why should I be using that? Well, if you search around the web you're going to quickly find that early on, several years ago, most conversations around the world of Python had to do with web crawling and web scraping typically because of Google. Google's original web crawler was written at least mostly in Python, but has since then has been migrated over to C++ code. There may be some Python in there a little bit, but they've kind of gotten away from it. But it was definitely their go-to early on. So, hey, if it's good enough for Google it's good enough for me. And at the time I really hadn't done much work in the Python space, so this was definitely an opportunity to kind of get out of my comfort zone and be able to learn something new and then ultimately help you guys do the same thing. So, web crawling and web scraping, what exactly is that? Well, the concept behind that is to be able to write some code that's going to navigate around a website and find different links and then follow those links to different pages within that site, and possibly even other websites. Now that's the crawling part. So, you're kind of crawling through a site, say specifically like this, looking for links to the About page, to the Downloads, to the Documentation and so on and so forth. If you ever look at the source of a web page like this, you're gonna see that very quickly you're gonna find a lot of links. So, if I just very quickly do a find, and I wanna start to look for anchor tags, you're gonna start to find a lot of them all over the place. Now, these are links that we wanna try to follow and crawl around, so, we can crawl around on this specific site as well as other sites as well. Now scraping, what exactly is that? Well, scraping is the same basic concept where I wanna look through the source, the actual raw HTML, that you find on the source of these pages before they're rendered by your browser, and I wanna start to extract data from them. So, if you start to look at this Python page here, you can see there's some data here about functions defined. Well, what if I wanted to find and maybe pull this paragraph, or these sentences, off that website and store it somewhere else so that I could maybe use it for running data analytic's or something like that, how would I do that? Well, if you were to right-click on this in most browsers, I'm using Chrome right now, and you can select Inspect Element. What you'll see here is once you get down into here you'll see, well all right, inside this paragraph I've got some text in here. Well, how do I get to that paragraph? Well, first I start in my div that has a class of header-banner, I find another div, I go into an unordered list, I find the list item, and I get down into this div where I have a paragraph and then I can extract the text. Now, that's kind of a rudimentary way to do it, but it will definitely work. Now, one caveat that I wanna throw out there before we start digging too deeply into this, is that web crawling and web scraping is not an exact science, and its definitely prone to failures because the things that I'm going to show you in this particular course can, and possibly will break over time. Depending on how you do it, depending on how you kind of crawl through the site and search for different pieces of data, you could ultimately be using class names, ID names, or different kind of nested element structures that may change on the websites over time. So, if you write it in a certain way, that's expecting a certain structure and that website changes, then your code and your crawlers and your scrapers and what have you could possibly, and will probably break. But ultimately, this is always kind of like that second or alternative method that you might wanna use if you can't get a hold of an API. So, let's go ahead and start to set up our environment so we can write some code that's ultimately going to go out to a site and extract some data. I'm going to introduce all of that in the next lesson.