Techniques for Mastering cURL

cURL is a tool for transferring files and data with URL syntax, supporting many protocols including HTTP, FTP, TELNET and more. Initially, cURL was designed to be a command line tool. Lucky for us, the cURL library is also supported by PHP. In this article, we will look at some of the advanced features of cURL, and how we can use them in our PHP scripts.

Why cURL?

It's true that there are other ways of fetching the contents of a web page. Many times, mostly due to laziness, I have just used simple PHP functions instead of cURL:


$content = file_get_contents("http://www.nettuts.com");

// or

$lines = file("http://www.nettuts.com");

// or

readfile("http://www.nettuts.com");

However they have virtually no flexibility and lack sufficient error handling. Also, there are certain tasks that you simply can not do, like dealing with cookies, authentication, form posts, file uploads etc.

cURL is a powerful library that supports many different protocols, options, and provides detailed information about the URL requests.

Basic Structure

Before we move on to more complicated examples, let's review the basic structure of a cURL request in PHP. There are four main steps:

Initialize
Set Options
Execute and Fetch Result
Free up the cURL handle


// 1. initialize
$ch = curl_init();

// 2. set the options, including the url
curl_setopt($ch, CURLOPT_URL, "http://www.nettuts.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);

// 3. execute and fetch the resulting HTML output
$output = curl_exec($ch);

// 4. free up the curl handle
curl_close($ch);

Step #2 (i.e. curl_setopt() calls) is going to be a big part of this article, because that is where all the magic happens. There is a long list of cURL options that can be set, which can configure the URL request in detail. It might be difficult to go through the whole list and digest it all at once. So today, we are just going to use some of the more common and useful options in various code examples.

Checking for Errors

Optionally, you can also add error checking:


// ...

$output = curl_exec($ch);

if ($output === FALSE) {

	echo "cURL Error: " . curl_error($ch);

}

// ...

Please note that we need to use "=== FALSE" for comparison instead of "== FALSE". Because we need to distinguish between empty output vs. the boolean value FALSE, which indicates an error.

Getting Information

Another optional step is to get information about the cURL request, after it has been executed.


// ...

curl_exec($ch);

$info = curl_getinfo($ch);

echo 'Took ' . $info['total_time'] . ' seconds for url ' . $info['url'];

// ...

Following information is included in the returned array:

"url"
"content_type"
"http_code"
"header_size"
"request_size"
"filetime"
"ssl_verify_result"
"redirect_count"
"total_time"
"namelookup_time"
"connect_time"
"pretransfer_time"
"size_upload"
"size_download"
"speed_download"
"speed_upload"
"download_content_length"
"upload_content_length"
"starttransfer_time"
"redirect_time"

Detect Redirection Based on Browser

In this first example, we will write a script that can detect URL redirections based on different browser settings. For example, some websites redirect cellphone browsers, or even surfers from different countries.

We are going to be using the CURLOPT_HTTPHEADER option to set our outgoing HTTP Headers including the user agent string and the accepted languages. Finally we will check to see if these websites are trying to redirect us to different URLs.


// test URLs
$urls = array(
	"http://www.cnn.com",
	"http://www.mozilla.com",
	"http://www.facebook.com"
);
// test browsers
$browsers = array(

	"standard" => array (
		"user_agent" => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 (.NET CLR 3.5.30729)",
		"language" => "en-us,en;q=0.5"
		),

	"iphone" => array (
		"user_agent" => "Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A537a Safari/419.3",
		"language" => "en"
		),

	"french" => array (
		"user_agent" => "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB6; .NET CLR 2.0.50727)",
		"language" => "fr,fr-FR;q=0.5"
		)

);

foreach ($urls as $url) {

	echo "URL: $url\n";

	foreach ($browsers as $test_name => $browser) {

		$ch = curl_init();

		// set url
		curl_setopt($ch, CURLOPT_URL, $url);

		// set browser specific headers
		curl_setopt($ch, CURLOPT_HTTPHEADER, array(
				"User-Agent: {$browser['user_agent']}",
				"Accept-Language: {$browser['language']}"
			));

		// we don't want the page contents
		curl_setopt($ch, CURLOPT_NOBODY, 1);

		// we need the HTTP Header returned
		curl_setopt($ch, CURLOPT_HEADER, 1);

		// return the results instead of outputting it
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

		$output = curl_exec($ch);

		curl_close($ch);

		// was there a redirection HTTP header?
		if (preg_match("!Location: (.*)!", $output, $matches)) {

			echo "$test_name: redirects to $matches[1]\n";

		} else {

			echo "$test_name: no redirection\n";

		}

	}
	echo "\n\n";
}

First we have a set of URLs to test, followed by a set of browser settings to test each of these URLs against. Then we loop through these test cases and make a cURL request for each.

Because of the way setup the cURL options, the returned output will only contain the HTTP headers (saved in $output). With a simple regex, we can see if there was a "Location:" header included.

When you run this script, you should get an output like this:

POSTing to a URL

On a GET request, data can be sent to a URL via the "query string". For example, when you do a search on Google, the search term is located in the query string part of the URL:

1	http://www.google.com/search?q=nettuts

You may not need cURL to simulate this in a web script. You can just be lazy and hit that url with "file_get_contents()" to receive the results.

But some HTML forms are set to the POST method. When these forms are submitted through the browser, the data is sent via the HTTP Request body, rather than the query string. For example, if you do a search on the CodeIgniter forums, you will be POSTing your search query to:

1	http://codeigniter.com/forums/do_search/

We can write a PHP script to simulate this kind of URL request. First let's create a simple file for accepting and displaying the POST data. Let's call it post_output.php:

1
2	print_r($_POST);

Next we create a PHP script to perform a cURL request:


$url = "http://localhost/post_output.php";

$post_data = array (
	"foo" => "bar",
	"query" => "Nettuts",
	"action" => "Submit"
);

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// we are doing a POST request
curl_setopt($ch, CURLOPT_POST, 1);
// adding the post variables to the request
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);

$output = curl_exec($ch);

curl_close($ch);

echo $output;

When you run this script, you should get an output like this:

It sent a POST to the post_output.php script, which dumped the $_POST variable, and we captured that output via cURL.

File Upload

Uploading files works very similarly to the previous POST example, since all file upload forms have the POST method.

First let's create a file for receiving the request and call it upload_output.php:

1
2	print_r($_FILES);

And here is the actual script performing the file upload:


$url = "http://localhost/upload_output.php";

$post_data = array (
	"foo" => "bar",
	// file to be uploaded
	"upload" => "@C:/wamp/www/test.zip"
);

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

curl_setopt($ch, CURLOPT_POST, 1);

curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);

$output = curl_exec($ch);

curl_close($ch);

echo $output;

When you want to upload a file, all you have to do is pass its file path just like a post variable, and put the @ symbol in front of it. Now when you run this script you should get an output like this:

Multi cURL

One of the more advanced features of cURL is the ability to create a "multi" cURL handle. This allows you to open connections to multiple URLs simultaneously and asynchronously.

On a regular cURL request, the script execution stops and waits for the URL request to finish before it can continue. If you intend to hit multiple URLs, this can take a long time, as you can only request one URL at a time. We can overcome this limitation by using the multi handle.

Let's look at this sample code from php.net:


// create both cURL resources
$ch1 = curl_init();
$ch2 = curl_init();

// set URL and other appropriate options
curl_setopt($ch1, CURLOPT_URL, "http://lxr.php.net/");
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch2, CURLOPT_URL, "http://www.php.net/");
curl_setopt($ch2, CURLOPT_HEADER, 0);

//create the multiple cURL handle
$mh = curl_multi_init();

//add the two handles
curl_multi_add_handle($mh,$ch1);
curl_multi_add_handle($mh,$ch2);

$active = null;
//execute the handles
do {
    $mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

while ($active && $mrc == CURLM_OK) {
    if (curl_multi_select($mh) != -1) {
        do {
            $mrc = curl_multi_exec($mh, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);
    }
}

//close the handles
curl_multi_remove_handle($mh, $ch1);
curl_multi_remove_handle($mh, $ch2);
curl_multi_close($mh);

The idea is that you can open multiple cURL handles and assign them to a single multi handle. Then you can wait for them to finish executing while in a loop.

There are two main loops in this example. The first do-while loop repeatedly calls curl_multi_exec(). This function is non-blocking. It executes as little as possible and returns a status value. As long as the returned value is the constant 'CURLM_CALL_MULTI_PERFORM', it means that there is still more immediate work to do (for example, sending http headers to the URLs.) That's why we keep calling it until the return value is something else.

In the following while loop, we continue as long as the $active variable is 'true'. This was passed as the second argument to the curl_multi_exec() call. It is set to 'true' as long as there are active connections withing the multi handle. Next thing we do is to call curl_multi_select(). This function is 'blocking' until there is any connection activity, such as receiving a response. When that happens, we go into yet another do-while loop to continue executing.

Let's see if we can create a working example ourselves, that has a practical purpose.

Wordpress Link Checker

Imagine a blog with many posts containing links to external websites. Some of these links might end up dead after a while for various reasons. Maybe the page is longer there, or the entire website is gone.

We are going to be building a script that analyzes all the links and finds non-loading websites and 404 pages and returns a report to us.

Note that this is not going to be an actual Wordpress plug-in. It is only a standalone utility script, and it is just for demonstration purposes.

So let's get started. First we need to fetch the links from the database:


// CONFIG
$db_host = 'localhost';
$db_user = 'root';
$db_pass = '';
$db_name = 'wordpress';
$excluded_domains = array(
	'localhost', 'www.mydomain.com');
$max_connections = 10;
// initialize some variables
$url_list = array();
$working_urls = array();
$dead_urls = array();
$not_found_urls = array();
$active = null;

// connect to MySQL
if (!mysql_connect($db_host, $db_user, $db_pass)) {
	die('Could not connect: ' . mysql_error());
}
if (!mysql_select_db($db_name)) {
	die('Could not select db: ' . mysql_error());
}


// get all published posts that have links
$q = "SELECT post_content FROM wp_posts
	WHERE post_content LIKE '%href=%'
	AND post_status = 'publish'
	AND post_type = 'post'";
$r = mysql_query($q) or die(mysql_error());
while ($d = mysql_fetch_assoc($r)) {

	// get all links via regex
	if (preg_match_all("!href=\"(.*?)\"!", $d['post_content'], $matches)) {

		foreach ($matches[1] as $url) {

			// exclude some domains
			$tmp = parse_url($url);
			if (in_array($tmp['host'], $excluded_domains)) {
				continue;
			}

			// store the url
			$url_list []= $url;
		}
	}
}

// remove duplicates
$url_list = array_values(array_unique($url_list));

if (!$url_list) {
	die('No URL to check');
}

First we have some database configuration, followed by an array of domain names we will ignore ($excluded_domains). Also we set a number for maximum simultaneous connections we will be using later ($max_connections). Then we connect to the database, fetch posts that contain links, and collect them into an array ($url_list).

Following code might be a little complex, so I will try to explain it in small steps.


// 1. multi handle
$mh = curl_multi_init();

// 2. add multiple URLs to the multi handle
for ($i = 0; $i < $max_connections; $i++) {
	add_url_to_multi_handle($mh, $url_list);
}

// 3. initial execution
do {
	$mrc = curl_multi_exec($mh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);

// 4. main loop
while ($active && $mrc == CURLM_OK) {

	// 5. there is activity
	if (curl_multi_select($mh) != -1) {

		// 6. do work
		do {
			$mrc = curl_multi_exec($mh, $active);
		} while ($mrc == CURLM_CALL_MULTI_PERFORM);

		// 7. is there info?
		if ($mhinfo = curl_multi_info_read($mh)) {
			// this means one of the requests were finished

			// 8. get the info on the curl handle
			$chinfo = curl_getinfo($mhinfo['handle']);

			// 9. dead link?
			if (!$chinfo['http_code']) {
				$dead_urls []= $chinfo['url'];

			// 10. 404?
			} else if ($chinfo['http_code'] == 404) {
				$not_found_urls []= $chinfo['url'];

			// 11. working
			} else {
				$working_urls []= $chinfo['url'];
			}

			// 12. remove the handle
			curl_multi_remove_handle($mh, $mhinfo['handle']);
			curl_close($mhinfo['handle']);

			// 13. add a new url and do work
			if (add_url_to_multi_handle($mh, $url_list)) {

				do {
					$mrc = curl_multi_exec($mh, $active);
				} while ($mrc == CURLM_CALL_MULTI_PERFORM);
			}
		}
	}
}

// 14. finished
curl_multi_close($mh);

echo "==Dead URLs==\n";
echo implode("\n",$dead_urls) . "\n\n";

echo "==404 URLs==\n";
echo implode("\n",$not_found_urls) . "\n\n";

echo "==Working URLs==\n";
echo implode("\n",$working_urls);

// 15. adds a url to the multi handle
function add_url_to_multi_handle($mh, $url_list) {
	static $index = 0;

	// if we have another url to get
	if ($url_list[$index]) {

		// new curl handle
		$ch = curl_init();

		// set the url
		curl_setopt($ch, CURLOPT_URL, $url_list[$index]);
		// to prevent the response from being outputted
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
		// follow redirections
		curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
		// do not need the body. this saves bandwidth and time
		curl_setopt($ch, CURLOPT_NOBODY, 1);

		// add it to the multi handle
		curl_multi_add_handle($mh, $ch);


		// increment so next url is used next time
		$index++;

		return true;
	} else {

		// we are done adding new URLs
		return false;
	}
}

And here is the explanation for the code above. Numbers in the list correspond to the numbers in the code comments.

Created a multi handle.
We will be creating the add_url_to_multi_handle() function later on. Every time it is called, it will add a url to the multi handle. Initially, we add 10 (based on $max_connections) URLs to the multi handle.
We must run curl_multi_exec() for the initial work. As long as it returns CURLM_CALL_MULTI_PERFORM, there is work to do. This is mainly for creating the connections. It does not wait for the full URL response.
This main loop runs as long as there is some activity in the multi handle.
curl_multi_select() waits the script until an activity to happens with any of the URL quests.
Again we must let cURL do some work, mainly for fetching response data.
We check for info. There is an array returned if a URL request was finished.
There is a cURL handle in the returned array. We use that to fetch info on the individual cURL request.
If the link was dead or timed out, there will be no http code.
If the link was a 404 page, the http code will be set to 404.
Otherwise we assume it was a working link. (You may add additional checks for 500 error codes etc...)
We remove the cURL handle from the multi handle since it is no longer needed, and close it.
We can now add another url to the multi handle, and again do the initial work before moving on.
Everything is finished. We can close the multi handle and print a report.
This is the function that adds a new url to the multi handle. The static variable $index is incremented every time this function is called, so we can keep track of where we left off.

I ran the script on my blog (with some broken links added on purpose, for testing), and here is what it looked like:

It took only less than 2 seconds to go through about 40 URLs. The performance gains are significant when dealing with even larger sets of URLs. If you open ten connections at the same time, it can run up to ten times faster. Also you can just utilize the non-blocking nature of the multi curl handle to do URL requests without stalling your web script.

Some Other Useful cURL Options

HTTP Authentication

If there is HTTP based authentication on a URL, you can use this:


$url = "http://www.somesite.com/members/";

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

// send the username and password
curl_setopt($ch, CURLOPT_USERPWD, "myusername:mypassword");

// if you allow redirections
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
// this lets cURL keep sending the username and password
// after being redirected
curl_setopt($ch, CURLOPT_UNRESTRICTED_AUTH, 1);

$output = curl_exec($ch);

curl_close($ch);

FTP Upload

PHP does have an FTP library, but you can also use cURL:


// open a file pointer
$file = fopen("/path/to/file", "r");

// the url contains most of the info needed
$url = "ftp://username:password@mydomain.com:21/path/to/new/file";

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

// upload related options
curl_setopt($ch, CURLOPT_UPLOAD, 1);
curl_setopt($ch, CURLOPT_INFILE, $fp);
curl_setopt($ch, CURLOPT_INFILESIZE, filesize("/path/to/file"));

// set for ASCII mode (e.g. text files)
curl_setopt($ch, CURLOPT_FTPASCII, 1);

$output = curl_exec($ch);
curl_close($ch);

Using a Proxy

You can perform your URL request through a proxy:


$ch = curl_init();

curl_setopt($ch, CURLOPT_URL,'http://www.example.com');

curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

// set the proxy address to use
curl_setopt($ch, CURLOPT_PROXY, '11.11.11.11:8080');

// if the proxy requires a username and password
curl_setopt($ch, CURLOPT_PROXYUSERPWD,'user:pass');

$output = curl_exec($ch);

curl_close ($ch);

Callback Functions

It is possible to have cURL call given callback functions during the URL request, before it is finished. For example, as the contents of the response is being downloaded, you can start using the data, without waiting for the whole download to complete.


$ch = curl_init();

curl_setopt($ch, CURLOPT_URL,'https://code.tutsplus.com');

curl_setopt($ch, CURLOPT_WRITEFUNCTION,"progress_function");

curl_exec($ch);

curl_close ($ch);


function progress_function($ch,$str) {

	echo $str;
	return strlen($str);

}

The callback function MUST return the length of the string, which is a requirement for this to work properly.

As the URL response is being fetched, every time a data packet is received, the callback function is called.

Conclusion

We have explored the power and the flexibility of the cURL library today. I hope you enjoyed and learned from the this article. Next time you need to make a URL request in your web application, consider using cURL.

Thank you and have a great day!

Write a Plus Tutorial

Did you know that you can earn up to $600 for writing a PLUS tutorial and/or screencast for us? We're looking for in depth and well-written tutorials on HTML, CSS, PHP, and JavaScript. If you're of the ability, please contact Jeffrey at nettuts@tutsplus.com.

Please note that actual compensation will be dependent upon the quality of the final tutorial and screencast.