Compressing and Extracting Files in Python

If you have been using computers for some time, you have probably come across files with the .zip extension. They are special files that can hold the compressed content of many other files, folders, and subfolders. This makes them pretty useful for transferring files over the internet. Did you know that you can use Python to compress or extract files?

This tutorial will teach you how to use the zipfile module in Python to extract or compress individual or multiple files at once.

Compressing Individual Files

This one is easy and requires very little code. We begin by importing the zipfile module and then open the ZipFile object in write mode by specifying the second parameter as 'w'. The first parameter is the path to the file itself. Here is the code that you need:

1	import zipfile
2
3	with zipfile.ZipFile('C:\\Stories\\Fantasy\\jungle.zip', 'w') as jungle_zip:
4	jungle_zip.write('C:\\Stories\\Fantasy\\jungle.pdf', compress_type=zipfile.ZIP_DEFLATED)

Please note that I will specify the path in all the code snippets in a Windows style format; you will need to make appropriate changes if you are on Linux or Mac.

You can specify different compression methods to compress files. The newer methods BZIP2 and LZMA were added in Python version 3.3, and there are some other tools as well which don't support these two compression methods. For this reason, it is safe to just use the DEFLATED method. You should still try out these methods to see the difference in the size of the compressed file.

Compressing Multiple Files

This is slightly complex as you need to iterate over all files. The code below should compress all files with the extension pdf in a given folder:

import os
import zipfile

fantasy_zip = zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip', 'w')

for folder, subfolders, files in os.walk('C:\\Stories\\Fantasy'):

    for file in files:
        if file.endswith('.pdf'):
            fantasy_zip.write(os.path.join(folder, file), os.path.relpath(os.path.join(folder,file), 'C:\\Stories\\Fantasy'), compress_type = zipfile.ZIP_DEFLATED)

fantasy_zip.close()

This time, we have imported the os module and used its walk() method to go over all files and subfolders inside our original folder. I am only compressing the pdf files in the directory. You can also create different archived files for each format using if statements.

If you don't want to preserve the directory structure, you can put all the files together by using the following line:

1	fantasy_zip.write(os.path.join(folder, file), file, compress_type = zipfile.ZIP_DEFLATED)

The write() method accepts three parameters. The first parameter is the name of the file that we want to compress. The second parameter is optional and allows you to specify a different file name for the compressed file. If nothing is specified, the original name is used.

Extracting All Files

You can use the extractall() method to extract all the files and folders from a zip file into the current working directory. You can also pass a folder name to extractall() to extract all files and folders in a specific directory. If the folder that you passed does not exist, this method will create one for you. Here is the code that you can use to extract files:

1	import zipfile
2
3	with zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip') as fantasy_zip:
4	fantasy_zip.extractall('C:\\Library\\Stories\\Fantasy')

If you want to extract multiple files, you will have to supply the name of files that you want to extract as a list.

Extracting Individual Files

This is similar to extracting multiple files. One difference is that this time you need to supply the filename first and the path to extract them to later. Also, you need to use the extract() method instead of extractall(). Here is a basic code snippet to extract individual files.

1	import zipfile
2
3	with zipfile.ZipFile('C:\\Stories\\Fantasy\\archive.zip') as fantasy_zip:
4	fantasy_zip.extract('Fantasy Jungle.pdf', 'C:\\Stories\\Fantasy')

Getting Information About Files

Consider a scenario where you need to see if a zip archive contains a specific file. Up to this point, your only option to do so is by extracting all the files in the archive. Similarly, you may need to extract only those files which are larger than a specific size. The zipfile module allows us to inquire about the contents of an archive without ever extracting it.

Using the namelist() method of the ZipFile object will return a list of all members of an archive by name. To get information on a specific file in the archive, you can use the getinfo() method of the ZipFile object. This will give you access to information specific to that file, like the compressed and uncompressed size of the file or its last modification time. We will come back to that later.

Calling the getinfo() method one by one on all files can be a tiresome process when there are a lot of files that need to be processed. In this case, you can use the infolist() method to return a list containing a ZipInfo object for every single member in the archive. The order of these objects in the list is the same as that of actual zipfiles.

You can also directly read the contents of a specific file from the archive using the read(file) method, where file is the name of the file that you intend to read. To do this, the archive must be opened in read or append mode.

To get the compressed size of an individual file from the archive, you can use the compress_size attribute. Similarly, to know the uncompressed size, you can use the file_size attribute.

The following code uses the properties and methods we just discussed to extract only those files that have a size below 1MB.

import zipfile

with zipfile.ZipFile('C:\\Stories\\Funny\\archive.zip') as stories_zip:
    for file in stories_zip.namelist():
        if stories_zip.getinfo(file).file_size < 1024*1024:
            stories_zip.extract(file, 'C:\\Stories\\Short\\Funny')

To know the time and date when a specific file from the archive was last modified, you can use the date_time attribute. This will return a tuple of six values. The values will be the year, month, day of the month, hours, minutes, and seconds, in that specific order. The year will always be greater than or equal to 1980, and hours, minutes, and seconds are zero-based.

import zipfile

with zipfile.ZipFile('C:\\Stories\\Funny\\archive.zip') as stories_zip:
    thirsty_crow_info = stories_zip.getinfo('The Thirsty Crow.pdf')

    print(thirsty_crow_info.date_time)
    print(thirsty_crow_info.compress_size)
    print(thirsty_crow_info.file_size)

This information about the original file size and compressed file size can help you decide whether it is worth compressing a file. I am sure it can be used in some other situations as well.

Reading and Writing Content to Files

We were able to get a lot of important information about the files in our archive using their ZipInfo objects. Now, it is time to go a step further and get the actual content of those files. I have taken some text files from the Project Gutenberg website and created an archive with them. We will now read the contents of one of the files in the archive using the read() function. It will return the bytes of the given file as long as the archive containing the file is open for reading. Here is an example:

import zipfile


with zipfile.ZipFile('D:\\tutsplus-tests\\books.zip') as books:
    for file in books.namelist():
        if file == 'Frankenstein.txt':
            contents = books.read(file)
            
            # <class 'bytes'>
            print(type(contents))

            # b'\xef\xbb\xbfThe Project Gutenberg eBook of Frankenstein, by Mary Wollstonecraft
            print(contents)

            # 29
            print(contents.count(b'Frankenstein'))

            contents = contents.replace(b'Frankenstein', b'Crankenstein')

            # b'\xef\xbb\xbfThe Project Gutenberg eBook of Crankenstein, by Mary Wollstonecraft
            print(contents)

As you can see, the read() function returns a bytes object with all the content of the file we are reading. You can do a lot of operations on the contents of the file, like finding the position of any sub-sequence from either end of the data or making regular replacements as we did above. In our example, we are doing all our operations with simple byte strings because we are reading text files.

There is also a write() function in the module, but it is used to write files to the archive and not to write content to those files themselves. One way to write content to specific files is to open them in write mode using the open() function and then use write() to add content to those files.

import zipfile

with zipfile.ZipFile('D:\\tutsplus-tests\\multiples.zip', 'w') as multiples_zip:
    for i in range(1, 101):
        with multiples_zip.open(str(i) + '.txt', 'w') as file:
            for j in range(1, 101):
                line = ' '.join(map(str, [i, 'x', j, '=', i*j ])) + '\n'
                number = bytes(line, 'utf-8')
                file.write(number)

The above code will create 100 text files with the first 100 multiples of those numbers stored in each file. We convert our string to bytes because write() expects a bytes-like object instead of a regular string.

Final Thoughts

As is evident from this tutorial, using the zipfile module to compress files gives you a lot of flexibility. You can compress different files in a directory to different archives based on their type, name, or size. You also get to decide whether you want to preserve the directory structure or not. Similarly, while extracting the files, you can extract them to the location you want, based on your own criteria like size, etc.

To be honest, it was also pretty exciting for me to compress and extract files by writing my own code. I hope you enjoyed the tutorial, and if you have any questions, please let me know on the Envato forum.

Learn Python

Learn Python with our complete Python tutorial guide, whether you're just getting started or you're a seasoned coder looking to learn new skills.