Aug 04

Introduction to Ghostscript

The Context

Have you ever run into a situation where you’ve got a PDF file and want to easily generate an image from one or more of its pages? Or extract its text content? Or merge multiple PDFs? Or convert an image to various raster formats? Yes, sure, there are many programs that you can install to do any of the above.

The problem is: what happens when you have many different PDFs each of which have around 100 or more pages? What happens when the pet store is closed and you can’t buy a gazillion pet monkeys and train them to do PDF to image conversion? What happens when Minions-R-Us has a going out of business sale and you weren’t able to find parking in time before all the minions got sold out?

If you can personally relate to any of these three questions (and hopefully just the first question, because training monkeys to do your bidding is probably illegal and just so that you know, minions don’t actually exist… sorry…), then the solution to your problems can be found in using Ghostscript.

What is Ghostscript?

Ghostscript is software that makes use of an interpreter for a PDF’s page description language as well as the PostScript programming language created by Adobe Systems. You can use PostScript in your program to create vector graphics. Vector graphics are geometrical primitives such as points, lines, curves, shapes or polygons that are mathematically generated to represent images in computer graphics. Since vector graphics are mathematically generated, they can be infinitely enlarged without losing any quality.

Compare vector graphics to rasterised graphics such as Bitmaps, Jpegs or PNG images, where the image is made out of millions of pixels – tiny squares of single colours that are organised together in such a way that it displays a full image. If you enlarge the image, you are essentially just enlarging each one of the tiny squares or pixels that the image is made of. Quality will get lost as a result, because you will start to see the individual pixels and the image will not appear as smooth anymore.

Vector graphic vs raster image at 10x magnification

Vector graphic vs raster image at 10x magnification

The content of a PDF page is generated using vector graphics and that is why you can zoom into a PDF page without it ever losing quality like with rasterised graphics. To view PDFs, you need a special program in order to interpret the PDF’s page description language and draw the vector graphics. These programs generally only allow you to read the PDF. If you want to generate a rasterised image of one or more of the PDF’s pages, you would generally need a different program that provides that functionality. But using those programs require a significant amount of manual labor, especially if you have many large PDFs that you want to process.

This is exactly where Ghostscript makes its dramatic entrance.

Ghostscript is written entirely in the C programming language and has been ported to run on a wide variety of operating systems including Microsoft Windows, Apple Mac OS and the many different types of Unix-based operating systems. There are various graphical user interfaces available that make use of Ghostscript to allow you to view a PostScript file or PDF on screen. Some even allow you to generate images from PDF pages, but as mentioned earlier, it can be a very time consuming task, especially if you need to convert hundreds of PDF pages. In this blog post we’re going to look at using Ghostscript in the console or terminal in order to batch process a PDF.

Installing Ghostscript

The most recent release of Ghostscript can be downloaded directly from the Ghostscript website. If you’re running Microsoft Windows, you can download and install the standalone *.exe file. If you’re running Apple Mac OS, you can download the standalone build which is accessible through the Terminal. To build and install Ghostscript from its source, download the cross-platform source archive. Once you’ve downloaded the source  zip archive, you can extract it to a folder on your computer. If you’re running Apple Mac OS, open the Terminal app and browse to the extracted folder and run the configure script e.g.

    cd /path/to/ghostscript-#.#

This will run several scripts in order to generate a ‘make’ file. Once the scripts are done building the ‘make’ file, you can run the ‘make’ file installer by simply entering the following command:

    make install

The system will then run all the necessary scripts to build Ghostscript.

What you can expect to see while Ghostscript builds and installs. All I see is blonde... brunette... red-head...

What you can expect to see while Ghostscript builds and installs. All I see is blonde… brunette… red-head…

Once it completes, you can see if it installed correctly by checking the currently installed version by entering the following command:

    gs --version

If Ghostscript was installed successfully, the current version will be displayed e.g. 9.14

How to use Ghostscript

Following are some basic examples of how to use Ghostscript for various tasks. For a detailed guide on how to use Ghostscript, you can view the official documentation online. These examples assume that you built Ghostscript from the source and are running Apple Mac OS, but you should be able to easily use the standalone builds as well. You will just need to move the standalone builds into your working folder.

Convert PDF Pages to Images

Open the Terminal and navigate to a folder on your computer where you’ve got a PDF that you want to convert its pages to images. If you enter the following command, Ghostscript will convert each PDF page into a PNG image that fits into a pixel dimension of 768×1024:

    gs -dNOPAUSE -sDEVICE=png16m -sOutputFile=image%i.png -dPDFFitPage -g768x1024 -q example.pdf -c quit

Following is an explanation of each of the flags:

  • -dNOPAUSE: do not pause and await user input before continuing on to the next page
  • -sDEVICE: set the output file format, in this case a PNG with 16-million colours (other formats include e.g. pnggray, pngalpha, jpeg, jpeggray etc.)
  • -sOutputFile: the filename of the output image(s). Use %i or %d as a placeholder to automatically create images with the page number
  • -dPDFFitPage: scale the PDF page to fit into the specified dimensions. Aspect ratio is retained
  • -g: the dimensions in pixels of the generated image
  • -q: quiet mode does not output progress to the Terminal window while the PDF is being processed
  • -c quit: after processing is complete, quit Ghostscript
After Ghostscript completes, there are images named 'image#.png' for each page of the PDF in the same folder.

After Ghostscript completes, there are images named ‘image#.png’ for each page of the PDF in the same folder.

The quality of the generated PNG images is excellent, since PNG images do not have image compression. To reduce the filesize of the generated images, consider using JPEG as the output format. You can then specify the output quality of the image to reduce its filesize:

    gs -dNOPAUSE -sDEVICE=jpeg -sOutputFile=image%i.jpg -dJPEGQ=90 -dPDFFitPage -g768x1024 -q example.pdf -c quit

The filesize will be significantly less, but the quality will take a hit. If you only want to convert certain pages, you can specify the start and end page numbers. To improve quality without really affecting filesize, you can set how many bits should be used for rendering text and images. To improve performance, you can set the amount of rendering threads to be used.

    gs -dNumRenderingThreads=4 -dNOPAUSE -sDEVICE=jpeg -dFirstPage=1 -dLastPage=5 -sOutputFile=image%i.jpg -dJPEGQ=90 -dPDFFitPage -dTextAlphaBits=4 -dGraphicsBits=4 -g768x1024 -q example.pdf -c quit

Get the Total Number of Pages

To return the total number of pages in a PDF, use the following command:

    gs -q -dNODISPLAY -c "(example.pdf) (r) file runpdfbegin pdfpagecount = quit"

Extract PDF Page Text

To extract a PDF’s page text content, enter the following command:

    gs -dNOPAUSE -sDEVICE=txtwrite -dFirstPage=1 -dLastPage=10 -sOutputFile=output.txt -q example.pdf -c quit

This will extract the text content of pages 1 to 10 and output it into a textfile named ‘output’.

A text file containing the extracted text content is generated in the same folder.

A text file containing the extracted text content is generated in the same folder.

If you want to ouput the text content to the Terminal window instead, use a dash as the output filename:

    gs -dNOPAUSE -sDEVICE=txtwrite -dFirstPage=1 -dLastPage=10 -sOutputFile=- -q example.pdf -c quit

Using Ghostscript with PHP

The examples shown above are all console commands entered into the Terminal window. How would you then be able to use Ghostscript in a PHP script? By making use of the following line of PHP code:

    $output = shell_exec('gs -dNOPAUSE -sDEVICE=png16m -sOutputFile=/path/to/output/image%i.png -dPDFFitPage -g768x1024 -q /path/to/input/example.pdf -c quit');

This will do the same as the first example and the output will be stored in the $output variable. Storing the output into a variable is especially useful if you want to extract a PDF page’s text content:

    $extracted_text = shell_exec('gs -dNOPAUSE -sDEVICE=txtwrite -dFirstPage=1 -dLastPage=10 -sOutputFile=- -q example.pdf -c quit');

To Conclude

As with most things in life, with great power comes great responsibility. Ghostscript is a very powerful tool that can be used for various format conversions such as from PDF page to image and vice versa. It can also be used to interpret a PDF page’s description language in order to extract text content or get the total page count. There are many more powerful ways in which you can use Ghostscript. That is where the responsibility part comes in: it is your responsibility to head on over to the Ghostscript documentation and spend hours upon hours fiddling with commands, tweaking options and discovering absolutely everything you can possibly do with Ghostscript!