Sapiens

Scribus And PDF In Exciting New Symbiosis

1. Scribus and PDF

There have been ongoing discussions about the way Scribus handles imported PDF files. So far, the only way to include other PDF files in a Scribus document is to load it into an image frame. Scribus will then use Ghostscript to raster the PDF, and from that moment on, the PDF file is just another raster image. While this is fine for many purposes, there are some reasons why this would not be intended:

2. The XObject approach

The PDF standard offers the possibility of treating a content stream (like the one describing the PDF page you want to import) as something like a standalone image object that can itself be put onto a page. This type of object is called a "Form XObject" (do not confuse this with something like HTML forms; these things have nothing to do with each other). So the somewhat obvious way to have Scribus import a PDF and retain its information all the way until a complete PDF is exported again, is to take the content stream of the external PDF, make it an XObject in the to-be-exported PDF, and put it wherever you want it placed.
There are quite a few disadvantages to this approach, including the following: Refer to the Scribus mailing list for more and deeper discussions on this.

3. Sequential updates

The PDF standard introduces a way of changing the contents of a PDF file by just appending things to the end of the file. This has the advantage of not having to worry about anything that one doesn't care about. So, we can just put the stuff we need from the imported PDFs at the and of Scribus' PDF and don't have to be concerned about messing up to much. That way, of course, the original image is still in the PDF, but not used any longer. We'll come to this later.

4. Sapiens

All of this in mind, I started creating a script calles Sapiens that would take a PDF as exported by Scribus and replace the rastered PDF with the real thing. The script is written in Python (the first draft was actually a shell script with lots of grep and head and tail and sed and and and... believe me, it wasn't pretty, and I'm not saying that the python version is good or nice programming). Obviously, since this script only looks at the PDF file, it has to be told what images to look at and with which PDF files to replace them. Therefore, I had to insert one single line of code into the scribus source file pdflib_core.cpp (line 6398 in my quite recent svn snapshot) as follows:
            ...
            Seite.ImgObjects[ResNam+"I"+QString::number(ResCount)] = ObjCounter-1;
            ResCount++;
        }
        if (extensionIndicatesPDF(ext)) {PutDoc("%SAPIENS:"+fn+"\n");} # <-- This is the new line
        StartObj(ObjCounter);
        ObjCounter++;
        PutDoc("<<\n/Type /XObject\n/Subtype /Image\n");
        ...
This way, scribus introduces a comment into the PDF right before actually writing the image object; this comment then contains "SAPIENS:" and the path and file name of the PDF file that this image was rastered from.
Sapiens now scans the PDF for all occurences of this comment, and for each occurence retrieves the first page of the respective PDF file, turns it into a Form XObject, replaces the Image XObject with this new Form XObject by means of sequential update, also retrieves all other objects that this Form XObject needs and adds them, relabeling them so as not to overwrite any objects that are still needed.

5. What works and what doesn't

Sapiens is by no means a ready-to-use tool. There is very little error checking (and errors are only reported as warnings, except for exceptions, which are left to the Python interpreter and not cared for at all) and there are many things Sapiens does not or cannot do: I hope I have made clear that you shouldn't trust Sapiens to do what you want. If, however, it seems to have worked for you, you will want to run the resulting file through something like ps2pdf to create a PDF file that you can feal pretty safe about (this will also fix my gradient problem mentioned above), and that doesn't contain the unnecessary images anymore.

6. Here it is

So, if you have read and understood the above, if you understand that I have written Sapiens for my personal use and thus it is not a "doing the work for you so you don't have to worry" tool (or as you might say, Sapiens knows as much about alpha releases as a tea leaf knows about the East India Company), if you understand that you have to alter the actual Scribus source code as mentioned above and thus rebuild Scribus (if you use a precompiled version, sorry: not for you), if you acknowledge that Sapiens should not be considered safe in any way or for any purpose and that it might turn your computer into a heap of crap that is as useless as an XP device driver on Vista, and if you promise to still consider me a nice person if that happens – if all that is true, you may click here.

Benjamin Dumke, scribus@guess-the-domain-from-this-pages-location.de
Updated 2008-03-28