Sapiens

Scribus And PDF In Exciting New Symbiosis

1. Scribus and PDF

There have been ongoing discussions about the way Scribus handles imported PDF files. So far, the only way to include other PDF files in a Scribus document is to load it into an image frame. Scribus will then use Ghostscript to raster the PDF, and from that moment on, the PDF file is just another raster image. While this is fine for many purposes, there are some reasons why this would not be intended:

In many cases, this will blow up the file size of the exported PDF, since raster image data (usually) takes up more space than, say, text information.
Sometimes you might not know what image resolution will be needed later, so the choice of the rastering resolution can only be a guess.
It contradicts the understanding of PDF as a device independent format, since the imported PDF is being rendered earlier in the process than it might be.

2. The XObject approach

The PDF standard offers the possibility of treating a content stream (like the one describing the PDF page you want to import) as something like a standalone image object that can itself be put onto a page. This type of object is called a "Form XObject" (do not confuse this with something like HTML forms; these things have nothing to do with each other). So the somewhat obvious way to have Scribus import a PDF and retain its information all the way until a complete PDF is exported again, is to take the content stream of the external PDF, make it an XObject in the to-be-exported PDF, and put it wherever you want it placed.
There are quite a few disadvantages to this approach, including the following:

If the imported PDF file does not conform to a certain standard (e.g. PDF/X-3, certain colorspace preferences, PDF version numbers or just plain comforming to the PDF standard itself), most likely the exported PDF won't do so either.
If you want to export to anything other than a PDF file, this approach is useless.

Refer to the Scribus mailing list for more and deeper discussions on this.

3. Sequential updates

The PDF standard introduces a way of changing the contents of a PDF file by just appending things to the end of the file. This has the advantage of not having to worry about anything that one doesn't care about. So, we can just put the stuff we need from the imported PDFs at the and of Scribus' PDF and don't have to be concerned about messing up to much. That way, of course, the original image is still in the PDF, but not used any longer. We'll come to this later.

4. Sapiens

All of this in mind, I started creating a script calles Sapiens that would take a PDF as exported by Scribus and replace the rastered PDF with the real thing. The script is written in Python (the first draft was actually a shell script with lots of grep and head and tail and sed and and and... believe me, it wasn't pretty, and I'm not saying that the python version is good or nice programming). Obviously, since this script only looks at the PDF file, it has to be told what images to look at and with which PDF files to replace them. Therefore, I had to insert one single line of code into the scribus source file pdflib_core.cpp (line 6398 in my quite recent svn snapshot) as follows:

            ...
            Seite.ImgObjects[ResNam+"I"+QString::number(ResCount)] = ObjCounter-1;
            ResCount++;
        }
        if (extensionIndicatesPDF(ext)) {PutDoc("%SAPIENS:"+fn+"\n");} # <-- This is the new line
        StartObj(ObjCounter);
        ObjCounter++;
        PutDoc("<<\n/Type /XObject\n/Subtype /Image\n");
        ...

This way, scribus introduces a comment into the PDF right before actually writing the image object; this comment then contains "SAPIENS:" and the path and file name of the PDF file that this image was rastered from.
Sapiens now scans the PDF for all occurences of this comment, and for each occurence retrieves the first page of the respective PDF file, turns it into a Form XObject, replaces the Image XObject with this new Form XObject by means of sequential update, also retrieves all other objects that this Form XObject needs and adds them, relabeling them so as not to overwrite any objects that are still needed.

5. What works and what doesn't

Sapiens is by no means a ready-to-use tool. There is very little error checking (and errors are only reported as warnings, except for exceptions, which are left to the Python interpreter and not cared for at all) and there are many things Sapiens does not or cannot do:

So far, it doesn't check version numbers. It just embeds the file, and doesn't care if it's embedding a PDF-1.7 into a PDF-1.3 file.
It always takes the first page of the document to be embedded. This is not to hard to change; however, since Scribus itself only picks the first page, we're consistent.
It doesn't work. At least, something goes wrong: I have used Sapiens and ended up with something that looks just the way it should except for one thing: The color gradients were gone when I looked at it with the Adobe Reader. Ghostscript, however, displayed it just right and did not issue any warning. I don't know if this is an error in the AR or (more likely) an error in Sapiens that Ghostscript generously works around.
The page description in a PDF file can be distributed over several content streams, but the XObject description can't (at least not that I know of). So to embed a PDF page that makes use of this feature, we would have to decode all the content streams, concatenate them, thus turning them into one single stream, choosing an appropriate (if any) encoder and reencoding it (or not, probably at the expense of disk space). This is far from what Sapiens can do, so these PDFs cannot be handled.
I am definitely not a PDF guru. I downloaded the PDF specifications (and the python docs, btw) just for making sapiens. I only read the PDF-1.4 specifications (as that is the version of my choice), and even for those, I did not go through the whole 978 pages. So chances are, there is a lot I didn't even think about.
To check which objects need to be inserted, Sapiens follows the references from the applicable resource dictionary and content stream dictionary and just retrieves all objects it encounters this way. We should get all necessary objects this way (though obviously I'm not sure about that), but chances are we actually retrieve way to many (e.g. additional structure information that the scribus PDF wouldn't care about). The only filtering done is that /Parent references are not followed (and there shouldn't be any of those anyway).
I have only tried this on my very own computer running Fedora 8, so the question whether it works anywhere else is just guesswork.
Don't even get me started on encryption, color management, forms, or anything else that might make life just a little bit more complicated.
One last thing to mention: Sapiens doesn't actually append the PDF file; it just creates a file called "SAPIENS_OUT" in the CWD with the additional data that you have to concatenate with your PDF afterwards by doing something like cat mydoc.pdf SAPIENS_OUT > result.pdf or whatever works for you. This is done on purpose: Since, as I'm not getting tired of mentioning, Sapiens should not be considered a user-ready tool, on the one hand I don't want it to mess around with anyone's files; on the other hand, I only want it to be used by people who know what they're doing. By the way, also watch out: If for whatever reason you already have a file called "SAPIENS_OUT" in the current directory, it will be overwritten!

I hope I have made clear that you shouldn't trust Sapiens to do what you want. If, however, it seems to have worked for you, you will want to run the resulting file through something like ps2pdf to create a PDF file that you can feal pretty safe about (this will also fix my gradient problem mentioned above), and that doesn't contain the unnecessary images anymore.

6. Here it is

So, if you have read and understood the above, if you understand that I have written Sapiens for my personal use and thus it is not a "doing the work for you so you don't have to worry" tool (or as you might say, Sapiens knows as much about alpha releases as a tea leaf knows about the East India Company), if you understand that you have to alter the actual Scribus source code as mentioned above and thus rebuild Scribus (if you use a precompiled version, sorry: not for you), if you acknowledge that Sapiens should not be considered safe in any way or for any purpose and that it might turn your computer into a heap of crap that is as useless as an XP device driver on Vista, and if you promise to still consider me a nice person if that happens – if all that is true, you may click here.

Benjamin Dumke, scribus@guess-the-domain-from-this-pages-location.de
Updated 2008-03-28