Dear Lazyweb: convert a PDF to high-rez CBR file?

Warning: obscure technical questions ahead!

Hey, Lazyweb, here's one for ya! I've got a 153-page PDF, made up mostly of high-rez raster images (8.5x11, 300DPI) with some vector text (page numbers a few blocks of text) here and there. I want to turn this into a .cbr file by bursting the PDF into individual PNGs or JPEGs and then RARing them, using Ubuntu and free tools. That's where you all come in: use the comments below to kibbitz -- I've been playing with ImageMagick's "convert" tool all day (using lines like "convert -geometry 4414x3123 -density 300x300 -quality 100 pdf:original.pdf[1-153] converted.png") with no success. Either it churns for hours and nothing happens (is there a verbose mode for "convert"?) or the output is really low-rez and crummy.


Discussion

Take a look at this

I would start with GPL Ghostscript and GSview (http://pages.cs.wisc.edu/~ghost/gsview/index.htm) to perform a conversion of each page by hand, picking one with the vector text to test that it works, first - and then your favorite RAR tool.

Take a look at this

I would look at using pdftoppm as the first conversion, then convert the ppm to whatever image format you need.

Take a look at this
#3 posted by rgc , June 6, 2008 10:29 AM

Install rar tools and ghostscript (so we can convert to PPM):

sudo apt-get install rar unrar gs-esp

Then for each INFILE:

1. Convert to high-resolution ppm files for each page.

pdftoppm INFILE.pdf page

2. Convert each PPM file to PNG.

for A in page*.ppm ; do convert $A $A.png ; done

3. Rar the whole thing up into a CBR file.

rar a OUTFILE.cbr page-*.png

4. Delete our temporary files.

rm page-*

That should do it.

Take a look at this

i know this was written in english....but i'm more confused than tm crs lkng t pht f nkd wmn.

Take a look at this
#5 posted by tomic Author Profile Page, June 6, 2008 10:33 AM

Whether this is a one-time project or one you plan on repeating, the first time I'd break it up into separate steps:

* Disassemble what you have (one big PDF?) into one-image/page-per-file. These should be "convenient" where "convenient" means easy to code for. Probably disgustingly large, etc.

* Fiddle the page-images as you need.

* Reassemble into your target, whatever a .cbr file is.


I recently broke a mechanically-scanned 600+ page catalog into that many TIFF images, and made a web page organized with hand-tyepd meta-data (the table of contents, about 75 lines). It wasn't hard, just iterative. I used a perl script. I'll mail you if you want.

Mine had no text, just images, but the disassembly script could do that. I use brutally stupid means to keep track of data like that. While you're working on page 99 (say), and come across page components

99-1.tiff
99-2.txt
99-3.wav
99-4.tiff
...

For page 99, all the sub-components, in order, are findable. Brute force is a fine tool here,

for ($page= 1; ++$page @PAGEITEMS= `ls 99-* | sort`;
# process PAGEITEMS
}


or whatever. I dunno how else to do this but with scripts.


One, have you broken the big PDF into pieces yet?

Take a look at this
#6 posted by Dillo Author Profile Page, June 6, 2008 10:38 AM

It's probably going to require a custom script.
I'd start with gs(1) and possibly netpbm tools and go from there.
This thread looks like it might offer some ideas:
http://moourl.com/q6wxn

Take a look at this

I don't have anything to add to this, I just came here for the comments because I knew they'd be crazy nerdy. I'm here to bask in it.

Take a look at this
#8 posted by tomic Author Profile Page, June 6, 2008 10:41 AM

Shoulda done this in the first post.

** Script that does the work; It's not well commented, sorry.

http://wps.com/temp/bustup

** The hand-typed meta file that the script reads to know how to organize things. It looks daunting, but its easy and the perl script can then assemble the document nicely.

http://wps.com/temp/ordered-pdf-list

With the process broken into steps, you can run it, check the output, if not what you want, delete the temp files, edit the scripts, and try again. That way you don't have to keep track of complicated edits and good files and bad files -- that all goes into the script.

** THis is a README I left for myself, but it seems incomplete. I did get all my hints from the URL there though.

http://wps.com/temp/README

Take a look at this
#9 posted by Dillo Author Profile Page, June 6, 2008 10:47 AM

BTW, if your ghostscript, rar or netpbmtools are not up to date, consider doing a backup before doing an 'apt-get upgrade' or 'apt-get install'. Upgrading gs(1) requires a libc6 and linux-image upgrade which will necessitate a reboot at minimum and any sort of futzing with libc6 is never a terribly safe proposition.

Take a look at this
#10 posted by SkipF , June 6, 2008 10:49 AM

http://pdfripimage.sourceforge.net/documentation.php#Download
pdfripimage my.pdf FORMAT=TIFF TIFF_COMPRESSION=LZW

here's a code fragment I use for OCR. you should be able to do other graphics formats besides TIFF, or convert the TIFF to something useful.

I hope this helps!

-Skip

Take a look at this

I'd start with a small PDF file first. Figure out the process you need to follow, then bring in the giant, slow-to-process files.

Take a look at this
#12 posted by doug117 , June 6, 2008 11:05 AM

One thing that might help -- no promises --

Go [back] to Acrobat 5.0 and there's a command to dump all the images in the PDF file out to a folder.

Quality seems to depend on how the images were or were not compressed when the PDF was built.

Take a look at this

While I haven't had great success with it for PDF conversion, phpThumb is supposed to handle this with ease. Supports caching, various output formats and compression settings, etc:
http://phpthumb.sourceforge.net/

Take a look at this

GhostScript seems to handle pdf conversion quite well.

I just tried this on a bunch of pdf's here on my work machine and they came out quite nice.

gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -sOutputFile=picname%03d.png inputpdf.pdf

Note that it's set to 300 DPI (Big Images!!) and png 16 million colour. The %03d is a printf to number your output file names.

RAR'em up and away you go.

YMMV

Cheers,
Kier

Take a look at this

I have Adobe Acrobat Professional 8.1.2 and you can go File>Export>Images>PNG and it will export each page as a separate PNG.

Take a look at this

I dunno from Comic Bok Reader files, but if you have Acrobat Standard (not Acrobat Reader) you can export from PDf to JPG (and other image formats).

1 page = 1 image automatically.

File --> Export --> Image --> JPEG

I hope this helps.

Take a look at this
#17 posted by wetzel , June 6, 2008 11:15 AM

Why don't you just use PhotoShop under File - Automate - Multi-Page PDF to PSD. Then use Actions to batch save them all as JPEGs in a new folder. Then Archive into RAR and then change the extension. Wouldn't that work?

Take a look at this
#18 posted by BaS , June 6, 2008 11:17 AM

Indeed, or you could alternately go (in Acro 8Pro) to Advanced->Document Processing->Export All Images As JPEG/TIFF/PNG/JPEG2000 just remember to go to the settings in that frame and up the file settings from the standard "Quality: Medium" and Resolution to what you want instead of Determine Automatically and not to exclude images smaller than whatever.

Take a look at this

well done ppl

Take a look at this
#20 posted by wetzel , June 6, 2008 11:18 AM

You could use Photoshop under File - Automate - Multi-Page PDF to PSD. Then create an Action and batch process them all as JPEGs in a new folder. Archive as RAR and then rename the extension as CBR. Wouldn't that work?

Take a look at this

I like #14's answer and wish to subscribe to his/her newsletter.

Take a look at this
#22 posted by Jord , June 6, 2008 11:22 AM

@#20 I don't think Photoshop will count as opensource...

Take a look at this

Kiergsmith@14: this seems super-promising, but the output is all v. low-rez -- only 477px wide! I'm shooting for 4,000!

Take a look at this
#24 posted by Jord , June 6, 2008 11:47 AM

You could try using pdfimages. It is part of the xpdf-utils and also in the poppler utils.

The man page can be found here:http://linux.die.net/man/1/pdfimages

Take a look at this

C'mon, guys! PDF2CBR!!M (get it from wherever you feel most comfortable!)

It does pretty much what others have suggested on this thread, just sorta automatically. Have fun!

Take a look at this

Ooops!!!

Cory, you got me there..

The 'correct' command line is as follows.

gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -r300 -sOutputFile=picname%03d.png inputpdf.pdf

Note the -r300 to set the resolution to 300DPI.

Umm, sorry? I'm still trying to learn how to cut and paste on this !@@#$% mac.

Good luck
Kier

Take a look at this
#27 posted by ordodk , June 6, 2008 11:54 AM

Try what Kiergsmith wrote with two additional parameters: -r300 -sPAPERSIZE=A4
Exchange A4 with legal if that's what you go for... That should yield the correct size images in 300DPI...

Take a look at this

Okay, oops. I dunno where that M came from... Also pdf2cbr only works on image files. I shoulda paid more attention.

Take a look at this

26,27: The file is ripping now and doing VERY well! Thanks everyone!

Cory

Take a look at this

Orododk@27

I tried a papersize with my test files, but I got some strange cropping issues with portrait vs landscape layouts. I'm pretty sure that gs will just size the data according the the mediabox setting that pdfs have.

But then again, maybe not. :)

YMMV

Kier

Take a look at this
#31 posted by jwz Author Profile Page, June 6, 2008 12:15 PM

Dear everyone saying "use GhostScript instead of ImageMagick":

ImageMagick just runs GhostScript to do stuff to PDFs.

Carry on.

Take a look at this

Awww, Ghostscript made BoingBoing! The comments section anyway. :)

ImageMagick just runs GhostScript to do stuff to PDFs.

Yeah, but it's not the front end you want for this. It's fond of rendering every page into memory before writing anything out, which accounts for the endless churning. ImageMagick is great for images. Books and videos, not so much.

Take a look at this
#33 posted by wetzel , June 6, 2008 1:02 PM

Can I love open source and love my Mac too?

Take a look at this
#34 posted by Andrew , June 6, 2008 1:07 PM

Dear JWZ: That's why to use Ghostscript. Because it actually gets it right.

Take a look at this

Anyone got a good recipe for flan?

Take a look at this
#36 posted by danegeld , June 6, 2008 1:42 PM

gs -sDEVICE=jpeg -dNOPAUSE -dBATCH -r300x300 -sOutputFile=page_%d.jpg my_book.pdf -c quit

rar a my_book.cbr page_*.jpg

...

but it looks like someone already got there...

Take a look at this
#37 posted by spotrh , June 6, 2008 2:22 PM

Cory, you should be aware that there are no free (as in speech, FSF, etc) tools for RAR operations.

The tools in Ubuntu are under a non-free license (which is why we don't have them in Fedora).

I'd highly encourage you to use CBZ format instead (ZIP instead of RAR).

Take a look at this
#38 posted by spotrh , June 6, 2008 2:24 PM

Cory, you should be aware that there are no free (as in speech, FSF, etc) tools for RAR operations.

The tools in Ubuntu are under a non-free license (which is why we don't have them in Fedora).

I'd highly encourage you to use CBZ format instead (ZIP instead of RAR).

Take a look at this

Offtopic, but for those stuck under Windows, check out PDFCreator (http://sourceforge.net/projects/pdfcreator/) which can print anything to PDF, gif, jpeg, etc. It installs as a printer driver. Just change the output options from the default PDF output to jpg and specify your per-page incrementing suffix. Then print your PDF to image files, and package up with your favorite compression tool (7-zip is free on Windows and does most major compression tools).

Take a look at this
#40 posted by Anonymous , June 6, 2008 3:42 PM

Curiously, I've been working on just such a thing (for a completely different purpose) using Poppler.

A trivial C++ program can be written in under 200 lines. Create a Cairo context, initialize the document, page, and (optionally) the jpeg library, then render the page, write to file. It's surprisingly easy.

I don't think I can release what I have as open source ... yet. But I think I will probably go home and do just such a thing later. I'm surprised it never occurred to me that it would be so useful.

As a side note: I've noticed that Ghostscript can produce some particulerly nasty-looking output. Poppler, while being an incomplete PDF renderer, produces better results using the Cairo backend.

-- (greyfade)

Take a look at this

@38, Spotrh,

Who cares as long as it *works*?

Take a look at this
#42 posted by rogerben , June 6, 2008 5:13 PM

Another note for Windows users: ComicRack, which is basically iTunes for your comic collection. It reads CBR/CBZ/PDF comics natively, and allows you to convert between them freely. (I convert everything to CBZ.)

Take a look at this

I've got a 153-page PDF, made up mostly of high-rez raster images (8.5x11, 300DPI) with some vector text (page numbers a few blocks of text) here and there. I want to turn this into a .cbr file by bursting the PDF into individual PNGs or JPEGs and then RARing them, using Ubuntu and free tools.

That's the most opaque text I've read in weeks. I'm reminded of the wah-wah-wah sound of adults in Charlie Brown cartoons.

Take a look at this
#44 posted by spotrh , June 6, 2008 7:38 PM

@41, Al.

He explicitly said "free tools", and I didn't want him to inadvertently be mislead.

I happen to care, and I know that others do as well. After all, Windows "just works".

Take a look at this
#45 posted by dargaud , June 8, 2008 2:43 PM

I tried to do like the original poster some time ago, and tried many of the solutions mentioned here and found them lacking: either the quality sucks after jpg extraction, or the extracted images are huge, much bigger than the originals.

Which brings me to the questions: when you create a pdf from jpg images, can you extract the _original_ jpegs, or are they gone and always converted to pbm/ppm or somesuch ?

Take a look at this

@44, Spotrh,

He didn't say "Free as in speech", he just said "Free." The tools are free in both circumstances (rar or zip). You're being ideological about it, which is pointless.

Post a comment

Anonymous