Dear Lazyweb: convert a PDF to high-rez CBR file?
Hey, Lazyweb, here's one for ya! I've got a 153-page PDF, made up mostly of high-rez raster images (8.5x11, 300DPI) with some vector text (page numbers a few blocks of text) here and there. I want to turn this into a .cbr file by bursting the PDF into individual PNGs or JPEGs and then RARing them, using Ubuntu and free tools. That's where you all come in: use the comments below to kibbitz -- I've been playing with ImageMagick's "convert" tool all day (using lines like "convert -geometry 4414x3123 -density 300x300 -quality 100 pdf:original.pdf[1-153] converted.png") with no success. Either it churns for hours and nothing happens (is there a verbose mode for "convert"?) or the output is really low-rez and crummy.


the latest
latest episodes









I would start with GPL Ghostscript and GSview (http://pages.cs.wisc.edu/~ghost/gsview/index.htm) to perform a conversion of each page by hand, picking one with the vector text to test that it works, first - and then your favorite RAR tool.
I would look at using pdftoppm as the first conversion, then convert the ppm to whatever image format you need.
Install rar tools and ghostscript (so we can convert to PPM):
sudo apt-get install rar unrar gs-esp
Then for each INFILE:
1. Convert to high-resolution ppm files for each page.
pdftoppm INFILE.pdf page
2. Convert each PPM file to PNG.
for A in page*.ppm ; do convert $A $A.png ; done
3. Rar the whole thing up into a CBR file.
rar a OUTFILE.cbr page-*.png
4. Delete our temporary files.
rm page-*
That should do it.
i know this was written in english....but i'm more confused than tm crs lkng t pht f nkd wmn.
Whether this is a one-time project or one you plan on repeating, the first time I'd break it up into separate steps:
* Disassemble what you have (one big PDF?) into one-image/page-per-file. These should be "convenient" where "convenient" means easy to code for. Probably disgustingly large, etc.
* Fiddle the page-images as you need.
* Reassemble into your target, whatever a .cbr file is.
I recently broke a mechanically-scanned 600+ page catalog into that many TIFF images, and made a web page organized with hand-tyepd meta-data (the table of contents, about 75 lines). It wasn't hard, just iterative. I used a perl script. I'll mail you if you want.
Mine had no text, just images, but the disassembly script could do that. I use brutally stupid means to keep track of data like that. While you're working on page 99 (say), and come across page components
99-1.tiff
99-2.txt
99-3.wav
99-4.tiff
...
For page 99, all the sub-components, in order, are findable. Brute force is a fine tool here,
for ($page= 1; ++$page @PAGEITEMS= `ls 99-* | sort`;
# process PAGEITEMS
}
or whatever. I dunno how else to do this but with scripts.
One, have you broken the big PDF into pieces yet?
It's probably going to require a custom script.
I'd start with gs(1) and possibly netpbm tools and go from there.
This thread looks like it might offer some ideas:
http://moourl.com/q6wxn
I don't have anything to add to this, I just came here for the comments because I knew they'd be crazy nerdy. I'm here to bask in it.
Shoulda done this in the first post.
** Script that does the work; It's not well commented, sorry.
http://wps.com/temp/bustup
** The hand-typed meta file that the script reads to know how to organize things. It looks daunting, but its easy and the perl script can then assemble the document nicely.
http://wps.com/temp/ordered-pdf-list
With the process broken into steps, you can run it, check the output, if not what you want, delete the temp files, edit the scripts, and try again. That way you don't have to keep track of complicated edits and good files and bad files -- that all goes into the script.
** THis is a README I left for myself, but it seems incomplete. I did get all my hints from the URL there though.
http://wps.com/temp/README
BTW, if your ghostscript, rar or netpbmtools are not up to date, consider doing a backup before doing an 'apt-get upgrade' or 'apt-get install'. Upgrading gs(1) requires a libc6 and linux-image upgrade which will necessitate a reboot at minimum and any sort of futzing with libc6 is never a terribly safe proposition.
http://pdfripimage.sourceforge.net/documentation.php#Download
pdfripimage my.pdf FORMAT=TIFF TIFF_COMPRESSION=LZW
here's a code fragment I use for OCR. you should be able to do other graphics formats besides TIFF, or convert the TIFF to something useful.
I hope this helps!
-Skip
I'd start with a small PDF file first. Figure out the process you need to follow, then bring in the giant, slow-to-process files.
One thing that might help -- no promises --
Go [back] to Acrobat 5.0 and there's a command to dump all the images in the PDF file out to a folder.
Quality seems to depend on how the images were or were not compressed when the PDF was built.
While I haven't had great success with it for PDF conversion, phpThumb is supposed to handle this with ease. Supports caching, various output formats and compression settings, etc:
http://phpthumb.sourceforge.net/
GhostScript seems to handle pdf conversion quite well.
I just tried this on a bunch of pdf's here on my work machine and they came out quite nice.
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -sOutputFile=picname%03d.png inputpdf.pdf
Note that it's set to 300 DPI (Big Images!!) and png 16 million colour. The %03d is a printf to number your output file names.
RAR'em up and away you go.
YMMV
Cheers,
Kier
I have Adobe Acrobat Professional 8.1.2 and you can go File>Export>Images>PNG and it will export each page as a separate PNG.
I dunno from Comic Bok Reader files, but if you have Acrobat Standard (not Acrobat Reader) you can export from PDf to JPG (and other image formats).
1 page = 1 image automatically.
File --> Export --> Image --> JPEG
I hope this helps.
Why don't you just use PhotoShop under File - Automate - Multi-Page PDF to PSD. Then use Actions to batch save them all as JPEGs in a new folder. Then Archive into RAR and then change the extension. Wouldn't that work?
Indeed, or you could alternately go (in Acro 8Pro) to Advanced->Document Processing->Export All Images As JPEG/TIFF/PNG/JPEG2000 just remember to go to the settings in that frame and up the file settings from the standard "Quality: Medium" and Resolution to what you want instead of Determine Automatically and not to exclude images smaller than whatever.
well done ppl
You could use Photoshop under File - Automate - Multi-Page PDF to PSD. Then create an Action and batch process them all as JPEGs in a new folder. Archive as RAR and then rename the extension as CBR. Wouldn't that work?
I like #14's answer and wish to subscribe to his/her newsletter.
@#20 I don't think Photoshop will count as opensource...
Kiergsmith@14: this seems super-promising, but the output is all v. low-rez -- only 477px wide! I'm shooting for 4,000!
You could try using pdfimages. It is part of the xpdf-utils and also in the poppler utils.
The man page can be found here:http://linux.die.net/man/1/pdfimages
C'mon, guys! PDF2CBR!!M (get it from wherever you feel most comfortable!)
It does pretty much what others have suggested on this thread, just sorta automatically. Have fun!
Ooops!!!
Cory, you got me there..
The 'correct' command line is as follows.
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=png16m -r300 -sOutputFile=picname%03d.png inputpdf.pdf
Note the -r300 to set the resolution to 300DPI.
Umm, sorry? I'm still trying to learn how to cut and paste on this !@@#$% mac.
Good luck
Kier
Try what Kiergsmith wrote with two additional parameters: -r300 -sPAPERSIZE=A4
Exchange A4 with legal if that's what you go for... That should yield the correct size images in 300DPI...
Okay, oops. I dunno where that M came from... Also pdf2cbr only works on image files. I shoulda paid more attention.
26,27: The file is ripping now and doing VERY well! Thanks everyone!
Cory
Orododk@27
I tried a papersize with my test files, but I got some strange cropping issues with portrait vs landscape layouts. I'm pretty sure that gs will just size the data according the the mediabox setting that pdfs have.
But then again, maybe not. :)
YMMV
Kier
Dear everyone saying "use GhostScript instead of ImageMagick":
ImageMagick just runs GhostScript to do stuff to PDFs.
Carry on.
Awww, Ghostscript made BoingBoing! The comments section anyway. :)
ImageMagick just runs GhostScript to do stuff to PDFs.
Yeah, but it's not the front end you want for this. It's fond of rendering every page into memory before writing anything out, which accounts for the endless churning. ImageMagick is great for images. Books and videos, not so much.
Can I love open source and love my Mac too?
Dear JWZ: That's why to use Ghostscript. Because it actually gets it right.
Anyone got a good recipe for flan?
gs -sDEVICE=jpeg -dNOPAUSE -dBATCH -r300x300 -sOutputFile=page_%d.jpg my_book.pdf -c quit
rar a my_book.cbr page_*.jpg
...
but it looks like someone already got there...
Cory, you should be aware that there are no free (as in speech, FSF, etc) tools for RAR operations.
The tools in Ubuntu are under a non-free license (which is why we don't have them in Fedora).
I'd highly encourage you to use CBZ format instead (ZIP instead of RAR).
Cory, you should be aware that there are no free (as in speech, FSF, etc) tools for RAR operations.
The tools in Ubuntu are under a non-free license (which is why we don't have them in Fedora).
I'd highly encourage you to use CBZ format instead (ZIP instead of RAR).
Offtopic, but for those stuck under Windows, check out PDFCreator (http://sourceforge.net/projects/pdfcreator/) which can print anything to PDF, gif, jpeg, etc. It installs as a printer driver. Just change the output options from the default PDF output to jpg and specify your per-page incrementing suffix. Then print your PDF to image files, and package up with your favorite compression tool (7-zip is free on Windows and does most major compression tools).
Curiously, I've been working on just such a thing (for a completely different purpose) using Poppler.
A trivial C++ program can be written in under 200 lines. Create a Cairo context, initialize the document, page, and (optionally) the jpeg library, then render the page, write to file. It's surprisingly easy.
I don't think I can release what I have as open source ... yet. But I think I will probably go home and do just such a thing later. I'm surprised it never occurred to me that it would be so useful.
As a side note: I've noticed that Ghostscript can produce some particulerly nasty-looking output. Poppler, while being an incomplete PDF renderer, produces better results using the Cairo backend.
-- (greyfade)
@38, Spotrh,
Who cares as long as it *works*?
Another note for Windows users: ComicRack, which is basically iTunes for your comic collection. It reads CBR/CBZ/PDF comics natively, and allows you to convert between them freely. (I convert everything to CBZ.)
I've got a 153-page PDF, made up mostly of high-rez raster images (8.5x11, 300DPI) with some vector text (page numbers a few blocks of text) here and there. I want to turn this into a .cbr file by bursting the PDF into individual PNGs or JPEGs and then RARing them, using Ubuntu and free tools.
That's the most opaque text I've read in weeks. I'm reminded of the wah-wah-wah sound of adults in Charlie Brown cartoons.
@41, Al.
He explicitly said "free tools", and I didn't want him to inadvertently be mislead.
I happen to care, and I know that others do as well. After all, Windows "just works".
I tried to do like the original poster some time ago, and tried many of the solutions mentioned here and found them lacking: either the quality sucks after jpg extraction, or the extracted images are huge, much bigger than the originals.
Which brings me to the questions: when you create a pdf from jpg images, can you extract the _original_ jpegs, or are they gone and always converted to pbm/ppm or somesuch ?
@44, Spotrh,
He didn't say "Free as in speech", he just said "Free." The tools are free in both circumstances (rar or zip). You're being ideological about it, which is pointless.