
I wear Pythongs [Image Source]
Converting web pages into PDFs that don’t look like a pile of junk is annoying. I doubt that was Adobe’s intended use case for the PDF format. And going from a web page to a PDF was not the intention of any web standard, so maybe this should be expected.
But like a lot of things in life, we are trying anyway.
There are tools for printing web pages to PDF, but if you don’t want to use the default print to PDF in your browser and want the ‘good stuff’ without adware, you typically need to pay for licensed software.
My beef is not the act of paying for software, I pay for software all the time, and that’s fine. The real beef is that this software is usually not complicated. If you just want a decent-looking PDF with images, it can be written in about 100 lines (or less) of Python by someone who doesn’t really know Python.
Or as I like to call it ‘Pythong’ when it’s just garble-gook copy-pasted from Stack Overflow and Chat Jippity.
The Issue
I needed to make a bunch of decent-looking, single-page PDFs from some articles written by a client for their portfolio. Some of the files needed minimal modification, others needed a lot of modification.
They no longer have the original documents anymore, but the article is still up on their old website in the general format they wanted in their portfolio, which they also need to be portable and offline. As a last constraint, this old website is no longer maintained and may be taken down in the near future. In other words, it’s not within the client’s control to take custody of the content.
Tried Some Stuff
I tried to download the text, HTML, CSS, and JavaScript to my local machine for manual manipulation then tried to print that to a PDF. It still looked like crap and it was taking way too much time to produce serval unique documents each with many pages of content across several different websites. The websites all had lots of dynamic junk to wade through and I just didn’t have the patience.
So I tried several browser extensions and other various existing products (some worked fine) but they would embed links to their website or product in the PDF, making it virtually useless for a portfolio unless I paid them in green.
Not saying a solution does not exist, I am saying of the many solutions I tried, they varied from a $5+ a month subscription to several $100+ dollar license fees. None of the options (free or not) allowed for all the modifications desired at the same time.
The Solution
As you can imagine, I was pulling out my credit card when I had a stupid idea.
Could I just make this using Pythong? Turns out, yes you can. Not only that, but people do this kind of junk with Pythong all the time.
For my own proof of concept, I used Selenium to take screenshots of the page and then slapped it into a PDF. That’s it. It’s so stupid and so easy, but it works. For the most part.
But after I got my 100 lines of Pythong working, I wanted to go further, so I spun up a Java project (sad). Deep down, I was more interested in using Java to merge images and produce custom files for other projects and prototypes, but I went on anyway.
I also wanted to know If I could embed all the needed dependencies for Selenium to run a Webdriver using Jpackage to spit out an MSI installer. Not sure you should, but turns out, yes you can! (link to repo example coming soon)
Then finally, I could run the generated PDF from my junk code through a PDF editor to make final modifications if needed. That way I only needed to pay for ONE SINGULAR PDF EDITOR TOOL FROM A REPUTABLE COMPANY.
The Code
WARNING: I realize the code quality is not great. This is not a product or something anyone should run in a production environment as-is. If you take this approach, please consider some websites will recognize you are running a tool like Selenium and may ban you.
Tools and their versions used in the project:
implementation 'org.apache.pdfbox:pdfbox:3.0.4'
implementation 'org.seleniumhq.selenium:selenium-java:4.28.1'
implementation 'org.seleniumhq.selenium:selenium-chrome-driver:4.28.1'
Chrome Driver Wrapper
I didn’t think I would get the app to work first try, so I made a wrapper class with all the junk JavaScript procedure code. But Selenium worked first try, you could write all of this without having a separate class if you find it confusing.
class DriverWrapper implements AutoCloseable {
static final String GET_SCROLLED_Y_OFFSET_JS = "return window.pageYOffset + window.innerHeight;";
static final String GET_SCROLL_HEIGHT_JS = "return document.body.scrollHeight;";
static final String SCROLL_PAGE_HEIGHT_JS = "window.scrollTo(window.pageYOffset, window.pageYOffset + window.innerHeight);";
static final String SCROLL_TOP_JS = "window.scrollTo(0, 0);";
static final String ZOOM_JS = "document.body.style.zoom = '%s';";
static final String HIDE_SCROLLBAR_BODY_JS = "document.body.style.overflow = 'hidden';";
static final String HIDE_SCROLLBAR_ELE_JS = "document.documentElement.style.overflow = 'hidden';";
static final String HIDE_TAGS_JS = """
const elements = document.getElementsByTagName("%s");
for(var i = 0; i < elements.length; i++){
elements[i].style.visibility = 'hidden';
elements[i].style.display = 'none';
elements[i].style.height = 0;
};""";
static final String HIDE_FIXED_JS = """
const elements = document.querySelectorAll('%s');
elements.forEach((element) => {
if (window.getComputedStyle(element).position === 'fixed') {
element.style.visibility = 'hidden';
element.style.display = 'none';
element.style.height = 0;
}
});""";
private final WebDriver driver;
private final JavascriptExecutor jsExecutor;
private final TakesScreenshot takesScreenshot;
DriverWrapper() {
this(new Dimension(1920, 1080));
}
DriverWrapper(Dimension resolution) {
ChromeOptions options = new ChromeOptions()
.setExperimentalOption("excludeSwitches", Collections.singletonList("enable-automation"))
.setExperimentalOption("useAutomationExtension", false)
.addArguments(
"--incognito",
"--no-sandbox",
"--disable-blink-features=AutomationControlled",
"--disable-dev-shm-usage",
"--disable-extensions",
"--disable-popup-blocking",
"--disable-smooth-scrolling",
"disable-infobars"
);
driver = new ChromeDriver(options);
driver.manage().timeouts().pageLoadTimeout(Duration.ofSeconds(30));
driver.manage().window().setSize(resolution);
jsExecutor = (JavascriptExecutor) driver;
takesScreenshot = (TakesScreenshot) driver;
}
int getPageHeight() {
Object data = executeJs(GET_SCROLL_HEIGHT_JS);
return convertToInt(data);
}
int getYOffset() {
Object data = jsExecutor.executeScript(GET_SCROLLED_Y_OFFSET_JS);
return convertToInt(data);
}
DriverWrapper scrollPageDown() {
executeJs(SCROLL_PAGE_HEIGHT_JS);
return this;
}
Object executeJs(String javaScript) {
return jsExecutor.executeScript(javaScript);
}
File takeScreenshot() {
return takesScreenshot.getScreenshotAs(OutputType.FILE);
}
DriverWrapper implicitWait(Duration duration) {
driver.manage().timeouts().implicitlyWait(duration);
return this;
}
DriverWrapper loadUrl(String url) {
driver.get(url);
return this;
}
int convertToInt(Object obj) {
int result = -1;
try {
if (obj != null) {
String resultString = String.valueOf(obj);
result = Integer.parseInt(resultString);
}
} catch (Exception ex) {
}
return result;
}
@Override
public void close() {
driver.close();
}
}
The Screenshot Loop of Death
Next, Selenium needs to scroll down the page and take the screenshots at the appropriate intervals. I resorted to Thread.Sleep()
because the duration the app needs to wait is loosely based on your connection speed to the site you are scraping. Just know there are better ways to wait until the appropriate time to execute the next action.
class ScreenshotMaker {
private static final float DEFAULT_SCALE = 1;
List<BufferedImage> screenshotAll(DriverWrapper driver,
String url) throws IOException, InterruptedException {
return screenshotAll(driver, url, DEFAULT_SCALE);
}
List<BufferedImage> screenshotAll(DriverWrapper driver,
String url,
float scale) throws IOException, InterruptedException {
driver.loadUrl(url).implicitWait(Duration.ofSeconds(3));
Thread.sleep(2100);
formatPageForScreenshots(driver, scale);
Thread.sleep(1100);
int pageHeight = driver.getPageHeight();
int lastYOffset = 0;
int yOffset = 0;
int crop = 0;
List<File> tempImageFiles = new ArrayList<>();
while ((pageHeight > 0) && (yOffset >= 0) && (yOffset != pageHeight)) {
tempImageFiles.add(driver.takeScreenshot());
driver.scrollPageDown().implicitWait(Duration.ofSeconds(7));
Thread.sleep(910);
yOffset = driver.getYOffset();
if (lastYOffset == yOffset) {
break;
} else {
crop = yOffset - lastYOffset;
lastYOffset = yOffset;
}
if (yOffset == pageHeight) {
tempImageFiles.add(driver.takeScreenshot());
Thread.sleep(910);
}
}
return convertFilesToBufferedImages(tempImageFiles, crop);
}
List<BufferedImage> convertFilesToBufferedImages(List<File> tempImageFiles,
int lastImageCrop) throws IOException {
List<BufferedImage> bufferedImageFiles = new ArrayList<>();
for (int index = 0; index < tempImageFiles.size(); index++) {
File tempImageFile = tempImageFiles.get(index);
BufferedImage image = ImageIO.read(tempImageFile);
// if last portion of pagedown screenshot is smaller due to scroll size and page height diff
if ((index == tempImageFiles.size() - 1)) {
if (!((lastImageCrop > 0) && (image.getHeight() > lastImageCrop))) {
System.out.println("Crop failed");
} else {
image = image.getSubimage(
0,
image.getHeight() - lastImageCrop,
image.getWidth(),
lastImageCrop);
}
}
bufferedImageFiles.add(image);
tempImageFile.delete();
}
return bufferedImageFiles;
}
void formatPageForScreenshots(DriverWrapper driver, double scale) {
driver.executeJs(HIDE_SCROLLBAR_BODY_JS);
driver.executeJs(HIDE_SCROLLBAR_ELE_JS);
driver.executeJs(String.format(HIDE_TAGS_JS, "nav"));
driver.executeJs(String.format(HIDE_TAGS_JS, "header"));
driver.executeJs(String.format(HIDE_FIXED_JS, "div"));
driver.executeJs(String.format(HIDE_FIXED_JS, "section"));
driver.executeJs(String.format(ZOOM_JS, scale));
driver.implicitWait(Duration.ofSeconds(7));
driver.executeJs(SCROLL_TOP_JS);
driver.implicitWait(Duration.ofSeconds(5));
}
}
The PDF Maker
Once all the screenshots are taken, we need to put them back together in a PDF file in the right order and scaling. I used PDFbox by Apache, but here are several other APIs that would work as well like IText.
public class PdfMaker {
public void createPdfFromImages(List<BufferedImage> images, float pageScaleToImage)
throws IOException {
try (PDDocument pdfDocument = new PDDocument()) {
addImages(pdfDocument, pageScaleToImage, images);
pdfDocument.save(".\\SlapAPdfOut-" + UUID.randomUUID() + ".pdf");
}
}
void addImages(PDDocument pdDocument, float pageScaleToImage, List<BufferedImage> images)
throws IOException {
final Rectangle imageDimension = getPdfImageDimension(images);
final float scaledWidth = ((float) imageDimension.getWidth()) * pageScaleToImage;
final PDRectangle pdRectangle = new PDRectangle(
scaledWidth, ((float) imageDimension.getHeight()) * pageScaleToImage);
final File tempImageFile = new File(".\\" + UUID.randomUUID() + ".jpeg");
final PDPage pdPage = new PDPage(pdRectangle);
pdDocument.addPage(pdPage);
float offsetY = pdPage.getMediaBox().getHeight();
try (PDPageContentStream pdPageStream = new PDPageContentStream(pdDocument, pdPage)) {
for (BufferedImage image : images) {
float dynamicHeight = image.getHeight() * pageScaleToImage;
if (ImageIO.write(image, "jpeg", tempImageFile)) {
PDImageXObject pdImage = PDImageXObject
.createFromFileByContent(tempImageFile, pdDocument);
pdPageStream.drawImage(
pdImage,
0,
offsetY - dynamicHeight,
scaledWidth,
dynamicHeight);
}
offsetY -= dynamicHeight;
}
} finally {
if (!tempImageFile.delete()) {
System.out.println("Unable to delete temp image file");
}
}
}
Rectangle getPdfImageDimension(List<BufferedImage> images) {
int height = 0;
int width = 0;
for (BufferedImage image : images) {
height += image.getHeight();
if (width < image.getWidth()) {
width = image.getWidth();
}
}
return new Rectangle(width, height);
}
}
Demo Usage
The main demo class function allows you to pass in a list of URLs to be scanned. With Jpacakge, this class allowed me to make a simple installable command application to accept multiple URLs at the same time with simple commands like
> slapapdf https://sitetoscrap1.com https://sitetoscrap1.com ...
Using better code and/or some third-party command line parsing libraries you could easily override the hardcoded parameters with flags.
public class SlapAPdf {
private static final Dimension DEFAULT_RESOLUTION_OVERRIDE = new Dimension(1200, 1080);
private static final float DEFAULT_SCALE_OVERRIDE = 1.4f;
private static final float DEFAULT_PDF_SCALE_OVERRIDE = 0.4f;
// TODO: this is just a DEMO
public static void main(String... urls) {
if (urls == null || urls.length == 0 || urls[0] == null) {
System.out.println("empty arguments");
return;
}
try (final DriverWrapper driver = new DriverWrapper(DEFAULT_RESOLUTION_OVERRIDE)) {
final ScreenshotMaker screenshotMaker = new ScreenshotMaker();
final PdfMaker pdfMaker = new PdfMaker();
System.out.println("Started processing, don't touch browser while scanning. Or modify this app to run headless");
for (String url : urls) {
if (!isValid(url)) continue;
System.out.println("Scanning " + url);
List<BufferedImage> images = screenshotMaker
.screenshotAll(driver, url, DEFAULT_SCALE_OVERRIDE);
if (images == null || images.isEmpty()) {
System.out.println("No valid data taken from page for PDF file creation");
continue;
}
pdfMaker.createPdfFromImages(images, DEFAULT_PDF_SCALE_OVERRIDE);
}
} catch (Exception ex) {
ex.printStackTrace();
}
System.out.println("DONE");
}
static boolean isValid(String url) {
boolean isValid = true;
try {
if (url == null || url.isBlank()) isValid = false;
else new URL(url);
} catch (Exception ex) {
isValid = false;
}
if (!isValid) System.out.println("Invalid URL " + url);
return isValid;
}
}
Final Thoughts
I have the code in a GitHub repo here if you want to look at it further. Maybe someday Microsoft will web scrape my crappy scraper code and I’ll see it show up in someone else’s code base so I can laugh at them.
I wish I had the time (and cared enough) to make a better app, but I just don’t. I typically use Java and Pythong to make prototypes. Then if the result makes sense, flesh out something more substantial.
That’s not going to happen here.
But it was a decent learning project to understand how to manipulate images in Java, some different PDF libraries for Java such as PDFBox, and some basic web scraping with Selenium.