{"id":64271,"date":"2024-09-09T07:27:20","date_gmt":"2024-09-09T01:57:20","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=64271"},"modified":"2024-09-09T10:51:05","modified_gmt":"2024-09-09T05:21:05","slug":"pdf-image-extraction-and-validation-with-playwright-and-sharp","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/pdf-image-extraction-and-validation-with-playwright-and-sharp\/","title":{"rendered":"PDF Image Extraction and Validation with Playwright and sharp"},"content":{"rendered":"<p>Text and images can be stored in PDFs and have their formatting preserved across platforms and devices. There are use cases in software testing where it&#8217;s critical to make sure the right images are included in PDFs and their quality is maintained. It can take a while and be prone to human mistakes to manually verify these images, especially when working with large pdfs or a lot of test cases. By automating this process, you may limit the chance of missing any discrepancies and reduce manual efforts.<\/p>\n<p>The extraction and validation of images from PDFs can be automated with the help of Node library Sharp. Sharp is an image processing library that can be coupled with Playwright, a robust end-to-end testing framework, to extract images from PDFs and compare them to expected images.<\/p>\n<h2><strong><br \/>\nStep 1: Set Up the Environment<\/strong><\/h2>\n<p><strong>Prerequisite<\/strong>:<\/p>\n<p>Node.js must be installed on the system before installing the required libraries. Install the following libraries are installation of node.js<\/p>\n<p><strong>pdf-lib<\/strong>: Library for working with PDFs<br \/>\n<strong>sharp<\/strong>: An image processing library used for image manipulation.<\/p>\n<p>To install these libraries we can use the following commands in the terminal<br \/>\nnpm install pdf-lib sharp fs path<\/p>\n<h2><strong><br \/>\nStep 2: Extracting Images from the PDF<\/strong><\/h2>\n<p>Once libraries are installed we can need to use pdf-lib for parsing pdf and working with pdf content. The below code is for extracting images from PDF file.<\/p>\n<pre>const fs = require('fs');\r\nconst { PDFDocument } = require('pdf-lib');\r\nconst sharp = require('sharp');\r\nconst path = require('path');<\/pre>\n<pre>async extractImageFromPdf(pdfPath){\r\n\u00a0 const pathToExtract = \".\/extractedImg\/\"; \/\/folder in project where all the images from pdf would be extracted\r\n\u00a0 try {\r\n\u00a0 \u00a0 const images = await PDF.sharpsFromPdf(pdfPath);\r\n\u00a0 \u00a0 images.forEach(({ image, name, channels }) =&gt; {\r\n  \u00a0 \u00a0 const ext = '.png'; \/\/can be use other extensions conditionally\r\n  \u00a0 \u00a0 image.toFile(pathToExtract+`${name}${ext}`);\r\n\u00a0 \u00a0 });\r\n\r\n  \u00a0 \/\/ Progress events\r\n  \u00a0 await PDF.sharpsFromPdf(pdfPath, {\r\n  \u00a0 \u00a0 handler(event, data) {\r\n  \u00a0 \u00a0 \u00a0 if (event === 'loading') {\r\n  \u00a0 \u00a0 \u00a0 \u00a0 console.log('Loading PDF:', (data.loaded \/ data.total) * 100);\r\n  \u00a0 \u00a0 \u00a0 } else if (event === 'loaded') {\r\n  \u00a0 \u00a0 \u00a0 \u00a0 console.log('PDF loaded');\r\n  \u00a0 \u00a0 \u00a0 } else if (event === 'image' || event === 'skip' || event === 'error') {\r\n  \u00a0 \u00a0 \u00a0 \u00a0 console.log('Parsing images:', (data.pageIndex \/ data.pages) * 100);\r\n  \u00a0 \u00a0 \u00a0 } else if (event === 'done') {\r\n  \u00a0 \u00a0 \u00a0 \u00a0 console.log('Done');\r\n  \u00a0 \u00a0 \u00a0 }\r\n  \u00a0 \u00a0 },\r\n  \u00a0 });\r\n  \u00a0 return pathToExtract;\r\n  } catch (error) {\r\n  \u00a0 console.error('Error extracting images from PDF:', error);\r\n  }\r\n}\r\n\r\n<\/pre>\n<p><strong>In this function:<\/strong><\/p>\n<ul>\n<li>The function called PDF.sharpsFromPdf(pdfPath) is used to extract the images from the provided PDF. The extracted images are returned as an array of objects containing images, names, and channels. image represents the image data, usually as a sharp object or similar, that allows you to manipulate and save the image, name is used to extract the name assigned to the image, often derived from the page number or some identifier within the PDF, channels indicates the number of color channels in the image(although not used in above function)<\/li>\n<li>The images.forEach loop iterates over the extracted images, saving each one to the specified directory with a .png extension using image.toFile.<\/li>\n<li>A second call to PDF.sharpsFromPdf is made, this time with a handler function that tracks progress. The handler logs various stages like loading, parsing images, and completion.<\/li>\n<li>If successful, the function returns the path where images are extracted. Otherwise, an error message is logged.<\/li>\n<\/ul>\n<h2><\/h2>\n<h2><strong>Step 3: Comparing Extracted Images with Expected Images<\/strong><\/h2>\n<p>Once the images are extracted from pdf, we can compare them with the expected images to verify their accuracy. The sharp library is used to perform pixel-level comparison<\/p>\n<p>Here\u2019s an example of a comparison function:<\/p>\n<pre>async compareImages(expectedImgPath, extractedImgFolderPath) {\r\nlet flag = false;\r\ntry {\r\n\/\/ Load and process the first image\r\nconst img1 = await sharp(expectedImgPath).resize(500, 500).ensureAlpha().raw().toBuffer({ resolveWithObject: true });\r\nconst { data: data1, info: info1 } = img1;\r\n\r\n\/\/ Read the directory and filter image files\r\nconst files = fs.readdirSync(extractedImgFolderPath);\r\n\r\nconst imageFiles = files.filter(file =&gt; \/^img_p3_\\d+\\.png$\/i.test(file)); \/\/restrict search to images starting with img_p3 \r\n\r\nfor (const file of imageFiles) {\r\nconst filePath = path.join(extractedImgFolderPath, file);\r\n\r\n\/\/ Load and process the second image\r\nconst img2 = await sharp(filePath).resize(500, 500).ensureAlpha().raw().toBuffer({ resolveWithObject: true });\r\nconst { data: data2, info: info2 } = img2;\r\n\r\n\/\/ Check if dimensions match\r\nif (info1.width !== info2.width || info1.height !== info2.height) {\r\nconsole.log(`Image dimensions do not match for file: ${file}`);\r\ncontinue;\r\n}\r\n\r\n\/\/ Calculate Mean Squared Error (MSE)\r\nlet mse = 0;\r\nfor (let i = 0; i &lt; data1.length; i++) {\r\nmse += (data1[i] - data2[i]) ** 2;\r\n}\r\nmse \/= data1.length;\r\n\r\nif (mse &lt; 550) {\r\nflag = true;\r\nconsole.log(`Image ${file} matches with the given image.`);\r\nbreak;\r\n} else {\r\nconsole.log(`Image ${file} does not match.`);\r\n}\r\n}\r\n} catch (error) {\r\nconsole.error('Error comparing images:', error);\r\n}\r\nreturn flag;\r\n}<\/pre>\n<p><strong>In this code:<\/strong><\/p>\n<ul>\n<li>In the above function, these 2 parameters are used,\u00a0 expectedImgPath: Path to the reference image to compare, extractedImgFolderPath: Directory containing the images to be compared against the reference image.<\/li>\n<li>The function uses the sharp library to load, resize, and convert the images to a raw pixel buffer. The ensureAlpha() function ensures that the images have an alpha channel, even if they don\u2019t originally.<\/li>\n<li>The function reads the directory at folderPath and filters the files using a regular expression (\/^img_p3_\\d+\\.png$\/i). This regex restricts the comparison to images named according to a pattern (e.g., &#8220;img_p3_123.png&#8221;) that indicates they belong to page 3.<\/li>\n<li>For each image in the folder, the function checks if its dimensions match the reference image. If they do, it calculates the Mean Squared Error (MSE) between the images&#8217; pixel data.<br \/>\nMSE is a measure of the difference between the images. If the MSE is below a threshold (550 in this case), the images are considered a match. We can change the MSE value accordingly.<\/li>\n<li>If a match is found (MSE &lt; 550), the flag is set to true, and the loop breaks. The test asserts that the flag is true, meaning at least one image matched.<\/li>\n<\/ul>\n<h2><strong><br \/>\nStep 4: Automating the Process with Playwright<\/strong><\/h2>\n<p>To take it a step further, we can integrate this into a broader testing framework like Playwright. Here\u2019s how you could use Playwright to automate the verification process:<\/p>\n<pre>const { test, expect } = require('@playwright\/test');\r\nconst path = require('path');\r\n\r\ntest('Verify images in PDF', async ({}) =&gt; {\r\nconst pdfPath = 'path\/to\/your\/pdf\/document.pdf';\r\nconst expectedImgPath = 'path\/to\/expected\/images';\r\nconst extractedImagesFolder = 'path\/to\/extracted\/images';\r\n\r\nawait extractImageFromPdf(pdfPath);\r\nlet imageCompareFlag = await compareImages(expectedImgPath, extractedImagesFolder);\r\n\r\nexpect(imageCompareFlag).toBeTruthy();\r\n});<\/pre>\n<h2>\n<strong>Conclusion<\/strong><\/h2>\n<p>One effective method for automating visual content validation is to extract photos from a PDF and compare them with expected images. Testing pipeline may easily incorporate this procedure, guaranteeing that your PDFs automatically include the right images. Playwright and libraries like pdf-lib and sharp make this automation simple and reliable, allowing you to be assured of the caliber and coherence of your visual content.<\/p>\n<p>&nbsp;<\/p>\n<h2>References:<\/h2>\n<p>https:\/\/www.npmjs.com\/package\/sharp-pdf<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Text and images can be stored in PDFs and have their formatting preserved across platforms and devices. There are use cases in software testing where it&#8217;s critical to make sure the right images are included in PDFs and their quality is maintained. It can take a while and be prone to human mistakes to manually [&hellip;]<\/p>\n","protected":false},"author":947,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":529},"categories":[5880],"tags":[6292,5464,6286,6293],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/64271"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/947"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=64271"}],"version-history":[{"count":9,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/64271\/revisions"}],"predecessor-version":[{"id":65487,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/64271\/revisions\/65487"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=64271"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=64271"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=64271"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}