{"id":58740,"date":"2023-09-29T19:37:18","date_gmt":"2023-09-29T14:07:18","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=58740"},"modified":"2023-10-05T19:42:05","modified_gmt":"2023-10-05T14:12:05","slug":"spark-with-pytest-shaping-the-future-of-data-testing","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/spark-with-pytest-shaping-the-future-of-data-testing\/","title":{"rendered":"Spark with Pytest : Shaping the Future of Data Testing"},"content":{"rendered":"<p>PySpark is an open-source, distributed computing framework that provides an interface for programming Apache Spark with the Python programming language, enabling the processing of large-scale data sets across clusters of computers. PySpark is often used to process and learn from voluminous event data. Apache Spark exposes DataFrames and Datasets API that enables writing very concise code, so concise that it is almost tempting to skip unit tests!<\/p>\n<p>In this post, we\u2019ll dive into writing unit tests using my favorite test framework for Python code: Pytest! Before we begin, let\u2019s take a quick peek at unit testing.<\/p>\n<h2 data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\"><b>Unit Testing<\/b><\/h2>\n<p data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\">Unit testing PySpark code is crucial to ensure the correctness and robustness of your data processing pipelines. Let\u2019s see with an example why unit testing is necessary:<\/p>\n<p>Imagine you are working on a PySpark project that involves processing customer data for an e-commerce platform. Your task is to implement a transformation logic that calculates the total revenue generated by each customer. This transformation involves several complex operations, including filtering, aggregation, and joining data from multiple sources.<\/p>\n<h2 data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\"><b>Why You Need Unit Testing<\/b><\/h2>\n<div data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\">\n<p data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\"><b>Data Quality Assurance<\/b>: Unit tests can check the quality of the data transformations. For instance, you can write tests to ensure that the total revenue is always a positive number, or that no null values are present in the output.<\/p>\n<p><b>Regression Detection<\/b>: Over time, your codebase may evolve. You or your colleagues may make changes to the transformation logic. Unit tests act as a safety net, catching regressions or unintended side effects when code changes occur.<\/p>\n<p><b>Edge Cases<\/b>: Unit tests can cover edge cases that might not be immediately obvious. For instance, you could have tests to verify the behavior when a customer has no purchase history or when there&#8217;s a sudden increase in data volume.<\/p>\n<p><b>Complex Business Logic<\/b>: In real-world scenarios, transformation logic can become quite complex. Unit tests allow you to break down this complexity into testable components, ensuring that each part of the transformation works as intended.<\/p>\n<p><b>Maintainability<\/b>: Well-structured unit tests can serve as documentation for your code. They make it easier for new team members to understand the intended behavior of your transformations and how they fit into the larger data processing pipeline.<\/p>\n<p><b>Cost Savings<\/b>: Identifying and fixing issues early in the development cycle is more cost-effective than discovering them in a production environment, where data quality problems can have significant financial implications.<\/p>\n<\/div>\n<div>\n<h2 data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\"><b>Characteristics of a unit test<\/b><\/h2>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-58734 size-full\" src=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/test-1.png\" alt=\"\" width=\"396\" height=\"358\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/test-1.png 396w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test-1-300x271.png 300w\" sizes=\"(max-width: 396px) 100vw, 396px\" \/><\/div>\n<p>&nbsp;<\/p>\n<ul data-pm-slice=\"0 1 []\" data-en-clipboard=\"true\">\n<li>\n<div><b>Focused<\/b>: Each test should test a single behavior\/functionality.<\/div>\n<\/li>\n<li>\n<div><b>Fast<\/b>: The test must allow iteration and share feedback quickly.<\/div>\n<\/li>\n<li>\n<div><b>Isolated<\/b>: Each test should be responsible for testing a specific functionality and must not depend on external factors in order to run successfully.<\/div>\n<\/li>\n<li>\n<div><b>Concise<\/b>: Creating a test shouldn&#8217;t include lots of boilerplate code to mock\/create complex objects in order for the test to run.<\/div>\n<\/li>\n<\/ul>\n<div>\n<h2 data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\"><b>Pytest<\/b><\/h2>\n<\/div>\n<p data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\">Pytest is an open-source testing framework for Python that simplifies and enhances the process of writing and running tests, making it easier to ensure the quality and correctness of Python code. When it comes to writing unit tests for PySpark pipelines, writing focused, fast, isolated, and concise tests can be challenging.<\/p>\n<p data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\">Some of the standout features of Pytest:<\/p>\n<div data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\">\n<ul data-pm-slice=\"0 3 []\" data-en-clipboard=\"true\">\n<li>Writing tests in Pytest is less verbose<\/li>\n<li>Provides great support for fixtures (including reusable fixtures with parameterization)<\/li>\n<li>Has great debugging support with contexts<\/li>\n<li>Makes parallel\/distributed running of tests easy<\/li>\n<li>Has well-thought-out command-line options<\/li>\n<\/ul>\n<p data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\">Spark supports a local mode that creates a cluster on your box that makes it easy to unit tests. To run Spark in local mode, you typically need to set up a SparkSession in your test script and configure it to run in local mode.<\/p>\n<p>Let\u2019s start by writing a unit test for the following simple transformation function!<\/p>\n<\/div>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-58737\" src=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.22.17-PM.png\" alt=\"\" width=\"1346\" height=\"814\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.22.17-PM.png 1346w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.22.17-PM-300x181.png 300w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.22.17-PM-1024x619.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.22.17-PM-768x464.png 768w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.22.17-PM-624x377.png 624w\" sizes=\"(max-width: 1346px) 100vw, 1346px\" \/><\/p>\n<p>To test this function, we need a spark_session fixture. A test fixture is a fixed state of a set of objects that can be used as a consistent baseline for running tests. We\u2019ll create a local mode SparkContext and decorate it with a Pytest fixture:<\/p>\n<div data-pm-slice=\"1 1 []\" data-en-clipboard=\"true\"><\/div>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-58738\" src=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.23.32-PM.png\" alt=\"\" width=\"1000\" height=\"186\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.23.32-PM.png 1000w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.23.32-PM-300x56.png 300w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.23.32-PM-768x143.png 768w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.23.32-PM-624x116.png 624w\" sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><\/p>\n<p>Creating a Spark Session (even in local mode) takes time, so we want to reuse it. The scope=session argument does exactly that, allowing reusing the context for all tests in the session. One can also set the scope=module to get a fresh context for tests in a module.<\/p>\n<p>Now, the Spark Session can be used to write a unit test for the transformation function:<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-58739\" src=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.26.36-PM.png\" alt=\"\" width=\"1254\" height=\"618\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.26.36-PM.png 1254w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.26.36-PM-300x148.png 300w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.26.36-PM-1024x505.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.26.36-PM-768x378.png 768w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-26-at-8.26.36-PM-624x308.png 624w\" sizes=\"(max-width: 1254px) 100vw, 1254px\" \/><\/p>\n<p><b><\/b>Code CoverageWe can run the test using the following command, and it will generate the coverage report in the mentioned directory:<\/p>\n<p>python3 -m pytest &#8211;cov &#8211;cov-report=html:coverage_re tests\/com\/code\/quality\/test_simple_transformation.py<\/p>\n<div data-codeblock=\"true\" data-line-wrapping=\"false\">\n<img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-58735\" src=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/test1.png\" alt=\"\" width=\"3456\" height=\"414\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/test1.png 3456w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test1-300x36.png 300w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test1-1024x123.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test1-768x92.png 768w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test1-1536x184.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test1-2048x245.png 2048w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test1-624x75.png 624w\" sizes=\"(max-width: 3456px) 100vw, 3456px\" \/><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-58736\" src=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/test2.png\" alt=\"\" width=\"1572\" height=\"1116\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/test2.png 1572w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test2-300x213.png 300w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test2-1024x727.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test2-768x545.png 768w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test2-1536x1090.png 1536w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/test2-624x443.png 624w\" sizes=\"(max-width: 1572px) 100vw, 1572px\" \/><\/div>\n<div class=\"ap-custom-wrapper\"><\/div><!--ap-custom-wrapper-->","protected":false},"excerpt":{"rendered":"<p>PySpark is an open-source, distributed computing framework that provides an interface for programming Apache Spark with the Python programming language, enabling the processing of large-scale data sets across clusters of computers. PySpark is often used to process and learn from voluminous event data. Apache Spark exposes DataFrames and Datasets API that enables writing very concise [&hellip;]<\/p>\n","protected":false},"author":1653,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":90},"categories":[1395,4831,1816],"tags":[1593,5474,5442,1358,1606,272],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/58740"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/1653"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=58740"}],"version-history":[{"count":3,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/58740\/revisions"}],"predecessor-version":[{"id":59082,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/58740\/revisions\/59082"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=58740"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=58740"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=58740"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}