{"id":58430,"date":"2023-09-11T11:35:18","date_gmt":"2023-09-11T06:05:18","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=58430"},"modified":"2023-09-14T11:42:33","modified_gmt":"2023-09-14T06:12:33","slug":"data-quality-with-pydeequ-a-comprehensive-guide","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/data-quality-with-pydeequ-a-comprehensive-guide\/","title":{"rendered":"Data Quality with PyDeequ: A Comprehensive Guide"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">Inadequate data quality can adversely affect both machine learning models and the decision-making process within a business. Unaddressed data errors can result in lasting repercussions, manifesting as blemishes and jolts. It is imperative in today&#8217;s landscape to implement automated tools for monitoring data quality, enabling the timely identification and resolution of issues. This proactive approach fosters greater confidence in the integrity of data and bolsters efficiency in data handling. Consequently, adopting automated data quality monitoring should be regarded as a strategic imperative for enhancing and sustaining an organization&#8217;s data systems.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In this step-by-step guide, we\u2019ll take a look at the different types of data quality checks available in AWS pyDeequ, so you can get a better understanding of how it works.<\/span><\/p>\n<h2><strong>What is PyDeequ?<\/strong><\/h2>\n<p><span style=\"font-weight: 400;\">PyDeequ is a Python library that provides a set of tools for data quality assessment and validation in large datasets. It allows users to define data quality checks, measure data quality metrics, and identify issues or anomalies within their data. PyDeequ is often used in data preprocessing and quality assurance tasks in data analytics and machine learning workflows.<\/span><\/p>\n<h2><strong>Setting up PyDeequ on PySpark<\/strong><\/h2>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-58426 size-large\" src=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.07-AM-1024x96.png\" alt=\"\" width=\"625\" height=\"59\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.07-AM-1024x96.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.07-AM-300x28.png 300w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.07-AM-768x72.png 768w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.07-AM-624x59.png 624w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.07-AM.png 1254w\" sizes=\"(max-width: 625px) 100vw, 625px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Once you have installed these dependencies, you&#8217;ll be ready to use PyDeequ for data quality checks and assessments on your datasets within a PySpark environment.<\/span><\/p>\n<h2><strong>Implementing PyDeequ<\/strong><\/h2>\n<h3><span style=\"font-weight: 400;\">Main components<\/span><\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-58427 size-full\" src=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.23-AM.png\" alt=\"\" width=\"900\" height=\"480\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.23-AM.png 900w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.23-AM-300x160.png 300w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.23-AM-768x410.png 768w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.23-AM-624x333.png 624w\" sizes=\"(max-width: 900px) 100vw, 900px\" \/><\/p>\n<p><b>Profiling<\/b><span style=\"font-weight: 400;\">: Profiling is the process of gathering basic statistics and information about the data. In PyDeequ, the profiler provides summary statistics, data type information, and basic data distribution insights for each column in your dataset. It helps you understand the characteristics of your data quickly.<\/span><\/p>\n<p><b>Analyzer<i>:<\/i><\/b><span style=\"font-weight: 400;\"> The analyzer component goes beyond profiling and allows you to compute more advanced statistics and metrics for your data. This includes metrics like uniqueness, completeness, and other custom-defined metrics. Analyzers help you gain a deeper understanding of data quality issues in your dataset.<\/span><\/p>\n<p><b>Constraint Suggestion<i>: <\/i><\/b><span style=\"font-weight: 400;\">Constraint suggestion is a powerful feature of PyDeequ that automatically generates data quality constraints based on the profiling and analysis results. It suggests constraints such as uniqueness, completeness, and data type constraints that you can apply to your data to improve its quality.<\/span><\/p>\n<p><b>Verification<i>:<\/i><\/b><span style=\"font-weight: 400;\"> Verification is the process of running data quality checks on your dataset using the defined constraints. PyDeequ&#8217;s verification component enables you to create and run data quality checks to validate whether your data conforms to the defined constraints. It provides detailed reports on the results of these checks, helping you identify data quality issues.<\/span><\/p>\n<p><b>Data Quality Metrics:<\/b><span style=\"font-weight: 400;\"> PyDeequ includes a set of predefined data quality metrics that you can use to measure and monitor the quality of your data. These metrics include measures like data completeness, distinctness, and uniformity, among others.<\/span><\/p>\n<p><b>Anomaly Detection<i>:<\/i><\/b><span style=\"font-weight: 400;\"> PyDeequ also offers anomaly detection capabilities to identify unusual or unexpected patterns in your data. This can be particularly useful for spotting outliers or data points that deviate significantly from the norm.<\/span><\/p>\n<p><b>Check Builder:<\/b><span style=\"font-weight: 400;\"> PyDeequ provides a convenient CheckBuilder API that allows you to construct data quality checks programmatically. You can define custom checks based on your specific data quality requirements and apply them to your data.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">These main components of PyDeequ work together to help you assess, analyze, and improve the quality of your data within a PySpark environment. By leveraging these components, you can gain valuable insights into your data and ensure it meets the necessary quality standards for your analytics and machine learning projects.<\/span><\/p>\n<h3><strong>Available Data Quality Checks<\/strong><\/h3>\n<p><b><i>\u25cf <\/i>Completeness Checks<i>:<\/i><\/b><span style=\"font-weight: 400;\"> These checks make sure that the specified fields in your data structure are filled with non-empty values. AWS pyDeequ offers functions to check the completeness of columns and find the missing values.<\/span><\/p>\n<p><b><i>\u25cf <\/i>Uniqueness Checks: <\/b><span style=\"font-weight: 400;\">Uniqueness checks make sure there are no duplicates in a particular column or set of columns. You can also determine the uniqueness rate of columns using AWS PyData.<\/span><\/p>\n<p><b><i>\u25cf<\/i> Consistency Checks<i>: <\/i><\/b><span style=\"font-weight: 400;\">The purpose of consistency checks is to ensure consistency across data values. With the help of AWS pyDeequ, you can identify the values that are not consistent in categorical columns. This way, you can detect data entry errors or discrepancies.<\/span><\/p>\n<p><b><i>\u25cf <\/i>Functional Dependency Checks<i>: <\/i><\/b><span style=\"font-weight: 400;\">These checks determine whether one type of column defines another type of column.<\/span><\/p>\n<p><b><i>\u25cf <\/i>Pattern Checks<i>:<\/i><\/b><span style=\"font-weight: 400;\"> In a pattern check, data is validated against pre-defined patterns (such as email addresses or phone numbers). With AWS PyDeeq, you can check whether your data matches those patterns.<\/span><\/p>\n<p><b><i>\u25cf <\/i>Value Distribution Checks<i>: <\/i><\/b><span style=\"font-weight: 400;\">Value distribution checks give you an idea of how the values are distributed within a single column. With the help of AWS pyDeequ, you can see how the unique values are distributed, which helps you to see where the data is skewed and where the data is balanced.<\/span><\/p>\n<p><b><i>\u25cf<\/i> Custom Checks:<\/b><span style=\"font-weight: 400;\"> With AWS PyDeeq, you can create your controls based on your business rules. With this flexibility, you can address domain-specific data quality issues.<\/span><\/p>\n<h3><b>Definitions of supported checks<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">pyDeequ provides ~40 constraints that we can verify on our dataset based on the above scenario for checking the quality of your data.<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><span style=\"font-weight: 400;\">Constraint<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Definition<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasSize<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on Data Frame Size.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">isComplete<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on a column completion.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasCompleteness<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts non-null values\/total_values.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">areComplete<\/span><\/td>\n<td><span style=\"font-weight: 400;\">checks all listed columns have non-null values.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">areAnyComplete<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts any completion in the combined set of columns.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">isUnique<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on a column&#8217;s uniqueness.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasUniqueness<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts any uniqueness in a single or combined set of key columns.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasDistinctness<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts distinctness in a single or combined set of key columns.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasUniqueValueRatio<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts a unique value ratio in a single or combined set of key columns.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasNumberofDistinctValues<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts the number of distinct values in the column.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasHistogramValues<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on column\u2019s value distribution.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasEntropy<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on a column entropy.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasMutualInformation<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on a piece of mutual information between two columns.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasApproxQuantile<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on an approximated quantile.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasMinLength<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on the minimum length of the column.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasMaxLength<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on the maximum length of the column.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasMin<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on the minimum value in the column.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasMax<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on the maximum value in the column.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasMean<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on the mean of column values.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasSum<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on the sum of column values.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasStandardDeviation<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on the standard deviation of the column.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasApproxCountDistinct<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on the approximate count distinct of the given column.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasCorrelation<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on the Pearson correlation between two columns.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasCompleteness<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on completed rows in a combined set of columns.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">haveAnyCompleteness<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on any completion in the combined set of columns.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">satisfies<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts on the given condition on the data frame( Where Clause).<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasPattern<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Matches the regex Pattern.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">containsCreditCardNumber<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Verifies against a Credit Card pattern.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">containsEmail<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Verifies against an Email pattern.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">containsURL<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Verifies against a URL pattern.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">containsSocialSecurityNumber<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Verifies against the Social Security number pattern for the US.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">hasDataType<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Verifies against the fraction of rows that conform to the given data type<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">isNonNegative<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts that a column contains no negative values.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">isPositive<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Find ratio positive_values\/total_values.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">isLessThan<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts that in each row, the value of columnA &lt; the value of columnB.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">isLessThanOrEqualto<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts that in each row, the value of columnA \u2264 the value of columnB.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">isGreaterThan<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts that in each row, the value of columnA &gt; the value of columnB.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">isGreaterThanOrEqualTo<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts that in each row, the value of columnA \u2265 to the value of columnB.<\/span><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">isContainedIn<\/span><\/td>\n<td><span style=\"font-weight: 400;\">Asserts that every non-null value in a column is contained in a set of predefined values.<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3><b>Example:<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">Pydeequ supports simple sequential addition of constraints on any Pyspark data frame and returns the quality check report in both JSON\/CSV formats.<\/span><\/p>\n<p><strong>Below is a sample code for adding constraints:<\/strong><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-58428 size-large\" src=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.41-AM-1024x156.png\" alt=\"\" width=\"625\" height=\"95\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.41-AM-1024x156.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.41-AM-300x46.png 300w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.41-AM-768x117.png 768w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.41-AM-624x95.png 624w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.41-AM.png 1250w\" sizes=\"(max-width: 625px) 100vw, 625px\" \/><\/p>\n<p><strong>Generated Report:<\/strong><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-58429 size-large\" src=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.51-AM-1024x240.png\" alt=\"\" width=\"625\" height=\"146\" srcset=\"\/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.51-AM-1024x240.png 1024w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.51-AM-300x70.png 300w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.51-AM-768x180.png 768w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.51-AM-624x146.png 624w, \/blog\/wp-ttn-blog\/uploads\/2023\/09\/Screenshot-2023-09-08-at-10.46.51-AM.png 1204w\" sizes=\"(max-width: 625px) 100vw, 625px\" \/><\/p>\n<h2><strong>\u00a0Conclusion<\/strong><\/h2>\n<p><span style=\"font-weight: 400;\">Data quality is one of the most important aspects of your data-driven business. AWS PyDeequ provides a complete suite of data quality controls that you can easily integrate into your workflow. With the help of its functions, you can improve the accuracy, uniformity, and dependability of your data sets, resulting in more informed decisions and insights.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In this step-by-step guide, we\u2019ve covered everything you need to know about AWS PyDeequ&#8217;s various data quality controls, including completeness, uniqueness, custom, and more. With this knowledge in hand, you\u2019re better equipped to make sure your data is as good as it can be, helping your organization reach its full potential.<\/span><\/p>\n<div class=\"ap-custom-wrapper\"><\/div><!--ap-custom-wrapper-->","protected":false},"excerpt":{"rendered":"<p>Inadequate data quality can adversely affect both machine learning models and the decision-making process within a business. Unaddressed data errors can result in lasting repercussions, manifesting as blemishes and jolts. It is imperative in today&#8217;s landscape to implement automated tools for monitoring data quality, enabling the timely identification and resolution of issues. This proactive approach [&hellip;]<\/p>\n","protected":false},"author":1583,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":773},"categories":[1395],"tags":[5412],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/58430"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/1583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=58430"}],"version-history":[{"count":2,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/58430\/revisions"}],"predecessor-version":[{"id":58530,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/58430\/revisions\/58530"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=58430"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=58430"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=58430"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}