{"id":58125,"date":"2023-08-30T15:36:42","date_gmt":"2023-08-30T10:06:42","guid":{"rendered":"https:\/\/www.tothenew.com\/blog\/?p=58125"},"modified":"2023-09-04T17:00:10","modified_gmt":"2023-09-04T11:30:10","slug":"from-pandas-to-pyspark","status":"publish","type":"post","link":"https:\/\/www.tothenew.com\/blog\/from-pandas-to-pyspark\/","title":{"rendered":"From Pandas to Pyspark"},"content":{"rendered":"<p>R<span style=\"font-weight: 400;\">ecently converted a Python script that relied on Pandas DataFrames to utilize PySpark DataFrames instead. The main goal is to transition data manipulation from the localized context of Pandas to the distributed processing capabilities offered by PySpark. This shift to PySpark DataFrames enables us to enhance scalability and efficiency by harnessing the power of distributed computing.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Pandas stands as the indispensable library for data scientists, serving as a crucial tool for individuals seeking to manipulate and analyze data effectively.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Nevertheless, while Pandas proves its usefulness and encompasses a wide range of functionalities, its limitations become evident when working with sizable datasets. To overcome this obstacle, a transition to PySpark becomes imperative as it enables distributed computing across multiple machines\u2014an advantage Pandas lacks. This article aims to simplify the process for newcomers to PySpark by offering code snippets that provide equivalents to various Pandas methods. With these readily available examples, navigating PySpark will be a smoother experience for aspiring practitioners.\u00a0<\/span><\/p>\n<h2><\/h2>\n<h2><span style=\"text-decoration: underline;\"><b>Getting started<\/b><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">To establish a solid foundation for our future discussions, it is imperative to start by importing the necessary libraries. This initial step lays the groundwork for further exploration and implementation.\u00a0<\/span><\/p>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">import pandas as pd<\/span>\r\n<span style=\"font-weight: 400;\">import pyspark.sql.functions as F<\/span><\/pre>\n<\/blockquote>\n<p><span style=\"font-weight: 400;\">The entry point into PySpark functionalities is the SparkSession class. Through the SparkSession instance, you can create data frames, apply all kinds of transformations, read and write files, etc.\u2026 To define a SparkSession, you can use the following :<\/span><\/p>\n<blockquote>\n<pre><span style=\"font-weight: 400;\"> from pyspark.sql import SparkSession<\/span>\r\n<span style=\"font-weight: 400;\">spark = SparkSession\\<\/span>\r\n<span style=\"font-weight: 400;\">.builder\\<\/span>\r\n<span style=\"font-weight: 400;\">.appName('SparkByExamples.com')\\<\/span>\r\n<span style=\"font-weight: 400;\">.getOrCreate()<\/span><\/pre>\n<\/blockquote>\n<p><span style=\"font-weight: 400;\">With all the preparations in place, we can now delve into the exciting comparison between Pandas and PySpark. Let&#8217;s explore the key differences and corresponding counterparts between these two powerful tools.<\/span><\/p>\n<h2><\/h2>\n<h2><span style=\"text-decoration: underline;\"><b>DataFrame creation<\/b><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">First, let\u2019s define a data sample we\u2019ll be using:<\/span><\/p>\n<pre><span style=\"font-weight: 400;\">columns = [\"employee\",\"department\",\"state\",\"salary\",\"age\"]<\/span>\r\n<span style=\"font-weight: 400;\">data = [(\"Alain\",\"Sales\",\"Paris\",60000,34),<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0(\"Ahmed\",\"Sales\",\"Lyon\",80000,45),<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0(\"Ines\",\"Sales\",\"Nice\",55000,30),<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0(\"Fatima\",\"Finance\",\"Paris\",90000,28),<\/span>\r\n<span style=\"font-weight: 400;\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0(\"Marie\",\"Finance\",\"Nantes\",100000,40)]<\/span><\/pre>\n<p><span style=\"font-weight: 400;\">To create a <\/span><b>Pandas<\/b><span style=\"font-weight: 400;\"> DataFrame<\/span><span style=\"font-weight: 400;\"> , we can use the following:<\/span><\/p>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df = pd.DataFrame(data=data, columns=columns)<\/span>\r\n<span style=\"font-weight: 400;\"># Show a few lines<\/span>\r\n<span style=\"font-weight: 400;\">df.head(2)<\/span><\/pre>\n<\/blockquote>\n<h3><b>PySpark<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df = spark.createDataFrame(data).toDF(*columns)<\/span>\r\n<span style=\"font-weight: 400;\"># Show a few lines<\/span>\r\n<span style=\"font-weight: 400;\">df.limit(2).show()<\/span><\/pre>\n<\/blockquote>\n<h2><\/h2>\n<h2><span style=\"text-decoration: underline;\"><b>Specifying columns types<\/b><\/span><\/h2>\n<h3><b>Pandas<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">types_dict = {<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\"employee\": pd.Series([r[0] for r in data], dtype='str'),<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\"department\": pd.Series([r[1] for r in data], dtype='str'),<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\"state\": pd.Series([r[2] for r in data], dtype='str'),<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\"salary\": pd.Series([r[3] for r in data], dtype='int'),<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0\"age\": pd.Series([r[4] for r in data], dtype='int')<\/span>\r\n<span style=\"font-weight: 400;\">}<\/span>\r\n<span style=\"font-weight: 400;\">df = pd.DataFrame(types_dict)<\/span><\/pre>\n<\/blockquote>\n<p><span style=\"font-weight: 400;\">You can check your types by executing this line:<\/span><\/p>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df.dtypes<\/span><\/pre>\n<\/blockquote>\n<h3><b>PySpark<\/b><\/h3>\n<blockquote>\n<pre>\u00a0 \u00a0<span style=\"font-weight: 400;\">from pyspark.sql.types import StructType,StructField, StringType, IntegerType<\/span>\r\n<span style=\"font-weight: 400;\">schema = StructType([ \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0StructField(\"employee\",StringType(),True), \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0StructField(\"department\",StringType(),True), \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0StructField(\"state\",StringType(),True), \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0StructField(\"salary\", IntegerType(), True), \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0\u00a0StructField(\"age\", IntegerType(), True) \\<\/span>\r\n<span style=\"font-weight: 400;\"> ])<\/span>\r\n<span style=\"font-weight: 400;\">df = spark.createDataFrame(data=data,schema=schema)<\/span><\/pre>\n<\/blockquote>\n<p><span style=\"font-weight: 400;\">You can check your DataFrame\u2019s schema by executing :<\/span><\/p>\n<blockquote>\n<pre>\u00a0<span style=\"font-weight: 400;\">df.dtypes<\/span>\r\n<span style=\"font-weight: 400;\"># OR<\/span>\r\n<span style=\"font-weight: 400;\">df.printSchema()<\/span><\/pre>\n<\/blockquote>\n<p>&nbsp;<\/p>\n<h2><span style=\"text-decoration: underline;\"><b>Reading and writing files<\/b><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Reading and writing are so similar in Pandas and PySpark. The syntax is the following for each:<br \/>\n<\/span><\/p>\n<h3><b>Pandas<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df = pd.read_csv(path, sep=';', header=True)<\/span>\r\n<span style=\"font-weight: 400;\">df.to_csv(path, ';', index=False)<\/span><\/pre>\n<\/blockquote>\n<h3><b>PySpark<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df = spark.read.csv(path, sep=';')<\/span>\r\n<span style=\"font-weight: 400;\">df.coalesce(n).write.mode('overwrite').csv(path, sep=';')<\/span><\/pre>\n<\/blockquote>\n<h2><span style=\"text-decoration: underline;\"><b>Add a column<\/b><\/span><\/h2>\n<h3><b>Pandas<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">score = [1, 2, 5, 7, 4]<\/span>\r\n<span style=\"font-weight: 400;\">df[score] = score<\/span><\/pre>\n<\/blockquote>\n<h3><b>Pyspark<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">score = [1, 2, 5, 7, 4]<\/span>\r\n<span style=\"font-weight: 400;\">df.withColumn('score', F.lit(score))<\/span><\/pre>\n<\/blockquote>\n<h2><span style=\"text-decoration: underline;\"><b>Filtering<\/b><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Selecting certain columns in Pandas is done like below:<\/span><b> <\/b><\/p>\n<h3><b>Pandas<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">columns_subset = ['employee', 'salary']<\/span>\r\n<span style=\"font-weight: 400;\">df[columns_subset].head()<\/span>\r\n<span style=\"font-weight: 400;\">df.loc[:, columns_subset].head()<\/span><\/pre>\n<\/blockquote>\n<p><span style=\"font-weight: 400;\">Whereas in PySpark, we need to use the select method with a list of columns: <\/span><\/p>\n<h3><b>PySpark<\/b><\/h3>\n<pre><span style=\"font-weight: 400;\">df.select(['employee', 'salary']).show(5)<\/span><\/pre>\n<h2><span style=\"text-decoration: underline;\"><b>Concatenate dataframes<\/b><\/span><\/h2>\n<h3><b>Pandas<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df = pd.concat([df1, df2], ignore_index = True)<\/span><\/pre>\n<\/blockquote>\n<h3><b>Pyspark<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df = df1.union(df2)<\/span><\/pre>\n<\/blockquote>\n<h2><span style=\"text-decoration: underline;\"><b>Aggregations<\/b><\/span><\/h2>\n<p><span style=\"font-weight: 400;\">To perform some aggregations, the syntax is almost similar between Pandas and PySpark: <\/span><\/p>\n<h3><b>Pandas<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df.groupby('department').agg({'employee': 'count', 'salary':'max', 'age':'mean'})<\/span><\/pre>\n<\/blockquote>\n<h3><b>PySpark<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df.groupBy('department').agg({'employee': 'count', 'salary':'max', 'age':'mean'})<\/span><\/pre>\n<\/blockquote>\n<p><span style=\"font-weight: 400;\">However, the results need some tweaking to be similar in pandas and PySpark.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In pandas, the column to group by becomes the index. <\/span><span style=\"font-weight: 400;\">To get it back as a column, we need to apply the<\/span><span style=\"font-weight: 400;\"> reset_index<\/span><span style=\"font-weight: 400;\"> method:\u00a0<\/span><\/p>\n<h3><b>Pandas<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df.groupby('department').agg({'employee': 'count', 'salary':'max', 'age':'mean'}).reset_index()<\/span><\/pre>\n<\/blockquote>\n<p><span style=\"font-weight: 400;\">In<\/span> <b>PySpark, <\/b><span style=\"font-weight: 400;\">the names of the columns get modified to reflect the performed aggregation in the resulting data frame. If you wish to avoid this, you\u2019ll need to use the alias method.<\/span><\/p>\n<h2><br style=\"font-weight: 400;\" \/><span style=\"text-decoration: underline;\"><b>Apply a transformation over a column<\/b><\/span><\/h2>\n<h3><b>Pandas<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">df['new_score'] = df['score'].apply(lambda x: x*5)<\/span><\/pre>\n<\/blockquote>\n<h3><b>Pyspark<\/b><\/h3>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">from pyspark.sql.types import FloatType<\/span>\r\n\r\n<span style=\"font-weight: 400;\">df.withColumn('new_score', F.udf(lambda x: x*5, FloatType())('score'))<\/span><\/pre>\n<\/blockquote>\n<h2><b>Conclusion<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">In conclusion, the striking similarities in syntax between Pandas and PySpark will greatly facilitate the transition from one framework to the other. This similarity enables a smooth and seamless migration between the two, easing the learning curve and minimizing the challenges associated with adapting to a new environment.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Utilizing PySpark offers a significant advantage when dealing with large datasets due to its capability for parallel computing. However, if the dataset being handled is small, it becomes more efficient to switch back to using the versatile and widely used Pandas library. In such cases, leveraging Pandas allows quicker and more streamlined data processing, making it the preferred choice. Please refer to our blogs for more insightful content and comment if you have any questions regarding the topic.<\/span><\/p>\n<div class=\"ap-custom-wrapper\"><\/div><!--ap-custom-wrapper-->","protected":false},"excerpt":{"rendered":"<p>Recently converted a Python script that relied on Pandas DataFrames to utilize PySpark DataFrames instead. The main goal is to transition data manipulation from the localized context of Pandas to the distributed processing capabilities offered by PySpark. This shift to PySpark DataFrames enables us to enhance scalability and efficiency by harnessing the power of distributed [&hellip;]<\/p>\n","protected":false},"author":1624,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"iawp_total_views":25},"categories":[1395,4308,4682,4831],"tags":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/58125"}],"collection":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/users\/1624"}],"replies":[{"embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/comments?post=58125"}],"version-history":[{"count":3,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/58125\/revisions"}],"predecessor-version":[{"id":58296,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/posts\/58125\/revisions\/58296"}],"wp:attachment":[{"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/media?parent=58125"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/categories?post=58125"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.tothenew.com\/blog\/wp-json\/wp\/v2\/tags?post=58125"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}