Translation of the Same Code in Each Python, SQL, and PySpark

David K. Kim
2 min readMay 26, 2021
Photo by Alex Chumak on Unsplash

When I analyze data there are times when I have to use Python while in other occasions I have to use SQL or Spark. When I’m dealing with large amounts of data and utilizing Apache Hive I would use Spark and SQL (or HQL) while in occasions where I might import and analyze a CSV file I would use Python.

That being said, there are times when I get confused and a bit mixed up when I’m coding in one language but forget how to implement a simple line of code that I know perfectly well in another language. I think this is an experience other data analysts might have met so this post will try to organize the same instructions in each language from.

For Spark I normally import the PySpark library, so my examples will be shown using PySpark.

Show First 5 Rows of Data

Python

df.head()

SQL

LIMIT 5;

Spark

df.show(5)

Sorting Values

  • The examples shown are ordering in descending order

Python

df.sort_values('column_name', ascending = False) 

SQL

ORDER BY column_name DESC

Spark

df.sort(col('column_name'), ascending = False)

Counting Number of Rows

Python

len(df)

SQL

SELECT COUNT(*) FROM table

Spark

df.count()

Renaming Columns

  • changing name of column ‘A’ to ‘B’

Python

df.rename(columns={'A':'B'}, inplace = True)

SQL

SELECT column A AS B FROM table

Spark

df.select(col('A').alias('B'))

Group By and Count

Python

df.groupby('column_name').size()

SQL

SELECT column A, COUNT(*) FROM table GROUP BY A

Spark

df.groupby('column_name').count()

Hope this post helped out some people who need to code in various languages :) Cheers!

--

--