Translation of the Same Code in Each Python, SQL, and PySpark
When I analyze data there are times when I have to use Python while in other occasions I have to use SQL or Spark. When I’m dealing with large amounts of data and utilizing Apache Hive I would use Spark and SQL (or HQL) while in occasions where I might import and analyze a CSV file I would use Python.
That being said, there are times when I get confused and a bit mixed up when I’m coding in one language but forget how to implement a simple line of code that I know perfectly well in another language. I think this is an experience other data analysts might have met so this post will try to organize the same instructions in each language from.
For Spark I normally import the PySpark library, so my examples will be shown using PySpark.
Show First 5 Rows of Data
Python
df.head()
SQL
LIMIT 5;
Spark
df.show(5)
Sorting Values
- The examples shown are ordering in descending order
Python
df.sort_values('column_name', ascending = False)
SQL
ORDER BY column_name DESC
Spark
df.sort(col('column_name'), ascending = False)
Counting Number of Rows
Python
len(df)
SQL
SELECT COUNT(*) FROM table
Spark
df.count()
Renaming Columns
- changing name of column ‘A’ to ‘B’
Python
df.rename(columns={'A':'B'}, inplace = True)
SQL
SELECT column A AS B FROM table
Spark
df.select(col('A').alias('B'))
Group By and Count
Python
df.groupby('column_name').size()
SQL
SELECT column A, COUNT(*) FROM table GROUP BY A
Spark
df.groupby('column_name').count()
Hope this post helped out some people who need to code in various languages :) Cheers!