Spark, Presto, it has never been easier with Cloud

Photo by Jez Timms on Unsplash

In the world of internet, “big data” has been a buzz word for many years. Hadoop, Spark… People have been talking about these terms for quite a long time. However, at the beginning, it is very hard for everyone to have a try on these super power tools.

  1. You need a cluster(a certain number of computers), not everyone has multiple computers at hand to play with.
  2. Even if you have 4–5 computers ready. Configuration alone will scare you away already.
  3. At the very beginning, you have to write your map-reduce jobs, and in Java ONLY. If you do not know Java, you probably will give up.

After a few years of cloud technology advancement, the above difficulties no longer exist. In this article, we will go through a small case of using a cloud platform(GCP) to start a Spark cluster to process a big volume dataset. With cloud technology, ETL for big volume dataset with Spark becomes very easy. In addition, Spark supports multiple different languages, we will be using Python version in this case, e.g. PySpark.

Other than Spark, we also will show how to process the same dataset with Presto. For those might not know about it, Presto is an open source distributed SQL query engine. Although Spark supports multiple languages like Python, SQL is the universal language for any data guy!

Prepare Dataset

We will be using a public dataset of New York Taxi Data. It is located in GCP Big Query. We can easily pull the data in CSV format to Cloud Storage. Just use the GCP gcloud command-line tool to type the following command in a terminal.

bq --location=us extract --destination_format=CSV \--field_delimiter=',' --print_header=false \"bigquery-public-data:new_york_taxi_trips.tlc_fhv_trips_2017" \gs://testing-2021/new_york_taxi_trips/csv/shard-*.csv
Dataset created in Cloud Storage in CSV format

Cloud Storage will naturally be the Hadoop Distributed File System ( HDFS ) for any big data tools. Now we can move on to create our cluster.

Create Cluster for Spark and Presto

In the GCP console, you can select DataProc, just one click button, you can easily start a cluster.

Create a cluster in DataProc

Make sure you select the configuration you want. Version must be above 1.4 to use Python3. Since we will be showing the query via Jupyter Notebook, make sure you choose Anaconda and Jupyter Notebook checkboxes.

Choose ideal configuration you want

After clicking create button, you shall have set up your cluster!

Cluster is created

Click the “JupyterLab” link, we can go to the notebook and start our big data wrangling!

Querying Data

Set up file location and Click “PySpark”

Using a notebook, you can easily execute Hive/Spark/Presto queries. You certainly can use console or GCP DataProc command lines to do it, but with a notebook, it is like an interactive IDE.

Create Hive Metadata table

First, we use the notebook feature “%sql” to easily execute SQL in the cell. If you encounter errors, just use “PIP INSTALL [library name]” to fix the problem.

Hive metadata let Spark and Presto understand what tables/schemas they are querying against, which makes it possible to SQL on files in Cloud Storage.

Spark - RDD

Use PySpark RDD to count row number

Spark supports different languages, for data guys, Python always comes first! With PySpark, you can easily process data with Spark. One of the old ways to do Spark is called RDD(Resilient Distributed Datasets). You load data into RDD, and program your operations in “map-reduce” ways.

There are just a few functions needed to be understood. Such as: count(), collect(), map(), reduce(), filter() ……

For details about RDD, please refer to

Spark - DataFrame

Use Spark DataFrame to count row number

RDD and program in “map-reduce” is very old-fashion and not user friendly. Spark has provide another data type called DataFrame. If you have tried the famous Pandas library, you will be very familiar with DataFrame. Since lots of data people know about Python and its library Pandas, Spark also provide this way to wrangling data!

If you are a data guy with Python background, ETL with big data right now becomes very easy! Since you just wrangle the data like you did with Pandas, very similar.

Spark SQL

Use Spark SQL to count row number

What if you are not that familiar with Pandas or DataFrame concept? No worries. Spark now provides SQL to query data directly!

SQL is the default language of anyone with data background. With Hive metadata to describe the table schemas of the data in Cloud Storage. You can use Spark SQL directly query the data, even it is NOT in a database!

For more information about DataFrame and SQL, please refer


Use Presto to count row number

Spark SQL is not the ONLY one can directly use SQL to query data in Cloud Storage. Presto is another famous tool to do so, and it is intentionally designed to do such job. You can see the performance is better than Spark SQL.

Performance - CSV, seriously?

The New York Taxi dataset is 15.62GB with 192,092,698 rows. Presto can count all rows in ~15 seconds. But it can be much better, since CSV is NOT a good format for querying!

Let us use Parquet!

Convert CSV to Parquet

With SQL, you can easily convert the CSV table to be Parquet table. Parquet is to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem. It is NOT the only format which can make performance better, there are ORC, Avro…… Feel free to test them as well

Parquet get better performance

If we repeat the same query, Parquet saves us 70% of query time!

What have we learnt so far?

  1. Big data is way easier nowadays with Cloud Technology. You do not need to manage the cluster by yourself.
  2. Spark has evolved to a status which is very user friendly to data people with different background, even if SQL only.
  3. There are other great tools like Presto, which makes the big data processing eco-system even better!
  4. Try to use better data formats for performance, Parquet, ORC, Avro ……

Hope you find ETL with big data in cloud world is NOT that difficult and give it a try to make some good insight out of this BIG DATA world! Thank you all!




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Deploying an Angular Front-end with a Django Rest Framework Back-end on Google App engine

Introduction to CI/CD Pipeline

Working with Symbolic Expressions

How to Secure Serverless Architectures

Connect to Different Cassandra Keyspaces in Spring Boot

Completing the Lambda School Program

SRE / DevOps / Kubernetes Weekly Collection#89(Week 41, 2021)

Complexity can be learned but abstractions come at a cost

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


More from Medium

Big Data and Hadoop

Apache Spark 3.0 Exciting Capabilities

Associate Developer for Apache Spark Certification

Getting Started with Apache Spark on Databricks