Spark, Presto, it has never been easier with Cloud
In the world of internet, “big data” has been a buzz word for many years. Hadoop, Spark… People have been talking about these terms for quite a long time. However, at the beginning, it is very hard for everyone to have a try on these super power tools.
- You need a cluster(a certain number of computers), not everyone has multiple computers at hand to play with.
- Even if you have 4–5 computers ready. Configuration alone will scare you away already.
- At the very beginning, you have to write your map-reduce jobs, and in Java ONLY. If you do not know Java, you probably will give up.
After a few years of cloud technology advancement, the above difficulties no longer exist. In this article, we will go through a small case of using a cloud platform(GCP) to start a Spark cluster to process a big volume dataset. With cloud technology, ETL for big volume dataset with Spark becomes very easy. In addition, Spark supports multiple different languages, we will be using Python version in this case, e.g. PySpark.
Other than Spark, we also will show how to process the same dataset with Presto. For those might not know about it, Presto is an open source distributed SQL query engine. Although Spark supports multiple languages like Python, SQL is the universal language for any data guy!
We will be using a public dataset of New York Taxi Data. It is located in GCP Big Query. We can easily pull the data in CSV format to Cloud Storage. Just use the GCP gcloud command-line tool to type the following command in a terminal.
bq --location=us extract --destination_format=CSV \--field_delimiter=',' --print_header=false \"bigquery-public-data:new_york_taxi_trips.tlc_fhv_trips_2017" \gs://testing-2021/new_york_taxi_trips/csv/shard-*.csv
Cloud Storage will naturally be the Hadoop Distributed File System ( HDFS ) for any big data tools. Now we can move on to create our cluster.
Create Cluster for Spark and Presto
In the GCP console, you can select DataProc, just one click button, you can easily start a cluster.
Make sure you select the configuration you want. Version must be above 1.4 to use Python3. Since we will be showing the query via Jupyter Notebook, make sure you choose Anaconda and Jupyter Notebook checkboxes.
After clicking create button, you shall have set up your cluster!
Click the “JupyterLab” link, we can go to the notebook and start our big data wrangling!
Using a notebook, you can easily execute Hive/Spark/Presto queries. You certainly can use console or GCP DataProc command lines to do it, but with a notebook, it is like an interactive IDE.
First, we use the notebook feature “%sql” to easily execute SQL in the cell. If you encounter errors, just use “PIP INSTALL [library name]” to fix the problem.
Hive metadata let Spark and Presto understand what tables/schemas they are querying against, which makes it possible to SQL on files in Cloud Storage.
Spark - RDD
Spark supports different languages, for data guys, Python always comes first! With PySpark, you can easily process data with Spark. One of the old ways to do Spark is called RDD(Resilient Distributed Datasets). You load data into RDD, and program your operations in “map-reduce” ways.
There are just a few functions needed to be understood. Such as: count(), collect(), map(), reduce(), filter() ……
For details about RDD, please refer to https://spark.apache.org/docs/latest/rdd-programming-guide.html
Spark - DataFrame
RDD and program in “map-reduce” is very old-fashion and not user friendly. Spark has provide another data type called DataFrame. If you have tried the famous Pandas library, you will be very familiar with DataFrame. Since lots of data people know about Python and its library Pandas, Spark also provide this way to wrangling data!
If you are a data guy with Python background, ETL with big data right now becomes very easy! Since you just wrangle the data like you did with Pandas, very similar.
What if you are not that familiar with Pandas or DataFrame concept? No worries. Spark now provides SQL to query data directly!
SQL is the default language of anyone with data background. With Hive metadata to describe the table schemas of the data in Cloud Storage. You can use Spark SQL directly query the data, even it is NOT in a database!
For more information about DataFrame and SQL, please refer https://spark.apache.org/docs/latest/sql-programming-guide.html
Spark SQL is not the ONLY one can directly use SQL to query data in Cloud Storage. Presto is another famous tool to do so, and it is intentionally designed to do such job. You can see the performance is better than Spark SQL.
Performance - CSV, seriously?
The New York Taxi dataset is 15.62GB with 192,092,698 rows. Presto can count all rows in ~15 seconds. But it can be much better, since CSV is NOT a good format for querying!
Let us use Parquet!
With SQL, you can easily convert the CSV table to be Parquet table. Parquet is to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem. It is NOT the only format which can make performance better, there are ORC, Avro…… Feel free to test them as well
If we repeat the same query, Parquet saves us 70% of query time!
What have we learnt so far?
- Big data is way easier nowadays with Cloud Technology. You do not need to manage the cluster by yourself.
- Spark has evolved to a status which is very user friendly to data people with different background, even if SQL only.
- There are other great tools like Presto, which makes the big data processing eco-system even better!
- Try to use better data formats for performance, Parquet, ORC, Avro ……
Hope you find ETL with big data in cloud world is NOT that difficult and give it a try to make some good insight out of this BIG DATA world! Thank you all!