A simple example of using Spark in OCI Dataflow with Python - PySpark

Vishal Savji
4 min readSep 9, 2020

Apache Spark : Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance

Oracle Data Flow : Oracle Cloud Infrastructure (OCI) Data Flow, a service that lets you run any Apache Spark™ application at any scale with no infrastructure to deploy or manage. Data Flow makes running Spark applications easy, repeatable, secure, and simple to share across the enterprise.

1. Fully managed Spark service with near zero administrative overhead.

2. Import/Run existing Spark apps from EMR,Databricks or Hadoop.

3. Create Apache Spark applications using SQL, Python, Java, or Scala.

4. Pay as you go,elastic big data service.

Object Store -> Data Flow -> Object Store

5 Simple Steps to setting up and run Data Flow Application on OCI

OCI DataFlow Application

Before you can create, manage and execute applications in Data Flow, the tenant administrator must create specific storage buckets. These set up steps are required in Object Store for Data Flow to function

In Oracle Cloud Infrastructure Console, create two storage buckets in Object Store for the following:

  1. DATA FLOW LOGS

Data Flow requires a bucket to store the logs (both standard out and standard err) for every application run. Create a standard storage tier bucket called dataflow-logs in the Object Store service. The location of the bucket must follow the pattern:

2. DATA FLOW WAREHOUSE

Data Flow requires a data warehouse for Spark SQL applications. Create a standard storage tier bucket called dataflow-warehouse in the Object Store service. The location of the warehouse must follow the pattern:

Once Logs buckets created its looks as below

Example : Converting CSV to Parquet file with Python-PySpark

Step 1: Create new bucket Example on object Storage

Step 2 : Upload CustomerValueAnalysis.CSV and Pyspark Example.py files on EXAMPLE object storage bucket

Step 3 : Create new PYTHON Data Flow Application — Name the Application

Step 3 Cont..: Scroll down to Resource Configuration. You can leave all these values as defaults.

Step 3 Cont ..: FILE URL This is the location of the PySpark file in object storage.

Step 3 Cont..: Double-check your Application configuration to confirm and Create it.

Step 4 : RUN a PySpark Application. The 2 MB csv file converted into 365 KB Parquet file.

Step 5 : Verify the new Parquet file has been created in EXAMPLE Object Storage bucket.

Debug Dataflow Application

If Data flow application fails then Debug it directly on the SPAK UI Environment.

What we have accomplished -

The Parquet format is an optimized binary format supporting efficient reads, making it ideal for reporting and analytics.

In this exercise, we take source data CustomerValueAnalysis.csv, convert it into Parquet and then do a number of interesting things with it.

Files Used — CustomerValueAnalysis.csv and Example.py (Pyspark code + Source and Destination file location on Object Store)

Thanks for reading!

--

--