|
| 1 | +# SparkSQL.jl |
| 2 | +Submits *Structured Query Language* (SQL), *Data Manipulation Language* (DML) and *Data Definition Language* (DDL) statements to Apache Spark. |
| 3 | +Has functions to move data from Spark into Julia DataFrames and Julia DataFrame data into Spark. |
| 4 | + |
| 5 | +### Use Case |
| 6 | +Apache Spark is one of the world's most popular open source big data processing engines. Spark supports programming in Java, Scala, Python, SQL, and R. |
| 7 | +This package enables Julia programs to utilize Apache Spark for structured data processing using SQL. |
| 8 | + |
| 9 | +The design goal of this package is to enable Julia centric programming using Apache Spark with just SQL. There are only 8 functions. No need to use Java, Scala, Python or R. Work with Spark data all from within Julia. |
| 10 | +The SparkSQL.jl package uses the Dataset APIs internally giving Julia users the performance benefit of Spark's catalyst optimizer and tungsten engine. The earlier Spark RDD API is not supported. |
| 11 | + |
| 12 | +This package is for structured and semi-structured data in Data Lakes, Lakehouses (Delta Lake) on premise and in the cloud. |
| 13 | + |
| 14 | +# Available Functions |
| 15 | +Use ? to see help for each function. |
| 16 | +- `initJVM`: initializes the Java Virtual Machine (JVM) in Julia. |
| 17 | +- `SparkSession`: submits application to Apache Spark cluster with config options. |
| 18 | +- `sql`: function to submit SQL, DDL, and DML statements to Spark. |
| 19 | +- `cache`: function to cache Spark Dataset into memory. |
| 20 | +- `createOrReplaceTempView`: creates temporary view that lasts the duration of the session. |
| 21 | +- `createGlobalTempView`: creates temporary view that lasts the duration of the application. |
| 22 | +- `toJuliaDF`: move Spark data into a Julia DataFrame. |
| 23 | +- `toSparkDS`: move Julia DataFrame data to a Spark Dataset. |
| 24 | + |
| 25 | + |
| 26 | +# Quick Start |
| 27 | +### Install and Setup |
| 28 | +Download Apache Spark 3.1.1 or later and set the environmental variables for Spark and Java home: |
| 29 | +``` |
| 30 | +export SPARK_HOME=/path/to/apache/spark |
| 31 | +export JAVA_HOME=/path/to/java |
| 32 | +``` |
| 33 | + |
| 34 | +### Usage |
| 35 | +Start Julia with `"JULIA_COPY_STACKS=yes"` required for JVM interop: |
| 36 | +``` |
| 37 | +JULIA_COPY_STACKS=yes julia |
| 38 | +``` |
| 39 | +In Julia include the DataFrames package. Also include the Dates and Decimals packages if your Spark data contains dates or decimal numbers. |
| 40 | +``` |
| 41 | +using SparkSQL, DataFrames, Dates, Decimals |
| 42 | +``` |
| 43 | +Initialize the JVM and start the Spark Session: |
| 44 | +``` |
| 45 | +initJVM() |
| 46 | +sparkSession = SparkSession("spark://example.com:7077", "Julia SparkSQL Example App") |
| 47 | +``` |
| 48 | +Query data from Spark and load it into a Julia Dataset. |
| 49 | +``` |
| 50 | +stmt = sql(sparkSession, "SELECT _c0 AS columnName1, _c1 AS columnName2 FROM CSV.`/pathToFile/fileName.csv`") |
| 51 | +createOrReplaceTempView(stmt, "TempViewName") |
| 52 | +sqlQuery = sql(sparkSession, "SELECT columnName1, columnName2 FROM TempViewName;") |
| 53 | +juliaDataFrame = toJuliaDF(sqlQuery) |
| 54 | +describe(juliaDataFrame) |
| 55 | +``` |
| 56 | +Move Julia DataFrame data into an Apache Spark Dataset. |
| 57 | +``` |
| 58 | +sparkDataset = toSparkDS(sparkSession, juliaDataFrame,",") |
| 59 | +createOrReplaceTempView(sparkDataset, "tempTable") |
| 60 | +``` |
| 61 | +The Dataset is a delimited string. To generate columns use the SparkSQL "split" function. |
| 62 | + |
| 63 | +``` |
| 64 | +sqlQuery = sql(sparkSession, "Select split(value, ',' )[0] AS columnName1, split(value, ',' )[1] AS columnName2 from tempTable") |
| 65 | +``` |
| 66 | + |
| 67 | + |
| 68 | +# Spark Data Sources |
| 69 | +Supported data-sources include: |
| 70 | +- File formats including: CSV, JSON, arrow, parquet |
| 71 | +- Data Lakes including: Hive, ORC, Avro |
| 72 | +- Data Lake Houses: Delta Lake, Apache Iceberg. |
| 73 | +- Cloud Object Stores: S3, Azure Blob Storage, Swift Object. |
| 74 | + |
| 75 | +## Data Source Examples |
| 76 | + |
| 77 | +### CSV file example: |
| 78 | +Comma Separated Value (CSV) format. |
| 79 | +``` |
| 80 | +stmt = sql(session, "SELECT * FROM CSV.`/pathToFile/fileName.csv`;") |
| 81 | +``` |
| 82 | +### Parquet file example: |
| 83 | +Apache Parquet format. |
| 84 | +``` |
| 85 | +stmt = sql(session, "SELECT * FROM PARQUET.`/pathToFile/fileName.parquet`;") |
| 86 | +``` |
| 87 | +### Delta Lake Example: |
| 88 | +Delta Lake is an open-source storage layer for Spark. Delta Lake offers: |
| 89 | + |
| 90 | +- ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data. |
| 91 | +- Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease. |
| 92 | + |
| 93 | +Example shows create table (DDL), insert (DML) and select (SQL) statements using Delta Lake and SparkSQL: |
| 94 | +``` |
| 95 | +sql(session, "CREATE DATABASE demo;") |
| 96 | +sql(session, "USE demo;") |
| 97 | +sql(session, "CREATE TABLE tb(col STRING) USING DELTA;" ) |
| 98 | +``` |
| 99 | +The Delta Lake feature requires adding the Delta Lake jar to the Spark jars folder. |
0 commit comments