Skip to content

Commit 87980a6

Browse files
Create index.md
1 parent c4dbea9 commit 87980a6

File tree

1 file changed

+99
-0
lines changed

1 file changed

+99
-0
lines changed

docs/index.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# SparkSQL.jl
2+
Submits *Structured Query Language* (SQL), *Data Manipulation Language* (DML) and *Data Definition Language* (DDL) statements to Apache Spark.
3+
Has functions to move data from Spark into Julia DataFrames and Julia DataFrame data into Spark.
4+
5+
### Use Case
6+
Apache Spark is one of the world's most popular open source big data processing engines. Spark supports programming in Java, Scala, Python, SQL, and R.
7+
This package enables Julia programs to utilize Apache Spark for structured data processing using SQL.
8+
9+
The design goal of this package is to enable Julia centric programming using Apache Spark with just SQL. There are only 8 functions. No need to use Java, Scala, Python or R. Work with Spark data all from within Julia.
10+
The SparkSQL.jl package uses the Dataset APIs internally giving Julia users the performance benefit of Spark's catalyst optimizer and tungsten engine. The earlier Spark RDD API is not supported.
11+
12+
This package is for structured and semi-structured data in Data Lakes, Lakehouses (Delta Lake) on premise and in the cloud.
13+
14+
# Available Functions
15+
Use ? to see help for each function.
16+
- `initJVM`: initializes the Java Virtual Machine (JVM) in Julia.
17+
- `SparkSession`: submits application to Apache Spark cluster with config options.
18+
- `sql`: function to submit SQL, DDL, and DML statements to Spark.
19+
- `cache`: function to cache Spark Dataset into memory.
20+
- `createOrReplaceTempView`: creates temporary view that lasts the duration of the session.
21+
- `createGlobalTempView`: creates temporary view that lasts the duration of the application.
22+
- `toJuliaDF`: move Spark data into a Julia DataFrame.
23+
- `toSparkDS`: move Julia DataFrame data to a Spark Dataset.
24+
25+
26+
# Quick Start
27+
### Install and Setup
28+
Download Apache Spark 3.1.1 or later and set the environmental variables for Spark and Java home:
29+
```
30+
export SPARK_HOME=/path/to/apache/spark
31+
export JAVA_HOME=/path/to/java
32+
```
33+
34+
### Usage
35+
Start Julia with `"JULIA_COPY_STACKS=yes"` required for JVM interop:
36+
```
37+
JULIA_COPY_STACKS=yes julia
38+
```
39+
In Julia include the DataFrames package. Also include the Dates and Decimals packages if your Spark data contains dates or decimal numbers.
40+
```
41+
using SparkSQL, DataFrames, Dates, Decimals
42+
```
43+
Initialize the JVM and start the Spark Session:
44+
```
45+
initJVM()
46+
sparkSession = SparkSession("spark://example.com:7077", "Julia SparkSQL Example App")
47+
```
48+
Query data from Spark and load it into a Julia Dataset.
49+
```
50+
stmt = sql(sparkSession, "SELECT _c0 AS columnName1, _c1 AS columnName2 FROM CSV.`/pathToFile/fileName.csv`")
51+
createOrReplaceTempView(stmt, "TempViewName")
52+
sqlQuery = sql(sparkSession, "SELECT columnName1, columnName2 FROM TempViewName;")
53+
juliaDataFrame = toJuliaDF(sqlQuery)
54+
describe(juliaDataFrame)
55+
```
56+
Move Julia DataFrame data into an Apache Spark Dataset.
57+
```
58+
sparkDataset = toSparkDS(sparkSession, juliaDataFrame,",")
59+
createOrReplaceTempView(sparkDataset, "tempTable")
60+
```
61+
The Dataset is a delimited string. To generate columns use the SparkSQL "split" function.
62+
63+
```
64+
sqlQuery = sql(sparkSession, "Select split(value, ',' )[0] AS columnName1, split(value, ',' )[1] AS columnName2 from tempTable")
65+
```
66+
67+
68+
# Spark Data Sources
69+
Supported data-sources include:
70+
- File formats including: CSV, JSON, arrow, parquet
71+
- Data Lakes including: Hive, ORC, Avro
72+
- Data Lake Houses: Delta Lake, Apache Iceberg.
73+
- Cloud Object Stores: S3, Azure Blob Storage, Swift Object.
74+
75+
## Data Source Examples
76+
77+
### CSV file example:
78+
Comma Separated Value (CSV) format.
79+
```
80+
stmt = sql(session, "SELECT * FROM CSV.`/pathToFile/fileName.csv`;")
81+
```
82+
### Parquet file example:
83+
Apache Parquet format.
84+
```
85+
stmt = sql(session, "SELECT * FROM PARQUET.`/pathToFile/fileName.parquet`;")
86+
```
87+
### Delta Lake Example:
88+
Delta Lake is an open-source storage layer for Spark. Delta Lake offers:
89+
90+
- ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
91+
- Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
92+
93+
Example shows create table (DDL), insert (DML) and select (SQL) statements using Delta Lake and SparkSQL:
94+
```
95+
sql(session, "CREATE DATABASE demo;")
96+
sql(session, "USE demo;")
97+
sql(session, "CREATE TABLE tb(col STRING) USING DELTA;" )
98+
```
99+
The Delta Lake feature requires adding the Delta Lake jar to the Spark jars folder.

0 commit comments

Comments
 (0)