Skip to content

Commit 1c7b0ff

Browse files
committed
Add Spark 2.2.0 example
1 parent 2df44e1 commit 1c7b0ff

21 files changed

+593
-409
lines changed

.travis.yml

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ cache:
1515

1616
script:
1717
- ./sbt test
18+
- ./sbt it:test
1819

1920
# Trick to avoid unnecessary cache updates
2021
- find $HOME/.sbt -name "*.lock" | xargs rm

README.md

+57-3
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,63 @@
1-
Scala Boilerplate
2-
=================
1+
Spark Scala Template
2+
====================
33

44
[![Build Status](https://travis-ci.org/dbast/spark-scala-template.svg?branch=master)](https://travis-ci.org/dbast/spark-scala-template)
55

66
Assortment of default settings, best practices, and general goodies for Scala projects.
77

8+
Layers
9+
------
10+
11+
There are 3 layers:
12+
* The reusable shared DAG steps, implemented via (SQL like) Dataset API
13+
* The user defined functions used by the steps
14+
* A command with cli, that builds a whole DAG by combining steps
15+
16+
The cli
17+
-------
18+
19+
```
20+
-d, --debug Explains plan during DAG construction
21+
-i, --input <arg> Path to the raw data to process (local, hdfs, s3)
22+
-l, --limit <arg> Limit DAG steps to given number, the read and
23+
write/show steps are always added
24+
--lines-to-show <arg> Amount of lines to shows to the console (instead
25+
of writing snappy compressed parquet files)
26+
-n, --nodes <arg> Spark nodes (local run) (default = local[*])
27+
-o, --output <arg> Output path (local, hdfs, s3)
28+
-s, --stop Stop the SparkSession after processing /
29+
exception (for cleanup during debugging)
30+
--help Show help message
31+
```
32+
33+
Testing
34+
-------
35+
36+
There are several ways to test the code:
37+
* With local in memory cluster and included data, e.g.:
38+
* A whole DAG `./sbt -i data/*.gz -o test`
39+
* A reduced DAG `./sbt -i data/*.gz -o test --limit 1 --lines 2`
40+
* Sherlock Holmes mode: Start the scala console via `./sbt console`, there is a
41+
preconfigured SparkSession named `spark` and all the Steps are imported.
42+
* Execute the automated tests via: `./sbt test` and `./sbt it:test`, or with
43+
coverage report: `./sbt coverage test coverageReport`. The report can be found
44+
in `target/scala-2.11/scoverage-report`.
45+
* Submitting a Spark job with any desired cli argument on a running cluster or
46+
import the code into Apache Zeppelin and use it there.
47+
48+
Releasing
49+
---------
50+
51+
The versioning is done via Annotated Git Tags, see [git-describe](https://git-scm.com/docs/git-describe) and [Git-Basics-Tagging](https://git-scm.com/book/en/v2/Git-Basics-Tagging) for annotated tags.
52+
53+
Tags have to be pushed via `git push --tags`.
54+
55+
This requires a full repo clone on the continuous integration machine (no shallow clone).
56+
57+
The benefit of git describe based versioning is, that every commit has an distinct
58+
automatic version. This facilitates also continuous integration and delivery.
59+
60+
861
sbt bash wrapper
962
----------------
1063

@@ -82,7 +135,8 @@ Fills apiMappings for common Scala libraries during `doc` task.
82135
License
83136
-------
84137

85-
Copyright 2011-2016 Marconi Lanna
138+
Copyright 2011-2016 Marconi Lanna
139+
Copyright 2017 Daniel Bast
86140

87141
Licensed under the Apache License, Version 2.0 (the "License");
88142
you may not use this file except in compliance with the License.

build.sbt

+55-45
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
/*
22
* Copyright 2011-2016 Marconi Lanna
3+
* Copyright 2017 Daniel Bast
34
*
45
* Licensed under the Apache License, Version 2.0 (the "License");
56
* you may not use this file except in compliance with the License.
@@ -20,6 +21,8 @@
2021

2122
name := "PROJECT"
2223

24+
// enable versioning based on tags, see https://git-scm.com/docs/git-describe
25+
// requires a full repo clone on the continuous integration machine (not a shallow clone)
2326
enablePlugins(GitVersioning)
2427
git.useGitDescribe := true
2528

@@ -36,7 +39,8 @@ licenses += "Apache-2.0" -> url("http://www.apache.org/licenses/LICENSE-2.0.html
3639
/*
3740
* scalac configuration
3841
*/
39-
42+
// Use the same scala version Spark is build with, see scala.version in
43+
// https://github.com/apache/spark/blob/master/pom.xml
4044
scalaVersion in ThisBuild := "2.11.8"
4145

4246
compileOrder := CompileOrder.JavaThenScala
@@ -49,14 +53,16 @@ lazy val root = Project("root", file("."))
4953
.settings(
5054
buildInfoKeys := Seq[BuildInfoKey](name, version, scalaVersion, sbtVersion)
5155
)
56+
// more memory Spark in local mode, see https://github.com/holdenk/spark-testing-base
57+
javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:+CMSClassUnloadingEnabled")
5258

5359
val commonScalacOptions = Seq(
5460
"-encoding",
5561
"UTF-8", // Specify character encoding used by source files
5662
"-target:jvm-1.8", // Target platform for object files
5763
"-Xexperimental", // Enable experimental extensions
58-
"-Xfuture", // Turn on future language features
59-
"-Ybackend:GenBCode" // Choice of bytecode emitter
64+
"-Xfuture" // Turn on future language features
65+
//"-Ybackend:GenBCode" // Choice of bytecode emitter
6066
)
6167

6268
val compileScalacOptions = Seq(
@@ -100,7 +106,11 @@ scalacOptions in (Compile, console) := commonScalacOptions ++ Seq(
100106

101107
scalacOptions in (Test, console) := (scalacOptions in (Compile, console)).value
102108

103-
addCompilerPlugin("org.scalamacros" % "paradise" % "2.1.0" cross CrossVersion.full)
109+
// Have fullClasspath during compile, test and run, but don't assemble what is marked provided
110+
// https://github.com/sbt/sbt-assembly#-provided-configuration
111+
run in Compile := Defaults
112+
.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))
113+
.evaluated
104114

105115
/*
106116
scalac -language:help
@@ -156,48 +166,27 @@ l:classpath Enable cross-method optimizations across the entire classpat
156166
/*
157167
* Managed dependencies
158168
*/
169+
val sparkVersion = "2.2.0"
170+
val clusterDependencyScope = "provided"
159171

160172
libraryDependencies ++= Seq(
161-
"commons-codec" % "commons-codec" % "1.10",
162-
"commons-io" % "commons-io" % "2.5",
163-
"commons-validator" % "commons-validator" % "1.5.1",
164-
"joda-time" % "joda-time" % "2.9.6",
165-
"mysql" % "mysql-connector-java" % "6.0.5",
166-
"ch.qos.logback" % "logback-classic" % "1.1.7",
167-
"com.github.nscala-time" %% "nscala-time" % "2.14.0"
168-
//"com.github.pathikrit" %% "better-files" % "2.16.0" // No 2.12
169-
,
170-
"com.github.t3hnar" %% "scala-bcrypt" % "3.0",
171-
"com.google.guava" % "guava" % "20.0",
172-
"com.ibm.icu" % "icu4j" % "58.1",
173-
"com.softwaremill.macwire" %% "macros" % "2.2.5" % Provided,
174-
"com.softwaremill.macwire" %% "proxy" % "2.2.5",
175-
"com.softwaremill.macwire" %% "util" % "2.2.5",
176-
"com.softwaremill.quicklens" %% "quicklens" % "1.4.8",
177-
"com.typesafe" % "config" % "1.3.1",
178-
"com.typesafe.scala-logging" %% "scala-logging" % "3.5.0",
179-
"com.typesafe.slick" %% "slick" % "3.2.0-M2",
180-
"com.typesafe.slick" %% "slick-hikaricp" % "3.2.0-M2",
181-
"com.univocity" % "univocity-parsers" % "2.2.3",
182-
"de.svenkubiak" % "jBCrypt" % "0.4.1",
183-
"org.apache.commons" % "commons-compress" % "1.12",
184-
"org.apache.commons" % "commons-csv" % "1.4",
185-
"org.apache.commons" % "commons-lang3" % "3.5",
186-
"org.apache.commons" % "commons-math3" % "3.6.1",
187-
"org.apache.httpcomponents" % "httpclient" % "4.5.2",
188-
"org.joda" % "joda-money" % "0.12",
189-
"org.jsoup" % "jsoup" % "1.10.1",
190-
"org.postgresql" % "postgresql" % "9.4.1212",
191-
"org.quartz-scheduler" % "quartz" % "2.2.3",
192-
"org.quartz-scheduler" % "quartz-jobs" % "2.2.3",
193-
"org.scala-lang" % "scala-reflect" % scalaVersion.value,
194-
"org.scalactic" %% "scalactic" % "3.0.1"
195-
)
173+
"org.apache.spark" %% "spark-core" % sparkVersion % clusterDependencyScope,
174+
"org.apache.spark" %% "spark-sql" % sparkVersion % clusterDependencyScope,
175+
// "org.apache.hadoop" % "hadoop-aws" % "2.7.3" % clusterDependencyScope,
176+
"org.apache.hadoop" % "hadoop-client" % "2.7.3" % clusterDependencyScope,
177+
"org.slf4j" % "slf4j-log4j12" % "1.7.25",
178+
"com.typesafe.scala-logging" %% "scala-logging" % "3.7.2",
179+
"org.rogach" %% "scallop" % "3.1.0"
180+
).map(_.exclude("ch.qos.logback", "logback-classic"))
196181

197182
libraryDependencies ++= Seq(
198-
"org.mockito" % "mockito-core" % "2.2.29",
199-
"org.scalatest" %% "scalatest" % "3.0.1",
200-
"org.seleniumhq.selenium" % "selenium-java" % "3.0.1"
183+
"org.scalatest" %% "scalatest" % "3.0.4",
184+
"com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.7.2",
185+
"org.apache.spark" %% "spark-hive" % sparkVersion // required by spark-testing-base
186+
// "org.scalacheck" %% "scalacheck" % "1.13.5",
187+
// "org.scalamock" %% "scalamock-scalatest-support" % "3.6.0",
188+
// "com.storm-enroute" %% "scalameter" % "0.8.2",
189+
// "es.ucm.fdi" %% "sscheck" % "0.3.2",
201190
) map (_ % Test)
202191

203192
/*
@@ -210,6 +199,10 @@ addCommandAlias("pluginUpdates", "; reload plugins; dependencyUpdates; reload re
210199
// Statements executed when starting the Scala REPL (sbt's `console` task)
211200
initialCommands := """
212201
import
202+
project.Functions._,
203+
project.Processing,
204+
project.Steps,
205+
org.apache.spark.sql.SparkSession,
213206
scala.annotation.{switch, tailrec},
214207
scala.beans.{BeanProperty, BooleanBeanProperty},
215208
scala.collection.JavaConverters._,
@@ -226,8 +219,10 @@ import
226219
java.net._,
227220
java.nio.file._,
228221
java.time.{Duration => jDuration, _},
229-
System.{currentTimeMillis => now},
230-
System.nanoTime
222+
java.lang.System.{currentTimeMillis => now},
223+
java.lang.System.nanoTime
224+
225+
val sparkNodes = sys.env.getOrElse("SPARK_NODES", "local[*]")
231226
232227
def desugarImpl[T](c: blackbox.Context)(expr: c.Expr[T]): c.Expr[Unit] = {
233228
import c.universe._, scala.io.AnsiColor.{BOLD, GREEN, RESET}
@@ -240,6 +235,20 @@ def desugarImpl[T](c: blackbox.Context)(expr: c.Expr[T]): c.Expr[Unit] = {
240235
}
241236
242237
def desugar[T](expr: T): Unit = macro desugarImpl[T]
238+
239+
var _sparkInitialized = false
240+
@transient lazy val spark = {
241+
_sparkInitialized = true
242+
SparkSession.builder
243+
.master(sparkNodes)
244+
.appName("Console test")
245+
.getOrCreate()
246+
}
247+
@transient lazy val sc = spark.sparkContext
248+
"""
249+
250+
cleanupCommands in console := """
251+
if (_sparkInitialized) {spark.stop()}
243252
"""
244253

245254
// Do not exit sbt when Ctrl-C is used to stop a running app
@@ -299,6 +308,8 @@ testScalastyle := scalastyle.in(Test).toTask("").value
299308
* sbt-assembly https://github.com/sbt/sbt-assembly
300309
*/
301310
test in assembly := {}
311+
// scala-library is provided by spark cluster execution environment
312+
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
302313

303314
/*
304315
* WartRemover: http://github.com/wartremover/wartremover
@@ -501,7 +512,6 @@ scalacOptions += "-P:linter:enable-only:" +
501512
/*
502513
* scoverage: http://github.com/scoverage/sbt-scoverage
503514
*/
504-
505515
coverageMinimum := 90
506516
coverageFailOnMinimum := true
507517
coverageOutputCobertura := false

data/example.json.gz

78 Bytes
Binary file not shown.

project/Testing.scala

+15
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,18 @@
1+
/*
2+
* Copyright 2017 Daniel Bast
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
116
import sbt._
217
import sbt.Keys._
318

project/plugins.sbt

+1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
/*
22
* Copyright 2011-2016 Marconi Lanna
3+
* Copyright 2017 Daniel Bast
34
*
45
* Licensed under the Apache License, Version 2.0 (the "License");
56
* you may not use this file except in compliance with the License.

0 commit comments

Comments
 (0)