Quick start

Installation

To use doric, just add the following dependency in your favourite build tool:

Maven Central

Sbt

libraryDependencies += "org.hablapps" % "doric_3-5_2.12" % "0.0.8"

Maven

<dependency>
  <groupId>org.hablapps</groupId>
  <artifactId>doric_3-5_2.12</artifactId>
  <version>0.0.8</version>
</dependency>

Doric is committed to use the most modern APIs first.

  • The latest stable version of doric is 0.0.8.
  • The latest experimental version of doric is 0.0.0+1-424172d4-SNAPSHOT.
  • Doric is compatible with the following Spark versions:
Spark Scala Tested doric
2.4.x (Deprecated) 2.11 Maven Central
3.0.0 2.12 You can use 3.0.2 version
3.0.1 2.12 You can use 3.0.2 version
3.0.2 2.12 Maven Central
3.1.0 2.12 You can use 3.1.2 version
3.1.1 2.12 You can use 3.1.2 version
3.1.2 2.12 Maven Central
3.2.0 2.12 / 2.13 You can use 3.2.4 version
3.2.1 2.12 / 2.13 You can use 3.2.4 version
3.2.2 2.12 / 2.13 You can use 3.2.4 version
3.2.3 2.12 / 2.13 You can use 3.2.4 version
3.2.4 2.12 Maven Central
3.2.4 2.13 Maven Central
3.3.0 2.12 / 2.13 You can use 3.3.4 version
3.3.1 2.12 / 2.13 You can use 3.3.4 version
3.3.2 2.12 / 2.13 You can use 3.3.4 version
3.3.3 2.12 / 2.13 You can use 3.3.4 version
3.3.4 2.12 Maven Central
3.3.4 2.13 Maven Central
3.4.0 2.12 / 2.13 You can use 3.4.4 version
3.4.1 2.12 / 2.13 You can use 3.4.4 version
3.4.2 2.12 / 2.13 You can use 3.4.4 version
3.4.3 2.12 / 2.13 You can use 3.4.4 version
3.4.4 2.12 Maven Central
3.4.4 2.13 Maven Central
3.5.0 2.12 / 2.13 You can use 3.5.3 version
3.5.1 2.12 / 2.13 You can use 3.5.3 version
3.5.2 2.12 / 2.13 You can use 3.5.3 version
3.5.3 2.12 Maven Central
3.5.3 2.13 Maven Central

Import statements

Doric is very easy to work with. First, you require the following import clause:

import doric._

There is no problem in combining conventional Spark column expressions and doric columns. However, to avoid name clashes, we will use the prefix f for the former ones:

import org.apache.spark.sql.{functions => f}

Type-safe column expressions

The overall purpose of doric is providing a type-safe API on top of the DataFrame API. This essentially means that we aim at capturing errors at compile time. For instance, in Spark we can’t mix apples and oranges, but this code still compiles:

def df = List(1,2,3).toDF().select($"value" * f.lit(true))

It’s only when we try to construct the DataFrame that an exception is raised at run-time:

df
// org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(value * true)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("INT" and "BOOLEAN").;
// 'Project [unresolvedalias((value#365 * true), Some(org.apache.spark.sql.Column$$Lambda$3790/0x00000008016c6840@115ac121))]
// +- LocalRelation [value#365]
// 
// 	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:310)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:297)
// 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
// 	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:243)
// 	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:243)
// 	at scala.collection.Iterator.foreach(Iterator.scala:943)
// 	at scala.collection.Iterator.foreach$(Iterator.scala:943)
// 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
// 	at scala.collection.IterableLike.foreach(IterableLike.scala:74)

Using doric, there is no need to wait for so long: errors will be reported at compile-time!

List(1,2,3).toDF().select(col[Int]("value") * lit(true))
// error: type mismatch;
//  found   : Boolean(true)
//  required: Int
// List(1,2,3).toDF().select(col[Int]("value") * lit(true))
//                                                   ^^^^

As you may see, changes in column expressions are minimal: just annotate column references with the intended type, i.e. col[Int]("name"), instead of col("name"). With this extra bit of type information, we are not only referring to a column named name: we are signalling that the expected Spark data type of that column is Integer.


ℹ️ NOTE ℹ️

Of course, this only works if we know the intended type of the column at compile-time. In a pure dynamic setting, doric is useless. Note, however, that you don’t need to know in advance the whole row type, as with Datasets. Thus, doric sits between a wholehearted static setting and a purely dynamic one. It offers type-safety for column expressions at a minimum cost, without compromising performance, i.e. sticking to DataFrames.


Finally, once we have constructed a doric column expression, we can use it within the context of a withColumn expression, or, in general, wherever we may use plain Spark columns: joins, filters, etc.:

List(1,2,3).toDF().filter(col[Int]("value") > lit(1))
// res1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]

As you can see in validations, explicit type annotations enable further validations when columns are interpreted within the context of withColumn, select, etc.

Mixing doric and Spark columns

Since doric is intended as a replacement of the whole DataFrame API, type-safe versions of Spark functions for numbers, dates, strings, etc., are provided. To know all possible transformations, you can take a look at the DoricColumn API .Occasionally, however, we might need to mix both doric and Spark column expressions. There is no problem with that, as this example shows:

val strDf = List("hi", "welcome", "to", "doric").toDF("str")
// strDf: org.apache.spark.sql.package.DataFrame = [str: string]

strDf
  .select(f.concat(f.col("str"), f.lit("!!!")) as "newCol") //pure spark
  .select(concat(lit("???"), colString("newCol")) as "finalCol") //pure and sweet doric
  .show()
// +-------------+
// |     finalCol|
// +-------------+
// |     ???hi!!!|
// |???welcome!!!|
// |     ???to!!!|
// |  ???doric!!!|
// +-------------+
//

Also, we can transform pure Spark columns into doric columns, and be sure that specific doric validations will be applied:

strDf.select(f.col("str").asDoric[String]).show()
// +-------+
// |    str|
// +-------+
// |     hi|
// |welcome|
// |     to|
// |  doric|
// +-------+
//

strDf.select((f.col("str") + f.lit(true)).asDoric[String]).show()
// doric.sem.DoricMultiError: Found 1 error in select
//   [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(str + true)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("DOUBLE" and "BOOLEAN").;
//   'Project [unresolvedalias((cast(str#378 as double) + true), Some(org.apache.spark.sql.Column$$Lambda$3790/0x00000008016c6840@115ac121))]
//   +- Project [value#375 AS str#378]
//      +- LocalRelation [value#375]
//   
//   	located at . (quickstart.md:76)
// 
// 	at doric.sem.package$ErrorThrower.$anonfun$returnOrThrow$1(package.scala:9)
// 	at cats.data.Validated.fold(Validated.scala:50)
// 	at doric.sem.package$ErrorThrower.returnOrThrow(package.scala:9)
// 	at doric.sem.TransformOps$DataframeTransformationSyntax.select(TransformOps.scala:140)
// 	at repl.MdocSession$MdocApp$$anonfun$2.apply$mcV$sp(quickstart.md:76)
// 	at repl.MdocSession$MdocApp$$anonfun$2.apply(quickstart.md:76)
// 	at repl.MdocSession$MdocApp$$anonfun$2.apply(quickstart.md:76)
// Caused by: org.apache.spark.sql.AnalysisException: [DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(str + true)" due to data type mismatch: the left and right operands of the binary operator have incompatible types ("DOUBLE" and "BOOLEAN").;
// 'Project [unresolvedalias((cast(str#378 as double) + true), Some(org.apache.spark.sql.Column$$Lambda$3790/0x00000008016c6840@115ac121))]
// +- Project [value#375 AS str#378]
//    +- LocalRelation [value#375]
// 
// 	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.dataTypeMismatch(package.scala:73)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:310)
// 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:297)
// 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:244)
// 	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:243)
// 	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:243)
// 	at scala.collection.Iterator.foreach(Iterator.scala:943)
// 	at scala.collection.Iterator.foreach$(Iterator.scala:943)
// 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
// 	at scala.collection.IterableLike.foreach(IterableLike.scala:74)