Sweet doric syntactic sugar

Before delving into the specific topics of validations and modularity, let’s discuss some general considerations around syntax.

Alternative syntax for doric columns

Besides the syntax col[T](field), there are two other ways to create typed columns in doric:

Column aliases

If you are not used to generic parameters, aliases colInt, colString, etc., are also available as column selectors. In this way, we can write colInt("name") instead of col[Int]("name"). We can’t avoid generic parameters when selecting arrays or maps, though: e.g., colArray[Int]("name") stands for col[Array[Int]]("name").

This is the whole list of column alias:

Doric column type Column alias
String colString
Null colNull
Int colInt
Long colLong
Double colDouble
Float colFloat
Boolean colBoolean
Instant colInstant
LocalDate colLocalDate
Timestamp colTimestamp
Date colDate
Array[T] colArray[T: ClassTag]
Array[Int] colArrayInt
Array[Byte] colBinary
Array[String] colArrayString
Row colStruct
Map[K, V] colMap[K: SparkType, V: SparkType]
Map[String, V] colMapString[V: SparkType]

You can check the latest API for each type of column here.

Scala Dynamic

We can also write row.name[Int] instead of col[Int]("name"), where row refers to the top-level row of the DataFrame (you can think of it in similar terms to the this keyword). This syntax is particularly appealing when accessing inner fields of struct columns. Thus, we can write row.person[Row].age[Int], instead of colStruct("person").getChild[Int]("age").

Dot syntax

Doric embraces the dot notation of common idiomatic Scala code whenever possible, instead of the functional style of Spark SQL. For instance, given the following DataFrame:

val dfArrays = List(("string", Array(1,2,3))).toDF("str", "arr")
// dfArrays: org.apache.spark.sql.package.DataFrame = [str: string, arr: array<int>]

a common transformation in the SQL/functional style would go as follows:

val complexS: Column = 
    f.aggregate(
        f.transform(f.col("arr"), x => x + 1), 
        f.lit(0), 
        (x, y) => x + y)
// complexS: Column = aggregate(transform(arr, lambdafunction((x_0 + 1), x_0)), 0, lambdafunction((x_1 + y_2), x_1, y_2), lambdafunction(x_3, x_3))

dfArrays.select(complexS as "complexTransformation").show()
// +---------------------+
// |complexTransformation|
// +---------------------+
// |                    9|
// +---------------------+
//

Using doric, we would write it like this:

val complexCol: DoricColumn[Int] = 
    col[Array[Int]]("arr")
      .transform(_ + 1.lit)
      .aggregate(0.lit)(_ + _)
// complexCol: DoricColumn[Int] = TransformationDoricColumn(
//   Kleisli(scala.Function1$$Lambda$3079/0x000000080139c040@3d0e76eb)
// )
  
dfArrays.select(complexCol as "complexTransformation").show()
// +---------------------+
// |complexTransformation|
// +---------------------+
// |                    9|
// +---------------------+
//

Implicit castings

Implicit type conversions in Spark are pervasive. For instance, the following code won’t cause Spark to complain at all:

val df0 = spark.range(1,10).withColumn("x", f.concat(f.col("id"), f.lit("jander")))
// df0: org.apache.spark.sql.package.DataFrame = [id: bigint, x: string]

which means that an implicit conversion from bigint to string will be in effect when we run our DataFrame:

df0.select(f.col("x")).show()
// +-------+
// |      x|
// +-------+
// |1jander|
// |2jander|
// |3jander|
// |4jander|
// |5jander|
// |6jander|
// |7jander|
// |8jander|
// |9jander|
// +-------+
//

Assuming that you are certain that your column holds vales of type bigint, the same code in doric won’t compile (note that the Spark type bigint corresponds to the Scala type Long):

val df1 = spark.range(1,10).toDF().withColumn("x", concat(colLong("id"), "jander".lit))
// error: type mismatch;
//  found   : doric.NamedDoricColumn[Long]
//  required: doric.StringColumn
//     (which expands to)  doric.DoricColumn[String]
// val df1 = spark.range(1,10).toDF().withColumn("x", concat(colLong("id"), "jander".lit))
//                                                                  ^

Still, doric will allow you to perform that operation provided that you explicitly enact the conversion:

val df1 = spark.range(1,10).toDF().withColumn("x", concat(colLong("id").cast[String], "jander".lit))
// df1: org.apache.spark.sql.package.DataFrame = [id: bigint, x: string]
df1.show()
// +---+-------+
// | id|      x|
// +---+-------+
// |  1|1jander|
// |  2|2jander|
// |  3|3jander|
// |  4|4jander|
// |  5|5jander|
// |  6|6jander|
// |  7|7jander|
// |  8|8jander|
// |  9|9jander|
// +---+-------+
//

Let’s also consider the following example:

val dfEq = List((1, "1"), (1, " 1"), (1, " 1 ")).toDF("int", "str")
// dfEq: org.apache.spark.sql.package.DataFrame = [int: int, str: string]
dfEq.withColumn("eq", f.col("int") === f.col("str"))
// res5: org.apache.spark.sql.package.DataFrame = [int: int, str: string ... 1 more field]

What would you expect to be the result? Well, it all depends on the implicit conversion that Spark chooses to apply, if at all:

  1. It may return false for the new column, given that the types of both input columns differ, thus choosing to apply no conversion
  2. It may convert the integer column into a string column
  3. It may convert strings to integers.

Let’s see what happens:

dfEq.withColumn("eq", f.col("int") === f.col("str")).show()
// +---+---+----+
// |int|str|  eq|
// +---+---+----+
// |  1|  1|true|
// |  1|  1|true|
// |  1| 1 |true|
// +---+---+----+
//

Option 3 wins, but you can only learn this by trial and error. With doric, you can depart from all this magic and explicitly cast types, if you desired so:

// Option 1, no castings: compile error
dfEq.withColumn("eq", colInt("int") === colString("str")).show()
// error: type mismatch;
//  found   : doric.NamedDoricColumn[String]
//  required: doric.DoricColumn[Int]
// dfEq.withColumn("eq", colInt("int") === colString("str")).show()
//                                                  ^
// Option 2, casting from int to string
dfEq.withColumn("eq", colInt("int").cast[String] === colString("str")).show()
// +---+---+-----+
// |int|str|   eq|
// +---+---+-----+
// |  1|  1| true|
// |  1|  1|false|
// |  1| 1 |false|
// +---+---+-----+
//
// Option 3, casting from string to int, not safe!
dfEq.withColumn("eq", colInt("int") === colString("str").unsafeCast[Int]).show()
// +---+---+----+
// |int|str|  eq|
// +---+---+----+
// |  1|  1|true|
// |  1|  1|true|
// |  1| 1 |true|
// +---+---+----+
//

Note that we can’t simply cast a string to an integer, since this conversion is partial. If the programmer insists in doing this unsafe casting, doric will force her to explicitly acknowledge this fact using the conversion function unsafeCast.

Last, note that we can also emulate the default Spark behaviour, enabling implicit conversions for safe castings, with an explicit import statement:

import doric.implicitConversions.implicitSafeCast

dfEq.withColumn("eq", colString("str") === colInt("int") ).show()
// +---+---+-----+
// |int|str|   eq|
// +---+---+-----+
// |  1|  1| true|
// |  1|  1|false|
// |  1| 1 |false|
// +---+---+-----+
//

Literal conversions

Sometimes, Spark allows us to insert literal values to simplify our code:

val intDF = List(1,2,3).toDF("int")
// intDF: org.apache.spark.sql.package.DataFrame = [int: int]
val colS = f.col("int") + 1
// colS: Column = (int + 1)

intDF.select(colS).show()
// +---------+
// |(int + 1)|
// +---------+
// |        2|
// |        3|
// |        4|
// +---------+
//

The default doric syntax is a little stricter and forces us to transform these values to literal columns:

val colD = colInt("int") + 1.lit
// colD: DoricColumn[Int] = TransformationDoricColumn(
//   Kleisli(scala.Function1$$Lambda$3079/0x000000080139c040@31891fca)
// )

intDF.select(colD).show()
// +---------+
// |(int + 1)|
// +---------+
// |        2|
// |        3|
// |        4|
// +---------+
//

However, we can also profit from the same literal syntax with the help of implicits. To enable this behaviour, we have to explicitly add the following import statement:

import doric.implicitConversions.literalConversion
val colSugarD = colInt("int") + 1
// colSugarD: DoricColumn[Int] = TransformationDoricColumn(
//   Kleisli(scala.Function1$$Lambda$3079/0x000000080139c040@e5f2c04)
// )
val columConcatLiterals = concat("this", "is","doric") // concat expects DoricColumn[String] values, the conversion puts them as expected
// columConcatLiterals: StringColumn = TransformationDoricColumn(
//   Kleisli(scala.Function1$$Lambda$3079/0x000000080139c040@7751054a)
// )

intDF.select(colSugarD, columConcatLiterals).show()
// +---------+-----------------------+
// |(int + 1)|concat(this, is, doric)|
// +---------+-----------------------+
// |        2|            thisisdoric|
// |        3|            thisisdoric|
// |        4|            thisisdoric|
// +---------+-----------------------+
//

Of course, implicit conversions are only in effect if the type of the literal value is valid:

colInt("int") + 1f //an integer with a float value can't be directly added in doric
// error: type mismatch;
//  found   : Float(1.0)
//  required: doric.DoricColumn[Int]
// colInt("int") + 1f //an integer with a float value can't be directly added in doric
//                 ^^
concat("hi", 5) // expects only strings and an integer is found
// error: type mismatch;
//  found   : Int(5)
//  required: doric.StringColumn
//     (which expands to)  doric.DoricColumn[String]
// concat("hi", 5) // expects only strings and an integer is found
//              ^