Doric Documentation
Sweet doric syntactic sugar
Before delving into the specific topics of validations and modularity, let’s discuss some general considerations around syntax.
Alternative syntax for doric columns
Besides the syntax col[T](field)
, there are two other ways to create typed columns in doric:
Column aliases
If you are not used to generic parameters, aliases colInt
, colString
, etc., are also available as column selectors.
In this way, we can write colInt("name")
instead of col[Int]("name")
. We can’t avoid generic parameters when
selecting arrays or maps, though: e.g., colArray[Int]("name")
stands for col[Array[Int]]("name")
.
This is the whole list of column alias:
Doric column type | Column alias |
---|---|
String |
colString |
Null |
colNull |
Int |
colInt |
Long |
colLong |
Double |
colDouble |
Float |
colFloat |
Boolean |
colBoolean |
Instant |
colInstant |
LocalDate |
colLocalDate |
Timestamp |
colTimestamp |
Date |
colDate |
Array[T] |
colArray[T: ClassTag] |
Array[Int] |
colArrayInt |
Array[Byte] |
colBinary |
Array[String] |
colArrayString |
Row |
colStruct |
Map[K, V] |
colMap[K: SparkType, V: SparkType] |
Map[String, V] |
colMapString[V: SparkType] |
You can check the latest API for each type of column here.
Scala Dynamic
We can also write row.name[Int]
instead of col[Int]("name")
, where row
refers to
the top-level row of the DataFrame (you can think of it in similar terms to the this
keyword). This syntax is
particularly appealing when accessing inner fields of struct columns. Thus, we can write row.person[Row].age[Int]
,
instead of colStruct("person").getChild[Int]("age")
.
Dot syntax
Doric embraces the dot notation of common idiomatic Scala code whenever possible, instead of the functional style of Spark SQL. For instance, given the following DataFrame:
val dfArrays = List(("string", Array(1,2,3))).toDF("str", "arr")
// dfArrays: org.apache.spark.sql.package.DataFrame = [str: string, arr: array<int>]
a common transformation in the SQL/functional style would go as follows:
val complexS: Column =
f.aggregate(
f.transform(f.col("arr"), x => x + 1),
f.lit(0),
(x, y) => x + y)
// complexS: Column = aggregate(transform(arr, lambdafunction((x_0 + 1), x_0)), 0, lambdafunction((x_1 + y_2), x_1, y_2), lambdafunction(x_3, x_3))
dfArrays.select(complexS as "complexTransformation").show()
// +---------------------+
// |complexTransformation|
// +---------------------+
// | 9|
// +---------------------+
//
Using doric, we would write it like this:
val complexCol: DoricColumn[Int] =
col[Array[Int]]("arr")
.transform(_ + 1.lit)
.aggregate(0.lit)(_ + _)
// complexCol: DoricColumn[Int] = TransformationDoricColumn(
// Kleisli(scala.Function1$$Lambda$3079/0x000000080139c040@3d0e76eb)
// )
dfArrays.select(complexCol as "complexTransformation").show()
// +---------------------+
// |complexTransformation|
// +---------------------+
// | 9|
// +---------------------+
//
Implicit castings
Implicit type conversions in Spark are pervasive. For instance, the following code won’t cause Spark to complain at all:
val df0 = spark.range(1,10).withColumn("x", f.concat(f.col("id"), f.lit("jander")))
// df0: org.apache.spark.sql.package.DataFrame = [id: bigint, x: string]
which means that an implicit conversion from bigint
to string
will be in effect when we run our DataFrame:
df0.select(f.col("x")).show()
// +-------+
// | x|
// +-------+
// |1jander|
// |2jander|
// |3jander|
// |4jander|
// |5jander|
// |6jander|
// |7jander|
// |8jander|
// |9jander|
// +-------+
//
Assuming that you are certain that your column holds vales of type bigint
, the same code in doric won’t compile
(note that the Spark type bigint
corresponds to the Scala type Long
):
val df1 = spark.range(1,10).toDF().withColumn("x", concat(colLong("id"), "jander".lit))
// error: type mismatch;
// found : doric.NamedDoricColumn[Long]
// required: doric.StringColumn
// (which expands to) doric.DoricColumn[String]
// val df1 = spark.range(1,10).toDF().withColumn("x", concat(colLong("id"), "jander".lit))
// ^
Still, doric will allow you to perform that operation provided that you explicitly enact the conversion:
val df1 = spark.range(1,10).toDF().withColumn("x", concat(colLong("id").cast[String], "jander".lit))
// df1: org.apache.spark.sql.package.DataFrame = [id: bigint, x: string]
df1.show()
// +---+-------+
// | id| x|
// +---+-------+
// | 1|1jander|
// | 2|2jander|
// | 3|3jander|
// | 4|4jander|
// | 5|5jander|
// | 6|6jander|
// | 7|7jander|
// | 8|8jander|
// | 9|9jander|
// +---+-------+
//
Let’s also consider the following example:
val dfEq = List((1, "1"), (1, " 1"), (1, " 1 ")).toDF("int", "str")
// dfEq: org.apache.spark.sql.package.DataFrame = [int: int, str: string]
dfEq.withColumn("eq", f.col("int") === f.col("str"))
// res5: org.apache.spark.sql.package.DataFrame = [int: int, str: string ... 1 more field]
What would you expect to be the result? Well, it all depends on the implicit conversion that Spark chooses to apply, if at all:
- It may return false for the new column, given that the types of both input columns differ, thus choosing to apply no conversion
- It may convert the integer column into a string column
- It may convert strings to integers.
Let’s see what happens:
dfEq.withColumn("eq", f.col("int") === f.col("str")).show()
// +---+---+----+
// |int|str| eq|
// +---+---+----+
// | 1| 1|true|
// | 1| 1|true|
// | 1| 1 |true|
// +---+---+----+
//
Option 3 wins, but you can only learn this by trial and error. With doric, you can depart from all this magic and explicitly cast types, if you desired so:
// Option 1, no castings: compile error
dfEq.withColumn("eq", colInt("int") === colString("str")).show()
// error: type mismatch;
// found : doric.NamedDoricColumn[String]
// required: doric.DoricColumn[Int]
// dfEq.withColumn("eq", colInt("int") === colString("str")).show()
// ^
// Option 2, casting from int to string
dfEq.withColumn("eq", colInt("int").cast[String] === colString("str")).show()
// +---+---+-----+
// |int|str| eq|
// +---+---+-----+
// | 1| 1| true|
// | 1| 1|false|
// | 1| 1 |false|
// +---+---+-----+
//
// Option 3, casting from string to int, not safe!
dfEq.withColumn("eq", colInt("int") === colString("str").unsafeCast[Int]).show()
// +---+---+----+
// |int|str| eq|
// +---+---+----+
// | 1| 1|true|
// | 1| 1|true|
// | 1| 1 |true|
// +---+---+----+
//
Note that we can’t simply cast a string to an integer, since this conversion is partial. If the programmer insists
in doing this unsafe casting, doric will force her to explicitly acknowledge this fact using the conversion function
unsafeCast
.
Last, note that we can also emulate the default Spark behaviour, enabling implicit conversions for safe castings, with an explicit import statement:
import doric.implicitConversions.implicitSafeCast
dfEq.withColumn("eq", colString("str") === colInt("int") ).show()
// +---+---+-----+
// |int|str| eq|
// +---+---+-----+
// | 1| 1| true|
// | 1| 1|false|
// | 1| 1 |false|
// +---+---+-----+
//
Literal conversions
Sometimes, Spark allows us to insert literal values to simplify our code:
val intDF = List(1,2,3).toDF("int")
// intDF: org.apache.spark.sql.package.DataFrame = [int: int]
val colS = f.col("int") + 1
// colS: Column = (int + 1)
intDF.select(colS).show()
// +---------+
// |(int + 1)|
// +---------+
// | 2|
// | 3|
// | 4|
// +---------+
//
The default doric syntax is a little stricter and forces us to transform these values to literal columns:
val colD = colInt("int") + 1.lit
// colD: DoricColumn[Int] = TransformationDoricColumn(
// Kleisli(scala.Function1$$Lambda$3079/0x000000080139c040@31891fca)
// )
intDF.select(colD).show()
// +---------+
// |(int + 1)|
// +---------+
// | 2|
// | 3|
// | 4|
// +---------+
//
However, we can also profit from the same literal syntax with the help of implicits. To enable this behaviour, we have to explicitly add the following import statement:
import doric.implicitConversions.literalConversion
val colSugarD = colInt("int") + 1
// colSugarD: DoricColumn[Int] = TransformationDoricColumn(
// Kleisli(scala.Function1$$Lambda$3079/0x000000080139c040@e5f2c04)
// )
val columConcatLiterals = concat("this", "is","doric") // concat expects DoricColumn[String] values, the conversion puts them as expected
// columConcatLiterals: StringColumn = TransformationDoricColumn(
// Kleisli(scala.Function1$$Lambda$3079/0x000000080139c040@7751054a)
// )
intDF.select(colSugarD, columConcatLiterals).show()
// +---------+-----------------------+
// |(int + 1)|concat(this, is, doric)|
// +---------+-----------------------+
// | 2| thisisdoric|
// | 3| thisisdoric|
// | 4| thisisdoric|
// +---------+-----------------------+
//
Of course, implicit conversions are only in effect if the type of the literal value is valid:
colInt("int") + 1f //an integer with a float value can't be directly added in doric
// error: type mismatch;
// found : Float(1.0)
// required: doric.DoricColumn[Int]
// colInt("int") + 1f //an integer with a float value can't be directly added in doric
// ^^
concat("hi", 5) // expects only strings and an integer is found
// error: type mismatch;
// found : Int(5)
// required: doric.StringColumn
// (which expands to) doric.DoricColumn[String]
// concat("hi", 5) // expects only strings and an integer is found
// ^