What is Broadcast Variable in Apache Spark with example

  • Post category:Spark
  • Reading time:6 mins read
apache_spark_logo

Hello Friends. In this article we will dive into the basic concept of broadcast variables. On a very high level broadcast variable is a kind of shared variable that Spark provides.

What is Broadcast variable

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The idea is to transfer values used in transformations from a driver to executors in a most effective way so they are copied once and used many times by tasks.

When should we use Broadcast variable

Let’s imagine a scenario that we are executing a Spark application and for each input row , we have to use a large variable like a static lookup table which is partitioned across multiple worker nodes. By default Spark will transfer copy of this variable with every task which involves a lot of network overhead and is definitely not efficient.

We can use Broadcast variable to deal with such scenarios. This is useful when tasks across multiple stages need the same data or when caching the data in de-serialized form is important. This will reduce the size of each serialized task, and the cost of launching a job over a cluster. Spark distributes the broadcast variables using efficient broadcast algorithms to reduce network cost.

How to create a broadcast variable

We can use SparkContext’s broadcast method to create a broadcast variable. We will enable the DEBUG logging level so that we can see some extra logs created by the class org.apache.spark.storage.BlockManager.

  // Create SparkConf
  val conf = new SparkConf
  conf.setMaster("local[4]")
  conf.setAppName("broadcast test")
  // Create SparkContext
  val sc = new SparkContext(conf)
  // Enable DEBUG logging to see some extra logs of class BlockManager
  sc.setLogLevel("DEBUG")
  val broadcast = sc.broadcast(Seq(1, 2, 3, 4, 5))

After running the above program, the logs generated by BlockManager are as below

DEBUG BlockManager: Put block broadcast_0 locally took 89 ms
DEBUG BlockManager: Putting block broadcast_0 without replication took 92 ms
DEBUG BlockManager: Told master about block broadcast_0_piece0
DEBUG BlockManager: Put block broadcast_0_piece0 locally took 10 ms
DEBUG BlockManager: Putting block broadcast_0_piece0 without replication took 10 ms

Now we can simply verify from the logs that our broadcast variable is created successfully.

How to access the value of a broadcast variable

After creating a broadcast variable let’s se how we can access the variable created above. value method is the only way to access the value of a broadcast variable.

val value = broadcast.value
println(value)

Again let’s checkout the below messages in logs.

21/06/13 16:46:34 DEBUG BlockManager: Getting local block broadcast_0
21/06/13 16:46:34 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)

Finally we will use destroy method to destroy the broadcast variable.

broadcast.destroy()

Again we should be able to see below messages in the logs


DEBUG BlockManager: Removing broadcast 0
DEBUG BlockManager: Removing block broadcast_0_piece0
DEBUG BlockManager: Told master about block broadcast_0_piece0
DEBUG BlockManager: Removing block broadcast_0

Methods available in Broadcast class

Modifier and TypeMethod and Description
voiddestroy() Destroy all data and metadata related to this broadcast variable.
longid() 
StringtoString() 
voidunpersist() Asynchronously delete cached copies of this broadcast on the executors.
voidunpersist(boolean blocking) Delete cached copies of this broadcast on the executors.
Tvalue() Get the broadcasted value.

References

https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#shared-variables

Happy Learning 🙂