Spark中foreachRDD、foreachPartition和foreach解读

            Spark中foreachRDD、foreachPartition和foreach解读

foreachRDD、foreachPartition和foreach的不同之处主要在于它们的作用范围不同,

  • foreachRDD作用于DStream中每一个时间间隔的RDD,
  • foreachPartition作用于每一个时间间隔的RDD中的每一个partition,
  • foreach作用于每一个时间间隔的RDD中的每一个元素。

Foreach与ForeachPartition都是在每个partition中对iterator进行操作,不同的是,foreach是直接在每个partition中直接对iterator执行foreach操作,而传入的function只是在foreach内部使用,而foreachPartition是在每个partition中把iterator给传入的function,让function自己对iterator进行处理(可以避免内存溢出)。

在Spark 官网中,foreachRDD被划分到Output Operations on DStreams中,所有我们首先要明确的是,它是一个输出操作的算子,然后再来看官网对它的含义解释:

Output OperationMeaning
foreachRDD(func)The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

最常用的输出操作

需要一个函数作为参数,函数作用于DStream中的每一个RDD

函数将RDD中的数据输出到外部系统,如文件、数据库

函数在driver上执行

函数中通常要有action算子,因为foreachRDD本身是transform算子

官网还给出了开发者常见的错误:

Often writing data to external system requires creating a connection object (e.g. TCP connection to a remote server) and using it to send data to a remote system. For this purpose, a developer may inadvertently try creating a connection object at the Spark driver, and then try to use it in a Spark worker to save records in the RDDs. For example :

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // executed at the driver
  rdd.foreach { record =>
    connection.send(record) // executed at the worker
  }
}

This is incorrect as this requires the connection object to be serialized and sent from the driver to the worker. Such connection objects are rarely transferrable across machines. This error may manifest as serialization errors (connection object not serializable), initialization errors (connection object needs to be initialized at the workers), etc. The correct solution is to create the connection object at the worker.

说的是我们使用foreachRDD向外部系统输出数据时,通常要创建一个连接对象,如果像上面的代码中创建在driver上就是错误的,因为foreach在每个节点上执行时节点上并没有连接对象。通常会报序列化错误或者初始化错误。

However, this can lead to another common mistake - creating a new connection for every record. For example:

dstream.foreachRDD { rdd =>
  rdd.foreach { record =>
    val connection = createNewConnection()
    connection.send(record)
    connection.close()
  }
}

 Typically, creating a connection object has time and resource overheads. Therefore, creating and destroying a connection object for each record can incur unnecessarily high overheads and can significantly reduce the overall throughput of the system. A better solution is to use rdd.foreachPartition - create a single connection object and send all the records in a RDD partition using that connection.

这样虽然不会报错,但是foreach中的每一个元素都会创建连接对象,浪费资源,foreach适用于对每一个元素进行操作的场景,因此需要创建连接对象时一般使用foreachPartition来解决这个问题,这样每个partition中只创建一个连接对象,使用它来对该partition内的每个元素进行输出如:

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

Finally, this can be further optimized by reusing connection objects across multiple RDDs/batches. One can maintain a static pool of connection objects than can be reused as RDDs of multiple batches are pushed to the external system, thus further reducing the overheads.

Note that the connections in the pool should be lazily created on demand and timed out if not used for a while. This achieves the most efficient sending of data to external systems.

更进一步的话,在处理一批RDD时,可以使用数据库连接池来重复使用连接对象,注意连接池必须是静态、懒加载的,官网示例:

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

这一点在SparkLearning一书中也有体现:

代码示例:

public class RedisUtil {
    private static JedisPool pool = null;
 
    static {
        if (pool == null) {
            JedisPoolConfig config = new JedisPoolConfig();
            config.setMaxIdle(25);
            config.setMaxWaitMillis(1000 * 100);
            config.setTestOnBorrow(true);
            pool = new JedisPool(config, "Redis IP", 6379);
        }
    }
 
    public static Jedis getConnection(){
        return pool.getResource();
    }
 
    public static void closeConnection(Jedis jedis, Boolean exceptionExist){
        if (jedis != null) {
            if(exceptionExist) {
                pool.returnBrokenResource(jedis);
            }else {
                pool.returnResource(jedis);
            }
        }
    }
}
linesFormat.foreachRDD(rdd => {
      rdd.foreachPartition(it => {
        var exceptionExist = false
        val jedis = RedisUtil.getConnection
        it.foreach(record => {
          try{
            if(jedis != null && jedis.exists("abc" + record._2 + record._3)){
              //从Redis中获取数据
              val value = jedis.hmget("abc" + record._2 + record._3,"a1","a2","a3","a4","a5","a6")      
              //使用数据。。。
            }
          }catch{
            case e: Exception => {
              println(e)
              exceptionExist = true
            }
          }
        })
        RedisUtil.closeConnection(jedis,exceptionExist)
        exceptionExist = false
      })
    })

参考:https://blog.csdn.net/Scapel/article/details/84030362

https://blog.csdn.net/qq_24084925/article/details/80000778

已标记关键词 清除标记
相关推荐
©️2020 CSDN 皮肤主题: 技术黑板 设计师:CSDN官方博客 返回首页