Spark Docker opt

Ordiy Lv5

spark docer

启动docker

1
2
3
4
5
6
7
# start 
docker pull apache/spark-py:latest


docker run -p 7077:7077 18080:8080 -it apache/spark-py /opt/spark/bin/pyspark


PySpark install

1
pip install pyspark

https://pypi.org/project/pyspark/ https://spark.apache.org/documentation.html https://hub.docker.com/r/apache/spark-py/tags

start spark with jupyter docker

1
2
3

sudo docker run -it --rm -p 8088:8888 jupyter/all-spark-notebook

test case

在jupyter notebook 中使用案例。

使用spark-shell 读取文件

1
2
3
4
5
6
7
8
9
10
11
12

$ spark-shell
scala> val words = sc.textFile("file:///usr/local/spark-3.5.0-bin-hadoop3/NOTICE")
words: org.apache.spark.rdd.RDD[String] = file:///usr/local/spark-3.5.0-bin-hadoop3/NOTICE MapPartitionsRDD[7] at textFile at <console>:23

scala> val rddtest = words.cache
rddtest: words.type = file:///usr/local/spark-3.5.0-bin-hadoop3/NOTICE MapPartitionsRDD[7] at textFile at <console>:23

scala> val fk = rddtest.first
fk: String = Apache Spark


Spark streaming

流式数据处理是将输入数据连续输入分散单元的过程。 Spark Streaming 是一个Spark组建,使用该组件执行实时分析(延迟只需几秒钟)。数据流被切分成多个批次,每个批次都组评委单独的Spark RDD。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
spark-shell > 
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 2 cores to prevent a starvation scenario.


val ssc = new StreamingContext(sc, Seconds(1))

// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)

// Split each line into words
val words = lines.flatMap(_.split(" "))


val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()
ssc.start() // Start the computation

  • Title: Spark Docker opt
  • Author: Ordiy
  • Created at : 2023-02-18 17:08:18
  • Updated at : 2025-03-26 09:39:38
  • Link: https://ordiy.github.io/posts/2024-02-02-spark-docker-opt/
  • License: This work is licensed under CC BY 4.0.