如何从下面现有的 RDD 创建元组?

// reading a text file "b.txt" and creating RDD 
val rdd = sc.textFile("/home/training/desktop/b.txt") 

b.txt数据集-->

 Ankita,26,BigData,newbie
 Shikha,30,Management,Expert

如果您打算拥有Array[Tuples4]那么您可以执行以下操作

scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24

scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))

然后您可以访问每个字段tuples

scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())

更新

如果您有可变大小的输入文件

Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big

您可以将匹配大小写模式匹配编写为

scala> val arrayTuples = rdd.map(line => line.split(",") match {
     | case Array(a, b, c, d) => (a,b,c,d)
     | case Array(a,b,c) => (a,b,c)
     | }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))

再次更新

正如 @eliasah 指出的那样,上述过程是使用product iterator. 根据他的建议,我们应该知道输入数据的最大元素,并使用以下逻辑,为任何元素分配默认值

val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect

正如 @philantrovert 指出的,如果我们不使用,我们可以通过以下方式验证输出REPL

arrayTuples.foreach(println)

结果是

(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

你期望的元组是什么?给一些样品。:)

预期元组 --((Ankita,26,BigData,新手),(Shikha,30,管理,专家))

你想要元组数组,对吗?

是的..那也会很顺利:)

多谢。有用。我们可以根据不同行不同的列数动态创建一个数组吗?

请查看我更新的答案:)并感谢您的支持和接受:)

Infering Product is a bad practice @RameshMaharjan . Personally I'd go with a default value (maybe null or an empty string) instead and keep it as a tuple 4

You can also use foreach with println. Not saying map is wrong, but it's a nice practice to use foreach to denote that this piece of code has a side effect.