8

testPassengerId = test.select('PassengerId').map(lambda x: x.PassengerId)

I want to select PassengerId column and make RDD of it. But .select is not working. It says 'RDD' object has no attribute 'select'

Sean Owen
  • 6,595
  • 6
  • 31
  • 43
dsl1990
  • 181
  • 1
  • 1
  • 2

4 Answers4

4

You could try the following,

testPassengerID = test.select('PassengerID').rdd

this would select the column PassengerID and convert it into a rdd

Stereo
  • 1,413
  • 9
  • 24
user25409
  • 41
  • 1
3

'RDD' object has no attribute 'select'

This means that test is in fact an RDD and not a dataframe (which you are assuming it to be). Either you convert it to a dataframe and then apply select or do a map operation over the RDD.

Please let me know if you need any help around this.

Shagun Sodhani
  • 722
  • 4
  • 26
3

Assuming you have an RDD each row of which is of the form (passenger_ID, passenger_name), you can do rdd.map(lambda x: x[0]). This is for a basic RDD

If you use Spark sqlcontext there are functions to select by column name.

wabbit
  • 1,297
  • 2
  • 12
  • 15
0

If your RDD happens to be in the form of a dictionary, this is how it can be done using PySpark:

Define the fields you want to keep in here:

field_list =[]

Create a function to keep specific keys within a dict input

def f(x):
    d = {}
    for k in x:
        if k in field_list:
            d[k] = x[k]
    return d

And just map after that, with x being an RDD row

rdd_subset = rdd.map(lambda x: f(x))

Jan Trienes
  • 103
  • 4