How to select particular column in Spark(pyspark)?

Question

testPassengerId = test.select('PassengerId').map(lambda x: x.PassengerId)

I want to select PassengerId column and make RDD of it. But .select is not working. It says 'RDD' object has no attribute 'select'

You can access columns pandas-style using DataFrame notation. — Emre, Jan 03 '16 at 04:34

score 4 · Answer 1 · edited Oct 20 '16 at 09:24

4

You could try the following,

testPassengerID = test.select('PassengerID').rdd

this would select the column PassengerID and convert it into a rdd

edited Oct 20 '16 at 09:24

Stereo

1,413
9
24

answered Oct 20 '16 at 02:25

user25409

41
1

One issue with this is that you get a row back out and so then might have to do what @wabbit suggests. – groceryheist Mar 13 '19 at 20:31

score 3 · Answer 2 · answered May 18 '16 at 09:52

3

'RDD' object has no attribute 'select'

This means that test is in fact an RDD and not a dataframe (which you are assuming it to be). Either you convert it to a dataframe and then apply select or do a map operation over the RDD.

Please let me know if you need any help around this.

answered May 18 '16 at 09:52

Shagun Sodhani

722
4
26

Is it possible to select multiple columns? – Kent Wong Nov 27 '19 at 03:53
Yes it is :) You could use df.select(*list_of_columns_to_select) – Shagun Sodhani Nov 27 '19 at 12:06
So you must use a data frame then? Not possible with just a RDD then? – Kent Wong Nov 27 '19 at 16:49
I think there os no way but my knowledge of RDDs is rustic now :) – Shagun Sodhani Nov 27 '19 at 22:58

score 3 · Answer 3 · answered May 18 '16 at 11:11

3

Assuming you have an RDD each row of which is of the form (passenger_ID, passenger_name), you can do rdd.map(lambda x: x[0]). This is for a basic RDD

If you use Spark sqlcontext there are functions to select by column name.

answered May 18 '16 at 11:11

wabbit

1,297
2
12
15

score 0 · Answer 4 · edited Nov 27 '17 at 16:26

If your RDD happens to be in the form of a dictionary, this is how it can be done using PySpark:

Define the fields you want to keep in here:

field_list =[]

Create a function to keep specific keys within a dict input

def f(x):
    d = {}
    for k in x:
        if k in field_list:
            d[k] = x[k]
    return d

And just map after that, with x being an RDD row

rdd_subset = rdd.map(lambda x: f(x))

How to select particular column in Spark(pyspark)?

4 Answers4