3

I apologize if this question is misplaced -- I'm not sure if this is more of a re question or a CountVectorizer question. I'm trying to exclude any would be token that has one or more numbers in it.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> import pandas as pd
>>> docs = ['this is some text', '0000th', 'aaa more 0stuff0', 'blahblah923']   
>>> vec = CountVectorizer()
>>> X = vec.fit_transform(docs)
>>> pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
   0000th  0stuff0  aaa  blahblah923  is  more  some  text  this
0       0        0    0            0   1     0     1     1     1
1       1        0    0            0   0     0     0     0     0
2       0        1    1            0   0     1     0     0     0
3       0        0    0            1   0     0     0     0     0

What I want instead is this:

   aaa  is  more  some  text  this
0    0   1     0     1     1     1
1    0   0     0     0     0     0
2    1   0     1     0     0     0
3    0   0     0     0     0     0

My thought was to use CountVectorizer's token_pattern argument to supply a regex string that will match anything except one or more numbers:

>>> vec = CountVectorizer(token_pattern=r'[^0-9]+')

but the result includes the surrounding text matched by the negated class:

   aaa more   blahblah  stuff  th  this is some text
0          0         0      0   0                  1
1          0         0      0   1                  0
2          1         0      1   0                  0
3          0         1      0   0                  0

Also, replacing the default pattern (?u)\b\w\w+\b obviously messes with the tokenizer's normal function which I want to preserver.

What I really want is to use the normal token_pattern, but apply a secondary screening of those tokens to only include those that have strictly letters in them. How can this be done?

aweeeezy
  • 501
  • 2
  • 5
  • 9

1 Answers1

1

Found this SO post which says to use the following regex:

\b[^\d\W]+\b/g

yielding the following:

>>> vec = CountVectorizer(token_pattern=r'\b[^\d\W]+\b')
>>> X = vec.fit_transform(docs)
>>> pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
   aaa  is  more  some  text  this
0    0   1     0     1     1     1
1    0   0     0     0     0     0
2    1   0     1     0     0     0
3    0   0     0     0     0     0

What I needed in my regex were the \b word boundary characters of which I was not aware of. That does make this a misplaced question as it has nothing to do with data science or that discipline's tools (sklearn).

aweeeezy
  • 501
  • 2
  • 5
  • 9