1

I have a re.findall() searching for a pattern in python, but it returns some undesired results and I want to know how to exclude them. The text is below, I want to get the names, and my statement (re.findall(r'([A-Z]{4,} \w. \w*|[A-Z]{4,} \w*)', text)) is returning this:

 'ERIN E. SCHNEIDER',
 'MONIQUE C. WINKLER',
 'JASON M. HABERMEYER',
 'MARC D. KATZ',
 'JESSICA W. CHAN',
 'RAHUL KOLHATKAR',
 'TSPU or taken',
 'TSPU or the',
 'TSPU only',
 'TSPU was',
 'TSPU and']

I want to get rid of the "TSPU" pattern items. Does anyone know how to do it?

JINA L. CHOI (NY Bar No. 2699718)

ERIN E. SCHNEIDER (Cal. Bar No. 216114) [email protected]

MONIQUE C. WINKLER (Cal. Bar No. 213031) [email protected]

JASON M. HABERMEYER (Cal. Bar No. 226607) [email protected]

MARC D. KATZ (Cal. Bar No. 189534) [email protected]

JESSICA W. CHAN (Cal. Bar No. 247669) [email protected]

RAHUL KOLHATKAR (Cal. Bar No. 261781) [email protected]

  1. The Investor Solicitation Process Generally Included a Face-to-Face Meeting, a Technology Demonstration, and a Binder of Materials [...]

2 Answers2

1

You can use

\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?

See this regex demo. Details:

  • \b - a word boundary (else, the regex may "catch" a part of a word that contains TSPU)
  • (?!TSPU\b) - a negative lookahead that fails the match if there is TSPU string followed with a non-word char or end of string immediately to the right of the current location
  • [A-Z]{4,} - four or more uppercase ASCII letters
  • (?:(?:\s+\w\.)?\s+\w+)? - an optional occurrence of:
    • (?:\s+\w\.)? - an optional occurrence of one or more whitespaces, a word char and a literal . char
    • \s+ - one or more whitespaces
    • \w+ - one or more word chars.

In Python, you can use

re.findall(r'\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?', text)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

You can do some simple .filter-ing, if your array was results,

removed_TSPU_results = list(filter(lambda: not result.startswith("TSPU"), results))
sean-7777
  • 700
  • 4
  • 13