Extract integers with specific length between separators

Question

Given a list of strings like:

L = ['1759@1@83@0#[email protected]@[email protected]#1094@[email protected]@14.4', 
     '[email protected]@[email protected]', 
     '[email protected]@[email protected]#1101@2@40@0#1108@2@30@0',
     '1430@[email protected]@2.15#1431@[email protected]@60.29#1074@[email protected]@58.8#1109',
     '1809@[email protected]@292.66#1816@[email protected]@95.44#1076@[email protected]@1110.61']

I need to extract all integers with length 4 between separators # or @, and also extract the first and last integers. No floats.

My solution is a bit overcomplicated - replace with space and then applied this solution:

pat = r'(?<!\S)\d{4}(?!\S)'
out = [re.findall(pat, re.sub('[#@]', ' ', x)) for x in L]
print (out)
"""
[['1759', '1362', '1094'], 
 ['1356'], 
 ['1354', '1101', '1108'], 
 ['1430', '1431', '1074', '1109'], 
 ['1809', '1816', '1076']]
"""

Is it possible to change the regex for not using re.sub necessarily for replace? Is there another solution with better performance?

Your current code results in integers at the edges of the string being matched too, despite those integers lacking at least one separator, is that desirable? — CertainPerformance, Mar 11 '19 at 07:27
@CertainPerformance - good question, Yes, I need first and last 4 length integers if exist — jezrael, Mar 11 '19 at 07:29
By new requirement, so do you want `1110` to be matched too? — revo, Mar 11 '19 at 07:30
@revo - expected output is `out`, last `1110.61` is not matched, because float — jezrael, Mar 11 '19 at 07:31
If you had a number with trailing decimal but no decimal digits like `1110.`, would you want it included/excluded/don't-care? — smci, Mar 11 '19 at 07:54

revo · Accepted Answer · 2019-03-11T07:36:15.430

6

To allow first and last occurrences that has no leading or trailing separator you could use negative lookarounds:

(?<![^#])\d{4}(?![^@])

(?<![^#]) is a near synonym for (?:^|#). The same applies for the negative lookahead.

See live demo here

edited Mar 11 '19 at 07:36

answered Mar 11 '19 at 07:27

revo

47,783
14
74
117

score 3 · Answer 2 · answered Mar 11 '19 at 08:22

Interesting problem!

This can be easily tackled with the concepts of lookahead & lookbehind.

INPUT

pattern = "(?<!\.)(?<=[#@])\d{4}|(?<!\.)\d{4}(?=[@#])"
out = [re.findall(pattern, x) for x in L]
print (out)

OUTPUT

[['1759', '1362', '1094', '1234'],
 ['1356'],
 ['1354', '1101', '1108'],
 ['1430', '1431', '1074', '1109'],
 ['1809', '1816', '1076', '1110']]

EXPLANATION

The above pattern is a combination of two separate patterns separated by an | (OR operator).

pattern_1 = "(?<!\.)(?<=[#@])\d{4}"
\d{4}     --- Extract exactly 4 digits
(?<!\.)   --- The 4 digits must not be preceded by a period(.) NEGATIVE LOOKBEHIND
(?<=[#@]) --- The 4 digits must be preceded by a hashtag(#) or at(@) POSITIVE LOOKBEHIND

pattern_2 = "(?<!\.)\d{4}(?=[@#])"
\d{4}     --- Extract exactly 4 digits
(?<!\.)   --- The 4 digits must not be preceded by a period(.) NEGATIVE LOOKBEHIND
(?=[@#]   --- The 4 digits must be followed by a hashtag(#) or at(@) POSITIVE LOOKAHEAD

To better understand these concepts, click here

score 2 · Answer 3 · answered Mar 11 '19 at 07:43

Here is a complex list comprehension without using regex if you consider the integers of length 4 without the starting # or ending @ too :

[[n for o in p for n in o] for p in [[[m for m in k.split("@") if m.isdigit() and str(int(m))==m and len(m) ==4] for k in j.split("#")] for j in L]]

Output :

[['1759', '1362', '1094'], ['1356'], ['1354', '1101', '1108'], ['1430', '1431', '1074', '1109'], ['1809', '1816', '1076']]

Extract integers with specific length between separators

3 Answers3

Linked