Python Forum

Hi,

I have a text that I'm trying to create a pattern for, I got thud far:

import re

text = '''
image003.jpgimage/[email protected]:39:50truefalseimage001.jpgimage/[email protected]:39:50truefalseNJS Basson Snr Basson Familie Trust  Gothoma Diggings NJS.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet41BB382125CA54438BDE8428DBDF8B6C@mns.internal.co.za868522019-04-01T12:39:50falsefalseNJS BASSON SNR.PDFapplication/pdfEBA76A0D4619594DB6951BE861F9FF9E@mns.internal.co.za3638912019-04-01T12:39:50falsefalse4798442019-04-01T12:39:31Z2019-04-01T12:39:[email protected] EscalationAgriCCC.Escalation@santam.co.zaSMTPMailboxtruetrueEternity [email protected] 000000000
ZZZZ
'''

pattern = re.compile(r'image[\d]+.+')
matches = pattern.finditer(text)

for match in matches:
    print(match)

Output:

Output:
"<re.Match object; span=(681, 1413), match='image003.jpgimage/[email protected]>"

I'm looking for expression that can pick anything similar, close or exactly that, please help

Try this pattern: re.findall(r'image\d+\.jpg', text)

Hi,
I tried findall:

for line in textonly2:
    results = re.findall(r'image\d+\.jpg', line)
    print(results)

Output:

Output:[]
[]
[]
[]
[]
[]

Empty list

I

Have removed punctuation, but I still need a code that will pull any sentence with image:

image003jpgimagejpegimage003jpg01D4E56EA5A4A0E0813720190401T123950truefalseimage001jpgimagejpegimage001jpg01D4E898B7AA95B0813720190401T123950truefalseNJS Basson Snr Basson Familie Trust Gothoma Diggings NJSxlsxapplicationvndopenxmlformatsofficedocumentspreadsheetmlsheet41BB382125CA54438BDE8428DBDF8B6Cmnsinternalcoza8685220190401T123950falsefalseNJS BASSON SNRPDFapplicationpdfEBA76A0D4619594DB6951BE861F9FF9Emnsinternalcoza36389120190401T123950falsefalse47984420190401T123931Z20190401T123950ZtruefalseLandbouLandbousantamcozaSMTPMailboxAgriculture EscalationAgriCCCEscalationsantamcozaSMTPMailboxtruetrueEternity ClaimseternityclaimskoshcomcozaSMTPOneOfffalse 000000000 ZZZZ

There are white spaces in between my text, whenever it finds an image sentence it needs to remove it. This is the text that should be removed from the above:

image003jpgimagejpegimage003jpg01D4E56EA5A4A0E0813720190401T123950truefalseimage001jpgimagejpegimage001jpg01D4E898B7AA95B0813720190401T123950truefalseNJS

It somewhat unclear what must be accomplished here. Isin't regex too complicated solution for achieving desired result? To get rid of 'jpg' containing chunks in row / sentence one can just:

In [1]: sentence = "image003jpgimagejpegimage003jpg01D4E56EA5A4A0E0813720190401T123950truefalseimage001jpgim
   ...: agejpegimage001jpg01D4E898B7AA95B0813720190401T123950truefalseNJS Basson Snr Basson Familie Trust Go
   ...: thoma Diggings NJSxlsxapplicationvndopenxmlformatsofficedocumentspreadsheetmlsheet41BB382125CA54438B
   ...: DE8428DBDF8B6Cmnsinternalcoza8685220190401T123950falsefalseNJS BASSON SNRPDFapplicationpdfEBA76A0D46
   ...: 19594DB6951BE861F9FF9Emnsinternalcoza36389120190401T123950falsefalse47984420190401T123931Z20190401T1
   ...: 23950ZtruefalseLandbouLandbousantamcozaSMTPMailboxAgriculture EscalationAgriCCCEscalationsantamcozaS
   ...: MTPMailboxtruetrueEternity ClaimseternityclaimskoshcomcozaSMTPOneOfffalse 000000000 ZZZZ"           

In [2]: ' '.join(chunk for chunk in sentence.split() if 'jpg' not in chunk)                                 
Out[2]: 'Basson Snr Basson Familie Trust Gothoma Diggings NJSxlsxapplicationvndopenxmlformatsofficedocumentspreadsheetmlsheet41BB382125CA54438BDE8428DBDF8B6Cmnsinternalcoza8685220190401T123950falsefalseNJS BASSON SNRPDFapplicationpdfEBA76A0D4619594DB6951BE861F9FF9Emnsinternalcoza36389120190401T123950falsefalse47984420190401T123931Z20190401T123950ZtruefalseLandbouLandbousantamcozaSMTPMailboxAgriculture EscalationAgriCCCEscalationsantamcozaSMTPMailboxtruetrueEternity ClaimseternityclaimskoshcomcozaSMTPOneOfffalse 000000000 ZZZZ'

I modified your code and got this:

Output:andre@andre-GP70-2PE:~/Schreibtisch$ python matches.py 
Match Object: <re.Match object; span=(1, 13), match='image003.jpg'>
Filename: image003.jpg
Start-IDX: 1 Stop-IDX: 13
Match Object: <re.Match object; span=(23, 35), match='image003.jpg'>
Filename: image003.jpg
Start-IDX: 23 Stop-IDX: 35
Match Object: <re.Match object; span=(85, 97), match='image001.jpg'>
Filename: image001.jpg
Start-IDX: 85 Stop-IDX: 97
Match Object: <re.Match object; span=(107, 119), match='image001.jpg'>
Filename: image001.jpg
Start-IDX: 107 Stop-IDX: 119

import re
 
text = '''
image003.jpgimage/[email protected]:39:50truefalseimage001.jpgimage/[email protected]:39:50truefalseNJS Basson Snr Basson Familie Trust  Gothoma Diggings NJS.xlsxapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheet41BB382125CA54438BDE8428DBDF8B6C@mns.internal.co.za868522019-04-01T12:39:50falsefalseNJS BASSON SNR.PDFapplication/pdfEBA76A0D4619594DB6951BE861F9FF9E@mns.internal.co.za3638912019-04-01T12:39:50falsefalse4798442019-04-01T12:39:31Z2019-04-01T12:39:[email protected] EscalationAgriCCC.Escalation@santam.co.zaSMTPMailboxtruetrueEternity [email protected] 000000000
ZZZZ
'''
 
pattern = re.compile(r'image\d+\.jpg')
matches = pattern.finditer(text)
 
for match in matches:
    print('Match Object:', match) # the raw match object
    print('Filename:', match.group()) # match.group() -> result
    start, stop = match.span() # index of current match in text
    print('Start-IDX:', start, 'Stop-IDX:', stop)

stahorse

DeaD_EyE

stahorse

stahorse

perfringo

DeaD_EyE