Looks like you want to match one or more numbers followed by one or more letters. I also decided to catch one or more letters followed by one or more numbers
test_text = StringIO(
"""25 - not this number1
the cow just over the moon and the sun is in 1the sky
26 - not this number
5one day is soon and soon is near take 5529care, 30over and out
59 - not this number
The covers at near the back of the 59closet, and when found have them place on the each of the beds. However you see the pillow cases use the ones on the 9second shelve."""
)
pattern = re.compile(r"[0-9]+[a-zA-Z]+|[a-zA-Z]+[0-9]+")
for line in test_text:
matches = re.findall(pattern, line)
if matches:
print(f"{line}Matches = {matches}\n")
else:
print(f"{line}No matches\n")
Output:
25 - not this number1
Matches = ['number1']
the cow just over the moon and the sun is in 1the sky
Matches = ['1the']
26 - not this number
No matches
5one day is soon and soon is near take 5529care, 30over and out
Matches = ['5one', '5529care', '30over']
59 - not this number
No matches
The covers at near the back of the 59closet, and when found have them place on the each of the beds. However you see the pillow cases use the ones on the 9second shelve.
Matches = ['59closet', '9second']
(Jul-25-2022, 05:27 AM)giddyhead Wrote: [ -> ]Yeah looking for a way to get rid of the numbers attached to words using the code from web scrapping.
Okay. That code from your web scrapping, as you call it, is not in a style that I work with, as you may gather from what I've posted.
As for getting rid of the numbers attached to words, that is precisely the objective of my posted script, given that said number is prefixed.
Maybe this is a 'Language barrier' issue. Do you fully understand English?
edit for p.s
I've updated the my code, as I was unhappy about some of the object names that I used (it was developed on-the-fly) and as I'll be adding this code to my notes, I've cleaned it up.
#!/usr/bin/python3
import re
def f_number(num):
number = re.search('\d',num)
if number:
num_list = re.split('\d',num)
return num_list[-1]
input_string = "" # put your text in this string object
string_list = input_string.split(' ')
p_string = ''
for get_word in range(len(string_list)):
word = string_list[get_word]
number = f_number(word)
if not number:
p_string += word+' '
else:
p_string += number+' '
print(p_string)
When you know what you want to replace it is easy to remove the digits. This uses a comprehension strip the numbers.:
from io import StringIO
import re
test_text = StringIO(
"""25 - not this number1
the cow just over the moon and the sun is in 1the sky
26 - not this number
5one day is soon and soon is near take 5529care, 30over and out
59 - not this number
The covers at near the back of the 59closet, and when found have them place on the each of the beds. However you see the pillow cases use the ones on the 9second shelve."""
)
pattern = re.compile(r"[0-9]+[a-zA-Z]+|[a-zA-Z]+[0-9]+")
for line in test_text:
matches = re.findall(pattern, line)
if matches:
for match in matches:
line = line.replace(match, "".join([c for c in match if c not in '0123456789']))
print(line.rstrip())
Output:
25 - not this number
the cow just over the moon and the sun is in the sky
26 - not this number
one day is soon and soon is near take care, over and out
59 - not this number
The covers at near the back of the closet, and when found have them place on the each of the beds. However you see the pillow cases use the ones on the second shelve.
And this uses another regex.
for match in matches:
line = line.replace(match, re.findall(stripper, match)[0])
But it is better to use re.sub(). Write a function that returns a digit-less version of the matching string. This function is the repl argument to the re.sub(patter, repl, string) call.
from io import StringIO
import re
test_text = StringIO(
"""25 - not this number1
the cow just over the moon and the sun is in 1the sky
26 - not this number
5one day is soon and soon is near take 5529care, 30over and out
59 - not this number
The covers at near the back of the 59closet, and when found have them place on the each of the beds. However you see the pillow cases use the ones on the 9second shelve."""
)
stripper = re.compile(r"[a-zA-Z]+")
finder = re.compile(r"[0-9]+[a-zA-Z]+|[a-zA-Z]+[0-9]+")
def strip_digits(match):
"""This is the repl function used by re.sub()"""
return re.findall(stripper, match.group())[0]
for line in test_text:
print(re.sub(finder, strip_digits, line).rstrip())
(Jul-25-2022, 05:15 AM)giddyhead Wrote: [ -> ]Instead of finding the whole thing can it be modified to only find the numbers only for example 5one, 5529care, 30over, etc? Thanks
If add a group
()
to deanhystad code then will get only numbers.
from io import StringIO
import re
test_text = StringIO(
"""25 - not this number1
the cow just over the moon and the sun is in 1the sky
26 - not this number
5one day is soon and soon is near take 5529care, 30over and out
59 - not this number
The covers at near the back of the 59closet, and when found have them place on the each of the beds. However you see the pillow cases use the ones on the 9second shelve."""
)
pattern = re.compile(r"([0-9]+)[a-zA-Z]+|[a-zA-Z]+([0-9]+)")
for line in test_text:
matches = re.findall(pattern, line)
if matches:
print(f"{line}Matches = {matches}\n")
else:
print(f"{line}No matches\n")
Output:
25 - not this number1
Matches = [('', '1')]
the cow just over the moon and the sun is in 1the sky
Matches = [('1', '')]
26 - not this number
No matches
5one day is soon and soon is near take 5529care, 30over and out
Matches = [('5', ''), ('5529', ''), ('30', '')]
59 - not this number
No matches
The covers at near the back of the 59closet, and when found have them place on the each of the beds. However you see the pillow cases use the ones on the 9second shelve.Matches = [('59', ''), ('9', '')]
Clean up.
>>> Matches = [('5', ''), ('5529', ''), ('30', '')]
>>> [i[0] for i in Matches]
['5', '5529', '30']
Sorry did not get to post until now. Completed! Thank you all for your help, time and information.