In my html file I have this line:
<div class="color-black mt-lg-0" id="hidden">, in</div>
<a href="https://neculaifantanaru.com/en/leadership-pro.html" title="View all articles from Leadership Pro" class="color-green font-weight-600 mx-1" id="hidden">Leadership Pro</a>
I use this regex: ^\s*<a href="(.*?)" title="View`
in order to find this link
https://neculaifantanaru.com/en/leadership-pro.html
In notepad++ the regex search is ok !
The problem is in Python.
FIND: (on line 18)
b_content = re.search('^\s*<a href="(.*?)" title="View', new_file_content).group(1)
REPLACE:
old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)
Gives me this error on line 18:
Traceback (most recent call last):
File "<module2>", line 18, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
I, also, try to change that line with:
b_content = re.match(r'^\s*<a href="(.*?)" title="View', new_file_content).group(1)
but I get the same error.
Doing regex on HTML is super annoying. Use an HTML parser instead (like beautifulsoup).
Your regex is anchored at the front of the string. If your new_file_content
contains the entire file, then the match will fail. When I try your command, but with only the second line in that variable, it matches.
(Jun-28-2023, 07:42 AM)Gribouillis Wrote: [ -> ] (Jun-28-2023, 07:32 AM)bowlofred Wrote: [ -> ]Your regex is anchored at the front of the string.
The regex multiline mode (?m)
could do the trick.
hello. Can you update my code as to understand better ?
The old classic
read
from bs4 import BeautifulSoup
import re
html = '''\
<div class="color-black mt-lg-0" id="hidden">, in</div>
<a href="https://neculaifantanaru.com/en/leadership-pro.html" title="View all articles from Leadership Pro" class="color-green font-weight-600 mx-1" id="hidden">Leadership Pro</a>
'''
soup = BeautifulSoup(html, 'html.parser')
link = soup.find('a').get('href')
print(link)
Output:
https://neculaifantanaru.com/en/leadership-pro.html
If you wonder about a working regex,but as in link should not use regex with HTML/XML.
Can work in smaller part aa here,but can/will blow up with errors in lager HTML.
>>> import re
>>>
>>> b_content = re.search(r"<a href=\"(.*?)\"", html).group(1)
>>> b_content
'https://neculaifantanaru.com/en/leadership-pro.html'
import re
# Citește conținutul fișierului new-file.html
with open('c:/Folder7/new-file.html', 'r') as file:
first_code = file.read()
# Citește conținutul fișierului old-file.html
with open('c:/Folder7/old-file.html', 'r') as file:
second_code = file.read()
# Extrage URL-ul din first_code
match = re.search('<a href="(.*?)" title="View all articles', first_code)
if match is not None:
url = match.group(1)
# Înlocuiește URL-ul în second_code
second_code = re.sub(', in <a href=".*?" title="Vezi toate', f', in <a href="{url}" title="Vezi toate', second_code)
# Scrie conținutul modificat înapoi în old-file.html
with open('c:/Folder7/old-file.html', 'w') as file:
file.write(second_code)
else:
print("No match found")
(Jun-28-2023, 07:50 AM)Melcu54 Wrote: [ -> ]Can you update my code as to understand better ?
Add
(?m)
at the beginning the regex as specified in the
re.MULTILINE documentation. It is very useful to read the documentation.
As advised no regex 🔨with HTML/XML.
from bs4 import BeautifulSoup
with open('file.html') as file:
first_code = file.read()
with open('old-file.html') as file:
second_code = file.read()
soup = BeautifulSoup(first_code, 'html.parser')
link = soup.find('a')
link['href'] = second_code
with open('old-file.html', 'w') as file:
file.write(soup.prettify())
Output:
<div class="color-black mt-lg-0" id="hidden">
, in
</div>
<a class="color-green font-weight-600 mx-1" href="https://python-forum.io" id="hidden" title="View all articles from Leadership Pro">
Leadership Pro
</a>
SOLUTION 1:
FIND: b_content = re.search('^\s*<a href="(.*?)" title="View', new_file_content).group(1)
REPLACE: old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)
SOLUTION 2:
FIND: b_content = re.match(r'^\s*<a href="(.*?)" title="View', new_file_content).group(1)
REPLACE: old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)
SOLUTION 3:
import re
b_content = re.match(r'^\s*<a href="(.*?)" title="View', new_file_content)
if b_content is not None:
b_content = b_content.group(1)
else:
b_content = "No match found"
SOLUTION 4:
import re
match = re.search('^\s*<a href="(.*?)" title="View', new_file_content)
if match is not None:
b_content = match.group(1)
old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)
else:
print("No match found")
SOLUTION 5: (use re.MULTILINE )
import re
match = re.search('^\s*<a href="(.*?)" title="View', new_file_content, re.MULTILINE)
if match is not None:
b_content = match.group(1)
old_file_content = re.sub(', in <a href="([^"]*)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)
else:
print("No match found")