Python Forum

In my html file I have this line:

	<div class="color-black mt-lg-0" id="hidden">, in</div>
    <a href="https://neculaifantanaru.com/en/leadership-pro.html" title="View all articles from Leadership Pro" class="color-green font-weight-600 mx-1" id="hidden">Leadership Pro</a>

I use this regex:

^\s*<a href="(.*?)" title="View`

in order to find this link

https://neculaifantanaru.com/en/leadership-pro.html

In notepad++ the regex search is ok !

The problem is in Python.

FIND: (on line 18)

b_content = re.search('^\s*<a href="(.*?)" title="View', new_file_content).group(1)

REPLACE:

old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)

Gives me this error on line 18:

    Traceback (most recent call last):
      File "<module2>", line 18, in <module>
    AttributeError: 'NoneType' object has no attribute 'group'

I, also, try to change that line with:

b_content = re.match(r'^\s*<a href="(.*?)" title="View', new_file_content).group(1)

but I get the same error.

Doing regex on HTML is super annoying. Use an HTML parser instead (like beautifulsoup).

Your regex is anchored at the front of the string. If your new_file_content contains the entire file, then the match will fail. When I try your command, but with only the second line in that variable, it matches.

(Jun-28-2023, 07:32 AM)bowlofred Wrote: [ -> ]Your regex is anchored at the front of the string.

The regex multiline mode (?m) could do the trick.

(Jun-28-2023, 07:42 AM)Gribouillis Wrote: [ -> ]
(Jun-28-2023, 07:32 AM)bowlofred Wrote: [ -> ]Your regex is anchored at the front of the string.
The regex multiline mode (?m) could do the trick.

hello. Can you update my code as to understand better ?

The old classic read Cool

from bs4 import BeautifulSoup
import re

html = '''\
<div class="color-black mt-lg-0" id="hidden">, in</div>
<a href="https://neculaifantanaru.com/en/leadership-pro.html" title="View all articles from Leadership Pro" class="color-green font-weight-600 mx-1" id="hidden">Leadership Pro</a>
'''

soup = BeautifulSoup(html, 'html.parser')
link = soup.find('a').get('href')
print(link)

Output:
https://neculaifantanaru.com/en/leadership-pro.html

If you wonder about a working regex,but as in link should not use regex with HTML/XML.
Can work in smaller part aa here,but can/will blow up with errors in lager HTML.

>>> import re
>>> 
>>> b_content = re.search(r"<a href=\"(.*?)\"", html).group(1)
>>> b_content
'https://neculaifantanaru.com/en/leadership-pro.html'

import re

# Citește conținutul fișierului new-file.html
with open('c:/Folder7/new-file.html', 'r') as file:
    first_code = file.read()

# Citește conținutul fișierului old-file.html
with open('c:/Folder7/old-file.html', 'r') as file:
    second_code = file.read()

# Extrage URL-ul din first_code
match = re.search('<a href="(.*?)" title="View all articles', first_code)
if match is not None:
    url = match.group(1)
    # Înlocuiește URL-ul în second_code
    second_code = re.sub(', in <a href=".*?" title="Vezi toate', f', in <a href="{url}" title="Vezi toate', second_code)

    # Scrie conținutul modificat înapoi în old-file.html
    with open('c:/Folder7/old-file.html', 'w') as file:
        file.write(second_code)
else:
    print("No match found")

(Jun-28-2023, 07:50 AM)Melcu54 Wrote: [ -> ]Can you update my code as to understand better ?

Add (?m) at the beginning the regex as specified in the re.MULTILINE documentation. It is very useful to read the documentation.

As advised no regex 🔨with HTML/XML.

from bs4 import BeautifulSoup

with open('file.html') as file:
    first_code = file.read()

with open('old-file.html') as file:
    second_code = file.read()

soup = BeautifulSoup(first_code, 'html.parser')
link = soup.find('a')
link['href'] = second_code

with open('old-file.html', 'w') as file:
    file.write(soup.prettify())

Output:<div class="color-black mt-lg-0" id="hidden">
 , in
</div>
<a class="color-green font-weight-600 mx-1" href="https://python-forum.io" id="hidden" title="View all articles from Leadership Pro">
 Leadership Pro
</a>

thank you veru much

SOLUTION 1:

FIND:

b_content = re.search('^\s*<a href="(.*?)" title="View', new_file_content).group(1)

REPLACE:

old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)

SOLUTION 2:

FIND:

b_content = re.match(r'^\s*<a href="(.*?)" title="View', new_file_content).group(1)

REPLACE:

old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)

SOLUTION 3:

import re

b_content = re.match(r'^\s*<a href="(.*?)" title="View', new_file_content)
if b_content is not None:
    b_content = b_content.group(1)
else:
    b_content = "No match found"

SOLUTION 4:

import re

match = re.search('^\s*<a href="(.*?)" title="View', new_file_content)
if match is not None:
    b_content = match.group(1)
    old_file_content = re.sub(', in <a href="(.*?)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)
else:
    print("No match found")

SOLUTION 5: (use re.MULTILINE )

import re

match = re.search('^\s*<a href="(.*?)" title="View', new_file_content, re.MULTILINE)
if match is not None:
    b_content = match.group(1)
    old_file_content = re.sub(', in <a href="([^"]*)" title="Vezi', f', in <a href="{b_content}" title="Vezi', old_file_content)
else:
    print("No match found")

Melcu54

bowlofred

Gribouillis

Melcu54

snippsat

Melcu54

Gribouillis

snippsat

Melcu54

Melcu54