Python Forum

Full Version: BeautifulSoup - I can't translate html tags that contain <a href=..</a> OR <em></em>
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
I can't translate html tags that contain other tags (such as <a href=..</a> OR <em></em>)

In example below, the paragraph <p class JAGAAA>..</p> is the problem, I cannot translate. All other p classes are translated very well. Except this class, because it has in it those <a href=..</a> OR <em></em>

I try so many things. I don't know why is not working my code. I don;t get any error. Just, this class is not translated.


    <p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre tanarul Hamlet, care voia sa razbune moartea tatalui sau</p>.
**THIS IS THE PART OF THE CODE**

    import os
    from bs4 import BeautifulSoup, NavigableString
    import re
    import textwrap
    from googletrans import Translator
    import pprint
    
    ...

    with open(f"{base_path}/{file}" , "r" , encoding='utf8', errors='ignore') as open_file:
      data = open_file.read()
    if data == "":
      print("{} este gol".format(file))
      continue
    lxml1 = str(BeautifulSoup(data, 'lxml'))
    #lxml1 = data
    lxml1 = lxml1.replace("\ufeff" , " ")
    #lxml1 = lxml1.replace("\n" , " ")
    #lxml1 = re.sub(' +', ' ', lxml1)
    if(read_tags == True):
      soup = BeautifulSoup(data, 'lxml')
      title_tag = soup.find("title")
      ist_p_tag = soup.find("p" , class_="text_obisnuit2")
      ist3_p_tag = soup.find("p" , class_="JAGAAA")
      second_p_tag = soup.find("p" , class_="donoo")
      meta_tag = soup.find("meta")
      if(title_tag ==  None):
        print("Title tag does not found")
      else:
        translated_title = translator.translate(title_tag.text, dest=input_lang)
        lxml1 = lxml1.replace(title_tag.text,translated_title.text)
      if(meta_tag ==  None):
        print("meta tag does not found")
      else:
        translated_meta = translator.translate(meta_tag["content"], dest=input_lang)
        lxml1 = lxml1.replace(meta_tag["content"],translated_meta.text)
        
      if(ist_p_tag == None):
        print("<p class='text_obisnuit2' /> not found")
      else:
        translated_p = translator.translate(ist_p_tag.text, dest=input_lang)
        lxml1 = lxml1.replace(ist_p_tag.text,translated_p.text)

      if(ist3_p_tag == None):
        print("<p class='JAGAAA' /> not found")
      else:
        translated_p = translator.translate(ist3_p_tag.text, dest=input_lang)
        lxml1 = lxml1.replace(ist3_p_tag.text,translated_p.text) 
instead of:
ist3_p_tag = soup.find("p" , class_="JAGAAA")
try:
ist3_p_tag = soup.find("p" , {'class_': "JAGAAA"})
hello, thanks, but is not working. So, I have as you say:

      ist3_p_tag = soup.find("p" , {'class_': "JAGAAA"})

      if(ist3_p_tag == None):
        print("<p class='JAGAAA' /> not found")
      else:
        translated_p = translator.translate(ist3_p_tag .text, dest=input_lang)
        lxml1 = lxml1.replace(ist3_p_tag .text,translated_p.text)
It doesn't translate. The code make a skip on this tag.
What does ist3_p_tag contain?
Contain text with <em></em> tags and <a href=..</a> tag. This is why doesn't work, because of this 2 inside tags

<p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre tanarul Hamlet, care voia sa razbune moartea tatalui sau.</p>
What about ist3_p_tag.text ?

I see that you try to get the text but have you checked it out?
I tried all these options below, and it still doesn't work

ist3_p_tag = soup.find("p" , {'class': "JAGAAA"})
ist3_p_tag = soup.find('p', attr={'class_': 'JAGAAA'})
ist3_p_tag = soup.find("p" , attr={'class_': "JAGAAA"})
ist3_p_tag = soup.find_all("p", class_="JAGAAA")
ist3_p_tag = soup.find("p" , {'class_': "JAGAAA"})
ist3_p_tag.text = soup.find("p" , {'class_': "JAGAAA"})
I am unable to reproduce what you are talking about. Still not clear what ist3_p_tag.text returns/contain.

Here is mine:
>>> from bs4 import BeautifulSoup

>>> html = """<p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre
 tanarul Hamlet, care voia sa razbune moartea tatalui sau</p>""" 

>>> soup = BeautifulSoup(html, 'lxml')

>>> soup

<html><body><p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre t
anarul Hamlet, care voia sa razbune moartea tatalui sau</p></body></html>

>>> p = soup.find('p', class_='JAGAAA')

>>> p
<p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre tanarul Hamle
t, care voia sa razbune moartea tatalui sau</p>

>>> p.text
'Intr-un articol precedent,  Dupa toate regulile artei , v-am povestit despre tanarul Hamlet, care voia sa razbune moartea tatalui sau'
As you can see the text is in p.text regardless of inline tags.
as you can see, for each particular html <p class, I define new variable. ist3_p_tag belongs to the class JAGAAA. All other classes works fine, because has no <em></em> or </a> in it

to_p_tag = soup.find_all('p', class_='text_obisnuit')
ist_p_tag = soup.find("p" , class_="text_obisnuit2")
second_p_tag = soup.find("p" , class_="donoo")
ist3_p_tag = soup.find("p" , class_="JAGAAA")
Doesn't matter if you call it ist3_p_tag or p as I did. How exactly doesn't work?

If soup.find can't find "p" , class_="JAGAAA" it will return None and ist3_p_tag will be None.

In your code you are checking if ist3_p_tag is None. Does it print "<p class='JAGAAA' /> not found" as it should?

If not, then ist3_p_tag is not None and ist3_p_tag = soup.find("p" , class_="JAGAAA") should be working.

Put

print(ist3_p_tag.text)
at the end of your code to see what it contains.

If it contains just all the text in the p tag then it works fine and you have to see why the translation isn't working.

Look at my code. It is the same p tag and the CSS selector is used the same way as you do and soup.find is doing well. The inline tags are not the problem

Put that print above as I suggested and see if you are getting the text. If you do the translation module is causing this "not working"
Pages: 1 2