Python Forum
BeautifulSoup - I can't translate html tags that contain <a href=..</a> OR <em></em>
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
BeautifulSoup - I can't translate html tags that contain <a href=..</a> OR <em></em>
#1
I can't translate html tags that contain other tags (such as <a href=..</a> OR <em></em>)

In example below, the paragraph <p class JAGAAA>..</p> is the problem, I cannot translate. All other p classes are translated very well. Except this class, because it has in it those <a href=..</a> OR <em></em>

I try so many things. I don't know why is not working my code. I don;t get any error. Just, this class is not translated.


    <p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre tanarul Hamlet, care voia sa razbune moartea tatalui sau</p>.
**THIS IS THE PART OF THE CODE**

    import os
    from bs4 import BeautifulSoup, NavigableString
    import re
    import textwrap
    from googletrans import Translator
    import pprint
    
    ...

    with open(f"{base_path}/{file}" , "r" , encoding='utf8', errors='ignore') as open_file:
      data = open_file.read()
    if data == "":
      print("{} este gol".format(file))
      continue
    lxml1 = str(BeautifulSoup(data, 'lxml'))
    #lxml1 = data
    lxml1 = lxml1.replace("\ufeff" , " ")
    #lxml1 = lxml1.replace("\n" , " ")
    #lxml1 = re.sub(' +', ' ', lxml1)
    if(read_tags == True):
      soup = BeautifulSoup(data, 'lxml')
      title_tag = soup.find("title")
      ist_p_tag = soup.find("p" , class_="text_obisnuit2")
      ist3_p_tag = soup.find("p" , class_="JAGAAA")
      second_p_tag = soup.find("p" , class_="donoo")
      meta_tag = soup.find("meta")
      if(title_tag ==  None):
        print("Title tag does not found")
      else:
        translated_title = translator.translate(title_tag.text, dest=input_lang)
        lxml1 = lxml1.replace(title_tag.text,translated_title.text)
      if(meta_tag ==  None):
        print("meta tag does not found")
      else:
        translated_meta = translator.translate(meta_tag["content"], dest=input_lang)
        lxml1 = lxml1.replace(meta_tag["content"],translated_meta.text)
        
      if(ist_p_tag == None):
        print("<p class='text_obisnuit2' /> not found")
      else:
        translated_p = translator.translate(ist_p_tag.text, dest=input_lang)
        lxml1 = lxml1.replace(ist_p_tag.text,translated_p.text)

      if(ist3_p_tag == None):
        print("<p class='JAGAAA' /> not found")
      else:
        translated_p = translator.translate(ist3_p_tag.text, dest=input_lang)
        lxml1 = lxml1.replace(ist3_p_tag.text,translated_p.text) 
Reply
#2
instead of:
ist3_p_tag = soup.find("p" , class_="JAGAAA")
try:
ist3_p_tag = soup.find("p" , {'class_': "JAGAAA"})
Reply
#3
hello, thanks, but is not working. So, I have as you say:

      ist3_p_tag = soup.find("p" , {'class_': "JAGAAA"})

      if(ist3_p_tag == None):
        print("<p class='JAGAAA' /> not found")
      else:
        translated_p = translator.translate(ist3_p_tag .text, dest=input_lang)
        lxml1 = lxml1.replace(ist3_p_tag .text,translated_p.text)
It doesn't translate. The code make a skip on this tag.
Reply
#4
What does ist3_p_tag contain?
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#5
Contain text with <em></em> tags and <a href=..</a> tag. This is why doesn't work, because of this 2 inside tags

<p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre tanarul Hamlet, care voia sa razbune moartea tatalui sau.</p>
Reply
#6
What about ist3_p_tag.text ?

I see that you try to get the text but have you checked it out?
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#7
I tried all these options below, and it still doesn't work

ist3_p_tag = soup.find("p" , {'class': "JAGAAA"})
ist3_p_tag = soup.find('p', attr={'class_': 'JAGAAA'})
ist3_p_tag = soup.find("p" , attr={'class_': "JAGAAA"})
ist3_p_tag = soup.find_all("p", class_="JAGAAA")
ist3_p_tag = soup.find("p" , {'class_': "JAGAAA"})
ist3_p_tag.text = soup.find("p" , {'class_': "JAGAAA"})
Reply
#8
I am unable to reproduce what you are talking about. Still not clear what ist3_p_tag.text returns/contain.

Here is mine:
>>> from bs4 import BeautifulSoup

>>> html = """<p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre
 tanarul Hamlet, care voia sa razbune moartea tatalui sau</p>""" 

>>> soup = BeautifulSoup(html, 'lxml')

>>> soup

<html><body><p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre t
anarul Hamlet, care voia sa razbune moartea tatalui sau</p></body></html>

>>> p = soup.find('p', class_='JAGAAA')

>>> p
<p class="JAGAAA">Intr-un articol precedent, <a href="https://neculaifantanaru.com/dupa-toate-regulile-artei.html"> <em>Dupa toate regulile artei</em> </a>, v-am povestit despre tanarul Hamle
t, care voia sa razbune moartea tatalui sau</p>

>>> p.text
'Intr-un articol precedent,  Dupa toate regulile artei , v-am povestit despre tanarul Hamlet, care voia sa razbune moartea tatalui sau'
As you can see the text is in p.text regardless of inline tags.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#9
as you can see, for each particular html <p class, I define new variable. ist3_p_tag belongs to the class JAGAAA. All other classes works fine, because has no <em></em> or </a> in it

to_p_tag = soup.find_all('p', class_='text_obisnuit')
ist_p_tag = soup.find("p" , class_="text_obisnuit2")
second_p_tag = soup.find("p" , class_="donoo")
ist3_p_tag = soup.find("p" , class_="JAGAAA")
Reply
#10
Doesn't matter if you call it ist3_p_tag or p as I did. How exactly doesn't work?

If soup.find can't find "p" , class_="JAGAAA" it will return None and ist3_p_tag will be None.

In your code you are checking if ist3_p_tag is None. Does it print "<p class='JAGAAA' /> not found" as it should?

If not, then ist3_p_tag is not None and ist3_p_tag = soup.find("p" , class_="JAGAAA") should be working.

Put

print(ist3_p_tag.text)
at the end of your code to see what it contains.

If it contains just all the text in the p tag then it works fine and you have to see why the translation isn't working.

Look at my code. It is the same p tag and the CSS selector is used the same way as you do and soup.find is doing well. The inline tags are not the problem

Put that print above as I suggested and see if you are getting the text. If you do the translation module is causing this "not working"
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  why doesn't it replace all html tags? Melcu54 3 786 Jul-05-2023, 04:47 AM
Last Post: Melcu54
  googletrans library to translate text language for using data frame is not running gcozba2023 0 1,260 Mar-06-2023, 09:50 AM
Last Post: gcozba2023
  Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row AaronCatolico1 0 960 Dec-25-2022, 06:28 PM
Last Post: AaronCatolico1
  Get text from within h3 html tags Pedroski55 8 4,369 Jan-05-2022, 06:50 AM
Last Post: Larz60+
  How to Translate a python code written in Mac-OS to Windows? alexanderDennisEnviro500 2 2,759 Jul-31-2021, 08:36 AM
Last Post: Gribouillis
  reading html and edit chekcbox to html jacklee26 5 3,122 Jul-01-2021, 10:31 AM
Last Post: snippsat
  Parsing link from html tags with Python Melcu54 0 1,630 Jun-14-2021, 09:25 AM
Last Post: Melcu54
  Delimiters - How to skip some html tags from being translate Melcu54 0 1,678 May-26-2021, 06:21 AM
Last Post: Melcu54
  Including a Variable In the HTML Tags When Sending An Email JoeDainton123 0 1,911 Aug-08-2020, 03:11 AM
Last Post: JoeDainton123
  Translate to noob a Name Eroor message bako 2 2,268 Mar-30-2020, 05:58 PM
Last Post: bako

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020