Python Forum
How to find particular text from td tag using bs4
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to find particular text from td tag using bs4
#1
hi,

I have some html links and i want to find some particular text and it's next text also. I am using regex but receiving lost of empty lists.

These are links:

https://www.99acres.com/mailers/mmm_html...7-558.html https://www.99acres.com/mailers/mmm_html...-2016.html https://www.99acres.com/mailers/mmm_html...7-553.html

text i am finding Area Range: Next Text also Possession: next text also for example possession 2019 Price: next text also

below are my codes:

import requests
from bs4 import BeautifulSoup
import csv
import json
import itertools
import re
file = {}
final_data = []
final = []
textdata = []
def readfile(alldata, filename):
    with open("./"+filename, "w") as csvfile:
        csvfile = csv.writer(csvfile, delimiter=",")
        for i in range(0, len(alldata)):
            csvfile.writerow(alldata[i])
def parsedata(url, values):
    r = requests.get(url, values)
    data = r.text
    return data

def getresults():
    global final_data, file
    with open("Mailers.csv", "r") as f:
        reader = csv.reader(f)
        next(reader)
        for row in reader:
            ids = row[0]
            link = row[1]
            html = parsedata(link, {})
            soup = BeautifulSoup(html, "html.parser")
            titles = soup.title.text
            td = soup.find_all("td")
            for i in td:
                sublist = []
                data = i.text
                pattern = r'(Possession:)(.)(.+)'
                x1 = re.findall(pattern, data)
                sublist.append(x1)
                sublist.append(link)
                final_data.append(sublist)
    print(final_data)
    return final_data
def main():
    getresults()
    readfile(final_data, "Data.csv")
main()
Reply
#2
Not all of those pages have the word "Possession" in them, and the pages that do have it, don't have it in every cell. Since you don't check whether there were any matches, your list has empty entries for every td that doesn't have any matches.
Reply
#3
how can i match if there is word present? and how can i remove those empty list?
Reply
#4
Just check if there's anything there, and if there isn't, don't add it to your list:
>>> import re
>>> data = ['<td width="40"></td>', '<td height="50"></td>', '<td width="40"></td>']
>>> # random sample data from the first link
...
>>> pattern = r'(Possession:)(.)(.+)'
>>> for cell in data:
...   x1 = re.findall(pattern, cell)
...   print(x1)
...
[]
[]
[]
>>> for cell in data:
...   x1 = re.findall(pattern, cell)
...   if x1:
...     print(x1)
...
Reply
#5
I found this useful, but it is repeating multiple times...how can i solve that too?
Reply
#6
I don't know what you mean. Can you share what some of the output is now?
Reply
#7
output....
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'New Launch</td>')]
Reply
#8
If the order doesn't matter, you could use a set instead of a list, so duplicates will just be ignored.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Why doesn't my spider find body text? sigalizer 5 4,298 Oct-30-2019, 11:35 PM
Last Post: sigalizer
  XML Parsing - Find a specific text (ElementTree) TeraX 3 4,023 Oct-09-2018, 09:06 AM
Last Post: TeraX
  BS4 Not Able To Find Text In CSS Comments digitalmatic7 4 5,178 Feb-27-2018, 03:45 AM
Last Post: digitalmatic7

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020