Python Forum

hi,

I have some html links and i want to find some particular text and it's next text also. I am using regex but receiving lost of empty lists.

These are links:

https://www.99acres.com/mailers/mmm_html...7-558.html https://www.99acres.com/mailers/mmm_html...-2016.html https://www.99acres.com/mailers/mmm_html...7-553.html

text i am finding Area Range: Next Text also Possession: next text also for example possession 2019 Price: next text also

below are my codes:

import requests
from bs4 import BeautifulSoup
import csv
import json
import itertools
import re
file = {}
final_data = []
final = []
textdata = []
def readfile(alldata, filename):
    with open("./"+filename, "w") as csvfile:
        csvfile = csv.writer(csvfile, delimiter=",")
        for i in range(0, len(alldata)):
            csvfile.writerow(alldata[i])
def parsedata(url, values):
    r = requests.get(url, values)
    data = r.text
    return data

def getresults():
    global final_data, file
    with open("Mailers.csv", "r") as f:
        reader = csv.reader(f)
        next(reader)
        for row in reader:
            ids = row[0]
            link = row[1]
            html = parsedata(link, {})
            soup = BeautifulSoup(html, "html.parser")
            titles = soup.title.text
            td = soup.find_all("td")
            for i in td:
                sublist = []
                data = i.text
                pattern = r'(Possession:)(.)(.+)'
                x1 = re.findall(pattern, data)
                sublist.append(x1)
                sublist.append(link)
                final_data.append(sublist)
    print(final_data)
    return final_data
def main():
    getresults()
    readfile(final_data, "Data.csv")
main()

Not all of those pages have the word "Possession" in them, and the pages that do have it, don't have it in every cell. Since you don't check whether there were any matches, your list has empty entries for every td that doesn't have any matches.

how can i match if there is word present? and how can i remove those empty list?

Just check if there's anything there, and if there isn't, don't add it to your list:

>>> import re
>>> data = ['<td width="40"></td>', '<td height="50"></td>', '<td width="40"></td>']
>>> # random sample data from the first link
...
>>> pattern = r'(Possession:)(.)(.+)'
>>> for cell in data:
...   x1 = re.findall(pattern, cell)
...   print(x1)
...
[]
[]
[]
>>> for cell in data:
...   x1 = re.findall(pattern, cell)
...   if x1:
...     print(x1)
...

I found this useful, but it is repeating multiple times...how can i solve that too?

I don't know what you mean. Can you share what some of the output is now?

output....

[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'September 2019</td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'December 2017" border="0" hspace="0" src="http://www.99acres.com/mailers/images/lg-greenfield-11nov-2016_04.jpg" style="device-width:300px; max-width:600px; width:inherit; font-family:Calibri, Arial; font-size:15px; color:#242424; font-weight:bold; text-align:center; text-transform:uppercase;" vspace="0"/> </div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'March 2019" border="0" hspace="0" src="http://www.99acres.com/mailers/images/royal-homes-25apr2017-853-1-4.jpg?v=1.05" style="device-width:300px; max-width:700px; width:inherit; font-family:\'Segoe UI\', Arial; font-size:20px; color:#333333; text-align:center;" vspace="0"/></div></td>')]
[('Possession:', ' ', 'New Launch</td>')]

If the order doesn't matter, you could use a set instead of a list, so duplicates will just be ignored.

Prince_Bhatia

nilamo

Prince_Bhatia

nilamo

Prince_Bhatia

nilamo

Prince_Bhatia

nilamo