Python Forum

Full Version: Need some help with parsing
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hi Everyone,
First of all thanks for the assistance. I am a noob i python and need assistance with some website parsing.
The website requires a login which i managed to do and managed to print the page i needed (one of many).
the issue i have is that i am trying to parse the page and search for two specific things.

My code is
from requests import Session
from bs4 import BeautifulSoup as bs

 
with Session() as s:
    site = s.get("https://connectedinvestors.com/login")
    bs_content = bs(site.content, "html.parser")
    token = bs_content.find("input", {"name":"infusion_id"})["value"]
    login_data = {"email":"myemailaddress","password":"mypassword", "infusion_id":token}
    s.post("https://connectedinvestors.com/login",login_data)
    home_page = s.get("https://connectedinvestors.com/friends-list")
    print(home_page.content)
when i do the print i get entire page data which looks like this (just a sample)

i need to pull the name of the member and the user-id
it can be seen in this line
<a itemprop="url" href="/member/rudy-acosta">\n <figure class="circle profile bordered bordered-level-2">\n <img itemprop="image" src="/uploads/user/15401/img_544bf971f23de.jpg" alt="Rudy Acosta"/>\n


how it is clear.
thanks
when you call https://connectedinvestors.com/login you get blank fields

there is no rudy-acosta
It's after he login with code as shown @Axel_Erfurt.

Send source to BS again.
home_page = s.get("https://connectedinvestors.com/friends-list")
#print(home_page.content)
bs_after_login = bs(home_page.content, "html.parser")
So than can parse like this.
>>> s = bs_after_login.find('a')
>>> s
<a href="/member/rudy-acosta" itemprop="url">
<figure class="circle profile bordered bordered-level-2">
<img alt="Rudy Acosta" itemprop="image" src="/uploads/user/15401/img_544bf971f23de.jpg"/>
</figure></a>
>>> 
>>> s.get('href')
'/member/rudy-acosta'
>>> 
>>> # img tag aslo have name
>>> img_tag = s.find('img')
>>> img_tag.attrs
{'alt': 'Rudy Acosta',
 'itemprop': 'image',
 'src': '/uploads/user/15401/img_544bf971f23de.jpg'}

>>> img_tag.attrs['alt']
'Rudy Acosta'
hi
thanks for the prompt answer guys,
basically in each page "https://connectedinvestors.com/friends-list" (there are plenty of them like this https://connectedinvestors.com/member/jo.../friends/2)
i have 30 friends and each one has a name and an ID.

i just need to parse it in a way (of course if possible) that will eventually give me

name id

for each member.
once i have this i will ask the next question :)

appreciate it a lot
Give it try,look at Web-Scraping part-1.
Here some hints.
h4 class="card-name" has all info about one card.
>>> tag_h4 = soup.find('h4', class_="card-name")
>>> tag_h4
<h4 class="card-name" itemprop="founder">
<a href="/member/derek-hodge" itemprop="url">
                            DEREK HODGE                        </a>
</h4>
>>> 
>>> a = tag_h4.find('a')
>>> a.get('href')
'/member/derek-hodge'
>>> 
>>> # There is also text in this a tag
>>> a.text.strip()
'DEREK HODGE'
Thanks snippsat
i manage to get the name but had issue with pulling the img
got this
AttributeError: 'NoneType' object has no attribute 'attrs'

How do i run it recursively as currently it is only pulling one name out of 30?
thanks
You should show code and error,also what you have tried.
As i stared testing in previous post,here is the code that should do it.
import requests
from bs4 import BeautifulSoup

url = 'https://connectedinvestors.com/member/jonathan-kessous/friends/2'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
for tag in soup.find_all('h4', class_="card-name"):
    a_link = tag.find('a')
    print('-'*25)
    print(a_link.get('href'))
    print(a_link.text.strip())
Output:
------------------------- /member/jonathan-kessous Jonathan Kessous ------------------------- /member/devon-van-nostrand Devon Vannostrand ------------------------- /member/angelo-argentieri Angelo Argentieri ------------------------- /member/clayton-zelazowski Clayton Zelazowski ------------------------- .....ect
Hi,
Running this code gets me the list of friends i have on that specific page.
from requests import Session
from bs4 import BeautifulSoup as bs

 
with Session() as s:
    site = s.get("https://connectedinvestors.com/login")
    bs_content = bs(site.content, "html.parser")
    token = bs_content.find("input", {"name":"infusion_id"})["value"]
    login_data = {"email":"[email protected]","password":"password", "infusion_id":token}
    s.post("https://connectedinvestors.com/login",login_data)

from bs4 import BeautifulSoup
 
url = 'https://connectedinvestors.com/member/jonathan-kessous/friends/2'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
for tag in soup.find_all('h4', class_="card-name"):
    a_link = tag.find('a')
    print('-'*25)
    print(a_link.get('href'))
    print(a_link.text.strip())
gets me this

Output:
------------------------- /member/jonathan-kessous Jonathan Kessous ------------------------- /member/aldo-alegria-lynch Siglos ------------------------- /member/investor-92469 Angini Kumar ------------------------- /member/noel-felix Noel Felix ------------------------- /member/mustapha-koromah Mustapha Koromah ------------------------- /member/ed-hardman Ed Hardman ------------------------- /member/michael-high Michael High ------------------------- /member/sidney-brown Sidney Brown ------------------------- /member/leighsa-thomas Leighsa Thomas ------------------------- /member/ta-brown-1 Ta Brown ------------------------- /member/tyler-wilson Tyler Wilson ------------------------- /member/ernestine-jordan Ernestine Jordan ------------------------- /member/john-davis John Davis ------------------------- /member/richard-tomlin-1 Richard Tomlin ------------------------- /member/evalee-aqui EVALEE AQUI ------------------------- /member/terrance-blake Terrance Blake ------------------------- /member/alexander-aminov Alexander Aminov ------------------------- /member/aj-golden AJ Golden ------------------------- /member/sean-hinely Sean Hinely ------------------------- /member/arie-bitton Arie Bitton ------------------------- /member/shawn-hanstedt Shawn Hanstedt ------------------------- /member/kenneth-kingsberry Kenneth Kingsberry ------------------------- /member/udo-ginczek Udo Ginczek ------------------------- /member/joshua-edmund Joshua Edmund ------------------------- /member/ben-alfasi-1 Ben Alfasi ------------------------- /member/igor-mosyak Igor Mosyak ------------------------- /member/vernon-ryan Vernon Ryan ------------------------- /member/investor-75091 Adrienne Jameson ------------------------- /member/jimmy-castillo JIMMY CASTILLO ------------------------- /member/max-michalak Max Michalak ------------------------- /member/sandra-velaquez Sandra Velaquez >>>
Running this code will get the second item i need which is the ID
import requests
from bs4 import BeautifulSoup
 
url = 'https://connectedinvestors.com/member/jonathan-kessous/friends/2'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
for tag in soup.find_all('figure', class_="circle profile bordered bordered-level-1"):
    a_link = tag.find('img')
    print('-'*25)
    print(a_link.get('src'))
    print(a_link.text.strip())
is getting me the ID's but not parsed. I need only the lines that have the /user/XXX
what i am getting is this
Output:
/uploads/user/241439/img_54528805a7014.jpg ------------------------- /upload/d4cb5dff-ef8d-af84-9db0-1d614a341497_120_120.jpg ------------------------- /uploads/user/275056/img_55ab1acf22e1c.jpg ------------------------- /uploads/user/525892/img_59a56c24a1414.jpg ------------------------- /uploads/user/505850/img_596a6c30244db.jpg ------------------------- /upload/c340a93e-ff95-7374-fd92-4f47fb56e784_120_120.jpg ------------------------- /uploads/user/327008/img_571d2a11e0cd5.jpg ------------------------- /uploads/user/413128/img_58859c9c83ab3.jpg ------------------------- /uploads/user/13047/img_56d07787c6016.jpg ------------------------- /upload/f6fe40d2-59c2-c0f4-c93e-a2d847a508a5_120_120.jpg ------------------------- /uploads/user/416613/img_58c44934b2b75.jpg ------------------------- /uploads/user/529286/img_59c9130e1e178.jpg ------------------------- /uploads/user/434922/img_58ea9786f1849.jpg ------------------------- /uploads/user/537960/img_5b4905c898ad3.jpg ------------------------- /uploads/user/10423/img_565a598c78756.jpg ------------------------- /uploads/user/532026/img_59b026dcc0437.jpg ------------------------- /uploads/user/437192/img_58cdaef37d2cf.jpg ------------------------- /uploads/user/296598/img_5656ce77f2b4e.jpg ------------------------- /uploads/user/502259/img_596212934ae77.jpg ------------------------- /uploads/user/491719/img_5a18e93a78e3f.jpg ------------------------- /upload/c1ad2725-d19b-4b54-0987-7a749c3fa594_120_120.jpg ------------------------- /uploads/user/576574/img_5a2eb0a5c5a47.jpg ------------------------- /uploads/user/469624/img_595384ae94e5b.jpg
How do i get both on the same line that it will be something like this

name - id

thanks
(Jan-20-2020, 12:48 PM)jkessous Wrote: [ -> ]How do i get both on the same line that it will be something like this

name - id
There is own id tag in card that get eg id="989670".
import requests
from bs4 import BeautifulSoup

url = 'https://connectedinvestors.com/member/jonathan-kessous/friends/2'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'html.parser')
for tag in soup.find_all('div', class_="investorcard clearfix"):
    h4 = tag.find('h4')
    print(h4.a.text.strip(), tag.attrs['id'])
Output:
Jonathan Kessous 454517 Diunique Williams 950184 Aundre Price 989670 Larry Jackson 265408 Josie Djach 577911 .....ect
Let say want to take user id from.
Output:
/uploads/user/252881/img_552a543595cab.jpg
Here do the parser stop working as it's just text output,so can eg use regex.
>>> import re
>>> 
>>> s = '/uploads/user/252881/img_552a543595cab.jpg'
>>> re.search( r"user/(\d+)", s).group(1)
'252881'
Thanks snippsat,
I really appreciate it.
as i have like 400 pages to run this script on.
https://connectedinvestors.com/member/jonathan-kessous/friends/2
https://connectedinvestors.com/member/jonathan-kessous/friends/3
https://connectedinvestors.com/member/jonathan-kessous/friends/4
etc.
how can i run it recursively and put the input in an excel or something similar as i will need this data for the next script.

thanks buddy
what an awesome forum
Pages: 1 2