Python Forum

Full Version: Substring extraction
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello,

i have a list of strings like this:

<a href="https://filmovitica.com/neprijatelj-2011-domaci-film-gledaj-online/" rel="bookmark">Neprijatelj (2011) domaći film gledaj online</a>

i need to extract following:

Neprijatelj (2011) domaći film gledaj online

I tried this:

print(re.search('">(.*)</a>', link))

But that results in:

<re.Match object; span=(91, 141), match='">Neprijatelj (2011) domaći film gledaj online</a>

It is a list of strings like this:

<a href="https://filmovitica.com/kraj-nedelje-1975-domaci-film-gledaj-online/" rel="bookmark">Kraj nedelje (1975) domaći film gledaj online</a>
<a href="https://filmovitica.com/cvetje-v-jeseni-1973-domaci-film-gledaj-online/" rel="bookmark">Cvetje v jeseni (1973) domaći film gledaj online</a>
<a href="https://filmovitica.com/sve-ce-to-narod-pozlatiti-1995-domaci-film-gledaj-online/" rel="bookmark">Sve će to narod pozlatiti (1995) domaći film gledaj online</a>
<a href="https://filmovitica.com/imam-nesto-vazno-da-vam-kazem-2005-domaci-film-gledaj-online/" rel="bookmark">Imam nesto vazno da vam kazem (2005) domaći film gledaj online</a>
<a href="https://filmovitica.com/kala-1958-domaci-film-gledaj-online/" rel="bookmark">Kala (1958) domaći film gledaj online</a>
<a href="https://filmovitica.com/oglas-1974-domaci-film-gledaj-online/" rel="bookmark">Oglas (1974) domaći film gledaj online</a>
<a href="https://filmovitica.com/mali-vojnici-1967-domaci-film-gledaj-online/" rel="bookmark">Mali vojnici (1967) domaći film gledaj online</a>
<a href="https://filmovitica.com/sinovci-2006-domaci-film-gledaj-online/" rel="bookmark">Sinovci (2006) domaći film gledaj online</a>
<a href="https://filmovitica.com/volca-nok-1955-vucja-noc-1955-domaci-film-gledaj-online/" rel="bookmark">Volca nok (1955) – Vucja noc (1955) domaći film gledaj online</a>
<a href="https://filmovitica.com/grad-1963-domaci-film-gledaj-online/" rel="bookmark">Grad (1963) domaći film gledaj online</a>
<a href="https://filmovitica.com/sta-se-dogodilo-sa-filipom-preradovicem-1977-domaci-film-gledaj-online/" rel="bookmark">Sta se dogodilo sa Filipom Preradovicem (1977) domaći film gledaj online</a>
<a href="https://filmovitica.com/hoja-lero-1952-domaci-film-gledaj-online/" rel="bookmark">Hoja! Lero! (1952) domaći film gledaj online</a>
<a href="https://filmovitica.com/roman-sa-kontrabasom-1972-domaci-film-gledaj-online/" rel="bookmark">Roman sa kontrabasom (1972) domaći film gledaj online</a>
<a href="https://filmovitica.com/zagreb-cappuccino-2014-domaci-film-gledaj-online/" rel="bookmark">Zagreb Cappuccino (2014) domaći film gledaj online</a>
<a href="https://filmovitica.com/prica-o-fabrici-1949-domaci-film-gledaj-online/" rel="bookmark">Prica o fabrici (1949) domaći film gledaj online</a>
<a href="https://filmovitica.com/put-ruzama-posut-2013-domaci-film-gledaj-online/" rel="bookmark">Put Ruzama Posut (2013) domaći film gledaj online</a>
<a href="https://filmovitica.com/pomorandzina-kora-2016-domaci-film-gledaj-online/" rel="bookmark">Pomorandžina kora (2016) domaći film gledaj online</a>
<a href="https://filmovitica.com/plava-ruza-domaci-film-gledaj-online/" rel="bookmark">Plava ruža domaći film gledaj online</a>
<a href="https://filmovitica.com/ubica-na-odsustvu-1965-domaci-film-gledaj-online/" rel="bookmark">Ubica na odsustvu (1965) domaći film gledaj online</a>
<a href="https://filmovitica.com/hudodelci-1987-domaci-film-gledaj-online/" rel="bookmark">Hudodelci (1987) domaći film gledaj online</a>
<a href="https://filmovitica.com/lazar-1984-domaci-film-gledaj-online/" rel="bookmark">Lazar (1984) domaći film gledaj online</a>

Name always starts with > and ends with < so i figured this might be the way, but a did not manage to get it working right.

Thanks in advance
you may find this SO answer interesting https://stackoverflow.com/a/1732454/4046632

Use proper tools to parse HTML - e.g. BeautifulSoup. Take a look at our web-scraping tutorial - part1
I am sorry, but i just can not get it working that way.. can you please help me with it?
(Apr-23-2019, 05:14 PM)nevendary Wrote: [ -> ]I am sorry, but i just can not get it working that way..
what exactly is the problem. Post your code in python tags and full traceback in error tags. Also note, my advise is to work with the original html source.
(Apr-23-2019, 06:23 PM)buran Wrote: [ -> ]
(Apr-23-2019, 05:14 PM)nevendary Wrote: [ -> ]I am sorry, but i just can not get it working that way..
what exactly is the problem. Post your code in python tags and full traceback in error tags. Also note, my advise is to work with the original html source.

Buran i guess we could communicate in Czech but lets keep it this way.

See there is a website, which i am trying to pull data from for Kodi addon.
So i am trying to take all movie links and their names

For example:

<a href="https://filmovitica.com/lazar-1984-domaci-film-gledaj-online/" rel="bookmark">Lazar (1984) domaći film gledaj online</a>

url=https://filmovitica.com/lazar-1984-domaci-film-gledaj-online/
name=Lazar (1984)

I already managed to get url out in variable. I am trying to get name now. I guess when i manage to get: Lazar (1984) domaći film gledaj online
i can then cut it for last few characters to get Lazar(1984) only
(Apr-23-2019, 06:56 PM)nevendary Wrote: [ -> ]Buran i guess we could communicate in Czech but lets keep it this way.
No, I don't speak czech. And you are not listening what I tell you
import requests
from bs4 import BeautifulSoup

url='https://filmovitica.com/film/domaci/'

resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser') # if you have lxml you can use it as parser instead
div_items = soup.find_all('div', {'class':'item-text'})
for div in div_items:
   link = div.find('a') 
   print(link.get('href')) # this is the link
   print(link.text) # this is the movie title
if you want, you can replace domaći film gledaj online with empty string to remove it or you can use slicing for that purpose.

Also you can replace the last 5 rows with
h4_items = soup.find_all('h4', {'class':'entry_title'})
for h4 in h4_items:
   link = h4.find('a')
   print(link.get('href'))
   print(link.text)
i.e. search for h4 instead for div
Awesome, thank you very much and sorry for bothering!