Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Substring extraction
#1
Hello,

i have a list of strings like this:

<a href="https://filmovitica.com/neprijatelj-2011-domaci-film-gledaj-online/" rel="bookmark">Neprijatelj (2011) domaći film gledaj online</a>

i need to extract following:

Neprijatelj (2011) domaći film gledaj online

I tried this:

print(re.search('">(.*)</a>', link))

But that results in:

<re.Match object; span=(91, 141), match='">Neprijatelj (2011) domaći film gledaj online</a>

It is a list of strings like this:

<a href="https://filmovitica.com/kraj-nedelje-1975-domaci-film-gledaj-online/" rel="bookmark">Kraj nedelje (1975) domaći film gledaj online</a>
<a href="https://filmovitica.com/cvetje-v-jeseni-1973-domaci-film-gledaj-online/" rel="bookmark">Cvetje v jeseni (1973) domaći film gledaj online</a>
<a href="https://filmovitica.com/sve-ce-to-narod-pozlatiti-1995-domaci-film-gledaj-online/" rel="bookmark">Sve će to narod pozlatiti (1995) domaći film gledaj online</a>
<a href="https://filmovitica.com/imam-nesto-vazno-da-vam-kazem-2005-domaci-film-gledaj-online/" rel="bookmark">Imam nesto vazno da vam kazem (2005) domaći film gledaj online</a>
<a href="https://filmovitica.com/kala-1958-domaci-film-gledaj-online/" rel="bookmark">Kala (1958) domaći film gledaj online</a>
<a href="https://filmovitica.com/oglas-1974-domaci-film-gledaj-online/" rel="bookmark">Oglas (1974) domaći film gledaj online</a>
<a href="https://filmovitica.com/mali-vojnici-1967-domaci-film-gledaj-online/" rel="bookmark">Mali vojnici (1967) domaći film gledaj online</a>
<a href="https://filmovitica.com/sinovci-2006-domaci-film-gledaj-online/" rel="bookmark">Sinovci (2006) domaći film gledaj online</a>
<a href="https://filmovitica.com/volca-nok-1955-vucja-noc-1955-domaci-film-gledaj-online/" rel="bookmark">Volca nok (1955) – Vucja noc (1955) domaći film gledaj online</a>
<a href="https://filmovitica.com/grad-1963-domaci-film-gledaj-online/" rel="bookmark">Grad (1963) domaći film gledaj online</a>
<a href="https://filmovitica.com/sta-se-dogodilo-sa-filipom-preradovicem-1977-domaci-film-gledaj-online/" rel="bookmark">Sta se dogodilo sa Filipom Preradovicem (1977) domaći film gledaj online</a>
<a href="https://filmovitica.com/hoja-lero-1952-domaci-film-gledaj-online/" rel="bookmark">Hoja! Lero! (1952) domaći film gledaj online</a>
<a href="https://filmovitica.com/roman-sa-kontrabasom-1972-domaci-film-gledaj-online/" rel="bookmark">Roman sa kontrabasom (1972) domaći film gledaj online</a>
<a href="https://filmovitica.com/zagreb-cappuccino-2014-domaci-film-gledaj-online/" rel="bookmark">Zagreb Cappuccino (2014) domaći film gledaj online</a>
<a href="https://filmovitica.com/prica-o-fabrici-1949-domaci-film-gledaj-online/" rel="bookmark">Prica o fabrici (1949) domaći film gledaj online</a>
<a href="https://filmovitica.com/put-ruzama-posut-2013-domaci-film-gledaj-online/" rel="bookmark">Put Ruzama Posut (2013) domaći film gledaj online</a>
<a href="https://filmovitica.com/pomorandzina-kora-2016-domaci-film-gledaj-online/" rel="bookmark">Pomorandžina kora (2016) domaći film gledaj online</a>
<a href="https://filmovitica.com/plava-ruza-domaci-film-gledaj-online/" rel="bookmark">Plava ruža domaći film gledaj online</a>
<a href="https://filmovitica.com/ubica-na-odsustvu-1965-domaci-film-gledaj-online/" rel="bookmark">Ubica na odsustvu (1965) domaći film gledaj online</a>
<a href="https://filmovitica.com/hudodelci-1987-domaci-film-gledaj-online/" rel="bookmark">Hudodelci (1987) domaći film gledaj online</a>
<a href="https://filmovitica.com/lazar-1984-domaci-film-gledaj-online/" rel="bookmark">Lazar (1984) domaći film gledaj online</a>

Name always starts with > and ends with < so i figured this might be the way, but a did not manage to get it working right.

Thanks in advance
Reply
#2
you may find this SO answer interesting https://stackoverflow.com/a/1732454/4046632

Use proper tools to parse HTML - e.g. BeautifulSoup. Take a look at our web-scraping tutorial - part1
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
I am sorry, but i just can not get it working that way.. can you please help me with it?
Reply
#4
(Apr-23-2019, 05:14 PM)nevendary Wrote: I am sorry, but i just can not get it working that way..
what exactly is the problem. Post your code in python tags and full traceback in error tags. Also note, my advise is to work with the original html source.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
(Apr-23-2019, 06:23 PM)buran Wrote:
(Apr-23-2019, 05:14 PM)nevendary Wrote: I am sorry, but i just can not get it working that way..
what exactly is the problem. Post your code in python tags and full traceback in error tags. Also note, my advise is to work with the original html source.

Buran i guess we could communicate in Czech but lets keep it this way.

See there is a website, which i am trying to pull data from for Kodi addon.
So i am trying to take all movie links and their names

For example:

<a href="https://filmovitica.com/lazar-1984-domaci-film-gledaj-online/" rel="bookmark">Lazar (1984) domaći film gledaj online</a>

url=https://filmovitica.com/lazar-1984-domaci-film-gledaj-online/
name=Lazar (1984)

I already managed to get url out in variable. I am trying to get name now. I guess when i manage to get: Lazar (1984) domaći film gledaj online
i can then cut it for last few characters to get Lazar(1984) only
Reply
#6
(Apr-23-2019, 06:56 PM)nevendary Wrote: Buran i guess we could communicate in Czech but lets keep it this way.
No, I don't speak czech. And you are not listening what I tell you
import requests
from bs4 import BeautifulSoup

url='https://filmovitica.com/film/domaci/'

resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser') # if you have lxml you can use it as parser instead
div_items = soup.find_all('div', {'class':'item-text'})
for div in div_items:
   link = div.find('a') 
   print(link.get('href')) # this is the link
   print(link.text) # this is the movie title
if you want, you can replace domaći film gledaj online with empty string to remove it or you can use slicing for that purpose.

Also you can replace the last 5 rows with
h4_items = soup.find_all('h4', {'class':'entry_title'})
for h4 in h4_items:
   link = h4.find('a')
   print(link.get('href'))
   print(link.text)
i.e. search for h4 instead for div
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
Awesome, thank you very much and sorry for bothering!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  extract substring from a string before a word !! evilcode1 3 491 Nov-08-2023, 12:18 AM
Last Post: evilcode1
  [SOLVED] [regex] Why isn't possible substring ignored? Winfried 4 1,015 Apr-08-2023, 06:36 PM
Last Post: Winfried
  ValueError: substring not found nby2001 4 7,844 Aug-08-2022, 11:16 AM
Last Post: rob101
  Match substring using regex Pavel_47 6 1,370 Jul-18-2022, 07:46 AM
Last Post: Pavel_47
  Substring Counting shelbyahn 4 6,070 Jan-13-2022, 10:08 AM
Last Post: krisputas
  Data extraction from a table based on column and row names tgottsc1 1 2,356 Jan-09-2021, 10:04 PM
Last Post: buran
  Python Substring muzikman 4 2,260 Dec-01-2020, 03:07 PM
Last Post: deanhystad
  Removing items from list if containing a substring pythonnewbie138 2 2,150 Aug-27-2020, 10:20 PM
Last Post: pythonnewbie138
  eml file data extraction ajetrumpet 2 2,569 Jul-04-2020, 04:34 AM
Last Post: ajetrumpet
  Substring and If then Condition to create column Chandan 2 2,319 Jan-23-2020, 08:40 AM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020