Python Forum
Scrap data from not standarized page?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scrap data from not standarized page?
#1
Hi guys,

I would like to scrap and organize data from html document.
I was learning scrapping and presenting data on different site structure(shopping/offers in elements)

I am curious, if something would be doable to scrap and organize data from thousands of documents which are not standarized? What i mean is that sometimes information is on top of the document, sometimes on the bottom, and pretty much always in different area.
Let's say that i would like to get data from "SUMMARY COMPENSATION TABLE" (from both of the files below).
For specific, the one only site it is doable(using indexes, find etc.)

Is there any kind of action which can be done to thousands of files like that? I cannot use specific div or other html-type because every table is named the same (with only different font).

I just don't know how to tell python look for "SUMMARY COMPENSATION TABLE" and get whole data from table below.

Example of page #1
https://www.sec.gov/Archives/edgar/data/...def14a.htm
Example of page #2
https://www.sec.gov/Archives/edgar/data/...def14a.htm

Do you have any thoughts, ideas if it is even doable?
Reply
#2
(Nov-20-2019, 02:27 PM)zarize Wrote: [ ... ]
I cannot use specific div or other html-type because every table is named the same (with only different font).

I just don't know how to tell python look for "SUMMARY COMPENSATION TABLE" and get whole data from table below.

Hi!

I think you could look for "SUMMARY COMPENSATION TABLE" and then, calculating the biggest table, or something you know the table always ends with, to use it like the end of the copy/get file process. This could be like 40 lines after "SUMMARY COMPENSATION TABLE" (if the tables are usually that high or less).

Here you may have some ideas:
https://python-forum.io/Thread-print-a-w...ord-search

All the best,
newbieAuggie2019

"That's been one of my mantras - focus and simplicity. Simple can be harder than complex: You have to work hard to get your thinking clean to make it simple. But it's worth it in the end because once you get there, you can move mountains."
Steve Jobs
Reply
#3
Hi, thanks for your input.

Hmmm... i was thinking about finding string and then write a code to find in which 'div' it is located.
Then find in found 'div' 'table' and return.

Is it correct, doable?

String is in structure like this:
<div>
<p style>
<p style>
<font style>
<b> SUMMARY COMPENSATION TABLE </b>

Can i get this div by finding string? I mean, it's located in upper tree..
Reply
#4
Pandas can help here as it can find tables on a web-site.
Here a Notebook where i take out table,using a match word that's in table.
The site is little messy as the use tables for a lot of stuff,but can use match(word in table) as shown to get right table.
Also look at this Thread pandas library tricks.
Reply
#5
Thank you very much for your input! :)

But in this case it wouldn't rather work. I was wondering about finding header of the table ie. "SUMMARY COMPENSATION TABLE" and then get table below.

If i would like to follow approach you have proposed then i would always need some1's name from the table, but they will always be different(and tables can have different named columns). The only common point in those files is this header i think.

The goal is to get automatically this table from X amount of files like this, so i think another approach would be needed (if it's even possible)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web scrap --Need help Lizardpython 4 953 Oct-01-2023, 11:37 AM
Last Post: Lizardpython
  trying to save data automatically from this page thunderspeed 1 1,970 Sep-19-2021, 04:57 AM
Last Post: ndc85430
  Scraping a page with log in data (security, proxies) iamaghost 0 2,102 Mar-27-2021, 02:56 PM
Last Post: iamaghost
  I tried every way to scrap morningstar financials data without success so far sparkt 2 8,167 Oct-20-2020, 05:43 PM
Last Post: sparkt
  Web scrap multiple pages anilacem_302 3 3,781 Jul-01-2020, 07:50 PM
Last Post: mlieqo
  Need logic on how to scrap 100K URLs goodmind 2 2,569 Jun-29-2020, 09:53 AM
Last Post: goodmind
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 3,578 Mar-19-2020, 06:13 PM
Last Post: apollo
  Sending data to php page ebolisa 0 1,888 Mar-18-2020, 05:34 PM
Last Post: ebolisa
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 5,087 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  Scrap a dynamic span hefaz 0 2,658 Mar-07-2020, 02:56 PM
Last Post: hefaz

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020