Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Scrap data from not standarized page?
#1
Hi guys,

I would like to scrap and organize data from html document.
I was learning scrapping and presenting data on different site structure(shopping/offers in elements)

I am curious, if something would be doable to scrap and organize data from thousands of documents which are not standarized? What i mean is that sometimes information is on top of the document, sometimes on the bottom, and pretty much always in different area.
Let's say that i would like to get data from "SUMMARY COMPENSATION TABLE" (from both of the files below).
For specific, the one only site it is doable(using indexes, find etc.)

Is there any kind of action which can be done to thousands of files like that? I cannot use specific div or other html-type because every table is named the same (with only different font).

I just don't know how to tell python look for "SUMMARY COMPENSATION TABLE" and get whole data from table below.

Example of page #1
https://www.sec.gov/Archives/edgar/data/...def14a.htm
Example of page #2
https://www.sec.gov/Archives/edgar/data/...def14a.htm

Do you have any thoughts, ideas if it is even doable?
Quote
#2
(Nov-20-2019, 02:27 PM)zarize Wrote: [ ... ]
I cannot use specific div or other html-type because every table is named the same (with only different font).

I just don't know how to tell python look for "SUMMARY COMPENSATION TABLE" and get whole data from table below.

Hi!

I think you could look for "SUMMARY COMPENSATION TABLE" and then, calculating the biggest table, or something you know the table always ends with, to use it like the end of the copy/get file process. This could be like 40 lines after "SUMMARY COMPENSATION TABLE" (if the tables are usually that high or less).

Here you may have some ideas:
https://python-forum.io/Thread-print-a-w...ord-search

All the best,
newbieAuggie2019

"That's been one of my mantras - focus and simplicity. Simple can be harder than complex: You have to work hard to get your thinking clean to make it simple. But it's worth it in the end because once you get there, you can move mountains."
Steve Jobs
Quote
#3
Hi, thanks for your input.

Hmmm... i was thinking about finding string and then write a code to find in which 'div' it is located.
Then find in found 'div' 'table' and return.

Is it correct, doable?

String is in structure like this:
<div>
<p style>
<p style>
<font style>
<b> SUMMARY COMPENSATION TABLE </b>

Can i get this div by finding string? I mean, it's located in upper tree..
Quote
#4
Pandas can help here as it can find tables on a web-site.
Here a Notebook where i take out table,using a match word that's in table.
The site is little messy as the use tables for a lot of stuff,but can use match(word in table) as shown to get right table.
Also look at this Thread pandas library tricks.
Quote
#5
Thank you very much for your input! :)

But in this case it wouldn't rather work. I was wondering about finding header of the table ie. "SUMMARY COMPENSATION TABLE" and then get table below.

If i would like to follow approach you have proposed then i would always need some1's name from the table, but they will always be different(and tables can have different named columns). The only common point in those files is this header i think.

The goal is to get automatically this table from X amount of files like this, so i think another approach would be needed (if it's even possible)
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Web scrap multiple pages anilacem_302 3 240 Jul-01-2020, 07:50 PM
Last Post: mlieqo
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 459 Mar-19-2020, 06:13 PM
Last Post: apollo
  Sending data to php page ebolisa 0 229 Mar-18-2020, 05:34 PM
Last Post: ebolisa
  scrape data 1 go to next page scrape data 2 and so on alkaline3 6 481 Mar-13-2020, 07:59 PM
Last Post: alkaline3
  Scrap a dynamic span hefaz 0 737 Mar-07-2020, 02:56 PM
Last Post: hefaz
  scrap by defining 3 functions zarize 0 303 Feb-18-2020, 03:55 PM
Last Post: zarize
  Skipping anti-scrap zarize 0 359 Jan-17-2020, 11:51 AM
Last Post: zarize
  Selenium get data from newly accessed page hoff1022 2 494 Oct-09-2019, 06:52 PM
Last Post: hoff1022
  page impossible to scrap? :O zarize 2 1,005 Oct-03-2019, 02:44 PM
Last Post: zarize
  Scraping data from ebay seller page yuvalta 3 2,829 Sep-25-2019, 04:22 AM
Last Post: sandramoraes

Forum Jump:


Users browsing this thread: 1 Guest(s)