Python Forum
WGET + Data Science + Python Programs
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
WGET + Data Science + Python Programs
#1
Alright everyone; This isn't a direct Python question. However it's a pre-requisite for large Data Science Python programs. Using WGET to mirror a website offline or part of a website for use with Data Science and Python.

Here is my problem.

I was able to mirror https://law.justia.com/cases/federal/dis...ts/FSupp2/ using the following lines of WGET with success.

wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --no-check-certificate \
     --output-file=logfile \
     --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20160101 Firefox/66.0" "https://law.justia.com"
     --domains law.justia.com \
     --no-parent \
         https://law.justia.com/cases/federal/district-courts/FSupp2/
I now have a successful offline copy in a .tar file. Awesome! Worked great!

The problem I ran into is this next part. Now I need FSupp3 and that isn't on law.justia.com

So let's head over to leagle.com by visiting https://www.leagle.com/decisions

This has multiple datasets listed in PHP. Very nice! I see FSupp3 right there. Let's check it out.

https://www.leagle.com/decisions/browse/series/F.Supp.3d


Now... using the same code. I get a failed WGET scrape. It redirects to a different URL pathway which results in failure. I do not know how to remedy this without doing a WGET on the entire website (which I have attempted several times over the past couple of years now). I can get almost the entire site and then when I get to the Attorney directory it starts going into major % % % % spacing URL's and eventually locks up and doesn't continue. It seems to freeze at the same spot everytime regardless of changing IP's, etc. It's not a server ban (I checked over and over again).

Let's navigate to Volume 1 of Federal Supplement 3d : https://www.leagle.com/decisions/browse/...0F.Supp.3d

Then let's click the first Legal Opinion : https://www.leagle.com/decision/infdco20140305d70

See the differential on the URL pathways? It makes it impossible to my understanding on how to create a specific no parent that works without blocking the actual decisions within that dataset.

I may end up with the list of cases; but no actual cases due to the cases being on https://www.leagle.com/decision/[decision].html post WGET.

Does anyone know how to modify these WGET lines so that I can do a WGET pull on specific data sets without pulling the entire website down since that's not a viable option.

Any assistance would be helpful! Thank you for the Python forum. Awesome LANG!


Below is the WGET lines that I need assistance on altering to make it work for the url dataset provided:


wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --no-check-certificate \
     --output-file=logfile \
     --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20160101 Firefox/66.0" "https://www.leagle.com"
     --domains www.leagle.com \
     --no-parent \
         https://www.leagle.com/decisions/browse/series/F.Supp.3d
Best Regards and God bless,

Brandon Kastning
Pre-Law College Student
Newbie Python Coder
Reply


Messages In This Thread
WGET + Data Science + Python Programs - by BrandonKastning - Mar-29-2020, 06:43 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Networkx / Data Science IamAlbert 0 138 Sep-11-2020, 05:33 PM
Last Post: IamAlbert
  What good book in Data science ? bashar 0 358 Apr-14-2020, 03:29 AM
Last Post: bashar
  Data science with Python - links with exercises darpInd 1 431 Mar-02-2020, 04:24 PM
Last Post: Larz60+
  Softwares to learn data science jk91 2 464 Feb-26-2020, 07:17 PM
Last Post: jefsummers
  Data Science Project DaisyPJ 3 552 Jan-19-2020, 07:05 PM
Last Post: jefsummers
  Nvidia or (25% better for the price) Radeon GPU for Python Data Science gheek 0 344 Dec-11-2019, 05:19 PM
Last Post: gheek
  data science peepeepoopoo 1 541 Sep-21-2019, 10:34 PM
Last Post: Larz60+
  Python for Enterprise Data Science paripy 3 808 May-03-2019, 05:37 AM
Last Post: directnirvana

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020