Mar-29-2020, 06:43 PM
(This post was last modified: Mar-29-2020, 06:43 PM by BrandonKastning.)
Alright everyone; This isn't a direct Python question. However it's a pre-requisite for large Data Science Python programs. Using WGET to mirror a website offline or part of a website for use with Data Science and Python.
Here is my problem.
I was able to mirror https://law.justia.com/cases/federal/dis...ts/FSupp2/ using the following lines of WGET with success.
The problem I ran into is this next part. Now I need FSupp3 and that isn't on law.justia.com
So let's head over to leagle.com by visiting https://www.leagle.com/decisions
This has multiple datasets listed in PHP. Very nice! I see FSupp3 right there. Let's check it out.
https://www.leagle.com/decisions/browse/series/F.Supp.3d
Now... using the same code. I get a failed WGET scrape. It redirects to a different URL pathway which results in failure. I do not know how to remedy this without doing a WGET on the entire website (which I have attempted several times over the past couple of years now). I can get almost the entire site and then when I get to the Attorney directory it starts going into major % % % % spacing URL's and eventually locks up and doesn't continue. It seems to freeze at the same spot everytime regardless of changing IP's, etc. It's not a server ban (I checked over and over again).
Let's navigate to Volume 1 of Federal Supplement 3d : https://www.leagle.com/decisions/browse/...0F.Supp.3d
Then let's click the first Legal Opinion : https://www.leagle.com/decision/infdco20140305d70
See the differential on the URL pathways? It makes it impossible to my understanding on how to create a specific no parent that works without blocking the actual decisions within that dataset.
I may end up with the list of cases; but no actual cases due to the cases being on https://www.leagle.com/decision/[decision].html post WGET.
Does anyone know how to modify these WGET lines so that I can do a WGET pull on specific data sets without pulling the entire website down since that's not a viable option.
Any assistance would be helpful! Thank you for the Python forum. Awesome LANG!
Below is the WGET lines that I need assistance on altering to make it work for the url dataset provided:
Brandon Kastning
Pre-Law College Student
Newbie Python Coder
Here is my problem.
I was able to mirror https://law.justia.com/cases/federal/dis...ts/FSupp2/ using the following lines of WGET with success.
wget \ --recursive \ --no-clobber \ --page-requisites \ --html-extension \ --convert-links \ --no-check-certificate \ --output-file=logfile \ --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20160101 Firefox/66.0" "https://law.justia.com" --domains law.justia.com \ --no-parent \ https://law.justia.com/cases/federal/district-courts/FSupp2/I now have a successful offline copy in a .tar file. Awesome! Worked great!
The problem I ran into is this next part. Now I need FSupp3 and that isn't on law.justia.com
So let's head over to leagle.com by visiting https://www.leagle.com/decisions
This has multiple datasets listed in PHP. Very nice! I see FSupp3 right there. Let's check it out.
https://www.leagle.com/decisions/browse/series/F.Supp.3d
Now... using the same code. I get a failed WGET scrape. It redirects to a different URL pathway which results in failure. I do not know how to remedy this without doing a WGET on the entire website (which I have attempted several times over the past couple of years now). I can get almost the entire site and then when I get to the Attorney directory it starts going into major % % % % spacing URL's and eventually locks up and doesn't continue. It seems to freeze at the same spot everytime regardless of changing IP's, etc. It's not a server ban (I checked over and over again).
Let's navigate to Volume 1 of Federal Supplement 3d : https://www.leagle.com/decisions/browse/...0F.Supp.3d
Then let's click the first Legal Opinion : https://www.leagle.com/decision/infdco20140305d70
See the differential on the URL pathways? It makes it impossible to my understanding on how to create a specific no parent that works without blocking the actual decisions within that dataset.
I may end up with the list of cases; but no actual cases due to the cases being on https://www.leagle.com/decision/[decision].html post WGET.
Does anyone know how to modify these WGET lines so that I can do a WGET pull on specific data sets without pulling the entire website down since that's not a viable option.
Any assistance would be helpful! Thank you for the Python forum. Awesome LANG!
Below is the WGET lines that I need assistance on altering to make it work for the url dataset provided:
wget \ --recursive \ --no-clobber \ --page-requisites \ --html-extension \ --convert-links \ --no-check-certificate \ --output-file=logfile \ --user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20160101 Firefox/66.0" "https://www.leagle.com" --domains www.leagle.com \ --no-parent \ https://www.leagle.com/decisions/browse/series/F.Supp.3dBest Regards and God bless,
Brandon Kastning
Pre-Law College Student
Newbie Python Coder
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)
“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)
#LetHISPeopleGo
“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)
#LetHISPeopleGo