Mar-03-2019, 01:43 AM
Mar-03-2019, 01:50 AM
Almost every dictionary that I create is nested. This is a generic dictionary function that I wrote for my own use, and I wanted it to be able to display any dictionary nested or not.
Mar-03-2019, 01:54 AM
So basically we don't really have a nested dictionary in this case, right?
Mar-03-2019, 01:59 AM
Correct.
I use this routine a lot when I'm scraping new sites because I usually build a dictionary witch is actually a partial sitemap of
the site I'm scraping. It makes it easier to find my target areas.
I use this routine a lot when I'm scraping new sites because I usually build a dictionary witch is actually a partial sitemap of
the site I'm scraping. It makes it easier to find my target areas.
Mar-05-2019, 12:17 AM
I'm trying to do something similar here:
Any idea how to fix this line?
#! python3 # n-gram for an inauguration speech import requests from bs4 import BeautifulSoup import re import string import operator def cleanInput(input): input = re.sub('\n', " ", input).lower() input = re.sub('\[[0-9]*\]', "", input) input = re.sub(' +', " ", input) input = bytes(input, "UTF-8") input = input.decode("ascii", "ignore") cleanInput = [] input = input.split(' ') for item in input: item = item.strip(string.punctuation) if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): cleanInput.append(item) return cleanInput def getNgrams(input, n): input = cleanInput(input) output = {} for i in range(len(input)-n+1): ngramTemp = " ".join(input[i:i+n]) if ngramTemp not in output: output[ngramTemp] = 0 else: output[ngramTemp] += 1 return output content = str(requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8') ngrams = getNgrams(content, 2) sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) print(sortedNGrams)but get this:
Error:Traceback (most recent call last):
File "C:\Python36\kodovi\speech.py", line 35, in <module>
content = str(requests.get("http://pythonscraping.com/files/inaugurationSpee
ch.txt").read(), 'utf-8')
AttributeError: 'Response' object has no attribute 'read'
when I remove read() I keep getting different sort of errors. If it would help I can put it here.Any idea how to fix this line?
Mar-05-2019, 01:23 AM
this line:
replace with:
See: https://www.nltk.org/
You can install it and play with it a bit.
for example to get n grams, using your context code would look something like:
Now you can go a step further:
get the frequency of each bigram in our corpus
Here's the tutorial for above code: https://www.kaggle.com/rtatman/tutorial-getting-n-grams
content = str(requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')I would do this in several steps so that you can check status of the request
replace with:
response = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt") if response.status_code == 200: content = response.textAlso, I don't know what your ultimate goal is, but this looks like something you might want to analyze with NLTK, at least read a bit about what it's capable of doing
See: https://www.nltk.org/
You can install it and play with it a bit.
for example to get n grams, using your context code would look something like:
import nltk, re, string, collections from nltk.util import ngrams tokenized = content.split() esBigrams = ngrams(tokenized, 2)That will get all of the ngrams.
Now you can go a step further:
get the frequency of each bigram in our corpus
esBigramFreq = collections.Counter(esBigrams)get 10 most common ngrams in text:
esBigramFreq.most_common(10)and lots more
Here's the tutorial for above code: https://www.kaggle.com/rtatman/tutorial-getting-n-grams
Mar-06-2019, 01:55 AM
Using your first 3 lines of code only instead of my previous code that raised error gives n-grams with values 0. Not sure why is that so.
kaggle is a data science place. Can we consider web scraping a part of data science then?
Will definitely take a look at NLTK stuff, although I don't think that I should spend too much time on this for now...
And this code:
kaggle is a data science place. Can we consider web scraping a part of data science then?

Will definitely take a look at NLTK stuff, although I don't think that I should spend too much time on this for now...
And this code:
import nltk, re, string, collections from nltk.util import ngrams import requests response = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt") if response.status_code == 200: content = response.text tokenized = content.split() esBigrams = ngrams(tokenized, 2)gives
Output:A**** told C**** that E**** knew B**** was a double agent.
CENSORED gave the secret documents to CENSORED.
Phone number found: 515-555-4444
555
(415)
Batman
Batmobile
mobile
555-4444
Batwowowoman
Batwowowowoman
<_sre.SRE_Match object; span=(0, 6), match='HaHaHa'>
None
>>>
The mystery continues...what would happen if I would add some real functions cleanInput and getNgrams. Will work on that tomorrow after I study documentation on nltk.Mar-06-2019, 03:50 AM
NLTK is great, but can be a bit tricky at first. I use the O'reilly book'Natural Language Processing with Python' as a guide, but I'm sure there are better examples on the web now as I purchased the book back in 2013, and it was published 2009. If you wish, I'll take a look and see what's available now, and if someone else is reading this post, and knows where to look, that would also help.
found this snippet:
found this snippet:
import nltk from nltk.util import ngrams def word_grams(words, min=1, max=4): s = [] for n in range(min, max): for ngram in ngrams(words, n): s.append(' '.join(str(i) for i in ngram)) return s print word_grams('one two three four'.split(' '))
Mar-06-2019, 01:57 PM
Thank you, no need look for anything newer as I'm not planning to focus that much on this field...at least not for now.
Will check the snippet.
Will check the snippet.
Mar-07-2019, 12:49 AM
import nltk, re, string, collections from nltk.util import ngrams # function for making ngrams with open("http://pythonscraping.com/files/inaugurationSpeech.txt", "r", encoding='latin-1') as file: text = file.read() print(text[0:1000])
Error:Traceback (most recent call last):
File "C:\Python36\kodovi\ngramss.py", line 7, in <module>
with open("http://pythonscraping.com/files/inaugurationSpeech.txt", "r", enc
oding='latin-1') as file:
OSError: [Errno 22] Invalid argument: 'http://pythonscraping.com/files/inaugurat
ionSpeech.txt'
>>>
Any idea why is this argument invalid?