Mar-03-2019, 01:43 AM
Thank you. Still, don't have a completely clear picture. Why would it be a nested dictionary?
getting unique values and counting amounts
|
Mar-03-2019, 01:43 AM
Thank you. Still, don't have a completely clear picture. Why would it be a nested dictionary?
Mar-03-2019, 01:50 AM
Almost every dictionary that I create is nested. This is a generic dictionary function that I wrote for my own use, and I wanted it to be able to display any dictionary nested or not.
Mar-03-2019, 01:54 AM
So basically we don't really have a nested dictionary in this case, right?
Mar-03-2019, 01:59 AM
Correct.
I use this routine a lot when I'm scraping new sites because I usually build a dictionary witch is actually a partial sitemap of the site I'm scraping. It makes it easier to find my target areas.
Mar-05-2019, 12:17 AM
I'm trying to do something similar here:
#! python3 # n-gram for an inauguration speech import requests from bs4 import BeautifulSoup import re import string import operator def cleanInput(input): input = re.sub('\n', " ", input).lower() input = re.sub('\[[0-9]*\]', "", input) input = re.sub(' +', " ", input) input = bytes(input, "UTF-8") input = input.decode("ascii", "ignore") cleanInput = [] input = input.split(' ') for item in input: item = item.strip(string.punctuation) if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): cleanInput.append(item) return cleanInput def getNgrams(input, n): input = cleanInput(input) output = {} for i in range(len(input)-n+1): ngramTemp = " ".join(input[i:i+n]) if ngramTemp not in output: output[ngramTemp] = 0 else: output[ngramTemp] += 1 return output content = str(requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8') ngrams = getNgrams(content, 2) sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) print(sortedNGrams)but get this: when I remove read() I keep getting different sort of errors. If it would help I can put it here.Any idea how to fix this line?
Mar-05-2019, 01:23 AM
this line:
content = str(requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')I would do this in several steps so that you can check status of the request replace with: response = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt") if response.status_code == 200: content = response.textAlso, I don't know what your ultimate goal is, but this looks like something you might want to analyze with NLTK, at least read a bit about what it's capable of doing See: https://www.nltk.org/ You can install it and play with it a bit. for example to get n grams, using your context code would look something like: import nltk, re, string, collections from nltk.util import ngrams tokenized = content.split() esBigrams = ngrams(tokenized, 2)That will get all of the ngrams. Now you can go a step further: get the frequency of each bigram in our corpus esBigramFreq = collections.Counter(esBigrams)get 10 most common ngrams in text: esBigramFreq.most_common(10)and lots more Here's the tutorial for above code: https://www.kaggle.com/rtatman/tutorial-getting-n-grams
Using your first 3 lines of code only instead of my previous code that raised error gives n-grams with values 0. Not sure why is that so.
kaggle is a data science place. Can we consider web scraping a part of data science then? Will definitely take a look at NLTK stuff, although I don't think that I should spend too much time on this for now... And this code: import nltk, re, string, collections from nltk.util import ngrams import requests response = requests.get("http://pythonscraping.com/files/inaugurationSpeech.txt") if response.status_code == 200: content = response.text tokenized = content.split() esBigrams = ngrams(tokenized, 2)gives The mystery continues...what would happen if I would add some real functions cleanInput and getNgrams. Will work on that tomorrow after I study documentation on nltk.
NLTK is great, but can be a bit tricky at first. I use the O'reilly book'Natural Language Processing with Python' as a guide, but I'm sure there are better examples on the web now as I purchased the book back in 2013, and it was published 2009. If you wish, I'll take a look and see what's available now, and if someone else is reading this post, and knows where to look, that would also help.
found this snippet: import nltk from nltk.util import ngrams def word_grams(words, min=1, max=4): s = [] for n in range(min, max): for ngram in ngrams(words, n): s.append(' '.join(str(i) for i in ngram)) return s print word_grams('one two three four'.split(' '))
Mar-06-2019, 01:57 PM
Thank you, no need look for anything newer as I'm not planning to focus that much on this field...at least not for now.
Will check the snippet.
Mar-07-2019, 12:49 AM
import nltk, re, string, collections from nltk.util import ngrams # function for making ngrams with open("http://pythonscraping.com/files/inaugurationSpeech.txt", "r", encoding='latin-1') as file: text = file.read() print(text[0:1000]) Any idea why is this argument invalid?
|
|