Identifying items in a csv file that also appear in a Text extract

**nilamo** · Sep-21-2016, 09:21 PM

(Sep-21-2016, 09:11 PM)Jaynorth Wrote: Counter() just returns Counter() in the console when the script is run. country_codes is the names of the csv file which is read into the script and codes is just a variable that I used to assign the relevant column in the csv file - country_codes['English short name lower case']

I am not matching on the Alpha-2 or Alpha-3 columns in the csv file which uses the 3 letter representation of the country like "CAN" XD

But, why use any sort of Counter() function at all? len() would do the exact same thing, wouldn't it?

>>> text = '''
... Once upon a time, there was the great country of Mexico.  Then there... blah blah blah'''
>>> [word for word in text.split()]
['Once', 'upon', 'a', 'time,', 'there', 'was', 'the', 'great', 'country', 'of', 'Mexico.', 'Then', 'there...', 'blah', 'blah', 'blah']

>>> import re
>>> [word for word in text.split() if re.sub(r'\W', '', word) in codes]
['Mexico.']

>>> len([word for word in text.split() if re.sub(r'\W', '', word) in codes])
1

Jaynorth · Sep-21-2016, 09:26 PM

(Sep-21-2016, 09:21 PM)nilamo Wrote:
(Sep-21-2016, 09:11 PM)Jaynorth Wrote: Counter() just returns Counter() in the console when the script is run. country_codes is the names of the csv file which is read into the script and codes is just a variable that I used to assign the relevant column in the csv file - country_codes['English short name lower case']

I am not matching on the Alpha-2 or Alpha-3 columns in the csv file which uses the 3 letter representation of the country like "CAN" XD

But, why use any sort of Counter() function at all? len() would do the exact same thing, wouldn't it?
>>> text = '''
... Once upon a time, there was the great country of Mexico.  Then there... blah blah blah'''
>>> [word for word in text.split()]
['Once', 'upon', 'a', 'time,', 'there', 'was', 'the', 'great', 'country', 'of', 'Mexico.', 'Then', 'there...', 'blah', 'blah', 'blah']

>>> import re
>>> [word for word in text.split() if re.sub(r'\W', '', word) in codes]
['Mexico.']

>>> len([word for word in text.split() if re.sub(r'\W', '', word) in codes])
1

I used Pandas to extract the text so it is a dataframe and not a string so I cannot use .split() or can I?

wavic · Sep-21-2016, 09:32 PM

(Sep-21-2016, 09:12 PM)nilamo Wrote:
(Sep-21-2016, 09:05 PM)wavic Wrote:
from collections import Counter

counter = Counter(iterable)

print(counter['item'])
How does this syntax highlighting works? :huh:
Use the python syntax highlighter, not the generic code one. (they're still working out the plugins)
Also, wouldn't your code just always give "0"?
>>> from collections import Counter
>>> cnt = Counter('Green eggs and spam')
>>> cnt['g']
2
>>> cnt['gg']
0
>>> cnt['eggs']
0

It gets the iterable from a CSV file. So if a row is "one,two,three,one,two,three" counter.keys() will return ['one', 'two', three'] as it suppose to be. CSV module will split it to the list.

***snippsat*** · (This post was last modified: Sep-21-2016, 09:48 PM by snippsat.)

Quote:But, why use any sort of Counter() function at all? len() would do the exact same thing, wouldn't it?

Because Counter is a better solution's and more readable than a list comprehension with a regex inside.

Quote:I used Pandas to extract the text so it is a dataframe and not a string so I cannot use .split() or can I?

You can use all Python syntax with Pandas.
Split if you want whole word.

>>> from collections import Counter
>>> cnt = Counter('Green eggs and spam spam spam spam'.split())
>>> print(cnt)
Counter({'spam': 4, 'eggs': 1, 'Green': 1, 'and': 1})
>>> print(cnt.most_common(1))
[('spam', 4)]

wavic · Sep-21-2016, 09:45 PM

>>>import string
>>>[word.strip(string.punctuation) for word in text.split()]

Jaynorth · Sep-21-2016, 10:11 PM

(Sep-21-2016, 09:45 PM)wavic Wrote:
>>>import string
>>>[word.strip(string.punctuation) for word in text.split()]

This gives the following error: AttributeError: 'Series' object has no attribute 'split'

wavic · Sep-21-2016, 10:32 PM

(Sep-21-2016, 10:11 PM)Jaynorth Wrote:
(Sep-21-2016, 09:45 PM)wavic Wrote:
>>>import string
>>>[word.strip(string.punctuation) for word in text.split()]
This gives the following error: AttributeError: 'Series' object has no attribute 'split'

Hmm! It just splits the regular text and remove the punctuation. So you get only the words. No Pandas here

In [1]: import string

In [2]: text = "This gives the following error: AttributeError: 'Series' object 
   ...: has no attribute 'split'"

In [3]: [word.strip(string.punctuation) for word in text.split()]
Out[3]: 
['This',
 'gives',
 'the',
 'following',
 'error',
 'AttributeError',
 'Series',
 'object',
 'has',
 'no',
 'attribute',
 'split']

In [4]:

Jaynorth · Sep-21-2016, 10:51 PM

(Sep-21-2016, 10:32 PM)wavic Wrote:
(Sep-21-2016, 10:11 PM)Jaynorth Wrote:
(Sep-21-2016, 09:45 PM)wavic Wrote:
>>>import string
>>>[word.strip(string.punctuation) for word in text.split()]
This gives the following error: AttributeError: 'Series' object has no attribute 'split'
Hmm! It just splits the regular text and remove the punctuation. So you get only the words. No Pandas here
In [1]: import string

In [2]: text = "This gives the following error: AttributeError: 'Series' object 
   ...: has no attribute 'split'"

In [3]: [word.strip(string.punctuation) for word in text.split()]
Out[3]: 
['This',
 'gives',
 'the',
 'following',
 'error',
 'AttributeError',
 'Series',
 'object',
 'has',
 'no',
 'attribute',
 'split']

In [4]: 

I just checked the Pandas documentation: to split a Pandas series use- text.str.split() and this converts it to an object but now I get a TypeError: unhashable type: list
For the codes variable with the country_codes csv because it is a list

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Cleaning a dataset: How to extract text between two patterns	Palke	0	1,872	Mar-06-2023, 05:13 PM Last Post: Palke
	extract and plot data from a txt file	usercat123	2	1,978	Apr-20-2022, 06:50 PM Last Post: usercat123
	[machine learning] identifying a number 0-9 from a 28x28 picture, not working	SheeppOSU	0	2,422	Apr-09-2021, 12:38 AM Last Post: SheeppOSU
	Comparing and Identifying ID with Percentage	jonatasflausino	1	3,058	Jun-23-2020, 06:44 PM Last Post: hussainmujtaba
	Identifying consecutive masked values in a 3D data array	chai0404	12	8,340	Feb-01-2020, 12:59 PM Last Post: perfringo
	Validate Excel with text in text file	Vinci141	3	4,202	Dec-03-2018, 04:03 PM Last Post: Larz60+
	OpenCV - extract 1st frame out of a video file	kerzol81	2	26,796	Nov-12-2018, 09:12 AM Last Post: kerzol81
	Upload csv file as numbers (floating?) and extract element, row, and column	bentaz	7	5,874	Mar-19-2018, 05:34 PM Last Post: bentaz
	Extract data between two dates from a .csv file using Python 2.7	sujai_banerji	1	11,458	Nov-15-2017, 09:48 PM Last Post: snippsat

Identifying items in a csv file that also appear in a Text extract

User Panel Messages

Announcements