Mar-22-2017, 04:56 AM
I'm dealing with Twitter data ,I have users in json format,I'm trying to extract location from these fields,here is some sample data
Sample data:
"location": "Georgia, USA",
"location": "El Centro, CA",
"location": "Barnaul",
"location": "heaven on earth",
The Problem:
The text in location field is not in a consistent format, it's not following any standard, for example, there are ISO codes for countries by using that, one can easily separate city, country or state, but there is no clear indication as to how to identify the text in the field as a particular location.
For example the texts in the location field are of these patterns
1) Country (ex. Canada)
This is a country but can be anything else, it's just a text, one can match that text with a list of countries, but what if it’s a city.
2) City (ex. Toronto)
Or it can be a city
3) City, Country (ex. Toronto, Canada)
City and country separated with comma or space
4) City, State (ex. Toronto, Ontario)
City and State separated with comma or space
5) Meaningless text (ex. Worldwide)
Text which is not a city, country or state
6) Different Language (ex 广州)
Same patterns as listed above but in a language other than English, for example, Chinese.
7) Abbreviations and ISO codes
Sample data:
"location": "Georgia, USA",
"location": "El Centro, CA",
"location": "Barnaul",
"location": "heaven on earth",
The Problem:
The text in location field is not in a consistent format, it's not following any standard, for example, there are ISO codes for countries by using that, one can easily separate city, country or state, but there is no clear indication as to how to identify the text in the field as a particular location.
For example the texts in the location field are of these patterns
1) Country (ex. Canada)
This is a country but can be anything else, it's just a text, one can match that text with a list of countries, but what if it’s a city.
2) City (ex. Toronto)
Or it can be a city
3) City, Country (ex. Toronto, Canada)
City and country separated with comma or space
4) City, State (ex. Toronto, Ontario)
City and State separated with comma or space
5) Meaningless text (ex. Worldwide)
Text which is not a city, country or state
6) Different Language (ex 广州)
Same patterns as listed above but in a language other than English, for example, Chinese.
7) Abbreviations and ISO codes
- Sometimes Countries are represented in ISO codes such as CA or CAN for Canada,
- States as FL for Florida (U.S state),
- City as US-MN for Minneapolis (a city in Minnesota).