Why is it so hard to parse text into lat/long coordinates?
It seems like such a simple thing- let your user type in whatever they want for their location (NYC, New York City, New York, NY, etc.), and just turn it into GPS coordinates for your own purposes. There’s even an entire category of tools dedicated to this: gazetteers.
Tools to help you turn unstructured text into latitude and longitude
- Unlock Places- they reference several different gazetteer resources, and provide you with many options. However, they were inaccurate for my main use cases.
- Google Maps API- this one worked really well! Google clearly has a lot of mapping and NLP expertise and puts it to use here. However, they limit you to 2500 API queries per day. And I’m working with a dataset with over 100k places.
- Google Earth- Generally quite accurate. Can do batch processing of a dataset and then export the resulting KML file. However, when I gave it a dataset of 35,000 places, it only returned 6,600 back to me. I have no idea what happened to the other 29,000. Hopefully they’ve just discovered an alternat universe that’s even better than ours (with easier geocoding!).
- Google Fusion Tables- They do a great job geocoding your data. They claim to only do 2500 per day, but they processed through far more than that in my dataset. However, they don’t let you export the lat/long data they’ve created for you. But they are useful later on below for converting from .kml files to .csv files once you already have gps coordinates.
- There are a ton of other options available here: http://en.wikipedia.org/wiki/List_of_geocoding_systems
- Some languages like Ruby and PHP have geocoding libraries built for them.
- Follow my process below- it’s surprisingly straightforward.
My method for turning location strings into geolocation:
- Download all the places data, and have available in a csv file.
- Sort by most active places to make sure we can focus our energies accurately.
- Create a pro membership to BatchGeo. You can cancel at anytime within your first 30 days. Or, you can do this in increments of 250 for free.
- Upload csv file to BatchGeo. The first row must be the header.
- Click on the Validate & Set Options button.
- Set Country to whichever is most accurate- in my case, the data set was worldwide, so I chose International up at the top of this dropdown list.
- Make sure that the ‘address’ field is the location field, even if it’s not an address.
- Click on the ‘Show Advanced Options’ button
- Put the ID field in the ‘Title’ column. This ensures it will appear in your exports. This is the only field that will show up by default using this process. Getting other fields to show up will require custom parsing. However, if you use the ID field, you can easily join this new data back in with the existing data.
- Make Map
- Save & Continue
- The options for saving the map don’t have any bearing on the outcome unless privacy and sharing matter to you.
- On the map page, scroll down to the bottom of the page. It’s not always obvious- use the scroll bar to the right of the page to avoid just scrolling to Antarctica.
- Click the Download KML file link.
- Now you have a file in KML format, with the ID and the new lat/long data! However, the .kml file extension is designed for mapping software, not for analytics or database software. So let’s convert it to a csv. You can manually change the file extension to ‘.xml’, but this only partway works. The much easier approach is below.
- Open Google Fusion Tables. You don’t need to know anything about Fusion Tables. If you haven’t used them before: A. Go to Google Drive B. Click the New button C. At the bottom of this dropdown list, click More D. Click on Connect More Apps E. Connect the Fusion Tables app F. Follow this same process again, and you’ll now find Fusion Table under the ‘More’ dropdown now.
- Select the file (from your computer)
- Wait a minute while it loads.
- File -> Download, then just make sure it saves as a csv.
- Open in your favorite (lightweight) text editor, and do a find/replace for all the xml-like formatting around the actual lat/long numbers.
- Voila! You now have a csv file with IDs and new lat/long coordinates!