In doing the trimming, we wanted to take care not to have our own decisions in some way bias the selection in a manner that might make the resulting TSP easier to solve.
So we used a crude approach that could be carried out with simple computer codes.
No doubt a team of humans would have made a nicer selection, favoring a really cool site for a less attractive one, but we were still left with an amazing sample of US history.
First, we excluded Alaska and Hawaii.
Both states have great historical sites, but including their locations would have resulted in a problem that had an obvious decomposition.
We then set up a computer code to gather the remaining locations, county by county, from the very nice Wikipedia United States National Register of Historic Places listings
The Wikipedia information for each entry includes the latitude and longitude (which are key for us) as well as an ID that we used to verify the entry with the National Park Service data base.
From the Wikipedia pages, we collected a total of 73,807 locations.
We immediately excluded 244 sites, mostly shipwrecks and lighthouses, that could not be reached with Google's walking directions.
We then removed 23,066 sites with "House" in the title and 500 sites with "District" in the title.
Finally, we added the White House, the Capitol, and the Supreme Court Building.
These last three are not actually in the National Register, but, breaking my rule about hand-tuning the data set, I thought they belonged in a US history tour.
I'd like to say that everything was hunky dory at this point.
But well into the solution process, I realized I had made two nasty mistakes.
Mistake number one.
The great cities of Baltimore and Chicago have data sets stored in individual directories on the Wikipedia site, rather than in the usual county structure.
I must apologize to the fans of these two cities for accidentally excluding their historic places from the big US tour.
To try to make up for this, I computed an optimal Baltimore side trip and an optimal Chicago side trip, covering all of their listings from the national register.
Mistake number two.
A number of listings in the data set have the same latitude and longitude.
This comes mainly from two states claiming the same location, such as the Statue of Liberty.
This sadly brings the TSP example down to a total 49,603 distinct points.
In our data sets we continue to use the full count of 50,000, since all of our references are to the initial list we created.
But clearly we can't say we solved a 50,000-point TSP.
Next time we will be more careful!
A good way to view the data is to have a look at a Google Maps drawing giving a marker for each of the 49,603 sites.
By zooming in, you can see the distribution of points around the country.
It is great have the US history context for our TSP challenge, but when it comes to the mathematics, all we need are the latitude and longitude coordinates for the stops.
This is the information that is used to query Google Maps for walking distances and for computing the geometric distance between pairs of historic sites.
Here is a link to a text file with the latitude and longitude (in decimal form) for all 50,000 locations.
In our problem, we assume the distance to walk from location A to location B is the same as the distance to walk back from B back to A.
This is quite reasonable for walking, although for innner-city travel it sometimes does not hold for driving (due to one-way streets).
We also make the assumption that the walking distance between two locations is no shorter than the geometric distance to fly from one site to the other.
To be precise, we define the distance from A to B to be the maximum of the geometric distance between the latitude and longitude coordinates for A and B and the walking distance from A to B provided by Google Maps.
Of course, this entirely what you would expect.
There is no way a walking path can be shorter than the route a crow would fly.
But, in making the assumpution, we can be precise aboout the problem we have solved, since we cannot actually ask Google for the travel distance between all 1,230,204,003 pairs of locations.
In our computation, the geometric distance between locations is specified by an approximation to the great circle distance on the Earth (treating the Earth as a ball).
This is a variation of the TSPLIB GEO-norm, adjusted to handle decimal degrees and scaled to provide the distance in meters rather than in kilometers.
The distance function is called the GEOM-norm;
a computer code to compute the GEOM distance between two points can be found on the World TSP page.
The US50K data set, in TSPLIB format, is given here.
This data set allows the geometric distances to be computed, for example, by the Concorde TSP code.
To fully specify the problem we solved, we need also to report the Google distances we used, since Google's algorithms adjust their routes over time, taking into consideration new construction and better data sets for the road networks.
Here is a gzipped list of the 1,105,953 pairs of points for which we queried Google Maps.
Each line gives the indices of two points, using the order specified in the TSPLIB data set, where the points are numbered from 0 up to 49,999.
Following the pair of numbers, on each line, is the reported walking distance in meters.
To reproduce the US50K problem, use the maximum of the GEOM distance and the walking distance for each of the 1,105,953 pairs in the file, and use the GEOM distance for each of the remaining 1,248,869,047 pairs.
We did not need to query Google for these additional pairs, since our final tour does not use any of these direct legs.
(Since the defined distance is the maximum, having the Google distance can only increase the length of any tour that uses one of the 1,248,869,047 pairs.)
Please be warned that the travel data was obtained in late 2015 and early 2016.
If you are planning a trip, be sure to check out Google Maps for up-to-date distances and directions.