Emacsen's Blog

Why Imports in OpenStreetMap Are Controversial

OpenStreetMap's goal is to map the entire world, so one might assume that anything that would help give the project a leg up would be welcome, but what many potential importers find is that the OSM community, in particular the more senior members, are hostile to imports. Understanding their strong feelings can be difficult, but I'm going to take a look at the issues around imports in the community, and why they're so controversial.

Before I go on, I need to make clear that the views in this post are my own and don't reflect the views of the DWG, the OSMF, OSM US or any other organization.

OSM's Construction

OpenStreetMap is a single unified geographic dataset. As a consequence of its design, it's far more integrated than even professional GIS maps are. For example, you can't simply remove all the roads objects in OSM, because those road objects might be connected to other objects, such as political boundaries, or natural features. Most professional GIS maps use a system of layers which are placed on top of one another. In contrast, OSM maps are more like a complex weaved material, with geographic features intertwined with each other.

This makes some operations in OpenSteetMap very easy. If a political boundary runs along a river, for example, as the river becomes more detailed, so does the political boundary. At the same time, it poses a set of challenges for imports.

Licenses

The most common problem with a data import into OpenStreetMap are licensing conflicts. In an ideal world government datasets would be released into the public domain. Unfortunately this is the exception rather than the norm. In the United States most states and municipalities do not release their geographic datasets to the public without restrictive licenses and expensive license fees.

Some municipalities are trying to make their data more available but do not use standard licenses. Instead, most US municipalities create their own licenses from scratch, oftentimes with conflicting terms. It's not uncommon for a municipal website to say that the data is in the public domain in one part of the website, but then another part of the website will say that the data is under copyright and may not be used or distributed without explicit permission.

The situation in the US mirrors much of the rest of the world, where national or local governments struggle between "openness" and "control".

The result for OpenStreetMap is that getting a license situation clear is difficult or sometimes impossible. Because of the way OSM data is so heavily integrated, removing it afterwards is nearly impossible, so the project needs to be extremely careful beforehand about the license situation for any imported data.

Official vs Authoritative

The second challenge or imports is the issue of official vs authoritative data. Most people assume that if a dataset is official that means that the data is of extremely high quality. This can be true, but more often than not, the official dataset is not what one would expect.

When a government creates or maintains a dataset, it does so with a specific purpose in mind. For example, in the US TIGER dataset, the purpose of the data is to provide a mapping of roads in order for census employees to know where to look for houses. The TIGER dataset needs to be complete enough for the census takers, but that is not the same as being a comprehensive road network map. For example, if a house exists at the end of the road, it does not matter to a census taker if the road has a slight bend in it or not.

Similarly, other data sets are often created to meet a very specific need. In order to meet that need, a local government will often outsource the task of surveying, or will hire low wage employees to do the data collection. The results of the survey need to be consistent, and governments often provide exhaustive procedures and checklists, but only for the specific objective they're trying to meet.

This difference in mapping objectives often comes out in address data. A government dataset may provide addresses as points on a map but it's often unclear what these points correspond with. Do they correspond with the geometric center of a building, or do they correspond with an entrance? If they correspond with an entrance, what do they do in cases where the building has multiple entrances?

In practice we find that such datasets are often complete in that they show all the addresses, but are inconsistent in where the address points are placed across buildings.

This issue of accuracy and consistency becomes more pronounced on larger, national datasets. A national dataset often consists of a collection of regional datasets, each which have been collected by different individuals and organizations, leading to inconsistency. Paradoxically the national datasets are the ones that OpenStreetMap members are most interested in, since they can provide the most information.

Doing it Right

Once the issues of license and data quality are addressed, there are questions of how to get the data into OSM. As mentioned earlier, one can't simply add external data onto OSM- the data needs to be integrated.

This integration consists of many highly technical, highly detail oriented steps. Every new importer believes that the integration step will be easy, but the fact is that the process is difficult and tedious.

OpenStreetMap is full of very badly imported data, and suffers from it to this day.

Corrections and Updates

Unfortunately, even after nearly a decade, no one has found an ideal solution to this problem. The result is usually that data imported into OSM never gets updated and the imported data in OSM becomes wrong very quickly.

When considering updates, as discussed previously, we cannot assume that an administrative database is correct. Ideally we want OSMers to map the areas themselves, and it's inevitable that they will do so, leading to a conflict between the two datasets.

OSMers have oscillated between trying to update data based on government datasets and always assuming that local changes are correct. Neither of these techniques will work in every case. Instead, a mapper will need to check the conflict manually, preferably by manually surveying the area.

This process of updating is slow and tedious and has only happened in a handful of cases.

Community Helping or Harming

Within OpenStreetMap there is a debate on whether or not imports help or harm the OpenStreetMap community.

OpenStreetMap works because its members are constantly adding and correcting information in the system. If there were no community members collecting this data, the map would quickly suffer bitrot, where old data sticks around and is not replaced with current, up to date geographic information.

Because of that, some members believe that efforts should be focused not on bringing data in from external sources, but on growing the community of mappers. Others believe that "seed data" provides a starting point for more mapping, especially in places that have not been very active in OpenStreetMap previously.

I'm not going to rehash these arguments here, but I will say that the role of imports in OpenStreetMap remains contentious, and an area of research that I would like to see explored is the rate of update between manually surveyed and imported data.

Conclusion

OpenStreetMap only works because of our community of many thousands of dedicated mappers who spend their time updating our map. Whatever the role of imports today or in the future, they can never be a substitute for on the ground mapping, which needs to remain a cornerstone of our efforts in OpenStreetMap.