The Maintenance of Imported Data in OpenStreetMap

Some feedback about my previous import post was regarding my contention that imported data is harder to maintain. This conclusion is based on years of observation. In this post, I'll explain why I believe imported data is not the same as manually mapped data in terms of maintenance.

The standard disclaimer applies here. I am speaking for myself and not representing any other group. In addition I want to say that this post is not meant to discourage imports in general, but that if OpenStreetMap is to have imports, we need to have comprehensive discussions about the issues around them.

Adding vs Correcting

The most common issue cited about why imported data is harder to work with is that it's easier to fill in an empty map. This often leads to the debate about completeness vs accuracy- in other words is it better to have a map that is full of partially accurate data or one that is only partially filled but with completely accurate data? This debate misses the real issue, which is the relative difficulty of adding vs correcting OpenStreetMap data.

On a map that is empty, it's easy to see what needs to be added. If a road or building is not present they can be added. We see this pattern being played out in OSM as larger, more prominent roads get filled in first, followed by secondary roads, and finally residential roads, buildings, POIs, etc.

The more difficult task is determing what data needs correction on a map that is visually complete. Looking at this map of Washington, DC, we see roads, buildings, and a variety of POIs. This map may be out of date. The speed limits on the roads may have changed, a building may have been torn down, or a store may have changed names.

It's not possible to determine what information needs correcting simply by looking at the map, nor is it always possible by looking at aerial imagery. Updating the map is far more difficult than creating it. Either the mapper must know the OSM data is wrong and re-survey the area, or they are going off newly collected data and the new data must be reconciled with the old data. For example we might know that a store is present on a particular street, but to be accurate, we must know not only that the new store exists, but that it replaced the old store in the same location.

When data is imported, the map appears complete, and the burden of the community effort shifts from collection to maintenance, which is a task that most mappers are less familiar with. In addition, correcting data does not have the same intrinsic motivation as first time collection does.

Complex Objects

The second issue that imports sometimes bring about are complex objects that are difficult or impossible to fix.

In this image, we see a section of New Jersey in OpenStreetMap, with the patchwork of color representing landuse data

When we zoom in, we see one of these objects is a multipolygon relation, an object composed of several other objects. This one object references five other objects. In addition, the outer boundary of this woods shares geometry with about another ten landuse objects.

If someone were to try to update the landuse data, they might unintentionally be effecting up to a half dozen other objects at the same time, meaning that a simple update process now becomes a difficult and time consuming process of ungluing multipolygons, splitting them and trying to reconstruct them.

For an experienced mapper, this would be a tedious, time consuming process. For a new mapper, it would be an daunting and possibly confusing challenge. Most mappers simply avoid the work altogether and the data is never updated.

Unverifiable Objects and External Unique Identifiers

The third, and in my view the most common problem we encounter with imported data is the issue around external identifiers in the imported dataset.

Let's start with a simple example of the problem in action: New York City bike racks.

As you can see, there are two bike racks on this street according to the data.

If one of those racks were found to be missing or in the wrong location, it would need to be removed. The problem is that it's not obvious by looking at a bike rack what bike rack is what. What, then is the solution?

Taking another example, we see this building footprint from New York City:

Every imported building in New York has a Building Identification Number, or BIN. In OpenStreetMap we've separated out the garage from the building, whereas the city hasn't. When merging the city data with data from local mappers, do we apply the BIN to the garage? To the main building? To neither?

A more common example is data imported from GNIS. GNIS data was imported many years ago and has many known issues associated with it, including but not exclusively that many of the GNIS points are far from the actual location of the object, in some cases a half a mile away.

Taking a look at this node, just a few weeks ago, this had a GNIS datapoint for a school. Korzun, a very active mapper, visited the location himself and found out it was a drug rehab center, but was too concerned about the integrity of the GNIS import to remove the dataset, even though he knew it to be inaccurate.

If an experienced mapper like him is confused, imagine the confusion of less experienced mappers?

The end result in all of these situations is out of date, or entirely inaccurate data in OpenStreetMap.

To address these issues, I have three practical suggestions.

We should reduce or eliminate the use of external tags, including identifiers during imports

For years, importers have been bringing in external data along with the imported data. This data might include collection dates, unique identifiers, etc.

As part of the import review process, OSM has generally been reducing the import of much of this data (for example, collection dates), but we continue to import external identifiers. The idea behind including the external identifiers is that it's hoped that at some future time, we'll be able to do an update based on a new dataset and use the external identifier as part of that process.

Unfortunately, while this possibility has been discussed a number of times, in the nearly ten years of the project, no one has successfully been able to make this happen. The reason is that even if an external identifier is consistent across revisions (which is not always the case), OSM activity does not lend itself well to using the external ID as the only identifier in the merge process.

OSMers may decide to add features that are not present on the map between imports, or may decide that a feature which has a single identifier in the external dataset should be represented in OSM by more than one object (eg splitting a road, or separating two buildings connected by a skybridge). It's even possible that a mapper will accidentally modify a tag that they do not understand. This was quite common when bot-mode expanded names across the US. Many times it was unable to fix a road simply because the TIGER tags had been edited by a mapper who did not understand them.

Because of this, merges must take other factors in consideration when doing a merge, including the feature name and location. While slower, these methods prove superior during a merge- leading back to the original question of why use the unique identifier in the first place if it creates confusion (leading to stale data) and does not offer benefits in the long run.

Imported data must always be imported by hand, and not by script

When imports began, the process for getting the external data into OSM was that the data was converted into OSM and then simply placed into OSM. As we discussed earlier, it is easier to add good data than correct bad data so we should take advantage of this in future imports.

We have tools to make this easier, including the OSM Tasking Manager, which is being used in such an import now. Updating OSM this way not only gives an import more review, but can also be used a means of engaging users, as was shown in the Seattle import.

An import proposal must be accompanied by a means of updating the data in the future

OpenStreetMap is only valuable if the data on it is not only complete, but current, and we've seen through OSM that data from imports is not often updated. We need to change this, and the best way to do it would be to require that any import specifically address future updates.

If possible, that process will not only involve conflating the data from the external dataset with OSM, but continuing to keep OSM contributors engaged in OSM editing.

Conclusion

Imports have their place in OSM, but they are not without their perils. I've tried to outline specifically how the past and current import process does not serve either data consumers nor the OSM community in general, as well as provided concrete suggestions on moving forward. I hope that we can continue to make OSM not only the freest, but the best geographic dataset in the world.