How we clean map data: Source Matching

2 min readAug 21, 2020

Collecting data from open sources, cleaning, fixing and mixing it is easy. You just need to practice for 10,000 hours.

Sao Paulo (OSM Buildings)

Our biggest source of building data is OpenStreetMap. OSM has a number of unique selling points:

  • Strong global coverage
  • Highest number of common attributes per building
  • Good accuracy due to ground truth mapping, or mapping from high-res imagery
  • Updated every minute

OSM’s 400 million buildings (as of July 2020) create a strong base for the subsequent processing work that we do.

Fixing AI geometry (Tanzania, Microsoft)

Our next biggest source of data is Microsoft US Buildings (2018). It’s our first source that involves AI. As is typical for this technology, false positives and invalid polygons are prevalent. We clean these up using in-house technology.

But we don’t stop here. There are far more data sets world wide that we are integrating. 10 million buildings from Netherlands, 20 million from UK, 5 million from Estonia… and many more.

Creating a single data set from these and combining it with OpenStreetMap requires more than plain copying. All buildings have to be checked for duplication. This means a duplicate test of 260 million additional buildings against OSM’s data.

Orlando, FL, mixing OSM (green) and alternatives (blue)

For multiple sources in a single region, we prioritise according to quality and capture date. OpenStreetMap is our highest ranking source.

There are more passes where attributes are merged. Imagine a case where a building from OSM has no height value attached. If there is an overlapping object from another source, we take height information from there.

What about tabular data with point-based information, or bitmap-based sources like terrain models? We do this too. Follow us on Medium or say hello via our website to find out more.

To experience the global coverage of 3dbuildings’ data, try out the map at Thank you for reading!