A guest post from Friedrich Lindenberg proposing some common fallacies of data standarization projects.
Developing open data standards is all the rage. IATI, EITI, OCDS, GTFS, XBRL, SDMX, BDP, HDX – if your sector doesn’t have a cryptic-sounding data initiative yet, it probably will soon.
In fact, chances are that you’re drawing one up right now (I am). In that case, here’s a list of things you may believe about your data standard. They are probably not true:
- Policy and tech people on your team mean the same thing when they say standard. There is a magical and unbreakable bond between conventions for government policy and column naming schemes.
- Different systems of government will produce data that should be expressed in the same format. Everybody’s mechanism for debating and making laws, or for handing out public contracts, or for managing public funds is basically the same, right?
- Standards are tools for publishing data. Use cases can be derived from the data structure available in your in-house database. For end users, put ‘researchers, journalists, NGOs’. Never put your own name there, your job is to empower others.
- Many people will develop tools and platforms to handle your data. You will not be stuck having to pay for your own ecosystem for the next fifteen years. The tools will actually work with data published by different sources.
- The economics of standardization always work out. This is true even when to date, nobody has been using the data. They will be able to do that more effectively now. When evaluating gains, think about Nairobi startups, not your own organisation.
- Your committee is the centre of the known universe. It is your duty to specify what countries exist in the world, what currencies they use and what constitutes a company. DublinCore is the maximum level of possible standards re-use.
- A standard is the best way of publishing data. Having a centralized API that actually works and has data quality assurance built in would be some sort of tech imperialism. The coordination cost of spreading data all over the web and collecting it upon use is lower.
- All data analysis problems are global. You need a standard before you can derive knowledge from data. As your data scales, the analytical questions people are working on will begin to apply to vastly different contexts.
Have I missed any?