Is Open Data At Risk From Poor Data Quality?
Open data is a concept that’s new to many people in the UK. The idea is fairly straightforward. By opening up records used by the public sector, the British public can see how their data is being used. The idea was introduced by the coalition government in May 2010, but has yet to gain traction, and the British public have not yet had the opportunity to scrutinise government spending.
In theory, open data should make the public sector more efficient and effective, and it should give British citizens more confidence in the way their tax money is being spent. Local governments are supposed to publish records for all spending over £500, and we should all be able to see figures on crime, civil servant salaries and government contracts.
But in order to inspire confidence, people need to see two things. They must know that data is being published in the manner it should be published, and they need to know that data is accurate.
According to the press, data quality is becoming a serious risk that could derail the open data project.
What’s Wrong With Open Data?
Computer Weekly analysed 50 different data releases that have been issued since November 2010. These data releases included 42 million records compiled from 7,500 different spreadsheets.
The resulting dataset was analysed by Spend Network. It stated that the quality of data in them was so poor, the public would have no chance of understanding it unless they had “advanced computer programming skills”.
Computer Weekly specifically mentioned “dirty data” that is hampering the entire project. Some government departments deny this, however. But the evidence points to a dataset which has a low level of data quality.
Data Problems Identified
The analysis identified various problems with the data that contributed to data quality issues:
- Data is being bundled and released in a raw format which has not been processed, making it very difficult for the public to use it
- Encoding is introducing characters that cause data quality problems. Various encoding methods are used randomly, including ASCII, ISO-8859 and Windows formats, amongst others. Official requirements to use UTF-8 encoding are being ignored by many departments, including the Cabinet Office; the department did not publish a single release in the format it recommended
- Software such as Microsoft Excel is producing non-compliant UTF-8 exports, introducing flaws into the data where they need not exist
- Formatting of common information, such as data fields, was different across different records, making it impossible to cross-reference; this may be because the guidelines for correct data formatting were changed three times over the period in question
- Over the same period, 22 different filename conventions are in use
- Fields were continually added or removed on 20 separate occasions, meaning that headings and labels do not match
- Commas are added to records, causing problems with processing and analysis, since many systems use commas to separate data fields
- Analysis suggested a large amount of human error, causing dirty data which was not cleaned prior to release
- Guidelines issued by HM Treasury made assumptions about data distribution, including the use of Microsoft products
The raw exports produced for the open data project apparently have many data quality problems. These common issues – encoding problems, date formats and human error – can be resolved with data quality software. But when released raw and uncorrected, the data cannot be fit for purpose. It cannot be turned into information.
Without this transition from data to information, it is impossible for humans to understand what they are looking at and interpret the true meaning of data. According to the experts in this case, cross-referencing data was so difficult that a professional analyst could not make sense of the figures, and one went back 8 years before making a successful match.
The Ministry of Justice denied that the data it issued was inaccurate, and said its data was reliable, accurate and open to scrutiny. And it may be true that some departments are making progress towards clean data. But without a cohesive data quality strategy, the whole dataset is still flawed.
Can Open Data Be Rescued?
Among some camps, there is an implication that data quality has been intentionally sidelined because government bodies do not want their affairs open to scrutiny. Some departments have declined to issue data for various reasons. But taking things at face value, it is likely that data quality is simply an alien concept for many people.
Additionally, topics such as encoding may be foreign to the administrators tasked with exporting the data for distribution, hence the UTF-8 requirement being ignored so routinely. In some cases, staff apparently followed guidelines, but Microsoft Excel’s UTF-8 format was non-standard, and nobody realised.
It follows that data quality is seen as someone else’s problem – a common misconception in businesses worldwide. The Cabinet Office suggested that human error is the largest cause of data quality problems, yet all departments must take ownership to ensure clean and healthy data is exported.
If the government is to succeed with its open data programme, it clearly needs to invest more money in data quality. Specifically:
- Investing in standards that make data consistent
- Ensuring encoding methods are used and checked
- Ensuring duplicate data is always removed during frequent data quality checks
- Removing dependency on software that produces inconsistent or proprietary results
- Ensuring governance that avoids confusion
The government has taken the right step by implementing open standards. It has also decided to adopt Open Document Format (ODF) above Microsoft’s formats, a sensible move that will make high quality data easier to publish.
The Importance of Data Quality
The Cabinet Office says that transparency is of utmost importance, and data quality is a “firm objective”. Clearly its open data project is in its infancy. Yet the scale of error experienced proves that high quality data is no accident and requires effort and commitment to achieve.
As we become more connected, businesses and the public sector are realising that investment is needed to make sure our data is fit for purpose. High profile failures should only cement our willingness to succeed.