The following glossary provides definitions for a range of terms related to data quality, processing, and management. These concepts are important in the field of data management and are essential for anyone working with data professionally.
Term | Definition | ||
Accuracy | The degree to which data correctly describes the “real world” object or event being described.
In order to be accurate, data must be free from error and conform to standards, i.e., values must be:
In the context of data quality:
|
||
API | An application programming interface (API) is a set of defined rules and protocols
that explain how applications talk to one another. |
||
Batch or Bulk Processing | The execution of high-volume, repetitive data processing jobs which can run without manual intervention, and are typically scheduled to run as resources permit.
A batch process has a beginning and an end. |
||
Completeness | The proportion of stored data against the potential of “100% complete”.
In the context of data quality defects:
|
||
Congruence | The absence of difference when comparing two or more representations of a thing against a definition.
In the context of data quality defects:
|
||
Consistent | The absence of difference when comparing two or more representations of a thing against a definition. In the context of data quality:
|
||
Data Authentication | Authentication is the process of determining whether someone or something is, in fact, who or what it says it is.
In the context of data quality automated checks are performed in real time or batch, to ensure the thing being checked is real at the time of checking. Common use cases include silent phone line testing to see if number will dial, or a check to see if an email will deliver. |
||
Data Assessment, Data Audit or Data Profile | An Assessment, of data quality is like an MRI scan for data, which will often uncover a lot of hidden facts and insights from every data field about the values stored:
|
||
Data Cleansing | The process of preparing data for analysis by amending or removing:
incorrect, corrupted, improperly formatted, duplicated, irrelevant, or incomplete data within a dataset. |
||
Data Congruence | The process of comparing elements of a record to assess its congruence.
In the context of data quality, this might involve the analysis of two or more data field values to ensure they are congruent. e.g., a dialling code relates to a country, or whether a first or last name appear in an email local part. |
||
Data Deduplication | The process of eliminating duplicate copies of repeating data.
This can be achieved by implementing deterministic or probabilistic matching techniques to accurately identify duplicates. In the context of data quality, this often relates to deduplication of Organisation’s (Accounts or Companies), People (Employees or Contacts) or Addresses, (Locations). |
||
Data Derivation | The process of obtaining a new piece of data from another by analysing its structure, pattern, or values.
In the context of data quality, deriving a countries ISO code from a telephone number prefix, or the country might me derived from an email’s domain suffix. |
||
Data Enhancement | An increase or improvement in quality, value, or extent.
In the context of data quality, enhancement might be as simple as changing and correcting the casing of a data field value such a
|
||
Data Enrichment | Enrichment is the process of adding data field values which improve the quality of data by adding or correcting something that was incorrect or missing.
This can include the process of enhancing, appending, refining, and improving collected data with relevant third-party data. In the context of data quality, the appending of:
|
||
Data Formatting | The process of transforming data to be in a correct/specific format.
This can include transforming phone numbers, email addresses, URL’s etc. to an agreed format. In the context of data quality:
|
||
Data Migration | Involves moving data from one system (the source) to another (the target), i.e.in one direction.
Migration is often a one-time process. Once data has been migrated, it is not moved back, and the migration is not repeated. |
||
Data Integration | ‘The meshing’ of two systems that do not already talk to each other. Integration is repeatable, too.
Often, it means creating a two-way link so that users see a more complete picture of a record or contact. Integration is commonly used in cloud applications: for example, your Customer Relationship Management (CRM) system and your accounting tool may be linked so that you can see invoices, contact details and payment history in both. |
||
Data Parsing | This involves manipulating data by splitting it into its constituent parts.
This can include phone numbers and email addresses. |
||
Data Quality | A measure of how reliable a data set is to serve the specific needs of an organisation based on factors such as:
Accuracy, completeness, consistency and reliability. |
||
Data Standardisation | The process of creating standards and transforming data taken from different sources
into a consistent format that adheres to those standards. |
||
Data Suppression | The process of identifying whether people are:
Deceased, gone away, or have expressed a preference not to be contacted. |
||
Data Transformation | The process of editing data by abbreviating, elaborating, normalising etc.
This can include examples such as abbreviating United Kingdom to UK, elaborating Rd to Road or normalising ‘Johnny & ‘Jonathan’ to ‘John’. |
||
Data Validation | Refers to syntactically validating values as being of the correct format.
This means ensuring they look correct, for example, an address following the correct format of having a house number and a postcode. |
||
Data Verification | Checking to see if the values or record match the proxy of the real-world entity they are supposed to represent.
This would be the equivalent of checking address details against the yellow pages to check the address matches the name given. |
||
Database | An organised collection of structured information, or data, typically stored electronically in a computer system
That can be easily accessed, managed and updated. |
||
First-Party Data | The data collected directly from your own sources,
commonly concerning your audience or customers. |
||
Golden Record | Also known as the “Single Customer View (SCV)”, the golden record refers to the consistent and comprehensive view of all the data an organisation has about its customers that is stored and consolidated in one record in a business application.
Organisations may hold multiple records for the same contact in various business applications – these records need to be duplicate free, complete and accurate, which then creates a Golden Record. |
||
Master Data Management | Master Data Management is the technology-enabled discipline in which a business and IT work together to ensure:
Uniformity, accuracy, consistency and accountability of official shared master data assets. |
||
Single Customer View (SCV) | Also known as the “Golden Record” is a consistent and comprehensive view of all the data an organisation has about its customers that is stored and consolidated in one record in a business application.
Organisations may hold multiple records for the same contact in various business applications – these records need to be duplicate free, complete and accurate, which then creates a Single Customer View. |
||
System of record | A system of record or is the term for an information storage system that is the authoritative data source
for a given data element or piece of information. |
||
Third-Party Data | Data collected that does not have a direct relationship
with the user the data is being collected on. |
||
Timeliness | The degree to which: (a) data represent reality from the required point in time, and (b) consumers have the data they need at the right time.
In the context of data quality defects:
|
||
Uniqueness | No thing will be recorded more than once based upon how that thing is identified”.
In the context of data quality defects:
|
||
Validity | Data is valid if it conforms to the syntax (format, type, range) of its definition.
In the context of data quality defects:
|
Data types refer to particular kinds of data item, as defined by the values it can take, the programming language used, or the operations that can be performed on it.
Type | Definition | Example | |
Integer (int) | Numeric data type for whole numbers without fractions | -909, 0, 909 | |
Floating Point (float) | Numeric data type for numbers with fractions | 909.09, 0.9, 909.00 | |
Character (char) | Single letter, digit, punctuation mark, symbol, or blank space | a, A, 9, !, ? | |
String (str or text) | Sequence of characters, digits, or symbols—always treated as text | Hello World, 0044-(0)2392-988303 Ext. 123, Straße, Zoë, Soufflé, myname@mydomain.com | |
Boolean (bool) | True or False values | 0 (false), 1 (true) | |
Enumerated type (enum) | Small set of predefined unique values (elements or enumerators) that can be text-based or numerical | blue (0), black (1), red (2), Green (3) | |
Array | List with a number of elements in a specific order—typically of the same type | blue (0), black (1), red (2), green (3) | |
Date | Date in the YYYY-MM-DD format (ISO 8601 syntax) | 2022-09-15 | |
Time | Time in the hh:mm:ss format for the time of day, time since an event, or time interval between events | 10:00:29 | |
Datetime | Date and time together in the YYYY-MM-DD hh:mm:ss format | 2022-09-15-10:00:29 | |
Timestamp | Number of seconds that have elapsed since midnight (00:00:00 UTC), 1st January 1970 (Unix time) | 1561956700 |
The commonly known DIKW model refers to the pyramid structure of data, information, knowledge and wisdom, where data acts as the key foundation. From raw data, we obtain information, from this information, we gain knowledge and from this knowledge, we achieve wisdom. Below is an example of how an airplane pilot might interpret Data, Information, Knowledge and Wisdom
Data | The number 10,000 flashes on your display. No label, no description, no units. It is data, but it means nothing to you. |
Information | If the display reads ‘10,000 feet above sea level’, it is information. |
Knowledge | If we are aware of mountains soaring to 12,000 feet, that’s knowledge. |
Wisdom | Wisdom is to climb another 2000 feet above to be safe. |
As data is the very foundation of our pyramid, it needs to be high quality and clean. The more we enrich our data with meaning and context, the more knowledge and insights we get out of it. Enabling us to make better informed data-based business decisions.