Data Readiness Report

Manish Kesarwani
4 min readJul 14, 2021

Data exploration and quality analysis is an essential yet tedious process in the AI pipeline. The journey of data from its collection to serving as training data for an AI system mostly takes place in an ad-hoc manner and is barely documented. The consumers have no insight into the quality of incoming data, and its readiness for a machine learning task and may spend numerous precious cycles exploring such properties.

According to the 2018 Kaggle Machine Learning & Data Science Survey [1], data scientists spend about 80% of their valuable time gathering, cleaning, visualising, and improving the data using inputs from various stakeholders.

2018 Kaggle Machine Learning & Data Science Survey

Anomalies and in-consistencies may creep into the data at any of the ingestion, aggregation, or annotation stages. Machine learning contexts also introduce additional quality requirements like ensuring adequate representation of target classes, collinearity checks, presence of discriminatory features, the occurrence of bias, identification of mislabeled samples, and various other data validation checks. This impacts the quality or readiness of a dataset for building reliable and efficient machine learning models.

The below figure illustrates how various stakeholders like data stewards, subject matter experts, and data scientists are involved at different stages in the data ingestion and processing workflow. Their interaction with data is determined by their role, and each persona has different purposes with respect to the kind of input they provide.

Various personas in an AI pipeline

Focus on standardised documentation to accompany AI assets is a relatively recent phenomenon. One of the earliest significant works that highlighted this practice was the concept of Datasheets [2]. This was proposed as an accompanying record for datasets to document the motivation, composition, collection process, recommended uses, dataset distribution and maintenance, etc., thus encouraging transparency and accountability for data used by the machine learning community.

Model Cards [3] were developed on the same principle as short documents that could accompany trained machine learning models and provide information about the models like general details, intended use cases, evaluation data and metrics, model performance measures, etc. Similar to model cards but oriented towards AI services are FactSheets [4]. A FactSheet for AI Services contains various attributes of an AI service like intended use, performance, safety, and security etc. It tries to bridge the expertise gap between the producer and consumer of an AI service by communicating the various attributes of the AI services in a standardised way.

The primary focus of these works has been on highlighting the data characteristics and data quality issues in specific ways. They do not take into account the remediation to the identified quality issues and explanations for the same. They also do not capture the lineage of data assessment operations and the role of various personas in a collaborative data preparation environment.

In our recent paper accepted in the 2021 IEEE International Conference on Smart Data Services (SMDS), we propose Data Readiness Report as an artifact that serves as a certification of the baseline quality of ingested data as well as a record of operations and remediations performed on the data. The readiness report becomes a one-stop lookup for understanding the quality of data and readiness analysis, including the lineage of transformations applied.

The following figure shows some examples in which the report can benefit various personas discussed earlier. For instance, a data scientist A can quickly understand the challenges and quality issues in the data from the report without having to repeat a similar analysis conducted by, say, data scientist B. A subject matter expert can review the remediations done on data and learn how data issues affect quality scores.

Generation of Data Readiness Report and its Utility Across Personas

The below figure exhibits a sample Data Readiness Report for the Adult dataset from the UCI data repository [5]. In this example, we have used only limited features to exemplify how essential information from the quality analysis process can be represented in the report.

A Sample Data Readiness Report — Adult Dataset

For more details on the Data Readiness Report, use the following link to access our paper @ https://arxiv.org/pdf/2010.07213.pdf

Authors: Shazia Afzal, Rajmohan C, Manish Kesarwani, Sameep Mehta and Hima Patel

Affiliation: IBM Research, India

References:

[1] https://www.kaggle.com/paultimothymooney/2018-kaggle-machine-learning-data-science-survey/

[2] T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daum e III, and K. Crawford, “Datasheets for datasets,” arXiv preprint arXiv:1803.09010, 2018.

[3] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru, “Model cards for model reporting,” in Proceedings of the conference on fairness, accountability, and transparency, 2019, pp. 220–229.

[4] Arnold, Matthew, Rachel KE Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilović, Ravi Nair et al. “FactSheets: Increasing trust in AI services through supplier’s declarations of conformity.” IBM Journal of Research and Development 63, no. 4/5 (2019): 6–1.

[5] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml

--

--