Why data diversity is important for AI development

On May 20, 2019

DEVELOPING artificial intelligence (AI) algorithms requires vast amounts of data, and with new, more sophisticated iterations, the technology demands more data to deliver the results expected.

AI depends on data to study the patterns and trends which it ‘learns’ from to be able to interpolate, and automatically execute certain functions.

And thus, it is crucial that the data that the AI algorithms are built on is not homogenous or biased towards certain elements.

For example, a facial recognition algorithm based solely on physical characteristics of the Western population may not be able to identify Asian populace. Or worse, will miscategorize them.

Similarly, a hypothetical system deployed to recruit potential candidates for a job may be partial to one gender or ethnicity if the data it was fed was not varied.

In other words, the very effectiveness of the technology relies heavily on the data, and this scenario presents an entirely new problem to AI developers who must address the data bias issue first.

Failing to do so will result in sub-standard AI products and harm enterprise use cases.

The solution lies with diversity

While AI by itself does not have built-in biases, data and its sources do, which could lead the technology to establish an inaccurate relationship between two variables, and amplify the mistake by making more of the same misguided inferences.

To solve this issue, developers have to start at the data collection and curation phase.

Experts recommend that procedures be put in place to ensure the data is sufficiently diverse and proportionally accounts for all variables.

Companies with a global presence should make it a point to analyze data from all its operations to develop a new process before integrating it with an AI solution so that every element and aspects are accounted for.

Companies developing computer vision software or natural language processing (NLP) systems should specifically heed this notion as it will not only improve the quality of their product but also enhance their market access.

Admittedly, de-biasing the data completely may not be possible given the challenges and resources required, but minimizing bias by way of diversifying the data is very much within the realm of possibility.

Data scientists should develop ways to analyze data distributions more thoroughly, and fix abnormal co-relations between variable.

In short, for AI to realize its full potential, continuous improvement with regards to data optimization is necessary.