The entire goal of using Artificial Intelligence (AI) or Machine Learning is to train an AI program how to look for and analyze patterns in a set of data. And then based on these patterns to make certain decisions automatically or notify a human through a report or application warning. Naturally, if you have bad data, then you will get bad analysis and results. So, good data is critical to making an AI system work correctly.
In this Blog Article we will discuss this further in depth, as well as what steps you can take to ensure your data is realistic and clean as possible.
Garbage In = Garbage Out (GIGO)
This has been a problem facing Computer professionals since the dawn of computing. If the raw data is not accurate or not formatted correctly, then the output will also not be correct or inaccurate. This is even more true for an AI Program or System, as you may or may not have a Human who is evaluating both the input and output from the system. Clearly all of us want to avoid a, “Do you want to play a game?” scenario (1).
Here are several of the challenges with Data that you need to watch for.
In many ways, getting a good sample of data is very similar to conducting a political poll before an election. And as we have seen during the last couple of elections in the US, this is not always accurate or easy to do. Some of the challenges that you need to look for include:
- Demographics in the Sample
There is a fairly famous example where a Software Company that specialized in facial recognition trained its AI programs based on the faces of their staff, which were mostly Caucasian Men, with a sprinkling of women and other cultural and ethnic groups. So, the system was very accurate in recognizing Caucasian Men between 20 and 35, but failed when trying to recognize women, children, elderly, black, asian, polynesian, native american, hispanic, etc.
They just didn’t have a large enough sample to represent everyone they were trying to train the AI to recognize. So, their first attempt could recognize White Men of a certain age group very well, but everyone else… Nope.
- Sample Size
A common problem in doing any form of statistical analysis is in either not getting a large enough sample or not having a good enough population sample. And if your AI program or machine learning program doesn’t have a large or high quality population sample it can and will come to the wrong conclusions as well. After all, it is only looking at the data you provide it with and looking for patterns in that data.
For example, if you run a survey in the United States and only ask Gun Owners what they think about the 2nd Amendment and Gun Control Efforts, you are going to get a very different response than if you asked a broader population.
However, the reverse is also true. If I am designing an AI program for Remmington to help me market and sell a newly designed rifle, then I specifically want to target individuals who are likely to purchase it. Even if I provide the AI with an extremely large data sample, it should be trained to disregard individuals who will most likely never purchase our new rifle.
- Sampling Error / Questions
One of the old adages in Statistics is that if you, “Ask the Right Question, then you will get the Answer you are Looking For.”
Meaning, if you accidentally or purposefully phrase a question a certain way, you can get the result you are looking for, but it might not be the true answer. The same thing can occur with AI and machine learning. If the data or answers that are being produced is due to a flawed question or input, then the output will also be flawed.
- Sampling Error / Methods
We also need to look at how we are collecting the Data that the AI or machine learning program is using. For example, if today we conducted a Political Poll calling only Land Line Telephones, who would we reach? Primarily individuals who are elderly and retired. If we expanded this to include Mobile Phone Numbers, many people simply wouldn’t answer.
The reason for this is the extreme number of both automated Sales Calls and Phishing / Scam Calls, such that people are no longer answering their phone from callers whom they don’t know.
For example, how many times have you gotten a call from the Internal Revenue Service claiming that you owe back taxes and if you don’t pay it immediately, the Sheriff will be called to come and arrest you?
With any data set you will have issues with duplicate data or data that is simply incorrect. And unfortunately, the larger your data set is, the greater the chance is that you will have some errors in it. Here are some common issues that you have to consider, look for, and most importantly fix:
Traditionally in the United States, an individual will have a First Name, Middle Name, and Last (or family) Name. For example: David Ray Annis
And sometimes on occasion you may have a Junior, Senior, the III or the IV, etc. designation. But this is not always the case, especially with other countries in the world. For example, is Luis Angel Lopez Chairez the same person as Luis Lopez? In this case, yes he is. But an AI might have trouble recognizing that, unless you trained it in the pattern for recognizing Hispanic Names.
It gets even more challenging for married individuals, as they may hyphenate their names or take their spouse's last name as their own. Plus many individuals around the world will use a variation of their actual name.
Using an eMail address is often the best approach to identifying a person as a unique entity. However, this also doesn’t always work. I think I am personally at about 7 different eMail addresses that are active, each used for a specific purpose.
So, in those cases we need to train the AI to look for broader patterns, or what is known as a Composite Key to identify a unique person.
It is important to remember that date formats follow different patterns around the world. And the system needs to note which countries format dates are used for each record, otherwise the AI can get as confused as you and I.
For example is my birthdate on 7/11 or 11/7? Both dates are actually correct, but they have two totally different meanings based on whether you are using the US standard or the European standard.
In the US, everyone is familiar with a two digit currency, such as $ 1,560.89. However there are a number of currencies that do not follow this type of format. Several have no digit places, while others may have 4 or even 6 digit places. In addition, in most countries the numeric format is different. It is $ 1.560,89.
Whenever possible you want to use an Address Service that verifies that the Address entered by a user or system is correct and valid. Because otherwise you can encounter all sorts of strange or duplicate entries, which will cause bad data. Probably the most important things will be to train the AI on standard abbreviations and what they mean, and in what context.
However, both use the abbreviation “ST”. And of course, the AI needs to be trained to recognize International Postal Codes, as most countries do not use the US format of 5 + 4 numbers. And many Postal Codes will include letters as well as numbers.
I strongly recommend using an Address Service to validate and modify addresses that are entered into the system by users. Naturally, you should allow the User to override or update the suggested Address, but this will cut down on data errors.
So, in this first Blog Article on Artificial Intelligence, we’ve discussed the Data issues and what to look for to make sure you have a large enough Data sample and that it is as clean as possible. In the next Article, we will discuss How to Train Your AI, and what Logic Errors to avoid.
We hope that you’ve enjoyed this article.
Thank you, David Annis.