This content originally appeared on DEV Community and was authored by Suleyman Sade
Picture this: you just finished an amazing movie on Netflix, and a suggestion pops up suggesting you to watch another, very similar movie. Or you are shopping online, and you suddenly meet with the product that you never knew you needed. Have you ever wondered how AI makes these spot-on suggestions? What makes an AI so good that it can predict what you want before you ask for it?
In this blog post, I am going to cover how important data is for AI to make decisions, and how to prepare a good data.
There is an important principle for AI that everybody should know about:
Garbage In, Garbage Out
If you have messy or wrong data, there is no way you are going to end up with a good AI. 99.9% of the time the result is going to be horrific.
Well then what should you do?
If you want your AI to be accurate and logical, first you need to decide on what data you need in order to make the decision. Try to find relations between what your AI is trying to achieve and the information that you can get from the user/database.
For example, let’s say you are trying to decide which genre the user likes the most, so you can give more suggestions from that genre. Then you can look at the genres of movies that the user watched previously, how much of the movie user watched, did the user scored the movie, and similar parameters.
Feature Selection
But you need to be careful and picky when choosing because using unrelated parameters might actually damage the accuracy of your model. For example, knowing the user’s favorite color might have a negative impact on the algorithm. This concept is called “feature selection,” and is all about identifying the most impactful and necessary parameters for the ML model.
Feature selection not only involves what to include, but also what to exclude. In some cases there might be a feature that would — in theory — enhance the algorithm’s accuracy, but if there is no reliable method to access this information or the information is flawed to the point that it can not be used, that feature should be excluded.
Well does this mean even if there is only one flawed row, the whole data is unusable?
No — well maybe in some extreme cases — because most of the time the data is going to have some sorts of error or something that needs to be changed. So you have to know how to do data cleaning or data processing to account for these changes. In this step, the messy, raw data is converted to clean and structured data format that is more comprehensible by AI.
Data Cleaning
Data cleaning involves many steps and things to consider, one of which is the consistency in the format. If you want to have age as a parameter you can’t have “five”, 5 and “5” at the same time, you need to be consistent.
Another issue data cleaning deals with is missing data. When you are missing a value in your data, there are a couple of options:
- Removal: If a column or a row is missing too much critical information, it would be better to remove the entire column/row, rather than making up random values that would cluster the data.
- Imputation: This involves filling up the blanks with generally a median or average value, or for text you can use the most frequent value. You need to avoid making up numbers unless you have enough prior knowledge. For example let’s say a row in “tip amount” column is missing, if you use logic, you may assume that empty value for tip = no tip is given thus you can fill the blank with 0.
All these methods are going to ensure that your database is correct, but that is not enough to make a successful AI. You want to avoid giving unfair advantage to certain values. There are a couple of ways that could happen
Duplicate Data: Let’s say we have a data full of different articles, but for some reason one of those articles are duplicated. This means AI is going to find 100% match while searching for patterns, which might result in overemphasizing the contents of this article — which is something we want to avoid.
Outliers: Outliers are data points that significantly deviate from the majority. Some outliers might be a result of an error in the system, for example having 100-star rating in a rating system out of 5 starts, these should definitely be cleaned as it would heavily impact AI’s decisions. But they could also be real data, in these cases it is up to you whether to include them.
Data transformation and scaling: Let’s say we are writing an algorithm that tries to predict house prices, and we have number of bedrooms(1–5) and square footage (500–5000) as two different features. Since these two features have different units. But the magnitude difference between them might cause AI to weight square footage more than number of bedrooms. That’s why you would want to scale these ranges between 0–1 so they have the same weight.
Then you may be wondering how this would work on categorical data. These categories then needs to be converted to numerical values. One common technique to do this is to use “one-hot encoding” which classifies the presence or absence of a category with binary.
All these techniques ensure a clean and consistent data that is the backbone of AI and everything related to Machine Learning. It is important to remember that data cleaning does — in fact — take a lot of time, but it is a necessity for the level of AI that we see today. Always remember principle “Garbage In,
Garbage Out”
So next time you watch Netflix and get that spot-on recommendation, or you found that one tool while online shopping. Remember that it is not some magical AI reading your mind, but rather the painstaking journey of hammering down the data to find exactly what you want.
This content originally appeared on DEV Community and was authored by Suleyman Sade