What is data pre-processing?
Data pre-processing is an important step in the data mining process. It describes any type of processing performed on raw data to prepare it for another processing procedure. Data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user.
Importance of data pre-processing.
Real-world data is usually incomplete,( it may contain missing values), noisy,(data may contain errors while transmission or dirty data), inconsistent,(data may contain duplicate values or unexpected values which lead to inconsistency). Data preprocessing is a proven method of solving such problems.
No quality data, no quality mining results! which means that if the analysis is performed on low-quality data then the results obtained will also be of a low quality which is not desired in the decision-making process. For a quality result, it is necessary to clean this dirty data. To convert dirty data into quality data, there is need of data pre-processing techniques.
Major Tasks in data pre-processing.
- Data Cleaning.
- Data Integration.
- Data Transformation.
- Data Reduction.

Data Cleaning:
Data cleaning or data cleansing techniques attempt to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the data. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table or database.
Tasks in data cleaning:
- Fill in missing values
- Identify outliers and smooth noisy data
- Correct inconsistent data
1. Fill in missing values :
- Ignore the tuple.
- Fill in the missing values manually
- Use a global constant to fill in the missing value.
- Use the most probable value
- Use the attribute mean or median for all the samples belonging to the same class as the given tuple.
2.Identify outliers and smooth noisy data
- Binning
- Regression
- Outlier analysis.
Data Integration:
Data Integration is the process of integrating data from multiple sources and has a single view over all these sources. Data integration can be physical or virtual.
Tasks in data integration:
- Data Integration-Combines data from multiples sources into a single data store.
- Schema integration-Integrate metadata from different sources
- Entity identification problem-Identify real-world entities from multiple data sources
- Detecting and resolving data value conflicts-For the same real-world entity, attribute values from different sources are different
- Handling Redundancy in Data Integration
Data Transformation:
Data transformation is the process of converting data from one format or structure into another format or structure.In this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand.
Data Transformation Strategies:
- Smoothing: which works to remove noise from the data
- Attribute construction (Feature construction)-where new attributes are constructed and added from the given set of attributes to help the mining process.
- Aggregation-where summary or aggregation operations are applied to the data
- Normalization-where the attribute data are scaled so as to fall within a smaller range.
- Discretization-where the raw values of a numeric attribute are replaced by interval labels.
- Concept hierarchy generation for nominal data-where the attributes such as street can be generalized to higher level concepts like city or country.
Data Reduction:
A database/data warehouse may store terabytes of data and to perform complex analysis on such a voluminous data may take very very long time on the complete data set. Therefore, data reduction is used to obtain a reduced representation of the data set that is much smaller in volume but yet produces the same analytical results. Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected, ordered, and simplified form.
Data reduction Strategies:
- Data Compression
- Dimensionality reduction
- Discretization and concept hierarchy generation
- Numerosity reduction
- Data cube aggregation.
Hello There. Thank you so much for reading “They Got It All Wrong”. This was an eye-opener for me. Happy you enjoyed it. Take care.
LikeLike
Hello there! I just want to offer you a huge thumbs up for the excellent info you have right here on this post.
I’ll be coming back to your web site for more soon.
LikeLike
Hello, after reading this amazing paragraph i am too glad to share my know-how here with friends.
LikeLike
Big data helps with that.
LikeLike
Only wanna admit that this is very beneficial, Thanks for taking your time to write this.
LikeLike
Thank you ! We look forward on improving it more with latest info !
LikeLike
I always come away knowing more after I read your posts. Please keep writing and informing. Take care.
Bob
LikeLiked by 1 person