Process Data from Dirty to Clean

Here are some other things to watch out for:

Data replication compromising data integrity: Continuing with the example, imagine you ask your international counterparts to verify dates and stick to one format. One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren't verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates.

Data transfer compromising data integrity: Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity.

Data manipulation compromising data integrity: When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness.

Reference: Data constraints and examples

As you progress in your data journey, you'll come across many types of data constraints (or criteria that determine validity). The table below offers definitions and examples of data constraint terms you might come across.
Data constraint
Definition
Examples
Data type
Values must be of a certain type: date, number, percentage, Boolean, etc.
If the data type is a date, a single number like 30 would fail the constraint and be invalid
Data range
Values must fall between predefined maximum and minimum values
If the data range is 10-20, a value of 30 would fail the constraint and be invalid
Mandatory
Values can’t be left blank or empty
If age is mandatory, that value must be filled in
Unique
Values can’t have a duplicate
Two people can’t have the same mobile phone number within the same service area
Regular expression (regex) patterns
Values must match a prescribed pattern
A phone number must match ###-###-#### (no other characters allowed)
Cross-field validation
Certain conditions for multiple fields must be satisfied
Values are percentages and values from multiple fields must add up to 100%
Primary-key
(Databases only) value must be unique per column
A database table can’t have two rows with the same primary key value. A primary key is an identifier in a database that references a column in which each value is unique. More information about primary and foreign keys is provided later in the program.
Set-membership
(Databases only) values for a column must come from a set of discrete values
Value for a column must be set to Yes, No, or Not Applicable
Foreign-key
(Databases only) values for a column must be unique values coming from a column in another table
In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table
Accuracy
The degree to which the data conforms to the actual entity being measured or described
If values for zip codes are validated by street location, the accuracy of the data goes up.
Completeness
The degree to which the data contains all desired components or measures
If data for personal profiles required hair and eye color, and both are collected, the data is complete.
Consistency
The degree to which the data is repeatable from different points of entry or collection
If a customer has the same address in the sales and repair databases, the data is consistent.

Data constraint	Definition	Examples
Data type	Values must be of a certain type: date, number, percentage, Boolean, etc.	If the data type is a date, a single number like 30 would fail the constraint and be invalid
Data range	Values must fall between predefined maximum and minimum values	If the data range is 10-20, a value of 30 would fail the constraint and be invalid
Mandatory	Values can’t be left blank or empty	If age is mandatory, that value must be filled in
Unique	Values can’t have a duplicate	Two people can’t have the same mobile phone number within the same service area
Regular expression (regex) patterns	Values must match a prescribed pattern	A phone number must match ###-###-#### (no other characters allowed)
Cross-field validation	Certain conditions for multiple fields must be satisfied	Values are percentages and values from multiple fields must add up to 100%
Primary-key	(Databases only) value must be unique per column	A database table can’t have two rows with the same primary key value. A primary key is an identifier in a database that references a column in which each value is unique. More information about primary and foreign keys is provided later in the program.
Set-membership	(Databases only) values for a column must come from a set of discrete values	Value for a column must be set to Yes, No, or Not Applicable
Foreign-key	(Databases only) values for a column must be unique values coming from a column in another table	In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table
Accuracy	The degree to which the data conforms to the actual entity being measured or described	If values for zip codes are validated by street location, the accuracy of the data goes up.
Completeness	The degree to which the data contains all desired components or measures	If data for personal profiles required hair and eye color, and both are collected, the data is complete.
Consistency	The degree to which the data is repeatable from different points of entry or collection	If a customer has the same address in the sales and repair databases, the data is consistent.

Conclusion

Fortunately, with a standard date format and compliance by all people and systems that work with the data, data integrity can be maintained. But no matter where your data comes from, always be sure to check that it is valid, complete, and clean before you begin any analysis.

Well-aligned objectives and data

You can gain powerful insights and make accurate conclusions when data is well-aligned to business objectives. As a data analyst, alignment is something you will need to judge. Good alignment means that the data is relevant and can help you solve a business problem or determine a course of action to achieve a given business objective. 
In this reading, you will review the business objectives associated with three scenarios. You will explore how clean data and well-aligned business objectives can help you come up with accurate conclusions. On top of that, you will learn how new variables discovered during data analysis can cause you to set up data constraints so you can keep the data aligned to a business objective.  
Clean data + alignment to business objective = accurate conclusionsBusiness objectiveAccount managers at Impress Me, an online content subscription service, want to know how soon users view content after their subscriptions are activated. 
To start off, the data analyst verifies that the data exported to spreadsheets is clean and confirms that the data needed (when users access content) is available. Knowing this, the analyst decides there is good alignment of the data to the business objective. All that is missing is figuring out exactly how long it takes each user to view content after their subscription has been activated.
Here are the data processing steps the analyst takes for a user from an account called V&L Consulting. (These steps would be repeated for each subscribing account, and for each user associated with that account.)
Step 1Data-processing step
Source of data
Look up the activation date for V&L Consulting
Account spreadsheet
Relevant data in spreadsheet:
Result: October 21, 2019
Step 2Data-processing step
Source of data
Look up the name of a user belonging to the V&L Consulting account
Account spreadsheet (users tab)
Relevant data in spreadsheet:
Result: Maria Ballantyne
Step 3Data-processing step
Source of data
Find the first content access date for Maria B.
Content usage spreadsheet
Relevant data in spreadsheet:
Result: October 31, 2019
Step 4Data-processing step
Source of data
Calculate the time between activation and first content usage for Maria B.
New spreadsheet calculation
Relevant data in spreadsheet:
Result: 10 days
Pro tip 1In the above process, the analyst could use VLOOKUP to look up the data in Steps 1, 2, and 3 to populate the values in the spreadsheet in Step 4. VLOOKUP is a spreadsheet function that searches for a certain value in a column to return a related piece of information. Using VLOOKUP can save a lot of time; without it, you have to look up dates and names manually.
R​efer to the VLOOKUP page in the Google Help Center for how to use the function in Google Sheets.
Pro tip 2In Step 4 of the above process, the analyst could use the DATEDIF function to automatically calculate the difference between the dates in column C and column D. The function can calculate the number of days between two dates. 
Refer to the Microsoft Support DATEDIF page for how to use the function in Excel. The DAYS360 function does the same thing in  accounting spreadsheets that use a 360-day year (twelve 30-day months).
R​efer to the DATEDIF page in the Google Help Center for how to use the function in Google Sheets.
Alignment to business objective + additional data cleaning = accurate conclusions Business objectiveCloud Gate, a software company, recently hosted a series of public webinars as free product introductions. The data analyst and webinar program manager want to identify companies that had five or more people attend these sessions. They want to give this list of companies to sales managers who can follow up for potential sales.  
The webinar attendance data includes the fields and data shown below.
Name
<First name> <Last name>
This was required information attendees had to submit
Email Address
xxxxx@company.com
This was required information attendees had to submit
Company
<Company name>
This was optional information attendees could provide
D​ata cleaningThe webinar attendance data seems to align with the business objective. But the data analyst and program manager decide that some data cleaning is needed before the analysis. They think data cleaning is required because:
The company name wasn’t a mandatory field. If the company name is blank, it might be found from the email address. For example, if the email address is username@google.com, the company field could be filled in with Google for the data analysis. This data cleaning step assumes that people with company-assigned email addresses attended a webinar for business purposes.
Attendees could enter any name. Since attendance across a series of webinars is being looked at, they need to validate names against unique email addresses. For example, if Joe Cox attended two webinars but signed in as Joe Cox for one and Joseph Cox for the other, he would be counted as two different people. To prevent this, they need to check his unique email address to determine that he was the same person. After the validation, Joseph Cox could be changed to Joe Cox to match the other instance.
Alignment to business objective + newly discovered variables + constraints = accurate conclusions Business objectiveAn after-school tutoring company, A+ Education,  wants to know if there is a minimum number of tutoring hours needed before students have at least a 10% improvement in their assessment scores. 
The data analyst thinks there is good alignment between the data available and the business objective because:
Students log in and out of a system for each tutoring session, and the number of hours is tracked
Assessment scores are regularly recorded  
D​ata constraints for new variablesAfter looking at the data, the data analyst discovers that there are other variables to consider. Some students had consistent weekly sessions while other students had scheduled sessions more randomly even though their total number of tutoring hours was the same. The data doesn’t align as well with the original business objective as first thought, so the analyst adds a data constraint to focus only on the students with consistent weekly sessions. This modification helps to get a more accurate picture about the enrollment time needed to achieve a 10% improvement in assessment scores. 
Duplicate data
DescriptionPossible causesPotential harm to businesses
Any data record that shows up more than onceManual data entry, batch data imports, or data migrationSkewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval
Outdated data
DescriptionPossible causesPotential harm to businesses
Any data that is old which should be replaced with newer and more accurate informationPeople changing roles or companies, or software and systems becoming obsoleteInaccurate insights, decision-making, and analytics
Incomplete data
DescriptionPossible causesPotential harm to businesses
Any data that is missing important fieldsImproper data collection or incorrect data entryDecreased productivity, inaccurate insights, or inability to complete essential services
Incorrect/inaccurate data
DescriptionPossible causesPotential harm to businesses
Any data that is complete but inaccurateHuman error inserted during data input, fake information, or mock dataInaccurate insights or decision-making based on bad information resulting in revenue loss
Inconsistent data
DescriptionPossible causesPotential harm to businesses
Any data that uses different formats to represent the same thingData stored incorrectly or errors inserted during data transferContradictory data points leading to confusion or inability to classify or segment customers
Business impact of dirty data
For further reading on the business impact of dirty data, enter the term “dirty data” into your preferred browser’s search bar to bring up numerous articles on the topic. Here are a few impacts cited for certain industries from a previous search:
Banking: Inaccuracies cost companies between 15% and 25% of revenue (source). 
Digital commerce: Up to 25% of B2B database contacts contain inaccuracies (source).
Marketing and sales: 8 out of 10 companies have said that dirty data hinders sales campaigns (source). 
Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s electronic health records (source).

Key takeawaysHopefully these examples give you a sense of what to look for to know if your data aligns with your business objective. 
When there is clean data and good alignment, you can get accurate insights and make conclusions the data supports.
If there is good alignment but the data needs to be cleaned, clean the data before you perform your analysis. 
If the data only partially aligns with an objective, think about how you could modify the objective, or use data constraints to make sure that the subset of data better aligns with the business objective.


Common mistakes to avoid
Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most of the time the wrong spelling or common grammatical errors can be detected, but it gets harder with things like names or addresses. For example, if you are working with a spreadsheet table of customer data, you might come across a customer named “John” whose name has been input incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for spelling errors and catch this, your analysis will have mistakes in it. 
Forgetting to document errors: Documenting your errors can be a big time saver, as it helps you avoid those errors in the future by showing you how you resolved them. For example, you might find an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns haven’t been formatted correctly. If you make a note of this fix, you can reference it the next time your formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you keep track of changes in your work, so that you can backtrack if a fix didn’t work. 
Not checking for misfielded values: A misfielded value happens when the values are entered into the wrong field. These values might still be formatted correctly, which makes them harder to catch if you aren’t careful. For example, you might have a dataset with columns for cities and countries. These are the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of Spain in the country column, and Spain had mistakenly been entered into the city column, you would miss key data points. Making sure your data has been entered correctly is key to accurate, complete analysis. 
Overlooking missing values: Missing values in your dataset can create errors and give you inaccurate conclusions. For example, if you were trying to get the total number of sales from the last three months, but a week of transactions were missing, your calculations would be inaccurate.  As a best practice, try to keep your data as clean as possible by maintaining completeness and consistency.
Only looking at a subset of the data: It is important to think about all of the relevant data when you are cleaning. This helps make sure you understand the whole story the data is telling, and that you are paying attention to all possible errors. For example, if you are working with data about bird migration patterns from different sources, but you only clean one source, you might not realize that some of the data is being repeated. This will cause problems in your analysis later on. If you want to avoid common errors like duplicates, each field of your data requires equal attention.
Losing track of business objectives: When you are cleaning data, you might make new and interesting discoveries about your dataset-- but you don’t want those discoveries to distract you from the task at hand. For example, if you were working with weather data to find the average number of rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really interesting, but it isn’t related to the question you are trying to answer right now. Being curious is great! But try not to let it distract you from the task at hand.  
Not fixing the source of the error: Fixing the error itself is important. But if that error is actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing that same error over and over again. For example, imagine you have a team spreadsheet that tracks everyone’s progress. The table keeps breaking because different people are entering different values. You can keep fixing all of these problems one by one, or you can set up your table to streamline data entry so everyone is on the same page. Addressing the source of the errors in your data will save you a lot of time in the long run. 
Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future errors, we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would find the cause of the problem before you started fixing the car, right? The same goes for data. First, you figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can control it and keep your data clean.
Not backing up your data prior to data cleaning: It is always good to be proactive and create your data backup before you start your data clean-up. If your program crashes, or if your changes cause a problem in your dataset, you can always go back to the saved version and restore it. The simple procedure of backing up your data can save you hours of work-- and most importantly, a headache. 
Not accounting for data cleaning in your deadlines/process: All good things take time, and that includes data cleaning. It is important to keep that in mind when going through your process and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA. 
Additional resources
Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to help you avoid the most common mistakes:
Top ten ways to clean your data: Review an orderly guide to data cleaning in Microsoft Excel.
10 Google Workspace tips to clean up data: Learn best practices for data cleaning in Google Sheets.
Example:

What you will need
To get started, first access the data spreadsheet.
To use the spreadsheet for this course item, click the link below and select “Use Template.”
Link to data spreadsheet: Cleaning with spreadsheets
OR
If you don’t have a Google account, you can download the template directly from the attachment below.
Data Spreadsheet for Cleaning with Spreadsheets
XLSX
Select and remove blank cells
The first technique we’ll use is to select and eliminate rows containing blank cells by using filters. To eliminate rows with blank cells:
1. Highlight all cells in the spreadsheet. You can highlight Columns A-H by clicking on the header of Column A, holding Shift, and clicking on the header of Column H.
2. Click on the Data tab and pick the Create a filter option. In Microsoft Excel, this is called Filter.
Excel:
3. Every column now shows a green triangle in the first row next to the column title. Click the green triangle in Column B to access a new menu.
4. On that new menu, click Filter by condition and open the dropdown menu to select Is empty. Click OK.
In Excel, click the dropdown, then Filter... then make sure only (Blanks) is checked. Click OK.
Excel:
You can then review a list of all the rows with blank cells in that column.
5. Select all these cells and delete the rows except the row of column headers.
6. Return to the Filter by condition and return it to None. In Excel, click Clear Filter from ‘Column’.
Note: You will now notice that any row that had an empty cell in Column A will be removed (including the extra empty rows after the data).
7. Repeat this for Columns B-H.
All the rows that had blank cells are now removed from the spreadsheet.
Transpose the data
The second technique you will practice will help you convert the data from the current long format (more rows than columns) to the wide format (more columns than rows). This action is called transposing. To transpose your data:
1. Highlight and copy the data that you want to transpose including the column labels. You can do this by highlighting Columns A-H. In Excel, highlight only the relevant cells (A1-H45) instead of the headers.
2. Right-click on cell I1. This is where you want the transposed data to start.
3. Hover over Paste Special from the right-click menu. Select the Transposed option. In Excel, select the Transpose icon under the paste options.
Excel:
You should now find the data transformed into the new wide format. At this point, you should remove the original long data from the spreadsheet.
4. Delete the previous long data. The easiest way to do this is to click on Column A, so the entire column is highlighted. Then, hold down the Shift key and click on Column H. You should find these columns highlighted. Right-click on the highlighted area and select Delete Columns A - H.
Your screen should now appear like this:
Get rid of extra spaces in cells with string data
Now that you have transposed the data, eliminate the extra spaces in the values of the cells.
1. Highlight the data in the spreadsheet.
2. Click on the Data tab, then hover over Data cleanup and select Trim whitespace.
In Excel, you can use the TRIM command to get rid of white spaces. In any space beneath your data (such as cell A10), type =TRIM(A1). Then, drag the bottom right corner of the cell to the bottom right to call the data without the white spaces.
Now all the extra spaces in the cells have been removed.
Change Text Lower/Uppercase/Proper Case
Next, you’ll process string data. The easiest way to clean up string data will depend on the spreadsheet program you are using. If you are using Excel, you’ll use a simple formula. If you are using Google Sheets, you can use an Add-On to do this with a few clicks. Follow the steps in the relevant section below.
Microsoft Excel
If you are using Microsoft Excel, this documentation explains how to use a formula to change the case of a text string. Follow these instructions to clean the string text and then move on to the confirmation and reflection section of this activity.
Google sheets
If you’re completing this exercise using Google Sheets, you’ll need to install an add-in that will give you the functionality needed to easily clean string data and change cases.
Google Sheets Add-on Instructions:
Click on the Add-Ons option at the top of Google Sheets.
Click on Get add-ons.
Search for ChangeCase. It should appear like this: 
4. Click on Install to install the add-on. It may ask you to login or verify the installation permissions.
Once you have installed the add-on successfully, you can access it by clicking on the Add-ons menu again.
Now, you can change the case of text data that shows up. To change the text in Column C to all uppercase:
1. Click on Column C. Be sure to deselect the column header, unless you want to change the case of that as well (which you don't).
2. Click on the Add-Ons tab and select ChangeCase. Select the option All uppercase. Notice the other options that you could have chosen if needed.
Delete all formatting
If you want to clear the formatting for any or all cells, you can find the command in the Format tab. To clear formatting:
1. Select the data for which you want to delete the formatting. In this case, highlight all the data in the spreadsheet by clicking and dragging over Rows 1-8.
2. Click the Format tab and select the Clear Formatting option.
In Excel, go to the Home tab, then hover over Clear and select Clear Formats.
You will notice that all the cells have had their formatting removed.
Confirmation and reflection

Name	<First name> <Last name>	This was required information attendees had to submit
Email Address	xxxxx@company.com	This was required information attendees had to submit
Company	<Company name>	This was optional information attendees could provide

SQL : PostgreSQL

Process Data from Dirty to Clean

Reference: Data constraints and examples

Conclusion

Well-aligned objectives and data

Clean data + alignment to business objective = accurate conclusions

Business objective

Step 1

Step 2

Step 3

Step 4

Pro tip 1

Pro tip 2

Alignment to business objective + additional data cleaning = accurate conclusions

Business objective

D​ata cleaning

Alignment to business objective + newly discovered variables + constraints = accurate conclusions

Business objective

D​ata constraints for new variables

Duplicate data

DescriptionPossible causesPotential harm to businessesAny data record that shows up more than onceManual data entry, batch data imports, or data migrationSkewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval

Outdated data

DescriptionPossible causesPotential harm to businessesAny data that is old which should be replaced with newer and more accurate informationPeople changing roles or companies, or software and systems becoming obsoleteInaccurate insights, decision-making, and analytics

Incomplete data

DescriptionPossible causesPotential harm to businessesAny data that is missing important fieldsImproper data collection or incorrect data entryDecreased productivity, inaccurate insights, or inability to complete essential services

Incorrect/inaccurate data

DescriptionPossible causesPotential harm to businessesAny data that is complete but inaccurateHuman error inserted during data input, fake information, or mock dataInaccurate insights or decision-making based on bad information resulting in revenue loss

Inconsistent data

DescriptionPossible causesPotential harm to businessesAny data that uses different formats to represent the same thingData stored incorrectly or errors inserted during data transferContradictory data points leading to confusion or inability to classify or segment customers

Business impact of dirty data

Key takeaways

Common mistakes to avoid

Additional resources

What you will need

Select and remove blank cells

Transpose the data

Get rid of extra spaces in cells with string data

Change Text Lower/Uppercase/Proper Case

Microsoft Excel

Google sheets

Delete all formatting

Confirmation and reflection

Comments

Post a Comment

Popular posts from this blog

SQL : PostgreSQL

Familiarize with the ER diagram : UCDAVIS

The R-versus-Python debate

Data cleaning

Data constraints for new variables

Description Possible causes Potential harm to businesses
Any data record that shows up more than once Manual data entry, batch data imports, or data migration Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval

Description Possible causes Potential harm to businesses
Any data that is old which should be replaced with newer and more accurate information People changing roles or companies, or software and systems becoming obsolete Inaccurate insights, decision-making, and analytics

Description Possible causes Potential harm to businesses
Any data that is missing important fields Improper data collection or incorrect data entry Decreased productivity, inaccurate insights, or inability to complete essential services

Description Possible causes Potential harm to businesses
Any data that is complete but inaccurate Human error inserted during data input, fake information, or mock data Inaccurate insights or decision-making based on bad information resulting in revenue loss

Description Possible causes Potential harm to businesses
Any data that uses different formats to represent the same thing Data stored incorrectly or errors inserted during data transfer Contradictory data points leading to confusion or inability to classify or segment customers