SQL : PostgreSQL

  Aggregate Functions Like most other relational database products,  PostgreSQL  supports  aggregate functions . An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the  count ,  sum ,  avg  (average),  max  (maximum) and  min  (minimum) over a set of rows. As an example, we can find the highest low-temperature reading anywhere with: SELECT max(temp_lo) FROM weather; max ----- 46 (1 row) If we wanted to know what city (or cities) that reading occurred in, we might try: SELECT city FROM weather WHERE temp_lo = max(temp_lo); WRONG but this will not work since the aggregate  max  cannot be used in the  WHERE  clause. (This restriction exists because the  WHERE  clause determines which rows will be included in the aggregate calculation; so obviously it has to be evaluated before aggregate functions are computed.) However, as is o...

Process Data from Dirty to Clean

 Process Data from Dirty to Clean


Here are some other things to watch out for:

  • Data replication compromising data integrity: Continuing with the example, imagine you ask your international counterparts to verify dates and stick to one format. One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren't verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates.

  • Data transfer compromising data integrity: Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity. 

  • Data manipulation compromising data integrity: When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness.

Reference: Data constraints and examples

As you progress in your data journey, you'll come across many types of data constraints (or criteria that determine validity). The table below offers definitions and examples of data constraint terms you might come across.

Data constraint

Definition

Examples

Data type

Values must be of a certain type: date, number, percentage, Boolean, etc.

If the data type is a date, a single number like 30 would fail the constraint and be invalid

Data range

Values must fall between predefined maximum and minimum values

If the data range is 10-20, a value of 30 would fail the constraint and be invalid

Mandatory

Values can’t be left blank or empty

If age is mandatory, that value must be filled in

Unique

Values can’t have a duplicate

Two people can’t have the same mobile phone number within the same service area

Regular expression (regex) patterns

Values must match a prescribed pattern

A phone number must match ###-###-#### (no other characters allowed)

Cross-field validation

Certain conditions for multiple fields must be satisfied

Values are percentages and values from multiple fields must add up to 100%

Primary-key

(Databases only) value must be unique per column

A database table can’t have two rows with the same primary key value. A primary key is an identifier in a database that references a column in which each value is unique. More information about primary and foreign keys is provided later in the program.

Set-membership

(Databases only) values for a column must come from a set of discrete values

Value for a column must be set to Yes, No, or Not Applicable

Foreign-key

(Databases only) values for a column must be unique values coming from a column in another table

In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table

Accuracy

The degree to which the data conforms to the actual entity being measured or described

If values for zip codes are validated by street location, the accuracy of the data goes up.

Completeness

The degree to which the data contains all desired components or measures

If data for personal profiles required hair and eye color, and both are collected, the data is complete.

Consistency

The degree to which the data is repeatable from different points of entry or collection

If a customer has the same address in the sales and repair databases, the data is consistent.

Conclusion

Fortunately, with a standard date format and compliance by all people and systems that work with the data, data integrity can be maintained. But no matter where your data comes from, always be sure to check that it is valid, complete, and clean before you begin any analysis. 

Well-aligned objectives and data

You can gain powerful insights and make accurate conclusions when data is well-aligned to business objectives. As a data analyst, alignment is something you will need to judge. Good alignment means that the data is relevant and can help you solve a business problem or determine a course of action to achieve a given business objective.

In this reading, you will review the business objectives associated with three scenarios. You will explore how clean data and well-aligned business objectives can help you come up with accurate conclusions. On top of that, you will learn how new variables discovered during data analysis can cause you to set up data constraints so you can keep the data aligned to a business objective.  

Clean data + alignment to business objective = accurate conclusions

Business objective

Account managers at Impress Me, an online content subscription service, want to know how soon users view content after their subscriptions are activated. 

Image of a web video player and image of an analog clock set to 3:03

To start off, the data analyst verifies that the data exported to spreadsheets is clean and confirms that the data needed (when users access content) is available. Knowing this, the analyst decides there is good alignment of the data to the business objective. All that is missing is figuring out exactly how long it takes each user to view content after their subscription has been activated.

Here are the data processing steps the analyst takes for a user from an account called V&L Consulting. (These steps would be repeated for each subscribing account, and for each user associated with that account.)

Step 1

Data-processing step

Source of data

Look up the activation date for V&L Consulting

Account spreadsheet

Relevant data in spreadsheet:

Screenshot of two columns of a spreadsheet for Account Name and Activation Date. The date 10/21/2019 is highlighted

Result: October 21, 2019

Step 2

Data-processing step

Source of data

Look up the name of a user belonging to the V&L Consulting account

Account spreadsheet (users tab)

Relevant data in spreadsheet:

Screenshot of two columns of a spreadsheet for Account Name and Users. The name Maria Ballantyne is highlighted

Result: Maria Ballantyne

Step 3

Data-processing step

Source of data

Find the first content access date for Maria B.

Content usage spreadsheet

Relevant data in spreadsheet:

Screenshot of two columns of a spreadsheet for Users and Access Dates. The date 10/31/2019 is highlighted

Result: October 31, 2019

Step 4

Data-processing step

Source of data

Calculate the time between activation and first content usage for Maria B.

New spreadsheet calculation

Relevant data in spreadsheet:

Screenshot of spreadsheet with Account, Users, Activation Date, 1st Access Date, and Number of Days. Number 10 is highlighted

Result: 10 days

Pro tip 1

In the above process, the analyst could use VLOOKUP to look up the data in Steps 1, 2, and 3 to populate the values in the spreadsheet in Step 4. VLOOKUP is a spreadsheet function that searches for a certain value in a column to return a related piece of information. Using VLOOKUP can save a lot of time; without it, you have to look up dates and names manually.

R​efer to the VLOOKUP page in the Google Help Center for how to use the function in Google Sheets.

Pro tip 2

In Step 4 of the above process, the analyst could use the DATEDIF function to automatically calculate the difference between the dates in column C and column D. The function can calculate the number of days between two dates. 

Refer to the Microsoft Support DATEDIF page for how to use the function in Excel. The DAYS360 function does the same thing in accounting spreadsheets that use a 360-day year (twelve 30-day months).

R​efer to the DATEDIF page in the Google Help Center for how to use the function in Google Sheets.

Alignment to business objective + additional data cleaning = accurate conclusions 

Business objective

Cloud Gate, a software company, recently hosted a series of public webinars as free product introductions. The data analyst and webinar program manager want to identify companies that had five or more people attend these sessions. They want to give this list of companies to sales managers who can follow up for potential sales.  

An image of a group of people chatting and an image of an online page

The webinar attendance data includes the fields and data shown below.

Name

<First name> <Last name>

This was required information attendees had to submit

Email Address

xxxxx@company.com

This was required information attendees had to submit

Company

<Company name>

This was optional information attendees could provide

D​ata cleaning

The webinar attendance data seems to align with the business objective. But the data analyst and program manager decide that some data cleaning is needed before the analysis. They think data cleaning is required because:

  • The company name wasn’t a mandatory field. If the company name is blank, it might be found from the email address. For example, if the email address is username@google.com, the company field could be filled in with Google for the data analysis. This data cleaning step assumes that people with company-assigned email addresses attended a webinar for business purposes.

  • Attendees could enter any name. Since attendance across a series of webinars is being looked at, they need to validate names against unique email addresses. For example, if Joe Cox attended two webinars but signed in as Joe Cox for one and Joseph Cox for the other, he would be counted as two different people. To prevent this, they need to check his unique email address to determine that he was the same person. After the validation, Joseph Cox could be changed to Joe Cox to match the other instance.

Alignment to business objective + newly discovered variables + constraints = accurate conclusions 

Business objective

An after-school tutoring company, A+ Education,  wants to know if there is a minimum number of tutoring hours needed before students have at least a 10% improvement in their assessment scores.

An image of a person writing on a pad and an image of a gauge meter

The data analyst thinks there is good alignment between the data available and the business objective because:

  • Students log in and out of a system for each tutoring session, and the number of hours is tracked

  • Assessment scores are regularly recorded  

D​ata constraints for new variables

After looking at the data, the data analyst discovers that there are other variables to consider. Some students had consistent weekly sessions while other students had scheduled sessions more randomly even though their total number of tutoring hours was the same. The data doesn’t align as well with the original business objective as first thought, so the analyst adds a data constraint to focus only on the students with consistent weekly sessions. This modification helps to get a more accurate picture about the enrollment time needed to achieve a 10% improvement in assessment scores. 

Duplicate data

DescriptionPossible causesPotential harm to businesses
Any data record that shows up more than onceManual data entry, batch data imports, or data migrationSkewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval

Outdated data

DescriptionPossible causesPotential harm to businesses
Any data that is old which should be replaced with newer and more accurate informationPeople changing roles or companies, or software and systems becoming obsoleteInaccurate insights, decision-making, and analytics

Incomplete data

DescriptionPossible causesPotential harm to businesses
Any data that is missing important fieldsImproper data collection or incorrect data entryDecreased productivity, inaccurate insights, or inability to complete essential services

Incorrect/inaccurate data

DescriptionPossible causesPotential harm to businesses
Any data that is complete but inaccurateHuman error inserted during data input, fake information, or mock dataInaccurate insights or decision-making based on bad information resulting in revenue loss

Inconsistent data

DescriptionPossible causesPotential harm to businesses
Any data that uses different formats to represent the same thingData stored incorrectly or errors inserted during data transferContradictory data points leading to confusion or inability to classify or segment customers

Business impact of dirty data

For further reading on the business impact of dirty data, enter the term “dirty data” into your preferred browser’s search bar to bring up numerous articles on the topic. Here are a few impacts cited for certain industries from a previous search:

  • Banking: Inaccuracies cost companies between 15% and 25% of revenue (source). 
  • Digital commerce: Up to 25% of B2B database contacts contain inaccuracies (source).
  • Marketing and sales: 8 out of 10 companies have said that dirty data hinders sales campaigns (source). 
  • Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s electronic health records (source).


Key takeaways

Hopefully these examples give you a sense of what to look for to know if your data aligns with your business objective. 

  • When there is clean data and good alignment, you can get accurate insights and make conclusions the data supports.

  • If there is good alignment but the data needs to be cleaned, clean the data before you perform your analysis. 

  • If the data only partially aligns with an objective, think about how you could modify the objective, or use data constraints to make sure that the subset of data better aligns with the business objective.



    Common mistakes to avoid

    • Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most of the time the wrong spelling or common grammatical errors can be detected, but it gets harder with things like names or addresses. For example, if you are working with a spreadsheet table of customer data, you might come across a customer named “John” whose name has been input incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for spelling errors and catch this, your analysis will have mistakes in it. 
    • Forgetting to document errors: Documenting your errors can be a big time saver, as it helps you avoid those errors in the future by showing you how you resolved them. For example, you might find an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns haven’t been formatted correctly. If you make a note of this fix, you can reference it the next time your formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you keep track of changes in your work, so that you can backtrack if a fix didn’t work. 
    • Not checking for misfielded values: A misfielded value happens when the values are entered into the wrong field. These values might still be formatted correctly, which makes them harder to catch if you aren’t careful. For example, you might have a dataset with columns for cities and countries. These are the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of Spain in the country column, and Spain had mistakenly been entered into the city column, you would miss key data points. Making sure your data has been entered correctly is key to accurate, complete analysis. 
    • Overlooking missing values: Missing values in your dataset can create errors and give you inaccurate conclusions. For example, if you were trying to get the total number of sales from the last three months, but a week of transactions were missing, your calculations would be inaccurate.  As a best practice, try to keep your data as clean as possible by maintaining completeness and consistency.
    • Only looking at a subset of the data: It is important to think about all of the relevant data when you are cleaning. This helps make sure you understand the whole story the data is telling, and that you are paying attention to all possible errors. For example, if you are working with data about bird migration patterns from different sources, but you only clean one source, you might not realize that some of the data is being repeated. This will cause problems in your analysis later on. If you want to avoid common errors like duplicates, each field of your data requires equal attention.
    • Losing track of business objectives: When you are cleaning data, you might make new and interesting discoveries about your dataset-- but you don’t want those discoveries to distract you from the task at hand. For example, if you were working with weather data to find the average number of rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really interesting, but it isn’t related to the question you are trying to answer right now. Being curious is great! But try not to let it distract you from the task at hand.  
    • Not fixing the source of the error: Fixing the error itself is important. But if that error is actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing that same error over and over again. For example, imagine you have a team spreadsheet that tracks everyone’s progress. The table keeps breaking because different people are entering different values. You can keep fixing all of these problems one by one, or you can set up your table to streamline data entry so everyone is on the same page. Addressing the source of the errors in your data will save you a lot of time in the long run. 
    • Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future errors, we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would find the cause of the problem before you started fixing the car, right? The same goes for data. First, you figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can control it and keep your data clean.
    • Not backing up your data prior to data cleaning: It is always good to be proactive and create your data backup before you start your data clean-up. If your program crashes, or if your changes cause a problem in your dataset, you can always go back to the saved version and restore it. The simple procedure of backing up your data can save you hours of work-- and most importantly, a headache. 
    • Not accounting for data cleaning in your deadlines/process: All good things take time, and that includes data cleaning. It is important to keep that in mind when going through your process and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA. 

    Additional resources

    Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to help you avoid the most common mistakes:

    Example:

    What you will need

    To get started, first access the data spreadsheet.

    To use the spreadsheet for this course item, click the link below and select “Use Template.”

    Link to data spreadsheet: Cleaning with spreadsheets

    OR

    If you don’t have a Google account, you can download the template directly from the attachment below.

    Data Spreadsheet for Cleaning with Spreadsheets
    XLSX

    Select and remove blank cells

    The first technique we’ll use is to select and eliminate rows containing blank cells by using filters. To eliminate rows with blank cells:

    1. Highlight all cells in the spreadsheet. You can highlight Columns A-H by clicking on the header of Column A, holding Shift, and clicking on the header of Column H.

    2. Click on the Data tab and pick the Create a filter option. In Microsoft Excel, this is called Filter.

    Excel:

    3. Every column now shows a green triangle in the first row next to the column title. Click the green triangle in Column B to access a new menu.

    4. On that new menu, click Filter by condition and open the dropdown menu to select Is empty. Click OK.

    In Excel, click the dropdown, then Filter... then make sure only (Blanks) is checked. Click OK.

    Excel:

    You can then review a list of all the rows with blank cells in that column.

    5. Select all these cells and delete the rows except the row of column headers.

    6. Return to the Filter by condition and return it to None. In Excel, click Clear Filter from ‘Column’.

    • Note: You will now notice that any row that had an empty cell in Column A will be removed (including the extra empty rows after the data).

    7. Repeat this for Columns B-H.

    All the rows that had blank cells are now removed from the spreadsheet.

    Transpose the data

    The second technique you will practice will help you convert the data from the current long format (more rows than columns) to the wide format (more columns than rows). This action is called transposing. To transpose your data:

    1. Highlight and copy the data that you want to transpose including the column labels. You can do this by highlighting Columns A-H. In Excel, highlight only the relevant cells (A1-H45) instead of the headers.

    2. Right-click on cell I1. This is where you want the transposed data to start.

    3. Hover over Paste Special from the right-click menu. Select the Transposed option. In Excel, select the Transpose icon under the paste options.

    Excel:

    You should now find the data transformed into the new wide format. At this point, you should remove the original long data from the spreadsheet.

    4. Delete the previous long data. The easiest way to do this is to click on Column A, so the entire column is highlighted. Then, hold down the Shift key and click on Column H. You should find these columns highlighted. Right-click on the highlighted area and select Delete Columns A - H.

    Your screen should now appear like this:

    Get rid of extra spaces in cells with string data

    Now that you have transposed the data, eliminate the extra spaces in the values of the cells.

    1. Highlight the data in the spreadsheet.

    2. Click on the Data tab, then hover over Data cleanup and select Trim whitespace.

    In Excel, you can use the TRIM command to get rid of white spaces. In any space beneath your data (such as cell A10), type =TRIM(A1). Then, drag the bottom right corner of the cell to the bottom right to call the data without the white spaces.

    Now all the extra spaces in the cells have been removed.

    Change Text Lower/Uppercase/Proper Case

    Next, you’ll process string data. The easiest way to clean up string data will depend on the spreadsheet program you are using. If you are using Excel, you’ll use a simple formula. If you are using Google Sheets, you can use an Add-On to do this with a few clicks. Follow the steps in the relevant section below.

    Microsoft Excel

    If you are using Microsoft Excel, this documentation explains how to use a formula to change the case of a text string. Follow these instructions to clean the string text and then move on to the confirmation and reflection section of this activity.

    Google sheets

    If you’re completing this exercise using Google Sheets, you’ll need to install an add-in that will give you the functionality needed to easily clean string data and change cases.

    Google Sheets Add-on Instructions:

    1. Click on the Add-Ons option at the top of Google Sheets.
    2. Click on Get add-ons.
    3. Search for ChangeCase. It should appear like this: 

    4. Click on Install to install the add-on. It may ask you to login or verify the installation permissions.

    Once you have installed the add-on successfully, you can access it by clicking on the Add-ons menu again.

    Now, you can change the case of text data that shows up. To change the text in Column C to all uppercase:

    1. Click on Column C. Be sure to deselect the column header, unless you want to change the case of that as well (which you don't).

    2. Click on the Add-Ons tab and select ChangeCase. Select the option All uppercase. Notice the other options that you could have chosen if needed.

    Delete all formatting

    If you want to clear the formatting for any or all cells, you can find the command in the Format tab. To clear formatting:

    1. Select the data for which you want to delete the formatting. In this case, highlight all the data in the spreadsheet by clicking and dragging over Rows 1-8.

    2. Click the Format tab and select the Clear Formatting option.

    In Excel, go to the Home tab, then hover over Clear and select Clear Formats.

    You will notice that all the cells have had their formatting removed.

    Confirmation and reflection


Comments

Popular posts from this blog

SQL : PostgreSQL

Using BigQuery / MySQL / other SQL

The R-versus-Python debate