SQL : PostgreSQL

  Aggregate Functions Like most other relational database products,  PostgreSQL  supports  aggregate functions . An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the  count ,  sum ,  avg  (average),  max  (maximum) and  min  (minimum) over a set of rows. As an example, we can find the highest low-temperature reading anywhere with: SELECT max(temp_lo) FROM weather; max ----- 46 (1 row) If we wanted to know what city (or cities) that reading occurred in, we might try: SELECT city FROM weather WHERE temp_lo = max(temp_lo); WRONG but this will not work since the aggregate  max  cannot be used in the  WHERE  clause. (This restriction exists because the  WHERE  clause determines which rows will be included in the aggregate calculation; so obviously it has to be evaluated before aggregate functions are computed.) However, as is o...

Calculating sample size

 


Calculating sample size

Before you dig deeper into sample size, familiarize yourself with these terms and definitions:

Terminology

Definitions

Population 

The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.

Sample 

A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population.

Margin of error

Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population. 

Confidence level

How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study. 

Confidence interval

The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.

Statistical significance

The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.

Things to remember when determining the size of your sample

When figuring out a sample size, here are things to keep in mind:

  • Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.

  • The confidence level most commonly used is 95%, but 90% can work in some cases. 

Increase the sample size to meet specific needs of your project:

  • For a higher confidence level, use a larger sample size

  • To decrease the margin of error, use a larger sample size

  • For greater statistical significance, use a larger sample size

Note: Sample size calculators use statistical formulas to determine a sample size. More about these are coming up in the course! Stay tuned.

Why a minimum sample of 30?

This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid. Researchers who rely on regression analysis statistical methods to determine the relationships between controlled and dependent variables – also prefer a minimum sample of 30.

Still curious? Without getting too much into the math, check out these articles:

  • Central Limit Theorem (CLT): This article by Investopedia explains the Central Limit Theorem and briefly describes how it can apply to an analysis of a stock index.

  • Sample Size Formula: This article by Statistics Solutions provides a little more detail about why some researchers use 30 as a minimum sample size.

Sample sizes vary by business problem

Sample size will vary based on the type of business problem you are trying to solve. 

For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller sample size look like? 

Would 200 be alright if the people surveyed represented every district in the city?

Answer: It depends on the stakes. 

  • A sample size of 200 might be large enough if your business problem is to find out how residents felt about the new library

  • A sample size of 200 might not be large enough if your business problem is to determine how residents would vote to fund the library

You could probably accept a larger margin of error surveying how residents feel about the new library versus surveying residents about how they would vote to fund it. For that reason, you would most likely use a larger sample size for the voter survey.

An icon of a person scratching their head with a question mark floating above them

Larger sample sizes have a higher cost

You also have to weigh the cost against the benefits of more accurate results with a larger sample size. Someone who is trying to understand consumer preferences for a new line of products wouldn’t need as large a sample size as someone who is trying to understand the effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a smaller sample size at a lower cost could provide good enough results. 

An icon of a magnifying glass with a dollar sign

Knowing the basics is helpful

Knowing the basics will help you make the right choices when it comes to sample size. You can always raise concerns if you come across a sample size that is too small. A sample size calculator is also a great tool for this. Sample size calculators let you enter a desired confidence level and margin of error for a given population size. They then calculate the sample size needed to statistically achieve those results. 

Refer to the Determine the Best Sample Size video for a demonstration of a sample size calculator, or refer to the Sample Size Calculator reading for additional information.

An icon of a check list with the third box unchecked

Sample size calculator

In this reading, you will learn the basics of sample size calculators, how to use them, and how to understand the results. A sample size calculator tells you how many people you need to interview (or things you need to test) to get results that represent the target population. Let’s review some terms you will come across when using a sample size calculator:

  • Confidence level: The probability that your sample size accurately reflects the greater population.

  • Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population.

  • Population: This is the total number you hope to pull your sample from.

  • Sample: A part of a population that is representative of the population.

  • Estimated response rate: If you are running a survey of individuals, this is the percentage of people you expect will complete your survey out of those who received the survey.

How to use a sample size calculator

In order to use a sample size calculator, you need to have the population size, confidence level, and the acceptable margin of error already decided so you can input them into the tool. If this information is ready to go, check out these sample size calculators below:

What to do with the results

After you have plugged your information into one of these calculators, it will give you a recommended sample size. Keep in mind, the calculated sample size is the minimum number to achieve what you input for confidence level and margin of error. If you are working with a survey, you will also need to think about the estimated response rate to figure out how many surveys you will need to send out. For example, if you need a sample size of 100 individuals and your estimated response rate is 10%, you will need to send your survey to 1,000 individuals to get the 100 responses you need for your analysis.

Now that you have the basics, try some calculations using the sample size calculators and refer back to this reading if you need a refresher on the definitions. 


What to do when there is no data

Earlier, you learned how you can still do an analysis using proxy data if you have no data. You might have some questions about proxy data, so this reading will give you a few more examples of the types of datasets that can serve as alternate data sources.

Proxy data examples

Sometimes the data to support a business objective isn’t readily available. This is when proxy data is useful. Take a look at the following scenarios and where proxy data comes in for each example:

Business scenario

How proxy data can be used

A new car model was just launched a few days ago and the auto dealership can’t wait until the end of the month for sales data to come in. They want sales projections now.

The analyst proxies the number of clicks to the car specifications on the dealership’s website as an estimate of potential sales at the dealership.

A brand new plant-based meat product was only recently stocked in grocery stores and the supplier needs to estimate the demand over the next four years. 

The analyst proxies the sales data for a turkey substitute made out of tofu that has been on the market for several years.

The Chamber of Commerce wants to know how a tourism campaign is going to impact travel to their city, but the results from the campaign aren’t publicly available yet.

The analyst proxies the historical data for airline bookings to the city one to three months after a similar campaign was run six months earlier.

Open (public) datasets

If you are part of a large organization, you might have access to lots of sources of data. But if you are looking for something specific or a little outside your line of business, you can also make use of open or public datasets. (You can refer to this Towards Data Science article for a brief explanation of the difference between open and public data.)

Here's an example. A nasal version of a vaccine was recently made available. A clinic wants to know what to expect for contraindications, but just started collecting first-party data from its patients. A contraindication is a condition that may cause a patient not to take a vaccine due to the harm it would cause them if taken. To estimate the number of possible contraindications, a data analyst proxies an open dataset from a trial of the injection version of the vaccine. The analyst selects a subset of the data with patient profiles most closely matching the makeup of the patients at the clinic. 

There are plenty of ways to share and collaborate on data within a community. Kaggle (kaggle.com) which we previously introduced, has datasets in a variety of formats including the most basic type, Comma Separated Values (CSV) files.  

An image of a magnifying glass and an image of a puzzle piece being added to a puzzle

CSV, J​SON, SQLite, and BigQuery datasets

  • CSV: Check out this Credit card customers dataset, which has information from 10,000 customers including age, salary, marital status, credit card limit, credit card category, etc. (CC0: Public Domain, Sakshi Goyal).

  • JSON: Check out this JSON dataset for trending YouTube videos (CC0: Public Domain, Mitchell J).

  • SQLite: Check out this SQLite dataset for 24 years worth of U.S. wildfire data (CC0: Public Domain, Rachael Tatman).

  • BigQuery: Check out this Google Analytics 360 sample dataset from the Google Merchandise Store (CC0 Public Domain, Google BigQuery).

Refer to the Kaggle documentation for datasets for more information and search for and explore datasets on your own at kaggle.com/datasets.

As with all other kinds of datasets, be on the lookout for duplicate data and ‘Null’ in open datasets. Null most often means that a data field was unassigned (left empty), but sometimes Null can be interpreted as the value, 0. It is important to understand how Null was used before you start analyzing a dataset with Null data.



Comments

Popular posts from this blog

SQL : PostgreSQL

Using BigQuery / MySQL / other SQL

The R-versus-Python debate