SQL : PostgreSQL

  Aggregate Functions Like most other relational database products,  PostgreSQL  supports  aggregate functions . An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the  count ,  sum ,  avg  (average),  max  (maximum) and  min  (minimum) over a set of rows. As an example, we can find the highest low-temperature reading anywhere with: SELECT max(temp_lo) FROM weather; max ----- 46 (1 row) If we wanted to know what city (or cities) that reading occurred in, we might try: SELECT city FROM weather WHERE temp_lo = max(temp_lo); WRONG but this will not work since the aggregate  max  cannot be used in the  WHERE  clause. (This restriction exists because the  WHERE  clause determines which rows will be included in the aggregate calculation; so obviously it has to be evaluated before aggregate functions are computed.) However, as is o...

Using BigQuery / MySQL / other SQL

 


https://azure.microsoft.com/en-us/products/azure-sql/#product-overview

https://cloud.google.com/bigquery/docs



Using BigQuery

BigQuery is a data warehouse on Google Cloud that data analysts can use to query, filter large datasets, aggregate results, and perform complex operations.

An upcoming activity is performed in BigQuery. This reading provides instructions to create your own BigQuery account, select public datasets, and upload CSV files. At the end of this reading, you can confirm your access to the BigQuery console before you move on to the activity,

Note: Additional getting started resources for a few other SQL database platforms are also provided at the end of this reading if you choose to work with them instead of BigQuery.

Types of BigQuery accounts

There are two different types of accounts: sandbox and free trial. A sandbox account allows you to practice queries and explore public datasets for free, but has additional restrictions on top of the standard quotas and limits. If you prefer to use BigQuery with the standard limits, you can set up a free trial account instead. More details:

  • A free sandbox account doesn’t ask for a method of payment. It does, however, limit you to 12 projects. It also doesn't allow you to insert new records to a database or update the field values of existing records. These data manipulation language (DML) operations aren't supported in the sandbox.

  • A free trial account requires a method of payment to establish a billable account, but offers full functionality during the trial period.

With either type of account, you can upgrade to a paid account at any time and retain all of your existing projects. If you set up a free trial account but choose not to upgrade to a paid account when your trial period ends, you can still set up a free sandbox account at that time. However, projects from your trial account won't transfer to your sandbox account. It would be like starting from scratch again.

Set up a free sandbox account for use in this program

Set up a free trial account instead (if you prefer)

If you prefer not to have the sandbox limitations in BigQuery, you can set up a free trial account for use in this program.

  • Follow these step-by-step instructions or watch the video, Setting up BigQuery, including sandbox and billing options. The free trial offers $300 in credit over the next 90 days. You won’t get anywhere near that spending limit if you just use the BigQuery console to practice SQL queries. After you spend the $300 credit (or after 90 days) your free trial will expire and you will need to personally select to upgrade to a paid account to keep using Google Cloud Platform services, including BigQuery. Your method of payment will never be automatically charged after your free trial ends. If you select to upgrade your account, you will begin to be billed for charges. 

  • After you set up your account, you will see My First Project in the banner and the status of your account above the banner your credit balance and the number of days remaining in your trial period.

How to get to the BigQuery console

In your browser, go to console.cloud.google.com/bigquery.

Note: Going to console.cloud.google.com in your browser takes you to the main dashboard for the Google Cloud Platform. To navigate to BigQuery from the dashboard, do the following:

  • Click the Navigation menu icon (Hamburger icon) in the banner.

  • Scroll down to the BIG DATA section.

  • Click BigQuery and select SQL workspace.

Watch the How to use BigQuery video for an introduction to each part of the BigQuery SQL workspace.

(Optional) Explore a BigQuery public dataset 

You will be exploring a public dataset in an upcoming activity, so you can perform these steps later if you prefer.

(Optional) Upload a CSV file to BigQuery

These steps are provided so you can work with a dataset on your own at this time. You will upload CSV files to BigQuery later in the program.

Getting started with other databases (if not using BigQuery)

It is easier to follow along with the course activities if you use BigQuery, but i​f you are connecting to and practicing SQL queries on other database platforms instead of BigQuery, here are similar getting started resources:

In-depth guide: SQL best practices

You can save this reading for future reference. Feel free to download a PDF version of this reading below:

DAC3 In-depth guide_ SQL best practices.pdfPDF File
Open file

These best practices include guidelines for writing SQL queries, developing documentation, and examples that demonstrate these practices. This is a great resource to have handy when you are using SQL yourself; you can just go straight to the relevant section to review these practices. Think of it like a SQL field guide!

Capitalization and case sensitivity

With SQL, capitalization usually doesn’t matter. You could write SELECT or select or SeLeCT. They all work! But if you use capitalization as part of a consistent style your queries will  look more professional.

To write SQL queries like a pro, it is always a good idea to use all caps for clause starters (e.g., SELECT, FROM, WHERE, etc.). Functions should also be in all caps (e.g., SUM()). Column names should be all lowercase (refer to the section on snake_case later in this guide). Table names should be in CamelCase (refer to the section on CamelCase later in this guide). This helps keep your queries consistent and easier to read while not impacting the data that will be pulled when you run them. The only time that capitalization does matter is when it is inside quotes (more on quotes below).

Vendors of SQL databases may use slightly different variations of SQL. These variations are called SQL dialects. Some SQL dialects are case sensitive. BigQuery is one of them. Vertica is another. But most, like MySQL, PostgreSQL, and SQL Server, aren’t case sensitive. This means if you searched for country_code = ‘us’, it will return all entries that have 'us', 'uS', 'Us', and 'US'. This isn’t the case with BigQuery. BigQuery is case sensitive, so that same search would only return entries where the country_code is exactly 'us'. If the country_code is 'US', BigQuery wouldn’t return those entries as part of your result.

Single or double quotes: '' or " "

For the most part, it also doesn’t matter if you use single quotes ' ' or double quotes " " when referring to strings. For example, SELECT is a clause starter. If you put SELECT in quotes like 'SELECT' or "SELECT", then SQL will treat it as a text string. Your query will return an error because your query needs a SELECT clause.

But there are two situations where it does matter what kind of quotes you use:

  1. When you want strings to be identifiable in any SQL dialect 

  2. When your string contains an apostrophe or quotation marks

Within each SQL dialect there are rules for what is accepted and what isn’t. But a general rule across almost all SQL dialects is to use single quotes for strings. This helps get rid of a lot of confusion. So if we want to reference the country US in a WHERE clause (e.g., country_code = 'US'), then use single quotes around the string 'US'.

The second situation is when your string has quotes inside it. Suppose you have a column of favorite foods in a table called FavoriteFoods and the other column corresponds to each friend.

Friend

Favorite_food

Rachel DeSantos

Shepherd’s pie

Sujin Lee

Tacos

Najil Okoro

Spanish paella

You might notice how Rachel’s favorite food contains an apostrophe. If you were to use single quotes in a WHERE clause to find the friend who has this favorite food, it would look like this:

Screenshot of Where Clause query with shepherd's pie in single quotes

This won’t work. If you run this query, you will get an error in return. This is because SQL recognizes a text string as something that starts with a quote 'and ends with another quote '. So in the bad query above,  SQL thinks that the Favorite_food you are looking for is 'Shepherd'. Just 'Shepherd' because the apostrophe in Shepherd's ends the string.

Generally speaking, this should be the only time you would use double quotes instead of single quotes. So your query would look like this instead:

SELECT Friend FROM FavoriteFoods WHERE Favorite_food = "Shepherd's pie"

SQL understands text strings as either starting with a single quote ' or double quote". Since this string starts with double quotes, SQL will expect another double quote to signal the end of the string. This keeps the apostrophe safe, so it will return "Shepherd's pie" and not 'Shepherd'.

Comments as reminders

As you get more comfortable with SQL, you will be able to read and understand queries at a glance. But it never hurts to have comments in the query to remind yourself of what you are trying to do. And if you share your query, it also helps others understand it. 

For example:

You can use # in place of the two dashes, --, in the above query but keep in mind that # isn’t recognized in all SQL dialects (MySQL doesn’t recognize #). So it is best to use -- and be consistent with it. When you add a comment to a query using --, the database query engine will ignore everything in the same line after --. It will continue to process the query starting on the next line.

Snake_case names for columns 

It is important to always make sure that the output of your query has easy-to-understand names. If you create a new column (say from a calculation or from concatenating new fields), the new column will receive a generic default name (e.g., f0). For example:

SELECT SUM(tickets), COUNT (tickets), SUM (tickets) AS total_tickets, COUNT (tickets) AS number_of_purchases FROM purchases

Results are:

f0

f1

total_tickets

number_of_purchases

8

4

8

4

The first two columns are named f0 and f1 because they weren’t named in the above query. SQL defaults to f0, f1, f2, f3, and so on. We named the last two columns total_tickets and number_of_purchases so these column names show up in the query results. This is why it is always good to give your columns useful names, especially when using functions. After running your query, you want to be able to quickly understand your results, like the last two columns we described in the example.

On top of that, you might notice how the column names have an underscore between the words. Names should never have spaces in them. If 'total_tickets' had a space and looked like 'total tickets' then SQL would rename SUM(tickets) as just 'total'. Because of the space, SQL will use 'total' as the name and won’t understand what you mean by 'tickets'. So, spaces are bad in SQL names. Never use spaces.

The best practice is to use snake_case. This means that 'total tickets', which has a space between the two words, should be written as 'total_tickets' with an underscore instead of a space. 

CamelCase names for tables

You can also use CamelCase capitalization when naming your table. CamelCase capitalization means that you capitalize the start of each word, like a two-humped (Bactrian) camel. So the table TicketsByOccasion uses CamelCase capitalization. Please note that the capitalization of the first word in CamelCase is optional; camelCase is also used. Some people differentiate between the two styles by calling CamelCase, PascalCase, and reserving camelCase for when the first word isn't capitalized, like a one-humped (Dromedary) camel; for example, ticketsByOccasion.

At the end of the day, CamelCase is a style choice. There are other ways you can name your tables, including: 

  • All lower or upper case, like ticketsbyoccasion or TICKETSBYOCCASION

  • With snake_case,  like tickets_by_occasion

Keep in mind, the option with all lowercase or uppercase letters can make it difficult to read your table name, so it isn’t recommended for professional use.

The second option, snake_case, is technically okay. With words separated by underscores, your table name is easy to read, but it can get very long because you are adding the underscores. It also takes more time to write. If you use this table a lot, it can become a chore.

In summary, it is up to you to use snake_case or CamelCase when creating table names. Just make sure your table name is easy to read and consistent. Also be sure to find out if your company has a preferred way of naming their tables. If they do, always go with their naming convention for consistency.

Indentation

As a general rule, you want to keep the length of each line in a query <= 100 characters. This makes your queries easy to read. For example, check out this query with a line with >100 characters:

Screenshot of query

This query is hard to read and just as hard to troubleshoot or edit. Now, here is a query where we stick to the <= 100 character rule:

Now it is much easier to understand what you are trying to do in the SELECT clause. Sure, both queries will run without a problem because indentation doesn’t matter in SQL. But proper indentation is still important to keep lines short. And it will be valued by anyone reading your query, including yourself!

Multi-line comments

If you make comments that take up multiple lines, you can use -- for each line. Or, if you have more than two lines of comments, it might be cleaner and easier is to use /* to start the comment and */ to close the comment. For example, you can use the -- method like below:

Screenshot of query

Or, you can use the /* */ method like below:

Screenshot of query

In SQL, it doesn’t matter which method you use. SQL ignores comments regardless of what you use: #, --, or /* and */. So it is up to you and your personal preference. The /* and  */ method for multi-line comments usually looks cleaner and helps separate the comments from the query. But there isn’t one right or wrong method.

SQL text editors

When you join a company, you can expect each company to use their own SQL platform and SQL dialect. The SQL platform they use (e.g., BigQuery, MySQL, or SQL Server) is where you will write and run your SQL queries. But keep in mind that not all SQL platforms provide native script editors to write SQL code. SQL text editors give you an interface where you can write your SQL queries in an easier and color-coded way. In fact, all of the code we have been working with so far was written with an SQL text editor!

Examples with Sublime Text

If your SQL platform doesn’t have color coding, you might want to think about using a text editor like Sublime Text or Atom. This section shows how SQL is displayed in Sublime Text. Here is a query in Sublime Text:

Screenshot of query in sublime text

With Sublime Text, you can also do advanced editing like deleting indents across multiple lines at the same time. For example, suppose your query somehow had indents in the wrong places and looked like this:

Screenshot of query with incorrect indentations

This is really hard to read, so you will want to eliminate those indents and start over. In a regular SQL platform, you would have to go into each line and press BACKSPACE to delete each indent per line. But in Sublime, you can get rid of all the indents at the same time by selecting all lines and pressing Command (or CTRL in Windows) + [. This eliminates indents from every line. Then you can select the lines that you want to indent (i.e., lines 2, 4, and 6) by pressing the Command key (or the CTRL key in Windows) and selecting those lines. Then while still holding down the Command key (or the CTRL key in Windows), press  ] to indent lines 2, 4, and 6 at the same time. This will clean up your query and make it look like this instead:

Screenshot of query where all lines have proper indents

Sublime Text also supports regular expressions. Regular expressions (or regex) can be used to search for and replace string patterns in queries. We won’t cover regular expressions here, but you might want to learn more about them on your own because they are a very powerful tool. 

You can begin with these resources:

Explore BigQuery

For this activity, you will need a BigQuery account. If you haven’t made one already, you can follow the instructions from the Using BigQuery reading. Once you have your account, you can start exploring!

Open your console

1. Log in to BigQuery

2. Then, click the Go to console button on the BigQuery homepage. This will open a new tab with your console. 

3.  Take a moment to explore your console. On the left side, you will find the Explorer menu; this includes a search bar you can use to find resources, pinned projects, and the + ADD DATA button. On the right side, you will find the Query Editor. This is where you will input queries and view datasets. You can also find your Job History, Query History, and Saved Queries here.

Access public data in BigQuery

In order to actually start writing queries, you will need some data to work with. Once you’re familiar with the BigQuery interface, you can access a public dataset directly from your console. 

1. Click on the + ADD DATA button in the Explorer menu pane and select Explore public datasets. This will open a new menu where you can search public datasets that are already available through Google Cloud.

2. In the dataset menu you just opened, type london bicycle in the search box at the top; this will return the London Bicycle Hires dataset from the Greater London Authority. Click the dataset for more details. 

3. From the dataset information page, click the blue VIEW DATASET button. This will open your console in a new tab with this dataset loaded. 

You'll notice that bigquery-public-data is now pinned in your Explorer pane. You can now explore and query these public datasets. 

4. Click on the arrow next to bigquery-public-data and scroll down the list of public datasets until you find the london_bicycles data. When you click on the dataset, it will list two tables. Click on cycle_hire.

This will pull up a new tab in your Query Editor with information about the table schema. 

5. After checking out the table schema, you can take a peek into what data the cycle_hire table contains by clicking on the Preview tab. This will give you a better idea of what kind of data you’ll be working with.

Once you have finished previewing the data, you can write a query!

Query your data

So far, you’ve learned three basic parts of a query: SELECT, FROM, and WHERE. As a refresher, here are what those basic parts represent in the query: 

  • SELECT is the section of a query that indicates what data you want SQL to return to you

  • FROM is the section of a query that indicates which table the desired data comes from. 

  • WHERE is the section of a query that indicates any filters you’d like to apply to your dataset

Write a basic query

Now, construct a simple command using the basic parts of a query you have already learned! For example, you can select a specific column from the cycle_hire table, such as the end_station_name column.  

1. Start your query with a SELECT clause and indicate which column you want to select from the table; in this case, you’ll input end_station_name

2. After you have indicated which column you are selecting, write your FROM clause. You will need to specify the table you are querying from by inputting the following location: `bigquery-public-data.london_bicycles.cycle_hire`

The completed query should appear like this:

SELECT 

   end_station_name 

FROM 

  `bigquery-public-data.london_bicycles.cycle_hire`;

3. Run your completed query by clicking on the blue RUN button. 

This query may take a few seconds to execute. Once it has finished, you will find the list of station names you requested under the Query Results console pane. 

Write a query to answer a question

After running the first basic query, try answering a specific question about the data. For example, how many bike trips lasted for 20 minutes or longer?

1. Click COMPOSE NEW QUERY to start a new query. Start with your SELECT statement again. This time, you will want to include all of the columns in the table for this query. You can use an asterisk to indicate that you are selecting all of the data like this: SELECT COUNT(*) AS num_of_trips 

2. Then you will add your FROM statement. You will be using the same dataset as the previous query: FROM `bigquery-public-data.london_bicycles.cycle_hire`.

3. Finally, you’ll add a WHERE statement to specify that you want to filter for only bike rides 20 minutes or longer. If you check the preview of this data, you might notice that the duration is recorded in seconds, so you’ll specify 1200 seconds in your query.  You can write that as WHERE duration>=1200;

Your completed query should be written like this:

SELECT 

    COUNT(*) AS num_of_trips 

FROM 

    `bigquery-public-data.london_bicycles.cycle_hire` 

WHERE 

    duration >= 1200;

4. Run your completed query by clicking on the blue RUN button.

This query may take a few seconds to execute. Once it has finished, you will find a list of rides from this table that fit your criteria. There are more than 7 million rows with bike trips that are 20 minutes or longer!

Up for a challenge?

If you’re comfortable using queries to answer questions, try creating and running queries to answer any of the questions below: 

  • What are the names of the stations that bike_id 1710 started from?

  • How many bike_ids have ended at "Moor Street, Soho"?

  • What is the station_id for "Canton Street, Poplar"?

  • What is the name of the station whose ID is 111?

  • How many distinct bike_ids had trip durations greater than 2400 seconds (or 40 minutes)?

You can use the solutions doc to check your work:  Intro to BigQuery Solutions

Or download the file directly here:

Intro to BigQuery solutionsDOCX File
Download file

Confirmation and reflection

Run another query on your table:

SELECT     end_station_name FROM    `bigquery-public-data.london_bicycles.cycle_hire` WHERE     rental_id = 57635395;

At what station did the bike trip with rental_id 57635395 end?

  • What are the names of the stations that bike_id 1710 started from?

  • How many bike_ids have ended at "Moor Street, Soho"?

  • What is the station_id for "Canton Street, Poplar"?

  • What is the name of the station whose ID is 111?

  • How many distinct bike_ids had trip durations greater than 2400 seconds (or 40 minutes)?

Comments

Popular posts from this blog

SQL : PostgreSQL

About spreadsheet basics