Mastering Aggregate Functions in SQL
Today we will look into aggregate functions to get a better understanding of how to obtain some data from integer columns. This is a topic you must know and master in order to be more proficient and obtain better data in SQL.
An aggregate function performs a calculation over a set of values and returns a single value. Except for COUNT(*), all aggregate functions ignore NULL values. They usually can be used with the GROUP BY clause.
All aggregate functions are deterministic. This means that aggregate functions return the same value each time that they are called, always with a specific set of input values. You can search for "Deterministic and Nondeterministic Functions" to better understand this concept.
There are a good number of aggregate functions, and we will go over some of the most commonly used ones. There are functions such as COUNT(), SUM() , MIN() , MAX() , and AVG(). Then we will take a look into how we can separate our data into different groups for analysis.
COUNT() Function
This one is really intuitive and works as expected, the COUNT() function does exactly what it sounds like. It counts the number of matches in the selection performed. For example, if we used the COUNT() function on the entire data set, the function would return the total number of observations within the dataset.
SELECT COUNT(*) FROM table_name;
This can be used in more useful queries with the WHERE statement.
Let's suppose that we have a table of books, with information about the author, genre, year, title, and sales. By using the WHERE clause with the COUNT() function, we can obtain more interesting data. We can count how many books are novels, or how many books were published in a given year.
But let's say that we have an author that has written lots of books, he will appear several times at our table. If we wanted the count of authors, we would only need each author to be counted once. In this case, we can use the COUNT(DISTINCT author)
and it would return the count of unique authors in our data.
These are just some of the possibilities for which the COUNT()function can be used.
MAX() Function
The MAX() function is also very intuitive. Given a selection of data, this function returns the max value found in that specific set of data. A simple use for this function in our previous dataset of books would be to find the book with the highest amount of sales.
SELECT MAX(sales) FROM books;
Or we can be more specific about the data that we want, and write a query like this.
SELECT MAX(sales) FROM books WHERE genre = 'fantasy' AND year = '2022';
MIN() Function
The MIN() function is very similar to the previous function, we will call it its little brother. However, instead of returning the max number, the function returns the opposite: the minimum number.
One simple example would be the following, but the logic is the same as the MAX() function.
SELECT MIN(sales) FROM books;
To find out the least sellingo book, and probably then we will want to know why it was so unpopular.
AVG() Function
The AVG() function is useful to obtain the average of a selected column in a given dataset. It can also be considered the mean or the sum of the observations divided by the number of observations*.
There are infinite uses for this function and you can probably think of multiple ways of how it can be applied.
For example, we could use our books table. If we wanted to know the average selling of a book for a given year we could execute SELECT AVG(sales) FROM books WHERE year = '2022';
.
We could also use the AVG() function in a different way, to only obtain the books that have a selling count higher than the average of all books.
SELECT * FROM books WHERE sales > (SELECT AVG(sales) FROM books);
SUM() Function
The SUM() function adds together all the observations in a given column, and it's the last aggregate function that we will cover today. A simple example would be using the SUM() function to find the total sales of all books. But probably we would want to know something more specific, like the total book selling for a given author.
SELECT SUM(sales) FROM books WHERE author = 'Sthephen King'
The GROUP BY Statement
Finally, we will see the GROUP BY statement. Which groups different observations together by similar values into several summary rows. A quick and simple example of this would be using the books table from the COUNT() function. There are books from several different genres. The GROUP BY statement could give an easy summary of the number of books that were written for each genre. The query would look like the following:
SELECT genre, COUNT(*) FROM books GROUP BY genre;
This could reveal the genres with the most books, and other interesting data.
SELECT author, AVG(sales) FROM books GROUP BY author HAVING AVG(sales) > 1000;
Here we would return each author with the average sales of all his books, always that he has an average sales of over 1000 per book.
Something else to take into consideration is that the data can be grouped by multiple columns, using a comma in between the different fields. For example, if you are working with a large dataset of books, the data may need to be grouped first by genre and then by author.
You can find more information about aggregate functions online. We have covered the most used ones, but there are also other ones used in more specific cases. For example, here's a list of all aggregate functions in T-SQL. APPROX_COUNT_DISTINCT AVG, CHECKSUM_AGG, COUNT, COUNT_BIG, GROUPING ,GROUPING_ID ,MAX ,MIN, STDEV, STDEVP, STRING_AGG, SUM, VAR, VARP
I hope this post was able to give you more confidence to start using different aggregate functions, or at least to have served as a quick refresher.
Thanks for reading, and have a great rest of your day! Keep coding.