SQLite | Group By 和 Order By 子句

文章目录

1. Group by and Order by1.1 Group Records1.2 Ordering Records1.3 Aggregate Functions1.4 The Having Statement1.5 Getting Distinct Records参考资料

1. Group by and Order by

我们在上一篇中介绍了 Where 子句，接下来我们将使用 Group by 和 Order by 子句，对数据进行聚合和排序。

使用Jupyter Notebook 运行 SQL 语句需安装 ipython-sql

%sql 以及 %%sql 为在 Notebook 中运行 SQL 语句，在 SQLite 命令行或 SQLite Stiduo 中不需要 %sql 或 %%sql

载入 SQL 以及连接 SQLite：

%load_ext sql%sql sqlite:///DataBase/weather_stations.db

'Connected: @DataBase/weather_stations.db'

本文将使用 weather_stations.db 数据库，其中包含了 STATION_DATA 表。

首先查看 STATION_DATA 表中的数据：

%sql select * from station_data limit 0,5; -- 筛选前五行

* sqlite:///DataBase/weather_stations.dbDone.

1.1 Group Records

首先从最简单的聚合方法开始：计数：

%%sqlselect count(*) as record_cound from station_data;

* sqlite:///DataBase/weather_stations.dbDone.

count(*)意味着计算记录的长度，你也可以和其他 SQL 操作符结合起来使用，比如where，我们可以这样计算 tornado 出现的次数：

%%sqlselect count(*) as record_count from station_datawhere tornado == 1;

* sqlite:///DataBase/weather_stations.dbDone.

我们找到了 3000 条包含 tornado 的记录，但如果我们想要按年计数呢？我们可以这样写：

%%sqlselect year, count(*) as record_count from station_datawhere tornado == 1group by yearlimit 0,3; -- 只展示前三条

* sqlite:///DataBase/weather_stations.dbDone.

我们现在可以看到每年的计数，让我们拆分下这个查询来看看怎么执行的：

select year, -- 1. 首先，我们选择了 year（select year）count(*) as record_count -- 2. 然后我们用 **count(\*)** 对筛选的记录进行了计数from station_datawhere tornado == 1 -- 3. 我们筛选了 tornado 为 true 的数据group by year -- 4. 最后，按年进行分类

我们也可以在多个 field 上进行聚合：

%%sqlselect year, month,count(*) as record_countfrom station_datawhere tornado == 1group by year, monthlimit 0,3;

* sqlite:///DataBase/weather_stations.dbDone.

此外，在使用group by时，我们可以也用序数位置（ordinal positions）：

%%sqlselect year, month,count(*) as record_countfrom station_datawhere tornado == 1group by 1, 2 -- ordinal positionslimit 0,5;

* sqlite:///DataBase/weather_stations.dbDone.

不是所有的平台都支持 ordinal positions，例如 Oracle 和 SQL Server，就只能写全称

1.2 Ordering Records

需要注意到，我们通过 group 得到的数据中 month 并不是按自然月份排序的，所以字哦好就是同时使用oreder by操作符来进行排序，如果你想要先按年份排序，再按月份排序，你只需要添加：

%%sqlselect year, month,count(*) as record_countfrom station_datawhere tornado == 1group by 1, 2 -- ordinal positionsorder by 1, 2 -- order by 同样支持 ordinal positionslimit 0,5;

* sqlite:///DataBase/weather_stations.dbDone.

order by 默认是按升序（ASC）排列的，然而你可能更对近期的数据感兴趣，你可以通过添加DESC来指定排序方式：

%%sqlselect year, month,count(*) as record_countfrom station_datawhere tornado == 1group by year, monthorder by year DESC, monthlimit 0,5;

* sqlite:///DataBase/weather_stations.dbDone.

1.3 Aggregate Functions

我们已经使用count(*)来对记录进行计数了，但还有其他的一些聚合函数（AggregateyFunctions），

如sum()、min()、max()和avg()。我们可以在特定的列上使用聚合函数来进行计算。

图1 SQLite 内置聚合函数

但首先让我们来看看count()的另一种使用方式，count()可以用于除了计数以外的其他用途。如果你不使用 * ，

而是指定某一列，那么它将会计算所有非缺失值（non-null）的个数。举个例子，我们可以计算 snow_depth 中非缺失值的个数：

%%sqlselect count(snow_depth) as recorded_snow_depth_countfrom station_data

* sqlite:///DataBase/weather_stations.dbDone.

让我们进一步看看聚合函数，如果你想要看看你从 2000 年开始每个月的平均温度，你可以先筛选 2000 年的记录，

然后按月份分组，最后计算平均温度：

%%sqlselect month, avg(temperature) as avg_tempfrom station_datawhere year >= 2000group by monthlimit 0,3;

* sqlite:///DataBase/weather_stations.dbDone.

sum()是另一个常见的聚合操作符，为了得到 2000 年至今每年的下雪深度，你可以这样查询：

%%sqlselect year, sum(snow_depth) as total_snowfrom station_datawhere year >= 2000group by yearlimit 0,3;

* sqlite:///DataBase/weather_stations.dbDone.

你可以在一次查询中多次使用聚合操作，我们将 2000 年以来的下雪总量、下雨总量和最大降雨量分别统计出来，并保留两位小数：

%%sqlselect year,round(sum(snow_depth), 2) as total_snow,round(sum(precipitation), 2) as total_precipitation,round(max(precipitation), 2) as max_precipitationfrom station_datawhere year >= 2000group by yearlimit 0,3;

* sqlite:///DataBase/weather_stations.dbDone.

1.4 The Having Statement

假设你想要基于一个聚合值来筛选记录，你的第一反应应该是使用where子句。确实，where子句可以

用来筛选记录，但是却无法用于聚合值上。举个例子，如果你想使用where子句筛选出总下雨量大于 30 的记录，

就会出现以下错误：

%%sqlselect year, sum(precipitation) as total_precipitationfrom station_datawhere total_precipitation > 30group by yearlimit 0,3;

* sqlite:///DataBase/weather_stations.db(sqlite3.OperationalError) misuse of aggregate: sum()[SQL: select year, sum(precipitation) as total_precipitationfrom station_datawhere total_precipitation > 30group by yearlimit 0,3;](Background on this error at: http://sqlalche.me/e/e3q8)

为什么不起作用呢？首先我们来看下聚合的原理，首先程序一行一行的扫描，找出那些在where 子句

上成立的数据，然后再进行聚合。然而在聚合前并没有total_precipitation这一列数据，因此出错。

当你想在聚合值上执行where这个方法时，只能使用having这个关键词：

%%sqlselect year,sum(precipitation) as total_precipitationfrom station_datagroup by yearhaving total_precipitation > 30limit 0,3

* sqlite:///DataBase/weather_stations.dbDone.

having相当于聚合版的where，但并不是所有平台都支持在 aliases 上使用having，

如 Oracle（group by 也不行），这意味着当你使用 having 时需要再输入一次聚合函数，像这样：

%%sqlselect year,sum(precipitation) as total_preicipitationfrom station_datagroup by yearhaving sum(precipitation) > 30limit 0,3

* sqlite:///DataBase/weather_stations.dbDone.

1.5 Getting Distinct Records

当我们使用 **select from** 时，记录中可能会包含重复值，如果你只想要返回**唯一值（distinct records）**，你可以使用 **select distinct from**，比如我们的 station_data，表中 station_number 一列包含了 28000 个值，但你通过 **select distinct from** 后会发现其中是 6368 个值不断重复出现组成的

%%sqlselect count(station_number) as duplicate_numfrom station_data;

* sqlite:///DataBase/weather_stations.dbDone.

%%sqlselect count(distinct station_number) as distinct_numfrom station_data;

* sqlite:///DataBase/weather_stations.dbDone.