拓端tecdat|R语言大数据分析纽约市的311万条投诉统计可视化与时间序列分析

原文链接：/?p=9800

原文出处：拓端数据部落公众号

介绍

本文并不表示R在数据分析方面比Python更好或更快速，我本人每天都使用两种语言。这篇文章只是提供了比较这两种语言的机会。

本文中的数据每天都会更新，我的文件版本更大，为4.63 GB。

CSV文件包含纽约市的311条投诉。它是纽约市开放数据门户网站中最受欢迎的数据集。

数据工作流程

install.packages("devtools")library("devtools")install_github("ropensci/plotly")

library(plotly)

需要创建一个帐户以连接到plotly API。或者，可以只使用默认的ggplot2图形。

set_credentials_file("DemoAccount", "lr1c37zw81") ## Replace contents with your API Key

使用dplyr在R中进行分析

假设已安装sqlite3（因此可通过终端访问）。

$ sqlite3 data.db # Create your database$.databases # Show databases to make sure it works$.mode csv $.import <filename> <tablename># Where filename is the name of the csv & tablename is the name of the new database table$.quit

将数据加载到内存中。

library(readr)# data.table, selecting a subset of columnstime_data.table <- system.time(fread('/users/ryankelly/NYC_data.csv', select = c('Agency', 'Created Date','Closed Date', 'Complaint Type', 'Descriptor', 'City'), showProgress = T))

kable(data.frame(rbind(time_data.table, time_data.table_full, time_readr)))

我将使用data.table读取数据。该fread函数大大提高了读取速度。

关于dplyr

默认情况下，dplyr查询只会从数据库中提取前10行。

library(dplyr)## Will be used for pandas replacement# Connect to the databasedb <- src_sqlite('/users/ryankelly/data.db')db

数据处理的两个最佳选择（除了R之外）是：

数据表dplyr

预览数据

# Wrapped in a function for display purposeshead_ <- function(x, n = 5) kable(head(x, n))head_(data)

选择几列

使用WHERE过滤行

使用WHERE和IN过滤列中的多个值

在DISTINCT列中查找唯一值

## City## 1 BROOKLYN## 2 ELMHURST## 3 JAMAICA## 4 NEW YORK## 5 ## 6 BAYSIDE

使用COUNT（*）和GROUP BY查询值计数

# dt[, .(plaints = .N), Agency]#setkey(dt, plaints) # setkey index's the dataq <- data %>% select(Agency) %>% group_by(Agency) %>% summarise(plaints = n())head_(q)

使用ORDER和-排序结果

数据库中有多少个城市？

# dt[, unique(City)]q <- data %>% select(City) %>% distinct() %>% summarise(Number.of.Cities = n())head(q)

## Number.of.Cities## 1 1818

让我们来绘制10个最受关注的城市

用UPPER转换CITY格式。

投诉类型（按城市）

# Plot resultplt <- ggplot(q_f, aes(ComplaintType, plaints, fill = CITY)) + geom_bar(stat = 'identity') + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))plt

第2部分时间序列运算

提供的数据不适合SQLite的标准日期格式。

在SQL数据库中创建一个新列，然后使用格式化的date语句重新插入数据创建一个新表并将格式化日期插入原始列名。

使用时间戳字符串过滤SQLite行：YYYY-MM-DD hh：mm：ss

# dt[CreatedDate < '-11-26 23:47:00' & CreatedDate > '-09-16 23:45:00', #.(ComplaintType, CreatedDate, City)]q <- data %>% filter(CreatedDate < "-11-26 23:47:00", CreatedDate > "-09-16 23:45:00") %>%select(ComplaintType, CreatedDate, City)head_(q)

使用strftime从时间戳中拉出小时单位

# dt[, hour := strftime('%H', CreatedDate), .(ComplaintType, CreatedDate, City)]q <- data %>% mutate(hour = strftime('%H', CreatedDate)) %>% select(ComplaintType, CreatedDate, City, hour)head_(q)

汇总时间序列

首先，创建一个时间戳记四舍五入到前15分钟间隔的新列

# Using lubridate::new_period()# dt[, interval := CreatedDate - new_period(900, 'seconds')][, .(CreatedDate, interval)]q <- data %>% mutate(interval = sql("datetime((strftime('%s', CreatedDate) / 900) * 900, 'unixepoch')")) %>% select(CreatedDate, interval)head_(q, 10)

绘制的结果