使用R从eHYD读取数据

亲爱的大家,

在精简有关极值分析的帖子的过程中,我尝试重新导入用于阐述的数据集。通常,肯定会使用自定义函数来执行诸如数据导入之类的任务,而且我认为在这种演示情况下这也很有用,原因有两个。首先,此过程在多个连续的帖子中使用。重复执行相同的操作可能是使用功能的最明显指示。其次,最初导入数据的方式有些笨拙,并且更新的版本也更美观。

在原始版本中,所有数据集都是手动导入的,如下所示:

# load package RCurl
library(RCurl)

# get 数据 from H and build 数据 frame
link <- getURL("http://ehyd.gv.at/eHYD/MessstellenExtraData/nlv?id=107540&file=2")
df <- read.csv(text = link, sep=";", dec=",", skip=21, header=FALSE)
df[df == "Lücke"] = NA
colnames(df) <- c("日期","沉淀")
df$Date <- as.Date(df$Date, format = "%d.%m.%Y")
df$Precipitation <- as.numeric(gsub(",", ".", as.character(df$Precipitation)))
head(df)

This resulted in a simple 数据 frame called df with two columns, “Date” and “Precipitation”。即使这样导入我们的数据集绝对没有错,但仍有改进的空间,尤其是就编码样式而言。使用以下简单功能,可以更方便地完成从奥地利水文局网站上读取数据的过程:

# load package xts
library(xts)

read.ehyd <- function(ehyd_url) {
    # separate the header, open the connection with correct encoding
    con <- url(ehyd_url, encoding = "latin1")
    header <- readLines(con, n=50)
    lines.header <- grep("Werte:", header, fixed = T)
    # read 数据, define time and values
    infile <- read.csv2(con, header = F, skip = lines.header,
        col.names = c("time", "value"),
        colClasses = c("character", "numeric"),
        na.strings = "Lücke",
        strip.white = TRUE, as.is = TRUE, fileEncoding = "latin1")
    infile$time <- as.POSIXct(infile$time, format = "%d.%m.%Y %H:%M:%S")
    # return time series object of class xts
    return(xts(infile$value, order.by = infile$time))
}

In doing so we can avoid using the package RCurl. The resulting object is a nice xts time series, and thus a closer representation of the actual 数据 structure, which is in fact a hydrological time series.

此功能足以在我的极值分析和图表的教程中使用。但是,我们仍然可以通过几种方式来增强此功能。

For instance, the argument ehyd_url of the function could be disassembled further. The URL includes the ID of the station (107540), an indicator that the requested file belongs to the 区 of ‘降水,气温和蒸发’ (in German: ‘Niederschlag,Lufttemperatur和Verdunstung’, thus ‘NLV’)以及水文参数指标(2表示降水)。因此,可以进一步修改该功能,以直接接受URL或从有关文件属性的某些输入信息中构建URL。在一个简单的版本中,我们可以开始构建如下功能:

suppressPackageStartupMessages({
  library(xts)
})

read.ehyd <- function(hzbnr, parameter = c("niederschlag", "neuschnee",
                                           "schneehöhe", "wasserstand", 
                                           "abfluss"), ehyd_url) {
  parameter <- match.arg(parameter)
  if(missing(ehyd_url)){
    区 <- ifelse(parameter %in% c("wasserstand", "abfluss"), "owf", "nlv")
    param_names <- c("niederschlag","neuschnee", "schneehöhe", 
                     "wasserstand", "abfluss")
    param_ids <- c(2, 3, 4, 7, 4)
    file_nr <- param_ids[which(param_names == parameter)]
    ehyd_url <- paste0("http://ehyd.gv.at/eHYD/MessstellenExtraData/",
                       区, "?id=", hzbnr, "&file=", file_nr)
  }
  ...
}

with hzbnr indicating the station ID, parameter being the environmental variable we are interested in, “area” being either ‘nlv’ or ‘owf’ (depending on the input parameter), and file_nr being the indicator for correct file depending on the desired environmental variable.

您还可以弄乱一些正则表达式,以从相应文件的标题中获取测站名称和水文参数的名称(德语)。

在一个简单的版本中(’包括通过提供水文参数和如上所述的站点ID来构建URL,这可能看起来像这样:

read_ehyd <- function(ehyd_url) {
  # separate the header, open the connection with correct encoding
  con <- url(ehyd_url, encoding = "latin1")
  header <- readLines(con, n=50)
  lines.header <- grep("Werte:", header, fixed = T)
  # read 数据, define time and values
  infile <- read.csv2(con, header = F, skip = lines.header,
                      col.names = c("time", "value"),
                      colClasses = c("character", "numeric"),
                      na.strings = "Lücke",
                      strip.white = TRUE, as.is = TRUE, fileEncoding = "latin1")
  infile$time <- as.POSIXct(infile$time, format = "%d.%m.%Y %H:%M:%S")
  # output message
  station <- gsub("(^.*;)([a-zA-Z]+)", "\\2", header[1])
  line_param <- grep("Exportzeitreihe:", header, fixed = T)
  if(endsWith(ehyd_url, "2")){
    param <- gsub("(^.*;)([a-zA-Z]+)(,.+$)", "\\2", header[line_param])
  } else {
    param <- gsub("Exportzeitreihe:           ;", "", header[line_param])
    param <- strsplit(param, split = ",")[[1]][[2]]
  }
  message("数据 set: ", param, " in ", station)
  # return time series object of class xts
  return(xts(infile$value, order.by = infile$time))
}

A nicely implemented version to solve the task at hand is the function read.hzb() which can be found at
//github.com/mundl/readhyd

关于作者

马蒂亚斯在维也纳自然资源与生命科学大学学习了环境信息管理,并获得了环境统计博士学位。他的论文的重点是罕见(极端)事件的统计建模,作为对关键基础设施进行漏洞评估的基础。他目前在奥地利国家气象和地球物理服务局(ZAMG)和BOKU大学山区风险工程研究所工作。他目前专注于(统计)不良天气事件和自然灾害以及减少灾害风险的评估。他的主要兴趣是环境现象的统计建模以及用于数据科学,地理信息和遥感的开源工具。

发表回复

*