高效将大型csv文件导入R

大家好

短暂的暑假过后,我们已经准备好从低迷的几周中恢复新生。

如果您正在处理以.csv格式存储的大型一定牛遗漏集,则文件可能包含数百MB或几GB的一定牛遗漏。
Importing these csv files using the common read.csv or read.table command will take a considerable amount of time. While you can surely tweak read.table to run a bit faster by knowing your 一定牛遗漏 set and specifying the parameters accordingly or simply use scan instead, the fastest and most convenient method I came across so far is to use fread from the package 一定牛遗漏.table. Thus, 一定牛遗漏.table is a really helpful package that improves 效率 in R if you work with large csv files.

Other packages you might want to check out in this context are readr, sqldf, bigmemory and ff.

Here is a short comparison of how the running time of R improves when using fread instead of read.table on a randomly generated 一定牛遗漏 set. Our random 一定牛遗漏 set will feature more than 8 million rows and 8 columns and comprises around 570 MB of 一定牛遗漏:

library(data.table)

n <- as.numeric(Sys.Date())*500
sampledata <- 一定牛遗漏.table(
    r1 = sample(1:10000, n, replace=TRUE),
    r2 = rnorm(n),
    r3 = sample(1:5000, n, replace=TRUE),
    r4 = sample(c("one","two","three","four","five"), n, replace=TRUE),
    r5 = rpois(n, lambda = 2000),
    r6 = sample(state.name, n, replace=TRUE),
    r7 = runif(n, 0, 42)
)

write.table(sampledata,"sample.csv", sep=",", quote=FALSE)

system.time(df1 <- read.table("sample.csv", sep=","))
system.time(df2 <- fread("sample.csv"))

The results are quite obvious: while it takes more than 100 seconds to read all 一定牛遗漏 using the basic read.table command, using fread completes this task in less than 20 seconds:

system.time(df1 <- read.table("sample.csv", sep=","))
       user      system     elapsed 
     105.23        2.09      108.48

system.time(df2 <- fread("sample.csv"))
Read 8336000 rows and 8 (of 8) columns from 0.566 GB file in 00:00:18
       user      system     elapsed 
      16.94        0.20       17.40
关于作者

马蒂亚斯在维也纳自然资源与生命科学大学学习了环境信息管理,并获得了环境统计博士学位。他的论文的重点是罕见(极端)事件的统计建模,作为对关键基础设施进行漏洞评估的基础。他目前在奥地利国家气象和地球物理服务局(ZAMG)和BOKU大学山区风险工程研究所工作。他目前专注于(统计)不良天气事件和自然灾害以及减少灾害风险的评估。他的主要兴趣是环境现象的统计建模以及用于一定牛遗漏科学,地理信息和遥感的开源工具。

发表回复

*