python处理文件效率对比awk

创建时间：2016-03-22 投稿人：浏览次数：1585

有如下三文件：

wc -l breakfast_all cheap_all receptions_all
   3345271 breakfast_all
   955890 cheap_all
   505504 receptions_all
4806665 总用量

head -3 cheap_all
a    true
b    true
c    true

三个文件的结构都类似，第一列为uid。现在想统计三个文件中总共有多少不重复的uid。特意用python与awk分别写了代码，测试两者处理文本的速度。

python代码：

#!/usr/bin/env python
#coding:utf-8

import time

def t1():
    dic = {}
    filelist = ["breakfast_all","receptions_all","cheap_all"]
    start = time.clock()
    for each in filelist:
        f = open(each,"r")
        for line in f.readlines():
            key = line.strip().split()[0]
            if key not in dic:
                dic[key] = 1

    end = time.clock()
    print len(dic)
    print "cost time is: %f" %(end - start)

def t2():
    uid_set = set()
    filelist = ["breakfast_all","receptions_all","cheap_all"]
    start = time.clock()
    for each in filelist:
        f = open(each,"r")
        for line in f.readlines():
            key = line.strip().split()[0]
            uid_set.add(key)

    end = time.clock()
    print len(uid_set)
    print "cost time is: %f" %(end - start)

t1()
t2()

用awk处理

#!/bin/bash

function handle()
{
    start=$(date +%s%N)
    start_ms=${start:0:16}
    awk "{a[$1]++} END{print length(a)}" breakfast_all receptions_all cheap_all
    end=$(date +%s%N)
    end_ms=${end:0:16}
    echo "cost time is:"
    echo "scale=6;($end_ms - $start_ms)/1000000" | bc
}

handle

运行python脚本
./test.py
3685715
cost time is: 4.890000
3685715
cost time is: 4.480000

运行sh脚本

./zzz.sh
3685715
cost time is:
4.865822

由此可见，python里头的set结构比dic稍微快一点点。整体上，awk的处理速度与python的处理速度大致相当！

声明：该文观点仅代表作者本人，牛骨文系教育信息发布平台，牛骨文仅提供信息存储空间服务。

热门文章: CTF writeup 2_南邮网络攻防训...; SSM框架——详细整合教程（...; Linux Shell脚本编程－－curl命...; HttpClient使用详解; Java面试题全集（上）; JAVA设计模式之单例模式; java.lang.OutOfMemoryError: PermGen ...; TCP协议中的三次握手和四次...; form表单的两种提交方式，su...; String,StringBuffer与StringBuilder...

最新文章: Java之品优购课程讲义_day20（7）; 剑指 Offer - 8：跳台阶; Netty权威指南_札记02_NIO编程; mysql时间属性之时间戳和datetime之...; 虚拟现实或许可以拯救古埃及的“...; spring cloud服务注册中心eureka---集群...; Java SE 第六章; HTTP请求+数据库; HIDL学习笔记之HIDL C++（第二天）; ubuntu系统下指定tomcat运行时为JDK1.8...