当前位置:Gxlcms > Python > 如何高效地获取文件行数

如何高效地获取文件行数

时间:2021-07-01 10:21:17 帮助过:166人阅读

简单的做法:

需要在python中获取大文件(数十万行)的行数。

  1. def file_len(fname):
  2. with open(fname) as f:
  3. for i, l in enumerate(f):
  4. pass return i + 1

有效的方法(缓冲区读取策略):

首先看下运行的结果:

  1. mapcount : 0.471799945831
  2. simplecount : 0.634400033951
  3. bufcount : 0.468800067902
  4. opcount : 0.602999973297

因此,对于Windows/Python2.6来说,缓冲区读取策略似乎是最快的。

以下是代码:

  1. from __future__ import with_statement
  2. import time
  3. import mmap
  4. import random
  5. from collections import defaultdict
  6. def mapcount(filename):
  7. f = open(filename, "r+")
  8. buf = mmap.mmap(f.fileno(), 0)
  9. lines = 0
  10. readline = buf.readline
  11. while readline():
  12. lines += 1
  13. return lines
  14. def simplecount(filename):
  15. lines = 0
  16. for line in open(filename):
  17. lines += 1
  18. return lines
  19. def bufcount(filename):
  20. f = open(filename)
  21. lines = 0
  22. buf_size = 1024 * 1024
  23. read_f = f.read # loop optimization
  24. buf = read_f(buf_size)
  25. while buf:
  26. lines += buf.count('\n')
  27. buf = read_f(buf_size)
  28. return lines
  29. def opcount(fname):
  30. with open(fname) as f:
  31. for i, l in enumerate(f):
  32. pass
  33. return i + 1
  34. counts = defaultdict(list)
  35. for i in range(5):
  36. for func in [mapcount, simplecount, bufcount, opcount]:
  37. start_time = time.time()
  38. assert func("big_file.txt") == 1209138
  39. counts[func].append(time.time() - start_time)
  40. for key, vals in counts.items():
  41. print key.__name__, ":", sum(vals) / float(len(vals))

以上就是如何高效地获取文件行数的详细内容,更多请关注Gxl网其它相关文章!

人气教程排行