用python 读取txt文件并保存为array
Reading Text Tables with Python
Reading tables is a pretty common thing to do and there are a number of ways to read tables besides writing a read function yourself. That’s not to say these are magic bullets. Every table is different and can have its own eccentricities. If you find yourself reading the same type of quirky file over and over again it could be worth your effort to write your own reader that does things just the way you like. That said, here are some other options.
numpy.loadtxt
numpy.loadtxt is a very simple reader. There are ways to make it do some advanced things like handle missing data or read non-numeric columns but they are all a bit tedious so this function is best used with well behaved tables.
numpy.loadtxt has a couple of useful keywords. Use the skiprows
keyword to skip header lines. By default numpy.loadtxt will split columns on white space but you can specify other separators using the
delimiter
keyword. If you want select only certain columns from the table use the
usecols
keyword. That can be useful if you want to skip a text column.
Normally the data is returned as one large 2D array but setting unpack=True
will return the the columns as individual arrays.
Examples
Reading a well formatted, white space delimited table into a single array:
>>> np.loadtxt("data_table.txt", skiprows=1) array([[ 0.2536, 0.1008, 0.3857], [ 0.4839, 0.4536, 0.3561], [ 0.1292, 0.6875, 0.5929], [ 0.1781, 0.3049, 0.8928], [ 0.6253, 0.3486, 0.8791]])
Reading a well formatted, white space delimited table into a three arrays:
>>> a,b,c = np.loadtxt("data_table.txt", skiprows=1, unpack=True) >>> a array([ 0.2536, 0.4839, 0.1292, 0.1781, 0.6253]) >>> b array([ 0.1008, 0.4536, 0.6875, 0.3049, 0.3486]) >>> c array([ 0.3857, 0.3561, 0.5929, 0.8928, 0.8791])
Reading a table with NAN
values:
>>> np.loadtxt("data_table2.txt", skiprows=1) array([[ 0.4839, 0.4536, 0.3561], [ 0.1292, 0.6875, nan], [ 0.1781, 0.3049, 0.8928], [ nan, 0.5801, 0.2038], [ 0.5993, 0.4357, 0.741 ]])
Reading a nicely formatted CSV file. Skip the first column since it contains strings:
>>> np.loadtxt("exoplanetData_clean.csv", skiprows=1, delimiter=",", usecols=(1,2,3)) array([[ 0.2 , 0.33 , 29.329 ], [ 9.1 , 1.29 , 60.3251], [ 17. , 0.96 , 143.213 ], [ 6.8 , 0.38 , 20.8613], [ 4.7 , 0.38 , 42.6318]])
numpy.genfromtxt
numpy.genfromtxt is a more robust version of
loadtxt that can better handle missing data. It supports most of the same keywords as loadtxt except it has
skip_header
instead of skiprows
.
Examples
numpy.genfromtxt will automatically convert character strings in numeric columns to
nan
:
>>> np.genfromtxt("data_table3.txt", skip_header=1) array([[ 0.4839, 0.4536, 0.3561], [ 0.1292, 0.6875, nan], [ 0.1781, 0.3049, 0.8928], [ nan, 0.5801, 0.2038], [ 0.5993, 0.4357, 0.741 ]])
But you can also specify the missing value string and substitution values:
>>> np.genfromtxt("data_table3.txt", skip_header=1, missing_values=("MISSING","MISSING","MISSING"), filling_values=(-999,-999,-999)) array([[ 4.83900000e-01, 4.53600000e-01, 3.56100000e-01], [ 1.29200000e-01, 6.87500000e-01, -9.99000000e+02], [ 1.78100000e-01, 3.04900000e-01, 8.92800000e-01], [ -9.99000000e+02, 5.80100000e-01, 2.03800000e-01], [ 5.99300000e-01, 4.35700000e-01, 7.41000000e-01]])
When you set dtype=None
numpy.genfromtxt will attempt to determine the type of each column, even detecting string columns. If you set
names=True
numpy.genfromtxt will take the column names from the header line:
>>> np.genfromtxt("exoplanetData_clean.csv", names=True, delimiter=",", dtype=None) array([("Kepler-30 b", 0.2, 0.33, 29.329), ("Kepler-30 c", 9.1, 1.29, 60.3251), ("Kepler-30 d", 17.0, 0.96, 143.213), ("Kepler-31 b", 6.8, 0.38, 20.8613), ("Kepler-31 c", 4.7, 0.38, 42.6318)], dtype=[("Planet_Name", "|S11"), ("Pl_Mass", "<f8"), ("Pl_Radius", "<f8"), ("Pl_Period", "<f8")]) >>> a = np.genfromtxt("exoplanetData_clean.csv", names=True, delimiter=",", dtype=None) >>> a["Pl_Mass"] array([ 0.2, 9.1, 17. , 6.8, 4.7])
Setting usemask=True
will return a
masked array where missing data is masked:
>>> np.genfromtxt("exoplanetData.csv", delimiter=",", skip_header=1, dtype=None, usemask=True) masked_array(data = [("Kepler-32 b", 4.1, 0.37, 5.90124) ("Kepler-32 c", 0.5, 0.33, 8.7522) ("Kepler-33 b", --, 0.16, 5.66793) ("Kepler-33 c", --, 0.29, 13.17562)], mask = [(False, False, False, False) (False, False, False, False) (False, True, False, False) (False, True, False, False)], fill_value = ("N/A", 1e+20, 1e+20, 1e+20), dtype = [("f0", "|S11"), ("f1", "<f8"), ("f2", "<f8"), ("f3", "<f8")])
asciitable
asciitable is a third-party Python tool for reading text files. Installation instructions are on the asciitable website. asciitable can read and write text tables in a variety of formats.
One nice thing about asciitable is that it will try to guess the format of your table so you can type less when reading in most tables. For non-standard tables you can define specialized reader classes that will be used to interpret the table.
Examples
Reading files with no missing data:
>>> asciitable.read("data_table.txt") rec.array([(0.2536, 0.1008, 0.3857), (0.4839, 0.4536, 0.3561), (0.1292, 0.6875, 0.5929), (0.1781, 0.3049, 0.8928), (0.6253, 0.3486, 0.8791)], dtype=[("Value1", "<f8"), ("Value2", "<f8"), ("Value3", "<f8")]) >>> asciitable.read("exoplanetData_clean.csv") rec.array([("Kepler-30 b", 0.2, 0.33, 29.329), ("Kepler-30 c", 9.1, 1.29, 60.3251), ("Kepler-30 d", 17.0, 0.96, 143.213), ("Kepler-31 b", 6.8, 0.38, 20.8613), ("Kepler-31 c", 4.7, 0.38, 42.6318)], dtype=[("Planet Name", "|S11"), ("Pl. Mass", "<f8"), ("Pl. Radius", "<f8"), ("Pl. Period", "<f8")])
Notice that it identified the string column without having to specify any special keywords.
When you specify the value of missing data using the fill_values
keyword asciitable will by default return a
masked array:
>>> asciitable.read("data_table3.txt", fill_values=("MISSING",-999)) masked_rec.array(data = [(0.4839, 0.4536, 0.3561) (0.1292, 0.6875, --) (0.1781, 0.3049, 0.8928) (--, 0.5801, 0.2038) (0.5993, 0.4357, 0.741)], mask = [(False, False, False) (False, False, True) (False, False, False) (True, False, False) (False, False, False)], fill_value = (1e+20, 1e+20, 1e+20), dtype = [("Value1", "<f8"), ("Value2", "<f8"), ("Value3", "<f8")]) >>> asciitable.read("exoplanetData.csv", fill_values=("", "nan")) masked_rec.array(data = [("Kepler-32 b", 4.1, 0.37, 5.90124) ("Kepler-32 c", 0.5, 0.33, 8.7522) ("Kepler-33 b", --, 0.16, 5.66793) ("Kepler-33 c", --, 0.29, 13.17562)], mask = [(False, False, False, False) (False, False, False, False) (False, True, False, False) (False, True, False, False)], fill_value = ("N/A", 1e+20, 1e+20, 1e+20), dtype = [("Planet Name", "|S11"), ("Pl. Mass", "<f8"), ("Pl. Radius", "<f8"), ("Pl. Period", "<f8")])
ATpy
If you want to get really fancy there’s atpy. atpy can read a huge number of table formats assuming you have the correct reader installed. For example, reading fits tables requires pyfits and reading basic text tables requires asciitable. What atpy gives you is a powerful Table class for working with your data.
Read more about atpy and asciitable at this AstroBetter blog post: http://www.astrobetter.com/atpy-and-asciitable/. atpy and asciitable can also be used to write table data files.
Examples
>>> t = atpy.Table("data_table.txt", type="ascii") >>> t.data array([(0.2536, 0.1008, 0.3857), (0.4839, 0.4536, 0.3561), (0.1292, 0.6875, 0.5929), (0.1781, 0.3049, 0.8928), (0.6253, 0.3486, 0.8791)], dtype=[("Value1", "<f8"), ("Value2", "<f8"), ("Value3", "<f8")]) >>> t.columns { Value1 : type=float64, Value2 : type=float64, Value3 : type=float64 } >>> t.Value1 array([ 0.2536, 0.4839, 0.1292, 0.1781, 0.6253]) >>> t.row(0) (0.2536, 0.1008, 0.3857)
The Table
class even has a
cool where() method:
>>> t = atpy.Table("exoplanetData_clean.csv", type="ascii") >>> t.data array([("Kepler-30 b", 0.2, 0.33, 29.329), ("Kepler-30 c", 9.1, 1.29, 60.3251), ("Kepler-30 d", 17.0, 0.96, 143.213), ("Kepler-31 b", 6.8, 0.38, 20.8613), ("Kepler-31 c", 4.7, 0.38, 42.6318)], dtype=[("Planet Name", "|S11"), ("Pl. Mass", "<f8"), ("Pl. Radius", "<f8"), ("Pl. Period", "<f8")]) >>> t_new = t.where((t["Pl. Mass"] > 5) & (t["Pl. Period"] < 100)) >>> t_new.data array([("Kepler-30 c", 9.1, 1.29, 60.3251), ("Kepler-31 b", 6.8, 0.38, 20.8613)], dtype=[("Planet Name", "|S11"), ("Pl. Mass", "<f8"), ("Pl. Radius", "<f8"), ("Pl. Period", "<f8")])
When reading text tables with atpy you can pass the same arguments you would have passed to asciitable, as these are all just passed onto
asciitable.read()
under the hood:
>>> t = atpy.Table("exoplanetData.csv", type="ascii", fill_values=("", "nan")) >>> t.data array([("Kepler-32 b", 4.1, 0.37, 5.90124), ("Kepler-32 c", 0.5, 0.33, 8.7522), ("Kepler-33 b", nan, 0.16, 5.66793), ("Kepler-33 c", nan, 0.29, 13.17562)], dtype=[("Planet Name", "|S11"), ("Pl. Mass", "<f8"), ("Pl. Radius", "<f8"), ("Pl. Period", "<f8")])
Notice that the data are not a masked array. Be aware that comparisons to
nan
always return False
so if you do a where
on a column containing
nan
values those rows will never be part of your result.
Data Files
Here are the data files used in the examples above.
data_table.txt:
Value1 Value2 Value3 0.2536 0.1008 0.3857 0.4839 0.4536 0.3561 0.1292 0.6875 0.5929 0.1781 0.3049 0.8928 0.6253 0.3486 0.8791
data_table2.txt:
Value1 Value2 Value3 0.4839 0.4536 0.3561 0.1292 0.6875 NAN 0.1781 0.3049 0.8928 NAN 0.5801 0.2038 0.5993 0.4357 0.7410
data_table3.txt:
Value1 Value2 Value3 0.4839 0.4536 0.3561 0.1292 0.6875 MISSING 0.1781 0.3049 0.8928 MISSING 0.5801 0.2038 0.5993 0.4357 0.7410
exoplanetData_clean.csv:
Planet Name,Pl. Mass,Pl. Radius,Pl. Period Kepler-30 b,0.2,0.33,29.329 Kepler-30 c,9.1,1.29,60.3251 Kepler-30 d,17,0.96,143.213 Kepler-31 b,6.8,0.38,20.8613 Kepler-31 c,4.7,0.38,42.6318
exoplanetData.csv (contains missing data):
Planet Name,Pl. Mass,Pl. Radius,Pl. Period Kepler-32 b,4.1,0.37,5.90124 Kepler-32 c,0.5,0.33,8.7522 Kepler-33 b,,0.16,5.66793 Kepler-33 c,,0.29,13.17562
- 上一篇: Python读写txt文本文件的操作方法全解析
- 下一篇: 一句话彻底明白java乱码问题