@X-blossom 2021-03-03T07:21:19.000000Z 字数 4843 阅读 2942

使用Python读取大容量TDMS格式数据

SIPM

TDMS文件处理

Introduction：

TAO项目需要测量SiPM暗光条件下的各性质，包括暗记数，串扰，以及后脉冲。不同于PDE探测(~ $\mu s$ level )，后脉冲的概率测量依赖于大尺度的时间窗口（~ s levle）。故而实验使用DT5751取数时波形的时间窗口设置为 ~ $10^{5} ns$ （100009 ns ），事例数为200000。
考虑到DT数据存储格式设为ASCII（即存储.txt文件），写入事例率约为50HZ,取数时间过于漫长，数据容量过于庞大。经过一番测试，选择二进制格式存储事例（即存储.tdms文件）：事例率约为180Hz,取数时间约为18min，单个文件大小约为38G。
那么取完数据后的首要任务就是将tdms格式数据“解码”，历程如下：

Method 1：

网上查阅发现TDMS格式数据是可以被Labview直接读取的，尝试了一下是可行的。可以看到tdms信息如Group & channel（实验真正有用的数据是“DGTZ”group下的“ch0”channel）。如果数据量较小（< $1*10^{7}$ number）的话是可以直接转换（复制）到excel或者TXT，但是对于38G的数据很明显是不现实的。
总结： LABVIEW 可转换小容量数据，但对于上G的数据无法转换，且耗时久。

Method 2：

使用 python转换:
首先需要导入nptdms包

from nptdms import TdmsFile
from nptdms import tdms

然后找到需要的数据：

        tdms_file = TdmsFile(filenameS)         
        groups_data = tdms_file.groups()
        group = tdms_file["DGTZ"]  #name of group
        channel = group["ch0"] #name of channel

本地对一个42M数据做了个测试(20000*1001 ~ $2*10^{7}$ number)，用时190s,生成的TXT文件约130M.

上传至服务器，需重新再个人目录下导入nptdms包:

        cd ~
        pip3 install --target=/junofs/... nptdms
        vi .bashrc
        #add export PYTHONPATH=$PYTHONPATH:/junofs/...

OK，一切就绪后在服务器上运行同样的程序处理同样的42M数据报错，自我怀疑后发现是少上传了一个名为.tdms.index的索引文件，重新上传后没有问题。
那么下一步就是批量提交作业了，但计划总是赶不上BUG。
服务器提交作业后大约RUN个两三分钟就会进入held状态，猜测是数据量过于庞大致使内存爆炸。那没有办法了，要么是把38G数据切片要么看能不能不全部写入内存。
遇事不决，先看源码。在nptdms包下面的tdms.py下面发现了这个负责的注释：

""" Reads and stores data from a TDMS file.
    There are two main ways to create a new TdmsFile object.
    TdmsFile.read will read all data into memory::
        tdms_file = TdmsFile.read(tdms_file_path)
    or you can use TdmsFile.open to read file metadata but not immediately read all data,
    for cases where a file is too large to easily fit in memory or you don't need to
    read data for all channels::
        with TdmsFile.open(tdms_file_path) as tdms_file:
            # Use tdms_file
            ...

芜湖，起飞~
那就把读取代码改成下面这个：

#       tdms_file = TdmsFile(filenameS)         
        tdms_file = TdmsFile.open(filenameS,raw_timestamps=False, memmap_dir=None)
        groups_data = tdms_file.groups()
        group = tdms_file["DGTZ"]  #name of group
        channel = group["ch0"] #name of channel

其中 TdmsFile.open 函数定义如下：

    def open(file, raw_timestamps=False, memmap_dir=None):
        """ Creates a new TdmsFile object and reads metadata, leaving the file open
            to allow reading channel data
        :param file: Either the path to the tdms file to read
            as a string or pathlib.Path, or an already opened file.
        :param raw_timestamps: By default TDMS timestamps are read as numpy datetime64
            but this loses some precision.
            Setting this to true will read timestamps as a custom TdmsTimestamp type.
        :param memmap_dir: The directory to store memory mapped data files in,
            or None to read data into memory. The data files are created
            as temporary files and are deleted when the channel data is no
            longer used. tempfile.gettempdir() can be used to get the default
            temporary file directory.
        """
        return TdmsFile(
            file, raw_timestamps=raw_timestamps, memmap_dir=memmap_dir, read_metadata_only=True, keep_open=True)

测试发现可行，可提交作业。只不过感觉一天不一定能转完，所以提交了个中等长度作业：

 hep_sub -wt mid job_${InputDir}.sh -g juno

剩下的，就耐心等待吧。

Update：

接上，转换八个小时生成15G TXT文件，保守估计一小时2G.那么对于38G二进制文件，全部转为TXT约为120G，按现在速度用时需要60小时，过于漫长！一番思索，应该是频繁打开IO接口所致。之前代码每个number都要写入一次，且进行一次累加和判断操作，写入程序如下：

        length=len(channel)
        for i in range (length):
                f.write(str(round(channel[i],3))+" ")#ie. 52.1111111-> 52.111
                if((i%100009)==0):     #1001 is the points of each waveform
                        f.write("\n")

查阅资料发现可以实施切片操作，借助于numpy包：

        data = np.array(channel)
        length=len(channel)
        print(length)
        data_piece=data.reshape(-1,100009)#100009
        event=length/100009
       f= open("/junofs/users/xiezq/2021_Sipm/dark/SenSL/data/"+sys.argv[1]+'.txt','w') #file to save waveform
        tt = time.time()
       for i in range (int(event)):
               f.write(( " ".join('%.6s'%id for id in data_piece[i]))+'\n')#"6"->"52.111"
       f.close

这样的话每一个data_piece[i]都是包含100009个number的array，可直接写入文件加快速度。
对40M tdms文件进行转换测试，用时约20s，速率提升10倍。

对38G文件转换又held掉了，猜测array过大无法进行切片等操作，内存吃不消。哎！

没办法，只好再查源代码：

    with TdmsFile.open(tdms_file_path) as tdms_file:
    channel = tdms_file[group_name][channel_name]
    for chunk in channel.data_chunks():
        channel_chunk_data = chunk[:]

其中每个chunk[:]代表着每行的数据，即100009个number。看文档说明它是一个接一个chunk读取，所以不会再引起内存爆炸问题。
综上，最终代码如下：

import pandas as pd
#import numba
#from numba import jit
#@jit(nopython=True)
def emp():
        filenameS = "/junofs/users/xiezq/2021_Sipm/dark/SenSL/data/"+sys.argv[1]+".tdms" #name of tdms file
#       tdms_file = TdmsFile(filenameS)         
#       tdms_file = TdmsFile.open(filenameS,raw_timestamps=False, memmap_dir=None)
        f= open("/junofs/users/xiezq/2021_Sipm/dark/SenSL/data/"+sys.argv[1]+'.txt','w') #file to save waveform
        tt = time.time()
        with TdmsFile.open(filenameS) as tdms_file:
                groups_data = tdms_file.groups()
                group = tdms_file["DGTZ"]  #name of group
                channel = group["ch0"] #name of channel
#               all_channel_data = channel[:]
#               data_subset = channel[100:200]
#               first_channel_value = channel[0]
                for chunk in channel.data_chunks():
                        channel_chunk_data = chunk[:]
#               print(len(channel)) #length of channel array
                        data = np.array(channel_chunk_data)
                        length=len(channel_chunk_data)
#                       print(length)
#                       data_piece=data.reshape(-1,1001)#100009
#                       event=length/1001
#                       for i in range (int(1.0)):
#                       print (data_piece)
                        f.write(( " ".join('%.6s'%id for id in data))+'\n')
        f.close
        print('Time used: {} sec'.format(time.time()-tt))
emp()

其中尝试使用JIT加速，但一些类型定义无法识别，那就算了吧。
经测试，一小时大概能生成20G TXT文件，预计6小时转换完，能够接受。

Conclusion：

Python转换可行，对于大数据文件需要注意内存。注意频繁IO读写会降低速率，切片可行，但看看源代码更好可以少去很多无用功。
网上如github也有些许C++示例，不过现在的20G/hours的转换速率也能接受就不折腾了。
如果你有缘看到这个，希望对你有所帮助。