Training script internals

主要的神经网络训练脚本steps/nnet/train.sh的调用如下:

steps/nnet/train.sh <data-train> <data-dev> <lang-dir> <ali-train> <ali-dev> <exp-dir>

神经网络的输入特征是从数据目录<data-train><data-dev>中获得的,训练的目标是从目录<ali-train> <ali-dev>得到的。目录<lang-dir>仅仅在使用LDA特征变换时才被使用,和从对齐中生成音素帧的统计量,这个对于训练不是很重要。输出(比如:训练得到的网络和log文件)都存到<exp-dir>。

在内部,脚本需要准备特征和目标基准,从而产生一个神经网络的原型和初始化,建立特征变换和使用调度脚本 steps/nnet/train_scheduler.sh,用来跑训练迭代次数和控制学习率。

当看steps/nnet/train.sh脚本内部时,我们将看到:

  1. CUDA是需要的,如果没有检测到GPU或者CUDA没有被编译,脚本将退出。(你可以坚持使用'–skip-cuda-check true'来使用CPU运行,但是速度将慢10-20倍)
  2. 对齐基准需要提前准备,训练工具需要的目标是以后验概率格式,因此ali-to-post.cc被使用:

labels_tr="ark:ali-to-pdf $alidir/final.mdl \"ark:gunzip -c $alidir/ali..gz |\" ark:- | ali-to-post ark:- ark:- |" labels_cv="ark:ali-to-pdf $alidir/final.mdl \"ark:gunzip -c $alidir_cv/ali..gz |\" ark:- | ali-to-post ark:- ark:- |"

  1. 重组的特征拷贝到/tmp/???/...,如果使用'–copy-feats false',这个将失效。或者目录改为–copy-feats-tmproot <dir>。
    • 特征使用调用列表被重新保存到本地,这些显著地降低了在训练过程中磁盘的重要性,它防止了大量磁盘访问的操作。
  2. 特征基准被准备:

begins with copy-feats: feats_tr="ark:copy-feats scp:$dir/train.scp ark:- |" feats_cv="ark:copy-feats scp:$dir/cv.scp ark:- |" # optionally apply-cmvn is appended: feats_tr="$feats_tr apply-cmvn --print-args=false --norm-vars=$norm_vars --utt2spk=ark:$data/utt2spk scp:$data/cmvn.scp ark:- ark:- |" feats_cv="$feats_cv apply-cmvn --print-args=false --norm-vars=$norm_vars --utt2spk=ark:$data_cv/utt2spk scp:$data_cv/cmvn.scp ark:- ark:- |" # optionally add-deltas is appended: feats_tr="$feats_tr add-deltas --delta-order=$delta_order ark:- ark:- |" feats_cv="$feats_cv add-deltas --delta-order=$delta_order ark:- ark:- |"

  1. 特征变换被准备:
    • 特征变换在DNN前端处理中是一个固定的函数,是通过GPU来计算的。一般来说,它会导致维度爆炸。这就要使得在磁盘上有低维的特征和DNN前端处理的高维特征, 即节约了磁盘空间,由节约了读取吞吐量。
    • 大多数的nnet-binaries有选项'–feature-transform'
    • 它的产生依赖于选项'–feat-type',它的值是(plain|traps|transf|lda)。
  2. 网络的原型是由utils/nnet/make_nnet_proto.py产生的:
    • 每个成分在单独一行上,这里的维度和初始化的超参数是指定的;
    • 对于AffineTransform,偏移量的初始化是的均匀分布给定<BiasMean>和<BiasRange>的均匀分布,而权重的初始化是通过通过对<ParamStddev>拉伸的正态分布;
    • 注意:如果你喜欢使用外部准备的神经网络原型来实验,可以使用选项'–mlp-proto ';

$ cat exp/dnn5b_pretrain-dbn_dnn/nnet.proto <NnetProto> <AffineTransform> <InputDim> 2048 <OutputDim> 3370 <BiasMean> 0.000000 <BiasRange> 0.000000 <ParamStddev> 0.067246 <Softmax> <InputDim> 3370 <OutputDim> 3370 </NnetProto>

  1. 神经网络是通过nnet-initialize.cc来初始化。下一步中, DBN是通过使用nnet-concat.cc得到的。
  2. 最终训练是通过运行调度脚本steps/nnet/train_scheduler.sh来完成的。

注:无论神经网络还是特征变换都可以使用nnet-info.cc来观看,或者用nnet-copy.cc来显示。

当具体看steps/nnet/train_scheduler.sh,我们可以看到:

一开始需要在交叉验证集上运行,和主函数需要根据$iter来运行迭代次数和控制学习率。典型的情况就是,train_scheduler.sh被train.sh调用

  • 默认的学习率的变化是根据目标函数相对性的提高来决定的:
    • 如果提高大于'start_halving_impr=0.01',初始化学习率保持常数
    • 然后学习率在每次迭代中乘以'halving_factor=0.5'来缩小
    • 最后,如果提高小于'end_halving_impr=0.001',训练被终止。

神经网络被保存在$dir/nnet,log文件被保存在$dir/log:

  1. 神经网络的名字包含迭代的次数,学习率和在训练和交叉验证集上的目标函数值
    • 我们可以看到从第五次迭代开始,学习率减半,这是一个普通的情况。

$ ls exp/dnn5bpretrain-dbn_dnn/nnet nnet_6.dbn_dnn_iter01_learnrate0.008_tr1.1919_cv1.5895 nnet_6.dbn_dnn_iter02_learnrate0.008_tr0.9566_cv1.5289 nnet_6.dbn_dnn_iter03_learnrate0.008_tr0.8819_cv1.4983 nnet_6.dbn_dnn_iter04_learnrate0.008_tr0.8347_cv1.5097_rejected nnet_6.dbn_dnn_iter05_learnrate0.004_tr0.8255_cv1.3760 nnet_6.dbn_dnn_iter06_learnrate0.002_tr0.7920_cv1.2981 nnet_6.dbn_dnn_iter07_learnrate0.001_tr0.7803_cv1.2412 ... nnet_6.dbn_dnn_iter19_learnrate2.44141e-07_tr0.7770_cv1.1448 nnet_6.dbn_dnn_iter20_learnrate1.2207e-07_tr0.7769_cv1.1446 nnet_6.dbn_dnn_iter20_learnrate1.2207e-07_tr0.7769_cv1.1446_final

  1. 训练集和交叉验证集分别存储了对应的log文件。 每一个log文件命令行:

$ cat exp/dnn5b_pretrain-dbn_dnn/log/iter01.tr.log nnet-train-frmshuff --learn-rate=0.008 --momentum=0 --l1-penalty=0 --l2-penalty=0 --minibatch-size=256 --randomizer-size=32768 --randomize=true --verbose=1 --binary=true --feature-transform=exp/dnn5b_pretrain-dbn_dnn/final.feature_transform --randomizer-seed=777 'ark:copy-feats scp:exp/dnn5b_pretrain-dbn_dnn/train.scp ark:- |' 'ark:ali-to-pdf exp/tri4b_ali_si284/final.mdl "ark:gunzip -c exp/tri4b_ali_si284/ali.*.gz |" ark:- | ali-to-post ark:- ark:- |' exp/dnn5b_pretrain-dbn_dnn/nnet_6.dbn_dnn.init exp/dnn5b_pretrain-dbn_dnn/nnet/nnet_6.dbn_dnn_iter01

gpu被使用的信息:

LOG (nnet-train-frmshuff:IsComputeExclusive():cu-device.cc:214) CUDA setup operating under Compute Exclusive Process Mode. LOG (nnet-train-frmshuff:FinalizeActiveGpu():cu-device.cc:174) The active GPU is [1]: GeForce GTX 780 Ti free:2974M, used:97M, total:3071M, free/total:0.968278 version 3.5

从神经网络训练得到的内部统计量是通过函数 Nnet::InfoPropagateNnet::InfoBackPropagateNnet::InfoGradient来准备的。它们将在迭代的一开始打印和迭代的最后一次进行第二次打印。注意当我们实现新的特征来做网络训练时,每一个成分的统计量就尤其便利地计算,所以我们可以比较参考的值和期望的值:

VLOG[1] (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:236) ### After 0 frames, VLOG[1] (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:237) ### Forward propagation buffer content : [1] output of <Input> ( min -6.1832, max 7.46296, mean 0.00260791, variance 0.964268, skewness -0.0622335, kurtosis 2.18525 ) [2] output of <AffineTransform> ( min -18.087, max 11.6435, mean -3.37778, variance 3.2801, skewness -3.40761, kurtosis 11.813 ) [3] output of <Sigmoid> ( min 1.39614e-08, max 0.999991, mean 0.085897, variance 0.0249875, skewness 4.65894, kurtosis 20.5913 ) [4] output of <AffineTransform> ( min -17.3738, max 14.4763, mean -2.69318, variance 2.08086, skewness -3.53642, kurtosis 13.9192 ) [5] output of <Sigmoid> ( min 2.84888e-08, max 0.999999, mean 0.108987, variance 0.0215204, skewness 4.78276, kurtosis 21.6807 ) [6] output of <AffineTransform> ( min -16.3061, max 10.9503, mean -3.65226, variance 2.49196, skewness -3.26134, kurtosis 12.1138 ) [7] output of <Sigmoid> ( min 8.28647e-08, max 0.999982, mean 0.0657602, variance 0.0212138, skewness 5.18622, kurtosis 26.2368 ) [8] output of <AffineTransform> ( min -19.9429, max 12.5567, mean -3.64982, variance 2.49913, skewness -3.2291, kurtosis 12.3174 ) [9] output of <Sigmoid> ( min 2.1823e-09, max 0.999996, mean 0.0671024, variance 0.0216422, skewness 5.07312, kurtosis 24.9565 ) [10] output of <AffineTransform> ( min -16.79, max 11.2748, mean -4.03986, variance 2.15785, skewness -3.13305, kurtosis 13.9256 ) [11] output of <Sigmoid> ( min 5.10745e-08, max 0.999987, mean 0.0492051, variance 0.0194567, skewness 5.73048, kurtosis 32.0733 ) [12] output of <AffineTransform> ( min -24.0731, max 13.8856, mean -4.00245, variance 2.16964, skewness -3.14425, kurtosis 16.7714 ) [13] output of <Sigmoid> ( min 3.50889e-11, max 0.999999, mean 0.0501351, variance 0.0200421, skewness 5.67209, kurtosis 31.1902 ) [14] output of <AffineTransform> ( min -2.53919, max 2.62531, mean -0.00363421, variance 0.209117, skewness -0.0302545, kurtosis 0.63143 ) [15] output of <Softmax> ( min 2.01032e-05, max 0.00347782, mean 0.000296736, variance 2.08593e-08, skewness 6.14324, kurtosis 35.6034 ) VLOG[1] (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:239) ### Backward propagation buffer content : [1] diff-output of <AffineTransform> ( min -0.0256142, max 0.0447016, mean 1.60589e-05, variance 7.34959e-07, skewness 1.50607, kurtosis 97.2922 ) [2] diff-output of <Sigmoid> ( min -0.10395, max 0.20643, mean -2.03144e-05, variance 5.40825e-05, skewness 0.226897, kurtosis 10.865 ) [3] diff-output of <AffineTransform> ( min -0.0246385, max 0.033782, mean 1.49055e-05, variance 7.2849e-07, skewness 0.71967, kurtosis 47.0307 ) [4] diff-output of <Sigmoid> ( min -0.137561, max 0.177565, mean -4.91158e-05, variance 4.85621e-05, skewness 0.020871, kurtosis 7.7897 ) [5] diff-output of <AffineTransform> ( min -0.0311345, max 0.0366407, mean 1.38255e-05, variance 7.76937e-07, skewness 0.886642, kurtosis 70.409 ) [6] diff-output of <Sigmoid> ( min -0.154734, max 0.166145, mean -3.83602e-05, variance 5.84839e-05, skewness 0.127536, kurtosis 8.54924 ) [7] diff-output of <AffineTransform> ( min -0.0236995, max 0.0353677, mean 1.29041e-05, variance 9.17979e-07, skewness 0.710979, kurtosis 48.1876 ) [8] diff-output of <Sigmoid> ( min -0.103117, max 0.146624, mean -3.74798e-05, variance 6.17777e-05, skewness 0.0458594, kurtosis 8.37983 ) [9] diff-output of <AffineTransform> ( min -0.0249271, max 0.0315759, mean 1.0794e-05, variance 1.2015e-06, skewness 0.703888, kurtosis 53.6606 ) [10] diff-output of <Sigmoid> ( min -0.147389, max 0.131032, mean -0.00014309, variance 0.000149306, skewness 0.0190403, kurtosis 5.48604 ) [11] diff-output of <AffineTransform> ( min -0.057817, max 0.0662253, mean 2.12237e-05, variance 1.21929e-05, skewness 0.332498, kurtosis 35.9619 ) [12] diff-output of <Sigmoid> ( min -0.311655, max 0.331862, mean 0.00031612, variance 0.00449583, skewness 0.00369107, kurtosis -0.0220473 ) [13] diff-output of <AffineTransform> ( min -0.999905, max 0.00347782, mean -1.33212e-12, variance 0.00029666, skewness -58.0197, kurtosis 3364.53 ) VLOG[1] (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:240) ### Gradient stats : Component 1 : <AffineTransform>, linearity_grad ( min -0.204042, max 0.190719, mean 0.000166458, variance 0.000231224, skewness 0.00769091, kurtosis 5.07687 ) bias_grad ( min -0.101453, max 0.0885828, mean 0.00411107, variance 0.000271452, skewness 0.728702, kurtosis 3.7276 ) Component 2 : <Sigmoid>, Component 3 : <AffineTransform>, linearity_grad ( min -0.108358, max 0.0843307, mean 0.000361943, variance 8.64557e-06, skewness 1.0407, kurtosis 21.355 ) bias_grad ( min -0.0658942, max 0.0973828, mean 0.0038158, variance 0.000288088, skewness 0.68505, kurtosis 1.74937 ) Component 4 : <Sigmoid>, Component 5 : <AffineTransform>, linearity_grad ( min -0.186918, max 0.141044, mean 0.000419367, variance 9.76016e-06, skewness 0.718714, kurtosis 40.6093 ) bias_grad ( min -0.167046, max 0.136064, mean 0.00353932, variance 0.000322016, skewness 0.464214, kurtosis 8.90469 ) Component 6 : <Sigmoid>, Component 7 : <AffineTransform>, linearity_grad ( min -0.134063, max 0.149993, mean 0.000249893, variance 9.18434e-06, skewness 1.61637, kurtosis 60.0989 ) bias_grad ( min -0.165298, max 0.131958, mean 0.00330344, variance 0.000438555, skewness 0.739655, kurtosis 6.9461 ) Component 8 : <Sigmoid>, Component 9 : <AffineTransform>, linearity_grad ( min -0.264095, max 0.27436, mean 0.000214027, variance 1.25338e-05, skewness 0.961544, kurtosis 184.881 ) bias_grad ( min -0.28208, max 0.273459, mean 0.00276327, variance 0.00060129, skewness 0.149445, kurtosis 21.2175 ) Component 10 : <Sigmoid>, Component 11 : <AffineTransform>, linearity_grad ( min -0.877651, max 0.811671, mean 0.000313385, variance 0.000122102, skewness -1.06983, kurtosis 395.3 ) bias_grad ( min -1.01687, max 0.640236, mean 0.00543326, variance 0.00977744, skewness -0.473956, kurtosis 14.3907 ) Component 12 : <Sigmoid>, Component 13 : <AffineTransform>, linearity_grad ( min -22.7678, max 0.0922921, mean -5.66685e-11, variance 0.00451415, skewness -151.169, kurtosis 41592.4 ) bias_grad ( min -22.8996, max 0.170164, mean -8.6555e-10, variance 0.421778, skewness -27.1075, kurtosis 884.01 ) Component 14 : <Softmax>,

有一个全部数据集的目标函数值的总结log文件,它的progress vector是由第一步产生的,和帧正确率为:

LOG (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:273) Done 34432 files, 21 with no tgt_mats, 0 with other errors. [TRAINING, RANDOMIZED, 50.8057 min, fps8961.77] LOG (nnet-train-frmshuff:main():nnet-train-frmshuff.cc:282) AvgLoss: 1.19191 (Xent), [AvgXent: 1.19191, AvgTargetEnt: 0] progress: [3.09478 1.92798 1.702 1.58763 1.49913 1.45936 1.40532 1.39672 1.355 1.34153 1.32753 1.30449 1.2725 1.2789 1.26154 1.25145 1.21521 1.24302 1.21865 1.2491 1.21729 1.19987 1.18887 1.16436 1.14782 1.16153 1.1881 1.1606 1.16369 1.16015 1.14077 1.11835 1.15213 1.11746 1.10557 1.1493 1.09608 1.10037 1.0974 1.09289 1.11857 1.09143 1.0766 1.08736 1.10586 1.08362 1.0885 1.07366 1.08279 1.03923 1.06073 1.10483 1.0773 1.0621 1.06251 1.07252 1.06945 1.06684 1.08892 1.07159 1.06216 1.05492 1.06508 1.08979 1.05842 1.04331 1.05885 1.05186 1.04255 1.06586 1.02833 1.06131 1.01124 1.03413 0.997029 ] FRAME_ACCURACY >> 65.6546% <<

log文件的结尾是CUDA的信息,CuMatrix::AddMatMat是矩阵乘法和大多数的花费时间如下:

[cudevice profile] Destroy 23.0389s AddVec 24.0874s CuMatrixBase::CopyFromMat(from other CuMatrixBase) 29.5765s AddVecToRows 29.7164s CuVector::SetZero 37.7405s DiffSigmoid 37.7669s CuMatrix::Resize 41.8662s FindRowMaxId 42.1923s Sigmoid 48.6683s CuVector::Resize 56.4445s AddRowSumMat 75.0928s CuMatrix::SetZero 86.5347s CuMatrixBase::CopyFromMat(from CPU) 166.27s AddMat 174.307s AddMatMat 1922.11s

直接运行steps/nnet/train_scheduler.sh:

  • 脚本train_scheduler.sh可以被train.sh调用,它允许覆盖默认的NN-input和NN-target streams,可以很便利的设置。
  • 然而这个脚本假设所有的设置是正确的,仅仅对高级用户来说是合适的。
  • 在直接调用前,我们非常建议去看脚本train_scheduler.sh是如何调用的。

results matching ""

    No results matching ""