Generative Adversarial Networks GANs are a type of generative models. The objective of generative models is to learn the training data distribution, so as to enable generation of new data, through sampling from the same distribution. The other popular types of generative models are Variational Auto Encoders VAEs and Normalizing Flow based models e.g. NICE, Glow. Deep Convolutional Generative Adversarial Network DCGAN, as the name suggests is a GAN. The distinguishing feature is that it uses convolutions in the discriminator and transposed convolutions in the generator.
GANs have a very unique training procedure. In the first phase, we train the discriminator and in the second phase, we train the generator. While training DCGAN on MNIST, a single training step (forward and backward propagation) invokes about 500 GPU CUDA kernels. The best way to understand the GAN training procedure and GPU performance is to understand every CUDA kernel i.e. which layer of the network invoked the kernel, with what arguments (tensor shapes and datatypes) and in which direction (forward propagation or backward propagation).
In this post, we will categorize every kernel used in the training
of DCGAN. All the information in the tables below was obtaining
using Nvidia's PyTorch Profiler, PyProf, on a Turing T4 GPU. The
information below is only a subset of what is provided by PyProf. The code
and instructions for obtaining a detailed profile are here. Note
that different GPUs will have slightly different kernel names e.g.
volta_*
as opposed to turing_*
.
The code for DCGAN was obtained from PyTorch DCGAN Tutorial and modified to use MNIST. The image below shows the output of the generator on a fixed noise at the beginning and at the end of epochs 1 through 10.
The parameters in the profiled code are as follows.
# Size of training images.
image_size = 64
# Channels in the training images.
# 3 for color images, 1 for MNIST.
nc = 1
# Size of latent vector (i.e. size of generator input).
nz = 100
# Size of feature maps in generator.
ngf = 64
# Size of feature maps in discriminator.
ndf = 64
batch_size = 128
The tables below show the GPU kernels invoked in 1 training step. For every GPU kernel we show the direction (fprop, bprop), name of the layer, name of the operation, and the input tensor shapes / matrix dimensions for the operation. PyProf provides a lot of additional information for every GPU kernel e.g. grid dimensions, block dimensions, silicon time, datatypes, flops, bytes, tensor core usage and so on.
GAN training consists of two parts. Part 1, where we train the discriminator and part 2, where we train the generator.
At the beginning of part 1, we zero out the gradients of the discriminator.
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
1 | fprop | Part1:D_Gradient | zero_ | [64,1,4,4] | modern::elementwise_kernel |
2 | fprop | Part1:D_Gradient | zero_ | [128,64,4,4] | modern::elementwise_kernel |
3 | fprop | Part1:D_Gradient | zero_ | [128] | modern::elementwise_kernel |
4 | fprop | Part1:D_Gradient | zero_ | [128] | modern::elementwise_kernel |
5 | fprop | Part1:D_Gradient | zero_ | [256,128,4,4] | modern::elementwise_kernel |
6 | fprop | Part1:D_Gradient | zero_ | [256] | modern::elementwise_kernel |
7 | fprop | Part1:D_Gradient | zero_ | [256] | modern::elementwise_kernel |
8 | fprop | Part1:D_Gradient | zero_ | [512,256,4,4] | modern::elementwise_kernel |
9 | fprop | Part1:D_Gradient | zero_ | [512] | modern::elementwise_kernel |
10 | fprop | Part1:D_Gradient | zero_ | [512] | modern::elementwise_kernel |
11 | fprop | Part1:D_Gradient | zero_ | [1,512,4,4] | modern::elementwise_kernel |
We pick batch_size
i.e. 128 images from the MNIST dataset and pass
them through the discriminator. The target label for these images is set
to 1 (kernel 12). The discriminator consists of 5 convolution layers.
Kernels 13-15 correspond to the first convolution and activation. The
input shape (N,C,H,W)
is (128,1,64,64)
, the number of filters K
is 64
, the kernel size (R,S)
is (4,4)
, the padding (ph,pw)
is
(1,1)
and the horizontal and vertical stride (U,V)
is (2,2)
. The
output shape (N,K,P,Q)
is (128,64,32,32)
.
Kernels 16-20 correspond to the second convolution, batch norm and
activation. The output shape is (128,128,16,16)
.
Kernels 21-25 correspond to the third convolution, batch norm and
activation. The output shape is (128,256,8,8)
.
Kernels 26-30 correspond to the fourth convolution, batch norm and
activation. The output shape is (128,512,4,4)
.
Kernels 31-33 correspond to the fifth convolution and activation. Note
that K=1
, therefore, the output shape is (128,1,1,1)
.
Kernels 34-35 calculate the loss with respect to misclassification of
real images from the dataset.
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
12 | fprop | Part1 | full | [128] | modern::elementwise_kernel |
13 | fprop | Part1:Real:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
14 | fprop | Part1:Real:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x64_relu_small_nn_v1 |
15 | fprop | Part1:Real:D:LRelu1 | leaky_relu | [128,64,32,32] | modern::elementwise_kernel |
16 | fprop | Part1:Real:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
17 | fprop | Part1:Real:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_relu_small_nn_v1 |
18 | fprop | Part1:Real:D:BN2 | __add__ | [];[] | legacy::elementwise_kernel |
19 | fprop | Part1:Real:D:BN2 | batch_norm | [128,128,16,16] | cudnn::detail::bn_fw_tr_1C11_kernel_NCHW |
20 | fprop | Part1:Real:D:LRelu2 | leaky_relu | [128,128,16,16] | modern::elementwise_kernel |
21 | fprop | Part1:Real:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
22 | fprop | Part1:Real:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_relu_small_nn_v1 |
23 | fprop | Part1:Real:D:BN3 | __add__ | [];[] | legacy::elementwise_kernel |
24 | fprop | Part1:Real:D:BN3 | batch_norm | [128,256,8,8] | cudnn::detail::bn_fw_tr_1C11_kernel_NCHW |
25 | fprop | Part1:Real:D:LRelu3 | leaky_relu | [128,256,8,8] | modern::elementwise_kernel |
26 | fprop | Part1:Real:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
27 | fprop | Part1:Real:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x64_relu_small_nn_v1 |
28 | fprop | Part1:Real:D:BN4 | __add__ | [];[] | legacy::elementwise_kernel |
29 | fprop | Part1:Real:D:BN4 | batch_norm | [128,512,4,4] | cudnn::detail::bn_fw_tr_1C11_singleread |
30 | fprop | Part1:Real:D:LRelu4 | leaky_relu | [128,512,4,4] | modern::elementwise_kernel |
31 | fprop | Part1:Real:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeOffsetsKernel |
32 | fprop | Part1:Real:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | volta_scudnn_128x32_relu_interior_nn_v1 |
33 | fprop | Part1:Real:D:Sigmoid | sigmoid | [128,1,1,1] | modern::elementwise_kernel |
34 | fprop | Part1:Real:Loss | binary_cross_entropy | [128,128] | kernelPointwiseApply3 |
35 | fprop | Part1:Real:Loss | binary_cross_entropy | [128,128] | reduce_kernel |
We now perform back propagation through the discriminator and calculate
the gradients.
Kernels 37-38 correspond to bprop through the loss layer.
Kernels 39-46 correspond to bprop (data gradient and weight gradient) through the fifth convolution layer.
Kernels 48,49,52-57 correspond to bprop through the fourth convolution layer.
Kernels 59,60,63-70 correspond to bprop through the third convolution layer.
Kernels 72,73,76-78 correspond to bprop through the second convolution layer.
Kernels 80,81 correspond to bprop through the first convolution layer.
Note that the first convolution layer requires only a weight gradient and
not a data gradient. Kernels with the op add_
, most likely correspond
to gradient accumulation i.e. adding the gradients to the previously
zeroed out gradient tensors. Kernel 83 calculates the average loss
(for reporting).
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
36 | fprop | Part1:Real | backward | legacy::elementwise_kernel | |
37 | bprop | Part1:Real:Loss | binary_cross_entropy | [128,128] | kernelPointwiseApply4 |
38 | bprop | Part1:Real:Loss | binary_cross_entropy | [128,128] | modern::elementwise_kernel |
39 | bprop | Part1:Real:D:Sigmoid | sigmoid | [128,1,1,1] | modern::elementwise_kernel |
40 | bprop | Part1:Real:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeOffsetsKernel |
41 | bprop | Part1:Real:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeBOffsetsKernel |
42 | bprop | Part1:Real:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | volta_scudnn_128x128_stridedB_small_nn_v1 |
43 | bprop | Part1:Real:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeWgradSplitKOffsetsKernel |
44 | bprop | Part1:Real:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | scalePackedTensor_kernel |
45 | bprop | Part1:Real:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeWgradBOffsetsKernel |
46 | bprop | Part1:Real:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | volta_scudnn_128x64_stridedB_splitK_interior_nn_v1 |
47 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
48 | bprop | Part1:Real:D:LRelu4 | leaky_relu | [128,512,4,4] | modern::elementwise_kernel |
49 | bprop | Part1:Real:D:BN4 | batch_norm | [128,512,4,4] | cudnn::detail::bn_bw_1C11_singleread |
50 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
51 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
52 | bprop | Part1:Real:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
53 | bprop | Part1:Real:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::dgrad2d_alg1_1 |
54 | bprop | Part1:Real:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradSplitKOffsetsKernel |
55 | bprop | Part1:Real:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
56 | bprop | Part1:Real:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradBOffsetsKernel |
57 | bprop | Part1:Real:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_stridedB_splitK_small_nn_v1 |
58 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
59 | bprop | Part1:Real:D:LRelu3 | leaky_relu | [128,256,8,8] | modern::elementwise_kernel |
60 | bprop | Part1:Real:D:BN3 | batch_norm | [128,256,8,8] | cudnn::detail::bn_bw_1C11_kernel_new |
61 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
62 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
63 | bprop | Part1:Real:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | fft2d_r2c_32x32 |
64 | bprop | Part1:Real:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | fft2d_r2c_32x32 |
65 | bprop | Part1:Real:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_gcgemm_32x32_nt |
66 | bprop | Part1:Real:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | fft2d_c2r_32x32 |
67 | bprop | Part1:Real:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradSplitKOffsetsKernel |
68 | bprop | Part1:Real:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
69 | bprop | Part1:Real:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradBOffsetsKernel |
70 | bprop | Part1:Real:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_stridedB_splitK_small_nn_v1 |
71 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
72 | bprop | Part1:Real:D:LRelu2 | leaky_relu | [128,128,16,16] | modern::elementwise_kernel |
73 | bprop | Part1:Real:D:BN2 | batch_norm | [128,128,16,16] | cudnn::detail::bn_bw_1C11_kernel_new |
74 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
75 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
76 | bprop | Part1:Real:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
77 | bprop | Part1:Real:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::dgrad2d_alg1_1 |
78 | bprop | Part1:Real:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::wgrad_alg0_engine |
79 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
80 | bprop | Part1:Real:D:LRelu1 | leaky_relu | [128,64,32,32] | modern::elementwise_kernel |
81 | bprop | Part1:Real:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::wgrad_alg0_engine |
82 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
83 | fprop | Part1:Real | mean | [128] | reduce_kernel |
We now create batch_size
i.e. 128 fake images using the generator.
The generator consists of 5 transposed convolutions which progressively
increase the image size from [nz,1,1]
to [nc,ngf,ngf]
i.e. from
[100,1,1]
to [1,64,64]
.
Kernel 84 creates a random tensor of shape [batch_size, nz]
.
Kernels 85-90 correspond to the first transposed convolution, batch norm and activation. The output shape is [128,512,4,4]
.
Kernels 91-95 correspond to the second transposed convolution, batch norm and activation. The output shape is [128,256,8,8]
.
Kernels 96-102 correspond to the third transposed convolution, batch norm and activation. The output shape is [128,128,16,16]
.
Kernels 103-107 correspond to the fourth transposed convolution, batch norm and activation. The output shape is [128,64,32,32]
.
Kernels 108-110 correspond to the fifth transposed convolution and activation. The output shape is [128,1,64,64]
.
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
84 | fprop | Part1:Fake | randn | distribution_elementwise_grid_stride_kernel | |
85 | fprop | Part1:Fake:G:ConvT1 | conv_transpose2d | T=[(128,100,1,1),(100,512,4,4)] | cudnn::gemm::computeOffsetsKernel |
86 | fprop | Part1:Fake:G:ConvT1 | conv_transpose2d | T=[(128,100,1,1),(100,512,4,4)] | cudnn::gemm::computeBOffsetsKernel |
87 | fprop | Part1:Fake:G:ConvT1 | conv_transpose2d | T=[(128,100,1,1),(100,512,4,4)] | volta_scudnn_128x64_stridedB_small_nn_v1 |
88 | fprop | Part1:Fake:G:BN1 | __add__ | [];[] | legacy::elementwise_kernel |
89 | fprop | Part1:Fake:G:BN1 | batch_norm | [128,512,4,4] | cudnn::detail::bn_fw_tr_1C11_singleread |
90 | fprop | Part1:Fake:G:Relu1 | relu | [128,512,4,4] | modern::elementwise_kernel |
91 | fprop | Part1:Fake:G:ConvT2 | conv_transpose2d | T=[(128,512,4,4),(512,256,4,4)] | scalePackedTensor_kernel |
92 | fprop | Part1:Fake:G:ConvT2 | conv_transpose2d | T=[(128,512,4,4),(512,256,4,4)] | cudnn::detail::dgrad2d_alg1_1 |
93 | fprop | Part1:Fake:G:BN2 | __add__ | [];[] | legacy::elementwise_kernel |
94 | fprop | Part1:Fake:G:BN2 | batch_norm | [128,256,8,8] | cudnn::detail::bn_fw_tr_1C11_kernel_NCHW |
95 | fprop | Part1:Fake:G:Relu2 | relu | [128,256,8,8] | modern::elementwise_kernel |
96 | fprop | Part1:Fake:G:ConvT3 | conv_transpose2d | T=[(128,256,8,8),(256,128,4,4)] | fft2d_r2c_32x32 |
97 | fprop | Part1:Fake:G:ConvT3 | conv_transpose2d | T=[(128,256,8,8),(256,128,4,4)] | fft2d_r2c_32x32 |
98 | fprop | Part1:Fake:G:ConvT3 | conv_transpose2d | T=[(128,256,8,8),(256,128,4,4)] | volta_gcgemm_32x32_nt |
99 | fprop | Part1:Fake:G:ConvT3 | conv_transpose2d | T=[(128,256,8,8),(256,128,4,4)] | fft2d_c2r_32x32 |
100 | fprop | Part1:Fake:G:BN3 | __add__ | [];[] | legacy::elementwise_kernel |
101 | fprop | Part1:Fake:G:BN3 | batch_norm | [128,128,16,16] | cudnn::detail::bn_fw_tr_1C11_kernel_NCHW |
102 | fprop | Part1:Fake:G:Relu3 | relu | [128,128,16,16] | modern::elementwise_kernel |
103 | fprop | Part1:Fake:G:ConvT4 | conv_transpose2d | T=[(128,128,16,16),(128,64,4,4)] | scalePackedTensor_kernel |
104 | fprop | Part1:Fake:G:ConvT4 | conv_transpose2d | T=[(128,128,16,16),(128,64,4,4)] | cudnn::detail::dgrad2d_alg1_1 |
105 | fprop | Part1:Fake:G:BN4 | __add__ | [];[] | legacy::elementwise_kernel |
106 | fprop | Part1:Fake:G:BN4 | batch_norm | [128,64,32,32] | cudnn::detail::bn_fw_tr_1C11_kernel_NCHW |
107 | fprop | Part1:Fake:G:Relu4 | relu | [128,64,32,32] | modern::elementwise_kernel |
108 | fprop | Part1:Fake:G:ConvT5 | conv_transpose2d | T=[(128,64,32,32),(64,1,4,4)] | scalePackedTensor_kernel |
109 | fprop | Part1:Fake:G:ConvT5 | conv_transpose2d | T=[(128,64,32,32),(64,1,4,4)] | cudnn::detail::dgrad_engine |
110 | fprop | Part1:Fake:G:Tanh | tanh | [128,1,64,64] | kernelPointwiseApply2 |
The fake images from the generator are now fed to the discriminator. The target label for these images is set to 0 (kernel 111). Kernels 112 through 134 are the same as when the discriminator was fed real images from the dataset.
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
111 | fprop | Part1:Fake | fill_ | [128] | modern::elementwise_kernel |
112 | fprop | Part1:Fake:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
113 | fprop | Part1:Fake:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x64_relu_small_nn_v1 |
114 | fprop | Part1:Fake:D:LRelu1 | leaky_relu | [128,64,32,32] | modern::elementwise_kernel |
115 | fprop | Part1:Fake:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
116 | fprop | Part1:Fake:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_relu_small_nn_v1 |
117 | fprop | Part1:Fake:D:BN2 | __add__ | [];[] | legacy::elementwise_kernel |
118 | fprop | Part1:Fake:D:BN2 | batch_norm | [128,128,16,16] | cudnn::detail::bn_fw_tr_1C11_kernel_NCHW |
119 | fprop | Part1:Fake:D:LRelu2 | leaky_relu | [128,128,16,16] | modern::elementwise_kernel |
120 | fprop | Part1:Fake:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
121 | fprop | Part1:Fake:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_relu_small_nn_v1 |
122 | fprop | Part1:Fake:D:BN3 | __add__ | [];[] | legacy::elementwise_kernel |
123 | fprop | Part1:Fake:D:BN3 | batch_norm | [128,256,8,8] | cudnn::detail::bn_fw_tr_1C11_kernel_NCHW |
124 | fprop | Part1:Fake:D:LRelu3 | leaky_relu | [128,256,8,8] | modern::elementwise_kernel |
125 | fprop | Part1:Fake:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
126 | fprop | Part1:Fake:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x64_relu_small_nn_v1 |
127 | fprop | Part1:Fake:D:BN4 | __add__ | [];[] | legacy::elementwise_kernel |
128 | fprop | Part1:Fake:D:BN4 | batch_norm | [128,512,4,4] | cudnn::detail::bn_fw_tr_1C11_singleread |
129 | fprop | Part1:Fake:D:LRelu4 | leaky_relu | [128,512,4,4] | modern::elementwise_kernel |
130 | fprop | Part1:Fake:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeOffsetsKernel |
131 | fprop | Part1:Fake:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | volta_scudnn_128x32_relu_interior_nn_v1 |
132 | fprop | Part1:Fake:D:Sigmoid | sigmoid | [128,1,1,1] | modern::elementwise_kernel |
133 | fprop | Part1:Fake:Loss | binary_cross_entropy | T=[(128,),(128,)] | kernelPointwiseApply3 |
134 | fprop | Part1:Fake:Loss | binary_cross_entropy | T=[(128,),(128,)] | reduce_kernel |
We now perform back propagation through the discriminator again and calculate and accumulate the gradients. Kernels 135 through 182 are the same as kernels 36 through 83. Kernel 183 adds the losses from real and fake images (for reporting).
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
135 | fprop | Part1:Fake | backward | legacy::elementwise_kernel | |
136 | bprop | Part1:Fake:Loss | binary_cross_entropy | T=[(128,),(128,)] | kernelPointwiseApply4 |
137 | bprop | Part1:Fake:Loss | binary_cross_entropy | T=[(128,),(128,)] | modern::elementwise_kernel |
138 | bprop | Part1:Fake:D:Sigmoid | sigmoid | [128,1,1,1] | modern::elementwise_kernel |
139 | bprop | Part1:Fake:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeOffsetsKernel |
140 | bprop | Part1:Fake:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeBOffsetsKernel |
141 | bprop | Part1:Fake:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | volta_scudnn_128x128_stridedB_small_nn_v1 |
142 | bprop | Part1:Fake:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeWgradSplitKOffsetsKernel |
143 | bprop | Part1:Fake:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | scalePackedTensor_kernel |
144 | bprop | Part1:Fake:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeWgradBOffsetsKernel |
145 | bprop | Part1:Fake:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | volta_scudnn_128x64_stridedB_splitK_interior_nn_v1 |
146 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
147 | bprop | Part1:Fake:D:LRelu4 | leaky_relu | [128,512,4,4] | modern::elementwise_kernel |
148 | bprop | Part1:Fake:D:BN4 | batch_norm | [128,512,4,4] | cudnn::detail::bn_bw_1C11_singleread |
149 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
150 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
151 | bprop | Part1:Fake:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
152 | bprop | Part1:Fake:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::dgrad2d_alg1_1 |
153 | bprop | Part1:Fake:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradSplitKOffsetsKernel |
154 | bprop | Part1:Fake:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
155 | bprop | Part1:Fake:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradBOffsetsKernel |
156 | bprop | Part1:Fake:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_stridedB_splitK_small_nn_v1 |
157 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
158 | bprop | Part1:Fake:D:LRelu3 | leaky_relu | [128,256,8,8] | modern::elementwise_kernel |
159 | bprop | Part1:Fake:D:BN3 | batch_norm | [128,256,8,8] | cudnn::detail::bn_bw_1C11_kernel_new |
160 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
161 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
162 | bprop | Part1:Fake:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | fft2d_r2c_32x32 |
163 | bprop | Part1:Fake:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | fft2d_r2c_32x32 |
164 | bprop | Part1:Fake:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_gcgemm_32x32_nt |
165 | bprop | Part1:Fake:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | fft2d_c2r_32x32 |
166 | bprop | Part1:Fake:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradSplitKOffsetsKernel |
167 | bprop | Part1:Fake:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
168 | bprop | Part1:Fake:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradBOffsetsKernel |
169 | bprop | Part1:Fake:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_stridedB_splitK_small_nn_v1 |
170 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
171 | bprop | Part1:Fake:D:LRelu2 | leaky_relu | [128,128,16,16] | modern::elementwise_kernel |
172 | bprop | Part1:Fake:D:BN2 | batch_norm | [128,128,16,16] | cudnn::detail::bn_bw_1C11_kernel_new |
173 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
174 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
175 | bprop | Part1:Fake:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
176 | bprop | Part1:Fake:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::dgrad2d_alg1_1 |
177 | bprop | Part1:Fake:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::wgrad_alg0_engine |
178 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
179 | bprop | Part1:Fake:D:LRelu1 | leaky_relu | [128,64,32,32] | modern::elementwise_kernel |
180 | bprop | Part1:Fake:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::wgrad_alg0_engine |
181 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
182 | fprop | Part1:Fake | mean | [128] | reduce_kernel |
183 | fprop | Part1 | __add__ | [];[] | legacy::elementwise_kernel |
After calculating and summing up the gradients from the real and fake images, we apply the Adam optimizer on the discriminator weights (parameters). The discriminator has 11 parameters, 1 for each of the 5 convolutions and 2 for each of 3 batch norms (see kernels 1-11). Each call to the Adam optimizer invokes 8 kernels, for a total of 88 kernels (184 through 271). This is not an optimized implementation and one can use the fused Adam implementation from Nvidia Apex.
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
184 | fprop | Part1:Optim | mul_ | [64,1,4,4];[] | modern::elementwise_kernel |
185 | fprop | Part1:Optim | add_ | [64,1,4,4];[64,1,4,4] | modern::elementwise_kernel |
186 | fprop | Part1:Optim | mul_ | [64,1,4,4];[] | modern::elementwise_kernel |
187 | fprop | Part1:Optim | addcmul_ | [64,1,4,4];[64,1,4,4];[64,1,4,4] | modern::elementwise_kernel |
188 | fprop | Part1:Optim | sqrt | [64,1,4,4] | modern::elementwise_kernel |
189 | fprop | Part1:Optim | __truediv__ | [64,1,4,4];[] | modern::elementwise_kernel |
190 | fprop | Part1:Optim | add_ | [64,1,4,4];[] | modern::elementwise_kernel |
191 | fprop | Part1:Optim | addcdiv_ | [64,1,4,4];[64,1,4,4];[64,1,4,4] | modern::elementwise_kernel |
192 | fprop | Part1:Optim | mul_ | [128,64,4,4];[] | modern::elementwise_kernel |
193 | fprop | Part1:Optim | add_ | [128,64,4,4];[128,64,4,4] | modern::elementwise_kernel |
194 | fprop | Part1:Optim | mul_ | [128,64,4,4];[] | modern::elementwise_kernel |
195 | fprop | Part1:Optim | addcmul_ | [128,64,4,4];[128,64,4,4];[128,64,4,4] | modern::elementwise_kernel |
196 | fprop | Part1:Optim | sqrt | [128,64,4,4] | modern::elementwise_kernel |
197 | fprop | Part1:Optim | __truediv__ | [128,64,4,4];[] | modern::elementwise_kernel |
198 | fprop | Part1:Optim | add_ | [128,64,4,4];[] | modern::elementwise_kernel |
199 | fprop | Part1:Optim | addcdiv_ | [128,64,4,4];[128,64,4,4];[128,64,4,4] | modern::elementwise_kernel |
200 | fprop | Part1:Optim | mul_ | [128];[] | modern::elementwise_kernel |
201 | fprop | Part1:Optim | add_ | [128];[128] | modern::elementwise_kernel |
202 | fprop | Part1:Optim | mul_ | [128];[] | modern::elementwise_kernel |
203 | fprop | Part1:Optim | addcmul_ | [128];[128];[128] | modern::elementwise_kernel |
204 | fprop | Part1:Optim | sqrt | [128] | modern::elementwise_kernel |
205 | fprop | Part1:Optim | __truediv__ | [128];[] | modern::elementwise_kernel |
206 | fprop | Part1:Optim | add_ | [128];[] | modern::elementwise_kernel |
207 | fprop | Part1:Optim | addcdiv_ | [128];[128];[128] | modern::elementwise_kernel |
208 | fprop | Part1:Optim | mul_ | [128];[] | modern::elementwise_kernel |
209 | fprop | Part1:Optim | add_ | [128];[128] | modern::elementwise_kernel |
210 | fprop | Part1:Optim | mul_ | [128];[] | modern::elementwise_kernel |
211 | fprop | Part1:Optim | addcmul_ | [128];[128];[128] | modern::elementwise_kernel |
212 | fprop | Part1:Optim | sqrt | [128] | modern::elementwise_kernel |
213 | fprop | Part1:Optim | __truediv__ | [128];[] | modern::elementwise_kernel |
214 | fprop | Part1:Optim | add_ | [128];[] | modern::elementwise_kernel |
215 | fprop | Part1:Optim | addcdiv_ | [128];[128];[128] | modern::elementwise_kernel |
216 | fprop | Part1:Optim | mul_ | [256,128,4,4];[] | modern::elementwise_kernel |
217 | fprop | Part1:Optim | add_ | [256,128,4,4];[256,128,4,4] | modern::elementwise_kernel |
218 | fprop | Part1:Optim | mul_ | [256,128,4,4];[] | modern::elementwise_kernel |
219 | fprop | Part1:Optim | addcmul_ | [256,128,4,4];[256,128,4,4];[256,128,4,4] | modern::elementwise_kernel |
220 | fprop | Part1:Optim | sqrt | [256,128,4,4] | modern::elementwise_kernel |
221 | fprop | Part1:Optim | __truediv__ | [256,128,4,4];[] | modern::elementwise_kernel |
222 | fprop | Part1:Optim | add_ | [256,128,4,4];[] | modern::elementwise_kernel |
223 | fprop | Part1:Optim | addcdiv_ | [256,128,4,4];[256,128,4,4];[256,128,4,4] | modern::elementwise_kernel |
224 | fprop | Part1:Optim | mul_ | [256];[] | modern::elementwise_kernel |
225 | fprop | Part1:Optim | add_ | [256];[256] | modern::elementwise_kernel |
226 | fprop | Part1:Optim | mul_ | [256];[] | modern::elementwise_kernel |
227 | fprop | Part1:Optim | addcmul_ | [256];[256];[256] | modern::elementwise_kernel |
228 | fprop | Part1:Optim | sqrt | [256] | modern::elementwise_kernel |
229 | fprop | Part1:Optim | __truediv__ | [256];[] | modern::elementwise_kernel |
230 | fprop | Part1:Optim | add_ | [256];[] | modern::elementwise_kernel |
231 | fprop | Part1:Optim | addcdiv_ | [256];[256];[256] | modern::elementwise_kernel |
232 | fprop | Part1:Optim | mul_ | [256];[] | modern::elementwise_kernel |
233 | fprop | Part1:Optim | add_ | [256];[256] | modern::elementwise_kernel |
234 | fprop | Part1:Optim | mul_ | [256];[] | modern::elementwise_kernel |
235 | fprop | Part1:Optim | addcmul_ | [256];[256];[256] | modern::elementwise_kernel |
236 | fprop | Part1:Optim | sqrt | [256] | modern::elementwise_kernel |
237 | fprop | Part1:Optim | __truediv__ | [256];[] | modern::elementwise_kernel |
238 | fprop | Part1:Optim | add_ | [256];[] | modern::elementwise_kernel |
239 | fprop | Part1:Optim | addcdiv_ | [256];[256];[256] | modern::elementwise_kernel |
240 | fprop | Part1:Optim | mul_ | [512,256,4,4];[] | modern::elementwise_kernel |
241 | fprop | Part1:Optim | add_ | [512,256,4,4];[512,256,4,4] | modern::elementwise_kernel |
242 | fprop | Part1:Optim | mul_ | [512,256,4,4];[] | modern::elementwise_kernel |
243 | fprop | Part1:Optim | addcmul_ | [512,256,4,4];[512,256,4,4];[512,256,4,4] | modern::elementwise_kernel |
244 | fprop | Part1:Optim | sqrt | [512,256,4,4] | modern::elementwise_kernel |
245 | fprop | Part1:Optim | __truediv__ | [512,256,4,4];[] | modern::elementwise_kernel |
246 | fprop | Part1:Optim | add_ | [512,256,4,4];[] | modern::elementwise_kernel |
247 | fprop | Part1:Optim | addcdiv_ | [512,256,4,4];[512,256,4,4];[512,256,4,4] | modern::elementwise_kernel |
248 | fprop | Part1:Optim | mul_ | [512];[] | modern::elementwise_kernel |
249 | fprop | Part1:Optim | add_ | [512];[512] | modern::elementwise_kernel |
250 | fprop | Part1:Optim | mul_ | [512];[] | modern::elementwise_kernel |
251 | fprop | Part1:Optim | addcmul_ | [512];[512];[512] | modern::elementwise_kernel |
252 | fprop | Part1:Optim | sqrt | [512] | modern::elementwise_kernel |
253 | fprop | Part1:Optim | __truediv__ | [512];[] | modern::elementwise_kernel |
254 | fprop | Part1:Optim | add_ | [512];[] | modern::elementwise_kernel |
255 | fprop | Part1:Optim | addcdiv_ | [512];[512];[512] | modern::elementwise_kernel |
256 | fprop | Part1:Optim | mul_ | [512];[] | modern::elementwise_kernel |
257 | fprop | Part1:Optim | add_ | [512];[512] | modern::elementwise_kernel |
258 | fprop | Part1:Optim | mul_ | [512];[] | modern::elementwise_kernel |
259 | fprop | Part1:Optim | addcmul_ | [512];[512];[512] | modern::elementwise_kernel |
260 | fprop | Part1:Optim | sqrt | [512] | modern::elementwise_kernel |
261 | fprop | Part1:Optim | __truediv__ | [512];[] | modern::elementwise_kernel |
262 | fprop | Part1:Optim | add_ | [512];[] | modern::elementwise_kernel |
263 | fprop | Part1:Optim | addcdiv_ | [512];[512];[512] | modern::elementwise_kernel |
264 | fprop | Part1:Optim | mul_ | [1,512,4,4];[] | modern::elementwise_kernel |
265 | fprop | Part1:Optim | add_ | [1,512,4,4];[1,512,4,4] | modern::elementwise_kernel |
266 | fprop | Part1:Optim | mul_ | [1,512,4,4];[] | modern::elementwise_kernel |
267 | fprop | Part1:Optim | addcmul_ | [1,512,4,4];[1,512,4,4];[1,512,4,4] | modern::elementwise_kernel |
268 | fprop | Part1:Optim | sqrt | [1,512,4,4] | modern::elementwise_kernel |
269 | fprop | Part1:Optim | __truediv__ | [1,512,4,4];[] | modern::elementwise_kernel |
270 | fprop | Part1:Optim | add_ | [1,512,4,4];[] | modern::elementwise_kernel |
271 | fprop | Part1:Optim | addcdiv_ | [1,512,4,4];[1,512,4,4];[1,512,4,4] | modern::elementwise_kernel |
At the beginning of part 2, we zero out the gradients of the generator.
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
272 | fprop | Part2:G_Gradient | zero_ | [100,512,4,4] | modern::elementwise_kernel |
273 | fprop | Part2:G_Gradient | zero_ | [512] | modern::elementwise_kernel |
274 | fprop | Part2:G_Gradient | zero_ | [512] | modern::elementwise_kernel |
275 | fprop | Part2:G_Gradient | zero_ | [512,256,4,4] | modern::elementwise_kernel |
276 | fprop | Part2:G_Gradient | zero_ | [256] | modern::elementwise_kernel |
277 | fprop | Part2:G_Gradient | zero_ | [256] | modern::elementwise_kernel |
278 | fprop | Part2:G_Gradient | zero_ | [256,128,4,4] | modern::elementwise_kernel |
279 | fprop | Part2:G_Gradient | zero_ | [128] | modern::elementwise_kernel |
280 | fprop | Part2:G_Gradient | zero_ | [128] | modern::elementwise_kernel |
281 | fprop | Part2:G_Gradient | zero_ | [128,64,4,4] | modern::elementwise_kernel |
282 | fprop | Part2:G_Gradient | zero_ | [64] | modern::elementwise_kernel |
283 | fprop | Part2:G_Gradient | zero_ | [64] | modern::elementwise_kernel |
284 | fprop | Part2:G_Gradient | zero_ | [64,1,4,4] | modern::elementwise_kernel |
In part 1, for training the discriminator, we generated batch_size
i.e. 128 fake images using the generator. Now for training
the generator, we reuse those images and feed them through the
discriminator. However, this time the target label for these images is
set to 1 (kernel 285). Kernels 286 through 308 are the same as before.
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
285 | fprop | Part2 | fill_ | [128] | modern::elementwise_kernel |
286 | fprop | Part2:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
287 | fprop | Part2:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x64_relu_small_nn_v1 |
288 | fprop | Part2:D:LRelu1 | leaky_relu | [128,64,32,32] | modern::elementwise_kernel |
289 | fprop | Part2:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
290 | fprop | Part2:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_relu_small_nn_v1 |
291 | fprop | Part2:D:BN2 | __add__ | [];[] | legacy::elementwise_kernel |
292 | fprop | Part2:D:BN2 | batch_norm | [128,128,16,16] | cudnn::detail::bn_fw_tr_1C11_kernel_NCHW |
293 | fprop | Part2:D:LRelu2 | leaky_relu | [128,128,16,16] | modern::elementwise_kernel |
294 | fprop | Part2:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
295 | fprop | Part2:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_relu_small_nn_v1 |
296 | fprop | Part2:D:BN3 | __add__ | [];[] | legacy::elementwise_kernel |
297 | fprop | Part2:D:BN3 | batch_norm | [128,256,8,8] | cudnn::detail::bn_fw_tr_1C11_kernel_NCHW |
298 | fprop | Part2:D:LRelu3 | leaky_relu | [128,256,8,8] | modern::elementwise_kernel |
299 | fprop | Part2:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeOffsetsKernel |
300 | fprop | Part2:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x64_relu_small_nn_v1 |
301 | fprop | Part2:D:BN4 | __add__ | [];[] | legacy::elementwise_kernel |
302 | fprop | Part2:D:BN4 | batch_norm | [128,512,4,4] | cudnn::detail::bn_fw_tr_1C11_singleread |
303 | fprop | Part2:D:LRelu4 | leaky_relu | [128,512,4,4] | modern::elementwise_kernel |
304 | fprop | Part2:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeOffsetsKernel |
305 | fprop | Part2:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | volta_scudnn_128x32_relu_interior_nn_v1 |
306 | fprop | Part2:D:Sigmoid | sigmoid | [128,1,1,1] | modern::elementwise_kernel |
307 | fprop | Part2:Loss | binary_cross_entropy | T=[(128,),(128,)] | kernelPointwiseApply3 |
308 | fprop | Part2:Loss | binary_cross_entropy | T=[(128,),(128,)] | reduce_kernel |
We now perform back propagation through the discriminator. Kernels 309 through 357 are the same as kernels 36 through 82. The only difference is that we now calculate the data gradient for the first convolution layer as well, which results in 2 additional kernels. Ideally, in part 2, since we don't update the discrimator parameters, we only need the data gradients and not the weight gradients.
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
309 | fprop | Part2 | backward | legacy::elementwise_kernel | |
310 | bprop | Part2:Loss | binary_cross_entropy | T=[(128,),(128,)] | kernelPointwiseApply4 |
311 | bprop | Part2:Loss | binary_cross_entropy | T=[(128,),(128,)] | modern::elementwise_kernel |
312 | bprop | Part2:D:Sigmoid | sigmoid | [128,1,1,1] | modern::elementwise_kernel |
313 | bprop | Part2:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeOffsetsKernel |
314 | bprop | Part2:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeBOffsetsKernel |
315 | bprop | Part2:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | volta_scudnn_128x128_stridedB_small_nn_v1 |
316 | bprop | Part2:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeWgradSplitKOffsetsKernel |
317 | bprop | Part2:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | scalePackedTensor_kernel |
318 | bprop | Part2:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | cudnn::gemm::computeWgradBOffsetsKernel |
319 | bprop | Part2:D:Conv5 | conv2d | N=128,C=512,H=4,W=4,K=1,P=1,Q=1,R=4,S=4,ph=0,pw=0,U=1,V=1 | volta_scudnn_128x64_stridedB_splitK_interior_nn_v1 |
320 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
321 | bprop | Part2:D:LRelu4 | leaky_relu | [128,512,4,4] | modern::elementwise_kernel |
322 | bprop | Part2:D:BN4 | batch_norm | [128,512,4,4] | cudnn::detail::bn_bw_1C11_singleread |
323 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
324 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
325 | bprop | Part2:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
326 | bprop | Part2:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::dgrad2d_alg1_1 |
327 | bprop | Part2:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradSplitKOffsetsKernel |
328 | bprop | Part2:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
329 | bprop | Part2:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradBOffsetsKernel |
330 | bprop | Part2:D:Conv4 | conv2d | N=128,C=256,H=8,W=8,K=512,P=4,Q=4,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_stridedB_splitK_small_nn_v1 |
331 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
332 | bprop | Part2:D:LRelu3 | leaky_relu | [128,256,8,8] | modern::elementwise_kernel |
333 | bprop | Part2:D:BN3 | batch_norm | [128,256,8,8] | cudnn::detail::bn_bw_1C11_kernel_new |
334 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
335 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
336 | bprop | Part2:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | fft2d_r2c_32x32 |
337 | bprop | Part2:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | fft2d_r2c_32x32 |
338 | bprop | Part2:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_gcgemm_32x32_nt |
339 | bprop | Part2:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | fft2d_c2r_32x32 |
340 | bprop | Part2:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradSplitKOffsetsKernel |
341 | bprop | Part2:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
342 | bprop | Part2:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::gemm::computeWgradBOffsetsKernel |
343 | bprop | Part2:D:Conv3 | conv2d | N=128,C=128,H=16,W=16,K=256,P=8,Q=8,R=4,S=4,ph=1,pw=1,U=2,V=2 | volta_scudnn_128x128_stridedB_splitK_small_nn_v1 |
344 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
345 | bprop | Part2:D:LRelu2 | leaky_relu | [128,128,16,16] | modern::elementwise_kernel |
346 | bprop | Part2:D:BN2 | batch_norm | [128,128,16,16] | cudnn::detail::bn_bw_1C11_kernel_new |
347 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
348 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
349 | bprop | Part2:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
350 | bprop | Part2:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::dgrad2d_alg1_1 |
351 | bprop | Part2:D:Conv2 | conv2d | N=128,C=64,H=32,W=32,K=128,P=16,Q=16,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::wgrad_alg0_engine |
352 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
353 | bprop | Part2:D:LRelu1 | leaky_relu | [128,64,32,32] | modern::elementwise_kernel |
354 | bprop | Part2:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | scalePackedTensor_kernel |
355 | bprop | Part2:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::dgrad_engine |
356 | bprop | Part2:D:Conv1 | conv2d | N=128,C=1,H=64,W=64,K=64,P=32,Q=32,R=4,S=4,ph=1,pw=1,U=2,V=2 | cudnn::detail::wgrad_alg0_engine |
357 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
We now perform back propagation through the generator.
Kernels 358-361 correspond to bprop through the fifth transposed convolution layer.
Kernels 363,364,367-369 correspond to bprop through the fourth transposed convolution layer.
Kernels 371,372,375-380 correspond to bprop through the third transposed convolution layer.
Kernels 382,383,386-391 correspond to bprop through the second transposed convolution layer.
Kernels 393,394,397-400 correspond to bprop through the first transposed convolution layer.
Kernels with the op add_
, most likely correspond to gradient
accumulation i.e. adding the gradients to the previously zeroed out
gradient tensors. Kernel 402 calculates the average loss (for reporting).
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
358 | bprop | Part1:Fake:G:Tanh | tanh | [128,1,64,64] | modern::elementwise_kernel |
359 | bprop | Part1:Fake:G:ConvT5 | conv_transpose2d | T=[(128,64,32,32),(64,1,4,4)] | cudnn::gemm::computeOffsetsKernel |
360 | bprop | Part1:Fake:G:ConvT5 | conv_transpose2d | T=[(128,64,32,32),(64,1,4,4)] | volta_scudnn_128x64_relu_small_nn_v1 |
361 | bprop | Part1:Fake:G:ConvT5 | conv_transpose2d | T=[(128,64,32,32),(64,1,4,4)] | cudnn::detail::wgrad_alg0_engine |
362 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
363 | bprop | Part1:Fake:G:Relu4 | relu | [128,64,32,32] | modern::elementwise_kernel |
364 | bprop | Part1:Fake:G:BN4 | batch_norm | [128,64,32,32] | cudnn::detail::bn_bw_1C11_kernel_new |
365 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
366 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
367 | bprop | Part1:Fake:G:ConvT4 | conv_transpose2d | T=[(128,128,16,16),(128,64,4,4)] | cudnn::gemm::computeOffsetsKernel |
368 | bprop | Part1:Fake:G:ConvT4 | conv_transpose2d | T=[(128,128,16,16),(128,64,4,4)] | volta_scudnn_128x128_relu_small_nn_v1 |
369 | bprop | Part1:Fake:G:ConvT4 | conv_transpose2d | T=[(128,128,16,16),(128,64,4,4)] | cudnn::detail::wgrad_alg0_engine |
370 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
371 | bprop | Part1:Fake:G:Relu3 | relu | [128,128,16,16] | modern::elementwise_kernel |
372 | bprop | Part1:Fake:G:BN3 | batch_norm | [128,128,16,16] | cudnn::detail::bn_bw_1C11_kernel_new |
373 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
374 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
375 | bprop | Part1:Fake:G:ConvT3 | conv_transpose2d | T=[(128,256,8,8),(256,128,4,4)] | cudnn::gemm::computeOffsetsKernel |
376 | bprop | Part1:Fake:G:ConvT3 | conv_transpose2d | T=[(128,256,8,8),(256,128,4,4)] | volta_scudnn_128x128_relu_small_nn_v1 |
377 | bprop | Part1:Fake:G:ConvT3 | conv_transpose2d | T=[(128,256,8,8),(256,128,4,4)] | cudnn::gemm::computeWgradSplitKOffsetsKernel |
378 | bprop | Part1:Fake:G:ConvT3 | conv_transpose2d | T=[(128,256,8,8),(256,128,4,4)] | scalePackedTensor_kernel |
379 | bprop | Part1:Fake:G:ConvT3 | conv_transpose2d | T=[(128,256,8,8),(256,128,4,4)] | cudnn::gemm::computeWgradBOffsetsKernel |
380 | bprop | Part1:Fake:G:ConvT3 | conv_transpose2d | T=[(128,256,8,8),(256,128,4,4)] | volta_scudnn_128x128_stridedB_splitK_small_nn_v1 |
381 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
382 | bprop | Part1:Fake:G:Relu2 | relu | [128,256,8,8] | modern::elementwise_kernel |
383 | bprop | Part1:Fake:G:BN2 | batch_norm | [128,256,8,8] | cudnn::detail::bn_bw_1C11_kernel_new |
384 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
385 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
386 | bprop | Part1:Fake:G:ConvT2 | conv_transpose2d | T=[(128,512,4,4),(512,256,4,4)] | cudnn::gemm::computeOffsetsKernel |
387 | bprop | Part1:Fake:G:ConvT2 | conv_transpose2d | T=[(128,512,4,4),(512,256,4,4)] | volta_scudnn_128x64_relu_small_nn_v1 |
388 | bprop | Part1:Fake:G:ConvT2 | conv_transpose2d | T=[(128,512,4,4),(512,256,4,4)] | cudnn::gemm::computeWgradSplitKOffsetsKernel |
389 | bprop | Part1:Fake:G:ConvT2 | conv_transpose2d | T=[(128,512,4,4),(512,256,4,4)] | scalePackedTensor_kernel |
390 | bprop | Part1:Fake:G:ConvT2 | conv_transpose2d | T=[(128,512,4,4),(512,256,4,4)] | cudnn::gemm::computeWgradBOffsetsKernel |
391 | bprop | Part1:Fake:G:ConvT2 | conv_transpose2d | T=[(128,512,4,4),(512,256,4,4)] | volta_scudnn_128x128_stridedB_splitK_small_nn_v1 |
392 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
393 | bprop | Part1:Fake:G:Relu1 | relu | [128,512,4,4] | modern::elementwise_kernel |
394 | bprop | Part1:Fake:G:BN1 | batch_norm | [128,512,4,4] | cudnn::detail::bn_bw_1C11_singleread |
395 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
396 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
397 | bprop | Part1:Fake:G:ConvT1 | conv_transpose2d | T=[(128,100,1,1),(100,512,4,4)] | cudnn::gemm::computeWgradSplitKOffsetsKernel |
398 | bprop | Part1:Fake:G:ConvT1 | conv_transpose2d | T=[(128,100,1,1),(100,512,4,4)] | scalePackedTensor_kernel |
399 | bprop | Part1:Fake:G:ConvT1 | conv_transpose2d | T=[(128,100,1,1),(100,512,4,4)] | cudnn::gemm::computeWgradBOffsetsKernel |
400 | bprop | Part1:Fake:G:ConvT1 | conv_transpose2d | T=[(128,100,1,1),(100,512,4,4)] | volta_scudnn_128x64_stridedB_splitK_interior_nn_v1 |
401 | fprop | - | add_ | na=na, | modern::elementwise_kernel |
402 | fprop | Part2 | mean | [128] | reduce_kernel |
The last step is to apply the Adam optimizer on the generator weights. The generator has 13 parameters, 1 for each of the 5 transposed convolutions and 2 for each of the 4 batch norms (see kernels 272-284). Each call to the Adam optimizer invokes 8 kernels, for a total of 104 kernels (403 through 506). This is not an optimized implementation and one can use the fused Adam implementation from Nvidia Apex.
Idx | Direction | Layer | Op | Params | Kernel |
---|---|---|---|---|---|
403 | fprop | Part2:Optim | mul_ | [100,512,4,4];[] | modern::elementwise_kernel |
404 | fprop | Part2:Optim | add_ | [100,512,4,4];[100,512,4,4] | modern::elementwise_kernel |
405 | fprop | Part2:Optim | mul_ | [100,512,4,4];[] | modern::elementwise_kernel |
406 | fprop | Part2:Optim | addcmul_ | [100,512,4,4];[100,512,4,4];[100,512,4,4] | modern::elementwise_kernel |
407 | fprop | Part2:Optim | sqrt | [100,512,4,4] | modern::elementwise_kernel |
408 | fprop | Part2:Optim | __truediv__ | [100,512,4,4];[] | modern::elementwise_kernel |
409 | fprop | Part2:Optim | add_ | [100,512,4,4];[] | modern::elementwise_kernel |
410 | fprop | Part2:Optim | addcdiv_ | [100,512,4,4];[100,512,4,4];[100,512,4,4] | modern::elementwise_kernel |
411 | fprop | Part2:Optim | mul_ | [512];[] | modern::elementwise_kernel |
412 | fprop | Part2:Optim | add_ | [512];[512] | modern::elementwise_kernel |
413 | fprop | Part2:Optim | mul_ | [512];[] | modern::elementwise_kernel |
414 | fprop | Part2:Optim | addcmul_ | [512];[512];[512] | modern::elementwise_kernel |
415 | fprop | Part2:Optim | sqrt | [512] | modern::elementwise_kernel |
416 | fprop | Part2:Optim | __truediv__ | [512];[] | modern::elementwise_kernel |
417 | fprop | Part2:Optim | add_ | [512];[] | modern::elementwise_kernel |
418 | fprop | Part2:Optim | addcdiv_ | [512];[512];[512] | modern::elementwise_kernel |
419 | fprop | Part2:Optim | mul_ | [512];[] | modern::elementwise_kernel |
420 | fprop | Part2:Optim | add_ | [512];[512] | modern::elementwise_kernel |
421 | fprop | Part2:Optim | mul_ | [512];[] | modern::elementwise_kernel |
422 | fprop | Part2:Optim | addcmul_ | [512];[512];[512] | modern::elementwise_kernel |
423 | fprop | Part2:Optim | sqrt | [512] | modern::elementwise_kernel |
424 | fprop | Part2:Optim | __truediv__ | [512];[] | modern::elementwise_kernel |
425 | fprop | Part2:Optim | add_ | [512];[] | modern::elementwise_kernel |
426 | fprop | Part2:Optim | addcdiv_ | [512];[512];[512] | modern::elementwise_kernel |
427 | fprop | Part2:Optim | mul_ | [512,256,4,4];[] | modern::elementwise_kernel |
428 | fprop | Part2:Optim | add_ | [512,256,4,4];[512,256,4,4] | modern::elementwise_kernel |
429 | fprop | Part2:Optim | mul_ | [512,256,4,4];[] | modern::elementwise_kernel |
430 | fprop | Part2:Optim | addcmul_ | [512,256,4,4];[512,256,4,4];[512,256,4,4] | modern::elementwise_kernel |
431 | fprop | Part2:Optim | sqrt | [512,256,4,4] | modern::elementwise_kernel |
432 | fprop | Part2:Optim | __truediv__ | [512,256,4,4];[] | modern::elementwise_kernel |
433 | fprop | Part2:Optim | add_ | [512,256,4,4];[] | modern::elementwise_kernel |
434 | fprop | Part2:Optim | addcdiv_ | [512,256,4,4];[512,256,4,4];[512,256,4,4] | modern::elementwise_kernel |
435 | fprop | Part2:Optim | mul_ | [256];[] | modern::elementwise_kernel |
436 | fprop | Part2:Optim | add_ | [256];[256] | modern::elementwise_kernel |
437 | fprop | Part2:Optim | mul_ | [256];[] | modern::elementwise_kernel |
438 | fprop | Part2:Optim | addcmul_ | [256];[256];[256] | modern::elementwise_kernel |
439 | fprop | Part2:Optim | sqrt | [256] | modern::elementwise_kernel |
440 | fprop | Part2:Optim | __truediv__ | [256];[] | modern::elementwise_kernel |
441 | fprop | Part2:Optim | add_ | [256];[] | modern::elementwise_kernel |
442 | fprop | Part2:Optim | addcdiv_ | [256];[256];[256] | modern::elementwise_kernel |
443 | fprop | Part2:Optim | mul_ | [256];[] | modern::elementwise_kernel |
444 | fprop | Part2:Optim | add_ | [256];[256] | modern::elementwise_kernel |
445 | fprop | Part2:Optim | mul_ | [256];[] | modern::elementwise_kernel |
446 | fprop | Part2:Optim | addcmul_ | [256];[256];[256] | modern::elementwise_kernel |
447 | fprop | Part2:Optim | sqrt | [256] | modern::elementwise_kernel |
448 | fprop | Part2:Optim | __truediv__ | [256];[] | modern::elementwise_kernel |
449 | fprop | Part2:Optim | add_ | [256];[] | modern::elementwise_kernel |
450 | fprop | Part2:Optim | addcdiv_ | [256];[256];[256] | modern::elementwise_kernel |
451 | fprop | Part2:Optim | mul_ | [256,128,4,4];[] | modern::elementwise_kernel |
452 | fprop | Part2:Optim | add_ | [256,128,4,4];[256,128,4,4] | modern::elementwise_kernel |
453 | fprop | Part2:Optim | mul_ | [256,128,4,4];[] | modern::elementwise_kernel |
454 | fprop | Part2:Optim | addcmul_ | [256,128,4,4];[256,128,4,4];[256,128,4,4] | modern::elementwise_kernel |
455 | fprop | Part2:Optim | sqrt | [256,128,4,4] | modern::elementwise_kernel |
456 | fprop | Part2:Optim | __truediv__ | [256,128,4,4];[] | modern::elementwise_kernel |
457 | fprop | Part2:Optim | add_ | [256,128,4,4];[] | modern::elementwise_kernel |
458 | fprop | Part2:Optim | addcdiv_ | [256,128,4,4];[256,128,4,4];[256,128,4,4] | modern::elementwise_kernel |
459 | fprop | Part2:Optim | mul_ | [128];[] | modern::elementwise_kernel |
460 | fprop | Part2:Optim | add_ | [128];[128] | modern::elementwise_kernel |
461 | fprop | Part2:Optim | mul_ | [128];[] | modern::elementwise_kernel |
462 | fprop | Part2:Optim | addcmul_ | [128];[128];[128] | modern::elementwise_kernel |
463 | fprop | Part2:Optim | sqrt | [128] | modern::elementwise_kernel |
464 | fprop | Part2:Optim | __truediv__ | [128];[] | modern::elementwise_kernel |
465 | fprop | Part2:Optim | add_ | [128];[] | modern::elementwise_kernel |
466 | fprop | Part2:Optim | addcdiv_ | [128];[128];[128] | modern::elementwise_kernel |
467 | fprop | Part2:Optim | mul_ | [128];[] | modern::elementwise_kernel |
468 | fprop | Part2:Optim | add_ | [128];[128] | modern::elementwise_kernel |
469 | fprop | Part2:Optim | mul_ | [128];[] | modern::elementwise_kernel |
470 | fprop | Part2:Optim | addcmul_ | [128];[128];[128] | modern::elementwise_kernel |
471 | fprop | Part2:Optim | sqrt | [128] | modern::elementwise_kernel |
472 | fprop | Part2:Optim | __truediv__ | [128];[] | modern::elementwise_kernel |
473 | fprop | Part2:Optim | add_ | [128];[] | modern::elementwise_kernel |
474 | fprop | Part2:Optim | addcdiv_ | [128];[128];[128] | modern::elementwise_kernel |
475 | fprop | Part2:Optim | mul_ | [128,64,4,4];[] | modern::elementwise_kernel |
476 | fprop | Part2:Optim | add_ | [128,64,4,4];[128,64,4,4] | modern::elementwise_kernel |
477 | fprop | Part2:Optim | mul_ | [128,64,4,4];[] | modern::elementwise_kernel |
478 | fprop | Part2:Optim | addcmul_ | [128,64,4,4];[128,64,4,4];[128,64,4,4] | modern::elementwise_kernel |
479 | fprop | Part2:Optim | sqrt | [128,64,4,4] | modern::elementwise_kernel |
480 | fprop | Part2:Optim | __truediv__ | [128,64,4,4];[] | modern::elementwise_kernel |
481 | fprop | Part2:Optim | add_ | [128,64,4,4];[] | modern::elementwise_kernel |
482 | fprop | Part2:Optim | addcdiv_ | [128,64,4,4];[128,64,4,4];[128,64,4,4] | modern::elementwise_kernel |
483 | fprop | Part2:Optim | mul_ | [64];[] | modern::elementwise_kernel |
484 | fprop | Part2:Optim | add_ | [64];[64] | modern::elementwise_kernel |
485 | fprop | Part2:Optim | mul_ | [64];[] | modern::elementwise_kernel |
486 | fprop | Part2:Optim | addcmul_ | [64];[64];[64] | modern::elementwise_kernel |
487 | fprop | Part2:Optim | sqrt | [64] | modern::elementwise_kernel |
488 | fprop | Part2:Optim | __truediv__ | [64];[] | modern::elementwise_kernel |
489 | fprop | Part2:Optim | add_ | [64];[] | modern::elementwise_kernel |
490 | fprop | Part2:Optim | addcdiv_ | [64];[64];[64] | modern::elementwise_kernel |
491 | fprop | Part2:Optim | mul_ | [64];[] | modern::elementwise_kernel |
492 | fprop | Part2:Optim | add_ | [64];[64] | modern::elementwise_kernel |
493 | fprop | Part2:Optim | mul_ | [64];[] | modern::elementwise_kernel |
494 | fprop | Part2:Optim | addcmul_ | [64];[64];[64] | modern::elementwise_kernel |
495 | fprop | Part2:Optim | sqrt | [64] | modern::elementwise_kernel |
496 | fprop | Part2:Optim | __truediv__ | [64];[] | modern::elementwise_kernel |
497 | fprop | Part2:Optim | add_ | [64];[] | modern::elementwise_kernel |
498 | fprop | Part2:Optim | addcdiv_ | [64];[64];[64] | modern::elementwise_kernel |
499 | fprop | Part2:Optim | mul_ | [64,1,4,4];[] | modern::elementwise_kernel |
500 | fprop | Part2:Optim | add_ | [64,1,4,4];[64,1,4,4] | modern::elementwise_kernel |
501 | fprop | Part2:Optim | mul_ | [64,1,4,4];[] | modern::elementwise_kernel |
502 | fprop | Part2:Optim | addcmul_ | [64,1,4,4];[64,1,4,4];[64,1,4,4] | modern::elementwise_kernel |
503 | fprop | Part2:Optim | sqrt | [64,1,4,4] | modern::elementwise_kernel |
504 | fprop | Part2:Optim | __truediv__ | [64,1,4,4];[] | modern::elementwise_kernel |
505 | fprop | Part2:Optim | add_ | [64,1,4,4];[] | modern::elementwise_kernel |
506 | fprop | Part2:Optim | addcdiv_ | [64,1,4,4];[64,1,4,4];[64,1,4,4] | modern::elementwise_kernel |