Deep Learning Recommendation Model (DLRM)

June 4, 2020

Deep Learning Recommendation Model is a Deep Neural Network for personalization and recommendation developed by Facebook. Due to its importance for large internet companies, it is one of the benchmarks in MLPerf. It is a simple network with 26 Embedding layers, 8 Linear layers and a Batched Matrix Multiply for calculating pairwise interactions. However, a single training step (forward and backward propagation) invokes about 800 GPU CUDA kernels (depending on the batch size). The best way to understand a DL network and GPU performance is to understand every single CUDA kernel i.e. which layer of the network invoked the kernel, with what arguments (tensor shapes and datatypes) and in which direction (forward propagation or backward propagation).

In this blog post, I will categorize every kernel used in the training of DLRM. All the information in the tables below was obtaining using Nvidia's PyTorch Profiler, PyProf, on a Turing T4 GPU. The information below is only a subset of what is provided by PyProf. The code and instructions for obtaining a detailed profile are here. Note that different GPUs will have slightly different kernel names e.g. volta_* as opposed to turing_*.

Model Architecture

DLRM Model Architecture
Image by Nvidia Deep Learning Examples

Model Parameters

The parameters in the profiled code are as follows. These parameters were obtained from Nvidia Deep Learning Examples. The only constraint is that the output of the bottom MLP should be equal to the embedding size. The vocabulary sizes of the categorical embeddings are for a synthetic dataset and not the real Criteo Terabyte Dataset. For the purpose of this article, I modified the shape of the categorical embeddings from [100000]*26 to [100000 + 100*x for x in range(26)], so that its easier to differentiate and understand the kernels associated with each of the 26 embeddings, both in forward and backward propagation. With the Criteo dataset, the categorical embeddings tables become really big. With an embedding size of 128, the embedding tables have ~ 24 Billion parameters ~ 96 GB, and it is not possible to store them on a single GPU.

Bottom MLP = [512, 256, 128]
Top MLP = [1024, 1024, 512, 256, 1]
Numerical Features (N) = 13
Categorical Features (M) = 26
Categorical Embeddings = [100000 + 100*x for x in range(26)]
Embedding Size = 128
Batch Size = 32768

GPU Kernels

The tables below show the GPU kernels invoked in 1 training step. For every GPU kernel we show the direction (fprop, bprop), name of the layer, name of the operation, and the input tensor shapes / matrix dimensions for the operation. PyProf provides a lot of additional information for every GPU kernel e.g. grid dimensions, block dimensions, silicon time, datatypes, flops, bytes, tensor core usage and so on.

Bottom MLP (fprop)

Kernels 1 through 11 correspond to the forward propagation through the 3 Linear layers in the Bottom MLP. Kernels 2, 3 and 4 correspond to bias, linear and relu operations of the first Linear layer respectively. Kernels 5-7, 8-10 correspond to the second and third Linear layers respectively. Kernels 1 and 11 correspond to data type conversions from fp32 to fp16 and back.

Idx Direction Layer Op Params GPU Kernel
1 fprop Bottom_MLP to T=(32768,13) legacy::elementwise_kernel
2 fprop Bottom_MLP bias M=512,N=32768 legacy::elementwise_kernel
3 fprop Bottom_MLP linear M=512,N=32768,K=13 volta_fp16_sgemm_fp16_128x128_tn
4 fprop Bottom_MLP relu T=(32768,512) modern::elementwise_kernel
5 fprop Bottom_MLP bias M=256,N=32768 legacy::elementwise_kernel
6 fprop Bottom_MLP linear M=256,N=32768,K=512 turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_tn
7 fprop Bottom_MLP relu T=(32768,256) modern::elementwise_kernel
8 fprop Bottom_MLP bias M=128,N=32768 legacy::elementwise_kernel
9 fprop Bottom_MLP linear M=128,N=32768,K=256 turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn
10 fprop Bottom_MLP relu T=(32768,128) modern::elementwise_kernel
11 fprop Bottom_MLP to T=(32768,128) legacy::elementwise_kernel

Embedding (fprop)

Kernel 12 corresponds to a data type cast operation on the index tensor. Kernels 13 through 38 correspond to the forward propagation through the 26 Embedding layers. You can notice the size of the index tensor (batch size) and the size of the embedding tables (vocab x embedding size).

Idx Direction Layer Op Params GPU Kernel
12 fprop - to T=(32768,26) legacy::elementwise_kernel
13 fprop Embedding_0 embedding I=(32768), E=(100000,128) indexSelectLargeIndex
14 fprop Embedding_1 embedding I=(32768), E=(100100,128) indexSelectLargeIndex
15 fprop Embedding_2 embedding I=(32768), E=(100200,128) indexSelectLargeIndex
16 fprop Embedding_3 embedding I=(32768), E=(100300,128) indexSelectLargeIndex
17 fprop Embedding_4 embedding I=(32768), E=(100400,128) indexSelectLargeIndex
18 fprop Embedding_5 embedding I=(32768), E=(100500,128) indexSelectLargeIndex
19 fprop Embedding_6 embedding I=(32768), E=(100600,128) indexSelectLargeIndex
20 fprop Embedding_7 embedding I=(32768), E=(100700,128) indexSelectLargeIndex
21 fprop Embedding_8 embedding I=(32768), E=(100800,128) indexSelectLargeIndex
22 fprop Embedding_9 embedding I=(32768), E=(100900,128) indexSelectLargeIndex
23 fprop Embedding_10 embedding I=(32768), E=(101000,128) indexSelectLargeIndex
24 fprop Embedding_11 embedding I=(32768), E=(101100,128) indexSelectLargeIndex
25 fprop Embedding_12 embedding I=(32768), E=(101200,128) indexSelectLargeIndex
26 fprop Embedding_13 embedding I=(32768), E=(101300,128) indexSelectLargeIndex
27 fprop Embedding_14 embedding I=(32768), E=(101400,128) indexSelectLargeIndex
28 fprop Embedding_15 embedding I=(32768), E=(101500,128) indexSelectLargeIndex
29 fprop Embedding_16 embedding I=(32768), E=(101600,128) indexSelectLargeIndex
30 fprop Embedding_17 embedding I=(32768), E=(101700,128) indexSelectLargeIndex
31 fprop Embedding_18 embedding I=(32768), E=(101800,128) indexSelectLargeIndex
32 fprop Embedding_19 embedding I=(32768), E=(101900,128) indexSelectLargeIndex
33 fprop Embedding_20 embedding I=(32768), E=(102000,128) indexSelectLargeIndex
34 fprop Embedding_21 embedding I=(32768), E=(102100,128) indexSelectLargeIndex
35 fprop Embedding_22 embedding I=(32768), E=(102200,128) indexSelectLargeIndex
36 fprop Embedding_23 embedding I=(32768), E=(102300,128) indexSelectLargeIndex
37 fprop Embedding_24 embedding I=(32768), E=(102400,128) indexSelectLargeIndex
38 fprop Embedding_25 embedding I=(32768), E=(102500,128) indexSelectLargeIndex

Pairwise Interaction (fprop)

Kernels 39 through 44 correspond to the pairwise interaction layer. Pairwise interaction is implemented as a dot product of every vector with all other vectors. If there are $N$ vectors (of the same length), then we will have ${N \choose 2} = \frac{N\times(N-1)}{2}$ interactions (scalar outputs). This can be implemented by creating a matrix $A$ containing the $N$ vectors, calculating $A.A^T$ and then taking the lower (or upper) traingular matrix, since the output is symmetric. In some sense, this is similar to attention. When this operation is done for multiple batches (32768 in this example), it is implemented as a Batched Matrix Multiplication (BMM).

Kernel 39 concatenates the outputs of the 26 embedding lookups and the Bottom MLP, resulting in 27 vectors. Kernel 40 computes the pairwise interaction of 27 vectors, each of length 128 using a BMM operation. Kernel 41 extract the lower (or upper) triangular matrix, which gives us ${27 \choose 2} = 351$ values.

Idx Direction Layer Op Params GPU Kernel
39 fprop Interaction cat T=[(32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128)] CatArrayBatchedCopy
40 fprop Interaction bmm B=32768,M=27,N=27,K=128 volta_sgemm_128x64_tn
41 fprop Interaction index na=na legacy::elementwise_kernel

Concat (fprop)

Kernels 42 through 44 concatenate the output of the Pairwise Interaction layer with the output of the Bottom MLP and pad it with 1 element before feeding it to the Top MLP.

Idx Direction Layer Op Params GPU Kernel
42 fprop Concat cat T=[(32768,128),(32768,351),(32768,1)] legacy::elementwise_kernel
43 fprop Concat cat T=[(32768,128),(32768,351),(32768,1)] legacy::elementwise_kernel
44 fprop Concat cat T=[(32768,128),(32768,351),(32768,1)] legacy::elementwise_kernel

Top MLP (fprop)

Kernels 45 through 60 correspond to the forward propagation through the 5 Linear layers in the Top MLP. Kernels 46, 47 and 48 correspond to bias, linear and relu operations of the first Linear layer respectively. Kernels 49-51, 52-54, 55-57, 58-59 correspond to the second, third, fourth and fifth Linear layers respectively. Kernels 45 and 60 correspond to data type conversions from fp32 to fp16 and back.

Idx Direction Layer Op Params GPU Kernel
45 fprop Top_MLP to T=(32768,480) legacy::elementwise_kernel
46 fprop Top_MLP bias M=1024,N=32768 legacy::elementwise_kernel
47 fprop Top_MLP linear M=1024,N=32768,K=480 turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn
48 fprop Top_MLP relu T=(32768,1024) modern::elementwise_kernel
49 fprop Top_MLP bias M=1024,N=32768 legacy::elementwise_kernel
50 fprop Top_MLP linear M=1024,N=32768,K=1024 turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn
51 fprop Top_MLP relu T=(32768,1024) modern::elementwise_kernel
52 fprop Top_MLP bias M=512,N=32768 legacy::elementwise_kernel
53 fprop Top_MLP linear M=512,N=32768,K=1024 turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn
54 fprop Top_MLP relu T=(32768,512) modern::elementwise_kernel
55 fprop Top_MLP bias M=256,N=32768 legacy::elementwise_kernel
56 fprop Top_MLP linear M=256,N=32768,K=512 turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_tn
57 fprop Top_MLP relu T=(32768,256) modern::elementwise_kernel
58 fprop Top_MLP bias M=1,N=32768 legacy::elementwise_kernel
59 fprop Top_MLP linear M=1,N=32768,K=256 gemv2T_kernel_val
60 fprop Top_MLP to T=(32768,1) legacy::elementwise_kernel

Loss (fprop)

Kernels 61 through 75 compute the BCE loss.

Idx Direction Layer Op Params GPU Kernel
61 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
62 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] kernelPointwiseApply1
63 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
64 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] legacy::elementwise_kernel
65 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
66 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
67 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
68 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] kernelPointwiseApply1
69 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
70 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
71 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] kernelPointwiseApply1
72 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
73 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
74 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
75 fprop - binary_cross_entropy_with_logits T=[(32768), (32768)] reduce_kernel

Gradient

In this profile, we are not accumulating gradients, and hence kernels 76 through 117 zero out the buffers used for storing the gradients during back propagation. Note that there are separate buffers for storing the gradients of the weights, biases and embeddings. Kernels 76-81, 82-91, 92-117 correspond to the Bottom MLP, Top MLP and the Embedding layers respectively.

Idx Direction Layer Op Params GPU Kernel
76 fprop - zero T=[(512,13)] modern::elementwise_kernel
77 fprop - zero T=[(512)] modern::elementwise_kernel
78 fprop - zero T=[(256,512)] modern::elementwise_kernel
79 fprop - zero T=[(256)] modern::elementwise_kernel
80 fprop - zero T=[(128,256)] modern::elementwise_kernel
81 fprop - zero T=[(128)] modern::elementwise_kernel
82 fprop - zero T=[(1024,480)] modern::elementwise_kernel
83 fprop - zero T=[(1024)] modern::elementwise_kernel
84 fprop - zero T=[(1024,1024)] modern::elementwise_kernel
85 fprop - zero T=[(1024)] modern::elementwise_kernel
86 fprop - zero T=[(512,1024)] modern::elementwise_kernel
87 fprop - zero T=[(512)] modern::elementwise_kernel
88 fprop - zero T=[(256,512)] modern::elementwise_kernel
89 fprop - zero T=[(256)] modern::elementwise_kernel
90 fprop - zero T=[(1,256)] modern::elementwise_kernel
91 fprop - zero T=[(1)] modern::elementwise_kernel
92 fprop - zero T=[(100000,128)] modern::elementwise_kernel
93 fprop - zero T=[(100100,128)] modern::elementwise_kernel
94 fprop - zero T=[(100200,128)] modern::elementwise_kernel
95 fprop - zero T=[(100300,128)] modern::elementwise_kernel
96 fprop - zero T=[(100400,128)] modern::elementwise_kernel
97 fprop - zero T=[(100500,128)] modern::elementwise_kernel
98 fprop - zero T=[(100600,128)] modern::elementwise_kernel
99 fprop - zero T=[(100700,128)] modern::elementwise_kernel
100 fprop - zero T=[(100800,128)] modern::elementwise_kernel
101 fprop - zero T=[(100900,128)] modern::elementwise_kernel
102 fprop - zero T=[(101000,128)] modern::elementwise_kernel
103 fprop - zero T=[(101100,128)] modern::elementwise_kernel
104 fprop - zero T=[(101200,128)] modern::elementwise_kernel
105 fprop - zero T=[(101300,128)] modern::elementwise_kernel
106 fprop - zero T=[(101400,128)] modern::elementwise_kernel
107 fprop - zero T=[(101500,128)] modern::elementwise_kernel
108 fprop - zero T=[(101600,128)] modern::elementwise_kernel
109 fprop - zero T=[(101700,128)] modern::elementwise_kernel
110 fprop - zero T=[(101800,128)] modern::elementwise_kernel
111 fprop - zero T=[(101900,128)] modern::elementwise_kernel
112 fprop - zero T=[(102000,128)] modern::elementwise_kernel
113 fprop - zero T=[(102100,128)] modern::elementwise_kernel
114 fprop - zero T=[(102200,128)] modern::elementwise_kernel
115 fprop - zero T=[(102300,128)] modern::elementwise_kernel
116 fprop - zero T=[(102400,128)] modern::elementwise_kernel
117 fprop - zero T=[(102500,128)] modern::elementwise_kernel

Loss Scaling

Kernels 118 through 122 most likely correspond to loss scaling and the beginning of back propagation.

Idx Direction Layer Op Params GPU Kernel
118 fprop - __imul__ T=[(1)] legacy::elementwise_kernel
119 fprop - __mul__ T=[(1)] legacy::elementwise_kernel
120 fprop - backward - legacy::elementwise_kernel
121 bprop - __mul__ T=[(1)] legacy::elementwise_kernel
122 bprop - __imul__ T=[(1)] legacy::elementwise_kernel

Loss (bprop)

Kernels 123 through 126 correspond to the backward propagation through the Loss layer.

Idx Direction Layer Op Params GPU Kernel
123 bprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel
124 bprop - binary_cross_entropy_with_logits T=[(32768), (32768)] legacy::elementwise_kernel
125 bprop - binary_cross_entropy_with_logits T=[(32768), (32768)] legacy::elementwise_kernel
126 bprop - binary_cross_entropy_with_logits T=[(32768), (32768)] modern::elementwise_kernel

Top MLP (bprop)

Kernels 127 through 149 correspond to the backward propagation through the Top MLP layer. Kernels 128-130 calculate the data gradient and weight gradient through the last Linear layer. Kernel 131 calculates the bias gradient. Likewise kernels 132-136, 137-140, 141-144, 145-148 correspond to the backward propagation through the fourth, third, second, and the first Linear layer of the Top MLP respectively.

Idx Direction Layer Op Params GPU Kernel
127 bprop - to na=na legacy::elementwise_kernel
128 bprop Top_MLP linear M=256,N=32768,K=1 gemmk1_kernel
129 bprop Top_MLP linear M=256,N=1,K=32768 gemv2N_kernel
130 bprop Top_MLP linear X=(32768,256),W=(1,256) splitKreduce_kernel
131 bprop - sum na=na reduce_kernel
132 bprop Top_MLP relu T=(32768,256) modern::elementwise_kernel
133 bprop Top_MLP linear M=512,N=32768,K=256 turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn
134 bprop Top_MLP linear M=512,N=256,K=32768 turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt
135 bprop Top_MLP linear X=(32768,512),W=(256,512) splitKreduce_kernel
136 bprop - sum na=na reduce_kernel
137 bprop Top_MLP relu T=(32768,512) modern::elementwise_kernel
138 bprop Top_MLP linear M=1024,N=32768,K=512 turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn
139 bprop Top_MLP linear M=1024,N=512,K=32768 turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt
140 bprop - sum na=na reduce_kernel
141 bprop Top_MLP relu T=(32768,1024) modern::elementwise_kernel
142 bprop Top_MLP linear M=1024,N=32768,K=1024 turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn
143 bprop Top_MLP linear M=1024,N=1024,K=32768 turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt
144 bprop - sum na=na reduce_kernel
145 bprop Top_MLP relu T=(32768,1024) modern::elementwise_kernel
146 bprop Top_MLP linear M=480,N=32768,K=1024 turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn
147 bprop Top_MLP linear M=480,N=1024,K=32768 turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt
148 bprop - sum na=na reduce_kernel
149 bprop - to na=na legacy::elementwise_kernel

Pairwise Interaction (bprop)

Kernels 150 through 165 correspond to the backward propagation through the pairwise interaction layer. Kernels 150-161 probably correspond to the lower (or upper) triangular matrix indexing. Kernels 162 and 163 correspond to the two gradients through the BMM operation. The gradients correspond to the same input and therefore kernels 164 and 165 are probably adding the gradients.

Idx Direction Layer Op Params GPU Kernel
150 bprop Interaction index na=na modern::elementwise_kernel
151 bprop Interaction index na=na legacy::elementwise_kernel
152 bprop Interaction index na=na kernelPointwiseApply2
153 bprop Interaction index na=na modern::elementwise_kernel
154 bprop Interaction index na=na kernelPointwiseApply2
155 bprop Interaction index na=na modern::elementwise_kernel
156 bprop Interaction index na=na modern::elementwise_kernel
157 bprop Interaction index na=na modern::elementwise_kernel
158 bprop Interaction index na=na thrust::cuda_cub::core::_kernel_agent
159 bprop Interaction index na=na thrust::cuda_cub::core::_kernel_agent
160 bprop Interaction index na=na indexing_backward_kernel
161 bprop - Slice na=na modern::elementwise_kernel
162 bprop Interaction bmm B=32768,M=27,N=27,K=128 volta_sgemm_128x64_nt
163 bprop Interaction bmm B=32768,M=27,N=27,K=128 volta_sgemm_128x64_nn
164 bprop - add na=na legacy::elementwise_kernel
165 bprop - add na=na legacy::elementwise_kernel

Embedding (bprop)

Kernels 166 through 711 correspond to the backward propagation through the 26 Embedding layers.

Idx Direction Layer Op Params GPU Kernel
166 bprop Embedding_25 embedding I=(32768), E=(102500,128) legacy::elementwise_kernel
167 bprop Embedding_25 embedding I=(32768), E=(102500,128) legacy::elementwise_kernel
168 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
169 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
170 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
171 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
172 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
173 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
174 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
175 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
176 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
177 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
178 bprop Embedding_25 embedding I=(32768), E=(102500,128) modern::elementwise_kernel
179 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
180 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
181 bprop Embedding_25 embedding I=(32768), E=(102500,128) krn_partials_per_segment
182 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
183 bprop Embedding_25 embedding I=(32768), E=(102500,128) thrust::cuda_cub::core::_kernel_agent
184 bprop Embedding_25 embedding I=(32768), E=(102500,128) krn_partial_segment_offset
185 bprop Embedding_25 embedding I=(32768), E=(102500,128) compute_grad_weight
186 bprop Embedding_25 embedding I=(32768), E=(102500,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
187 bprop Embedding_24 embedding I=(32768), E=(102400,128) legacy::elementwise_kernel
188 bprop Embedding_24 embedding I=(32768), E=(102400,128) legacy::elementwise_kernel
189 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
190 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
191 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
192 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
193 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
194 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
195 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
196 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
197 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
198 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
199 bprop Embedding_24 embedding I=(32768), E=(102400,128) modern::elementwise_kernel
200 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
201 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
202 bprop Embedding_24 embedding I=(32768), E=(102400,128) krn_partials_per_segment
203 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
204 bprop Embedding_24 embedding I=(32768), E=(102400,128) thrust::cuda_cub::core::_kernel_agent
205 bprop Embedding_24 embedding I=(32768), E=(102400,128) krn_partial_segment_offset
206 bprop Embedding_24 embedding I=(32768), E=(102400,128) compute_grad_weight
207 bprop Embedding_24 embedding I=(32768), E=(102400,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
208 bprop Embedding_23 embedding I=(32768), E=(102300,128) legacy::elementwise_kernel
209 bprop Embedding_23 embedding I=(32768), E=(102300,128) legacy::elementwise_kernel
210 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
211 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
212 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
213 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
214 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
215 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
216 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
217 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
218 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
219 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
220 bprop Embedding_23 embedding I=(32768), E=(102300,128) modern::elementwise_kernel
221 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
222 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
223 bprop Embedding_23 embedding I=(32768), E=(102300,128) krn_partials_per_segment
224 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
225 bprop Embedding_23 embedding I=(32768), E=(102300,128) thrust::cuda_cub::core::_kernel_agent
226 bprop Embedding_23 embedding I=(32768), E=(102300,128) krn_partial_segment_offset
227 bprop Embedding_23 embedding I=(32768), E=(102300,128) compute_grad_weight
228 bprop Embedding_23 embedding I=(32768), E=(102300,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
229 bprop Embedding_22 embedding I=(32768), E=(102200,128) legacy::elementwise_kernel
230 bprop Embedding_22 embedding I=(32768), E=(102200,128) legacy::elementwise_kernel
231 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
232 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
233 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
234 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
235 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
236 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
237 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
238 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
239 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
240 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
241 bprop Embedding_22 embedding I=(32768), E=(102200,128) modern::elementwise_kernel
242 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
243 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
244 bprop Embedding_22 embedding I=(32768), E=(102200,128) krn_partials_per_segment
245 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
246 bprop Embedding_22 embedding I=(32768), E=(102200,128) thrust::cuda_cub::core::_kernel_agent
247 bprop Embedding_22 embedding I=(32768), E=(102200,128) krn_partial_segment_offset
248 bprop Embedding_22 embedding I=(32768), E=(102200,128) compute_grad_weight
249 bprop Embedding_22 embedding I=(32768), E=(102200,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
250 bprop Embedding_21 embedding I=(32768), E=(102100,128) legacy::elementwise_kernel
251 bprop Embedding_21 embedding I=(32768), E=(102100,128) legacy::elementwise_kernel
252 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
253 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
254 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
255 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
256 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
257 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
258 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
259 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
260 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
261 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
262 bprop Embedding_21 embedding I=(32768), E=(102100,128) modern::elementwise_kernel
263 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
264 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
265 bprop Embedding_21 embedding I=(32768), E=(102100,128) krn_partials_per_segment
266 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
267 bprop Embedding_21 embedding I=(32768), E=(102100,128) thrust::cuda_cub::core::_kernel_agent
268 bprop Embedding_21 embedding I=(32768), E=(102100,128) krn_partial_segment_offset
269 bprop Embedding_21 embedding I=(32768), E=(102100,128) compute_grad_weight
270 bprop Embedding_21 embedding I=(32768), E=(102100,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
271 bprop Embedding_20 embedding I=(32768), E=(102000,128) legacy::elementwise_kernel
272 bprop Embedding_20 embedding I=(32768), E=(102000,128) legacy::elementwise_kernel
273 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
274 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
275 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
276 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
277 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
278 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
279 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
280 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
281 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
282 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
283 bprop Embedding_20 embedding I=(32768), E=(102000,128) modern::elementwise_kernel
284 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
285 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
286 bprop Embedding_20 embedding I=(32768), E=(102000,128) krn_partials_per_segment
287 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
288 bprop Embedding_20 embedding I=(32768), E=(102000,128) thrust::cuda_cub::core::_kernel_agent
289 bprop Embedding_20 embedding I=(32768), E=(102000,128) krn_partial_segment_offset
290 bprop Embedding_20 embedding I=(32768), E=(102000,128) compute_grad_weight
291 bprop Embedding_20 embedding I=(32768), E=(102000,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
292 bprop Embedding_19 embedding I=(32768), E=(101900,128) legacy::elementwise_kernel
293 bprop Embedding_19 embedding I=(32768), E=(101900,128) legacy::elementwise_kernel
294 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
295 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
296 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
297 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
298 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
299 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
300 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
301 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
302 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
303 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
304 bprop Embedding_19 embedding I=(32768), E=(101900,128) modern::elementwise_kernel
305 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
306 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
307 bprop Embedding_19 embedding I=(32768), E=(101900,128) krn_partials_per_segment
308 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
309 bprop Embedding_19 embedding I=(32768), E=(101900,128) thrust::cuda_cub::core::_kernel_agent
310 bprop Embedding_19 embedding I=(32768), E=(101900,128) krn_partial_segment_offset
311 bprop Embedding_19 embedding I=(32768), E=(101900,128) compute_grad_weight
312 bprop Embedding_19 embedding I=(32768), E=(101900,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
313 bprop Embedding_18 embedding I=(32768), E=(101800,128) legacy::elementwise_kernel
314 bprop Embedding_18 embedding I=(32768), E=(101800,128) legacy::elementwise_kernel
315 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
316 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
317 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
318 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
319 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
320 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
321 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
322 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
323 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
324 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
325 bprop Embedding_18 embedding I=(32768), E=(101800,128) modern::elementwise_kernel
326 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
327 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
328 bprop Embedding_18 embedding I=(32768), E=(101800,128) krn_partials_per_segment
329 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
330 bprop Embedding_18 embedding I=(32768), E=(101800,128) thrust::cuda_cub::core::_kernel_agent
331 bprop Embedding_18 embedding I=(32768), E=(101800,128) krn_partial_segment_offset
332 bprop Embedding_18 embedding I=(32768), E=(101800,128) compute_grad_weight
333 bprop Embedding_18 embedding I=(32768), E=(101800,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
334 bprop Embedding_17 embedding I=(32768), E=(101700,128) legacy::elementwise_kernel
335 bprop Embedding_17 embedding I=(32768), E=(101700,128) legacy::elementwise_kernel
336 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
337 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
338 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
339 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
340 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
341 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
342 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
343 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
344 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
345 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
346 bprop Embedding_17 embedding I=(32768), E=(101700,128) modern::elementwise_kernel
347 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
348 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
349 bprop Embedding_17 embedding I=(32768), E=(101700,128) krn_partials_per_segment
350 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
351 bprop Embedding_17 embedding I=(32768), E=(101700,128) thrust::cuda_cub::core::_kernel_agent
352 bprop Embedding_17 embedding I=(32768), E=(101700,128) krn_partial_segment_offset
353 bprop Embedding_17 embedding I=(32768), E=(101700,128) compute_grad_weight
354 bprop Embedding_17 embedding I=(32768), E=(101700,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
355 bprop Embedding_16 embedding I=(32768), E=(101600,128) legacy::elementwise_kernel
356 bprop Embedding_16 embedding I=(32768), E=(101600,128) legacy::elementwise_kernel
357 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
358 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
359 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
360 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
361 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
362 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
363 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
364 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
365 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
366 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
367 bprop Embedding_16 embedding I=(32768), E=(101600,128) modern::elementwise_kernel
368 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
369 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
370 bprop Embedding_16 embedding I=(32768), E=(101600,128) krn_partials_per_segment
371 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
372 bprop Embedding_16 embedding I=(32768), E=(101600,128) thrust::cuda_cub::core::_kernel_agent
373 bprop Embedding_16 embedding I=(32768), E=(101600,128) krn_partial_segment_offset
374 bprop Embedding_16 embedding I=(32768), E=(101600,128) compute_grad_weight
375 bprop Embedding_16 embedding I=(32768), E=(101600,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
376 bprop Embedding_15 embedding I=(32768), E=(101500,128) legacy::elementwise_kernel
377 bprop Embedding_15 embedding I=(32768), E=(101500,128) legacy::elementwise_kernel
378 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
379 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
380 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
381 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
382 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
383 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
384 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
385 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
386 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
387 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
388 bprop Embedding_15 embedding I=(32768), E=(101500,128) modern::elementwise_kernel
389 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
390 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
391 bprop Embedding_15 embedding I=(32768), E=(101500,128) krn_partials_per_segment
392 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
393 bprop Embedding_15 embedding I=(32768), E=(101500,128) thrust::cuda_cub::core::_kernel_agent
394 bprop Embedding_15 embedding I=(32768), E=(101500,128) krn_partial_segment_offset
395 bprop Embedding_15 embedding I=(32768), E=(101500,128) compute_grad_weight
396 bprop Embedding_15 embedding I=(32768), E=(101500,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
397 bprop Embedding_14 embedding I=(32768), E=(101400,128) legacy::elementwise_kernel
398 bprop Embedding_14 embedding I=(32768), E=(101400,128) legacy::elementwise_kernel
399 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
400 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
401 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
402 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
403 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
404 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
405 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
406 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
407 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
408 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
409 bprop Embedding_14 embedding I=(32768), E=(101400,128) modern::elementwise_kernel
410 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
411 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
412 bprop Embedding_14 embedding I=(32768), E=(101400,128) krn_partials_per_segment
413 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
414 bprop Embedding_14 embedding I=(32768), E=(101400,128) thrust::cuda_cub::core::_kernel_agent
415 bprop Embedding_14 embedding I=(32768), E=(101400,128) krn_partial_segment_offset
416 bprop Embedding_14 embedding I=(32768), E=(101400,128) compute_grad_weight
417 bprop Embedding_14 embedding I=(32768), E=(101400,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
418 bprop Embedding_13 embedding I=(32768), E=(101300,128) legacy::elementwise_kernel
419 bprop Embedding_13 embedding I=(32768), E=(101300,128) legacy::elementwise_kernel
420 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
421 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
422 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
423 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
424 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
425 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
426 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
427 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
428 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
429 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
430 bprop Embedding_13 embedding I=(32768), E=(101300,128) modern::elementwise_kernel
431 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
432 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
433 bprop Embedding_13 embedding I=(32768), E=(101300,128) krn_partials_per_segment
434 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
435 bprop Embedding_13 embedding I=(32768), E=(101300,128) thrust::cuda_cub::core::_kernel_agent
436 bprop Embedding_13 embedding I=(32768), E=(101300,128) krn_partial_segment_offset
437 bprop Embedding_13 embedding I=(32768), E=(101300,128) compute_grad_weight
438 bprop Embedding_13 embedding I=(32768), E=(101300,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
439 bprop Embedding_12 embedding I=(32768), E=(101200,128) legacy::elementwise_kernel
440 bprop Embedding_12 embedding I=(32768), E=(101200,128) legacy::elementwise_kernel
441 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
442 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
443 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
444 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
445 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
446 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
447 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
448 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
449 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
450 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
451 bprop Embedding_12 embedding I=(32768), E=(101200,128) modern::elementwise_kernel
452 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
453 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
454 bprop Embedding_12 embedding I=(32768), E=(101200,128) krn_partials_per_segment
455 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
456 bprop Embedding_12 embedding I=(32768), E=(101200,128) thrust::cuda_cub::core::_kernel_agent
457 bprop Embedding_12 embedding I=(32768), E=(101200,128) krn_partial_segment_offset
458 bprop Embedding_12 embedding I=(32768), E=(101200,128) compute_grad_weight
459 bprop Embedding_12 embedding I=(32768), E=(101200,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
460 bprop Embedding_11 embedding I=(32768), E=(101100,128) legacy::elementwise_kernel
461 bprop Embedding_11 embedding I=(32768), E=(101100,128) legacy::elementwise_kernel
462 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
463 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
464 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
465 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
466 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
467 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
468 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
469 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
470 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
471 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
472 bprop Embedding_11 embedding I=(32768), E=(101100,128) modern::elementwise_kernel
473 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
474 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
475 bprop Embedding_11 embedding I=(32768), E=(101100,128) krn_partials_per_segment
476 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
477 bprop Embedding_11 embedding I=(32768), E=(101100,128) thrust::cuda_cub::core::_kernel_agent
478 bprop Embedding_11 embedding I=(32768), E=(101100,128) krn_partial_segment_offset
479 bprop Embedding_11 embedding I=(32768), E=(101100,128) compute_grad_weight
480 bprop Embedding_11 embedding I=(32768), E=(101100,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
481 bprop Embedding_10 embedding I=(32768), E=(101000,128) legacy::elementwise_kernel
482 bprop Embedding_10 embedding I=(32768), E=(101000,128) legacy::elementwise_kernel
483 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
484 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
485 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
486 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
487 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
488 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
489 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
490 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
491 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
492 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
493 bprop Embedding_10 embedding I=(32768), E=(101000,128) modern::elementwise_kernel
494 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
495 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
496 bprop Embedding_10 embedding I=(32768), E=(101000,128) krn_partials_per_segment
497 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
498 bprop Embedding_10 embedding I=(32768), E=(101000,128) thrust::cuda_cub::core::_kernel_agent
499 bprop Embedding_10 embedding I=(32768), E=(101000,128) krn_partial_segment_offset
500 bprop Embedding_10 embedding I=(32768), E=(101000,128) compute_grad_weight
501 bprop Embedding_10 embedding I=(32768), E=(101000,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
502 bprop Embedding_9 embedding I=(32768), E=(100900,128) legacy::elementwise_kernel
503 bprop Embedding_9 embedding I=(32768), E=(100900,128) legacy::elementwise_kernel
504 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
505 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
506 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
507 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
508 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
509 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
510 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
511 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
512 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
513 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
514 bprop Embedding_9 embedding I=(32768), E=(100900,128) modern::elementwise_kernel
515 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
516 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
517 bprop Embedding_9 embedding I=(32768), E=(100900,128) krn_partials_per_segment
518 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
519 bprop Embedding_9 embedding I=(32768), E=(100900,128) thrust::cuda_cub::core::_kernel_agent
520 bprop Embedding_9 embedding I=(32768), E=(100900,128) krn_partial_segment_offset
521 bprop Embedding_9 embedding I=(32768), E=(100900,128) compute_grad_weight
522 bprop Embedding_9 embedding I=(32768), E=(100900,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
523 bprop Embedding_8 embedding I=(32768), E=(100800,128) legacy::elementwise_kernel
524 bprop Embedding_8 embedding I=(32768), E=(100800,128) legacy::elementwise_kernel
525 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
526 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
527 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
528 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
529 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
530 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
531 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
532 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
533 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
534 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
535 bprop Embedding_8 embedding I=(32768), E=(100800,128) modern::elementwise_kernel
536 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
537 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
538 bprop Embedding_8 embedding I=(32768), E=(100800,128) krn_partials_per_segment
539 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
540 bprop Embedding_8 embedding I=(32768), E=(100800,128) thrust::cuda_cub::core::_kernel_agent
541 bprop Embedding_8 embedding I=(32768), E=(100800,128) krn_partial_segment_offset
542 bprop Embedding_8 embedding I=(32768), E=(100800,128) compute_grad_weight
543 bprop Embedding_8 embedding I=(32768), E=(100800,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
544 bprop Embedding_7 embedding I=(32768), E=(100700,128) legacy::elementwise_kernel
545 bprop Embedding_7 embedding I=(32768), E=(100700,128) legacy::elementwise_kernel
546 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
547 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
548 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
549 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
550 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
551 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
552 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
553 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
554 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
555 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
556 bprop Embedding_7 embedding I=(32768), E=(100700,128) modern::elementwise_kernel
557 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
558 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
559 bprop Embedding_7 embedding I=(32768), E=(100700,128) krn_partials_per_segment
560 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
561 bprop Embedding_7 embedding I=(32768), E=(100700,128) thrust::cuda_cub::core::_kernel_agent
562 bprop Embedding_7 embedding I=(32768), E=(100700,128) krn_partial_segment_offset
563 bprop Embedding_7 embedding I=(32768), E=(100700,128) compute_grad_weight
564 bprop Embedding_7 embedding I=(32768), E=(100700,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
565 bprop Embedding_6 embedding I=(32768), E=(100600,128) legacy::elementwise_kernel
566 bprop Embedding_6 embedding I=(32768), E=(100600,128) legacy::elementwise_kernel
567 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
568 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
569 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
570 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
571 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
572 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
573 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
574 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
575 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
576 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
577 bprop Embedding_6 embedding I=(32768), E=(100600,128) modern::elementwise_kernel
578 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
579 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
580 bprop Embedding_6 embedding I=(32768), E=(100600,128) krn_partials_per_segment
581 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
582 bprop Embedding_6 embedding I=(32768), E=(100600,128) thrust::cuda_cub::core::_kernel_agent
583 bprop Embedding_6 embedding I=(32768), E=(100600,128) krn_partial_segment_offset
584 bprop Embedding_6 embedding I=(32768), E=(100600,128) compute_grad_weight
585 bprop Embedding_6 embedding I=(32768), E=(100600,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
586 bprop Embedding_5 embedding I=(32768), E=(100500,128) legacy::elementwise_kernel
587 bprop Embedding_5 embedding I=(32768), E=(100500,128) legacy::elementwise_kernel
588 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
589 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
590 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
591 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
592 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
593 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
594 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
595 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
596 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
597 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
598 bprop Embedding_5 embedding I=(32768), E=(100500,128) modern::elementwise_kernel
599 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
600 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
601 bprop Embedding_5 embedding I=(32768), E=(100500,128) krn_partials_per_segment
602 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
603 bprop Embedding_5 embedding I=(32768), E=(100500,128) thrust::cuda_cub::core::_kernel_agent
604 bprop Embedding_5 embedding I=(32768), E=(100500,128) krn_partial_segment_offset
605 bprop Embedding_5 embedding I=(32768), E=(100500,128) compute_grad_weight
606 bprop Embedding_5 embedding I=(32768), E=(100500,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
607 bprop Embedding_4 embedding I=(32768), E=(100400,128) legacy::elementwise_kernel
608 bprop Embedding_4 embedding I=(32768), E=(100400,128) legacy::elementwise_kernel
609 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
610 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
611 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
612 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
613 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
614 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
615 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
616 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
617 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
618 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
619 bprop Embedding_4 embedding I=(32768), E=(100400,128) modern::elementwise_kernel
620 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
621 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
622 bprop Embedding_4 embedding I=(32768), E=(100400,128) krn_partials_per_segment
623 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
624 bprop Embedding_4 embedding I=(32768), E=(100400,128) thrust::cuda_cub::core::_kernel_agent
625 bprop Embedding_4 embedding I=(32768), E=(100400,128) krn_partial_segment_offset
626 bprop Embedding_4 embedding I=(32768), E=(100400,128) compute_grad_weight
627 bprop Embedding_4 embedding I=(32768), E=(100400,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
628 bprop Embedding_3 embedding I=(32768), E=(100300,128) legacy::elementwise_kernel
629 bprop Embedding_3 embedding I=(32768), E=(100300,128) legacy::elementwise_kernel
630 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
631 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
632 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
633 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
634 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
635 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
636 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
637 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
638 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
639 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
640 bprop Embedding_3 embedding I=(32768), E=(100300,128) modern::elementwise_kernel
641 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
642 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
643 bprop Embedding_3 embedding I=(32768), E=(100300,128) krn_partials_per_segment
644 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
645 bprop Embedding_3 embedding I=(32768), E=(100300,128) thrust::cuda_cub::core::_kernel_agent
646 bprop Embedding_3 embedding I=(32768), E=(100300,128) krn_partial_segment_offset
647 bprop Embedding_3 embedding I=(32768), E=(100300,128) compute_grad_weight
648 bprop Embedding_3 embedding I=(32768), E=(100300,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
649 bprop Embedding_2 embedding I=(32768), E=(100200,128) legacy::elementwise_kernel
650 bprop Embedding_2 embedding I=(32768), E=(100200,128) legacy::elementwise_kernel
651 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
652 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
653 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
654 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
655 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
656 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
657 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
658 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
659 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
660 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
661 bprop Embedding_2 embedding I=(32768), E=(100200,128) modern::elementwise_kernel
662 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
663 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
664 bprop Embedding_2 embedding I=(32768), E=(100200,128) krn_partials_per_segment
665 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
666 bprop Embedding_2 embedding I=(32768), E=(100200,128) thrust::cuda_cub::core::_kernel_agent
667 bprop Embedding_2 embedding I=(32768), E=(100200,128) krn_partial_segment_offset
668 bprop Embedding_2 embedding I=(32768), E=(100200,128) compute_grad_weight
669 bprop Embedding_2 embedding I=(32768), E=(100200,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
670 bprop Embedding_1 embedding I=(32768), E=(100100,128) legacy::elementwise_kernel
671 bprop Embedding_1 embedding I=(32768), E=(100100,128) legacy::elementwise_kernel
672 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
673 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
674 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
675 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
676 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
677 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
678 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
679 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
680 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
681 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
682 bprop Embedding_1 embedding I=(32768), E=(100100,128) modern::elementwise_kernel
683 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
684 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
685 bprop Embedding_1 embedding I=(32768), E=(100100,128) krn_partials_per_segment
686 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
687 bprop Embedding_1 embedding I=(32768), E=(100100,128) thrust::cuda_cub::core::_kernel_agent
688 bprop Embedding_1 embedding I=(32768), E=(100100,128) krn_partial_segment_offset
689 bprop Embedding_1 embedding I=(32768), E=(100100,128) compute_grad_weight
690 bprop Embedding_1 embedding I=(32768), E=(100100,128) sum_and_scatter
Idx Direction Layer Op Params GPU Kernel
691 bprop Embedding_0 embedding I=(32768), E=(100000,128) legacy::elementwise_kernel
692 bprop Embedding_0 embedding I=(32768), E=(100000,128) legacy::elementwise_kernel
693 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
694 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
695 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
696 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
697 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
698 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
699 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
700 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
701 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
702 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
703 bprop Embedding_0 embedding I=(32768), E=(100000,128) modern::elementwise_kernel
704 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
705 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
706 bprop Embedding_0 embedding I=(32768), E=(100000,128) krn_partials_per_segment
707 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
708 bprop Embedding_0 embedding I=(32768), E=(100000,128) thrust::cuda_cub::core::_kernel_agent
709 bprop Embedding_0 embedding I=(32768), E=(100000,128) krn_partial_segment_offset
710 bprop Embedding_0 embedding I=(32768), E=(100000,128) compute_grad_weight
711 bprop Embedding_0 embedding I=(32768), E=(100000,128) sum_and_scatter

Bottom MLP (bprop)

Kernels 712 through 726 correspond to the backward propagation through the Bottom MLP layer. Kernel 713 does the backward propagation through the ReLU layer, 714-716 calculate the data gradient and weight gradient through the last Linear layer. Kernel 717 calculates the bias gradient. Likewise kernels 718-722 and 723-726 correspond to the backward propagation through the second and first Linear layer of the Bottom MLP respectively.

Idx Direction Layer Op Params GPU Kernel
712 bprop - to na=na legacy::elementwise_kernel
713 bprop Bottom_MLP relu T=(32768,128) modern::elementwise_kernel
714 bprop Bottom_MLP linear M=256,N=32768,K=128 turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn
715 bprop Bottom_MLP linear M=256,N=128,K=32768 turing_fp16_s1688gemm_fp16_256x128_ldg8_f2f_nt
716 bprop Bottom_MLP linear X=(32768,256),W=(128,256) splitKreduce_kernel
717 bprop - sum na=na reduce_kernel
718 bprop Bottom_MLP relu T=(32768,256) modern::elementwise_kernel
719 bprop Bottom_MLP linear M=512,N=32768,K=256 turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn
720 bprop Bottom_MLP linear M=512,N=256,K=32768 turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt
721 bprop Bottom_MLP linear X=(32768,512),W=(256,512) splitKreduce_kernel
722 bprop - sum na=na reduce_kernel
723 bprop Bottom_MLP relu T=(32768,512) modern::elementwise_kernel
724 bprop Bottom_MLP linear M=13,N=32768,K=512 volta_sgemm_fp16_32x32_sliced1x4_nt
725 bprop Bottom_MLP linear M=13,N=512,K=32768 splitKreduce_kernel
726 bprop - sum na=na reduce_kernel

At the end of back propagation, we have all the weight gradients. Kernels 727 through 788 correspond to reversing the loss scaling and applying the SGD optimizer on all parameters.

Idx Direction Layer Op Params GPU Kernel
727 bprop - zero_ T=[(1,)] modern::elementwise_kernel
728 bprop - multi_tensor_scale T=[(512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1), (512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1,)] multi_tensor_apply_kernel
729 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
730 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
731 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
732 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
733 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
734 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
735 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
736 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
737 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
738 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
739 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
740 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
741 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
742 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
743 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
744 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
745 bprop - multi_tensor_axpby T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] multi_tensor_apply_kernel
Idx Direction Layer Op Params GPU Kernel
746 bprop - add_ T=[(100000,128), (100000,128)] modern::elementwise_kernel
747 bprop - add_ T=[(100100,128), (100100,128)] modern::elementwise_kernel
748 bprop - add_ T=[(100200,128), (100200,128)] modern::elementwise_kernel
749 bprop - add_ T=[(100300,128), (100300,128)] modern::elementwise_kernel
750 bprop - add_ T=[(100400,128), (100400,128)] modern::elementwise_kernel
751 bprop - add_ T=[(100500,128), (100500,128)] modern::elementwise_kernel
752 bprop - add_ T=[(100600,128), (100600,128)] modern::elementwise_kernel
753 bprop - add_ T=[(100700,128), (100700,128)] modern::elementwise_kernel
754 bprop - add_ T=[(100800,128), (100800,128)] modern::elementwise_kernel
755 bprop - add_ T=[(100900,128), (100900,128)] modern::elementwise_kernel
756 bprop - add_ T=[(101000,128), (101000,128)] modern::elementwise_kernel
757 bprop - add_ T=[(101100,128), (101100,128)] modern::elementwise_kernel
758 bprop - add_ T=[(101200,128), (101200,128)] modern::elementwise_kernel
759 bprop - add_ T=[(101300,128), (101300,128)] modern::elementwise_kernel
760 bprop - add_ T=[(101400,128), (101400,128)] modern::elementwise_kernel
761 bprop - add_ T=[(101500,128), (101500,128)] modern::elementwise_kernel
762 bprop - add_ T=[(101600,128), (101600,128)] modern::elementwise_kernel
763 bprop - add_ T=[(101700,128), (101700,128)] modern::elementwise_kernel
764 bprop - add_ T=[(101800,128), (101800,128)] modern::elementwise_kernel
765 bprop - add_ T=[(101900,128), (101900,128)] modern::elementwise_kernel
766 bprop - add_ T=[(102000,128), (102000,128)] modern::elementwise_kernel
767 bprop - add_ T=[(102100,128), (102100,128)] modern::elementwise_kernel
768 bprop - add_ T=[(102200,128), (102200,128)] modern::elementwise_kernel
769 bprop - add_ T=[(102300,128), (102300,128)] modern::elementwise_kernel
770 bprop - add_ T=[(102400,128), (102400,128)] modern::elementwise_kernel
771 bprop - add_ T=[(102500,128), (102500,128)] modern::elementwise_kernel
772 bprop - add_ T=[(512,13), (512,13)] modern::elementwise_kernel
773 bprop - add_ T=[(512), (512,)] modern::elementwise_kernel
774 bprop - add_ T=[(256,512), (256,512)] modern::elementwise_kernel
775 bprop - add_ T=[(256), (256,)] modern::elementwise_kernel
776 bprop - add_ T=[(128,256), (128,256)] modern::elementwise_kernel
777 bprop - add_ T=[(128), (128,)] modern::elementwise_kernel
778 bprop - add_ T=[(1024,480), (1024,480)] modern::elementwise_kernel
779 bprop - add_ T=[(1024), (1024,)] modern::elementwise_kernel
780 bprop - add_ T=[(1024,1024), (1024,1024)] modern::elementwise_kernel
781 bprop - add_ T=[(1024), (1024,)] modern::elementwise_kernel
782 bprop - add_ T=[(512,1024), (512,1024)] modern::elementwise_kernel
783 bprop - add_ T=[(512), (512,)] modern::elementwise_kernel
784 bprop - add_ T=[(256,512), (256,512)] modern::elementwise_kernel
785 bprop - add_ T=[(256), (256,)] modern::elementwise_kernel
786 bprop - add_ T=[(1,256), (1,256)] modern::elementwise_kernel
787 bprop - add_ T=[(1), (1,)] modern::elementwise_kernel
788 bprop - multi_tensor_scale T=[(512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1), (512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1,)] multi_tensor_apply_kernel