Deep Learning Recommendation Model is a Deep Neural Network for personalization and recommendation developed by Facebook. Due to its importance for large internet companies, it is one of the benchmarks in MLPerf. It is a simple network with 26 Embedding layers, 8 Linear layers and a Batched Matrix Multiply for calculating pairwise interactions. However, a single training step (forward and backward propagation) invokes about 800 GPU CUDA kernels (depending on the batch size). The best way to understand a DL network and GPU performance is to understand every single CUDA kernel i.e. which layer of the network invoked the kernel, with what arguments (tensor shapes and datatypes) and in which direction (forward propagation or backward propagation).
In this blog post, I will categorize every kernel used in the training of
DLRM. All the information in the tables below was obtaining using Nvidia's
PyTorch Profiler, PyProf, on a Turing T4 GPU. The information below
is only a subset of what is provided by PyProf. The code and instructions
for obtaining a detailed profile are here. Note that different GPUs
will have slightly different kernel names e.g. volta_*
as opposed to
turing_*
.
The parameters in the profiled code are as follows. These parameters
were obtained from Nvidia Deep Learning Examples. The only
constraint is that the output of the bottom MLP should be equal to the
embedding size. The vocabulary sizes of the categorical embeddings are
for a synthetic dataset and not the real Criteo Terabyte Dataset.
For the purpose of this article, I modified the shape of the categorical
embeddings from [100000]*26
to [100000 + 100*x for x in range(26)]
,
so that its easier to differentiate and understand the kernels
associated with each of the 26 embeddings, both in forward and backward
propagation. With the Criteo dataset, the categorical embeddings tables
become really big. With an embedding size of 128, the embedding tables have ~
24 Billion parameters ~ 96 GB, and it is not possible to store them on
a single GPU.
Bottom MLP = [512, 256, 128]
Top MLP = [1024, 1024, 512, 256, 1]
Numerical Features (N) = 13
Categorical Features (M) = 26
Categorical Embeddings = [100000 + 100*x for x in range(26)]
Embedding Size = 128
Batch Size = 32768
The tables below show the GPU kernels invoked in 1 training step. For every GPU kernel we show the direction (fprop, bprop), name of the layer, name of the operation, and the input tensor shapes / matrix dimensions for the operation. PyProf provides a lot of additional information for every GPU kernel e.g. grid dimensions, block dimensions, silicon time, datatypes, flops, bytes, tensor core usage and so on.
Kernels 1 through 11 correspond to the forward propagation through the 3 Linear layers in the Bottom MLP. Kernels 2, 3 and 4 correspond to bias, linear and relu operations of the first Linear layer respectively. Kernels 5-7, 8-10 correspond to the second and third Linear layers respectively. Kernels 1 and 11 correspond to data type conversions from fp32 to fp16 and back.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
1 | fprop | Bottom_MLP | to | T=(32768,13) | legacy::elementwise_kernel |
2 | fprop | Bottom_MLP | bias | M=512,N=32768 | legacy::elementwise_kernel |
3 | fprop | Bottom_MLP | linear | M=512,N=32768,K=13 | volta_fp16_sgemm_fp16_128x128_tn |
4 | fprop | Bottom_MLP | relu | T=(32768,512) | modern::elementwise_kernel |
5 | fprop | Bottom_MLP | bias | M=256,N=32768 | legacy::elementwise_kernel |
6 | fprop | Bottom_MLP | linear | M=256,N=32768,K=512 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_tn |
7 | fprop | Bottom_MLP | relu | T=(32768,256) | modern::elementwise_kernel |
8 | fprop | Bottom_MLP | bias | M=128,N=32768 | legacy::elementwise_kernel |
9 | fprop | Bottom_MLP | linear | M=128,N=32768,K=256 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn |
10 | fprop | Bottom_MLP | relu | T=(32768,128) | modern::elementwise_kernel |
11 | fprop | Bottom_MLP | to | T=(32768,128) | legacy::elementwise_kernel |
Kernel 12 corresponds to a data type cast operation on the index tensor. Kernels 13 through 38 correspond to the forward propagation through the 26 Embedding layers. You can notice the size of the index tensor (batch size) and the size of the embedding tables (vocab x embedding size).
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
12 | fprop | - | to | T=(32768,26) | legacy::elementwise_kernel |
13 | fprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | indexSelectLargeIndex |
14 | fprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | indexSelectLargeIndex |
15 | fprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | indexSelectLargeIndex |
16 | fprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | indexSelectLargeIndex |
17 | fprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | indexSelectLargeIndex |
18 | fprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | indexSelectLargeIndex |
19 | fprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | indexSelectLargeIndex |
20 | fprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | indexSelectLargeIndex |
21 | fprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | indexSelectLargeIndex |
22 | fprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | indexSelectLargeIndex |
23 | fprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | indexSelectLargeIndex |
24 | fprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | indexSelectLargeIndex |
25 | fprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | indexSelectLargeIndex |
26 | fprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | indexSelectLargeIndex |
27 | fprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | indexSelectLargeIndex |
28 | fprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | indexSelectLargeIndex |
29 | fprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | indexSelectLargeIndex |
30 | fprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | indexSelectLargeIndex |
31 | fprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | indexSelectLargeIndex |
32 | fprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | indexSelectLargeIndex |
33 | fprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | indexSelectLargeIndex |
34 | fprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | indexSelectLargeIndex |
35 | fprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | indexSelectLargeIndex |
36 | fprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | indexSelectLargeIndex |
37 | fprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | indexSelectLargeIndex |
38 | fprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | indexSelectLargeIndex |
Kernels 39 through 44 correspond to the pairwise interaction layer. Pairwise interaction is implemented as a dot product of every vector with all other vectors. If there are $N$ vectors (of the same length), then we will have ${N \choose 2} = \frac{N\times(N-1)}{2}$ interactions (scalar outputs). This can be implemented by creating a matrix $A$ containing the $N$ vectors, calculating $A.A^T$ and then taking the lower (or upper) traingular matrix, since the output is symmetric. In some sense, this is similar to attention. When this operation is done for multiple batches (32768 in this example), it is implemented as a Batched Matrix Multiplication (BMM).
Kernel 39 concatenates the outputs of the 26 embedding lookups and the Bottom MLP, resulting in 27 vectors. Kernel 40 computes the pairwise interaction of 27 vectors, each of length 128 using a BMM operation. Kernel 41 extract the lower (or upper) triangular matrix, which gives us ${27 \choose 2} = 351$ values.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
39 | fprop | Interaction | cat | T=[(32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128), (32768,128)] | CatArrayBatchedCopy |
40 | fprop | Interaction | bmm | B=32768,M=27,N=27,K=128 | volta_sgemm_128x64_tn |
41 | fprop | Interaction | index | na=na | legacy::elementwise_kernel |
Kernels 42 through 44 concatenate the output of the Pairwise Interaction layer with the output of the Bottom MLP and pad it with 1 element before feeding it to the Top MLP.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
42 | fprop | Concat | cat | T=[(32768,128),(32768,351),(32768,1)] | legacy::elementwise_kernel |
43 | fprop | Concat | cat | T=[(32768,128),(32768,351),(32768,1)] | legacy::elementwise_kernel |
44 | fprop | Concat | cat | T=[(32768,128),(32768,351),(32768,1)] | legacy::elementwise_kernel |
Kernels 45 through 60 correspond to the forward propagation through the 5 Linear layers in the Top MLP. Kernels 46, 47 and 48 correspond to bias, linear and relu operations of the first Linear layer respectively. Kernels 49-51, 52-54, 55-57, 58-59 correspond to the second, third, fourth and fifth Linear layers respectively. Kernels 45 and 60 correspond to data type conversions from fp32 to fp16 and back.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
45 | fprop | Top_MLP | to | T=(32768,480) | legacy::elementwise_kernel |
46 | fprop | Top_MLP | bias | M=1024,N=32768 | legacy::elementwise_kernel |
47 | fprop | Top_MLP | linear | M=1024,N=32768,K=480 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn |
48 | fprop | Top_MLP | relu | T=(32768,1024) | modern::elementwise_kernel |
49 | fprop | Top_MLP | bias | M=1024,N=32768 | legacy::elementwise_kernel |
50 | fprop | Top_MLP | linear | M=1024,N=32768,K=1024 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn |
51 | fprop | Top_MLP | relu | T=(32768,1024) | modern::elementwise_kernel |
52 | fprop | Top_MLP | bias | M=512,N=32768 | legacy::elementwise_kernel |
53 | fprop | Top_MLP | linear | M=512,N=32768,K=1024 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn |
54 | fprop | Top_MLP | relu | T=(32768,512) | modern::elementwise_kernel |
55 | fprop | Top_MLP | bias | M=256,N=32768 | legacy::elementwise_kernel |
56 | fprop | Top_MLP | linear | M=256,N=32768,K=512 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_tn |
57 | fprop | Top_MLP | relu | T=(32768,256) | modern::elementwise_kernel |
58 | fprop | Top_MLP | bias | M=1,N=32768 | legacy::elementwise_kernel |
59 | fprop | Top_MLP | linear | M=1,N=32768,K=256 | gemv2T_kernel_val |
60 | fprop | Top_MLP | to | T=(32768,1) | legacy::elementwise_kernel |
Kernels 61 through 75 compute the BCE loss.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
61 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
62 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | kernelPointwiseApply1 |
63 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
64 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | legacy::elementwise_kernel |
65 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
66 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
67 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
68 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | kernelPointwiseApply1 |
69 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
70 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
71 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | kernelPointwiseApply1 |
72 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
73 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
74 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
75 | fprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | reduce_kernel |
In this profile, we are not accumulating gradients, and hence kernels 76 through 117 zero out the buffers used for storing the gradients during back propagation. Note that there are separate buffers for storing the gradients of the weights, biases and embeddings. Kernels 76-81, 82-91, 92-117 correspond to the Bottom MLP, Top MLP and the Embedding layers respectively.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
76 | fprop | - | zero | T=[(512,13)] | modern::elementwise_kernel |
77 | fprop | - | zero | T=[(512)] | modern::elementwise_kernel |
78 | fprop | - | zero | T=[(256,512)] | modern::elementwise_kernel |
79 | fprop | - | zero | T=[(256)] | modern::elementwise_kernel |
80 | fprop | - | zero | T=[(128,256)] | modern::elementwise_kernel |
81 | fprop | - | zero | T=[(128)] | modern::elementwise_kernel |
82 | fprop | - | zero | T=[(1024,480)] | modern::elementwise_kernel |
83 | fprop | - | zero | T=[(1024)] | modern::elementwise_kernel |
84 | fprop | - | zero | T=[(1024,1024)] | modern::elementwise_kernel |
85 | fprop | - | zero | T=[(1024)] | modern::elementwise_kernel |
86 | fprop | - | zero | T=[(512,1024)] | modern::elementwise_kernel |
87 | fprop | - | zero | T=[(512)] | modern::elementwise_kernel |
88 | fprop | - | zero | T=[(256,512)] | modern::elementwise_kernel |
89 | fprop | - | zero | T=[(256)] | modern::elementwise_kernel |
90 | fprop | - | zero | T=[(1,256)] | modern::elementwise_kernel |
91 | fprop | - | zero | T=[(1)] | modern::elementwise_kernel |
92 | fprop | - | zero | T=[(100000,128)] | modern::elementwise_kernel |
93 | fprop | - | zero | T=[(100100,128)] | modern::elementwise_kernel |
94 | fprop | - | zero | T=[(100200,128)] | modern::elementwise_kernel |
95 | fprop | - | zero | T=[(100300,128)] | modern::elementwise_kernel |
96 | fprop | - | zero | T=[(100400,128)] | modern::elementwise_kernel |
97 | fprop | - | zero | T=[(100500,128)] | modern::elementwise_kernel |
98 | fprop | - | zero | T=[(100600,128)] | modern::elementwise_kernel |
99 | fprop | - | zero | T=[(100700,128)] | modern::elementwise_kernel |
100 | fprop | - | zero | T=[(100800,128)] | modern::elementwise_kernel |
101 | fprop | - | zero | T=[(100900,128)] | modern::elementwise_kernel |
102 | fprop | - | zero | T=[(101000,128)] | modern::elementwise_kernel |
103 | fprop | - | zero | T=[(101100,128)] | modern::elementwise_kernel |
104 | fprop | - | zero | T=[(101200,128)] | modern::elementwise_kernel |
105 | fprop | - | zero | T=[(101300,128)] | modern::elementwise_kernel |
106 | fprop | - | zero | T=[(101400,128)] | modern::elementwise_kernel |
107 | fprop | - | zero | T=[(101500,128)] | modern::elementwise_kernel |
108 | fprop | - | zero | T=[(101600,128)] | modern::elementwise_kernel |
109 | fprop | - | zero | T=[(101700,128)] | modern::elementwise_kernel |
110 | fprop | - | zero | T=[(101800,128)] | modern::elementwise_kernel |
111 | fprop | - | zero | T=[(101900,128)] | modern::elementwise_kernel |
112 | fprop | - | zero | T=[(102000,128)] | modern::elementwise_kernel |
113 | fprop | - | zero | T=[(102100,128)] | modern::elementwise_kernel |
114 | fprop | - | zero | T=[(102200,128)] | modern::elementwise_kernel |
115 | fprop | - | zero | T=[(102300,128)] | modern::elementwise_kernel |
116 | fprop | - | zero | T=[(102400,128)] | modern::elementwise_kernel |
117 | fprop | - | zero | T=[(102500,128)] | modern::elementwise_kernel |
Kernels 118 through 122 most likely correspond to loss scaling and the beginning of back propagation.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
118 | fprop | - | __imul__ | T=[(1)] | legacy::elementwise_kernel |
119 | fprop | - | __mul__ | T=[(1)] | legacy::elementwise_kernel |
120 | fprop | - | backward | - | legacy::elementwise_kernel |
121 | bprop | - | __mul__ | T=[(1)] | legacy::elementwise_kernel |
122 | bprop | - | __imul__ | T=[(1)] | legacy::elementwise_kernel |
Kernels 123 through 126 correspond to the backward propagation through the Loss layer.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
123 | bprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
124 | bprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | legacy::elementwise_kernel |
125 | bprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | legacy::elementwise_kernel |
126 | bprop | - | binary_cross_entropy_with_logits | T=[(32768), (32768)] | modern::elementwise_kernel |
Kernels 127 through 149 correspond to the backward propagation through the Top MLP layer. Kernels 128-130 calculate the data gradient and weight gradient through the last Linear layer. Kernel 131 calculates the bias gradient. Likewise kernels 132-136, 137-140, 141-144, 145-148 correspond to the backward propagation through the fourth, third, second, and the first Linear layer of the Top MLP respectively.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
127 | bprop | - | to | na=na | legacy::elementwise_kernel |
128 | bprop | Top_MLP | linear | M=256,N=32768,K=1 | gemmk1_kernel |
129 | bprop | Top_MLP | linear | M=256,N=1,K=32768 | gemv2N_kernel |
130 | bprop | Top_MLP | linear | X=(32768,256),W=(1,256) | splitKreduce_kernel |
131 | bprop | - | sum | na=na | reduce_kernel |
132 | bprop | Top_MLP | relu | T=(32768,256) | modern::elementwise_kernel |
133 | bprop | Top_MLP | linear | M=512,N=32768,K=256 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
134 | bprop | Top_MLP | linear | M=512,N=256,K=32768 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt |
135 | bprop | Top_MLP | linear | X=(32768,512),W=(256,512) | splitKreduce_kernel |
136 | bprop | - | sum | na=na | reduce_kernel |
137 | bprop | Top_MLP | relu | T=(32768,512) | modern::elementwise_kernel |
138 | bprop | Top_MLP | linear | M=1024,N=32768,K=512 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
139 | bprop | Top_MLP | linear | M=1024,N=512,K=32768 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt |
140 | bprop | - | sum | na=na | reduce_kernel |
141 | bprop | Top_MLP | relu | T=(32768,1024) | modern::elementwise_kernel |
142 | bprop | Top_MLP | linear | M=1024,N=32768,K=1024 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
143 | bprop | Top_MLP | linear | M=1024,N=1024,K=32768 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt |
144 | bprop | - | sum | na=na | reduce_kernel |
145 | bprop | Top_MLP | relu | T=(32768,1024) | modern::elementwise_kernel |
146 | bprop | Top_MLP | linear | M=480,N=32768,K=1024 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
147 | bprop | Top_MLP | linear | M=480,N=1024,K=32768 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt |
148 | bprop | - | sum | na=na | reduce_kernel |
149 | bprop | - | to | na=na | legacy::elementwise_kernel |
Kernels 150 through 165 correspond to the backward propagation through the pairwise interaction layer. Kernels 150-161 probably correspond to the lower (or upper) triangular matrix indexing. Kernels 162 and 163 correspond to the two gradients through the BMM operation. The gradients correspond to the same input and therefore kernels 164 and 165 are probably adding the gradients.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
150 | bprop | Interaction | index | na=na | modern::elementwise_kernel |
151 | bprop | Interaction | index | na=na | legacy::elementwise_kernel |
152 | bprop | Interaction | index | na=na | kernelPointwiseApply2 |
153 | bprop | Interaction | index | na=na | modern::elementwise_kernel |
154 | bprop | Interaction | index | na=na | kernelPointwiseApply2 |
155 | bprop | Interaction | index | na=na | modern::elementwise_kernel |
156 | bprop | Interaction | index | na=na | modern::elementwise_kernel |
157 | bprop | Interaction | index | na=na | modern::elementwise_kernel |
158 | bprop | Interaction | index | na=na | thrust::cuda_cub::core::_kernel_agent |
159 | bprop | Interaction | index | na=na | thrust::cuda_cub::core::_kernel_agent |
160 | bprop | Interaction | index | na=na | indexing_backward_kernel |
161 | bprop | - | Slice | na=na | modern::elementwise_kernel |
162 | bprop | Interaction | bmm | B=32768,M=27,N=27,K=128 | volta_sgemm_128x64_nt |
163 | bprop | Interaction | bmm | B=32768,M=27,N=27,K=128 | volta_sgemm_128x64_nn |
164 | bprop | - | add | na=na | legacy::elementwise_kernel |
165 | bprop | - | add | na=na | legacy::elementwise_kernel |
Kernels 166 through 711 correspond to the backward propagation through the 26 Embedding layers.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
166 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | legacy::elementwise_kernel |
167 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | legacy::elementwise_kernel |
168 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
169 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
170 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
171 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
172 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
173 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
174 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
175 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
176 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
177 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
178 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | modern::elementwise_kernel |
179 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
180 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
181 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | krn_partials_per_segment |
182 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
183 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | thrust::cuda_cub::core::_kernel_agent |
184 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | krn_partial_segment_offset |
185 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | compute_grad_weight |
186 | bprop | Embedding_25 | embedding | I=(32768), E=(102500,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
187 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | legacy::elementwise_kernel |
188 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | legacy::elementwise_kernel |
189 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
190 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
191 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
192 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
193 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
194 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
195 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
196 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
197 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
198 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
199 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | modern::elementwise_kernel |
200 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
201 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
202 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | krn_partials_per_segment |
203 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
204 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | thrust::cuda_cub::core::_kernel_agent |
205 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | krn_partial_segment_offset |
206 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | compute_grad_weight |
207 | bprop | Embedding_24 | embedding | I=(32768), E=(102400,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
208 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | legacy::elementwise_kernel |
209 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | legacy::elementwise_kernel |
210 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
211 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
212 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
213 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
214 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
215 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
216 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
217 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
218 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
219 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
220 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | modern::elementwise_kernel |
221 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
222 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
223 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | krn_partials_per_segment |
224 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
225 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | thrust::cuda_cub::core::_kernel_agent |
226 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | krn_partial_segment_offset |
227 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | compute_grad_weight |
228 | bprop | Embedding_23 | embedding | I=(32768), E=(102300,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
229 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | legacy::elementwise_kernel |
230 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | legacy::elementwise_kernel |
231 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
232 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
233 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
234 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
235 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
236 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
237 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
238 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
239 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
240 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
241 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | modern::elementwise_kernel |
242 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
243 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
244 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | krn_partials_per_segment |
245 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
246 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | thrust::cuda_cub::core::_kernel_agent |
247 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | krn_partial_segment_offset |
248 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | compute_grad_weight |
249 | bprop | Embedding_22 | embedding | I=(32768), E=(102200,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
250 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | legacy::elementwise_kernel |
251 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | legacy::elementwise_kernel |
252 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
253 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
254 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
255 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
256 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
257 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
258 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
259 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
260 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
261 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
262 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | modern::elementwise_kernel |
263 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
264 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
265 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | krn_partials_per_segment |
266 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
267 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | thrust::cuda_cub::core::_kernel_agent |
268 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | krn_partial_segment_offset |
269 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | compute_grad_weight |
270 | bprop | Embedding_21 | embedding | I=(32768), E=(102100,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
271 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | legacy::elementwise_kernel |
272 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | legacy::elementwise_kernel |
273 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
274 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
275 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
276 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
277 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
278 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
279 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
280 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
281 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
282 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
283 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | modern::elementwise_kernel |
284 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
285 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
286 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | krn_partials_per_segment |
287 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
288 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | thrust::cuda_cub::core::_kernel_agent |
289 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | krn_partial_segment_offset |
290 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | compute_grad_weight |
291 | bprop | Embedding_20 | embedding | I=(32768), E=(102000,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
292 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | legacy::elementwise_kernel |
293 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | legacy::elementwise_kernel |
294 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
295 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
296 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
297 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
298 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
299 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
300 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
301 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
302 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
303 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
304 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | modern::elementwise_kernel |
305 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
306 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
307 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | krn_partials_per_segment |
308 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
309 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | thrust::cuda_cub::core::_kernel_agent |
310 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | krn_partial_segment_offset |
311 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | compute_grad_weight |
312 | bprop | Embedding_19 | embedding | I=(32768), E=(101900,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
313 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | legacy::elementwise_kernel |
314 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | legacy::elementwise_kernel |
315 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
316 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
317 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
318 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
319 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
320 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
321 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
322 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
323 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
324 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
325 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | modern::elementwise_kernel |
326 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
327 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
328 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | krn_partials_per_segment |
329 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
330 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | thrust::cuda_cub::core::_kernel_agent |
331 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | krn_partial_segment_offset |
332 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | compute_grad_weight |
333 | bprop | Embedding_18 | embedding | I=(32768), E=(101800,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
334 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | legacy::elementwise_kernel |
335 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | legacy::elementwise_kernel |
336 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
337 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
338 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
339 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
340 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
341 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
342 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
343 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
344 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
345 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
346 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | modern::elementwise_kernel |
347 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
348 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
349 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | krn_partials_per_segment |
350 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
351 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | thrust::cuda_cub::core::_kernel_agent |
352 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | krn_partial_segment_offset |
353 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | compute_grad_weight |
354 | bprop | Embedding_17 | embedding | I=(32768), E=(101700,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
355 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | legacy::elementwise_kernel |
356 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | legacy::elementwise_kernel |
357 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
358 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
359 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
360 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
361 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
362 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
363 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
364 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
365 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
366 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
367 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | modern::elementwise_kernel |
368 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
369 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
370 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | krn_partials_per_segment |
371 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
372 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | thrust::cuda_cub::core::_kernel_agent |
373 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | krn_partial_segment_offset |
374 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | compute_grad_weight |
375 | bprop | Embedding_16 | embedding | I=(32768), E=(101600,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
376 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | legacy::elementwise_kernel |
377 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | legacy::elementwise_kernel |
378 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
379 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
380 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
381 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
382 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
383 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
384 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
385 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
386 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
387 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
388 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | modern::elementwise_kernel |
389 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
390 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
391 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | krn_partials_per_segment |
392 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
393 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | thrust::cuda_cub::core::_kernel_agent |
394 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | krn_partial_segment_offset |
395 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | compute_grad_weight |
396 | bprop | Embedding_15 | embedding | I=(32768), E=(101500,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
397 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | legacy::elementwise_kernel |
398 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | legacy::elementwise_kernel |
399 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
400 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
401 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
402 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
403 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
404 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
405 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
406 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
407 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
408 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
409 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | modern::elementwise_kernel |
410 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
411 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
412 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | krn_partials_per_segment |
413 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
414 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | thrust::cuda_cub::core::_kernel_agent |
415 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | krn_partial_segment_offset |
416 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | compute_grad_weight |
417 | bprop | Embedding_14 | embedding | I=(32768), E=(101400,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
418 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | legacy::elementwise_kernel |
419 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | legacy::elementwise_kernel |
420 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
421 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
422 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
423 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
424 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
425 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
426 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
427 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
428 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
429 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
430 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | modern::elementwise_kernel |
431 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
432 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
433 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | krn_partials_per_segment |
434 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
435 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | thrust::cuda_cub::core::_kernel_agent |
436 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | krn_partial_segment_offset |
437 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | compute_grad_weight |
438 | bprop | Embedding_13 | embedding | I=(32768), E=(101300,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
439 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | legacy::elementwise_kernel |
440 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | legacy::elementwise_kernel |
441 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
442 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
443 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
444 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
445 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
446 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
447 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
448 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
449 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
450 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
451 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | modern::elementwise_kernel |
452 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
453 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
454 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | krn_partials_per_segment |
455 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
456 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | thrust::cuda_cub::core::_kernel_agent |
457 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | krn_partial_segment_offset |
458 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | compute_grad_weight |
459 | bprop | Embedding_12 | embedding | I=(32768), E=(101200,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
460 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | legacy::elementwise_kernel |
461 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | legacy::elementwise_kernel |
462 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
463 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
464 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
465 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
466 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
467 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
468 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
469 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
470 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
471 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
472 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | modern::elementwise_kernel |
473 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
474 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
475 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | krn_partials_per_segment |
476 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
477 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | thrust::cuda_cub::core::_kernel_agent |
478 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | krn_partial_segment_offset |
479 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | compute_grad_weight |
480 | bprop | Embedding_11 | embedding | I=(32768), E=(101100,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
481 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | legacy::elementwise_kernel |
482 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | legacy::elementwise_kernel |
483 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
484 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
485 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
486 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
487 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
488 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
489 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
490 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
491 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
492 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
493 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | modern::elementwise_kernel |
494 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
495 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
496 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | krn_partials_per_segment |
497 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
498 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | thrust::cuda_cub::core::_kernel_agent |
499 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | krn_partial_segment_offset |
500 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | compute_grad_weight |
501 | bprop | Embedding_10 | embedding | I=(32768), E=(101000,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
502 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | legacy::elementwise_kernel |
503 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | legacy::elementwise_kernel |
504 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
505 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
506 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
507 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
508 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
509 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
510 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
511 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
512 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
513 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
514 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | modern::elementwise_kernel |
515 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
516 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
517 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | krn_partials_per_segment |
518 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
519 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | thrust::cuda_cub::core::_kernel_agent |
520 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | krn_partial_segment_offset |
521 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | compute_grad_weight |
522 | bprop | Embedding_9 | embedding | I=(32768), E=(100900,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
523 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | legacy::elementwise_kernel |
524 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | legacy::elementwise_kernel |
525 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
526 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
527 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
528 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
529 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
530 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
531 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
532 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
533 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
534 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
535 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | modern::elementwise_kernel |
536 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
537 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
538 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | krn_partials_per_segment |
539 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
540 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | thrust::cuda_cub::core::_kernel_agent |
541 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | krn_partial_segment_offset |
542 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | compute_grad_weight |
543 | bprop | Embedding_8 | embedding | I=(32768), E=(100800,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
544 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | legacy::elementwise_kernel |
545 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | legacy::elementwise_kernel |
546 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
547 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
548 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
549 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
550 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
551 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
552 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
553 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
554 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
555 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
556 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | modern::elementwise_kernel |
557 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
558 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
559 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | krn_partials_per_segment |
560 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
561 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | thrust::cuda_cub::core::_kernel_agent |
562 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | krn_partial_segment_offset |
563 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | compute_grad_weight |
564 | bprop | Embedding_7 | embedding | I=(32768), E=(100700,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
565 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | legacy::elementwise_kernel |
566 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | legacy::elementwise_kernel |
567 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
568 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
569 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
570 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
571 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
572 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
573 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
574 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
575 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
576 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
577 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | modern::elementwise_kernel |
578 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
579 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
580 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | krn_partials_per_segment |
581 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
582 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | thrust::cuda_cub::core::_kernel_agent |
583 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | krn_partial_segment_offset |
584 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | compute_grad_weight |
585 | bprop | Embedding_6 | embedding | I=(32768), E=(100600,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
586 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | legacy::elementwise_kernel |
587 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | legacy::elementwise_kernel |
588 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
589 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
590 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
591 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
592 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
593 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
594 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
595 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
596 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
597 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
598 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | modern::elementwise_kernel |
599 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
600 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
601 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | krn_partials_per_segment |
602 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
603 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | thrust::cuda_cub::core::_kernel_agent |
604 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | krn_partial_segment_offset |
605 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | compute_grad_weight |
606 | bprop | Embedding_5 | embedding | I=(32768), E=(100500,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
607 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | legacy::elementwise_kernel |
608 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | legacy::elementwise_kernel |
609 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
610 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
611 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
612 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
613 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
614 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
615 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
616 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
617 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
618 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
619 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | modern::elementwise_kernel |
620 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
621 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
622 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | krn_partials_per_segment |
623 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
624 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | thrust::cuda_cub::core::_kernel_agent |
625 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | krn_partial_segment_offset |
626 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | compute_grad_weight |
627 | bprop | Embedding_4 | embedding | I=(32768), E=(100400,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
628 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | legacy::elementwise_kernel |
629 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | legacy::elementwise_kernel |
630 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
631 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
632 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
633 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
634 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
635 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
636 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
637 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
638 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
639 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
640 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | modern::elementwise_kernel |
641 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
642 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
643 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | krn_partials_per_segment |
644 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
645 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | thrust::cuda_cub::core::_kernel_agent |
646 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | krn_partial_segment_offset |
647 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | compute_grad_weight |
648 | bprop | Embedding_3 | embedding | I=(32768), E=(100300,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
649 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | legacy::elementwise_kernel |
650 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | legacy::elementwise_kernel |
651 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
652 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
653 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
654 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
655 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
656 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
657 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
658 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
659 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
660 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
661 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | modern::elementwise_kernel |
662 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
663 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
664 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | krn_partials_per_segment |
665 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
666 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | thrust::cuda_cub::core::_kernel_agent |
667 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | krn_partial_segment_offset |
668 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | compute_grad_weight |
669 | bprop | Embedding_2 | embedding | I=(32768), E=(100200,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
670 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | legacy::elementwise_kernel |
671 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | legacy::elementwise_kernel |
672 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
673 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
674 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
675 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
676 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
677 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
678 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
679 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
680 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
681 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
682 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | modern::elementwise_kernel |
683 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
684 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
685 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | krn_partials_per_segment |
686 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
687 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | thrust::cuda_cub::core::_kernel_agent |
688 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | krn_partial_segment_offset |
689 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | compute_grad_weight |
690 | bprop | Embedding_1 | embedding | I=(32768), E=(100100,128) | sum_and_scatter |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
691 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | legacy::elementwise_kernel |
692 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | legacy::elementwise_kernel |
693 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
694 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
695 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
696 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
697 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
698 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
699 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
700 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
701 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
702 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
703 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | modern::elementwise_kernel |
704 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
705 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
706 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | krn_partials_per_segment |
707 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
708 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | thrust::cuda_cub::core::_kernel_agent |
709 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | krn_partial_segment_offset |
710 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | compute_grad_weight |
711 | bprop | Embedding_0 | embedding | I=(32768), E=(100000,128) | sum_and_scatter |
Kernels 712 through 726 correspond to the backward propagation through the Bottom MLP layer. Kernel 713 does the backward propagation through the ReLU layer, 714-716 calculate the data gradient and weight gradient through the last Linear layer. Kernel 717 calculates the bias gradient. Likewise kernels 718-722 and 723-726 correspond to the backward propagation through the second and first Linear layer of the Bottom MLP respectively.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
712 | bprop | - | to | na=na | legacy::elementwise_kernel |
713 | bprop | Bottom_MLP | relu | T=(32768,128) | modern::elementwise_kernel |
714 | bprop | Bottom_MLP | linear | M=256,N=32768,K=128 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
715 | bprop | Bottom_MLP | linear | M=256,N=128,K=32768 | turing_fp16_s1688gemm_fp16_256x128_ldg8_f2f_nt |
716 | bprop | Bottom_MLP | linear | X=(32768,256),W=(128,256) | splitKreduce_kernel |
717 | bprop | - | sum | na=na | reduce_kernel |
718 | bprop | Bottom_MLP | relu | T=(32768,256) | modern::elementwise_kernel |
719 | bprop | Bottom_MLP | linear | M=512,N=32768,K=256 | turing_fp16_s1688gemm_fp16_128x128_ldg8_f2f_nn |
720 | bprop | Bottom_MLP | linear | M=512,N=256,K=32768 | turing_fp16_s1688gemm_fp16_128x256_ldg8_f2f_nt |
721 | bprop | Bottom_MLP | linear | X=(32768,512),W=(256,512) | splitKreduce_kernel |
722 | bprop | - | sum | na=na | reduce_kernel |
723 | bprop | Bottom_MLP | relu | T=(32768,512) | modern::elementwise_kernel |
724 | bprop | Bottom_MLP | linear | M=13,N=32768,K=512 | volta_sgemm_fp16_32x32_sliced1x4_nt |
725 | bprop | Bottom_MLP | linear | M=13,N=512,K=32768 | splitKreduce_kernel |
726 | bprop | - | sum | na=na | reduce_kernel |
At the end of back propagation, we have all the weight gradients. Kernels 727 through 788 correspond to reversing the loss scaling and applying the SGD optimizer on all parameters.
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
727 | bprop | - | zero_ | T=[(1,)] | modern::elementwise_kernel |
728 | bprop | - | multi_tensor_scale | T=[(512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1), (512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1,)] | multi_tensor_apply_kernel |
729 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
730 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
731 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
732 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
733 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
734 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
735 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
736 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
737 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
738 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
739 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
740 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
741 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
742 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
743 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
744 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
745 | bprop | - | multi_tensor_axpby | T=[(100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128), (100000,128), (100100,128), (100200,128), (100300,128), (100400,128), (100500,128), (100600,128), (100700,128), (100800,128), (100900,128), (101000,128), (101100,128), (101200,128), (101300,128), (101400,128), (101500,128), (101600,128), (101700,128), (101800,128), (101900,128), (102000,128), (102100,128), (102200,128), (102300,128), (102400,128), (102500,128)] | multi_tensor_apply_kernel |
Idx | Direction | Layer | Op | Params | GPU Kernel |
---|---|---|---|---|---|
746 | bprop | - | add_ | T=[(100000,128), (100000,128)] | modern::elementwise_kernel |
747 | bprop | - | add_ | T=[(100100,128), (100100,128)] | modern::elementwise_kernel |
748 | bprop | - | add_ | T=[(100200,128), (100200,128)] | modern::elementwise_kernel |
749 | bprop | - | add_ | T=[(100300,128), (100300,128)] | modern::elementwise_kernel |
750 | bprop | - | add_ | T=[(100400,128), (100400,128)] | modern::elementwise_kernel |
751 | bprop | - | add_ | T=[(100500,128), (100500,128)] | modern::elementwise_kernel |
752 | bprop | - | add_ | T=[(100600,128), (100600,128)] | modern::elementwise_kernel |
753 | bprop | - | add_ | T=[(100700,128), (100700,128)] | modern::elementwise_kernel |
754 | bprop | - | add_ | T=[(100800,128), (100800,128)] | modern::elementwise_kernel |
755 | bprop | - | add_ | T=[(100900,128), (100900,128)] | modern::elementwise_kernel |
756 | bprop | - | add_ | T=[(101000,128), (101000,128)] | modern::elementwise_kernel |
757 | bprop | - | add_ | T=[(101100,128), (101100,128)] | modern::elementwise_kernel |
758 | bprop | - | add_ | T=[(101200,128), (101200,128)] | modern::elementwise_kernel |
759 | bprop | - | add_ | T=[(101300,128), (101300,128)] | modern::elementwise_kernel |
760 | bprop | - | add_ | T=[(101400,128), (101400,128)] | modern::elementwise_kernel |
761 | bprop | - | add_ | T=[(101500,128), (101500,128)] | modern::elementwise_kernel |
762 | bprop | - | add_ | T=[(101600,128), (101600,128)] | modern::elementwise_kernel |
763 | bprop | - | add_ | T=[(101700,128), (101700,128)] | modern::elementwise_kernel |
764 | bprop | - | add_ | T=[(101800,128), (101800,128)] | modern::elementwise_kernel |
765 | bprop | - | add_ | T=[(101900,128), (101900,128)] | modern::elementwise_kernel |
766 | bprop | - | add_ | T=[(102000,128), (102000,128)] | modern::elementwise_kernel |
767 | bprop | - | add_ | T=[(102100,128), (102100,128)] | modern::elementwise_kernel |
768 | bprop | - | add_ | T=[(102200,128), (102200,128)] | modern::elementwise_kernel |
769 | bprop | - | add_ | T=[(102300,128), (102300,128)] | modern::elementwise_kernel |
770 | bprop | - | add_ | T=[(102400,128), (102400,128)] | modern::elementwise_kernel |
771 | bprop | - | add_ | T=[(102500,128), (102500,128)] | modern::elementwise_kernel |
772 | bprop | - | add_ | T=[(512,13), (512,13)] | modern::elementwise_kernel |
773 | bprop | - | add_ | T=[(512), (512,)] | modern::elementwise_kernel |
774 | bprop | - | add_ | T=[(256,512), (256,512)] | modern::elementwise_kernel |
775 | bprop | - | add_ | T=[(256), (256,)] | modern::elementwise_kernel |
776 | bprop | - | add_ | T=[(128,256), (128,256)] | modern::elementwise_kernel |
777 | bprop | - | add_ | T=[(128), (128,)] | modern::elementwise_kernel |
778 | bprop | - | add_ | T=[(1024,480), (1024,480)] | modern::elementwise_kernel |
779 | bprop | - | add_ | T=[(1024), (1024,)] | modern::elementwise_kernel |
780 | bprop | - | add_ | T=[(1024,1024), (1024,1024)] | modern::elementwise_kernel |
781 | bprop | - | add_ | T=[(1024), (1024,)] | modern::elementwise_kernel |
782 | bprop | - | add_ | T=[(512,1024), (512,1024)] | modern::elementwise_kernel |
783 | bprop | - | add_ | T=[(512), (512,)] | modern::elementwise_kernel |
784 | bprop | - | add_ | T=[(256,512), (256,512)] | modern::elementwise_kernel |
785 | bprop | - | add_ | T=[(256), (256,)] | modern::elementwise_kernel |
786 | bprop | - | add_ | T=[(1,256), (1,256)] | modern::elementwise_kernel |
787 | bprop | - | add_ | T=[(1), (1,)] | modern::elementwise_kernel |
788 | bprop | - | multi_tensor_scale | T=[(512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1), (512,13), (512), (256,512), (256), (128,256), (128), (1024,480), (1024), (1024,1024), (1024), (512,1024), (512), (256,512), (256), (1,256), (1,)] | multi_tensor_apply_kernel |